Jaunt Java getText() returning correct text but with lots of "?"

172 Views Asked by At

The title explains all, also, I have tried removing them

(because the text is there, but instead of "aldo" there is "al?do", also it seems to have a random pattern)

with (String).replace("?", ""), but with no success.

I have also used this, with a combination of UTF_8,UTF_16 and ISO-8859, with no success.

byte[] ptext = tempName.getBytes(UTF_8); 
String tempName1 = new String(ptext, UTF_16); 

An example of what I am getting:

Studded Regular Sweatshirt          // Instead of this
S?tudde?d R?eg?ular? Sw?eats?h?irt  // I get this

Could it be the website that notices the headless browser and tries to "spoof" its content? How can I overcome this?

1

There are 1 best solutions below

1
Yu Jiaao On BEST ANSWER

It looks very likely that site you scrapping intent mix up the 3f and 64 characters into your result. so you have to mask your self as a normal browser to scrapping or filter it out by replacing.

text simple

Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe

after filteration

Scarface Embroidered Leather




//Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
//Scarface Embroidered Leathe

String hex="5363613f3f3f7266613f3f3f63653f3f3f20453f3f3f6d62723f3f3f6f69‌​643f3f3f65726564204c‌​653f3f3f61746865";
byte[] bytes= hexStringToBytes(hex);

//the only line you need
String res = new String(bytes,"UTF-8").replaceAll("\\\u003f","").replaceAll('�',"").replaceAll("�","");

private static byte charToByte(char c) {
    return (byte) "0123456789ABCDEF".indexOf(new String(c));
}


public static byte[] hexStringToBytes(String hexString) {
    if (hexString == null || hexString.equals("")) {
        return null;
    }
    hexString = hexString.toUpperCase();
    int length = hexString.length() / 2;
    char[] hexChars = hexString.toCharArray();
    byte[] d = new byte[length];
    for (int i = 0; i < length; i++) {
        int pos = i * 2;
        d[i] = (byte) (charToByte(hexChars[pos]) << 4 | charToByte(hexChars[pos + 1]));

    }
    return d;
}

public static String bytesToHexString(byte[] src){
    StringBuilder stringBuilder = new StringBuilder("");
    if (src == null || src.length <= 0) {
        return null;
    }
    for (int i = 0; i < src.length; i++) {
        int v = src[i] & 0xFF;
        String hv = Integer.toHexString(v);
        if (hv.length() < 2) {
            stringBuilder.append(0);
        }
        stringBuilder.append(hv);
    }
    return stringBuilder.toString();
}

public   String printHexString( byte[] b) {
    String a = "";
    for (int i = 0; i < b.length; i++) { 
        String hex = Integer.toHexString(b[i] & 0xFF); 
        if (hex.length() == 1) { 
            hex = '0' + hex; 
        }

        a = a+hex;
    } 

    return a;
}