Jaunt Java getText() returning correct text but with lots of "?"

164 Views Asked by At

The title explains all, also, I have tried removing them

(because the text is there, but instead of "aldo" there is "al?do", also it seems to have a random pattern)

with (String).replace("?", ""), but with no success.

I have also used this, with a combination of UTF_8,UTF_16 and ISO-8859, with no success.

byte[] ptext = tempName.getBytes(UTF_8); 
String tempName1 = new String(ptext, UTF_16); 

An example of what I am getting:

Studded Regular Sweatshirt          // Instead of this
S?tudde?d R?eg?ular? Sw?eats?h?irt  // I get this

Could it be the website that notices the headless browser and tries to "spoof" its content? How can I overcome this?

1

There are 1 best solutions below

1
On BEST ANSWER

It looks very likely that site you scrapping intent mix up the 3f and 64 characters into your result. so you have to mask your self as a normal browser to scrapping or filter it out by replacing.

text simple

Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe

after filteration

Scarface Embroidered Leather




//Sca???rfa???ce??? E???mbr???oi�d???ered L�e???athe
//Scarface Embroidered Leathe

String hex="5363613f3f3f7266613f3f3f63653f3f3f20453f3f3f6d62723f3f3f6f69‌​643f3f3f65726564204c‌​653f3f3f61746865";
byte[] bytes= hexStringToBytes(hex);

//the only line you need
String res = new String(bytes,"UTF-8").replaceAll("\\\u003f","").replaceAll('�',"").replaceAll("�","");

private static byte charToByte(char c) {
    return (byte) "0123456789ABCDEF".indexOf(new String(c));
}


public static byte[] hexStringToBytes(String hexString) {
    if (hexString == null || hexString.equals("")) {
        return null;
    }
    hexString = hexString.toUpperCase();
    int length = hexString.length() / 2;
    char[] hexChars = hexString.toCharArray();
    byte[] d = new byte[length];
    for (int i = 0; i < length; i++) {
        int pos = i * 2;
        d[i] = (byte) (charToByte(hexChars[pos]) << 4 | charToByte(hexChars[pos + 1]));

    }
    return d;
}

public static String bytesToHexString(byte[] src){
    StringBuilder stringBuilder = new StringBuilder("");
    if (src == null || src.length <= 0) {
        return null;
    }
    for (int i = 0; i < src.length; i++) {
        int v = src[i] & 0xFF;
        String hv = Integer.toHexString(v);
        if (hv.length() < 2) {
            stringBuilder.append(0);
        }
        stringBuilder.append(hv);
    }
    return stringBuilder.toString();
}

public   String printHexString( byte[] b) {
    String a = "";
    for (int i = 0; i < b.length; i++) { 
        String hex = Integer.toHexString(b[i] & 0xFF); 
        if (hex.length() == 1) { 
            hex = '0' + hex; 
        }

        a = a+hex;
    } 

    return a;
}