Fetch HTML part in java

275 Views Asked by At

I have some troubles understanding how can I download only part of html page. I tryed traditional way through URL::openStream method and BufferedReader but I'm not quite sure if this way pushes me to download whole page. The problem is: I have quite big HTML page and I need to parse 2 numbers from it, which updating at least once a second. Way above helps to detect changes once in 2-3 seconds and I wonder if there is way to make it faster. So I thought if fetching page partly can help me.

2

There are 2 best solutions below

4
On BEST ANSWER

I think you should see how the data is fetched (SSE or WebSocket) and just try to subscribe to that service. If that is impossible try more efficient XML parser. I recommend https://vtd-xml.sourceforge.io/ it can be ~10x faster then DOM parser that comes with JDK.

Also be careful with the BufferedReader.readLine() as there is a hidden cost of allocation (this is pretty advanced stuff as you have to think about CPU memory bandwidth, L1 cache misses etc..) for the strings that you don't really need.

Example using the library I mentioned:

byte[] pageInBytes = readAllBytesFromTheURL();
VTDGen vg = new VTDGen();
vg.setDoc(pageInBytes);
vg.parse(false);
VTDNav vn = vg.getNav();

AutoPilot ap = new AutoPilot(vn);

//Jump to the section that we want to process
ap.selectXPath("/html/body/div");
String fileId = vn.toString(vu.getElementFragment());
0
On

Wrote helper to read url content. Parser for elements in another class.

public class HTMLReaderHelper {

private final URL currentURL;

HTMLReaderHelper(URL url){
    currentURL = url;
}

public CharIterator charIterator(){
    CharIterator iterator;
    try {
        iterator = new CharIterator();
    } catch(IOException ex){
        return null;
    }
    return iterator;
}

public StringIterator stringIterator(){
    return new StringIterator();
}

class CharIterator implements java.util.Iterator<Character>{

    private InputStream urlStream;

    private boolean isValid;

    private Queue<Character> buffer;

    private CharIterator() throws IOException {
        urlStream = currentURL.openStream();
        isValid = true;
        buffer = new ArrayDeque<>();
    }

    @Override
    public boolean hasNext() {
        char c;
        try {
            c = (char)urlStream.read();
            buffer.add(c);
        } catch (IOException ex) {
            markInvalid();
            return false;
        }
        return c != (char) -1;
    }

    @Override
    public Character next() {
        if(!isValid){
            return null;
        }
        char c;
        try {
            if(buffer.size() > 0){
                return buffer.remove();
            }
            c = (char)urlStream.read();
        } catch (IOException ex) {
            markInvalid();
            return null;
        }
        return (c != (char)-1) ? c : null;
    }

    private void markInvalid(){
        isValid = false;
    }
}

class StringIterator implements java.util.Iterator<String>{

    private CharIterator charPointer;

    private Queue<String> buffer;

    private boolean isValid;

    private StringIterator(){
        charPointer = charIterator();
        isValid = true;
        buffer = new ArrayDeque<>();
    }

    @Override
    public boolean hasNext() {
        String value = next();
        try {
            buffer.add(value);
        } catch (NullPointerException ex){
            markInvalid();
            return false;
        }
        return isValid;
    }

    @Override
    public String next() {
        if(buffer.size() > 0){
            return buffer.remove();
        }
        if(!isValid){
            return null;
        }
        StringBuilder sb = new StringBuilder();
        Character currentChar = charPointer.next();
        if(currentChar == null){
            return null;
        }
        while (currentChar.equals('\n') || currentChar.equals('\r')){
            currentChar = charPointer.next();
            if(currentChar == null){
                return null;
            }
        }
        while (currentChar != Character.valueOf('\n') && currentChar != Character.valueOf('\r')){
            sb.append(currentChar);
            currentChar = charPointer.next();
        }
        return sb.toString();
    }
    private void markInvalid(){
        isValid = false;
    }
}
}