Parsing objects out of PDF, objects with byte streams are ignored for some reason?

337 Views Asked by At

My current assignment includes taking all of the objects out of the pdf file and then using the parsed out objects. But there is an issue that I have noticed where some of the stream objects are being flat out skipped over by my code.

I am completely confused and hoping someone can help indicate what is going wrong here.

Here is the main parsing code.

    void parseRawPDFFile() {
        //Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
        //object is what allows us to use it in regex.
        ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
        byteCharSequence.getStringFromData();

        Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
        Matcher matcher = pattern.matcher(byteCharSequence);

        //While we have a match (apparently only one match exists at a time) keep looping over the list.
        //When a match is found, get the starting and ending indices and manually cut these out char by char
        //and assemble them into a new "ByteArrayOutputStream".
        int counterOfDoom = 1;
        while (matcher.find() ) {
            for (int i = 0; i < matcher.groupCount(); i++) {
                ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
                System.out.println("----------------------------------------------------");
                System.out.println(cutOutArray);
                //At this point we have cut out the object and can now send it for processing.
               createPDFObject(cutOutArray);

                System.out.println(counterOfDoom);
                System.out.println("----------------------------------------------------");
                counterOfDoom++;
            }
        }
    }

Here is the code for the ByteCharSequence (Credits for the core of this code here: http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)

public class ByteCharSequence implements CharSequence {

    private final byte[] data;
    private final int length;
    private final int offset;

    public ByteCharSequence(byte[] data) {
        this(data, 0, data.length);
    }

    public ByteCharSequence(byte[] data, int offset, int length) {
        this.data = data;
        this.offset = offset;
        this.length = length;
    }

    @Override
    public int length() {
        return this.length;
    }

    @Override
    public char charAt(int index) {
        return (char) (data[offset + index] & 0xff);
    }

    @Override
    public CharSequence subSequence(int start, int end) {
        return new ByteCharSequence(data, offset + start, end - start);
    }

    /**
     * Get the string from the ByteCharSequence data.
     * @return
     */
    public String getStringFromData() {
        //Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
        //But it works...
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        for (byte individualByte: data
             ) {
            byteArrayOutputStream.write(individualByte);
        }

        return byteArrayOutputStream.toString();
    }
}

The pdf data that I am processing at present:

10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj


12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj

Some information that I was trying to look into.

1: From what I understand there should be no limitation on how much can be fit into the data structures. So size shouldn't be an issue????

1

There are 1 best solutions below

0
On BEST ANSWER

Add the DOTALL flag to the pattern compile call so that your pattern matches newline characters =)