Minimal PDF file according to PDF-2.0 spec results in corrupted document structure

Question

Minimal PDF file according to PDF-2.0 spec results in corrupted document structure

142 Views Asked by ermeglio At 28 February 2024 at 00:07

I am trying to create a minimal PDF file example using the PDF-2.0 standard based on the ISO Specification. I would like to avoid using the Xref Table and use instead only the Cross-reference stream dictionary and no Trailer Section.

The file opens in Adobe, but when I want to close it, it tries to save it which I consider it does to fix what it is considering a corrupted document structure.

So I guess that my PDF to not comply with the PDF-2.0. But why not?

Here is my code for the PDF-2.0 File:

UPDATE: I tried to follow some of the comments, thanks for these inputs. Update was:

added Length on the XREF Stream as it seems to be required for a stream (strange in the example in the spec I see examples without Length, but ok added now). I didn't add (yet) page content as it is defined to be optional. The new version of the Pdf opens in Acrobat but when closing the save dialog still appears.

Still don't know what bit is missing to be reconized by Acrobat as valid file and no prompting any saving dialog.

%PDF-2.0
%Óëéá
1 0 obj
<</Type /Catalog
/Pages 2 0 R
/Metadata 5 0 R
>>
endobj
2 0 obj
<</Type /Pages
/Kids [3 0 R 4 0 R]
/Count 2
>>
endobj
3 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
>>
endobj
4 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
>>
endobj
5 0 obj
<</Type /Metadata
/Subtype /XML
/Length /2880
>>
stream
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <pdf:Producer>PdfProd</pdf:Producer>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <xmp:CreateDate>2024-02-28T23:46:34+01:00</xmp:CreateDate>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <dc:format>application/pdf</dc:format>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
      <xmpMM:DocumentID>f2015454-8669-45e4-9218-ad61ad0e2082</xmpMM:DocumentID>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
<?xpacket end="w"?>
endstream
endobj
6 0 obj
<</Type /XRef
/Index [0 7]
/Size 7
/W [1 2 1]
/Root 1 0 R
/ID [<1f7139e82f1c048ff020a6c953c3addd><1f7139e82f1c048ff020a6c953c3addd>]
/Length 77
>>
stream
00 0000 00
01 000F 01
01 004F 02
01 008D 03
01 00D3 04
01 0119 05
01 0CAB 06
endstream
endobj
startxref
3405
%%EOF

2ND UPDATE: I tried to implement all suggestions, many thanks for all the very useful and precious inputs in the comments. After these changes, the validation over some online pdf validation, say the file is ok. But it fact for Acrobat now it's even worse, when I try to open the file in Acrobat, is not able to open it anymore ("The file is damaged and could not be repaired."). Thanks in advance for any help!

%PDF-2.0
%Óëéá
1 0 obj
<</Type /Catalog
/Metadata 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<</Type /Metadata
/Length 2881
/Subtype /XML
>>
stream
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <pdf:Producer>Pdf2You</pdf:Producer>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <xmp:CreateDate>2024-03-04T23:42:40+01:00</xmp:CreateDate>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <dc:format>application/pdf</dc:format>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
      <xmpMM:DocumentID>7525a6cc-24c0-4d27-a995-17ee0436f906</xmpMM:DocumentID>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
<?xpacket end="w"?>
endstream
endobj
3 0 obj
<</Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj
4 0 obj
<</Type /Page
/Parent 3 0 R
/MediaBox [0 0 595 842]
/Resources<<>>
>>
endobj
5 0 obj
<</Type /XRef
/Length 66
/Index [0 6]
/Filter /ASCIIHexDecode
/Size 6
/W [1 2 1]
/Root 1 0 R
/ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
>>
stream
00 0000 00
01 000F 00
01 004F 00
01 0BE0 00
01 0C18 00
01 0C6D 00
endstream
endobj
startxref
3181
%%EOF

PS: the end of lines are all LF. PPS: the validation tool saying is valid is https://www.pdf-online.com/osa/validate.aspx

Original Q&A

There are 1 best solutions below

**K J** · Accepted Answer · 2024-02-28T18:27:05.590000

I am showing the more common features of your file from 2.0 ISO Standard that will allow acceptance by most, if not all, version 1 or 2 PDF readers. Without them or Acrobat considering any "fix" on entry or exit.

The smallest possible with a "Trailer" and acceptable to Acrobat etc. is roughly 300 bytes (303 with preferred EOL after EOF).

%PDF-2.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj
xref
0 4
0000000000 65536 f 
0000000009 00000 n 
0000000052 00000 n 
0000000101 00000 n 
trailer<</Size 4/Root 1 0 R>>
startxref
164
%%EOF

Smallest with XrefStream and not "fixed" or "rejected" by Acrobat Viewer 6 or later is 371 bytes (perhaps 370 if you ignore the %%EOF EOL)!

%PDF-2.0
%ÞÐƒ²
2 0 obj<</Type/Catalog/Pages 4 0 R>>endobj
4 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/Parent 4 0 R/MediaBox[0 0 612 792]>>endobj
1 0 obj<</Type/XRef/Root 2 0 R/Length 10/Size 5/Index [0 5]/W [1 1 0]/ID [<0123456789ABCDEF0123456789ABCDEF><0123456789ABCDEF0123456789ABCDEF>]>>
stream
  ªk: 
endstream endobj
startxref
170
%%EOF
% The following hex table is a pictorial representation in HexCode of the streams
% 10 byte binary data hence we must include the binary 2nd line (%ÞÐƒ²) marker.
stream
00 00
01 AA
01 0F
01 6B
01 3A
endstream endobj

It does not matter which order each object is numbered and commonly metadata would be 3rd object, if an info section were at the first location. Acrobat will normally "fix" a PDF, by add a duplicate info section first, with entries selected from the metadata. However here for "minimal acceptable" to Acrobat conforming readers, there is no /Info section.

Note the standard says there does not "need" to be a metadata section, so that can be deleted and thus use a smaller example.

It is not strictly the minimum acceptable because it contains a page content stream (Contents in the page object), and a metadata stream. These objects were included to make this file useful as a starting point for creating other, more realistic PDF files.

Usually the /Type is found as the last object entry while we may logically expect or preferer that first. Comments related to altering your version are at the end.

%PDF-2.0
%ÞÐƒ²
1 0 obj
<</Type/Catalog/Pages 2 0 R/Metadata 5 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>
endobj
4 0 obj
<</Length 0>>
stream
endstream
endobj
5 0 obj
<</Type/Metadata/Subtype/XML/Length 1059>>
stream
<?xpacket begin="ï»¿" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>name</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreatorTool>name</xmp:CreatorTool>
<xmp:CreateDate>2012-12-25T12:34:56Z</xmp:CreateDate>
<xmp:ModifyDate>2012-12-25T12:34:56Z</xmp:ModifyDate>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title><rdf:Alt>
<rdf:li xml:lang="x-default">title</rdf:li>
</rdf:Alt></dc:title>
<dc:creator><rdf:Seq>
<rdf:li>author</rdf:li>
</rdf:Seq></dc:creator>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>GUID of document</xmpMM:DocumentID>
<xmpMM:InstanceID>GUID save change</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
 
<?xpacket end="w"?>
endstream
endobj
xref
0 6
0000000000 65536 f 
0000000015 00000 n 
0000000075 00000 n 
0000000126 00000 n 
0000000220 00000 n 
0000000266 00000 n 
trailer
<</Size 6/Root 1 0 R>>
startxref
1401
%%EOF

Alterations

Pages should include a /Count, even if it is single Page [0].
<</Type/Pages/Count 1/Kids[3 0 R]>>

A page should infer some contents (even if we declare it is empty).
<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>

A minimal page content can be acceptable as.

# # obj
<</Length 0>>
stream
endstream
endobj

Others have suggested an altered "xref" structure. However although the standard "HINTS" of the type you want (an expanded Cross-reference text stream). I have yet to see that documentation format acceptable by Adobe Acrobat. I have no examples, other than it be /FlateDecode encoded.

In the standard this is "hinted" as.

/Filter /ASCIIHexDecode %For readability only
data has been encoded in hexadecimal representation for readability; in actual practice, a lossless decompression filter such as FlateDecode can be used.

Adobe added an explanation in their 1.7 Reference, same appendix H, which seems to still be current policy.

"3.4.7, “Cross-Reference Streams” (Cross-Reference Stream Dictionary) 20. FlateDecode is the only filter supported by Acrobat 6.0 and later viewers for cross-reference streams. These viewers also support unencoded cross- reference streams.

So commonly readers should/could use either fully flatted or fully inflated and never mix the two, especially when there are edits (incremental additions, alterations etc).

I have used the smaller (in this case 2.0 H2 example) with inflated text version above.

[Later EDIT]

As @mkl has pointed out your Xref table can be replaced without the ASCII /Filter by using a pure binary stream and all readers including Adobe Acrobat Viewers will accept that as an equivalent working format. In effect it meets the "unencoded cross- reference streams" statement.

so replace

5 0 obj
<</Type /XRef
/Length 66
/Index [0 6]
/Filter /ASCIIHexDecode
/Size 6
/W [1 2 1]
/Root 1 0 R
/ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
>>
stream
00 0000 00
01 000F 00
01 004F 00
01 0BE0 00
01 0C18 00
01 0C6D 00
endstream
endobj
startxref
3181
%%EOF

With

5 0 obj
<</Type /XRef
/Length 24
/Index [0 6]
/Size 6
/W [1 2 1]
/Root 1 0 R
/ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
>>
stream
       O à  m 
endstream
endobj
startxref
3181
%%EOF

Where the stream in ASCII terms will be including nulls a more compressed (compared to the decimal text values)
0000000001000F0001004F00010BE000010C1800010C6D00

However for pure ANSI editing that would be an unworkable method.

Most readers that allow it to open would simply replace that section as ASCII table for example replace 6 with 7

6 0 obj
<<>>
endobj
7 0 obj
<</Creator (\(FlexiPDF\))/ICNAppName (FlexiPDF)/ICNAppPlatform (Win)/ICNAppVersion (3.0.7)/ModDate (D:20240305155621)>>
endobj
xref
0 8
0000000005 65535 f 
0000000009 00000 n 
0000000223 00000 n 
0000003183 00000 n
.... etc.

or convert to flated stream

6 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter FlateDecode
/ID [<884DFB9A4FFE1D4ACCF3D4454478960F><884DFB9A4FFE1D4ACCF3D4454478960F>]
/Length 41/Root 1 0 R/Size 7/Type/XRef/W [1 2 1]>>
stream
xÚcb``øÏÄÈÀÏÈÄÀàÈ dma`úÏÝd}`dúÏý Y:{
endstream
endobj
startxref
3167
%%EOF

PDF ISO Standard compliant readers (apart from Acrobat) will also consider this combination as meeting the standard so easier to use uncompressed by ISO 2.0 Compliant Readers however NOT in Acrobat !

/Filter/ASCII85Decode>>
stream
z!<<W1!<>mq!=Rfc!=TeF!=WfF
endstream
endobj
startxref
3181
%%EOF

The above compressed (5 bytes shown as one) ASCII 120% expanded string is acceptable to most readers (apart from Acrobat DC). Even Acrobat Powered plug-in within Edge will accept it !

Here Acrobat reader in Edge refuses to open the file. Same File in same EDGE TAB simply switched from IE mode, so using lighter "Powered by Adobe Acrobat" plug-in it works.

Minimal PDF file according to PDF-2.0 spec results in corrupted document structure

There are 1 best solutions below

Alterations

Related Questions in PDF

Related Questions in ACROBAT

Trending Questions

Popular # Hahtags

Popular Questions