Get all SDF/COS objects from PDF

Question

Get all SDF/COS objects from PDF

606 Views Asked by superstator At 13 February 2020 at 00:49

I am trying to get a list of all SDF/COS objects within a PDF document, using PDFNet 7.0.4 and netcoreapp3.1. Using a different PDF parser, I know that this document has 570 total COS objects within it, including 3 images.

Initially I used PDFDoc to load the document, and iterated through the pages just looking for Element objects of type e_image or e_inline_image, but this only yielded 2 out of 3 images. In a larger document it did even worse; 0 out of ~2600 images.

Now, I've stepped back and am trying to do a lower level search via SDFDoc. I can get a trailer object, and then iterate through it, recursing any e_dict or e_stream objects, and returning anything that looks like a real object (i.e., anything that actually has an object number and generation).

IEnumerable<Obj> Recurse(Obj root)
{
    var idHash = new HashSet<PdfIdentifier>();

    return Recurse(root, idHash);

    static IEnumerable<Obj> Recurse(Obj obj, HashSet<PdfIdentifier> idHash)
    {
        var id = obj.ToPdfIdentifier();

        if (!idHash.Contains(id))
        {
            if (id != nullIdentifier)
            {
                idHash.Add(id);
                yield return obj;
            }

            if (obj.GetType().OneOf(Obj.ObjType.e_dict, Obj.ObjType.e_stream))
            {
                for (var iter = obj.GetDictIterator(); iter.HasNext(); iter.Next())
                {
                    foreach (var child in Recurse(iter.Value(), idHash))
                    {
                        yield return child;
                    }
                }
            }
        }
    }
}

static PdfIdentifier nullIdentifier = new PdfIdentifier() { Generation = 0, ObjectNum = 0 };

ToPdfIdentifier is a simple extension method to get the object number and generation:

public static PdfIdentifier ToPdfIdentifier(this pdftron.SDF.Obj obj) => new PdfIdentifier { ObjectNum = obj.GetObjNum(), Generation = obj.GetGenNum() };

This runs OK, but only returns 45 objects, none of them the images I'm actually interested in.

How can I simply get all COS objects from a document?

edit

Here is the original PDFDoc code we tried to get all images:

private IEnumerable<(PdfIdentifier id, Element el)> GetImages(Stream stream)
{
    var doc = new PDFDoc(stream);

    var reader = new ElementReader();

    for (var iter = doc.GetPageIterator(); iter.HasNext(); iter.Next())
    {
        reader.Begin(iter.Current());

        var el = reader.Next();
        while (el != null)
        {
            var type = el.GetType();
            if (el.GetType().OneOf(Element.Type.e_image, Element.Type.e_inline_image))
            {
                var obj = el.GetXObject();
                var id = el.GetXObject().ToPdfIdentifier();

                yield return (id, el);
            }
            el = reader.Next();
        }

        reader.End();
    }
}

This kind of worked in that it returned some images, but not all. For some sample documents it returned all, for some it returned a subset, and for some it returned none at all.

edit

Just for future reference, thanks to the answer below from Ryan, we ended up with a pair of nice clean extension methods:

public static IEnumerable<SDF.Obj> GetAllObj(this SDF.SDFDoc sdfDoc)
{
    var xrefTableSize = sdfDoc.XRefSize();
    for (int objNum = 0; objNum < xrefTableSize; objNum++)
    {
        var obj = sdfDoc.GetObj(objNum);
        if (obj.IsFree())
        {
            continue;
        }
        else
        {
            yield return obj;
        }
    }
}

and

public static string Subtype(this SDF.Obj obj) => obj.FindObj("Subtype") switch
{
    null => null,
    var s when s.IsName() => s.GetName(),
    var s when s.IsString() => s.GetAsPDFText(),
    _ => throw new Exception("COS object has an invalid Subtype entry")
};

Now we can get images as simply as sdfDoc.GetAllObj().Where(o => o.IsStream() && o.Subtype() == "Image"); or even use Linq:

from o in sdfDoc.GetAllObj()
where o.IsStream() && o.Subtype() == "Image"
select new Image(o);

Original Q&A

There are 1 best solutions below

**Ryan** · Accepted Answer · 2020-02-13T22:38:04.220000

If you want to get the images that are actually used on a page of the PDF (in case there happen to be unused images in the PDF), then you would use this sample code. This code would have the added bonus of including inline images. https://www.pdftron.com/documentation/samples/dotnetcore/cs/ImageExtractTest

Though the above can be slow, if the document has hundreds or thousands of pages, that are complicated graphically.

The otherway, as you described, is to iterate the COS objects. The following C# code finds all Image streams. Note, the PDF standard specifically states that Streams have to be Indirect objects. So I think you can safely omit reading through all the direct objects.

using (PDFDoc doc = new PDFDoc("2002.04610.pdf"))
{
    doc.InitSecurityHandler();
    int xrefSz = doc.GetSDFDoc().XRefSize();
    for (int xrefCounter = 0; xrefCounter < xrefSz; ++xrefCounter)
    {
        Obj o = doc.GetSDFDoc().GetObj(xrefCounter);
        if (o.IsFree())
        {
            continue;
        }
        if(o.IsStream())
        {
            Obj subtypeObj = o.FindObj("Subtype");
            if (subtypeObj != null)
            {
                string subtype = "";
                if(subtypeObj.IsName()) subtype = subtypeObj.GetName();
                if(subtypeObj.IsString()) subtype = subtypeObj.GetAsPDFText(); // Subtype should be a Name, but just in case
                if (subtype.CompareTo("Image") == 0)
                {
                    Console.WriteLine("Indirect object {0} is an Image Stream", o.GetObjNum());
                }
            }
        }
    }
}

Get all SDF/COS objects from PDF

There are 1 best solutions below

Related Questions in C#

Related Questions in .NET-CORE

Related Questions in PDFTRON

Related Questions in PDFNET

Trending Questions

Popular # Hahtags

Popular Questions