C# XMLReader acts up after attempting to read a too big element

40 Views Asked by At

I have a very simple XML, but where one child element is a DNA string. This string can be anything from a few hundred characters long to 3 billion (like in the human genome). Of course reading a record where such an element is included will cause an Out of Memory Exception.

Here is the XML:

    <?xml version="1.0" encoding="UTF-8"?>
    <ROOT>
      <RECORD>
        <ID>OX451740</ID>
          <DNASEQ>
          tctacttcactcacgtaagtgatacc.... 1.2 BILLION MORE!!!!.... tctacttcactcacgtaagtgatacc
          </DNASEQ>
      </RECORD>
      <RECORD>
        <ID>OU641503</ID>
        <DNASEQ>
        gtaaccaaatcggtgctgctttctggtacgtgttgcagaccctgaaccatcaattgttgga
        </DNASEQ>
      </RECORD>
      <RECORD>
        <ID>Y19466</ID>
         <DNASEQ>
         atcgtcacccccccgcccgccacacctgacaaacaagatgtgcggcgggggtggaa
         </DNASEQ>
      </RECORD>
      <RECORD>
        <ID>AM472099</ID>
        <DNASEQ>
        cagttgacactctccttctcataaaattttgtaagcacaaccacatcacaactat
        </DNASEQ>
      </RECORD>
    </ROOT>

Thing is, catching the exception and then continuing parsing will cause more errors, even when the rest of the records in the XML have no long DNA strings.

If I delete the record in question, the all records in the file are read just fine.

Of course the file itself can be up to gigabytes in size - therefore I use the XMLReader to not load the whole file into memory.

Happy to share the real XML file if that can help.

So my code is as follows:

    using System.Xml.Linq;
    using System.Xml;
    namespace LargeXMLReader
    {
        internal class Program
        {
            static void Main(string[] args)
            {
                XmlReaderSettings settings = new XmlReaderSettings();
                settings.DtdProcessing = DtdProcessing.Ignore;
                settings.CheckCharacters = true;
                settings.IgnoreWhitespace = true;
                using (XmlReader reader = XmlReader.Create("test.xml", settings))
                {
                    reader.ReadStartElement("ROOT");
                    while (!reader.EOF)
                    {
                        if (reader.NodeType == XmlNodeType.Element && reader.Name == "RECORD")
                        {
                            try
                            {
                                XElement el = (XElement)XNode.ReadFrom(reader);
                                foreach (XElement e in el.Elements())
                                {
                                    if(e.Name == "ID")
                                        Console.WriteLine(e.Value);
                                }
                            }
                            catch
                            {
                                Console.WriteLine("Too big element...");
                                reader.Read();
                            }
                        }
                        else
                        {
                            reader.Read();
                        }
                    }
                }
            }
        }
    }

I would expect it to fail on the first record, but it also fails on the last one. If I delete the large record from the XML, it will read all just fine. For my real data XML it will now fail on the large one, and then fail over and over going through the rest or the records, but not all... Very strange.

Bug in XMLReader? Can Base64 help somehow?

0

There are 0 best solutions below