Get substring from MemoryStream without converting entire stream to string

1.1k Views Asked by At

I would like to be able to efficiently get a substring from a MemoryStream (that originally comes from a xml file in a zip). Currently, I read the entire MemoryStream to a string and then search for the start and end tags of the xml node I desire. This works fine but the text file may be very large so I would like to avoid converting the entire MemoryStream into a string and instead just extract the desired section of xml text directly from the stream.

What is the best way to go about this?

string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
    var ze = zip[zipPath];
    using (var ms = new MemoryStream())
    {
        ze.Extract(ms);
        ms.Position = 0;
        using(var sr = new StreamReader(ms))
        {
            xmlText = sr.ReadToEnd();
        }
    }
}

string startTag = "<someTag>";
string endTag = "</someTag>";
int startIndex = xmlText.IndexOf(startTag, StringComparison.Ordinal);
int endIndex = xmlText.IndexOf(endTag, startIndex, StringComparison.Ordinal) + endTag.Length - 1;
xmlText = xmlText.Substring(startIndex, endIndex - startIndex + 1);
2

There are 2 best solutions below

0
On BEST ANSWER

If your file is a valid xml file then you should be able to use a XmlReader to avoid loading the entire file into memory

string xmlText;
using (var zip = ZipFile.Read(zipFileName))
{
    var ze = zip[zipPath];
    using (var ms = new MemoryStream())
    {
        ze.Extract(ms);
        ms.Position = 0;
        using (var xml = XmlReader.Create(ms))
        {
            if(xml.ReadToFollowing("someTag"))
            {
                xmlText = xml.ReadInnerXml();
            }
            else
            {
                // <someTag> not found
            }
        }
    }
}

You'll likely want to catch potential exceptions if the file is not valid xml.

0
On

Assuming that since it is xml it will have line breaks, it would probably be best to use StreamReader ReadLine and search for your tags in each line. (Also note put your StreamReader in a using as well.)

Something like

        using (var ms = new MemoryStream())
        {
            ze.Extract(ms);
            ms.Position = 0;
            using (var sr = new StreamReader(ms))
            {
                bool adding = false;
                string startTag = "<someTag>";
                string endTag = "</someTag>";
                StringBuilder text = new StringBuilder();
                while (sr.Peek() >= 0)
                {
                    string tmp = sr.ReadLine();
                    if (!adding && tmp.Contains(startTag))
                    {
                        adding = true;
                    }
                    if (adding)
                    {
                        text.Append(tmp);
                    }
                    if (tmp.Contains(endTag))
                        break;
                }
                xmlText = text.ToString();
            }
        }

This assumes that the start and end tags are on a line by themselves. If not, you could clean up the resulting text string by getting the index of start and end again like you originally did.