Unable to completely parse XML in PowerShell

3k Views Asked by At

I have an XML file that I would like to parse through, and retrieve back specific information.

To make it easy to understand, here is a screenshot of what the XML file looks like:

enter image description here

I would like to parse through the XML and for each Item node, retrieve back the fields indicated in the screenshot. Each of the values retrieved need to be formatted per item node.

Finally, I would love to be able to specify a criteria to look for, and only retrieve that where found.

I have been trying, without luck. Here is what I have been able to come up with:

[xml]$MyXMLFile = gc 'X:\folder\my.xml'
$XMLItem = $MyXMLFile.PatchScan.Machine.Product.Item
$Patch = $XMLItem | Where-Object {$_.Class -eq 'Patch'}
$Patch.BulletinID
$Patch.PatchName
$Patch.Status

When I run the above code, it returns no results. However, for testing purposes only, I remove the Item portion. Now, I can get it working by modifying the code above.

I load the XML into an XML Object. Now I try traverse it down to product and it works perfectly:

PS> $xmlobj.PatchScan.Machine.Product | Select-Object -Property Name, SP

Name SP
---- --
Windows 10 Pro (x64) 1607
Internet Explorer 11 (x64) Gold
Windows Media Player 12.0 Gold
MDAC 6.3 (x64) Gold
.NET Framework 4.7 (x64) Gold
MSXML 3.0 SP11
MSXML 6.0 (x64) SP3
DirectX 9.0c Gold
Adobe Flash 23 Gold
VMware Tools x64 Gold
Microsoft Visual C++ 2008 SP1 Redistributable Gold
Microsoft Visual C++ 2008 SP1 Redistributable (x64) Gold

Now add Item in and Intellisense puts up a bracket as if Item was a method $xmlobj.PatchScan.Machine.Product.Item( ← See that? So that is why I think for some reason the Item node is doing something strange and that is my roadblock.

This screenshot shows better how it starts with many product folders, and then in each product folder is many item folders.

enter image description here

The XML in the product folder I don't care about. I need the individual information in each item folder.

2

There are 2 best solutions below

1
On BEST ANSWER

XML is a structured text format. It knows nothing about "folders". What you see in your screenshots is just how the the data is rendered by program you use for displaying it.

Anyway, the best approach to get what you want is using SelectNodes() with an XPath expression. As usual.

[xml]$xml = Get-Content 'X:\folder\my.xml'
$xml.SelectNodes('//Product/Item[@Class="Patch"]') |
    Select-Object BulletinID, PatchName, Status
0
On

tl;dr

As you suspected, a name collision prevented prevented access to the .Item property on the XML elements of interest; fix the problem with explicit enumeration of the parent elements:

$xml.PatchScan.Machine.Product |
  % { $_.Item | select BulletinId, PatchName, Status }

% is a built-in alias for the ForEach-Object cmdlet; see bottom section for an explanation.


As an alternative, Ansgar Wiecher's helpful answer offers a concise XPath-based solution, which is both efficient and allows sophisticated queries.

As an aside: PowerShell v3+ comes with the Select-Xml cmdlet, which takes a file path as an argument, allowing for a single-pipeline solution:

(Select-Xml -LiteralPath X:\folder\my.xml '//Product/Item[@Class="Patch"]').Node |
  Select-Object BulletinId, PatchName, Status

Note:

  • Select-Xml wraps the matching XML nodes in an outer object, hence the need to access the .Node property.

  • As with direct use of the .NET APIs, querying XML documents that have namespaces requires extra work, namely declaring a (hash)table of namespace prefixes that map to namespace URIs, and use of those prefixes in the XPath query - see this answer


PowerShell's adaptation of the XML DOM (dot notation):

Update: A more comprehensive summary can now be found in this answer.

PowerShell decorates the object hierarchy contained in [System.Xml.XmlDocument] instances (created with cast [xml], for instance):

  • with properties named for the input document's specific elements and attributes[1] at every level; e.g.:

     ([xml] '<foo><bar>baz</bar></foo>').foo.bar # -> 'baz'
     ([xml] '<foo><bar id="1" /></foo>').foo.bar.id # -> '1'
    
  • turning multiple elements of the same name at a given hierarchy level implicitly into arrays (specifically, of type [object[]]); e.g.:

     ([xml] '<foo><C>one</C><C>two</C></foo>').foo.C[1] # -> 'two'
    

As the examples (and your own code in the question) show, this allows for access via convenient dot notation.

Note: If you use dot notation to target an element that has at least one attribute and/or child elements, the element itself is returned (an XmlElement instance); otherwise, it is the element's text content; for information about updating XML documents via dot notation, see this answer.

The downside of dot notation is that there can be name collisions, if an incidental input-XML element name happens to be the same as either an intrinsic [System.Xml.XmlElement] property name (for single-element properties), or an intrinsic [Array] property name (for array-valued properties; [System.Object[]] derives from [Array]).

In the event of a name collision: If the property being accessed contains:

  • a single child element ([System.Xml.XmlElement]), the incidental properties win.

    • This too can be problematic, because it makes accessing intrinsic type properties unpredictable - see bottom section.
  • an array of child elements, the [Array] type's properties win.

    • Therefore, the following element names break dot notation with array-valued properties (obtained with reflection command
      Get-Member -InputObject 1, 2 -Type Properties, ParameterizedProperty):

          Item Count IsFixedSize IsReadOnly IsSynchronized Length LongLenth Rank SyncRoot
      

See the last section for a discussion of this difference and for how to gain access to the intrinsic [System.Xml.XmlElement] properties in the event of a collision.

The workaround is to use explicit enumeration of array-valued properties, using the ForEach-Object cmdlet, as demonstrated at the top.
Here is a complete example:

[xml] $xml = @'
<PatchScan>
  <Machine>
    <Product>
      <Name>Windows 10 Pro (x64)</Name>
      <Item Class="Patch">
        <BulletinId>MSAF-054</BulletinId>
        <PatchName>windows10.0-kb3189031-x64.msu</PatchName>
        <Status>Installed</Status>
      </Item>
      <Item Class="Patch">
        <BulletinId>MSAF-055</BulletinId>
        <PatchName>windows10.0-kb3189032-x64.msu</PatchName>
        <Status>Not Installed</Status>
      </Item>
    </Product>
    <Product>
      <Name>Windows 7 Pro (x86)</Name>
      <Item Class="Patch">
        <BulletinId>MSAF-154</BulletinId>
        <PatchName>windows7-kb3189031-x86.msu</PatchName>
        <Status>Partly Installed</Status>
      </Item>
      <Item Class="Patch">
        <BulletinId>MSAF-155</BulletinId>
        <PatchName>windows7-kb3189032-x86.msu</PatchName>
        <Status>Uninstalled</Status>
      </Item>
    </Product>
  </Machine>
</PatchScan>
'@

# Enumerate the array-valued .Product property explicitly, so that
# the .Item property can successfully be accessed on each XmlElement instance.
$xml.PatchScan.Machine.Product | 
  ForEach-Object { $_.Item | Select-Object BulletinID, PatchName, Status }

The above yields:

Class BulletinId PatchName                     Status          
----- ---------- ---------                     ------          
Patch MSAF-054   windows10.0-kb3189031-x64.msu Installed       
Patch MSAF-055   windows10.0-kb3189032-x64.msu Not Installed   
Patch MSAF-154   windows7-kb3189031-x86.msu    Partly Installed
Patch MSAF-155   windows7-kb3189032-x86.msu    Uninstalled     

Further down the rabbit hole: What properties are shadowed when:

Note: By shadowing I mean that in the case of a name collision, the "winning" property - the one whose value is reported - effectively hides the other one, thereby "putting it in the shadow".


In the case of using dot notation with arrays, a feature called member-access enumeration comes into play, which applies to any collection in PowerShell v3+; in other words: the behavior is not specific to the [xml] type.

In short: accessing a property on a collection implicitly accesses the property on each member of the collection (item in the collection) and returns the resulting values as an array ([System.Object[]]); .e.g:

# Using member-access enumeration, collect the value of the .prop property from
# the array's individual *members*.
> ([pscustomobject] @{ prop = 10 }, [pscustomobject] @{ prop = 20 }).prop
10
20

However, if the collection type itself has a property by that name, the collection's own property takes precedence; e.g.:

# !! Since arrays themselves have a property named .Count,
# !! member-access enumeration does NOT occur here.
> ([pscustomobject] @{ count = 10 }, [pscustomobject] @{ count = 20 }).Count
2  # !! The *array's* count property was accessed, returning the count of elements

In the case of using dot notation with [xml] (PowerShell-decorated System.Xml.XmlDocument and System.Xml.XmlElement instances), the PowerShell-added, incidental properties shadow the type-native ones:

While this behavior is easy to grasp, the fact that the outcome depends on the specific input can also be treacherous:

For instance, in the following example the incidental name child element shadows the intrinsic property of the same name on the element itself:

> ([xml] '<xml><child>foo</child></xml>').xml.Name
xml  # OK: The element's *own* name

> ([xml] '<xml><name>foo</name></xml>').xml.Name
foo  # !! .name was interpreted as the incidental *child* element

If you do need to gain access to the intrinsic type's properties, use .get_<property-name>():[2]

> ([xml] '<xml><name>foo</name></xml>').xml.get_Name()
xml  # OK - intrinsic property value to use of .get_*()

[1] If a given element has both an attribute and and element by the same name, PowerShell reports both, as the elements of an array [object[]].

[2] The true type-native properties have get_* accessor methods (all properties in .NET are ultimately implemented via such methods), which the virtual, adapted properties lack, so calling these methods is a way to bypass the shadowing. Another option is to use the intrinsic psbase property, e.g. ([xml] '<xml><name>foo</name></xml>').xml.psbase.Name