Extract embedded XML data from an Audio File in windows

54 Views Asked by At

We have a platform that records our callcentre calls and at the end of the wav file adds some xml that holds important metadata about that call.

I'm trying to read a folder of these wav files and pull the meta data in to a list for a user, their preference is for the list to be in excel, however I'm struggling to find a method that would reliably work on a normal windows computer without having something special installed, like Python.

Like Excel has an xml import function but that fails as the xml is at the end of the files and excel reads from the start and gets confused by the audio part, just need to skip down to <recording> and read from there until <\recording>.

I've tried the following in powershell:

$directory = "C:\test"
$wavFiles = Get-ChildItem -Path $directory -Filter *.wav

foreach ($file in $wavFiles) {
    Write-Host "Processing file: $($file.Name)"
    $content = Get-Content -Path $file.FullName -Raw -Encoding Byte
    $decodedContent = [System.Text.Encoding]::UTF8.GetString($content)
    $match = [regex]::Match($decodedContent, '<recording>.+?</recording>')
    if ($match.Success) {
        $xmlContent = $match.Value
        Write-Host "Found XML in file $($file.Name):"
        Write-Host $xmlContent
    } else {
        Write-Host "No XML found in file $($file.Name)."
    }
}

And this correctly locates the file, but is unable to parse the xml. Which can be seen when opening the file in a text editor like notepad++

Processing file: 131346032527__8115_02-13-2024-11-12-58.wav

No XML found in file 131346032527__8115_02-13-2024-11-12-58.wav.

Any ideas?

3

There are 3 best solutions below

1
mklement0 On BEST ANSWER

Note:

  • The following assumes that a well-formed <recording> XML element is indeed embedded in your .wav files, and that it is represented as a single, contiguous block of bytes.

By default, . in the .NET regex engine matches any character except a newline character (\n).

  • Thus, one possible explanation for not being able to find a match is that your <recording> XML element spans multiple lines.

    • To make . match newlines too, you need to set the SingleLine .NET regex option, which you can do in one of the following ways:

      • Use the equivalent s inline option, as part of the regex:

        [regex]::Match($decodedContent, '(?s)<recording>.+?</recording>')
        
      • Use the [regex]::Match() overload that takes a System.Text.RegularExpressions.RegexOptions argument:

        [regex]::Match($decodedContent, '<recording>.+?</recording>', 'SingleLine')
        
    • Note:

      • This solution may have to be combined with using a different character encoding - see the next major bullet point.

      • Also, note that .NET's regex engine is case-sensitive by default (unlike the PowerShell functionality that builds on it, notably -match and -replace); to make the above case-insensitive, use (?si) / 'SingleLine, IgnoreCase'

  • Independently, perhaps the embedded XML uses a character encoding other than UTF-8, notably one that isn't at least compatible with UTF-8 with respect to ASCII-range characters.

    • E.g., if the actual encoding is UTF-16LE, use the following:

       [System.Text.Encoding]::Unicode.GetString($content)
      
0
jdweng On

Finally extracted the Xml. I used Xml Linq to parse.

using assembly System.Xml.Linq 

$filename = 'c:\temp\test.wav';

enum Wave {
   PCM = 0x0001
   IBM_MULAW = 0x0101
   IBM_ALAW = 0x0102
   IBM_ADPCM = 0x0103
}

$stream = [System.IO.File]::OpenRead($filename);
$bReader = [System.IO.BinaryReader]::New($stream);
$length = $bReader.BaseStream.Length;

while ($bReader.BaseStream.Position -lt $length)
{
   $identifier = $bReader.ReadBytes(4);
   $identifiedStr = [System.Text.Encoding]::UTF8.GetString($identifier);
   $size = [System.BitConverter]::ToInt32($bReader.ReadBytes(4), 0);
   $chunk = $bReader.ReadBytes($size);

   #if odd read padding byte
   if($size % 2 -eq 1) { $bReader.ReadByte(); }

   #if odd read padding byte
   if($size % 2 -eq 1) {$bReader.ReadByte();}

   if ($identifiedStr -eq 'RIFF')
   {
      $chunkIdent = [System.Text.Encoding]::UTF8.GetString($chunk, 0, 4).Trim();

      $offset = 4;

      if($chunkIdent -eq "WAVE")
      {
         while ($offset -lt $chunk.Length)
         {
 
             $chunktypeStr = [System.Text.Encoding]::UTF8.GetString($chunk, $offset, 4).Trim();
             Write-Host 'Chunk Ident = '$chunkIdent ', Chunk Type = '$chunktypeStr

             $chunkTypeLength = [System.BitConverter]::ToInt32($chunk, $offset + 4);


             switch ($chunktypeStr)
             {
                'fmt'   {
                           $category = [Wave][System.BitConverter]::ToInt16($chunk, $offset + 8);
                           $channels = [System.BitConverter]::ToInt16($chunk, $offset + 10);
                           $samplePerChannel = [System.BitConverter]::ToInt32($chunk, $offset + 12);
                           $averageBytesPerSample = [System.BitConverter]::ToInt32($chunk, $offset + 16);
                           Write-Host 'Format : Category = '$category.ToString() ', Channels = '$channels ', Samples per channel = '$samplePerChannel ', Average Bytes per Sample = '$averageBytesPerSample
                           $blockAlign = [System.BitConverter]::ToInt16($chunk, $offset + 20);
                        }
                 'data' {
                           $data = [System.Text.Encoding]::UTF8.GetString($chunk, $offset,$chunk.Length - $offset);
                        }
                 'iXML' {
                            $xml = [System.Text.Encoding]::UTF8.GetString($chunk, $offset + 8, $chunkTypeLength - $offset - 8);
                            $doc = [System.Xml.Linq.XDocument]::Parse($xml);
                        }
                 default {
                 }
             }


             $offset += $chunkTypeLength + 8
             if($chunkTypeLength %2 -eq 1) { $offset += 1}
         }
      }
   }
}
1
Michael Brogan On

I did some playing around with it, had similar issues getting the regex to match up. I found that by removing the `r`n characters in my setup and collapsing the entire raw string to 1 line I was able to get the pattern matching to line up. See example.

$decodedContent = @"
This is a bunch of garbage
<recording>
    <Title>My Thing</Title>
    <Date>1-2-24</Date>
    <meta>12345</meta>
</recording>
A bunch of other garbage
"@
$match = [regex]::Match(($decodedContent.Replace("`r`n","")),'<recording>.+?</recording>')
$xmlContent = $match.Value
[xml]$xml = $xmlContent