PowerShell Extract text between two strings with -Tail and -Wait

1.6k Views Asked by At

I have a text file with a large number of log messages. I want to extract the messages between two string patterns. I want the extracted message to appear as it is in the text file.

I tried the following methods. It works, but doesn't support Get-Content's -Wait and -Tail options. Also, the extracted results are displayed in one line, but not like the text file. Inputs are welcome :-)

Sample Code

function GetTextBetweenTwoStrings($startPattern, $endPattern, $filePath){

    # Get content from the input file
    $fileContent = Get-Content $filePath

    # Regular expression (Regex) of the given start and end patterns
    $pattern = "$startPattern(.*?)$endPattern"

    # Perform the Regex opperation
    $result = [regex]::Match($fileContent,$pattern).Value

    # Finally return the result to the caller
    return $result
}

# Clear the screen
Clear-Host

$input = "THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'

# Call the function
GetTextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $input

Improved script based on Theo's answer. The following points need to be improved:

  1. The beginning and end of the output is somehow trimmed despite I adjusted the buffer size in the script.
  2. How to wrap each matched result into START and END string?
  3. Still I could not figure out how to use the -Wait and -Tail options

Updated Script

# Clear the screen
Clear-Host

# Adjust the buffer size of the window
$bw = 10000
$bh = 300000
if ($host.name -eq 'ConsoleHost') # or -notmatch 'ISE'
{
  [console]::bufferwidth = $bw
  [console]::bufferheight = $bh
}
else
{
    $pshost = get-host
    $pswindow = $pshost.ui.rawui
    $newsize = $pswindow.buffersize
    $newsize.height = $bh
    $newsize.width = $bw
    $pswindow.buffersize = $newsize
}


function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

# Input file path
 $inputFile = "THE-LOG-FILE.log"

# The patterns
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'


Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
2

There are 2 best solutions below

5
On BEST ANSWER
  • You need to perform streaming processing of your Get-Content call, in a pipeline, such as with ForEach-Object, if you want to process lines as they're being read.

    • This is a must if you're using Get-Content -Wait, because such a call doesn't terminate by itself (it keeps waiting for new lines to be added to the file, indefinitely), but inside a pipeline its output can be processed as it is being received, even before the command terminates.
  • You're trying to match across multiple lines, which with Get-Content output would only work if you used the -Raw switch - by default, Get-Content reads its input file(s) line by line.

    • However, -Raw is incompatible with -Wait.
    • Therefore, you must stick with line-by-line processing, which requires that you match the start and end patterns separately, and keep track of when you're processing lines between those two patterns.

Here's a proof of concept, but note the following:

  • -Tail 100 is hard-coded - adjust as needed or make it another parameter.

  • The use of -Wait means that the function will run indefinitely - waiting for new lines to be added to $filePath - so you'll need to use Ctrl-C to stop it.

    • While you can use a Get-TextBetweenTwoStrings call itself in a pipeline for object-by-object processing, assigning its result to a variable ($result = ...) won't work when terminating with Ctrl-C, because this method of termination also aborts the assignment operation.

    • To work around this limitation, the function below is defined as an advanced function, which automatically enables support for the common -OutVariable parameter, which is populated even in the event of termination with Ctrl-C; your sample call would then look as follows (as Theo notes, don't use the automatic $input variable as a custom variable):

      # Look for blocks of interest in the input file, indefinitely,
      # and output them as they're being found.
      # After termination with Ctrl-C, $result will also contain the blocks
      # found, if any.
      Get-TextBetweenTwoStrings -OutVariable result -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
      
  • Per your feedback you want the block of lines to encompass the full lines on which the start and end patterns match, so the regexes below are enclosed in .*

  • The word pattern in your $startPattern and $endPattern parameters is a bit ambiguous in that it suggests that they themselves are regexes that can therefore be used as-is or embedded as-is in a larger regex on the RHS of the -match operator.
    However, in the solution below I am assuming that they are be treated as literal strings, which is why they are escaped with [regex]::Escape(); simply omit these calls if these parameters are indeed regexes themselves; i.e.:

    $startRegex = '.*' + $startPattern + '.*'
    $endRegex = '.*' + $endPattern + '.*'
    
  • The solution assumes there is no overlap between blocks and that, in a given block, the start and end patterns are on separate lines.

  • Each block found is output as a single, multi-line string, using LF ("`n") as the newline character; if you want a CRLF newline sequences instead, use "`r`n"; for the platform-native newline format (CRLF on Windows, LF on Unix-like platforms), use [Environment]::NewLine.

# Note the use of "-" after "Get", to adhere to PowerShell's
# "<Verb>-<Noun>" naming convention.
function Get-TextBetweenTwoStrings {

  # Make the function an advanced one, so that it supports the 
  # -OutVariable common parameter.
  [CmdletBinding()]
  param(
    $startPattern, 
    $endPattern, 
    $filePath
  )

  # Note: If $startPattern and $endPattern are themselves
  #       regexes, omit the [regex]::Escape() calls.
  $startRegex = '.*' + [regex]::Escape($startPattern) + '.*'
  $endRegex = '.*' + [regex]::Escape($endPattern) + '.*'

  $inBlock = $false
  $block = [System.Collections.Generic.List[string]]::new()

  Get-Content -Tail 100 -Wait $filePath | ForEach-Object {
    if ($inBlock) {
      if ($_ -match $endRegex) {
        $block.Add($Matches[0])
        # Output the block of lines as a single, multi-line string
        $block -join "`n"
        $inBlock = $false; $block.Clear()       
      }
      else {
        $block.Add($_)
      }
    }
    elseif ($_ -match $startRegex) {
      $inBlock = $true
      $block.Add($Matches[0])
    }
  }

}
6
On

First of all, you should not use $input as self-defined variable name, because this is an Automatic variable.

Then, you are reading the file as a string array, where you would rather read is as a single, multiline string. For that append switch -Raw to the Get-Content call.

The regex you are creating does not allow fgor regex special characters in the start- and end patterns you give, so it I would suggest using [regex]::Escape() on these patterns when creating the regex string.

While your regex does use a group capturing sequence inside the brackets, you are not using that when it comes to getting the value you seek.

Finally, I would recommend using PowerShell naming convention (Verb-Noun) for the function name

Try

function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

$inputFile    = "D:\Test\THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern   = 'END-OF-PATTERN'

Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile

Would result in something like:

blahblah
more lines here

The (?is) makes the regex case-insensitive and have the dot match linebreaks as well


Nice to see you're using my version of the Get-TextBetweenTwoStrings function, however I believe you are mistaking the output in the console to output as in a dedicated text editor. In the console, too long lines will be truncated, whereas in a text editor like notepad, you can choose to wrap long lines or have a horizontal scrollbar.

If you simply append

| Set-Content -Path 'X:\wherever\theoutput.txt'

to the Get-TextBetweenTwoStrings .. call, you will find the lines are NOT truncated when you open it in Word or notepad for instance.

In fact, you can have that line folowed by

notepad 'X:\wherever\theoutput.txt'

to have notepad open that file straight away.