I have a script that will look for a regex inside a large number of files, such as an address or phonenumber. The script i currently have runs as a job and works, however very slowly.
Currently my method of start-job works as expected, all be it slowly. Im looking for ways to speed up and returning results quicker. If at all possible
I have ventured into the world of Runspaces within powershell after browsing around for various help. Below is the code i have mashed together with brief understanding in the use of Runspaces.
My question is around the way that Runspaces can be used so that a Get-Childitem request running in parallel will not be scanning the same file across multiple runspaces. If this is even possible?
I created 20,000 files containing junk, and manually edited 2 files with the word "KETCHUP!" inside.
10k files are .xml 10k files are .txt
Im trying not to use PS v7 -parallel parameters as i would like to hand my script/GUI to other members of staff that are not in IT and will not have higher than ISE installed
powershell searching for a phrase in a large amount of files fast
$Finished.text = 'Working.....'
#Get list of files to search through
$path = "C:\intel\spam"
Push-Location $path
$FILES = Get-ChildItem -filter *.XML -File
### 5 Runspace limit
$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1,5)
$RunspacePool.ApartmentState = "MTA"
$RunspacePool.Open()
$runspaces = @()
# Setup scriptblock
$scriptblock = {
Param (
[object]$files
)
foreach($file in $files){
$test = select-string -Path $file -Pattern 'KETCHUP!' -List | select-object FileName,Path
if($test)
{
add-content -Path 'C:\intel\matches.txt' -Value $test.Filename
}
}
}
Write-Output "Starting search..."
$runspace = [PowerShell]::Create()
[void]$runspace.AddScript($scriptblock)
[void]$runspace.AddArgument($FILES) # <-- Send files to be searched
$runspace.RunspacePool = $RunspacePool
$AsyncObject = $runspace.BeginInvoke()
# Wait for runspaces to complete
while ($runspaces.Status.IsCompleted -notcontains $true) {}
# Cleanup runspaces
foreach ($runspace in $runspaces ) {
$runspace.Pipe.EndInvoke($runspace.Status)
$runspace.Pipe.Dispose()
}
# Cleanup runspace pool
$RunspacePool.Close()
$RunspacePool.Dispose()
$Data = $runspace.EndInvoke($AsyncObject)
Pop-Location
It really comes down to logic, and that would be in breaking the files to search for in chunks - which is totally doable. The way it works is more or less like this. Let's imagine you have 8 files and a hypothetical 2-core CPU:
After determining the chunk_size (which would be 4 in this hypothetical scenario since 8 files divided by 2 cores is 4), the code would divide these files into chunks:
This division would be stored in the
$file_chunksArrayList:Now, when parallel processing begins, each CPU core (or runspace) picks up a chunk:
Each core works on its own subset of files, allowing for faster parallel processing.
With this said and done, you can create a more robust solution such as a function to re-use it in a more friendly manner:)