Powershell Removing rows from a CSV that matches values in a Separate CSV

72 Views Asked by At

We are currently in the process of migrating files and at the end of each migration we need to produce a report of any file that doesn't get migrated. (Not all source files are migrated).

I produce a csv file that contains all files from the source and then another csv after the migration of all files that have been migrated.

I am running the below powershell script to remove any file from the source csv that has a value in the migrated file csv saving the source csv then repeating n the next row, but its taking an age to run. (I am guessing that because its checking each value in the migrated csv with each row in the source csv) 1 file has 77810 rows and the other has 88900 rows.

Is there any other way to do this to speed up the process as these files are tiny compared to some of the migrations.

import-csv -Path 'C:\Temp\CFP Didnts\Test\SPeopleMigratedFiles.csv' | foreach-object {
    $Fullpath = $($_.FullPath)
    $text = [regex]::escape($Fullpath)
 
    (Import-CSv 'C:\temp\CFP Didnts\Test\S People Commissioning Full Files.csv') |
        Where-Object{ $_.FullName -notmatch $text } |
        Export-Csv -NoTypeInformation -Path 'C:\temp\CFP Didnts\Test\S People Commissioning Full Files.csv' -Force
}

In short what I need to end up with at the end is 1 csv which contains only values not present in the migrated files csv.

Thanks for any help or suggestions.

It works but on a small migration (cvs1 contains 77810 rows and CSV2 contains 88900 rows) compared to other migrations its taking an age to run, but it does work just very very slow.

1

There are 1 best solutions below

0
Santiago Squarzon On

The main issue with your code is that you're reading the Csv S People Commissioning Full Files.csv per loop iteration and you're also overwriting it per loop iteration.

Assuming you're looking to filter out .FullPath values on SPeopleMigratedFiles.csv that are an exact match of .FullName values in S People Commissioning Full Files.csv, then -notmatch is not the right operator to use in this case as, combined with regex.Escape, it would be filtering out partial matches too.

The way you can improve your code in this case is using a HashSet<T> and change ForEach-Object for a ScriptBlock with a process block, which provides faster enumeration and consumes less resources. For a performance comparison between these 2 techniques for enumerating collections see the results in this answer.

Import-Csv 'C:\temp\CFP Didnts\Test\S People Commissioning Full Files.csv' | & {
    begin {
        # Hashsets are collections of unique values
        $hash = [System.Collections.Generic.HashSet[string]]::new(
            [string[]] (Import-Csv 'C:\Temp\CFP Didnts\Test\SPeopleMigratedFiles.csv').FullPath,
            [System.StringComparer]::InvariantCultureIgnoreCase)
    }
    process {
        # if the value of `.FullName` can be added to the Hashset
        if ($hash.Add($_.FullName)) {
            # it means the `SPeopleMigratedFiles.csv` does not contain it
            $_
        }
    }
} | Export-Csv 'C:\temp\CFP Didnts\Test\S People Commissioning Full Files.csv' -NoTypeInformation