I have a delimited file with 350 columns. The delimiter is \034(Field separator).
I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file.
The source file is 1GB. I have written the following command. It is very slow.
Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation
Please help!
I suggest using a
switchstatement to process the input file quickly (by PowerShell standards):Tip of the hat to Mathias R. Jessen for suggesting the
-NoElementswitch, which streamlines theGroup-Objectcall by only maintaining abstract group information; that is, only the grouping criteria (as reflected in.Name, not also the individual objects that make up the group (as normally reflected in.Group) are returned via the output objects.As for what you tried:
Get-Contentwith line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, becauseGet-Contentdecorates each line it outputs with ETS (Extended Type System) metadata.-Rawswitch reads the entire file as a single, multi-line string, which is much faster.Passing
-Property { $_ }toGroup-Objectisn't necessary - just omit it. Without a-Propertyargument, the input objects are grouped as a whole.Chaining
Where-ObjectandSelect-Object- rather than filtering via anifstatement in aForEach-Objectcall combined with multipleSelect-Objectcalls - is not only conceptually clearer, but performs better.