I have a delimited file with 350 columns. The delimiter is \034(Field separator)
.
I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file.
The source file is 1GB. I have written the following command. It is very slow.
Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation
Please help!
I suggest using a
switch
statement to process the input file quickly (by PowerShell standards):Tip of the hat to Mathias R. Jessen for suggesting the
-NoElement
switch, which streamlines theGroup-Object
call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in.Name
, not also the individual objects that make up the group (as normally reflected in.Group
) are returned via the output objects.As for what you tried:
Get-Content
with line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, becauseGet-Content
decorates each line it outputs with ETS (Extended Type System) metadata.-Raw
switch reads the entire file as a single, multi-line string, which is much faster.Passing
-Property { $_ }
toGroup-Object
isn't necessary - just omit it. Without a-Property
argument, the input objects are grouped as a whole.Chaining
Where-Object
andSelect-Object
- rather than filtering via anif
statement in aForEach-Object
call combined with multipleSelect-Object
calls - is not only conceptually clearer, but performs better.