I've written a script to help me identify duplicate files. For some reason if I split these commands and export/import to CSV it runs much faster than if I leave everything in memory. Here is my original code, it is god-awful slow:
Get-ChildItem M:\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicates.csv
If I split this into 2 commands and export to CSV in the middle it runs about 100x faster. I'm hoping someone could shed some light on what I'm doing wrong.
Get-ChildItem M:\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Export-Csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\4.csv
import-csv M:\Misc\Scripts\Duplicates\4.csv | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicates\Duplicates.csv
remove-item M:\Misc\Scripts\Duplicates\4.csv
appreciate any suggestions,
~TJ
It's not
Group-Objectthat is slow, it's your grouping condition, you're asking it to groupFileInfoobjects by their.Directoryproperty which represents their parent folderDirectoryInfoinstance. So, you're asking the cmdlet to group objects by a very complex object as a grouping condition, instead you could use the.DirectoryNameproperty as your grouping condition, which represents the parent directory'sFullNameproperty (a simple string) or you could use the.Directory.Nameproperty which represents the parent's folderName(also a simple string).To summarize, the main reason why exporting to a CSV is faster in this case, is because when
Export-Csvreceives your objects from pipeline, it calls theToString()method on each object's property values, hence theDirectoryinstance gets converted to its string representation (callingToString()to this instance ends up being the folder'sFullName).As for your code, if you want to keep as efficient as possible without actually overcomplicating it:
If you want to group them by the Parent
Nameinstead ofFullName, you could use: