Custom Powershell sort function

360 Views Asked by At

I have a giant array of 1M+ names in it, some are alpha numeric some are just alpha.

CSV:
id,firstname,lastname,email,email2,profession
100,Andeee,Michella,[email protected],[email protected],police officer
101,Tybie,1Grobe,[email protected],[email protected],worker
102,Fernande,Azeria,[email protected],[email protected],developer
103,Lenna,Schenck,[email protected],[email protected],police officer
104,4Marti,Brittani,[email protected],[email protected],worker
105,Riannon,Aldric,[email protected],[email protected],doctor
106,Corry,Nikaniki,[email protected],[email protected],worker
107,Correy,Shama,[email protected],[email protected],police officer
108,Marcy,Drus,[email protected],[email protected],worker
109,Bill,Valerio,[email protected],[email protected],worker

I don't want to use Sort-Oject or Sort for the entire array for it's taking way too long. This needs to be done in Powershell because of restrictions in my environment.

The array comes from a csv that I exported from another powershell job. ( I don't have access to the jobs code, just the results).

Here is the example that I created from the Java method I found. It failed with the following error: The script failed due to call depth overflow.

$array = @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
mergeSort $array

write-host $array

function mergeSort
{

   param([string[]] $arr)

      if($arr.length -ge 2){
         
         #find mid-point
         $left_len = [math]::floor([int32]$arr.length/2)                                              
         
         #declare array size of left of mid-point
         $left = [string[]]::new($left_len);                                                                 

         #find mid-point
         $right_len = [math]::ceiling([int32]($arr.Length - $arr.Length /2))                
     
         #declare array size right of mid-point
         $right = [string[]]::new($right_len);                                                               

         for ($i = 0; $i -lt $left_len.Length; $i++){
            $left= $arr[$i]
         }#for loop

         for ($i = 0; $i -lt $right_len; $i++){
            $right = $arr[$i +$arr.Length/2]
         }
     }#if($arr.length -ge 2)

   mergeSort $left

   mergeSort $right

   merge ($arr, $left, $right)
}

function merge
{
    param ([string[]] $result,[string[]] $left, [string[]] $right)

    $i1 = 0

    $12 = 0

    for ($i = 0; $i -le $result.Length; $i++) {
      if($i2 -gt $right.Length -or ($i1 -lt $left.Length -and $left[$i1].CompareTo($right[$i2]) -lt 0)) {
         $result[$i] = $left[$i1]
          $i1++
       }
       else {
          $result[$i] = $right[$i2]
          $i2++
       }

   }

   $result.legnth

 }

This is latest solution I came up with following everyone's suggestions: I want to make this work in paralles but it throws errors:

$array = @('Ryan', 'Kelly', 'Alex', 'Kyle', 'Riley', '4test', 'test4', 'why', 'you', 'me', 'where', 'hello', 'jose', 'test', 
'Jelly', 'Plex', 'Cyle', 'Miley', '5test', '3test4', 'who', 'Bou', 'We', 'There', 'Yellow', 'Pose', 'West')

$type = ("System.Collections.Generic.SortedSet"+'`'+"2") -as "Type"
$type = $type.MakeGenericType( @( ("System.string" -as "Type"), ("system.string" -as "Type") ) )
$sortedArray = [Activator]::CreateInstance($type, 10000)

$a, $b = ($array | Split-Collection -Count ([int]$array.length/2) | %{ $_ -join ',' })

$firstCollection = $a.Split(",")
$secondCollection = $b.Split(",")

$counter = 0
$counterHalf = $array.Length/2

1..$counterHalf| ForEach {
    try {   
        $col1 = $firstCollection[$counter]
        $sortedArray.Add($col1, $counter)
    }
    catch { "Out of bound col 1" }

    try {    
        $col2 = $secondCollection[$counter]
        $sortedArray.Add($col2, $counter)
    }
    catch { "Out of bound col 2" }
    
    $counter++
}

$sortedArray


function Split-Collection {
    [CmdletBinding()]
    param(
        [Parameter(ValueFromPipeline=$true)] $Collection,
        [Parameter(Mandatory=$true)][ValidateRange(1, 247483647)][int] $Count)
    begin {
        $Ctr = 0
        $Arrays = @()
        $TempArray = @()
    }
    process {
        if (++$Ctr -eq $Count) {
            $Ctr = 0
            $Arrays += , @($TempArray + $_)
            $TempArray = @()
            return
        }
        $TempArray += $_
    }
    end {
        if ($TempArray) { $Arrays += , $TempArray }
        $Arrays
    }
}
3

There are 3 best solutions below

6
On BEST ANSWER

FWIW, this is an answer to the original question about getting your Merge Sort code working. Unfortunately it's not very performant, so I don't know if it will actually help you with the wider problem of sorting your 1M+ rows...

The Good News

I made some tweaks to your original mergeSort that appear to fix it, at least for the sample array at the top of your question.

The fixes were mostly typos - see a screenshot from BeyondCompare to see the changes I made:

enter image description here

The Bad News

It's sloooooow.

PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
PS> measure-command {
    mergeSort $array
}

...
TotalMilliseconds : 11511.74

compared to

PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
PS> measure-command {
    $array = $array | sort-object
}

...
TotalMilliseconds : 36.8607

Maybe it performs better with the scale of data you're talking about, but I didn't have the patience to test it :-)

The Ugly

I also tweaked the code slightly so it sorts the left side before doing anything with the right side, which means it doesn't need to use as much memory.

Here's the updated sample, for posterity.

$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";

function mergeSort
{

    param([string[]] $arr)

    if( $arr.length -gt 1 )
    {

        # sort the left side
        $left_len = [Math]::Floor([int32]$arr.length / 2)
        $left = [string[]]::new($left_len);                                                                 
        for( $i = 0; $i -lt $left_len; $i++ )
        {
            $left[$i] = $arr[$i]
        }
        mergeSort -arr $left

        # sort the right side
        $right_len = $arr.Length - $left_len
        $right = [string[]]::new($right_len);
        for( $i = 0; $i -lt $right_len; $i++ )
        {
            $right[$i] = $arr[$left_len + $i]
        }
        mergeSort -arr $right

        # merge the two sides
        merge -result $arr -left $left -right $right

    }

}

function merge
{
    param ([string[]] $result,[string[]] $left, [string[]] $right)

    $i1 = 0
    $i2 = 0

    for ($i = 0; $i -lt $result.Length; $i++)
    {
        if( ($i1 -lt $left.Length) -and (($i2 -ge $right.Length) -or $left[$i1].CompareTo($right[$i2]) -lt 0) )
        {
            $result[$i] = $left[$i1]
            $i1++
        }
        else
        {
            $result[$i] = $right[$i2]
            $i2++
        }
    }

}

$array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
mergeSort $array

write-host $array

One thing to specifically call out is casting the input array to a string:

$array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")

Without the cast, $array is of type [System.Object[]] and what happens is PowerShell will create a new temporary [string[]] array internally, copy the values into that and then sort the internal array, but it won't assign the internal array back to $array.

Try it without the cast to see that in action.

6
On

Use a sorted dictionary which has hash for key

$filename = 'c:\temp\test.csv'
$dict = [System.Collections.Generic.SortedDictionary[string,string]]::new()

$reader = [System.IO.StreamReader]::new($filename)
#skip header
$reader.ReadLine()
while( ($line = $reader.ReadLine()) -ne $null )
{
   if($line.Length.Trim() > 0)
   {
      $firstComma = $line.IndexOf(',')
      $id = $line.Substring(0, $firstComma)
      $dict.Add($id, $line)
   }
}
$dict
1
On

If you do have performance issues with PowerShell you might start with reading PowerShell scripting performance considerations

Top-down implementation

With regards to your first attempt, one thing you might want to avoid is a recursive function simply because calling a function in PowerShell is pretty expensive, see: What is the best way to reuse static code
The specific error The script failed due to call depth overflow is most likely due an incorrect check where you should stop the recursed call and therefore going on forever...

Bottom-up implementation

With regards to your second attempt, using increase assignment operator (+=) to create a collection is a quiet common PowerShell performance gotcha, see: Why should I avoid using the increase assignment operator (+=) to create a collection

Merge Sort Prototype

Taking these two issues in consideration, you might come up with a function like:

function MergeSort($List, $By) {
    $Count = @($List).get_Count()
    if ($Count -le 1) { return $List }
    $Temp = New-Object PSCustomObject[] $Count
    for ($Width = 1; $Width -lt $Count; $Width = 2 * $Width) {
        for ($Start = 0; $Start -lt $Count; $Start = $Start + 2 * $Width) {
            $Left = $Right = $To = 0
            do {
                if (
                  $Right -ge $Width -or $Start + $Width + $Right -ge $Count -or (
                    $Left -lt $Width -and $Start + $Left -lt $Count -and 
                    $List[$Start + $Left].$By -lt $List[$Start + $Width + $Right].$By
                  )
                )    { $Temp[$Start + $To++] = $List[$Start + $Left++] }
                else { $Temp[$Start + $To++] = $List[$Start + $Width + $Right++] }
            } while ($To -lt 2 * $Width -and $Start + $To -lt $Count)
        }
        $List, $Temp = $Temp, $List # Swap (the references of) the lists
    }
    $List
}

Demo

$Data = ConvertFrom-Csv @'
id,firstname,lastname,email,email2,profession
100,Andeee,Michella,[email protected],[email protected],police officer
101,Tybie,1Grobe,[email protected],[email protected],worker
102,Fernande,Azeria,[email protected],[email protected],developer
103,Lenna,Schenck,[email protected],[email protected],police officer
104,4Marti,Brittani,[email protected],[email protected],worker
105,Riannon,Aldric,[email protected],[email protected],doctor
106,Corry,Nikaniki,[email protected],[email protected],worker
107,Correy,Shama,[email protected],[email protected],police officer
108,Marcy,Drus,[email protected],[email protected],worker
109,Bill,Valerio,[email protected],[email protected],worker
'@

MergeSort $Data -by lastname |Format-Table

id  firstname lastname email                       email2                    profession
--  --------- -------- -----                       ------                    ----------
101 Tybie     1Grobe   [email protected]     [email protected]     worker
105 Riannon   Aldric   [email protected]  [email protected]  doctor
102 Fernande  Azeria   [email protected] [email protected] developer
104 4Marti    Brittani [email protected]  [email protected]  worker
108 Marcy     Drus     [email protected]      [email protected]      worker
100 Andeee    Michella [email protected] [email protected] police officer
106 Corry     Nikaniki [email protected]  [email protected]  worker
103 Lenna     Schenck  [email protected]   [email protected]   police officer
107 Correy    Shama    [email protected]    [email protected]    police officer
109 Bill      Valerio  [email protected]    [email protected]    worker

But as already concluded in @mclayton helpful answer, it will be unlikely that you be able to beat the native Sort-Object with a self-made PowerShell function.

SortedDictionary

Anyhow, the suggestion mentioned in @jdweng helpful answer of using the SortedDictionary<TKey,TValue> Class is a better candidate to beat the native Sort-Object command.

$Dictionary = [System.Collections.Generic.SortedDictionary[string,object]]::new()
$Data.foreach{ $Dictionary[$_.lastname] = $_ }
$Dictionary.Values |Format-Table

Benchmark

I question your comment that the SortedDictionary "solution is linear", but I can't prove that. Anyways, I guess in the end it is the actual performance that counts here.

$Data = 1..10000 |Foreach-Object { [PSCustomObject]@{ Id = $_; Name = (New-Guid).Guid } }
(Measure-Command { 
    $Dict = [System.Collections.Generic.SortedDictionary[string,object]]::new()
    $Data.foreach{ $Dict[$_.Name] = $_ }
    $Null = $Dict.Values
}).TotalMilliseconds

(Measure-Command { 
    $SortedList = [System.Collections.Generic.SortedList[string,object]]::new()
    $Data.foreach{ $SortedList.Add($_.Name, $_ ) }
    $Null = $SortedList.Values
}).TotalMilliseconds

(Measure-Command { 
    $Null = $Data |Sort-Object -Property Name
}).TotalMilliseconds

(Measure-Command { 
    $Null = MergeSort $Data -By Name
}).TotalMilliseconds

Results

SortedDictionary:   110.8943
SortedList:         133.1337
Sort-Object:        141.4514
MergeSort Function: 550.4682

Divide and conquer

So, now we know that a SortedDictionary and a SortedList perform the best, let's see if we can divide the these .Net tasks over multiple parallel threads (CPUs).

Parallel Sort Prototype

For this I will need the SortedList (which is a little slower SortedDictionary) as I will need to be able to index into the returned lists.

function ParallelSort($List, $By, $ThrottleLimit = 3) {
    $Count = $Data.get_Count()
    $Size = [Math]::Ceiling($Count / $ThrottleLimit)
    $Lists = 
        for ($c = 0; $c -lt $ThrottleLimit; $c++) {
            $Start = $c * $Size
            $End = [Math]::Min(($Start + $Size), $Count) - 1
            ,$Data[$Start..$End]
        }
    $Lists = $Lists |Foreach-Object -ThrottleLimit $ThrottleLimit -Parallel {
        $List = [System.Collections.Generic.SortedList[string,object]]::new()
        foreach ($Item in $_) { $List.Add($Item.Name, $Item ) }
        ,$List.Values
    }

    $Indices = [UInt[]]::new($ThrottleLimit)
    Do {
        $lt = $Null
        for ($l = 0; $l -lt $ThrottleLimit; $l++) {
            if ($Indices[$l] -lt $Lists[$l].Count -and (
                $Null -eq $lt -or
                $Lists[$l][$Indices[$l]].$By -lt $Lists[$lt][$Indices[$lt]].$By)
            ) { $lt = $l }
        }
        if ($Null -ne $lt) { $Lists[$lt][$Indices[$lt]++] }
    } While ($Null -ne $lt)
}

Usage (sanity check)

$Data = 1..50 |Foreach-Object { [PSCustomObject]@{ Id = $_; Name = (New-Guid).Guid } }
ParallelSort $Data -By Name

Which a large amount of data (e.g. 200.000 entries), the merged-parallel sorted routine indeed appears to be faster then the single process SortedDictionary:

Merged-parallel:  4324.5299
SortedDictionary: 5335.0125