Sorting an array of strings by arbitrary numbers of substrings

265 Views Asked by At

I'm attempting to modify the OrderedImportsFixer class in php-cs-fixer so I can clean up my files the way I want. What I want is to order my imports in a fashion similar to what you'd see in a filesystem listing, with "directories" listed before "files".

So, given this array:

$indexes = [
    26 => ["namespace" => "X\\Y\\Zed"],
    9 =>  ["namespace" => "A\\B\\See"],
    3 =>  ["namespace" => "A\\B\\Bee"],
    38 => ["namespace" => "A\\B\\C\\Dee"],
    51 => ["namespace" => "X\\Wye"],
    16 => ["namespace" => "A\\Sea"],
    12 => ["namespace" => "A\\Bees"],
    31 => ["namespace" => "M"],
];

I'd like this output:

$sorted = [
    38 => ["namespace" => "A\\B\\C\\Dee"],
    3 =>  ["namespace" => "A\\B\\Bee"],
    9 =>  ["namespace" => "A\\B\\See"],
    12 => ["namespace" => "A\\Bees"],
    16 => ["namespace" => "A\\Sea"],
    26 => ["namespace" => "X\\Y\\Zed"],
    51 => ["namespace" => "X\\Wye"],
    31 => ["namespace" => "M"],
];

As in a typical filesystem listing:

screenshot of a directory listing

I've been going at uasort for a while (key association must be maintained) and have come close. Admittedly, this is due more to desperate flailing than any sort of rigorous methodology. Not really having a sense of how uasort works is kind of limiting me here.

// get the maximum number of namespace components in the list
$ns_counts = array_map(function($val){
    return count(explode("\\", $val["namespace"]));
}, $indexes);
$limit = max($ns_counts);

for ($depth = 0; $depth <= $limit; $depth++) {
    uasort($indexes, function($first, $second) use ($depth, $limit) {
        $fexp = explode("\\", $first["namespace"]);
        $sexp = explode("\\", $second["namespace"]);
        if ($depth === $limit) {
            // why does this help?
            array_pop($fexp);
            array_pop($sexp);
        }
        $fexp = array_slice($fexp, 0, $depth + 1, true);
        $sexp = array_slice($sexp, 0, $depth + 1, true);
        $fimp = implode(" ", $fexp);
        $simp = implode(" ", $sexp);
        //echo "$depth: $fimp <-> $simp\n";
        return strnatcmp($fimp, $simp);
    });
}
echo json_encode($indexes, JSON_PRETTY_PRINT);

This gives me properly sorted output, but with deeper namespaces on the bottom instead of the top:

{
    "31": {
        "namespace": "M"
    },
    "12": {
        "namespace": "A\\Bees"
    },
    "16": {
        "namespace": "A\\Sea"
    },
    "3": {
        "namespace": "A\\B\\Bee"
    },
    "9": {
        "namespace": "A\\B\\See"
    },
    "38": {
        "namespace": "A\\B\\C\\Dee"
    },
    "51": {
        "namespace": "X\\Wye"
    },
    "26": {
        "namespace": "X\\Y\\Zed"
    }
}

I'm thinking I may have to build a separate array for each level of namespace and sort it separately, but have drawn a blank on how I might do that. Any suggestions for getting the last step of this working, or something completely different that doesn't involve so many loops?

4

There are 4 best solutions below

3
On BEST ANSWER

I believe the following should work:

uasort($indexes, static function (array $entry1, array $entry2): int {  
    $ns1Parts = explode('\\', $entry1['namespace']);
    $ns2Parts = explode('\\', $entry2['namespace']);

    $ns1Length = count($ns1Parts);
    $ns2Length = count($ns2Parts);

    for ($i = 0; $i < $ns1Length && isset($ns2Parts[$i]); $i++) {
        $isLastPartForNs1 = $i === $ns1Length - 1;
        $isLastPartForNs2 = $i === $ns2Length - 1;

        if ($isLastPartForNs1 !== $isLastPartForNs2) {
            return $isLastPartForNs1 <=> $isLastPartForNs2;
        }

        $nsComparison = $ns1Parts[$i] <=> $ns2Parts[$i];

        if ($nsComparison !== 0) {
            return $nsComparison;
        }
    }

    return 0;
});

What it does is:

  • split namespaces into parts,
  • compare each part starting from the first one, and:
    • if we're at the last part for one and not the other, prioritize the one with the most parts,
    • otherwise, if the respective parts are different, prioritize the one that is before the other one alphabetically.

Demo

12
On

We divide this into 4 steps.

Step 1: Create hierarchical structure from the dataset.

function createHierarchicalStructure($indexes){
    $data = [];
    foreach($indexes as $d){
        $temp = &$data;
        foreach(explode("\\",$d['namespace']) as $namespace){
            if(!isset($temp[$namespace])){
                $temp[$namespace] = [];
            }
            $temp = &$temp[$namespace];
        }
    }
    
    return $data;
}

Split the namespaces by \\ and maintain a $data variable. Use & address reference to keep editing the same copy of the array.

Step 2: Sort the hierarchy in first folders then files fashion.

function fileSystemSorting(&$indexes){
    foreach($indexes as $key => &$value){
        fileSystemSorting($value);
    }
    
    uksort($indexes,function($key1,$key2) use ($indexes){
        if(count($indexes[$key1]) == 0 && count($indexes[$key2]) > 0) return 1;
        if(count($indexes[$key2]) == 0 && count($indexes[$key1]) > 0) return -1;
        return strnatcmp($key1,$key2);
    });
}

Sort the subordinate folders and use uksort for the current level of folders. Vice-versa would also work. If both 2 folders in comparison have subfolders, compare them as strings, else if one is a folder and another is a file, make folders come above.

Step 3: Flatten the hierarchical structure now that they are in order.

function flattenFileSystemResults($hierarchical_data){
    $result = [];
    foreach($hierarchical_data as $key => $value){
        if(count($value) > 0){
            $sub_result = flattenFileSystemResults($value);
            foreach($sub_result as $r){
                $result[] = $key . "\\" . $r;
            }   
        }else{
            $result[] = $key;
        }
    }
    
    return $result;
}

Step 4: Restore the initial data keys back and return the result.

function associateKeys($data,$indexes){
    $map = array_combine(array_column($indexes,'namespace'),array_keys($indexes));
    $result = [];
    foreach($data as $val){
        $result[ $map[$val] ] = ['namespace' => $val];
    }
    return $result;
}

Driver code:

function foldersBeforeFiles($indexes){
   $hierarchical_data = createHierarchicalStructure($indexes);
   fileSystemSorting($hierarchical_data);
   return associateKeys(flattenFileSystemResults($hierarchical_data),$indexes);
}

print_r(foldersBeforeFiles($indexes));

Demo: https://3v4l.org/cvoB2

4
On

Here's another version that breaks the steps down further that, although it might not be the most optimal, definitely helps my brain think about it. See the comments for more details on what is going on:

uasort(
    $indexes,
    static function (array $a, array $b) {

        $aPath = $a['namespace'];
        $bPath = $b['namespace'];

        // Just in case there are duplicates
        if ($aPath === $bPath) {
            return 0;
        }

        // Break into parts
        $aParts = explode('\\', $aPath);
        $bParts = explode('\\', $bPath);

        // If we only have a single thing then it is a root-level, just compare the item
        if (1 === count($aParts) && 1 === count($bParts)) {
            return $aPath <=> $bPath;
        }

        // Get the class and namespace (file and folder) parts
        $aClass = array_pop($aParts);
        $bClass = array_pop($bParts);

        $aNamespace = implode('\\', $aParts);
        $bNamespace = implode('\\', $bParts);

        // If the namespaces are the same, sort by class name
        if ($aNamespace === $bNamespace) {
            return $aClass <=> $bClass;
        }

        // If the first namespace _starts_ with the second namespace, sort it first
        if (0 === mb_strpos($aNamespace, $bNamespace)) {
            return -1;
        }

        // Same as above but the other way
        if (0 === mb_strpos($bNamespace, $aNamespace)) {
            return 1;
        }

        // Just only by namespace
        return $aNamespace <=> $bNamespace;
    }
);

Online demo

0
On

I find no fault with Jeto's algorithmic design, but I decided to implement it more concisely. My snippet avoids iterated function calls and arithmetic in the for() loop, uses a single spaceship operator, and a single return. My snippet is greater than 50% shorter and I generally find it easier to read, but then everybody thinks their own baby is cute, right?

Code: (Demo)

uasort($indexes, function($a, $b) {  
    $aParts = explode('\\', $a['namespace']);
    $bParts = explode('\\', $b['namespace']);
    $aLast = count($aParts) - 1;
    $bLast = count($bParts) - 1;

    for ($cmp = 0, $i = 0; $i <= $aLast && !$cmp; ++$i) {
        $cmp = [$i === $aLast, $aParts[$i]] <=> [$i === $bLast, $bParts[$i]];
    }
    return $cmp;
});

Like Jeto's answer, it iterates each array simultaneously and

  • checks if either element is the last element of the array, if so it is bumped down the list (because we want longer paths to win tiebreakers);
  • if neither element is the last in its array, then compare the elements' current string alphabetically.
  • The process repeats until a non-zero evaluation is generated.
  • Since duplicate entries are not expected to occur, the return value should always be -1 or 1 (never 0)

Note, I am halting the for() loop with two conditions.

  1. If $i is greater than the number of elements in $aParts (only) -- because if $bParts has fewer elements than $aParts, then $cmp will generate a non-zero value before a Notice is triggered.
  2. If $cmp is a non-zero value.

Finally, to explain the array syntax on either side of the spaceship operator...
The spaceship operator will compare the arrays from left to right, so it will behave like:

leftside[0] <=> rightside[0] then leftside[1] <=> rightside[1]

Making multiple comparisons in this way does not impact performance because there are no function calls on either side of the <=>. If there were function calls involved, it would be more performant to make individual comparisons in a fallback manner like:

fun($a1) <=> fun($b1) ?: fun($a2) <=> fun($b2)

This way subsequent function calls are only actually made if a tiebreak is necessary.