I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?
Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
Weighing an arrangement of items
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
would give the count
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
Expressed as Perl code:
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to
weigh_distance
. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:Selecting optimal permutations
Given a selection of permuations, we can select those permutations that are optimal:
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
Creating all possible permutations
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
The
_fetch_perm_cache
returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.That not all strings are valid input keys is no issue: every collection can be enumerated, so
make_perms
could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if%perm_cache
were shared).Connecting the pieces
This is now a simple matter of
This would yield
which are all optimal solutions by the used definition. Interestingly, the solution
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Completing the test cases
We can run these test cases by
Out of interest, we can calculate the optimal distributions for these letters:
This brings us
Further work
The permutation generation algorithm has to be improved. Memoization could lead to a speedup.Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
This should in fact be reversed.
Edit
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
Edit 2
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because
A => 4, C => 2
,C => 4, B => 2
and$regex => 2, $fh => 4
represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.The large bottleneck is in the
select_best
sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.