How to write a prioritized left-shuffle algorithm in O(n)?

284 Views Asked by At

There are shuffle algorithms like FisherYates. They take an array and return one with elements in random order. This runs in O(n).

What I'm trying to do is to implement a prioritized left-shuffle algorithm. What does that mean?

  • Prioritized: It does not take an array of values. It takes an array of value-probability pairs. E.g. [ (1, 60), (2, 10), (3, 10), (4, 20) ]. Value 1 has 60%, value 2 has 10%, ...
  • left-shuffle: The higher the probability of a value, the higher its chances to be far on the left of the array.

Let's take this example [ (1, 10), (2, 10), (3, 60), (4, 20) ]. The most probable result should be [ 3, 4, 1, 2 ] or [ 3, 4, 2, 1 ].


I tried implementing this, but I haven't found any solution in O(n).

O(n^2) in pseudocode based on FisherYates:

sum = 100  #100%
for i = 0 to n-2:
    r = random value between 0 and sum
    localsum = 0
    for j = i to n-1:
        localsum = localsum + pair[j].Probability
        if localsum >= r + 1:
            swap(i, j)
            break
    sum = sum - pair[i].Probability

What could probably improve this a bit: Sorting the elements decreasing by probability right in the beginning to minimize the number of swaps and the iterations in the inner loop.

Is there a better solution (maybe even in O(n))?

4

There are 4 best solutions below

5
On BEST ANSWER

Update of my first answer:

I've found a paper where the 'Roulette-wheel selection via stochastic acceptance' with O(1) is introduced. This makes the algorithm to O(n) and is simple to implement

from random import randint
from random import random
import time

data = [ (1, 10), (2, 10), (3, 60), (4, 20) ]

def swap(i, j, array):
    array[j], array[i] = array[i], array[j]

def roulette_wheel_selection(data, start, max_weight_limit):
    while True:
        r = random()
        r_index = randint(start, len(data) - 1)
        if r <= data[r_index][1] / max_weight_limit:
            return r_index
    

def shuffle(data, max_weight):
    data = data.copy()
    n = len(data)
    for i in range(n-1):
        r_index = roulette_wheel_selection(data, i, max_weight)
        swap(i, r_index, data)
    return data

def performance_test(iterations, data):
    start = time.time()
    max_weight = max([item[1] for item in data])
    for i in range(iterations):
        shuffle(data, max_weight)
    end = time.time()
    print(len(data), ': ',end - start)
    return end - start

performance_test(1000, data)

data2 = []
for i in range(10):
    data2 += data
performance_test(1000, data2)  

data3 = []
for i in range(100):
    data3 += data
performance_test(1000, data3) 

data4 = []
for i in range(1000):
    data4 += data
performance_test(1000, data4) 

Performance Output

4 :  0.09153580665588379
40 :  0.6010794639587402
400 :  5.142168045043945
4000 :  50.09365963935852

So it's linear time in n (data size). I updated from my first answer the constant from "updated sum" to "maximum weight of all data items" But sure it depends on the max_weight konstant. If someone has a strategy to update max_weight in a proper way, the performance would increase.

2
On

I've found a paper where the 'Roulette-wheel selection via stochastic acceptance' with O(1) is introduced. This makes the algorithm to O(n) and is simple to implement

from random import randint
from random import random

data = [ (1, 10), (2, 10), (3, 60), (4, 20) ]

def swap(i, j, array):
    array[j], array[i] = array[i], array[j]

def roulette_wheel_selection(data, start, sum):
    while True:
        r = random()
        r_index = randint(start, len(data) - 1)
        if r <= data[r_index][1] / sum:
            return r_index
    

def shuffle(data):
    data = data.copy()
    n = len(data)
    sum = 100.0
    for i in range(n-1):
        r_index = roulette_wheel_selection(data, i, sum)
        swap(i, r_index, data)
        sum = sum - data[i][1]
    return data

for i in range(10):
    print(shuffle(data))

Output

[(3, 60), (4, 20), (2, 10), (1, 10)]
[(3, 60), (1, 10), (4, 20), (2, 10)]
[(3, 60), (1, 10), (4, 20), (2, 10)]
[(3, 60), (4, 20), (1, 10), (2, 10)]
[(3, 60), (4, 20), (2, 10), (1, 10)]
[(3, 60), (4, 20), (2, 10), (1, 10)]
[(3, 60), (4, 20), (2, 10), (1, 10)]
[(4, 20), (3, 60), (1, 10), (2, 10)]
[(3, 60), (2, 10), (4, 20), (1, 10)]
[(4, 20), (3, 60), (2, 10), (1, 10)]

Notice: For best performance the roulette_wheel_selection should use p_max depending on every iteration instead of sum. I use sum because it is easy to compute and update.

1
On

There’s a way to do this in time O(n log n) using augmented binary search trees. The idea is the following. Take the items you want to shuffle and add them into a binary search tree, each annotated with their associated weights. Then, for each node in the BST, calculate the total weight of all the nodes in the subtree rooted at that node. For example, the weight of the root node will be 1 (sum of all the weights, which is 1 because it’s a probability distribution), the sum of the weight of the left child of the root will be the total weight in the left subtree, and the sum of the weights in the right child of the root will be the total weight of the right subtree.

With this structure in place, you can in time O(log n) select a random element from the tree, distributed according to your weights. The algorithm works like this. Pick a random number x, uniformly, in the range from 0 to the total weight left in the tree (initially 1, but as items are picked this will decrease). Then, start at the tree root. Let L be the weight of the tree’s left subtree and w be the weight of the root. Recursively use this procedure to select a node:

  • If x < L, move left and recursively select a node from there.
  • If L ≤ x < L + w, return the root.
  • If L + w ≤ x, set x := x - L - w and recursively select a node from the right subtree.

This technique is sometimes called roulette wheel selection, in case you want to learn more about it.

Once you’ve selected an item from the BST, you can then delete that item from the BST to ensure you don’t pick it again. There are techniques that ensure that, after removing the node from the tree, you can fix up the weight sums of the remaining nodes in the tree in time O(log n) so that they correctly reflect the weights of the remaining items. Do a search for augmented binary search tree for details about how to do this. Overall, this means that you’ll spend O(log n) work sampling and removing a single item, which summed across all n items gives an O(n log n)-time algorithm for generating your shuffle.

I’m not sure whether it’s possible to improve upon this. There is another algorithm for sampling from a discrete distribution called Vose’s alias method which gives O(1)-time queries, but it doesn’t nicely handle changes to the underlying distribution, which is something you need for your use case.

0
On

The 'Roulette-wheel selection via stochastic acceptance' answer of @StefanFenn technically answers my question.

But it has a disadvantage:

The maximum in the algorithm is only calculated once. Calculating it more often leads to a performance worse than O(n). If there are priorities like [100.000.000, 1, 2, 3], the algorithm would probably need 1 iteration through the while loop of roulette_wheel_selection if it picks the number 100.000.000, but millions of iterations through the while loop as soon as 100.000.000 is picked.

So I want to show you a very short O(n*log(n)) solution I've found that does not depend on how large the priorities themselves are (C# code):

var n = elements.Count;
Enumerable.Range(0, n)
          .OrderByDescending(k => Math.Pow(_rng.NextDouble(), 1.0 / elements[k].Priority))
          .Select(i => elements[i].Value);

Description: Based on the collection with the priorities with n elements, we create a new collection with values 0, 1, ... n-1. For each of them, we call the Math.Pow method to calculate a key and order the values descending by that key (because we want the values with higher priorities on the left, not the right). Now, we've got a collection with 0, 1, ... n-1 but in a prioritized/weighted random order. These are indices. In the last step, we get the insert the values based on the order of these indices.