os.listdir() - choose randomly from the returned list based on a condition

950 Views Asked by At

I have a directory containing thousands of images from three different domains

let's say the file names are xxx_A.png and yyy_B.png and zzz_C.png there are thousands from each domain

os.listdir() will return a list for all image names inside the directory

I then want to filter this list based on some percentages

Example: I specify that I want to have out of these thousands of images, only 100 shuffled images where 30% of them will be from domainA, 30% from domainB and 40% of domainC

So simply given a certain number, I have these percentages, and I choose x number of random images (based on the image name for sure, because they are already specified), and this will be the new list

Example:

Input:

['1_A.png', '2_A.png', '3_A.png', '4_A.png', '5_A.png', '6_A.png', '7_A.png', '8_A.png', '9_A.png', '10_A.png', '1_B.png', '2_B.png', '3_B.png', '4_B.png', '5_B.png', '6_B.png', '7_B.png', '8_B.png', '9_B.png', '10_B.png', '1_C.png', '2_C.png', '3_C.png', '4_C.png', '5_C.png', '6_C.png', '7_C.png', '8_C.png', '9_C.png', '10_C.png']

I want 12 images, 30% from domain A, 30% from domain B and 40% from domain C

Output:

 ['1_C.png', '10_C.png', '2_B.png', '4_A.png', '3_A.png', '9_C.png', '7_C.png', '6_A.png', '8_B.png', '10_B.png', '3_C.png', '5_C.png']

How can I make this ?

3

There are 3 best solutions below

0
On BEST ANSWER

Below is a function I defined. As Martin stated, math.ceil is probably the best function to use to get the number of files (so you don't get less than your desired amount). Also, you will want to sample without replacement (meaning you don't want to repeat file names), so you should not use random.choice like Rakesh did (as random.choice samples with replacement). The random.shuffle avoids this problem.

Input:

import random
import math
os_dir_list= ['1_A.png', '2_A.png', '3_A.png', '4_A.png', '5_A.png', '6_A.png', '7_A.png', '8_A.png', '9_A.png', '10_A.png', '1_B.png', '2_B.png', '3_B.png', '4_B.png', '5_B.png', '6_B.png', '7_B.png', '8_B.png', '9_B.png', '10_B.png', '1_C.png', '2_C.png', '3_C.png', '4_C.png', '5_C.png', '6_C.png', '7_C.png', '8_C.png', '9_C.png', '10_C.png']       
def shuffle_pick(os_dir_list,length, tuple_list):
    shuffled_list = []
    for letter,percent in tuple_list:
        sub_list = [img for img in os_dir_list if img.endswith(letter + '.png')]
        random.shuffle(sub_list)
        num = int(math.ceil(len(sub_list)*percent/100))
        shuffled_list += sub_list[:num]
    return shuffled_list[:length]

print(shuffle_pick(os_dir_list, 12, [('A',30),('B',30),('C',60)]))

Output:

['1_A.png', '5_A.png', '3_A.png', '6_A.png', '1_B.png', '7_B.png', '9_B.png', '5_B.png', '10_C.png', '4_C.png', '3_C.png', '9_C.png']

You can also call random.shuffle(shuffled_list) before the return statement to shuffle the output list.

2
On

Here is one possible approach:

  1. First split all of the filenames into domains based on the letter using a defaultdict(list). e.g. a dictionary looking like:

    {'A' : ['file1_A.jpg', 'file2_A.jpg'], 'B' : ['file1_B.jpg']}
    
  2. For each domain, use random.sample() to randomly take the required number of files from the domain into an output list. math.ceil() is used to be ensure enough files are always present by always rounding upwards.

  3. Finally, shuffle the combined output list (if required) and ensure that the correct overall number of files are present.

This will result in an output with the exact distribution of random elements from each domain.

from collections import defaultdict
import random
import math

domains = defaultdict(list)

files = ['1_A.png', '2_A.png', '3_A.png', '4_A.png', '5_A.png', '6_A.png', '7_A.png', '8_A.png', '9_A.png', '10_A.png', '1_B.png', '2_B.png', '3_B.png', '4_B.png', '5_B.png', '6_B.png', '7_B.png', '8_B.png', '9_B.png', '10_B.png', '1_C.png', '2_C.png', '3_C.png', '4_C.png', '5_C.png', '6_C.png', '7_C.png', '8_C.png', '9_C.png', '10_C.png']

for file in files:
    domains[file[-5]].append(file)

total_required = 12
output = []    

for key, percentage in (('A', 30.0), ('B', 30.0), ('C', 40.0)):
    len_required = int(math.ceil(percentage * total_required / 100.0))
    output.extend(random.sample(domains[key], len_required))

random.shuffle(output)
output = output[:total_required]

print(len(output), output)

Giving a possible output of:

12 ['6_B.png', '2_B.png', '3_B.png', '10_A.png', '1_A.png', '6_A.png', '2_C.png', '1_B.png', '1_C.png', '3_C.png', '2_A.png', '10_C.png']    

Tested on Python 3.6.6

0
On

This is one approach. I am using a dictionary to separate the image from different domains then calculate the number of images required from each domain.

Demo:

import random    

inputData = ['1_A.png', '2_A.png', '3_A.png', '4_A.png', '5_A.png', '6_A.png', '7_A.png', '8_A.png', '9_A.png', '10_A.png', '1_B.png', '2_B.png', '3_B.png', '4_B.png', '5_B.png', '6_B.png', '7_B.png', '8_B.png', '9_B.png', '10_B.png', '1_C.png', '2_C.png', '3_C.png', '4_C.png', '5_C.png', '6_C.png', '7_C.png', '8_C.png', '9_C.png', '10_C.png']

d = {"A": [], "B":[], "C":[]}
#for i in os.listdir("path"):
for i in inputData:           #Group images by domain. 
    if "A" in i:
        d["A"].append(i)
    elif "B" in i:
        d["B"].append(i)
    else:
        d["C"].append(i)

percentage = {"A": 30, "B": 30, "C": 60} 

res = []
for k, v in d.items():
    res.extend([random.choice(v) for i in range(int((percentage[k] * len(v)) / 100.0))])
print(res) 

Output:

['7_A.png', '8_A.png', '9_A.png', '6_C.png', '8_C.png', '9_C.png', '7_C.png', '9_C.png', '7_C.png', '1_B.png', '6_B.png', '2_B.png']