I have a directory containing thousands of images from three different domains
let's say the file names are xxx_A.png and yyy_B.png and zzz_C.png there are thousands from each domain
os.listdir()
will return a list for all image names inside the directory
I then want to filter this list based on some percentages
Example: I specify that I want to have out of these thousands of images, only 100 shuffled images where 30% of them will be from domainA, 30% from domainB and 40% of domainC
So simply given a certain number, I have these percentages, and I choose x number of random images (based on the image name for sure, because they are already specified), and this will be the new list
Example:
Input:
['1_A.png', '2_A.png', '3_A.png', '4_A.png', '5_A.png', '6_A.png', '7_A.png', '8_A.png', '9_A.png', '10_A.png', '1_B.png', '2_B.png', '3_B.png', '4_B.png', '5_B.png', '6_B.png', '7_B.png', '8_B.png', '9_B.png', '10_B.png', '1_C.png', '2_C.png', '3_C.png', '4_C.png', '5_C.png', '6_C.png', '7_C.png', '8_C.png', '9_C.png', '10_C.png']
I want 12 images, 30% from domain A, 30% from domain B and 40% from domain C
Output:
['1_C.png', '10_C.png', '2_B.png', '4_A.png', '3_A.png', '9_C.png', '7_C.png', '6_A.png', '8_B.png', '10_B.png', '3_C.png', '5_C.png']
How can I make this ?
Below is a function I defined. As Martin stated, math.ceil is probably the best function to use to get the number of files (so you don't get less than your desired amount). Also, you will want to sample without replacement (meaning you don't want to repeat file names), so you should not use random.choice like Rakesh did (as random.choice samples with replacement). The random.shuffle avoids this problem.
Input:
Output:
You can also call
random.shuffle(shuffled_list)
before the return statement to shuffle the output list.