how to divide a series of words into "N" chunks?

121 Views Asked by At

first of all forgive me for any ambiguity . i find my problem a bit hard to explain in English . basically what i want to do is , to divide a huge set of words to "N" parts .

for example read all the words in a file , then divide them between lets say N=10 parts . to be more precise , i'm working on a data mining project . there are thousands of documents which i need to sort the words of .

say n = 2 . i know i can put a-m and n-z in a file . i need an algorithm which can do this for n > 100 .

PS : my program FIRST has to create the N files ( or chunks ) then read all the words and depending on how they begin , assign them to one of the chunks .

EXAMPLE : input : N = 2 words = [....]

output : [words starting with a-m] , [words starting with n-z]

in other words i want to divide my words Lexicographically

2

There are 2 best solutions below

0
On BEST ANSWER

This is a rough idea of what you want:

l = "i find my problem a bit hard to explain in English".split()
n = 2
ln = len(l)
chnk = ln / n
srt = sorted(l, key=str.lower) # use str.lower as the key or uppercase will come before lower
chunks = (srt[i:chnk+i] for i in xrange(0, len(srt), chnk))

In [4]: l = "i find my problem a bit hard to explain in English".split()
In [5]: n = 2    
In [6]: ln = len(l)
In [7]: chnk = ln / n    
In [8]: srt = sorted(l, key=str.lower)
In [9]: chunks = (srt[i:chnk+i] for i in xrange(0, len(srt), chnk))    
In [10]:     
In [10]: for chunk in chunks:
   ....:         print(chunk)
   ....:     
['a', 'bit', 'English', 'explain', 'find']
['hard', 'i', 'in', 'my', 'problem']
['to']

Obviously you will have to handle the case when n chunks does not divide evenly into the length of your list of words.

1
On

You can use itertools.

from itertools import islice

# islice('ABCDEFG', 2) --> A B
# islice('ABCDEFG', 2, 4) --> C D
# islice('ABCDEFG', 2, None) --> C D E F G
# islice('ABCDEFG', 0, None, 2) --> A C E G

your_dict=[1,2,3,4,5]
first_chunk=islice(your_dict,2) #--> 1 2
second_chunk=islice(your_dict,2,None) #--> 3 4 5

After that you can play with 2nd and 3rd arguments of islice an wrap'em into function,