Python: iterating in multiple levels

229 Views Asked by At
-------------2000--------------
1        17824
2        20131125192004.9
3        690714s1969    dcu           000 0 eng
4    a       75601809 
4    a    DLC
4    b    eng
4    c    DLC
5    a    WA 750
-------------2001--------------
1        3224
2        20w125192004.9
3        690714s1969    dcu           000 0 eng
5    a    WA 120
-------------2002--------------
2        2013341524626245.9
3        484914s1969    dcu           000 0 eng
4    a       75601809 
4    a    eng
4    c    DLC
5    a    WA 345

I want to iterate through both the years and the fields under each year (e.g. 1, 2, 3, 4, and 5). a, b, and other alphabet letters after some fields are subfields.

The lines with dashes in my code indicates the year of the entry. Each record group starts at ---year--- and ends at the line before ---year---.

Also, fields is a list: fields=["1", "2", "3,", "4", "5"].

I'm eventually trying to retrieve the values next to the fields for each entry/year. For example, if my current field is 1, which is equivalent to fields[0], I would iterate through all the years (2000, 2001, and 2002) to get the values for the field 1. The output would be

17824
3224
(Blank space for Year 2002)  

How can I iterate through the years (indicated by the dashes)? I can't seem to think of a code to generate the desired output.

2

There are 2 best solutions below

6
On

So I'm writing a pretty involved answer that uses a helper function, but I think you'll find it pretty flexible. It uses an iterutil type helper function that I wrote called groupby. The groupby function accepts a key function to specify which group each item belongs to. In your case the key function was a little fancy because it had to maintain state to know which year each element belonged to. The code below is totally runnable. Just copy and paste into a script and let me know what you think.

EDIT

Turns out the groupby function is already implemented in the itertools module and I've been missing it forever. I edited the code to use the itertools version

#!/usr/bin/env python

import io
import re
import itertools as it

data = '''-------------2000--------------
1        17824
2        20131125192004.9
3        690714s1969    dcu           000 0 eng
4    a       75601809 
4    a    DLC
4    b    eng
4    c    DLC
5    a    WA 750
-------------2001--------------
1        3224
2        20w125192004.9
3        690714s1969    dcu           000 0 eng
5    a    WA 120
-------------2002--------------
2        2013341524626245.9
3        484914s1969    dcu           000 0 eng
4    a       75601809 
4    a    eng
4    c    DLC
5    a    WA 345'''

def group_year():
    ''' 
    A stateful closure to group the year blobs together
    ''' 
    # Hack to update a variable from the closure
    g = [0]
    def closure(e):
        if re.findall(r'-----[0-9]{4}------', e): 
            g[0] += 1
        return g[0]
    return closure

if __name__ == "__main__":
    f = io.BytesIO(data)
    gy = group_year()
    for k,group in it.groupby(f, key=gy):
        # group is now an iter of lines for each year group in the data
        # Now you can iterate on each group like so:
        for line in group:
            rec = line.strip().split()
            if rec[0] == '1':
                print rec[1]
        # You could also use nested groupby's at this point to perform
        # further grouping on the different columns or whatever
1
On

You can first use regex to split your text then use itertools.izip_longest within a nested list comprehension to get your expected columns :

>>> import re
>>> blocks=re.split(r'-+\d+-+',s)
>>> from itertools import izip_longest

>>> z=[list(izip_longest(*[k for k in sub if k])) for sub in izip_longest(*[[j.split() for j in i.split('\n')] for i in blocks])]
[[], [('1', '1', '2'), ('17824', '3224', '2013341524626245.9')], [('2', '2', '3'), ('20131125192004.9', '20w125192004.9', '484914s1969'), (None, None, 'dcu'), (None, None, '000'), (None, None, '0'), (None, None, 'eng')], [('3', '3', '4'), ('690714s1969', '690714s1969', 'a'), ('dcu', 'dcu', '75601809'), ('000', '000', None), ('0', '0', None), ('eng', 'eng', None)], [('4', '5', '4'), ('a', 'a', 'a'), ('75601809', 'WA', 'eng'), (None, '120', None)], [('4', '4'), ('a', 'c'), ('DLC', 'DLC')], [('4', '5'), ('b', 'a'), ('eng', 'WA'), (None, '345')], [('4',), ('c',), ('DLC',)], [('5',), ('a',), ('WA',), ('750',)], []]

each sub list represent a specific line in each block for example the first sub list is first lines in each block :

>>> z=[i for i in z if i] # remove the empty lists
>>> z[0]
[('1', '1', '2'), ('17824', '3224', '2013341524626245.9')]
>>> z[0][1]
('17824', '3224', '2013341524626245.9')