Proper way to read position based text file

4.9k Views Asked by At

So I have a file with data in this (standardized) format:

 12455WE READ THIS             TOO796445 125997  554777     
 22455 888AND THIS       TOO796445 125997  55477778 2 1

Probably tought up by someone who has done too much cobol.

Each field has a fixed lenght and I can read it by slicing the line.

My problem is how can I structure my code in a way that makes it more flexible and does not make me use hard-coded offsets for the slices ? Should I use a class of constants of something like that ?

EDIT:

Also the first number (0->9 always present) determines the structure of the line which is of fixed length. Also the file is provided by a 3rd party who ensures the validity so I don't need to check the format only read it. There are around 11 different line structures.

3

There are 3 best solutions below

2
On BEST ANSWER

Create a list of widths and a routine that accepts this and an indexed column number as parameters. The routine can calculate the start offset for your slice by adding all previous column widths, and add the width of the indexed column for the end offset.

6
On

My suggestion is to use a dictionary keyed on the 5 digit line type code. Each value in the dictionary can be a list of field offsets (or of (offset, width) tuples), indexed by field position.

If your fields have names it may be convenient to use a class instead of a list to store field offset data. However, namedtuples may be better here, since then you can access your field offset data either via its name or by its field position, so you get the best of both worlds.

namedtuples are actually implemented as classes, but defining a new namedtuple type is much more compact that creating an explicit class definition, and namedtuples use the __slots__ protocol, so they take up less RAM than a normal class that uses __dict__ for storing its attributes.


Here's one way to use namedtuples to store field offset data. I'm not claiming that the following code is the best way to do this, but it should give you some ideas.

from collections import namedtuple

#Create a namedtuple, `Fields`, containing all field names
fieldnames = [
    'record_type', 
    'special',
    'communication',
    'id_number',
    'transaction_code',
    'amount',
    'other',
]

Fields = namedtuple('Fields', fieldnames)

#Some fake test data
data = [
    #          1         2         3         4         5
    #012345678901234567890123456789012345678901234567890123
    "12455WE READ THIS             TOO796445 125997  554777",
    "22455 888AND THIS       TOO796445 125997  55477778 2 1",
]

#A dict to store the field (offset, width) data for each field in a record,
#keyed by record type, which is always stored at (0, 5)
offsets = {}

#Some fake record structures
offsets['12455'] = Fields(
    record_type=(0, 5), 
    special=None,
    communication=(5, 28),
    id_number=(33, 6),
    transaction_code=(40, 6),
    amount=(48, 6),
    other=None)

offsets['22455'] = Fields( 
    record_type=(0, 5),
    special=(6, 3),
    communication=(9, 18),
    id_number=(27, 6),
    transaction_code=(34, 6),
    amount=(42, 8),
    other=(51,3))

#Test.
for row in data:
    print row
    #Get record type
    rt = row[:5]
    #Get field structure
    fields = offsets[rt]
    for name in fieldnames:
        #Get field offset data by field name
        t = getattr(fields, name)
        if t is not None:
            start, flen = t
            stop = start + flen
            data = row[start : stop]            
            print "%-16s ... %r" % (name, data)
    print

output

12455WE READ THIS             TOO796445 125997  554777
record_type      ... '12455'
communication    ... 'WE READ THIS             TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '554777'

22455 888AND THIS       TOO796445 125997  55477778 2 1
record_type      ... '22455'
special          ... '888'
communication    ... 'AND THIS       TOO'
id_number        ... '796445'
transaction_code ... '125997'
amount           ... '55477778'
other            ... '2 1'
0
On

You can have a list of widths of the columns describing the format and unfold it like this:

formats = [
    [1, ],
    [1, 4, 28, 7, 7, 7],
]

def unfold(line):
    lengths = formats[int(line[0])]
    ends = [sum(lengths[0:n+1]) for n in range(len(lengths))]
    return [line[s:e] for s,e in zip([0] + ends[:-1], ends)]

lines = [
    "12455WE READ THIS             TOO796445 125997 554777",
]

for line in lines:
    print unfold(line)

Edit: Updated the code to better match what maazza asked in the edited question. This assumes the format character is an integer, but it can easily be generalized to other format designators.