I am reading a file using Python, and within the file there are sections that are enclosed with the '#' character:
#HEADER1, SOME EXTRA INFO
data first section
1 2
1 233
...
// THIS IS A COMMENT
#HEADER2, SECOND SECTION
452
134
// ANOTHER COMMENT
...
#HEADER3, THIRD SECTION
Now I wrote code to read the file as follows:
with open(filename) as fh:
enumerated = enumerate(iter(fh.readline, ''), start=1)
for lino, line in enumerated:
# handle special section
if line.startswith('#'):
print("="*40)
print(line)
while True:
start = fh.tell()
lino, line = next(enumerated)
if line.startswith('#'):
fh.seek(start)
break
print("[{}] {}".format(lino,line))
The output is:
========================================
#HEADER1, SOME EXTRA INFO
[2] data first section
[3] 1 2
[4] 1 233
[5] ...
[6] // THIS IS A COMMENT
========================================
#HEADER2, SECOND SECTION
[9] 452
[10] 134
[11] // ANOTHER COMMENT
[12] ...
========================================
#HEADER3, THIRD SECTION
Now you see that the line counter lino
is no longer valid because I'm using seek
. Also, it won't help I decrease it before breaking the loop because this counter is increased with each call to next
. So is there an elegant way to solve this problem in Python 3.x? Also, is there a better way of solving the StopIteration
without putting a pass
statement in an Except
block?
UPDATE
So far I have adopted an implementation based on the suggestion made by @Dunes. I had to change it a bit so I can peek ahead to see if a new section is starting. I don't know if there's a better way to do this, so please jump in with comments:
class EnumeratedFile:
def __init__(self, fh, lineno_start=1):
self.fh = fh
self.lineno = lineno_start
def __iter__(self):
return self
def __next__(self):
result = self.lineno, self.fh.readline()
if result[1] == '':
raise StopIteration
self.lineno += 1
return result
def mark(self):
self.marked_lineno = self.lineno
self.marked_file_position = self.fh.tell()
def recall(self):
self.lineno = self.marked_lineno
self.fh.seek(self.marked_file_position)
def section(self):
pos = self.fh.tell()
char = self.fh.read(1)
self.fh.seek(pos)
return char != '#'
And then the file is read and each section is processed as follows:
# create enumerated object
e = EnumeratedFile(fh)
header = ""
for lineno, line, in e:
print("[{}] {}".format(lineno, line))
header = line.rstrip()
# HEADER1
if header.startswith("#HEADER1"):
# process header 1 lines
while e.section():
# get node line
lineno, line = next(e)
# do whatever needs to be done with the line
elif header.startswith("#HEADER2"):
# etc.
You cannot alter the counter of the
enumerate()
iterable, no.You don't need to at all here, nor do you need to seek. Instead use a nested loop and buffer the section header:
This buffers the header line only; every time we come across a new header, it is stored and the current section loop is ended.
Demo:
The third section remains unprocessed because there were no lines in it, but had there been, the
header
variable has already been set in anticipation.