I have a very large file of data and each entry looks something like this:
5 (this can be any number, call this line n)
Line 1
Line 2
Line 3
n lines, in this case 5, i.e. lines 4 - 8
Line 9
n lines, in this case again 5, i.e. lines 10-14
Line 15
Essentially, each entry starts with one line, followed by 3 lines + n lines + 1 line + n lines + 1 line.
This number n, is an integer (but can vary depending on the entry). Is there a way to figure out how many data entries I have in this file?
I have some code in place for if I know how many entries there are, then I can loop over each entry... but is there a way to figure out the number of entries in the first place?
Thanks!
edit: Here are two examples of a sample entry -
5
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
A -0.005364798 -0.022912843 0.017346957
B 0.527031905 0.603310150 0.560736787
B -0.629466850 -0.628385741 0.628048126
B -0.649090857 0.603667874 -0.726135880
B 0.683741908 -0.584386774 -0.700569743
-17.862057
-2.022841336 -1.477407454 -5.606136767
2.521789668 2.889251770 2.572440406
-0.401914888 -0.722582908 0.244151982
0.806040926 -0.990697574 1.474733506
-0.903074369 0.301436166 1.314862295
0.016462
7
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
A -0.591644968 -0.645755982 -0.014245979
B 1.198655655 -0.588872080 -0.025169784
B -1.460774580 -1.255848596 0.025804796
B 0.321839745 2.199107994 0.050450166
C 0.617684720 -1.389588077 -0.075897238
C 0.493712792 1.349385956 -0.004249822
D -0.808145644 0.577304796 0.014326943
-26.435922
1.649465696 -2.945456091 -0.152209323
0.531241391 -1.113956273 -0.135548573
-0.529287352 -0.556746737 -0.061346528
-2.152476371 6.326868481 0.441458459
-1.633473432 3.325310912 0.291306019
0.726490986 -8.268565793 -0.512575180
1.408090505 3.232545501 0.128915126
0.155658
The first number, an integer (5 or 7 in these examples), determines the number of lines that follows this entry:
10.0 0.0 0.0
0.0 10.0 0.0
0.0 0.0 10.0
As well as the number of lines that follow the line after, which in the first case is: -17.862057
Each entry looks something like this. Basically, the goal would be to figure out how many entries there are total, utilizing the fact that the first integer gives an idea of how many total lines follow for the rest of the entry.
I've written this code to work with your given example. It doesn't know at the start how many entries there are, but it just keeps reading from the file until the file is exhausted, in order to pull each entry. I've saved your sample input in
input.txt
. I've now also modified the code to read the data in as floats.Which outputs:
Demonstrating that it's found 2 entries, and has parsed them as floats, and then outputs the entries. I'm not entirely sure what the entries are, so I've kept them ambiguously named. Note that I've preserved as much data as I can of the entries in my big list-tuple structure, because I'm not sure which bits are relevant either, so the original file should almost be reconstructable from the entries in memory.
Regarding the lines starting with a character - this is approached by first applying
str.strip
to the line, as sometimes there is a space before the character. It then separates theline
intoline[0]
andline[1:]
, which is the character, and a slice of the string representing the data, which is then operated on as normal.More on how I separate the characters from the floats:
Take the following line:
This will be parsed by:
However, if we're considering only this line, we can look at less of the expression. The first thing that happens to the line is
str.strip
, frommap(str.strip..)
. This strips any trailing and leading whitespace to ensure the first character is the letter to be removed. This means the state of the line in memory is now:The line is then separated into
line[0]
andread_floats(line[1:])
. This is where the distinction between the string and floats is made - the string is separated away from the rest of the string, which is then passed toread_floats
. This is using slice notation, a powerful syntax Python has for getting sublists of iterables. The slice1:
means 'slice from index 1 to the end of the string'. For clarity:for _
is a Python convention for when you just need to repeat something, without keeping track of which repetition it is. ie it reads a line for each number in therange(n)
, so it readsn
lines, but it doesn't need to keep track of which number the current line is. It could just as well sayfor i in range(n)
, excepti
would be unused, so the iterator is called_
to indicate you don't want it.if n:
checks if the stringn
is not empty. This is because when youreadline()
a file that has been exhausted, an empty string is returned. This means instead of crashing when it's done with the file, the program will just neatly stop parsing entries. This is important as we don't know the number of entries, so we keep trying to read ann
until we can no longer read ann
, so we have to use an if statement.Regarding why entries looks so convoluted -
parse_entry(input_file)
would only parse a single entry. All of the other baggage is required to parse all entries.functools.partial(parse_entry, input_file)
means 'apply the argumentinput_file
to the functionparse_entry
'. This then usesiter
to keep doing this until it returnsNone
. This is quite a useful trick - the iter function can be given any function and then a value to stop at, and it will keep returning values from the function until it hits the 'stop' value. A simpler, more often seen example might beiter(sys.stdin.readline, "a\n")
. This would keep reading lines fromstdin
until it hit a line containing onlya
.On tuples and tuple unpacking - you could do this:
This results in the output:
Hopefully this demonstrates how you might go about making use of the structure.