Working with huge text files (or quasi-csv) with Python - matrices creation from text data

197 Views Asked by At

I've reached a point when I do not have a proper idea of how to resolve my problem. I am working with the text file (*.inp - abaqus job file) and I want to extract some basic information from it. By far I identified two major problems:

  1. Such files are quite big, i.e. 500 000 lines.
  2. Their structure is not always csv-like

Ad.1. Because of huge amount of data, I wanted to include the pandas library to speed up the operations (which will be repeated in an optimization loop)

Ad.2. Exemplary *.inp file with its "strange" structure (please note that "node" and "element" are actual names used in the code, and each element is built up from several nodes, like a cube=element, each of the cube vertices=node]:

*NODE
     1,  0.0, 0.0, 3.0
     2,  -17.0, 5.5, 2.3
     3,  51.0, 0.0, 639.8          
     5,  0.0, 5.5 , 31.0 
...
     145000, 31.3, 21.5, 99.8
*ELEMENT, ELSET=Name1, TYPE=Type1
     1527450, 265156, 273237, 265019, 265021, 275728, 273221, 265599,
     265146, 273583, 265020
     1527449, 269279, 272869, 269277, 269479, 273130, 272862, 269278,
     269489, 275729, 269627
     1527448, 272250, 272858, 275350, 273327, 272851, 275730, 275731,
     273346, 275732, 275733
...
     1126546, 265180, 275352, 273263, 273237, 275736, 275737, 275738,
     275739, 275740, 273246
*ELEMENT, ELSET=Name2, Type2
...
*SURFACE, NAME=Surf1
     12345, S5
     34567, S3
...
*STEP
*STATIC
1.0,,,1.0
*BOUNDARY
bc_1,1,3,0.0
bc_2,6,6,0.0
...
...

Values listed under "*NODE" keyword have following sequenece: node_id, coord_x, coord_y, coord_z

This is the biggest set of data in the model, that's why I wanted to use pandas for it (read it like a csv). For this part I do not find major issues.

Values listed under *ELEMENT" keyword are a bit more complicated:

line n: elementn_id, node1_id, node2_id, node3_id, node4_id, node5_id, node6_id, node7_id

line n+1: node8_id, node9_id, node10_id

In this case, pandas import this part of code as two separate lines (obviously) with N/A items in last 7 columns of n+1 rows. I use pd.read_csv for it. Please be aware, that nodes ids from 1 to 10 form together an element (with id specified as 1st thing in the nth row).

And now I state the problem :):

  1. How to properly import the data which lay between *ELEMENT, ELSET=name1 and *ELEMENT, ELSET=name2, when my aim is to have matrix in which each element uses 1 row only with total of 11 columns (1st - element_id, 2-11 - nodex_id).
  2. By far I divided this *.inp file into separate files to be able to work on them... Now I want to do it all in one script, i.e. create matrix A = [(node_id, coord_x, coord_y, coord_z),...] and matrix B = [(element_id, node1_id, node2_id, ... , node10_id),...] at once. How to do so, if simple pd.read_csv doesn't perform OK in this case? There are plenty of strictly string rows which should either not be imported or be excluded to speed up the script.

My idea was to import the *.inp file into python as 'open' function, then add some kind of tags/triggers to match which lines of the code should be used further (and maybe processed using pandas), but in this case I do not use pandas as import option...

I believe my problem is quite dull for most of you, but I am not strictly a developer :) I do not anticipate to get direct, ready solution, but to get your advise on where to look for potential answer or tools.

Thank you all in advance and I wish you a nice day, prz

2

There are 2 best solutions below

1
On

Interesting challenge!

So long as the files you need to process roughly follow that structure, something like this might work for you. See below for the output.

  • The file data is inlined in a io.StringIO() to make this self-sufficient, but it could just as well be an open("data.inp") file stream.
  • If you're not very well-versed in Python, the generator functions with their yield magic may seem a little arcane, sorry about that. :)
  • It does need to "re-constitute" the CSV files in memory for Pandas to read, which could be a bottleneck if you're low on memory, but you'll know by giving it a shot...
  • Note how the "Surf1" group is deliberately skipped, so you have an idea of how to ignore certain sections.
import io
import itertools
import pandas as pd


def split_abq(inp_file):
    """
    Split an ABQ file into tuples of "group" header and lines in that group.
    """

    current_header = None
    for line in inp_file:
        line = line.strip()  # remove whitespace
        if not line:  # skip empty lines
            continue
        if line.startswith("*"):  # note headers
            current_header = line
            continue
        yield (current_header, line)  # generate lines


def group_split_abq(split_generator):
    """
    "De-repeat" the output of `split_abq` into a generator (group name, generator-of-lines)
    """
    for group, entries in itertools.groupby(split_abq(input_file), lambda pair: pair[0]):
        line_generator = (line for group, line in entries)  # A generator expression; this is evaluated lazily
        yield (group, line_generator)


def lines_to_df(line_generator, **read_csv_kwargs):
    """
    Convert an iterable of CSV-ish lines into a Pandas dataframe
    """

    # "Write" an in-memory file for Pandas to parse
    csv_io = io.StringIO()
    for line in line_generator:
        print(line, file=csv_io)
    csv_io.seek(0)  # Seek back to the start
    return pd.read_csv(csv_io, **read_csv_kwargs)


input_file = io.StringIO(
    """
*NODE
     1,  0.0, 0.0, 3.0
     2,  -17.0, 5.5, 2.3
     3,  51.0, 0.0, 639.8
     5,  0.0, 5.5 , 31.0 
     145000, 31.3, 21.5, 99.8
*ELEMENT, ELSET=Name1, TYPE=Type1
     1527450, 265156, 273237, 265019, 265021, 275728, 273221, 265599, 265146, 273583, 265020
     1527449, 269279, 272869, 269277, 269479, 273130, 272862, 269278, 269489, 275729, 269627
     1527448, 272250, 272858, 275350, 273327, 272851, 275730, 275731, 273346, 275732, 275733
     1126546, 265180, 275352, 273263, 273237, 275736, 275737, 275738, 275739, 275740, 273246
*ELEMENT, ELSET=Name2, Type2
    1527450, 265156, 273237, 265019, 265021, 275728, 273221, 265599, 265146, 273583, 265020
*SURFACE, NAME=Surf1
     12345, S5
     34567, S3
*STEP
*STATIC
1.0,,,1.0
*BOUNDARY
bc_1,1,3,0.0
bc_2,6,6,0.0
"""
)


for group, lines in group_split_abq(split_abq(input_file)):
    print("=================================")
    print("Group: ", group)
    if "Surf1" in group:  # You can use this opportunity to ignore some groups
        print("-> Skipping Surf1")
        continue
    df = lines_to_df(lines, header=None)  # `header` should probably be decided by the group type
    print(df.head())
    print("------------------------------\n")
    # You could store the various `df`s generated in a dict here

The output is

=================================
Group:  *NODE
        0     1     2      3
0       1   0.0   0.0    3.0
1       2 -17.0   5.5    2.3
2       3  51.0   0.0  639.8
3       5   0.0   5.5   31.0
4  145000  31.3  21.5   99.8
------------------------------

=================================
Group:  *ELEMENT, ELSET=Name1, TYPE=Type1
        0       1       2       3       4       5       6       7       8       9       10
0  1527450  265156  273237  265019  265021  275728  273221  265599  265146  273583  265020
1  1527449  269279  272869  269277  269479  273130  272862  269278  269489  275729  269627
2  1527448  272250  272858  275350  273327  272851  275730  275731  273346  275732  275733
3  1126546  265180  275352  273263  273237  275736  275737  275738  275739  275740  273246
------------------------------

=================================
Group:  *ELEMENT, ELSET=Name2, Type2
        0       1       2       3       4       5       6       7       8       9       10
0  1527450  265156  273237  265019  265021  275728  273221  265599  265146  273583  265020
------------------------------

=================================
Group:  *SURFACE, NAME=Surf1
-> Skipping Surf1
=================================
Group:  *STATIC
     0   1   2    3
0  1.0 NaN NaN  1.0
------------------------------

=================================
Group:  *BOUNDARY
      0  1  2    3
0  bc_1  1  3  0.0
1  bc_2  6  6  0.0
------------------------------
1
On

You might want to look at using this parser AbqParse, I've not tried it so I can't know on whether it'll work for you, the code is also quite old, so it may not work in more recent versions of python.