I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this:
The dataset is laid out as follows:
- The column
id
is the unique identifier for each word group inside a document, shown in columntext
(like Nodes) - The column
label
identifies whether the word group are classified as a 'question' or an 'answer' - The column
linking
denoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers' - The column
'box'
denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0). - The Column
'words'
holds each individual word inside the wordgroup, and its location (box).
I aim to train a classifier to identify words inside the column 'words'
that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:
Is there a way to break each row in the column
'words'
into a two columns[box_word, text_word]
, each only for one word, while replicating the other columns which remain the same:[id, label, text, box]
, resulting in a final dataframe with these columns:[box,text,label,box_word, text_word]
I can Tokenize the columns
'text'
andtext_word
, one hot encode columnlabel
, split columns with more than one numericbox
andbox_word
into individual columns , but How do I split up/rearrange the colum'linking'
to define the edges of my Network Graph?Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?
Any and all help/tips is appreciated.
Edit: process multiple entries in the column
words
.Your questions 1 and 2 are answered in the code. Actually quite simple (assuming the data format is correctly represented by what shown in the screenshot). Digest:
Q1:
apply
the splitting function on the column and unpack by.tolist()
such that separate columns can be created. See this post also.Q2: Use list comprehension to unpack the extra list layer and retain only non-empty edges.
Q3: Yes and no. Yes because
pandas
is good at organizing data with heterogeneous types. For example, lists, dict, int and float can be present at different columns. Several I/O functions, such aspd.read_csv()
orpd.read_json()
, are also very handy.However, there is overhead in data access, and that is especially costly for iterating over rows (records). Therefore, the transformed data that feeds directly into your model is usually converted into
numpy.array
or more efficient formats. Such a format conversion task is the data scientist's sole responsibility.Code and Output
I make up my own sample dataset. Irrelevant columns were ignored (as I am not obliged to and shouldn't do).
Q1 output