Create Network from dictionary of Text and Numerical data - to train GNN

315 Views Asked by At

I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this: FUNSD Dataframe The dataset is laid out as follows:

  • The column id is the unique identifier for each word group inside a document, shown in column text (like Nodes)
  • The columnlabel identifies whether the word group are classified as a 'question' or an 'answer'
  • The column linking denoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers'
  • The column 'box' denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0).
  • The Column 'words' holds each individual word inside the wordgroup, and its location (box).

I aim to train a classifier to identify words inside the column 'words' that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:

  1. Is there a way to break each row in the column 'words' into a two columns [box_word, text_word], each only for one word, while replicating the other columns which remain the same: [id, label, text, box], resulting in a final dataframe with these columns: [box,text,label,box_word, text_word]

  2. I can Tokenize the columns 'text' and text_word, one hot encode column label, split columns with more than one numeric box and box_word into individual columns , but How do I split up/rearrange the colum 'linking' to define the edges of my Network Graph?

  3. Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN?

Any and all help/tips is appreciated.

1

There are 1 best solutions below

3
On BEST ANSWER

Edit: process multiple entries in the column words.

Your questions 1 and 2 are answered in the code. Actually quite simple (assuming the data format is correctly represented by what shown in the screenshot). Digest:

Q1: apply the splitting function on the column and unpack by .tolist() such that separate columns can be created. See this post also.

Q2: Use list comprehension to unpack the extra list layer and retain only non-empty edges.

Q3: Yes and no. Yes because pandas is good at organizing data with heterogeneous types. For example, lists, dict, int and float can be present at different columns. Several I/O functions, such as pd.read_csv() or pd.read_json(), are also very handy.

However, there is overhead in data access, and that is especially costly for iterating over rows (records). Therefore, the transformed data that feeds directly into your model is usually converted into numpy.array or more efficient formats. Such a format conversion task is the data scientist's sole responsibility.

Code and Output

I make up my own sample dataset. Irrelevant columns were ignored (as I am not obliged to and shouldn't do).

import networkx as nx
import pandas as pd

# data
df = pd.DataFrame(
    data={
        "words": [
            [{"box": [1, 2, 3, 4], "text": "TO:"}, {"box": [7, 7, 7, 7], "text": "777"}],
            [{"box": [1, 2, 3, 4], "text": "TO:"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}, {"box": [4, 4, 4, 4], "text": "444"}],
            [{"text": "TO:", "box": [1, 2, 3, 4]}],
        ],
        "linking": [
            [[0, 4]],
            [],
            [[4, 6]],
            [[6, 0]],
        ]
    }
)


# Q1. split
def split(el):
    ls_box = []
    ls_text = []
    for dic in el:
        ls_box.append(dic["box"])
        ls_text.append(dic["text"])
    return ls_box, ls_text

# straightforward but receives a deprecation warning
df[["box_word", "text_word"]] = df["words"].apply(split).tolist()
# to avoid that,
ls_tup = df["words"].apply(split).tolist()  # len: 4x2
ls_tup_tr = list(map(list, zip(*ls_tup)))  # len: 2x4
df["box_word"] = ls_tup_tr[0]
df["text_word"] = ls_tup_tr[1]

# Q2. construct graph
ls_edges = [item[0] for item in df["linking"].values if len(item) > 0]
print(ls_edges)  # [[0, 4], [4, 6], [6, 0]]

g = nx.Graph()
g.add_edges_from(ls_edges)
list(g.nodes)  # [0, 4, 6]
list(g.edges)  # [(0, 4), (0, 6), (4, 6)]

Q1 output

# trim the first column for printing
df_show = df.__deepcopy__()
df_show["words"] = df_show["words"].apply(lambda s: str(s)[:10])
df_show

Out[51]: 
        words   linking                      box_word   text_word
0  [{'box': [  [[0, 4]]  [[1, 2, 3, 4], [7, 7, 7, 7]]  [TO:, 777]
1  [{'box': [        []                [[1, 2, 3, 4]]       [TO:]
2  [{'text':   [[4, 6]]  [[1, 2, 3, 4], [4, 4, 4, 4]]  [TO:, 444]
3  [{'text':   [[6, 0]]                [[1, 2, 3, 4]]       [TO:]