Target/output mismatch using SentencePieceTokenizer layer with HuggingFace dataset?

246 Views Asked by At

I am trying to test a simple model using a SentencePieceTokenizer layer over a (HuggingFace) dataset. But I seem unable to get the shape of the dataset's target to agree with the model's output. All code available here](https://github.com/rbelew/rikHak/blob/master/tst_240311.py)

First, I get the dataset from HF and convert it to the tensorflow version that keras.Model.fit() expects using

trainDS = LH_dataset_HF['train'].to_tf_dataset(
        columns=["text"],
        label_cols=["answer"],
        batch_size=batch_size,
        shuffle=False,
        )

I can demonstrate the data is loaded and the SPTokenizer is working as expected:

trainTF shape=(6, 3) answer shape=(6,)
all answers=[b'Yes' b'Yes' b'Yes' b'No' b'No' b'No']
echo1:  b'My roommate and I were feeling unwell in our basement apartment for a long ...

My model begins with a keras_nlp.tokenizers.SentencePieceTokenizer layer, has one embedding layer, and then makes a prediction:

Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer)        │ (None)                 │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ sentence_piece_tokenizer        │ (None, 32)             │             0 │
│ (SentencePieceTokenizer)        │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ embed (Embedding)               │ (None, 32, 100)        │     1,000,000 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ predictions (Dense)             │ (None, 32, 1)          │           101 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,000,101 (3.82 MB)
Trainable params: 1,000,101 (3.82 MB)
Non-trainable params: 0 (0.00 B)

But when I try to model.fit(trainDS) I get

ValueError: Arguments target and output must have the same rank (ndim). Received: target.shape=(None,), output.shape=(None, 32, 1)

I also did an experiment using map to transform the string labels to integers:

def binaryLbl(txt,tlbl):
    return(txt,1 if tlbl=='Yes' else 0)

trainDS2 = trainDS.map(binaryLbl)

but trainDS2.element_spec() now says the label has NO shape?!

(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))

Why is does target.shape=(None,) How should I make it match the single prediction output node of the model?

Package versions

torch=2.1.0.post100
torchtext=0.16.1
tensorflow=2.15.0
tensorflow_text=2.15.0
keras=3.0.5
keras_nlp=0.7.0
0

There are 0 best solutions below