I want to convert a tab-delimited text into a 2D tensor object so that I can feed the data into a CNN.
What is the proper way to do this?
I wrote the following:
from typing import List, Union, cast
import tensorflow as tf
CellType = Union[str, float, int, bool]
RowType = List[CellType]
# Mapping Python types to TensorFlow data types
TF_DATA_TYPES = {
str: tf.string,
float: tf.float32,
int: tf.int32,
bool: tf.bool
}
def convert_string_to_tensorflow_object(data_string):
# Split the string into lines
linesStringList1d: List[str] = data_string.strip().split('\n')
# Split each line into columns
dataStringList2d: List[List[str]] = []
for line in linesStringList1d:
rowItem: List[str] = line.split(' ')
dataStringList2d.append(rowItem)
# Convert the data to TensorFlow tensors
listOfRows: List[RowType] = []
for rowItem in dataStringList2d:
oneRow: RowType = []
for stringItem in rowItem:
oneRow.append(cast(CellType, stringItem))
listOfRows.append(oneRow)
# Get the TensorFlow data type based on the Python type of CellType
tf_data_type = TF_DATA_TYPES[type(CellType)]
listOfRows = tf.constant(listOfRows, dtype=tf_data_type)
# Create a TensorFlow dataset
return listOfRows
if __name__ == "__main__":
# Example usage
data_string: str = """
1 ASN C 7.042 9.118 0.000 1 1 1 1 1 0
2 LEU H 5.781 5.488 7.470 0 0 0 0 1 0
3 THR H 5.399 5.166 6.452 0 0 0 0 0 0
4 GLU H 5.373 4.852 6.069 0 0 0 0 1 0
5 LEU H 5.423 5.164 6.197 0 0 0 0 2 0
"""
tensorflow_dataset = convert_string_to_tensorflow_object(data_string)
print(tensorflow_dataset)
Output:
C:\Users\pc\AppData\Local\Programs\Python\Python311\python.exe C:/git/heca_v2~~2/src/cnn_lib/convert_string_to_tensorflow_object.py
Traceback (most recent call last):
File "C:\git\heca_v2~~2\src\cnn_lib\convert_string_to_tensorflow_object.py", line 51, in <module>
tensorflow_dataset = convert_string_to_tensorflow_object(data_string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\git\heca_v2~~2\src\cnn_lib\convert_string_to_tensorflow_object.py", line 34, in convert_string_to_tensorflow_object
tf_data_type = TF_DATA_TYPES[type(CellType)]
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: <class 'typing._UnionGenericAlias'>
Process finished with exit code 1
Can I resolve the error?
The error you are getting is because
type(CellType)
does not return one of the keys in yourTF_DATA_TYPES
dictionary.CellType
is a union type, and callingtype()
on it will not returnstr
,float
,int
, orbool
.Instead of trying to find the data type from
CellType
, inspect the actual data items and convert them to the appropriate data type.You could also convert the list of rows into a TensorFlow tensor.
That would require all data to be of the same data type, so you may need to decide on a common data type that can represent all your data without loss of information.
Since a TensorFlow CNN (Convolutional Neural Network) typically work with numeric data, try
float
.Finally, Your code attempts to split each line by a space (
' '
), but you mentioned that your data is tab-delimited. You should changeline.split(' ')
toline.split('\t')
.That way, you split each line by tab characters, convert the string representations of numbers to
float
(assuming your CNN can work with float data), and create atf.Tensor
from the 2D list of floats.