How can I convert a Union type into a tensor type?

141 Views Asked by At

I want to convert a tab-delimited text into a 2D tensor object so that I can feed the data into a CNN.

What is the proper way to do this?

I wrote the following:

from typing import List, Union, cast
import tensorflow as tf

CellType = Union[str, float, int, bool]
RowType = List[CellType]

# Mapping Python types to TensorFlow data types
TF_DATA_TYPES = {
    str: tf.string,
    float: tf.float32,
    int: tf.int32,
    bool: tf.bool
}

def convert_string_to_tensorflow_object(data_string):
    # Split the string into lines
    linesStringList1d: List[str] = data_string.strip().split('\n')

    # Split each line into columns
    dataStringList2d: List[List[str]] = []
    for line in linesStringList1d:
        rowItem: List[str] = line.split(' ')
        dataStringList2d.append(rowItem)

    # Convert the data to TensorFlow tensors
    listOfRows: List[RowType] = []
    for rowItem in dataStringList2d:
        oneRow: RowType = []
        for stringItem in rowItem:
            oneRow.append(cast(CellType, stringItem))
        listOfRows.append(oneRow)

    # Get the TensorFlow data type based on the Python type of CellType
    tf_data_type = TF_DATA_TYPES[type(CellType)]

    listOfRows = tf.constant(listOfRows, dtype=tf_data_type)

    # Create a TensorFlow dataset
    return listOfRows

if __name__ == "__main__":
    # Example usage
    data_string: str = """
    1 ASN C  7.042   9.118  0.000 1 1 1 1  1  0
    2 LEU H  5.781   5.488  7.470 0 0 0 0  1  0
    3 THR H  5.399   5.166  6.452 0 0 0 0  0  0
    4 GLU H  5.373   4.852  6.069 0 0 0 0  1  0
    5 LEU H  5.423   5.164  6.197 0 0 0 0  2  0
    """

    tensorflow_dataset = convert_string_to_tensorflow_object(data_string)

    print(tensorflow_dataset)

Output:

C:\Users\pc\AppData\Local\Programs\Python\Python311\python.exe C:/git/heca_v2~~2/src/cnn_lib/convert_string_to_tensorflow_object.py
Traceback (most recent call last):
  File "C:\git\heca_v2~~2\src\cnn_lib\convert_string_to_tensorflow_object.py", line 51, in <module>
    tensorflow_dataset = convert_string_to_tensorflow_object(data_string)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\git\heca_v2~~2\src\cnn_lib\convert_string_to_tensorflow_object.py", line 34, in convert_string_to_tensorflow_object
    tf_data_type = TF_DATA_TYPES[type(CellType)]
                   ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: <class 'typing._UnionGenericAlias'>

Process finished with exit code 1

Can I resolve the error?

1

There are 1 best solutions below

0
On BEST ANSWER

The error you are getting is because type(CellType) does not return one of the keys in your TF_DATA_TYPES dictionary. CellType is a union type, and calling type() on it will not return str, float, int, or bool.

Instead of trying to find the data type from CellType, inspect the actual data items and convert them to the appropriate data type.

What would be your expected end result?

A 2D Tensor.

You could also convert the list of rows into a TensorFlow tensor.
That would require all data to be of the same data type, so you may need to decide on a common data type that can represent all your data without loss of information.

Since a TensorFlow CNN (Convolutional Neural Network) typically work with numeric data, try float.

Finally, Your code attempts to split each line by a space (' '), but you mentioned that your data is tab-delimited. You should change line.split(' ') to line.split('\t').

from typing import List
import tensorflow as tf

def convert_string_to_tensorflow_object(data_string):
    # Split the string into lines
    linesStringList1d: List[str] = data_string.strip().split('\n')

    # Split each line into columns
    dataStringList2d: List[List[str]] = [line.split('\t') for line in linesStringList1d]

    # Convert the string items to float, as CNNs typically work with numeric data
    dataFloatList2d: List[List[float]] = [[float(item) for item in row] for row in dataStringList2d]

    # Convert the data to a TensorFlow tensor
    tensor = tf.constant(dataFloatList2d, dtype=tf.float32)

    return tensor

if __name__ == "__main__":
    # Example usage
    data_string: str = """
    1\tASN\tC\t7.042\t9.118\t0.000\t1\t1\t1\t1\t1\t0
    2\tLEU\tH\t5.781\t5.488\t7.470\t0\t0\t0\t0\t1\t0
    3\tTHR\tH\t5.399\t5.166\t6.452\t0\t0\t0\t0\t0\t0
    4\tGLU\tH\t5.373\t4.852\t6.069\t0\t0\t0\t0\t1\t0
    5\tLEU\tH\t5.423\t5.164\t6.197\t0\t0\t0\t0\t2\t0
    """

    tensorflow_tensor = convert_string_to_tensorflow_object(data_string)
    print(tensorflow_tensor)

That way, you split each line by tab characters, convert the string representations of numbers to float (assuming your CNN can work with float data), and create a tf.Tensor from the 2D list of floats.

| 2D List of mixed dtypes (as per CellType) |
\------+------------------------------------/
       |
       | Convert all elements to float
       v
| 2D List of float                          |
\------+------------------------------------/
       |
       | Convert to TensorFlow Tensor
       v
| 2D TensorFlow Tensor                      |
\------+------------------------------------/