Is the format of the preprocessing correct?

33 Views Asked by At

I am trying to have a neural network predict whether or not a transaction is suspicious. I have created 50000 synthetic transactions for training (Format is shown below). But no matter what I do, I can only get the NN to slowly learn the training data to around 55% accuracy, then completely fail on testing data. I have tried different network architecture, learning rate, batch size, epochs etc. I suspect there is a problem with the pre-processing. Thank you in advance.

Below is code, end of output log, and json format:

# Read the JSON data
with open('new_transactions_training.json') as f:
    data = json.load(f)

# Initialize lists to store values for each column
types = []
amounts = []
transactionTimes = []
transactionLocations = []
devices = []
paymentMethods = []
recentChanges = []
suspiciousFlags = []

# Iterate over each transaction dictionary
for transaction in data:
    types.append(transaction['type'])
    amounts.append(transaction['amount'])
    transactionTimes.append(transaction['transactionTime'])
    transactionLocations.append(transaction['transactionLocation'])
    devices.append(transaction['device'])
    paymentMethods.append(transaction['paymentMethod'])
    recentChanges.append(transaction['recentChangeInAccountDetails'])
    suspiciousFlags.append(transaction['suspicious'])

# Create DataFrame from the lists of values
df = pd.DataFrame({
    'type': types,
    'amount': amounts,
    'transactionTime': transactionTimes,
    'transactionLocation': transactionLocations,
    'device': devices,
    'paymentMethod': paymentMethods,
    'recentChangeInAccountDetails': recentChanges,
    'suspicious': suspiciousFlags
})

# Extract features and labels
X = df.drop('suspicious', axis=1)  # Features
y = df['suspicious']  # Labels

# Preprocess the features
# Encode categorical variables and scale numerical features
categorical_features = ['type', 'transactionLocation', 'device', 'paymentMethod']
numerical_features = ['amount', 'transactionTime', 'recentChangeInAccountDetails']

# Define transformers for the preprocessing pipeline
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numerical_transformer = StandardScaler()

# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply transformations
X_processed = preprocessor.fit_transform(X)

# Convert sparse matrices to dense arrays
if isinstance(X_processed, csr_matrix):
    X_processed = X_processed.toarray()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Define the neural network architecture
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {test_acc:.3f}')

End of output Log:

Epoch 20/20

   1/1250 [..............................] - ETA: 1s - loss: 0.6764 - accuracy: 0.5938
  46/1250 [>.............................] - ETA: 1s - loss: 0.6760 - accuracy: 0.5428
  92/1250 [=>............................] - ETA: 1s - loss: 0.6730 - accuracy: 0.5649
 137/1250 [==>...........................] - ETA: 1s - loss: 0.6740 - accuracy: 0.5614
 183/1250 [===>..........................] - ETA: 1s - loss: 0.6749 - accuracy: 0.5570
 227/1250 [====>.........................] - ETA: 1s - loss: 0.6746 - accuracy: 0.5592
 273/1250 [=====>........................] - ETA: 1s - loss: 0.6747 - accuracy: 0.5568
 318/1250 [======>.......................] - ETA: 1s - loss: 0.6759 - accuracy: 0.5540
 365/1250 [=======>......................] - ETA: 0s - loss: 0.6756 - accuracy: 0.5554
 411/1250 [========>.....................] - ETA: 0s - loss: 0.6760 - accuracy: 0.5544
 456/1250 [=========>....................] - ETA: 0s - loss: 0.6758 - accuracy: 0.5529
 500/1250 [===========>..................] - ETA: 0s - loss: 0.6761 - accuracy: 0.5515
 545/1250 [============>.................] - ETA: 0s - loss: 0.6761 - accuracy: 0.5513
 590/1250 [=============>................] - ETA: 0s - loss: 0.6763 - accuracy: 0.5502
 634/1250 [==============>...............] - ETA: 0s - loss: 0.6763 - accuracy: 0.5501
 680/1250 [===============>..............] - ETA: 0s - loss: 0.6762 - accuracy: 0.5502
 725/1250 [================>.............] - ETA: 0s - loss: 0.6764 - accuracy: 0.5508
 769/1250 [=================>............] - ETA: 0s - loss: 0.6766 - accuracy: 0.5515
 815/1250 [==================>...........] - ETA: 0s - loss: 0.6766 - accuracy: 0.5512
 861/1250 [===================>..........] - ETA: 0s - loss: 0.6771 - accuracy: 0.5500
 906/1250 [====================>.........] - ETA: 0s - loss: 0.6774 - accuracy: 0.5496
 950/1250 [=====================>........] - ETA: 0s - loss: 0.6776 - accuracy: 0.5489
 994/1250 [======================>.......] - ETA: 0s - loss: 0.6777 - accuracy: 0.5492
1039/1250 [=======================>......] - ETA: 0s - loss: 0.6779 - accuracy: 0.5488
1084/1250 [=========================>....] - ETA: 0s - loss: 0.6779 - accuracy: 0.5484
1129/1250 [==========================>...] - ETA: 0s - loss: 0.6780 - accuracy: 0.5475
1175/1250 [===========================>..] - ETA: 0s - loss: 0.6779 - accuracy: 0.5481
1222/1250 [============================>.] - ETA: 0s - loss: 0.6779 - accuracy: 0.5482
1250/1250 [==============================] - 2s 1ms/step - loss: 0.6777 - accuracy: 0.5484 - val_loss: 0.6888 - val_accuracy: 0.5209

  1/313 [..............................] - ETA: 4s - loss: 0.7425 - accuracy: 0.4062
 49/313 [===>..........................] - ETA: 0s - loss: 0.6928 - accuracy: 0.5045
 99/313 [========>.....................] - ETA: 0s - loss: 0.6935 - accuracy: 0.5079
148/313 [=============>................] - ETA: 0s - loss: 0.6916 - accuracy: 0.5065
198/313 [=================>............] - ETA: 0s - loss: 0.6898 - accuracy: 0.5115
246/313 [======================>.......] - ETA: 0s - loss: 0.6898 - accuracy: 0.5163
296/313 [===========================>..] - ETA: 0s - loss: 0.6892 - accuracy: 0.5201
313/313 [==============================] - 0s 1ms/step - loss: 0.6888 - accuracy: 0.5209
Test Accuracy: 0.521

Json Snippet:

[{"type":"WITHDRAWAL","amount":280.65,"transactionTime":1708716420.000000000,"transactionLocation":"AUSTRALIA","device":"SMART_WATCH","paymentMethod":"DEBIT_CARD","recentChangeInAccountDetails":false,"suspicious":false},{"type":"WITHDRAWAL","amount":917.46,"transactionTime":1708742400.000000000,"transactionLocation":"AUSTRALIA","device":"MOBILE","paymentMethod":"WIRE_TRANSFER","recentChangeInAccountDetails":false,"suspicious":false},

Expected 100% accuracy as the code for different transactions and the associated labels is no more than 10 if else statements.

0

There are 0 best solutions below