Google Colab keeps crashing when trying to run keras tuner

2.3k Views Asked by At

This is my first machine learning project, working with a dataset i have created on my own.

Unfortunately Google Colab keeps crashing. And it seems to have something to do with keras tuner, but i am not sure.

It actually worked for a while. But now it is crashing immediately when i run it.

edit: it is when i run the tuner.search that Colab crashes.

The log. (read from the bottom and up)

Dec 2, 2020, 12:53:12 PM    WARNING 
WARNING:root:kernel e615fcc9-5bdc-44af-ad35-ee2a772f131f restarted
Dec 2, 2020, 12:53:12 PM    INFO    KernelRestarter: 
restarting kernel (1/5), keep random ports
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.006902: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] 
Created TensorFlow device 
(/job:localhost/replica:0/task:0/device:GPU:0 with 10630 MB memory) 
-> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, 
compute capability: 3.7)
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.006032: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004903: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004580: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004559: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
Dec 2, 2020, 12:53:11 PM    WARNING 2020-12-02 11:53:11.004497: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] 
Device interconnect StreamExecutor with strength 1 edge matrix:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.529441: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudart.so.10.1
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.529298: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] 
Adding visible gpu devices: 0
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.528166: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526440: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526344: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudnn.so.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526305: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcusparse.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526268: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcusolver.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526227: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcurand.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526186: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcufft.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.526125: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcublas.so.10
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.525706: 
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] 
Successfully opened dynamic library libcudart.so.10.1
Dec 2, 2020, 12:53:10 PM    WARNING coreClock: 0.8235GHz coreCount: 
13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
Dec 2, 2020, 12:53:10 PM    WARNING pciBusID: 0000:00:04.0 name: 
Tesla K80 computeCapability: 3.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.525625: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] 
Found device 0 with properties:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.524630: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1)
, but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.523938: 
I tensorflow/compiler/xla/service/service.cc:176] 
StreamExecutor device (0): Tesla K80, Compute Capability 3.7
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.523902: 
I tensorflow/compiler/xla/service/service.cc:168] 
XLA service 0x7a39500 initialized for platform CUDA 
(this does not guarantee that XLA will be used). Devices:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.522755: 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] 
successful NUMA node read from SysFS had negative value (-1), 
but there must be at least one NUMA node, so returning NUMA node zero
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.467341: 
I tensorflow/compiler/xla/service/service.cc:176] 
StreamExecutor device (0): Host, Default Version
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.467308: 
I tensorflow/compiler/xla/service/service.cc:168] 
XLA service 0x2383480 initialized for platform Host 
(this does not guarantee that XLA will be used). Devices:
Dec 2, 2020, 12:53:10 PM    WARNING 2020-12-02 11:53:10.466693: 
I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] 
CPU Frequency: 2300000000 Hz

My code

import tensorflow as tf
import kerastuner
from tensorflow import keras
from kerastuner.tuners import RandomSearch
from kerastuner.engine.hypermodel import HyperModel
from kerastuner.engine.hyperparameters import HyperParameters
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.losses import sparse_categorical_crossentropy

!unzip -q /content/paintings.zip

data_dir = "/content/paintings"

#Theese three rows of code is only here because i read somewhere 
#that it would help solve the problem, but it does not.
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
for device in gpu_devices:
    tf.config.experimental.set_memory_growth(device, True)

num_classes = 50
nb_epochs = 10
batch_size = 16
img_height = 128
img_width = 128

train_datagen = ImageDataGenerator(rescale=1./255,
    validation_split=0.2) 

train_generator = train_datagen.flow_from_directory(
    data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    shuffle = True,
    class_mode="sparse",
    subset='training') 

validation_generator = train_datagen.flow_from_directory(
    data_dir, 
    target_size=(img_height, img_width),
    batch_size=batch_size,
    shuffle = True,
    class_mode="sparse",
    subset='validation') 

hp = HyperParameters()
hp.Choice('learning_rate', [0.005, 1e-4])
hp.Int('num_layers_conv', 1, 5)
hp.Int('num_layers_dense', 1, 3)
hp.Int('dense_n',
        min_value=0,
        max_value=500,
        step=50)
hp.Choice(
        'activation',
        values=['relu', 'tanh'],
        default='relu')
hp.Float('dropout',
          min_value=0.0,
          max_value=0.5,
          default=0.25,
          step=0.05)

def build_model(hp):
    model = keras.Sequential()

    for i in range(hp.get('num_layers_conv')): 
        model.add(layers.Conv2D
            (filters=hp.Int('filters_' + str(i), 0, 512, step=32),
            kernel_size=hp.Int('kernel_size_' + str(i), 3, 5), padding="same", 
            activation=hp.get('activation')))

    model.add(layers.MaxPooling2D(pool_size=(2,2)))
  
    model.add(layers.Conv2D(32, kernel_size=(3, 3), activation='relu'))
    
    model.add(layers.MaxPooling2D(pool_size=(2,2)))

    model.add(layers.Flatten())
    
    for i in range(hp.get('num_layers_dense')): 
        model.add(layers.Dense(units=hp.get('dense_n'), 
        activation=hp.get('activation')))
        model.add(layers.BatchNormalization())
        model.add(layers.Dropout(rate=hp.get('dropout')))

    model.add(layers.Dense(num_classes, activation='softmax'))
    
    model.compile(
        optimizer=keras.optimizers.Adam(hp.get('learning_rate')),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'])
    
    return model

tuner = RandomSearch(
    build_model,
    max_trials=100,
    executions_per_trial=1,
    hyperparameters=hp,
    directory = "output",
    project_name = "ArtNet",
    objective='val_accuracy')

tuner.search(train_generator,
             epochs=10,
             validation_data=validation_generator)

Any help would be really appreciated!

2

There are 2 best solutions below

6
On

This could be because of multiple colab tabs open and you are getting out of RAM. Use only single tab and run the process. Check with the code below how much you have RAM and how much it takes while you start the process. Let me know if this works.

# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
0
On

For me changing the image size worked. If you are Loding lots of images then reduce them into half and then try;