How to use TCMalloc on Google Cloud ML Engine

334 Views Asked by At

How to use TCMalloc on Google Cloud ML Engine? Or apart from TCMalloc, is there any other way to solve memory leak issues on ML Engine?

Finalizing graph doesn't seem to help.


Memory utilization graph: enter image description here

I've got out of memory error after training 73 epochs. Here is part of the training log:

11:26:33.707 Job failed.

11:26:20.949 Finished tearing down TensorFlow.

11:25:18.568 The replica master 0 ran out-of-memory and exited with a non-zero status of 247. To find out more about why your job exited please check the logs

11:25:07.785 Clean up finished.

11:25:07.785 Module completed; cleaning up.

11:25:07.783 Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.main', u'--data=gs://', u'--train_log_dir=gs://tfoutput/joboutput', u'--model=trainer.crisp_model', u'--num_threads=32', u'--memory_usage=0.8', u'--max_out_norm=1', u'--train_batch_size=64', u'--sample_size=112', u'--num_gpus=4', u'--allow_growth=True', u'--weight_loss_by_train_size=True', u'-x', returned non-zero exit status -9.

11:23:08.853 PNG warning: Exceeded size limit while expanding chunk

11:18:18.474 epoch 58.0: accuracy = 0.9109

11:17:14.851 2017-05-17 10:17:14.851024: epoch 58, loss = 0.12, lr = 0.085500 (228.9 examples/sec; 0.280 sec/batch)

11:15:39.532 PNG warning: Exceeded size limit while expanding chunk

11:10:23.855 PoolAllocator: After 372618242 get requests, put_count=372618151 evicted_count=475000 eviction_rate=0.00127476 and unsatisfied allocation rate=0.00127518

11:05:32.928 PNG warning: Exceeded size limit while expanding chunk

10:59:26.006 epoch 57.0: accuracy = 0.8868

10:58:24.117 2017-05-17 09:58:24.117444: epoch 57, loss = 0.23, lr = 0.085750 (282.2 examples/sec; 0.227 sec/batch)

10:54:37.440 PNG warning: Exceeded size limit while expanding chunk

10:53:30.323 PoolAllocator: After 366350973 get requests, put_count=366350992 evicted_count=465000 eviction_rate=0.00126927 and unsatisfied allocation rate=0.0012694

10:51:51.417 PNG warning: Exceeded size limit while expanding chunk

10:40:43.811 epoch 56.0: accuracy = 0.7897

10:39:41.308 2017-05-17 09:39:41.308624: epoch 56, loss = 0.06, lr = 0.086000 (273.8 examples/sec; 0.234 sec/batch)

10:38:14.522 PoolAllocator: After 360630699 get requests, put_count=360630659 evicted_count=455000 eviction_rate=0.00126168 and unsatisfied allocation rate=0.00126197

10:36:10.480 PNG warning: Exceeded size limit while expanding chunk

10:21:50.715 epoch 55.0: accuracy = 0.9175

10:20:51.801 PoolAllocator: After 354197216 get requests, put_count=354197255 evicted_count=445000 eviction_rate=0.00125636 and unsatisfied allocation rate=0.00125644

10:20:49.815 2017-05-17 09:20:49.815251: epoch 55, loss = 0.25, lr = 0.086250 (285.6 examples/sec; 0.224 sec/batch)

10:02:56.637 epoch 54.0: accuracy = 0.9191

10:01:57.367 2017-05-17 09:01:57.367369: epoch 54, loss = 0.09, lr = 0.086500 (256.5 examples/sec; 0.249 sec/batch)

10:01:42.365 PoolAllocator: After 347107694 get requests, put_count=347107646 evicted_count=435000 eviction_rate=0.00125321 and unsatisfied allocation rate=0.00125354

09:45:56.116 PNG warning: Exceeded size limit while expanding chunk

09:44:12.698 epoch 53.0: accuracy = 0.9039

09:43:09.888 2017-05-17 08:43:09.888202: epoch 53, loss = 0.10, lr = 0.086750 (307.0 examples/sec; 0.208 sec/batch)

09:41:48.672 PoolAllocator: After 339747205 get requests, put_count=339747210 evicted_count=425000 eviction_rate=0.00125093 and unsatisfied allocation rate=0.00125111

09:36:14.085 PNG warning: Exceeded size limit while expanding chunk

09:35:11.686 PNG warning: Exceeded size limit while expanding chunk

09:34:45.011 PNG warning: Exceeded size limit while expanding chunk

09:31:03.212 PNG warning: Exceeded size limit while expanding chunk

09:28:40.116 PoolAllocator: After 335014430 get requests, put_count=335014342 evicted_count=415000 eviction_rate=0.00123875 and unsatisfied allocation rate=0.00123921

09:27:38.374 PNG warning: Exceeded size limit while expanding chunk

09:25:23.913 PNG warning: Exceeded size limit while expanding chunk

09:25:16.065 epoch 52.0: accuracy = 0.9313

09:24:16.963 2017-05-17 08:24:16.962930: epoch 52, loss = 0.11, lr = 0.087000 (278.7 examples/sec; 0.230 sec/batch)

09:17:48.417 PNG warning: Exceeded size limit while expanding chunk

09:13:34.740 PoolAllocator: After 329380055 get requests, put_count=329379978 evicted_count=405000 eviction_rate=0.00122958 and unsatisfied allocation rate=0.00123001

09:06:09.948 update epoch 51.0: accuracy = 0.9357

09:06:09.948 epoch 51.0: accuracy = 0.9357

09:05:09.575 2017-05-17 08:05:09.575641: epoch 51, loss = 0.11, lr = 0.087250 (248.4 examples/sec; 0.258 sec/batch)

08:59:17.735 PNG warning: Exceeded size limit while expanding chunk

08:55:58.605 PoolAllocator: After 322904781 get requests, put_count=322904714 evicted_count=395000 eviction_rate=0.00122327 and unsatisfied allocation rate=0.00122368

08:48:46.322 PNG warning: Exceeded size limit while expanding chunk

08:47:27.936 epoch 50.0: accuracy = 0.9197

08:46:29.370 2017-05-17 07:46:29.370135: epoch 50, loss = 0.20, lr = 0.087500 (253.2 examples/sec; 0.253 sec/batch)

I've tried using TCMalloc for training on my local machine, there is still a memory leak but less than not using it.

1

There are 1 best solutions below

3
On

TensorFlow uses jemalloc by default, and that is what is used on CloudML Engine as well:

jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support.

So fragmentation is not likely the root cause of your memory issues.