Connect GPU of kaggle manually

217 Views Asked by At

I do know that most of user use TensorFlow or PyTorch for modeling framework, but I am Trying to converte a model(called ernie-doc) written in paddle to make it run on kaggle, I guess some GPU connection issues happened.

!pip install -q -U paddlepaddle-gpu
import paddle
import paddle.fluid as fluid
paddle.enable_static()
# as the document suggests, check 
fluid.install_check.run_check()

it runs successfully

Running Verify Fluid Program ... 
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid

However things get weird when fit the model

sys.path.append(os.path.abspath("/kaggle/input/erniedoc/ernie-doc"))
from finetune.classifier import create_model, evaluate
...
print("use gpu...")
place = fluid.CUDAPlace(0)
startup_prog = fluid.Program()
train_program = fluid.Program()
origin_train_program = train_program
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
...
outputs = evaluate(exe, train_program, train_pyreader, graph_vars, 
                                        train_mems_vars, tower_mems_np,
                                       "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
...

It complains,

RuntimeError                              Traceback (most recent call last)
<ipython-input-8-51b504e78714> in main(args)
    163                         outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars, 
    164                                         train_mems_vars, tower_mems_np,
--> 165                                        "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
    166                         tower_mems_np = outputs['tower_mems_np']
    167 

...

/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
   1230         else:
   1231             self._default_executor.run_prepared_ctx(ctx, scope, False, False,
-> 1232                                                     False)
   1233         arr = scope.find_var(fetch_var_name).get_fetch_list()
   1234         tensors = arr._move_to_list()

RuntimeError: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1   paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
2   paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long)
3   paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
4   paddle::memory::allocation::Allocator::Allocate(unsigned long)
5   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
6   paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
7   paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8   paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
9   paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
10  paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
11  std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
12  paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
13  paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
14  paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
15  paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
16  paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 432.000244MB memory on GPU 0, 15.811646GB memory has been allocated and available memory is only 89.750000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 

 (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)

what is going on here, the script is modified based on the official, and has allocated some memory, so I assume the GPU is connected and the script got no bug so far, but why is that, the GPU:0 processes 16GB memory and nothing else is running. After that checking GPU info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    35W / 250W |  16191MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

should I stop some process or do something else? any suggestion would be appreciated!

0

There are 0 best solutions below