I do know that most of user use TensorFlow or PyTorch for modeling framework, but I am Trying to converte a model(called ernie-doc) written in paddle to make it run on kaggle, I guess some GPU connection issues happened.
!pip install -q -U paddlepaddle-gpu
import paddle
import paddle.fluid as fluid
paddle.enable_static()
# as the document suggests, check
fluid.install_check.run_check()
it runs successfully
Running Verify Fluid Program ...
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid
However things get weird when fit the model
sys.path.append(os.path.abspath("/kaggle/input/erniedoc/ernie-doc"))
from finetune.classifier import create_model, evaluate
...
print("use gpu...")
place = fluid.CUDAPlace(0)
startup_prog = fluid.Program()
train_program = fluid.Program()
origin_train_program = train_program
exe = fluid.Executor(place)
exe.run(startup_prog)
init_model(args, exe, startup_prog)
...
outputs = evaluate(exe, train_program, train_pyreader, graph_vars,
train_mems_vars, tower_mems_np,
"train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
...
It complains,
RuntimeError Traceback (most recent call last)
<ipython-input-8-51b504e78714> in main(args)
163 outputs = evaluate(train_exe, train_program, train_pyreader, graph_vars,
164 train_mems_vars, tower_mems_np,
--> 165 "train", steps, trainer_id, dev_count, scheduled_lr, use_vars=args.use_vars)
166 tower_mems_np = outputs['tower_mems_np']
167
...
/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py in _run_program(self, program, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
1230 else:
1231 self._default_executor.run_prepared_ctx(ctx, scope, False, False,
-> 1232 False)
1233 arr = scope.find_var(fetch_var_name).get_fetch_list()
1234 tensors = arr._move_to_list()
RuntimeError:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<std::string>(std::string&&, char const*, int)
1 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
2 paddle::memory::allocation::AlignedAllocator::AllocateImpl(unsigned long)
3 paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
4 paddle::memory::allocation::Allocator::Allocate(unsigned long)
5 paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
6 paddle::memory::allocation::AllocatorFacade::Alloc(paddle::platform::Place const&, unsigned long)
7 paddle::memory::allocation::AllocatorFacade::AllocShared(paddle::platform::Place const&, unsigned long)
8 paddle::memory::AllocShared(paddle::platform::Place const&, unsigned long)
9 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsigned long)
10 paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
11 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 0ul, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, float>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, double>, paddle::operators::MatMulKernel<paddle::platform::CUDADeviceContext, paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
12 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
13 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
14 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
15 paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
16 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 432.000244MB memory on GPU 0, 15.811646GB memory has been allocated and available memory is only 89.750000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
what is going on here, the script is modified based on the official, and has allocated some memory, so I assume the GPU is connected and the script got no bug so far, but why is that, the GPU:0 processes 16GB memory and nothing else is running. After that checking GPU info
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P0 35W / 250W | 16191MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
should I stop some process or do something else? any suggestion would be appreciated!