Why is cv2.dnn work faster when I use CPU rather than GPU?

5.3k Views Asked by At

I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get.

----------------------------------------------------------------
---         GPU                vs             CPU            ---
---                                                          ---
--- 21.913758993148804 seconds ---3.0586464405059814 seconds ---
--- 22.379303455352783 seconds ---3.1384341716766357 seconds ---
--- 21.500431060791016 seconds ---2.9400241374969482 seconds ---
--- 21.292986392974854 seconds ---3.3738017082214355 seconds ---
--- 20.88358211517334 seconds  ---3.388749599456787 seconds  ---

I will give my code snippet in case I may be doing something wrong that cause GPU time to spike so high.

def loadYolo():
    net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
    
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

    classes = []
    with open("coco.names", "r") as f:
        classes = [line.strip() for line in f.readlines()]

    layer_names = net.getLayerNames()
    output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
    return net,classes,layer_names,output_layers


@socketio.on('image')
def image(data_image):

    sbuf = StringIO()
    sbuf.write(data_image)
    
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
        pass
    else:
        pimg = Image.open(b)
    
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
    
        net,classes,layer_names,output_layers=loadYolo()
        height, width, channels = frame.shape

        
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)

       
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        
        
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)

                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)

                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)

        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
    
        imgencode = cv2.imencode('.jpg', frame)[1]

        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)

My Gpu is Nvidia 1050 Ti and my CPU is i5 gen 9 in case someone need the specification. Can someone please enlighten me as I am super confused right now? Thank you very much

EDIT 1: I tried to use cv2.dnn.DNN_TARGET_CUDA instead of cv2.dnn.DNN_TARGET_CUDA_FP16, but the time is still terrible compare to CPU. Below is the GPU result :

--- 10.91195559501648 seconds ---
--- 11.344025135040283 seconds ---
--- 11.754926204681396 seconds ---
--- 12.779674530029297 seconds ---

Below is CPU result :

--- 4.780993223190308 seconds ---
--- 4.910650253295898 seconds ---
--- 4.990436553955078 seconds ---
--- 5.246175050735474 seconds ---

it is still slower than CPU

EDIT 2: OpenCv is 4.5.0, CUDA 11.1 and CUDNN 8.0.1

3

There are 3 best solutions below

7
On

DNN_TARGET_CUDA_FP16 refers to 16-bit floating-point. since your gpu is 1050 Ti, your gpu seems not works too well with FP16.you can check it from here and your compute capability from here. i think you should change this line :

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

into :

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
2
On

You should definitely only load YOLO once. Recreating it for every image that comes through the socket is slow for both CPU and GPU, but GPU takes longer to initially load which is why you're seeing it run slower than CPU.

I don't understand what you mean by using an LRU cache for your YOLO model. Without seeing the rest of your code structure I can't make any real suggestions, but can you try at least temporarily putting the network into the global space just to see if it runs faster? (remove the function altogether and put its body in the global space)

something like this

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]

layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]


@socketio.on('image')
def image(data_image):

    sbuf = StringIO()
    sbuf.write(data_image)
    
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
        pass
    else:
        pimg = Image.open(b)
    
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
    
        height, width, channels = frame.shape

        
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)

       
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        
        
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)

                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)

                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)

        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
    
        imgencode = cv2.imencode('.jpg', frame)[1]

        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)
0
On

From the previous two answer I manage to get the solution changing :

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

into :

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

have help to twice the GPU speed due to my GPU type is not compatible with FP16 this is thanks to Amir Karami and also despite Ian Chu answer did not solve my problem it give me basis to forcefully make all the images to only use one net instances this actually lower the processing time significantly from each needing 10 second into 0.03-0.04 seconds thus surpassing CPU speed by many times. The reason I did not accept both answer because neither really solve my problem but both become strong basis to my solution so I still upvote them. I just leave my answer here in case anyone encounter this problem like me.