I'm currently working on a project using TensorFlow's MoveNet for pose estimation on a video. The model is detecting keypoints quite well, but there's an issue with the keypoint positioning.
I'm experiencing misalignment between the detected keypoints and the actual body parts in the video frames. The keypoints are off by a significant margin. I'm wondering if this misalignment is a common issue or if there are any specific adjustments that need to be made.
Here's a summary of the relevant information:
- I'm using TensorFlow and the MoveNet model for single-pose estimation.
- The input video has a resolution of 1280x720 pixels.
- I'm resizing the frames to 192x192 pixels while preserving the aspect ratio before passing them to the model.
- The code for rendering keypoints and connections is based on standard practices for pose estimation.
The specific problem is that keypoints are not correctly aligned with the body parts they represent, e.g., the elbow keypoint is not in the correct position. I'm looking for guidance on how to address this issue.
import tensorflow as tf
import tensorflow_hub as hub
import cv2
from matplotlib import pyplot as plt
import numpy as np
model = hub.load("https://tfhub.dev/google/movenet/singlepose/lightning/4")
movenet = model.signatures['serving_default']
cap = cv2.VideoCapture('federer.mp4')
while cap.isOpened():
ret, frame = cap.read()
# Resize image
img = frame.copy()
img = tf.image.resize_with_pad(tf.expand_dims(img, axis=0), 192,192)
input_img = tf.cast(img, dtype=tf.int32)
# Detection section
results = movenet(input_img)
keypoints_with_scores = results['output_0'].numpy()[:,:,:51].reshape((1,1,17,3))
# Render keypoints
loop_through_people(frame, keypoints_with_scores, EDGES, 0.1)
cv2.imshow('Movenet Multipose', frame)
if cv2.waitKey(10) & 0xFF==ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# Function to loop through each person detected and render
def loop_through_people(frame, keypoints_with_scores, edges, confidence_threshold):
for person in keypoints_with_scores:
draw_connections(frame, person, edges, confidence_threshold)
draw_keypoints(frame, person, confidence_threshold)
def draw_keypoints(frame, keypoints, confidence_threshold):
y, x, c = frame.shape
shaped = np.squeeze(np.multiply(keypoints, [y,x,1]))
for kp in shaped:
ky, kx, kp_conf = kp
if kp_conf > confidence_threshold:
cv2.circle(frame, (int(kx), int(ky)), 6, (0,255,0), -1)
EDGES = {
(0, 1): 'm',
(0, 2): 'c',
(1, 3): 'm',
(2, 4): 'c',
(0, 5): 'm',
(0, 6): 'c',
(5, 7): 'm',
(7, 9): 'm',
(6, 8): 'c',
(8, 10): 'c',
(5, 6): 'y',
(5, 11): 'm',
(6, 12): 'c',
(11, 12): 'y',
(11, 13): 'm',
(13, 15): 'm',
(12, 14): 'c',
(14, 16): 'c'
}
def draw_connections(frame, keypoints, edges, confidence_threshold):
y, x, c = frame.shape
shaped = np.squeeze(np.multiply(keypoints, [y,x,1]))
for edge, color in edges.items():
p1, p2 = edge
y1, x1, c1 = shaped[p1]
y2, x2, c2 = shaped[p2]
if (c1 > confidence_threshold) & (c2 > confidence_threshold):
cv2.line(frame, (int(x1), int(y1)), (int(x2), int(y2)), (0,0,255), 4)
The specific problem is that keypoints are not correctly aligned with the body parts they represent, e.g., the elbow keypoint is not in the correct position. I'm looking for guidance on how to address this issue.
I have been used same sized model created by PINTO0309 from Github transleted to be used with OPENVINO and it did not performed well in terms of the positional accuracy of keypoints. However, when I used the the model with sizes 256 by 256 or 192 by 256 the accuracy increased and it was actually great. From my experience, I could say that 192 by 192 is not enough to be accurate as you would expect for processing a 720p video.