Tensorflow js movenet returns an output of [1, 6, 56] shape tensor

54 Views Asked by At

Inputs A frame of video or an image, represented as an int32 tensor of shape: 192x192x3. Channels order: RGB with values in [0, 255].

Outputs A float32 tensor of shape [1, 1, 17, 3].

The first two channels of the last dimension represents the yx coordinates (normalized to image frame, i.e. range in [0.0, 1.0]) of the 17 keypoints (in the order of: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]).

The third channel of the last dimension represents the prediction confidence scores of each keypoint, also in the range [0.0, 1.0].

const MODEL_PATH = './model';
const EXAMPLE_IMG = document.getElementById('exampleImg');

let movenet = undefined;async function loadAndRunModel() {movenet = await tf.loadGraphModel(MODEL_PATH, {fromTFHub: true,});// let exampleInputTensor = tf.zeros([1, 192, 192, 3], 'int32');let imageTensor = tf.browser.fromPixels(EXAMPLE_IMG);console.log(imageTensor.shape);
  
  let cropStartPoint = [15, 170, 0];
  let cropSize = [345, 345, 3];
  let croppedTensor = tf.slice(imageTensor, cropStartPoint, cropSize);

  let resizedTensor = tf.image
    .resizeBilinear(croppedTensor, [192, 192], true)
    .toInt();
  console.log(resizedTensor.shape);

  let tensorOutput = movenet.predict(tf.expandDims(resizedTensor));
  let arrayOutput = await tensorOutput;

  console.log(arrayOutput);
}

loadAndRunModel();

I get an output of [1, 6, 56] shape tensor. According to the documentation it should return [1, 1, 17, 3] shape tensor. Why it returns different output?

1

There are 1 best solutions below

0
On

Movenet comes in three "flavors":

  • single pose "lightning": fast but not the most accurate
  • single pose "thunder: more accurate but slower
  • multipose "lightning": fast multipose detection

The single pose versions return a tensor of shape [1,1,17,3] as described in your question, but the multipose returns a tensor of shape [1,6,56], which is described in the model card:

Output: A float32 tensor of shape [1, 6, 56].

  • The first dimension is the batch dimension, which is always equal to 1.
  • The second dimension corresponds to the maximum number of instance detections. The model can detect up to 6 people in the image frame simultaneously.
  • The third dimension represents the predicted bounding box/keypoint locations and scores. The first 17 * 3 elements are the keypoint locations and scores in the format: [y_0, x_0, s_0, y_1, x_1, s_1, …, y_16, x_16, s_16], where y_i, x_i, s_i are the yx-coordinates (normalized to image frame, e.g. range in [0.0, 1.0]) and confidence scores of the i-th joint correspondingly. The order of the 17 keypoint joints is: [nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]. The remaining 5 elements [ymin, xmin, ymax, xmax, score] represent the region of the bounding box (in normalized coordinates) and the confidence score of the instance.

You probably grabbed the multipose version. If you want the single pose versions, they're available here: