Nested dictionary to pandas df concatenating rows

115 Views Asked by At

Given the following dict:

j = {
  "source": "https://example.com",
  "timestamp": "2021-04-12T19:34:24Z",
  "durationInTicks": 1082400000,
  "duration": "PT1M48.24S",
  "combinedRecognizedPhrases": [
    {
      "channel": 0,
      "lexical": "aaa",
      "itn": "aaa",
      "maskedITN": "aaa",
      "display": "aaa"
    }
  ],
  "recognizedPhrases": [
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 1,
      "offset": "PT2.18S",
      "duration": "PT3.88S",
      "offsetInTicks": 21800000,
      "durationInTicks": 38800000,
      "nBest": [
        {
          "confidence": 0.9306252,
          "lexical": "gracias por llamar",
          "itn": "gracias por llamar",
          "maskedITN": "gracias por llamar",
          "display": "¿Gracias por llamar",
          "words": [
            {
              "word": "gracias",
              "offset": "PT2.18S",
              "duration": "PT0.37S",
              "offsetInTicks": 21800000,
              "durationInTicks": 3700000,
              "confidence": 0.930625
            },
            {
              "word": "por",
              "offset": "PT2.55S",
              "duration": "PT0.18S",
              "offsetInTicks": 25500000,
              "durationInTicks": 1800000,
              "confidence": 0.930625
            },
            {
              "word": "llamar",
              "offset": "PT2.73S",
              "duration": "PT0.22S",
              "offsetInTicks": 27300000,
              "durationInTicks": 2200000,
              "confidence": 0.930625
            }
          ]
        }
      ]
    },
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 2,
      "offset": "PT6.85S",
      "duration": "PT5.63S",
      "offsetInTicks": 68500000,
      "durationInTicks": 56300000,
      "nBest": [
        {
          "confidence": 0.9306253,
          "lexical": "quiero hacer un pago",
          "itn": "quiero hacer un pago",
          "maskedITN": "quiero hacer un pago",
          "display": "quiero hacer un pago"
        }
      ]
    },
    {
      "recognitionStatus": "Success",
      "channel": 0,
      "speaker": 2,
      "offset": "PT13.29S",
      "duration": "PT3.81S",
      "offsetInTicks": 132900000,
      "durationInTicks": 38100000,
      "nBest": [
        {
          "confidence": 0.93062526,
          "lexical": "no sé bien la cantidad",
          "itn": "no sé bien la cantidad",
          "maskedITN": "no sé bien la cantidad",
          "display": "no sé bien la cantidad"
        }
      ]
    }
  ]
}

Goal: to get the information of interest in a single row of a df.

What have I done so far?:

df = pd.json_normalize(j, record_path=['recognizedPhrases', 'nBest'], meta=['source', 'durationInTicks', 'duration', ['recognizedPhrases', 'speaker']])
df['speech'] = df.groupby(['source', 'recognizedPhrases.speaker'])['display'].transform(lambda x : ' '.join(x))
df = df.drop_duplicates(subset=['recognizedPhrases.speaker'])

Obtained df: enter image description here

Why am I not satisfied with the output I have obtained?: My output presents a df with two rows (one row for each recognizedPhrases.speaker) and I need all the information in one row, one column for what speaker 1 said (which is in the speaker column) and another column for what speaker 2 said.

Additional information: Performance is an important factor since I will be doing this process with thousands of files.

Edit 1: The result I expect would look something like this:

expected_dict = {'source': {0: 'https://example.com'},
 'durationInTicks': {0: 1082400000},
 'duration': {0: 'PT1M48.24S'},
 'recognizedPhrases.speaker1': {0: '¿Gracias por llamar'},
 'recognizedPhrases.speaker2': {0: 'quiero hacer un pago no sé bien la cantidad'}}
expected_df = pd.DataFrame(expected_dict)
1

There are 1 best solutions below

0
On BEST ANSWER

You can pivot() into the expected output:

index = ['source', 'durationInTicks', 'duration']
columns = ['recognizedPhrases.speaker']
values= ['speech']

df = df[index+columns+values].pivot(index=index, columns=columns, values=values[0])
df.columns = [f'{df.columns.name}{column}' for column in df.columns]
source durationInTicks duration recognizedPhrases.speaker1 recognizedPhrases.speaker2
https://example.com 1082400000 PT1M48.24S ¿Gracias por llamar quiero hacer un pago no sé bien la cantidad