I have a question about the whisper neural network. I have two-channel recordings of phone calls. How to decrypt the file .wav to a text file with the designation, where does the interlocutor speak? For example: the operator ... client...
I tried to work separately with each channel. I wrote them into two files and combined them. I wanted to know if there is a simpler solution
If I understood correctly, you want to also add information about which interlocutor says what, right? From what I know, it seems that the way you approached it is the simplest one (https://github.com/openai/whisper/discussions/1026).
You could also merge the two channels into one (make the audio mono), but then you might get issues with speakers overlapping each other sometimes. If you do want to do it like this, then you could output timestamps as well aside from just the text. Then, if you know when each interlocutor speaks in the recording, you can match the outputted timestamps with the corresponding speaker. You can use this Whisper implementation if you want word-level timestamps: https://github.com/linto-ai/whisper-timestamped. Whisper outputs segment-level timestamps already.
I also just found out that the native implementation of Whisper has support for word level timestamps (if you add
word_timestamps=Truein the.transcribe()command, see https://github.com/openai/whisper/blob/main/whisper/transcribe.py)Hope this helps!