I have a piece of audio that was played through the built-in speakers of a laptop and recorded using the laptop's built-in microphone. It's a series of English numbers, such as: 2, 34, 12, 45.
Under the condition of only one computer playing and no human voice interference, the recognition accuracy of Whisper is relatively high. However, if human voices are added or multiple computers play audio simultaneously, it fails to recognize the played numbers.
For instance, if human voices or other computers play numbers like 55, 63, 12, 32... etc., it can't correctly identify the numbers I want, which are 2, 34, 12, 45.
Is there any solution to this problem?
I tried using OpenAI Whisper for recognition. For example, the numbers played were: 7, 23, 55, 90. The results I got were: 1, 2, 3, 4, 7, 23, 55, 95, 76, 78, 88, 99, 100, 1, 2