What is right way for "timing" phoneme in SAPI TTS (c#)?? (SpVoice.Phoneme()->streamPosition)

Question

What is right way for "timing" phoneme in SAPI TTS (c#)?? (SpVoice.Phoneme()->streamPosition)

1.1k Views Asked by tomdelahaba At 29 July 2025 at 17:21

i have next "problem" in my application, i write app where someone will write text, SAPI TTS translate it in speech and next i will work with the output WAV. What i need are information about phonemes (where in the output WAV is some phoneme, how long voice say it, etc).. ok, i used SpVoice.Phoneme() and i added handler for phonemes. Ok, now i can get duration etc..but in SpVoice.Phoneme() is attribute StreamPosition but i have not idea what that means..

from MSDN:

StreamPosition
The character position in the output stream at which the phoneme begins.

I dont understand if they mean "byte" position in output WAV (on WHICH byte is the phoneme)..or millisecond time in output WAV..or what that mean??

For example, for text:

This is high. This is low. This is fast. This is slow.

I get the StreamPositions values:

Position:0
Position:120
Position:2562
....
Position:143798
Position:147874
Position:151950

The output WAV file have 5.377098seconds and last phoneme "ow" is told circa in 4.734s. The output WAV file have 237 568bytes.. So the value of attribute StreamPosition "147874" is probably not the byte on which begin the phoneme. The same for "timing" (in ms because WAV have 5.3s but 151950ms is 151,950s..so this is closed..).

So what is the StreamPosition? (what means the value in StreamPosition?)

I really need catch exactly time when the phoneme begin. I tried it with DateTime.Now.Ticks/10000. When user click on button for start translating TTS i save this datetime value and when some handler catch some phoneme i catch the value again. And then i will get the value with currTime-startTime. But this "method" is not so exact. There are always some divergency. Have SpVoice.Phoneme() some "method" or something to get exactly information about the time when phoneme began? If not, is there some better way to get exactlier time in ms?

sry for my english and really thanks for all answers and advices..

Original Q&A

There are 2 best solutions below

Steven Du On 29 January 2012 at 16:17

1) I am not sure how you save the output to wav file,but the file size 237 568bytes is larger than normal(if sampling rate is 16khz), as file size for a 5.377098seconds wav file

is 5.377098*16000*2 = 172067 bytes + header(44 bytes)

so, I think your wav file contains phoneme event as well.

2)TTS take time to generate output so you can't timing in that way, I suggest you:

2.1)record the phoneme event as you may already done in 1

You can also refer to Windows SDK

C:\Program Files\Microsoft SDKs\Windows\v7.1\Samples\winui\speech\ttsapplication

           if (SUCCEEDED(hr))
        {
        //  OriginalFmt.WaveFormatExPtr()->nSamplesPerSec;
            hr = SPBindToFile( m_szWFileName, SPFM_CREATE_ALWAYS, &cpWavStream, &OriginalFmt.FormatId(), OriginalFmt.WaveFormatExPtr(),SPFEI_ALL_TTS_EVENTS); 
        }
        if( SUCCEEDED( hr ) )
        {
            // Set the voice's output to the wav file instead of the speakers
            hr = m_cpVoice->SetOutput(cpWavStream, TRUE);

        }

2.2)Timing by other event like stream start <= I am not so sure about the exactly name.

in Windows SDK:

    while (m_cpVoice->GetEvents(1, &event, &ul) == S_OK) 
        { 
            if (event.eEventId == SPEI_VISEME) 
            { 
                printf("v: %i\'",event.lParam); // viseme 
                printf("t: %i\'",event.wParam); // duration of viseme 
            } 
            else if (event.eEventId == SPEI_END_INPUT_STREAM) 
            { 

            } else if (event.eEventId == SPEI_START_INPUT_STREAM)
            {
            }
        }

But the code is not in C#

**tomdelahaba** · Accepted Answer

ok, i will answer myself.. My bachelors profesor sended me some code in C++ what he wrote.. I readed it last 2days and now i see how stupid I am.

so i will answer..

attribute StreamPosition is really "bites" position in the output stream (probably WAV).

If you want to know millisecond position in the output stream, you need write something like:

(int)StreamPosition/(double)wavFileFormat_samplesPerSec/((double)wavFileFormat_BitsPerSample/8)

so you need find information about the outputStream like bitsPerSample, SamplesPerSec and you will get the milliseconds timing.

What is right way for "timing" phoneme in SAPI TTS (c#)?? (SpVoice.Phoneme()->streamPosition)

There are 2 best solutions below

Related Questions in C#

Related Questions in TIMING

Related Questions in SAPI

Related Questions in TEXT-TO-SPEECH

Trending Questions

Popular # Hahtags

Popular Questions