Hello all.... I have a quick question regarding large vocabulary continuous speech recognition (LVCSR) with HTK. I am currently building a recogniser based on the WSJCAM0 speech corpus. In the initial version I produced, I was only getting in the region of 55% word accuracy (for the 5000 word task A, si_dt5a), and I noticed that, in many cases, the recognition broke off early. I.e. HELLO HOW ARE YOU TODAY gets recognised as HELLO HOW ARE strangely, this truncation of recognition did not trigger the 'no tokens survived to end of utterance' error. I looked more closely at the utterances in which this occurred. It often seemed that at the point that the utterance cut off, there would be a portion of silence (as can be seen from the monophone transcription). Now, in my language model and pronunciation dictionary, the only words to have the pronunciation 'sil' are the start and end words (i.e. !ENTER , !EXIT). I reasoned that the recognition must be cutting off early because, as the only place that silence can occur is at the end of the utterance. Therefore, I need to insert a way for silence to appear in my language network. Currently, my language network is built using HBuild and a bigram language model. What I really need is the ability to specify a HParse style grammar and use it with Hbuild and my bigram (word) model, i.e. grammar: { $word [sil] } But, as far as I am aware, this cannot be done. Therefore, I improved performance my including a <SIL> word in my bigram language model. I basically added a new unigram (with a single pronunciation 'sil') with a reasonable probability of occurrence. This pushed performance up to 76%. However, this is obviously not the way to do it. By including the silence word in the language model, I am effecting the recognition of the word immediately following it (as it would not be able to use the preceding 'real' word in its bigram). I.e. HERE <SILENCE> SATURDAY gets recognised as: HERE <SILENCE> SO TODAY what is happening is SO TODAY is more probable than SATURDAY because the <SILENCE> effectively causes the language model to back-off to unigrams, and SO TODAY is probably a lot more likely than SATURDAY without taking context into account. So, my question is this: Can anyone think of a better way for me to include facility for silence in this system? Thanks Alastair -- Alastair James Research Student School of Computing Sciences University of East Anglia, Norwich, UK
![]() |
0 |
![]() |
"Alastair Bruce James" <abj@sys.uea.ac.uk> writes: > I have a quick question regarding large vocabulary continuous speech > recognition (LVCSR) with HTK. I am currently building a recogniser based on > the WSJCAM0 speech corpus. > > I reasoned that the recognition must be > cutting off early because, as the only place that silence can occur is at > the end of the utterance. What does your dictionary look like? Do you have variants that end in sil and sp respectively? See the tutorial in the HTK book for a way to train these models. You might want to post such questions to the htk-users mailing list as you might get more feedback there. Gunnar
![]() |
0 |
![]() |
Just to let anyone know who might search this newsgroup in the future... Including a sp and sil pronunciation of each word slowed down recognition greatly, however, I am seeing a 4-5% increase in word accuracy over the old 'silence word' method. Thanks!! Alastair "Gunnar Evermann" <ge204@eng.cam.ac.uk> wrote in message news:mqqn01jezt8.fsf@eng.cam.ac.uk... > "Alastair Bruce James" <abj@sys.uea.ac.uk> writes: > > > I have a quick question regarding large vocabulary continuous speech > > recognition (LVCSR) with HTK. I am currently building a recogniser based on > > the WSJCAM0 speech corpus. > > > > I reasoned that the recognition must be > > cutting off early because, as the only place that silence can occur is at > > the end of the utterance. > > What does your dictionary look like? Do you have variants that end in > sil and sp respectively? See the tutorial in the HTK book for a way to > train these models. > > You might want to post such questions to the htk-users mailing list as > you might get more feedback there. > > > Gunnar
![]() |
0 |
![]() |
Alastair Bruce James wrote: > Hello all.... > > I have a quick question regarding large vocabulary continuous speech > recognition (LVCSR) with HTK. I am currently building a recogniser based on > the WSJCAM0 speech corpus. > > In the initial version I produced, I was only getting in the region of 55% > word accuracy An former MSc student of mine, Christophe Van Bael, managed closer to 30% WER, also using HTK - his dissertation is here: http://www.ling.ed.ac.uk/teaching/postgrad/mscslp/past_dissertations.shtml Simon -- Dr. Simon King Simon.King@ed.ac.uk Centre for Speech Technology Research www.cstr.ed.ac.uk For MSc/PhD info, visit www.hcrc.ed.ac.uk/language-at-edinburgh
![]() |
0 |
![]() |