f



Wsjcam0 with HTK

Hello all....

I have a quick question regarding large vocabulary continuous speech
recognition (LVCSR) with HTK. I am currently building a recogniser based on
the WSJCAM0 speech corpus.

In the initial version I produced, I was only getting in the region of 55%
word accuracy (for the 5000 word task A, si_dt5a), and I noticed that, in
many cases, the recognition broke off early. I.e.

        HELLO HOW ARE YOU TODAY
gets recognised as
        HELLO HOW ARE

strangely, this truncation of recognition did not trigger the 'no tokens
survived to end of utterance' error. I looked more closely at the utterances
in which this occurred. It often seemed that at the point that the utterance
cut off, there would be a portion of silence (as can be seen from the
monophone transcription). Now, in my language model and pronunciation
dictionary, the only words to have the pronunciation 'sil' are the start and
end words (i.e. !ENTER , !EXIT). I reasoned that the recognition must be
cutting off early because, as the only place  that silence can occur is at
the end of the utterance.

Therefore, I need to insert a way for silence to appear in my language
network. Currently, my language network is built using HBuild and a bigram
language model. What I really need is the ability to specify a HParse style
grammar  and use it with Hbuild and my bigram (word) model, i.e.

grammar:     { $word [sil] }

But, as far as I am aware, this cannot be done. Therefore, I improved
performance my including a <SIL> word in my bigram language model. I
basically added a new unigram (with a single pronunciation 'sil') with a
reasonable probability of occurrence. This pushed performance up to 76%.

However, this is obviously not the way to do it. By including the silence
word in the language model, I am effecting the recognition of the word
immediately following it (as it would not be able to use the preceding
'real' word in its bigram). I.e.

HERE <SILENCE> SATURDAY

gets recognised as:

HERE <SILENCE> SO TODAY

what is happening is SO TODAY is more probable than SATURDAY because the
<SILENCE> effectively causes the language model to back-off to unigrams, and
SO TODAY is probably a lot more likely than SATURDAY without taking context
into account.

So, my question is this: Can anyone think of a better way for me to include
facility for silence in this system?

Thanks

Alastair
-- 
Alastair James
Research Student
School of Computing Sciences
University of East Anglia, Norwich, UK


0
Alastair
7/29/2004 9:49:41 AM
comp.speech.research 1026 articles. 0 followers. Post Follow

3 Replies
818 Views

Similar Articles

[PageSpeed] 8

"Alastair Bruce James" <abj@sys.uea.ac.uk> writes:

> I have a quick question regarding large vocabulary continuous speech
> recognition (LVCSR) with HTK. I am currently building a recogniser based on
> the WSJCAM0 speech corpus.
>
> I reasoned that the recognition must be
> cutting off early because, as the only place  that silence can occur is at
> the end of the utterance.

What does your dictionary look like? Do you have variants that end in
sil and sp respectively? See the tutorial in the HTK book for a way to
train these models.

You might want to post such questions to the htk-users mailing list as
you might get more feedback there.


  Gunnar
0
Gunnar
7/29/2004 12:57:23 PM
Just to let anyone know who might search this newsgroup in the future...

Including a sp and sil pronunciation of each word slowed down recognition
greatly, however, I am seeing a 4-5% increase in word accuracy over the old
'silence word' method.

Thanks!!

Alastair

"Gunnar Evermann" <ge204@eng.cam.ac.uk> wrote in message
news:mqqn01jezt8.fsf@eng.cam.ac.uk...
> "Alastair Bruce James" <abj@sys.uea.ac.uk> writes:
>
> > I have a quick question regarding large vocabulary continuous speech
> > recognition (LVCSR) with HTK. I am currently building a recogniser based
on
> > the WSJCAM0 speech corpus.
> >
> > I reasoned that the recognition must be
> > cutting off early because, as the only place  that silence can occur is
at
> > the end of the utterance.
>
> What does your dictionary look like? Do you have variants that end in
> sil and sp respectively? See the tutorial in the HTK book for a way to
> train these models.
>
> You might want to post such questions to the htk-users mailing list as
> you might get more feedback there.
>
>
>   Gunnar


0
Alastair
7/29/2004 6:00:36 PM
Alastair Bruce James wrote:
> Hello all....
> 
> I have a quick question regarding large vocabulary continuous speech
> recognition (LVCSR) with HTK. I am currently building a recogniser based on
> the WSJCAM0 speech corpus.
> 
> In the initial version I produced, I was only getting in the region of 55%
> word accuracy 

An former MSc student of mine, Christophe Van Bael, managed closer to 
30% WER, also using HTK - his dissertation is here:

http://www.ling.ed.ac.uk/teaching/postgrad/mscslp/past_dissertations.shtml

Simon

-- 
Dr. Simon King                               Simon.King@ed.ac.uk
Centre for Speech Technology Research          www.cstr.ed.ac.uk
For MSc/PhD info, visit  www.hcrc.ed.ac.uk/language-at-edinburgh
0
Simon
9/8/2004 6:42:55 PM
Reply: