Abstract

Abstract
Abstract

One-stage decoding as an integration of speech recognition and linguistic analysis into one probabilistic process is an interesting  trend in speech research. In this paper, we present a simple one-stage decoding scheme that can be realised without the implementation of a specialized decoder, nor the use of complex language models. Instead, we reduce an HMMbased porno semantic analysis to the problem of deriving annotated versions of the conventional language model, while the acoustic model remains unchanged. We present experiments with the ATIS corpus (Price, 1990) in  which the performance of the one-stage method is shown to be comparable with the traditional two-stage approach, while requiring a signifi- cantly smaller increase in language model size. 1 Introduction In a spoken dialogue system, speech recognition and linguistic analysis play a decisive role for the overall performance of the system. Traditionally, word hypotheses produced by the automatic speech recognition (ASR) component are  fed into a separate natural xvideos language understanding (NLU) module for deriving a semantic meaning representation.

These semantic representations are the system’s understanding of the user’s intentions. Based on this knowledge the dialogue manager has to decide on the system reaction. Because speech recognition is a probabilistic pattern matching problem that ususally  does not generate one single possible result, hard decisions taken after the speech recognition process could cause signifi- cant loss of information that could be important for the parsing and other subsequent processing steps and may thus lead to avoidable system failures. One common way of avoiding this problem is the use of N-best listsor word lattices as output representations, but these may require more complex NLU processing and/or increased processing times. In this paper, we follow an alternative approach: integrating flat HMM-based semantic analysis with xnxx the speech recognition process, resulting in a one-stage recognition system that avoids hard decisions between ASR and NLU. The resulting system produces word hypotheses where each word is annotated with a semantic label from which a frame-based semantic representation may easily be constructed. Fig. 1 sketches the individual processes involved in our integrative approach. The shaded portions in the figure indicate the models and processing steps that will be modified by versions using semantic labels.

This will lead to an overall architecture, where a separate semantic decoding step (5) becomes dispensable. One contribution of this work is to show that compared to other one-stage approaches (Thomae et al., 2003) such an integrated recognition system does not require a specialized decoder or complex language model support. Instead, basic bi-gram language models may be used. We achieve the integration by “reducing” the NLU part to language modelling whilst enriching the lexicon and language model with semantic information. Conventional basic language modelling techniques are capable of representing this information. We redefine the units used in the language model: instead of using “plain” words, these are annotated with additional information. Such an additional information may consist of semantic labels and context information. For each of these annotated variants of a word, the phonetic transcription of the “plain” word is used. Consequently, the ASR cannot decide which variant to choose on the basis of the acoustic model. No retraining of the acoustic model is necessary. The speech recogniser produces word hypotheses enriched with semantic labels. The remainder of this paper is structured as follows: In the next section we give a brief overview of the Cam-Figure 1: Principal knowledge sources and models of speech recognition and semantic analysis. Shaded parts constitute the changes when using a one-stage approach. The numbers indicate the following computational steps: (1) acoustic model parameter estimation, (2) language modelling, (3) Viterbi acoustic decoding, (4) semantic model parameter estimation, (5) Viterbi semantic decoding. bridge HTK software we used for our experiments with the ATIS corpus.

In Section 3 we outline the HMM-based parsing method. The basic approach for adding information into the speech recogniser language model is described in Section 4. In Section 5 we discuss our experiments and present speech recognition results. Finally, we conclude by pointing out further possible improvements. 2 Acoustic Modelling and Speech Recognition Using HTKSpeech recognition may be formulated as an optimisation problem: Given a sequence of observations O consisting of acoustic feature vectors, determine the sequence of words W, such that it maximizes the conditional probability P(W|O). Bayes’ rule is used to replace this conditional probability which is not directly computable by the product of two components: P(O|W), the acoustic model, and P(W), the language model. [W] opt = argmax W {P(W)P(O|W)} (1) The Cambridge Hidden Markov Model Toolkit (HTK) (Young et al., 2004) can be used to build robust speaker-independent speech recognition systems.

The tied acoustic model parameters are estimated by the forward-backward algorithm. The HTK Viterbi decoder can be used together with a probabilistic word network that may be computed from a finite state grammar or the bi-gram statistics of a text corpus. The decoder’s token passing algorithm is ablereference word: case frame or concept identifier case frame: set of cases related to a concept case: attribute of a concept case marker: surface structure indicator of a case case system: complete set of cases of the application Figure 2: Semantic case grammar formalism. to produce word hypotheses lattices or N-best lists of recognition results. Internally this word network is combined with a phonetic transcription dictionary to produce an expanded network of phoneme states. Usually, one phoneme or triphone is represented by five states. For our experiments with the ATIS corpus, the acoustic model is constructed in conventional way. We use 4500 utterances to train a triphone recogniser with 8 Gaussian mixtures. A triphone count of 5929 physical triphones expand to 27619 logical ones. The acoustic model is used for both the two-stage and the one-stage experiments.

Leave a Reply

Your email address will not be published. Required fields are marked *