As mentioned above, the language model P(W) represents the probability of a state sequence. With the bigram approximation P(W) ≈ P(wi |wi−1) this probability becomes a transition probability between words in a word network. By adding information to the language model (cf. shaded parts of Fig. 1) we modify the word network in a way that instead of “plain” words as nodes the network should now contain “annotated” variants of these original words. The annotation then encodes some additional information that is relevant to the further processing in the dialogue system, but does not affect the pronunciation of the word. By introducing such labelled word variants, it is possible to encode some additional relations that exist between the labels rather than between the words. Consider the following utterance and a corresponding labelling of each word with additional information: Show-(null) me-(null) ground- transportation- for-(null) Dallas-(v:at-city) The word network computed from utterances of the latter form instead of plain texts will represent the fact that after a word labelled as (null), a city name labelled as (v:atcity) is much more likely than labelling as (v:from-city) or (v:to-city).
In order to compute the modified version of the network it is only necessary to replace the words by their labelled variants in the training corpus and to compute the bi-gram statistics from this modified corpus (cf. step (2) in Fig. 1). For expanding the word network into a network of phoneme states as required by the speech recognition, it is necessary to modify the phonetic transcription dictionary accordingly: for each labelled variant of a word appearing in the labelled training texts, the respective unlabelledword entry is copied. The Viterbi decoder will now output sequences of annotated words (step (3)). The language model may not only be enriched by semantic labels. Other information, such as the context of the word may also be used. A language model labelled with a context that consists of one word on the left is essentially a tri-gram model. There is a trade-off between what the network can express and its size. Using too many different labels for each word in the network may quickly result in word networks impractical for realtime use. For our experiments within the ATIS domain, Table 1: Word network sizes for different labelling techniques. “Expanded” refers to the phoneme state network. t denotes estimated per utterance processing time.Table 1 summarises the word network sizes for different labelling methods. Here, “ASR” refers to the original base-line unlabelled language model.
“ASR/Cl” is a simple class-based language model with manually de- fined classes. A left context of one word was used in “ASR/Co”, and combined in with classes in “ASR/CC”. These labelled versions may be used in the two-stage approach to improve the speech recognition results. “ASR+”, “ASR+Co”, “ASR+N” refer to semantically labelled language models. “ASR+” is directly trained on the semantically labelled training texts. “ASR+Co” furthermore includes a left context of one semantic label, whereas “ASR+N” includes sub-label numbering. As can be seen from the numbers, word classes as well as the semantic methods only incur a modest increase in network size. The word-based methods, however, signifi- cantly inflate the model. Although we have not systematically recorded the time necessary for recognizing the test set with these networks, it is fair to say the time escalates from minutes to hours. The last column in Table 1 denotes the estimated average per-utterance processing time in seconds. The numbers were obtained on a Pentium 4 with 2,6 GHz speed and 1 GB of RAM running Linux. For our experiments with the ATIS corpus, the stochastic parsing model is computed from 1500 utterances, manually annotated in a bootstrapping process. We use 13 semantic word classes (e.g. /weekday/, /number/, /month/, /cityname/). The semantic representation consists of 70 different labels. Splitting sequences of identical labels into numbered sub-labels results in 174 numbered labels. The semantic representation focuses on the identification of key slots, such as origin, destination and stop over locations, as well as airline names and flight numbers. Word sequences containing temporal information such as constraints on the departure or arrival time are not analysed in detail. Instead, all these words are marked with (v:arrive-time) or (v:depart-time), respectively. The test corpus consists of 418 utterances which were manually annotated with semantic labels. For the two-stage approach different word-based language models (plain, class-based, context, combined) were used (cf. section 4). An N-best decoding was performed and 20 hypotheses were subsequently labelled by the stand-alone stochastic parser. After that, the result with the maximum combined probability value was chosen.
In the one-stage approach, two refinements (context and numbering