HMM-Based Semantic Case Frame Analysis

HMM-Based Semantic Case Frame Analysis
HMM-Based Semantic Case Frame Analysis

In the domain of spoken language information retrieval, spontaneous effectsin speech are very important(Minker, 1999). These include false starts, repetitions and illformed utterances. Thus it would be improvident to base the semantic extraction exclusively on a syntactic analysis of the input utterance. Parsing failures due to ungrammatical syntactic constructs may be reduced if those phrases containing important semantic information could be extracted whilst ignoring the non-essential or redundant parts of the utterance.

Restarts and repeats frequently occur between the phrases. Poorly syntactic constructs often consist of well-formed phrases which are semantically meaningful. One approach to extract semantic information is based on case frames. The original concept of a case frame as described by Fillmore (Fillmore, 1968) is based on a set of universally applicable cases or case values. They express the relationship between a verb and its nouns. Bruce (Bruce, 1975) extended the Fillmore theory to any concept-based system and defined an appropriate semantic grammar whose formalism is given in Fig. In the example query could you give me a ticket price on [uh] [throat clear] a flight first class from San Francisco to Dallas please a typical semantic case grammar would instantiate the following terminals:
• price: this reference word identifies the concept airfare (other concepts may be: book, flight, …)from: case marker of the case from-city corresponding to the departure city San Francisco

• to: case marker of the case to-city corresponding to the arrival city Dallas
• class: case marker of the case flight-class corresponding to first
• case system: from, to, class, … The parsing process based on a semantic case grammar typically considers less than 50% of the example query to be semantically meaningful. The hesitations and false starts are ignored. The approach therefore appears well suited for natural language understanding components where the need for semantic guidance in parsing is especially relevant. Case frame analysis may be used in a rule-based case grammar. Here, we apply HMM-based modelling instead (Pieraccini et al., 1992; Minker et al., 1999). In the frame-based representation, the semantic labelling does not consider all the words of the utterance, but only those related to the concept and its cases. However, in order to estimate the model parameters, each word of the utterance must have a corresponding semantic label. Thus, the additional label (null) is assigned to those words not used by the case frame analyzer for the specific application. A semantic sequence consists of the basic labels , (m:case), (v:case) and (null) corresponding respectively to the reference words, case markers, values and irrelevant words. Relative occurrences of model states and observations are used to establish the Markov Model, whose topology needs to be fixed prior to training and decoding.

Semantic labels are defined as the states sj . All states such as the examples (v:at-city), (null) and shown can follow each other; thus the model is ergodic. In direct analogy to the speech recognition problem (equation 1), the decoding consists of maximizing the conditional probability P(S|W) of some state sequence S given the observation sequence W: [S] opt = argmax S {P(S)P(W|S)} (2) Given the dimensionality of the sequence W, the exact computation of the likelihood P(W|S) is intractable. Again, bi-grams are a common approximation in order to robustly estimate the Markov Model parameters, the state transitions probabilities P(sj |si) and the observation symbol probability distribution P(wm|sj ) in state j. In contrast to speech recognition, the computation of the model parameters can be achieved through maximum likelihood estimation, i.e. by counting event occurrences. Usually a back-off and discounting strategy is applied in order to improve robustness in the face of unseen events.An HMM-based parsing module may be conceived as a probabilistic finite state transducer that translates a sequence of words into a sequence of semantic labels. The semantic labels denote word’s function in the semantic representation. Although the flat semantic model has known limitations with respect to the representation of long-term dependencies, for practical applicationsit is often sufficient. It has been shown that several methods, such as contextual observations and garbage models, exist that enhance the performance of HMM-based stochastic parsing models (Beuschel et al., 2004).

Adding Information to the Language Model

Adding Information to the Language Model
Adding Information to the Language Model

As mentioned above, the language model P(W) represents the probability of a state sequence. With the bigram approximation P(W) ≈ P(wi |wi−1) this probability becomes a transition probability between words in a word network. By adding information to the language model (cf. shaded parts of Fig. 1) we modify the word network in a way that instead of “plain” words as nodes the network should now contain “annotated” variants of these original words. The annotation then encodes some additional information that is relevant to the further processing in the dialogue system, but does not affect the pronunciation of the word. By introducing such labelled word variants, it is possible to encode some additional relations that exist between the labels rather than between the words. Consider the following utterance and a corresponding labelling of each word with additional information: Show-(null) me-(null) ground- transportation- for-(null) Dallas-(v:at-city) The word network computed from utterances of the latter form instead of plain texts will represent the fact that after a word labelled as (null), a city name labelled as (v:atcity) is much more likely than labelling as (v:from-city) or (v:to-city).

In order to compute the modified version of the network it is only necessary to replace the words by their labelled variants in the training corpus and to compute the bi-gram statistics from this modified corpus (cf. step (2) in Fig. 1). For expanding the word network into a network of phoneme states as required by the speech recognition, it is necessary to modify the phonetic transcription dictionary accordingly: for each labelled variant of a word appearing in the labelled training texts, the respective unlabelledword entry is copied. The Viterbi decoder will now output sequences of annotated words (step (3)). The language model may not only be enriched by semantic labels. Other information, such as the context of the word may also be used. A language model labelled with a context that consists of one word on the left is essentially a tri-gram model. There is a trade-off between what the network can express and its size. Using too many different labels for each word in the network may quickly result in word networks impractical for realtime use. For our experiments within the ATIS domain, Table 1: Word network sizes for different labelling techniques. “Expanded” refers to the phoneme state network. t denotes estimated per utterance processing time.Table 1 summarises the word network sizes for different labelling methods. Here, “ASR” refers to the original base-line unlabelled language model.

“ASR/Cl” is a simple class-based language model with manually de- fined classes. A left context of one word was used in “ASR/Co”, and combined in with classes in “ASR/CC”. These labelled versions may be used in the two-stage approach to improve the speech recognition results. “ASR+”, “ASR+Co”, “ASR+N” refer to semantically labelled language models. “ASR+” is directly trained on the semantically labelled training texts. “ASR+Co” furthermore includes a left context of one semantic label, whereas “ASR+N” includes sub-label numbering. As can be seen from the numbers, word classes as well as the semantic methods only incur a modest increase in network size. The word-based methods, however, signifi- cantly inflate the model. Although we have not systematically recorded the time necessary for recognizing the test set with these networks, it is fair to say the time escalates from minutes to hours. The last column in Table 1 denotes the estimated average per-utterance processing time in seconds. The numbers were obtained on a Pentium 4 with 2,6 GHz speed and 1 GB of RAM running Linux. For our experiments with the ATIS corpus, the stochastic parsing model is computed from 1500 utterances, manually annotated in a bootstrapping process. We use 13 semantic word classes (e.g. /weekday/, /number/, /month/, /cityname/). The semantic representation consists of 70 different labels. Splitting sequences of identical labels into numbered sub-labels results in 174 numbered labels. The semantic representation focuses on the identification of key slots, such as origin, destination and stop over locations, as well as airline names and flight numbers. Word sequences containing temporal information such as constraints on the departure or arrival time are not analysed in detail. Instead, all these words are marked with (v:arrive-time) or (v:depart-time), respectively. The test corpus consists of 418 utterances which were manually annotated with semantic labels. For the two-stage approach different word-based language models (plain, class-based, context, combined) were used (cf. section 4). An N-best decoding was performed and 20 hypotheses were subsequently labelled by the stand-alone stochastic parser. After that, the result with the maximum combined probability value was chosen.
In the one-stage approach, two refinements (context and numbering

Speech Recognition Experiments

Speech Recognition Experiments
Speech Recognition Experiments

Tables 2 and 3 present the results of these experiments. They are based on word recognition and concept recognition performance, respectively. The columns titled “Correct” and “Accuracy” refer to word correct rate and word accuracy, as well as to their concept-level equivalents. The “Sentence” column lists the percentage of completely correctly decoded sentences. For the twostage approach, the numbers in Table 2 denote the performance of the speech recognition system alone (step (3) in Fig. 1). For the one-stage approach, the semantic labels were removed after decoding in order to obtain the plain word sequences.

It can be seen that the word-based recognition benefits both from word-based additions to the language model, as well as from semantic labels inTable 3 summarizes the concept-level results. Here, the semantic labels are also compared against the reference. Numbers in sub-labels are ignored, however. The “NLU” row denotes the performance on perfectly recognized data, i.e. on the training transcriptions. One-stage integrated recognition produces competitive recognition rates when compared to the two-stage approach. Even though in the two-stage approach, each stage’s representation can be fine-tuned separately. It seems interesting to note a subtle difference between the decoding procedures of the two-stage and the one-stage architectures. In a stand-alone stochastic parser, Viterbi decoding is used for word-to-label correspondences. The probability of a transition from semantic state si to sj is thus defined as the product P(wj |sj )P(sj |si), where P(wj |sj ) is the probability of observing wj in state sj . In contrast, if a labelled language model is used the transition probability is P(wj |wi), where wi and wj are pairs of the actual words and their associated labels, so the surface form of the last word influences the transition as well (not only its label). 6 Conclusions and Future Work It can be shown that a flat HMM-based semantic analysis does not require a separate decoding stage. Instead it seems possible to use the speech recogniser’s language model to represent the semantic state model, without compromising recognition in terms of word or slot error rate.

For a stand-alone speech recognition component, it seems advantageousto use a class-based or context-based language model, since it improves the word recognition score. For the stochastic parsing, numbered sub-labels provide best results. With N-best decoding, the stochastic parser can be used to select the best overall hypothesis.

A number of improvements and extensions may be considered for the different processing stages. Firstly, instead of representing compound airport and city namessuch as “New York City” as word sequences, they could be entered in the dictionary as single words, which should avoid certain recognition errors. In addition, an equivalent of a class-based language model should be defined for semantically annotated language models. Also, contextual observations, i.e. the use of a class of manually defined context words could help the stochastic parser to address long-term dependencies that have so far proved difficult. Finally, the ATIS task results in relatively simple semantic structures and yields a limited vocabulary size. It would be interesting to apply our proposed techniques to a more complex domain, such as an appointment scheduling task (Minker et al., 1999), implying a more natural speech-based interaction. This would enable us to validate our approach on larger vocabulary sizes.INTERSPEECH’2005 SCIENCE QUIZ

Fun and imagination at INTERSPEECH’2005 – EUROSPEECH!

Upon registration in Lisbon, all INTERSPEECH’ 2005 participants received a sheet with 16 intruiging questions from the area of language and speech science and technology. They were selected from proposals from colleagues from all over the world. Participants were challenged to find the right answers during INTERSPEECH’ 2005 and to compete for the honour and a nice prize, a beautiful vase of Portuguese ceramics.

THE ANSWERS AND THE WINNER
Although lots of discussion was witnessed, and frantic Internet searches, the quiz was considered quite difficult, and only 40 participants returned the form. There was a tie between two participants who got 12 answers right. The final winner was Ibon Saratxaga from Spain, with Arlo Faria in second place. Other high scores were obtained by Mats Blomberg, Lou Boves, Frederic Bimbot, Athanassios Katsamanis and Bernd Möbius.