{"title": "Maximum-Likelihood Continuity Mapping (MALCOM): An Alternative to HMMs", "book": "Advances in Neural Information Processing Systems", "page_first": 744, "page_last": 750, "abstract": null, "full_text": "Maximum-Likelihood Continuity Mapping \n\n(MALCOM): An Alternative to HMMs \n\nDavid A. Nix \n\ndnix@lanl.gov \n\nJohn E. Hogden \n\nhogden@lanl.gov \n\nComputer Research & Applications \n\nComputer Research & Applications \n\nCIC-3, MS B265 \n\nCIC-3, MS B265 \n\nLos Alamos National Laboratory \n\nLos Alamos National Laboratory \n\nLos Alamos, NM 87545 \n\nLos Alamos, NM 87545 \n\nAbstract \n\nWe describe Maximum-Likelihood Continuity Mapping (MALCOM), an \nalternative to hidden Markov models (HMMs) for processing sequence \ndata such as speech. While HMMs have a discrete \"hidden\" space con(cid:173)\nstrained by a fixed finite-automaton architecture, MALCOM has a con(cid:173)\ntinuous hidden space-a continuity map-that is constrained only by a \nsmoothness requirement on paths through the space. MALCOM fits into \nthe same probabilistic framework for speech recognition as HMMs, but \nit represents a more realistic model of the speech production process. \nTo evaluate the extent to which MALCOM captures speech production \ninformation, we generated continuous speech continuity maps for three \nspeakers and used the paths through them to predict measured speech \narticulator data. The median correlation between the MALCOM paths \nobtained from only the speech acoustics and articulator measurements \nwas 0.77 on an independent test set not used to train MALCOM or the \npredictor. This unsupervised model achieved correlations over speak(cid:173)\ners and articulators only 0.02 to 0.15 lower than those obtained using an \nanalogous supervised method which used articulatory measurements as \nwell as acoustics .. \n\n1 \n\nINTRODUCTION \n\nHidden Markov models (HMMs) are generally considered to be the state of the art in speech \nrecognition (e.g., Young, 1996). The strengths of the HMM framework include a rich math(cid:173)\nematical foundation, powerful training and recognition algorithms for large speech corpora, \nand a probabilistic framework that can incorporate statistical phonology and syntax (Mor(cid:173)\ngan & Bourlard, 1995). However, HMMs are known to be a poor model of the speech \nproduction process. While speech production is a continuous, temporally evolving pro(cid:173)\ncess, HMMs treat speech production as a discrete, finite-state system where the current \nstate depends only on the immediately preceding state. Furthermore, while HMMs are \ndesigned to capture temporal information as state transition probabilities, Bourlard et al., \n\n\fMaximum-Likelihood Continuity Mapping (MALCOM) : An Alternative to HMMs \n\n745 \n\n(1995) suggest that when the transition probabilities are replaced by constant values, recog(cid:173)\nnition results do not significantly deteriorate. That is, while transitions are often considered \nthe most perceptually relevent component of speech, the conventional HMM framework is \npoor at capturing transition information. \n\nGiven these deficiencies, we are considering alternatives to the HMM approach that main(cid:173)\ntain its strengths while improving upon its weaknesses. This paper describes one such \nmodel called Maximum-Likelihood Continuity Mapping (MALCOM). We first review a \ngeneral statistical framework for speech recognition so that we can compare the HMM and \nMALCOM formulations. Then we consider what the abstract hidden state represents in \nMALCOM, demonstrating empirically that the paths through MALCOM's hidden space \nare closely related to the movements of the speech production articulators. \n\n2 A GENERAL FRAMEWORK FOR SPEECH RECOGNITION \n\nConsider an unknown speech waveform that is converted by a front-end signal-processing \nmodule into a sequence of acoustic vectors X. Given a space of possible utterances, W, \nthe task of speech recognition is to return the most likely utterance W * given the observed \nacoustic sequence X . Using Bayes' rule this corresponds to \n\n(1) \n\nIn recognition, P(X) is typically ignored because it is constant over all W, and the pos(cid:173)\nterior P(WIX) is estimated as the product of the prior probability of the word sequence, \nP(W), and the probability that the observed acoustics were generated by the word se(cid:173)\nquence, P(XI W) . The prior P(W) is estimated by a language model, while the production \nprobability P(X IW) is estimated by an acoustic model. In continuous speech recognition, \nthe product of these terms must be maximized over W; however, in this paper, we will re(cid:173)\nstrict our attention to the form of the acoustical model only. Every candidate utterance W \ncorresponds to a sequence of word/phone models M w such that P(XIW) = P(XIMw), \nand each M w considers all possible paths through some \"hidden\" space. Thus, for each \ncandidate utterance, we must calculate \n\nP(X IM w) = i P(XIY , Mw )P(YIMw )dY , \n\n(2) \n\nwhere Y is some path through the hidden space. \n\n2.1 HIDDEN MARKOV MODELS \n\nBecause HMMs are finite-state machines with a given fixed architecture, the path Y \nthrough the hidden space corresponds to series of discrete states, simplifying the integral \nof Eq. (2) to a sum. However, to avoid computing the contribution of all possible paths, the \nViterbi approximation-considering only the single path that maximizes Eq. (2)-is fre(cid:173)\nquently used without much loss in recognition performance (Morgan & Bourlard, 1995). \nThus, \n\nP(XIMw) ~ ar!?\"ymaxP(XIY, Mw)P(YIM w). \n\n(3) \n\nThe first term corresponds to the product of the emission probabilities of the acoustics given \nthe state sequence and is typically estimated by mixtures of high-dimensional Gaussian \ndensities. The second term corresponds to the product of the state transition probabilities. \nHowever, because Bourlard et al. (1995) found that this second term contributes little to \nrecognition performance, the modeling power of the conventional HMM must reside in \nthe first term. Training the HMM system involves estimating both the emission and the \n\n\f746 \n\nD. A. Nix and J. E. Hogden \n\ntransition probabilities from real speech data. The Baum-Welchlforward-backward algo(cid:173)\nrithm (e.g., Morgan & Scofield, 1994) is the standard computationally efficient algorithm \nfor iteratively estimating these distributions. \n\n2.2 MAXIMUM-LIKELIHOOD CONTINUITY MAPPING (MALCOM) \n\nIn contrast to HMMs, the multi-dimensional MALCOM hidden space is continuous-there \nare an infinite number states and paths through them. While the HMM is constrained by \na fixed architecture, MALCOM is constrained by the notion of continuity of the hidden \npath. That is, the path must be smooth and continuous: it may not carry any energy above \na given cutoff frequency. Unlike the discrete path in an HMM, the smooth hidden path \nin MALCOM attempts to emulate the motion of the speech articulators in what we call a \ncontinuity map (CM). \n\nUnless we know how to evaluate the integral of Eq. (2) (which we currently do not), we \nmust also make the Viterbi approximation and approximate P(XIM w ) by considering \nonly the single path that maximizes the likelihood of the acoustics X given the utterance \nmodel M w , resulting in Eq. (3) once again. Analogously, the first term, P(XIY, M w ), \ncorresponds to the acoustic generation probability given the hidden path, and the second \nterm corresponds to the probability of the hidden path given the utterance model. This \npaper focuses on the first term because this is the term that produces conventional HMM \nperformance. 1 \n\nCommon to all Mw is a set of N probability density functions (pdfs) * that define the CM \nhidden space, modeling the likelihood of Y given X for an N -code vector quantization \n(VQ) of the acoustic space. Because these pdfs are defined over the low-dimensional CM \nspace instead of the high-dimensional acoustic space (e.g., 6 vs. 40+), MALCOM requires \nmany fewer parameters to be estimated than the corresponding HMM. \n\n3 THE MALCOM ALGORITHM \n\nWe now turn to developing an algorithm to estimate both the CM pdfs ** and the corre(cid:173)\nsponding paths Y that together maximize the likelihood of a given time series of acoustics, \nC = P(X I Y , <1\u00bb\n. This is an extension of the method first proposed by Hogden (1995), in \nwhich he instead maximized P(YIX, <1\u00bb using vowel data from a single speaker. Starting \nwith random but smooth Y , the MALCOM training algorithm generates a CM by iterating \nbetween the following two steps: (1) Given Y, reestimate ** to maximize C; and (2) Given \n<1>, reestimate smooth paths Y to maximize \u00a3. \n\n3.1 LOG LIKELIHOOD FUNCTION \n\nTo specify the log likelihood function C, we make two dependence claims and one inde(cid:173)\npendence assumption. First we claim that Yt depends (to at least some small extent) on all \nother Y in the utterance, an expression of the continuity constraint described above. We \nmake another natural claim that Xt depends on Yt, that the path configuration at time t in(cid:173)\nfluences the corresponding acoustics. However, we do make the conditional independence \nassumption that \n\nNote that Eq. (4) does not assume that each Xt is independent OfXt-l (as is often assumed \nin data modeling); it only assumes that the conditioning of Xt on Yt is independent from \n\nIHowever, we are currently developing a model of P(YIM w ) to replace the corresponding (and \n\nuseless) term in the conventional HMM formulation as well (Hogden et al. , 1998). \n\nC = P(XIY, <1\u00bb = IT P(XtIYt , <1\u00bb. \n\nn \n\nt=l \n\n(4) \n\n\fMaximum-Likelihood Continuity Mapping (MALCOM): An Alternative to HMMs \n\n747 \n\nt-l to t. For example, because Xt depends on Yt, Yt depends on all othery (the smoothness \nconstraint), and Xt-1 depends on Yt-1, Xt is not assumed to be independent of all other xs \nin the utterance. \n\nWith a log transformation and an invocation of Bayes' rule, we obtain the MALCOM log \nlikelihood function: \n\nn \n\nInL: = L [InP(Ytlxt, <1\u00bb + InP(xt) -InP(Ytlj (Xt)], where \nthe particular model j depends on which of the N VQ codes Xt is assigned to. Here we \nuse a simple multi-dimensional Gaussian for each pdf, but we are currently exploring the \nuse of multi-modal mixtures of Gaussians to represent the pdfs for sounds such as stop \nconsonants for which the inverse map from acoustics to articulation may not be unique \n(Nix, 1998). Next, we need an estimate of P(Ytlj)P(Xj). We estimate P(Xj) by \nover all VQ partitions: P(Ytl(Xj)]P(Xj)Ej 1(Yt - JLj)} \n\ntEx(t)=x. \n\nt=1 \n\nLj=l p[Ytlxj, **(Xj)]P(Xj) \n\n(6) \nwhere E is the covariance matrix for each pdf. For the results in this paper, we use a com(cid:173)\nmon radially symmetric covariance matrix for all pdfs and reestimate the covariance matrix \nafter each path optimization step.2 In doing the optimization, we employ the following al(cid:173)\ngorithm: \n\n1. Make an initial guess of each JLi as the means of the path configurations corre(cid:173)\n\nsponding to the observed acoustics X E Xt . \n\n2. Construct V' JL In L: by considering Eq. (6) over all N acoustic partitions. \n3. Determine a search direction for the optimization using, for example, conjugate \n\ngradients and perform a line search along this direction (Press et al., 1988). \n\n4. Repeat steps [2]-[3] until convergence. \n\nTo avoid potential degenerate solutions, after each pdf optimization step, the dimensions \nof the CM are orthogonalized. Furthermore, because the scale of the continuity map is \nmeaningless (only its topological arrangement matters), the N pdfmeans are scaled to zero \nmean, unit variance before each path optimization step. \n\n3.3 PATH ESTIMATION \n\nFor step (2) of training, we use gradient-based optimization to reestimate Y, where the \ngradient of the log likelihood function with respect to a specific Yt is given by \nV'y InL: = V'y,p[Ytlxt,**(xt)] _ V'y, F~=IP[YtIXj,**(Xj)]P(Xj) \n\n(7) \n\n, \n\np[YtIXt, **(Xt)] \n\nLj=1 p[Ytlxj, **(Xj)]P(Xj) \n\n2However, we are currently exploring the effects of individual and diagonal covariance matrices. \n\n\f748 \n\nD. A. Nix and J. E. Hogden \n\nIn doing the optimization, we employ the following gradient-based algorithm: \n\n1. Make an initial guess of the path yO as the means of the pdfs corresponding to \n\nthe observed acoustic sequence X. \n\n2. Low pass filter yo. \n3. Construct \\1y InC by considering Eq. (7) over all t. \n4. Determine a search direction for the optimization using, for example, conjugate \n\ngradients (Press et at., 1988). \n\n5. Low-pass filter this search direction using the same filter as in step [2]. \n6. Perform a line search along the filtered direction (Press et at., 1988). \n7. Repeat steps [3]-[6] until convergence. \n\nBecause neither the line search direction nor the initial estimate yO contains energy above \nthe cutoff frequency of the low-pass filter, their linear addition-the next estimate of Y -\nwill not contain energy above the cutoff frequency either. Thus, steps [2] and [5] implement \nthe desired smoothness constraint. \n\n4 COMPARNG MALCOM PATHS TO SPEECH ARTICULATION \n\nTo evaluate our claim that MALCOM paths are topologically related to articulator motions, \nwe construct a regression predictor from Y to measured articulator data using the training \ndata and test the quality of this predictor on an independent test set. \n\nOur speech corpus consists of data from two male and one female native speakers of Ger(cid:173)\nman. This data was obtained from Dr. Igor Zlokarnik and recorded at the Technical Uni(cid:173)\nversity of Munich, Germany using electro-magnetic articulography (EMA) (Perkell et al., \n1992). Each speaker's articulatory measurements and acoustics were recorded for the same \n108 sentences, where each sentence was about 4 seconds long. \n\nThe acoustics were recorded using a room-placed microphone and sampled using 16-bit \nresolution at 16 kHz. Prior to receiving the data from Munich, the data were resampled at \n11025 Hz. To represent the acoustic signal in compact vector time-series, we used 256-\nsample (23.2 msec) Hamming-windowed frames, with a new frame starting every 5.8 msec \n(75% overlap). We transform each frame into a 13th-order LPC-cepstral coefficient vector \nat (12 cepstral features plus log gain-see Morgan& Scofield, 1994). A full acoustical \nfeature vector Xt consists of a window of seven frames such that Xt is made up of the frames \n{at-6,at-4,at-2,at,at+2, at+4,at+6}. To VQ the acoustic space we used the classical k(cid:173)\nmeans algorithm (e.g., Bishop, 1995), but we used 512 codes to model the vowel data, \nand 256 codes each to model the stop consonants, the fricatives, the nasals, and the liquids \n(1536 codes combined).3 \nThe articulatory data consist of the (x, y) coordinates of 4 coils along the tongue and the y(cid:173)\ncoordinates of coils on the jaw and lower lip. Figure 1 illustrates the approximate location \nof each coil. The data were originally sampled at 250 Hz but were resampled to 172.26 Hz \nto match one articulatory sample for each 75%-overlapping acoustic frame of 256 samples. \nThe articulatory data were subsequenpy low-pass filtered at 15 Hz to remove measurement \nnoise. \n\nSentences 1-90 were used as a training set, and sentences 91-108 were withheld for eval(cid:173)\nuation. A separate CM was generated for each speaker using the training data. We used \nan 8 Hz cutoff frequency because the measured articulatory data had very little energy \nabove 8 Hz, and a 6-dimensional continuity map was used because the first six principal \ncomponents capture 99% of the variance of the corresponding articulator data (Nix, 1998). \n\n3This acoustic representation and VQ scheme were determined to work well for modeling real \n\narticulator data (Nix, 1998), so they were used here as well. \n\n\fMaximum-Likelihood Continuity Mapping (MALCOM) : An Alternative to HMMs \n\n749 \n\n\\ , / ~Tongue mIddle (T2) \n\n/ \n\nTongue dorsum (T3) \n\n( \n\n\u2022 -_ _ \n\nTongue back (T4) \n\nTOngUetIP~1~j~t;;\"-- -~ \n\n-t \\ \n--i l---------. J ,c- \"I \n\\i~~ \n\\ \n, I \nI :'-'-~ ( \nLower lop (LL) ~ ., '~\"\"~ \nI .' ~ __ \\ \n(/ /\"\nI , -- - ,~ \n,~\\ ,.~~:,~ \n'~ \\ \n\\ \n\\(, .. ~~J \n' -\n\n~ \nLower jaw (LJ) ,X t,:',': \n\ni \n\n'\n\n----\n\nFigure 1: Approximate positions of EMA coils for speech articulation measurements. \n\nBecause the third term in Eq. (5) is computationally complex, we approximated Eq. (5) \nby only its first term (the second term is constant during training) until In C, calculated \nat the end of each iteration using all terms, started to decrease. At this point we started \nusing both the first and third terms of Eq. (5). In each pdf and path optimization step, \nour convergence criterion was when the maximum movement of a mean or a path was \n< 10- 4 . Our convergence criterion for the entire algorithm was when the correlation of \nthe paths from one full iteration of pdf and path optimization to another was> 0.99 in all \ndimensions. This usually took about 30 iterations. \n\nTo evaluate the extent to which MALCOM hidden paths capture information related to \narticulation, we used the same training set to estimate a non-linear regression function \nfrom the output generated by MALCOM to the corresponding measured articulator data. \nWe used an ensemble of 10 single-hidden-Iayer, 32-hidden unit, multi-layer perceptrons \ntrained on different 2/3-training, 1/3-early stopping partitions of the training set, where \nthe results of the ensemble on the test set were averaged (e.g., Bishop, 1995). A linear \nregression produced results approximately 10% worse than those we report here. \n\nTo contrast with the unsupervised MALCOM method, we also tested a supervised method \nin which the articulatory data was available for training as well as evaluation. This involved \nonly the pdf optimization step of MALCOM because the paths were fixed as the articulator \nmeasurements. The resulting pdfs were then used in the path optimization step to determine \npaths for the test data acoustics. We could then measure what fraction of this supervised \nperformance the unsupervised MALCOM attained. \n\n5 RESULTS AND CONCLUSIONS \n\nThe results of this regression on the test set are plotted in Figure 2. The MALCOM paths \nhad a median correlation of 0.77 with the actual articulator data, compared to 0.84 for \nthe comparable supervised method. Thus, using only the speech acoustics, MALCOM \ngenerated continuity maps with correlations to real articulator measurements only 0.02 to \n0.15 lower than the corresponding supervised model which used articulatory measurements \nas well as acoustics. \n\nGiven that (1) MALCOM fits into the same probabilistic framework for speech recognition \nas HMMs and (2) MALCOM's hidden paths capture considerable information about the \nspeech production process, we believe that MALCOM will prove to be a viable alternative \nto the HMM for speech processing tasks. Our current work emphasizes developing a word \nmodel to complete the MALCOM formulation and test a full speech recognition system. \nFurthermore, MALCOM is applicable to any other task to which HMMs can be applied, \n\n\f750 \n\nD. A. Nix and 1. E. Hogden \n\n0.8 \n\n0.2 \n\nOL-~ __ ~~~~ __ ~~~~~-L~~~ __ -L~ __ L J __ -L~ __ L-~ \n\nT1x \n\nT1y \n\nT2x \n\nT2y \n\nT3x \n\nT3y \n\nT4x \n\nT4y \n\nLLy \n\nArticulator dimension \n\nUy \n\nFigure 2: Correlation between estimated and actual articulator trajectories on the indepen(cid:173)\ndent test set averaged across speakers. Each full bar is the performance of the supervised \nanalogy to MALCOM, and the horizontal line on each bar is the performance of MALCOM \nitself. \n\nincluding fraud detection (Hogden, 1997) and text processing. \n\nAcknowledgments \nWe would like to thank James Howse and Mike Mozer for their helpful comments on this manuscript \nand Igor Zlokarnik for sharing his data with us. This work was performed under the auspices of the \nU.S. Department of Energy. \n\nReferences \nBishop, C.M. (1995). Neural Networks for Pattern Recognition, NY: Oxford University Press, Inc. \nBourlard, H. Konig, Y., & Morgan, N. (1995). \"REMAP: Recursive estimation and maximization of \na posteriori probabilities, application to transition-based connectionist speech recognition,\" \nInternational Computer Science Institute Technical Report TR-94-064. \n\nHogden, J. (1995). \"Improving on hidden Markov models: an articulatorily constrained, maximum(cid:173)\nlikelihood approach to speech recognition and speech coding,\" Los Alamos National Lab(cid:173)\noratory Technical Report, LA-UR-96-3945. \n\nNational Laboratory Technical Report, LA-UR-97-992. \n\nHogden, J. (1997). \"Maximum likelihood continuity mapping for fraud detection,\" Los Alamos \nHogden, 1., Nix, D.A., Gracco, v., & Rubin, P. (1998). \"Stochastic word nodels for articulatorily \nconstrained speech recognition and synthesis,\" submitted to Acoustical Society of America \nConference, 1998. \n\nMorgan, N. & Bourlard, H.A. (1995). \"Neural Networks for Statistical Recognition of Continuous \n\nSpeech,\" Proceedings of the IEEE, 83(5), 742-770. \n\nMorgan, D.P., & Scofield, c.L. (1992). Neural Networks and Speech Processing, Boston, MA: \n\nKluwer Academic Publishers. \n\nNix, D.A. (1998). Probabilistic methods for inferring vocal-tract articulation from speech acoustics, \n\nPh.D. Dissertation, U. of CO at Boulder, Dept. of Computer Science, in preparation. \n\nPerkell, 1.S., Cohen, M.H., Svirsky, M.A., Matthies, M.L., Garabieta, I., & Jackson, M.T.T. (1992). \n\"Electromagnetic midsagittal articulo meter systems for transducing speech articulatory \nmovements,\" Journal of the Acoustical Society of America, 92(6), 3078-3096. \n\nPress, W.H., Teukolsky, S.A., Vetterling, w.T., & Flannery, B.P. (1988). Numerical Recipes in C \n\nCambridge University Press. \n\nYoung, SJ. (1996). \"A review of large-vocabulary continuous speech recognition,\" IEEE Signal \n\nProcessing Magazine, September, 45-57. \n\n\f", "award": [], "sourceid": 1634, "authors": [{"given_name": "David", "family_name": "Nix", "institution": null}, {"given_name": "John", "family_name": "Hogden", "institution": null}]}*