dc.contributor.author |
Katsamanis, A |
en |
dc.contributor.author |
Papandreou, G |
en |
dc.contributor.author |
Maragos, P |
en |
dc.date.accessioned |
2014-03-01T01:30:37Z |
|
dc.date.available |
2014-03-01T01:30:37Z |
|
dc.date.issued |
2009 |
en |
dc.identifier.issn |
1558-7916 |
en |
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/19605 |
|
dc.subject |
Active appearance models (AAMs) |
en |
dc.subject |
Audiovisual-to-articulatory speech inversion |
en |
dc.subject |
Canonical correlation analysis (CCA) |
en |
dc.subject |
Multimodal fusion |
en |
dc.subject.classification |
Acoustics |
en |
dc.subject.classification |
Engineering, Electrical & Electronic |
en |
dc.subject.other |
Active appearance models |
en |
dc.subject.other |
Active appearance models (AAMs) |
en |
dc.subject.other |
Appearance modeling |
en |
dc.subject.other |
Audio features |
en |
dc.subject.other |
Audiovisual-to-articulatory speech inversion |
en |
dc.subject.other |
Canonical correlation analysis |
en |
dc.subject.other |
Canonical correlation analysis (CCA) |
en |
dc.subject.other |
Dynamic information |
en |
dc.subject.other |
Electromagnetic articulography |
en |
dc.subject.other |
Face Tracking |
en |
dc.subject.other |
Facial analysis |
en |
dc.subject.other |
Ill posed |
en |
dc.subject.other |
Ill-posedness |
en |
dc.subject.other |
Inversion process |
en |
dc.subject.other |
Inversion scheme |
en |
dc.subject.other |
Line spectral frequencies |
en |
dc.subject.other |
Linear mapping |
en |
dc.subject.other |
Markovian |
en |
dc.subject.other |
Mel-frequency cepstral coefficients |
en |
dc.subject.other |
Model switching |
en |
dc.subject.other |
Multi-modal |
en |
dc.subject.other |
Multi-stream hidden Markov model |
en |
dc.subject.other |
Multimodal fusion |
en |
dc.subject.other |
Piecewise linear models |
en |
dc.subject.other |
Points of interest |
en |
dc.subject.other |
Speech acoustics |
en |
dc.subject.other |
Speech inversion |
en |
dc.subject.other |
Speech production |
en |
dc.subject.other |
Visual feature extraction |
en |
dc.subject.other |
Visual information |
en |
dc.subject.other |
Visual modalities |
en |
dc.subject.other |
Vocal-tracts |
en |
dc.subject.other |
Face recognition |
en |
dc.subject.other |
Feature extraction |
en |
dc.subject.other |
Frequency estimation |
en |
dc.subject.other |
Hidden Markov models |
en |
dc.subject.other |
Piecewise linear techniques |
en |
dc.subject.other |
Speech recognition |
en |
dc.subject.other |
Visual communication |
en |
dc.subject.other |
Audio acoustics |
en |
dc.title |
Face active appearance modeling and speech acoustic information to recover articulation |
en |
heal.type |
journalArticle |
en |
heal.identifier.primary |
10.1109/TASL.2008.2008740 |
en |
heal.identifier.secondary |
http://dx.doi.org/10.1109/TASL.2008.2008740 |
en |
heal.language |
English |
en |
heal.publicationDate |
2009 |
en |
heal.abstract |
We are interested in recovering aspects of vocal tract's geometry and dynamics from speech, a problem referred to as speech inversion. Traditional audio-only speech inversion techniques are inherently ill-posed since the same speech acoustics can be produced by multiple articulatory configurations. To alleviate the ill-posedness of the audio-only inversion process, we propose an inversion scheme which also exploits visual information from the speaker's face. The complex audiovisual-to-articulatory mapping is approximated by an adaptive piecewise linear model. Model switching is governed by a Markovian discrete process which captures articulatory dynamic information. Each constituent linear mapping is effectively estimated via canonical correlation analysis. In the described multimodal context, we investigate alternative fusion schemes which allow interaction between the audio and visual modalities at various synchronization levels. For facial analysis, we employ active appearance models (AAMs) and demonstrate fully automatic face tracking and visual feature extraction. Using the AAM features in conjunction with audio features such as Mel= frequency cepstral coefficients (MFCCs) or line spectral frequencies (LSFs) leads to effective estimation of the trajectories followed by certain points of interest in the speech production system. We report experiments on the QSMT and MOCHA databases which contain audio, video, and electromagnetic articulography data recorded in parallel. The results show that exploiting both audio and visual modalities in a multistream hidden Markov model based scheme clearly improves performance relative to either audio or visual-only estimation. © 2009 IEEE. |
en |
heal.publisher |
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
en |
heal.journalName |
IEEE Transactions on Audio, Speech and Language Processing |
en |
dc.identifier.doi |
10.1109/TASL.2008.2008740 |
en |
dc.identifier.isi |
ISI:000263639400002 |
en |
dc.identifier.volume |
17 |
en |
dc.identifier.issue |
3 |
en |
dc.identifier.spage |
411 |
en |
dc.identifier.epage |
422 |
en |