Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems

Autores
Ferrer, Luciana; Bratt, Harry; Richey, Colleen; Franco, Horacio; Abrash, Victor; Precoda, Kristin
Año de publicación
2015
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
Fil: Ferrer, Luciana. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Sri International; Estados Unidos
Fil: Bratt, Harry. Sri International; Estados Unidos
Fil: Richey, Colleen. Sri International; Estados Unidos
Fil: Franco, Horacio. Sri International; Estados Unidos
Fil: Abrash, Victor. Sri International; Estados Unidos
Fil: Precoda, Kristin. Sri International; Estados Unidos
Materia
Computer-Assisted Language Learning
Gaussian Mixture Models
Lexical Stress Detection
Mel Frequency Cepstral Coefficients
Prosodic Features
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/38100

id CONICETDig_c589597e81166969669cdc55daf026ce
oai_identifier_str oai:ri.conicet.gov.ar:11336/38100
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systemsFerrer, LucianaBratt, HarryRichey, ColleenFranco, HoracioAbrash, VictorPrecoda, KristinComputer-Assisted Language LearningGaussian Mixture ModelsLexical Stress DetectionMel Frequency Cepstral CoefficientsProsodic Featureshttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.Fil: Ferrer, Luciana. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Sri International; Estados UnidosFil: Bratt, Harry. Sri International; Estados UnidosFil: Richey, Colleen. Sri International; Estados UnidosFil: Franco, Horacio. Sri International; Estados UnidosFil: Abrash, Victor. Sri International; Estados UnidosFil: Precoda, Kristin. Sri International; Estados UnidosElsevier Science2015-02info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/38100Ferrer, Luciana; Bratt, Harry; Richey, Colleen; Franco, Horacio; Abrash, Victor; et al.; Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems; Elsevier Science; Speech Communication; 69; 2-2015; 31-450167-6393CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/http://www.sciencedirect.com/science/article/pii/S0167639315000151info:eu-repo/semantics/altIdentifier/doi/10.1016/j.specom.2015.02.002info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-nd/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:48:04Zoai:ri.conicet.gov.ar:11336/38100instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:48:04.824CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
title Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
spellingShingle Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
Ferrer, Luciana
Computer-Assisted Language Learning
Gaussian Mixture Models
Lexical Stress Detection
Mel Frequency Cepstral Coefficients
Prosodic Features
title_short Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
title_full Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
title_fullStr Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
title_full_unstemmed Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
title_sort Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems
dc.creator.none.fl_str_mv Ferrer, Luciana
Bratt, Harry
Richey, Colleen
Franco, Horacio
Abrash, Victor
Precoda, Kristin
author Ferrer, Luciana
author_facet Ferrer, Luciana
Bratt, Harry
Richey, Colleen
Franco, Horacio
Abrash, Victor
Precoda, Kristin
author_role author
author2 Bratt, Harry
Richey, Colleen
Franco, Horacio
Abrash, Victor
Precoda, Kristin
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Computer-Assisted Language Learning
Gaussian Mixture Models
Lexical Stress Detection
Mel Frequency Cepstral Coefficients
Prosodic Features
topic Computer-Assisted Language Learning
Gaussian Mixture Models
Lexical Stress Detection
Mel Frequency Cepstral Coefficients
Prosodic Features
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
Fil: Ferrer, Luciana. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Sri International; Estados Unidos
Fil: Bratt, Harry. Sri International; Estados Unidos
Fil: Richey, Colleen. Sri International; Estados Unidos
Fil: Franco, Horacio. Sri International; Estados Unidos
Fil: Abrash, Victor. Sri International; Estados Unidos
Fil: Precoda, Kristin. Sri International; Estados Unidos
description We present a system for detection of lexical stress in English words spoken by English learners. This system was designed to be part of the EduSpeak® computer-assisted language learning (CALL) software. The system uses both prosodic and spectral features to detect the level of stress (unstressed, primary or secondary) for each syllable in a word. Features are computed on the vowels and include normalized energy, pitch, spectral tilt, and duration measurements, as well as log-posterior probabilities obtained from the frame-level mel-frequency cepstral coefficients (MFCCs). Gaussian mixture models (GMMs) are used to represent the distribution of these features for each stress class. The system is trained on utterances by L1-English children and tested on English speech from L1-English children and L1-Japanese children with variable levels of English proficiency. Since it is trained on data from L1-English speakers, the system can be used on English utterances spoken by speakers of any L1 without retraining. Furthermore, automatically determined stress patterns are used as the intended target; therefore, hand-labeling of training data is not required. This allows us to use a large amount of data for training the system. Our algorithm results in an error rate of approximately 11% on English utterances from L1-English speakers and 20% on English utterances from L1-Japanese speakers. We show that all features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features. For English utterances from L1-Japanese speakers, energy, MFCC log-posterior probabilities and duration are the most important features.
publishDate 2015
dc.date.none.fl_str_mv 2015-02
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/38100
Ferrer, Luciana; Bratt, Harry; Richey, Colleen; Franco, Horacio; Abrash, Victor; et al.; Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems; Elsevier Science; Speech Communication; 69; 2-2015; 31-45
0167-6393
CONICET Digital
CONICET
url http://hdl.handle.net/11336/38100
identifier_str_mv Ferrer, Luciana; Bratt, Harry; Richey, Colleen; Franco, Horacio; Abrash, Victor; et al.; Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems; Elsevier Science; Speech Communication; 69; 2-2015; 31-45
0167-6393
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://www.sciencedirect.com/science/article/pii/S0167639315000151
info:eu-repo/semantics/altIdentifier/doi/10.1016/j.specom.2015.02.002
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Elsevier Science
publisher.none.fl_str_mv Elsevier Science
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613495220338688
score 13.070432