Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech

Autores
Räsänen, Okko; Seshadri, Shreyas; Karadayi, Julien; Riebling, Eric; Bunce, John; Cristia, Alejandrina; Metze, Florian; Casillas, Marisa; Rosemberg, Celia Renata; Bergelson, Elika; Soderstrom, Melanie
Año de publicación
2019
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.
Fil: Räsänen, Okko. Universidad de Tampere; Finlandia
Fil: Seshadri, Shreyas. Aalto University; Finlandia
Fil: Karadayi, Julien. Université Paris Sciences et Lettres; Francia
Fil: Riebling, Eric. University of Carnegie Mellon; Estados Unidos
Fil: Bunce, John. University of Manitoba; Canadá
Fil: Cristia, Alejandrina. Université Paris Sciences et Lettres; Francia
Fil: Metze, Florian. University of Carnegie Mellon; Estados Unidos
Fil: Casillas, Marisa. Max Planck Institute For Psycholinguistics; Países Bajos
Fil: Rosemberg, Celia Renata. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Saavedra 15. Centro Interdisciplinario de Investigaciones en Psicología Matemática y Experimental Dr. Horacio J. A. Rimoldi; Argentina
Fil: Bergelson, Elika. University of Duke; Estados Unidos
Fil: Soderstrom, Melanie. University of Manitoba; Canadá
Materia
AUTOMATIC SYLLABIFICATION
DAYLONG RECORDINGS
LANGUAGE ACQUISITION
NOISE ROBUSTNESS
WORD COUNT ESTIMATION
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/108130

id CONICETDig_edaa0c24f09777adcb5e6a2c6ed64aba
oai_identifier_str oai:ri.conicet.gov.ar:11336/108130
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speechRäsänen, OkkoSeshadri, ShreyasKaradayi, JulienRiebling, EricBunce, JohnCristia, AlejandrinaMetze, FlorianCasillas, MarisaRosemberg, Celia RenataBergelson, ElikaSoderstrom, MelanieAUTOMATIC SYLLABIFICATIONDAYLONG RECORDINGSLANGUAGE ACQUISITIONNOISE ROBUSTNESSWORD COUNT ESTIMATIONhttps://purl.org/becyt/ford/5.3https://purl.org/becyt/ford/5Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.Fil: Räsänen, Okko. Universidad de Tampere; FinlandiaFil: Seshadri, Shreyas. Aalto University; FinlandiaFil: Karadayi, Julien. Université Paris Sciences et Lettres; FranciaFil: Riebling, Eric. University of Carnegie Mellon; Estados UnidosFil: Bunce, John. University of Manitoba; CanadáFil: Cristia, Alejandrina. Université Paris Sciences et Lettres; FranciaFil: Metze, Florian. University of Carnegie Mellon; Estados UnidosFil: Casillas, Marisa. Max Planck Institute For Psycholinguistics; Países BajosFil: Rosemberg, Celia Renata. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Saavedra 15. Centro Interdisciplinario de Investigaciones en Psicología Matemática y Experimental Dr. Horacio J. A. Rimoldi; ArgentinaFil: Bergelson, Elika. University of Duke; Estados UnidosFil: Soderstrom, Melanie. University of Manitoba; CanadáElsevier2019-10info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/108130Räsänen, Okko; Seshadri, Shreyas; Karadayi, Julien; Riebling, Eric; Bunce, John; et al.; Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech; Elsevier; Speech Communication; 113; 10-2019; 63-800167-6393CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1016/j.specom.2019.08.005info:eu-repo/semantics/altIdentifier/url/https://www.sciencedirect.com/science/article/pii/S0167639318304205info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-nd/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:42:57Zoai:ri.conicet.gov.ar:11336/108130instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:42:57.716CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
title Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
spellingShingle Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
Räsänen, Okko
AUTOMATIC SYLLABIFICATION
DAYLONG RECORDINGS
LANGUAGE ACQUISITION
NOISE ROBUSTNESS
WORD COUNT ESTIMATION
title_short Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
title_full Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
title_fullStr Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
title_full_unstemmed Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
title_sort Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech
dc.creator.none.fl_str_mv Räsänen, Okko
Seshadri, Shreyas
Karadayi, Julien
Riebling, Eric
Bunce, John
Cristia, Alejandrina
Metze, Florian
Casillas, Marisa
Rosemberg, Celia Renata
Bergelson, Elika
Soderstrom, Melanie
author Räsänen, Okko
author_facet Räsänen, Okko
Seshadri, Shreyas
Karadayi, Julien
Riebling, Eric
Bunce, John
Cristia, Alejandrina
Metze, Florian
Casillas, Marisa
Rosemberg, Celia Renata
Bergelson, Elika
Soderstrom, Melanie
author_role author
author2 Seshadri, Shreyas
Karadayi, Julien
Riebling, Eric
Bunce, John
Cristia, Alejandrina
Metze, Florian
Casillas, Marisa
Rosemberg, Celia Renata
Bergelson, Elika
Soderstrom, Melanie
author2_role author
author
author
author
author
author
author
author
author
author
dc.subject.none.fl_str_mv AUTOMATIC SYLLABIFICATION
DAYLONG RECORDINGS
LANGUAGE ACQUISITION
NOISE ROBUSTNESS
WORD COUNT ESTIMATION
topic AUTOMATIC SYLLABIFICATION
DAYLONG RECORDINGS
LANGUAGE ACQUISITION
NOISE ROBUSTNESS
WORD COUNT ESTIMATION
purl_subject.fl_str_mv https://purl.org/becyt/ford/5.3
https://purl.org/becyt/ford/5
dc.description.none.fl_txt_mv Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.
Fil: Räsänen, Okko. Universidad de Tampere; Finlandia
Fil: Seshadri, Shreyas. Aalto University; Finlandia
Fil: Karadayi, Julien. Université Paris Sciences et Lettres; Francia
Fil: Riebling, Eric. University of Carnegie Mellon; Estados Unidos
Fil: Bunce, John. University of Manitoba; Canadá
Fil: Cristia, Alejandrina. Université Paris Sciences et Lettres; Francia
Fil: Metze, Florian. University of Carnegie Mellon; Estados Unidos
Fil: Casillas, Marisa. Max Planck Institute For Psycholinguistics; Países Bajos
Fil: Rosemberg, Celia Renata. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Saavedra 15. Centro Interdisciplinario de Investigaciones en Psicología Matemática y Experimental Dr. Horacio J. A. Rimoldi; Argentina
Fil: Bergelson, Elika. University of Duke; Estados Unidos
Fil: Soderstrom, Melanie. University of Manitoba; Canadá
description Automatic word count estimation (WCE) from audio recordings can be used to quantify the amount of verbal communication in a recording environment. One key application of WCE is to measure language input heard by infants and toddlers in their natural environments, as captured by daylong recordings from microphones worn by the infants. Although WCE is nearly trivial for high-quality signals in high-resource languages, daylong recordings are substantially more challenging due to the unconstrained acoustic environments and the presence of near- and far-field speech. Moreover, many use cases of interest involve languages for which reliable ASR systems or even well-defined lexicons are not available. A good WCE system should also perform similarly for low- and high-resource languages in order to enable unbiased comparisons across different cultures and environments. Unfortunately, the current state-of-the-art solution, the LENA system, is based on proprietary software and has only been optimized for American English, limiting its applicability. In this paper, we build on existing work on WCE and present the steps we have taken towards a freely available system for WCE that can be adapted to different languages or dialects with a limited amount of orthographically transcribed speech data. Our system is based on language-independent syllabification of speech, followed by a language-dependent mapping from syllable counts (and a number of other acoustic features) to the corresponding word count estimates. We evaluate our system on samples from daylong infant recordings from six different corpora consisting of several languages and socioeconomic environments, all manually annotated with the same protocol to allow direct comparison. We compare a number of alternative techniques for the two key components in our system: speech activity detection and automatic syllabification of speech. As a result, we show that our system can reach relatively consistent WCE accuracy across multiple corpora and languages (with some limitations). In addition, the system outperforms LENA on three of the four corpora consisting of different varieties of English. We also demonstrate how an automatic neural network-based syllabifier, when trained on multiple languages, generalizes well to novel languages beyond the training data, outperforming two previously proposed unsupervised syllabifiers as a feature extractor for WCE.
publishDate 2019
dc.date.none.fl_str_mv 2019-10
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/108130
Räsänen, Okko; Seshadri, Shreyas; Karadayi, Julien; Riebling, Eric; Bunce, John; et al.; Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech; Elsevier; Speech Communication; 113; 10-2019; 63-80
0167-6393
CONICET Digital
CONICET
url http://hdl.handle.net/11336/108130
identifier_str_mv Räsänen, Okko; Seshadri, Shreyas; Karadayi, Julien; Riebling, Eric; Bunce, John; et al.; Automatic word count estimation from daylong child-centered recordings in various language environments using language-independent syllabification of speech; Elsevier; Speech Communication; 113; 10-2019; 63-80
0167-6393
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.1016/j.specom.2019.08.005
info:eu-repo/semantics/altIdentifier/url/https://www.sciencedirect.com/science/article/pii/S0167639318304205
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613352033091584
score 13.070432