Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets

Autores
Nusch, Carlos Javier; Del Rio Riande, Gimena; Cagnina, Leticia Cecilia; Errecalde, Marcelo Luis; Antonelli, Leandro
Año de publicación
2024
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
Materia
Ciencias de la Computación e Información
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
CIC Digital (CICBA)
Institución
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador
oai:digital.cic.gba.gob.ar:11746/12412

id CICBA_098542e0bf3e569b715d3dcbc64ca21b
oai_identifier_str oai:digital.cic.gba.gob.ar:11746/12412
network_acronym_str CICBA
repository_id_str 9441
network_name_str CIC Digital (CICBA)
spelling Initial Explorations for Document Clustering Tasks in Latin Elegiac PoetsNusch, Carlos JavierDel Rio Riande, GimenaCagnina, Leticia CeciliaErrecalde, Marcelo LuisAntonelli, LeandroCiencias de la Computación e InformaciónLatin Elegiac PoetsDocument ClusteringK MeansSilhouette CoefficientDecision TreesFeature ImportanceInformation Gain RatioThis article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.2024-06info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12412enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:05Zoai:digital.cic.gba.gob.ar:11746/12412Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:05.833CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
spellingShingle Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
Nusch, Carlos Javier
Ciencias de la Computación e Información
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
title_short Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_full Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_fullStr Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_full_unstemmed Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_sort Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
dc.creator.none.fl_str_mv Nusch, Carlos Javier
Del Rio Riande, Gimena
Cagnina, Leticia Cecilia
Errecalde, Marcelo Luis
Antonelli, Leandro
author Nusch, Carlos Javier
author_facet Nusch, Carlos Javier
Del Rio Riande, Gimena
Cagnina, Leticia Cecilia
Errecalde, Marcelo Luis
Antonelli, Leandro
author_role author
author2 Del Rio Riande, Gimena
Cagnina, Leticia Cecilia
Errecalde, Marcelo Luis
Antonelli, Leandro
author2_role author
author
author
author
dc.subject.none.fl_str_mv Ciencias de la Computación e Información
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
topic Ciencias de la Computación e Información
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
dc.description.none.fl_txt_mv This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
description This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
publishDate 2024
dc.date.none.fl_str_mv 2024-06
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv https://digital.cic.gba.gob.ar/handle/11746/12412
url https://digital.cic.gba.gob.ar/handle/11746/12412
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.source.none.fl_str_mv reponame:CIC Digital (CICBA)
instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron:CICBA
reponame_str CIC Digital (CICBA)
collection CIC Digital (CICBA)
instname_str Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str CICBA
institution CICBA
repository.name.fl_str_mv CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv marisa.degiusti@sedici.unlp.edu.ar
_version_ 1844618599455522816
score 13.070432