Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction

Autores
Nusch, Carlos Javier; Del Rio Riande, María Gimena; Cagnina, Leticia; Errecalde, Marcelo Luis; Antonelli, Rubén Leandro
Año de publicación
2024
Idioma
español castellano
Tipo de recurso
artículo
Estado
versión publicada
Descripción
This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.
Dirección PREBI-SEDICI
Materia
Informática
Humanidades
Augustan love poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/175050

id SEDICI_bfb72a281634a7860774d03fc4af112c
oai_identifier_str oai:sedici.unlp.edu.ar:10915/175050
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance ExtractionNusch, Carlos JavierDel Rio Riande, María GimenaCagnina, LeticiaErrecalde, Marcelo LuisAntonelli, Rubén LeandroInformáticaHumanidadesAugustan love poetsDocument ClusteringK MeansSilhouette CoefficientDecision TreesFeature ImportanceInformation Gain RatioThis article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.Dirección PREBI-SEDICI2024-11-18info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/175050spainfo:eu-repo/semantics/altIdentifier/issn/1613-0073info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:47:04Zoai:sedici.unlp.edu.ar:10915/175050Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:47:05.074SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
title Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
spellingShingle Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
Nusch, Carlos Javier
Informática
Humanidades
Augustan love poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
title_short Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
title_full Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
title_fullStr Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
title_full_unstemmed Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
title_sort Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
dc.creator.none.fl_str_mv Nusch, Carlos Javier
Del Rio Riande, María Gimena
Cagnina, Leticia
Errecalde, Marcelo Luis
Antonelli, Rubén Leandro
author Nusch, Carlos Javier
author_facet Nusch, Carlos Javier
Del Rio Riande, María Gimena
Cagnina, Leticia
Errecalde, Marcelo Luis
Antonelli, Rubén Leandro
author_role author
author2 Del Rio Riande, María Gimena
Cagnina, Leticia
Errecalde, Marcelo Luis
Antonelli, Rubén Leandro
author2_role author
author
author
author
dc.subject.none.fl_str_mv Informática
Humanidades
Augustan love poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
topic Informática
Humanidades
Augustan love poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
dc.description.none.fl_txt_mv This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.
Dirección PREBI-SEDICI
description This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.
publishDate 2024
dc.date.none.fl_str_mv 2024-11-18
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/175050
url http://sedici.unlp.edu.ar/handle/10915/175050
dc.language.none.fl_str_mv spa
language spa
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/issn/1613-0073
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616335287386112
score 13.070432