Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction
- Autores
- Nusch, Carlos Javier; del Rio Riande, Gimena; Cagnina, Leticia Cecilia; Errecalde, Marcelo Luis; Antonelli, Leandro
- Año de publicación
- 2024
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love> spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus’ Albius Tibullus’ and Sextus Propertius’ representing the literary movement of the neoterics’ and Publius Vergilius Maro and Marcus Annaeus Lucanus’ epic poets with distinct styles’ serving as control samples. Unlike previous works’ various corrections were added to the preprocessing tasks’ including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks’ the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels’ decision trees were trained for each range of n-grams’ aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy’ and calculations of Feature Importance were performed. In this study’ we focused on detailing the classification results and features extracted by the decision trees’ based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.
- Materia
-
Ciencias de la Computación e Información
Augustan love poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/12425
Ver los metadatos del registro completo
id |
CICBA_655e3407951219c9be775e42b5ff2b77 |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/12425 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance ExtractionNusch, Carlos Javierdel Rio Riande, GimenaCagnina, Leticia CeciliaErrecalde, Marcelo LuisAntonelli, LeandroCiencias de la Computación e InformaciónAugustan love poetsDocument ClusteringK MeansSilhouette CoefficientDecision TreesFeature ImportanceInformation Gain RatioThis article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love> spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus’ Albius Tibullus’ and Sextus Propertius’ representing the literary movement of the neoterics’ and Publius Vergilius Maro and Marcus Annaeus Lucanus’ epic poets with distinct styles’ serving as control samples. Unlike previous works’ various corrections were added to the preprocessing tasks’ including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks’ the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels’ decision trees were trained for each range of n-grams’ aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy’ and calculations of Feature Importance were performed. In this study’ we focused on detailing the classification results and features extracted by the decision trees’ based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques.2024-12info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12425enginfo:eu-repo/semantics/altIdentifier/issn/1613-0073info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:22Zoai:digital.cic.gba.gob.ar:11746/12425Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:22.408CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
title |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
spellingShingle |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction Nusch, Carlos Javier Ciencias de la Computación e Información Augustan love poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
title_short |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
title_full |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
title_fullStr |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
title_full_unstemmed |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
title_sort |
Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |
dc.creator.none.fl_str_mv |
Nusch, Carlos Javier del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author |
Nusch, Carlos Javier |
author_facet |
Nusch, Carlos Javier del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author_role |
author |
author2 |
del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author2_role |
author author author author |
dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información Augustan love poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
topic |
Ciencias de la Computación e Información Augustan love poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
dc.description.none.fl_txt_mv |
This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love> spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus’ Albius Tibullus’ and Sextus Propertius’ representing the literary movement of the neoterics’ and Publius Vergilius Maro and Marcus Annaeus Lucanus’ epic poets with distinct styles’ serving as control samples. Unlike previous works’ various corrections were added to the preprocessing tasks’ including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks’ the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels’ decision trees were trained for each range of n-grams’ aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy’ and calculations of Feature Importance were performed. In this study’ we focused on detailing the classification results and features extracted by the decision trees’ based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques. |
description |
This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love> spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus’ Albius Tibullus’ and Sextus Propertius’ representing the literary movement of the neoterics’ and Publius Vergilius Maro and Marcus Annaeus Lucanus’ epic poets with distinct styles’ serving as control samples. Unlike previous works’ various corrections were added to the preprocessing tasks’ including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks’ the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels’ decision trees were trained for each range of n-grams’ aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy’ and calculations of Feature Importance were performed. In this study’ we focused on detailing the classification results and features extracted by the decision trees’ based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-12 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/12425 |
url |
https://digital.cic.gba.gob.ar/handle/11746/12425 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/issn/1613-0073 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1844618620166995968 |
score |
13.070432 |