Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
- Autores
- Nusch, Carlos Javier; Del Rio Riande, Gimena; Cagnina, Leticia Cecilia; Errecalde, Marcelo Luis; Antonelli, Leandro
- Año de publicación
- 2024
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
- Materia
-
Ciencias de la Computación e Información
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/12412
Ver los metadatos del registro completo
id |
CICBA_098542e0bf3e569b715d3dcbc64ca21b |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/12412 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
Initial Explorations for Document Clustering Tasks in Latin Elegiac PoetsNusch, Carlos JavierDel Rio Riande, GimenaCagnina, Leticia CeciliaErrecalde, Marcelo LuisAntonelli, LeandroCiencias de la Computación e InformaciónLatin Elegiac PoetsDocument ClusteringK MeansSilhouette CoefficientDecision TreesFeature ImportanceInformation Gain RatioThis article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.2024-06info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12412enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:05Zoai:digital.cic.gba.gob.ar:11746/12412Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:05.833CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
title |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
spellingShingle |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets Nusch, Carlos Javier Ciencias de la Computación e Información Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
title_short |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
title_full |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
title_fullStr |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
title_full_unstemmed |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
title_sort |
Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets |
dc.creator.none.fl_str_mv |
Nusch, Carlos Javier Del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author |
Nusch, Carlos Javier |
author_facet |
Nusch, Carlos Javier Del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author_role |
author |
author2 |
Del Rio Riande, Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Leandro |
author2_role |
author author author author |
dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
topic |
Ciencias de la Computación e Información Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio |
dc.description.none.fl_txt_mv |
This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features. |
description |
This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024-06 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/12412 |
url |
https://digital.cic.gba.gob.ar/handle/11746/12412 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1844618599455522816 |
score |
13.070432 |