Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets

Autores: Nusch, Carlos Javier; Del Rio Riande, María Gimena; Cagnina, Leticia Cecilia; Errecalde, Marcelo Luis; Antonelli, Rubén Leandro
Año de publicación: 2024
Idioma: español castellano
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
Dirección PREBI-SEDICI
Materia: Informática
Humanidades
Latin Elegiac Poets
Document Clustering
K Means
Silhouette Coefficient
Decision Trees
Feature Importance
Information Gain Ratio
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/182788

Acceder

id	SEDICI_6ba78708e73030666a461ca96bed76e4
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/182788
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Initial Explorations for Document Clustering Tasks in Latin Elegiac PoetsNusch, Carlos JavierDel Rio Riande, María GimenaCagnina, Leticia CeciliaErrecalde, Marcelo LuisAntonelli, Rubén LeandroInformáticaHumanidadesLatin Elegiac PoetsDocument ClusteringK MeansSilhouette CoefficientDecision TreesFeature ImportanceInformation Gain RatioThis article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.Dirección PREBI-SEDICI2024info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/182788spainfo:eu-repo/semantics/altIdentifier/isbn/978-3-031-91690-8info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:44:02Zoai:sedici.unlp.edu.ar:10915/182788Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:44:02.922SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
spellingShingle	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets Nusch, Carlos Javier Informática Humanidades Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio
title_short	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_full	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_fullStr	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_full_unstemmed	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
title_sort	Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets
dc.creator.none.fl_str_mv	Nusch, Carlos Javier Del Rio Riande, María Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Rubén Leandro
author	Nusch, Carlos Javier
author_facet	Nusch, Carlos Javier Del Rio Riande, María Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Rubén Leandro
author_role	author
author2	Del Rio Riande, María Gimena Cagnina, Leticia Cecilia Errecalde, Marcelo Luis Antonelli, Rubén Leandro
author2_role	author author author author
dc.subject.none.fl_str_mv	Informática Humanidades Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio
topic	Informática Humanidades Latin Elegiac Poets Document Clustering K Means Silhouette Coefficient Decision Trees Feature Importance Information Gain Ratio
dc.description.none.fl_txt_mv	This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features. Dirección PREBI-SEDICI
description	This article describes various Automatic Text Analysis tasks applying Natural Language Processing techniques on a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, who represent the literary movement of the neoterics, as a group of poets to be identified, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with remarkably distinct styles, as control samples. The purpose of this preliminary and exploratory study is to investigate the potential and best features for document clustering. The clustering tasks were carried out using fixed ranges of character n-grams and word n-grams. For the clustering tasks, the K-Means method and the Silhouette Index were used for determining the optimal cluster sizes. Using optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. Results show variations based on text preprocessing techniques: simple filtering of stopwords in the corpus yields better Silhouette scores, with one or two features showing potential classification value for the decision trees. The application of TF-IDF weighting results in Silhouette indices closer to zero, albeit with a more balanced distribution of Importance among different features.
publishDate	2024
dc.date.none.fl_str_mv	2024
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/182788
url	http://sedici.unlp.edu.ar/handle/10915/182788
dc.language.none.fl_str_mv	spa
language	spa
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/isbn/978-3-031-91690-8
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866372151302422528
score	13.040872

Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets

Publicaciones similares