Improving LDA topic modeling in Twitter with graph community detection

Autores: Albanese, Federico; Feuerstein, Esteban
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: Texts can be characterized from their content using machine learning and natural language processing techniques. In particular, understanding their topic is useful for different tasks such as personalized message recommendation, fake news detection or public opinion monitoring. Latent Dirichlet Allocation (LDA) is an unsupervised generative model for the decomposition of topics, which seeks to represent texts as random mixtures over topics with a Dirichlet distribution, and each topic is characterized by a distribution over words. However, this method is challenging to apply when the text is short and sometimes incoherent, as is often the case with posts on social networks such as twitter. Therefore, different works have shown that tweet pooling (aggregating tweets into longer documents) improves LDA results, but its performance depends on which method was used to aggregating the texts. We propose the new method to detect topics on twitter: “Community pooling”. In this novel scheme, first we define the retweet graph where users are the nodes and retweets between them are the edges. Then, we use the Louvain method for community detection in order to uncover the communities (a group of users who mainly interact with each other but not with other groups). Finally we aggregate into a single document all the tweets authored by all users in a community. Therefore, this method drastically reduces the number of total documents and makes denser word co-occurrence matrix, which is beneficial to LDA algorithm. With the intention of evaluating our model, we created two datasets of tweets with different characteristics. A first generic dataset involving various topics such as music, health and movies and a second dataset corresponding to an event: Biden’s presidential inauguration day in the United States. We compare the performance of our model with state of the art schemes and previous pooling models in terms of document retrieval performance, cluster quality and supervised machine learning classification score. Results showed that Community pooling had a better performance on all datasets and tasks, with the only exception of the retrieval task on the event dataset. Moreover, Community polling was faster than all other aggregation techniques (less than half the running time), which is particularly useful in big data scenarios.
Sociedad Argentina de Informática e Investigación Operativa
Materia: Ciencias Informáticas
Topic modeling
Community detection
Twitter
Text mining
Text clustering
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/3.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/140126

Acceder

id	SEDICI_5fd4fdf6ccd46a425c872c7f6fc2466e
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/140126
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Improving LDA topic modeling in Twitter with graph community detectionAlbanese, FedericoFeuerstein, EstebanCiencias InformáticasTopic modelingCommunity detectionTwitterText miningText clusteringTexts can be characterized from their content using machine learning and natural language processing techniques. In particular, understanding their topic is useful for different tasks such as personalized message recommendation, fake news detection or public opinion monitoring. Latent Dirichlet Allocation (LDA) is an unsupervised generative model for the decomposition of topics, which seeks to represent texts as random mixtures over topics with a Dirichlet distribution, and each topic is characterized by a distribution over words. However, this method is challenging to apply when the text is short and sometimes incoherent, as is often the case with posts on social networks such as twitter. Therefore, different works have shown that tweet pooling (aggregating tweets into longer documents) improves LDA results, but its performance depends on which method was used to aggregating the texts. We propose the new method to detect topics on twitter: “Community pooling”. In this novel scheme, first we define the retweet graph where users are the nodes and retweets between them are the edges. Then, we use the Louvain method for community detection in order to uncover the communities (a group of users who mainly interact with each other but not with other groups). Finally we aggregate into a single document all the tweets authored by all users in a community. Therefore, this method drastically reduces the number of total documents and makes denser word co-occurrence matrix, which is beneficial to LDA algorithm. With the intention of evaluating our model, we created two datasets of tweets with different characteristics. A first generic dataset involving various topics such as music, health and movies and a second dataset corresponding to an event: Biden’s presidential inauguration day in the United States. We compare the performance of our model with state of the art schemes and previous pooling models in terms of document retrieval performance, cluster quality and supervised machine learning classification score. Results showed that Community pooling had a better performance on all datasets and tasks, with the only exception of the retrieval task on the event dataset. Moreover, Community polling was faster than all other aggregation techniques (less than half the running time), which is particularly useful in big data scenarios.Sociedad Argentina de Informática e Investigación Operativa2021-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf19-19http://sedici.unlp.edu.ar/handle/10915/140126enginfo:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-04.pdfinfo:eu-repo/semantics/altIdentifier/issn/2683-8966info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/3.0/Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:29:48Zoai:sedici.unlp.edu.ar:10915/140126Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:29:48.641SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Improving LDA topic modeling in Twitter with graph community detection
title	Improving LDA topic modeling in Twitter with graph community detection
spellingShingle	Improving LDA topic modeling in Twitter with graph community detection Albanese, Federico Ciencias Informáticas Topic modeling Community detection Twitter Text mining Text clustering
title_short	Improving LDA topic modeling in Twitter with graph community detection
title_full	Improving LDA topic modeling in Twitter with graph community detection
title_fullStr	Improving LDA topic modeling in Twitter with graph community detection
title_full_unstemmed	Improving LDA topic modeling in Twitter with graph community detection
title_sort	Improving LDA topic modeling in Twitter with graph community detection
dc.creator.none.fl_str_mv	Albanese, Federico Feuerstein, Esteban
author	Albanese, Federico
author_facet	Albanese, Federico Feuerstein, Esteban
author_role	author
author2	Feuerstein, Esteban
author2_role	author
dc.subject.none.fl_str_mv	Ciencias Informáticas Topic modeling Community detection Twitter Text mining Text clustering
topic	Ciencias Informáticas Topic modeling Community detection Twitter Text mining Text clustering
dc.description.none.fl_txt_mv	Texts can be characterized from their content using machine learning and natural language processing techniques. In particular, understanding their topic is useful for different tasks such as personalized message recommendation, fake news detection or public opinion monitoring. Latent Dirichlet Allocation (LDA) is an unsupervised generative model for the decomposition of topics, which seeks to represent texts as random mixtures over topics with a Dirichlet distribution, and each topic is characterized by a distribution over words. However, this method is challenging to apply when the text is short and sometimes incoherent, as is often the case with posts on social networks such as twitter. Therefore, different works have shown that tweet pooling (aggregating tweets into longer documents) improves LDA results, but its performance depends on which method was used to aggregating the texts. We propose the new method to detect topics on twitter: “Community pooling”. In this novel scheme, first we define the retweet graph where users are the nodes and retweets between them are the edges. Then, we use the Louvain method for community detection in order to uncover the communities (a group of users who mainly interact with each other but not with other groups). Finally we aggregate into a single document all the tweets authored by all users in a community. Therefore, this method drastically reduces the number of total documents and makes denser word co-occurrence matrix, which is beneficial to LDA algorithm. With the intention of evaluating our model, we created two datasets of tweets with different characteristics. A first generic dataset involving various topics such as music, health and movies and a second dataset corresponding to an event: Biden’s presidential inauguration day in the United States. We compare the performance of our model with state of the art schemes and previous pooling models in terms of document retrieval performance, cluster quality and supervised machine learning classification score. Results showed that Community pooling had a better performance on all datasets and tasks, with the only exception of the retrieval task on the event dataset. Moreover, Community polling was faster than all other aggregation techniques (less than half the running time), which is particularly useful in big data scenarios. Sociedad Argentina de Informática e Investigación Operativa
description	Texts can be characterized from their content using machine learning and natural language processing techniques. In particular, understanding their topic is useful for different tasks such as personalized message recommendation, fake news detection or public opinion monitoring. Latent Dirichlet Allocation (LDA) is an unsupervised generative model for the decomposition of topics, which seeks to represent texts as random mixtures over topics with a Dirichlet distribution, and each topic is characterized by a distribution over words. However, this method is challenging to apply when the text is short and sometimes incoherent, as is often the case with posts on social networks such as twitter. Therefore, different works have shown that tweet pooling (aggregating tweets into longer documents) improves LDA results, but its performance depends on which method was used to aggregating the texts. We propose the new method to detect topics on twitter: “Community pooling”. In this novel scheme, first we define the retweet graph where users are the nodes and retweets between them are the edges. Then, we use the Louvain method for community detection in order to uncover the communities (a group of users who mainly interact with each other but not with other groups). Finally we aggregate into a single document all the tweets authored by all users in a community. Therefore, this method drastically reduces the number of total documents and makes denser word co-occurrence matrix, which is beneficial to LDA algorithm. With the intention of evaluating our model, we created two datasets of tweets with different characteristics. A first generic dataset involving various topics such as music, health and movies and a second dataset corresponding to an event: Biden’s presidential inauguration day in the United States. We compare the performance of our model with state of the art schemes and previous pooling models in terms of document retrieval performance, cluster quality and supervised machine learning classification score. Results showed that Community pooling had a better performance on all datasets and tasks, with the only exception of the retrieval task on the event dataset. Moreover, Community polling was faster than all other aggregation techniques (less than half the running time), which is particularly useful in big data scenarios.
publishDate	2021
dc.date.none.fl_str_mv	2021-10
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/140126
url	http://sedici.unlp.edu.ar/handle/10915/140126
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-04.pdf info:eu-repo/semantics/altIdentifier/issn/2683-8966
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.format.none.fl_str_mv	application/pdf 19-19
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371926314713088
score	13.040872

Improving LDA topic modeling in Twitter with graph community detection

Publicaciones similares