Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)

Autores: Tur, Georvic; Homsi, Masun Nabhan
Año de publicación: 2017
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing.
Sociedad Argentina de Informática e Investigación Operativa (SADIO)
Materia: Ciencias Informáticas
spam classification
twitter
topic discovering
cost-sensitive classifier
random forest
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-sa/3.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/63208

Acceder

id	SEDICI_c4deda322bcfe50ae193906afa066a87
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/63208
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)Tur, GeorvicHomsi, Masun NabhanCiencias Informáticasspam classificationtwittertopic discoveringcost-sensitive classifierrandom forestAbstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing.Sociedad Argentina de Informática e Investigación Operativa (SADIO)2017-09info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/63208enginfo:eu-repo/semantics/altIdentifier/url/http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/SLMDI/SLMDI-07.pdfinfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-sa/3.0/Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:03:05Zoai:sedici.unlp.edu.ar:10915/63208Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:03:05.726SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
spellingShingle	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017) Tur, Georvic Ciencias Informáticas spam classification twitter topic discovering cost-sensitive classifier random forest
title_short	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_full	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_fullStr	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_full_unstemmed	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
title_sort	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
dc.creator.none.fl_str_mv	Tur, Georvic Homsi, Masun Nabhan
author	Tur, Georvic
author_facet	Tur, Georvic Homsi, Masun Nabhan
author_role	author
author2	Homsi, Masun Nabhan
author2_role	author
dc.subject.none.fl_str_mv	Ciencias Informáticas spam classification twitter topic discovering cost-sensitive classifier random forest
topic	Ciencias Informáticas spam classification twitter topic discovering cost-sensitive classifier random forest
dc.description.none.fl_txt_mv	Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing. Sociedad Argentina de Informática e Investigación Operativa (SADIO)
description	Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing.
publishDate	2017
dc.date.none.fl_str_mv	2017-09
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/63208
url	http://sedici.unlp.edu.ar/handle/10915/63208
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/SLMDI/SLMDI-07.pdf
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371502594588672
score	13.040872

Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)

Publicaciones similares