First Steps towards Data-Driven Adversarial Deduplication

Autores: Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro
Año de publicación: 2018
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.
Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina
Fil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados Unidos
Fil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; Argentina
Fil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina
Materia: ADVERSARIAL DEDUPLICATION
CYBER THREAT INTELLIGENCE
MACHINE LEARNING CLASSIFIERS
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/89020

Acceder

id	CONICETDig_1ac1c207aef6892a3ea8e8f33050dcc3
oai_identifier_str	oai:ri.conicet.gov.ar:11336/89020
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	First Steps towards Data-Driven Adversarial DeduplicationParedes, José NicolásSimari, GerardoMartinez, Maria VaninaFalappa, Marcelo AlejandroADVERSARIAL DEDUPLICATIONCYBER THREAT INTELLIGENCEMACHINE LEARNING CLASSIFIERShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados UnidosFil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; ArgentinaFil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaMDPI AG2018-07-27info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/89020Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-2042078-2489CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2078-2489/9/8/189info:eu-repo/semantics/altIdentifier/doi/10.3390/info9080189info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-02-26T10:22:43Zoai:ri.conicet.gov.ar:11336/89020instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-02-26 10:22:43.827CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	First Steps towards Data-Driven Adversarial Deduplication
title	First Steps towards Data-Driven Adversarial Deduplication
spellingShingle	First Steps towards Data-Driven Adversarial Deduplication Paredes, José Nicolás ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS
title_short	First Steps towards Data-Driven Adversarial Deduplication
title_full	First Steps towards Data-Driven Adversarial Deduplication
title_fullStr	First Steps towards Data-Driven Adversarial Deduplication
title_full_unstemmed	First Steps towards Data-Driven Adversarial Deduplication
title_sort	First Steps towards Data-Driven Adversarial Deduplication
dc.creator.none.fl_str_mv	Paredes, José Nicolás Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro
author	Paredes, José Nicolás
author_facet	Paredes, José Nicolás Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro
author_role	author
author2	Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro
author2_role	author author author
dc.subject.none.fl_str_mv	ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS
topic	ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats. Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina Fil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados Unidos Fil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; Argentina Fil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina
description	In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.
publishDate	2018
dc.date.none.fl_str_mv	2018-07-27
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/89020 Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-204 2078-2489 CONICET Digital CONICET
url	http://hdl.handle.net/11336/89020
identifier_str_mv	Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-204 2078-2489 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2078-2489/9/8/189 info:eu-repo/semantics/altIdentifier/doi/10.3390/info9080189
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	MDPI AG
publisher.none.fl_str_mv	MDPI AG
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1858305675642273792
score	13.176822

First Steps towards Data-Driven Adversarial Deduplication

Publicaciones similares