First Steps towards Data-Driven Adversarial Deduplication
- Autores
- Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro
- Año de publicación
- 2018
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.
Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina
Fil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados Unidos
Fil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; Argentina
Fil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina - Materia
-
ADVERSARIAL DEDUPLICATION
CYBER THREAT INTELLIGENCE
MACHINE LEARNING CLASSIFIERS - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/89020
Ver los metadatos del registro completo
id |
CONICETDig_1ac1c207aef6892a3ea8e8f33050dcc3 |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/89020 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
First Steps towards Data-Driven Adversarial DeduplicationParedes, José NicolásSimari, GerardoMartinez, Maria VaninaFalappa, Marcelo AlejandroADVERSARIAL DEDUPLICATIONCYBER THREAT INTELLIGENCEMACHINE LEARNING CLASSIFIERShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats.Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados UnidosFil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; ArgentinaFil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaMDPI AG2018-07-27info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/89020Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-2042078-2489CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2078-2489/9/8/189info:eu-repo/semantics/altIdentifier/doi/10.3390/info9080189info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:10:59Zoai:ri.conicet.gov.ar:11336/89020instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:11:00.038CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
First Steps towards Data-Driven Adversarial Deduplication |
title |
First Steps towards Data-Driven Adversarial Deduplication |
spellingShingle |
First Steps towards Data-Driven Adversarial Deduplication Paredes, José Nicolás ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS |
title_short |
First Steps towards Data-Driven Adversarial Deduplication |
title_full |
First Steps towards Data-Driven Adversarial Deduplication |
title_fullStr |
First Steps towards Data-Driven Adversarial Deduplication |
title_full_unstemmed |
First Steps towards Data-Driven Adversarial Deduplication |
title_sort |
First Steps towards Data-Driven Adversarial Deduplication |
dc.creator.none.fl_str_mv |
Paredes, José Nicolás Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro |
author |
Paredes, José Nicolás |
author_facet |
Paredes, José Nicolás Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro |
author_role |
author |
author2 |
Simari, Gerardo Martinez, Maria Vanina Falappa, Marcelo Alejandro |
author2_role |
author author author |
dc.subject.none.fl_str_mv |
ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS |
topic |
ADVERSARIAL DEDUPLICATION CYBER THREAT INTELLIGENCE MACHINE LEARNING CLASSIFIERS |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats. Fil: Paredes, José Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina Fil: Simari, Gerardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Arizona State University; Estados Unidos Fil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires; Argentina Fil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina |
description |
In traditional databases, the entity resolution problem (which is also known as deduplication)refers to the task of mapping multiple manifestations of virtual objects totheir corresponding real-worldentities. When addressing this problem, in both theory and practice, it is widely assumed that suchsets of virtual objects appear as the result of clerical errors, transliterations, missing or updatedattributes, abbreviations, and so forth. In this paper, we address this problem under the assumptionthat this situation is caused by malicious actors operating in domains in which they do not wishto be identified, such as hacker forums and markets in which the participants are motivated toremain semi-anonymous (though they wish to keep their true identities secret, they find it useful forcustomers to identify their products and services). We are therefore in the presence of a different, andeven more challenging, problem that we refer to as adversarial deduplication. In this paper, we studythis problem via examples that arise from real-world data on malicious hacker forums and marketsarising from collaborations with a cyber threat intelligence company focusing on understanding thiskind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data onwhich to build solutions to this problem, and develop a set of preliminary experiments based ontraining machine learning classifiers that leverage text analysis to detect potential cases of duplicateentities. Our results are encouraging as a first step towards building tools that human analysts canuse to enhance their capabilities towards fighting cyber threats. |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018-07-27 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/89020 Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-204 2078-2489 CONICET Digital CONICET |
url |
http://hdl.handle.net/11336/89020 |
identifier_str_mv |
Paredes, José Nicolás; Simari, Gerardo; Martinez, Maria Vanina; Falappa, Marcelo Alejandro; First Steps towards Data-Driven Adversarial Deduplication; MDPI AG; Information (Switzerland); 9; 8; 27-7-2018; 189-204 2078-2489 CONICET Digital CONICET |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2078-2489/9/8/189 info:eu-repo/semantics/altIdentifier/doi/10.3390/info9080189 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
MDPI AG |
publisher.none.fl_str_mv |
MDPI AG |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844614004170817536 |
score |
13.069144 |