-
Intent Detection at Scale: Tuning a Generic Model using Relevant Intents
Authors:
Nichal Narotamo,
David Aparicio,
Tiago Mesquita,
Mariana Almeida
Abstract:
Accurately predicting the intent of customer support requests is vital for efficient support systems, enabling agents to quickly understand messages and prioritize responses accordingly. While different approaches exist for intent detection, maintaining separate client-specific or industry-specific models can be costly and impractical as the client base expands.
This work proposes a system to sc…
▽ More
Accurately predicting the intent of customer support requests is vital for efficient support systems, enabling agents to quickly understand messages and prioritize responses accordingly. While different approaches exist for intent detection, maintaining separate client-specific or industry-specific models can be costly and impractical as the client base expands.
This work proposes a system to scale intent predictions to various clients effectively, by combining a single generic model with a per-client list of relevant intents. Our approach minimizes training and maintenance costs while providing a personalized experience for clients, allowing for seamless adaptation to changes in their relevant intents. Furthermore, we propose a strategy for using the clients relevant intents as model features that proves to be resilient to changes in the relevant intents of clients -- a common occurrence in production environments.
The final system exhibits significantly superior performance compared to industry-specific models, showcasing its flexibility and ability to cater to diverse client needs.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Natural language to SQL in low-code platforms
Authors:
Sofia Aparicio,
Samuel Arcadinho,
João Nadkarni,
David Aparício,
João Lages,
Mariana Lourenço,
Bartłomiej Matejczyk,
Filipe Assunção
Abstract:
One of the developers' biggest challenges in low-code platforms is retrieving data from a database using SQL queries. Here, we propose a pipeline allowing developers to write natural language (NL) to retrieve data. In this study, we collect, label, and validate data covering the SQL queries most often performed by OutSystems users. We use that data to train a NL model that generates SQL. Alongside…
▽ More
One of the developers' biggest challenges in low-code platforms is retrieving data from a database using SQL queries. Here, we propose a pipeline allowing developers to write natural language (NL) to retrieve data. In this study, we collect, label, and validate data covering the SQL queries most often performed by OutSystems users. We use that data to train a NL model that generates SQL. Alongside this, we describe the entire pipeline, which comprises a feedback loop that allows us to quickly collect production data and use it to retrain our SQL generation model. Using crowd-sourcing, we collect 26k NL and SQL pairs and obtain an additional 1k pairs from production data. Finally, we develop a UI that allows developers to input a NL query in a prompt and receive a user-friendly representation of the resulting SQL query. We use A/B testing to compare four different models in production and observe a 240% improvement in terms of adoption of the feature, 220% in terms of engagement rate, and a 90% decrease in failure rate when compared against the first model that we put into production, showcasing the effectiveness of our pipeline in continuously improving our feature.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.
-
The GANfather: Controllable generation of malicious activity to improve defence systems
Authors:
Ricardo Ribeiro Pereira,
Jacopo Bono,
João Tiago Ascensão,
David Aparício,
Pedro Ribeiro,
Pedro Bizarro
Abstract:
Machine learning methods to aid defence systems in detecting malicious activity typically rely on labelled data. In some domains, such labelled data is unavailable or incomplete. In practice this can lead to low detection rates and high false positive rates, which characterise for example anti-money laundering systems. In fact, it is estimated that 1.7--4 trillion euros are laundered annually and…
▽ More
Machine learning methods to aid defence systems in detecting malicious activity typically rely on labelled data. In some domains, such labelled data is unavailable or incomplete. In practice this can lead to low detection rates and high false positive rates, which characterise for example anti-money laundering systems. In fact, it is estimated that 1.7--4 trillion euros are laundered annually and go undetected. We propose The GANfather, a method to generate samples with properties of malicious activity, without label requirements. We propose to reward the generation of malicious samples by introducing an extra objective to the typical Generative Adversarial Networks (GANs) loss. Ultimately, our goal is to enhance the detection of illicit activity using the discriminator network as a novel and robust defence system. Optionally, we may encourage the generator to bypass pre-existing detection systems. This setup then reveals defensive weaknesses for the discriminator to correct. We evaluate our method in two real-world use cases, money laundering and recommendation systems. In the former, our method moves cumulative amounts close to 350 thousand dollars through a network of accounts without being detected by an existing system. In the latter, we recommend the target item to a broad user base with as few as 30 synthetic attackers. In both cases, we train a new defence system to capture the synthetic attacks.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
From random-walks to graph-sprints: a low-latency node embedding framework on continuous-time dynamic graphs
Authors:
Ahmad Naser Eddin,
Jacopo Bono,
David Aparício,
Hugo Ferreira,
João Ascensão,
Pedro Ribeiro,
Pedro Bizarro
Abstract:
Many real-world datasets have an underlying dynamic graph structure, where entities and their interactions evolve over time. Machine learning models should consider these dynamics in order to harness their full potential in downstream tasks. Previous approaches for graph representation learning have focused on either sampling k-hop neighborhoods, akin to breadth-first search, or random walks, akin…
▽ More
Many real-world datasets have an underlying dynamic graph structure, where entities and their interactions evolve over time. Machine learning models should consider these dynamics in order to harness their full potential in downstream tasks. Previous approaches for graph representation learning have focused on either sampling k-hop neighborhoods, akin to breadth-first search, or random walks, akin to depth-first search. However, these methods are computationally expensive and unsuitable for real-time, low-latency inference on dynamic graphs. To overcome these limitations, we propose graph-sprints a general purpose feature extraction framework for continuous-time-dynamic-graphs (CTDGs) that has low latency and is competitive with state-of-the-art, higher latency models. To achieve this, a streaming, low latency approximation to the random-walk based features is proposed. In our framework, time-aware node embeddings summarizing multi-hop information are computed using only single-hop operations on the incoming edges. We evaluate our proposed approach on three open-source datasets and two in-house datasets, and compare with three state-of-the-art algorithms (TGN-attn, TGN-ID, Jodie). We demonstrate that our graph-sprints features, combined with a machine learning classifier, achieve competitive performance (outperforming all baselines for the node classification tasks in five datasets). Simultaneously, graph-sprints significantly reduce inference latencies, achieving close to an order of magnitude speed-up in our experimental setting.
△ Less
Submitted 16 February, 2024; v1 submitted 17 July, 2023;
originally announced July 2023.
-
T5QL: Taming language models for SQL generation
Authors:
Samuel Arcadinho,
David Aparício,
Hugo Veiga,
António Alegria
Abstract:
Automatic SQL generation has been an active research area, aiming at streamlining the access to databases by writing natural language with the given intent instead of writing SQL. Current SOTA methods for semantic parsing depend on LLMs to achieve high predictive accuracy on benchmark datasets. This reduces their applicability, since LLMs requires expensive GPUs. Furthermore, SOTA methods are ungr…
▽ More
Automatic SQL generation has been an active research area, aiming at streamlining the access to databases by writing natural language with the given intent instead of writing SQL. Current SOTA methods for semantic parsing depend on LLMs to achieve high predictive accuracy on benchmark datasets. This reduces their applicability, since LLMs requires expensive GPUs. Furthermore, SOTA methods are ungrounded and thus not guaranteed to always generate valid SQL. Here we propose T5QL, a new SQL generation method that improves the performance in benchmark datasets when using smaller LMs, namely T5-Base, by 13pp when compared against SOTA methods. Additionally, T5QL is guaranteed to always output valid SQL using a context-free grammar to constrain SQL generation. Finally, we show that dividing semantic parsing in two tasks, candidate SQLs generation and candidate re-ranking, is a promising research avenue that can reduce the need for large LMs.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
Anti-Money Laundering Alert Optimization Using Machine Learning with Graphs
Authors:
Ahmad Naser Eddin,
Jacopo Bono,
David Aparício,
David Polido,
João Tiago Ascensão,
Pedro Bizarro,
Pedro Ribeiro
Abstract:
Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually) such as drug dealing, human trafficking, or corruption. The anti-money laundering systems deployed by financial institutions typically comprise rules aligned with regulatory frameworks. Human investigators review the alerts and report suspicious cases. Such systems suffer…
▽ More
Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually) such as drug dealing, human trafficking, or corruption. The anti-money laundering systems deployed by financial institutions typically comprise rules aligned with regulatory frameworks. Human investigators review the alerts and report suspicious cases. Such systems suffer from high false-positive rates, undermining their effectiveness and resulting in high operational costs. We propose a machine learning triage model, which complements the rule-based system and learns to predict the risk of an alert accurately. Our model uses both entity-centric engineered features and attributes characterizing inter-entity relations in the form of graph-based features. We leverage time windows to construct the dynamic graph, optimizing for time and space efficiency. We validate our model on a real-world banking dataset and show how the triage model can reduce the number of false positives by 80% while detecting over 90% of true positives. In this way, our model can significantly improve anti-money laundering operations.
△ Less
Submitted 17 June, 2022; v1 submitted 14 December, 2021;
originally announced December 2021.
-
GUDIE: a flexible, user-defined method to extract subgraphs of interest from large graphs
Authors:
Maria Inês Silva,
David Aparício,
Beatriz Malveiro,
João Tiago Ascensão,
Pedro Bizarro
Abstract:
Large, dense, small-world networks often emerge from social phenomena, including financial networks, social media, or epidemiology. As networks grow in importance, it is often necessary to partition them into meaningful units of analysis. In this work, we propose GUDIE, a message-passing algorithm that extracts relevant context around seed nodes based on user-defined criteria. We design GUDIE for…
▽ More
Large, dense, small-world networks often emerge from social phenomena, including financial networks, social media, or epidemiology. As networks grow in importance, it is often necessary to partition them into meaningful units of analysis. In this work, we propose GUDIE, a message-passing algorithm that extracts relevant context around seed nodes based on user-defined criteria. We design GUDIE for rich, labeled graphs, and expansions consider node and edge attributes. Preliminary results indicate that GUDIE expands to insightful areas while avoiding unimportant connections. The resulting subgraphs contain the relevant context for a seed node and can accelerate and extend analysis capabilities in finance and other critical networks.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Finding NeMo: Fishing in banking networks using network motifs
Authors:
Xavier Fontes,
David Aparício,
Maria Inês Silva,
Beatriz Malveiro,
João Tiago Ascensão,
Pedro Bizarro
Abstract:
Banking fraud causes billion-dollar losses for banks worldwide. In fraud detection, graphs help understand complex transaction patterns and discovering new fraud schemes. This work explores graph patterns in a real-world transaction dataset by extracting and analyzing its network motifs. Since banking graphs are heterogeneous, we focus on heterogeneous network motifs. Additionally, we propose a no…
▽ More
Banking fraud causes billion-dollar losses for banks worldwide. In fraud detection, graphs help understand complex transaction patterns and discovering new fraud schemes. This work explores graph patterns in a real-world transaction dataset by extracting and analyzing its network motifs. Since banking graphs are heterogeneous, we focus on heterogeneous network motifs. Additionally, we propose a novel network randomization process that generates valid banking graphs. From our exploratory analysis, we conclude that network motifs extract insightful and interpretable patterns.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
GuiltyWalker: Distance to illicit nodes in the Bitcoin network
Authors:
Catarina Oliveira,
João Torres,
Maria Inês Silva,
David Aparício,
João Tiago Ascensão,
Pedro Bizarro
Abstract:
Money laundering is a global phenomenon with wide-reaching social and economic consequences. Cryptocurrencies are particularly susceptible due to the lack of control by authorities and their anonymity. Thus, it is important to develop new techniques to detect and prevent illicit cryptocurrency transactions. In our work, we propose new features based on the structure of the graph and past labels to…
▽ More
Money laundering is a global phenomenon with wide-reaching social and economic consequences. Cryptocurrencies are particularly susceptible due to the lack of control by authorities and their anonymity. Thus, it is important to develop new techniques to detect and prevent illicit cryptocurrency transactions. In our work, we propose new features based on the structure of the graph and past labels to boost the performance of machine learning methods to detect money laundering. Our method, GuiltyWalker, performs random walks on the bitcoin transaction graph and computes features based on the distance to illicit transactions. We combine these new features with features proposed by Weber et al. and observe an improvement of about 5pp regarding illicit classification. Namely, we observe that our proposed features are particularly helpful during a black market shutdown, where the algorithm by Weber et al. was low performing.
△ Less
Submitted 21 July, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
Machine learning methods to detect money laundering in the Bitcoin blockchain in the presence of label scarcity
Authors:
Joana Lorenz,
Maria Inês Silva,
David Aparício,
João Tiago Ascensão,
Pedro Bizarro
Abstract:
Every year, criminals launder billions of dollars acquired from serious felonies (e.g., terrorism, drug smuggling, or human trafficking) harming countless people and economies. Cryptocurrencies, in particular, have developed as a haven for money laundering activity. Machine Learning can be used to detect these illicit patterns. However, labels are so scarce that traditional supervised algorithms a…
▽ More
Every year, criminals launder billions of dollars acquired from serious felonies (e.g., terrorism, drug smuggling, or human trafficking) harming countless people and economies. Cryptocurrencies, in particular, have developed as a haven for money laundering activity. Machine Learning can be used to detect these illicit patterns. However, labels are so scarce that traditional supervised algorithms are inapplicable. Here, we address money laundering detection assuming minimal access to labels. First, we show that existing state-of-the-art solutions using unsupervised anomaly detection methods are inadequate to detect the illicit patterns in a real Bitcoin transaction dataset. Then, we show that our proposed active learning solution is capable of matching the performance of a fully supervised baseline by using just 5\% of the labels. This solution mimics a typical real-life situation in which a limited number of labels can be acquired through manual annotation by experts.
△ Less
Submitted 5 October, 2021; v1 submitted 29 May, 2020;
originally announced May 2020.
-
ARMS: Automated rules management system for fraud detection
Authors:
David Aparício,
Ricardo Barata,
João Bravo,
João Tiago Ascensão,
Pedro Bizarro
Abstract:
Fraud detection is essential in financial services, with the potential of greatly reducing criminal activities and saving considerable resources for businesses and customers. We address online fraud detection, which consists of classifying incoming transactions as either legitimate or fraudulent in real-time. Modern fraud detection systems consist of a machine learning model and rules defined by h…
▽ More
Fraud detection is essential in financial services, with the potential of greatly reducing criminal activities and saving considerable resources for businesses and customers. We address online fraud detection, which consists of classifying incoming transactions as either legitimate or fraudulent in real-time. Modern fraud detection systems consist of a machine learning model and rules defined by human experts. Often, the rules performance degrades over time due to concept drift, especially of adversarial nature. Furthermore, they can be costly to maintain, either because they are computationally expensive or because they send transactions for manual review. We propose ARMS, an automated rules management system that evaluates the contribution of individual rules and optimizes the set of active rules using heuristic search and a user-defined loss-function. It complies with critical domain-specific requirements, such as handling different actions (e.g., accept, alert, and decline), priorities, blacklists, and large datasets (i.e., hundreds of rules and millions of transactions). We use ARMS to optimize the rule-based systems of two real-world clients. Results show that it can maintain the original systems' performance (e.g., recall, or false-positive rate) using only a fraction of the original rules (~ 50% in one case, and ~ 20% in the other).
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets
Authors:
Pedro Ribeiro,
Pedro Paredes,
Miguel E. P. Silva,
David Aparicio,
Fernando Silva
Abstract:
Computing subgraph frequencies is a fundamental task that lies at the core of several network analysis methodologies, such as network motifs and graphlet-based metrics, which have been widely used to categorize and compare networks from multiple domains. Counting subgraphs is however computationally very expensive and there has been a large body of work on efficient algorithms and strategies to ma…
▽ More
Computing subgraph frequencies is a fundamental task that lies at the core of several network analysis methodologies, such as network motifs and graphlet-based metrics, which have been widely used to categorize and compare networks from multiple domains. Counting subgraphs is however computationally very expensive and there has been a large body of work on efficient algorithms and strategies to make subgraph counting feasible for larger subgraphs and networks.
This survey aims precisely to provide a comprehensive overview of the existing methods for subgraph counting. Our main contribution is a general and structured review of existing algorithms, classifying them on a set of key characteristics, highlighting their main similarities and differences. We identify and describe the main conceptual approaches, giving insight on their advantages and limitations, and provide pointers to existing implementations. We initially focus on exact sequential algorithms, but we also do a thorough survey on approximate methodologies (with a trade-off between accuracy and execution time) and parallel strategies (that need to deal with an unbalanced search space).
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
GoT-WAVE: Temporal network alignment using graphlet-orbit transitions
Authors:
David Aparício,
Pedro Ribeiro,
Tijana Milenković,
Fernando Silva
Abstract:
Global pairwise network alignment (GPNA) aims to find a one-to-one node mapping between two networks that identifies conserved network regions. GPNA algorithms optimize node conservation (NC) and edge conservation (EC). NC quantifies topological similarity between nodes. Graphlet-based degree vectors (GDVs) are a state-of-the-art topological NC measure. Dynamic GDVs (DGDVs) were used as a dynamic…
▽ More
Global pairwise network alignment (GPNA) aims to find a one-to-one node mapping between two networks that identifies conserved network regions. GPNA algorithms optimize node conservation (NC) and edge conservation (EC). NC quantifies topological similarity between nodes. Graphlet-based degree vectors (GDVs) are a state-of-the-art topological NC measure. Dynamic GDVs (DGDVs) were used as a dynamic NC measure within the first-ever algorithms for GPNA of temporal networks: DynaMAGNA++ and DynaWAVE. The latter is superior for larger networks. We recently developed a different graphlet-based measure of temporal node similarity, graphlet-orbit transitions (GoTs). Here, we use GoTs instead of DGDVs as a new dynamic NC measure within DynaWAVE, resulting in a new approach, GoT-WAVE.
On synthetic networks, GoT-WAVE improves DynaWAVE's accuracy by 25% and speed by 64%. On real networks, when optimizing only dynamic NC, each method is superior ~50% of the time. While DynaWAVE benefits more from also optimizing dynamic EC, only GoT-WAVE can support directed edges. Hence, GoT-WAVE is a promising new temporal GPNA algorithm, which efficiently optimizes dynamic NC. Future work on better incorporating dynamic EC may yield further improvements.
△ Less
Submitted 24 August, 2018;
originally announced August 2018.
-
Formal verification of the YubiKey and YubiHSM APIs in Maude-NPA
Authors:
Antonio González-Burgueño,
Damián Aparicio,
Santiago Escobar,
Catherine Meadows,
José Meseguer
Abstract:
In this paper, we perform an automated analysis of two devices developed by Yubico: YubiKey, designed to authenticate a user to network-based services, and YubiHSM, Yubicos hardware security module. Both are analyzed using the Maude-NPA cryptographic protocol analyzer. Although previous work has been done applying automated tools to these devices, to the best of our knowledge there has been no com…
▽ More
In this paper, we perform an automated analysis of two devices developed by Yubico: YubiKey, designed to authenticate a user to network-based services, and YubiHSM, Yubicos hardware security module. Both are analyzed using the Maude-NPA cryptographic protocol analyzer. Although previous work has been done applying automated tools to these devices, to the best of our knowledge there has been no completely automated analysis to date. This is not surprising, because both YubiKey and YubiHSM, which make use of cryptographic APIs, involve a number of complex features: (i) discrete time in the form of Lamport clocks, (ii) a mutable memory for storing previously seen keys or nonces, (iii) event-based properties that require an analysis of sequences of actions, and (iv) reasoning modulo exclusive-or. In this work, we have been able to both prove properties of YubiKey and find the known attacks on the YubiHSM, in a completely automated way beyond the capabilities of previous work in the literature.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
Temporal Network Comparison using Graphlet-orbit Transitions
Authors:
David Aparício,
Pedro Ribeiro,
Fernando Silva
Abstract:
Networks are widely used to model real-world systems and uncover their topological features. Network properties such as the degree distribution and shortest path length have been computed in numerous real-world networks, and most of them have been shown to be both scale-free and small-world networks. Graphlets and network motifs are subgraph patterns that capture richer structural information than…
▽ More
Networks are widely used to model real-world systems and uncover their topological features. Network properties such as the degree distribution and shortest path length have been computed in numerous real-world networks, and most of them have been shown to be both scale-free and small-world networks. Graphlets and network motifs are subgraph patterns that capture richer structural information than aforementioned global network properties, and these local features are often used for network comparison. However, past work on graphlets and network motifs is almost exclusively applicable only for static networks. Many systems are better represented as temporal networks which depict not only how a system was at a given stage but also how they evolved. Time-dependent information is crucial in temporal networks and, by disregarding that data, static methods can not achieve the best possible results. This paper introduces an extension of graphlets for temporal networks. Our proposed method enumerates all 4-node graphlet-orbits in each network-snapshot, building the corresponding orbit-transition matrix in the process. Our hypothesis is that networks representing similar systems have characteristic orbit transitions which better identify them than simple static patterns, and this is assessed on a set of real temporal networks split into categories. In order to perform temporal network comparison we put forward an orbit-transition-agreement metric (OTA). OTA correctly groups a set of temporal networks that both static network motifs and graphlets fail to do so adequately. Furthermore, our method produces interpretable results which we use to uncover characteristic orbit transitions, and that can be regarded as a network-fingerprint.
△ Less
Submitted 14 July, 2017;
originally announced July 2017.
-
Network comparison using directed graphlets
Authors:
David Aparício,
Pedro Ribeiro,
Fernando Silva
Abstract:
With recent advances in high-throughput cell biology the amount of cellular biological data has grown drastically. Such data is often modeled as graphs (also called networks) and studying them can lead to new insights into molecule-level organization. A possible way to understand their structure is by analysing the smaller components that constitute them, namely network motifs and graphlets. Graph…
▽ More
With recent advances in high-throughput cell biology the amount of cellular biological data has grown drastically. Such data is often modeled as graphs (also called networks) and studying them can lead to new insights into molecule-level organization. A possible way to understand their structure is by analysing the smaller components that constitute them, namely network motifs and graphlets. Graphlets are particularly well suited to compare networks and to assess their level of similarity but are almost always used as small undirected graphs of up to five nodes, thus limiting their applicability in directed networks. However, a large set of interesting biological networks such as metabolic, cell signaling or transcriptional regulatory networks are intrinsically directional, and using metrics that ignore edge direction may gravely hinder information extraction. The applicability of graphlets is extended to directed networks by considering the edge direction of the graphlets. We tested our approach on a set of directed biological networks and verified that they were correctly grouped by type using directed graphlets. However, enumerating all graphlets in a large network is a computationally demanding task. Our implementation addresses this concern by using a state-of-the-art data structure, the g-trie, which is able to greatly reduce the necessary computation. We compared our tool, gtrieScanner, to other state-of-the art methods and verified that it is the fastest general tool for graphlet counting.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.