Social Network Analysis for Fraud Detection

Basic concepts

  • guilt-by-association
    • the reason of studying social network
    • assumes that a person is guilty of a crime because he/she is associated with someone or something involved in a crime
  • applications of social network analytics
    • because most frauds are committed by illegal set-ups of a group of fraudsters
    • traditional fraud detection models have limitations to detect relations or insight from a social network
    • social network analytics will give insights about some suspicious cases (e.g., insurance frauds, bankruptcies of companies, and opinion frauds)
  • network components

    • complex network analysis (CNA) - study of complex networks, inculding structure, characteristics, and dynamics
    • graph theory - mathematical study for the analysis and representation of networks

      • types of graphs
        • directed graph
        • undirected graph
      • edge representation
        • self-edge - a person who transfers money from his/her account to another he/she owns
        • multi-edge - a person pay the merchant money twice
        • hyper-edge - 3 people attend the same event
      • weighted graph - represents the intensity of relations in a network

        • binary weight
          • 0 or 1: represents relation exists between 2 nodes or not
          • -1, 0 or 1: e.g. 1 represents close relation, 0 represents no relation, -1 represents animosity
        • numeric weight
          • higher numeric value represents closer affinity
        • normalized weight
          • variant of numeric weight where all outgoing edges of a node sum up to 1
        • Jaccard weight

          • edge weight depends on how "social" both nodes are
          • formula

          \[w(v_1,v_2)=\frac{|\Gamma(v_1)\cap\Gamma(v_2)|}{|\Gamma(v_1)\cup\Gamma(v_2)|}\]

          • example
            • person A attended 10 events, person B attended 5 events
            • both went to 3 common events
            • \(Jaccard\ weight = \frac{3}{10+5-3}=\frac{1}{4}\)
      • for fraud

        • nodes - legitimate or fraudulent
        • edges - relationship
        • egonet - one-hop neighbourhoods
          • one-hop is enough for fraud analysis
  • network representations

    • graphically (a)
    • mathematically
      • adjacency or connection matrix - a matrix of size \(n \times n\) (n is the number of nodes) (b)
        • a sparse matrix - containing many zero values
        • similar to real-life social network, where most people are only connected to a small number of friends
      • adjacency list - abstract representation of adjacency matrix (d)
      • weight matrix - a matrix with edge weights (c)
      • weight list - abstract representation of weight matrix (e)

Homophilic network

  • definition
    • fraudsters connect to fraudsters
    • legitimate people connect to legitimate people
  • web of frauds
    • fraudsters' nodes are grouped together
  • features
    • no network is perfectly homophilic - if a network exhibits evidence of homophily, it is worthwhile to investigate more thoroughly
    • a network that is not homophilic is heterophilic
  • formula
    • \(l\) = legitimate nodes in the network
    • \(f\) = fraudulent nodes in the network
    • \(2lf\) = expected cross-label edges (fraud connects to non-fraud)
    • \(\hat{r}\) = observed cross-label edges
    • if null hypothesis is rejected (observed cross-label edges is significantly less than the expected cross-label edges), a network is homophilic

\[H_0 : \hat{r} \geq 2lf\]

  • example
  • dyadicity & heterophilicity

    • \(Dyadicity(D)=\frac{m_{11}}{\overline{m}_{11}}\)
    • \(Heterophilicity(H)=\frac{m_{10}}{\overline{m}_{10}}\)
    • if D > 1, network is dyadic
      • fraudulent cases tend to connect more densely among themselves than expected randomly
    • if H < 1, network is heterophobic (oppsite of heterophilic)
      • fraud nodes have fewer connections to legitimate nodes than expected randomly
    • if a network is dyadic and heterophobic (D > 1 and H < 1), it exhibits homophily

Metrics to measure impact of neighborhoods

  • neighbour metrics
    • characterize the target of interest based on its direct associates
    • first-order neighbour for fraud detection models
    • include: degree, triangles, density, relation neighbour & probabilistic relational neighbour
  • centrality metrics
    • quantify the importance of an individual in a social network
    • include: geodesic paths, betweenness, closeness, and graph theoretic center
  • collective inference algorithms
    • compute the probability that a node is exposed to fraud, i.e. probability that fraud influences a certain node
    • include: PageRank, Gibbs sampling, iterative classification, relaxation labelling, and loopy belief propagation

Neighborhood Metrics

  • degree
    • definition
    • degree distribution
      • the probability distribution of the degree in the network
      • in real-life networks, follows power law (i.e. many nodes are connected with few other nodes, while only few nodes are connected to many other nodes)
    • k-regular graph
      • network with a constant degree distribution
      • each node in the network has the same degree
      • e.g. 4-regular graph
  • triangle
    • definition
    • example
  • density

    • definition

      • measures the extent to which nodes in a network are connected to each other
      • i.e. measures the number of observed edges (M) to the maximum possible number of edges in the graph
      • full connected network: each node is connected to every other node, has \(\frac{N(N-1)}{2}\) edges, where N = total number of nodes

      \[d=\frac{2M}{N(N-1)}\]

    • example

  • relational neighbourhood

    • relational neighbor classifier
      • makes use of the homophily assumption - connected nodes have a propensity to belong to the same class, i.e. guilt-by-association
      • if two nodes are associated, they tend to exhibit similar behaviours
    • posterior class probability for node n to belong to class c
      • example

  • probabilistic relational neighborhood

    • extension of relational neighbour classifier
    • example
    • \(P(c|n_j)\) can be result of a local model, or of a previously applied network model
    • relational neighborhood and probabilistic relational neighborhood, can both be used as a classifier, or an additional variable in fraud prediction models
  • relational logistic regression classifier

    • definition
      • starts with a dataset with local node-specific characteristics called, intrinsic variables, e.g. company's age, sector, etc.
    • network characteristics
      • mode-link: most frequently occuring class of neighbor
      • count-link: frequency of the classes of the neighbor
      • binary-link: binary indicators indicating class presence
      • example
    • trick: use logistic regression with the dataset that contains both intrinsic and network variables
    • caution: there might be correlations between the network variables added
      • filtered out during an input selection procedure (e.g. using stepwise logistic regression), or
      • featurization - can measure the behaviour of the neighbors in terms of the target variables or the intrinsic variables

Centrality metrics

  • Dijkstra's algorithm
    • example
    • Dijkstra's algorithm is only applicable to networks with non-negative edge weights
  • geodesic path (or shortest path)
    • computes the minimum distance needed to reach a node from a target node
    • Q: how far is any fraudulent node in the network removed from the target node?
      • if a fraudulent node is in the close neighborhood, fraud might impact the node more intensively and contaminate the target of interest
    • graph theoretic center
      • the node with the smallest, maximum distance to all other nodes, i.e. most central node in the network
    • average path length
      • length one needs to traverse on average to reach one node from another
    • in fraud, if more paths exist between two nodes, there is a higher chance that fraudulent influence will eventually reach the target node
    • adjacency matrix - number of connecting paths between any two nodes
      • example
  • closeness

    • definition

      • mean geodesic distance or farness measure
        • \(d(v_i v_j)\) = distance between a node and another node corresponds to the geodesic or shortest path
        • \(g(v_i)\) = mean geodesic distance or farness from a node i to the other nodes
        • \(n\) = number of nodes in a network

      \[g(v_i)=\frac{\displaystyle \sum_{j=1(j\neq i)}d(v_i v_j)}{n-1}\]

    • closeness centrality

      • measures the mean distance from a node to each other in the network
      • inverse of farness measure

      \[Closeness \ Centrality(v_i)={(\frac{\displaystyle \sum_{j=1(j\neq i)}d(v_i v_j)}{n-1})}^{-1}\]

    • a few points to note

      • the higher the value, the more central the node is in the network
      • the value of closeness centralities for all nodes in the network might lie closely together, and therefore important to inspect the decimals
      • closeness centralities often excludes the distances to nodes that cannot be reached
    • example

  • betweenness

    • definition
      • measures the extent to which a node lies on the geodesic paths connecting any two nodes in the network
      • the extent to which information passes through this node
      • a node with a high betweenness possibly connects communities with each other
    • betweenness centrality

      • \(g_{jk}\) = number of shortest paths between node j and node k
      • \(g_{jk}(v_i)\) = number of shortest paths between node j and node k that pass through node \(v_i\)

      \[\displaystyle \sum_{j < k}\frac{g_{jk}(v_i)}{g_{jk}}\]

    • example

Collective inference algorithms

  • definition
    • infers a set of class labels/probabilities for the unknown nodes
    • rationale: inferences about nodes can mutually affect one another
  • PageRank algorithm
    • Google based on this
    • explaination
    • ranking of a web page depends on
      • ranking of web pages pointing towards that web page
      • out-degree of the linking web pages
      • random surfer: assumes surfers get bored and jumped randomly to another web page
    • to calculate the PageRank requires ranking of the neighboring web pages
      • start with a random RageRank value for every web page
      • lteratively update the PageRank scores until a predefined number of iterations is reached, or a stopping criteria is met (e.g. when the change in the ranking is marginal)
    • application of PageRank algorithm to fraud analytics
      • PageRank algorithm can be seen as a propagation of page influence through the network
      • propagate fraud through the network
    • personalize PageRank by fraud
      • i.e. adjacency matrix A = fraud network (i.e. people-to-people network)
    • example
  • Gibbs sampling
    • uses a local classifier to infer a posterior class probability in order to initialize the node labels in the network
    • original semi-labelled graph is transformed in a fully labelled graph by sampling the posterior probabilities of the local classifier
    • predictive features of a local classifier consist of only intrinsic variables
    • labels of unknown nodes in the graph express an expectation of the true class value
    • lterative procedure
      • updates the expected class labels of the unknown nodes
      • the first \(iter_b\) steps approach a stationary distribution (burn-in period), where no statistics are recorded
      • the last \(iter_c\) steps, keeps track of which class labels are assigned to each node
      • final class probability estimate is computed as normalized count of the number of times each class is assigned to a particular node
  • lterative classification algorithm
    • similar to Gibbs sampling, iterative classification initializes the semi-labelled graph by using a local classifier
    • bootstrap phase
      • based on the local model’s output, the most probable class label is assigned to each unknown node
    • iteration phase
      • a relational learner updates the class labels of each unknown node based on the outcome of a relational logistic regression model
      • input features: computed as link statistics of the current label assignments
      • link statistics: mode (most occurring label of the neighboring nodes), count (number of neighboring fraud nodes), binary (at least one of the neighboring nodes are fraudulent)
      • nodes not yet classified are ignored
      • a new class label is assigned to each unknown node based on the largest posterior probability, this step is repeated until a stopping criterion is met
      • the final class label corresponds to the class label estimate generated during the last iteration
  • relaxation labelling
    • starts from a local classifier to initialize a node’s class label
    • previous approaches assigned a hard label (i.e. either legitimate or fraud).
    • soft labelling: relaxation labelling starts with assigning each node a probability that indicates the likelihood of a node to belong to a certain class
    • next, the probability class labels are used to iteratively update the class probability using a relational model
    • the class estimates of the last iteration are the final class label estimates
  • loopy belief propagation
    • a collective inference procedure based on iterative message passing
    • idea
      • belief of each node to be in state x (let’s say fraud) depends on the messages it receives from its neighbors
      • belief of a node to be in state x is the normalized product of the received messages
      • the message as well as the belief is continuously updated during the algorithm
  • featurization
    • mapping of an unstructured data source like a network into useful and meaningful characteristics of each node
    • most network-based features already give a good indication which nodes might be fraudulent in the future, these network-based characteristics can be combined with local features into a classification model
  • bipartite graph
    • unipartite network
      • networks with only one node type
      • link weight: shared events or behaviors
      • the higher the link weight, the more intense the relationship is
      • two nodes are more influenced by each other if their relationship is more intense
    • affiliation or bipartite network
      • represent the reason why people connect to each other, and include the events that network nodes (e.g. people or company) attend or share
      • creates new insights in the network structure

Post-Processing of Fraud Analytics

Backtesting analytical fraud models

  • backtesting
    • aims at contrasting ex-ante made predictions with ex-post realized outcomes
    • key idea is to verify whether the fraud model still performs satisfactory
    • important because fraudsters might outsmart fraud analytical models by continuously changing their strategies
  • backtesting activities
    • data stability
    • model stability
    • model calibration
  • traffic light indicator approach
    • green light = low probability of fraud
    • red light = high probability of fraud
  • backtesting data stability

    • check whether the population on which the model is currently being used is similar to the population what was used to develop the model
    • if difference occur in step 1, verify the stability of the individual variables
    • for step 1, a system stability index (SSI) can be calculated as

      \[SSI=\sum_{i=1}^k(observed_i - expected_i).ln\frac{observed_i}{expected_i}\]

      • SSI measures difference between expected and observed distribution
      • higher SSI implies a population shift and instability
      • rule of thumb
        • SSI < 0.10: no significant shift (green)
        • 0.10 ≤ SSI < 0.25: moderate shift (yellow)
        • SSI ≥ 0.25: significant shift (red)
      • SSI also referred to as deviation index, identical to information value
      • SSI can be monitored through time or by individual variables
  • backtesting model stability

    • volatility of the model performance
    • to track the model performance using PM (performance metric), no. of observations, no. of frauds and corresponding traffic light
    • to track over a subsequent peroid of time
    • traffic light coding procedure
      • if change of PM < 5%, assign green light
      • if change of PM ≥ 5% and < 10%, assign yellow light
      • if change of PM ≥ 10%, assign red light
    • bootstrapping procedure
      • pool the training and out-of-time test observations with the predicted PM into one larger sample
      • draw a training and a test set bootstrap sample with the same size as the original training and out-of-time test set (NOTE: a bootstrap sample is one with replacement)
      • calculate the difference for PM between the bootstrap training and the bootstrap test sample
      • repeat 1,000 or more times to get the distribution and statistically test whether the difference is zero
    • traffic light procedure
      • green: no statistical difference at 95%
      • yellow: statistical difference at 95% but not at 99%
      • red: statistical difference at 99%
  • backtesting model calibration

    • in classification or regression model
      • aim of the model is to come up with well-calibrated estimates of fraud probabilities or fraud amounts
      • the calibration needs to be monitored in time
    • procedure
      • a classification model assigns fraud probabilities to various pools of customers
      • each pool has a corresponding calibrated probability, as it was calculated during model development
      • to see how these probabilities evolve in time and whether they remain stable
    • binomial test
      • used as heuristic for evaluating the quality of the calibration
      • traffic light procedure
        • green: no statistical difference at 90%
        • yellow: statistical difference at 90% but not at 95%
        • orange: statistical difference at 95% but not at 99%
        • red: statistical difference at 99%
    • hosmer-lemeshow test
      • test calibrated versus observed fraud rates across multiple pools simultaneously
    • backtest calibration of regression model
      • use a parametric student's t-test
        • can be used to evaluate the significance of the error defined as the difference between the amount predicted and observed
      • output of regression model has been categorized into pools