FITE7410 Social Network Analysis and Post-Processing

Basic concepts

guilt-by-association
- the reason of studying social network
- assumes that a person is guilty of a crime because he/she is associated with someone or something involved in a crime
applications of social network analytics
- because most frauds are committed by illegal set-ups of a group of fraudsters
- traditional fraud detection models have limitations to detect relations or insight from a social network
- social network analytics will give insights about some suspicious cases (e.g., insurance frauds, bankruptcies of companies, and opinion frauds)
network components
- complex network analysis (CNA) - study of complex networks, inculding structure, characteristics, and dynamics
- graph theory - mathematical study for the analysis and representation of networks
  - types of graphs
    - directed graph
    - undirected graph
  - edge representation
    - self-edge - a person who transfers money from his/her account to another he/she owns
    - multi-edge - a person pay the merchant money twice
    - hyper-edge - 3 people attend the same event
  - weighted graph - represents the intensity of relations in a network
    - binary weight
      - 0 or 1: represents relation exists between 2 nodes or not
      - -1, 0 or 1: e.g. 1 represents close relation, 0 represents no relation, -1 represents animosity
    - numeric weight
      - higher numeric value represents closer affinity
    - normalized weight
      - variant of numeric weight where all outgoing edges of a node sum up to 1
    - Jaccard weight
      - edge weight depends on how "social" both nodes are
      - formula
      \[w(v_1,v_2)=\frac{|\Gamma(v_1)\cap\Gamma(v_2)|}{|\Gamma(v_1)\cup\Gamma(v_2)|}\]
      - example
        
        person A attended 10 events, person B attended 5 events
        
        both went to 3 common events
        
        \(Jaccard\ weight = \frac{3}{10+5-3}=\frac{1}{4}\)
  - for fraud
    - nodes - legitimate or fraudulent
    - edges - relationship
    - egonet - one-hop neighbourhoods
      - one-hop is enough for fraud analysis
network representations
- graphically (a)
- mathematically
  - adjacency or connection matrix - a matrix of size \(n \times n\) (n is the number of nodes) (b)
    - a sparse matrix - containing many zero values
    - similar to real-life social network, where most people are only connected to a small number of friends
  - adjacency list - abstract representation of adjacency matrix (d)
  - weight matrix - a matrix with edge weights (c)
  - weight list - abstract representation of weight matrix (e)

Homophilic network

definition
- fraudsters connect to fraudsters
- legitimate people connect to legitimate people
web of frauds
- fraudsters' nodes are grouped together
features
- no network is perfectly homophilic - if a network exhibits evidence of homophily, it is worthwhile to investigate more thoroughly
- a network that is not homophilic is heterophilic
formula
- \(l\) = legitimate nodes in the network
- \(f\) = fraudulent nodes in the network
- \(2lf\) = expected cross-label edges (fraud connects to non-fraud)
- \(\hat{r}\) = observed cross-label edges
- if null hypothesis is rejected (observed cross-label edges is significantly less than the expected cross-label edges), a network is homophilic

\[H_0 : \hat{r} \geq 2lf\]

example
dyadicity & heterophilicity
- \(Dyadicity(D)=\frac{m_{11}}{\overline{m}_{11}}\)
- \(Heterophilicity(H)=\frac{m_{10}}{\overline{m}_{10}}\)
- if D > 1, network is dyadic
  - fraudulent cases tend to connect more densely among themselves than expected randomly
- if H < 1, network is heterophobic (oppsite of heterophilic)
  - fraud nodes have fewer connections to legitimate nodes than expected randomly
- if a network is dyadic and heterophobic (D > 1 and H < 1), it exhibits homophily

Metrics to measure impact of neighborhoods

neighbour metrics
- characterize the target of interest based on its direct associates
- first-order neighbour for fraud detection models
- include: degree, triangles, density, relation neighbour & probabilistic relational neighbour
centrality metrics
- quantify the importance of an individual in a social network
- include: geodesic paths, betweenness, closeness, and graph theoretic center
collective inference algorithms
- compute the probability that a node is exposed to fraud, i.e. probability that fraud influences a certain node
- include: PageRank, Gibbs sampling, iterative classification, relaxation labelling, and loopy belief propagation

Neighborhood Metrics

degree
- definition
- degree distribution
  - the probability distribution of the degree in the network
  - in real-life networks, follows power law (i.e. many nodes are connected with few other nodes, while only few nodes are connected to many other nodes)
- k-regular graph
  - network with a constant degree distribution
  - each node in the network has the same degree
  - e.g. 4-regular graph
triangle
- definition
- example
density
- definition
  - measures the extent to which nodes in a network are connected to each other
  - i.e. measures the number of observed edges (M) to the maximum possible number of edges in the graph
  - full connected network: each node is connected to every other node, has \(\frac{N(N-1)}{2}\) edges, where N = total number of nodes
  \[d=\frac{2M}{N(N-1)}\]
- example
relational neighbourhood
- relational neighbor classifier
  - makes use of the homophily assumption - connected nodes have a propensity to belong to the same class, i.e. guilt-by-association
  - if two nodes are associated, they tend to exhibit similar behaviours
- posterior class probability for node n to belong to class c
  - example
probabilistic relational neighborhood
- extension of relational neighbour classifier
- example
- \(P(c|n_j)\) can be result of a local model, or of a previously applied network model
- relational neighborhood and probabilistic relational neighborhood, can both be used as a classifier, or an additional variable in fraud prediction models
relational logistic regression classifier
- definition
  - starts with a dataset with local node-specific characteristics called, intrinsic variables, e.g. company's age, sector, etc.
- network characteristics
  - mode-link: most frequently occuring class of neighbor
  - count-link: frequency of the classes of the neighbor
  - binary-link: binary indicators indicating class presence
  - example
- trick: use logistic regression with the dataset that contains both intrinsic and network variables
- caution: there might be correlations between the network variables added
  - filtered out during an input selection procedure (e.g. using stepwise logistic regression), or
  - featurization - can measure the behaviour of the neighbors in terms of the target variables or the intrinsic variables

Centrality metrics

Dijkstra's algorithm
- example
- Dijkstra's algorithm is only applicable to networks with non-negative edge weights
geodesic path (or shortest path)
- computes the minimum distance needed to reach a node from a target node
- Q: how far is any fraudulent node in the network removed from the target node?
  - if a fraudulent node is in the close neighborhood, fraud might impact the node more intensively and contaminate the target of interest
- graph theoretic center
  - the node with the smallest, maximum distance to all other nodes, i.e. most central node in the network
- average path length
  - length one needs to traverse on average to reach one node from another
- in fraud, if more paths exist between two nodes, there is a higher chance that fraudulent influence will eventually reach the target node
- adjacency matrix - number of connecting paths between any two nodes
  - example
closeness
- definition
  - mean geodesic distance or farness measure
    - \(d(v_i v_j)\) = distance between a node and another node corresponds to the geodesic or shortest path
    - \(g(v_i)\) = mean geodesic distance or farness from a node i to the other nodes
    - \(n\) = number of nodes in a network
  \[g(v_i)=\frac{\displaystyle \sum_{j=1(j\neq i)}d(v_i v_j)}{n-1}\]
- closeness centrality
  - measures the mean distance from a node to each other in the network
  - inverse of farness measure
  \[Closeness \ Centrality(v_i)={(\frac{\displaystyle \sum_{j=1(j\neq i)}d(v_i v_j)}{n-1})}^{-1}\]
- a few points to note
  - the higher the value, the more central the node is in the network
  - the value of closeness centralities for all nodes in the network might lie closely together, and therefore important to inspect the decimals
  - closeness centralities often excludes the distances to nodes that cannot be reached
- example
betweenness
- definition
  - measures the extent to which a node lies on the geodesic paths connecting any two nodes in the network
  - the extent to which information passes through this node
  - a node with a high betweenness possibly connects communities with each other
- betweenness centrality
  - \(g_{jk}\) = number of shortest paths between node j and node k
  - \(g_{jk}(v_i)\) = number of shortest paths between node j and node k that pass through node \(v_i\)
  \[\displaystyle \sum_{j < k}\frac{g_{jk}(v_i)}{g_{jk}}\]
- example

Collective inference algorithms

definition
- infers a set of class labels/probabilities for the unknown nodes
- rationale: inferences about nodes can mutually affect one another
PageRank algorithm
- Google based on this
- explaination
- ranking of a web page depends on
  - ranking of web pages pointing towards that web page
  - out-degree of the linking web pages
  - random surfer: assumes surfers get bored and jumped randomly to another web page
- to calculate the PageRank requires ranking of the neighboring web pages
  - start with a random RageRank value for every web page
  - lteratively update the PageRank scores until a predefined number of iterations is reached, or a stopping criteria is met (e.g. when the change in the ranking is marginal)
- application of PageRank algorithm to fraud analytics
  - PageRank algorithm can be seen as a propagation of page influence through the network
  - propagate fraud through the network
- personalize PageRank by fraud
  - i.e. adjacency matrix A = fraud network (i.e. people-to-people network)
- example
Gibbs sampling
- uses a local classifier to infer a posterior class probability in order to initialize the node labels in the network
- original semi-labelled graph is transformed in a fully labelled graph by sampling the posterior probabilities of the local classifier
- predictive features of a local classifier consist of only intrinsic variables
- labels of unknown nodes in the graph express an expectation of the true class value
- lterative procedure
  - updates the expected class labels of the unknown nodes
  - the first \(iter_b\) steps approach a stationary distribution (burn-in period), where no statistics are recorded
  - the last \(iter_c\) steps, keeps track of which class labels are assigned to each node
  - final class probability estimate is computed as normalized count of the number of times each class is assigned to a particular node
lterative classification algorithm
- similar to Gibbs sampling, iterative classification initializes the semi-labelled graph by using a local classifier
- bootstrap phase
  - based on the local model’s output, the most probable class label is assigned to each unknown node
- iteration phase
  - a relational learner updates the class labels of each unknown node based on the outcome of a relational logistic regression model
  - input features: computed as link statistics of the current label assignments
  - link statistics: mode (most occurring label of the neighboring nodes), count (number of neighboring fraud nodes), binary (at least one of the neighboring nodes are fraudulent)
  - nodes not yet classified are ignored
  - a new class label is assigned to each unknown node based on the largest posterior probability, this step is repeated until a stopping criterion is met
  - the final class label corresponds to the class label estimate generated during the last iteration
relaxation labelling
- starts from a local classifier to initialize a node’s class label
- previous approaches assigned a hard label (i.e. either legitimate or fraud).
- soft labelling: relaxation labelling starts with assigning each node a probability that indicates the likelihood of a node to belong to a certain class
- next, the probability class labels are used to iteratively update the class probability using a relational model
- the class estimates of the last iteration are the final class label estimates
loopy belief propagation
- a collective inference procedure based on iterative message passing
- idea
  - belief of each node to be in state x (let’s say fraud) depends on the messages it receives from its neighbors
  - belief of a node to be in state x is the normalized product of the received messages
  - the message as well as the belief is continuously updated during the algorithm
featurization
- mapping of an unstructured data source like a network into useful and meaningful characteristics of each node
- most network-based features already give a good indication which nodes might be fraudulent in the future, these network-based characteristics can be combined with local features into a classification model
bipartite graph
- unipartite network
  - networks with only one node type
  - link weight: shared events or behaviors
  - the higher the link weight, the more intense the relationship is
  - two nodes are more influenced by each other if their relationship is more intense
- affiliation or bipartite network
  - represent the reason why people connect to each other, and include the events that network nodes (e.g. people or company) attend or share
  - creates new insights in the network structure

Post-Processing of Fraud Analytics

Backtesting analytical fraud models

backtesting
- aims at contrasting ex-ante made predictions with ex-post realized outcomes
- key idea is to verify whether the fraud model still performs satisfactory
- important because fraudsters might outsmart fraud analytical models by continuously changing their strategies
backtesting activities
- data stability
- model stability
- model calibration
traffic light indicator approach
- green light = low probability of fraud
- red light = high probability of fraud
backtesting data stability
- check whether the population on which the model is currently being used is similar to the population what was used to develop the model
- if difference occur in step 1, verify the stability of the individual variables
- for step 1, a system stability index (SSI) can be calculated as
  
  \[SSI=\sum_{i=1}^k(observed_i - expected_i).ln\frac{observed_i}{expected_i}\]
  - SSI measures difference between expected and observed distribution
  - higher SSI implies a population shift and instability
  - rule of thumb
    - SSI < 0.10: no significant shift (green)
    - 0.10 ≤ SSI < 0.25: moderate shift (yellow)
    - SSI ≥ 0.25: significant shift (red)
  - SSI also referred to as deviation index, identical to information value
  - SSI can be monitored through time or by individual variables
backtesting model stability
- volatility of the model performance
- to track the model performance using PM (performance metric), no. of observations, no. of frauds and corresponding traffic light
- to track over a subsequent peroid of time
- traffic light coding procedure
  - if change of PM < 5%, assign green light
  - if change of PM ≥ 5% and < 10%, assign yellow light
  - if change of PM ≥ 10%, assign red light
- bootstrapping procedure
  - pool the training and out-of-time test observations with the predicted PM into one larger sample
  - draw a training and a test set bootstrap sample with the same size as the original training and out-of-time test set (NOTE: a bootstrap sample is one with replacement)
  - calculate the difference for PM between the bootstrap training and the bootstrap test sample
  - repeat 1,000 or more times to get the distribution and statistically test whether the difference is zero
- traffic light procedure
  - green: no statistical difference at 95%
  - yellow: statistical difference at 95% but not at 99%
  - red: statistical difference at 99%
backtesting model calibration
- in classification or regression model
  - aim of the model is to come up with well-calibrated estimates of fraud probabilities or fraud amounts
  - the calibration needs to be monitored in time
- procedure
  - a classification model assigns fraud probabilities to various pools of customers
  - each pool has a corresponding calibrated probability, as it was calculated during model development
  - to see how these probabilities evolve in time and whether they remain stable
- binomial test
  - used as heuristic for evaluating the quality of the calibration
  - traffic light procedure
    - green: no statistical difference at 90%
    - yellow: statistical difference at 90% but not at 95%
    - orange: statistical difference at 95% but not at 99%
    - red: statistical difference at 99%
- hosmer-lemeshow test
  - test calibrated versus observed fraud rates across multiple pools simultaneously
- backtest calibration of regression model
  - use a parametric student's t-test
    - can be used to evaluate the significance of the error defined as the difference between the amount predicted and observed
  - output of regression model has been categorized into pools

FITE7410 Social Network Analysis and Post-Processing

Social Network Analysis for Fraud Detection

Post-Processing of Fraud Analytics