FITE7410 Financial Fraud Analytics Techniques (Part 1)

Financial Fraud Analytics

Fraud detection models

classification
- the output is category
regression
- the output is a real value

Techniques

statistics techniques
- break-point analysis
- peer-group analysis
- association rule analysis
- linear/logistic regressions
AI/DM/ML techniques
- supervised learning
  - using historical information to retrieve patterns
  - required labelled dataset
- unsupervised learning
  - using historical information to retrieve patterns
  - do not require labelled dataset
  - divided into
    - clustering - to discover the inherent groupings in the data
    - association - to discover rules that describe large portions of the data, find relation
- semi-supervised learning
  - deal with pertially labelled data
- social network analysis
  - learn and detect characteristics of fraudulent behavior in a network of linked entities
- notes
  - these techniques are not mutually exclusive but complement each other
  - an effective fraud detection and prevention system may combine the use of different techniques

Data analytics

predictive analytics
- logistic regression
- decision tree
- random forest
- neural network
- support vector machines
descriptive analytics
- clustering
- autoencoder

Statistics Techniques

Descriptive analytics for fraud detection

outlier detection techniques
- break-point analysis
  - intra-account, indicates by a sudden change in account behavior
  - starts from defining a fixed time window
  - split the time window into old and new
  - old part represents the local model or profile against which the new observations will be compared
  - use t-test (mean comparison) to compare the averages of the old and new parts
  - observations are ranked according to the t-statistical value
- peer-group analysis
  - inter-account, compare with peer-group (nomal behavior)
  - identifies the peer group of a target account
    - by using prior business knowledge or in a statistical way
    - group size cannot be too small (sensitive to noise), or too big (insensitive to local irregularities)
  - compares the behaviors of target account with peer accounts, e.g. t-test, any distance metrics
- disadvantage
  - the two methods can be used to detect local anomalies rather than global anomalies
    - cases may be normal in global population
    - these cases are marked as suspicious when compared to local profile or peers (more sensitive to local anomalies)
relation detection techniques
- association rule analysis
  - detect frequently occurring relationships between items
  - originates from market basket analysis - to detect which items are frequently purchased together
  - rules measure correlation associations, should not be interpreted as casual relation
- steps
  - identify frequent item sets
    - frequency of an item set is measured by means of its support \(support(X) = \frac{number \ of \ transactions \ supporting (X)}{total \ number \ of \ transactions}\)
    - example
      - item set {insured A, police officer X, auto repair shop 1}
      - occurs in claim ID# 1, 6, 9
      - support = 3/10 = 30%
    - a frequent item set - an item set of which the support is higher than a minimum value (e.g. 10%)
  - identify association rules
    - strength of the association rule measured by confidence \(confidence(X \rightarrow Y) = P(Y|X) = \frac{support(X \cap Y)}{support(X)}\)
    - example
      - item set {insured A, police officer X, auto repair shop 1} can have multiple association rules
        
        if insured A AND police officer X -> auto repair shop 1
        
        X = {insured A, police officer X}
        
        \(support(X) = 4/10\) (ID# 1, 2, 6, 9)
        
        Y = {auto repair shop 1}
        
        \(support(X\cap Y) = 3/10\) (ID# 1, 6, 9)
        
        confidence (X -> Y) = 75%
        
        if insured A AND auto repair shop 1 -> police office X
        
        if insured A -> auto repair shop 1 AND police office X
    - selected association rule - a rule of which the confidence is higher than a specified value

Predictive analytics for fraud detection

linear regression
- used to model a continuous target variable
- general formulation: \(Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_N X_N\)
  - slope: positive or negative relation between X and Y
  - \(\beta_1\ldots\beta_N\): regression coefficient of a variable
  - \(\beta_0\): intercept coefficient, excepted mean value of Y when all X = 0
- minimize the sum of all error squares (MSE = mean square error) to find the best fit straight line
- advantages
  - performs exceptionally well for linearly separable data
  - operationally efficient and easy to interpret & implement
  - extrapolation beyond a specific dataset
- disadvantages
  - target and exploratory variables must be of linear relation
  - prone to noise and overfitting (a better fitting of the training dataset than the testing dataset)
  - sensitive to outliers
  - assumes exploratory variables are independent, might have problem of multicollinearity
  - some variables are correlated, may have errors
logistic regression
- can be used for classification problem where the target variable assumes a value between 0 or 1 (boundaries)
- \(P(Y=1)=1-P(Y=0)\)
- steps
- optimize the maximum likelihood estimation (MLE) - chooses the parameters in such a way as to maximize the probability of getting the sample at hand, in order to find the best fit straight line
- interpret result
  - linear in log odds (logit)
  - estimates a linear decision boundary between the two classes
- advantages
  - performs exceptionally well for linearly separable data
  - operationally efficient and easy to implement
  - a good baseline to measure performance of other more complex fraud detection model
- disadvantages
  - nonlinear problem cannot be solved
  - prone to overfitting
  - difficult to capture complex relationships

Decision Tree

Overview

basics
aims
- minimize "impurity" in the data
impurity
- the node impurity is a measure of the homogeneity of the labels at the node

Splitting decision

categorical variables - decision tree
- Gini impurity
- entropy
- example
  - formula
  \[IG(D_p,f)=I(D_p)-\frac{N_{left}}{N}I(D_{left})-\frac{N_{right}}{N}I(D_{right})\]
  - compare the IG for splitting decision using the attribute "AGE" and "INCOME"
  - IG(AGE) > IG(INCOME) -> use AGE as the splitting attribute
continuous variables - regression tree
- mean square error (MSE)
- variance
- regression tree splits - favour homogeneity within node and heterogeneity between nodes
- example - favour low MSE in a leave node

Stopping decision

aims
- avoid overfitting, the tree would be too complex and fails to correctly model the noise free pattern or trend in the data
steps
- split the data into training set and validation set (usually 7:3)
- stop growing the tree when the misclassification error for validation set reaches its minimal value

Comparison

advantages of decision tree
- easy to understand and interpret
- useful in data exploration
- less data clearing required - robust to outliers in inputs and no problem with missing values
- data type not a constraint - can handle both continuous and categorical data for both input and target data
- automatically detects interactions, accommodates nonlinearity and selects input variables
disadvantages of decision tree
- prone to overfitting
- splitting turns continuous input variables into discrete variables
- unstable fitted tree - small change in the data result in a very different series of splits - sensitve to the dataset
classification vs regression tree

Ensemble Methods

Overview

definition
- a machine learning model that involve a group of prediction models (not only one model)
reasons of using ensemble learning
- performance: better predictions than any single contributing model
- robustness: reduce the spread or dispersion of the predictions and model performance

Bootstrap sampling

definition
- smaller sample of same size are repeated drawn, with replacement, from the larger original sample
procedures
- choose a number of bootstrap samples to perform
- choose a sample size
- for each bootstrap sample
  - draw a sample with replacement with the chosen size
  - calculate the statistic on the sample
- calculate the mean of the calculated sample statistics
bagging (bootstrap aggregating)
- take N number of bootstraps from the underlying sample
- build a classifier
  - for classification -> major voting
  - for regression -> calculate the average of the outcome of the N prediction models

Random Forest

Overview

definition
- is a forest of decision trees
- can be used for both classification tree and regression tree
- achieve dissimilarities among the decision trees by
  - adopting a bootstrap procedure to select training samples for each tree (the same dataset -> useless)
  - selecting a random subset of attributes at each node
  - training different base models of decision trees
- result of random forest is a model with better performance compared to a single decision tree model

Comparison

advantages
- random forest can achieve excellent predictive performance and suitable for the requirements of fraud detection
- capable of dealing with data sets having only a few observations, but with lots of variables
disadvantages
- it is a black-box model, more complicated actually
  - variable importance can be used to understand the internal workings of random forest (or any ensemble model)

Variable importance

aims
- when get result from random tree, which variables have the most predictive power
- variables with HIGH importance can be used for further analysis, while variables with LOW importance can be discarded
example - provide two mechanism, which one should be used, it depends

Performance Evaluation

Split the sample data

training dataset
- training data - to build the model
- validation data - to be used during model development (e.g. making stopping decision in decision tree)
- testing data - to test the performance of the model
splitting the dataset
- observations used for training should not be used for testing or validation
- training:(validation):testing (not strict, it depends)
  - 7:3 (validation dataset is not required)
  - 4:3:3 (validation dataset is required)
k-fold cross-validation
- can be applied when the sample size is small
- procedures (k = 5)
  - original dataset (shown in dark green) is randomly partitioned into k disjoint sets (shown in light green)
  - then k − 1 parts are used for training a model (shown in blue) and remaining part is used for evaluation (shown in orange)
  - this process is repeated k times for all possible choices of the test set, producing test errors
  - the final performance is reported by averaging the errors from each iteration
- with more than one trained model, which one should be chosen
  - similar to ensemble method, use voting procedure
  - use leave one out cross-validation and randomly select one model
    - randomly select
    - using all data except dropped one for training
    - since all models differ by one observation only, the performance should be similar for all models
  - use all observations for training
    - then, use the cross-validation performance result (mean error) as the independent estimate of the model
    - only use when the sample size is really extremely small (no choice option)

Model evaluation

classification model
- statistics measure
  - correlation tests
  - comparison of mean tests
  - regression tests
  - non-parametric tests
- performance metric
  - mainly used in machine learning evaluation
  - the confusion matrix
    - accuracy: percentage of total items classified correctly
    \[Classification\ accuracy=\frac{TP+TN}{TP+FP+FN+TN}\]
    - error: percentage of total items classified incorrectly
    \[Classification\ error=\frac{FP+FN}{TP+FP+FN+TN}\]
    - recall: how many fraudsters are correctly classified as fraudsters (most important, i.e. favour TP > FN)
    \[Sensitivity=Recall=Hit\ rate=\frac{TP}{TP+FN}\]
    - precision: how many predicted fraudsters are actually fraudsters (useful if the objective is not to leave out important information, e.g. spam mail detection, i.e. TP > FP)
    \[Precision=\frac{TP}{TP+FP}\]
    - F1 score: the weight average of precision and recall (take into account FP and FN, thus more informative than accuracy)
    \[F-measure=\frac{2\times (Precision\times Recall)}{Precision+Recall}\]
    - for fraud detection models, recall is most useful performance measure
    - example
      
      \[Accuracy = \frac{0+90}{100}=90\%\]
      - even with very high accuracy, this model is useless in detecting fraud cases, because no fraud cases are detected by this model
  - ROC-AUC
    - basic terms
      - ROC = Receiver Operating Characteristics
      - AUC = Area Under the ROC Curve
      - \(True Positive Rate (TPR) = Sensitivity=Recall=Hit\ rate=\frac{TP}{TP+FN}\)
      - \(True Negative Rate (TNR) = Specificity = \frac{TN}{FP + TN}\)
      - \(False Positive Rate (FPR) = 1 - Specificity\)
    - ROC curve
      - a curve of probabilities
      - with TPR as y-axis and FPR as x-axis
    - example
      - AUC is always between 0 and 1
      - calculate the area of the curve and compare which one is better
      - AUC = 1 = ideal situation where all fraud and no fraud cases are correctly predicted
      - AUC = 0.5 (i.e. diagonal curve) = random guesses, i.e. no discrimination power between fraud and no fraud cases
      - any curves under the diagonal curve = no use
- any measures that are applicable for the model developed
regression model
- MAE (Mean Absolute Error)
  - \(y_i\) is the actual expected output and \(\hat{y_i}\) is the model's prediction
  - simplest but not popular
\[MAE=\frac{1}{N} \displaystyle \sum^{N}_{i=1} |y_i - \hat{y_i}|\]
- MSE (Mean Squared Error)
  - more sensitive to outliers, compared with MAE
\[MSE=\frac{1}{N} \displaystyle \sum^{N}_{i=1} {(y_i - \hat{y_i})}^2\]
- RMSE (Root Mean Squared Error)
  - due to squared error terms in MSE
\[RMSE=\sqrt{\frac{1}{N} \displaystyle \sum^{N}_{i=1} {(y_i - \hat{y_i})}^2}\]
- r-squared
  - also called coefficient of determination
  - explains the degree to which the input variables explain the variation of the output/predicted variable
  - limitation: either stay the same or increases with the addition of more variables, even if they do not have any relationship with the output variables
\[R^2=1-\frac{SS_{RES}}{SS_{TOT}}=1-\frac{\sum_i {(y_i-\hat{y_i})}^2}{\sum_i {(y_i-\bar{y})}^2}\]
- adjusted r-squared
  - N is the total sample size (number of rows) and p is the number of predictors (number of columns)
  - overcome the R-squared limitation
  - for building linear regression on multiple variables, suggest to use this method to judge
  - but for only one input variable, r-squared and adjusted r-squared are same
\[Adjusted \ R^2=1-\frac{(1-R^2)(N-1)}{N-p-1}\]
- scatter plot
  - the more the plot approximately a straight, the better the performance of the regression model
- Pearson correlation coefficient
  - varies between -1 and +1
  - closer to +1 indicates better agreement
\[corr(\hat{y},y)=\frac{\sum^n_{i=1}(\hat{y_i}-\bar{\hat{y}})(y_i-\bar{y})}{\sqrt{\sum^n_{i=1}{(\hat{y_i}-\bar{\hat{y}})}^2}\sqrt{\sum^n_{i=1}{(y_i-\bar{y})}^2}}\]
comparison
- for classfication model
  - the output is categorical data
  - the measure of the performance is counting the % of correctly predicted value
- for regression model
  - the outout is a continuous number
  - the measure of the performance is how "close" the predicted value is to the actual value, any deviation from the actual value is an error
  \[Error=Y(actual)-Y(predicted)\]