8 minutes
FITE7410 Autoencoder and Benford’s Law
Review and Autoencoder
A qiuck review
- assumptions underlying PCA
- is not Gaussian distribution
- interval-level measurement
- random sampling
- linearity
- large variances have important structure
- the principal components are orthogonal
- fraud detection with outliers
- assume data has N features (N-dimensions)
- using PCA to find key components
- identify outliers using the key components
- the reason of using PCA
- N-dimensional data are difficult to understand
- PCA supports dimension reduction
- PCA
- is a technique to reduce the dimensionality of data by forming new variables that are linear composites of the original variables
Autoencoder
- definition
- an autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner
- it is an neural network that is trained to recreate as output what it is fed as input
- characteristics
- for non-linear situation
- feed-forward neural network
- encoder and decoder
- unsupervised learning
- reconstruct the input in the output layer
- code layer is the latent representation
- with autoencoder, the hidden layers will map the input to a vector space in a “nice” way
- using fewer neurons in the output layer, we map the input from a high dimensional space to one with few dimensions, i.e. dimension reduction
- for higher dimensional data, autoencoders are capable of learning a complex representation of the data (manifold) which can be used to describe observations in a lower dimensionality and correspondingly decoded into the original input space
- applications
- dimensionality reduction
- information retrieval
- anomaly detection
- fraud detection
- insider detection
- intrusion detection
- image processing
- drug discovery
- population synthesis
- popularity prediction
- machine translation
using reconstruction error to find the fraudulent transactions
- the dataset has too few fraudulent cases
create an autoencoder and train it only on non-fraudulent transactions
- train the autoencoder with non-fraudulent transactions only
- after 100 epochs, we have an encoder
- feed all transactions of the training data set into the trained encoder
calculate the error (reconstruction error) between the input and the output
reconstruction error
- autoencoders compress the input into a lower-dimensional projection and then reconstruct the output from this representation
- reconstruction error is the distance between the original input and its autoencoder reconstruction
\[L(x,x^{\prime})={||x-x^{\prime}||}^2\]
evaluate the result
- it is expected that the reconstruction error of the autoencoder is higher for fraudulent transactions than the normal transactions
- transaction \(t\) is defined to be fraudulent if it falls outside a confidence interval that is d standard deviations from the mean for some parameter \(d: mean + d \times stdev\)
- if the data is normally distributed
- if the reconstruction error of a transaction > \(d: mean + 3 \times stdev\), mark the case as fraudulent
- if the data is not normally distributed
- by Chebyshev's inequality, the probability of a value falling outside this interval is at most \(\frac{1}{d^2}\)
- for d = 3, the probability is at most 0.1111
- for d = 4, the probability is at most 0.0625
- by Chebyshev's inequality, the probability of a value falling outside this interval is at most \(\frac{1}{d^2}\)
A real autoencoder
- process
- normalize and scale the data set
- randomly split into training data set (80%) and testing data set (20%)
- create the autoencoder: 5 fully connected layers with 29, 14, 7, 14 and 29 (for example) neurons respectively
- feed the normal data set into the autoencoder model for training
- check whether the reconstruction error on the training data set converge or not
- the reconstruction error on the training and testing data sets seems to converge nicely
- evaluate the performance using the testing data set
- evaluation
- metrics
- \(Recall=\frac{TP}{TP+FN}=TPR\)
- measures how many relevant results are returned
- \(Precision=\frac{TP}{TP+FP}\)
- measures the relevancy of obtained results
- \(F1=2 \times \frac{Recall \times Precision}{Recall+Precision}\)
- \(True Positive Rate (TPR)=\frac{TP}{TP+FN}\)
- \(False Positive Rate (FPR)=\frac{FP}{FP+TN}\)
- AUC = Area Under ROC Curve
- good to have both Precision and Recall values equal to 1
- high recall low precision: classify many fraud transactions, most of them are normal transactions (high false positives, low relevancy)
- high precision low recall: classify few fraud transactions (with high relevancy), many fraud transactions cannot be found
- \(Recall=\frac{TP}{TP+FN}=TPR\)
- optimal threshold
- threshold value
- the boundary between the normal and fraudulent transactions using the autoencoder
- different threshold value produces different reconstruction error
- when threshold value increases, reconstruction error increases
- when threshold value increases, the precision increases and the recall decreases
- when threshold = 50, the recall is 0.2 and the precision is 0.4
- comparing different threshold values
- higher threshold value
- lower in false classification of normal transaction, but discovery of fewer fraudulent cases
- lower threshold value
- more fraudulent cases can be found, but more false classification of normal transaction
- higher threshold value
- threshold value
- ROC curve (Receiver Operating Characteristic curve)
- change the threshold, show the TPR vs FPR
- a graph showing the performance of a classification model for all classification thresholds
- lowering the classification threshold classifies more items as positive
- increasing both False Positives and True Positives
- Precision-Recall curve
- a high area under the curve represents both high recall and high precision
- high precision means to false positive rate
- high recall means low false negative rate
- high scores for both precision and recall means the classifier returns accurate results (high precision) and finds majority of all fraudulent transactions (high recall)
- a high area under the curve represents both high recall and high precision
- metrics
Fraud detection using autoencoder and k-NN
- process
- using only the encoder part of an autoencoder (dimension reduction)
- apply k-Nearest Neighbors (k-NN) for classification to the instances in the reduced dimension space
- both normal and fraudulent transactions will be used in training
- example
- train the encoder with all transaction in the training data set
- for the k-NN classification, both training and testing transactions will be mapped to the “reduced” 12-dimensional space with the encoder
- for each transaction in the testing data set, the 3 closest neighboring instances of the training data set decide whether it is a fraudulent transaction or not
Benford's Law (The First Digit Law)
Overview
- content
- the frequency of occurrence of the leading digits in “naturally occurring” numerical distributions is predictable and nonuniform, but closer to a power-law distribution
- a given number is six times more likely to start with a 1 than a 9
- this is counterintuitive, people would expect a uniform distribution
- characteristics
- a series of numerical records follows Benford’s Law when they
- represents magnitudes of events or events, such as populations of cities, flows of water in rivers
- do not have pre-established minimum or maximum limits
- are not made up of numbers used as identifiers, such as ID cards, bank accounts, telephone numbers
- have a mean which is less than the median, and the data is not concentrated around the mean
- a series of numerical records follows Benford’s Law when they
Benford's law in fraud detection
- an experiment
- start with a Benford dataset D
- D was contaminated by a sample from a normal distribution with the same mean and the same variance of D
- the data were replaced at random in D by the fraudulent, so that the size of the dataset D’ was kept constant
- measure the distance (Euclidean distance) of the contaminated dataset D from Benford’s Law
- observations
- the figure shows the evolution of the distance from Benford’s Law to the size of the fraudulent sample (% of the size of the dataset D)
- range of contamination was chosen to be 0-10%
- an exponential curve was found to be “good” fit, i.e. distance from Benford’s Law increases exponentially as the dataset is more contaminated
- distance can be used as a measure for fraud detection
- variance in distance is large, especially for high level of contamination → separating noise from fraud may be difficult
mathmatically usage
- the frequency of occurrence of the leading digit D in naturally occurring numerical distributions following a logarithmically decaying distribution
\[Pr(D_1 = d) = log_{10}(1+\frac{1}{d})\]
- result
- idea
- if a certain set of values follows Benford’s Law, then model’s for the corresponding predicted values should also follow Benford’s Law
- for fraud detection
- manipulated or fraudulent data do not trend to confirm to Benford’s Law, whereas unmanipulated data do
- Benford’s Law can be used to investigate fraud patterns for datasets in fields where categorical values have been counted, such as medical test results or income tax revenues
- manipulated or fraudulent data do not trend to confirm to Benford’s Law, whereas unmanipulated data do
- example 1 - consider financial data
- assume someone owes a stock with a value of $1,000
- for his fund to arrive $2,000, it would have to double by growing 100%
- to further grow from $2,000 to $3,000, it would only need to increase by 50%
- to further grow from $3,000 to $4,000, i.e. first digit from 3 to 4, it would need to increase by 33.3%
- therefore, for the first digit from 1 to 2 (100%), it requires more growth than for the first digit from 3 to 4 (33.3%)
- Benford distribution is a “distribution of distributions”
- example 2 - Enron Case (splitting the invoice)
- consider a company where any travel and entertainment expenses over $10,000 must be approved by the vice president
- employees will split invoice with amount over $10,000 to avoid the “approval”
- a spike in first-digit frequencies around 5 and 6, clear violation of Bendford’s Law
notes
- Benford’s Law works best for fraud detection when the criminals are unaware of it
- if the criminals know how the law works, they can fool or cheat it
- you can use Benford’s Law to flag datasets that might be fraudulent, but you can’t use it to prove the datasets are not fraudulent
Chi-square goodness-of-fit test
- use statistical test to verify that a dataset obeys Benford’s law
- Chi-square goodness-of-fit test: statistical test to check whether an empirical (observed) distribution differs significantly from a theoretical (expected) distribution
- a significance level (p-value) is used as the discriminator
- commonly used p-value: 0.05, means 5% risk of erroneously concluding a differences exists when it does not
- other p-value: 0.01 or 0.1
- process
- find the degree of freedom (df), defined as number of categories (k) minus 1: df = k – 1
- for Benford’s law, df = 9 – 1 = 8
- calculate the expected frequency count at each level by multiplying the sample size by the theoretical proportions at each level
- calculate the chi-square random variable, aka the test statistic
- refer to the chi-square distribution table for the p-value for df=8
- if the test statistic is less than the p-value in the table, it is considered as significant, then you cannot reject the hypothesis that the observed and theoretical distributions are the same
- if the test statistic < 15.51, the p-value is > 0.05, there is no statistically significant difference between the observed dataset and the Benford’s law prediction
- find the degree of freedom (df), defined as number of categories (k) minus 1: df = k – 1