FITE7410 Autoencoder and Benford's Law

Review and Autoencoder

A qiuck review

assumptions underlying PCA
- is not Gaussian distribution
- interval-level measurement
- random sampling
- linearity
- large variances have important structure
- the principal components are orthogonal
fraud detection with outliers
- assume data has N features (N-dimensions)
- using PCA to find key components
- identify outliers using the key components
the reason of using PCA
- N-dimensional data are difficult to understand
- PCA supports dimension reduction
PCA
- is a technique to reduce the dimensionality of data by forming new variables that are linear composites of the original variables

Autoencoder

definition
- an autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner
- it is an neural network that is trained to recreate as output what it is fed as input
characteristics
- for non-linear situation
- feed-forward neural network
- encoder and decoder
- unsupervised learning
- reconstruct the input in the output layer
- code layer is the latent representation
- with autoencoder, the hidden layers will map the input to a vector space in a “nice” way
- using fewer neurons in the output layer, we map the input from a high dimensional space to one with few dimensions, i.e. dimension reduction
- for higher dimensional data, autoencoders are capable of learning a complex representation of the data (manifold) which can be used to describe observations in a lower dimensionality and correspondingly decoded into the original input space
applications
- dimensionality reduction
- information retrieval
- anomaly detection
  - fraud detection
  - insider detection
  - intrusion detection
- image processing
- drug discovery
- population synthesis
- popularity prediction
- machine translation
using reconstruction error to find the fraudulent transactions
- the dataset has too few fraudulent cases
- create an autoencoder and train it only on non-fraudulent transactions
  - train the autoencoder with non-fraudulent transactions only
  - after 100 epochs, we have an encoder
  - feed all transactions of the training data set into the trained encoder
  - calculate the error (reconstruction error) between the input and the output
    - reconstruction error
      - autoencoders compress the input into a lower-dimensional projection and then reconstruct the output from this representation
      - reconstruction error is the distance between the original input and its autoencoder reconstruction
      \[L(x,x^{\prime})={||x-x^{\prime}||}^2\]
- evaluate the result
  - it is expected that the reconstruction error of the autoencoder is higher for fraudulent transactions than the normal transactions
  - transaction $t$ is defined to be fraudulent if it falls outside a confidence interval that is d standard deviations from the mean for some parameter $d: mean + d \times stdev$
  - if the data is normally distributed
    - if the reconstruction error of a transaction > $d: mean + 3 \times stdev$, mark the case as fraudulent
  - if the data is not normally distributed
    - by Chebyshev's inequality, the probability of a value falling outside this interval is at most $\frac{1}{d^2}$
      - for d = 3, the probability is at most 0.1111
      - for d = 4, the probability is at most 0.0625

A real autoencoder

process
- normalize and scale the data set
- randomly split into training data set (80%) and testing data set (20%)
- create the autoencoder: 5 fully connected layers with 29, 14, 7, 14 and 29 (for example) neurons respectively
- feed the normal data set into the autoencoder model for training
- check whether the reconstruction error on the training data set converge or not
  - the reconstruction error on the training and testing data sets seems to converge nicely
- evaluate the performance using the testing data set
evaluation
- metrics
  - $Recall=\frac{TP}{TP+FN}=TPR$
    - measures how many relevant results are returned
  - $Precision=\frac{TP}{TP+FP}$
    - measures the relevancy of obtained results
  - $F1=2 \times \frac{Recall \times Precision}{Recall+Precision}$
  - $True Positive Rate (TPR)=\frac{TP}{TP+FN}$
  - $False Positive Rate (FPR)=\frac{FP}{FP+TN}$
  - AUC = Area Under ROC Curve
  - good to have both Precision and Recall values equal to 1
  - high recall low precision: classify many fraud transactions, most of them are normal transactions (high false positives, low relevancy)
  - high precision low recall: classify few fraud transactions (with high relevancy), many fraud transactions cannot be found
- optimal threshold
  - threshold value
    - the boundary between the normal and fraudulent transactions using the autoencoder
  - different threshold value produces different reconstruction error
    - when threshold value increases, reconstruction error increases
  - when threshold value increases, the precision increases and the recall decreases
    - when threshold = 50, the recall is 0.2 and the precision is 0.4
  - comparing different threshold values
    - higher threshold value
      - lower in false classification of normal transaction, but discovery of fewer fraudulent cases
    - lower threshold value
      - more fraudulent cases can be found, but more false classification of normal transaction
- ROC curve (Receiver Operating Characteristic curve)
  - change the threshold, show the TPR vs FPR
  - a graph showing the performance of a classification model for all classification thresholds
  - lowering the classification threshold classifies more items as positive
    - increasing both False Positives and True Positives
- Precision-Recall curve
  - a high area under the curve represents both high recall and high precision
    - high precision means to false positive rate
    - high recall means low false negative rate
  - high scores for both precision and recall means the classifier returns accurate results (high precision) and finds majority of all fraudulent transactions (high recall)

Fraud detection using autoencoder and k-NN

process
- using only the encoder part of an autoencoder (dimension reduction)
- apply k-Nearest Neighbors (k-NN) for classification to the instances in the reduced dimension space
- both normal and fraudulent transactions will be used in training
example
- train the encoder with all transaction in the training data set
- for the k-NN classification, both training and testing transactions will be mapped to the “reduced” 12-dimensional space with the encoder
- for each transaction in the testing data set, the 3 closest neighboring instances of the training data set decide whether it is a fraudulent transaction or not

Benford's Law (The First Digit Law)

Overview

content
- the frequency of occurrence of the leading digits in “naturally occurring” numerical distributions is predictable and nonuniform, but closer to a power-law distribution
- a given number is six times more likely to start with a 1 than a 9
- this is counterintuitive, people would expect a uniform distribution
characteristics
- a series of numerical records follows Benford’s Law when they
  - represents magnitudes of events or events, such as populations of cities, flows of water in rivers
  - do not have pre-established minimum or maximum limits
  - are not made up of numbers used as identifiers, such as ID cards, bank accounts, telephone numbers
  - have a mean which is less than the median, and the data is not concentrated around the mean

Benford's law in fraud detection

an experiment
- start with a Benford dataset D
- D was contaminated by a sample from a normal distribution with the same mean and the same variance of D
- the data were replaced at random in D by the fraudulent, so that the size of the dataset D’ was kept constant
- measure the distance (Euclidean distance) of the contaminated dataset D from Benford’s Law
observations
- the figure shows the evolution of the distance from Benford’s Law to the size of the fraudulent sample (% of the size of the dataset D)
- range of contamination was chosen to be 0-10%
- an exponential curve was found to be “good” fit, i.e. distance from Benford’s Law increases exponentially as the dataset is more contaminated
  - distance can be used as a measure for fraud detection
  - variance in distance is large, especially for high level of contamination → separating noise from fraud may be difficult
mathmatically usage
- the frequency of occurrence of the leading digit D in naturally occurring numerical distributions following a logarithmically decaying distribution
\[Pr(D_1 = d) = log_{10}(1+\frac{1}{d})\]
- result
- idea
  - if a certain set of values follows Benford’s Law, then model’s for the corresponding predicted values should also follow Benford’s Law
- for fraud detection
  - manipulated or fraudulent data do not trend to confirm to Benford’s Law, whereas unmanipulated data do
    - Benford’s Law can be used to investigate fraud patterns for datasets in fields where categorical values have been counted, such as medical test results or income tax revenues
- example 1 - consider financial data
  - assume someone owes a stock with a value of $1,000
  - for his fund to arrive $2,000, it would have to double by growing 100%
  - to further grow from $2,000 to $3,000, it would only need to increase by 50%
  - to further grow from $3,000 to $4,000, i.e. first digit from 3 to 4, it would need to increase by 33.3%
  - therefore, for the first digit from 1 to 2 (100%), it requires more growth than for the first digit from 3 to 4 (33.3%)
  - Benford distribution is a “distribution of distributions”
- example 2 - Enron Case (splitting the invoice)
  - consider a company where any travel and entertainment expenses over $10,000 must be approved by the vice president
  - employees will split invoice with amount over $10,000 to avoid the “approval”
  - a spike in first-digit frequencies around 5 and 6, clear violation of Bendford’s Law
notes
- Benford’s Law works best for fraud detection when the criminals are unaware of it
- if the criminals know how the law works, they can fool or cheat it
- you can use Benford’s Law to flag datasets that might be fraudulent, but you can’t use it to prove the datasets are not fraudulent
Chi-square goodness-of-fit test
- use statistical test to verify that a dataset obeys Benford’s law
- Chi-square goodness-of-fit test: statistical test to check whether an empirical (observed) distribution differs significantly from a theoretical (expected) distribution
- a significance level (p-value) is used as the discriminator
  - commonly used p-value: 0.05, means 5% risk of erroneously concluding a differences exists when it does not
  - other p-value: 0.01 or 0.1
- process
  - find the degree of freedom (df), defined as number of categories (k) minus 1: df = k – 1
    - for Benford’s law, df = 9 – 1 = 8
  - calculate the expected frequency count at each level by multiplying the sample size by the theoretical proportions at each level
  - calculate the chi-square random variable, aka the test statistic
  - refer to the chi-square distribution table for the p-value for df=8
  - if the test statistic is less than the p-value in the table, it is considered as significant, then you cannot reject the hypothesis that the observed and theoretical distributions are the same
    - if the test statistic < 15.51, the p-value is > 0.05, there is no statistically significant difference between the observed dataset and the Benford’s law prediction

FITE7410 Autoencoder and Benford’s Law

Review and Autoencoder

Benford's Law (The First Digit Law)