FITE7410 Processing Fraud Data

Data Collection

Overview

real-life data problems
- inconsistencies
- incompleteness
- duplication
- merging
- data size too big to handle, etc.
how to deal with
- pre-process the raw data before proceeding to conduct the analysis steps

Data sources

types
- structured
  - transactional data
  - contractual, subscription, or account data
  - surveys
  - behavoural information
- unstructured
  - text documents, e.g. emails, web pages, claim forms or multimedia contents
  - contextual or network information
  - qualitative expert-based data
  - publicly available data
- semi-structured
5Vs
- velocity
- volume
- variety
- veracity
- value

Basic concepts

data table
- columns, variables, fields, characteristics, attributes, features, etc. -> same things
- rows, instances, observations, lines, records, tuples, etc. -> same things
data types
- continuous data
  - e.g. amount of transactions, balance on saving account, similarity index
- categorical data
  - nominal
    - e.g. marital status, payment type, country of region
  - ordinal
    - e.g. age code as young, middle-age, and old
  - binary
    - e.g. yes/no, 1/0 (only two values)

Data Pre-Processing

Overview

definition
- a process to transform raw data into some usable format for the next data analytics step
- provides techniques that can help us to understand and make knowledge discovery of data at the same time
reasons
- to deal with the problems of raw data, e.g. noise, incompleteness, inconsistencies
- to make sense of the raw data, i.e. to transform raw data into an understandable format

Basic techniques

data integration
- combines data from multiple sources into a coherent data store
- problems
  - entity identification problem
  - redundant attribute problem
  - tuple (record) duplication problem
    - redundant records after merging data from different sources
  - data conflict problem
data cleaning
- attempts to fill in missing values (incompleteness), smooth out noise while identifying outliers, and correct inconsistencies in the data
- "dirty" data
  - incomplete/missing data
    - human input error, intentionally hiding some information
    - not applicable values, e.g. if there are records without visa card, visa card transactions will be not applicable
    - not matching search or filter criteria, e.g. if transaction > 1 billion
  - nosiy data (outliers or errors)
    - reasons of outliers
      - valid observations, e.g. salary of senior management > $1M
      - invalid observations, e.g. age > 300
  - data inconsistencies (similar to data conflict in data integration)
  - duplicate records (similar to data duplication in data integration)
- resolution - missing data
  - listwise/pairwise deletion
    - problems
      - listwise - reduced sample size and possibility of missing some important information
      - pairwise - still have the problem of drawing conclusions based on a subset of data only
  - data imputation - fill in the missing data
    - 4 techniques
      - which method should be used, it really depends on the dataset, the volume of missing data and skill level of various methods
      - model-based estimation uses one model, if the value of one attribute is lost, it can use the other attributes to estimate the attribute
      - multiple imputation uses different models and average the results, need to build separate models for data imputation, in addition to the analysis model
    - MICE - a packet in R (multiple imputation)
      - mice()
        
        generate several datasets (normally is 5), stored in mids
        
        these datasets are copies of the original dataframe except that missing values are now replaced with values generated by mice()
      - with()
        
        run the ols regression on all datasets in mids
        
        obtain a different regression coefficient for each dataset, reflecting the effect of each variable on output
        
        coefficients are different because each dataset contains different imputed values, we do not know which one is correct in advance
        
        the results are stored in mira
      - pool()
        
        transmit coefficients into one regression coefficient, just take the mean
- resolution - outliers
  - detection
    - calculate minimum, maximum, z-score values for each attribute
    - define outliers when absolute value of z-score is longer than 3
    - use visualization, e.g. histogram, box plots
  - treatment
    - for invalid observations, treat the outlier as missing value and can use one of the techniques for handling missing value to deal with outlier value
    - for valid observations, truncation/capping/winsorizing, i.e. set upper and lower limits on a variable
      - z-score (standard deviation), e.g. upper/lower limit = $M \pm 3 \times z-score$
      - IQR (interquartile range), e.g. upper/lower limit = $M \pm 3 \times IQR/(2 \times 0.6745)$ (useful link)
- outliers or red flags
  - outliers may sometimes be actually the fraud cases, because behaviors of fraudsters usually deviant from normal non-fraudsters
  - these diviations from normal patterns are red flags of fraud, e.g. small payment followed by a large payment immediately -> may be credit card fraud
  - caution or mark when deal with outliers -> for further analysis
data transformation
- data are transformed into appropriate format for data analysis, also known as ETL (Extract, Transform, Load)
- purposes
  - for easy comparison among different data sets with diverse format
  - for easy combination with other data sets to provide insights
  - to perform aggregation of data
- techniques
  - normalization
    - required only when attributes have different ranges
    - e.g. AGE range from 0 - 100, INCOME range from 10,000 - 100,000 -> INCOME might have larger effect on the predictive power of the model due to its large value
    - for continuous variables
      - min-max normalization (range normalization)
      - z-score standardization
      - (natual) log or base-10 log
      - square root
      - inverse
      - square
      - exponential
      - centring (subtract mean)
  - attribute (feature) construction
    - transform a given set of input features to gnerate a new set of more powerful features
    - purpose
      - dimensionality reduction
      - prediction performance improvement
    - method - PCA
  - discretization
    - replace the values of numeric attribute (continuous variable) by conceptual values (discrete variable)
    - e.g. replace AGE (numeric value) with AGEGROUP (children, youth, adult, elderly), group rare levels into one discrete group "OTHER"
    - method - binning transformation (or categorization)
      - purpose
        
        to reduce the number of categories for categorical variables
        
        to reduce the number of values for a given continuous variable by dividing the range of variable into discrete intervals
      - example - non-linear relation of default risk vs age
        
        fits only when non-linear models are used (e.g. neural network)
        
        will not work well for linear models (e.g. regression)
        
        group variables into ranges, so that nonmonotonicity can be captured (group 25-45, the peak would disappear)
      - method #1
        
        equal interval binning
        
        BIN 1 (range 1,000 - 1,500): A, B, C, F
        
        BIN 2 (range 1,500 - 2,000): D, E
        
        equal frequency binning - sort firstly and divide into groups with the same volume
        
        BIN 1: A, B, C
        
        BIN 2: D, E, F
      - method #2 - chi-square test
        
        purpose - combine variables better
        
        formula
        
        \[\chi^2 = \sum \frac{{(observed \ value - expected \ value)}^2}{expected \ value}\]
        
        example
        
        blue = observed, red = expected
        
        if bins are significantly different, split is made (if not, try another combination)
        
        $\chi^2 = \frac{(6000-5670)^2}{5670} + \frac{(300-630)^2}{630} + \frac{(1950-2241)^2}{2241} + \frac{(540-249)^2}{249} + \frac{(1050-1089)^2}{1089} + \frac{(160-121)^2}{121} = 583$
data sampling/reduction
- makes it possible to perform data analysis on huge amounts of data that may take a very long time if not reduced in size
- purpose
  - increase storage efficiency
  - reduce data storage and analysis costs
- methods
  - reduce number of cases
    - aggregation (e.g. "DAILY" transforms to "MONTHLY")
    - sampling
      - purpose
        
        to produce a sufficiently representative sample for further analysis
        
        to reduce the size of an extremely large dataset to manageable size
      - pros & cons
        
        reduce costs and more operationally efficient, with good prediction result
        
        still have uncertainties as this is only estimates
      - methods
        
        simple random sampling
        
        randomly select the sample from the dataset
        
        the probability of being selected is the same
        
        systematic sampling
        
        sample every $k^{th}$ observation from the dataset from a random start
        
        suitable for cases with the need to keep the sequences or periodicity of the dataset
        
        e.g. to create a sample of $n=20$ from a dataset size $N=200$
        
        randomly select an integer from 1 to 10 (200/20), e.g. random# = 6
        
        start with $6^{th}$ observation and select every $6^{th}$ unit observation as sample
        
        stratified sampling
        
        according similar characteristics divide the dataset into groups (strata)
        
        using simple random sampling for each group (strata)
        
        more precise than simple random sampling, if the strata are homogenous
        
        cluster sampling
        
        according similar characteristics divide the data set into groups (cluster)
        
        randomly select clusters as samples (sampled clusters) -> select the whole group
        
        used when the dataset is too widely dispersed
  - reduce number of distinct values or variables
    - binning
  - variable selection
    - purpose
      - dimensionality reduction
      - operational efficiency - reduce time and memory
      - better interpretability with easier visualization using fewer variables
      - eliminate irrelevant features (e.g. 10-15 features are useful in fraud detection models)
    - input variables are selected (or filtered) based on the usefulness or relation with target variables, using
      - correlation with target variable (mostly use)
      - information criteria
      - clustering of variables

Imbalanced Data Handling

Imbalanced dataset

definition
- also known as skewed dataset
- is a special case for classification problem where the class distribution is not uniform among the classes
- composed by two classes
  - majority class
  - minority class
problems in machine learning
- most machine learning models assume an equal distribution of classes
- a model may focus on learning the characteristics of majority class due to the abundance of samples available for learning
- many machine learning models will show bias towards majority class, leading to incorrect conclusions
slight imbalance vs severe imbalance
- if the dataset is only slightly imbalance (e.g. ratio of 4:6), can still be used for training

Handling methods

varying the sample window
- increase the number of fraudsters by increasing the time horizon
  - e.g. using 12-month window rather than 6-month window
- sample every fraudsters twice or more as shown in the following figure
undersampling and oversampling
- undersampling
  - reduce the number of majority class
  - some information is lost
- oversampling
  - increase the minority class
  - no information is lost, both majority and minority classes (advantage)
  - it is prone to overfitting
SMOTE (Synthetic Minority Over-Sampling Technique)
- creates synthetic observations based upon the existing minority observations
- combines the synthetic oversampling of the minority class with undersampling the minority class
- is better than either under-/over- sampling, can be used in fraud detection
- example - select using line
- problem
  - may create a line bridge with the majority class, if the observations in minority class are outlying and appears in majority class
ADASYN (Adaptive Synthetic Sampling)
- a generalization of the SMOTE algorithm
- takes into account the distribution of density
- measures the k-nearest neighbors for all minority instances, then calculates the class ratio of the minority and majority instances to create new samples
- impurity ratio is calculated for all minority data points
- higher the ratio, more synthetic data points (minority class) are created
- example
  - the synthetic data points of Obs3 will double that of Obs2
other approaches
- using probabilities
  - likelihood approach
  - adjusting posterior probabilities
- cost-sensitive learning - assign higher misclassification costs to minority class (i.e. fraud cases)
notes
- the performance of both sampling and cost-sensitive learning approaches are good and equivalent for handling imbalanced dataset
- suggest to adopt the sampling approaches for fraud analytics

Exploratory Data Analysis

Overview

definition
- the process of using quick and simple methods for the visualization and examination of small data, e.g. boxplots, histograms, etc.
- but it is not feasible to use traditional EDA to analyze dataset, due to 5Vs
purpose
- to have a better understanding of the data before building the fraud detection model
- to detect problems in data

Basic techniques

non-graphical - statistics summary
- illustrate relationships between two or more data variables using statistics or cross-tabulation
- as a preview to check the symmetry or asymmetry of the distribution, i.e. skewness of the data
- calculate for the whole sample set and target set
- using descriptive statistics
  - mean, median (for continuous variables)
  - mode (for categorical variables)
  - standard deviation (represent how much data is spread around the mean)
  - percentile distribution (percentage/distribution of the data)
graphical - visualization
- using picture of data, such as stem-and-leaf plots, box plots, pie chart, histogram, scatter plot
- missing data, outliers, and distribution can be more easily identified using visualization of the data
- example
  - the maximum value of hospital stay is 668 (> 365) -> impossible
correlation analysis
- find out the variable relations (positively or negatively correlated), e.g. use scatter plots, correlation
- pearson correlation (mostly used)
  - the covariance of the two variables divided by the product of their standard deviations
  - if the coefficient lies between $\pm 0.5$ to $\pm 1.0$, it is highly correlated
  - if the coefficient is below $\pm 0.29$, the correlation is low
  \[\rho_{X, Y}=\frac{cov(X, Y)}{\sigma_X \sigma_Y}\]
- example
  - many of the variables are not significantly correlated with each other and the class variable - because this is an imbalance class dataset
PCA (Principal Component Analysis)
- reduce the dimensionality of a huge dataset down to a manageable dimensions
- only keep principle components that have high correlations
- explained
- example
  - the first few principal components are most important
  - can only use these instead of using 50+ variables