10 minutes
FITE7410 Processing Fraud Data
Data Collection
Overview
- real-life data problems
- inconsistencies
- incompleteness
- duplication
- merging
- data size too big to handle, etc.
- how to deal with
- pre-process the raw data before proceeding to conduct the analysis steps
Data sources
- types
- structured
- transactional data
- contractual, subscription, or account data
- surveys
- behavoural information
- unstructured
- text documents, e.g. emails, web pages, claim forms or multimedia contents
- contextual or network information
- qualitative expert-based data
- publicly available data
- semi-structured
- structured
- 5Vs
- velocity
- volume
- variety
- veracity
- value
Basic concepts
- data table
- columns, variables, fields, characteristics, attributes, features, etc. -> same things
- rows, instances, observations, lines, records, tuples, etc. -> same things
- data types
- continuous data
- e.g. amount of transactions, balance on saving account, similarity index
- categorical data
- nominal
- e.g. marital status, payment type, country of region
- ordinal
- e.g. age code as young, middle-age, and old
- binary
- e.g. yes/no, 1/0 (only two values)
- nominal
- continuous data
Data Pre-Processing
Overview
- definition
- a process to transform raw data into some usable format for the next data analytics step
- provides techniques that can help us to understand and make knowledge discovery of data at the same time
- reasons
- to deal with the problems of raw data, e.g. noise, incompleteness, inconsistencies
- to make sense of the raw data, i.e. to transform raw data into an understandable format
Basic techniques
- data integration
- combines data from multiple sources into a coherent data store
- problems
- entity identification problem
- redundant attribute problem
- tuple (record) duplication problem
- redundant records after merging data from different sources
- data conflict problem
- data cleaning
- attempts to fill in missing values (incompleteness), smooth out noise while identifying outliers, and correct inconsistencies in the data
- "dirty" data
- incomplete/missing data
- human input error, intentionally hiding some information
- not applicable values, e.g. if there are records without visa card, visa card transactions will be not applicable
- not matching search or filter criteria, e.g. if transaction > 1 billion
- nosiy data (outliers or errors)
- reasons of outliers
- valid observations, e.g. salary of senior management > $1M
- invalid observations, e.g. age > 300
- reasons of outliers
- data inconsistencies (similar to data conflict in data integration)
- duplicate records (similar to data duplication in data integration)
- incomplete/missing data
- resolution - missing data
- listwise/pairwise deletion
- problems
- listwise - reduced sample size and possibility of missing some important information
- pairwise - still have the problem of drawing conclusions based on a subset of data only
- problems
- data imputation - fill in the missing data
- 4 techniques
- which method should be used, it really depends on the dataset, the volume of missing data and skill level of various methods
- model-based estimation uses one model, if the value of one attribute is lost, it can use the other attributes to estimate the attribute
- multiple imputation uses different models and average the results, need to build separate models for data imputation, in addition to the analysis model
- MICE - a packet in R (multiple imputation)
- mice()
- generate several datasets (normally is 5), stored in mids
- these datasets are copies of the original dataframe except that missing values are now replaced with values generated by mice()
- with()
- run the ols regression on all datasets in mids
- obtain a different regression coefficient for each dataset, reflecting the effect of each variable on output
- coefficients are different because each dataset contains different imputed values, we do not know which one is correct in advance
- the results are stored in mira
- pool()
- transmit coefficients into one regression coefficient, just take the mean
- mice()
- 4 techniques
- listwise/pairwise deletion
- resolution - outliers
- detection
- calculate minimum, maximum, z-score values for each attribute
- define outliers when absolute value of z-score is longer than 3
- use visualization, e.g. histogram, box plots
- treatment
- for invalid observations, treat the outlier as missing value and can use one of the techniques for handling missing value to deal with outlier value
- for valid observations, truncation/capping/winsorizing, i.e. set upper and lower limits on a variable
- z-score (standard deviation), e.g. upper/lower limit = \(M \pm 3 \times z-score\)
- IQR (interquartile range), e.g. upper/lower limit = \(M \pm 3 \times IQR/(2 \times 0.6745)\) (useful link)
- detection
- outliers or red flags
- outliers may sometimes be actually the fraud cases, because behaviors of fraudsters usually deviant from normal non-fraudsters
- these diviations from normal patterns are red flags of fraud, e.g. small payment followed by a large payment immediately -> may be credit card fraud
- caution or mark when deal with outliers -> for further analysis
data transformation
- data are transformed into appropriate format for data analysis, also known as ETL (Extract, Transform, Load)
- purposes
- for easy comparison among different data sets with diverse format
- for easy combination with other data sets to provide insights
- to perform aggregation of data
techniques
- normalization
- required only when attributes have different ranges
- e.g. AGE range from 0 - 100, INCOME range from 10,000 - 100,000 -> INCOME might have larger effect on the predictive power of the model due to its large value
- for continuous variables
- min-max normalization (range normalization)
- z-score standardization
- (natual) log or base-10 log
- square root
- inverse
- square
- exponential
- centring (subtract mean)
- attribute (feature) construction
- transform a given set of input features to gnerate a new set of more powerful features
- purpose
- dimensionality reduction
- prediction performance improvement
- method - PCA
discretization
- replace the values of numeric attribute (continuous variable) by conceptual values (discrete variable)
- e.g. replace AGE (numeric value) with AGEGROUP (children, youth, adult, elderly), group rare levels into one discrete group "OTHER"
method - binning transformation (or categorization)
- purpose
- to reduce the number of categories for categorical variables
- to reduce the number of values for a given continuous variable by dividing the range of variable into discrete intervals
- example - non-linear relation of default risk vs age
- fits only when non-linear models are used (e.g. neural network)
- will not work well for linear models (e.g. regression)
- group variables into ranges, so that nonmonotonicity can be captured (group 25-45, the peak would disappear)
- method #1
- equal interval binning
- BIN 1 (range 1,000 - 1,500): A, B, C, F
- BIN 2 (range 1,500 - 2,000): D, E
- equal frequency binning - sort firstly and divide into groups with the same volume
- BIN 1: A, B, C
- BIN 2: D, E, F
- equal interval binning
method #2 - chi-square test
- purpose - combine variables better
formula
\[\chi^2 = \sum \frac{{(observed \ value - expected \ value)}^2}{expected \ value}\]
example
- blue = observed, red = expected
- if bins are significantly different, split is made (if not, try another combination) \(\chi^2 = \frac{(6000-5670)^2}{5670} + \frac{(300-630)^2}{630} + \frac{(1950-2241)^2}{2241} + \frac{(540-249)^2}{249} + \frac{(1050-1089)^2}{1089} + \frac{(160-121)^2}{121} = 583\)
- purpose
- normalization
data sampling/reduction
- makes it possible to perform data analysis on huge amounts of data that may take a very long time if not reduced in size
- purpose
- increase storage efficiency
- reduce data storage and analysis costs
- methods
- reduce number of cases
- aggregation (e.g. "DAILY" transforms to "MONTHLY")
- sampling
- purpose
- to produce a sufficiently representative sample for further analysis
- to reduce the size of an extremely large dataset to manageable size
- pros & cons
- reduce costs and more operationally efficient, with good prediction result
- still have uncertainties as this is only estimates
- methods
- simple random sampling
- randomly select the sample from the dataset
- the probability of being selected is the same
- systematic sampling
- sample every \(k^{th}\) observation from the dataset from a random start
- suitable for cases with the need to keep the sequences or periodicity of the dataset
- e.g. to create a sample of \(n=20\) from a dataset size \(N=200\)
- randomly select an integer from 1 to 10 (200/20), e.g. random# = 6
- start with \(6^{th}\) observation and select every \(6^{th}\) unit observation as sample
- stratified sampling
- according similar characteristics divide the dataset into groups (strata)
- using simple random sampling for each group (strata)
- more precise than simple random sampling, if the strata are homogenous
- cluster sampling
- according similar characteristics divide the data set into groups (cluster)
- randomly select clusters as samples (sampled clusters) -> select the whole group
- used when the dataset is too widely dispersed
- simple random sampling
- purpose
- reduce number of distinct values or variables
- binning
- variable selection
- purpose
- dimensionality reduction
- operational efficiency - reduce time and memory
- better interpretability with easier visualization using fewer variables
- eliminate irrelevant features (e.g. 10-15 features are useful in fraud detection models)
- input variables are selected (or filtered) based on the usefulness or relation with target variables, using
- correlation with target variable (mostly use)
- information criteria
- clustering of variables
- purpose
- reduce number of cases
Imbalanced Data Handling
Imbalanced dataset
- definition
- also known as skewed dataset
- is a special case for classification problem where the class distribution is not uniform among the classes
- composed by two classes
- majority class
- minority class
- problems in machine learning
- most machine learning models assume an equal distribution of classes
- a model may focus on learning the characteristics of majority class due to the abundance of samples available for learning
- many machine learning models will show bias towards majority class, leading to incorrect conclusions
- slight imbalance vs severe imbalance
- if the dataset is only slightly imbalance (e.g. ratio of 4:6), can still be used for training
Handling methods
- varying the sample window
- increase the number of fraudsters by increasing the time horizon
- e.g. using 12-month window rather than 6-month window
- sample every fraudsters twice or more as shown in the following figure
- increase the number of fraudsters by increasing the time horizon
- undersampling and oversampling
- undersampling
- reduce the number of majority class
- some information is lost
- oversampling
- increase the minority class
- no information is lost, both majority and minority classes (advantage)
- it is prone to overfitting
- undersampling
- SMOTE (Synthetic Minority Over-Sampling Technique)
- creates synthetic observations based upon the existing minority observations
- combines the synthetic oversampling of the minority class with undersampling the minority class
- is better than either under-/over- sampling, can be used in fraud detection
- example - select using line
- problem
- may create a line bridge with the majority class, if the observations in minority class are outlying and appears in majority class
- may create a line bridge with the majority class, if the observations in minority class are outlying and appears in majority class
- ADASYN (Adaptive Synthetic Sampling)
- a generalization of the SMOTE algorithm
- takes into account the distribution of density
- measures the k-nearest neighbors for all minority instances, then calculates the class ratio of the minority and majority instances to create new samples
- impurity ratio is calculated for all minority data points
- higher the ratio, more synthetic data points (minority class) are created
- example
- the synthetic data points of Obs3 will double that of Obs2
- other approaches
- using probabilities
- likelihood approach
- adjusting posterior probabilities
- cost-sensitive learning - assign higher misclassification costs to minority class (i.e. fraud cases)
- using probabilities
- notes
- the performance of both sampling and cost-sensitive learning approaches are good and equivalent for handling imbalanced dataset
- suggest to adopt the sampling approaches for fraud analytics
Exploratory Data Analysis
Overview
- definition
- the process of using quick and simple methods for the visualization and examination of small data, e.g. boxplots, histograms, etc.
- but it is not feasible to use traditional EDA to analyze dataset, due to 5Vs
- purpose
- to have a better understanding of the data before building the fraud detection model
- to detect problems in data
Basic techniques
- non-graphical - statistics summary
- illustrate relationships between two or more data variables using statistics or cross-tabulation
- as a preview to check the symmetry or asymmetry of the distribution, i.e. skewness of the data
- calculate for the whole sample set and target set
- using descriptive statistics
- mean, median (for continuous variables)
- mode (for categorical variables)
- standard deviation (represent how much data is spread around the mean)
- percentile distribution (percentage/distribution of the data)
- graphical - visualization
- using picture of data, such as stem-and-leaf plots, box plots, pie chart, histogram, scatter plot
- missing data, outliers, and distribution can be more easily identified using visualization of the data
- example
- the maximum value of hospital stay is 668 (> 365) -> impossible
correlation analysis
- find out the variable relations (positively or negatively correlated), e.g. use scatter plots, correlation
pearson correlation (mostly used)
- the covariance of the two variables divided by the product of their standard deviations
- if the coefficient lies between \(\pm 0.5\) to \(\pm 1.0\), it is highly correlated
- if the coefficient is below \(\pm 0.29\), the correlation is low
\[\rho_{X, Y}=\frac{cov(X, Y)}{\sigma_X \sigma_Y}\]
example
- many of the variables are not significantly correlated with each other and the class variable - because this is an imbalance class dataset
PCA (Principal Component Analysis)
- reduce the dimensionality of a huge dataset down to a manageable dimensions
- only keep principle components that have high correlations
- explained
- example
- the first few principal components are most important
- can only use these instead of using 50+ variables