Data Collection

Overview

  • real-life data problems
    • inconsistencies
    • incompleteness
    • duplication
    • merging
    • data size too big to handle, etc.
  • how to deal with
    • pre-process the raw data before proceeding to conduct the analysis steps

Data sources

  • types
    • structured
      • transactional data
      • contractual, subscription, or account data
      • surveys
      • behavoural information
    • unstructured
      • text documents, e.g. emails, web pages, claim forms or multimedia contents
      • contextual or network information
      • qualitative expert-based data
      • publicly available data
    • semi-structured
  • 5Vs
    • velocity
    • volume
    • variety
    • veracity
    • value

Basic concepts

  • data table
    • columns, variables, fields, characteristics, attributes, features, etc. -> same things
    • rows, instances, observations, lines, records, tuples, etc. -> same things
  • data types
    • continuous data
      • e.g. amount of transactions, balance on saving account, similarity index
    • categorical data
      • nominal
        • e.g. marital status, payment type, country of region
      • ordinal
        • e.g. age code as young, middle-age, and old
      • binary
        • e.g. yes/no, 1/0 (only two values)

Data Pre-Processing

Overview

  • definition
    • a process to transform raw data into some usable format for the next data analytics step
    • provides techniques that can help us to understand and make knowledge discovery of data at the same time
  • reasons
    • to deal with the problems of raw data, e.g. noise, incompleteness, inconsistencies
    • to make sense of the raw data, i.e. to transform raw data into an understandable format

Basic techniques

  • data integration
    • combines data from multiple sources into a coherent data store
    • problems
      • entity identification problem
      • redundant attribute problem
      • tuple (record) duplication problem
        • redundant records after merging data from different sources
      • data conflict problem
  • data cleaning
    • attempts to fill in missing values (incompleteness), smooth out noise while identifying outliers, and correct inconsistencies in the data
    • "dirty" data
      • incomplete/missing data
        • human input error, intentionally hiding some information
        • not applicable values, e.g. if there are records without visa card, visa card transactions will be not applicable
        • not matching search or filter criteria, e.g. if transaction > 1 billion
      • nosiy data (outliers or errors)
        • reasons of outliers
          • valid observations, e.g. salary of senior management > $1M
          • invalid observations, e.g. age > 300
      • data inconsistencies (similar to data conflict in data integration)
      • duplicate records (similar to data duplication in data integration)
    • resolution - missing data
      • listwise/pairwise deletion
        • problems
          • listwise - reduced sample size and possibility of missing some important information
          • pairwise - still have the problem of drawing conclusions based on a subset of data only
      • data imputation - fill in the missing data
        • 4 techniques
          • which method should be used, it really depends on the dataset, the volume of missing data and skill level of various methods
          • model-based estimation uses one model, if the value of one attribute is lost, it can use the other attributes to estimate the attribute
          • multiple imputation uses different models and average the results, need to build separate models for data imputation, in addition to the analysis model
        • MICE - a packet in R (multiple imputation)
          • mice()
            • generate several datasets (normally is 5), stored in mids
            • these datasets are copies of the original dataframe except that missing values are now replaced with values generated by mice()
          • with()
            • run the ols regression on all datasets in mids
            • obtain a different regression coefficient for each dataset, reflecting the effect of each variable on output
            • coefficients are different because each dataset contains different imputed values, we do not know which one is correct in advance
            • the results are stored in mira
          • pool()
            • transmit coefficients into one regression coefficient, just take the mean
    • resolution - outliers
      • detection
        • calculate minimum, maximum, z-score values for each attribute
        • define outliers when absolute value of z-score is longer than 3
        • use visualization, e.g. histogram, box plots
      • treatment
        • for invalid observations, treat the outlier as missing value and can use one of the techniques for handling missing value to deal with outlier value
        • for valid observations, truncation/capping/winsorizing, i.e. set upper and lower limits on a variable
          • z-score (standard deviation), e.g. upper/lower limit = \(M \pm 3 \times z-score\)
          • IQR (interquartile range), e.g. upper/lower limit = \(M \pm 3 \times IQR/(2 \times 0.6745)\) (useful link)
    • outliers or red flags
      • outliers may sometimes be actually the fraud cases, because behaviors of fraudsters usually deviant from normal non-fraudsters
      • these diviations from normal patterns are red flags of fraud, e.g. small payment followed by a large payment immediately -> may be credit card fraud
      • caution or mark when deal with outliers -> for further analysis
  • data transformation

    • data are transformed into appropriate format for data analysis, also known as ETL (Extract, Transform, Load)
    • purposes
      • for easy comparison among different data sets with diverse format
      • for easy combination with other data sets to provide insights
      • to perform aggregation of data
    • techniques

      • normalization
        • required only when attributes have different ranges
        • e.g. AGE range from 0 - 100, INCOME range from 10,000 - 100,000 -> INCOME might have larger effect on the predictive power of the model due to its large value
        • for continuous variables
          • min-max normalization (range normalization)
          • z-score standardization
          • (natual) log or base-10 log
          • square root
          • inverse
          • square
          • exponential
          • centring (subtract mean)
      • attribute (feature) construction
        • transform a given set of input features to gnerate a new set of more powerful features
        • purpose
          • dimensionality reduction
          • prediction performance improvement
        • method - PCA
      • discretization

        • replace the values of numeric attribute (continuous variable) by conceptual values (discrete variable)
        • e.g. replace AGE (numeric value) with AGEGROUP (children, youth, adult, elderly), group rare levels into one discrete group "OTHER"
        • method - binning transformation (or categorization)

          • purpose
            • to reduce the number of categories for categorical variables
            • to reduce the number of values for a given continuous variable by dividing the range of variable into discrete intervals
          • example - non-linear relation of default risk vs age
            • fits only when non-linear models are used (e.g. neural network)
            • will not work well for linear models (e.g. regression)
            • group variables into ranges, so that nonmonotonicity can be captured (group 25-45, the peak would disappear)
          • method #1
            • equal interval binning
              • BIN 1 (range 1,000 - 1,500): A, B, C, F
              • BIN 2 (range 1,500 - 2,000): D, E
            • equal frequency binning - sort firstly and divide into groups with the same volume
              • BIN 1: A, B, C
              • BIN 2: D, E, F
          • method #2 - chi-square test

            • purpose - combine variables better
            • formula

              \[\chi^2 = \sum \frac{{(observed \ value - expected \ value)}^2}{expected \ value}\]

            • example

              • blue = observed, red = expected
              • if bins are significantly different, split is made (if not, try another combination)
                \(\chi^2 = \frac{(6000-5670)^2}{5670} + \frac{(300-630)^2}{630} + \frac{(1950-2241)^2}{2241} + \frac{(540-249)^2}{249} + \frac{(1050-1089)^2}{1089} + \frac{(160-121)^2}{121} = 583\)
  • data sampling/reduction

    • makes it possible to perform data analysis on huge amounts of data that may take a very long time if not reduced in size
    • purpose
      • increase storage efficiency
      • reduce data storage and analysis costs
    • methods
      • reduce number of cases
        • aggregation (e.g. "DAILY" transforms to "MONTHLY")
        • sampling
          • purpose
            • to produce a sufficiently representative sample for further analysis
            • to reduce the size of an extremely large dataset to manageable size
          • pros & cons
            • reduce costs and more operationally efficient, with good prediction result
            • still have uncertainties as this is only estimates
          • methods
            • simple random sampling
              • randomly select the sample from the dataset
              • the probability of being selected is the same
            • systematic sampling
              • sample every \(k^{th}\) observation from the dataset from a random start
              • suitable for cases with the need to keep the sequences or periodicity of the dataset
              • e.g. to create a sample of \(n=20\) from a dataset size \(N=200\)
                • randomly select an integer from 1 to 10 (200/20), e.g. random# = 6
                • start with \(6^{th}\) observation and select every \(6^{th}\) unit observation as sample
            • stratified sampling
              • according similar characteristics divide the dataset into groups (strata)
              • using simple random sampling for each group (strata)
              • more precise than simple random sampling, if the strata are homogenous
            • cluster sampling
              • according similar characteristics divide the data set into groups (cluster)
              • randomly select clusters as samples (sampled clusters) -> select the whole group
              • used when the dataset is too widely dispersed
      • reduce number of distinct values or variables
        • binning
      • variable selection
        • purpose
          • dimensionality reduction
          • operational efficiency - reduce time and memory
          • better interpretability with easier visualization using fewer variables
          • eliminate irrelevant features (e.g. 10-15 features are useful in fraud detection models)
        • input variables are selected (or filtered) based on the usefulness or relation with target variables, using
          • correlation with target variable (mostly use)
          • information criteria
          • clustering of variables

Imbalanced Data Handling

Imbalanced dataset

  • definition
    • also known as skewed dataset
    • is a special case for classification problem where the class distribution is not uniform among the classes
    • composed by two classes
      • majority class
      • minority class
  • problems in machine learning
    • most machine learning models assume an equal distribution of classes
    • a model may focus on learning the characteristics of majority class due to the abundance of samples available for learning
    • many machine learning models will show bias towards majority class, leading to incorrect conclusions
  • slight imbalance vs severe imbalance
    • if the dataset is only slightly imbalance (e.g. ratio of 4:6), can still be used for training

Handling methods

  • varying the sample window
    • increase the number of fraudsters by increasing the time horizon
      • e.g. using 12-month window rather than 6-month window
    • sample every fraudsters twice or more as shown in the following figure
  • undersampling and oversampling
    • undersampling
      • reduce the number of majority class
      • some information is lost
    • oversampling
      • increase the minority class
      • no information is lost, both majority and minority classes (advantage)
      • it is prone to overfitting
  • SMOTE (Synthetic Minority Over-Sampling Technique)
    • creates synthetic observations based upon the existing minority observations
    • combines the synthetic oversampling of the minority class with undersampling the minority class
    • is better than either under-/over- sampling, can be used in fraud detection
    • example - select using line
    • problem
      • may create a line bridge with the majority class, if the observations in minority class are outlying and appears in majority class

  • ADASYN (Adaptive Synthetic Sampling)
    • a generalization of the SMOTE algorithm
    • takes into account the distribution of density
    • measures the k-nearest neighbors for all minority instances, then calculates the class ratio of the minority and majority instances to create new samples
    • impurity ratio is calculated for all minority data points
    • higher the ratio, more synthetic data points (minority class) are created
    • example
      • the synthetic data points of Obs3 will double that of Obs2
  • other approaches
    • using probabilities
      • likelihood approach
      • adjusting posterior probabilities
    • cost-sensitive learning - assign higher misclassification costs to minority class (i.e. fraud cases)
  • notes
    • the performance of both sampling and cost-sensitive learning approaches are good and equivalent for handling imbalanced dataset
    • suggest to adopt the sampling approaches for fraud analytics

Exploratory Data Analysis

Overview

  • definition
    • the process of using quick and simple methods for the visualization and examination of small data, e.g. boxplots, histograms, etc.
    • but it is not feasible to use traditional EDA to analyze dataset, due to 5Vs
  • purpose
    • to have a better understanding of the data before building the fraud detection model
    • to detect problems in data

Basic techniques

  • non-graphical - statistics summary
    • illustrate relationships between two or more data variables using statistics or cross-tabulation
    • as a preview to check the symmetry or asymmetry of the distribution, i.e. skewness of the data
    • calculate for the whole sample set and target set
    • using descriptive statistics
      • mean, median (for continuous variables)
      • mode (for categorical variables)
      • standard deviation (represent how much data is spread around the mean)
      • percentile distribution (percentage/distribution of the data)
  • graphical - visualization
    • using picture of data, such as stem-and-leaf plots, box plots, pie chart, histogram, scatter plot
    • missing data, outliers, and distribution can be more easily identified using visualization of the data
    • example
      • the maximum value of hospital stay is 668 (> 365) -> impossible
  • correlation analysis

    • find out the variable relations (positively or negatively correlated), e.g. use scatter plots, correlation
    • pearson correlation (mostly used)

      • the covariance of the two variables divided by the product of their standard deviations
      • if the coefficient lies between \(\pm 0.5\) to \(\pm 1.0\), it is highly correlated
      • if the coefficient is below \(\pm 0.29\), the correlation is low

      \[\rho_{X, Y}=\frac{cov(X, Y)}{\sigma_X \sigma_Y}\]

    • example

      • many of the variables are not significantly correlated with each other and the class variable - because this is an imbalance class dataset
  • PCA (Principal Component Analysis)

    • reduce the dimensionality of a huge dataset down to a manageable dimensions
    • only keep principle components that have high correlations
    • explained
    • example
      • the first few principal components are most important
      • can only use these instead of using 50+ variables