Clustering

Basics about clustering

  • aim
    • group a set of observations into groups (or clusters)
  • feature
    • the homogeneity (similarity) within the cluster is maximized
    • the heterogeneity (dissimilarity) between the cluster is maximized
    • unsupervised classification (no labels)
    • no predefined target class
    • number of clusters unknown
    • meaning of clusters unknown
  • distance metrics

    • measure the similarity and dissimilarity between observations in a group
    • types

      • metrics for continuous variables

        • Minkowski distance

          • formula
            • \(n\) represents the number of variables
            • \(x_i, x_j\) are two observations

          \[D(x_i:x_j)={(\sum_{k=1}^n {|x_{ik}-x_{jk}|}^p)}^{1/p}\]

          • when p = 1, it is Manhattan (or city block) distance

          \[D(x_i:x_j)=|x_{11}-x_{21}|+|x_{12}-x_{22}|=|50-30|+|20-10|=30\]

          • when p = 2, it is Euclidean distance

          \[D(x_i:x_j)=\sqrt{{(x_{11}-x_{21})}^2+{(x_{12}-x_{22})}^2}=\sqrt{{(50-30)}^2+{(20-10)}^2}=22\]

        • Pearson correlation

      • metrics for categorical variables

        • simple matching coefficient (SMC)
          • calculates the number of identical matches between the variable values
          • assumption of SMC is that "Yes" and "No" are of equal weights
        • Jaccard index
          • similar to SMC, but left out the "No-No" match
          • measures the similarity between observations across those red flags that were raised at least once
          • especially useful in situations where many red-flag indicators are available and typically only a few are raised
        • example
          • SMC(Claim1, Claim2) = 2/5
          • Jaccard(Claim1, Claim2) = 1/4

Types of clustering

  • hierarchical

    • agglomerative
      • bottom-up
    • divisive
      • top-down

    • use a similarity or distance matrix to merge or split one cluster at a time
    • distance matrics

      • single linkage: smallest possible distance
        • thin, long and elongated clusters since dissimilar objects are not accounted for
      • complete linkage: biggest possible distance
        • cluster are tight, more balanced and spherical
      • average linkage: average of all possible distances
        • merge clusters with small variances, results in clusters with similar variance
      • centroid method: distance between the centroids of both clusters
        • most robust to outliers, but not perform as well as Ward's method or average linkage
      • Ward's distance: the dendrogram and hierarchy is clearer

        • merge clusters with a small number of observations and results in balanced clusters
        • formula

        \[D_{Ward}(C_i,C_j)=\sum_{x\in C_i}{(x-c_i)}^2+\sum_{x\in C_j}{(x-c_j)}^2-\sum_{x\in C_{ij}}{(x-c_{ij})}^2\]

        • find the optimal clustering thru
          • dendrogram - termination condition to cut the dendrogram
          • screen plot - elbow point indicates the optimal clustering
    • advantages

      • easy to interpret and understand (dendrogram is easy to understand)
      • relatively easy to implement
      • number of clusters does not need to be specified prior to the analysis
    • disadvantages

      • the methods do not scale very well to large data sets, i.e. computationally expansive
      • interpretation of clusters is often subjective and depends on expert knowledge
      • does not work well with mixed data types and missing data
      • not the best solution for data sets with high dimensions
  • non-hierarchical

    • k-means
      • some basics
        • k means each cluster is represented by the center (or mean) of the cluster
        • need to be specified before the start of analysis
      • procedure
        • select k observations as initial cluster centroids (seeds)
        • assign each observation to the cluster that has the closest centroid
        • when all observations have been assigned, recalculate the positions of the k centroids
        • repeat until the cluster centroids no longer change or a fixed number of iterations is reached
      • advantages
        • relatively computational efficient as compared with hierarchical clustering
        • simple to implement and scales to large datasets
      • disadvantages
        • need to specify k, the number of clusters, in advance
        • often terminates at a local optimum, need to try different initial cluster centres
        • unable to handle noisy data and outliers
    • self-organizing map (SOM)
  • many others

    • density-based
    • model-based
    • graph-based

Performance Evaluation of Clustering

Overivew

  • method
    • no universal criterion for clustering performance evaluation
  • only 2 perspectives
    • statistical perspective
    • interpretability perspective
  • the standard of a good clustering solution
    • high within-cluster similarity
    • low between-cluster similarity

SSE (Sum of Square Error)

  • a statistical method
  • how to evaluate
    • the lower the SSE for a particular cluster (WSS), the more homogenous is that cluster, i.e. higher within-cluster similarity
    • the higher the SSE among different clusters (BSS), the more heterogenous are the clusters, i.e. lower between-cluster similarity
  • the optimal number of clusters
    • elbow method - makes use of within-cluster SSE (WSS)
    • steps
      • compute the k-means clustering models for different k values
      • for each k, calculate the WSS
      • plot WSS accoriding to the value of k
      • find the point where there is a bend (or elbow) in the curve -> optimal

Interpretation of a clustering solution

  • compare clusters with population distribution (or between clusters)
    • plot histogram, mean, bar charts etc of those most important variables by clusters
    • example
      • cluster C1 has observations with low recency values and high monetary values, whereas the frequency is similar to original population
  • use classification tree
    • to predict the target variables
    • example

Neural Network

Basic concepts

  • neural networks (NN)
    • can be used to perform tasks like classification, clustering, predictive analytics
  • ANN & DNN
    • artificial neural networks (ANN) are composed of layers of node/neurons
    • neural networks with more than one hidden layer are called deep neural networks (DNN)
  • structure
    • input layer - initial input data for the neural network (visible layer)
    • hidden layers - between input and output layers, where the computations are done and passed on to other nodes deeper in the neural network
    • output layer - output result of the given input (visible layer)

Inside an artificial neuron

  • activation functions
    • mathematical equations that determine the output of a neural network
    • attached to each node and defines if the given node should be “activated”, based on whether each node’s input is relevant for the model’s prediction
    • normalize the output of each neuron to a range between 1 and 0 or between -1 and 1
    • examples
      • linear activation functions
        • threshold (unit step or sign) function
          • only classify output as 0 or 1
        • linear function
          • transform the output of the neural network into a linear function, unable to cater for non-linear data
      • non-linear activation functions (most neural networks use this for more complex mappings between inputs and outputs)
        • Sigmoid function
          • can accept any value and normalize output between 0 to 1
          • smooth gradient and clear predictions
          • problems
            • output not zero centered & computationally expensive
            • vanishing gradient - when the sigmoid function value is either too high or too low, the derivative becomes very small i.e. << 1, causing poor learning for deep networks
        • TanH (hyperbolic tangent) function
          • similar to Sigmoid function, but zero centered, i.e. easier to model inputs with strong negative, neutral and positive values

Training a neural network

  • cost/loss function

    • NN is trained based on this
    • to measure the error of the network's prediction, or how good the neural network model is for a particular task
    • cost function - MSE

      • difference between predicted output and actual output
      • square the difference and
      • calculate the mean

      \[MSE=\frac{1}{n}\sum_{i=1}^n{(Y_i-\hat{Y_i})}^2\]

    • MSE should be as small number as possible

  • process

  • for fraud analytics

    • usually one hidden layer is enough
    • theoretically can use different activation function at each node, but usually this is fixed for each layer
    • for hidden layers (black-box approach)
      • use non-linear functions, e.g. TanH function
    • for output layers
      • classification targets (fraud vs non-fraud): use sigmoid function
      • regression target (e.g. amount of fraud): can use any function
    • data pre-processing
      • normalize the continuous variables, e.g. using z-scores
      • reduce the number of categories for categorical variables

Interpret neural network

  • variable selection
    • features
      • to select those variables that actively contribute to the neural network output
      • does not give insight about the internal workings of the neural network
    • Hinton diagram
      • visualizes the weights between inputs and hidden neurons as squares
      • size of square is proportional to size of the weight
      • color of the square represents sign of the weight (e.g. black = -ve, white = +ve)
      • e.g. “Income” variable can be removed
    • variable selection procedures
      • inspect the Hinton diagram and remove the variable whose weights are closest to ZERO
      • re-estimate the neural network with the variable removed (to speed up the convergence, it could be beneficial to start from the previous weights)
      • continue with step 1 until a stopping criterion is met, the stopping criterion could be a decrease of predictive performance or a fixed number of steps
    • backward variable selection procedures
      • build a neural network with all N variables
      • remove each variable in turn and re-estimate the network (this will give N networks each having N-1 variables)
      • remove the variable whose absence gives the best performing network (e.g. in terms of misclassification error, mean squared error)
      • repeat this procedure until the performance decreases significantly
  • rule extraction
    • feature
      • extract if-then classification rules
    • decompositional approach
      • decompose the network’s internal workings by inspecting weights and/or activation values
      • example
    • pedagogical approach
      • example
    • performance evaluation
      • rules extraction sets evaluated in terms of
        • accuracy
        • conciseness (e.g. number of rules, number of conditions per rule)
        • fidelity – measures to what extent the extracted rule set succeeds in mimicking the neural network
      • benchmark the extracted rules/trees with a tree built directly on the original data to see the benefit of going through the neural network
  • two-stage models
    • combines 2 models with both interpretability and performance
    • procedure
      • stage 1 (model interpretability)
        • to estimate an easy-to-understand model (e.g. linear regression, logistic regression)
      • stage 2 (model performance)
        • to predict the errors made by the simple model using the same set of predictors
    • example

Advantages and disadvantages

  • advantages
    • handle different types of input variables and support multivariate targets
    • support flexible and nonlinear relations between input and target variables
    • due to flexibility of the model, it is applicable to many problems
  • disadvantages
    • black-box model, not easy to interpret
    • prone to overfit
    • requires handling of input attributes, e.g. data imputation

Support Vector Machine (SVM)

Overview

  • issues with NN
    • the objective function is nonconvex (i.e. may have multiple local minima)
    • the effort needed to tune the number of hidden neurons
  • basics of SVM
    • supervised learning model that can be used for both classification and regression tasks
    • more commonly used for classification tasks
    • basic idea - to find a line or a hyperplane that best divides the dataset into 2 classes

Basic idea

  • linear classification
    • decision boundary
      • 1-d: a dot
      • 2-d: a line
      • n-d: hyperplane
    • steps
      • find the points (support vector) that are closest to the hyperplane (the line) from both classes
      • compute the margin = distance between the support vectors and hyperplane
      • find the hyperplane with the maximal margins
    • objective of SVM
      • to maximize margins between the two dotted lines
      • i.e. the 2 classes are far away as possible
    • linear nonseparable
      • introduce some values
        • error term (\(e_k\)) allows for misclassification errors
        • hyperparameter (C) balances importance maximizing the margin or minimizing the error of on the data
      • if it is linear separable, just ignore the above values
  • nonlinear classification

    • mapping function \(\phi(x)\)

      • transform the input data points to a higher dimensional feature space
      • mathematically, it is not needed (need kernel function actually)
      • kernel trick

        • formula

        \[K(x_k,x_l)=\phi(x_k)^T\phi(x_l)\]

    • commonly used kernel functions

      • linear kernel: \(K(x,x_k)=x_k^Tx\)
      • polynomial kernel: \(K(x,x_k)={(1+x_k^Tx)}^d\)
      • radial basis function (RBF) kernel: \(K(x,x_k)=exp\{-{||x-x_k||}^2/\sigma^2\}\)
      • example
        • RBF kernel usually performs best, but you need to tune the extra parameter gamma \(\sigma\)
      • tuning of hyperparameters

        • hyperparameter C - penalty parameter representing an error or any form of misclassification
          • small C value
            • hyperplane with a larger margin, i.e. more violations are allowed, maximizing the margins
          • large C value
            • hyperplane with a tighter margin, i.e. more emphasis is placed on minimizing the number of classifications
          • just try to balance, which one should be used, thinking about the performance
        • gamma
          • only need to tune this hyperparameter if you use nonlinear hyperplane
          • lower value of gamma
            • create a loose fit of the training dataset
            • consider only the nearby points for the calculation of a separate plane
          • higher value of gamma
            • tries to exactly fit the training data
            • consider all the data-points to calculate the final separation line
          • which one is better, defined by performance (each one is good, the difference is boundary)
        • procedure

          • partition the data into 40%, 30%, 30% training, validation and test data
          • build an RBF SVM classifier for each (\(\sigma, C\))

            \[\sigma \in \{0.5, 5, 10, 15, 25, 50, 100, 250, 500\}\]

            \[C \in \{0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500\}\]

          • choose the (\(\sigma, C\)) combination with the best validation set performance

          • build an RBF SVM classifier with the optimal (\(\sigma, C\)) combination on combined training + validation dataset

          • calculate the performance of the estimated RBF SVM classifier on the test set

  • SVM interpretation

    • SVM classifier can be presented as a neural network
    • then, use the techniques in neural network interpretation
    • SVM prediction -> decision tree (like previous example)

Advantages and disadvantages

  • advantages
    • works really well with a clear margin of separation
    • effective in high dimensional spaces and where the number of dimensions is greater than the number of samples
    • uses a subset of training points in the decision function (called support vectors), so it is also memory efficient
  • disadvantages
    • does not perform well when data set is large, and when the data set has more noise
    • choosing the effective kernel function can be computationally intensive

Multiclass Classification Techniques

Binary vs multiclass

  • binary
    • label as "fraud" and "no fraud"
  • multiclass
    • more than 2 labels
    • nominal labels
      • fraud, suspected fraud, no fraud
    • ordinal labels
      • severe fraud, medium fraud, light fraud, no fraud

Techniques

  • logistic regression (for nominal target variables)

    • choose one of the target class (e.g. class K) as the base class

      \[\frac{P(Y=1|X_1,\ldots,X_N)}{P(Y=K|X_1,\ldots,X_N)}=e^{(\beta_0^1+\beta_1^1X_1+\beta_2^2X_2+\ldots+\beta_N^1X_N)}\]

      \[\frac{P(Y=2|X_1,\ldots,X_N)}{P(Y=K|X_1,\ldots,X_N)}=e^{(\beta_0^2+\beta_1^2X_1+\beta_2^2X_2+\ldots+\beta_N^2X_N)}\]

      \[\cdots\]

      \[\frac{P(Y=K-1|X_1,\ldots,X_N)}{P(Y=K|X_1,\ldots,X_N)}=e^{(\beta_0^{K-1}+\beta_1^{K-1}X_1+\beta_2^{K-1}X_2+\ldots+\beta_N^{K-1}X_N)}\]

    • based on the fact that all probabilities sum to 1

      \[P(Y=1|X_1,\ldots,X_N)=\frac{e^{(\beta_0^1+\beta_1^1X_1+\beta_2^2X_2+\ldots+\beta_N^1X_N)}}{1+\sum_{k=1}^{K-1}e^{(\beta_0^k+\beta_1^kX_1+\beta_2^kX_2+\ldots+\beta_N^kX_N)}}\]

      \[P(Y=2|X_1,\ldots,X_N)=\frac{e^{(\beta_0^2+\beta_1^2X_1+\beta_2^2X_2+\ldots+\beta_N^2X_N)}}{1+\sum_{k=1}^{K-1}e^{(\beta_0^k+\beta_1^kX_1+\beta_2^kX_2+\ldots+\beta_N^kX_N)}}\]

      \[P(Y=K|X_1,\ldots,X_N)=\frac{1}{1+\sum_{k=1}^{K-1}e^{(\beta_0^k+\beta_1^kX_1+\beta_2^kX_2+\ldots+\beta_N^kX_N)}}\]

    • parameter \(\beta\) is estimated using maximum aposteriori estimation, which is extension of maximum likelihood estimation

  • decision tree

    • the impurity criteria can be written as follows (K is the number of class labels)

      \[Entropy(S)=-\displaystyle \sum^K_{k=1}p_klog_2(p_k)\]

      \[Gini(S)=\displaystyle \sum^K_{k=1}p_k(1-p_k)\]

    • the splitting and stopping decisions are the same as the binary class classifications

  • NN

    • increase the number of output neurons to K, where K corresponds to K class labels
    • an observation is assigned to the output neuron with the highest activation value
  • SVM

    • treat is as binary class problem by dividing the target classes into 2 classes
    • one-versus-one, for a K target classes, learn K SVMs as follows
      • classifies target as "y=1" vs "y=2"; "y=1" vs "y=3"; ... "y=1" vs "y=K"; "y=2" vs "y=3"; "y=2" vs "y=4"; ... "y=2" vs "y=K" ...
      • will have \(\frac{K(K-1)}{2}\) classifiers
      • use a (weighted) voting procedure for the final classification result
    • one-versus-all, for a K target classes, learn K SVMs as follows
      • SVM#1, with target class label=1, classifies target as “y=1” vs “y≠1"
      • SVM#2, with target class label=2, classifies target as “y=2” vs “y≠2"
      • SVM#K, with target class label=K, classifies target as “y=K” vs “y≠K”
      • use the highest posterior probability to decide the classification result