close

Вход

Log in using OpenID

embedDownload
(Predictive Discriminant Analysis)
Ricco RAKOTOMALALA
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
1
Maximum A Posteriori Rule
Calculating the posterior probability
Bayes
Theorem
P Y  yk / X


P Y  yk   P  X / Y  y k 
P X 
P Y  y k   P  X / Y  y k 
K
 PY
l 1
 yl   P  X / Y  yl 
MAP – Maximum A Posteriori rule
yk *  arg max P Y  yk / X
k


yk *  arg max P Y  yk   P  X / Y  y k
k
Prior probability of class k: P(Y=yk)
Estimated by empirical frequency nk/n

How to estimate P(X/Y=yk)
Assumptions are introduced in order to obtain a
convenient calculation of this distribution.
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
2
Assumption 1: (X1, …, XJ / yk) is assumed multivariate normal
(Multivariate Gaussian Distribution – Parametric method)
Multivariate Gaussian Density
P(
X 1 v1 ,, X J v J
yk ) 
1
2 det( k )
e
 12 ( X   k )  k 1 ( X   k )'
(X1) pet_length vs. (X2) pet_w idth by (Y) type
k
Conditional centroids
k
Conditional
covariance matrices
3
2
2
1
1
1
2
3
4
Iris-setosa
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Iris-versicolor
5
6
Iris-virginica
3
Assumption 2: Population covariance matrices are equal
   k , k  1,, K
(X1) pet_length vs. (X2) pet_w idth by (Y) type
3
2
2
1
1
1
2
3
4
Iris-setosa
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Iris-versicolor
5
6
Iris-virginica
4
Linear classification functions
(under the assumptions [1] and [2])
The natural logarithm of the conditional probability is proportional to:
ln P( X yk )   12 ( X  k )1 ( X  k )'
From a sample with n instances, K classes and J predictive variables
 xk ,1 


ˆ k   
x 
 k ,J 
K
1
ˆ 
nk  ˆ k

n  K k 1
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Conditional centroids
Pooled variance covariance matrix
5
Linear classification functions
(an explicit classification model that can classify an unseen instance)
The classification function for yk is proportional to P(Y=yk/X)
1
d (Yk , X )  lnPY  yk   k  1 X ' k  1k '
2
Takes into account the prior probability of the group
Decision rule
d (Y1 , X )  a1,0  a1,1 X 1  a1, 2 X 2    a1, J X J
d (Y2 , X )  a2,0  a2,1 X 1  a2, 2 X 2    a2, J X J
yk *  arg max d (Yk , X )
k

Advantages et shortcomings
LDA - in general - is as effective as the other linear methods (e.g. logistic regression)
>> It is robust to the deviation from the Gaussian assumption
>> It may be disturbed by a strong deviation from the homoscedasticity assumption
>> It is sensitive to the dimensionality and/or the presence of redundant variables
>> The multimodal conditional distributions constitute a problem (e.g. 2 or more « clusters » for Y=Yk)
>> Sensitivity to outliers
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
6
Classification rule – Distance to the centroids
The classification function d(Yk,X) computed for the individual  is based on
( X ( )  k )1 ( X ( )  k )'
Distance-based classification : Assign  to that the population to which it is closest
(1) in the sense of the distance to the centroids, (2) using the Mahalanobis distance
We understand that LDA fails in some situations: (a) when we have multimodal conditional
distributions, the group centroids are not reliable; (b) when the conditional covariance matrices are
very different, the pooled covariance matrix is not appropriate for the calculation of distances.
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
7
Classification rule – Linear separator
Linear decision boundaries (hyperplane)
to separate the groups
Defined by the points equally distant to the two conditional centroids
LDA, the decision rule can be interpreted in different ways: (a) MAP decision rule
(posterior probability); (b) distance to the centroids; (c) linear separator which
defines various regions in the representation space
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
8
Evaluation of the classifier
(1) Estimating classification error rate
Holdout scheme: Learning + Test  Confusion matrix
(2) Overall “statistical” evaluation of the classifier
H 0 : 1     K
One-way MANOVA statistical test
H0: the population centroids do not differ
The test statistic: WILKS’ LAMBDA
det W 

det V 
Pooled covariance matrix
Global covariance matrix
In practice, we use the Bartlett transformation (² distribution) or
the Rao transformation (F distribution) to define the critical region
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
9
Assessing the relevance of the descriptors
Measuring the influence of the variables in the classifier
The idea is to measure the variation of the Wilks' lambda of the model with [J variables]
and without [J-1 variables] the variable that we want to evaluate.
The F statistic (loss in separation if the Jth variable is deleted)
n  K  J  1   J 1 

 1  F K  1, n  K  J  1
K 1
 J

This statistic is often available into the tools from the statistician community
(not into the tools from the machine learning community)
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
10
The particular case of the binary classification (K = 2)
We have a binary class attribute  Y = {+,-}
d (, X )  a , 0  a ,1 X 1  a , 2 X 2    a , J X J

d (, X )  a , 0  a,1 X 1  a, 2 X 2    a , J X J
d ( X )  c  c1 X 1  c2 X 2    cJ X J
Decision rule
D(X) > 0  Y = +
Interpretation
>> d(X) is a SCORE function, it enables to assign a score [proportional to the
positive class probability estimate] to each instance
>> The sign of the coefficients allows to understand the sense of the influence
of the variable on the class attribute
Evaluation
>> There is an analogy between the logistic regression and the LDA.
>> There is also a strong analogy between the linear regression between the linear
regression of an indicator (0/1) response variable and the LDA (we can use some results of
the first one for the second one).
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
11
LDA with Tanagra software
Statistical overall evaluation
MANOVA
Stat
Wilks' Lambda
Value
0.1639
p-value
-
Bartlett -- C(9)
1252.4759
0
Rao -- F(9, 689)
390.5925
0
LDA Summary
Classification functions
Attribute
clump
begnin
malignant
Statistical Evaluation
Wilks L.
Partial L.
F(1,689)
p-value
0.728957
1.615639
0.183803
0.891601
83.76696
0
-0.316259
0.29187
0.166796
0.982512
12.26383
0.000492
ucellshape
0.066021
0.504149
0.165463
0.990423
6.6621
0.010054
mgadhesion
0.057281
0.232155
0.164499
0.99623
2.60769
0.106805
sepics
0.654272
0.869596
0.164423
0.996687
2.29011
0.130659
bnuclei
0.209333
1.427423
0.210303
0.779248
195.18577
0
bchromatin
0.686367
1.245253
0.167816
0.976538
16.55349
0.000053
-0.000296
0.461624
0.168846
0.97058
20.88498
0.000006
0.200806
0.278126
0.163956
0.99953
0.32432
0.569209
-3.047873
-23.296414
ucellsize
normnucl
mitoses
constant
Classification functions
(Linear Discriminant Functions)
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
-
Variable importance
12
LDA with SPAD software
(1) Only for binary problem
(2) All predictive variables must be continuous
(3) Evaluation of the relevance of the variables by the way of the linear regression
D  d begnin / X   d malignant / X 
Overall statistical evaluation of the model
F from the Wilks’ lambda, Hotelling’s T2
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
(9.15…)²  83.76696
Results of the linear regression on the
indicator response variable
13
Dealing with discrete (categorical) predictive variables
(1)
Dummy coding scheme (we must define a fixed reference level)
(2) DISQUAL (Saporta): Multiple Correspondence Analysis + LDA from the factor scores
(This is a kind of regularization which enables to reduce the variance of the classifier when we select a subset of the factors)
Some tools such as SPAD can perform DISQUAL and provide the
classification functions on the dummy variables.
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
14
Feature selection (1) – The STEPDISC approach
Forward strategy
Principle: Based on the F statistic
Process: Evaluate the addition of the (J+1)th
variable into the classifier at each step
nK J
K 1
 J


 1  F K  1, n  K  J 
  J 1

FORWARD selection
J=0
REPEAT
For each candidate variable, calculate the F statistic
Select the variable which maximizes F
The addition implies a “significant” improvement of the model?
If YES, the variable is incorporated in the model
UNTIL (no variable can be added)
Note:
(1) Problems may arise when we define "significant" with the
computed p-value (see ‘multiple comparison’)
(2) Other strategies: BACKWARD and BIDIRECTIONAL
(3) A similar strategy is performed in the linear regression
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
15
Feature selection (2)
TemperatureSun
(°C) (h)
3064
3000
3155
3085
3245
Wine quality (Tenenhaus, pp. 256-260)
E.g. Stopping rule – Significance level  = 0.05
…
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
Heat (days) Rain (mm)
1201
1053
1133
970
1258
…
10
11
19
4
36
…
361
338
393
467
294
…
Quality
m edium
bad
m edium
bad
good
…
16
Bibliography
STAT 897D – “Applied Data Mining”, The Pennsylvania State University, 2014.
https://onlinecourses.science.psu.edu/stat857/
G. James, D. Witten, T. Hastie, R. Tibshirani, “An introduction to Statistical
Learning”, Springer, 2013. http://www-bcf.usc.edu/~gareth/ISL/
SAS/STAT(R) 9.3 User’s Guide, “The DISCRIM Procedure”.
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/v
iewer.htm#discrim_toc.htm
Ricco Rakotomalala
Tutoriels Tanagra - http://data-mining-tutorials.blogspot.fr/
17
1/--pages
Пожаловаться на содержимое документа