close

Вход

Log in using OpenID

embedDownload
1
INPUT DATA ANALYSIS
SMDE-MIRI-FIB
Course 2013-14 Term 1
Author: Lídia Montero
IDA
Introduction
2
1.
In stochastic simulations, must decide on input
probability distributions from which to
generate random variates


2.
Random variates = observations or draws or
realizations of random variables from specified
input distributions/processes.
Queueing Systems: interarrival and service times
Once these input probability distributions are
specified, must have a way to generate
random variates from them
IDA
3
Specifying Univariate Input
Distributions

Univariate distribution: a single (scalar) random
input
 May

have many across simulation
Typically have real-world observed data on a
simulation input
 Want
to fit a probability distribution to the data
 Then use this fitted distribution to generate random
variates to drive the simulation

Why not just drive simulation with actual observed
data?
 Seems
more direct than fitting/generating
 Useful for validation, but there are problems
IDA
Choosing Probability Distributions
4

There are many probability distributions; some common
ones:



Continuous: normal, exponential, continuous uniform, triangular
Discrete: discrete uniform, binomial, Poisson, geometric, negative
binomial
Lists of probability distributions and their
properties/relationships:


Many books and web sites
Beware: different sources may parameterize the same
distribution differently, important to mesh with our current
definition.
IDA
5
Choosing Probability Distributions –
Qualitative


How do you choose one distribution from among
hundreds?
Some qualitative criteria:
Continuous vs. discrete inputs – choose according to situation
 Range (a.k.a. support) of the distribution – finite on both ends
or infinite on right or left ends? Match with the real situation
 Make a histogram of the observed data, match it to shapes of
known probability density functions (continuous) or probability
mass function (discrete) … Formulate hiphotesis,
 But still need to estimate distribution parameters (“fit”), test the
goodness of fit. Descriptive statistical tools are a ‘must’.

IDA
6
Fitting Distributions to Data –
Preliminaries
Select a possible distribution form, estimate
parameters, test goodness of fit … repeat as
necessary …
 Several distribution-fitting packages, or
modules provided in MINITAB.
 Or statistical resources in R.
 Use Stat::Fit (www.geerms.com) … free student
version, plus a more-capable “textbook”
version on this book’s website

IDA
Interarrival times to a queueing system (I)
7
0.0056
0.5861
1.1745
1.5370
2.1642
2.7615
3.7070
4.4611
5.4059
6.3889
7.1863
8.5473
9.5786
11.0024
13.0725
14.3476
15.6594
16.8580
19.3180
21.9020
25.5066
27.7501
30.2375
33.4450
38.5539
45.0766
61.9682
0.0967
0.6353
1.1891
1.5647
2.1831
2.7816
3.7101
4.5225
5.4331
6.4247
7.3967
8.5664
9.6571
11.1306
13.2578
14.3792
15.9621
16.9638
19.3221
22.5750
26.0540
27.9995
30.3884
34.4730
38.6309
46.2803
66.3411
0.1807
0.6510
1.2248
1.6214
2.2502
2.8736
3.7975
4.6649
5.4764
6.4959
7.6357
8.5788
9.7378
11.1442
13.6691
14.3882
16.0468
17.2893
19.7837
22.6202
26.1539
28.8724
30.4502
35.3383
38.6955
46.6233
66.7154
0.2262
0.6731
1.2468
1.6926
2.3509
2.9131
3.9954
4.8233
5.5202
6.6339
7.6863
8.7657
9.9193
11.5322
13.8986
14.5239
16.0755
17.3069
19.8183
23.0887
26.3157
29.1316
31.3845
35.5309
40.2452
47.8474
67.0041
0.2389
0.7571
1.3303
1.7026
2.3802
2.9503
4.0307
4.8930
5.5676
6.6549
7.7195
8.7732
10.0864
11.5343
14.0160
14.5886
16.3010
17.3494
19.9215
23.4108
26.4727
29.2133
31.6547
36.2030
40.7250
48.5374
70.4577
0.2628
0.7587
1.3524
1.8342
2.4070
2.9543
4.2456
4.9570
5.7514
6.7329
7.8727
8.7898
10.4750
12.1626
14.0924
15.1880
16.3375
17.6276
20.4094
24.2013
26.8948
29.5204
31.7689
37.0102
41.0193
53.3607
79.3826
0.4112
1.0471
1.3836
1.9083
2.4423
3.3028
4.3496
5.0024
5.9360
6.9757
7.9692
8.9490
10.5871
12.2130
14.1088
15.2705
16.5762
18.0241
21.2502
24.2341
26.9533
29.8822
31.9442
37.3794
41.6327
56.1566
IDA
0.5298
1.0651
1.4123
1.9799
2.5339
3.4130
4.3519
5.0737
5.9792
7.1417
8.1695
9.0254
10.8425
12.5622
14.2914
15.4756
16.7091
18.1331
21.3287
24.3140
26.9969
30.1434
32.2406
37.3898
44.2501
58.3313
0.5525
1.1033
1.4896
2.0623
2.7035
3.4334
4.4318
5.1245
6.1983
7.1698
8.3292
9.2393
10.8702
13.0473
14.3412
15.5005
16.7427
18.5128
21.3893
24.8287
27.0315
30.2184
33.2581
38.2246
45.0125
60.6484
Interarrival times to a queueing system (II): sample
moments
8
60
Frequency
50
40
30
20
10
0
0
20
40
60
80
tiempos
Histogram 240 observed inter arrival times
Mean and standard deviation of the sample :
μˆ = 16.398179
ˆ = 15.855114
σ
Coefficient of Variation cv = 0.966882 ≈ 1.0 →
Hipothesis : exponential inter arrival times : f(x) = 0.060982e -0.60982x
IDA
Interarrival times to a queueing system (III):
validation of the theoretical distribution
9
•
•
•
Group observations into intervals: plot them in a histogram
Nj is the number of observations in the sample belonging to j-th class, whose
lower and upper values are: [aj-1, aj)
According to the theoretical distribution considered f(x) (depending on the
hypothesis), then the expected number of observation in j-th class Npj, being N
the total sample size, and pj the expected probability for j-th interval, that
according to f(x) is:
aj
pj =
∫ f ( x)dx
a j −1
•
Compute the χ2 statistic for distribution matching, distributed as χ k2−1 :
k
( N j − Np j ) 2
j =1
Np j
χ =∑
2
If the fit is good enough then χ2 value should be less than the theoretical
distribution of the statistic χ 2k −1,1−α , for a confidence level of α.
•
In the example taking equal probability for each interval pj=0.04, fiven k = 25
classes, then the theoretical number of observations should be Npj = 240 x 0.04 =
j
9.6. Lower and upper interval values are given by aIDA
) ,
j = −16.398179 ln(1 −
25
leading to the following table .
Ensure equal number of expected obs.
according to hypothetical distribution
10
11
Ensure equal number of expected obs.
according to hypothetical distribution
par(mfrow=c(1,2))
hist(sample,freq=F,breaks=25,col="yellow",main="Equal
interval length ... ")
curve(dgamma(x,shape=1,scale=16.398179),col=2,add=T)
sequence<-seq(0,1,by=0.04)
qualist<-quantile(sample,sequence)
qualist[1]<-0
sequence;qualist
hist(sample,freq=F,breaks=qualist,col="green",main="Qua
ntile defined lenght... ")
curve(dgamma(x,shape=1,scale=16.398179),col=2,add=T)
Interarrival times to a queueing system (III): validation of theoretical
distribution
12
k
[aj-1,aj)
Nj
Npj
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[0,0.6694)
[0.6694,1.3673)
[1.3673,2.0962)
[2.0962,2.8591)
[2.8591,3.6591)
[3.6591,4.5002)
[4.5002,5.3868)
[5.3868,6.3242)
[6.3242,7.3183)
[7.3183,8.3766)
[8.3766,9.5079)
[9.5079,10.7232)
[10.7332,12.0357)
[12.0357,13.4626)
[13.4626,15.0255)
[15.0255,16.7532)
[16.7532,18.6846)
[18.6846,20.8743)
[20.8743,23.4021)
[23.4021,26.3918)
[26.3918,30.051)
[30.051,34.7865)
[34.7865,41.4173)
[41.4173,52.7837)
[52.7837,∞)
12
12
12
11
7
10
8
9
10
8
9
7
7
6
12
13
9
6
7
9
12
13
13
8
10
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
9.6
(Nj- Npj)2/
Npj
0.6
0.6
0.6
0.2041
0.7041
0.0166
0.2666
0.0375
0.0166
0.2666
0.0375
0.7041
0.7041
1.35
0.6
1.2041
0.0375
1.35
0.7041
0.0375
0.6
1.2041
1.2041
0.2666
0.0166
2
χ =13.3324
2
Threshold for chi-square distribution χ 24
≥ χ2=13.3324 statistic ⇒ H0 accepted
.
, 0.90 = 33196
IDA
Service times in a queueing system : repeat the process
13
0.1454
1.2881
1.7846
2.1979
2.6951
3.2008
3.6541
4.0073
4.3200
4.6505
5.2811
6.0355
6.5401
7.1373
7.9724
8.4725
9.1904
10.1837
10.5886
11.3437
12.2778
13.0563
15.5465
18.3013
20.7997
22.5364
32.0592
0.5238
1.3493
1.8179
2.2023
2.7077
3.2599
3.6675
4.0156
4.4019
4.7220
5.3477
6.0467
6.6975
7.2518
7.9819
8.6467
9.2989
10.1946
10.6869
11.4795
12.3234
13.2111
15.6115
18.6442
20.9388
22.6313
32.1830
0.6166
1.3659
1.8522
2.2234
2.7357
3.2930
3.7180
4.0534
4.4094
4.8663
5.4460
6.2092
6.8948
7.2790
7.9981
8.6740
9.6704
10.1948
10.7418
11.5043
12.4025
13.2396
15.8801
18.9559
21.1958
22.8665
44.9957
0.6795
1.3746
1.8625
2.3095
2.8507
3.3534
3.7821
4.0854
4.5001
4.8709
5.5446
6.2176
6.9817
7.3775
8.0131
8.7237
9.7038
10.2016
10.7594
11.5985
12.4373
13.6681
16.0658
19.1108
21.3721
23.5957
48.1438
0.7180
1.4258
1.8690
2.3292
2.9065
3.3606
3.8698
4.1096
4.5362
4.9562
5.7256
6.2428
7.0083
7.6351
8.0861
8.8410
10.0262
10.2505
10.7623
11.7391
12.6500
14.6065
16.2848
19.1158
21.6883
24.7190
57.7308
0.8721
1.4259
1.9063
2.4369
2.9127
3.4855
3.9133
4.1534
4.5387
5.0169
5.9442
6.3282
7.0408
7.6402
8.1647
8.9370
10.0340
10.3965
10.9943
11.9313
12.7672
14.6497
17.3174
19.1320
21.7385
25.9496
59.4783
0.9957
1.5716
2.0432
2.4854
3.0853
3.5595
3.9293
4.1550
4.5698
5.1590
5.9679
6.4020
7.0896
7.7009
8.2875
8.9715
10.0410
10.4271
11.0059
12.0433
12.7755
14.8285
17.6161
19.2530
21.7717
27.0300
IDA
1.1146
1.6946
2.0636
2.5326
3.0942
3.5697
3.9537
4.2082
4.5793
5.1796
5.9866
6.4833
7.1086
7.8690
8.3136
8.9877
10.0722
10.4441
11.0416
12.0448
12.8075
14.9139
17.9746
19.9745
22.0588
27.4983
1.2283
1.7126
2.1548
2.5630
3.1387
3.5865
3.9765
4.2180
4.5832
5.2767
6.0239
6.5308
7.1304
7.8781
8.4141
9.0189
10.0822
10.5701
11.1802
12.1718
13.0493
15.4660
17.9907
20.6804
22.2163
28.7685
Service times in a queueing system : not so simple !
14
50
Frequency
40
30
20
10
0
0
10
20
30
40
50
60
tserv
Histogram for the 240 observation sample
50
Frequency
40
30
20
10
0
0
10
20
30
40
50
60
tserv
H0: Theoretical distribución Γ(1.25,7.5)
IDA
Fitting Distributions to i.i.d univariate data

15


Formulation of hypothesis once descriptive data analysis is performed: Histograms,
Box Plots, etc. List of candidate distributions.
Estimation of parameters: location, scale and shape parameter for each candidate
distribution.
Validation goodness of fit for each possible distribution: Quantile and Probability
Plots, Χ2 Test, Kolmogorov-Smirnoff Test,…
PROCESS
ANALYSIS
DATA
COLLECTION
PROPOSED DISTR.
HIPOTHESIS
LIST OF CANDIDATE
DISTRIBUTION
DESCRIPTIVE
DATA ANALYSIS
MODEL:
THEORETICAL DISTRIBUTION
Estimation
of model
parámeters
Goodness of Fit Tests
IDA
REAL PROCESS
16
Fitting Distributions to i.i.d univariate
data

For each chosen distributions, Fit Distribution
Parameters  Validation of Fit
 Several Goodness of Fit tests: Chi Squared (most
popular test, df must be provided), K-S
(Kolmogorov-Smirnoff), A-D (Anderson-Darling).
 Minitab: Table with all distributions, all goodnessof-fit tests’ test statistics
 Smaller test statistics ⇒ better fit
 Detail report on each chosen distribution, and each
goodness-of-fit test applied to it
IDA
17
Fitting Distributions to i.i.d univariate
data


H0: “Sample Data distributed as ...”
Each test provides a p value that means “probability of H0
to be true”
How to interpret p value:


If p value is less than the selected confidence level (usually 5
or 10%) then “REJECT H0”, meaning that the proposed
distribution does not fit available data.
If p value is greater than the selected confidence level
(usually 5 or 10%) then there is no evidence to “REJECT H0”,
thus “DO NOT REJECT H0” (“ACCEPT H0”), meaning that the
proposed distribution does fit available data.
IDA

18

Coefficient of variation for random variate X:
σX
CX =
μX
Intuitively is a mesure of how far the selected distribution for the sample
data relies from the exponential distribution ( with CX=1 )
A. Candidate distributions for nonnegative sample data shows CX<1:
Gamma family distributions:
 − xβ
∞
 e β −α x α −1 para x ≥ 0
fX (x) =  Γ (α )
Γ(α) = ∫ y α -1e −α dy
0

para x = 0
0
α ∈ R + , β ∈ R + If α = k ∈ Z + ⇒ Γ(k + 1) = k!
α shape parameter, β scale parameter. FX(x) has an analitical expression
j

when α is integer:
x 



− x α −1  β 
β

FX (x) = 1 − e ∑
j!
j=0

0 otherwise
x≥0
IDA

19
Particular case: when α =1 , Γ(1) ≡ exponential distribution scale β or
rate parameter 1/β.
Mean x = αβ
α
⇒ CX =
<1
fX (x) → 
2
2
α
Variance σ = αβ

For α = k integer and β=1/(kµ), Γ(k, 1/(kµ)) is Erlang distribution with
shape k and mean 1/µ, sum k I.I.D. exponentials with mean 1/(kµ),
k −1
kμ(kμ ) x k −1 −kμ x 
fX (x) =
e

(k − 1)!
1
1

x
≥
0;
Mean
;
Variance
2
j j
k −1
μ
kμ
x
(
)
kμ x 
FX (x) = 1 − e −kμ ∑
j! 
j=0
( )
( )
IDA
Fitting Distributions to i.i.d univariate data
20


Hipothesis: Observations X1, X2, …, Xn, are I.I.D.
Method: Maximum likelihood estimation
 Let Pθ(x) be a univariate probability function, depending
ˆof θ parameter
on θ parameter. We infer an estimate θ
calculate from sample data X1, X2, …, Xn.
 Maximum Likelihood Function
L(θ) is defined as:
θˆ
L(θ) = Pθ(X1) Pθ(X2)…. Pθ(Xn)
ˆ
θ
 The estimate
of θ parameter maximizes L(θ) being the
one the “one that better explain sample data”.
ˆθ ⇔ dL(θ ) = 0
dθ
IDA
Example 1: Exponential distribution

21

1 − xμ
Mean θ = μ > 0, fμ (x ) = e , x ≥ 0
μ
Likelihood function :
 1 −X1 μ
L(μ ) =  e
μ

Log-likelihood function:

Determine the value of
 1 − X 2 μ
 e
 μ
μˆ
  1 −Xn μ
... e
 μ
 1 n 
 1
 = n exp − ∑ X i 
 μ i =1 
 μ
1 n
l(μ ) = ln[L(μ )] = −nlnμ − ∑ X i
μ i =1
maximizing L(µ) for µ≥0 :
MAX L(μ ) ≡ MAX l(μ )
μ

According to
KKT conditions:
μ
dl(μ )
n 1
=− + 2
dμ
μ μ
n
n
∑X
i =1
n
∑ X i ⇒ μˆ =
dl2 (μ ) n 2
= 2− 3
2
dμ
μ μ
n
∑X
i =1
IDA
i
i =1
i
< 0 for μ = μˆ
22
Exemple 2: Gamma Γ(α,β=1), no location
. Scale β= 1
parameter considered.
IDA
Take a look to standard Gamma pdf’s
23
IDA
Exemple 2: Gamma Γ(α,λ), pay attention
to the meaning of parameters
24

Gamma Γ(α,β), no location parameter considered
e− xβ

β − α x α −1
fX (x) =  Γ (α )

0
x>0
x=0
∞
Γ(α) = ∫ y α -1e −α dy
0
α ∈ R + , β ∈ R + α = k ∈ Z + ⇒ Γ(k + 1) = k!

Likelihood function :
L(α, β ) =
β
−nα
n
α −1

 1 n
exp − ∑ X i 
 β i =1 
[Γ(α )]n


 ∏ X i 
 i =1 
non linear system to simultaneous estimate αˆ y βˆ
∂L(α, β )
∂L(α, β )
=0
= 0,
∂β
∂α
IDA
Solution : fitdistr() method in R
25
Exemple 2: Gamma Γ(α,λ), pay
attention to the meaning of parameters




Probability Density function, most common parametrization with
λ rate parameter
Gamma α = n/2 and λ = 1/2 is a chi-square variate with n
degrees of freedom (df).
If α = k integer then k-Erlang and λ = kμ.
If α = 1 then exponential variate with rate λ (mean 1/ λ) . In
the previous parametrization β has a mean meaning and thus, λ
2
=1/β.
α
1
E[T ] = V [T ] = α  
λ
 λ  IDA
26
Erlang Distribution:for statisticians a particular
gamma when shape parameter is integer
In computer science a case of hypoexponential
distribution
IDA
HipoExponential distributions (HIPOEXP)
27




HipoExp: Serial connection of Exponentials phases
2 phase HipoExp notated as HiPO(λ1, λ2).
Density, distribution and hazard functions are:
Service time for a disk, usually modeled as 3 phase
HipoExp since total access time is the sum of searching
time plus latency plus transfer timeés.
IDA
HipoExponential distributions (HIPOEXP)
28

HipoExp(1,5)
HipoExp(1,5) density function
HipoExp(1,5) Distribution function
IDA
Erlang distribution
29



Particular case of HIPOEXP process: all phases are i.i.d
exponentially distributed.
Also a particular case of gamma distribution when shape
parameter is integer (meaning, number of modeling
phases).
No specific distribution in R for Erlang variates.
IDA
Service times in a queueing system : descriptive
statistical analysis
30
Summary for TServicio
Anderson-Darling Normality Test
A-Squared
P-Value <
0
12
24
36
48
60
11,08
0,005
Mean
StDev
Variance
Skewness
Kurtosis
N
9,6076
8,7011
75,7084
2,51942
9,63090
240
Minimum
1st Quartile
Median
3rd Quartile
Maximum
0,1454
3,9173
7,3283
12,2513
59,4783
95% Confidence Interval for Mean
8,5012
10,7140
95% Confidence Interval for Median
6,3537
95% Confidence Interval for StDev
95% Confidence Intervals
7,9860
Mean
Median
6
7
8
9
8,4523
10
11
IDA
9,5578
Service times in a queueing system :
Minitab output
31
Probability Plot for TServicio
Probability Plot for TServicio
Weibull - 95% CI
Lognormal - 95% CI
Goodness of Fit Test
90
80
70
60
50
40
30
20
99,9
AD = 0,996
P-Value = 0,013
Goodness of Fit Test
99
AD = 0,947
P-Value = 0,016
95
90
10
Percent
Percent
99,9
99
5
3
2
1
80
70
60
50
40
30
20
10
5
1
0,1
0,01
0,10
1,00
TServicio
10,00
0,1
100,00
0,1
1,0
Probability Plot for TServicio
10,0
TServicio
Probability Plot for TServicio
Gamma - 95% CI
Logistic - 95% CI
99,9
99
95
90
80
70
60
50
40
30
20
99,9
Goodness of Fit Test
AD = 0,525
P-Value = 0,211
Goodness of Fit Test
AD = 5,963
P-Value < 0,005
99
95
Percent
Percent
100,0
10
5
3
2
80
50
20
5
1
1
0,1
0,1
1,0
TServicio
10,0
100,0
0,1
-30
-20
-10
IDA
0
10
20
TServicio
30
40
50
60
Service times in a queueing system :
GAMMA (θ,α,β)
32


Shape α
= 1,526
Scale β =
6,295.
Location/
shift
paramete
r θ =0
Histogram of TServicio
Gamma
80
Shape 1,526
Scale 6,295
N
240
70
60
Frequency

50
40
30
20
10
0
0
8
16
24
32
TServicio
40
IDA
48
56
Example 3: Weibull distribution W(θ,α,β)

33
Probability Density function fX(x) and distribution function FX(x) when no
shift is considered; i.e. θ = 0:
 −α α −1 − x β 

fX (x ) = αβ x e
0 otherwise

α

− x 
1 − e  β 
x≥0
FX (x) = 
0 otherwise
α
x≥0
Shape parameter α > 0, scale parameter β > 0
2



α
2
α
1
+
+
 


2 
Variance β Γ
 
 − Γ
  α    α  
 α + 1
Mean βΓ

 α 
Γ(c + 1) = cΓ(c ), Γ(0 ) = Γ(1) = 1, Γ( 12 ) = Π

Maximum likelihood estimate require numeric resolution of the
simultaneous equations:
 n αˆ 
X lnX i
lnX i
 ∑ Xi 
∑
∑
1 i =1
i =1
, y βˆ =  i=1 
− =
n
 n 
αˆ
n
αˆ
IDA
Xi


∑
i =1


n
αˆ
i
n
1
αˆ
34
IDA
Example 3: Weibull distribution W(θ,α,β)
35



Pay attention to the functional form, that is meaning of model
parameters.
Instead W(θ=0,α,λ=1/β), widely used in engineering.
Modeling of fatigue in mechanical or electronic components
F (t ) = 1 − e
2

11 1
1  1    2  1  1  
E[T ] =  Γ  V [T ] =    2Γ  − Γ  
α  λ  α 
α  λ   α  α α  
α shape parameter and λ is rate parameter, related to scale
parameter β , λ =1/β.
Previously to the use of any statistical package, the distribution
functional form must be checked.
2


− λt α
IDA
Descriptive Statistics methods in R (I)
36


Variate type: discrete variate (with few values) or
continous variate (also, discrete one with many
values).
Numerical description: moments; i.e. mean, variance
(standard deviation), kurtosis, etc. Classified as:
 Centrality
Trend: Mean (not average), Median, Mode.
 Diversion Trend: Variance (Standard Deviation),
Quantiles, IQR (Q3-Q1), etc.

Rough description in R: summary( dataframe )
IDA
Descriptive Statistics methods in R (II)
37


Discrete variate (with few values or cathegorical) :
pie.chart(table( variable ) ), barplot( table( variable ) ) .
Continuous variates: Classified as:
 Histogram:
hist(variable, freq=T). Draw a theoretical
distribution
 Boxplot: use Boxplot(variable) in library(car) instead
that boxplot(variable).


Default description in R: plot( variable ).
Possible tools increase when 2 simultaneous variates
are considered (not the case in this unit).
IDA
Descriptive Statistic methods in R:
example
38



Draw a sample of size 5000 from W(θ=5,α=2,λ=1),
Draw a sample of size 5000 from W(θ=0,α=2,λ=1),
Theoretical moments for standard Weibull; i.e., X∼W(θ=0,α=2,λ=1).
And thus for Y∼ W(θ=5,α=2,λ=1) ?
 α + 1
Mean βΓ

 α 
2



α
2
α
1
+
+
 


2 
Variance β Γ
 
 − Γ
α
α
 





Γ(c + 1) = cΓ(c ), Γ(0 ) = Γ(1) = 1, Γ( 12 ) = Π
E [X ] = Γ((c + 1) c )β = Γ(3 2 ) = π 2 E [Y ] = 5 + π 2


Compute sample statistics: check moment consistency.
Plot PDF (Probability Density Function), overlap theoretical distribution
IDA
Take a look to standard Weibull pdf’s
39
IDA
40
Descriptive Statistics methods in R:
example

Weibull: α = 2 and λ =1
# Exemple Weibull: shift 0 - scale 1 - shape 2
par(mfrow=c(1,1))
m.shape2<- rweibull(5000, shape=2, scale = 1)
mean( m.shape2);stdev( m.shape2);var( m.shape2)
hist( m.shape2, freq=F )
x<-seq(0,max(m.shape2),100)
curve(dweibull(x,shape=2,scale=1),col=2,add=T)
# Exemple Weibull: shift 5 - scale 1 - shape 2
m.shape2<- 5 + m.shape2
# How would the new histogram be?
IDA
41
Descriptive Statistics methods in R:
example






Hypothesis H0: Data is Weibull distributed, with known
shape α=2 (unrealistic)
Theoretical moments for standard Weibull with shape
α=2; i.e., W(θ=0,α=2,λ=1).
Compute numerical statistics: sample and H0.
Plot PDF (Probability Density Function) for sample data,
overlap theoretical distribution
Shift and scale contribute at first sight to missmatch H0.
How to avoid the distorsion effect of non standard
distributed sample data?
IDA
42
Descriptive Statistics methods in R:
example



Hypothesis H0: Data is Weibull distributed, with known
shape α=2 (unrealistic)
How to avoid the distorsion effect of non standard
distributed sample data?
Plot theoretical standard quantiles for the theoretical
distribution vs sample data quantiles. That’s a QQPlot.
Compute the coefficient of determination R2 of the linear
regression fit for QQPlot: this is a goodness of fit indicator as
closer to 1 than better is the fit to H0.
 Intercept and slope of regression are estimates of shift and
scale parameters for H0.

IDA
43
Descriptive Statistics methods in R:
example
• Sample data from Weibull location (shift) 5, scale 1 and shape 2.
• H0: Weibull distribution with shape α= 2.
• Plot QQPlot and estimate simple linear regression line.
MEIO 2 (SIMULACIÓN) - III Input Data Analysis
44
Descriptive Statistics methods in R:
example

H0: Weibull: α = 2. Shape parameter known.
> # Sample data in m.shape2 variable
> # Compute standard weibull shape 2 for probability points
in the sample
> par(mfrow=c(1,2))
> hist( m.shape2, freq=F )
> x<-seq(min(m.shape2),max(m.shape2),100)
> curve(dweibull(x-5,shape=2,scale=1),col=2,add=T)
> xx <- qweibull( ppoints( m.shape2 ) , shape=2)
> yy<- sort( m.shape2)
> mm <- lm( yy ~ xx )
> cor(yy,xx) # Squared correlation coeff is R2
[1] 0.9998977
> mm
Coefficients:
(Intercept)
xx
5.002425
0.986965
> plot( qweibull( ppoints( m.shape2), shape=2 ), sort(
m.shape2) ) # plot( xx, yy )
I
> lines( xx, mm$fitted.values, col=2 )
45
Descriptive Statistics methods in R:
example


•
•
H0: Weibull α = 2. Shape parameter known.
Sample data W(θ=5,α=2,λ=1).
Intercept is a shift (location) parameter estimate.
Slope is a scale parameter estimate.
IDA
Inference of parameter estimates for H0
46

H0: Weibull. Shift should be estimated externally.

ML Estimation in R: method fitdistr()
in library MASS.
> # Maximum likelihood estimation shape and scale (rate),
package MASS. Shift (location) estimated externally.
> fitdistr( m.shape2-5, "weibull")
shape
scale
2.008910258
0.990032096
(0.022126097) (0.007340344)
Warning messages:
1: Se han producido NaNs in: dweibull(x, shape, scale, log)
…
> fitdistr( m.shape2-5, "weibull",lower=0.001)
shape
scale
2.008909637
0.990030803
(0.022126093) (0.007340332)
IDA
47
Goodness of fit: Chisq Test, KS Test,
Anderson-Darling Test, etc.


H0: Weibull θ=5, α = 2 and λ =1 .
General goodness of fit Kolmogorov-Smirnoff (no shift parameter). Not
available: Chi Squared Test.
> chisq.test( m.shape2-5, "pweibull", shape= 2, scale=1
) # Non existent: Not authomatically provided by R
> ks.test( m.shape2-5, "pweibull", shape= 2, scale=1 )
One-sample Kolmogorov-Smirnov test data: m.shape2
D = 0.015, p-value = 0.2131
alternative hypothesis: two.sided
> # library(nortest): several normality tests
> shapiro.test( m.shape2)
Shapiro-Wilk normality test data: m.shape2
W = 0.9717, p-value < 2.2e-16
IDA
48
Descriptive Statistics methods in R:
example



Hypothesis H0: Data is Weibull distributed, with
unknown shape. Define a feasible range for the
unknown shape parameter.
For each feasible α value in the range:
QQPlot for α
Compute the coefficient of determination R2 of the linear
regression fit for QQPlot(α ): remember as closer to 1 than
better is the fit.
 PPCC Plot (Probability Plot Correlation Coeficient Plot ) : R2(α) vs α

α
is selected as the value that maximizes R2 linear
regression fit in PPCC Plot .
IDA
49
Descriptive Statistics methods in R:
example
• H0: Weibull. Unknown shape parameter.
• PPCC Plot: Probability Plot Correlation Coeficient Plot
> # H0: Sample Data is Weibull distributed
> # Aquell valor de param. shape que dona la maxima
correlació en el pplot
> # m.shape2 <- 5+rweibull(5000, shape=2, scale = 1)
> rang <- seq(0.1,3,length=50)
> cors<-rep(0,50)
> yy<- sort( m.shape2 )
> for (i in 1:50)
+ {
+ xx <- qweibull( ppoints(m.shape2 ) , shape=rang[ i ])
+ mm <- lm( yy ~ xx )
+ cors[ i ]<-cor(yy,xx)
+ }
> plot(rang,cors) # PPCC Plot
> rang[which(cors>=max(cors))]
[1] 2.053061
> # Repeat the process (1.5,2.5) …
IDA
50
Descriptive Statistics methods in R:
example


H0: Weibull. Unknown shape parameter.
PPCC Plot: Probability Plot Correlation Coeficient Plot
1. Asummed range for shape
parameter α : (1.5, 2.5).
2. Scattergram R2 vs α (PPCC
Plot).
3. Estimate for shape paramter
α is the value given the
maximum inPPCC Plot
4. To increase precision: repeat
the process.
IDA
51
Example: Repeat PPCC Plot estrategy for
a gamma sample
• Exemple Gamma: α = 1,525 and rate λ =1/6.30. No shift.
• H0: Sample Gamma distributed with unknown shape, rate and
> no shift.
> rang <- seq(0.1,10,length=100)
> cors<-rep(0,100)
> for (i in 1:100)
+{
+ xx <- qgamma( ppoints( mostra72 ) , shape=rang[ i ])
+ yy<- sort( mostra72 )
+ mm <- lm( yy ~ xx )
+ cors[ i ]<-cor(yy,xx)
+}
> plot(rang,cors)
> rang[which(cors>=max(cors))]
[1] 1.5
IDA
Example: Gamma sample data
52


>
>
>
>
>
Sample data: Gamma: α = 1,525 and rate λ =1/6.30. No shift.
H0: Sample Gamma distributed with estimated shape 1.5 (according to
PPCC Plot). Unknown scale/rate and shift.
# mostra72 <- rgamma(5000, shape=1.525, scale = 6.3)
# PPCC Plot technique estimated shape 1.5
xx <- qgamma( ppoints( mostra72 ) , shape=1.5)
yy<- sort( mostra72 )
mm <- lm( yy ~ xx )
> cor( yy,xx)
[1] 0.999535
> mm
(Intercept)
-0.06699
xx
6.47899
> plot( qgamma( ppoints( mostra72 ), shape=1.5 ), sort(
mostra72 ) )
> lines( xx, mm$fitted.values, col=2 )
IDA
Example: Gamma sample data
53
> ks.test(mostra72,"pgamma", shape= 1.5, scale=6.47899 )
One-sample Kolmogorov-Smirnov test data: mostra72
D = 0.0159, p-value = 0.1591 alternative hypothesis: two.sided
> ks.test(mostra72,"pgamma", shape= 1.525, scale=6.30 )
One-sample Kolmogorov-Smirnov test data: mostra72
D = 0.0126, p-value = 0.4018 alternative hypothesis: two.sided
# Max LK estimate: package MASS in R:
> fitdistr( mostra72, "gamma",lower=0.001)
shape
rate
1.487304711 0.154110460
(0.027060660) (0.003324827)
> beta=1/rate
> 6.488852
IDA
54
SOME USEFUL RANDOM
VARIATES FOR MODELING
PURPOSES
Input Data
Analysis
Pareto variate: Density function plot
55
α
f ( x ) = αk x
−α −1
, x ≥ k, α , k > 0
Long-tailed distribution for:
• Modeling CPU time,
• Size of transferred files,
• Thinking time in a Web
editor, etc.
IDA
56
Pareto variate: Distribution function
plot
IDA
Hipoexponential: General case
57
r

Z =
∑X
i =1
i
, where X1,X2, … , Xr are independent
variates with exponential distribution of rate λi
λi ≠ λ j for i ≠ j

Then Z is an r-phase hipoexponential variate.
EXP(λ1)
EXP(λ2)
EXP(λr)
IDA
58
HiperExponential variate
(HIPEREXP)


Composed Distribution for alternance of K Exponential
variates (non identical)
Ellapsed CPU Time sometimes modeled as a HiperExp.
IDA
59
Hiper Exponencial Vs. Exponencial
CDF
IDA
60
Transformed variates:
Y pdf from X pdf since Y = Φ(X)

Let X be a normal variate X - N(µ, σ2)

and let us define Y = Φ(X) = eX, thus

Y is a log-normal variate-
IDA
Beta distribution… 2 shape parameters
61


Stàndard
defined (0,1).
Let α , β, be
shape
parameters,
thus unimodal,
J-shape, Ushape, … can
be obtained.
IDA
Distribution-Fitting Issues
62

What if nothing fits?


Too much data?



“False” rejection
Watch for bad data
Sensitivity analysis


Empirical distributions
Test both means and variation
Types of inputs:
Deterministic vs. Stochastic
 Scalar vs. Multivariate vs. Stochastic
 Time-Varying Arrival Rates

IDA
1/--pages
Пожаловаться на содержимое документа