close

Вход

Log in using OpenID

embedDownload
Applying Statistical Learning,
Optimization, and Control to Application
Performance Management in the Cloud
Xiaoyun Zhu
October 17, 2014
© 2014 VMware Inc. All rights reserved.
Rapidly growing public cloud market
2
How about hosting critical applications?
3
Application performance – a real concern
64%
51%
44%
Source: “The hidden costs of managing applications in the cloud,” Compuware/Research In Action White Paper, Dec. 2012,
based on survey results from 468 CIOs in Americas, Europe, and Asia.
4
Application performance management is hard
Service Level Objective: 95% of all
transactions should be completed within 500ms
SLO violation!
Performance
troubleshooting &
remediation
Cloud
hosting
provider
Many tenant
applications
5
Challenges in managing application performance
•  On average, 46.2 hours spend in “war-room” scenarios each month
Source: Improving the usability of APM data: Essential capabilities and benefits. TRAC Research, June 2012, based on
survey data from 400 IT organizations worldwide
6
Challenges in usability of performance data
“false
negatives”
Source: Improving the usability of APM data: Essential capabilities and benefits. TRAC Research, June 2012, based on
survey data from 400 IT organizations worldwide
7
APM goal: achieve service-level-objective (SLO)
Technical challenges
•  Enterprise applications are distributed or multi-tiered
•  App-level performance depends on access to many
resources
–  HW: CPU, memory, cache, network, storage
–  SW: threads, connection pool, locks
•  Time-varying application behavior
•  Time-varying hosting condition
•  Dynamic and bursty workload demands
•  Performance interference among co-hosted applications
8
8
Better IT analytics for APM automation
Three-pronged approach
Learning
Optimization
Control
9
Why learning?
•  Deals with APM-generated big data problem
•  Fills the semantic gap with learned models
•  Answers key modeling questions
10
10
APM-generated Big Data
•  “APM tools were part of the huge explosion in metric
collection, generating thousands of KPIs per application.”
•  “83% of respondents agreed that metric data collection has
grown >300% in the last 4 years alone.”
•  10 years ago data was mostly collected every 15 minutes;
now typically every 5 minutes; 23% every 1 minute or less
•  “88% of companies are only able to analyze less than half
of the metric data they collect… 45% analyze less than a
quarter of the data.”
•  “77% of respondents cannot effectively correlate business,
customer experience, and IT metrics.”
Source: “APM-generated big data boom.” Netuitive & APMDigest, July 2012, based on survey of US & UK IT professionals.
11
What performance data are collected?
Infrastructure-level
Physical host metrics
•  System-level stats collected by the hypervisor
§  e.g., esxtop – CPU, memory, disk, network, interrupt
•  CPU stats
§  %USED, %RUN, %RDY, %SYS, %OVRLP, %CSTP, %WAIT, %IDLE,
%SWPWT
•  ~100s-1000s metrics per host!
VM metrics
•  Resource usage stats collected by the guest OS
§ 
e.g., dstat, iostat
•  ~10s metrics per VM
•  Widely available on most platforms
•  Available at a time scale of seconds to minutes
12
What performance data are collected?
Application-level
Types of metrics
• 
• 
• 
• 
End-user experience (response times, throughput)
Application architecture discovery
Transaction tracing
Component monitoring
VMware Hyperic monitoring tool
Agents deployed in VMs
Auto-discovers types of applications running
Plugins to extract application-related performance stats
Stats available at a time scale of minutes
•  Stats aggregated in Hyperic server
•  Supports over 80 different application components
•  Extensible framework to allow customized plugins
• 
• 
• 
• 
13
The Semantic Gap challenge
Correlating performance data from different sources
Model
14
Learning helps answer key modeling questions
•  Q1: Which metrics go into the model?
•  Thousands of metrics from each ESX host and their VMs
•  Which system resources or parameters affect application
performance the most?
•  Q2: What kind of model should we use?
•  White-box vs. empirical models
•  Linear vs. nonlinear models
•  Offline vs. online models
•  Q3: Does our model capture current behavior?
•  Applications workloads and environments are constantly changing
15
An example multi-tier application
Type of Metrics
Number
app performance
8
raw host metric
7226
raw VM metrics
266
16
Q1: Which metrics go into the model?
Phase-1: Correlation-based metrics filtering
Category Type of Metrics
Number
|corrcoef| >= 0.8
1
p-value <= 0.1
Input
given app metric
Input
raw system metric
Nraw
132 candidate metrics
Output
candidate metrics
Ncan
98% not highly correlated!
Total 7226 raw metrics from 4 ESX hosts (esxtop)
132
More significant correlation!
App metric: Throughput
App metric: Mean RT
App metric: 95-p RT
17
Q1: Which metrics go into the model?
Phase 2: Model-based metrics selection
Input: Ncan candidate metrics and a performance model F
Output: Npred top predictor metrics that provide a good fit for the data
Tunable Parameter: minimum incremental improvement in R2
Imp < 0.01
Imp =
0.074
Imp =
0.063
Imp =
0.668
18
Q2: What kind of models should we use?
White-box performance models
•  Pros
•  Solid theoretical foundation
•  Application-aware, easier to interpret
•  Closed-form solution in some special cases
•  Cons
•  Detailed knowledge of system, application, workload, deployment
•  More appropriate for aggregate behavior or offline analysis
•  Harder to automate, scale, or adapt
19
Q2: What kind of models should we use?
Black-box empirical models
•  Pros
• 
• 
• 
• 
Generic: No a priori assumptions
Tools: Many learning algorithms available
Automation: Easier to do partially or fully
Scalable: Easier to codify analysis in algorithms
•  Challenges
•  Efficiency: Real-time data processing and analytics
•  Accuracy: Reduces false positives and false-negatives
•  Adaptivity: Handles changing workloads and environments
20
Q2: What kind of models should we use?
Linear vs. nonlinear models
Linear regression
k nearest neighbors
Regression tree
Boosting approach
Q2: What kind of models should we use?
Tradeoff between linear and nonlinear models
•  Nonlinear models have better accuracy than linear regression model
•  Linear regression model has the least computation cost
•  Boosting algorithm has the best accuracy and highest cost
•  Regression tree maybe a good tradeoff between accuracy and cost
22
Q2: What kind of models should we use?
Offline vs. online models
•  Offline modeling
•  More appropriate for nonlinear models
•  More suitable for capacity planning and initial sizing
•  Cannot adapt to runtime changes in app, workload, or system
•  Online modeling
• 
• 
• 
• 
Should be cheap to compute and update
Linear models more appropriate
Can adapt to changes in app, workload, and system
Suitable for runtime adaptation and reconfiguration
23
Q3: Does our model capture current behavior?
Online change-point detection
•  Hypothesis: The distribution of prediction errors (residuals) is
stationary if there are no changes in the application/environment
•  Detection: Use a hypothesis test to compare error distributions from
adjacent time windows
24
vPerfGuard: Learning-based troubleshooting
Online Change-Point Detection Module
Sensor Module
Application
performance metrics
(THP, MRT, RT95p)
new samples
Online hypothesis testing
Model and top metrics
Host metrics
(1000’s)
Guest VM metrics
(10’s)
Yes
Re-train model? Yes/No
Metric Filtering & Model Building Module
raw metrics
Phase 1:
Correlation-based
filtering
Phase 2:
Model-based
filtering
suspicious
metrics
remediation
* P. Xiong et al. “vPerfGuard: An automated model-driven framework for application performance diagnosis in consolidated cloud
environments.” ICPE 2013.
25
Case study: CPU contention with co-located VMs
CPU contention
noisy neighbors
Intervals
MRT Model
27 – 45
MRT = 1.13 H_ESX1_CPU_Util +
1.97 H_ESX4_Mem_Active – 89.7
46 – 74
MRT = 752.8
H_ESX1_CPULoad_1MinAvg – 562.9
75 – 89
MRT = 12.5
H_ESX1_Web_vCPU_Ready – 25.0
90 – 102
MRT = -7.70 H_ESX1_vCPU_Idle
+ 410.3
Model retraining
Note: All models during the contention
period show CPU on ESX1 as the top
metric affecting application latency!
26
Why control and optimization?
•  Control (or dynamic adaptation) takes advantage of newly
exposed performance tuning knobs
•  Feedback allows tolerance of model imperfection and
uncertainties
•  Optimization handles tradeoffs between competing goals
–  performance vs. power
–  responsiveness vs. stability
27
27
Auto-Scaling to maintain application SLO
A feedback-control approach
Application Latency
End
User
Front
Tier
DB Tier
28
Auto-Scaling to maintain application SLO
A feedback-control approach
Application Latency
End
User
Front
Tier
DB Tier
Application Latency
End
User
Front
Tier
DB Tier
Horizontal scaling
29
Auto-Scaling to maintain application SLO
A feedback-control approach
Application Latency
End
User
Front
Tier
DB Tier
Horizontal scaling
Application Latency
End
User
Front
Tier
DB Tier
Vertical scaling
30
Existing solutions to horizontal scaling
Threshold-based approach
•  User-defined threshold on a specific metric
–  Spin up new instances when threshold is violated
CPU U%liza%on (%) –  e.g. AWS Auto Scaling: http://aws.amazon.com/autoscaling/
80 Threshold
60 40 20 0 0 5 10 Time 15 20 25 •  Challenges
–  How to handle multiple application tiers?
–  How to handle multiple resources?
–  How to determine the threshold value?
31
Our Solution: Learning-based auto scaling
•  User only needs to provide end-to-end performance goal
•  Uses reinforcement learning to capture application’s scaling behavior
and inform future actions
•  Uses heuristics to seed the learning process
•  Handles multiple resources and tiers
•  Fully automated without human intervention
AppServer VMs
Client threads
avg Apache Latency
SLO
1200
16
14
12
End-to-End Latency (ms)
1000
10
800
8
600
6
400
4
200
2
0
0
100
200
300
400
500
Time in minutes
600
700
800
Number of App-Server VMs & Number of ClientThreads
1400
0
900
32
Vertical scaling of resource containers
Automatic tuning of resource control settings
•  Available on various virtualization platforms
•  For shared CPU, memory, disk I/O*, network I/O*:
–  Reservation (R)* – minimum guaranteed amount of resources
–  Limit (L) – upper bound on resource consumption (non-work-conserving)
–  Shares (S) – relative priority during resource contention
•  VM’s CPU/memory demand (D): estimated by hypervisor, critical to
actual allocation
L
Actual-allocation =
f(R, L, S, D, Cap)
VM configured size (C)
R
Available capacity
33
DRS (Distributed Resource Scheduler)
Resource pool hierarchy
VDC
<R1, L1, S1>
RP1
RP2
vApp1
<r, l, s>
Web
App
<R2, L2, S2>
vApp2
DB
Web
App
DB
VM1
VM2
•  Capacity of an RP divvied hierarchically based on resource settings
•  Sibling RPs share capacity of the VDC
•  Sibling VMs share capacity of the parent RP
* VMware distributed resource management: Design, implementation, and lessons learned, VMware Technical Journal,
April 2012.
34
Powerful knobs, hard to use
•  How do VM-level settings impact application performance?
•  How to set RP-level settings to protect high priority applications within
the RP?
•  Fully reserved (R=L=C) for critical applications
–  Leads to lower consolidation ratio due to admission control
•  Others left at default (R=0, L=C) until performance problem arises
–  Increases reservation for the bottleneck resource (which one? by how much?)
workload
measured
performance p(t)
vApp
per-VM resource settings
Web
App
DB
35
Performance model learned for each vApp
Maps VM-level resource allocations to app-level performance
•  Captures multiple tiers and multiple resource types
•  Choose a linear low-order model (easy to compute)
•  Workload indirectly captured in model parameters
•  Model parameters updated online in each interval (tracks nonlinearity)
workload λ
VM CPU usage ukc(t)
Model
vApp
VM memory usage ukm(t)
VM I/O usage ukio(t)
measured
performance p(t)
p(t) = f(p(t-1),u(t))
Web
App
DB
36
Use optimization to handle design tradeoff
•  An example cost function
J (u(t + 1)) = ( p (t + 1) − pSLO ) 2 + β || u(t + 1) − u(t ) ||2
performance cost
control cost
Tradeoff between
performance and stability
•  Solve for optimal resource allocations
u * (t + 1) = g ( p (t ), pSLO , u(t ), λ , β )
37
AppRM
SLO-driven auto-tuning of resource control settings
•  For each application, vApp Manager translates its SLO into desired
resource control settings at individual VM level
•  For each resource pool, RP Manager computes the actual VM- and
RP-level resource settings to satisfy all critical applications
vApp1
VM1
App-level
SLO
VM2
...
vApp2
VMn
VM1
VM2
...
Resource Pool
(RP)
VMn
App/System Sensors
App/System Sensors
vApp Manager
vApp Manager
...
App-level
SLO
Desired VM resource settings
Arbiter
Resource Pool Manager
(RP Manager)
Actuator
Actual VM- or RP-level
settings via vSphere API
38
vApp Manager overview
vApp1
pt
VM1
VM2
...
VMn
Observed app
performance (pt)
App Sensor
System Sensor
Application
Controller
Model: p
Model
Builder
Desired resource
allocations (ut+1)
= f(u)
Resource
Controller
Current resource
allocations (ut)
RP Manager
39
App-level SLO (pref)
Desired VM resource settings (st+1)
vApp
Manager
Performance evaluation
•  Application
–  MongoDB – distributed data processing application with sharding
–  Rain – workload generation tool to generate dynamic workload
•  Workload
–  Number of clients
–  Read/write mix
Configsvr
VM2
•  Evaluation questions
Mongos
–  Can the vApp Manager meet
individual application SLO?
–  Can the RP Manager meet SLOs
Shard1
VM
Shard2
VM1
VM3
of multiple vApps?
40
Result: Meeting mean response time target
•  Under-provisioned initial settings: R = 0, Limit = 512 (MHz, MB)
•  Over-provisioned initial settings: R = 0, L = unlimited (cpu, mem)
Mean response %me ( target 300ms) Normalized Response Time 100 Initial-learning
control + continued-learning
10 1 0.1 1 6 11 16 21 RT-­‐scenario1 26 31 36 RT-­‐scenario2 41 46 Target 51 56 61 66 71 76 81 86 Time interval ( every 1 min)
41
Resource utilization (under-provisioned case)
•  Target response time = 300 ms
•  Initial setting R = 0, L = 512 MHz/MB (under-provisioned)
Memory u%liza%on 0.09 0.2 0.08 0.18 0.07 0.16 Memory u%liza%on CPU u%liza%on CPU u%liza%on 0.06 0.05 0.04 0.03 0.02 0.14 0.12 0.1 0.08 0.06 0.04 0.01 0.02 0 1 11 21 31 Mongos-­‐CPU 41 51 Shard1-­‐CPU 61 71 81 Shard2-­‐CPU 0 1 11 21 Mongos-­‐MEM 31 41 51 Shard1-­‐MEM 61 71 81 Shard2-­‐MEM 42
Recap:
APM automation requires better analytics
Tradeoff between
competing goals
Online modeling of
application performance
Learning
Optimization
Control
Model-driven online
adaptation in face of
uncertainty
43
Grand challenge
The Vision of Autonomic Computing, IEEE Computer, Jan. 2003.
“Systems manage themselves according to an administrator’s goals.
New components integrate as effortlessly as a new cell establishes itself
in the human body. These ideas are not science fiction, but elements of
the grand challenge to create self-managing computing systems.”
Enablers
•  Widely deployed sensors and lots of (noisy) data
•  New control knobs, resource fungibility and elasticity
•  Increasing compute, storage, and network capacity
•  Matured learning, control, and optimization techniques
Challenges
•  Software complexity, nonlinearity, dependency, scalability
•  Automated root-cause analysis, integrated diagnosis & control
•  Need more collaborations between control and systems people
•  How to teach control theory to CS students?
44
44
Thanks to collaborators
  VMware
  • Lei Lu, Rean Griffith, Mustafa Uysal, Anne Holler, Pradeep Padala, Aashish
Parikh, Parth Shah
  HP Labs
  • Zhikui Wang, Sharad Singhal, Arif Merchant (now Google)
  KIT
  • Simon Spinner, Samuel Kounev
  College of William & Mary
  • Evgenia Smirni
  Georgia Tech
  • Pengcheng Xiong (now NEC Lab), Calton Pu
  University of Michigan
  • Kang Shin, Karen Hou
45
Related venues
• International Conference on Autonomic Computing
  https://www.usenix.org/conference/icac14
• Feedback Computing Workshop (formerly known as FeBID)
  http://feedbackcomputing.org/
  http://www.controlofsystems.org/
46
References
•  X. Zhu, et al. “What does control theory bring to systems research?” ACM SIGOPS
Operating Systems Review, 43(1), January 2009.
•  P. Padala et al. “Automated control of multiple virtualized resources.” Eurosys 2009.
•  A. Gulati et al. “VMware distributed resource management: Design, implementation, and
lessons learned.” VMware Technical Journal, Vol. 1(1), April 2012.
•  P. Xiong et al. “vPerfGuard: An automated model-driven framework for application
performance diagnosis in consolidated cloud environments.” ICPE 2013.
•  A. Gulati , “Towards proactive resource management in virtualized datacenters,”
RESoLVE 2013.
•  L. Lu, et al., “Application-Driven dynamic vertical scaling of virtual machines in resource
pools.” NOMS 2014.
47
1/--pages
Пожаловаться на содержимое документа