close

Enter

Log in using OpenID

poster

embedDownload
Mini-Batch Semi-Stochastic Gradient
Descent in the Proximal Setting
Jakub Koneˇcn´
y
1. Problem Description
Jie Liu
def
(1)
x∈Rd
Theorem 1. Let Assumptions 1 and 2 be satisfied and let x∗ =
arg minx P (x). In addition, assume that the stepsize satisfies 0 < h <
1
1
min{ 4Lα(b) , L } and that m is sufficiently large so that
where f is the average of a large number of smooth convex functions fi (x), i.e.,
n
i=1
6. Numerical Experiments
1
4ηLα(b)(m + 1)
+
< 1,
ρ =
mηµ(1 − 4ηLα(b)) m(1 − 4ηLα(b))
def
fi (x).
where α(b) =
n−b
b(n−1) .
2. Assumptions
(3)
Then mS2GD has linear convergence in expectation:
0
10
Objective − Optimum
min {P (x) := f (x) + R(x)},
f (x) =
Martin Tak´aˇc
4. Convergence Result
The problem is to minimize a sum of two convex functions,
1
n
Peter Richt´arik
−5
10
mS2GD
S2GD
SAG
SGDcon
SGD+
FISTA
−10
10
−15
10
0
k
E(P (xk ) − P (x∗ )) ≤ ρ (P (x0 ) − P (x∗ )).
d
·
Assumption 2. P is strongly convex with parameter µ > 0, i.e., ∀x, y ∈
dom(P ),
µ
2
y − x , ∀ξ ∈ ∂P (x),
P (y) ≥ P (x) + ξ (y − x) +
2
T
The following bound of variance is considered crucial:
vk,t − ∇F (yk,t )
2
≤ 4α(b)L(P (yk,t ) − P (x∗ ) + P (xk ) − P (x∗ )).
5. Mini-Batch Speedup
Theorem 2. Fix target ρ ∈ (0, 1) and the mini-batch size b. Let us define
1+ρ
ρµ
˜b
h :=
where ∂P (x) is the subdifferential of P at x.
Algorithm 1 mS2GD
1: Input: m (max # of stochastic steps per epoch); h > 0 (stepsize); x0 ∈ Rd
(starting point); minibatch size b ∈ {1, . . . , n}
2: for k = 0, 1, 2, . . . do
3:
Compute and store gk ← ∇f (xk ) = n1 i ∇fi (xk )
4:
Initialize the inner loop: yk,0 ← xk
5:
Let tk ← t ∈ {1, 2, . . . , m} uniformly at random
6:
for t = 0 to tk − 1 do
7:
Choose mini-batch Akt ⊂ {1, . . . , n} of size b, uniformly at random
8:
Compute a stoch. estimate of ∇f (yk,t ):
vk,t ← gk + 1b i∈Akt (∇fi (yk,t ) − ∇fi (xk ))
9:
yk,t+1 ← proxhR (yk,t − hvk,t )
10:
end for
11:
Set xk+1 ← yk,tk
12: end for
This is a simplified case of our original algorithm. A complete version of the
algorithm and convergence result, with known lower bounds of the convexity
parameters νF , νR for F and R respectively, requires a non-uniform distribution
for the number of steps per epoch [1].
25
30
20
25
30
−5
10
b=1
b=2
b=4
b=8
b=16
b=32
−10
10
−15
10
0
(2)
3. The Algorithm (mS2GD)
20
0
E
is L2 norm.
15
10
Objective − Optimum
∇fi (x) − ∇fi (y) ≤ L x − y , where
10
Effective Passes
2
1+ρ
1
−
.
+
4µα(b)L
ρµ
b
m∗
Otherwise
hb∗
=
1
L
= 8α(b)L
and
mb
∗
=
10
15
0
10
b=1
b=2
Then the optimal step size hb∗ and the maximum size of inner loop mb∗ —
which minimizes the number of gradient evaluation while keeping sufficient
overall decrease — are given as follows:
˜ b ≤ 1 then hb = h
˜ b and
If h
∗
L
1+ρ+
5
Effective Passes
1
2
µρ
4α(b)L
µρ2
+ (1 + ρ)2
Objective − Optimum
Assumption 1. The regularizer R : R → R ∪ {+∞} is convex and closed.
The functions fi : Rd → R are differentiable and have Lipschitz continuous
gradients with constant L > 0, i.e., ∀x, y ∈ Rd ,
5
b=4
b=8
−5
10
b=16
b=32
−10
10
−15
.
(4)
10
0
5
10
15
20
25
30
Effective Passes/b
L/µ+4α(b)
ρ−4α(b)(1+ρ) .
1
If mb
≤
m
∗
∗ /b, then we can reach the same accuracy with fewer gradient
˜ b ≤ 1 is
evaluations. Equation (4) shows that as long as the condition h
L
satisfied, mb∗ is decreasing at a rate roughly faster than 1/b. Hence, we
can attain the same accuracy with less work, compared to the case when
b = 1.
7. References
[1] Jakub Koneˇcn´
y, Jie Liu, Peter Richt´
arik and Martin Tak´
aˇc: mS2GD: MiniBatch Semi-Stochastic Gradient Descent in the Proximal Setting, OPT 2014
@NIPS.
[2] Jakub Koneˇcn´
y and Peter Richt´
arik: Semi-Stochastic Gradient Descent
Methods, arXiv 1312.1666, 2013.
[3] Lin Xiao and Tong Zhang: A Proximal Stochastic Gradient Method with
Progressive Variance Reduction, arXiv 1403.4699, 2014.
Figure 1: rcv1 dataset, logistic regression, R(x) =
1
2n
x
2.
TOP: Comparison between mS2GD and the other relevant algorithms implies its competitiveness. SGDcon is
by using constant step-size in hindsight and SGD+ is the
one with adaptive step-size h = h0 /(k + 1), where k is the
number of effective passes.
MIDDLE: We compare mS2GD algorithm for different
mini-batch sizes with the best parameters m and h for
each batch size.
BOTTOM: We present the ideal speedup by parallelism
— that would happen if we could always efficiently evaluate the b gradients in parallel, thus being b times faster.
1/--pages
Report inappropriate content