Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting Jakub Koneˇcn´ y 1. Problem Description Jie Liu def (1) x∈Rd Theorem 1. Let Assumptions 1 and 2 be satisfied and let x∗ = arg minx P (x). In addition, assume that the stepsize satisfies 0 < h < 1 1 min{ 4Lα(b) , L } and that m is sufficiently large so that where f is the average of a large number of smooth convex functions fi (x), i.e., n i=1 6. Numerical Experiments 1 4ηLα(b)(m + 1) + < 1, ρ = mηµ(1 − 4ηLα(b)) m(1 − 4ηLα(b)) def fi (x). where α(b) = n−b b(n−1) . 2. Assumptions (3) Then mS2GD has linear convergence in expectation: 0 10 Objective − Optimum min {P (x) := f (x) + R(x)}, f (x) = Martin Tak´aˇc 4. Convergence Result The problem is to minimize a sum of two convex functions, 1 n Peter Richt´arik −5 10 mS2GD S2GD SAG SGDcon SGD+ FISTA −10 10 −15 10 0 k E(P (xk ) − P (x∗ )) ≤ ρ (P (x0 ) − P (x∗ )). d · Assumption 2. P is strongly convex with parameter µ > 0, i.e., ∀x, y ∈ dom(P ), µ 2 y − x , ∀ξ ∈ ∂P (x), P (y) ≥ P (x) + ξ (y − x) + 2 T The following bound of variance is considered crucial: vk,t − ∇F (yk,t ) 2 ≤ 4α(b)L(P (yk,t ) − P (x∗ ) + P (xk ) − P (x∗ )). 5. Mini-Batch Speedup Theorem 2. Fix target ρ ∈ (0, 1) and the mini-batch size b. Let us define 1+ρ ρµ ˜b h := where ∂P (x) is the subdifferential of P at x. Algorithm 1 mS2GD 1: Input: m (max # of stochastic steps per epoch); h > 0 (stepsize); x0 ∈ Rd (starting point); minibatch size b ∈ {1, . . . , n} 2: for k = 0, 1, 2, . . . do 3: Compute and store gk ← ∇f (xk ) = n1 i ∇fi (xk ) 4: Initialize the inner loop: yk,0 ← xk 5: Let tk ← t ∈ {1, 2, . . . , m} uniformly at random 6: for t = 0 to tk − 1 do 7: Choose mini-batch Akt ⊂ {1, . . . , n} of size b, uniformly at random 8: Compute a stoch. estimate of ∇f (yk,t ): vk,t ← gk + 1b i∈Akt (∇fi (yk,t ) − ∇fi (xk )) 9: yk,t+1 ← proxhR (yk,t − hvk,t ) 10: end for 11: Set xk+1 ← yk,tk 12: end for This is a simplified case of our original algorithm. A complete version of the algorithm and convergence result, with known lower bounds of the convexity parameters νF , νR for F and R respectively, requires a non-uniform distribution for the number of steps per epoch [1]. 25 30 20 25 30 −5 10 b=1 b=2 b=4 b=8 b=16 b=32 −10 10 −15 10 0 (2) 3. The Algorithm (mS2GD) 20 0 E is L2 norm. 15 10 Objective − Optimum ∇fi (x) − ∇fi (y) ≤ L x − y , where 10 Effective Passes 2 1+ρ 1 − . + 4µα(b)L ρµ b m∗ Otherwise hb∗ = 1 L = 8α(b)L and mb ∗ = 10 15 0 10 b=1 b=2 Then the optimal step size hb∗ and the maximum size of inner loop mb∗ — which minimizes the number of gradient evaluation while keeping sufficient overall decrease — are given as follows: ˜ b ≤ 1 then hb = h ˜ b and If h ∗ L 1+ρ+ 5 Effective Passes 1 2 µρ 4α(b)L µρ2 + (1 + ρ)2 Objective − Optimum Assumption 1. The regularizer R : R → R ∪ {+∞} is convex and closed. The functions fi : Rd → R are differentiable and have Lipschitz continuous gradients with constant L > 0, i.e., ∀x, y ∈ Rd , 5 b=4 b=8 −5 10 b=16 b=32 −10 10 −15 . (4) 10 0 5 10 15 20 25 30 Effective Passes/b L/µ+4α(b) ρ−4α(b)(1+ρ) . 1 If mb ≤ m ∗ ∗ /b, then we can reach the same accuracy with fewer gradient ˜ b ≤ 1 is evaluations. Equation (4) shows that as long as the condition h L satisfied, mb∗ is decreasing at a rate roughly faster than 1/b. Hence, we can attain the same accuracy with less work, compared to the case when b = 1. 7. References [1] Jakub Koneˇcn´ y, Jie Liu, Peter Richt´ arik and Martin Tak´ aˇc: mS2GD: MiniBatch Semi-Stochastic Gradient Descent in the Proximal Setting, OPT 2014 @NIPS. [2] Jakub Koneˇcn´ y and Peter Richt´ arik: Semi-Stochastic Gradient Descent Methods, arXiv 1312.1666, 2013. [3] Lin Xiao and Tong Zhang: A Proximal Stochastic Gradient Method with Progressive Variance Reduction, arXiv 1403.4699, 2014. Figure 1: rcv1 dataset, logistic regression, R(x) = 1 2n x 2. TOP: Comparison between mS2GD and the other relevant algorithms implies its competitiveness. SGDcon is by using constant step-size in hindsight and SGD+ is the one with adaptive step-size h = h0 /(k + 1), where k is the number of effective passes. MIDDLE: We compare mS2GD algorithm for different mini-batch sizes with the best parameters m and h for each batch size. BOTTOM: We present the ideal speedup by parallelism — that would happen if we could always efficiently evaluate the b gradients in parallel, thus being b times faster.

1/--pages