您的位置 首页 > 腾讯云社区

分布式学习中用于抑制散乱的高效梯度编码(CS IT)---用户7305506

基于梯度的方法的分布式实现,其中服务器在工作机之间分配梯度计算,需要克服两个限制:由运行缓慢的机器(称为“散列器”)引起的延迟和通信开销。最近,Ye和Abbe[ICML 2018]提出了一个编码理论范式,用来描述每名工作人员的计算负载、每名工作人员的通信开销和散乱容忍度之间的基本权衡。然而,他们提出的编码方案存在着译码复杂度高和数值稳定性差的缺点。为了克服这些缺点,本文提出了一种通信效率高的梯度编码框架。我们提出的框架可以使用任何线性码来设计编码和解码功能。当在这个框架中使用一个特定的代码时,它的块长度决定了计算负载,尺寸决定了通信开销,最小距离决定了偏离公差。选择代码的灵活性允许我们优雅地权衡离散阈值和通信开销,以获得更小的解码复杂度和更高的数值稳定性。此外,我们还证明了在我们的框架中使用由随机高斯矩阵生成的最大距离可分(MDS)码可以得到一个相对于折衷而言是最优的梯度码,并且与先前提出的格式相比,它在数值稳定性上满足更强的保证。最后,我们在Amazon EC2上对我们提出的框架进行了评估,并证明它比以前的梯度编码方案平均减少了16%的迭代时间。

原文标题:Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

原文:Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations: delays caused by slow running machines called 'stragglers', and communication overheads. Recently, Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance. However, their proposed coding schemes suffer from heavy decoding complexity and poor numerical stability. In this paper, we develop a communication-efficient gradient coding framework to overcome these drawbacks. Our proposed framework enables using any linear code to design the encoding and decoding functions. When a particular code is used in this framework, its block-length determines the computation load, dimension determines the communication overhead, and minimum distance determines the straggler tolerance. The flexibility of choosing a code allows us to gracefully trade-off the straggler threshold and communication overhead for smaller decoding complexity and higher numerical stability. Further, we show that using a maximum distance separable (MDS) code generated by a random Gaussian matrix in our framework yields a gradient code that is optimal with respect to the trade-off and, in addition, satisfies stronger guarantees on numerical stability as compared to the previously proposed schemes. Finally, we evaluate our proposed framework on Amazon EC2 and demonstrate that it reduces the average iteration time by 16% as compared to prior gradient coding schemes.

原文作者:Swanand Kadhe, O. Ozan Koyluoglu, Kannan Ramchandran

原文:https://arxiv.org/abs/2005.07184

分布式学习中用于抑制散乱的高效梯度编码(CS IT).pdf ---来自腾讯云社区的---用户7305506

关于作者: 瞎采新闻

这里可以显示个人介绍!这里可以显示个人介绍!

热门文章

留言与评论(共有 0 条评论)
   
验证码: