Download presentation
Presentation is loading. Please wait.
Published byKurt Paulsen Modified 5年之前
1
Online job scheduling in Distributed Machine Learning Clusters
段新朋
2
Problem Parameter Server机器学习训练模型下的工作调度问题
分为worker和server,前者用于计算梯度,后者用于保存以及更新 参数 新来一个job的时候,需要为这个job分配若干worker和server以满 足训练需求,同时满足资源总量的限制 输入数据分成若干chunk,每个chunk由一个worker负责训练,一 个chunk又分为多个mini-batch,每次训练一个mini-batch 参数根据server的数量均分为若干份,每个server保存参数的一部 分。
3
Modeling
4
Offline Algorithm 确定接受哪些job,不接受哪些job,并为每一个job分配worker和server,最终实现所有job的效用函数最大化的同时满足各种限制。
5
Online Algorithm We formulate the dual of (14) by relaxing integrality constraints (18) and associating dual variables p r h (t), q R k (t) and µ i with (15), (16) and (17), respectively.
6
Primal-dual μ:payoff P:unit-cost for type-r resource Q:
f: utility function
7
P,Q P: unit cost for type-r resource on the worker in t.
Q:unit cost for type-r resource on the server in t. U:maximum per-unit-resource job utility for type- r resource on physical servers to deploy workers L: L 1 (L 2 ) represents the minimum unit-time-unit-resource job utility on physical servers to deploy workers (parameter servers), among all jobs
8
Online Algorithm
9
Finding best job schedule
Similar presentations