Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online job scheduling in Distributed Machine Learning Clusters

Similar presentations


Presentation on theme: "Online job scheduling in Distributed Machine Learning Clusters"— Presentation transcript:

1 Online job scheduling in Distributed Machine Learning Clusters
段新朋

2 Problem Parameter Server机器学习训练模型下的工作调度问题
分为worker和server,前者用于计算梯度,后者用于保存以及更新 参数 新来一个job的时候,需要为这个job分配若干worker和server以满 足训练需求,同时满足资源总量的限制 输入数据分成若干chunk,每个chunk由一个worker负责训练,一 个chunk又分为多个mini-batch,每次训练一个mini-batch 参数根据server的数量均分为若干份,每个server保存参数的一部 分。

3 Modeling

4 Offline Algorithm 确定接受哪些job,不接受哪些job,并为每一个job分配worker和server,最终实现所有job的效用函数最大化的同时满足各种限制。

5 Online Algorithm We formulate the dual of (14) by relaxing integrality constraints (18) and associating dual variables p r h (t), q R k (t) and µ i with (15), (16) and (17), respectively.

6 Primal-dual μ:payoff P:unit-cost for type-r resource Q:
f: utility function

7 P,Q P: unit cost for type-r resource on the worker in t.
Q:unit cost for type-r resource on the server in t. U:maximum per-unit-resource job utility for type- r resource on physical servers to deploy workers L: L 1 (L 2 ) represents the minimum unit-time-unit-resource job utility on physical servers to deploy workers (parameter servers), among all jobs

8 Online Algorithm

9 Finding best job schedule

10


Download ppt "Online job scheduling in Distributed Machine Learning Clusters"

Similar presentations


Ads by Google