School of EECS, Peking University

School of EECS, Peking University
Synchronization Hongfei Yan School of EECS, Peking University 4/18/2012 While communication is important, it is not the entire story. Closely related is how process cooperate and synchronize with one another. Cooperation is partly supported by means of naming, which allows processes to at least share resources, or entities in general. Synchronize: 1) simultaneously access a shared resource, 2) agree on the ordering of events In many cases, it is important that a group of processes can appoint one process as a coordinator.

Contents 01: Introduction 02: Architectures 03: Processes 04: Communication 05: Naming 06: Synchronization 07: Consistency & Replication 08: Fault Tolerance 09: Security 10: Distributed Object-Based Systems 11: Distributed File Systems 12: Distributed Web-Based Systems 13: Distributed Coordination-Based Systems

Outline 6.1 Clock Synchronization (++) 6.2 Logical Clocks (++)
6.3 Mutual Exclusion 6.4 Global Positioning of nodes 6.5 Election Algorithms

6.1 Clock Synchronization
Physical Clocks Global Positioning System (++) Clock Synchronization Algorithms (++) Logical Clocks (relative ordering/ordering of events) Time (how processes can synchronize), Physical Clocks (actual/absolute time)

Why is it important ? Need to measure accurately
E.g., auditing in e-commerce Algorithms depending on E.g., consistency, make Correctness of distributed systems frequently hinges upon the satisfaction of global system invariants. E.g., Absence of deadlocks Write access to a distributed database never granted to more than one process the sum of all account debits and ATM payments in an electronic cash system is zero Objects are only subject to garbage collection when no further reference to them exists Correctness of distributed systems frequently hinges upon the satisfaction of global system invariants.

E.g., UNIX make program In a centralize system, time is unambiguous. When a process wants to know the time, it makes a system call and the kernel tells it. Eg, UNIX make program Normally, in UNIX, large programs are split up into multiple cource file, so that a change to one source file only requires one file to be recompiled, not all the files Is it possible to synchronize all the clocks in a distributed system? When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.

时间简史人类生活有紧密联系的太阳的运动是比较均匀天文计量，自从17世纪机械钟表发明. GMT (Greenwich Mean Time)
地球自传是一天，公转是一年天文计量，自从17世纪机械钟表发明. GMT (Greenwich Mean Time) 太阳到达天空中它出现的最高点时称为中天两次连续的太阳中天之间的时间称为一个太阳日每天有24小时，每小时有3600秒， 1/86400个太阳日是太阳秒物理计量，20世纪40年代 1秒是铯133原子作9,192,631,770次跃迁所用的时间 BIH (Bureau International de l’Heure)将这些值平均起来产生国际原子时间，简称为TAI (International Atomic Time) TAI和太阳秒计时之间的差增加到800毫秒时使用一次闰秒，修正后的时间系统称作统一协调时间，简称UTC (Universal Coordinated Time) 人们很早就发现，与人类生活有紧密联系的太阳的运动是比较均匀的，而这种运动实际上是地球自转与公转的结果，因此，建立了按太阳运动计量时间的计时系统，把日出日落的周期值作为计量时间的标准原器。测量时间不像人们想象的那样简单，尤其是当精度要求很高时更是如此。自从17世纪机械钟发明以来，人们就一直用天文学的方法测量时间。每天太阳都是从东方地平线上升起，然后升到天空的最高处，最后再落到西边。太阳到达天空中它出现的最高点时称为中天(transit of the sun)，它在每天正午发生。两次连续的太阳中天之间的时间称为一个太阳日(solar day)。因为每天有24小时，每小时有3600秒，所以一个太阳秒被精确地定义为1/86400个太阳日。在20世纪40年代，科学家们证实了地球的自转周期并非常数。由于潮汐摩擦和大气的阻力，地球自转速度正在变慢。基于对远古时代珊瑚的生长图案的研究，地质学家现在相信，在3亿年前每年大约有400天。年的长度(地球绕太阳一周的时间)被认为是不变的，那么每天就简单地变长了。除了这种长期变化趋势，也存在一天长短的短期变化，这可能是由地球地核层熔岩的剧烈沸腾引起的。这些发现促使天文学家们在计算天的长度时测量很多天的长度，然后对它们取平均值;最后除以86400得到的结果称为平均太阳秒(mean solar second) 。 1948年原子时钟诞生，它使更精确地测量时间成为可能，原子时钟不受地球的摆动和振动的影响，而是通过艳133原子的跃迁计时。物理学家从天文学家的手中接管了计时的工作，定义1秒是铯133原子作次跃迁所用的时间。选择是为了使原子秒与引入原子秒那一年的平均太阳秒相等。目前，世界上大约有50个实验室拥有铯133时钟。每个实验室都定期向巴黎的BIH报告其时钟的嘀嗒次数。BIH将这些值平均起来产生国际原子时间(international atomic time)，简称为TAI。这样TAI就是艳133时钟从1958年1月1日午夜以来被除后的平均嘀嗒数。尽管 TAI相当稳定，并且任何人只要愿意都可以买到一只铯时钟，但是它仍然存在一个严重的问题，那就是86400个TAI秒现在比一个平均太阳日少3 msec(因为平均太阳日越来越长)。使用TAI计时将意味着多年以后，中午会出现得越来越早，直到最终出现在凌晨。 BIH (国际时间局)通过引入闰秒(leap second)解决了该问题，即当TAI和太阳秒计时之间的差增加到800毫秒时使用一次闰秒。这种修正产生了一种时间系统，该时间系统基于恒定长度的 TAI秒，但是却和太阳的运动保持一致;它被称作统一协调时间(universal coordinated time)，简称为UTC。UTC是所有现代人计时的基础。它从根本上取代了原有的标准，即格林尼治平均时间(greenwich mean time)，后者是一种天文时间。

Physical Clocks A high-frequency oscillator, counter, and holding register Counter decrements on each oscillation, generates a “tick” and is reset from the holding register when it goes to 0. On a network, clocks will differ: “skew”. Two problems: How do we synchronize with each other? How do we synchronize multiple clocks to the “real” time? Skew is not that well defined, so be careful.

Measuring Time How long is a second? How long is a day?
One solar day is 86,400 solar seconds. Some variation in the rotational speed of the earth: mean solar second.

A Solar Day Computation of the mean solar day.

Changes Big problem: the earth’s rotation is slowing.
Seconds are getting longer. Is this acceptable for science and engineering? Use physical clocks not tied to the Earth: TAI. Hm…but then there are more seconds in a day! What are leap years for? Leap seconds are periodically added. People might notice this and we could have the same kind of situation as occurred in 1582 when Pope Gregory XIII decreed that 10 days be omitted from the calendar. A leap year (or intercalary year) is a year containing one or more extra days (or, in the case of lunisolar calendars, an extra month) in order to keep the calendar year synchronized with the astronomical or seasonal year. For example, in the Gregorian calendar, February in a leap year has 29 days instead of the usual 28 so the year lasts 366 days instead of the usual 365. Because seasons and astronomical events do not repeat in a whole number of days, a calendar that had the same number of days in each year would, over time, drift with respect to the event it was supposed to track. By occasionally inserting (or intercalating) an additional day or month into the year, the drift can be corrected. A year that is not a leap year is called a common year.

Leap Seconds TAI seconds are of constant length, unlike solar seconds. Leap seconds are introduced when necessary to keep in phase with the sun. UTC is TAI with leap seconds. Suppose we want to know how much time elapsed between two events. Which do we use?

Time Standards International Atomic Time
TAI is a physical time standard that defines the second as the duration of 9,192,631,770 periods of the radiation of a specified transition of the cesium atom 133. TAI is a chronoscopic timescale, i.e, a timescale without any discontinuities. It defines the epoch, the origin of time measurement, as January 1, 1958 at 00:00:00 hours, and continuously increases the counter as time progresses Universal Coordinated Time (UTC) UTC is an astronomical time standard that is the basis for the time on the "wall clock". In 1972 it was internationally agreed that the duration of the second should conform to the TAI stand, but that the number of seconds in an hour will have to be occasionally modified by inserting a leap second into UTC to maintain synchrony between the wall clock time and the astronomical phenomena, like day and night. epoch [`i:pCk, `epCk] n. 新纪元, 时代, 时期, 时间上的一点, [地质]世关于时间的几个概念（UT0\TAI\UTC等） ( :08:02)转载标签：平均太阳日世界时国际原子时协调世界时 ut0 tai utc 在时间概念方面经常提到以下术语：平均太阳日、世界时、国际原子时、协调世界时、闰秒等，下面对这些术语分别进行解释和定义。 (1) 平均太阳日人们习惯上是以太阳在天球上的位置来确定时间的，但因为地球绕太阳公转运动的轨道是椭圆，所以真太阳周日视运动的速度是不均匀的（即真太阳时是不均匀的）。为了得到以真太阳周日视运动为基础而又克服其不均匀性的时间计量系统，人们引进了平均太阳日的概念。平太阳时的基本单位是平太阳日，1平均太阳日等于24平均太阳小时，1平均太阳小时等于86400平均太阳秒。 (2) 世界时（UT0/UT1/UT2）以平子夜作为0时开始的格林威治（英国伦敦南郊原格林尼治天文台的所在地，它又是世界上地理经度的起始点）平太阳时，就称为世界时。世界时与恒星时有严格的转换关系，人们是通过在世界各地利用天文望远镜观测恒星后平均得到世界时的，其精度只能达到10-9。由于地极移动和地球自转的不均匀性，最初得到的世界时，也是不均匀的，我们将其记为UT0；人们对UT0 加上极移改正，得到的结果记为UT1；再加上地球自转速率季节性变化的经验改正就得到UT2。 (3) 国际原子时（TAI）原子时间计量标准在1967年正式取代了天文学的秒长的定义新秒长规定为：位于海平面上的铯Cs133原子基态的两个超精细能级间在零磁场中跃迁振荡个周期所持续的时间为一个原子时秒，我们称之为国际原子时（TAI），其稳定度可以达到10-14以上。另外规定原子时起点在1958年1月1日0时（UT），即在这一瞬间，原子时和世界时重合。 (4) 协调世界时（UTC）相对于以地球自转为基础的世界时来说，原子时是均匀的计量系统，这对于测量时间间隔非常重要。但世界时时刻反映了地球在空间的位置，并对应于春夏秋冬、白天黑夜的周期，是我们熟悉且在日常生活中必不可少的时间。为兼顾这两种需要，引入了协调世界时（UTC）系统。UTC在本质上还是一种原子时，因为它的秒长规定要和原子时秒长相等，只是在时刻上，通过人工干预，尽量靠近世界时。 (5) 闰秒 UTC在秒长上使用原子时秒，但是在时刻上，需要通过人工干预，使其尽量靠近世界时。这就需要对UTC进行“闰秒操作”，即每当UTC与世界时UT1时刻之差超过接近或超过0.9秒时，在当年的6月底或12月底的UTC时刻上增加一秒或减少一秒。

Physical Clocks (1/3) Problem: Sometimes we simply need the exact time, not just an ordering. Solution: Universal Coordinated Time (UTC): Based on the number of transitions per second of the cesium 133 atom (pretty accurate). At present, the real time is taken as the average of some 50 cesium-clocks around the world. Introduces a leap second from time to time to compensate that days are getting longer. UTC is broadcast through short wave radio and satellite. Satellites can give an accuracy of about ±0.5 ms. cesium [英] [ˈsi:zjəm] n. 1. 铯

Physical Clocks (2/3) Problem: Suppose we have a distributed system with a UTC-receiver somewhere in it => we still have to distribute its time to each machine. Basic principle: Every machine has a timer that generates an interrupt H times per second. There is a clock in machine p that ticks on each timer interrupt. Denote the value of that clock by Cp(t) , where t is UTC time. Ideally, we have that for each machine p, Cp(t) = t, or, in other words, dC/dt = 1.

Physical Clocks (3/3) In practice: 1 – ρ <= dC/dt <= 1 + ρ
Goal: Never let two clocks in any system differ by more thanδ time units => synchronize at least every δ/(2ρ) seconds. Δδ Ρρ The relation between clock time and UTC when clocks tick at different rates. Suppose two clocks are synced at time t. At t+dt, how far apart could they be? If two clocks are drifting from UTC in the opposite direction, at a time Δt after they were synchronized, they many be as much as 2ρ Δt apart. If the operating system designers want to guarantee that no two clocks ever differ by more than δ, clocks must be resyschronized (in software) at least every δ/2ρ,

Global Positioning System (1/2)
Basic idea: You can get an accurate account of the time as a side-effect of GPS. Principle: Problem: Assuming that the clocks of the satellites are accurate and synchronized: It takes a while before a signal reaches the receiver The receiver’s clock is definitely out of synch with the satellite longitude D.J.[ˈlɔndʒitju:d] K.K.[ˈlɑndʒɪˌtud, -ˌtjud, ˈlɔn-] n. 经度 latitude D.J.[ˈlætitju:d] K.K.[ˈlætɪˌtud, -ˌtjud] n. 纬度 altitude D.J.[ˈæltitju:d] K.K.[ˈæltɪˌtud, -ˌtjud] n. 高度, 海拔

Global Positioning System (2/2)
Δr is unknown deviation of the receiver’s clock. xr, yr, zr are unknown coordinates of the receiver. Ti is timestamp on a message from satellite i Δi = (Tnow − Ti) + Δr is measured delay of the message sent by satellite i. Measured distance to satellite i: c × Δi (c is speed of light) Real distance is di =cΔi−cΔr =sqrt ((xi − xr)2 + (yi − yr)2 + (zi − zr)2) 4 satellites => 4 equations in 4 unknowns (with Δr as one of them) Δδ Ρρ

Clock Synchronization
Internal: processors run clock sync protocol, e.g.: broadcasting their clock readings each processor receives set of values from others (may differ) algorithm would pick a synchronized value from the set External: satellite system launched by military in early 1990's, became public and inexpensive can think of satellites broadcasting the time small radio receiver picks up signals from three satellites and triangulates to determine position same computation also yields extremely accurate clock (milliseconds)

Clock Synchronization Algorithms
Cristian’s Algorithm (1989) The time server is passive E.g., The Network Time Protocol The Berkeley Algorithm (1989) The time server is active msec - one thousandth (10^-3) of a second Millisecond, time unit, unit of time - a unit for measuring time periods s, sec, second - 1/60 of a minute; the basic unit of time adopted under the Systeme International d'Unites microsecond - one millionth (10^-6) of a second; one thousandth of a millisecond

Cristian's Algorithm (1/2)
Getting the current time from a time server.

Cristian's Algorithm (2/2)
Two problems One major, time must never run backward When slowing down, the interrupt routine adds less msec to the time until the correction has been made Be advanced gradually by adding more msec at each interrupt One minor, it takes a nonzero amount of time for the time server’s replay to get back to the sender (T1 – T0) / 2 (T1 – T0 – I )/2 A series of measurement to be averaged The message that came back fastest

Network Time Protocol (RFC 1305)
NTP是用来使计算机时间同步化的一种协议，它可以使计算机对其服务器或时钟源（如石英钟，GPS等等)做同步化，它可以提供高精准度的时间校正（LAN上与标准间差小于1毫秒，WAN上几十毫秒），且可介由加密确认的方式来防止恶毒的协议攻击。 NTP如何工作 NTP获得UTC的时间来源可以是原子钟、天文台、卫星，也可以从Internet上获取。时间按NTP服务器的等级传播。按照离外部UTC 源的远近将所有服务器归入不同的Stratum（层）中。 Stratum-1在顶层，有外部UTC接入，而Stratum-2则从Stratum-1获取时间，Stratum-3从Stratum-2获取时间，以此类推，但Stratum层的总数限制在15以内。所有这些服务器在逻辑上形成阶梯式的架构相互连接，而Stratum-1的时间服务器是整个系统的基础。计算机主机一般同多个时间服务器连接，利用统计学的算法过滤来自不同服务器的时间，以选择最佳的路径和来源来校正主机时间。即使主机在长时间无法与某一时间服务器相联系的情况下，NTP服务依然有效运转。为防止对时间服务器的恶意破坏，NTP使用了识别(Authentication)机制，检查来对时的信息是否是真正来自所宣称的服务器并检查资料的返回路径，以提供对抗干扰的保护机制。

NTP Architecture 1 2 2 3 3 3 Arrows denote synchronization control, numbers denote strata. Reconfigure when servers become unreachable

Synchronization Measures of NTP
Multicast mode Intend for use on a high speed LAN Assuming a small delay Low accuracy but efficient Procedure-call mode Similar to Cristian’s Higher accuracy than multicast Symmetric mode The highest accuracy

Getting the current time from a time server
A can estimate its offset relative to B as θ = T3 + [(T2 – T1) + (T4 – T3)]/2 – T4 = [(T2- T1) + (T3 – T4)]/2 For the delay, δ = [(T4- T1) + (T3 – T2)]/2 The trick is to find a good estimation for these delays. θ δ

Symmetric Mode Sync. Implementation
NTP servers retain 8 most recent pairs < θi, δi> The value θi of that corresponds to the minimum value δi is chosen to estimate θ A NTP server exchanges with several peers in addition to with parent Peers with lower stratum numbers are favored Peers with the lowest synchronization dispersion are favored The timekeeping quality at a particular peer is determined by a sum of weighted offset differences, called the dispersion. The total dispersion to the root due to all causes is called the synchronization dispersion. dispersion D.J.[disˈpə:ʃən] K.K.[dɪˈspɚʒən, -ʃən] n. 散布,驱散,传播,散射;离差,差量

The Berkeley Algorithm
The time daemon asks all the other machines for their clock values The machines answer The time daemon tells everyone how to adjust their clock

6.2 Logical Clocks Lamport’s Logical Clocks (++) Vector Clocks (++)
So far, we have assumed that clock synchronization is naturally related to real time. However, we have also seen that it may be sufficient that every node agrees on a current time, without that time necessarily being the same as the real time. In this section we will discuss Lampport's algorithm, which synchronizes logical clocks. Also, we discuss an extension to Lamppost's approach, called vector timestamps.

Happen-Before (HB) Relation
Problem: We first need to introduce a notion of ordering before we can order anything.  denotes HB relation HB1: If process pi : e comes before e`, then ee` HB2: For any message m, send(m)  receive(m) HB3: IF e, e’ and e” are events such that e  e` and e`  e”, then e  e” Note: this introduces a partial ordering of events in a system with concurrently operating processes. Causal ordering Problem: We first need to introduce a notion of ordering before we can order anything. The happened-before relation on the set of events in a distributed system: • If a and b are two events in the same process, and a comes before b, then a->b. • If a is the sending of a message, and b is the receipt of that message, then a->b • If a->b and b->c, then a->c Note: this introduces a partial ordering of events in a system with concurrently operating processes.

Happen-before relation
Example a || e Shortcomings Not suitable to processes collaboration that does not involve messages transmission Capture potential causal ordering 不是所有的发生在先关系都是因果。

Events occurring at three processes

Logical Clocks (1/2) Problem: How do we maintain a global view on the system’s behavior that is consistent with the happened before relation? Solution: attach a timestamp C(e) to each event e, satisfying the following properties: P1: If a and b are two events in the same process, and a->b, then we demand that C(a) < C(b). P2: If a corresponds to sending a message m, and b to the receipt of that message, then also C(a) < C(b). Problem: How to attach a timestamp to an event when there’s no global clock => maintain a consistent set of logical clocks, one per process.

Logical Clocks (2/2) Solution: Each process Pi maintains a local counter Ci and adjusts this counter according to the following rules: 1: For any two successive events that take place within Pi, Ci is incremented by 1. 2: Each time a message m is sent by process Pi, the message receives a timestamp ts(m) = Ci. 3: Whenever a message m is received by a process Pj, Pj adjusts its local counter Cj to max{Cj, ts(m)}; then executes step 1 before passing m to the application. Property P1 is satisfied by (1); Property P2 by (2) and (3). Note: it can still occur that two events happen at the same time. Avoid this by breaking ties through process IDs. Page 246 Lamppost's solution follows directly from the happen-before relation. To implement Lamppost's logical clocks, each process Pi maintains a local counter Ci. These counters are updated as follows steps (Raynal and Singhal, 1996): Before executing an event (i.e., sending a message over the network, delivering a message to an application, or some other internal event), Pi execute Ci <- Ci + 1. When process Pi sends a message m to Pj, it sets m’s timestamp ts(m) equal to Ci after having executed the previous step. Upon the receipt of a message m, process Pj adjusts its own local counter as Cj<-max{Cj, ts(m)}, after which it then executes the first step and delivers the message to the application.

Logical Clocks - Example
Note: Adjustments take place in the middleware layer: Figure 6-9. (a) Three processes, each with its own clock. The clocks run at different rates. (b) Lamppost's algorithm corrects the clocks. Figure 10. The positioning of Lamppost's logical clocks in distributed systems

Example: Totally Ordered Multicasting (1/2)
Problem: We sometimes need to guarantee that concurrent updates on a replicated database are seen in the same order everywhere: P1 adds $100 to an account (initial value: $1000) P2 increments account by 1% There are two replicas Result: in absence of proper synchronization: replica #1  $1111, while replica #2  $1110. A bank may place copies of an account database in two different cities, say New York and San Francisco. A query is always forwarded to the nearest copy. Assume a customer in San Francisco wants to add $100 to his account, which currently contains $1000. At the same time, a bank employee in New York initiates an update by which the customer’s account is to be increased with 1 percent interest. Problem: We sometimes need to guarantee that concurrent updates on a replicated database are seen in the same order everywhere: • P1 adds $100 to an account (initial value: $1000) • P2 increments account by 1% • There are two replicas Result: in absence of proper synchronization: replica #1 ← $1111, while replica #2 ← $1110.

Example: Totally Ordered Multicasting (2/2)
Solution: Process Pi sends timestamped message msgi to all others. The message itself is put in a local queue queuei Any incoming message Pj is queued in queuej, according to its timestamp, and acknowledged to every other process. Note: all messages are delivered in the same order to each receiver. In addition, we assume that messages from the same sender are received in the order they were sent, and that no message are lost. Pj passes a message msgi to its application if: (1) msgi is at the head of queuej (2) And has been acknowledged by each other process. A totally-ordered multicast is a multicast operation by which all messages are delivered in the same order to each receiver. In addition, we assume that messages from the same sender are received in the order they were sent, and that no message are lost.

6.2 Logical Clocks Lamport’s Logical Clocks Vector Clocks

Vector Clocks (1/2) Observation: Lamport’s clocks do not guarantee that if C(a) < C(b) that a causally preceded b: Observation: Event a: m1 is received at T = 16. Event b: m2 is sent at T = 20. We cannot conclude that a causally precedes b. The problem is that Lamport clocks do not capture causality. Causality can be captured by means of vector clocks.

Vector Clocks (2/2) Solution:
Each process Pi has an array VCi[1..n], where VCi[j] denotes the number of events that process Pi knows have taken place at process Pj. When Pi sends a message m, it adds 1 to VCi[i], and sends VCi along with m as vector timestamp vt(m). Result: upon arrival, recipient knows Pi’s timestamp. When a process Pj delivers a message m that it received from Pi with vector timestamp ts(m), it (1) updates each VCj[k] to max{VCj[k], ts(m)[k]} (2) increments VCj[j] by 1. Question: What does VCi[j] = k mean in terms of messages sent and received?

Causally Ordered Multicasting (1/2)
Observation: We can now ensure that a message is delivered only if all causally preceding messages have already been delivered. Adjustment: Pi increments VCi[i] only when sending a message, and Pj “adjusts” VCj when receiving a message (i.e., effectively does not change VCj[j]). First condition means that this message is the next expected (no skipping). Second condition means that we wait till we have seen all messages that Pi saw when it sent the message. Pj postpones delivery of m until: • ts(m)[i] = VCj[i] + 1. • ts(m)[k] ≤ VCj[k] for k ≠ i.

Causally Ordered Multicasting (2/2)
Example 1: Example 2: Take VC2 = [0, 2,2], ts(m) = [1, 3,0] from P0. What information does P2 have, and what will it do when receiving m (from P0)? Note that there is no need for process Pj to delay the delivery of its own messages.

Asynchronous distributed system model:
Observation: A distributed computation consists of a set of processes that cooperate to achieve a common goal. A main characteristic of these computations is that the processes do not already share a common global memory and that they communicate only by exchanging messages over a communication network. Asynchronous distributed system model: message transfer delays are finite yet unpredictable which includes systems that span large geographic areas and are subject to unpredictable loads. Causality : given two events in a distributed computation, a crucial problem is knowing whether they are causally related. Fundamentals of Distributed Computing: A Practical Tour of Vector Clock Systems A key concept of asynchronous distributed systems is causality. More precisely, given two events in a distributed computation, a crucial problem is knowing whether they are causally related. Could the occurrence of one event be a consequence of the other?

An example of a distributed computation

Causal broadcast Causal broadcast states that the order in which processes deliver messages to application processes cannot violate the precedence order of the corresponding broadcast events. More precisely, if two broadcast messages m and m´ are such that broadcast(m) —> broadcast(m´), then any process must deliver m before m´. If the broadcasts of m and m´are concurrent, then processes are free to deliver m and m´in any order. This means that when a process delivers a message m to a process, all messages whose broadcasts causally precede the broadcast of m have already been delivered to that process. In our discussion of basic vector clock properties, we investigate three problems—causal broadcast, detecting message stability, and detecting an event pattern. Birman and Joseph introduced the causal broadcast notion to reduce the asynchrony of communication channels as application processes perceive them.

Causal delivery of broadcast messages
When m´ arrives at P2, its delivery must be delayed because m´ arrived at P2 before m, and the sending of m causally precedes m´.

6.3 Mutual Exclusion 6.4 Global Positioning of nodes 6.5 Election Algorithms Fundamental to distributed systems is the concurrency and collaboration among multiple processes. In many cases, this also means that processes will need to simultaneously access the same resources. To prevent that such concurrent accesses corrupt the resource, or make it inconsistent, solutions are needed to grant mutual exclusive access by processes.

6.3 Mutual Exclusion Overview A Centralized Algorithm
A Decentralized Algorithm A Distributed Algorithm A Token Ring Algorithm A Comparison of the Four Algorithm

Terminology In concurrent programming a critical section (CS) is a piece of code that accesses a shared resource that must not be concurrently accessed by more than one thread of execution. Mutual exclusion (ME, often abbreviated to mutex) algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. In concurrent programming a critical section is a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution. A critical section will usually terminate in fixed time, and a thread, task or process will only have to wait a fixed time to enter it (i.e. bounded waiting). Some synchronization mechanism is required at the entry and exit of the critical section to ensure exclusive use, for example a semaphore. Mutual exclusion (often abbreviated to mutex) algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections.

Applications use ME Observation: If a data item is replicated at various sites, we expect that only one of the sites updates it at a time. This gives rise to the problem of ME, which requires that only one of the contending processes be allowed, at a time, to enter its CS. Replicated databases Protocols that control access to replicated data and ensure data consistency in case of network partitioning are called replica control protocols. A transaction (an access request to data) is the basic unit of user computation in a database. All replica control protocols require that mutual exclusion must be guaranteed between two write operations and a read and write operation. Distributed shared memory (DSM) is an abstraction used for sharing data between computers that do not share physical memory. [Saxena and Rai,2003] P. C. Saxena and J. Rai, "A survey of permission-based distributed mutual exclusion algorithms," Comput. Stand. Interfaces, vol. 25, pp , (pdf) 1） An important application of distributed systems, which uses mutual exclusion and needs special mention, is in the field of replicated databases. replicated databases: increase fault-tolerance and improve response time A transaction (an access request to data) is the basic unit of user computation in a database. It executes in three steps [18]: it reads a portion of the database into a local workspace; then it performs some local computation on it; and finally, it writes some values into the database. Read and write operations performed by a transaction are of interest to us. 2）Another important application of mutual exclusion is in the field of distributed shared memory.

Performance Metrics of DME Algorithm
Message Complexity (MC) The number of messages exchanged by a process per CS entry. Synchronization Delay (SD) the average time delay in granting CS, which is the period of time between the instant a site invokes mutual exclusion and the instant when it enters CS. A good DME algorithm must be safe (it shall ensure mutual exclusion), live (the system should make progress towards the execution of CS and a deadlock situation shall not occur) and fair (it shall not be biased against or in favor of a node and each request shall eventually be satisfied). low MC and SD. high fault tolerance and availability The fault tolerance is the maximal number of nodes that can fail before it becomes impossible for any node to enter its CS. Its availability is the probability that the CS can be entered in the presence of failures. page 162 Message Complexity (MC): The number of messages exchanged by a process per CS entry. Synchronization Delay (SD): the average time delay in granting CS, which is the period of time between the instant a site invokes mutual exclusion and the instant when it enters CS.

Mutual Exclusion Prevent simultaneous access to a resource
Two different categories: Permission-based approach A Centralized Algorithm A Decentralized Algorithm A Distributed Algorithm Token-based approach A Token Ring Algorithm A Token Ring Algorithm: Process with the token gets access, Hard part is regenerating a lost token

Classification tree for DME algorithms

Centralized Algorithm
Use a central coordinator to simulate how it is done in a one-processor system Step 3: When process 1 releases the resource, it tells the coordinator, which then replies to 2. Step 2: Process 2 then asks permission to access the same resource. The coordinator does not reply. Step 1: Process 1 asks the coordinator for permission to access a shared resource. Permission is granted.

A Decentralized Algorithm
Use a distributed hash table (DHT). Hashes to a node. Each resource has n coordinators (called replicas in the book). A limit m (> n/2) is pre-defined. A client acquires the lock by sending a request to each coordinator. If it gets m permissions, then it gets it. If a resource is already locked, then it will be rejected (as opposed to just blocking.) Does this work? Why is it better than the centralized solution. What happens if a coordinator fails? Why is m > n/2?

n = 5 m = 3 1. Send lock requests 2. Receive responses. Blue succeeds.
Squares are clients. Circles are coordinators. What happens if m = 2? 1. Send lock requests 2. Receive responses. Blue succeeds. 3. Release if failed.

Coordinator Failure If a coordinator fails, replace it.
But what about the lock state? This amounts to a resetting of the coordinator state, which could result in violating mutual exclusion. How many would have to fail? 2m - n What is the probability of violation? n – m + x = m needed at minimum to reach the limit m.

Probability of Violation
Let p be the probability of failure during some time t. The probability that k out of m coordinators reset is: To violate mutual exclusion, you need at least 2m-n failures. n – m + x = m needed at minimum to reach the limit m. MC = 3mk, k=1,2,... SD = 2m Problems: Starvation, low efficiency With node participation for 3 hours, t of 10 seconds, and n = 32 and m = 0.75n, the Pv <

A Distributed Algorithm
When a process wants a resource, it creates a message with the name of the resource, its process number, and the current (logical) time. It then reliably sends the message to all processes and waits for an OK from everyone. When a process receives a message: If the receiver is not accessing the resource and is not currently trying to access it, it sends back an OK message to the sender. “Yes, you can have it. I don’t want it, so what do I care?” If the receiver already has access to the resource, it simply does not reply. Instead, it queues the request. “Sorry, I am using it. I will save your request, and give you an OK when I am done with it.” If the receiver wants to access the resource as well but has not yet done so, it compares the timestamp of the incoming message with the one contained in the message that it has sent everyone. The lowest one wins. If the incoming message has a lower timestamp, the receiver sends back an OK. “I want it also, but you were first.” If it’s own message has a lower timestamp, it queues it up. “Sorry, I want it also, and I was first.” When done using a resource, send an OK on its queue and delete them all from the queue.

Step 1 Step 2 Step 3 Accesses resource Accesses resource
Timestamp Accesses resource When process 0 is done, it sends an OK also, so 2 can now go ahead. Process 0 has the lowest timestamp, so it wins. Two processes (0 and 2) want to access a shared resource at the same moment. Let us try to understand why the algorithm works. If there is no conflict, it clearly works.

Evaluation How many messages are required? More or less than centralized? One request and OK from everyone else, so 2(n-1). More scalable? How much work per node, per lock? Is it better than centralized? How many points of failure? We have replaced a poor one with a worse one. Can we figure out how to handle failure? Allow a process to enter a critical region when it has collected permission from a simple majority of the other process.

A Token Ring Algorithm (a) An unordered group of processes on a network. (b) A logical ring constructed in software.

When ring is initiated, give process 0 the token.
Token circulates around the ring in point-to-point messages. When a process wants to enter the CS, it waits till it gets the token, enters, holds the token, exits, passes the token on. Starvation? Lost tokens? Other crashes?

A Comparison of the Four Algorithms

6.3 Mutual Exclusion 6.4 Global Positioning of nodes 6.5 Election Algorithms When the number of nodes in a distributed system grows, it becomes increasingly difficult for any node to keep track of the others. Such knowledge may be important for executing distributed algorithms such as routing, multicasting, data placement, searching, and so on.

Global Positioning of Nodes
Problem: How can a single node efficiently estimate the latency between any two other nodes in a distributed system? Solution: construct a geometric overlay network, in which the distance d(P,Q) reflects the actual latency between P and Q. When the number of nodes in a distributed system grows, it becomes increasingly difficult for any node to keep track of the others. Such knowledge may be important for executing distributed algorithms such as routing, multicasting, data placement, search, and so on.

Computing Position (1/2)
Observation: a node P needs m + 1 distance meausres to compute its own position in a m-dimensional space. Consider two-dimensional case: Solution: P needs to solve three equations in two unknowns (xP,yP): a node P needs k + 1 landmarks to compute its own position in a d-dimensional space. Consider two-dimensional case:

Computing Position (2/2)
2002-INFOCOM-Predicting Internet network Distance with Coordinates-Based Approaches.pdf

Szyamniak et al., 2004 As it turns out, with well-chosen landmarks, m can be as small as 6 or 7, with d^(P,Q) being no more than a factor 2 different from the actual latency d(P,Q) for arbitrary nodes P and Q.

Election Algorithms The Bully Algorithm A Ring Algorithm
Superpeer Selection

Election Algorithms Principle: An algorithm requires that some process acts as a coordinator. The question is how to select his special process dynamically. Note: In many systems the coordinator is chosen by hand (e.g. file servers). This leads to centralized solutions => single point of failure. Question: Is a full distributed solution, i.e. one without a coordinator, always more robust than any centralized/coordinated solution? Many distributed algorithms require one process to act as a coordinator. In general, it doesn’t matter which process, but one of them must do it, and only one. Can’t have two processes in a centralized mutual exclusion algorithm, for example. Election algorithms are for this purpose. Outcome is one, and only one “winner”. All processes must agree. Biggest challenge is handling failures.

Election by Bullying (1/2)
Principle: Each process has an associated priority(weight). The process with the highest priority should always be elected as the coordinator. Issue: How do we find the heaviest process? Any process can just start an election by sending an election message to all other processes (assuming you don’t know the weights of the others). If a process Pheavy receives an election message from a lighter process Plight, it sends a take-over message to Plight. Plight is out of the race. If a process doesn’t get a take-over message back, it wins, and sends a victory message to all other processes. Issue: How do we find the heaviest process? When any process notices that the current coordinator is down, it starts an election. A process P holds an election as follows: P sends an ELECTION message to all processes with higher numbers. If no one responds, P wins the election and becomes coordinator. If one of the higher-ups answers, it takes over. P’s job is done. When a process gets an ELECTION message, it sends an OK back, indicating that he will take over. He then holds a (recursive) election himself. When a process knows that it has won, it sends a winner message to all other processes. When a process comes back up, it holds an election.

Election by Bullying (2/2)
Question: We’re assuming something very important here – what? Bully Algorithm (Garcia-Molina, 1982) Process 4 holds an election; Process 5 and 6 respond, telling 4 to stop; Now 5 and 6 each hold an election; Process 6 tells 5 to stop; Process 6 wins and tells everyone We assume that every process knows the process number of every other process. What the processes do not know is which ones are currently up and which ones are currently down.

Election in a Ring (1/2) Principle: Process priority is obtained by organizing processes into a (logical) ring. Process with the highest priority should be elected as coordinator. Any process can start an election by sending an election message to its successor. If a successor is down, the message is passed on to the next successor. If a message is passed on, the sender adds itself to the list. When it gets back to the initiator, everyone had a chance to make its presence known. The initiator sends a coordinator message around the ring containing a list of all living processes. The one with the highest priority is elected as coordinator. When a process notices the coordinator is down, it builds an ELECTION message containing its own process number, and sends the message to its successor. If successor is down, it will skip it. At each step, each process adds its own process number, and sends it along. Eventually returns to the first process. Message is then resent as COORDINATOR message to tell others who the new coordinator is, which is the process with the highest number.

Election in A Ring (2/2) QA: It does no harm to have extra messages circulating; at worst it consumes a little bandwidth, but this not considered wasteful. Question: Does it matter if two processes initiate an election? Question: What happens if a process crashes during the election?

Elections in Large-Scale Systems
The algorithms we have been discussing so far generally apply to relatively small distributed systems. Moreover, the algorithms concentrate on the selection of only a single node.

Superpeer Selection Issue: How do we select superpeers such that:
Normal nodes have low-latency access to superpeers Superpeers are evenly distributed across the overlay network There is be a predefined fraction of superpeers Each superpeer should not need to serve more than a fixed number of normal nodes DHT: Reserve a fixed part of the ID space for superpeers. Example: if S superpeers are needed for a system that uses m-bit identifiers, simply reserve the k = ⌈log2 S⌉ leftmost bits for superpeers. With N nodes, we’ll have, on average, 2k−mN superpeers. Routing to superpeer: Send message for key p to node responsible for p AND 11· · ·1100· · ·00

Superpeer Selection in an m-dimensional geometric space
Virginia Lo, Dayi Zhou, Yuhong Liu, Chris GauthierDickey, and Jun Li, Scalable Supernode Selection in Peer-to-Peer Overlay Networks, Proceedings of the 2005 Second International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P'05) The basic idea is simple: a total of N tokens are spread across N randomly-chosen nodes. No node can hold more than one token. Each token represents a repelling force by which another token is incline to move away. The net effect is that if all tokens exert the same repulsion force, they will move away from each other and spread themselves evenly in the geometric space. Moving tokens in a two-dimensional space using repulsion forces. When a token is held by a node for a given amount of time, that node will promote itself to superpeer.

School of EECS, Peking University

Similar presentations

Presentation on theme: "School of EECS, Peking University"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

School of EECS, Peking University

Similar presentations

Presentation on theme: "School of EECS, Peking University"— Presentation transcript:

Similar presentations

About project

反馈