While designing a distributed message queue, I noted that it would be difficult to properly order messages between a quorum of nodes if the system clocks between the nodes are not perfectly in sync or if messages came in faster than the system’s minimum tick resolution.
For example, on Windows 2008 R2 virtual machines running under Hyper-V, we’ve noted that the lowest resolution we can achieve without registry hacks is 15.6 ms and is related to the system minimum timer period. If two messages are received in under 15.6 ms, even on a single node, they will effectively have the same tick count and it will be impossible to order them by wallclock time.
If the clocks on nodes aren’t perfectly in sync, Node A with a clock that is 3 seconds behind may see a message come in 2 seconds before Node B even though Node B’s message actually came in first. When working with virtual machines, clock skew, and large clusters, even protocols like NTP won’t keep clocks within acceptable limits to guarantee ordering.
Because of these issues with wallclock time, to accomplish a partial ordering of messages between nodes, I have looked into vector clocks which I see mentioned a lot in distributed systems and I think I have come up with a plan that lets them work with the hash ring quorum based messaging system I am developing.
Vector clocks track the number of events that have been processed by actors in a distributed system. Each actor tracks not only its current event count, but also holds knowledge of the number of events that it is aware of that have occurred with other actors. Using this knowledge, it is possible to deduce causality, and thus order between a series of messages.
There are a few methods for updating and passing vector clock data between actors in a system, and this actually seems to be the most interesting part of utilizing them. You can read more in Distributed Computing [Kindle Edition].
While searching for a way to utilize vector clocks in my system, I first found a way that didn’t work which I will demonstrate first, then I’ll explain the method that I plan to use which utilizes a quorum consensus to update the vector clock on a message.
Let’s take a look at the first example of using vector clocks that didn’t quite turn out the way I needed it to in order to decide on message order. See the crude diagram below.
In the diagram above, we are sending the message “hi how are you” in 4 pieces coming from two different producers. The first “hi” message comes from producer 1 (“p1”). At the time the first message is sent, the producer has no knowledge of the vector clock associated with the queue. In fact, the queue is brand new and all the nodes give this queue an initial value of [0,0,0] corresponding to the initial state of all three nodes involved in managing the range for this queue. When the first message “hi” is send from p1, it sent to nodes A and B that form a quorum for this queue. Both nodes are contacted and update their vector clocks to show that they have received an event for this queue. The return clocks for nodes A and B are [1,0,0] (the first 1 corresponding to node A, the first node in the range) and [0,1,0] (the 1 in the second position corresponding to node B, the second node in the range) respectively. Note that this is also the value of the vector clock recorded with the message “hi” for each of the nodes. As far as node A is concerned message “hi” has a clock of [1,0,0], and as far as node B is concerned, message “hi” has a different value of [0,1,0] for its clock.
When the producer gets these clocks back from the quorum, they are combined with max(vector clock A, vector clock B) and the producer now knows that the resultant clock from this operation is [1,1,0]. The producer p1 then sends “how” to nodes A and B again forming a quorum, and tags the message with its knowledge of the combined vector clock [1,1,0] to A and B. Node A receives this message, increments its part in the vector clock and ALSO updates its knowledge of the vector clock for Node B yielding [2,1,0]. Node B gets the same message, updates its value in the vector clock and ALSO updates its knowledge of the vector clock value for node A yielding [1,2,0]. The producer then combines these values again to yield a resulting clock of [2,2,0].
This process continues through p2 producing the message “are” and then p1 producing the message “you”. Then on the bottom right hand side of the diagram, we see consumer c1 connect to nodes A and C, and try to read the message stream.
Message “hi” is consumed from Node A, and its clock value there is [1,0,0]. Message “how” is also consumed from Node A, and its clock value is [2,1,0]. From this data we can deduce that message “hi” came before “how” because “how” has two values that are greater than those of “hi” and none that are less than it. So far so good, but not for long.
When we then consume the value “are” we have a problem. The clock value obtained from Node C on that message is [0,0,1]. While the third count is greater than the third count on any other message, the clock values for Nodes A and B are lower than any other previous message. This leaves us with an ambiguity in how to order the messages.
What I noticed while performing this mental exercise was that the producer DOES know the proper order at each of the steps. Since it is combining the vector clocks returned from the quorum, it has a better idea of the current clock for the messages than the nodes do. What if we take an extra step once we have contacted a quorum, and instruct the nodes to write the combined clock values back to the message?
You can see this sequence in the diagram below:
Now we can see a completely different (but familiar) series of events. p1 writes the value “hi” to nodes A and B. A’s resultant clock is [1,0,0] while B’s is [0,1,0]. p1 gets the quorum response back and then proceeds to tell A and B to update the vclock for the message “hi” to [1,1,0]. p1 then writes “how” to nodes A and B, and gets a resulting clock of [2,2,0], again instructing both nodes to write this clock value back to the message. This continues until we reach the end of the series of messages in the queue.
When consumer c1 connects to nodes A and C this time, it has a different view of the clocks associated with the messages. “hi” is now read from Node A, and has a clock of [1,1,0]. “how” is also read from Node A and has a clock of [2,2,0]. “are” is read from Node C and has a clock of [2,3,1] and finally “you” is read from Node A again with a clock value of [3,4,1] producing a series of clocks [1,1,0] -> [2,2,0] -> [2,3,1] -> [3,4,1]. There are no ambiguities remaining in how to order the messages, and the proper order of the resulting messages is maintained when they are returned.
As I learn more about distributed algorithms, I may discover a more efficient way to accomplish the same thing. For example, I haven’t looked into using matrix clocks yet to see if I might be able to skip the extra write-back step given the additional information that would be available. However, for now I am happy that this will at least solve the ordering issue.