Distributed messaging fault tolerance

In a previous post, I was left with a question of how to ensure that messages were not dropped in the face of a specific type of failure. The problem that I was faced with at the end of the article was the case where the consumer side loses its link to the common node where a producer is sending messages to in a quorum.

cli_common_link

In this scenario, the producer continues pumping out messages to a quorum of nodes, unaware of the link failure on the client side. The client does not receive them until it makes a new connection to achieve a read quorum. Without adapting this design, messages will be lost while the client reconnects.

cli_common_link_recover

I have chosen to go with a design that allows the caller to specify a TTL on messages even if they’re not going to be sent to persistent storage. I will then implement an in-memory queue for messages that are received on a node and not claimed. Any message that is not claimed by an active subscriber will go into this queue for a configurable amount of time (I’m thinking 5- 10 seconds max) that will give the consumer time to recover from a temporary failure like the one above without losing messages. Though this wasn’t very important for my use case, it felt wrong to lose messages during ANY kind of failure on what is supposed to be a fault tolerant system.

The other option that I had entertained in the previous article was to just commit messages to all nodes that are up that handle the range, and only return success if the producer was able to contact at least a quorum of them. This would handle the above client side link failure case, but I threw an additional monkey wrench into the mix.

I want to borrow the idea of the coordinator node from Apache Cassandra. That is, any client connects to a single node and doesn’t have to worry about ring topology to begin passing messages with the system. I want to keep the design of the client as simple as I can so that it is easy to port to multiple languages.

coordinator

In this configuration, even if a producer were to write to all the nodes that handle a range on the ring, we could still lose messages if the coordinator node went down. Writing to a quorum of nodes and holding unpiped messages temporarily in a time and space limited queue allows the coordinator to die without a loss of messages, as long as the recovery happens within the TTL specified by the producer of the message.

Disadvantages of this idea are additional latency and additional memory usage. Because I have to wait for the given TTL to expire before pronouncing a message claimed or not, when a client isn’t available to read a message there will be a minimum of TTL seconds in latency before I can respond back to the coordinator node that a message could not be piped. Because I have to keep a queue of messages for the given TTL, there will be memory used for that data.

However, one piece of good news is that because sopmq is optionally persistent, if a message is flagged to be persisted and it can’t be piped, I can store it to the Cassandra backend and return immediately once a quorum of nodes responds that the message could not be piped. In the persistent case, the latency will be minimal, and for us this is the case that is most important because it means we can quickly tell the user that their message was not delivered but was saved.

We’ll see in time how these decisions impact the design and development of the project.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s