This is the third post in a series of posts that explains Ark, a consensus algorithm we’ve developed for TokuMX and MongoDB to fix known issues in elections and failover. The tech report we released last week describes the algorithm in full detail. These posts are a layman’s explanation.
In the first post, I discussed what the basics of replication are, how it works, and what the important components are: syncing data (or basic replication), elections, replication rollback, and write acknowledgement. In the second post, I explained the current election algorithm and discussed some high level problems. In this post, I assume the reader is familiar with the first two posts and discuss why data that has been successfully acknowledged with majority write concern may be lost in a failover.
The best way to demonstrate how data may be lost is with examples. For each of our examples, suppose things start out as follows. We have a five node replica set: n1, n2, n3, n4, and n5, with the following setup:
- n1 is primary
- n2 and n3 are syncing off of n1
- n4 and n5 are syncing off of n3
Example 1
Suppose the following sequence of events happens:
- A network partition happens such that (n1, n2) gets disconnected from (n3, n4, n5).
- n3 notices that it cannot reach n1, and proceeds to elect itself.
- n4 and n5 vote for n3, and as a result, n3 becomes primary.
- n3 performs write A, which gets replicated and acknowledged by n4 and n5, thus having been acknowledged by a majority of the replica set.
- n1 reconnects to the replica set, but still as a primary. It has yet to step down.
Now we have two primaries, n1 and n3. One needs to step down. If n1 steps down, nothing bad happens, but if n3 steps down, a write that was acknowledged with majority write concern will be lost.
In MongoDB 2.4, this was quite possible. In dual primary scenarios, which primary stepped down was essentially arbitrary. SERVER-9765 addressed this issue. Now, in 2.6, a primary uses the timestamp of the last election to determine which primary should step down. In the example above, because n3 was elected at a later time than n1, n3 ought to remain primary and n1 ought to step down. Because members may participate in an election once every thirty seconds, the minimum amount of time between successful elections ought to be thirty seconds. That being said, I don’t understand how clock skew between different members of the replica set impacts this algorithm.
Example 2
This example expands a bit on the first one:
- A network partition happens such that (n1, n2) gets disconnected from (n3, n4, n5).
- n3 notices that it cannot reach n1, and proceeds to elect itself.
- n4 and n5 vote for n3, and as a result, n3 becomes primary.
- n3 performs write A, which gets replicated and acknowledged by n4 and n5, thus having been acknowledged by a majority of the replica set.
- n1 reconnects to the replica set, but still as a primary. It has yet to step down.
- n1 accepts a write B that gets replicated and acknowledged by n2.
- n4 stops replicating off of n3 and starts replicating off n1.
- n4 rolls back write A, because n1 doesn’t have it, and replicates and acknowledges write B.
Now we have a big problem. Both n1 and n3 are primary, and both have performed a different write that has been acknowledged by a majority of the replica set. One needs to step down and rollback their write. Whoever does will have rolled back a write that was acknowledged with majority write concern.
This example demonstrates an underlying problem of MongoDB’s replication: a timestamp is used to determine oplog position (aka, GTID). The symptom this causes is that two different primaries may produce oplog entries whose positions interleave. Because write B happens after write A, as far as time goes, the timestamp stored in B’s oplog entry on n1 is greater than A’s oplog entry that has been replicated to n4. As a result, n4 thinks that n1 is farther ahead, and thinks it’s ok to start replicating off n1. As soon as this decision is made, n4 will replicate and acknowledge B, which allows both writes to be acknowledged by a majority.
TokuMX does not use timestamps to determine oplog position in its current election algorithm (preceding Ark). As a result, the problem listed above does not impact TokuMX. However, TokuMX does have the problem where two different primaries produce oplog entries whose positions interleave. The existing thirty second timer between elections makes the problem harder to explain, but it does exist.
Nevertheless, solving these issues does not fix the election protocol. There is one more problem. Here is one final example.
Example 3
This example is similar to the preceding two, in that n1 is a primary, and after a partition, n3 becomes primary. Let’s suppose that the resolution of dual primaries is robust and works. That is, we can deterministically predict that the primary that was elected earlier will step down. Also, let’s suppose that the positions of oplog entries produced do not intersect. That is, any entry produced by n3 (the later election) is guaranteed to come after any entry produced by n1 (the earlier election). We still have the following problem.
- All secondaries replicate from n1.
- A network partition happens such that (n1, n2) gets disconnected from (n3, n4, n5).
- n3 notices that it cannot reach n1, and proceeds to elect itself.
- n4 and n5 vote for n3, but n3 has yet to become primary.
- After n4 and n5 have voted, the network restores itself.
- n1 performs a write A that is acknowledged by n2, n4, and n5 (4 out of 5 members!). n3 has yet to become primary or to replicate and acknowledge the write.
- n3 finally gets around to becoming primary, without having a chance to replicate and acknowledge A.
- n1 notices that n3 was elected later and steps down.
- all other members start replicating off n3, and as a result, rollback write A.
The problem here is that the process of write acknowledgement and voting for primaries are not communicating enough. n4 and n5 vote for n3, and then acknowledge a write that could be rolled back, in part because of the vote they just performed. This is the main underlying problem that TokuMX and MongoDB’s current election protocol do not take into account, and this is the problem Ark rectifies.
So, in summary, the underlying issues in the existing election protocol are as follows:
- When faced with dual primary situations, MongoDB and TokuMX must be able to resolve the situation predictably. This means:
- Two primaries never produce oplog entries whose positions interleave
- When faced with two primaries, the one that produces oplog entries with smaller positions steps down.
- Threads that sync data and acknowledge writes need to communicate more with threads that vote for primaries. In short, the following ought to happen:
- A member should not vote for a primary that will roll back a write it has acknowledged
- A member should not acknowledge a write that may be rolled back due to an election it has already voted “yes” in.
Making sure these properties hold is what Ark does. In my next post, I’ll start explaining Ark in detail.
The post Explaining Ark Part 3: Why Data May Be Lost on a Failover appeared first on Tokutek.