In my last post, I describe the differences between a TokuMX oplog entry and a MongoDB oplog entry. One reason why the entries are so different is that TokuMX supports multi-statement and multi-document transactions. In this post, I want to elaborate on why multi-statement transactions cause changes to the oplog, and explain how we changed replication to support arbitrarily large transactions.
Take the following MongoDB statement:
db.foo.insert( [ { _id : 1 , a : 10 }, { _id : 2 , a : 20 } ] );
Inside the MongoDB server, this statement is not transactional. The insertion of each document is transactional, but the statement itself is not. That means a valid possible outcome is that the first document gets inserted but the second one does not, or that a query may pick up the first document but not the second. With TokuMX, on the other hand, the entire statement is transactional, meaning that the either the entire statement is applied or nothing is. There is no in-between. Therefore, no query that is supposed to capture both documents may pick up the first document and miss the second.
This behavior has an impact on the oplog. With MongoDB, because the statement is not transactional, writing the two inserts to the oplog separately is ok. If a secondary replicates the first document but not the second, that’s still consistent with MongoDB’s normal behavior. Therefore, MongoDB’s oplog format of having one operation per oplog entry is valid. With TokuMX, we cannot afford this behavior. We need the contents of a transaction to be captured together in the oplog, so that the secondary knows exactly what operations must be applied within a transaction when replaying events. If a transaction is atomic on the primary, it has to be atomic on the secondaries too. Therefore, we came to the conclusion that to properly support multi-document transactions with replication, at minimum, we had to extend and possibly change the oplog format (although, to be fair, there were many reasons for this, as my last post outlined).
We saw two options for doing this:
- Write all operations for a transaction in a single oplog entry.
- Have an oplog entry per operation, but link them with a transaction id.
The first option seemed more efficient. Writing one bigger entry to the oplog containing all operations is faster than writing many small entries, one for each operation. So, for moderately sized “normal” transactions, the decision was a no brainer. The first option is better.
What made the decision interesting was designing a way to handle very large transactions. Although not necessarily recommended, users can create arbitrarily large transactions, containing gigabytes of oplog data. We need to be able to handle this case correctly. The first option requires maintaining an in-memory list of operations performed, and then writing them to the oplog in one entry right before committing the transaction. With large transactions, this in-memory list may get arbitrarily large, causing us to run out of memory. So, without modification, the first option is unusable. Because the second option does not require an in-memory list, the second option is usable.
What made the second option still unattractive from a behavioral standpoint was the following. If we start writing data that has yet to be committed to the oplog while the transaction is in progress, then secondaries that are tailing the oplog cannot proceed beyond the point of the uncommitted data. So, suppose we have a replica set where all secondaries are up to date. If we have a large transaction that takes one hour to complete, and we log an operation at the beginning of the transaction, then secondaries will stall for that hour waiting for the transaction to complete. By not writing to the oplog until we are ready to commit, all transactions that complete while the large transaction is live may still be replicated.
As a result, we really wanted to find a way to make option 1 work. And that is how we came up with the oplog.refs collection, which I explained in my last post. When a transaction’s operations start taking too much space (by default, 1MB), we start writing the operations to the oplog.refs collection in batches. If the transaction aborts, then all this written information will abort as well and disappear. If the transaction is to be committed, we write an entry into the oplog.rs collection that stores a reference to the oplog.refs collection. Secondaries that tail the oplog know that if an oplog entry has such a reference, the relevant entries in oplog.refs (and there may be many) must be copied as well, and replayed.
This relieves the memory pressure of maintaining an in-memory list. We simply dump the list to disk, via a collection, periodically. Now, for large transactions, secondaries will be stalled at the moment the transaction is committing, not when the transaction began. This is significantly better. (On a side note, with TokuMX, large write transactions may still take a non-trivial amount of time to commit. So, be careful. Sometimes they are necessary, but try not to use very large transactions in a replica set.)
So, this is why in the TokuMX oplog, “normal” transactions have an “ops” array that contain operations instead of an individual “op”. This is why large transactions have no “ops” field, and instead have a “ref” field that is an OID. This is why the oplog.refs collection exists.
In conclusion, I hope some of these oplog changes we’ve made now make sense. In future posts, I will explain the reasoning behind other changes.
The post On TokuMX (and MongoDB) Replication and Transactions appeared first on Tokutek.