Started with 2.2.0, mongod will start several threads to apply oplog. one backgroupSync thread is to get the oplog from master, one is to organize the oplog to apply them parallelly, and, there are some work threads to apply them. 8 worker threads in 2.2.0.
BackgroupSync thread it use OplogReader to fetch oplog from master, and store them into a blockingQueue struct locally. BlockingQueue is fixed-size, but its size is std::numeric_limits::max() (seems to large?). mongod can push oplog into this queue only if there are some free space avaiable in this queue. mongo records that how many oplog in this queue, how big is it, and how long time it takes to put the oplog into the queue, we can find this information with command db.serverStatus().replNetworkQueue;
Actually, the “push time” doesn’t include the network shipping time, it just start when bgsync get a oplog, and finish when the oplog is pushed into the queue.
@bgsync.cpp
Timer timer;
// the blocking queue will wait (forever) until there's room for us to push
OCCASIONALLY {
LOG(2) << "bgsync buffer has " << _buffer.size() << " bytes" << rsLog;
}
_buffer.push(o);
{
boost::unique_lock lock(_mutex);
// update counters
_queueCounter.waitTime += timer.millis();
_queueCounter.numElems++;
...
}
Since the time is too short, sometime, the recorded time maybe 0.
SyncTail thread it will get the oplog from the blocking dequeue, prefetch them parallel, and hash them into several vectors to let them applied parallelly. the hash function is like this:
void SyncTail::fillWriterVectors(const std::deque& ops,
std::vector< std::vector >* writerVectors) {
for (std::deque::const_iterator it = ops.begin();
it != ops.end();
++it) {
const BSONElement e = it->getField("ns");
verify(e.type() == String);
const char* ns = e.valuestr();
int len = e.valuestrsize();
uint32_t hash = 0;
MurmurHash3_x86_32( ns, len, 0, &hash);
(*writerVectors)[hash % writerVectors->size()].push_back(*it);
}
}
This thread submit the job and wait them completed.
about prefetch prefetch includes both index and records.
Per my understanding, that mongo introduce “prefetch” is for “parallel apply”. For example, There are 3 work threads to recover parallel, and all the 3 collections are in the same database, so they will compete for the same Writelock. Let’s assume that T1 obtains the writeLock, and then T1 find the page are not in physical memory. “page fault” and fetch the data into physical memory. but during this process, the other two threads T2/T3 are waiting for T1′s writeLock. it doesn’t have to wait in this case in fact. so the “prefetch” means “prefetch parallel” before get the writeLock. after the prefetch parallel, T1 get the writeLock (no page fault at this time), modify collection, release lock. T2 and T3 … . this feature will decrease the wait time for wirteLock, but will not decrease the possibility of page fault, it just make “page fault” happen during different period(before/after get writeLock).
Also, there are some sistuations, we didn’t changed the indexed columns, so we will not willing to prefetch all the indexes since it will increase some unnecessary io request. mongo also provide a option that “replIndexPrefetch”, we can set it to “_id_only".
Work Thread pool there are 8 threads in thread pool. and they will apply the oplog finally.
How to improve apply performance
1. we need to check if the network is bottneck.
we can run db.serverStatus().replNetworkQueue to see if we have got some oplog, but they aren’t be applied. if have, we can assume that the network is not the root cause.
2. if network is not the root cause, we can see if we can use more collections and put them in different databases.
since mongod use database level lock since 2.2.0, we can use different database to let them applied parallel really.
if possible, we can split one collection to several collections like “t_0,t_1,t_2″ and so on.
3 disable unnecessary prefetch