Just a few questions re sequential I/O and the OPLOG versus the Extent Store. If the I/O is deemed sequential in nature, will this always bypass the OPLOG or only when the write operation is larger then 1MB? Does bypassing the oplog mean that the write will be a lot slower in comparison? It still hits the SSD so my assumption is that it’s going to be the same. Why does the process of coalescing the writes before sequentially draining them help with performance? I’m interested in why this step is necessary as opposed to just writing directly to the SSD and then replicating out.
That’s an interesting question.
Write IO is deemed as sequential when there is more than 1.5MB of outstanding write IO to a vDisk (as of 4.6). IOs meeting this will bypass the OpLog and go directly to the Extent Store since they are already large chunks of aligned data and won't benefit from coalescing.
Nutanix Bible: I/O Path and Cache
You are correct in saying that there since the data is written to SSH either way there is not much performance improvement. The data is still considered hot so it will not be placed on HDD at once. However, if you think ahead about the read that comes next from the extent store, it will be more efficient since the data had been aligned previously. Let me know if that makes sense.
Thanks for helping me to understand this better. It makes sense what you’ve said and matches what I was thinking. I’m still a little unclear about this part though “However, if you think ahead about the read that comes next from the extent store, it will be more efficient since the data had been aligned previously.” I’m still unsure about what the benefit is of coalescing then sequentially draining the incoming writes actually has.
I am more than happy to attempt to explain this. Feel free to ask more questions.
You read the data after you write it. Just like me writing this right now. Imagine that you’re working with the alphabet. You can write a b c d e … z or you can write a y h i … c b l … k.
In both instances, the task is the same – to read the letters in alphabetical order. Which scenario is going to take you longer? (Please disregard the fact that you can reproduce the order from memory without reading it)
It gets a little more complicated in real life where there are multiple layers of data organisation but in a nutshell, this is why sequenced I/O is better than random.
The write is received and evaluated. If it’s sequential then it is written to the extent store. If it is random then it hangs out in the oplog until either it becomes part of the sequence (and is drained) or it is overwritten.
Draining oplog sequentially means to write pieces of data not as they appear in the oplog but in the order to the extent store. Instead of writing a y h i … c b l … k the extent store will receive a b c d e … z. In that way, when the read request comes for a letter, a number of them or a sequence, it is easy to locate them on the extent store. Think of it as looking for a file or a folder on your computer. You either sort it by date or alphabetical order but you sort it to find what you’re looking for.
The data that has been touched recently is likely to be touched again soon. That’s why the buffers are everywhere: RAM, your recent files in any text editor, your recent file in any file browser that you use, NICs have sort of a cache to handle bursts of I/O too.
Thanks again for taking the time to explain. Ah I see, so essentially the reason the data is sequentially drained is to make reading the data later on a lot faster, because as you said reading sequentially is a lot fast then randomly? Is the data replicated to other nodes before it’s drained? Also, just to confirm, any sequential write operation under 1.5MB in size would still be written to the oplog?
It is both read and write operations that benefit from being written sequentially. When you need to edit a file/data block/etc you need to find it first. Each search operation takes time. Not long by human standards but it does. With randomized data this time is accumulated as the system needs to find and change the data in each of the locations on the disk. The more sensitive the application is to the latency the more noticeable the difference is. When the data is written in sequence the retrieval takes less time. This is true for any system. This why defragmentation (not the same thing but works for a similar reason) improves performance, for example.
The difference is especially noticeable when the data is later considered to be cold and migrated to the HDD.
All other IOs, including those which can be large (e.g. >64K) will still be handled by the OpLog.
Nutanix Bible: I/O Path and Cache
Great, that makes complete sense and well explained. Thanks for taking the time to answer my queries. I could go on asking more but you’ve certainly answered my initial questions and some. Cheers
Happy to help:) You can keep on asking. I can’t promise I will have answers but I’ll do my best.
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.