Buffering vs streaming benchmark: first results
My poor laptop's had a busy weekend. It's run 72 tests, rebooting between each test. Most of these tests have kept both the CPU and disk busy for a lot of the time. I expect to update this blog post with more numbers - and possibly more strategies - as time goes on, but I wanted to get the numbers out as quickly as possible. Graphs should be coming when I've got a decent network to experiment with Google graphing.
Source code and setup
While I don't really expect anyone else to sacrifice their computer for the best part of 24 hours, it doesn't take very much effort to get the tests running. This zip file contains the source code for three programs, a batch file to generate the test data, and a command file.
The CreateFiles executable creates input files based on its command line arguments. (This is run multiple times by the batch file.) EncryptFiles "encrypts" (I use the term loosely) all the files in a given directory, writing the results to another directory. The class design of EncryptFiles is a little bit funny, but it's really tailored for these tests. It should be easy to write other strategies in the same way too.
ExecuteAndReboot is used to automate the testing. Basically it reads a command file (from a hard-coded directory; simplicity trumps flexibility here). If the file isn't empty, it waits five minutes (so that after a reboot the computer has time to settle down after loading all its services) and runs the command specified in the first line of the file, appending the output to a results file. When the command has finished executing, ExecuteAndReboot rewrites the command file, removing the first line, and schedules a reboot. The idea is to drop ExecuteAndReboot into a Startup folder, set Windows to log in automatically, reboot once and then just let it go. If you want to use the computer, just wait until it's in the "waiting five minutes" bit at the start of the cycle and kill ExecuteAndReboot. Next time you reboot it will start testing again.
That's the general principle - if anyone does want to run the tests on their mail and needs a bit of help setting it up, just let me know.
Test environment and workload
I expect the details of the execution environment to affect the outcome significantly. For the sake of the record then, my laptop specs are:
Dell Inspiron 1720 (unlikely to be relevant, but...)
Intel Core 2 Duo T7250 2.0GHz
Vista Home Premium 32 bit
3 GB main memory, 32K L1 cache, 2048K L2 cache
2 disks, both 120GB (Fujitsu MHY2120BH, 5400RPM, 8MB cache)
Even though there are two drives in the system, the tests only ever read from and wrote to a single disk.
I created four sets of input data; I've presented the results for each one separately below, and included the description of the files along with the results. Basically in each case there's a set of files where each file is the same size, consisting of a fixed number of fixed-length lines.
The "encryption" that is performed in each case (on the encoded version of the line) is just to XOR each byte with a value. This is repeated a specified number of times, and the XOR value on each pass is the current index, i.e. it first XORs with 0, then 1, then 2 etc. I've used work levels of 0 (the loop never runs), 100 and 1000. These were picked somewhat arbitrarily, but they give interesting results. At 0 the process is clearly IO-bound. At 1000 it's clearly CPU-bound. At 100 it's not pegging the CPU, but there's significant activity. I may try 200 as well at some point.
For each scenario I tried using 1, 2 and 3 threads (and 4 for the big tests; I'll go back and repeat earlier ones with 4 threads soon). The two strategies used are "streaming": read a line, encrypt, write the line, repeat and "buffering": read a whole file, encrypt it all, write it all out. After a thread has started, it needs no shared data at all - it's given the complete list of files to encrypt at the very start.
Results
For every table, the number of threads is shown by the leftmost column, and the work level is shown by the topmost row. The cells are of the form "x/y" where "x" is the time taken for the streaming strategy, and "y" is the time taken for the buffering strategy. All times are in seconds. For each column, I've highlighted the optimal test.
Small tests
There are 10000 test files. Each is 100 lines long, with 100 characters per line.
| Threads/Work |
0 |
100 |
1000 |
| 1 |
47/90 |
94/79 |
163/158 |
| 2 |
95/112 |
113/157 |
129/146 |
| 3 |
179/128 |
130/144 |
152/160 |
| 4 |
144/139 |
163/193 |
156/204 |
Medium tests (short lines)
There are 100 test files. Each is 500,000 lines long, with 20 characters per line.
| Threads/Work |
0 |
100 |
1000 |
| 1 |
138/157 |
219/267 |
1125/1205 |
| 2 |
196/238 |
259/263 |
604/691 |
| 3 |
292/250 |
312/236 |
624/676 |
| 4 |
302/291 |
321/281 |
602/666 |
Medium tests (long lines)
There are 100 test files. Each is 100 lines long, with 100,000 characters per line.
| Threads/Work |
0 |
100 |
1000 |
| 1 |
203/213 |
249/286 |
1027/1107 |
| 2 |
283/264 |
311/296 |
602/696 |
| 3 |
319/284 |
314/278 |
608/618 |
| 4 |
360/300 |
336/297 |
607/598 |
Big tests
There are 20 files. Each file is 100,000 lines long, with 1000 characters per line.
| Threads/Work |
0 |
100 |
1000 |
| 1 |
190/246 |
349/441 |
2047/2165 |
| 2 |
365/375 |
392/473 |
1097/1387 |
| 3 |
456/379 |
484/399 |
1161/1296 |
| 4 |
517/442 |
521/431 |
1113/1214 |
Conclusions
I'm loathe to draw too many conclusions so far. After all, we've only got data for a single test environment. It would be interesting to see what would happen on a quad core machine. However, I think a few things can be said with reasonable confidence:
- Sometimes the streaming strategy wins, sometimes the buffering strategy wins. The difference can be quite significant either way. For a given input and workload, the streaming solution almost always wins on my machine.
- In ideal scenario, the bottlenecked resource should be busy for as much of the time as possible. For example, in a high-CPU task it makes little sense to have idle cores when you've already loaded data. Likewise in a disk-bound task we should always be fetching (or writing) data even if all our cores are momentarily busy. Personally I think this is the operating system's pre-fetch systems's job. We could do it ourselves, but it's nicer if we don't have to.
- If we're disk-bound, it makes sense to read (or write) one file at a time, to avoid seeking. This is obvious from the results - for work levels of 0 and 100, a single-thread solution is consistently best (and streaming usually works better buffering). Based on this assumption, I suspect a cleaner solution would be to have a single thread reading largish chunks (but not the whole file) and feeding the other threads - or to use async IO. However, this all gets very complicated. One nice aspect of both of the current solutions is that they're really simple to implement safely. Reading line-by-line in an async manner is a pain. I'd quite like to write an async "execute for every line" helper at some time, but not just yet.
- If we're CPU bound, the buffering solution's sweet spot is 3 threads (remember we're using 2 cores) and the streaming solution works better with just 2 threads - introducing more threads will make both the IO and context switching less efficient.
- The streaming strategy always uses less application memory than the buffering strategy. When running the buffering strategy against the "big" data set with 4 threads, the memory varied between 800MB and 1.8GB (as measured by the "Total bytes in all heaps" performance counter), vs around 165KB with the streaming strategy. (Obviously the process itself took more than 165KB, but that was the memory used in "normal" managed heap memory.) Using more memory to improve performance is a valid technique, but I'd say it's only worth it when the improvement is very significant. Additionally, the streaming technique instantly scales to huge data files. While they're not in the current requirements, it doesn't hurt to get that ability for free.
Next steps...
I've included quite a few "I'm going to try to [...]" points in this post. Just for reference, my plans are:
- Turn off my virus checker! If this makes a significant difference, I'll rerun all the current tests
- Try to visualise the results with Google graphs - although I think it may get cluttered
- Rerun big/buffering/1000 test - odd results
- Rerun tests with a work level of 200 and keep an eye on CPU usage - I'd like to spot where we tip between CPU and IO being the bottleneck
- Try to optimize the Stream/StreamReader being used - we should be able to skip the FileStream buffer completely, and specifying FileOptions.SequentialScan should give more hints to Windows about how we're intending to read the data. I'll try a few different buffer sizes, probably 4K, 8K and 64K, just to see what happens.
This is a really interesting little example problem, because it sounds so simple (and it is simple in terms of the implementation) but it has a lot of environment variables. Of course it would be nicer if we didn't have to reboot the machine between test runs in order to get a fair result, but never mind...