Benchmarking IO: buffering vs streaming
I mentioned in my recent book review that I was concerned about a recommendation to load all of the data from an input file before processing all of it. This seems to me to be a bad idea in an age where Windows prefetch will anticipate what data you need next, etc - allowing you to process efficiently in a streaming fashion.
However, without any benchmarks I'm just guessing. I'd like to set up a benchmark to test this - it's an interesting problem which I suspect has lots of nuances. This isn't about trying to prove the book wrong - it's about investigating a problem which sounds relatively simple, but could well not be. I wouldn't be at all surprised to see that in some cases the streaming solution is faster, and in other cases the buffered solution is faster.
The Task
The situation presented is like this:
- We have a bunch of input files, either locally or on the network (I'm probably just going to test locally for now)
- Each file is less than 100MB
- We need to encrypt each line of text in each input file, writing it to a corresponding output file
The method suggested in the book is for each thread to:
- Load a file into a List<string>
- Encrypt every line (replacing it in the list)
- Save to a new file
My alternative option is:
- Open a TextReader and a TextWriter for the input/output
- Repeatedly read a line, encrypt, write the encrypted line until we've exhausted the input file
- Close both the reader and the writer
These are the two implementations I want to test. I strongly suspect that the optimal solution would involve async IO, but doing an async version of ReadLine is a real pain for various reasons. I'm going to keep it simple - using plain threading, no TPL etc.
I haven't written any code yet. This is where you come in - not to write the code for me, but to make sure I test in a useful way.
Environmental variations
My plan of attack is to first write a small program to generate the input files. These will just be random text files, and the program will have a few command line parameters:
- Directory to put files under (one per test variation, basically)
- Number of files to create
- Number of lines per file
- Number of characters per line
I'll probably test a high and a low number for each of the last three parameters, possibly omitting a few variations for practical reasons.
In an ideal world I'd test on several different computers, locally and networked, but that just isn't practical. In particular I'd be interested to see how much difference an SSD (low seek time) makes to this test. I'll be using my normal laptop, which is a dual core Core Duo with two normal laptop disks. I may well try using different drives for reading and writing to see how much difference that makes.
Benchmarking
The benchmark program will also have a few command line parameters:
- Directory to read files from
- Directory to write files to
- Number of threads to use (in some cases I suspect that more threads than cores will be useful, to avoid cores idling while data is read for a blocking thread)
- Strategy to use (buffered or streaming)
- Encryption work level
The first three parameters here are pretty self-explanatory, but the encryption work level isn't. Basically I want to be able to vary the difficulty of the task, which will vary whether it ends up being CPU-bound or IO-bound (I expect). So, for a particular line I will:
- Convert to binary (using Encoding.ASCII - I'll generate just ASCII files)
- Encrypt the binary data
- Encrypt the encrypted binary data
- Encrypt the encrypted encrypted [...] etc until we've hit the number given by the encryption work level
- Base64 encode the result - this will be the output line
So with an encryption work level of 1 I'll just encrypt once. With a work level of 2 I'll encrypt twice, etc. This is purely for the sake of giving the computer something to do. I'll use AES unless anyone has a better suggestion. (Another option would be to just use an XOR or something else incredibly simple.) The key/IV will be fixed for all tests, just in case that has a bearing on anything.
The benchmarking program is going to be as simple as I can possibly make it:
- Start a stopwatch
- Read the names of all the files in the directory
- Create a list of files for each thread to encrypt
- Create and start the threads
- Use Thread.Join on all the threads
- Stop the stopwatch and report the time taken
No rendezvous required at all, which certainly simplifies things. By creating the work list before the thread, I don't need to worry about memory model issues. It should all just be fine.
In the absence of a better way of emptying all the file read caches (at the Windows and disk levels) I plan to reboot my computer between test runs (which makes it pretty expensive in terms of time spent - hence omitting some variations). I wasn't planning on shutting services etc down: I really hope that Vista won't do anything silly like trying to index the disk while I've got a heavy load going. Obviously I won't run any other applications at the same time.
If anyone has any suggested changes, I'd be very glad to hear them. Have I missed anything? Should I run a test where the file sizes vary? Is there a better way of flushing all caches than rebooting?
I don't know exactly when I'm going to find time to do all of this, but I'll get there eventually :)