Recent Posts

Tags

News

  • A blog about Microsoft Windows development, focused on kernel-mode driver development, the Windows DDK, WDK, and related tools.

    To elaborate on the copyright notice at the bottom: all content produced by me on this site is copyright and licensed as follows:

    <!-- Creative Commons License --> Creative Commons License
    This work is licensed under a Creative Commons License. <!-- /Creative Commons License --> <!-- <rdf:RDF xmlns="http://web.resource.org/cc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about=""> <dc:type rdf:resource="http://purl.org/dc/dcmitype/Text" /> <license rdf:resource="http://creativecommons.org/licenses/by-nc/2.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc/2.0/"> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <requires rdf:resource="http://web.resource.org/cc/Notice" /> <requires rdf:resource="http://web.resource.org/cc/Attribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> <permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" /> </License> </rdf:RDF> -->

    Although I work for Positive Networks, this work is my own and is not connected with my employer in any way.

    <!-- technorati again --> <script type="text/javascript" src="http://embed.technorati.com/embed/8xz8dihr.js"> </script>

Community

Email Notifications

Other Blogs

General

Technical Resources

About Me

Archives

Kernel Mustard

Reflections on Windows System Programming
Steve Dispensa, MVP - Windows DDK

September 2004 - Posts

Driver Developer's Toolbox, Part 4: The Checked Build
Fresh off of two days worth of board meetings at my company, and two days out from leaving town for two weeks on vacation (Switzerland, Germany, and France), I'm exceedingly low on time, so please accept my apologies in advance for being slow to post. During my absence, I have a couple of guest bloggers lined up to discuss Interesting Things(TM) to tide you over until I get back.

Today I want to talk about the checked build. As you may know, there are two different builds of the OS, the free build and the checked build. The difference is that the checked build has some additional checks compiled in (usually in the form of ASSERT() macros) and lots of debug logging (via KdPrint()). You can get a good feel for how checked build code differs from free build code simply by reading the Microsoft-supplied samples in the DDK. Obviously, these extra checks can help much during the development of driver projects.

The problem with the checked build is that it's a pain to deal with. It is hard to find, although it is available on all MSDN subscriptions from "Operating Systems" on up. Also, all service packs are released with checked build counterparts that can typically be downloaded from microsoft.com. Once you have the build, you have a couple of options for how to install it. I prefer to run with a full checked build at this point on one of my test boxes, but the full checked build has the disadvantage of being considerably slower than the free build, so it takes a lot of horsepower to run. The other problem with the full checked build is that debug messages can get to be amazingly verbose, which is oftentimes not helpful.

The solution to these problems is to use a partial checked build. This means using a checked kernel and hal (the kernel and the hal must always go together - they're a matched pair), and any additional checked kernel components that are relevant. For example, when I'm developing an NDIS driver, I typically run with checked versions of ndis.sys, tdi.sys, tcpip.sys, and afd.sys. FSD and FS filter development call for checked versions of ntfs and fastfat. Use your head; the right checked binaries are usually obvious.

Getting the checked build onto the system is another matter, however. Due to the Fantastic Miracle of System File Protection, driver development has become slightly more painful in this area. To get Windows to allow you to replace the free files with checked ones, you must disable SFP and reboot with a debugger attached. The good news is that the kind folks at OSR have a tool that twiddles the registry keys for SFP automatically, and seems to work across lots of versions of the OS.

Once you have SFP disabled, simply back up ntoskrnl, hal, and whatever other binaries you're replacing, and copy over the new ones. Keep a debugger attached to the system at all times, as things just won't work right without it. ASSERT() macros will crash the system with a bugcheck if there is no debugger, which is seldom helpful. There's always some idiotic driver (VMWare, are you listening?) that ASSERTs in ntio during boot-up, so you'll need the debugger to dismiss the assert.

I find that developing and testing very early on with a checked build can be a big help in preventing the introduction of bugs in your drivers. And, what's more, the newer you are at driver development, the bigger the payoff.

Suggestion Box
Please leave suggestions for topics as feedback on this thread. I'm also happy to take submissions for articles -- please use the contact link to mail me if you'd like to submit something.
Driver Developer's Toolbox, Part 3: Verifier
After getting a lot of feedback for the "Introduction to WinDBG" article I posted a few weeks ago, I thought I'd follow it up with another in the series, this time about Driver Verifier. All driver developers need to know a few basic things about the tools of the trade, and one of the most important tools for driver development that Microsoft ships is Driver Verifier.

Driver Verifier is essentially a library of routines that cross-check your interaction with the OS in a stricter way than normal. The design philosophy of the operating system is that kernel-mode components should trust one another. This works well in practice, as it provides significant performance improvements on often-used code paths. However, ithis trust is exactly what you don't want during driver development, as it can let subtle errors go unnoticed until your software is in your customers' hands.

There are a couple of tools designed to enforce stricter checking in the OS. One of these tools, the checked build, I'll talk about in another article. Driver Verifier is the other major runtime driver validation tool. Verifier ships with the operating system, and changes with each release. The configuration interface is presented in a user-mode GUI app, but the code that does the real work is embedded in ntoskrnl. Also, note that Verifier and the checked build are unrelated; you really need both, but you can use either alone.

Verifier can catch tons of little errors once you turn it on. It can check for proper use of IRQLs and spin locks, proper implementation of the DMA protocol, correct handling of IRPs, and so on. It can even test your driver in a low-memory simulation, randomly failing memory allocation requests. Best driver development practice dictates that all development testing be done with full driver verifier turned on (possibly with the exception of low resources simulation). By following this rule, you're sure to catch as many mistakes as possible, before they are covered up by more layers of your code. In fact, one developer at Microsoft told me that he routinely runs the entire OS under verifier.

The description that follows is done on the current 64-bit XP preview release, but other OSes are similar. To enable this magic, it's easiest to start verifier.exe from the Start->Run box. Although there is another registry-based way to configure verifier, I won't go into it here - operate Regmon if you're curious. Choose "Create custom settings" and click Next, and "Select individual settings from a list", and Next. At this point, you're prompted with a list of verifications. The easiest and most comprehensive thing is to check them all, but you may want to leave low resources simulation off the list during development and early testing. Also, IRQL checking has a sizeable performance impact, due to the fact that it invalidates all pageable pages in the driver before each call into your code. Still, this is an invaluable test, particularly in combinationwith the PAGED_CODE macro. Once you finish with options, you are prompted to select which drivers to verify.

Once verification is started on your driver, be sure to have a kernel debugger hooked up, as any violations that Verifier finds will turn into breaks into the debugger. If you don't have a debugger attached, the system will just bugcheck with verifier's own bugcheck code. Usually, verifier is pretty clear about what has gone wrong with your driver, so the problems (if not the fixes) are pretty obvious. In my experience, Verifier doesn't catch many false positives - if Verifier breaks in, there is a very high probability that you have a real bug, and you should fix it.

One other thing - I occasionally find drivers that have clearly not been tested against verifier, because they trip it off during whole system verification. If you find one of these drivers, be a good citizen and drop a friendly note to the company that is responsible for it. Bug reports from clueful developers are always appreciated. And, if that doesn't work, there's always public shame. :-)

A Tale Of Two Laptops
I'm a firm believer in broadening horizons. I love alternatives and underdogs. Variety is the spice of life. With all of that in mind, I ordered a couple of new laptops for my development team that came in this week.

The first one was a Sager - a brand that I had never heard of before a few months ago. Sager makes an impressive box. In fact, it's by far the most impressive laptop I've ever run across, feature-wise. One lucky developer wound up with a Sager NP4750, which is an AMD64-based box. In addition to having every option I've ever heard of in a computer -- seriously, check that link if you don't believe me -- the AMD64 setup seems to be pretty solid. As you know if you've been reading my blog for a while, I'm a big fan of the AMD64, and of 64-bit computing in general. Other than a few minor gotchas with the XP 64-bit preview release (bluescreen on trying to install the wrong VMWare, sound drivers don't quite work all of the time, etc), it looks good. We're still working on the set-up, so I'll let you know if we don't get any of those problems resolved. FWIW, my dual Opteron wound up with zero problems, and I couldn't be happier with it. I've never heard of Sager before this, so I invested in the best warranty coverage they could offer. It can't be any worse than the WinBooks that are on their way out!

Because of the success and ultra-coolness of that Sager laptop, I decided to do the obvious thing and buy an Apple PowerBook G4. After having run Linux on my laptops for many years, I decided it was time to upgrade to a slightly more usable UNIX laptop. My wife has had a PowerMac G5 for over a year now, and it's been fantastic. Everything works right, the UI is the most beautiful graphics work I've ever seen (non-art graphics, anyway), and in general, the Mac Mystique is real. This laptop does nothing to diminish my happiness with Apple. Seriously, if you've never bought a piece of Apple hardware before, go treat yourself to an IPod or something and marvel at the amazingly good packaging and perfect out-of-box experience. As someone who has been in the product business for a few years, I have learned to really appreciate the companies in the world that do an outstanding job on fit and finish.

Anyway, you might be wondering what a Windows driver developer is doing running all of this wacky hardware. The answer is simple: the more different environments you use, the better you get at using all of them. The more different operating systems you expose yourself to, the better you get at improving any of them. My Macintosh experiences (and my Linux experiences) have been invaluable when it comes to improving my Windows products.

I'm still in the process of setting up the Mac, but I have Microsoft VirtualPC 6.1 and Microsoft Office 2004, so I have everything I need to do driver development the way I always have. Emulation speed isn't as good in VPC, however, so for testing, I use Microsoft Remote Desktop to get into my afore-mentioned dual Opteron. It's so much faster than any laptop I've ever seen (even the Sager) that laptop-based testing just doesn't make sense to me any more.

Some Follow-Up To Previous Comments
A couple of things:

- Further offline discussion with Wayne points out that, if there are memory barrier issues in Java (in its current incarnation), they are JDK problems, not language problems per se, due to the fact that Java guarantees "program order" (causing a permanent performance penalty). The example he gave turned out to need some re-working to really test this correctly.

- The Java synchronization stuff posted doesn't actually do anything for memory barriers at all in theory, although as it happens, all of the underlying OS synchronization primitives provide implicit memory barriers. Java on an architecture in which synchronization primitives are implemented differently might have a problem.

- Rod posted a cool link about Java memory issues. It's a good thing that I Hate Java, or else I'd have to be concerned about stuff like this. :-)

- As far as memory barrier references, there are few. The is some discussion in Dekker and Newcommer's Writing Windows NT Device Drivers, which is old and out of date, but a great book nonetheless. Wikipedia has an article about memory barriers, and they're covered in the processor manuals for the Pentium 4, Itanium, and AMD64. Note that they're also sometimes referred to as "fences". Adrian Oney from Microsoft knows about them; that's as much as I can say about that though. :-) I would really appreciate any additional resources you find.

Intel And Multi-Core Chips
One of my daily reads is ArsTechnica. Hannibal, their CPU guy, does an amazing job of talking about the low-level stuff in such a way that it's easily understandable, even to people with limited neural matter such as myself.

His review of the Intel Developer Forum (just concluded) has some very interesting stuff in it with regard to dual-core CPUs. As if it weren't already important enough to design software in an MP-safe way, now we're reaching the point that software *must* be designed to take advantage of multiple CPUs, or else a significant chunk of your average microprocessor will go unused.

If you have 5 minutes after you're done with your real work (i.e. reading this blog), head on over there and take a peek, and then make sure you re-test all of your drivers on MP boxes for good measure.

Memory Barriers Wrap-up
Hello blogosphere! I hope everyone had a great time this weekend puzzling through the mysteries of memory barriers. Personally, I spent the weekend coding and reading about realtivity (a recent post by Raymond Chen got me re-re-re-re-re-started on physics again).

In addition to the above-mentioned nonsense, I got some time to drag out the intel manuals to see what they had to say about x86 memory barriers. For the curious, the details can be found in section 7.3 of the 3rd volume of the Intel Pentium 4 manuals.

The situation is slightly different between the {i486, P5} and P6+ (Pentium Pro, Pentium II, Xeon, etc.) processors. The first group of chips enforces relatively strong program ordering of reads and writes at all times, with one exception: read misses are allowed to go ahead of write hits. In other words, if a program writes to memory location 1 and then reads from memory location 2, the read is allowed to hit the system bus before the write. This is because the execution stream inside the processor is usually totally blocked waiting for reads, whereas writes can be "queued" to the cache somewhat more asynchronously in the core without blocking program flow.

The P6-based processors present a slightly different story, adding support for out-of-order writes of long string data and speculative read support. In order to control these features of the processor, Intel has supplied a few instructions to enforce memory ordering. There are three explicit fence instructions - LFENCE, SFENCE, and MFENCE.

  • LFENCE - Load fence - all pending load operations must be completed by the time an LFENCE executes
  • SFENCE - Store fence - all pending store operations must be completed by the time an SFENCE executes
  • MFENCE - Memory fence - all pending load and store operations must be completed by the time an MFENCE executes

These instructions are in addition to the "synchronizing" instructions, such as interlocked memory operations and the CPUID instruction. The latter cause a total pipeline flush, leading to less-efficient utilization of the CPU. It should be noted that the DDK defines KeMemoryBarrier() using an interlocked store operation, so KeMemoryBarrier() sufferes from this performance issue.

This story changes on other architectures, as I've said before, so the best practice is stil to code defensively and use memory barriers where you need them. However, it doesn't look like you're likely to run into these situations in x86-land.

Memory Barriers, Part 2
So my question du jour is, "Is anyone still not using Firefox?" I have been getting sick in recent months of friends and family calling me and complaining about spyware, pop-ups, viruses, and so on. Amazingly enough, simply installing Firefox has dropped my personal support call volume to near-zero. I've also been using firefox exclusively for months, except for accessing certain MS sites that require IE, and have been thrilled. YMMV, of course, but the newly-released 1.0 preview release runs amazingly well on both Linux and Windows. It's actually more stable on my amd64 than either the 32-bit or 64-bit versions of IE.

OK, so yesterday, I posted an extra-credit assignment. Nobody tried it, so I'm going to elaborate on it a bit. If you haven't read yesterday's post yet, scroll down and do so before trying to go at this one.

int a = 0;
int b = 1;

f()
{
        for(;;)
        {
                ASSERT(a < b);
        }
}

g()
{
        for(;;)
        {
                b++;
                a++;
        }
}

This code is similar to code I once saw a Microsoft person scribble on a whiteboard, and I thought it was a really interesting way to frame the memory barrier problem. Say you create both threads f() and g() on a dual-proc computer and then just walk away and let it run. Will the ASSERT ever fire? According to the MS guy, the answer is "yes", and the reason is that the a++ can be committed to RAM before the b++, making a == b.

Consider the values of a and b after a few revolutions. There are a couple of different scenarios:

   case 1               case 2
(expected)   (reordered writes)
   a | b                    a | b
   -----                   -----
   0 | 1                   0 | 1   (initial)
   0 | 2                   1 | 1   (after b++; #2 re-orders write)
   1 | 2                   1 | 2 

There is another sequence too, for example: a++, b++, b++, a++; and b++, a++, a++, b++.

There are a couple of interesting things to think about here. The first is that this happens in a loop. That effectively gives you two places to put memory barriers: between b and a, like so:

g()
{
        for(;;)
        {
                b++;
                KeMemoryBarrier();
                a++;
        }
}

or between a and b:

g()
{
        for(;;)
        {
                b++;
                a++;
                KeMemoryBarrier();
        }
}

Notice that there are actually two places this barrier can be placed, with equivalent effect.

These two examples solve slightly different problems, as outlined in the sequences given above.

So, over the weekend, here are three more things to ponder:

  1. What impact does the fact that a++ is actually a read/update/write operation have on this? Is the effect architecture-specific?
  2. Are the reordering issues different between on-chip reordering and compiler-generated reordering? Is this also architecture-specific? (think 64-bit computing here)
  3. What would the tables look like under the various possible sequences with and without the barriers in either or both places?

Have a good weekend!

Memory Barriers
Sorry for the long break in blogging; I've been catching up on 1001 things at work, and getting ready for an upcoming trip to the Old World. I promise I won't let it happen again! <g>

I first heard about memory barriers from Ed Dekker's book on NT Device Drivers. This is still probably my favorite overall book on driver-writing, even though it's getting to be badly out of date. Ed can tell stories with the best of 'em, and his is one of the few books that really has a personality. He addressed the concept of memory barriers in conjunction with the (slightly oddball) Alpha processor port of NT.

First, consider the following code:

int a = 0;
int b = 0;

f()
{
        while(a == 0)
        {
        }

        ASSERT(b == 1);
}

g()
{
        b = 1;
        a = 1;
}

Assume f() and g() are two threads started simultaneously. Will that ASSERT() ever fire? The naive answer is "no". However, modern super-scalar processors sometimes re-order memory accesses for various reasons, and if that happens, the ASSERT can be tripped. This is subtle; think about it for a second if it's not immediately obvious.

This problem can be fixed with a memory barrier. A memory barrier is an explicit instruction to the CPU that orders reads and writes to memory. In other words, it requires that any outstanding read or write accesses to memory be completed, processor-wide. The implementation of a memory barrier is CPU-specific, as are the situations in which one might be needed. IA-64 write combining presents different issues to programers than normal x86 semantics, for example.

On an x86 chip, any interlocked operation will force an implied memory barrier, and if you're using a new enough DDK, you can call KeMemoryBarrier() to make your intentions obvious. The above code would be fixed, for example, by changing g() as follows:

g()
{
        b = 1;
        KeMemoryBarrier();
        a = 1;
}

So how can you tell when you need one? Well, the good news is that it doesn't seem like I run into many situations where this is an issue. The operating system protects you with implicit memory barriers included in all locks, and if you always protect shared memory with a lock of some sort, you're safe. However, if you try to minimize the use of locks in your code, this can jump up and bite you.

There is one other source of re-ordering that you should be aware of, as well: compilers tend to re-order things in certain cases, and while the rules for re-ordering are subtle and complex, you can always protect yourself using a combination of the "volitle" keyword in C code and compiler-specific intrinsics.

For more information on memory barriers, check out this paper at WHDC.

Extra credit: analyze the following code, in light of memory barriers:

int a = 0;
int b = 1;

f()
{
        for(;;)
        {
                ASSERT(a < b);
        }
}

g()
{
        for(;;)
        {
                b++;
                a++;
        }
}

When Does "Output" Mean "Input"?
After more philosophization on the meaning of direct IOCTL codes, I came to the conclusion that I've never used METHOD_IN_DIRECT in a driver. Naturally, I wondered if it was any different than METHOD_OUT_DIRECT. Boy, was that an interesting investigation.

To start off with, you have to know a thing or two about how an IRP works. An IRP is the basic data structure passed into all driver dispatch routines. It contains all of the caller's parameters, as well as an associated data structure that replaces the traditional stack used during function calls. In particular, IRPs have a member called MdlAddress. Note that it doesn't say "InMdlAddress" and "OutMdlAddress" - it's just MdlAddress.

After some consideration, I determined that when a usermode app calls DeviceIoControl() or NtDeviceIoControlFile() on a METHOD_IN_DIRECT code, it must just pass its data in the InputBuffer into the driver at MdlAddress. I put together a quick test driver to verify this fact. Nope, wrong.

The next step was to look around for any sample code that calls DeviceIoControl() with METHOD_IN_DIRECT. I searched my DDKs for about 5 minutes and finally gave up - the only samples I found were calling from the kernel, and not calling NtDeviceIoControlFile().

After fiddling with the code for long enough to convince myself that I wasn't crazy (riiiiight), I decided to do what any sane developer would do in a similar situation: I broke out WinDbg. Knowing that all IOCTL requests from user mode end up calling NtDeviceIoControlFile, I disassembled that function:

kd> ln nt!NtDeviceIoControlFile
(8052af7e)   nt!NtDeviceIoControlFile   |  (8052afaa)   nt!NtFsControlFile
Exact matches:
    nt!NtDeviceIoControlFile = 
kd> u 8052af7e 8052afaa
nt!NtDeviceIoControlFile:
8052af7e 55               push    ebp
8052af7f 8bec             mov     ebp,esp
8052af81 6a01             push    0x1
8052af83 ff752c           push    dword ptr [ebp+0x2c]
8052af86 ff7528           push    dword ptr [ebp+0x28]
8052af89 ff7524           push    dword ptr [ebp+0x24]
8052af8c ff7520           push    dword ptr [ebp+0x20]
8052af8f ff751c           push    dword ptr [ebp+0x1c]
8052af92 ff7518           push    dword ptr [ebp+0x18]
8052af95 ff7514           push    dword ptr [ebp+0x14]
8052af98 ff7510           push    dword ptr [ebp+0x10]
8052af9b ff750c           push    dword ptr [ebp+0xc]
8052af9e ff7508           push    dword ptr [ebp+0x8]
8052afa1 e84ea70000       call    nt!IopXxxControlFile (805356f4)
8052afa6 5d               pop     ebp
8052afa7 c22800           ret     0x28

It looks like NtDeviceIoControlFile just hops directly to IopXxxControlFile(), which is not exported. Disassembling that function in WinDbg shows that this is where the real magic happens. Some selected lines:

kd> ln nt!IopXxxControlFile
(805356f4)   nt!IopXxxControlFile   |  (80535dac)   nt!IopInitializeBootLogging
Exact matches:
    nt!IopXxxControlFile = 
kd> u 805356f4 80535dac
8053579f e846befdff       call    nt!ProbeForWrite (805115ea)
805357ed e8409ff6ff       call    nt!ObReferenceObjectByHandle (8049f732)
805358da e85408efff       call    nt!IoGetRelatedDeviceObject (80426133)
805358e4 e86f06efff       call    nt!IoGetAttachedDevice (80425f58)
80535aea e84deaeeff       call    nt!IoAllocateIrp (8042453c)
80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)

etc...

OK, so now I know we're in the right function. Now I look for what happens to METHOD_IN_DIRECT, which (according to the DDK) is type 1. That IoAllocateMdl call looks promising, too, as we know that the function should only be allocating a MDL for DIRECT I/O. Some exploration yields:

80535c0c 53               push    ebx
80535c0d 6a01             push    0x1
80535c0f 56               push    esi
80535c10 ff752c           push    dword ptr [ebp+0x2c]
80535c13 ff7528           push    dword ptr [ebp+0x28]
80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)

Now, remember that arguments are pushed on the stack backwards, so ebp+0x28 will be VirtualAddress, ebp+0x2c will be Length, esi (which is xor'd to 0) represents a FALSE for SecondaryBuffer, 0x1 is TRUE for ChargeQuota, and ebx holds the address of the IRP (which I know is correct, because it was set to the return value of IoAllocateIrp()).

The interesting point is that this is the *only* call to IlAllocateMdl in the entire function. In fact, it's the only call to any MDL-related function, so that must be what's used to set MdlAddress. A little exploration confirms that:

kd> dt nt!_IRP
   +0x000 Type             : Int2B
   +0x002 Size             : Uint2B
   +0x004 MdlAddress       : Ptr32 _MDL
...

80535c16 e84cebeeff       call    nt!IoAllocateMdl (80424767)
80535c1b 894304           mov     [ebx+0x4],eax

Here, I used the dt command to tell me the offset of the MdlAddress member of the IRP struct. Then, I looked at what happened to the return value (eax), and sure enough, it's a match. Remember that we determined above that ebx is our IRP.

So, only one question remains: what data is mapped into that MDL? Here's the interesting part: those arguments provided to IoAllocateMdl are statically defined. They're not dependant on the transfer method. In other words: no matter what transfer method you choose, if you get to the IoAllocateMdl() call, you're getting the same buffer mapped into the MDL. Which buffer is it?

To find that out, we have to identify ebp-28 and ebp-2c. Looking back at the way this function was called, we should be able to figure out what happens. The good news here is that this function uses the standard stack frame pointer, which is set up at the top of the function:

kd> u 805356f4 80535dac
nt!IopXxxControlFile:
805356f4 55               push    ebp
805356f5 8bec             mov     ebp,esp

This means we only have to look at whatever is +28 in the caller's frame. Remember that the push we just did above is the first thing on the stack, and the return address will be next. So, we just go back to the caller's string o pushes and look for the one at +20, which will be the 9th argument. That turns out to be ebp+0x28 as well. Using the same logic, we see that our argument is the 9th argument to NtDeviceIoControlFile. Now, we just crack open our copy of Nebbett's Native API book, and find that the 9th argument to NtDeviceIoControlFile() is OutputBuffer!

Well, that certainly explains a lot. No matter whether you specify METHOD_IN_DIRECT or METHOD_OUT_DIRECT, it looks like Windows will just build a MDL on OutputBuffer. After this little revelation, I went back and tried to figure out what happened to InputBuffer, which is the 7th argument, at offset ebp+0x20. I didn't have to look far - immediately above the IoAllocateMdl() stuff is this:

80535bca 397520           cmp     [ebp+0x20],esi
80535bcd 7435             jz      nt!IopXxxControlFile+0x510 (80535c04)
80535bcf 68496f2020       push    0x20206f49
80535bd4 ff7524           push    dword ptr [ebp+0x24]
80535bd7 ff75d8           push    dword ptr [ebp-0x28]
80535bda e8075ceeff       call    nt!ExAllocatePoolWithQuotaTag (8041b7e6)
80535bdf 89430c           mov     [ebx+0xc],eax
80535be2 8b4d24           mov     ecx,[ebp+0x24]
80535be5 8b7520           mov     esi,[ebp+0x20]
80535be8 8bf8             mov     edi,eax
80535bea 8bc1             mov     eax,ecx
80535bec c1e902           shr     ecx,0x2
80535bef f3a5             rep     movsd
80535bf1 8bc8             mov     ecx,eax
80535bf3 83e103           and     ecx,0x3
80535bf6 f3a4             rep     movsb
80535bf8 c7430830000000   mov     dword ptr [ebx+0x8],0x30
80535bff 33f6             xor     esi,esi
80535c01 8b4d2c           mov     ecx,[ebp+0x2c]

Remember that esi is still 0. This code allocates a buffer of ebp+0x24 (i.e. InputLength) bytes and sets it to Irp->AssociatedIrp.SystemBuffer (also found with the dt command). It then does what boils down to RtlCopyMemory(), x86-style, from source ebp+0x20 (InputBuffer) to dest SystemBuffer, length ebp+0x24 (InputLength). In other words, the system always double-buffers InputBuffer on NtDeviceIoControlFile().

OK, so I know you really have to be a geek to find this fascinating, but I really didn't gather that this was the case just from reading the documentation, although it's certainly possible that I missed it. The lack of samples seems to indicate that this isn't a commonly-used code path, either.

The bad news is that this post has taken over 2 hours to write, and now it's likely that I'm going to be late to work. See you on the flip side.

A Quick Note...
Thanks to Girish Bharadwaj for the GMail invite. I'm officially one of the Cool People now, thanks to him! Check out his blog when you get a chance.
More About FAST_MUTEX
In an earlier post, I said that FAST_MUTEXes are more efficient than spin locks to use if you don't need to synchronize at DISPATCH_LEVEL. In particular, that is true, in the sense of overall system efficiency (which was the intended sense), and as an added bonus, they actually seem to be faster to acquire in the uncontended case.

However, according to the disassembly, they hit the dispatcher lock in just the same way as regular KMUTEXes on release and on contended acquire. This makes those paths slower than the corresponding spin lock paths. The overall system efficiency gain is still worthwhile, though. Peter Wieland from Microsoft posted on NTDEV about this today.

Some Peculiar Peculiarities
A couple of odd things keep happening lately. First, a private high school that I do occasional volunteer work for tried to deploy XP SP2 onto an administrative computer today, and it slowed the entire box waaaay down. Windows would take 30 seconds to open, IMs would take 20 seconds to send, etc. It was terrible. After listening to this particular administrator's agony for 5 minutes, I told her to just call Microsoft, as I had heard they were providing free tech support for SP2.

Well, it was true - they were. According to the administrator, the support person had heard of the problem, and the only known workaround was to uninstall the service pack. This is obviously not an optimal solution, and I didn't get out there with a kernel debugger, so I really have no idea what went wrong. Anybody else have similar experiences?

Another oddball thing is this .Text blog. There is one bug that I keep running across, but aside from that, it's a fine blogging package. However, it crashes on my Mac. I don't get it - both Safari (default Mac browser, based on Konq) and latest Mac IE segfault and kindly ask me if I'd like to file a bug report with Apple. I don't expect too much sympathy from the crowd that is likely to read this blog :-) but nonetheless, it's weird. I've had maybe 3 other segfaults ever in the last 12+ months of active use. This .Text crash is pretty reliable, too. Who knows.

More .Text issues
For some reason, the last post leads to a weird .Text error if you click the feedback link. Sooo, if you have feedback for that previous post, whack the feedback link below *this* post.
More On IOCTL Security
Hello blogosphere! I'm back from an extended holiday weekend and feeling much refreshed. Let's see if that improves the quality of these posts. :-)

First, though, I'd like to give a Black Lump Of Coal (as opposed to a Gold Star(TM)) to Blockbuster Videos - the people who rent movies for a few nights - for their new marketing campaign. First, they are finally trying to answer NetFlix by going to a flat, per-month pricing model. OK, so that's not so bad (but it still won't compete with NetFlix), but in the process, they've created the Worst Euphamism Ever: they refer to an end to "EXTENDED VIEWING FEES". Hahaha! They're "Late Fees" when they're trying to scare you into returning DVDs on time, threatening your credit, and in general, chewing you out for being a horrible person (this actually happened to my sister-in-law!). BUT, now that being late is a SELLING POINT, they're conveniently called EXTENDED VIEWING FEES! Black Lump Of Coal, big time.

OK, with that little rant out of the way, let's talk about two common security issues with IOCTLs. First, if you use either METHOD_DIRECT or METHOD_BUFFERED, you are guaranteed that the buffer will point to valid memory. It'll be NonPagedPool or else it'll be locked down with MmProbeAndLockPages(). - the OS locks down your pages for you.

HOWEVER, with METHOD_DIRECT, you *are* at risk of another kind of attack. Suppose that the user program has two threads in his program. Your driver receives the IOCTL request on thread 1 and validates its buffer. Then, thread 2 changes the contents of the buffer out from under it. Then, thread 1 reads something out of that buffer - say, for example, the valid length of the buffer (bad idea -- this is given to you by the OS). Now you have an incorrect buffer length, and you're either going to be reading or writing into memory you shouldn't be. This can range from a crash to a security hole. The correct thing to do here is to "capture" the arguments once, up front. Once captured (via an RtlCopyMemory() or something similar), work only on the captured copy of the parameters.

METHOD_BUFFERED avoids this whole mess by capturing both the input buffer and the output buffer for you. The OS probes the buffers and then copies the input into a captured buffer. The output is also copied back by the OS. All of this is presumably done within the safe confines of proper OS-level exception handling, so you don't risk crashing the system.

With METHOD_NEITHER, things get even worse. In addition to the attacks described above, the usermode program can actually invalidate the entire buffer. The pages aren't probed and locked by the OS, and the address you will be referencing will be a usermode address. Finally, there's one other consideration: because it is a usermode address, you cannot touch it unless you're in the context of the calling process. Otherwise, you'll be reading from or writing to the wrong process, also leading to either a security hole or a crash.

Some IOCTL Code Definition Tips
We've all* defined our own IOCTL codes in the past. They're a primary way to enable user-mode -> driver and driver -> driver communication. Although they were probably originally envsioned in the context of storage-related drivers, Microsoft supports the use of an IOCTL dispatch in most kinds of drivers. NDIS, for example, supports NdisMRegisterDevice for the explicit purpose of providing access to the IOCTL framework in network drivers.

IOCTL codes are defined using the CTL_CODE() macro, which is part of ntddk.h in kernel-mode and winioctl.h in user-mode. The first parameter takes a Device Type, which can either be one of the Microsoft-defined device types (see the DDK headers for a list) or a custom device type code.

There are a couple of things to keep in mind here. The first is that the Device Type parameter must match the device type that is passed into IoCreateDevice(). Also, If you define a custom device type, it should be above 0x8000. The bottom 15 bits are what actually represent the device type, and the 16th bit is known as the Common bit. The DDK requires that the Common bit be set on all custom Device Types. Another way of saying this is by requiring all custom codes to be between 0x8000 and 0xFFFF. Similarly, the function code is required to be between 0x800 and 0xFFF, because the top bit ("Custom" in this case) is required to be set for non-Microsoft-defined function codes. Playing by the rules will make your driver as compatible as possible with all releases of the OS, present and future.

Method is one of METHOD_BUFFERED, METHOD_IN_DIRECT, METHOD_OUT_DIRECT, or METHOD_NEITHER. METHOD_BUFFERED is the most common transfer method, and is generally the safest and easiest to use. This method double-buffers your data by copying it from the supplied user-mode buffer into a newly-created kernel-mode buffer, and then passing that new buffer to your driver instead of the original one. If you're transferring less than one page of data (4K on x86), and especially if you'r doing it infrequently, this is the way to go. If you're tranferring larger amounts of data, one of the DIRECT methods may make sense. This is particularly true if you're going to wind up DMAing your data to or from a device, but it is also true if you just want to avoid the double-buffer in general. I'm not going to discuss METHOD_NEITHER at the moment, other than to say that you shouldn't use it. I'll get in to more detail about why another day.

The final knob to turn is the RequiredAccess parameter. I admit that I really didn't understand what this parameter was for until quite a while after I wrote my first driver. It turns out that it is a method for enforcing some small but nontrivial amount of access control on who can call your IOCTL. This specifies the kind of access the user must have to the device, as specified in the CreateFile() call, in order for the IO manager to let the IRP through. FILE_ANY_ACCESS means that they can send the IRP with virtually any access at all, as long as they have an open file handle to the driver. FILE_READ_ACCESS and FILE_WRITE_ACCESS loosely correlate to the ability to read and write data to and from the device.

Most driver writers just set this to FILE_ANY_ACCESS and forget about it. This is, of course, exactly the wrong thing to do. A much better strategy is to specify the most restrictive access possible (FILE_READ_ACCESS|FILE_WRITE_ACCESS -- yes, you can OR them together) whenever possible, and only remove bits when necessary ("necessary" depends on the kind of driver you're writing). This parameter is particularly important in IOCTLs where you're actually reading and writing data -- why would you allow a user to read data from an IOCTL if you wouldn't allow the same user to read data using ReadFile()? -- but it should probably be applied carefully to all IOCTLs.

Finally, it might be obvious, but try to name your IOCTL codes something obvious. My office has a standard that goes IOCTL__. In other words, you might have IOCTL_POSVPN_SET_INFO to configure our VPN driver. This goes with the standard rants about variable naming, and is generally an important thing if you want someone else to be able to work on your code.

OK, I expect everyone to run out and tighten up their use of CTL_CODE(). When you're done with that, go listen to Fred Jones, Part 2, by Ben Folds. It'll make you a Better Person.

Happy hacking! * OK, "All" might be a bit of an exaggeration. :-)

Inside the NT Insider
I just got the latest edition of the NT Insider from OSR in the mail today. They really out-did themselves with this issue. It's 52 pages of my favorite topic: testing! There are lots of articles about debugging, testing, and so on. They're well-written, as usual, and form a teriffic resource for budding driver developers and seasoned pros alike.

If you haven't done so already, go over to www.osronline.com and register for a subscription. You won't be sorry.

Two Quickies
If you've spent much time looking at Microsoft sample code or reading through the DDK headers, you've probably noticed a macro called PAGED_CODE(). The DDK defines it as:

#if DBG
#define PAGED_CODE() \
    { if (KeGetCurrentIrql() > APC_LEVEL) { \
         KdPrint(( "EX: Pageable code called at IRQL %d\n", KeGetCurrentIrql() )); \
         ASSERT(FALSE); \
    } \
#else
#define PAGED_CODE() NOP_FUNCTION;
#endif

As you can see, this function makes sure your code is being called at <= APC_LEVEL, which is a requirement for things like referencing pageable memory. It's a good idea to put this at the top of all of your functions that require <= APC_LEVEL. Because it compiles out in free builds, there's no real harm in using it. You'll be a better person for it.

Another nice thing to do is to mark your segments as pageable or as init if possible. Most of the Microsoft samples do this. it's accomplished with a couple of pragma directives:

#pragma alloc_text (INIT, DriverEntry)
#pragma alloc_text (PAGE, AddDevice)

The idea here is to use the first pragma on any functions (like DriverEntry) that will never be called again after they run the first time. The OS can then discard them completely and not waste any more resources on them. The secnd pragma marks functions as pageable. meaning that the code can be paged out to disk if needed, to make room for other processes to execute. Note that this cannot be done for any code that might run at DISPATCH_LEVEL or higher, and as such is nicely mated with the PAGED_CODE() macro.

The Joy of NUMA
Dr. HardwareBlog has an article up about Non-Uniform Memory Architectures on his blog. It looks like the first in a series. This has interesting ramifications for all developers, including kernel-mode development. Keep an eye on it.
Driver Developer's Toolbox, Part 2: WinDBG
It's another back-to-basics day. Lots of driver developers post questions on various forums that basically boil down to "how do I debug my driver?". Let me see if I can clear things up a bit.

If you want to be a serious driver developer, you will need a debugger. I know you don't write bugs :) but you might not understand something about the way the OS is behaving underneath you, so there is still a need to get good with a debugger. Furthermore, you'll never understand how a computer works in general, or how Windows works in particular, until you've done some time staring at a debugging console. You need a debugger if you plan on using the checked build, running driver verifier, disabling system file protection, or looking at the output from your DbgPrint/KdPrint statements. And, of course, you need a debugger because there's no other way: there is no printf() debugging or MessageBox() debugging in the kernel.

Taxonomy of Debuggers
There are really only two debuggers that are in common use in the driver development community. One is a product called SoftICE, from Compuware (formerly NuMega). It's a good product, and above all, it's a phenomenal hack, considering the way they're able to slip the debugger between Windows and the bare iron. Because of this approach, SoftICE works on the actual computer you're trying to debug (as opposed to WinDBG). One other benefit of SoftICE is that it runs on Windows 9x. If you're a Poor Unfortunate Soul stuck with supporting one of those fine versions of Windows, SoftICE may be a good idea: the other debugger (wdeb386.exe) is *absolutely horrible*. Really. Unusable. Promise.

SoftICE does have a couple of drawbacks, though. First, it's expensive. It only comes bundled with other products that you may not want or need. In fact, for the price of SoftICE, you can probably buy either VMWare (below) or a second computer to use as a debugging target. The bigger issue to me, however, is that it's not the standard. Most kernel-mode programmers work with WinDBG. Questions about your driver that have debugger involvement will almost always be answered in the context of WinDBG. If you ever call Microsoft Product Support Services for help, WinDBG will be the assumption. That said, SoftICE really is a great product. if you want to go down that path, knock yourself out.

In general, though, it's probably best to go to WHDC and download the Debugging Tools for Windows (also known as WinDBG). WinDBG (pronounced either "win-d-b-g" or "wind-bag", depending on who you ask) is an incredibly powerful, feature-rich debugger for kernel-mode (and user-mode!) applications. In addition to the standard debugger intrinsics, Microsoft has shipped dozens of deubgger extensions that do higher-level things like look for locks, dump the contents of IRPs in a way that makes sense, and so on.

Setting Up Your Debugger
Getting kernel debugging set up is a little bit of a pain. There are two basic ways to go about it. The first way is to use two computers, conneted to each other by either a standard null-modem cable or by a FireWire cable. I've heard nothing but complaints about FireWire from Gary Little, a frequent contributor to NTDEV and the public Microsoft newsgroups, and I have never used it myself, so I generally recommend the good old fashioned rs-232 connection. One thing worth noting is that it has to be a plain-jane RS-232 port on the target machine; USB serial ports and the like won't work, as they depend on the kernel loading lots of drivers to get them to work.

In addition to a physical connection, you have to modify the debug target computer's boot.ini file. Generally, I remove the "/FASTBOOT" switch and add "/SOS /DEBUG /DEBUGPORT:COM1 /BAUDRATE:115200". Speed is everything, so you want to use the highest baud rate that your box supports. Once you make the modifications, your computer will boot to a boot menu and prompt you as to whether or not you want to launch the debugging mode. Choose the "[Debugger Enabled]" option to make the kernel look for a kernel debugger.

If you don't have a second computer handy, there is a new feature introduced with Windows XP that may be of some use. It adds the ability to do a limited form of local kernel debugging. This is similar to the livekd tool that was shipped with Inside Windows 2000, and is useful as a learning tool. It cannot be used to do any sort of invasive kernel debugging, though, so it's mostly inappropriate for kernel development.

Another way to avoid getting a second debugging target computer is to use VMWare or Microsoft VirtualPC. I use VMWare every day to develop and test drivers. It is amazingly good for this sort of development. You can do remote kernel debugging over a named pipe from host to guest, and you can set up restore points, so you never have to worry about crashing and burning on a test computer. No more 10-minute re-ghosting procedures - restoring a crashed vm takes 10 seconds on my test computer. Note, however, that this is not totally sufficient for kernel development, as VMWare doesn't emulate a multi-processor VM yet, and it can't do 64-bit CPUs. All things considered, I'd never be caught trying to do driver development without VMWare. I've never used VirtualPC, but I hear it's similar. VPC has the advantage of being included in MSDN subscriptions.

Working With The Debugger
In your debugging host computer, start WinDBG, choose Kernel Debug from the menu, and enter in the appropriate communication parameters. Once you hit OK, the deubgger will check the target computer, and you're ready to go. Hit "Ctrl+Break" to break into the target. What you do from here is best learned by reading the debugger's help file; typte .hh at the kd> prompt for more.

If your computer crashes while a debugger is attached, it won't bluescreen. Instead, it will break into the debugger and give you a chance to figure out what is going on. One of the most useful commands to type at the kd> prompt on a crashed computer is "!analyze -v". It will invoke an analysis extension in the debugger that is extremely good at figuring out what is wrong with the crashed computer. If you ever post a question on a public forum about a crashing driver, please be sure to include output from this command.

In order to get !analyze -v to work correctly, you must be running the correct symbols. Fortunately, Microsoft has fixed the Hell of Symbols in recent years. In current debuggers, you can use a symbol path that points to an Internet server from which the correct symbols are automatically downloaded. Not 100% of the OS symbols are found on the symbol server yet, so I also use an old-fashioned symbol directory for things such as service pack symbols, checked build symbols, etc. My symbol path winds up looking like:

srv*x:\symbols\symserv*http://msdl.microsoft.com/download/symbols;x:\symbols\2ksp4chk;...

Once your symbols are set up correctly, issuing a .reload command from the kd> prompt will load symbols to match the running binaries. The stack ('kb' at the kd> prompt) should now look more reasonable, depending on the kind of crash you have.

Wrap-Up
I only have one complaint about WinDBG: the Hell of Docking. Microsoft recently re-worked the user interface (I think this happened with 6.3), and I can't get the damned thing to lay out the way I want it to any more. If anyone from Microsoft is reading, *please*, free us from the Hell of Docking! Let my people go!

I hope you've found this little tutorial useful. Debugging is an art form that takes a lifetime to master, and a nontrivial amount of learning just to become basically functional. The help file is good, and there are other resources on the Internet (particularly on microsoft.com). Also, don't hesitate to post any questions to the WINDBG mailing list hosted by www.osronline.com, or to one of the public Microsoft newsgroups dealing with debugging. The OSR list is monitored by several people who know exactly what they're doing, and the Microsoft groups get a lot of Microsoft employee participation from people that have abvoe-average amounts of clue.

Happy debugging!