I came home tonight to see reports of blue screens of death and failed boot up/reboots for XP machines that had installed the MS10-015 security patch or hotfix. There is a long thread on Microsoft Answers.
After installing the patch you have to reboot. You are then greeted with:
STOP: 0×00000050 (0×80097004, 0×00000001, 0×80515103, 0×00000000).
The posted fix by Microsoft in the thread is:
- Boot from your Windows XP CD or DVD and start the recovery console (see this Microsoft article for help with this step). Once you are in the Repair Screen..
- Type this command: CHDIR $NtUninstallKB977165$\spuninst
- Type this command: BATCH spuninst.txt
- When complete, type this command: exit
There is what appears to be some misinformation or hysteria about this. For example:
- Some news articles are claiming that Windows 2003 and Vista are reported in this thread as being affected. I saw no mention of those operating systems.
- I saw one article (a random find) that tried to make it look like that this affected Hyper-V. Pah! It does not from everything I have read. There are no reports of issues with Windows Server 2008 or Windows Server 2008 R2. Put the Kool-Aid down and step away from the cup.
I cannot claim there are not problems there but they are not in that thread.
EDIT: Overnight after this blog post was originally written, some people did post about Vista and W2003 suffering issues with blue screens caused by the update.
It is bad that a patch has affected many. I’m sure MS will be making someone feel very uncomfortable overnight about this. It’s bad that it happened at all. But let’s face it. Not everyone is affected. There is some combination in factors that is contributing to the blue screen. There is some scenario that MS didn’t test or couldn’t predict. These things happen. It could be some niche piece of software or driver that reacts badly to the patch.
EDIT: I’ve read on one site that some people are finding an issue with the ATAPI.SYS file not looking like the genuine file supplied by MS. They suspect an old malware issue causes an incompatibility with the fix!!!
This situation (whatever the actual cause of the blue screens) is why I think people like Steve Reilly who preach that we should all push out security updates immediately and without question are wrong for me (maybe not wrong for you). How many zero day exploits have there ever been? Not many. Think of the big bad attacks … Nimda, SQL Slammer, MS Blaster, Conficker. They all attacked vulnerabilities that were fixed with patches long before hand. What’s a couple of weeks? It’s because of the rare occasion when a patch goes wrong that I run a 3 phase process for patches.
I have three groups in WSUS. I configure my Windows Update agents either via group policy (AD members) or registry edits (.REG files for workgroup members) to be members of 1 of 3 groups:
- Testing – contains VM’s with various blends of OS and application
- Management – Our production AD, management systems, and online applications
- Hosting– Hosted customer servers
We’re a hosting company. WSUS has an automatic approval policy for the Testing group. The machines in that group are VM’s on my Hyper-V lab server. They patch in the late morning/early afternoon (around lunch) so we can see how they reacted.
Ideally that group would contain samples of the various bits of hardware you have on the network to include drivers in the mix. I was lucky enough to be able to do that with one employed in the past – but we did push out updates in less than a week from release. However, I need to be cost conscious and that is not an option now.
When we’re happy we sit and watch the news. If all is well, change control happens, and then we approve the updates for the management network. Stealing a line from Microsoft, we eat our own dog food. Over the 3 nights of the following weekend (Friday, Saturday, Sunday), machines are patched and reboot automatically. Some services are clustered/replicated and we do them on different nights or time slots. We have scheduled scripts on the OpsMgr RMS to put machines into maintenance mode.
Now we watch how that went and continue to watch the news wires. If there’s no more problems then we approve the updates for the hosting customers after another change control process. Patches then deploy according to their pre-agreed time windows.
The end result is that within 2-3 weeks all security updates are deployed. You could compress this down to a week. We are totally minimizing the risk of being stung by a “bad” update. Like I said earlier, MS probably did test the update as far as is realistically possible. There is always the chance that something bad happens.
Steve Reilly’s argument was that if you get a bad update then you call easily rollback your server farm because it’s probably 90% virtual. In my opinion you shouldn’t really use snapshots in production on Hyper-V. They’re supported but they suck the life from your VM’s. DPM or 3rd party solutions that are using the Hyper-V VSS writer are cool for this. But really, do you want to risk your production network going down for hours while you recover (starting at 3am when your patch failed) because of the rush to deploy an update that will likely not have an attack vector for quite some time?
Weigh the various risks and make an informed decision for yourself. Maybe Steve Reilly’s approach to push out updates without testing is right for you. Maybe my phased and cautious approach is. Maybe there is a middle ground that you prefer. Do the research and be sure you know why you make your decision and that it is based on fact.
There is strong suspicion that the BSOD’s are actually happening on machines that were already infected by a rootkit called TDSS. It attacks ATAPI.SYS and replacing that file appears to fix the BSOD issue as well. Microsoft Security Essentials appears to be able to detect it.