The Operations Manager 2007 agent and management server communicate with each other and perform mutual authentication using Kerberos. They’re in the same forest and hence in the same Kerberos domain. But what happens if you have agents outside the forest? If you read anything from Microsoft (or the OpsMgr book I just bought) you’d be left under the impression that you must install the OpsMgr gateway. You’d then install a custom X.509 cert (requiring a cert server running on Windows Enterprise Edition) on that machine and on the OpsMgr server. There’s two problems with this:
- What if the un-trusted network is a workgroup, e.g. a DMZ? There’s no Kerberos domain for the agents on the network to authenticate with the Gateway.
- What if you are monitoring many networks with only one or two agents on each network? Are you going to install lots and lots of Gateways?
If you are persistent with your searches you will find that:
- There is one mention by Microsoft in a downloadable Word document that you can install agents with the X.509 cert so that the agents can communicate directly with the management server.
- There is an almost complete guide by Duncan McAlynn on how to install the certs using MOMCERTIMPORT /SUBJECTNAME (the subject name is the name of the cert in the certificate store).
Duncan appears to be the only person to have attempted to document this process so he deserves credit for it. The MS documentation folks have done a poor job with OpsMgr, e.g. failing to cover this subject and failing to document complete management pack authoring. The instructions for setting up the CA are in the OpsMgr 2007 Security Guide and Duncan walks you through installing the agent. The only missing step is you need to install and import CA and agent certs on the OpsMgr management server(s) so that they have a means for mutual authentication with the agents.
I’d been doing this successfully on servers and then I hit one server where the agent could not use the cert. I saw the following in the Operations Manager Event Log:
Source: OpsMgr Connector
Event ID: 21036
The certificate specified in the registry at HKEY_LOCAL_MACHINESOFTWAREMicrosoftMicrosoft Operations Manager3.0Machine Settings cannot be used for authentication. The error is The credentials supplied to the package were not recognized
I reissued that cert, re-imported it, re-installed the agent half a dozen times. I’d opened a call with MS (thanks to IT Pro Momentum) but the first PSS agent was not the Mae West to deal with. He kept claiming the my CA was at fault but I knew it wasn’t – other agents were fine. Finally the ticket got reassigned to Brian who was a pleasure to work with.
He started coming up with some new ideas straight away. The first was maybe the cert store was corrupt. I tried a fix for that (CERTUTIL -F -REPAIRSTORE MY “<thumbprint of agent cert>”) but that didn’t fix the problem. Brian asked if we could look at the server together using "EasyAssist" … it’s MS’s answer to WebEx or LogMeIn so they can get Remote Assistance over web friendly protocols. We poked around and saw something interesting.
- The CA cert in ComputerTrusted Root Authorities was fine.
- The agent cert in the ComputerPersonal store was fine. The certification path was fine.
- When you run MOMCERTIMPORT it copies the cert into ComputerOperations Manager in the certificate store. I had overlooked this. Here, the certification path was invalid. Weird, because it was fine in the ComputerPersonal store.
We manually imported the cert into there and the certification path was still screwed. We re-imported the CA cert but it was still screwed. We re-imported the CA cert and the operations manager copy of the cert. The certification path was fine but the agent didn’t appear to be using it. We re-ran MOMCERTIMPORT and the certification path was invalid again. OK … I thought we’d try this:
- Delete all copies of the agent and CA certs from the certificate store.
- Brian suggested restarting the cryptography and the OpsMgr Health service.
- I went through the process of re-importing: Import the CA cert into ComputerTrusted Root Authorities, import the agent PFX into ComputerPersonal, re-run MOMCERTIMPORT /SUBJECTNAME and restarted the OpsMgr Health service.
Lo and behold … it worked! In fact, it worked so well that we detected a hardware fault on the server that we hadn’t known about. Sweet; OpsMgr rules!
A big "Thank You" to Brian for helping out on that one. For the most part, I’ve always had good dealings with MS PSS agents going back to 2003. It was good to see this one being rescued so professionally.