Cluster service failure after AD lockdown
Users were unable to connect to their shares. John discovered that the Cluster service wasn't started, and that any attempts to start it resulted in an error 1068
He attempted to ping the virtual server's IP address and it returned a "request timed out" message. He got the same error when trying to ping the cluster node's public adapter.
When he got to the node he found the Cluster service in a Starting state. He soon discovered that he had no network connectivity to or from either Cluster node, and that their network cards were missing from "Network Connections" The only changes made to the network were just a few minor group policy settings to lock down permissions a bit. Maybe that had something to do with this? It looked like it was going to be a long night...
This is another fairly common problem. This is not really just a Cluster problem, but that is usually how it is presented to me. Of course if networking is not functional, then Cluster isn't going to work either. :) I have worked at least three of these issues in the last two months, and thought it warranted discussion since there isn't a public KB article on this particular scenario yet. I hope to fully document every error encountered here, so that others may find this post when they run into this situation. (KB articles sometimes take a while to get published)
System event log:
SAM event ID: 12291 "SAM failed to start the TCP/IP or SPX/IPX listening thread"
IPSec event ID: 4292 "The IPSec driver has entered Block mode."
DfsSvc event ID: 14523 "DFS could not contact any DC for Domain DFS operations."
Application event log:
EventSystem event ID: 4609 "The COM+ Event System detected a bad return code during its internal processing. HRESULT was 80004015 from line 142 of d:\nt\com\complus\src\events\tier2\service.cpp."
Other problems discovered with this node:
The Com+ Event System, Network Connections and Shell Hardware Detection services were in a Starting state.
The following services failed to start:
Cluster Service: Error 1068: The dependency service or group failed to start.
File Replication: Error 1068: The dependency service or group failed to start.
---dependencies opens up a window titled "Service Dependencies" and the message is: Wind32: Access is denied.
IPSEC Services: Error 1899: The endpoint mapper database entry could not be created.
System Event Notification: Error 1068: The dependency service or group failed to start.
--trying to view the dependencies on the server returns the following message: Win32: Access is denied
Task Scheduler: "The endpoint mapper database could not be loaded"
We have three services failing with "the dependency service or group failed to start."
When we try to view the dependencies we get an access denied message.
Let's look in the registry to see what each of these services depend on:
Cluster service:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc
DependOnService:
ClusNet
RpcSs
W32Time
NetMan
File Replication:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs
DependOnService:
EventLog
RpcSs
EventSystem
System Event Notification:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SENS
DependOnService:
EventSystem
So the common dependencies are RpcSs and EventSystem
RpcSs is the Remote Procedure Call (RPC) service, and EventSystem is the Com+ Event System service. We know from earlier that Com+ Event System is one of the services stuck in a Starting state, so that is why the File Replication and System Event Notification services haven't started. One of the other dependencies for the Cluster service is NetMan, which is the Network Connections service. Network Connections is also one of the services stuck in a Starting state.
So now the real question is: Why are the Com+ Event System and Network Connections services not starting?
If we view the dependencies for these two services, we just find RpcSs listed. So it all boils down to RPC. However, the Remote Procedure Call (RPC) service is actually started.
If you do a search in the knowledge base on these errors, you are likely to come across this article:
909444 Systems that have changed the default Access Control List permissions on the %windir%\registration directory may experience various problems after you install the Microsoft Security Bulletin MS05-051 for COM+ and MS DTC
This discusses changes made by a hotfix that would cause these problems. The fix is to correct NTFS permissions on the %SystemRoot%\Registration directory. However the permissions here are the same as in the article.
You may also come across this one:
916254 COM+-related events may be logged in Event Viewer when you install Windows XP Service Pack 2 and join the computer to a domain
Most would come across this second article and instantly dismiss it since it says "Windows XP Service Pack 2." However, we have a lot of the same symptoms, and since XP SP2 and Server 2003 SP1 include a lot of the same security changes it warrants further investigation.
One of the security changes in SP1 for Windows Server 2003 was to change the Logon Account used for RPC.
RPC use to log on as Local System and now uses an account with less privileges: Network Service.
The article states that this issue occurs if the SERVICE account is missing from the policy setting "Impersonate a client after authentication"
We can see if SERVICE is missing from this policy by performing the following steps:
1. Open up Local Security Policy in order to see what the effective settings are:
Start, Run, secpol.msc
2. Expand Local Policies, User Rights Assignment and then open up "Impersonate a client after authentication"
At minimum the following should be listed: Administrators and SERVICE
The problem that I have seen recently happens when someone decides to change the "Impersonate a client after authentication" user right in group policy. Typically how it goes is they decide to lockdown their servers, and only give specific accounts certain privileges. However, after incorrectly removing the SERVICE account from this privilege the server loses all network connectivity. Fortunately this problem doesn't show up until after a reboot. (You have an opportunity to identify that the problem exists before causing a major outage of all servers in a large OU.)
The fix is simple for the servers that haven't been restarted:
1. Correct the policy and then force group policy to be reapplied. (gpupdate /force)
(To correct the policy: just add SERVICE and Administrators to this policy setting in addition to the other ones defined)
If you have already rebooted the servers after applying the incorrect policy settings they will not be corrected by just simply changing the policy back since they have already lost network access. (unless the policy change was made locally to begin with)
1. Export the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs
2. In the services snap-in: Change Remote Procedure Call (RPC) to start up with the Local System account instead of Network Service, and then reboot
3. At this point the majority of the services should be started and we should now have network access. Ensure that the offending group policy has been corrected with the proper accounts, force group policy to apply, (gpupdate /force) and then reboot.
4. Change the logon account for Remote Procedure Call (RPC) service back to Network Service by importing the reg file that you exported in step one, and then reboot. Alternatively: navigate to the following reg key and then reboot
Technorati tags: Active Directory, Cluster, Windows Server 2003
:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs
Change the ObjectName value from LocalSystem to: NT Authority\NetworkService
For more information regarding this security setting see article on Technet: SeImpersonatePrivilege
I have commented KB 269229 to reflect the requirement for SERVICE to be included in this User Right.
He attempted to ping the virtual server's IP address and it returned a "request timed out" message. He got the same error when trying to ping the cluster node's public adapter.
When he got to the node he found the Cluster service in a Starting state. He soon discovered that he had no network connectivity to or from either Cluster node, and that their network cards were missing from "Network Connections" The only changes made to the network were just a few minor group policy settings to lock down permissions a bit. Maybe that had something to do with this? It looked like it was going to be a long night...
This is another fairly common problem. This is not really just a Cluster problem, but that is usually how it is presented to me. Of course if networking is not functional, then Cluster isn't going to work either. :) I have worked at least three of these issues in the last two months, and thought it warranted discussion since there isn't a public KB article on this particular scenario yet. I hope to fully document every error encountered here, so that others may find this post when they run into this situation. (KB articles sometimes take a while to get published)
System event log:
SAM event ID: 12291 "SAM failed to start the TCP/IP or SPX/IPX listening thread"
IPSec event ID: 4292 "The IPSec driver has entered Block mode."
DfsSvc event ID: 14523 "DFS could not contact any DC for Domain DFS operations."
Application event log:
EventSystem event ID: 4609 "The COM+ Event System detected a bad return code during its internal processing. HRESULT was 80004015 from line 142 of d:\nt\com\complus\src\events\tier2\service.cpp."
Other problems discovered with this node:
The Com+ Event System, Network Connections and Shell Hardware Detection services were in a Starting state.
The following services failed to start:
Cluster Service: Error 1068: The dependency service or group failed to start.
File Replication: Error 1068: The dependency service or group failed to start.
---dependencies opens up a window titled "Service Dependencies" and the message is: Wind32: Access is denied.
IPSEC Services: Error 1899: The endpoint mapper database entry could not be created.
System Event Notification: Error 1068: The dependency service or group failed to start.
--trying to view the dependencies on the server returns the following message: Win32: Access is denied
Task Scheduler: "The endpoint mapper database could not be loaded"
We have three services failing with "the dependency service or group failed to start."
When we try to view the dependencies we get an access denied message.
Let's look in the registry to see what each of these services depend on:
Cluster service:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc
DependOnService:
ClusNet
RpcSs
W32Time
NetMan
File Replication:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs
DependOnService:
EventLog
RpcSs
EventSystem
System Event Notification:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SENS
DependOnService:
EventSystem
So the common dependencies are RpcSs and EventSystem
RpcSs is the Remote Procedure Call (RPC) service, and EventSystem is the Com+ Event System service. We know from earlier that Com+ Event System is one of the services stuck in a Starting state, so that is why the File Replication and System Event Notification services haven't started. One of the other dependencies for the Cluster service is NetMan, which is the Network Connections service. Network Connections is also one of the services stuck in a Starting state.
So now the real question is: Why are the Com+ Event System and Network Connections services not starting?
If we view the dependencies for these two services, we just find RpcSs listed. So it all boils down to RPC. However, the Remote Procedure Call (RPC) service is actually started.
If you do a search in the knowledge base on these errors, you are likely to come across this article:
909444 Systems that have changed the default Access Control List permissions on the %windir%\registration directory may experience various problems after you install the Microsoft Security Bulletin MS05-051 for COM+ and MS DTC
This discusses changes made by a hotfix that would cause these problems. The fix is to correct NTFS permissions on the %SystemRoot%\Registration directory. However the permissions here are the same as in the article.
You may also come across this one:
916254 COM+-related events may be logged in Event Viewer when you install Windows XP Service Pack 2 and join the computer to a domain
Most would come across this second article and instantly dismiss it since it says "Windows XP Service Pack 2." However, we have a lot of the same symptoms, and since XP SP2 and Server 2003 SP1 include a lot of the same security changes it warrants further investigation.
One of the security changes in SP1 for Windows Server 2003 was to change the Logon Account used for RPC.
RPC use to log on as Local System and now uses an account with less privileges: Network Service.
The article states that this issue occurs if the SERVICE account is missing from the policy setting "Impersonate a client after authentication"
We can see if SERVICE is missing from this policy by performing the following steps:
1. Open up Local Security Policy in order to see what the effective settings are:
Start, Run, secpol.msc
2. Expand Local Policies, User Rights Assignment and then open up "Impersonate a client after authentication"
At minimum the following should be listed: Administrators and SERVICE
The problem that I have seen recently happens when someone decides to change the "Impersonate a client after authentication" user right in group policy. Typically how it goes is they decide to lockdown their servers, and only give specific accounts certain privileges. However, after incorrectly removing the SERVICE account from this privilege the server loses all network connectivity. Fortunately this problem doesn't show up until after a reboot. (You have an opportunity to identify that the problem exists before causing a major outage of all servers in a large OU.)
The fix is simple for the servers that haven't been restarted:
1. Correct the policy and then force group policy to be reapplied. (gpupdate /force)
(To correct the policy: just add SERVICE and Administrators to this policy setting in addition to the other ones defined)
If you have already rebooted the servers after applying the incorrect policy settings they will not be corrected by just simply changing the policy back since they have already lost network access. (unless the policy change was made locally to begin with)
1. Export the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs
2. In the services snap-in: Change Remote Procedure Call (RPC) to start up with the Local System account instead of Network Service, and then reboot
3. At this point the majority of the services should be started and we should now have network access. Ensure that the offending group policy has been corrected with the proper accounts, force group policy to apply, (gpupdate /force) and then reboot.
4. Change the logon account for Remote Procedure Call (RPC) service back to Network Service by importing the reg file that you exported in step one, and then reboot. Alternatively: navigate to the following reg key and then reboot
Technorati tags: Active Directory, Cluster, Windows Server 2003
:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\RpcSs
Change the ObjectName value from LocalSystem to: NT Authority\NetworkService
For more information regarding this security setting see article on Technet: SeImpersonatePrivilege
I have commented KB 269229 to reflect the requirement for SERVICE to be included in this User Right.
Comments
The policy you talk about was defined in a group policy for all the domain controllers. Someone had removed the service account for the Local Security Setting you mention, so problem occurred when a reboot occured after it had applied its GP's for the first time.
Weird as the GP has always been there and none of the other DC's have ever had a problem. Seems this problem is related to a hotfix for SP1, the strange thing is that the new DC was running SP2...
Fixed now, so many thanks for your help on this!