You can almost always count on interesting things happening during Update Weekend. Sometimes a patch will yield
unexpected results, sometimes you
lose access to the server after initiating a restart (and yet the server doesn’t actually restart), and so on. Well, this past weekend was no different, but the types of issues encountered was.
As such, I’m going to start a new series of posts in the vein of demonstrating how troubleshooting was approached during a particular situation to help others identify other possible troubleshooting steps or avenues when encountering problems. We’ll start with a rather typical behavior (restarted a server remotely and could not get access back to the server when it should have come up) that had a very unusual root problem.
As mentioned, this started when I lost access to the server in question following a remote restart request. When doing updates, we always do a clean restart of the system prior to installing updates to make sure the server will come up cleanly, so if there are problems, we know they’re NOT related to the updates. Anyway, I restarted this server in question Saturday morning at 8:30am, and by 9:00am I knew it wasn’t coming back. Not only could I not connect via RDP, but telnet to port 25 to check SMTP was also failing, so the server was pretty clearly not coming back.
I was able to reach a contact for this customer and got someone on site to take a look. Maybe it received a shutdown command instead of a restart, maybe they lost power, whatever. The on-site contact was able to log into the server, but it was running really slowly. We checked the basics: did it have a valid IP address, and it did. Was the server able to ping the default gateway, it could. Was the server able to ping
www.google.com, it could not. Hm. Sounds like a DNS issue. I asked the on-site person to open the Services control panel, and it took about 5 minutes to open. Not good. At that point, I arranged for an on-site visit myself.
When I arrived, the server was running very sluggishly. I confirmed the tests we had already done: ipconfig is correct, basic networking is working (can ping the gateway and other internal resources by IP), but DNS was failing. I tried an nslookup and the DNS server timed out. OK, sounds like the DNS service isn’t running. Looked in the open Services console, and sure enough the DNS Server Service is in a Starting, but not Started, state. That’s when I noticed that a number of Automatic services were not started, including (but not limited to) DHCP server, Event Log, Terminal Services, SMTP, WINS Server, and a few others.
OK, so that explains why the server can’t get out to the Internet, and why I couldn’t remotely access the server. Now what? Let’s try to start some of the services and see if it’s just a startup glitch that kept them from launching at boot. I started with DHCP simply so we could get workstations back up if needed. DHCP Server wouldn’t start because one of its service dependencies didn’t start. OK, that’s another step towards the solution. Let’s look at the dependencies for the DHCP Server service and the other services that didn’t start and find a common service.
After looking at the dependencies for most of the services, the common thread is the EventLog service. So if we can get the EventLog service running, we’ll probably get several of the other started. Next step, let’s try to reboot into Safe Mode and see if that alters the behavior. So, we restart the server in Safe Mode with Networking, and have the same problems. EventLog and other services that should start in Safe Mode are not starting. At this point we reboot back into normal mode and troubleshoot from there.
So it’s possible that a corrupt event log file might be keeping the service from starting. So I went into C:\WINDOWS\system32\config and moved the event log files (*.evt) to a different directory and tried to start the EventLog service. It failed to come up, but only 4 log files got created, and I moved 8 or 9 out of the folder. Hm. What’s the last log that was created? The DNS log. Let’s take a look in the event viewer and see which logfile might be causing the problem.
Boom, that’s when I found the issue. Even though the event viewer couldn’t display the contents of the log files (since the service wasn’t started), I could see all the logs it wanted to display, and that’s when I found the errant log entry. One of the log files had a name that started with FSSCRM and looked more like an error message than a legitimate event log title. Since the event log service loads its component logs from the registry, I opened regedit and browsed to the HKLM\SYSTEM\CurrentControlSet\Services\Eventlog. Sure enough, I see a Key with the unusual name in there, and when I look at the values in that key, they point to places on the server that don’t exist. I saved the key to a registry file (just in case) and then deleted the key and closed the registry editor. When I attempted to launch the EventLog service again, it fired right up. As did all of the related services. Of course, we did another full reboot of the system to make sure all services started as expected, and sure enough they did.
While I still have no idea how this key got into the registry, or if it was a valid key that somehow got corrupted, we got the server back online and the system running, giving me time to do some research to see what service might have been associated with that erroneous log setting. But it also serves as a lesson that just because something looks like a networking problem doesn’t mean that it’s truly a networking problem at the core. And also another good reason why you shouldn’t go mucking around in the registry without good reason. One small incorrectly-formatted registry value effectively brought down this server, at least from the business owner’s perspective.