Troubleshooting PSOD
Purple Screen of Death or commonly known
as PSOD is something which we see most of the times on ESXi
host.
Usually when we experience PSOD, we take the screenshot of PSOD and reboot the host and then capture the logs and upload it to VMware support for analysis.
Let’s analyze the dumps by yourself?
Step 1: Sometimes, we might miss out on the screenshot
of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we
can extract the dump files to gather the crash logs.
Once the host is back up from accident reboot
post PSOD, login to the SSH/Putty of the host and go to the core directory. The
core directory is the location where your PSOD logging go to.
The most important one is the “Core” folder
which contains the kernel dump, the PSOD will purge what was in memory to a
file called vmkernel-zdump.1 or something to that affect and place it in that
directory.
So go to
# cd var/core
Then list out the files here using ls –ltr . You will see the below file.
Then list out the files here using ls –ltr . You will see the below file.
Vmkernel-zdump.1
Step 2: How do we extract it?
Well, we have a nice extract script that does all the job,
“vmkdump_extract ". This command must be executed against the zdump.1
file, which looks something like this:
# vmkdump_extract vmkernel-zdump.1
It creates multiple below files as mentioned in the screenshot.
# vmkdump_extract vmkernel-zdump.1
It creates multiple below files as mentioned in the screenshot.
Note: - All
we require for analysis is the vmkernel-log.1 file.
Step 3: Open the vmkernel-log.1 file using one of the
below method:
a. WinSCP (GUI)
b. less vmkernel-log.1
(Command line)
I am windows plus VMware support engineer, so
defiantly I would preferred GUI method to analyze
the log file J
Let’s use WinSCP:
Step 4. Connect your ESXI host using WinSCP and browse
/var/core path and copy vmkernel-log.1 to your local machine.
Step 5. As you have already copied vmkernel-log.1 to your local machine.
Now, You will have to use something like Notepad++ to open the vmkernel-log.1
file, right click on it and edit the log file in notepad++ editor and search
for keyword “BlueScreen” and it will take you to the below events.
The first line @BlueScreen: Tells
the crash exception like Exception 13/14, in my case issue it is pointed to “LINT1/NMI (motherboard nonmaskable interrupt),
undiagnosed. This may be a hardware problem; please contact your hardware
vendor” Which is pointing to hardware issue.
The VMKuptime tells the
Kernel up-time before the crash.
The logging after that is the information that
we need to be looking for, the cause as to why the crash occurred.
Note:- The crash dump varies for every crash. These issues can range
from hardware errors / driver issues / issues with ESXi build and a lot more.
While using the b method, skip to the end of the file by pressing Shift+G.and slowly go to the top by pressing Page Up. You will come across a line that says @BlueScreen: <event> and after that you know what exactly need to check J
each dump analysis would be different, but fundamental is same.
Comments
Post a Comment