Learn And Share

Tuesday, 29 December 2015

How To Analyze PSOD

Purple Screen of Death or commonly known as PSOD is something which we see most of the times when we run an ESXi host.

Usually when we experience PSOD, we reboot the host (which is a must) and then gather the logs and upload it to VMware support for analysis (where I spend a good amount of time going through it)

Why not take a look at the dumps by yourself?

Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's SSH. The command is

# vsish -e set /reliability/crashMe/Panic 1

And when you open a DCUI to the ESXi host, you can see the PSOD

Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.

Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the SSH/Putty of the host and go to the core directory. The core directory is the location where your PSOD logging go to.

# cd var/core

Then list out the files here:

# ls -lh

Step 3:
How do we extract it?

Well, we have a nice extract script that does all the job, " vmkdump_extract ". This command must be executed against the zdump.1 file, which looks something like this:

# vmkdump_extract vmkernel-zdump.1

It creates four files:
a) vmkernel-log.1
b) vmkernel-core.1
c) visorFS.tar
d) vmkernel-pci

All we require for analysis is the vmkernel-log.1 file

Step 4:
Open the vmkernel-log.1 file using the below command:

# less vmkernel-log.1

Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing PageUp.
You will come across a line that says @BlueScreen: <event>

In my case, the dumps were:

2015-12-17T20:34:03.603Z cpu3:47209)@BlueScreen: CrashMe
2015-12-17T20:34:03.603Z cpu3:47209)Code start: 0x418021200000 VMK uptime: 0:01:14:16.524>
2015-12-17T20:34:03.603Z cpu3:47209)0x412461a5dc10:[0x41802128d249]PanicvPanicInt@vmkernel#nover+0x575 stack: 0x726f632000000008
2015-12-17T20:34:03.603Z cpu3:47209)0x412461a5dc70:[0x41802128d48d]Panic_NoSave@vmkernel#nover+0x49 stack: 0x412461a5dcd0
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5dd60:[0x41802157a63b]CrashMeCurrentCore@vmkernel#nover+0x553 stack: 0x100000278
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5dda0:[0x41802157a8ca]CrashMe_VsiCommandSet@vmkernel#nover+0x13e stack: 0x0
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5de30:[0x41802160c3c7]VSI_SetInfo@vmkernel#nover+0x2fb stack: 0x41109d630330
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5dec0:[0x4180217bd7a7]UWVMKSyscallUnpackVSI_Set@<none>#<none>+0xef stack: 0x412461a67000
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5df00:[0x418021783a47]User_UWVMKSyscallHandler@<none>#<none>+0x243 stack: 0x412461a5df20
2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5df10:[0x4180212aa90d]User_UWVMKSyscallHandler@vmkernel#nover+0x1d stack: 0xffbc0bb8

2015-12-17T20:34:03.604Z cpu3:47209)0x412461a5df20:[0x4180212f2064]gate_entry@vmkernel#nover+0x64 stack: 0x0

The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is CrashMe which is for a manual crash.
The VMKuptime tells the Kernel up-time before the crash.
The logging after that is the information that we need to be looking for, the cause as to why the crash occurred.

Now, here the crash dump varies for every crash. These issues can range from hardware errors / driver issues / issues with ESXi build and a lot more.

Each dump analysis would be different. But the basic is the same.

So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware support, I will do the job for you.

Cheers!

Unable To Delete Orphaned/Stale VMDK File

So today I got a case where we were trying to delete an orphaned flat.vmdk file.

A brief background of what was being experienced here:

There were three ESXi hosts and 2 shared datastores among these hosts. Now, there were couple of folders in these 2 shared datastores which contained only flat.vmdk files. These flat files were not associated with any virtual machines and also the last modified date of these files were somewhere about a year ago.

However, every time we tried to delete the file from the datastore browser GUI, we got the error:

Cannot Delete File [Datastore Name] File_Name.vmdk

So, when we try to delete this file from the command line using the " rm -f <file_name> " we got the error:

rm: cannot remove 'File.vmdk': No such file or directory

Also:
We were able to move the file to another datastore and remove it successfully. But, the stale file copy was still left behind in the original datastore.

So, how do we remove this stale file?

Step 1:

Take a SSH session to all the hosts that have access to this datastore where the stale file resides.
In my case all the three hosts in the cluster.

Step 2:

Run the below command. This command has to be executed from the SSH(Putty) of all the hosts having connectivity to that datastore.

# vmkfstools -L release /vmfs/volumes/<datastore_name>/<folder_name>/<file_name>-flat.vmdk

This can result in two error outputs:

First error:

Could not open /vmfs/volumes/xxxxxxxx/xxxxxxx/xxxxxx-flat.vmdk

Command release failed Error: Device or resource busy

Second error:

Command release failed

Error: Inappropriate ioctl for device

In my case it was the second error.

The host that gives you the second error has the stale lock on the file. All the three hosts returned the second error, and I had to reboot all the three hosts.

Once the hosts are rebooted, you can successfully remove the stale flat.vmdk files.

Understanding VMkernel.log for vMotion Operation

Let's decode the vMotion logging in VMkernel.log.

Open a SSH (Putty) to the host where this virtual machine currently resides. Change the directory to:

# cd var/log

Capture the live logging of VMkernel using the following command:

# tail -f vmkernel.log

Perform vMotion of a virtual machine residing on this host to any other available host with a shared storage. You will see the below logging:

I will break down the logging with " // " for comments.

2015-12-25T16:39:25.565Z cpu4:2758489)Migrate: vm 2758492: 3284: Setting VMOTION info: Source ts = 1451061663105920, src ip = <192.168.1.176> dest ip = <192.168.1.177> Dest wid = 1830931 using SHARED swap

//The first line Migrate vm 2758492 does not tell which virtual machine is being migrated. It tells the world ID of the virtual machine that is going to be migrated. To find the world ID of the virtual machine, before migrating run the command # esxcli vm process list This command lists all the virtual machines world IDs that is residing on the host.

// The Setting vMotion info 1451061663105920 is the vMotion ID. This vMotion ID is required because when you "grep" for this ID in the hostd.log or vmware.log (residing in the virtual machine directory) gives you further information of vMotion. In vmware.log you can see the transitioning states for vMotion, with each state performing a set of steps.

// The source ip where this virtual machine currently resides is 192.168.1.176 and the destination to where the virtual machine is being migrated to is 192.168.1.177

// The dest wid 1830931 is the world ID for this virtual machine once the vMotion is completed.

2015-12-25T16:39:25.567Z cpu4:2758489)Tcpip_Vmk: 1288: Affinitizing 192.168.1.176 to world 2772001, Success
2015-12-25T16:39:25.567Z cpu4:2758489)VMotion: 2734: 1451061663105920 S: Set ip address '192.168.1.176' worldlet affinity to send World ID 2772001
2015-12-25T16:39:25.567Z cpu4:2758489)Hbr: 3340: Migration start received (worldID=2758492) (migrateType=1) (event=0) (isSource=1) (sharedConfig=1)

// Here the host is being prepared for migration by taking it's IP address into consideration.

//The migration start received logs the vMotion type. The World ID 2758492 is recorded. MigrateType=1 is host migration,

//The host where I am logged in currently via the SSH is the source host which shows the isSource=1 and sharedConfig=1

2015-12-25T16:39:25.567Z cpu5:2771999)CpuSched: 583: user latency of 2771999 vmotionStreamHelper0-2758492 0 changed by 2771999 vmotionStreamHelper0-2758492 -1
2015-12-25T16:39:25.568Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'

//Here the connection from source host vmkernel port-group is established.

2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2096: Accepted connection from <::ffff:192.168.1.177>

// Here the destination source has accepted the connection for vMotion

2015-12-25T16:39:25.570Z cpu5:33435)MigrateNet: vm 33435: 2166: dataSocket 0x410958a8dc00 receive buffer size is -565184049
2015-12-25T16:39:25.570Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.571Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 1 added.
2015-12-25T16:39:25.571Z cpu4:2772001)MigrateNet: 1186: 1451061663105920 S: Successfully bound connection to vmknic '192.168.1.176'
2015-12-25T16:39:25.572Z cpu4:2772001)VMotionUtil: 3396: 1451061663105920 S: Stream connection 2 added.

//Both the surce and destination have established the connection and the vMotion process takes place. The VMkernel.log does not record the details of vMotion. If you check the vmware.log for this virtual machine, you can see the states and progress of vMotion in detail.

2015-12-25T16:39:25.848Z cpu3:2758492)VMotion: 4531: 1451061663105920 S: Stopping pre-copy: only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~2.116 MB/s, 52403100% t2d)

//In short how vMotion works is:

A shadow VM is created on the destination host.
Copy each memory page from the source to the destination via the vMotion network. This is known as preCopy.
Perform another pass over the VM’s memory, copying any pages that changed during the last preCopy iteration
Continue the pre-copy iteration until no changed page remains
Stun the VM and resume the destination VM

//Basically, the memory state of the virtual machine is being transferred to the shadow virtual machine created on the destination machine. The memory is nothing but pages. The pages are transferred to the shadow VM over the vMotion network. Larger the VM I/Os, longer the vMotion process.

//Towards the end of vMotion the source VM must be destroyed and the operations should continue at the destination end. For this, the ESXi should determine, that the last few memory pages can be transferred over to the destination quickly. Which is the switch-over goal of 0.5 seconds.

//So here when it says only 0 pages left to send, which can be sent within the switchover time goal of 0.500 seconds it means that there are no more active memory pages left to be transferred. So the host declares that the source VM can be destroyed and the vMotion can be completed and the destination VM can resume. All this can happen within the feasible switch time.

2015-12-25T16:39:25.952Z cpu5:2772001)VMotionSend: 3643: 1451061663105920 S: Sent all modified pages to destination (no network bandwidth estimate)

//Here it tells that the for the vMotion ID the "S" Source has sent all the memory pages to the destination.

2015-12-25T16:39:26.900Z cpu0:2758489)Hbr: 3434: Migration end received (worldID=2758492) (migrateType=1) (event=1) (isSource=1) (sharedConfig=1)
2015-12-25T16:39:26.908Z cpu3:32820)Net: 3354: disconnected client from port 0x200000c
2015-12-25T16:39:26.967Z cpu3:34039)DLX: 3768: vol 'Recovery_LUN', lock at 116094976: [Req mode 1] Checking liveness:

//Here the migration has completed for the world ID, the migration type. And the virtual machine in my case is residing on Recovery_LUN is locked by the new host that is residing on with a new world ID that was assigned during the vMotion.

So you know what a successful vMotion looks like in the vmkernel.log
In depth vMotion can be found in the vmware.log, which can be self explanatory once you know what to look at and where to look at.

Friday, 27 November 2015

To Use Hyper V in Vmware Workstation

hypervisor.cpuid.v0 = “FALSE”

Add the above line in .VMX file for the vm used for Hyper V

Tuesday, 24 November 2015

VM Deployment

VMware vSphere is one of the most stable, reliable platforms available for building virtulize - cloud infrastructures.

Are you ready to get started?

Just take a few minutes first to consider some important deployment factors. First and foremost is hardware. No, really, you have to get the right hardware.

1. Get the Right HardwareDon’t assume that just any server, CPU, or other hardware component is compatible with vSphere, because only a few are. Incompatible hardware means some or all of the features won’t work, such as Fault Tolerance and Direct Path I/O.
Without the right hardware, vSphere just won’t work. At best, there will be a whole lot of really important features that don’t work. There is an extremely narrow set of hardware components that are certified to work with vSphere. You have to be sure you have one of the certified:

Servers
CPU
Storage adapters
Network adapters
Storage

Bear in mind that the latest VMware vSphere versions are only compatible with 64-bit hardware, so customers still using 32-bit hardware are out of luck. Though the approved hardware components are manufactured by popular makers like Intel and AMD, you can’t assume that all Intel and AMD products are certified, because they aren’t.
Check the Compatibility Guide before buying or prepare to get your feelings hurt.

2. Get the Right Software Licenses: Software vendors have started cracking down on software licensing infringements with a vengeance in recent years. Some of the most aggressive are the big guys like Microsoft, Oracle, and SAP. IBM and AutoDesk are also getting a bad rep for knocking on doors and demanding software audits. Don’t let this simple oversight cost you (and it will: potentially tens of thousands of dollars). Get the software licenses you need and set up some kind of SAM to keep yourself out of hot water in the future.

3. Size the Host Correctly: Each host has to be set up for the appropriate size in order to properly mange the apps that are installed and operated in that environment. You don’t have to guess at it, because there are tools to help you determine what size to set up the host in a given scenario for optimal performance (and profits). Use one of the available tools to determine the right host size in your environment.

4. Monitor Your Clients More Than a Few Hours or Days:
You won’t be able to get a good read on each client’s metrics during the first day or first few days. Monitor your customers over the course of a whole month to get a clear picture of the average and peak workloads you’re dealing with.

Another common deployment mistake is to monitor clients for just a day or less to determine their needs. Average and peak usage can’t really be measured unless you monitor the situation for a full month. By then, you should have collected adequate metrics to properly address the average and peak workloads of each customer.
A little planning now can save you a lot of trouble, expense, and frustration in the long run. For a near-time monitoring solution to improve your VMware performance and free up as much as 60% of your time, visit Opvizor today.

Vmware Snapshot Best Practise

What is VMware Virtual machine snapshot and best practice, in below.

Snapshots are not backups. A snapshot file is only a change log of the original virtual disk. Therefore, do not rely on it as a direct backup process. The virtual machine is running on the most current snapshot, not the original vmdk disk files.

Snapshots are not complete copies of the original vmdk disk files. Taking a snapshot does not create a complete copy of the original vmdk disk file, rather it only copies the delta disks. The change log in the snapshot file combines with the original disk files to make up the current state of the virtual machine. If the base disks are deleted, the snapshot files are useless.
Delta files can grow to the same size as the original base disk file, which is why the provisioned storage size of a virtual machine increases by an amount up to the original size of the virtual machine multiplied by the number of snapshots on the virtual machine.

The maximum supported amount of snapshots in a chain is 32. However, VMware recommends that you use only 2-3 snapshots in a chain.
Use no single snapshot for more than 24-72 hours. Snapshots should not be maintained over long periods of time for application or Virtual Machine version control purposes.
This prevents snapshots from growing so large as to cause issues when deleting/committing them to the original virtual machine disks. Take the snapshot, make the changes to the virtual machine, and delete/commit the snapshot as soon as you have verified the proper working state of the virtual machine.
Be especially diligent with snapshot use on high-transaction virtual machines such as email and database servers. These snapshots can very quickly grow in size, filling Datastore space. Commit snapshots on these virtual machines as soon as you have verified the proper working state of the process you are testing.
An excessive number of delta files in a chain (caused by an excessive number of snapshots) or large delta files may cause decreased virtual machine and host performance

If using a third party product that takes advantage of snapshots (such as virtual machine backup software), regularly monitor systems configured for backups to ensure that no snapshots remain active for extensive periods of time.
Snapshots should only be present for the duration of the backup process.
Snapshots taken by third party software (called via API) may not show up in the vCenter Snapshot Manager. Routinely check for snapshots through the command-line.

For more info, kindly refer VMware KB - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1025279

Microsoft Virtual Tech Labs

Pages