As part of my job as a consultant at VMware, I get to deliver Health Check & Architecture Reviews engagement on regular basis. I have found quite few best practices that is generally missed by most & thought to document the top 5 in here for everyone to benefit. Maybe on the next round more enterprises will pass these best practices check. The list below is not ordered in any manner, its just how they happen to kick into my mind and all of them quite important best practice to follow unless you have a strong reason not to.
1- Change port group security default settings ForgedTransmits and MACAddressChanges to Reject unless the application requires the defaults. As well ensure promiscuous mode kept its default setting of reject unless your application require it. The reason why you want to ensure all these are set to reject is to increase your environment security as the last thing you want to allow in your environment is some one forge transmitting packages on your network or a compromised VM claiming to be some one else & crashing another VM and receiving packets meant to be sent to it. Even worse if you have promiscuous mode set to accept then a VM can sniff all the data flowing on that particular port group. You want to change these all to reject unless you are using something like Microsoft Clustering or NLB then you might want to dedicate a specific portgroup for these and only on that port group you turn the required security setting to accept. I believe the main reason why many admin miss these configuration is that ForgedTransmits and MACAddressChanges are set to accept by default and many admins love default settings.
2- Consider limiting host visibility to datastores it requires. I often go to customers where I find them mounting every VMFS datastore in the environment to every ESXi host. This is BAAADDDD!!!! It is not only bad from scalability perspective where you are limited to 256 LUNs per host and now you will be limiting your full environment to that. It is quite bad from availability and management perspective as well. Imagine one of these LUNs causing an APD (All Path Down)/PDL(Permanent Device Loss) situation, at that time you are risking your full environment to be affected rather than limiting the affect to a particular cluster. As well you increase your chances of admins managing a specific cluster not understanding the affect of deleting datastores used by other clusters, which can end you up in a disastrous situation.
3- Size with VMware HA host failure considerations. Yeah Yeah seems as common sense, but its countless the times I went to customers and found admission control disabled. As soon I see that I know that there is a big chance that the customer has not properly sized his environment as when ever he turn on admission control he is getting the error of HA Admission control being violated. Sorry disabling admission control is not the solution in that case, but it means you need more capacity or better sizing of your VMs. Turning off your admission control is a risky decision as now HA does not guarantee that you will have enough capacity to fail over all your protected VMs in case of a host failure which is a result that most of us don’t want. If you disable HA Admission control then you should not blame VMware that some VMs did not start after a host failure although you have HA on! If you are doing this, its time to look at improving your environment sizing! Don’t wait!
4- Maintain configuration consistency across hosts in a single cluster. Yes, another one that every know about but only few pass the test for. Trying to maintain consistency in hosts configuration manually can be challenging. How many of you can tell if NTP configuration, Default Gateway/DNS, firewall settings, port groups settings, attached LUNs to every host in your vSphere cluster is consistent at all times. Unless you have a very small environment, it can be challenging specially with all the changes that being carried out over time. I would highly recommend two things to stream line the process. First use automated tools to help you audit consistency. Host Profiles, vCenter Configuration Manager, & even a self written Power Cli script can be great tools to report on any inconsistency or changes against your desired configuration. Second tip is to ensure that you have a mechanism on place that limit continuous undocumented changes to the environment, a well laid out Change Control Process can definitely help.
5- Utilize VMware HA Priorities. Trust me when VMware has introduced this feature, it was included for a good cause. Imagine that you have 60 VMs per host(Yes I have seen much higher numbers out there!) and that host fail, could you guarantee that all the VMs on that host are created equal and you don’t care which one start first & which one will be facing the longest downtime and starting last. If you are not comfortable with that fact, then you better start utilizing VMware HA priorities to ensure your critical VMs start first. One thing to remember here that VMware HA still does not respect vAPP start orders. Another misconception that I have been seeing lately that some customers assume that VMware HA respect Resource Pools shares. Just because your resource pool shares has been setup to high, it does not mean HA will start its VMs before another resource pool that has its share setup to low. Resource pools shares setting are only to prioritize VMs from resources perspective when there is resource contentions, but it has nothing to do with HA priorities.
I hope those tips will be useful to many of you out there, & will give you something to think about on Monday Morning. I would love to hear your feedback and see which one of the above you missed. It will be kind of you to share such info in the comments area below. You can comment anonymously if you feel like it!