Beware dork out session below
I’d like to solicit advice from the GIS\VM community regarding our ArcGIS Server 10 upgrade on VMWare ESXi, especially regarding the table below. Are there any major factors which I left out or factors which look like they might be a problem?
We all know the “cloud” is all the rage. Even Apple is doing it. I don’t want to rehash the vague benefits, but rather some of the concrete obstacles.
At the Port of San Diego we have a Virtual Infrastructure leveraging VMWare’s ESXi 3.5 software. About a year ago our Development and Production environments (Linux\Oracle11g\SDE and Win2003\AGS9.3.1) were ported over to this environment. We gave the physical machines over to these environments. For about 6 months everything worked fine. I regularly checked the ArcGIS Server Logs- No errors. Managers, administrators and users were all satisfied with the performance.
On January 26th we had an unplanned server outage. The system has never been the same since. The server and the services were restarted successfully, but our performance dropped to unacceptable levels. I could get the system to error by simulating a few users using different browsers. It certainly didn’t feel as snappy. In some cases layers wouldn’t load and we started getting lovely errors like those below in our ArcGIS Manager log.
Tension between what I refer to as “hardware guys” and “software guys” ensued. As a software guy my feeling was that something in the VM had changed. At one point the production GIS webserver was placed into a “throttled” state. Our Oracle DBA referred to it as “nice” mode. The hardware team took exception to that categorization. The hardware guys claimed it was an OS issue and the servers weren’t even leveraging all the virtual hardware we had access to. Our Oracle DBA pulled the Linux\Oracle11g\SDE machine out of the Virtual Infrastructure and onto a physical Solaris machine to free up some resources. Performance increased but has never reached an acceptable level for a critical business system.
I have troubleshot this performance issue from every angle I can think of. I worked towards increasing the performance of the services: caching aerials, using optimized symbology, finer scale dependencies, republishing services as MSD based services and fixing every issue associated with the analyze button on the Map Service Publishing toolbar. I did everything I could think of at an OS level, defragmenting virtual and LUN disk, disk cleanup, checkdisk, defragmenting page files. I learned the nitty-gritty details of perfmon counters to ensure the application server was performing properly. I delved into the performance tab available in our VMWare Infrastructure Client and tweaked our VM settings to leverage memory ballooning. We created a new volume specifically for page files. Disabled Acceleration, disconnected virtual floppy and CD drives. If you are still reading you probably understand my frustration.
That brings us to today and our upgrade to ArcGIS 10. Our plan is to do an in-place database upgrade but create new Windows Server 2008 64-bit VMs, thereby overcoming the 32-bit OS memory limitation. According to a white paper from Esri and VMWare on deploying ArcGIS Server in VMWare Infrastructure: “Under average conditions, a CPU in a SOC machine can support about four concurrently active service instances. … If each machine is a dual-CPU system, this configuration can accommodate about 16 users simultaneously performing operations on services.” Under this logic our 4 core machine should be able to handle 64 concurrent users. My goal is to identify the inevitable differences between the white paper and our future environment. This is not an apples-to-apples comparison. Below is a spreadsheet outlining a series of factors which could effect performance.
As the webserver administrator should I have followed our Oracle DBA’s lead and moved back to physical machines? None of our other mission critical business systems (Email, Document Management, or ERP) are on the Virtual Infrastructure, nor is there any intention of moving them to VMs. Our Enterprise GIS intends to be considered among this group and leveraged in a similar way. If our implementation is successful we hope to prove not only that our GIS implementation ready for prime-time, but also that our Virtual Infrastructure can support other critical business systems. This will pave the way for other systems to move to the Virtual Infrastructure thereby realizing the efficiency, availability, flexibility and financial benefits of managing our own cloud.
ESRI ArcGIS Server 9.3 for VMware Infrastructure Deployment and Technical Considerations Guide White Paper