Enterprise GIS as Virtual Infrastructure Proving Grounds


Beware dork out session below

I’d like to solicit advice from the GIS\VM community regarding our ArcGIS Server 10 upgrade on VMWare ESXi, especially regarding the table below. Are there any major factors which I left out or factors which look like they might be a problem?

We all know the “cloud” is all the rage. Even Apple is doing it. I don’t want to rehash the vague benefits, but rather some of the concrete obstacles.

At the Port of San Diego we have a Virtual Infrastructure leveraging VMWare’s ESXi 3.5 software. About a year ago our Development and Production environments (Linux\Oracle11g\SDE and Win2003\AGS9.3.1) were ported over to this environment. We gave the physical machines over to these environments. For about 6 months everything worked fine. I regularly checked the ArcGIS Server Logs- No errors. Managers, administrators and users were all satisfied with the performance.

On January 26th we had an unplanned server outage. The system has never been the same since. The server and the services were restarted successfully, but our performance dropped to unacceptable levels. I could get the system to error by simulating a few users using different browsers. It certainly didn’t feel as snappy. In some cases layers wouldn’t load and we started getting lovely errors like those below in our ArcGIS Manager log.

Tension between what I refer to as “hardware guys” and “software guys” ensued.  As a software guy my feeling was that something in the VM had changed. At one point the production GIS webserver was placed into a “throttled” state. Our Oracle DBA referred to it as “nice” mode. The hardware team took exception to that categorization. The hardware guys claimed it was an OS issue and the servers weren’t even leveraging all the virtual hardware we had access to. Our Oracle DBA pulled the Linux\Oracle11g\SDE machine out of the Virtual Infrastructure and onto a physical Solaris machine to free up some resources. Performance increased but has never reached an acceptable level for a critical business system.

I have troubleshot this performance issue from every angle I can think of. I worked towards increasing the performance of the services: caching aerials, using optimized symbology, finer scale dependencies, republishing services as MSD based services and fixing every issue associated with the analyze button on the Map Service Publishing toolbar. I did everything I could think of at an OS level, defragmenting virtual and LUN disk, disk cleanup, checkdisk, defragmenting page files. I learned the nitty-gritty details of perfmon counters to ensure the application server was performing properly. I delved into the performance tab available in our VMWare Infrastructure Client and tweaked our VM settings to leverage memory ballooning. We created a new volume specifically for page files. Disabled Acceleration, disconnected virtual floppy and CD drives. If you are still reading you probably understand my frustration.

That brings us to today and our upgrade to ArcGIS 10. Our plan is to do an in-place database upgrade but create new Windows Server 2008 64-bit VMs, thereby overcoming the 32-bit OS memory limitation. According to a white paper from Esri and VMWare on deploying ArcGIS Server in VMWare Infrastructure: “Under average conditions, a CPU in a SOC machine can support about four concurrently active service instances. … If each machine is a dual-CPU system, this configuration can accommodate about 16 users simultaneously performing operations on services.” Under this logic our 4 core machine should be able to handle 64 concurrent users. My goal is to identify the inevitable differences between the white paper and our future environment. This is not an apples-to-apples comparison. Below is a spreadsheet outlining a series of factors which could effect performance.

Click on image to open larger version

As the webserver administrator should I have followed our Oracle DBA’s lead and moved back to physical machines? None of our other mission critical business systems (Email, Document Management, or ERP) are on the Virtual Infrastructure, nor is there any intention of moving them to VMs. Our Enterprise GIS intends to be considered among this group and leveraged in a similar way. If our implementation is successful we hope to prove not only that our GIS implementation ready for prime-time, but also that our Virtual Infrastructure can support other critical business systems. This will pave the way for other systems to move to the Virtual Infrastructure thereby realizing the efficiency, availability, flexibility and financial benefits of managing our own cloud.

ESRI ArcGIS Server 9.3 for VMware Infrastructure Deployment and Technical Considerations Guide White Paper
http://www.vmware.com/files/pdf/ESRI-DeploymentGuide-v1.0.pdf (permalink)

MyNewFavoriteThing Zen


Tags: , , , , , , , ,

One Response to “Enterprise GIS as Virtual Infrastructure Proving Grounds”

  1. Ari Isaak Says:

    Below are some comments I received via email:

    The AGS services timing out at startup puts the emphasis on the speed of the data supply to the ArcGIS Server VMs as the culprit, ie the database, given that moving it to a solaris box improved it.

    I’d look to isolate the Oracle DB using fGDBs if possible to compare performance – there are some very low level things inside Oracle that ArcSDE puts pressure on – eg SGA is something that I’ve seen make big differences. It could also be an index on a common layer, especially a in a where clause of a spatial view – the MSD publish tool may not be able to see that.

    Another possible cause would be a patch that came into effect after the restart, maybe an update on the Oracle/Linux machine.

    The Sata 7200rpm drives listed are not good for the databases, and not ideal for the ArcGIS Server virtual machines as these are very disk intensive.

    Optimizer would be good here to be able to isolate which layer is performing slowly if it is specific to a layer, and would definitely have identified when the problem had occurred. It doesn’t have any visibility into the data tier/ SQL profiling, so it would add specific details on how, but not why, it’s performing worse.

    ESXi 4.1 upgrade is worth it, system performance is much better.

    This person can be reached at http://www.northsouthgis.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: