Saturday, May 16, 2009

More Lessons learned from Performance Testing SharePoint

Abstract

Performance testing with SharePoint, or any web based application, can be quite tricky.  Recently my team launched an upgraded Corporate Web Site based on SharePoint 2007.  The launch was quite challenging mainly due to mistakes made during performance testing Lessons Learned from Intranet Launch.

This post is dedicated to the lessons learned from the performance testing of Corporate Web Site. 

Background

Prior to launch we ran through our performance test scenarios 3 times.  Each time the output showed that we could scale way beyond the existing implantation of our Corporate Web Site (Referred to as Violin from here on). 

The performance test scenarios had been chosen based on traffic patterns and pages determined to be high risk for performance (This was good). 

Our key performance requirements stated that the web servers must support 38 page views / sec with response time < 5 sec (This was good).  This is a nice well defined requirement, although some could argue that 38 page views needs to be broken down into specific types of pages (ex. 10 home page views, 7 chapter page views, …). 

We also had a performance goal stating that processor utilization should not go above 80% on web servers for more than 5 seconds (This was good).

For the final test we replayed traffic from IIS logs that were taken during peak traffic window (when we received the most requests / sec).  This was a bit tricky because my Load Runner resource told me that this was not supported by Load Runner.  So he and I had to message the data inside the IIS logs to get it so Load Runner would support running the tests (this felt wrong at the time, but I cannot say if it is a mistake).

We used Load Runner (sorry I do not know version) for all of the performance tests.  The Load Runner clients were located within the same data center as our web servers, but they were on different network segments.

When we ran the tests we engaged several people from operations team (Network, Windows Server, SQL Server DBA and SharePoint Admin).  These people were tasked with monitoring components related to their area of expertise.  They were also required to collect performance statistics and report those back so they could be included in overall performance test report (This was good).

“Performance is exceptional” or The False Sense of Security

So each time we ran the tests we were able to reach levels of about 90 page views / sec on one server with avg. response time < 5 seconds (we have 4 load balanced WFE in our farm).  So we were hi-fiving and slapping each other on the back.  As far as we were concerned performance requirements were met, check them off we are done.

We did notice an occasional spike w/ CPU, but we were able to correlate this back to pages expiring in Output Cache.  So this was not a concern.

Well once we went live we discovered that something was gravely wrong.

So What Went Wrong

After going live we discovered that the output cache hit ratio was not aligned with the numbers we were seeing during performance testing.  So were were having a LOT less output cache hits.  This resulted in the servers having to do a lot more work than originally anticipated.

What could have happened? We thought we did everything right with the performance tests.  What went wrong?

Well after much soul searching (and re-reading basics of performance testing) it hit me.  "

Oh $hit we didn’t model user variations and think times. 

 

Does that really matter? 

Yeah it does, the reason is because we ran a high number of requests but the proportion of cached requests vs. un-cached requests was out of balance.  Had we have taken into consideration user think times and other variations(browser type, user location) we would have less hits against output cache.

Classic 101 Performance Testing Mistake.  Oh well, you pick yourself up, dust yourself off and vow not to make the same mistake again.

Lessons Learned Summary

1. Think times matter

User think times are critical when doing performance testing (especially for web applications that rely on ASP.Net Output Caching to meet performance goals).

2. End user variations matter

Just as important as think times you need to look at the IIS Logs (or your web analytics reports) to understand browser differences and local differences.  This is extremely critical if you have Output Cache configured so it treats these differences as non cached page requests.

3. Mix up the IP addresses to fool user affinity

While this is not as important as Think Times and End User variations it is important if you are doing performance testing through a load balancer configured with session affinity. 

All of the tests we ran looked like they were coming from 2 IPs.  While I cannot prove this invalidated the test results it looks like there was some sort of caching efficiencies realized somewhere in the stack (Switch, NIC, IIS, …). 

References

Microsoft Patterns and Practices: Performance Testing Guidance for Web Applications

Microsoft Office Server Online: Configure page output cache settings

MSDN: Output Caching and Cache Profiles

Sunday, May 3, 2009

Lessons Learned from Launch of Intranet on SharePoint 2007

Abstract

This post provides some lessons learned from the launch of our Corporate Intranet.  After about two weeks of poor performance and stability issues we stabilized the site and resolved most of the issues.  The lessons learned here are common and I'm sure our team was not the first (nor the last) to make these mistakes. 

Background

Our Corporate Intranet supports about 21 Business Area / Business Units (BA/BUs).  When I say Intranet I am referring to a content publishing web site that provides announcements, latest news, corporate policies and other information that is important for employees to consider.  It is not a place for employees to collaborate as teams, this is done by another set of SharePoint 2007 sites.

The Intranet is hosted on something we call the Common Web Platform (CWP).  What makes it common is it is one set of features / functionality that powers Intranet, Extranet and Internet publishing sites.

In 2007 I started working on a project to upgrade CWP from its current infrastructure (SharePoint 2003 / MS Content Management Server 2002) to SharePoint 2007.  The first major component to rollout under the new SharePoint 2007 version of CWP is the Intranet site.

Our intranet is not small. It contains approximately 67,000 webs, 65,0000 documents and 70,000 web pages.  The business requirements for sharing content between BA/BUs led us to determine that putting all this content in one site collection was the best choice.  I still believe this was the right decision, but it did cause us to create a SharePoint Content DB that is around 330 GB.

Launch Day

Launch day was actually quite calm from my perspective.  Yes we had a large site, but I felt we had done an excellent job with performance testing so launch would actually go quite smooth  I do not want to go into specifics but the performance testing done had shown that the new SharePoint 2007 site would be able to scale about 3 X higher in number of page views and users than the existing platform.

Everything wasn't perfect, in fact far from it.  We had quite a number of lingering issues from content migration.  We also had some application bugs that just would not go away.  But everyone agreed that these could be solved so we decided to go forward with the launch.

So at approximately noon Eastern US time on March 25th, 2009 we had the DNS team flip the switch and all traffic rolled off the old environment and to the new.  It was one of the smoothest cutover’s I have ever been associated with, I even heard some people saying that the did not know we had flipped the switch. 

The next morning as Europe came online the proverbial $hit hit the fan.  I'm not going to go into the blow by blow details, but I will say a dedicated team of engineers that wanted nothing more than to see this new platform succeed went to work along with MS Premier Support.  On Tuesday April 7th the task force was closed as everyone agreed that while the new Intranet had some problems it was stable and performance was acceptable to end users.

This was a really tough one to troubleshoot.  The thing that made it tough was just inconsistency with the crashes.  We could never tie it back to one specific event or one set of clear patterns.  The only consistency was the fact that it crashed during peak traffic loads (from 2 AM - 9 AM Eastern US time).  The Intranet availability dipped to about 60% during these two weeks.

Lessons Learned

As I stated after about two weeks of pure hell we got things stable.  During that time we did a lot of analysis and a few changes.  So in no particular order here is the things we changed and why and what I personally learned.

1. Hosting web services that do not use SharePoint inside your SharePoint Application Pool is bad (umm kay… South Park ref).

One of our field controls makes calls to a web service that in turn makes calls to a database to retrieve some data.  It is pretty basic stuff.  Well, to make a long story short the web service ended up in our SharePoint solution package and our field control ended up a call back into the same Application Domain to call some data from a database.  Yes, I know not a very smart thing to do.

Anyway, during the performance we put together specific KPI’s to watch for this web service.  We saw no major problems with it, but put it on a list of things to change once the application went into maintenance mode.

While we never linked any outages specifically to calls to this web service, however we did see a major improvement in overall stability when the web service was moved to a separate application pool. 

So the lesson learned is to keep the Application Pools that host SharePoint sites dedicated to SharePoint sites (do not have those Application Pools host non SharePoint IIS Sites).

2. Be sure to set RowLimit on query at less than 5000 items to avoid table locks.

One of the problems that definitely caused outages was table locks at the SQL Server level.  We traced the table locks back to SQL that was being generated by a CAML query we used to show documents associated with a given web page. 

SQL will lock a table if it thinks a query will return more than 5000 rows.  So it is very important that you set a row limit when using SPQuery and CrossListQueryCache objects.  When SharePoint generates the SQL for CrossListQuery if will set a default row limit of 2 million items.  I’m not sure if it does the same thing for SPQuery, but better safe than sorry. 

So the lesson learned here is always set a row limit that is less than 5000 when using SPQuery and CrossListQueryCache. 

3. If querying by FileRef use SPWeb.GetListItem instead

The CAML Query referenced in Item 2 above was using FileRef field to filter the result list.  Unfortunately FileRef is a special field inside of SharePoint, meaning it doesn’t lend itself to be indexed (See Index List Field).  So the SQL query’s that were generated from the CAML were doing full table scans which is another big performance hit and can cause unwanted database locks.

So in the end we abandoned using CAML query to get the documents and instead pulled them the SPWeb.GetListItem method.  At first there was a hugh debate on our team, because fundamentally it is better to reduce communication with DB.  So we were going from essentially one call to the DB to two calls per file in our document list field control (note: SPWeb.GetListItem results in at least 2 calls to the DB, one to get the List field info and one to pull the ListItem data).

Our control has a limit of 200 documents that can be displayed.  So we knew the maximum number of times we would call GetListItem per page would be 200.  We also knew that the average number of documents per page was 3.  So most pages had very few documents to display.

Our team is looking at alternate approaches.  One idea is to add a field to each document that has a GUID.  Then index that field and go back to doing queries using that new field.  We have a lot of testing to do before we make a decision to go in that direction.

So the lesson learned was do not write CAML query's that use FileRef as the primary field to filter the results. 

4. Don't make these mistakes with performance testing.

Okay this one requires a separate blog post.  I promise to post a blog entry with this information very soon.  In the mean time I can say that the key mistake made with performance testing was not taking into consideration user sessions and think times.  We had the right URLs (we took these straight from logs of production machine), but we ran them through too fast which created a situation where URLs uses output cached versions of the pages when under normal load they would not have used the cached versions. 

Wrapping Up

Granted we had a rough launch because the performance testing did not catch the critical application issues.  I do not want to leave people with the impression that everything we did was wrong.  Our team did a lot of stuff write and often these things get forgotten when things go wrong.  So here is a short list of the things we did right:

  1. We used 64 bit hardware for all our servers (SQL and Web Front Ends).
  2. We used the caching options with Publishing sites effectively (Output Cache, Object Cache and BLOB Cache).
  3. We discovered a major memory leak in our code with performance testing and fixed it before going live.
  4. We put together a well defined set of Solutions and Features for our application (so we can deploy easily).
  5. We created a team of people that have some really deep knowledge on building SharePoint Publishing sites.

References

MSDN: Best Practices: Common Coding Issues When Using the SharePoint Object Model

Microsoft TechNet: Tune Web server performance (Office SharePoint Server)

SharePoint for End Users: Manage large SharePoint lists for better performance

Reza Alirezaei’s Blog: 20 key Points Arising, or Inferred, From “Working with large lists in MOSS 2007” Paper

Technorati Tags: ,