Sunday, June 21, 2009

SharePoint Saturday Charlotte

This past Saturday I had the privilege of presenting at SharePoint Saturday Charlotte Event along side some top talent in the SharePoint community.  It was great to finally meet some of the people I follow on Twitter (too many to name).  Also, big kudos to Dan Lewis @danlewisnet, Brian Gough and all of the #SPSCLT Volunteers, you guys/girls rock!!!

As promised here the slide deck from my talk about Performance Testing with SharePoint.  I was hoping for a little bigger turnout, but when I saw I was in the same time slot as Becky Isserman (@MossLover) and Laura Rogers (@WonderLaura) I knew I would be lucky to get 7 people. :)

I really enjoyed the sessions I got to attend. 

The day started off with a GREAT presentation from Phil Wicklund. Phil had lots of pragmatic advice for managing your SharePoint investment. 

Next, I learned a LOT about how MS has implemented Windows Azure from Rick Taylor (@slkrck).  Rick is a GREAT speaker and had lots of war stories (which I LOVE hearing). 

Then, I got to hear Mike Watson (@mikewat) talk about SharePoint hosting architectures.  He has some really good insight into what it takes to make SharePoint purr. Also, I got to hear that our hosting architecture for SharePoint is in line with what he thinks is the RIGHT way to do it.

Next, I listened into Dan User (@usher) talk about Taxonomies.  I felt much better about my Internet facing solution that will have about 30 Web Applications in one Farm after hearing what he is doing. 

Finally, I listened to Dan Attis (@jdattis) talk about a solution he recently wrapped up that used SharePoint lists to store data for a Web Interface that was not SharePoint.  It sounds like a really cool solution the folks at B&R built.  Also, I learned about Object Initializers in .Net 3.5, really cool stuff.

Definitely will not be my last SharePoint Saturday event.

Thursday, June 4, 2009

Big Thanks to Office / SharePoint Teams for TAP Airlift

I had the privilege of attending the Office 14 TAP Air Lift this week in Seattle.  This is my second time coming out to to a SharePoint Air Lift and I must say they never disappoint.  While I cannot share any information from the Air Lift I can say it is exciting times to be working with SharePoint. 

I really want to extend a big thank you to Microsoft for hosting a great event.  The folks in Office / SharePoint development teams have some big deadlines in front of them.  For them to take time out of their busy schedules to spend some one on one time with customers says a LOT. 

Saturday, May 16, 2009

More Lessons learned from Performance Testing SharePoint

Abstract

Performance testing with SharePoint, or any web based application, can be quite tricky.  Recently my team launched an upgraded Corporate Web Site based on SharePoint 2007.  The launch was quite challenging mainly due to mistakes made during performance testing Lessons Learned from Intranet Launch.

This post is dedicated to the lessons learned from the performance testing of Corporate Web Site. 

Background

Prior to launch we ran through our performance test scenarios 3 times.  Each time the output showed that we could scale way beyond the existing implantation of our Corporate Web Site (Referred to as Violin from here on). 

The performance test scenarios had been chosen based on traffic patterns and pages determined to be high risk for performance (This was good). 

Our key performance requirements stated that the web servers must support 38 page views / sec with response time < 5 sec (This was good).  This is a nice well defined requirement, although some could argue that 38 page views needs to be broken down into specific types of pages (ex. 10 home page views, 7 chapter page views, …). 

We also had a performance goal stating that processor utilization should not go above 80% on web servers for more than 5 seconds (This was good).

For the final test we replayed traffic from IIS logs that were taken during peak traffic window (when we received the most requests / sec).  This was a bit tricky because my Load Runner resource told me that this was not supported by Load Runner.  So he and I had to message the data inside the IIS logs to get it so Load Runner would support running the tests (this felt wrong at the time, but I cannot say if it is a mistake).

We used Load Runner (sorry I do not know version) for all of the performance tests.  The Load Runner clients were located within the same data center as our web servers, but they were on different network segments.

When we ran the tests we engaged several people from operations team (Network, Windows Server, SQL Server DBA and SharePoint Admin).  These people were tasked with monitoring components related to their area of expertise.  They were also required to collect performance statistics and report those back so they could be included in overall performance test report (This was good).

“Performance is exceptional” or The False Sense of Security

So each time we ran the tests we were able to reach levels of about 90 page views / sec on one server with avg. response time < 5 seconds (we have 4 load balanced WFE in our farm).  So we were hi-fiving and slapping each other on the back.  As far as we were concerned performance requirements were met, check them off we are done.

We did notice an occasional spike w/ CPU, but we were able to correlate this back to pages expiring in Output Cache.  So this was not a concern.

Well once we went live we discovered that something was gravely wrong.

So What Went Wrong

After going live we discovered that the output cache hit ratio was not aligned with the numbers we were seeing during performance testing.  So were were having a LOT less output cache hits.  This resulted in the servers having to do a lot more work than originally anticipated.

What could have happened? We thought we did everything right with the performance tests.  What went wrong?

Well after much soul searching (and re-reading basics of performance testing) it hit me.  "

Oh $hit we didn’t model user variations and think times. 

 

Does that really matter? 

Yeah it does, the reason is because we ran a high number of requests but the proportion of cached requests vs. un-cached requests was out of balance.  Had we have taken into consideration user think times and other variations(browser type, user location) we would have less hits against output cache.

Classic 101 Performance Testing Mistake.  Oh well, you pick yourself up, dust yourself off and vow not to make the same mistake again.

Lessons Learned Summary

1. Think times matter

User think times are critical when doing performance testing (especially for web applications that rely on ASP.Net Output Caching to meet performance goals).

2. End user variations matter

Just as important as think times you need to look at the IIS Logs (or your web analytics reports) to understand browser differences and local differences.  This is extremely critical if you have Output Cache configured so it treats these differences as non cached page requests.

3. Mix up the IP addresses to fool user affinity

While this is not as important as Think Times and End User variations it is important if you are doing performance testing through a load balancer configured with session affinity. 

All of the tests we ran looked like they were coming from 2 IPs.  While I cannot prove this invalidated the test results it looks like there was some sort of caching efficiencies realized somewhere in the stack (Switch, NIC, IIS, …). 

References

Microsoft Patterns and Practices: Performance Testing Guidance for Web Applications

Microsoft Office Server Online: Configure page output cache settings

MSDN: Output Caching and Cache Profiles

Sunday, May 3, 2009

Lessons Learned from Launch of Intranet on SharePoint 2007

Abstract

This post provides some lessons learned from the launch of our Corporate Intranet.  After about two weeks of poor performance and stability issues we stabilized the site and resolved most of the issues.  The lessons learned here are common and I'm sure our team was not the first (nor the last) to make these mistakes. 

Background

Our Corporate Intranet supports about 21 Business Area / Business Units (BA/BUs).  When I say Intranet I am referring to a content publishing web site that provides announcements, latest news, corporate policies and other information that is important for employees to consider.  It is not a place for employees to collaborate as teams, this is done by another set of SharePoint 2007 sites.

The Intranet is hosted on something we call the Common Web Platform (CWP).  What makes it common is it is one set of features / functionality that powers Intranet, Extranet and Internet publishing sites.

In 2007 I started working on a project to upgrade CWP from its current infrastructure (SharePoint 2003 / MS Content Management Server 2002) to SharePoint 2007.  The first major component to rollout under the new SharePoint 2007 version of CWP is the Intranet site.

Our intranet is not small. It contains approximately 67,000 webs, 65,0000 documents and 70,000 web pages.  The business requirements for sharing content between BA/BUs led us to determine that putting all this content in one site collection was the best choice.  I still believe this was the right decision, but it did cause us to create a SharePoint Content DB that is around 330 GB.

Launch Day

Launch day was actually quite calm from my perspective.  Yes we had a large site, but I felt we had done an excellent job with performance testing so launch would actually go quite smooth  I do not want to go into specifics but the performance testing done had shown that the new SharePoint 2007 site would be able to scale about 3 X higher in number of page views and users than the existing platform.

Everything wasn't perfect, in fact far from it.  We had quite a number of lingering issues from content migration.  We also had some application bugs that just would not go away.  But everyone agreed that these could be solved so we decided to go forward with the launch.

So at approximately noon Eastern US time on March 25th, 2009 we had the DNS team flip the switch and all traffic rolled off the old environment and to the new.  It was one of the smoothest cutover’s I have ever been associated with, I even heard some people saying that the did not know we had flipped the switch. 

The next morning as Europe came online the proverbial $hit hit the fan.  I'm not going to go into the blow by blow details, but I will say a dedicated team of engineers that wanted nothing more than to see this new platform succeed went to work along with MS Premier Support.  On Tuesday April 7th the task force was closed as everyone agreed that while the new Intranet had some problems it was stable and performance was acceptable to end users.

This was a really tough one to troubleshoot.  The thing that made it tough was just inconsistency with the crashes.  We could never tie it back to one specific event or one set of clear patterns.  The only consistency was the fact that it crashed during peak traffic loads (from 2 AM - 9 AM Eastern US time).  The Intranet availability dipped to about 60% during these two weeks.

Lessons Learned

As I stated after about two weeks of pure hell we got things stable.  During that time we did a lot of analysis and a few changes.  So in no particular order here is the things we changed and why and what I personally learned.

1. Hosting web services that do not use SharePoint inside your SharePoint Application Pool is bad (umm kay… South Park ref).

One of our field controls makes calls to a web service that in turn makes calls to a database to retrieve some data.  It is pretty basic stuff.  Well, to make a long story short the web service ended up in our SharePoint solution package and our field control ended up a call back into the same Application Domain to call some data from a database.  Yes, I know not a very smart thing to do.

Anyway, during the performance we put together specific KPI’s to watch for this web service.  We saw no major problems with it, but put it on a list of things to change once the application went into maintenance mode.

While we never linked any outages specifically to calls to this web service, however we did see a major improvement in overall stability when the web service was moved to a separate application pool. 

So the lesson learned is to keep the Application Pools that host SharePoint sites dedicated to SharePoint sites (do not have those Application Pools host non SharePoint IIS Sites).

2. Be sure to set RowLimit on query at less than 5000 items to avoid table locks.

One of the problems that definitely caused outages was table locks at the SQL Server level.  We traced the table locks back to SQL that was being generated by a CAML query we used to show documents associated with a given web page. 

SQL will lock a table if it thinks a query will return more than 5000 rows.  So it is very important that you set a row limit when using SPQuery and CrossListQueryCache objects.  When SharePoint generates the SQL for CrossListQuery if will set a default row limit of 2 million items.  I’m not sure if it does the same thing for SPQuery, but better safe than sorry. 

So the lesson learned here is always set a row limit that is less than 5000 when using SPQuery and CrossListQueryCache. 

3. If querying by FileRef use SPWeb.GetListItem instead

The CAML Query referenced in Item 2 above was using FileRef field to filter the result list.  Unfortunately FileRef is a special field inside of SharePoint, meaning it doesn’t lend itself to be indexed (See Index List Field).  So the SQL query’s that were generated from the CAML were doing full table scans which is another big performance hit and can cause unwanted database locks.

So in the end we abandoned using CAML query to get the documents and instead pulled them the SPWeb.GetListItem method.  At first there was a hugh debate on our team, because fundamentally it is better to reduce communication with DB.  So we were going from essentially one call to the DB to two calls per file in our document list field control (note: SPWeb.GetListItem results in at least 2 calls to the DB, one to get the List field info and one to pull the ListItem data).

Our control has a limit of 200 documents that can be displayed.  So we knew the maximum number of times we would call GetListItem per page would be 200.  We also knew that the average number of documents per page was 3.  So most pages had very few documents to display.

Our team is looking at alternate approaches.  One idea is to add a field to each document that has a GUID.  Then index that field and go back to doing queries using that new field.  We have a lot of testing to do before we make a decision to go in that direction.

So the lesson learned was do not write CAML query's that use FileRef as the primary field to filter the results. 

4. Don't make these mistakes with performance testing.

Okay this one requires a separate blog post.  I promise to post a blog entry with this information very soon.  In the mean time I can say that the key mistake made with performance testing was not taking into consideration user sessions and think times.  We had the right URLs (we took these straight from logs of production machine), but we ran them through too fast which created a situation where URLs uses output cached versions of the pages when under normal load they would not have used the cached versions. 

Wrapping Up

Granted we had a rough launch because the performance testing did not catch the critical application issues.  I do not want to leave people with the impression that everything we did was wrong.  Our team did a lot of stuff write and often these things get forgotten when things go wrong.  So here is a short list of the things we did right:

  1. We used 64 bit hardware for all our servers (SQL and Web Front Ends).
  2. We used the caching options with Publishing sites effectively (Output Cache, Object Cache and BLOB Cache).
  3. We discovered a major memory leak in our code with performance testing and fixed it before going live.
  4. We put together a well defined set of Solutions and Features for our application (so we can deploy easily).
  5. We created a team of people that have some really deep knowledge on building SharePoint Publishing sites.

References

MSDN: Best Practices: Common Coding Issues When Using the SharePoint Object Model

Microsoft TechNet: Tune Web server performance (Office SharePoint Server)

SharePoint for End Users: Manage large SharePoint lists for better performance

Reza Alirezaei’s Blog: 20 key Points Arising, or Inferred, From “Working with large lists in MOSS 2007” Paper

Technorati Tags: ,

Thursday, April 30, 2009

I'm Back...

It's been a while since my last post. 

I've been heads down in the middle of our corporate rollout of SharePoint 2007.  Specifically I have been working on project to upgrade our content management system from MCMS 2002 / SharePoint 2003 to SharePoint 2007.  Exciting stuff and every day presents new challenges.

We just recently went live with our Intranet and now are starting work on Extranet and Internet sites. 

Sunday, June 1, 2008

Managing Key-Value Pair Settings in SharePoint

Recently my team discussed various ways to manage application configuration settings in SharePoint.  We wanted to avoid using Web.Config files for Key-Value Pair data because of the complexity of managing this data in SharePoint.

SharePoint contains a decent API (SPWebConfigModification) for managing web.config settings across the farm.  It works pretty good, but we have had some trouble with it in certain situations.  If you are interested in a good article on SPWebConfigModification check out this one by Mark Wagner

Thankfully the folks on the SharePoint development team delivered just what we needed to have a simple solution for managing key-value pair settings.  It is the SPPropertyBag class.

SharePoint provides a properties collection for SPFarm, SPWebApplication, SPWebService and SPWeb objects (none for SPSite, but site settings can be stored @ SPSite.RootWeb).

The properties collection for SPFarm, SPWebApplication and SPWebService are not really based on the SPPropertyBag class.  The properties collection comes from the derived class SPPersistedObject and is actually a HashTable.  But it works exactly the same as SPPropertyBag.

It is really easy to manage the properties settings.  Here is some sample code to update settings for SPFarm, SPWebApplication or SPWebService.

public static void SetPropertyValue(SPPersistedObject spObject, string name, string value)
{
if (spObject.Properties.ContainsKey(name))
{
spObject.Properties[name] = value;
}
else
{
spObject.Properties.Add(name, value);
}
spObject.Update();
}

The SPWeb class actually uses SPPropertyBag for its properties.  The code to manage SPWeb.Properties is similar.


public static void SetPropertyValue(SPWeb webSite, string name, string value)
{
if (webSite.Properties.ContainsKey(name))
{
webSite.Properties[name] = value;
}
else
{
webSite.Properties.Add(name, value);
}
webSite.Properties.Update();
}

Wednesday, May 21, 2008

Creating a SharePoint Web with the API

I have been working on a application that provides a repeatable way to create sites.  This application is very similar to the SharePoint Test Data Population Tool

One of the first tasks was to create new web applications.  A quick look at the API and I thought well this is simple enough.

First instantiate a new instance of SPWebApplicationBuilder.  Set some properties on my object and then call create.  Use the SPWebApplication.Provision() method and I am done.

Well that is not exactly true.  What I discovered was that the Provision method only creates the web application on the local server.  This is great for a single web front end, but for a more typical setup more work is required.

So I dug into the Administrative screens using Reflector.  I discovered that the SharePoint team is instantiating a timer job that will provision the newly created web on the other web front ends.

if (SPFarm.Local.TimerService.Instances.Count > 1)
{
SPWebApplicationProvisioningJobDefinition definition = new SPWebApplicationProvisioningJobDefinition(application, this.ApplicationPoolSection.ResetIis);
definition.Schedule = new SPOneTimeSchedule(DateTime.Now);
definition.Update();
}

So I thought okay this is simple enough. Then I discovered that SPWebApplicationProvisioningJobDefinition as marked as internal. Agghhhh!!!!


This pattern of finding a useful SharePoint routine only to have it hidden with the Internal keyword happens way too often.  I am sure that usually there is a good reason to hide away the functionality, but I could see no logical reason to hide SPWebApplicationProvisioningJobDefinition. 


So after throwing a few darts at the SharePoint Development Team dartboard (No I do not actually have one, but it could be a great gift SharePoint developers) I sat down and created a timer job definition that does basically the same thing as SPWebApplicationProvisioningJobDefinition.


This was the first timer job I had ever written so I made a *few* mistakes. 


First I failed to understand a basic concept of timer jobs.  That is any classes inheriting from SPJobDefinition need to be deployed to the GAC on each SharePoint server.  If they are not then the job will not be created correctly.  The frustrating thing I encountered was the fact that no runtime error was raised.  So it looked like the job definition was created correctly, but it was not.


Second I failed to put a default constructor on my job definition class.  This resulted in a runtime error with the job definition.  The runtime error indicated that the class failed to serialize properly.


Third I failed to decorate my member variables with the [Persisted] attribute.   This resulted in me losing my variable values.


Fourth I did not realize that I have to manually delete the job definitions when they fail.  To delete the timer job definition bring up the SharePoint Central Administration web site.  On the Operations tab choose Timer job definitions (under the Global Configuration header).  Next click the timer job definition in the list and then click the Delete button.


Fifth I failed to understand that job definitions must be uniquely named.  This was a problem for me because my application would create one timer job per Web Application.  Sometimes a new web application job would be created before the other job was finished.  To work around the issue I included a timestamp in the name.  I am not sure if this was smart, but it works.


Sixth I failed to understand that the OWSTimer.exe service must be restarted when deploying a new version of my timer job class library.  The OWSTimer.exe service is what manages the timer jobs.  It actually instantiates and executes the timer job classes.  Since my timer job class was deployed to the GAC it meant I had to restart the OWSTimer.exe (on each SharePoint server ughhh!!!). 


In the end I accomplished my goal, but it would have been a lot easier if the SharePoint development team had not marked SPWebApplicationProvisioningJobDefinition as internal.