I’ve been at my new job for two weeks. In that time I’ve diagrammed all the databases as they are today. I’ve suggested recommendations on adding keys to some tables that had none. I’ve also started diagramming the object data that’s being collected in the system, I’m hoping to find duplication within the system and build out a more efficient design. I’ll be putting together both a transactional design and a related analysis design. To me, this is the really interesting part, visualizing the data, and optimizing it.
The problem is, this new architecture is the long term optimization. While I move to this, I have to keep an eye on how the data is being used. Right now I’m evaluating Confio’s Ignite8 for SQL Server. So far I like this product. I am however seeing limitations in how I’ve implemented it. I don’t have a server on the same LAN segment that I could use to install the monitor, and allow me to monitor 24-7. I’m working to remedy this, but it looks like early January before I can do that. After I get the hardware in place, I’ll resume my monitoring efforts.
Anyway, back to my story.
I was checking out my server, and I identified a query that was being run that was waiting nearly an hour a day to complete. It was an export process, so I knew I’d be dealing with a large data set. I could also see that this query was running between 5,000 and 10,000 times per day. Ignite helped me identify two wait types Memory/CPU and PAGEIOLATCH_SH.
I didn’t throw out Confio’s suggestion that we may be “reading more data from memory than necessary”. I started scanning the query that was being run. I discovered that the query was calling a view.
That was composed of a few tables joined with a view.
That view was itself composed of several views (that used the PIVOT function), joined to tables.
One of those views in turn had a cross database join, so setting up indexes on the views is out, since using WITH SCHEMABINDING is out due to that cross database join. I looked at the amount of data being retrieved by each view.
Each of the “children” views passed all of their columns up to the parent, and those columns were used (this is a wide export). So I turned my attention to the base tables used in the queries. I noticed that each join was using two columns.
So I dug into the actual data. I found that one column used in the join wasn’t very selective at all. But the other was!
A-HAH!
So after a little further exploring, I found I was able to make the queries a bit more optimized by changing all the joins to use the single column, rather than both columns. This was due to the selective key was the primary key in several of the tables (though it wasn’t always identified as the foreign key in the matched table…I have to fix that).
This improvement is only a small improvement, but a noticeable one. In unit testing I was able to save 1621 milliseconds (average) per query. At our current volume that should just cut that wait time by half.
Couple that with the fact we’re moving to new hardware (with more than double the CPU and triple the RAM) in January, and we should see the wait for this query fall off to nothing. I usually hate to say throw more hardware at it, but this time I’m going to leave it there, since the hardware has already been ordered.
The whole process took about 4 hours. Like I said, not easy. Definitely not quick either. After we throw the hardware at it, I am looking at re-factoring this query, if it remains to be one of the top 10 worst performing queries on my servers.
Do any of you out there have any suggestions for tracking down performance issues with queries that call views? What have you done to help performance? What was a waste of time?