I really think monitoring is an under-appreciated topic in our industry. Sure, that’s several companies that make pretty good morning writing monitoring software. Cards on the table, I even worked for one of them several years ago. Whether you invest in 3rd party monitoring software or not, you really should have some full solution in place for monitoring and alerting. I have spent many years developing my own monitoring scripts, and I continue to do so. Everything I create today though is supplemental to the monitoring solution I work for.
One of the things I hear a lot is “we don’t have budget for monitoring software so we create all of our own”. However, you still end up paying for the monitoring solution even if you do it yourself. You can either invest $X in a read-made full solution or you can spend $X in lost man-hours spent creating your own and never reaching the full solution point. If you count the losses incurred when you miss something that a full solution wouldn’t have, you can end spending 2 or 3 times $X by rolling your own solution.
And really, have you ever met a DBA that didn’t wish someone could free up several hours of their time every week to work on other things they’d like to work on?
With that said, I offer you these 5 tips for monitoring and alerting that are applicable whether you are buying a solution or implementing your own.
- Monitoring should not be local only: Early in my career, I learned the hard way that when you only monitor a server from that server, it’s not sufficient. In my very first DBA gig, we were a very small company (12 people) running quite a few servers. Our budget didn’t cover luxuries like test staff or 3rd party monitoring software. All of our monitoring was stuff we had written over the years, usually for very specific things either in reaction to a problem we had in the past or to a problem we anticipated.
One day, we had a client complaining that their data wasn’t getting updated. We downloaded the data from their parent organization every half hour. New data should show up on their public website within 45 minutes of entering it into their member system. Upon investigation, I discovered that the process was failing to start. Ergo, none of my monitoring scripts reported errors because the errors never got a chance to occur. As a result, I set up an alert from a remote server that counted new records added to our master database from their system every half hour. If no new data was found that had been imported from the external system, it sent an alert saying to make sure the process was running.
The key takeaway here is that whether you are monitoring it via 3rd party software or you roll your own checks, don’t rely only on monitoring processes that are local to the server. You need to be able to monitor the server, at least to some degree, from a remote server so that when the server is down, you still get alerts, even if it’s just alerts that the server is down.
And no, I do not think relying on an absence of alerts to be sufficient.
- Avoid the storm: One of the things I find myself battling over and over is alert storming. I see a lot of people who set up alerts for both failures and successes. And they do it for everything. So you end up getting thousands of emails per day that have no meaning. Error alerts are very few.
What always end up happening is that we inadvertently train ourselves to ignore these emails and thus when a critical alert actually arrives, we don’t see it. If we you constantly being hit by a storm of emails, you’re not going to see that very important raindrop when it lands on you. Then you may find yourself in the awkward position of having to explain why you missed a critical alert.
- Alerts should be actionable: This relates directly to the previous tip. In addition to creating a storm of alerts by alerting on successes, alerting on things you don’t care about should also be avoided. For example, if you have a process that is coded to retry a failure three times, and it sends an alert every time it fails, but you know there is nothing for you to do until you get the 4th error, then the first 3 errors are not actionable. So once again, you learn to ignore them. It becomes easy to miss one of the 4 emails if you are waiting for 4 emails. Or what if it fails twice and then the process crashes and never sends more errors. You think it was successful because you only received 2 error alerts.
What you should do in that scenario is to only send an alert if it fails all 4 times. Send 1 alert saying that it failed 4 times and won’t be automatically tried again. Unless alerts are actionable, meaning that when an alert is received, someone needs to respond to it in some manner, then those alerts will eventually be ignored, and along with them, the alerts that do require action will end up being ignored as well.
- Minimize impact: At some point in your career, you will be tempted to just monitor everything because you don’t know what may go wrong next and you’re tired of not having the info you need for things that occurred in the past. Not everything that you may decide to capture is lightweight. Things like querying the cache on a SQL Server with a lot of RAM can have a large overhead.
I know people who poll the plan cache every 5 minutes and store it in a table so that they can look at the plans later if something goes wrong. Looking at one of my main servers with 512 GB of RAM, the plan cache is about 5 GB. If I write that 5 GB to a table every 5 minutes, that’s 1,440 GB or almost 1.5 TB of data per day. Tell me that’s not going to have a noticeable impact on the server.
Story time. At a previous place I worked at, the other DBA team (yes, we had 2 DBA teams that managed different servers) was having performance problems with one of their SQL Servers. I looked at the server and saw that the biggest resource user on their system was their monitoring. They were monitoring the server so heavily, that the monitoring was a bigger workload on the production server than the application. Here is what they had set up for monitoring on all production servers:
- PerfMon capture of system counters for systems team
- PerfMon capture of system and SQL Server counters for DBA team
- Default SQL Trace
- Default Extended Events system health session
- Custom SQL Trace to capture queries than ran longer than 3 seconds
- Custom SQL Audit that served no purpose other than saying “we audit access”
- Custom SQL Trace from their 3rd party monitoring software
- Custom SQL Trace from their 3rd party security compliance software
In case you weren’t counting, that’s 4 SQL Traces, 2 PerfMon captures, and 2 Extended events session (SQL Audit uses Extended Events).
- Give until it hurts: Not every alert has to be critical, but if an alert is critical, then it should be pervasive and annoying. It should keep alerting until the issue gets fixed. For example, if a server crashes, the monitoring software should be screaming its head off and should continue screaming until the monitoring is paused by someone or the issue is fixed. If it is something that demands attention, it should be impossible to overlook the alerts and so annoying that someone will look into it just to quiet the alerts if for no other reason.