Monitoring Databases to Prevent Performance & Availability Outages
This presentation covers OS/Infrastructure monitoring but nobody monitors their OS manually.
Most deployments ue Nagios etc. Then why another talk on monitoring , something which everyone knows about. Usually what a sysadmin sees is probaly an isolated event , but somone needs to understand the whole picture ,
what would be the effect of IO slowdown on the database , what all can you expect.
Based on my work in providing Database Support, the idea is to talk with attendees about what we see most commonly in the Service Requests. How they can avoid some of those failure scenarios or at least detect them and minimize the downtime.
Key takeaway is “what are the signs to look for to prevent or minimize outages”.
Most of the content is applicable to databases in genral , but I would also talk about things which are Mysql specific.The last part of this presentation would have some details from MySQL Performance Schema and MySQL Enterprise Monitoring.
A. Why monitor :
a. Database is essential to most application
b. Application server can be redundant, can be spawned any time
c. Database is a limited resource , non-redundant most of the times
d. What can effective monitoring achieve
B. Outage Scenarios:
a. Performance Outage Scenarios: Database is alive
i. Unexpected load
ii. Spiked latency
iii. runaway queries
iv. Unexpected Locking
b. Availability Outage Scenarios: Database has crashed/shutdown
i. Disk is full
ii. Network is down
iii. Seg(6//11) Crash/ Bugs
iv. Out of Memory aka OOM
v. Server Code Bugs
C. What to Monitor ?:
CPU & Threads :
OS threads vs SQL sessions/connections
Number of threads
Running vs blocked
Disk Errors, multi-path failures
%use of different mount points esp “/tmp”
dmesg (-T if available)
/proc/<pid of mysqld>/status (VM*)
stack traces in mysqld error log file , lock status
D. Monitoring for customers:
long running queries
locking duration/timeouts/deadlocks/pending transactions/lock waits
data volume chnages/rows examined vs returned
temp disk usage/On-disk sorting
semaphore waits/pending flushes/fsync
connections/network disconnects/aborted connects/aborted clients/max_connections
backups & archive logging (binary logging)
MySQL Specific: performance_schema, MEM
The speaker is a DBA/Support Engineer with experience in support mission critical databases. As a MySQL Support Engineer for Oracle he is involved in troubleshooting availability and performance issues for large clients especially in the telecom domain.