Monitoring Databases to Prevent Performance & Availability Outages

Submitted Mar 9, 2018

Technical level: Intermediate

This presentation covers OS/Infrastructure monitoring but nobody monitors their OS manually.
Most deployments ue Nagios etc. Then why another talk on monitoring , something which everyone knows about. Usually what a sysadmin sees is probaly an isolated event , but somone needs to understand the whole picture ,
what would be the effect of IO slowdown on the database , what all can you expect.

Based on my work in providing Database Support, the idea is to talk with attendees about what we see most commonly in the Service Requests. How they can avoid some of those failure scenarios or at least detect them and minimize the downtime.

Key takeaway is “what are the signs to look for to prevent or minimize outages”.

Most of the content is applicable to databases in genral , but I would also talk about things which are Mysql specific.The last part of this presentation would have some details from MySQL Performance Schema and MySQL Enterprise Monitoring.

Outline

A. Why monitor :

a. Database is essential to most application
b. Application server can be redundant, can be spawned any time
c. Database is a limited resource , non-redundant most of the times
d. What can effective monitoring achieve

B. Outage Scenarios:

a. Performance Outage Scenarios: Database is alive
i. Unexpected load
ii. Spiked latency
iii. runaway queries
iv. Unexpected Locking

b. Availability Outage Scenarios: Database has crashed/shutdown
i. Disk is full
ii. Network is down
iii. Seg(6//11) Crash/ Bugs
iv. Out of Memory aka OOM
v. Server Code Bugs

C. What to Monitor ?:

Internal infrastructure:
CPU & Threads :
OS threads vs SQL sessions/connections
Number of threads
Thread states
Running vs blocked
CPU Load

IO :
Load
Usage
Latency
Disk Errors, multi-path failures

File System:
%use of different mount points esp “/tmp”
FS errors

OS/SYSTEMS:
dmesg (-T if available)
/var/log
/proc/meminfo
/proc/<pid of mysqld>/status (VM*)
Paging/swapping activity
Background flushing

Process:
stack traces in mysqld error log file , lock status

Network:
socket status
dropped packets/collisions

D. Monitoring for customers:

SQL:
long running queries
untimely queries
locking duration/timeouts/deadlocks/pending transactions/lock waits
data volume chnages/rows examined vs returned
concurrency
temp disk usage/On-disk sorting
semaphore waits/pending flushes/fsync
connections/network disconnects/aborted connects/aborted clients/max_connections
backups & archive logging (binary logging)

MySQL Specific: performance_schema, MEM

Speaker bio

The speaker is a DBA/Support Engineer with experience in support mission critical databases. As a MySQL Support Engineer for Oracle he is involved in troubleshooting availability and performance issues for large clients especially in the telecom domain.

Slides

https://www.dropbox.com/s/eqqbktcsfcno2v9/Monitoring_Alerts_Databases.pdf?dl=0

Rootconf proposals for round the year in 2018