Best Practices for Alerting and Logging with SQL Server Blocked Process MonitorBlocking in SQL Server is a natural outcome of concurrency control, but prolonged blocking can degrade application performance and user experience. The Blocked Process Monitor (BPM) in SQL Server provides a way to detect sessions that are blocked longer than a configured threshold. Proper alerting and logging around BPM events let DBAs surface problems early, troubleshoot faster, and track trends over time. This article covers practical best practices for configuring BPM, creating reliable alerts, designing useful logs, and integrating with monitoring systems.
What the Blocked Process Monitor does (brief)
The Blocked Process Monitor raises an event when a session has been blocked longer than the configured threshold (by default BPM is off). When triggered, SQL Server generates an extended event or trace output containing information about the blocked and blocking sessions, the wait types, and a stack of relevant resources. This output is useful to identify the root cause of long-running blocking chains.
Key fact: BPM triggers when a session is blocked longer than the threshold set by the blocked process threshold option.
Configuring BPM: thresholds and scope
Choose an appropriate blocked process threshold
- Default is 0 (BPM disabled). Set a threshold in seconds using sp_configure ‘blocked process threshold’,
, and reconfigure with RECONFIGURE. - Recommended starting points:
- OLTP systems: 5–15 seconds — catches user-impacting waits without excessive noise.
- Reporting/analytics: 30–60 seconds — avoids noise from long-running queries expected on these workloads.
- Tune based on application SLAs and typical query characteristics.
Consider server-wide impact
- BPM is global to the SQL Server instance. A single threshold applies across all databases and workloads.
- Beware of environments with mixed workloads; a single threshold may be noisy if you host both OLTP and long analytical queries.
Capturing BPM events: Extended Events vs SQL Trace
Prefer Extended Events
- Extended Events (XE) are lightweight, flexible, and the recommended modern approach.
- Use the blocked_process_report event (XE) to capture BPM output. Include relevant actions/fields: sql_text, session_id, blocking_session_id, wait_info, stack, database_id, and timestamp.
- Example XE session elements:
- Target: event_file (persistent) or ring_buffer (volatile).
- Filters: database_id, database_name, or sql_text patterns to reduce noise.
When to use SQL Trace / server-side traces
- Deprecated for new designs. Use only if legacy tooling requires it.
- Traces are heavier and less flexible than XE.
Designing alerting rules
Alert on meaningful conditions, not every BPM event
- Avoid alert fatigue. BPM can generate many events if threshold is low. Alert on:
- Events where blocking_session_id persists across multiple captures.
- Blocking chains longer than X seconds or involving specific critical databases or application accounts.
- Repeat offenders: same blocking session or same query causing repeated blocks.
Alert destinations
- Email: for immediate DBA visibility.
- Pager/incident platforms (PagerDuty, OpsGenie): for high-severity blocking impacting SLAs.
- Logging/monitoring platforms (Splunk, ELK, Datadog, Prometheus + Alertmanager): for aggregation and trends.
Rate-limiting and suppression
- Implement deduplication: suppress repeated alerts for the same blocking session for a configurable cool-down (e.g., 10–30 minutes).
- Threshold escalation: e.g., warning at 10s, critical at 60s, and page on critical.
Logging strategy: structure and retention
What to log from BPM
- Timestamp of event
- Blocked session id and blocking session id(s)
- Blocking chain (ordered list of sessions)
- SQL text of both blocked and blocking queries (trim or hash long texts)
- Database name and object ids (if available)
- Wait type and wait resource
- Duration blocked at time of capture
- Execution plan or query hash (when available)
- Hostname, application name, login/user
- Deadlock/stack trace info if present
Storage and schema
- Use a structured log (JSON) for ease of parsing. Example JSON fields: timestamp, blocked_session, blocker_session, sql_text_blocked, sql_text_blocker, db_name, wait_resource, duration_seconds, plan_handle, application.
- Store logs in a central persistent store: relational table, ELK/Splunk index, or time-series DB.
- Example table columns (relational):
- event_time DATETIME
- blocked_session INT
- blocker_session INT
- db_name NVARCHAR(128)
- wait_resource NVARCHAR(256)
- duration_seconds INT
- sql_text_blocked NVARCHAR(MAX)
- sql_text_blocker NVARCHAR(MAX)
- query_hash BINARY(8)
- plan_handle VARBINARY(64)
- raw_event XML/JSON
Retention and size control
- Retain detailed BPM logs for a shorter window (e.g., 30–90 days).
- Keep aggregated metrics (counts, top blockers) longer (6–24 months) for trend analysis.
- Mask or truncate PII in SQL text if required by compliance.
Practical implementation: sample Extended Events session (conceptual)
Use an XE session capturing blocked_process_report to an event_file target. Filter by database_id or application name as needed. Persist files to a monitored directory and have a log shipper (Fluentd/Logstash/SQL agent job) parse and forward events to your central logging system.
Example XE fields to capture:
- blocked_process_report event
- Actions/fields: sql_text, client_hostname, database_name, session_id, blocking_session_id, wait_info, stack, plan_handle, query_hash
- Target: event_file with rollover and size limits
Parsing, enriching, and correlating BPM logs
Enrichment steps
- Resolve session IDs to login/application/host via sys.dm_exec_sessions snapshots at time of event.
- Link query_hash or plan_handle to cached plans and AWR-like metrics (execution counts, avg duration).
- Correlate with performance counters (CPU, IO) and job schedules to find root causes.
Correlation examples
- If a particular application host appears frequently as blocker, involve app devs to optimize or change retry logic.
- If blocks align with nightly batch jobs, consider scheduling changes or resource isolation (separate resource pools).
Alerting playbooks and runbooks
What an on-call alert should include
- Short summary: blocked session X blocked by Y for N seconds on database Z.
- Top 3 reasons to check: long-running transaction, missing index or range scan holding locks, application retry loop causing buildup.
- Quick commands to run:
- sys.dm_tran_locks and sys.dm_exec_requests to see current wait_resource and blocking_session_id.
- sys.dm_exec_sql_text(plan_handle) to get query text.
- sp_who2 or custom queries to see blocking chains.
- Suggested immediate mitigations:
- If safe, kill the blocking session (EXEC sys.sp_who2 to confirm).
- Temporarily increase lock timeout or reduce transaction time in application (long-term fix).
- If caused by index maintenance/batch jobs: throttle or reschedule.
Avoiding dangerous practices
- Do not set threshold to 0 (this disables BPM) if you want monitoring.
- Avoid automatically killing sessions as the default remediation—this can cause application errors and data inconsistency.
- Beware of capturing full SQL text for long-retention logs without truncation or masking (PII risk).
Automation and integration ideas
- Automate a workflow: XE -> parser -> enrich -> push to SIEM -> runbook link -> alerting rules with dedupe/aggregation.
- Create dashboards showing top blockers by query_hash, by database, and by host.
- Implement anomaly detection: unusual spike in BPM events triggers an automated investigation job to snapshot DMVs and attach to an incident.
Example alerting rule matrix
Severity | Condition | Action |
---|---|---|
Warning | Blocked > 10s in non-critical DB | Log to SIEM, send email to DB team |
High | Blocked > 30s in critical DB or repeated same blocker >3 times/hour | Create incident, notify on-call |
Critical | Blocked > 120s impacting SLA or blocking critical job | Page on-call, collect diagnostics, consider kill after approval |
Summary — practical checklist
- Set a sensible global blocked process threshold aligned to workload.
- Use Extended Events (blocked_process_report) and capture relevant fields.
- Filter and enrich events to reduce noise and make logs actionable.
- Centralize logs (JSON) and retain details short-term, aggregates long-term.
- Alert on meaningful patterns, use rate-limits and escalation.
- Prepare runbooks with safe remediation steps; avoid automatic kills.
- Monitor trends and implement fixes (indexes, plan changes, scheduling, resource isolation).
If you want, I can:
- Provide a ready-to-deploy Extended Events session script for your SQL Server version.
- Create a sample SQL Server table schema and parser for storing BPM events in a database.
- Draft an on-call runbook tailored to your environment.
Leave a Reply