5.17. Alerts #

PPEM supports sending alerts to users and user groups via email to monitor agents, hosts, and instances. To send alerts, you must configure an SMTP server.

For an alert to be sent, an alert trigger must fire. An alert trigger fires when the value of a metric is below, equal to, or higher than the specified threshold. Metrics are stored in data sources, for example, in the repository database.

To determine the alert trigger threshold, alert trigger rules are used. An alert trigger rule is a set of one or multiple conditions containing logical operators and values. For example, an alert trigger rule condition can include the > logical operator and the 0 value. With such a rule, the alert trigger fires if the value of the specified metric exceeds 0.

Relations between multiple alert trigger rule conditions are determined by the AND and OR logical connectives.

For alerts to work properly, you must first install and configure logging and monitoring tools.

This section explains how to manage alerts. It includes the following instructions:

Important

The alerts functionality is in the beta phase. Currently, only pgpro-otel-collector metrics can be used for alert triggers. Other existing limitations are specified in the corresponding instructions of this section.

Pre-Configuring Alerts

To work with alerts, you must first pre-configure them in the ppem-manager.yml manager configuration file.

You can specify the following parameters:

alerts:
  metrics:
    request_chunk_size: number_of_instance_IDs
  cleanup_grace_period: alert_cleanup_interval_if_no_data_is_received
  scheduler:
    interval: interval_for_checking_new_alerts
    initial_delay: delay_for_starting_alert_scheduler
    timeout: timeout_for_updating_alert_trigger_rules
  delayed_data:
    is_enabled: true or false
    data_delay: default_data_arrival_delay_for_all_sources
    datasource_delays:
      metrics: delay_for_metrics_arrival
      logs: delay_for_log_arrival
    max_delay: maximum_allowed_data_arrival_delay
    is_adaptive_delay: true or false
  notifier:
    num_workers: number_of_concurrent_workers
    worker_batch_size: number_of_alerts_in_one_batch
    worker_interval: interval_for_checking_new_alerts
    backoff_base: exponential_backoff_calculation_duration
    max_retries: maximum_number_of_alert_attempts
    notification_timeout: alert_timeout
    janitor_interval: janitor_worker_polling_interval
    stale_processing_timeout: stale_alert_processing_timeout
  email:
    is_enabled: true or false
    smtp:
      host: SMTP_server_hostname_or_IP
      port: SMTP_server_port
      username: username_for_SMTP_server_authentication
      password: password_for_SMTP_server_authentication
      from: alert_sender_email
      timeout: SMTP_server_connection_timeout
      use_starttls: true or false
      use_ssl: true or false
      tls:
        insecure_skip_verify: true or false
        root_ca_path: path_to_root_CA

Where:

  • metrics: The parameters of sending requests to the metrics plugin.

    • request_chunk_size: The maximum number of instance IDs within one request.

      Default value: 100.

  • cleanup_grace_period: The time interval after which alerts are cleaned up if no data is received.

    Default value: 6h.

  • scheduler: The parameters of the scheduler that updates alerts in the manager memory.

    • interval: The time interval for the scheduler to check for new alerts to process.

      Default value: 50s.

    • initial_delay: The delay before starting the scheduler for the first time after the start of PPEM.

      Default value: 10s.

    • timeout: The scheduler timeout for updating alert trigger rules.

      Default value: 10m.

  • delayed_data: The parameters for managing delayed metrics and logs with unknown delay time.

    • is_enabled: Specifies whether delayed data management parameters are enabled.

      Possible values:

      • true

      • false

      If true is specified, PPEM checks for delayed metrics and logs.

      Default value: false.

    • data_delay: The default data delay for all data sources when specific delays are not configured.

      Default value: 180s.

    • datasource_delays: The data delay for specific data sources. This parameter allows specifying different delays for metrics and logs as they may arrive at different rates.

      Possible values:

      • metrics: The delay for the metrics arrival, in seconds. Metrics typically have more consistent collection intervals but may be delayed due to network or processing issues.

      • logs: The delay for the log arrival, in seconds. Logs may arrive more frequently but with higher variability in timing due to log rotation and processing.

    • max_delay: The maximum allowed delay to prevent processing data that is too old. Data found earlier than this number of seconds is ignored to prevent false alerts from stale data.

      Default value: 600s.

    • is_adaptive_delay: Enables or disables the adaptive delay learning based on observed data arrival patterns.

      Possible values:

      • true

      • false

      When enabled, PPEM learns on actual delays from data timestamps and adjusts the lookback window dynamically.

      Default value: true.

  • notifier: The parameters of the notifier that sends alerts.

    • num_workers: The number of concurrent workers that will send alerts.

      Default value: 5.

    • worker_batch_size: The number of alerts processed by workers in one batch.

      Default value: 20.

    • worker_interval: The polling interval for workers to check for new alerts in the repository database.

      Default value: 30s.

    • backoff_base: The base duration for the exponential backoff calculation when resending a failed alert.

      The delay for resending the alert is calculated as:

      backoff_base X (2^number_of_retry_attempts).

      Default value: 10s.

    • max_retries: The maximum number of attempts to resend a failed alert.

      Default value: 3.

    • notification_timeout: The maximum amount of time for the notifier to wait for an alert to be sent before considering it failed.

      Default value: 20s.

    • janitor_interval: The polling interval for the janitor worker that cleans alerts stuck in the processing state.

      Default value: 1m.

    • stale_processing_timeout: The amount of time after which alerts stuck in the processing state are considered stale and must be reset by the janitor worker.

      Default value: 10m.

  • email: The parameters of sending alerts via email.

    • is_enabled: Specifies whether alerts are sent via email.

      Possible values:

      • true

      • false

      If false is specified, alerts are logged instead of being sent via email.

      Default value: false.

    • smtp: The parameters of the SMTP server used for sending alerts.

      • host: The hostname or IP address of the SMTP server.

        Default value: localhost.

      • port: The port number of the SMTP server.

        Default value: 25.

      • username: The username for authenticating in the SMTP server.

        Default value: "".

      • password: The password for authenticating in the SMTP server.

        Default value: "".

      • from: The email address of the alert sender.

        Default value: admin@localdomain.local.

      • timeout: The SMTP server connection timeout.

        Default value: 10s.

      • use_starttls: Specifies whether the STARTTLS extension is used for securing the SMTP server connection.

        Possible values:

        • true

        • false

        Default value: false.

      • use_ssl: Specifies whether the SSL/TLS protocol is used for the SMTP server connection.

        Possible values:

        • true

        • false

        Default value: false.

      • tls: The TLS protocol parameters.

        • insecure_skip_verify: Specifies whether the client skips the verification of the certificate chain and hostname of the SMTP server.

          Possible values:

          • true

          • false

          Default value: false.

          Important

          Setting this parameter to true represents a security risk. Do it only for testing purposes or with trusted networks.

        • root_ca_path: The path to the CA certificate used for verifying the certificate of the SMTP server.

          Default value: "".

Creating an Alert

  1. In the navigation panel, go to MonitoringAlerts.

  2. In the top-right corner of the page, click Create alert.

  3. Enter parameters of the new alert (parameters marked with an asterisk are required):

    • Name.

    • Severity: The alert priority.

      Possible values:

      • Low

      • Medium

      • High

    • Datasource type: The type of the metrics that will be used for the alert trigger.

      Currently, you can only use pgpro-otel-collector metrics.

    • Datasource.

    • State: The state of the alert after creation.

      Possible values:

      • Disabled

      • Enabled

    • Check interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.

      Minimum value: 60.

    • Flap check, count: The number of repeated triggers required for stopping the alert.

      The value of 0 means that this limitation is disabled.

    • Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.

    • Cooldown period, sec.: The time in seconds during which the alert will not be sent after the last firing trigger.

      0 means that this limitation is disabled.

  4. Click Next, and then specify additional parameters (parameters marked with an asterisk are required):

    • Metric name: The name of the metric without any additional characters that will be used for the alert trigger.

      You can use the following pgpro-otel-collector metrics from the monitoring.metrics table of the repository database:

      • postgresql.archiver.archived_count

      • postgresql.archiver.failed_count

      • postgresql.bgwriter.buffers_checkpoint

      • postgresql.bgwriter.buffers_clean

      • postgresql.bgwriter.buffers_backend

      • postgresql.bgwriter.buffers_allocated

      • postgresql.bgwriter.maxwritten_clean

      • postgresql.bgwriter.buffers_backend_fsync

      • postgresql.bgwriter.checkpoints_requested

      • postgresql.bgwriter.checkpoints_timed

      • postgresql.bgwriter.checkpoint_sync_time_milliseconds

      • postgresql.bgwriter.checkpoint_write_time_milliseconds

      • postgresql.databases.blocks_hit

      • postgresql.databases.blocks_read

      • postgresql.databases.conflicts

      • postgresql.databases.deadlocks

      • postgresql.databases.checksum_failures

      • postgresql.databases.tuples_fetched

      • postgresql.databases.tuples_returned

      • postgresql.databases.tuples_inserted

      • postgresql.databases.tuples_updated

      • postgresql.databases.tuples_deleted

      • postgresql.databases.temp_bytes

      • postgresql.databases.temp_files

      • postgresql.wal.bytes

      • postgresql.databases.rollbacks

      • system.cpu.utilization

      • system.memory.usage

      • system.paging.usage

      • postgresql.wal.records

      • postgresql.databases.commits

    • Operator • Threshold value: The alert trigger rule condition containing a logical operator and value.

      Possible logical operators:

      • = (eq)

      • > (gt)

      • >= (gte)

      • < (lt)

      • <= (lte)

      • != (neq)

      For example, if you select > and specify 0, the alert is sent when the value of the specified metric exceeds 0.

      You can add multiple alert trigger rule conditions by clicking Add.

    • Rule condition: The logical connectives for the specified alert trigger rule conditions.

      Possible values:

      • And

      • Or

      This parameter is available only if you added multiple alert trigger rule conditions.

    • Instances to check.

      Possible values:

      • Check all.

      • Select instances.

        For this value, from Instances, select the instances.

    • Notify users: The users that will receive alerts.

    • Notify groups: The user groups that will receive alerts.

    • Alert template: The template of the alert text.

      You can use the following variables in the alert text:

      • {{.Title}}: The name of the metric used for the alert trigger.

      • {{.Timestamp}}: The date and time when the alert trigger fired.

      • {{.HostName}}: The name of the host where the trigger fired.

      • {{.AgentName}}: The agent that caused the trigger to fire.

      • {{.InstanceName}}: The name of the instance where the trigger fired.

      • {{.Status}}: The status of the alert trigger.

    • Notify resolved: Specifies whether the alert is sent once the trigger is resolved.

      Possible values:

      • Enabled.

        For this value, in Resolved template, enter the template of the alert text.

        You can use the same variables in this alert text as in Alert template.

      • Disabled.

  5. Click Next and then Save to confirm the selected parameters and create the alert.

Viewing Alerts

In the navigation panel, go to MonitoringAlerts.

The table of alerts with the following columns will be displayed:

  • Name.

  • State.

    Possible values:

    • Enabled

    • Disabled

    • Inactive

  • Source name: The data source of the alert trigger.

    This column includes additional information:

    Type: The type of the metrics used for the alert trigger.

    Possible values:

    • Repositories: System metrics.

    • Metrics: pgpro-otel-collector metrics.

    • Logs: pgpro-otel-collector logs.

      This type of metrics is temporarily not supported.

  • Parameters:

    • Interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.

    • Flap check, count: The number of repeated triggers required for stopping the alert.

      0 means that this limitation is disabled.

    • Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.

    • Cooldown period, sec.: The time in seconds during which the alert is not sent after the last trigger.

      0 means that this limitation is disabled.

  • Priority: The alert priority.

    Possible values:

    • Not classified

    • Low

    • Medium

    • High

  • Recipients: The users that receive alerts.

  • Group recipients: The user groups that receive alerts.

  • Rule: The alert trigger rule conditions.

    For example, if an alert trigger rule condition is postgresql.up > 0, the alert is sent when the value of the postgersql.up metric exceeds 0.

  • Actions.

    For more information about available actions, refer to other instructions in this section.

Viewing Detailed Alert Information

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Three vertical dots iconView details next to the alert.

    The alert editing window with the following tabs will be displayed:

    • History: The history of trigger firing.

    • Parameters. For more information about alert parameters, refer to Creating an Alert.

  3. Click Close.

Disabling and Enabling Alerts

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Disable or Enable next to the alert.

Editing an Alert

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Three vertical dots iconEdit next to the alert.

  3. Edit alert parameters.

  4. Click Save.

Editing Alert Recipients

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Three vertical dots iconEdit recepients next to the alert.

  3. Edit users and user groups that receive alerts.

  4. Click Save.

Clearing Alert History

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Three vertical dots iconView details next to the alert.

  3. Click Clear.

Deleting an Alert

Important

  • System alerts cannot be deleted.

  • Deleted alerts cannot be restored.

To delete an alert:

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Delete next to the alert.

  3. Confirm the operation and click Delete.