5.17. Alerts #

PPEM supports sending alerts to users and user groups via email to monitor agents, hosts, and instances. To send alerts, you must configure an SMTP server.

For an alert to be sent, an alert trigger must fire. An alert trigger fires when the value of a metric is below, equal to, or higher than the specified threshold. Metrics are stored in data sources, for example, in the repository database.

To determine the alert trigger threshold, alert trigger rules are used. An alert trigger rule is a set of one or multiple conditions containing logical operators and values. For example, an alert trigger rule condition can include the > logical operator and the 0 value. With such a rule, the alert trigger fires if the value of the specified metric exceeds 0.

Relations between multiple alert trigger rule conditions are determined by the AND and OR logical connectives.

For alerts to work properly, you must first install and configure logging and monitoring tools.

This section explains how to manage alerts. It includes the following instructions:

Important

The alerts functionality is in the beta phase. Currently, only pgpro-otel-collector metrics can be used for alert triggers. Other existing limitations are specified in the corresponding instructions of this section.

Pre-Configuring Alerts

To work with alerts, you must first pre-configure them in the ppem-manager.yml manager configuration file.

You can specify the following parameters:

alerts:
  metrics:
    request_chunk_size: number_of_instance_IDs
  scheduler:
    interval: interval_for_checking_new_alerts
    initial_delay: delay_for_starting_aler_scheduler
    timeout: timeout_for_updating_alert_trigger_rules
  notifier:
    num_workers: number_of_concurrent_workers
    worker_batch_size: number_of_alerts_in_one_batch
    worker_interval: interval_for_checking_new_alerts
    backoff_base: exponential_backoff_calculation_duration
    max_retries: maximum_number_of_alert_attempts
    notification_timeout: alert_timeout
    janitor_interval: janitor_worker_polling_interval
    stale_processing_timeout: stale_alert_processing_timeout
  email:
    is_enabled: true or false
    smtp:
      host: SMTP_server_hostname_or_IP
      port: SMTP_server_port
      username: username_for_SMTP_server_authentication
      password: password_for_SMTP_server_authentication
      from: alert_sender_email
      timeout: SMTP_server_connection_timeout
      use_starttls: true or false
      use_ssl: true or false
      tls:
        insecure_skip_verify: true or false
        root_ca_path: path_to_root_CA

Where:

  • metrics: The parameters of sending requests to the metrics plugin.

    • request_chunk_size: The maximum number of instance IDs within one request.

      Default value: 100.

  • scheduler: The parameters of the scheduler that updates alerts in the manager memory.

    • interval: The time interval for the scheduler to check for new alerts to process.

      Default value: 50s.

    • initial_delay: The delay before starting the scheduler for the first time after the start of PPEM.

      Default value: 10s.

    • timeout: The scheduler timeout for updating alert trigger rules.

      Default value: 10m.

  • notifier: The parameters of the notifier that sends alerts.

    • num_workers: The number of concurrent workers that will send alerts.

      Default value: 5.

    • worker_batch_size: The number of alerts processed by workers in one batch.

      Default value: 20.

    • worker_interval: The polling interval for workers to check for new alerts in the repository database.

      Default value: 30s.

    • backoff_base: The base duration for the exponential backoff calculation when resending a failed alert.

      The delay for resending the alert is calculated as:

      backoff_base X (2^number_of_retry_attempts).

      Default value: 10s.

    • max_retries: The maximum number of attempts to resend a failed alert.

      Default value: 3.

    • notification_timeout: The maximum amount of time for the notifier to wait for an alert to be sent before considering it failed.

      Default value: 20s.

    • janitor_interval: The polling interval for the janitor worker that cleans alerts stuck in the processing state.

      Default value: 1m.

    • stale_processing_timeout: The amount of time after which alerts stuck in the processing state are considered stale and must be reset by the janitor worker.

      Default value: 10m.

  • email: The parameters of sending alerts via email.

    • is_enabled: Specifies whether alerts are sent via email.

      Possible values:

      • true

      • false

      If false is specified, alerts are logged instead of being sent via email.

      Default value: false.

    • smtp: The parameters of the SMTP server used for sending alerts.

      • host: The hostname or IP address of the SMTP server.

        Default value: localhost.

      • port: The port number of the SMTP server.

        Default value: 25.

      • username: The username for authenticating in the SMTP server.

        Default value: "".

      • password: The password for authenticating in the SMTP server.

        Default value: "".

      • from: The email address of the alert sender.

        Default value: admin@localdomain.local.

      • timeout: The SMTP server connection timeout.

        Default value: 10s.

      • use_starttls: Specifies whether the STARTTLS extension is used for securing the SMTP server connection.

        Possible values:

        • true

        • false

        Default value: false.

      • use_ssl: Specifies whether the SSL/TLS protocol is used for the SMTP server connection.

        Possible values:

        • true

        • false

        Default value: false.

      • tls: The TLS protocol parameters.

        • insecure_skip_verify: Specifies whether the client skips the verification of the certificate chain and hostname of the SMTP server.

          Possible values:

          • true

          • false

          Default value: false.

          Important

          Setting this parameter to true represents a security risk. Do it only for testing purposes or with trusted networks.

        • root_ca_path: The path to the CA certificate used for verifying the certificate of the SMTP server.

          Default value: "".

Creating an Alert

  1. In the navigation panel, go to MonitoringAlerts.

  2. In the top-right corner of the page, click Create alert.

  3. Enter parameters of the new alert (parameters marked with an asterisk are required):

    • Name.

    • Datasource type: The type of the metrics that will be used for the alert trigger.

      Currently, you can only use pgpro-otel-collector metrics.

    • Datasource.

    • State: The state of the alert after creation.

      Possible values:

      • Disabled

      • Enabled

    • Check interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.

      Minimum value: 60.

    • Flap check, count: The number of repeated triggers required for stopping the alert.

      0 means that this limitation is disabled.

    • Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.

    • Cooldown period, sec.: The time in seconds during which the alert will not be sent after the last firing trigger.

      0 means that this limitation is disabled.

    Note

    Currently, you cannot change the values of Flap check, count, Notify after, sec., and Cooldown period, sec..

  4. Click Next, and then specify additional parameters (parameters marked with an asterisk are required):

    • Metric name: The name of the metric without any additional characters that will be used for the alert trigger.

      You can use the following pgpro-otel-collector metrics from the monitoring.metrics table of the repository database:

      • postgresql.archiver.archived_count

      • postgresql.archiver.failed_count

      • postgresql.bgwriter.buffers_checkpoint

      • postgresql.bgwriter.buffers_clean

      • postgresql.bgwriter.buffers_backend

      • postgresql.bgwriter.buffers_allocated

      • postgresql.bgwriter.maxwritten_clean

      • postgresql.bgwriter.buffers_backend_fsync

      • postgresql.bgwriter.checkpoints_requested

      • postgresql.bgwriter.checkpoints_timed

      • postgresql.bgwriter.checkpoint_sync_time_milliseconds

      • postgresql.bgwriter.checkpoint_write_time_milliseconds

      • postgresql.databases.blocks_hit

      • postgresql.databases.blocks_read

      • postgresql.databases.conflicts

      • postgresql.databases.deadlocks

      • postgresql.databases.checksum_failures

      • postgresql.databases.tuples_fetched

      • postgresql.databases.tuples_returned

      • postgresql.databases.tuples_inserted

      • postgresql.databases.tuples_updated

      • postgresql.databases.tuples_deleted

      • postgresql.databases.temp_bytes

      • postgresql.databases.temp_files

      • postgresql.wal.bytes

      • postgresql.databases.rollbacks

      • system.cpu.utilization

      • system.memory.usage

      • system.paging.usage

      • postgresql.wal.records

      • postgresql.databases.commits

    • Operator • Threshold value: The alert trigger rule condition containing a logical operator and value.

      Possible logical operators:

      • = (eq)

      • > (gt)

      • >= (gte)

      • < (lt)

      • <= (lte)

      • != (neq)

      For example, if you select > and specify 0, the alert is sent when the value of the specified metric exceeds 0.

      You can add multiple alert trigger rule conditions by clicking Add.

    • Rule condition: The logical connectives for the specified alert trigger rule conditions.

      Possible values:

      • AND

      • OR

      This parameter is available only if you added multiple alert trigger rule conditions.

    • Instances to check.

      Possible values:

      • Check all.

      • Select instances.

        For this value, from Instances, select the instances.

    • Notify users: The users that will receive alerts.

    • Notify groups: The user groups that will receive alerts.

    • Alert template: The template of the alert text.

      You can use the following variables in the alert text:

      • {{.Title}}: The name of the metric used for the alert trigger.

      • {{.Timestamp}}: The date and time when the alert trigger fired.

      • {{.Status}}: The status of the alert trigger.

    • Notify resolved: Specifies whether the alert is sent once the trigger is resolved.

      Possible values:

      • Enabled.

        For this value, in Resolved template, enter the template of the alert text.

        You can use the same variables in this alert text as in Alert template.

      • Disabled.

  5. Click Save.

Viewing Alerts

In the navigation panel, go to MonitoringAlerts.

The table of alerts with the following columns will be displayed:

  • Name.

  • State.

    Possible values:

    • Enabled

    • Disabled

  • Source name: The data source of the alert trigger.

    This column includes additional information:

    Type: The type of the metrics used for the alert trigger.

    Possible values:

    • Repositories: System metrics.

    • Metrics: pgpro-otel-collector metrics.

    • Logs: pgpro-otel-collector logs.

      This type of metrics is temporarily not supported.

  • Interval, sec.: The time interval in seconds for verifying the data source of the alert trigger.

  • Flap check, count: The number of repeated triggers required for stopping the alert.

    0 means that this limitation is disabled.

  • Notify after, sec.: The time in seconds during which the trigger must continually fire for the alert to be sent.

  • Cooldown period, sec.: The time in seconds during which the alert is not sent after the last trigger.

    0 means that this limitation is disabled.

  • Recipients: The users that receive alerts.

  • Group recipients: The user groups that receive alerts.

  • Rule: The alert trigger rule conditions.

    For example, if an alert trigger rule condition is postgresql.up > 0, the alert is sent when the value of the postgersql.up metric exceeds 0.

  • Actions.

    For more information about available actions, refer to other instructions in this section.

Disabling and Enabling Alerts

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Disable or Enable next to the alert.

Editing Alert Recipients

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Edit recipients next to the alert.

  3. Edit users and user groups that receive alerts.

  4. Click Save.

Deleting an Alert

Important

  • System alerts cannot be deleted.

  • Deleted alerts cannot be restored.

To delete an alert:

  1. In the navigation panel, go to MonitoringAlerts.

  2. Click Delete next to the alert.

  3. Confirm the operation and click Delete.