Thread: Identify root-cause for intermittent spikes

Identify root-cause for intermittent spikes

From

Sengottaiyan T

Date:

11 October 2022, 11:06:55

Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.
How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Re: Identify root-cause for intermittent spikes

From

MichaelDBA

Date:

11 October 2022, 11:48:54

Hello,

Your problem is probably, too many active, concurrent connections. Get it from here the db directly:
select datname, usename, application_name, substring(query, 1, 80) query from pg_stat_activity where state in ('active','idle in transaction');

Compare the number of rows returned with the number of vCPUs. If it's more than double the number of vCPUs in your AWS instance class, then you are cpu saturated.

Regards,

Michael Vitale

Sengottaiyan T wrote on 10/11/2022 7:06 AM:

Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.
How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Re: Identify root-cause for intermittent spikes

From

Rick Otten

Date:

12 October 2022, 00:42:19

I like to use pgbadger to collect data on what is happening in RDS instances. You have to turn up a bunch of logging in RDS:

1. Turn on connection logging, duration logging, lock_waits, and anything else that you are interested in studying.

2. Then grab all of your postgresql logs from AWS. I wrote this little bash script to pull all of the logs for a current day. It will work if you have your aws credentials configured correctly and can run aws-cli commands.

```

#!/bin/env bash

## Return all of the postgresql log files saved by RDS since midnight.
## Save them in your current directory.
## This is so we can use cli tools like "grep"
## It is also really handy for feeding into pgbadger for deeper analysis.

# aws requires the timestamp to be in milliseconds.
# unfortunately date will provide either seconds or nano seconds, so we have to do math.
midnight_timestamp=$(date -d $(date -I) '+%s')
midnight_timestamp_milliseconds=$(echo "${midnight_timestamp} * 1000" | bc)

logfiles=$(aws rds describe-db-log-files \
--profile default \
--db-instance-identifier "some_rds_instance_name" \
--output json \
--file-last-written ${midnight_timestamp_milliseconds} | jq -r ".DescribeDBLogFiles[].LogFileName")

for logfile in $(echo ${logfiles})
do
# remove the leading "error/" so we can use the name to save it.
logfile_save=$(echo "${logfile}" | awk -F\/ '{print $NF}')

tput bold; echo "${logfile}"; tput sgr0
aws rds download-db-log-file-portion \
--profile admin \
--db-instance-identifier prod-notify-me-1 \
--log-file-name ${logfile} \
--output text \
--no-paginate > ${logfile_save}
done

```

3. Then run pgbadger:

``` ~/src/pgbadger/pgbadger -f rds postgresql*

```

4. Open the `out.html` in your browser, and poke around. There is a ton of stuff you can find in all the drop down menus about what was happening in your database over the time window you collected the logs for. The html is generated as a standalone file by a perl script of all things. It is pretty impressive.

On Tue, Oct 11, 2022 at 7:07 AM Sengottaiyan T <techsenko@gmail.com> wrote:

Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.
How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Re: Identify root-cause for intermittent spikes

From

SAMEER KUMAR

Date:

12 October 2022, 06:50:31

On Tue, 11 Oct 2022, 22:07 Sengottaiyan T, <techsenko@gmail.com> wrote:

Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.

Performance Insights should also offer you visibility into statement level stats for Top SQL if pg_stat_statements is enabled.

Performance Insights also has other metrics (Counter Metrics) that you can refer to to understand some of the data points you are after - xact_count/second, session_in_idle_in_transactions/second, blocked_transactions/second etc. You need to add them to PI dashboard by using Manage Meteics button on PI dashboard.

How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Regards

Sameer

DB Specialist,

Amazon Web Services

Re: Identify root-cause for intermittent spikes

From

Sengottaiyan T

Date:

12 October 2022, 17:16:23

Thanks, Michael.

Due to reactive nature of the intermittent alerts, Is there any table which stores the historical information / periodic snapshots captured from the pg_stat_activity view?

On Tue, Oct 11, 2022 at 5:18 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:

Hello,

Your problem is probably, too many active, concurrent connections. Get it from here the db directly:
select datname, usename, application_name, substring(query, 1, 80) query from pg_stat_activity where state in ('active','idle in transaction');

Compare the number of rows returned with the number of vCPUs. If it's more than double the number of vCPUs in your AWS instance class, then you are cpu saturated.

Regards,
Michael Vitale

Sengottaiyan T wrote on 10/11/2022 7:06 AM:
Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.
How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Re: Identify root-cause for intermittent spikes

From

Sengottaiyan T

Date:

12 October 2022, 17:19:19

Thanks, Rick - I will give it a try.

On Wed, Oct 12, 2022 at 6:12 AM Rick Otten <rottenwindfish@gmail.com> wrote:

I like to use pgbadger to collect data on what is happening in RDS instances. You have to turn up a bunch of logging in RDS:

1. Turn on connection logging, duration logging, lock_waits, and anything else that you are interested in studying.

2. Then grab all of your postgresql logs from AWS. I wrote this little bash script to pull all of the logs for a current day. It will work if you have your aws credentials configured correctly and can run aws-cli commands.
```
#!/bin/env bash

## Return all of the postgresql log files saved by RDS since midnight.
## Save them in your current directory.
## This is so we can use cli tools like "grep"
## It is also really handy for feeding into pgbadger for deeper analysis.

# aws requires the timestamp to be in milliseconds.
# unfortunately date will provide either seconds or nano seconds, so we have to do math.
midnight_timestamp=$(date -d $(date -I) '+%s')
midnight_timestamp_milliseconds=$(echo "${midnight_timestamp} * 1000" | bc)

logfiles=$(aws rds describe-db-log-files \
--profile default \
--db-instance-identifier "some_rds_instance_name" \
--output json \
--file-last-written ${midnight_timestamp_milliseconds} | jq -r ".DescribeDBLogFiles[].LogFileName")

for logfile in $(echo ${logfiles})
do
# remove the leading "error/" so we can use the name to save it.
logfile_save=$(echo "${logfile}" | awk -F\/ '{print $NF}')

tput bold; echo "${logfile}"; tput sgr0
aws rds download-db-log-file-portion \
--profile admin \
--db-instance-identifier prod-notify-me-1 \
--log-file-name ${logfile} \
--output text \
--no-paginate > ${logfile_save}
done
```
3. Then run pgbadger:
``` ~/src/pgbadger/pgbadger -f rds postgresql*
```
4. Open the `out.html` in your browser, and poke around. There is a ton of stuff you can find in all the drop down menus about what was happening in your database over the time window you collected the logs for. The html is generated as a standalone file by a perl script of all things. It is pretty impressive.

On Tue, Oct 11, 2022 at 7:07 AM Sengottaiyan T <techsenko@gmail.com> wrote:
Hi All,

I'm looking for suggestions:

Environment: AWS PostgreSQL RDS instance - Version 14.3
Operations support gets intermittent alerts from the monitoring tool through AWS cloud watch metrics on Disk Queue Depth, CPU burst-credit & CPU Utilization.
I would like to understand what is causing the spike - is the number of logon's increased, (or) number of transactions per second increased, (or) SQL execution picked wrong plan and the long running (I/O, CPU or memory intensive) SQL is increasing load on server (cause and effect scenario) etc.,

Due to the reactive nature of the issues, we rely on the metrics gathered in the AWS cloud watch monitoring (for the underlying OS stats), Performance Insights (for the DB performance) and correlate SQL queries with pg_Stat_Statements view. But the data in the view is an aggregated stats. And, I'm looking to see the deltas compared to normal runs.
How should I approach and get to the root-cause?

AppDynamics is already configured for the RDS instance. Are there any open source monitoring tools available which would help to capture and visualize the deltas?

Thanks,
Senko

Re: Identify root-cause for intermittent spikes

From

hubert depesz lubaczewski

Date:

13 October 2022, 08:33:28

On Wed, Oct 12, 2022 at 10:46:23PM +0530, Sengottaiyan T wrote:
> Thanks, Michael.
> 
> Due to reactive nature of the intermittent alerts, Is there any table which
> stores the historical information / periodic snapshots captured from the
> pg_stat_activity view?

No. Unless you will make one, and add data to it from some cron-like
thing.

Best regards,

depesz

Re: Identify root-cause for intermittent spikes

From

Sengottaiyan T

Date:

14 October 2022, 17:27:00

Thanks, Depsez.

Please suggest: Is there any open-source tool available for capturing such information?

On Thu, Oct 13, 2022, 14:03 hubert depesz lubaczewski <depesz@depesz.com> wrote:

On Wed, Oct 12, 2022 at 10:46:23PM +0530, Sengottaiyan T wrote:
> Thanks, Michael.
>
> Due to reactive nature of the intermittent alerts, Is there any table which
> stores the historical information / periodic snapshots captured from the
> pg_stat_activity view?

No. Unless you will make one, and add data to it from some cron-like
thing.

Best regards,

depesz

Re: Identify root-cause for intermittent spikes

From

Julien Rouhaud

Date:

15 October 2022, 10:35:22

On Fri, Oct 14, 2022 at 10:57:00PM +0530, Sengottaiyan T wrote:
>
> Please suggest: Is there any open-source tool available for capturing such
> information?

Most of the open-source tools won't work as you won't be able to install them
on RDS.

As far as I know the "Performance Insights" provides detailed information, not
only cumulated metrics, so that's probably your best option.

Re: Identify root-cause for intermittent spikes

From

hubert depesz lubaczewski

Date:

16 October 2022, 09:18:28

On Fri, Oct 14, 2022 at 10:57:00PM +0530, Sengottaiyan T wrote:
> Please suggest: Is there any open-source tool available for capturing such
> information?

pg_cron?

depesz