Thread: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

Hi,

When performing post-mortem analysis of some short latency spikes on a heavily loaded database, I found that the reason for (less than 10 second latency spike) wasn't on the EXECUTE stage but on the BIND stage.
At the same time graphical monitoring shows that during this few second period there were some queries waiting in the BIND stage.

Logging setup:
log_min_duration_statement=200ms
log_lock_waits=on
deadlock_timeout=100ms
So I expected that every lock waiting over 100ms (>deadlock_timeout) should be in the log.
But in the log I see only spikes on slow BIND but not lock waits logged.
(
grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
...
      9 2022-12-29 00:12:5
      2 2022-12-29 00:13:1
      3 2022-12-29 00:13:5
!!!  68 2022-12-29 00:14:0
      5 2022-12-29 00:14:1
      3 2022-12-29 00:14:2
      2 2022-12-29 00:14:3
).
But no lock waits on the BIND stage logged during the problem period (and no lock waits in general).
Similar issues happen a few times per day without any visible pattern (but on the same tables usually).
No CPU or IO load/latency spikes found during problem periods.
No EXECUTE slowdown found in the log during that time.

So currently I have two hypotheses in research:
1)during BIND stage not every lock waits logged
2)there are some not a lock related intermittent slowdown of BIND

I ask for any ideas how to debug this issue (duration of such spike usually under 1s but given how many TPS database serving - 1s is too much and affect end users).


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> Hi,
> 
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
> 
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout) should
> be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.

What version postgres?  What settings have non-default values ?
What OS/version?  What environment/hardware?  VM/image/provider/...
What are the queries that are running BIND ?  What parameter types ?
Are the slow BINDs failing?  Are their paramters being logged ?
What else is running besides postgres ?  Are the DB clients local or
remote ?  It shouldn't matter, but what client library?

https://wiki.postgresql.org/wiki/Slow_Query_Questions

-- 
Justin





On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> Hi,
>
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
>
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout) should
> be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.

What version postgres?  What settings have non-default values ?
What OS/version?  What environment/hardware?  VM/image/provider/...
What are the queries that are running BIND ?  What parameter types ?
Are the slow BINDs failing?  Are their paramters being logged ?
What else is running besides postgres ?  Are the DB clients local or
remote ?  It shouldn't matter, but what client library?

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2 server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ?  - nothing special, e.g. during problem period a lot completely different queries become stuck in BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in general it feel that whole BIND/PARSE mechanics lock for short period
==== LOG SAMPLE ==========================
2023-01-01 09:07:31.622 UTC 1848286 ****** from [local] [vxid:84/20886619 txid:0] [PARSE] LOG:  duration: 235.472 ms  parse <unnamed>: SELECT COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.350000') AND (job_stats_master.created_at < '2023-01-01 09:07:31.350000') AND "job_stats_master"."employer_id" = **** AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" = **** AND "job_stats_master"."ip_matching_id" = *****
2023-01-01 09:07:31.622 UTC 1898699 ******  from [local] [vxid:158/22054921 txid:0] [BIND] LOG:  duration: 231.274 ms  bind <unnamed>: SELECT id, name FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 ******* from [local] [vxid:158/22054921 txid:0] [BIND] DETAIL:  parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416 txid:0] [BIND] LOG:  duration: 231.024 ms  bind <unnamed>: SELECT id, name FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 ******* from [local] [vxid:281/10515416 txid:0] [BIND] DETAIL:  parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673 txid:0] [BIND] LOG:  duration: 139.372 ms  bind <unnamed>: SELECT     employers.*,     third_party_employer_pixels.facebook_pixel_id AS facebook_pixel_id,     third_party_employer_pixels.google_pixel_id   AS google_pixel_id,     third_party_employer_pixels.google_actions    AS google_actions,     employer_pixel_configurations.solution        AS tracking_solution,     employer_pixel_configurations.domain_name     AS domain,     settings.use_multiple_bids   FROM employers     LEFT JOIN third_party_employer_pixels   ON third_party_employer_pixels.employer_id = employers.id     LEFT JOIN employer_pixel_configurations ON employer_pixel_configurations.id = employers.id     LEFT JOIN settings                      ON settings.id = employers.setting_id WHERE employers.id = $1
2023-01-01 09:07:31.623 UTC 1806315 ******* from [local] [vxid:231/17406673 txid:0] [BIND] DETAIL:  parameters: $1 = '*****'
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997 txid:0] [BIND] LOG:  duration: 120.237 ms  bind <unnamed>: SELECT job_boards.*, enterprises.product_type, feed_settings.use_employer_exported_name as use_employer_exported_name, integration_job_board_settings.integration_status as integration_status FROM job_boards LEFT JOIN integration_job_board_settings ON integration_job_board_settings.id = job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 ******* from [local] [vxid:176/21846997 txid:0] [BIND] DETAIL:  parameters: $1 = '****'
===============================================================
What really curious in the log: that every of 100+ stuck in PARSE/BIND stage queries that had been logged (and thus unstuck) in the same exact moment... that highly likely means that they all had been stuck in the same single place.
E.g. something locked the whole PARSE/BIND machinery (but not an EXECUTE) for 200+ms.

Are the slow BINDs failing?
No, they all executed successfully later after being unstuck.

Are their parameters being logged ?
Yes.

What else is running besides postgres ?
Nothing else , dedicated DB server.

Are the DB clients local or remote ?
Remote all over a fast network.

It shouldn't matter, but what client library?
50% ROR (ruby on rails) / 50% java(jdbc).


Problem that issue happens only few times per 24 hour and usual duration under 1second
so it very hard to catch problem with perf or gdb or strace.



--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

Howdy,

Few additional questions:
  1. How many concurrent, active connections are running when these BIND problems occur?  select count(*) from pg_stat_activity where state in ('active','idle in transaction')
  2. Are the queries using gigantic IN (<big list>) values?
  3. Perhaps unrelated, but is log_temp_files turned on, and if so, do you have a lot of logs related to that?

Regards,
Michael Vitale, just another PG DBA








On Sun, Jan 1, 2023 at 3:27 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:
Howdy,

Few additional questions:
  1. How many concurrent, active connections are running when these BIND problems occur?  select count(*) from pg_stat_activity where state in ('active','idle in transaction')
  2. Are the queries using gigantic IN (<big list>) values?
  3. Perhaps unrelated, but is log_temp_files turned on, and if so, do you have a lot of logs related to that?

Regards,
Michael Vitale, just another PG DBA


1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load spike or similar anomalies)
additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local pgbouncer in transaction mode)

2)no... long BIND for huge parameter lists is a known issue for me, in this case there is nothing like that... just (every?) PARSE/BIND stuck for a short period (including ones which don't require pg_statistic table access)...
There are some funny samples from the latest spike:
2023-01-01 15:45:09.151 UTC 2421121 ******** from [local] [vxid:109/20732521 txid:0] [BIND] LOG:  duration: 338.830 ms  bind <unnamed>: ROLLBACK
2023-01-01 15:45:09.151 UTC 2365255 ******** from [local] [vxid:41/21277531 txid:2504447286] [PARSE] LOG:  duration: 338.755 ms  parse <unnamed>: select nextval ('jobs_id_seq')
along with normal select/insert/update/delete operations stuck for a short time too...

3)log_temp_files on for sure, I found no correlation with temp file usage, as well as no correlation between latency spikes and logged autovacuum actions.

PS: '[BIND] LOG:  duration: 338.830 ms  bind <unnamed>: ROLLBACK' on a definitely not overloaded and perfectly healthy server - probably the most curious log entry of 2022 year for me.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

Hi Maxim,

10-20 active, concurrent connections is way below any CPU load problem you should have with 48 available vCPUs. 
You never explicitly said what the load is, so what is it in the context of the 1,5,15?

Maxim Boguk wrote on 1/1/2023 11:30 AM:
1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load spike or similar anomalies)
additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local pgbouncer in transaction mode)


Regards,

Michael Vitale

Michaeldba@sqlexec.com

703-600-9343 


Attachment


On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:
Hi Maxim,

10-20 active, concurrent connections is way below any CPU load problem you should have with 48 available vCPUs. 
You never explicitly said what the load is, so what is it in the context of the 1,5,15?


LA 10-15 all time, servers are really overprovisioned (2-3x by available CPU resources) because an application is quite sensitive to the database latency.
And during these latency spikes - EXECUTE work without any issues (e.g. only PARSE/BIND suck).


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

You said it's a dedicated server, but pgbouncer is running locally, right?  PGBouncer has a small footprint, but is the CPU high for it?

Maxim Boguk wrote on 1/1/2023 11:51 AM:


On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:
Hi Maxim,

10-20 active, concurrent connections is way below any CPU load problem you should have with 48 available vCPUs. 
You never explicitly said what the load is, so what is it in the context of the 1,5,15?


LA 10-15 all time, servers are really overprovisioned (2-3x by available CPU resources) because an application is quite sensitive to the database latency.
And during these latency spikes - EXECUTE work without any issues (e.g. only PARSE/BIND suck).


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678



Regards,

Michael Vitale

Michaeldba@sqlexec.com

703-600-9343 


Attachment

On Sun, Jan 1, 2023 at 6:55 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:
You said it's a dedicated server, but pgbouncer is running locally, right?  PGBouncer has a small footprint, but is the CPU high for it?

There are 4 pgbouncer processes in so_reuseport mode.
I never saw more than 40% of a single CPU core per one pgbouncer process (most time under 20%).
So it's an unlikely result of pgbouncer being overloaded.

 
--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678



On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk <maxim.boguk@gmail.com> wrote:
Hi,

When performing post-mortem analysis of some short latency spikes on a heavily loaded database, I found that the reason for (less than 10 second latency spike) wasn't on the EXECUTE stage but on the BIND stage.
At the same time graphical monitoring shows that during this few second period there were some queries waiting in the BIND stage.

Logging setup:
log_min_duration_statement=200ms
log_lock_waits=on
deadlock_timeout=100ms
So I expected that every lock waiting over 100ms (>deadlock_timeout) should be in the log.
But in the log I see only spikes on slow BIND but not lock waits logged.
(
grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
...
      9 2022-12-29 00:12:5
      2 2022-12-29 00:13:1
      3 2022-12-29 00:13:5
!!!  68 2022-12-29 00:14:0
      5 2022-12-29 00:14:1
      3 2022-12-29 00:14:2
      2 2022-12-29 00:14:3
).
But no lock waits on the BIND stage logged during the problem period (and no lock waits in general).
Similar issues happen a few times per day without any visible pattern (but on the same tables usually).
No CPU or IO load/latency spikes found during problem periods.
No EXECUTE slowdown found in the log during that time.


Followup research of this issue lead me to following results:
Every logged spike of BIND/PARSE response time correlated with corresponding backend waiting on
wait_event_type = LWLock
wait_event = pg_stat_statements
and all of these spikes happen during increment of pg_stat_statements_info.dealloc counter.

Some searching about this issue lead me to following blog post about similar issue:

However, we already have pg_stat_statements.max=10000 so further increase of this parameter
seems counterproductive (the size of 14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB).


Open questions remains:
1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE during deallocation of least used entries for the whole period of cleanup?


2)Any recommended workaround for this issue for systems with strict latency SLA
(block every database query (used extended query protocol) for 200-500ms 50+ times per day at random time - isn't acceptable for our case unfortunately)?


3)Why only BIND/PARSE locks but not EXECUTE?
(may be some difference in implementation of plan vs exec pg_stat_statements counters?).


Kind Regards,
Maxim


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678


What happens if you take pg_stat_statements out of the picture (remove from shared_preload_libraries)?  Does your BIND problem go away?


On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:

What happens if you take pg_stat_statements out of the picture (remove from shared_preload_libraries)?  Does your BIND problem go away?

I didn't test this idea, because it requires restart of the database (it cannot be done quickly) and without pg_stat_statements there will be no adequate performance monitoring of the database.
But I'm pretty sure that the issue will go away with pg_stat_statements disabled.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

Well if you find out for sure, please let me know.  I'm very interested in the outcome of this problem.

Maxim Boguk wrote on 1/5/2023 6:44 AM:


On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA <MichaelDBA@sqlexec.com> wrote:

What happens if you take pg_stat_statements out of the picture (remove from shared_preload_libraries)?  Does your BIND problem go away?

I didn't test this idea, because it requires restart of the database (it cannot be done quickly) and without pg_stat_statements there will be no adequate performance monitoring of the database.
But I'm pretty sure that the issue will go away with pg_stat_statements disabled.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678