Thread: Query running longer

Query running longer

From

veem v

Date:

01 February, 23:57:01

Hello All,
We have the below query which is running for ~45 seconds on postgres aurora reader instance. I have captured the explain analyze. Want to understand, where exactly the resources are getting spent and if we can be able to optimize it further. It's a UI query showing top 50 rows and is supposed to finish in sub seconds but it takes around ~45 seconds to finish.

Also seeing multiple workers in the plan, does that mean the query is running in parallel somehow?

explain (analyze,verbose,costs,buffers) select TAB1.PRI from SCHEMA1.TAB1 TAB1
inner join SCHEMA1.TAB4 TAB4 on TAB4.PRI = TAB1.PRI
inner join SCHEMA1."TAB2" TAB2 on TAB2.PRI = TAB1.PRI
inner join SCHEMA1.TAB3 a2 on a2.AID = TAB2.AID
where TAB2.MID = XXXXX and TAB4.TAB4_code = 'XX'
and TAB2.TAB2_time between '2024-01-01' and '2024-01-31'
order by TAB2.TAB2_time desc
limit 50;

Limit (cost=13052924.01..13052924.14 rows=50 width=45) (actual time=45211.971..45224.720 rows=50 loops=1)
" Output: TAB1.PRI, TAB2.TAB2_time"
Buffers: shared hit=1980943 read=2335820
I/O Timings: shared/local read=112477.014
-> Sort (cost=13052924.01..13052924.19 rows=70 width=45) (actual time=45211.969..45224.713 rows=50 loops=1)
" Output: TAB1.PRI, TAB2.TAB2_time"
Sort Key: TAB2.TAB2_time DESC
Sort Method: top-N heapsort Memory: 32kB
Buffers: shared hit=1980943 read=2335820
I/O Timings: shared/local read=112477.014
-> Gather (cost=92917.38..13052921.87 rows=70 width=45) (actual time=947.004..45221.915 rows=5428 loops=1)
" Output: TAB1.PRI, TAB2.TAB2_time"
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=1980943 read=2335820
I/O Timings: shared/local read=112477.014
-> Nested Loop (cost=91917.38..13051914.87 rows=18 width=45) (actual time=945.946..45195.224 rows=1086 loops=5)
" Output: TAB1.PRI, TAB2.TAB2_time"
Inner Unique: true
Buffers: shared hit=1980943 read=2335820
I/O Timings: shared/local read=112477.014
Worker 0: actual time=936.808..45193.518 rows=1036 loops=1
Buffers: shared hit=382606 read=465076
I/O Timings: shared/local read=22452.028
Worker 1: actual time=947.246..45194.168 rows=1055 loops=1
Buffers: shared hit=383165 read=484189
I/O Timings: shared/local read=22617.135
Worker 2: actual time=933.623..45192.534 rows=1145 loops=1
Buffers: shared hit=415758 read=473182
I/O Timings: shared/local read=22741.488
Worker 3: actual time=965.639..45193.603 rows=1078 loops=1
Buffers: shared hit=398009 read=449053
I/O Timings: shared/local read=22221.094
-> Nested Loop (cost=91916.81..13051828.80 rows=18 width=81) (actual time=945.917..43729.931 rows=1086 loops=5)
" Output: TAB1.PRI, TAB2.TAB2_time, TAB2.AID"
Inner Unique: true
Join Filter: ((TAB4.PRI)::text = (TAB1.PRI)::text)
Buffers: shared hit=1962289 read=2328363
I/O Timings: shared/local read=105246.220
Worker 0: actual time=936.781..43732.652 rows=1036 loops=1
Buffers: shared hit=379077 read=463587
I/O Timings: shared/local read=21008.508
Worker 1: actual time=947.212..43699.507 rows=1055 loops=1
Buffers: shared hit=379573 read=482704
I/O Timings: shared/local read=21142.572
Worker 2: actual time=933.589..43696.710 rows=1145 loops=1
Buffers: shared hit=411836 read=471634
I/O Timings: shared/local read=21266.581
Worker 3: actual time=965.608..43768.535 rows=1078 loops=1
Buffers: shared hit=394288 read=447583
I/O Timings: shared/local read=20814.288
-> Parallel Hash Join (cost=91916.24..13051765.39 rows=18 width=117) (actual time=945.879..42758.939 rows=1086 loops=5)
" Output: TAB4.PRI, TAB2.TAB2_time, TAB2.PRI, TAB2.AID"
Hash Cond: ((TAB2.PRI)::text = (TAB4.PRI)::text)
Buffers: shared hit=1943792 read=2322814
I/O Timings: shared/local read=100496.787
Worker 0: actual time=936.743..42798.247 rows=1036 loops=1
Buffers: shared hit=375573 read=462501
I/O Timings: shared/local read=20094.654
Worker 1: actual time=947.169..42752.987 rows=1055 loops=1
Buffers: shared hit=375975 read=481619
I/O Timings: shared/local read=20216.926
Worker 2: actual time=933.545..42660.854 rows=1145 loops=1
Buffers: shared hit=407956 read=470465
I/O Timings: shared/local read=20252.386
Worker 3: actual time=965.567..42797.288 rows=1078 loops=1
Buffers: shared hit=390609 read=446481
I/O Timings: shared/local read=19863.965
" -> Parallel Bitmap Heap Scan on SCHEMA1.""TAB2"" TAB2 (cost=84860.50..13040301.00 rows=1175611 width=80) (actual time=713.054..26942.082 rows=956249 loops=5)"
" Output: TAB2.TAB2_time, TAB2.PRI, TAB2.AID"
Recheck Cond: (TAB2.MID = 'XXXXX'::numeric)
Rows Removed by Index Recheck: 2137395
Filter: ((TAB2.TAB2_time >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (TAB2.TAB2_time <= '2024-01-31 00:00:00+00'::timestamp with time zone))
Heap Blocks: exact=5300 lossy=782577
Buffers: shared hit=1651569 read=2245157
I/O Timings: shared/local read=29063.286
Worker 0: actual time=713.040..27006.980 rows=942051 loops=1
Buffers: shared hit=317611 read=447013
I/O Timings: shared/local read=5851.688
Worker 1: actual time=713.047..27065.878 rows=939696 loops=1
Buffers: shared hit=317632 read=466176
I/O Timings: shared/local read=6038.851
Worker 2: actual time=713.027..26894.506 rows=967468 loops=1
Buffers: shared hit=349596 read=454912
I/O Timings: shared/local read=5962.348
Worker 3: actual time=713.091..26826.767 rows=961928 loops=1
Buffers: shared hit=332980 read=430848
I/O Timings: shared/local read=5426.475
-> Bitmap Index Scan on TAB2_idx2 (cost=0.00..83684.89 rows=4702443 width=0) (actual time=688.661..688.661 rows=4781245 loops=1)
Index Cond: (TAB2.MID = 'XXXXX'::numeric)
Buffers: shared hit=12408
Worker 2: actual time=688.661..688.661 rows=4781245 loops=1
Buffers: shared hit=12408
-> Parallel Hash (cost=7042.63..7042.63 rows=1049 width=37) (actual time=217.987..217.988 rows=27613 loops=5)
Output: TAB4.PRI
Buckets: 262144 (originally 2048) Batches: 1 (originally 1) Memory Usage: 13936kB
Buffers: shared hit=134917
Worker 0: actual time=214.981..214.982 rows=27779 loops=1
Buffers: shared hit=27133
Worker 1: actual time=215.455..215.456 rows=27805 loops=1
Buffers: shared hit=27159
Worker 2: actual time=215.774..215.774 rows=27330 loops=1
Buffers: shared hit=26706
Worker 3: actual time=215.776..215.777 rows=26880 loops=1
Buffers: shared hit=26245
-> Parallel Bitmap Heap Scan on SCHEMA1.TAB4 TAB4 (cost=26.39..7042.63 rows=1049 width=37) (actual time=23.650..201.606 rows=27613 loops=5)
Output: TAB4.PRI
Recheck Cond: ((TAB4.TAB4_code)::text = 'XX'::text)
Rows Removed by Index Recheck: 616610
Heap Blocks: exact=11978 lossy=15624
Buffers: shared hit=134917
Worker 0: actual time=20.627..199.852 rows=27779 loops=1
Buffers: shared hit=27133
Worker 1: actual time=21.065..199.786 rows=27805 loops=1
Buffers: shared hit=27159
Worker 2: actual time=21.445..198.582 rows=27330 loops=1
Buffers: shared hit=26706
Worker 3: actual time=21.470..195.915 rows=26880 loops=1
Buffers: shared hit=26245
-> Bitmap Index Scan on TAB4_idx1 (cost=0.00..25.95 rows=1784 width=0) (actual time=23.938..23.938 rows=138067 loops=1)
Index Cond: ((TAB4.TAB4_code)::text = 'XX'::text)
Buffers: shared hit=72
-> Index Only Scan using TAB1_pk on SCHEMA1.TAB1 TAB1 (cost=0.57..3.51 rows=1 width=37) (actual time=0.891..0.891 rows=1 loops=5428)
Output: TAB1.PRI
Index Cond: (TAB1.PRI = (TAB2.PRI)::text)
Heap Fetches: 0
Buffers: shared hit=18262 read=5549
I/O Timings: shared/local read=4749.434
Worker 0: actual time=0.899..0.899 rows=1 loops=1036
Buffers: shared hit=3464 read=1086
I/O Timings: shared/local read=913.854
Worker 1: actual time=0.894..0.894 rows=1 loops=1055
Buffers: shared hit=3558 read=1085
I/O Timings: shared/local read=925.646
Worker 2: actual time=0.901..0.901 rows=1 loops=1145
Buffers: shared hit=3840 read=1169
I/O Timings: shared/local read=1014.196
Worker 3: actual time=0.898..0.898 rows=1 loops=1078
Buffers: shared hit=3634 read=1102
I/O Timings: shared/local read=950.323
-> Index Only Scan using TAB3_pk on SCHEMA1.TAB3 a2 (cost=0.57..4.78 rows=1 width=36) (actual time=1.336..1.336 rows=1 loops=5428)
Output: a2.AID
Index Cond: (a2.AID = (TAB2.AID)::text)
Heap Fetches: 1836
Buffers: shared hit=18278 read=7398
I/O Timings: shared/local read=7172.664
Worker 0: actual time=1.393..1.393 rows=1 loops=1036
Buffers: shared hit=3455 read=1473
I/O Timings: shared/local read=1429.250
Worker 1: actual time=1.405..1.405 rows=1 loops=1055
Buffers: shared hit=3531 read=1476
I/O Timings: shared/local read=1464.637
Worker 2: actual time=1.296..1.296 rows=1 loops=1145
Buffers: shared hit=3857 read=1538
I/O Timings: shared/local read=1465.583
Worker 3: actual time=1.309..1.309 rows=1 loops=1078
Buffers: shared hit=3642 read=1459
I/O Timings: shared/local read=1395.946
Query Identifier: 7231829541130579109
Planning:
Buffers: shared hit=1414
Planning Time: 1.305 ms
Execution Time: 45224.792 ms

Re: Query running longer

From

Laurenz Albe

Date:

02 February, 00:13:39

On Fri, 2024-02-02 at 02:27 +0530, veem v wrote:
> We have the below query which is running for ~45 seconds on postgres aurora reader instance.
> I have captured the explain analyze. Want to understand, where exactly the resources are
> getting spent and if we can be able to optimize it further.

Aurora <> PostgreSQL, but here is what I can see:

- The index scan on SCHEMA1."TAB2" has to check 2 million extra
  rows because "work_mem" is too small.  Almost the complete time
  is spent there.

- You may be getting a bad plan, because the statistics on
  SCHEMA1.TAB4 are either out of date or not detailed enough,
  which makes PostgreSQL underestimate the result size.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Query running longer

From

Ron Johnson

Date:

02 February, 02:22:09

On Thu, Feb 1, 2024 at 4:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Fri, 2024-02-02 at 02:27 +0530, veem v wrote:
> We have the below query which is running for ~45 seconds on postgres aurora reader instance.
> I have captured the explain analyze. Want to understand, where exactly the resources are
> getting spent and if we can be able to optimize it further.

Aurora <> PostgreSQL, but here is what I can see:

- The index scan on SCHEMA1."TAB2" has to check 2 million extra
rows because "work_mem" is too small. Almost the complete time
is spent there.

Is that these numbers?

Buffers: shared hit=xxxxxx read=22xxxxx

Re: Query running longer

From

veem v

Date:

02 February, 07:44:09

On Fri, 2 Feb 2024 at 02:43, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Fri, 2024-02-02 at 02:27 +0530, veem v wrote:
> We have the below query which is running for ~45 seconds on postgres aurora reader instance.
> I have captured the explain analyze. Want to understand, where exactly the resources are
> getting spent and if we can be able to optimize it further.

Aurora <> PostgreSQL, but here is what I can see:

- The index scan on SCHEMA1."TAB2" has to check 2 million extra
rows because "work_mem" is too small. Almost the complete time
is spent there.

- You may be getting a bad plan, because the statistics on
SCHEMA1.TAB4 are either out of date or not detailed enough,
which makes PostgreSQL underestimate the result size.

Thank you so much.

I can read and understand the execution plan in Oracle database easily and can also interpret database parameters too. But new to postgres, are there some good resources to quickly get up to speed to understand the postgres execution path here and also various key parameters(like for e.g. one you mentioned as work_mem)?

As you mentioned below, So wondering how you got to know, if this is the step where majority of the DB resources get spent. And as total time the query ran was ~45 seconds, and out of that how much time it spent in this step, how can i get that? And to fix this one line with regards to TAB2, should we increase "work_mem" here?

- The index scan on SCHEMA1."TAB2" has to check 2 million extra
rows because "work_mem" is too small. Almost the complete time
is spent there.

-> Parallel Bitmap Heap Scan on SCHEMA1.""TAB2"" TAB2 (cost=84860.50..13040301.00 rows=1175611 width=80) (actual time=713.054..26942.082 rows=956249 loops=5)"
" Output: TAB2.TAB2_time, TAB2.PRI, TAB2.AID"
Recheck Cond: (TAB2.MID = 'XXXXX'::numeric)
Rows Removed by Index Recheck: 2137395
Filter: ((TAB2.TAB2_time >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (TAB2.TAB2_time <= '2024-01-31 00:00:00+00'::timestamp with time zone))
Heap Blocks: exact=5300 lossy=782577
Buffers: shared hit=1651569 read=2245157
I/O Timings: shared/local read=29063.286

Another point you mentioned as below , for this , should we run vacuum analyze on the table TAB4?

- You may be getting a bad plan, because the statistics on
SCHEMA1.TAB4 are either out of date or not detailed enough,
which makes PostgreSQL underestimate the result size.

Output: TAB4.PRI
Buckets: 262144 (originally 2048) Batches: 1 (originally 1) Memory Usage: 13936kB
Buffers: shared hit=134917
Worker 0: actual time=214.981..214.982 rows=27779 loops=1
Buffers: shared hit=27133
Worker 1: actual time=215.455..215.456 rows=27805 loops=1
Buffers: shared hit=27159
Worker 2: actual time=215.774..215.774 rows=27330 loops=1
Buffers: shared hit=26706
Worker 3: actual time=215.776..215.777 rows=26880 loops=1
Buffers: shared hit=26245
-> Parallel Bitmap Heap Scan on SCHEMA1.TAB4 TAB4 (cost=26.39..7042.63 rows=1049 width=37) (actual time=23.650..201.606 rows=27613 loops=5)
Output: TAB4.PRI
Recheck Cond: ((TAB4.TAB4_code)::text = 'XX'::text)
Rows Removed by Index Recheck: 616610
Heap Blocks: exact=11978 lossy=15624
Buffers: shared hit=134917
Worker 0: actual time=20.627..199.852 rows=27779 loops=1
Buffers: shared hit=27133
Worker 1: actual time=21.065..199.786 rows=27805 loops=1
Buffers: shared hit=27159
Worker 2: actual time=21.445..198.582 rows=27330 loops=1
Buffers: shared hit=26706
Worker 3: actual time=21.470..195.915 rows=26880 loops=1
Buffers: shared hit=26245
-> Bitmap Index Scan on TAB4_idx1 (cost=0.00..25.95 rows=1784 width=0) (actual time=23.938..23.938 rows=138067 loops=1)
Index Cond: ((TAB4.TAB4_code)::text = 'XX'::text)
Buffers: shared hit=72

Re: Query running longer

From

Laurenz Albe

Date:

02 February, 11:37:38

On Fri, 2024-02-02 at 10:14 +0530, veem v wrote:
> On Fri, 2 Feb 2024 at 02:43, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Fri, 2024-02-02 at 02:27 +0530, veem v wrote:
> > > We have the below query which is running for ~45 seconds on postgres aurora reader instance.
> > > I have captured the explain analyze. Want to understand, where exactly the resources are
> > > getting spent and if we can be able to optimize it further.
> >
> > Aurora <> PostgreSQL, but here is what I can see:
> >
> > - The index scan on SCHEMA1."TAB2" has to check 2 million extra
> >   rows because "work_mem" is too small.  Almost the complete time
> >   is spent there.
> >
> > - You may be getting a bad plan, because the statistics on
> >   SCHEMA1.TAB4 are either out of date or not detailed enough,
> >   which makes PostgreSQL underestimate the result size.
>
> As you mentioned below, So wondering how you got to know, if this is the step where majority of the DB resources get
spent.And as total time the query ran was ~45 seconds, and out of that how much time it spent in this step, how can i
getthat? And to fix this one line with regards to TAB2, should we increase "work_mem" here? 
>
> - The index scan on SCHEMA1."TAB2" has to check 2 million extra
>   rows because "work_mem" is too small.  Almost the complete time
>   is spent there.
>
>  ->  Parallel Bitmap Heap Scan on SCHEMA1.""TAB2"" TAB2  (cost=84860.50..13040301.00 rows=1175611 width=80) (actual
time=713.054..26942.082rows=956249 loops=5)" 
> "                     Output: TAB2.TAB2_time, TAB2.PRI, TAB2.AID"
>   Recheck Cond: (TAB2.MID = 'XXXXX'::numeric)
>   Rows Removed by Index Recheck: 2137395
>   Filter: ((TAB2.TAB2_time >= '2024-01-01 00:00:00+00'::timestamp with time zone) AND (TAB2.TAB2_time <= '2024-01-31
00:00:00+00'::timestampwith time zone)) 
>   Heap Blocks: exact=5300 lossy=782577
>   Buffers: shared hit=1651569 read=2245157
>   I/O Timings: shared/local read=29063.286  

The bitmap index scan builds a bitmap, the size of which is limited by "work_mem".
If that is big enough, the bitmap will contain a bit for each table row, if not,
it only contains a bit per table row for some table blocks (the "exact" ones),
then it degrades to a bit per block (whe "lossy" ones).

For lossy blocks, *all* rows in the block have to be rechecked, which is
overhead.  Given that only a small part of the time (26942.082 - 713.054) * 5
is spent doing I/O (29063.286), I guess that the rest is spent processing
table rows.

> Another point you mentioned as below , for this , should we run vacuum analyze on the table TAB4?
>
> - You may be getting a bad plan, because the statistics on
>   SCHEMA1.TAB4 are either out of date or not detailed enough,
>   which makes PostgreSQL underestimate the result size.
>   
>   ->  Parallel Bitmap Heap Scan on SCHEMA1.TAB4 TAB4  (cost=26.39..7042.63 rows=1049 width=37) (actual
time=23.650..201.606rows=27613 loops=5) 
>        ->  Bitmap Index Scan on TAB4_idx1  (cost=0.00..25.95 rows=1784 width=0) (actual time=23.938..23.938
rows=138067loops=1) 
>            Index Cond: ((TAB4.TAB4_code)::text = 'XX'::text)
>            Buffers: shared hit=72

An ANALYZE might be enough.

If not, you can try to collect more detailed statistics for the column:

  ALTER TABLE SCHEMA1.TAB4 ALTER TAB4_code SET STATISTICS 1000;
  ANALYZE SCHEMA1.TAB4;

Yours,
Laurenz Albe