Thread: using memoize in in paralel query decreases performance

using memoize in in paralel query decreases performance

From
Pavel Stehule
Date:
Hi

In one query I can see very big overhead of memoize node - unfortunately with hits = 0

The Estimate is almost very good. See details in attachment

Regards

Pavel



Attachment

Re: using memoize in in paralel query decreases performance

From
David Rowley
Date:
On Mon, 6 Mar 2023 at 20:34, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> In one query I can see very big overhead of memoize node - unfortunately with hits = 0
>
> The Estimate is almost very good. See details in attachment

Are you able to share the version number for this?

Also, it would be good to see EXPLAIN ANALYZE *VERBOSE* for the
memorize plan so we can see the timings for the parallel workers.

The results of:

EXPLAIN ANALYZE
SELECT DISTINCT  ictc.sub_category_id
FROM ixfk_ictc_subcategoryid ictc
INNER JOIN item i ON i.item_category_id = ictc.sub_category_id
WHERE ictc.super_category_id = ANY
('{47124,49426,49488,47040,47128}'::bigint[]);

would also be useful. That should give an idea of the ndistinct
estimate.  I guess memorize thinks there are fewer unique values than
the 112 that were found.  There's probably not much to be done about
that. The slowness of the parallel workers seems like a more
interesting thing to understand.

David



Re: using memoize in in paralel query decreases performance

From
Pavel Stehule
Date:


po 6. 3. 2023 v 9:16 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 6 Mar 2023 at 20:34, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> In one query I can see very big overhead of memoize node - unfortunately with hits = 0
>
> The Estimate is almost very good. See details in attachment

Are you able to share the version number for this?

15.1 - upgrade on 15.2 is planned this month
 

Also, it would be good to see EXPLAIN ANALYZE *VERBOSE* for the
memorize plan so we can see the timings for the parallel workers.



The results of:

EXPLAIN ANALYZE
SELECT DISTINCT  ictc.sub_category_id
FROM ixfk_ictc_subcategoryid ictc
INNER JOIN item i ON i.item_category_id = ictc.sub_category_id
WHERE ictc.super_category_id = ANY
('{47124,49426,49488,47040,47128}'::bigint[]);



would also be useful. That should give an idea of the ndistinct
estimate.  I guess memorize thinks there are fewer unique values than
the 112 that were found.  There's probably not much to be done about
that. The slowness of the parallel workers seems like a more
interesting thing to understand.

David

Re: using memoize in in paralel query decreases performance

From
David Rowley
Date:
On Mon, 6 Mar 2023 at 21:55, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> default https://explain.depesz.com/s/fnBe

It looks like the slowness is coming from the Bitmap Index scan and
Bitmap heap scan rather than Memoize.

       ->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=20.395..591.606 rows=20471
loops=784)
     Output: i.id, i.item_category_id
     Recheck Cond: (i.item_category_id = ictc.sub_category_id)
     Heap Blocks: exact=1590348
     Worker 0:  actual time=20.128..591.426 rows=20471 loops=112
     Worker 1:  actual time=20.243..591.627 rows=20471 loops=112
     Worker 2:  actual time=20.318..591.660 rows=20471 loops=112
     Worker 3:  actual time=21.180..591.644 rows=20471 loops=112
     Worker 4:  actual time=20.226..591.357 rows=20471 loops=112
     Worker 5:  actual time=20.597..591.418 rows=20471 loops=112
     ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851
rows=25362 loops=784)
   Index Cond: (i.item_category_id = ictc.sub_category_id)
   Worker 0:  actual time=14.863..14.863 rows=25362 loops=112
   Worker 1:  actual time=14.854..14.854 rows=25362 loops=112
   Worker 2:  actual time=14.611..14.611 rows=25362 loops=112
   Worker 3:  actual time=15.245..15.245 rows=25362 loops=112
   Worker 4:  actual time=14.909..14.909 rows=25362 loops=112
   Worker 5:  actual time=14.841..14.841 rows=25362 loops=112

> disabled memoize https://explain.depesz.com/s/P2rP

->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784)
       Output: i.id, i.item_category_id
       Recheck Cond: (i.item_category_id = ictc.sub_category_id)
       Heap Blocks: exact=1590349
       Worker 0:  actual time=9.422..58.420 rows=20471 loops=112
       Worker 1:  actual time=9.449..57.539 rows=20471 loops=112
       Worker 2:  actual time=9.751..58.129 rows=20471 loops=112
       Worker 3:  actual time=9.620..57.484 rows=20471 loops=112
       Worker 4:  actual time=8.940..57.911 rows=20471 loops=112
       Worker 5:  actual time=9.454..57.488 rows=20471 loops=112
       ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581
rows=25363 loops=784)
     Index Cond: (i.item_category_id = ictc.sub_category_id)
     Worker 0:  actual time=4.846..4.846 rows=25363 loops=112
     Worker 1:  actual time=4.734..4.734 rows=25363 loops=112
     Worker 2:  actual time=4.803..4.803 rows=25363 loops=112
     Worker 3:  actual time=4.959..4.959 rows=25363 loops=112
     Worker 4:  actual time=4.402..4.402 rows=25363 loops=112
     Worker 5:  actual time=4.778..4.778 rows=25363 loops=112

I wonder if the additional work_mem required for Memoize is just doing
something like causing kernel page cache evictions and leading to
fewer buffers for ixfk_ite_itemcategoryid and the item table being
cached in the kernel page cache.

Maybe you could get an idea of that if you SET track_io_timing = on;
and EXPLAIN (ANALYZE, BUFFERS) both queries.

David



Re: using memoize in in paralel query decreases performance

From
Pavel Stehule
Date:


po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 6 Mar 2023 at 21:55, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> default https://explain.depesz.com/s/fnBe

It looks like the slowness is coming from the Bitmap Index scan and
Bitmap heap scan rather than Memoize.

       ->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=20.395..591.606 rows=20471
loops=784)
     Output: i.id, i.item_category_id
     Recheck Cond: (i.item_category_id = ictc.sub_category_id)
     Heap Blocks: exact=1590348
     Worker 0:  actual time=20.128..591.426 rows=20471 loops=112
     Worker 1:  actual time=20.243..591.627 rows=20471 loops=112
     Worker 2:  actual time=20.318..591.660 rows=20471 loops=112
     Worker 3:  actual time=21.180..591.644 rows=20471 loops=112
     Worker 4:  actual time=20.226..591.357 rows=20471 loops=112
     Worker 5:  actual time=20.597..591.418 rows=20471 loops=112
     ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851
rows=25362 loops=784)
   Index Cond: (i.item_category_id = ictc.sub_category_id)
   Worker 0:  actual time=14.863..14.863 rows=25362 loops=112
   Worker 1:  actual time=14.854..14.854 rows=25362 loops=112
   Worker 2:  actual time=14.611..14.611 rows=25362 loops=112
   Worker 3:  actual time=15.245..15.245 rows=25362 loops=112
   Worker 4:  actual time=14.909..14.909 rows=25362 loops=112
   Worker 5:  actual time=14.841..14.841 rows=25362 loops=112

> disabled memoize https://explain.depesz.com/s/P2rP

->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784)
       Output: i.id, i.item_category_id
       Recheck Cond: (i.item_category_id = ictc.sub_category_id)
       Heap Blocks: exact=1590349
       Worker 0:  actual time=9.422..58.420 rows=20471 loops=112
       Worker 1:  actual time=9.449..57.539 rows=20471 loops=112
       Worker 2:  actual time=9.751..58.129 rows=20471 loops=112
       Worker 3:  actual time=9.620..57.484 rows=20471 loops=112
       Worker 4:  actual time=8.940..57.911 rows=20471 loops=112
       Worker 5:  actual time=9.454..57.488 rows=20471 loops=112
       ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581
rows=25363 loops=784)
     Index Cond: (i.item_category_id = ictc.sub_category_id)
     Worker 0:  actual time=4.846..4.846 rows=25363 loops=112
     Worker 1:  actual time=4.734..4.734 rows=25363 loops=112
     Worker 2:  actual time=4.803..4.803 rows=25363 loops=112
     Worker 3:  actual time=4.959..4.959 rows=25363 loops=112
     Worker 4:  actual time=4.402..4.402 rows=25363 loops=112
     Worker 5:  actual time=4.778..4.778 rows=25363 loops=112

I wonder if the additional work_mem required for Memoize is just doing
something like causing kernel page cache evictions and leading to
fewer buffers for ixfk_ite_itemcategoryid and the item table being
cached in the kernel page cache.

Maybe you could get an idea of that if you SET track_io_timing = on;
and EXPLAIN (ANALYZE, BUFFERS) both queries.


Regards

Pavel


David

Re: using memoize in in paralel query decreases performance

From
David Rowley
Date:
 /On Tue, 7 Mar 2023 at 21:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>> I wonder if the additional work_mem required for Memoize is just doing
>> something like causing kernel page cache evictions and leading to
>> fewer buffers for ixfk_ite_itemcategoryid and the item table being
>> cached in the kernel page cache.
>>
>> Maybe you could get an idea of that if you SET track_io_timing = on;
>> and EXPLAIN (ANALYZE, BUFFERS) both queries.
>
>
> https://explain.depesz.com/s/vhk0

This is the enable_memoize=on one. The I/O looks like:

Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863
I/O Timings: shared/local read=2671836.341 write=1286.869

2671836.341 / 15274264 = ~0.175 ms per read.

> https://explain.depesz.com/s/R5ju

This is the faster enable_memoize = off one. The I/O looks like:

Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625
I/O Timings: shared/local read=1554838.583 write=821.477

1554838.583 / 18541899 = ~0.084 ms per read.

That indicates that the enable_memoize=off version is just finding
more pages in the kernel's page cache than the slower query.  The
slower query just appears to be under more memory pressure causing the
kernel to have less free memory to cache useful pages. I don't see
anything here that indicates any problems with Memoize.  Sure the
statistics could be better as, ideally, the Memoize wouldn't have
happened for the i_2 relation. I don't see anything that indicates any
bugs with this, however. It's pretty well known that Memoize puts
quite a bit of faith into ndistinct estimates. If it causes too many
issues the enable_memoize switch can be turned to off.

You might want to consider experimenting with smaller values of
work_mem and/or hash_mem_multiplier for this query or just disabling
memoize altogether.

David



Re: using memoize in in paralel query decreases performance

From
Pavel Stehule
Date:


út 7. 3. 2023 v 9:58 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
 /On Tue, 7 Mar 2023 at 21:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>> I wonder if the additional work_mem required for Memoize is just doing
>> something like causing kernel page cache evictions and leading to
>> fewer buffers for ixfk_ite_itemcategoryid and the item table being
>> cached in the kernel page cache.
>>
>> Maybe you could get an idea of that if you SET track_io_timing = on;
>> and EXPLAIN (ANALYZE, BUFFERS) both queries.
>
>
> https://explain.depesz.com/s/vhk0

This is the enable_memoize=on one. The I/O looks like:

Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863
I/O Timings: shared/local read=2671836.341 write=1286.869

2671836.341 / 15274264 = ~0.175 ms per read.

> https://explain.depesz.com/s/R5ju

This is the faster enable_memoize = off one. The I/O looks like:

Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625
I/O Timings: shared/local read=1554838.583 write=821.477

1554838.583 / 18541899 = ~0.084 ms per read.

That indicates that the enable_memoize=off version is just finding
more pages in the kernel's page cache than the slower query.  The
slower query just appears to be under more memory pressure causing the
kernel to have less free memory to cache useful pages. I don't see
anything here that indicates any problems with Memoize.  Sure the
statistics could be better as, ideally, the Memoize wouldn't have
happened for the i_2 relation. I don't see anything that indicates any
bugs with this, however. It's pretty well known that Memoize puts
quite a bit of faith into ndistinct estimates. If it causes too many
issues the enable_memoize switch can be turned to off.

You might want to consider experimenting with smaller values of
work_mem and/or hash_mem_multiplier for this query or just disabling
memoize altogether.

I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that the performance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimation of the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the number of workers is high.



David

Re: using memoize in in paralel query decreases performance

From
David Rowley
Date:
On Tue, 7 Mar 2023 at 22:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that
theperformance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the
estimationof the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the
numberof workers is high. 

The costing for Memoize takes the number of workers into account by
way of the change in expected input rows.  The number of estimated
input rows is effectively just divided by the number of parallel
workers, so if we expect 1 million rows from the outer side of the
join and 4 workers, then we'll assume the memorize will deal with
250,000 rows per worker.  If the n_distinct estimate for the cache key
is 500,000, then it's not going to look very attractive to Memoize
that.  In reality, estimate_num_groups() won't say the number of
groups is higher than the input rows, but Memoize, with all the other
overheads factored into the costs, it would never look favourable if
the planner thought there was never going to be any repeated values.
The expected cache hit ratio there would be zero.

David



Re: using memoize in in paralel query decreases performance

From
Pavel Stehule
Date:


út 7. 3. 2023 v 10:46 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Tue, 7 Mar 2023 at 22:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that the performance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimation of the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the number of workers is high.

The costing for Memoize takes the number of workers into account by
way of the change in expected input rows.  The number of estimated
input rows is effectively just divided by the number of parallel
workers, so if we expect 1 million rows from the outer side of the
join and 4 workers, then we'll assume the memorize will deal with
250,000 rows per worker.  If the n_distinct estimate for the cache key
is 500,000, then it's not going to look very attractive to Memoize
that.  In reality, estimate_num_groups() won't say the number of
groups is higher than the input rows, but Memoize, with all the other
overheads factored into the costs, it would never look favourable if
the planner thought there was never going to be any repeated values.
The expected cache hit ratio there would be zero.

Thanks for the explanation. 

Pavel


David