Thread: using memoize in in paralel query decreases performance
Hi
In one query I can see very big overhead of memoize node - unfortunately with hits = 0
The Estimate is almost very good. See details in attachment
Regards
Pavel
Attachment
On Mon, 6 Mar 2023 at 20:34, Pavel Stehule <pavel.stehule@gmail.com> wrote: > In one query I can see very big overhead of memoize node - unfortunately with hits = 0 > > The Estimate is almost very good. See details in attachment Are you able to share the version number for this? Also, it would be good to see EXPLAIN ANALYZE *VERBOSE* for the memorize plan so we can see the timings for the parallel workers. The results of: EXPLAIN ANALYZE SELECT DISTINCT ictc.sub_category_id FROM ixfk_ictc_subcategoryid ictc INNER JOIN item i ON i.item_category_id = ictc.sub_category_id WHERE ictc.super_category_id = ANY ('{47124,49426,49488,47040,47128}'::bigint[]); would also be useful. That should give an idea of the ndistinct estimate. I guess memorize thinks there are fewer unique values than the 112 that were found. There's probably not much to be done about that. The slowness of the parallel workers seems like a more interesting thing to understand. David
po 6. 3. 2023 v 9:16 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 6 Mar 2023 at 20:34, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> In one query I can see very big overhead of memoize node - unfortunately with hits = 0
>
> The Estimate is almost very good. See details in attachment
Are you able to share the version number for this?
15.1 - upgrade on 15.2 is planned this month
Also, it would be good to see EXPLAIN ANALYZE *VERBOSE* for the
memorize plan so we can see the timings for the parallel workers.
disabled memoize https://explain.depesz.com/s/P2rP
The results of:
EXPLAIN ANALYZE
SELECT DISTINCT ictc.sub_category_id
FROM ixfk_ictc_subcategoryid ictc
INNER JOIN item i ON i.item_category_id = ictc.sub_category_id
WHERE ictc.super_category_id = ANY
('{47124,49426,49488,47040,47128}'::bigint[]);
would also be useful. That should give an idea of the ndistinct
estimate. I guess memorize thinks there are fewer unique values than
the 112 that were found. There's probably not much to be done about
that. The slowness of the parallel workers seems like a more
interesting thing to understand.
David
On Mon, 6 Mar 2023 at 21:55, Pavel Stehule <pavel.stehule@gmail.com> wrote: > default https://explain.depesz.com/s/fnBe It looks like the slowness is coming from the Bitmap Index scan and Bitmap heap scan rather than Memoize. -> Bitmap Heap Scan on public.item i (cost=285.69..41952.12 rows=29021 width=16) (actual time=20.395..591.606 rows=20471 loops=784) Output: i.id, i.item_category_id Recheck Cond: (i.item_category_id = ictc.sub_category_id) Heap Blocks: exact=1590348 Worker 0: actual time=20.128..591.426 rows=20471 loops=112 Worker 1: actual time=20.243..591.627 rows=20471 loops=112 Worker 2: actual time=20.318..591.660 rows=20471 loops=112 Worker 3: actual time=21.180..591.644 rows=20471 loops=112 Worker 4: actual time=20.226..591.357 rows=20471 loops=112 Worker 5: actual time=20.597..591.418 rows=20471 loops=112 -> Bitmap Index Scan on ixfk_ite_itemcategoryid (cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851 rows=25362 loops=784) Index Cond: (i.item_category_id = ictc.sub_category_id) Worker 0: actual time=14.863..14.863 rows=25362 loops=112 Worker 1: actual time=14.854..14.854 rows=25362 loops=112 Worker 2: actual time=14.611..14.611 rows=25362 loops=112 Worker 3: actual time=15.245..15.245 rows=25362 loops=112 Worker 4: actual time=14.909..14.909 rows=25362 loops=112 Worker 5: actual time=14.841..14.841 rows=25362 loops=112 > disabled memoize https://explain.depesz.com/s/P2rP -> Bitmap Heap Scan on public.item i (cost=285.69..41952.12 rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784) Output: i.id, i.item_category_id Recheck Cond: (i.item_category_id = ictc.sub_category_id) Heap Blocks: exact=1590349 Worker 0: actual time=9.422..58.420 rows=20471 loops=112 Worker 1: actual time=9.449..57.539 rows=20471 loops=112 Worker 2: actual time=9.751..58.129 rows=20471 loops=112 Worker 3: actual time=9.620..57.484 rows=20471 loops=112 Worker 4: actual time=8.940..57.911 rows=20471 loops=112 Worker 5: actual time=9.454..57.488 rows=20471 loops=112 -> Bitmap Index Scan on ixfk_ite_itemcategoryid (cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581 rows=25363 loops=784) Index Cond: (i.item_category_id = ictc.sub_category_id) Worker 0: actual time=4.846..4.846 rows=25363 loops=112 Worker 1: actual time=4.734..4.734 rows=25363 loops=112 Worker 2: actual time=4.803..4.803 rows=25363 loops=112 Worker 3: actual time=4.959..4.959 rows=25363 loops=112 Worker 4: actual time=4.402..4.402 rows=25363 loops=112 Worker 5: actual time=4.778..4.778 rows=25363 loops=112 I wonder if the additional work_mem required for Memoize is just doing something like causing kernel page cache evictions and leading to fewer buffers for ixfk_ite_itemcategoryid and the item table being cached in the kernel page cache. Maybe you could get an idea of that if you SET track_io_timing = on; and EXPLAIN (ANALYZE, BUFFERS) both queries. David
po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 6 Mar 2023 at 21:55, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> default https://explain.depesz.com/s/fnBe
It looks like the slowness is coming from the Bitmap Index scan and
Bitmap heap scan rather than Memoize.
-> Bitmap Heap Scan on public.item i (cost=285.69..41952.12
rows=29021 width=16) (actual time=20.395..591.606 rows=20471
loops=784)
Output: i.id, i.item_category_id
Recheck Cond: (i.item_category_id = ictc.sub_category_id)
Heap Blocks: exact=1590348
Worker 0: actual time=20.128..591.426 rows=20471 loops=112
Worker 1: actual time=20.243..591.627 rows=20471 loops=112
Worker 2: actual time=20.318..591.660 rows=20471 loops=112
Worker 3: actual time=21.180..591.644 rows=20471 loops=112
Worker 4: actual time=20.226..591.357 rows=20471 loops=112
Worker 5: actual time=20.597..591.418 rows=20471 loops=112
-> Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851
rows=25362 loops=784)
Index Cond: (i.item_category_id = ictc.sub_category_id)
Worker 0: actual time=14.863..14.863 rows=25362 loops=112
Worker 1: actual time=14.854..14.854 rows=25362 loops=112
Worker 2: actual time=14.611..14.611 rows=25362 loops=112
Worker 3: actual time=15.245..15.245 rows=25362 loops=112
Worker 4: actual time=14.909..14.909 rows=25362 loops=112
Worker 5: actual time=14.841..14.841 rows=25362 loops=112
> disabled memoize https://explain.depesz.com/s/P2rP
-> Bitmap Heap Scan on public.item i (cost=285.69..41952.12
rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784)
Output: i.id, i.item_category_id
Recheck Cond: (i.item_category_id = ictc.sub_category_id)
Heap Blocks: exact=1590349
Worker 0: actual time=9.422..58.420 rows=20471 loops=112
Worker 1: actual time=9.449..57.539 rows=20471 loops=112
Worker 2: actual time=9.751..58.129 rows=20471 loops=112
Worker 3: actual time=9.620..57.484 rows=20471 loops=112
Worker 4: actual time=8.940..57.911 rows=20471 loops=112
Worker 5: actual time=9.454..57.488 rows=20471 loops=112
-> Bitmap Index Scan on ixfk_ite_itemcategoryid
(cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581
rows=25363 loops=784)
Index Cond: (i.item_category_id = ictc.sub_category_id)
Worker 0: actual time=4.846..4.846 rows=25363 loops=112
Worker 1: actual time=4.734..4.734 rows=25363 loops=112
Worker 2: actual time=4.803..4.803 rows=25363 loops=112
Worker 3: actual time=4.959..4.959 rows=25363 loops=112
Worker 4: actual time=4.402..4.402 rows=25363 loops=112
Worker 5: actual time=4.778..4.778 rows=25363 loops=112
I wonder if the additional work_mem required for Memoize is just doing
something like causing kernel page cache evictions and leading to
fewer buffers for ixfk_ite_itemcategoryid and the item table being
cached in the kernel page cache.
Maybe you could get an idea of that if you SET track_io_timing = on;
and EXPLAIN (ANALYZE, BUFFERS) both queries.
Regards
Pavel
David
/On Tue, 7 Mar 2023 at 21:09, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal: >> I wonder if the additional work_mem required for Memoize is just doing >> something like causing kernel page cache evictions and leading to >> fewer buffers for ixfk_ite_itemcategoryid and the item table being >> cached in the kernel page cache. >> >> Maybe you could get an idea of that if you SET track_io_timing = on; >> and EXPLAIN (ANALYZE, BUFFERS) both queries. > > > https://explain.depesz.com/s/vhk0 This is the enable_memoize=on one. The I/O looks like: Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863 I/O Timings: shared/local read=2671836.341 write=1286.869 2671836.341 / 15274264 = ~0.175 ms per read. > https://explain.depesz.com/s/R5ju This is the faster enable_memoize = off one. The I/O looks like: Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625 I/O Timings: shared/local read=1554838.583 write=821.477 1554838.583 / 18541899 = ~0.084 ms per read. That indicates that the enable_memoize=off version is just finding more pages in the kernel's page cache than the slower query. The slower query just appears to be under more memory pressure causing the kernel to have less free memory to cache useful pages. I don't see anything here that indicates any problems with Memoize. Sure the statistics could be better as, ideally, the Memoize wouldn't have happened for the i_2 relation. I don't see anything that indicates any bugs with this, however. It's pretty well known that Memoize puts quite a bit of faith into ndistinct estimates. If it causes too many issues the enable_memoize switch can be turned to off. You might want to consider experimenting with smaller values of work_mem and/or hash_mem_multiplier for this query or just disabling memoize altogether. David
út 7. 3. 2023 v 9:58 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
/On Tue, 7 Mar 2023 at 21:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> po 6. 3. 2023 v 22:52 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
>> I wonder if the additional work_mem required for Memoize is just doing
>> something like causing kernel page cache evictions and leading to
>> fewer buffers for ixfk_ite_itemcategoryid and the item table being
>> cached in the kernel page cache.
>>
>> Maybe you could get an idea of that if you SET track_io_timing = on;
>> and EXPLAIN (ANALYZE, BUFFERS) both queries.
>
>
> https://explain.depesz.com/s/vhk0
This is the enable_memoize=on one. The I/O looks like:
Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863
I/O Timings: shared/local read=2671836.341 write=1286.869
2671836.341 / 15274264 = ~0.175 ms per read.
> https://explain.depesz.com/s/R5ju
This is the faster enable_memoize = off one. The I/O looks like:
Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625
I/O Timings: shared/local read=1554838.583 write=821.477
1554838.583 / 18541899 = ~0.084 ms per read.
That indicates that the enable_memoize=off version is just finding
more pages in the kernel's page cache than the slower query. The
slower query just appears to be under more memory pressure causing the
kernel to have less free memory to cache useful pages. I don't see
anything here that indicates any problems with Memoize. Sure the
statistics could be better as, ideally, the Memoize wouldn't have
happened for the i_2 relation. I don't see anything that indicates any
bugs with this, however. It's pretty well known that Memoize puts
quite a bit of faith into ndistinct estimates. If it causes too many
issues the enable_memoize switch can be turned to off.
You might want to consider experimenting with smaller values of
work_mem and/or hash_mem_multiplier for this query or just disabling
memoize altogether.
I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that the performance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimation of the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the number of workers is high.
David
On Tue, 7 Mar 2023 at 22:09, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that theperformance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimationof the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the numberof workers is high. The costing for Memoize takes the number of workers into account by way of the change in expected input rows. The number of estimated input rows is effectively just divided by the number of parallel workers, so if we expect 1 million rows from the outer side of the join and 4 workers, then we'll assume the memorize will deal with 250,000 rows per worker. If the n_distinct estimate for the cache key is 500,000, then it's not going to look very attractive to Memoize that. In reality, estimate_num_groups() won't say the number of groups is higher than the input rows, but Memoize, with all the other overheads factored into the costs, it would never look favourable if the planner thought there was never going to be any repeated values. The expected cache hit ratio there would be zero. David
út 7. 3. 2023 v 10:46 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Tue, 7 Mar 2023 at 22:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that the performance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimation of the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the number of workers is high.
The costing for Memoize takes the number of workers into account by
way of the change in expected input rows. The number of estimated
input rows is effectively just divided by the number of parallel
workers, so if we expect 1 million rows from the outer side of the
join and 4 workers, then we'll assume the memorize will deal with
250,000 rows per worker. If the n_distinct estimate for the cache key
is 500,000, then it's not going to look very attractive to Memoize
that. In reality, estimate_num_groups() won't say the number of
groups is higher than the input rows, but Memoize, with all the other
overheads factored into the costs, it would never look favourable if
the planner thought there was never going to be any repeated values.
The expected cache hit ratio there would be zero.
Thanks for the explanation.
Pavel
David