Thread: strange slow query - lost lot of time somewhere
Hi
I found a query that is significantly slower with more memory
plan 1 - fast https://explain.depesz.com/s/XM1f
plan 2 - slow https://explain.depesz.com/s/2rBw
Strange - the time of last row is +/- same, but execution time is 10x worse
It looks like slow environment cleaning
On Mon, 2 May 2022 at 11:00, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > Hi > > I found a query that is significantly slower with more memory Which PostgreSQL version did you use? Did you enable assert checking? Do you have an example database setup to work with? > plan 2 > QUERY PLAN > ---------------- > Nested Loop Anti Join (cost=46.53..2914.58 rows=1 width=16) (actual time=18.306..23.065 rows=32 loops=1) > ... > Execution Time: 451.161 ms Truly strange; especially the 418ms difference between execution time and the root node's "actual time". I haven't really seen such differences happen, except when concurrent locks were held at the table / index level. > plan 1 - fast https://explain.depesz.com/s/XM1f > > plan 2 - slow https://explain.depesz.com/s/2rBw > > Strange - the time of last row is +/- same, but execution time is 10x worse The only difference between the two plans that I see is that plan 1 doesn't use memoization, whereas plan 2 does use 2 memoize plan nodes (one of 66 misses; one of 342 misses). The only "expensive" operation that I see in memoize nodes is the check for memory size in assert-enabled builds; and that should have very low overhead considering that the size of the memoized data is only 8kB and 25kB respectively.
po 2. 5. 2022 v 15:28 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
On Mon, 2 May 2022 at 11:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I found a query that is significantly slower with more memory
Which PostgreSQL version did you use? Did you enable assert checking?
Do you have an example database setup to work with?
> plan 2
> QUERY PLAN
> ----------------
> Nested Loop Anti Join (cost=46.53..2914.58 rows=1 width=16) (actual time=18.306..23.065 rows=32 loops=1)
> ...
> Execution Time: 451.161 ms
Truly strange; especially the 418ms difference between execution time
and the root node's "actual time". I haven't really seen such
differences happen, except when concurrent locks were held at the
table / index level.
> plan 1 - fast https://explain.depesz.com/s/XM1f
>
> plan 2 - slow https://explain.depesz.com/s/2rBw
>
> Strange - the time of last row is +/- same, but execution time is 10x worse
The only difference between the two plans that I see is that plan 1
doesn't use memoization, whereas plan 2 does use 2 memoize plan nodes
(one of 66 misses; one of 342 misses). The only "expensive" operation
that I see in memoize nodes is the check for memory size in
assert-enabled builds; and that should have very low overhead
considering that the size of the memoized data is only 8kB and 25kB
respectively.
This is PostgreSQL 14 used in production environment
(2022-05-02 15:37:29) prd_aukro=# show debug_assertions ;
┌──────────────────┐
│ debug_assertions │
├──────────────────┤
│ off │
└──────────────────┘
(1 řádka)
Čas: 0,295 ms
(2022-05-02 15:37:35) prd_aukro=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 řádka)
Čas: 0,629 ms
┌──────────────────┐
│ debug_assertions │
├──────────────────┤
│ off │
└──────────────────┘
(1 řádka)
Čas: 0,295 ms
(2022-05-02 15:37:35) prd_aukro=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 řádka)
Čas: 0,629 ms
there is just shared buffers changed to 32GB and work_mem to 70MB. Unfortunately - it is in production environment with customer data, so I cannot to play too much
This is perf of slow
25,94% postmaster [kernel.kallsyms] [k] clear_page_erms
11,06% postmaster [kernel.kallsyms] [k] page_fault
5,51% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode
5,18% postmaster [kernel.kallsyms] [k] __list_del_entry_valid
5,15% postmaster libc-2.28.so [.] __memset_avx2_erms
3,99% postmaster [kernel.kallsyms] [k] unmap_page_range
3,07% postmaster postgres [.] hash_search_with_hash_value
2,73% postmaster [kernel.kallsyms] [k] cgroup_throttle_swaprate
2,49% postmaster postgres [.] heap_page_prune_opt
1,92% postmaster [kernel.kallsyms] [k] try_charge
1,85% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode
1,82% postmaster [kernel.kallsyms] [k] error_entry
1,73% postmaster postgres [.] _bt_checkkeys
1,48% postmaster [kernel.kallsyms] [k] free_pcppages_bulk
1,35% postmaster [kernel.kallsyms] [k] get_page_from_freelist
1,20% postmaster [kernel.kallsyms] [k] __pagevec_lru_add_fn
1,08% postmaster [kernel.kallsyms] [k] percpu_ref_put_many.constprop.84
1,08% postmaster postgres [.] 0x00000000003c1be6
1,06% postmaster [kernel.kallsyms] [k] get_mem_cgroup_from_mm.part.49
0,86% postmaster [kernel.kallsyms] [k] __handle_mm_fault
0,79% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
0,70% postmaster [kernel.kallsyms] [k] release_pages
0,61% postmaster postgres [.] _bt_checkpage
0,61% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache
0,60% postmaster [kernel.kallsyms] [k] handle_mm_fault
0,57% postmaster postgres [.] tbm_iterate
0,56% postmaster [kernel.kallsyms] [k] __count_memcg_events.part.70
0,55% postmaster [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0,52% postmaster postgres [.] 0x000000000015f6e5
0,50% postmaster [kernel.kallsyms] [k] prep_new_page
0,49% postmaster [kernel.kallsyms] [k] __do_page_fault
0,46% postmaster [kernel.kallsyms] [k] _raw_spin_lock
0,44% postmaster [kernel.kallsyms] [k] do_anonymous_page
11,06% postmaster [kernel.kallsyms] [k] page_fault
5,51% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode
5,18% postmaster [kernel.kallsyms] [k] __list_del_entry_valid
5,15% postmaster libc-2.28.so [.] __memset_avx2_erms
3,99% postmaster [kernel.kallsyms] [k] unmap_page_range
3,07% postmaster postgres [.] hash_search_with_hash_value
2,73% postmaster [kernel.kallsyms] [k] cgroup_throttle_swaprate
2,49% postmaster postgres [.] heap_page_prune_opt
1,92% postmaster [kernel.kallsyms] [k] try_charge
1,85% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode
1,82% postmaster [kernel.kallsyms] [k] error_entry
1,73% postmaster postgres [.] _bt_checkkeys
1,48% postmaster [kernel.kallsyms] [k] free_pcppages_bulk
1,35% postmaster [kernel.kallsyms] [k] get_page_from_freelist
1,20% postmaster [kernel.kallsyms] [k] __pagevec_lru_add_fn
1,08% postmaster [kernel.kallsyms] [k] percpu_ref_put_many.constprop.84
1,08% postmaster postgres [.] 0x00000000003c1be6
1,06% postmaster [kernel.kallsyms] [k] get_mem_cgroup_from_mm.part.49
0,86% postmaster [kernel.kallsyms] [k] __handle_mm_fault
0,79% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
0,70% postmaster [kernel.kallsyms] [k] release_pages
0,61% postmaster postgres [.] _bt_checkpage
0,61% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache
0,60% postmaster [kernel.kallsyms] [k] handle_mm_fault
0,57% postmaster postgres [.] tbm_iterate
0,56% postmaster [kernel.kallsyms] [k] __count_memcg_events.part.70
0,55% postmaster [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0,52% postmaster postgres [.] 0x000000000015f6e5
0,50% postmaster [kernel.kallsyms] [k] prep_new_page
0,49% postmaster [kernel.kallsyms] [k] __do_page_fault
0,46% postmaster [kernel.kallsyms] [k] _raw_spin_lock
0,44% postmaster [kernel.kallsyms] [k] do_anonymous_page
This is fast
21,13% postmaster postgres [.] hash_search_with_hash_value
15,33% postmaster postgres [.] heap_page_prune_opt
10,22% postmaster libc-2.28.so [.] __memset_avx2_erms
10,00% postmaster postgres [.] _bt_checkkeys
6,23% postmaster postgres [.] 0x00000000003c1be6
4,94% postmaster postgres [.] _bt_checkpage
2,85% postmaster postgres [.] tbm_iterate
2,31% postmaster postgres [.] nocache_index_getattr
2,13% postmaster postgres [.] pg_qsort
1,58% postmaster postgres [.] heap_hot_search_buffer
1,58% postmaster postgres [.] FunctionCall2Coll
1,58% postmaster postgres [.] 0x000000000015f6e5
1,17% postmaster postgres [.] LWLockRelease
0,85% postmaster libc-2.28.so [.] __memcmp_avx2_movbe
0,64% postmaster postgres [.] 0x00000000003e4233
0,54% postmaster postgres [.] hash_bytes
0,53% postmaster postgres [.] 0x0000000000306fbb
0,53% postmaster postgres [.] LWLockAcquire
0,42% postmaster postgres [.] 0x00000000003c1c6f
0,42% postmaster postgres [.] _bt_compare
15,33% postmaster postgres [.] heap_page_prune_opt
10,22% postmaster libc-2.28.so [.] __memset_avx2_erms
10,00% postmaster postgres [.] _bt_checkkeys
6,23% postmaster postgres [.] 0x00000000003c1be6
4,94% postmaster postgres [.] _bt_checkpage
2,85% postmaster postgres [.] tbm_iterate
2,31% postmaster postgres [.] nocache_index_getattr
2,13% postmaster postgres [.] pg_qsort
1,58% postmaster postgres [.] heap_hot_search_buffer
1,58% postmaster postgres [.] FunctionCall2Coll
1,58% postmaster postgres [.] 0x000000000015f6e5
1,17% postmaster postgres [.] LWLockRelease
0,85% postmaster libc-2.28.so [.] __memcmp_avx2_movbe
0,64% postmaster postgres [.] 0x00000000003e4233
0,54% postmaster postgres [.] hash_bytes
0,53% postmaster postgres [.] 0x0000000000306fbb
0,53% postmaster postgres [.] LWLockAcquire
0,42% postmaster postgres [.] 0x00000000003c1c6f
0,42% postmaster postgres [.] _bt_compare
Regards
Pavel
po 2. 5. 2022 v 16:02 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
po 2. 5. 2022 v 15:28 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:On Mon, 2 May 2022 at 11:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> I found a query that is significantly slower with more memory
Which PostgreSQL version did you use? Did you enable assert checking?
Do you have an example database setup to work with?
> plan 2
> QUERY PLAN
> ----------------
> Nested Loop Anti Join (cost=46.53..2914.58 rows=1 width=16) (actual time=18.306..23.065 rows=32 loops=1)
> ...
> Execution Time: 451.161 ms
Truly strange; especially the 418ms difference between execution time
and the root node's "actual time". I haven't really seen such
differences happen, except when concurrent locks were held at the
table / index level.
> plan 1 - fast https://explain.depesz.com/s/XM1f
>
> plan 2 - slow https://explain.depesz.com/s/2rBw
>
> Strange - the time of last row is +/- same, but execution time is 10x worse
The only difference between the two plans that I see is that plan 1
doesn't use memoization, whereas plan 2 does use 2 memoize plan nodes
(one of 66 misses; one of 342 misses). The only "expensive" operation
that I see in memoize nodes is the check for memory size in
assert-enabled builds; and that should have very low overhead
considering that the size of the memoized data is only 8kB and 25kB
respectively.This is PostgreSQL 14 used in production environment(2022-05-02 15:37:29) prd_aukro=# show debug_assertions ;
┌──────────────────┐
│ debug_assertions │
├──────────────────┤
│ off │
└──────────────────┘
(1 řádka)
Čas: 0,295 ms
(2022-05-02 15:37:35) prd_aukro=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 řádka)
Čas: 0,629 msthere is just shared buffers changed to 32GB and work_mem to 70MB. Unfortunately - it is in production environment with customer data, so I cannot to play too muchThis is perf of slow25,94% postmaster [kernel.kallsyms] [k] clear_page_erms
11,06% postmaster [kernel.kallsyms] [k] page_fault
5,51% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode
5,18% postmaster [kernel.kallsyms] [k] __list_del_entry_valid
5,15% postmaster libc-2.28.so [.] __memset_avx2_erms
3,99% postmaster [kernel.kallsyms] [k] unmap_page_range
3,07% postmaster postgres [.] hash_search_with_hash_value
2,73% postmaster [kernel.kallsyms] [k] cgroup_throttle_swaprate
2,49% postmaster postgres [.] heap_page_prune_opt
1,92% postmaster [kernel.kallsyms] [k] try_charge
1,85% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode
1,82% postmaster [kernel.kallsyms] [k] error_entry
1,73% postmaster postgres [.] _bt_checkkeys
1,48% postmaster [kernel.kallsyms] [k] free_pcppages_bulk
1,35% postmaster [kernel.kallsyms] [k] get_page_from_freelist
1,20% postmaster [kernel.kallsyms] [k] __pagevec_lru_add_fn
1,08% postmaster [kernel.kallsyms] [k] percpu_ref_put_many.constprop.84
1,08% postmaster postgres [.] 0x00000000003c1be6
1,06% postmaster [kernel.kallsyms] [k] get_mem_cgroup_from_mm.part.49
0,86% postmaster [kernel.kallsyms] [k] __handle_mm_fault
0,79% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
0,70% postmaster [kernel.kallsyms] [k] release_pages
0,61% postmaster postgres [.] _bt_checkpage
0,61% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache
0,60% postmaster [kernel.kallsyms] [k] handle_mm_fault
0,57% postmaster postgres [.] tbm_iterate
0,56% postmaster [kernel.kallsyms] [k] __count_memcg_events.part.70
0,55% postmaster [kernel.kallsyms] [k] __mod_memcg_lruvec_state
0,52% postmaster postgres [.] 0x000000000015f6e5
0,50% postmaster [kernel.kallsyms] [k] prep_new_page
0,49% postmaster [kernel.kallsyms] [k] __do_page_fault
0,46% postmaster [kernel.kallsyms] [k] _raw_spin_lock
0,44% postmaster [kernel.kallsyms] [k] do_anonymous_pageThis is fast21,13% postmaster postgres [.] hash_search_with_hash_value
15,33% postmaster postgres [.] heap_page_prune_opt
10,22% postmaster libc-2.28.so [.] __memset_avx2_erms
10,00% postmaster postgres [.] _bt_checkkeys
6,23% postmaster postgres [.] 0x00000000003c1be6
4,94% postmaster postgres [.] _bt_checkpage
2,85% postmaster postgres [.] tbm_iterate
2,31% postmaster postgres [.] nocache_index_getattr
2,13% postmaster postgres [.] pg_qsort
1,58% postmaster postgres [.] heap_hot_search_buffer
1,58% postmaster postgres [.] FunctionCall2Coll
1,58% postmaster postgres [.] 0x000000000015f6e5
1,17% postmaster postgres [.] LWLockRelease
0,85% postmaster libc-2.28.so [.] __memcmp_avx2_movbe
0,64% postmaster postgres [.] 0x00000000003e4233
0,54% postmaster postgres [.] hash_bytes
0,53% postmaster postgres [.] 0x0000000000306fbb
0,53% postmaster postgres [.] LWLockAcquire
0,42% postmaster postgres [.] 0x00000000003c1c6f
0,42% postmaster postgres [.] _bt_compare
It looks so memoization allocate lot of memory - maybe there are some temporal memory leak
Performance counter stats for process id '4004464':
84,26 msec task-clock # 0,012 CPUs utilized
3 context-switches # 0,036 K/sec
0 cpu-migrations # 0,000 K/sec
19 page-faults # 0,225 K/sec
0 cycles # 0,000 GHz
106 873 995 instructions
20 225 431 branches # 240,026 M/sec
348 834 branch-misses # 1,72% of all branches
7,106142051 seconds time elapsed
84,26 msec task-clock # 0,012 CPUs utilized
3 context-switches # 0,036 K/sec
0 cpu-migrations # 0,000 K/sec
19 page-faults # 0,225 K/sec
0 cycles # 0,000 GHz
106 873 995 instructions
20 225 431 branches # 240,026 M/sec
348 834 branch-misses # 1,72% of all branches
7,106142051 seconds time elapsed
Performance counter stats for process id '4004464':
1 116,97 msec task-clock # 0,214 CPUs utilized
4 context-switches # 0,004 K/sec
0 cpu-migrations # 0,000 K/sec
99 349 page-faults # 0,089 M/sec
0 cycles # 0,000 GHz
478 842 411 instructions
89 495 015 branches # 80,123 M/sec
1 014 763 branch-misses # 1,13% of all branches
5,221116331 seconds time elapsed
1 116,97 msec task-clock # 0,214 CPUs utilized
4 context-switches # 0,004 K/sec
0 cpu-migrations # 0,000 K/sec
99 349 page-faults # 0,089 M/sec
0 cycles # 0,000 GHz
478 842 411 instructions
89 495 015 branches # 80,123 M/sec
1 014 763 branch-misses # 1,13% of all branches
5,221116331 seconds time elapsed
Regards
Pavel
RegardsPavel
On Mon, 2 May 2022 at 16:09, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > > > po 2. 5. 2022 v 16:02 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal: >> there is just shared buffers changed to 32GB and work_mem to 70MB. Unfortunately - it is in production environment withcustomer data, so I cannot to play too much >> >> This is perf of slow >> >> 25,94% postmaster [kernel.kallsyms] [k] clear_page_erms >> 11,06% postmaster [kernel.kallsyms] [k] page_fault >> 5,51% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode >> 5,18% postmaster [kernel.kallsyms] [k] __list_del_entry_valid >> 5,15% postmaster libc-2.28.so [.] __memset_avx2_erms >> 3,99% postmaster [kernel.kallsyms] [k] unmap_page_range >> 3,07% postmaster postgres [.] hash_search_with_hash_value >> 2,73% postmaster [kernel.kallsyms] [k] cgroup_throttle_swaprate >> 2,49% postmaster postgres [.] heap_page_prune_opt >> 1,92% postmaster [kernel.kallsyms] [k] try_charge >> 1,85% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode >> 1,82% postmaster [kernel.kallsyms] [k] error_entry >> 1,73% postmaster postgres [.] _bt_checkkeys >> 1,48% postmaster [kernel.kallsyms] [k] free_pcppages_bulk >> 1,35% postmaster [kernel.kallsyms] [k] get_page_from_freelist >> 1,20% postmaster [kernel.kallsyms] [k] __pagevec_lru_add_fn >> 1,08% postmaster [kernel.kallsyms] [k] percpu_ref_put_many.constprop.84 >> 1,08% postmaster postgres [.] 0x00000000003c1be6 >> 1,06% postmaster [kernel.kallsyms] [k] get_mem_cgroup_from_mm.part.49 >> 0,86% postmaster [kernel.kallsyms] [k] __handle_mm_fault >> 0,79% postmaster [kernel.kallsyms] [k] mem_cgroup_charge >> 0,70% postmaster [kernel.kallsyms] [k] release_pages >> 0,61% postmaster postgres [.] _bt_checkpage >> 0,61% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache >> 0,60% postmaster [kernel.kallsyms] [k] handle_mm_fault >> 0,57% postmaster postgres [.] tbm_iterate >> 0,56% postmaster [kernel.kallsyms] [k] __count_memcg_events.part.70 >> 0,55% postmaster [kernel.kallsyms] [k] __mod_memcg_lruvec_state >> 0,52% postmaster postgres [.] 0x000000000015f6e5 >> 0,50% postmaster [kernel.kallsyms] [k] prep_new_page >> 0,49% postmaster [kernel.kallsyms] [k] __do_page_fault >> 0,46% postmaster [kernel.kallsyms] [k] _raw_spin_lock >> 0,44% postmaster [kernel.kallsyms] [k] do_anonymous_page >> >> This is fast >> >> 21,13% postmaster postgres [.] hash_search_with_hash_value >> 15,33% postmaster postgres [.] heap_page_prune_opt >> 10,22% postmaster libc-2.28.so [.] __memset_avx2_erms >> 10,00% postmaster postgres [.] _bt_checkkeys >> 6,23% postmaster postgres [.] 0x00000000003c1be6 >> 4,94% postmaster postgres [.] _bt_checkpage >> 2,85% postmaster postgres [.] tbm_iterate >> 2,31% postmaster postgres [.] nocache_index_getattr >> 2,13% postmaster postgres [.] pg_qsort >> 1,58% postmaster postgres [.] heap_hot_search_buffer >> 1,58% postmaster postgres [.] FunctionCall2Coll >> 1,58% postmaster postgres [.] 0x000000000015f6e5 >> 1,17% postmaster postgres [.] LWLockRelease >> 0,85% postmaster libc-2.28.so [.] __memcmp_avx2_movbe >> 0,64% postmaster postgres [.] 0x00000000003e4233 >> 0,54% postmaster postgres [.] hash_bytes >> 0,53% postmaster postgres [.] 0x0000000000306fbb >> 0,53% postmaster postgres [.] LWLockAcquire >> 0,42% postmaster postgres [.] 0x00000000003c1c6f >> 0,42% postmaster postgres [.] _bt_compare >> > > It looks so memoization allocate lot of memory - maybe there are some temporal memory leak Memoization doesn't leak memory any more than hash tables do; so I doubt that that is the issue. > Performance counter stats for process id '4004464': > > 84,26 msec task-clock # 0,012 CPUs utilized > 3 context-switches # 0,036 K/sec > 0 cpu-migrations # 0,000 K/sec > 19 page-faults # 0,225 K/sec > 0 cycles # 0,000 GHz > 106 873 995 instructions > 20 225 431 branches # 240,026 M/sec > 348 834 branch-misses # 1,72% of all branches > > 7,106142051 seconds time elapsed > Assuming the above was for the fast query > Performance counter stats for process id '4004464': > > 1 116,97 msec task-clock # 0,214 CPUs utilized > 4 context-switches # 0,004 K/sec > 0 cpu-migrations # 0,000 K/sec > 99 349 page-faults # 0,089 M/sec > 0 cycles # 0,000 GHz > 478 842 411 instructions > 89 495 015 branches # 80,123 M/sec > 1 014 763 branch-misses # 1,13% of all branches > > 5,221116331 seconds time elapsed ... and this for the slow one: It seems like this system is actively swapping memory; which has a negative impact on your system. This seems to be indicated by the high amount of page faults and the high amount of time spent in the kernel (as per the perf report one mail earlier). Maybe too much (work)memory was assigned and the machine you're running on doesn't have that amount of memory left? Either way, seeing that so much time is spent in the kernel I don't think that PostgreSQL is the main/only source of the slow query here, so I don't think pgsql-hackers is the right place to continue with this conversation. Regards, Matthias PS. Maybe next time start off in https://www.postgresql.org/list/pgsql-performance/ if you have performance issues with unknown origin. The wiki also has some nice tips to debug performance issues: https://wiki.postgresql.org/wiki/Slow_Query_Questions
po 2. 5. 2022 v 16:44 odesílatel Matthias van de Meent <boekewurm+postgres@gmail.com> napsal:
On Mon, 2 May 2022 at 16:09, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>
>
> po 2. 5. 2022 v 16:02 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
>> there is just shared buffers changed to 32GB and work_mem to 70MB. Unfortunately - it is in production environment with customer data, so I cannot to play too much
>>
>> This is perf of slow
>>
>> 25,94% postmaster [kernel.kallsyms] [k] clear_page_erms
>> 11,06% postmaster [kernel.kallsyms] [k] page_fault
>> 5,51% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode
>> 5,18% postmaster [kernel.kallsyms] [k] __list_del_entry_valid
>> 5,15% postmaster libc-2.28.so [.] __memset_avx2_erms
>> 3,99% postmaster [kernel.kallsyms] [k] unmap_page_range
>> 3,07% postmaster postgres [.] hash_search_with_hash_value
>> 2,73% postmaster [kernel.kallsyms] [k] cgroup_throttle_swaprate
>> 2,49% postmaster postgres [.] heap_page_prune_opt
>> 1,92% postmaster [kernel.kallsyms] [k] try_charge
>> 1,85% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode
>> 1,82% postmaster [kernel.kallsyms] [k] error_entry
>> 1,73% postmaster postgres [.] _bt_checkkeys
>> 1,48% postmaster [kernel.kallsyms] [k] free_pcppages_bulk
>> 1,35% postmaster [kernel.kallsyms] [k] get_page_from_freelist
>> 1,20% postmaster [kernel.kallsyms] [k] __pagevec_lru_add_fn
>> 1,08% postmaster [kernel.kallsyms] [k] percpu_ref_put_many.constprop.84
>> 1,08% postmaster postgres [.] 0x00000000003c1be6
>> 1,06% postmaster [kernel.kallsyms] [k] get_mem_cgroup_from_mm.part.49
>> 0,86% postmaster [kernel.kallsyms] [k] __handle_mm_fault
>> 0,79% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
>> 0,70% postmaster [kernel.kallsyms] [k] release_pages
>> 0,61% postmaster postgres [.] _bt_checkpage
>> 0,61% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache
>> 0,60% postmaster [kernel.kallsyms] [k] handle_mm_fault
>> 0,57% postmaster postgres [.] tbm_iterate
>> 0,56% postmaster [kernel.kallsyms] [k] __count_memcg_events.part.70
>> 0,55% postmaster [kernel.kallsyms] [k] __mod_memcg_lruvec_state
>> 0,52% postmaster postgres [.] 0x000000000015f6e5
>> 0,50% postmaster [kernel.kallsyms] [k] prep_new_page
>> 0,49% postmaster [kernel.kallsyms] [k] __do_page_fault
>> 0,46% postmaster [kernel.kallsyms] [k] _raw_spin_lock
>> 0,44% postmaster [kernel.kallsyms] [k] do_anonymous_page
>>
>> This is fast
>>
>> 21,13% postmaster postgres [.] hash_search_with_hash_value
>> 15,33% postmaster postgres [.] heap_page_prune_opt
>> 10,22% postmaster libc-2.28.so [.] __memset_avx2_erms
>> 10,00% postmaster postgres [.] _bt_checkkeys
>> 6,23% postmaster postgres [.] 0x00000000003c1be6
>> 4,94% postmaster postgres [.] _bt_checkpage
>> 2,85% postmaster postgres [.] tbm_iterate
>> 2,31% postmaster postgres [.] nocache_index_getattr
>> 2,13% postmaster postgres [.] pg_qsort
>> 1,58% postmaster postgres [.] heap_hot_search_buffer
>> 1,58% postmaster postgres [.] FunctionCall2Coll
>> 1,58% postmaster postgres [.] 0x000000000015f6e5
>> 1,17% postmaster postgres [.] LWLockRelease
>> 0,85% postmaster libc-2.28.so [.] __memcmp_avx2_movbe
>> 0,64% postmaster postgres [.] 0x00000000003e4233
>> 0,54% postmaster postgres [.] hash_bytes
>> 0,53% postmaster postgres [.] 0x0000000000306fbb
>> 0,53% postmaster postgres [.] LWLockAcquire
>> 0,42% postmaster postgres [.] 0x00000000003c1c6f
>> 0,42% postmaster postgres [.] _bt_compare
>>
>
> It looks so memoization allocate lot of memory - maybe there are some temporal memory leak
Memoization doesn't leak memory any more than hash tables do; so I
doubt that that is the issue.
> Performance counter stats for process id '4004464':
>
> 84,26 msec task-clock # 0,012 CPUs utilized
> 3 context-switches # 0,036 K/sec
> 0 cpu-migrations # 0,000 K/sec
> 19 page-faults # 0,225 K/sec
> 0 cycles # 0,000 GHz
> 106 873 995 instructions
> 20 225 431 branches # 240,026 M/sec
> 348 834 branch-misses # 1,72% of all branches
>
> 7,106142051 seconds time elapsed
>
Assuming the above was for the fast query
> Performance counter stats for process id '4004464':
>
> 1 116,97 msec task-clock # 0,214 CPUs utilized
> 4 context-switches # 0,004 K/sec
> 0 cpu-migrations # 0,000 K/sec
> 99 349 page-faults # 0,089 M/sec
> 0 cycles # 0,000 GHz
> 478 842 411 instructions
> 89 495 015 branches # 80,123 M/sec
> 1 014 763 branch-misses # 1,13% of all branches
>
> 5,221116331 seconds time elapsed
... and this for the slow one:
It seems like this system is actively swapping memory; which has a
negative impact on your system. This seems to be indicated by the high
amount of page faults and the high amount of time spent in the kernel
(as per the perf report one mail earlier). Maybe too much (work)memory
was assigned and the machine you're running on doesn't have that
amount of memory left?
This computer has 354GB RAM, and there is 63GB RAM free (unused memory)
Either way, seeing that so much time is spent in the kernel I don't
think that PostgreSQL is the main/only source of the slow query here,
so I don't think pgsql-hackers is the right place to continue with
this conversation.
I can see this issue only when Memoize is enabled. So it looks like a Postgres issue. 400MB of work mem is not too much.
Regards,
Matthias
PS. Maybe next time start off in
https://www.postgresql.org/list/pgsql-performance/ if you have
performance issues with unknown origin.
The wiki also has some nice tips to debug performance issues:
https://wiki.postgresql.org/wiki/Slow_Query_Questions
On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I found a query that is significantly slower with more memory Can you clarify what you mean here? More memory was installed on the machine? or work_mem was increased? or? > plan 1 - fast https://explain.depesz.com/s/XM1f > > plan 2 - slow https://explain.depesz.com/s/2rBw If it was work_mem you increased, it seems strange that the plan would switch over to using a Nested Loop / Memoize plan. Only 91 rows are estimated on the outer side of the join. It's hard to imagine that work_mem was so low that the Memoize costing code thought there would ever be cache evictions. > Strange - the time of last row is +/- same, but execution time is 10x worse > > It looks like slow environment cleaning Can you also show EXPLAIN for the Memoize plan without ANALYZE? Does the slowness present every time that plan is executed? Can you show the EXPLAIN ANALYZE of the nested loop plan with enable_memoize = off? You may ned to disable hash and merge join. David
David Rowley <dgrowleyml@gmail.com> writes: > On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> I found a query that is significantly slower with more memory > If it was work_mem you increased, it seems strange that the plan would > switch over to using a Nested Loop / Memoize plan. Yeah, there's something unexplained there. I think that the most probable explanation for the symptoms is that cost_memoize_rescan is computing some insane value for est_entries, causing ExecInitMemoize to allocate-and-zero a huge hash table, which ExecEndMemoize then frees again. Neither of those steps gets counted into any plan node's runtime, but EXPLAIN's total execution time will include them. An insane value for est_entries could perhaps go along with a cost misestimate that convinces the planner to include the memoize even though it seems pointless. I spent some time studying cost_memoize_rescan, and the only conclusions I arrived at were that the variable names are poorly chosen and the comments are unhelpful. For instance, one would think that est_entry_bytes is the expected size of one cache entry, but it seems to actually be the total space that would be occupied if the whole input relation were loaded into the cache. And the est_cache_entries computation seems nonsensical; if it does make sense, the comment sure doesn't illuminate why. So I am quite prepared to buy into the idea that cost_memoize_rescan is producing bogus answers, but it's hard to tell what it's coming out with in this example. Too bad EXPLAIN doesn't print est_entries. regards, tom lane
On Mon, May 2, 2022 at 4:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
> On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> I found a query that is significantly slower with more memory
> If it was work_mem you increased, it seems strange that the plan would
> switch over to using a Nested Loop / Memoize plan.
Yeah, there's something unexplained there.
I spent some time studying cost_memoize_rescan, and the only
conclusions I arrived at were that the variable names are poorly
chosen and the comments are unhelpful. For instance, one would
think that est_entry_bytes is the expected size of one cache entry,
but it seems to actually be the total space that would be occupied
if the whole input relation were loaded into the cache.
And
the est_cache_entries computation seems nonsensical; if it does
make sense, the comment sure doesn't illuminate why.
My take on this is that a cache entry is keyed by a parameterization and any given entry can have, at most, every tuple saved into it (hence the computation of tuples*per-tuple-size). So the maximum number of hash keys is the total available memory divided by input relation size. This upper bound is stored in est_cache_entries. If the number of unique parameterizations expected (at worst one-per-call) is less than this we use that value and never evict. If it is more we use the est_cache_entries and plan to evict.
What I'm expecting to find but don't see is that by definition each unique parameterization must positively match a unique subset of the input relation tuples. If we remember only those tuples that matched then at no point should the total memory for the hash table exceed the size of the input relation.
Now, I'm not completely confident the cache only holds matched tuples...but if so:
From that the mpath->est_entries should be "min(hash_mem_bytes / est_entry_bytes, 1.0) * ndistinct"
i.e., all groups or a fractional subset based upon available hash memory
Then:
ndistinct = estimate_num_groups() || calls
retention_ratio = min(hash_mem_bytes / est_entry_bytes, 1.0)
est_entries = retention_ratio * ndistinct
evict_ratio = 1.0 - retention_ratio
hit_ratio = (est_entries / ndistinct) - (ndistinct / calls) || clamp to 0.0
I don't understand the adjustment factor ndistinct/calls
evictions total cost adjustment also assumes we are evicting all tuples as opposed to tuples/est_entries
This is a "rescan" so aside from cache management isn't the cost of originally populating the cache already accounted for elsewhere?
David J.
On Tue, 3 May 2022 at 11:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote: > >> I found a query that is significantly slower with more memory > > > If it was work_mem you increased, it seems strange that the plan would > > switch over to using a Nested Loop / Memoize plan. > > Yeah, there's something unexplained there. > > I think that the most probable explanation for the symptoms is that > cost_memoize_rescan is computing some insane value for est_entries, > causing ExecInitMemoize to allocate-and-zero a huge hash table, > which ExecEndMemoize then frees again. Neither of those steps > gets counted into any plan node's runtime, but EXPLAIN's total > execution time will include them. An insane value for est_entries > could perhaps go along with a cost misestimate that convinces the > planner to include the memoize even though it seems pointless. That seems pretty unlikely to me. est_entries is based on the minimum value of the expected number of total cache entries and the ndistinct value. ndistinct cannot be insane here as ndistinct is never going to be higher than the number of calls, which is the row estimate from the outer side of the join. That's 91 in both cases here. As far as I can see, that's just going to make a table of 128 buckets. See estimate_num_groups_incremental() at: /* * We don't ever want to return an estimate of zero groups, as that tends * to lead to division-by-zero and other unpleasantness. The input_rows * estimate is usually already at least 1, but clamp it just in case it * isn't. */ input_rows = clamp_row_est(input_rows); > I spent some time studying cost_memoize_rescan, and the only > conclusions I arrived at were that the variable names are poorly > chosen and the comments are unhelpful. For instance, one would > think that est_entry_bytes is the expected size of one cache entry, > but it seems to actually be the total space that would be occupied > if the whole input relation were loaded into the cache. I think you've misunderstood. It *is* the estimated size of a single entry. I think you might be going wrong in assuming "tuples" is the expected tuples from all rescans of the inner side of the join. It's actually from a single scan. I can add a comment there to help make that clear. > And > the est_cache_entries computation seems nonsensical; if it does > make sense, the comment sure doesn't illuminate why. So I am > quite prepared to buy into the idea that cost_memoize_rescan is > producing bogus answers, but it's hard to tell what it's coming out > with in this example. Too bad EXPLAIN doesn't print est_entries. I'm wishing I put the initial hash table size and the final hash table size in EXPLAIN + EXPLAIN ANALYZE now. Perhaps it's not too late for v15 to do that so that it might help us figure things out in the future. I'm open to making improvements to the comments in that area. I do remember spending quite a bit of time trying to make things as clear as possible as it is fairly complex what's going on there. David
On Tue, 3 May 2022 at 13:43, David G. Johnston <david.g.johnston@gmail.com> wrote: > hit_ratio = (est_entries / ndistinct) - (ndistinct / calls) || clamp to 0.0 > I don't understand the adjustment factor ndistinct/calls I've attached a spreadsheet showing you the impact of subtracting (ndistinct / calls). What this is correcting for is the fact that the first scan from each unique value is a cache miss. The more calls we have, the more hits we'll get. If there was only 1 call per distinct value then there'd never be any hits. Without subtracting (ndistinct / calls) and assuming there's space in the cache for each ndistinct value, we'd assume 100% cache hit ratio if calls == ndistinct. What we should assume in that case is a 0% hit ratio as the first scan for each distinct parameter must always be a miss as we've never had a chance to cache any tuples for it yet. > This is a "rescan" so aside from cache management isn't the cost of originally populating the cache already accounted forelsewhere? The cost of the first scan is calculated in create_memoize_path(). Since the first scan will always be a cache miss, the code there just adds some cache management surcharges. Namely: /* * Add a small additional charge for caching the first entry. All the * harder calculations for rescans are performed in cost_memoize_rescan(). */ pathnode->path.startup_cost = subpath->startup_cost + cpu_tuple_cost; pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost; David
Attachment
On Mon, May 2, 2022 at 7:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 3 May 2022 at 11:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> >> I found a query that is significantly slower with more memory
>
> > If it was work_mem you increased, it seems strange that the plan would
> > switch over to using a Nested Loop / Memoize plan.
>
> Yeah, there's something unexplained there.
>
> I think that the most probable explanation for the symptoms is that
> cost_memoize_rescan is computing some insane value for est_entries,
> causing ExecInitMemoize to allocate-and-zero a huge hash table,
> which ExecEndMemoize then frees again. Neither of those steps
> gets counted into any plan node's runtime, but EXPLAIN's total
> execution time will include them. An insane value for est_entries
> could perhaps go along with a cost misestimate that convinces the
> planner to include the memoize even though it seems pointless.
That seems pretty unlikely to me. est_entries is based on the minimum
value of the expected number of total cache entries and the ndistinct
value. ndistinct cannot be insane here as ndistinct is never going to
be higher than the number of calls, which is the row estimate from the
outer side of the join. That's 91 in both cases here. As far as I
can see, that's just going to make a table of 128 buckets.
If est_entries goes to zero due to hash_mem_bytes/est_entry_bytes < 1 (hence floor takes it to zero) the executor will use a size value of 1024 instead in build_hash_table.
That seems unlikely but there is no data to support or refute it.
I'm open to making improvements to the comments in that area. I do
remember spending quite a bit of time trying to make things as clear
as possible as it is fairly complex what's going on there.
A few more intermediate calculation variables, along with descriptions, would help.
e.g., min(est_cache_entries, ndistinct) is repeated twice after its initial definition.
retention_ratio per my other reply
The (ndistinct/calls) part of hit_ratio being described specifically.
David J.
On Mon, May 2, 2022 at 7:30 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 3 May 2022 at 13:43, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> hit_ratio = (est_entries / ndistinct) - (ndistinct / calls) || clamp to 0.0
> I don't understand the adjustment factor ndistinct/calls
I've attached a spreadsheet showing you the impact of subtracting
(ndistinct / calls). What this is correcting for is the fact that the
first scan from each unique value is a cache miss. The more calls we
have, the more hits we'll get. If there was only 1 call per distinct
value then there'd never be any hits. Without subtracting (ndistinct /
calls) and assuming there's space in the cache for each ndistinct
value, we'd assume 100% cache hit ratio if calls == ndistinct. What
we should assume in that case is a 0% hit ratio as the first scan for
each distinct parameter must always be a miss as we've never had a
chance to cache any tuples for it yet.
Thank you. I understand the theory and agree with it - but the math doesn't seem to be working out.
Plugging in:
n = 2,000
e = 500
c = 10,000
proper = 5%
incorrect = 25%
But of the 10,000 calls we will receive, the first 2,000 will be misses while 2,000 of the remaining 8,000 will be hits, due to sharing 2,000 distinct groups among the available inventory of 500 (25% of 8,000 is 2,000). 2,000 hits in 10,000 calls yields 20%.
I believe the correct formula to be:
((calls - ndistinct) / calls) * (est_entries / ndistinct) = hit_ratio
.80 * .25 = .20
First we recognize that our hit ratio can be at most c-n/c since n misses are guaranteed. We take that ratio and scale it by our cache efficiency since of the remaining hits that fraction will turn into misses due to the relevant cache not being in memory.
David J.
On Tue, 3 May 2022 at 15:22, David G. Johnston <david.g.johnston@gmail.com> wrote: > Plugging in: > n = 2,000 > e = 500 > c = 10,000 > > proper = 5% > incorrect = 25% > > But of the 10,000 calls we will receive, the first 2,000 will be misses while 2,000 of the remaining 8,000 will be hits,due to sharing 2,000 distinct groups among the available inventory of 500 (25% of 8,000 is 2,000). 2,000 hits in 10,000calls yields 20%. > > I believe the correct formula to be: > > ((calls - ndistinct) / calls) * (est_entries / ndistinct) = hit_ratio > .80 * .25 = .20 I think you're correct here. The formula should be that. However, two things; 1) this being incorrect is not the cause of the original problem reported on this thread, and 2) There's just no way we could consider fixing this in v15, let alone back patch it to v14. Maybe we should open a new thread about this and put an entry in the first CF for v16 under bugs and come back to it after we branch. Thinking the cache hit ratio is lower than it actually is going to be will reduce the chances of the planner switching to a Nested Loop / Memoize plan vs a Hash or Merge Join plan. I was already fairly concerned that Memoize could cause performance regressions when the ndistinct value or expected cache entry size is underestimated or the outer side rows are overestimated. What I've got to calculate the cache hit ratio does seem incorrect given what you're showing, however it does add an element of pessimism and reduces the chances of a bad plan being picked when work_mem is too low to cache all entries. Swapping it out for your formula seems like it would increase the chances of a Memoize plan being picked when the row, ndistinct and cache entry size estimates are correct, however, it could also increase the chance of a bad plan when being picked in cases where the estimates are incorrect. My problem with changing this now would be that we already often perform Nested Loop joins when a Hash or Merge join would be a better option. I'd hate to take us in a direction where we make that problem even worse. David
po 2. 5. 2022 v 23:48 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I found a query that is significantly slower with more memory
Can you clarify what you mean here? More memory was installed on the
machine? or work_mem was increased? or?
> plan 1 - fast https://explain.depesz.com/s/XM1f
>
> plan 2 - slow https://explain.depesz.com/s/2rBw
If it was work_mem you increased, it seems strange that the plan would
switch over to using a Nested Loop / Memoize plan. Only 91 rows are
estimated on the outer side of the join. It's hard to imagine that
work_mem was so low that the Memoize costing code thought there would
ever be cache evictions.
> Strange - the time of last row is +/- same, but execution time is 10x worse
>
> It looks like slow environment cleaning
Can you also show EXPLAIN for the Memoize plan without ANALYZE?
yes - it is strange - it is just slow without execution
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=59.62..3168.15 rows=1 width=16) │
│ -> Nested Loop Anti Join (cost=59.34..3165.24 rows=1 width=16) │
│ -> Nested Loop Semi Join (cost=58.48..3133.09 rows=1 width=16) │
│ -> Bitmap Heap Scan on item itembo0_ (cost=57.34..2708.22 rows=11 width=16) │
│ Recheck Cond: ((ending_time IS NULL) OR ((status_id = ANY ('{1,4,5}'::bigint[])) AND (CURRENT_TIMESTAMP < ending_time) AND (starting_time <= CURRENT_TIMESTAMP))) │
│ Filter: ((to_expose_date IS NULL) AND (status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ -> BitmapOr (cost=57.34..57.34 rows=1751 width=0) │
│ -> Bitmap Index Scan on stehule354 (cost=0.00..2.08 rows=1 width=0) │
│ Index Cond: (ending_time IS NULL) │
│ -> Bitmap Index Scan on stehule1010 (cost=0.00..55.26 rows=1751 width=0) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (ending_time > CURRENT_TIMESTAMP) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ -> Nested Loop (cost=1.14..37.71 rows=91 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.57..2.09 rows=1 width=8) │
│ Cache Key: itemsharei2__1.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ -> Nested Loop (cost=0.85..32.14 rows=1 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.29..1.72 rows=1 width=8) │
│ Cache Key: itemsharei2_.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..1.71 rows=1 width=8) │
│ Index Cond: (id = itemsharei2_.item_image_id) │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) │
│ Index Cond: (item_id = itembo0_.id) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(29 řádek)
Čas: 420,392 ms
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=59.62..3168.15 rows=1 width=16) │
│ -> Nested Loop Anti Join (cost=59.34..3165.24 rows=1 width=16) │
│ -> Nested Loop Semi Join (cost=58.48..3133.09 rows=1 width=16) │
│ -> Bitmap Heap Scan on item itembo0_ (cost=57.34..2708.22 rows=11 width=16) │
│ Recheck Cond: ((ending_time IS NULL) OR ((status_id = ANY ('{1,4,5}'::bigint[])) AND (CURRENT_TIMESTAMP < ending_time) AND (starting_time <= CURRENT_TIMESTAMP))) │
│ Filter: ((to_expose_date IS NULL) AND (status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ -> BitmapOr (cost=57.34..57.34 rows=1751 width=0) │
│ -> Bitmap Index Scan on stehule354 (cost=0.00..2.08 rows=1 width=0) │
│ Index Cond: (ending_time IS NULL) │
│ -> Bitmap Index Scan on stehule1010 (cost=0.00..55.26 rows=1751 width=0) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (ending_time > CURRENT_TIMESTAMP) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ -> Nested Loop (cost=1.14..37.71 rows=91 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.57..2.09 rows=1 width=8) │
│ Cache Key: itemsharei2__1.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ -> Nested Loop (cost=0.85..32.14 rows=1 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.29..1.72 rows=1 width=8) │
│ Cache Key: itemsharei2_.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..1.71 rows=1 width=8) │
│ Index Cond: (id = itemsharei2_.item_image_id) │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) │
│ Index Cond: (item_id = itembo0_.id) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(29 řádek)
Čas: 420,392 ms
Does the slowness present every time that plan is executed?
looks yes
Can you show the EXPLAIN ANALYZE of the nested loop plan with
enable_memoize = off? You may ned to disable hash and merge join.
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=1093.22..4488.89 rows=1 width=16) (actual time=5.723..60.470 rows=13 loops=1) │
│ -> Nested Loop Anti Join (cost=1092.94..4485.97 rows=1 width=16) (actual time=5.165..60.368 rows=41 loops=1) │
│ -> Gather (cost=1001.70..4391.26 rows=1 width=16) (actual time=1.909..56.913 rows=41 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Nested Loop Semi Join (cost=1.70..3391.16 rows=1 width=16) (actual time=22.032..39.253 rows=14 loops=3) │
│ -> Parallel Index Only Scan using stehule1010 on item itembo0_ (cost=0.57..2422.83 rows=5 width=16) (actual time=21.785..38.851 rows=14 loops=3) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ Filter: ((to_expose_date IS NULL) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ Rows Removed by Filter: 1589 │
│ Heap Fetches: 21 │
│ -> Nested Loop (cost=1.13..192.76 rows=91 width=8) (actual time=0.029..0.029 rows=1 loops=41) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) (actual time=0.015..0.015 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=41) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ Heap Fetches: 2 │
│ -> Hash Join (cost=91.24..94.71 rows=1 width=8) (actual time=0.084..0.084 rows=0 loops=41) │
│ Hash Cond: (itemsharei2_.item_image_id = itemimageb3_.id) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) (actual time=0.003..0.004 rows=6 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Hash (cost=67.41..67.41 rows=1861 width=8) (actual time=3.213..3.214 rows=3950 loops=1) │
│ Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory Usage: 187kB │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..67.41 rows=1861 width=8) (actual time=0.029..2.479 rows=3950 loops=1) │
│ Heap Fetches: 2203 │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 0 │
│ Planning Time: 1.471 ms │
│ Execution Time: 60.570 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(32 řádek)
Čas: 62,982 ms
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=1093.22..4488.89 rows=1 width=16) (actual time=5.723..60.470 rows=13 loops=1) │
│ -> Nested Loop Anti Join (cost=1092.94..4485.97 rows=1 width=16) (actual time=5.165..60.368 rows=41 loops=1) │
│ -> Gather (cost=1001.70..4391.26 rows=1 width=16) (actual time=1.909..56.913 rows=41 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Nested Loop Semi Join (cost=1.70..3391.16 rows=1 width=16) (actual time=22.032..39.253 rows=14 loops=3) │
│ -> Parallel Index Only Scan using stehule1010 on item itembo0_ (cost=0.57..2422.83 rows=5 width=16) (actual time=21.785..38.851 rows=14 loops=3) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ Filter: ((to_expose_date IS NULL) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ Rows Removed by Filter: 1589 │
│ Heap Fetches: 21 │
│ -> Nested Loop (cost=1.13..192.76 rows=91 width=8) (actual time=0.029..0.029 rows=1 loops=41) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) (actual time=0.015..0.015 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=41) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ Heap Fetches: 2 │
│ -> Hash Join (cost=91.24..94.71 rows=1 width=8) (actual time=0.084..0.084 rows=0 loops=41) │
│ Hash Cond: (itemsharei2_.item_image_id = itemimageb3_.id) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) (actual time=0.003..0.004 rows=6 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Hash (cost=67.41..67.41 rows=1861 width=8) (actual time=3.213..3.214 rows=3950 loops=1) │
│ Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory Usage: 187kB │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..67.41 rows=1861 width=8) (actual time=0.029..2.479 rows=3950 loops=1) │
│ Heap Fetches: 2203 │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 0 │
│ Planning Time: 1.471 ms │
│ Execution Time: 60.570 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(32 řádek)
Čas: 62,982 ms
David
út 3. 5. 2022 v 6:09 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
po 2. 5. 2022 v 23:48 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:On Mon, 2 May 2022 at 21:00, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I found a query that is significantly slower with more memory
Can you clarify what you mean here? More memory was installed on the
machine? or work_mem was increased? or?
> plan 1 - fast https://explain.depesz.com/s/XM1f
>
> plan 2 - slow https://explain.depesz.com/s/2rBw
If it was work_mem you increased, it seems strange that the plan would
switch over to using a Nested Loop / Memoize plan. Only 91 rows are
estimated on the outer side of the join. It's hard to imagine that
work_mem was so low that the Memoize costing code thought there would
ever be cache evictions.
> Strange - the time of last row is +/- same, but execution time is 10x worse
>
> It looks like slow environment cleaning
Can you also show EXPLAIN for the Memoize plan without ANALYZE?yes - it is strange - it is just slow without execution┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=59.62..3168.15 rows=1 width=16) │
│ -> Nested Loop Anti Join (cost=59.34..3165.24 rows=1 width=16) │
│ -> Nested Loop Semi Join (cost=58.48..3133.09 rows=1 width=16) │
│ -> Bitmap Heap Scan on item itembo0_ (cost=57.34..2708.22 rows=11 width=16) │
│ Recheck Cond: ((ending_time IS NULL) OR ((status_id = ANY ('{1,4,5}'::bigint[])) AND (CURRENT_TIMESTAMP < ending_time) AND (starting_time <= CURRENT_TIMESTAMP))) │
│ Filter: ((to_expose_date IS NULL) AND (status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ -> BitmapOr (cost=57.34..57.34 rows=1751 width=0) │
│ -> Bitmap Index Scan on stehule354 (cost=0.00..2.08 rows=1 width=0) │
│ Index Cond: (ending_time IS NULL) │
│ -> Bitmap Index Scan on stehule1010 (cost=0.00..55.26 rows=1751 width=0) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (ending_time > CURRENT_TIMESTAMP) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ -> Nested Loop (cost=1.14..37.71 rows=91 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.57..2.09 rows=1 width=8) │
│ Cache Key: itemsharei2__1.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ -> Nested Loop (cost=0.85..32.14 rows=1 width=8) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) │
│ Index Cond: (item_id = itembo0_.id) │
│ -> Memoize (cost=0.29..1.72 rows=1 width=8) │
│ Cache Key: itemsharei2_.item_image_id │
│ Cache Mode: logical │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..1.71 rows=1 width=8) │
│ Index Cond: (id = itemsharei2_.item_image_id) │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) │
│ Index Cond: (item_id = itembo0_.id) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(29 řádek)
Čas: 420,392 ms
there is really something strange (see attached file). Looks so this issue is much more related to planning time than execution time
Does the slowness present every time that plan is executed?looks yes
Can you show the EXPLAIN ANALYZE of the nested loop plan with
enable_memoize = off? You may ned to disable hash and merge join.┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop Anti Join (cost=1093.22..4488.89 rows=1 width=16) (actual time=5.723..60.470 rows=13 loops=1) │
│ -> Nested Loop Anti Join (cost=1092.94..4485.97 rows=1 width=16) (actual time=5.165..60.368 rows=41 loops=1) │
│ -> Gather (cost=1001.70..4391.26 rows=1 width=16) (actual time=1.909..56.913 rows=41 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Nested Loop Semi Join (cost=1.70..3391.16 rows=1 width=16) (actual time=22.032..39.253 rows=14 loops=3) │
│ -> Parallel Index Only Scan using stehule1010 on item itembo0_ (cost=0.57..2422.83 rows=5 width=16) (actual time=21.785..38.851 rows=14 loops=3) │
│ Index Cond: ((status_id = ANY ('{1,4,5}'::bigint[])) AND (starting_time <= CURRENT_TIMESTAMP)) │
│ Filter: ((to_expose_date IS NULL) AND ((ending_time IS NULL) OR (CURRENT_TIMESTAMP < ending_time))) │
│ Rows Removed by Filter: 1589 │
│ Heap Fetches: 21 │
│ -> Nested Loop (cost=1.13..192.76 rows=91 width=8) (actual time=0.029..0.029 rows=1 loops=41) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.80 rows=91 width=16) (actual time=0.015..0.015 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Index Only Scan using pk_item_image on item_image itemimageb3__1 (cost=0.56..2.08 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=41) │
│ Index Cond: (id = itemsharei2__1.item_image_id) │
│ Heap Fetches: 2 │
│ -> Hash Join (cost=91.24..94.71 rows=1 width=8) (actual time=0.084..0.084 rows=0 loops=41) │
│ Hash Cond: (itemsharei2_.item_image_id = itemimageb3_.id) │
│ -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2_ (cost=0.57..3.80 rows=91 width=16) (actual time=0.003..0.004 rows=6 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 2 │
│ -> Hash (cost=67.41..67.41 rows=1861 width=8) (actual time=3.213..3.214 rows=3950 loops=1) │
│ Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory Usage: 187kB │
│ -> Index Only Scan using stehule3001 on item_image itemimageb3_ (cost=0.28..67.41 rows=1861 width=8) (actual time=0.029..2.479 rows=3950 loops=1) │
│ Heap Fetches: 2203 │
│ -> Index Only Scan using ixfk_ima_itemid on item_missing_attribute itemmissin1_ (cost=0.28..1.66 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=41) │
│ Index Cond: (item_id = itembo0_.id) │
│ Heap Fetches: 0 │
│ Planning Time: 1.471 ms │
│ Execution Time: 60.570 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(32 řádek)
Čas: 62,982 ms
David
Attachment
Pavel Stehule <pavel.stehule@gmail.com> writes: > there is really something strange (see attached file). Looks so this issue > is much more related to planning time than execution time You sure there's not something taking an exclusive lock on one of these tables every so often? regards, tom lane
út 3. 5. 2022 v 6:57 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> there is really something strange (see attached file). Looks so this issue
> is much more related to planning time than execution time
You sure there's not something taking an exclusive lock on one of these
tables every so often?
I am almost sure, I can see this issue only every time when I set a higher work mem. I don't see this issue in other cases.
Regards
Pavel
regards, tom lane
On Tue, 3 May 2022 at 17:02, Pavel Stehule <pavel.stehule@gmail.com> wrote: > út 3. 5. 2022 v 6:57 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: >> You sure there's not something taking an exclusive lock on one of these >> tables every so often? > > I am almost sure, I can see this issue only every time when I set a higher work mem. I don't see this issue in other cases. hmm, I don't think the query being blocked on a table lock would cause this behaviour. As far as I know, all table locks should be obtained before the timer starts for the "Execution Time" timer in EXPLAIN ANALYZE. However, locks are obtained on indexes at executor startup, so if there was some delay in obtaining a lock on the index it would show up this way. I just don't know of anything that obtains a conflicting lock on an index without the same conflicting lock on the table that the index belongs to. I do agree that the perf report does indicate that the extra time is taken due to some large amount of memory being allocated. I just can't quite see how that would happen in Memoize given that estimate_num_groups() clamps the distinct estimate as the number of input rows, which is 91 in both cases in your problem query. Are you able to run the Memoize query in psql with \watch 0.1 for a few seconds while you do: perf record --call-graph dwarf --pid <pid> sleep 2 then send along the perf report. I locally hacked build_hash_table() in nodeMemoize.c to make the hashtable 100 million elements and I see my perf report for a trivial Memoize query come up as: + 99.98% 0.00% postgres postgres [.] _start + 99.98% 0.00% postgres libc.so.6 [.] __libc_start_main_alias_2 (inlined) + 99.98% 0.00% postgres libc.so.6 [.] __libc_start_call_main + 99.98% 0.00% postgres postgres [.] main + 99.98% 0.00% postgres postgres [.] PostmasterMain + 99.98% 0.00% postgres postgres [.] ServerLoop + 99.98% 0.00% postgres postgres [.] BackendStartup (inlined) + 99.98% 0.00% postgres postgres [.] BackendRun (inlined) + 99.98% 0.00% postgres postgres [.] PostgresMain + 99.98% 0.00% postgres postgres [.] exec_simple_query + 99.98% 0.00% postgres postgres [.] PortalRun + 99.98% 0.00% postgres postgres [.] FillPortalStore + 99.98% 0.00% postgres postgres [.] PortalRunUtility + 99.98% 0.00% postgres postgres [.] standard_ProcessUtility + 99.98% 0.00% postgres postgres [.] ExplainQuery + 99.98% 0.00% postgres postgres [.] ExplainOneQuery + 99.95% 0.00% postgres postgres [.] ExplainOnePlan + 87.87% 0.00% postgres postgres [.] standard_ExecutorStart + 87.87% 0.00% postgres postgres [.] InitPlan (inlined) + 87.87% 0.00% postgres postgres [.] ExecInitNode + 87.87% 0.00% postgres postgres [.] ExecInitNestLoop + 87.87% 0.00% postgres postgres [.] ExecInitMemoize + 87.87% 0.00% postgres postgres [.] build_hash_table (inlined) <---- + 87.87% 0.00% postgres postgres [.] memoize_create (inlined) + 87.87% 0.00% postgres postgres [.] memoize_allocate (inlined) + 87.87% 0.00% postgres postgres [.] MemoryContextAllocExtended + 87.87% 0.00% postgres postgres [.] memset (inlined) Failing that, are you able to pg_dump these tables and load them into a PostgreSQL instance that you can play around with and patch? Provided you can actually recreate the problem on that instance. David
> I do agree that the perf report does indicate that the extra time is taken due to > some large amount of memory being allocated. I just can't quite see how that > would happen in Memoize given that > estimate_num_groups() clamps the distinct estimate as the number of input > rows, which is 91 in both cases in your problem query. > > Are you able to run the Memoize query in psql with \watch 0.1 for a few seconds > while you do: > > perf record --call-graph dwarf --pid <pid> sleep 2 > > then send along the perf report. > > I locally hacked build_hash_table() in nodeMemoize.c to make the hashtable 100 > million elements and I see my perf report for a trivial Memoize query come up > as: > [..] > > Failing that, are you able to pg_dump these tables and load them into a > PostgreSQL instance that you can play around with and patch? > Provided you can actually recreate the problem on that instance. > +1 to what David says, we need a reproducer. In [1] Pavel wrote that he's having a lot of clear_page_erms(), so maybe thiswill be a little help: I recall having similar issue having a lot of minor page faults and high %sys when raising work_mem.For me it was different issue some time ago, but it was something like build_hash_table() being used by UNION recursivecalls -> BuildTupleHashTable() -> .. malloc() -> mmap64(). When mmap() is issued with MAP_ANONYMOUS the kernelwill zero out the memory (more memory -> potentially bigger CPU waste visible as minor page faults; erms stands for"Enhanced REP MOVSB/STOSB"; this is on kernel side). The culprit was planner allocating something that wouldn't be usedlater. Additional three ways to figure that one (all are IMHO production safe): a) already mentioned perf with --call-graph dwarf -p PID b) strace -p PID -e 'mmap' # verify if mmap() NULL is not having MAP_ANONYMOUS flag, size of mmap() request will somehowmatch work_mem sizing c) gdb -p PID and then breakpoint for mmap and verify each mmap() # check MAP_ANONYMOUS as above [1] - https://www.postgresql.org/message-id/CAFj8pRAo5CrF8mpPxMvnBYFSqu4HYDqRsQnLqGphckNHkHosFg%40mail.gmail.com -J.
st 4. 5. 2022 v 2:15 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Tue, 3 May 2022 at 17:02, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> út 3. 5. 2022 v 6:57 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
>> You sure there's not something taking an exclusive lock on one of these
>> tables every so often?
>
> I am almost sure, I can see this issue only every time when I set a higher work mem. I don't see this issue in other cases.
hmm, I don't think the query being blocked on a table lock would cause
this behaviour. As far as I know, all table locks should be obtained
before the timer starts for the "Execution Time" timer in EXPLAIN
ANALYZE. However, locks are obtained on indexes at executor startup,
so if there was some delay in obtaining a lock on the index it would
show up this way. I just don't know of anything that obtains a
conflicting lock on an index without the same conflicting lock on the
table that the index belongs to.
I do agree that the perf report does indicate that the extra time is
taken due to some large amount of memory being allocated. I just can't
quite see how that would happen in Memoize given that
estimate_num_groups() clamps the distinct estimate as the number of
input rows, which is 91 in both cases in your problem query.
Are you able to run the Memoize query in psql with \watch 0.1 for a
few seconds while you do:
perf record --call-graph dwarf --pid <pid> sleep 2
then send along the perf report.
I locally hacked build_hash_table() in nodeMemoize.c to make the
hashtable 100 million elements and I see my perf report for a trivial
Memoize query come up as:
+ 99.98% 0.00% postgres postgres [.] _start
+ 99.98% 0.00% postgres libc.so.6 [.]
__libc_start_main_alias_2 (inlined)
+ 99.98% 0.00% postgres libc.so.6 [.] __libc_start_call_main
+ 99.98% 0.00% postgres postgres [.] main
+ 99.98% 0.00% postgres postgres [.] PostmasterMain
+ 99.98% 0.00% postgres postgres [.] ServerLoop
+ 99.98% 0.00% postgres postgres [.] BackendStartup (inlined)
+ 99.98% 0.00% postgres postgres [.] BackendRun (inlined)
+ 99.98% 0.00% postgres postgres [.] PostgresMain
+ 99.98% 0.00% postgres postgres [.] exec_simple_query
+ 99.98% 0.00% postgres postgres [.] PortalRun
+ 99.98% 0.00% postgres postgres [.] FillPortalStore
+ 99.98% 0.00% postgres postgres [.] PortalRunUtility
+ 99.98% 0.00% postgres postgres [.] standard_ProcessUtility
+ 99.98% 0.00% postgres postgres [.] ExplainQuery
+ 99.98% 0.00% postgres postgres [.] ExplainOneQuery
+ 99.95% 0.00% postgres postgres [.] ExplainOnePlan
+ 87.87% 0.00% postgres postgres [.] standard_ExecutorStart
+ 87.87% 0.00% postgres postgres [.] InitPlan (inlined)
+ 87.87% 0.00% postgres postgres [.] ExecInitNode
+ 87.87% 0.00% postgres postgres [.] ExecInitNestLoop
+ 87.87% 0.00% postgres postgres [.] ExecInitMemoize
+ 87.87% 0.00% postgres postgres [.]
build_hash_table (inlined) <----
+ 87.87% 0.00% postgres postgres [.] memoize_create (inlined)
+ 87.87% 0.00% postgres postgres [.]
memoize_allocate (inlined)
+ 87.87% 0.00% postgres postgres [.]
MemoryContextAllocExtended
+ 87.87% 0.00% postgres postgres [.] memset (inlined)
Failing that, are you able to pg_dump these tables and load them into
a PostgreSQL instance that you can play around with and patch?
Provided you can actually recreate the problem on that instance.
+ 71,98% 14,36% postmaster [kernel.kallsyms] [k] page_fault ▒
+ 70,19% 6,59% postmaster libc-2.28.so [.] __memset_avx2_erms ▒
+ 68,20% 0,00% postmaster postgres [.] ExecInitNode ▒
+ 68,20% 0,00% postmaster postgres [.] ExecInitNestLoop ▒
+ 68,13% 0,00% postmaster postgres [.] ExecInitMemoize ▒
+ 68,13% 0,00% postmaster postgres [.] MemoryContextAllocExtended ▒
+ 63,20% 0,00% postmaster postgres [.] 0x0000000000776b89 ▒
+ 63,20% 0,00% postmaster postgres [.] PostgresMain ◆
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f48ca ▒
+ 63,03% 0,00% postmaster postgres [.] PortalRun ▒
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f83ae ▒
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f7fee ▒
+ 63,03% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b599c6 ▒
+ 63,03% 0,00% postmaster postgres [.] standard_ProcessUtility ▒
+ 63,03% 0,00% postmaster postgres [.] ExplainQuery ▒
+ 62,83% 0,00% postmaster postgres [.] 0x000000000062e83c ▒
+ 62,83% 0,00% postmaster postgres [.] ExplainOnePlan ▒
+ 57,47% 0,14% postmaster [kernel.kallsyms] [k] do_page_fault ▒
+ 57,23% 0,51% postmaster [kernel.kallsyms] [k] __do_page_fault ▒
+ 55,61% 0,71% postmaster [kernel.kallsyms] [k] handle_mm_fault ▒
+ 55,19% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b5ad2c ▒
+ 55,19% 0,00% postmaster postgres [.] standard_ExecutorStart ▒
+ 54,78% 0,87% postmaster [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 53,54% 0,37% postmaster [kernel.kallsyms] [k] do_anonymous_page ▒
+ 36,36% 0,21% postmaster [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 35,99% 0,31% postmaster [kernel.kallsyms] [k] __alloc_pages_nodemask ▒
+ 35,40% 1,06% postmaster [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 27,71% 0,62% postmaster [kernel.kallsyms] [k] prep_new_page ▒
+ 27,09% 26,99% postmaster [kernel.kallsyms] [k] clear_page_erms ▒
+ 11,24% 2,29% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode ▒
+ 8,95% 6,87% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode ▒
+ 7,83% 1,01% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
+ 70,19% 6,59% postmaster libc-2.28.so [.] __memset_avx2_erms ▒
+ 68,20% 0,00% postmaster postgres [.] ExecInitNode ▒
+ 68,20% 0,00% postmaster postgres [.] ExecInitNestLoop ▒
+ 68,13% 0,00% postmaster postgres [.] ExecInitMemoize ▒
+ 68,13% 0,00% postmaster postgres [.] MemoryContextAllocExtended ▒
+ 63,20% 0,00% postmaster postgres [.] 0x0000000000776b89 ▒
+ 63,20% 0,00% postmaster postgres [.] PostgresMain ◆
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f48ca ▒
+ 63,03% 0,00% postmaster postgres [.] PortalRun ▒
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f83ae ▒
+ 63,03% 0,00% postmaster postgres [.] 0x00000000007f7fee ▒
+ 63,03% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b599c6 ▒
+ 63,03% 0,00% postmaster postgres [.] standard_ProcessUtility ▒
+ 63,03% 0,00% postmaster postgres [.] ExplainQuery ▒
+ 62,83% 0,00% postmaster postgres [.] 0x000000000062e83c ▒
+ 62,83% 0,00% postmaster postgres [.] ExplainOnePlan ▒
+ 57,47% 0,14% postmaster [kernel.kallsyms] [k] do_page_fault ▒
+ 57,23% 0,51% postmaster [kernel.kallsyms] [k] __do_page_fault ▒
+ 55,61% 0,71% postmaster [kernel.kallsyms] [k] handle_mm_fault ▒
+ 55,19% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b5ad2c ▒
+ 55,19% 0,00% postmaster postgres [.] standard_ExecutorStart ▒
+ 54,78% 0,87% postmaster [kernel.kallsyms] [k] __handle_mm_fault ▒
+ 53,54% 0,37% postmaster [kernel.kallsyms] [k] do_anonymous_page ▒
+ 36,36% 0,21% postmaster [kernel.kallsyms] [k] alloc_pages_vma ▒
+ 35,99% 0,31% postmaster [kernel.kallsyms] [k] __alloc_pages_nodemask ▒
+ 35,40% 1,06% postmaster [kernel.kallsyms] [k] get_page_from_freelist ▒
+ 27,71% 0,62% postmaster [kernel.kallsyms] [k] prep_new_page ▒
+ 27,09% 26,99% postmaster [kernel.kallsyms] [k] clear_page_erms ▒
+ 11,24% 2,29% postmaster [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode ▒
+ 8,95% 6,87% postmaster [kernel.kallsyms] [k] prepare_exit_to_usermode ▒
+ 7,83% 1,01% postmaster [kernel.kallsyms] [k] mem_cgroup_charge
David
st 4. 5. 2022 v 16:08 odesílatel Jakub Wartak <Jakub.Wartak@tomtom.com> napsal:
Additional three ways to figure that one (all are IMHO production safe):
a) already mentioned perf with --call-graph dwarf -p PID
b) strace -p PID -e 'mmap' # verify if mmap() NULL is not having MAP_ANONYMOUS flag, size of mmap() request will somehow match work_mem sizing
c) gdb -p PID and then breakpoint for mmap and verify each mmap() # check MAP_ANONYMOUS as above
I have not debug symbols, so I have not more details now
Breakpoint 1 at 0x7f557f0c16c0
(gdb) c
Continuing.
Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
#1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
#2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
#3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
#4 0x0000000000932134 in AllocSetAlloc ()
#5 0x00000000009376cf in MemoryContextAllocExtended ()
#6 0x00000000006ad915 in ExecInitMemoize ()
#7 0x000000000068dc02 in ExecInitNode ()
#8 0x00000000006b37ff in ExecInitNestLoop ()
#9 0x000000000068dc56 in ExecInitNode ()
#10 0x00000000006b37ff in ExecInitNestLoop ()
#11 0x000000000068dc56 in ExecInitNode ()
#12 0x00000000006b37de in ExecInitNestLoop ()
#13 0x000000000068dc56 in ExecInitNode ()
#14 0x00000000006b37de in ExecInitNestLoop ()
#15 0x000000000068dc56 in ExecInitNode ()
#16 0x0000000000687e4d in standard_ExecutorStart ()
#17 0x00007f5579b5ad2d in pgss_ExecutorStart () from /usr/pgsql-14/lib/pg_stat_statements.so
#18 0x000000000062e643 in ExplainOnePlan ()
#19 0x000000000062e83d in ExplainOneQuery ()
#20 0x000000000062ee6f in ExplainQuery ()
#21 0x00000000007f9b15 in standard_ProcessUtility ()
#22 0x00007f5579b599c7 in pgss_ProcessUtility () from /usr/pgsql-14/lib/pg_stat_statements.so
#23 0x00000000007f7fef in PortalRunUtility ()
#24 0x00000000007f83af in FillPortalStore ()
#25 0x00000000007f86dd in PortalRun ()
#26 0x00000000007f48cb in exec_simple_query ()
#27 0x00000000007f610e in PostgresMain ()
#28 0x0000000000776b8a in ServerLoop ()
#29 0x0000000000777a03 in PostmasterMain ()
#30 0x00000000004fe413 in main ()
(gdb) p
The history is empty.
(gdb) c
Continuing.
Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
#1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
#2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
#3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
#4 0x0000000000932134 in AllocSetAlloc ()
#5 0x00000000009376cf in MemoryContextAllocExtended ()
#6 0x00000000006ad915 in ExecInitMemoize ()
#7 0x000000000068dc02 in ExecInitNode ()
#8 0x00000000006b37ff in ExecInitNestLoop ()
#9 0x000000000068dc56 in ExecInitNode ()
#10 0x00000000006b37ff in ExecInitNestLoop ()
#11 0x000000000068dc56 in ExecInitNode ()
#12 0x00000000006b37de in ExecInitNestLoop ()
#13 0x000000000068dc56 in ExecInitNode ()
#14 0x0000000000687e4d in standard_ExecutorStart ()
#15 0x00007f5579b5ad2d in pgss_ExecutorStart () from /usr/pgsql-14/lib/pg_stat_statements.so
#16 0x000000000062e643 in ExplainOnePlan ()
#17 0x000000000062e83d in ExplainOneQuery ()
#18 0x000000000062ee6f in ExplainQuery ()
#19 0x00000000007f9b15 in standard_ProcessUtility ()
#20 0x00007f5579b599c7 in pgss_ProcessUtility () from /usr/pgsql-14/lib/pg_stat_statements.so
#21 0x00000000007f7fef in PortalRunUtility ()
#22 0x00000000007f83af in FillPortalStore ()
#23 0x00000000007f86dd in PortalRun ()
#24 0x00000000007f48cb in exec_simple_query ()
#25 0x00000000007f610e in PostgresMain ()
#26 0x0000000000776b8a in ServerLoop ()
#27 0x0000000000777a03 in PostmasterMain ()
#28 0x00000000004fe413 in main ()
(gdb) c
Continuing.
(gdb) c
Continuing.
Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
#1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
#2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
#3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
#4 0x0000000000932134 in AllocSetAlloc ()
#5 0x00000000009376cf in MemoryContextAllocExtended ()
#6 0x00000000006ad915 in ExecInitMemoize ()
#7 0x000000000068dc02 in ExecInitNode ()
#8 0x00000000006b37ff in ExecInitNestLoop ()
#9 0x000000000068dc56 in ExecInitNode ()
#10 0x00000000006b37ff in ExecInitNestLoop ()
#11 0x000000000068dc56 in ExecInitNode ()
#12 0x00000000006b37de in ExecInitNestLoop ()
#13 0x000000000068dc56 in ExecInitNode ()
#14 0x00000000006b37de in ExecInitNestLoop ()
#15 0x000000000068dc56 in ExecInitNode ()
#16 0x0000000000687e4d in standard_ExecutorStart ()
#17 0x00007f5579b5ad2d in pgss_ExecutorStart () from /usr/pgsql-14/lib/pg_stat_statements.so
#18 0x000000000062e643 in ExplainOnePlan ()
#19 0x000000000062e83d in ExplainOneQuery ()
#20 0x000000000062ee6f in ExplainQuery ()
#21 0x00000000007f9b15 in standard_ProcessUtility ()
#22 0x00007f5579b599c7 in pgss_ProcessUtility () from /usr/pgsql-14/lib/pg_stat_statements.so
#23 0x00000000007f7fef in PortalRunUtility ()
#24 0x00000000007f83af in FillPortalStore ()
#25 0x00000000007f86dd in PortalRun ()
#26 0x00000000007f48cb in exec_simple_query ()
#27 0x00000000007f610e in PostgresMain ()
#28 0x0000000000776b8a in ServerLoop ()
#29 0x0000000000777a03 in PostmasterMain ()
#30 0x00000000004fe413 in main ()
(gdb) p
The history is empty.
(gdb) c
Continuing.
Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
#1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
#2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
#3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
#4 0x0000000000932134 in AllocSetAlloc ()
#5 0x00000000009376cf in MemoryContextAllocExtended ()
#6 0x00000000006ad915 in ExecInitMemoize ()
#7 0x000000000068dc02 in ExecInitNode ()
#8 0x00000000006b37ff in ExecInitNestLoop ()
#9 0x000000000068dc56 in ExecInitNode ()
#10 0x00000000006b37ff in ExecInitNestLoop ()
#11 0x000000000068dc56 in ExecInitNode ()
#12 0x00000000006b37de in ExecInitNestLoop ()
#13 0x000000000068dc56 in ExecInitNode ()
#14 0x0000000000687e4d in standard_ExecutorStart ()
#15 0x00007f5579b5ad2d in pgss_ExecutorStart () from /usr/pgsql-14/lib/pg_stat_statements.so
#16 0x000000000062e643 in ExplainOnePlan ()
#17 0x000000000062e83d in ExplainOneQuery ()
#18 0x000000000062ee6f in ExplainQuery ()
#19 0x00000000007f9b15 in standard_ProcessUtility ()
#20 0x00007f5579b599c7 in pgss_ProcessUtility () from /usr/pgsql-14/lib/pg_stat_statements.so
#21 0x00000000007f7fef in PortalRunUtility ()
#22 0x00000000007f83af in FillPortalStore ()
#23 0x00000000007f86dd in PortalRun ()
#24 0x00000000007f48cb in exec_simple_query ()
#25 0x00000000007f610e in PostgresMain ()
#26 0x0000000000776b8a in ServerLoop ()
#27 0x0000000000777a03 in PostmasterMain ()
#28 0x00000000004fe413 in main ()
(gdb) c
Continuing.
there was 2 hits of mmap
Regards
Pavel
[1] - https://www.postgresql.org/message-id/CAFj8pRAo5CrF8mpPxMvnBYFSqu4HYDqRsQnLqGphckNHkHosFg%40mail.gmail.com
-J.
Hi Pavel, > I have not debug symbols, so I have not more details now > Breakpoint 1 at 0x7f557f0c16c0 > (gdb) c > Continuing. > Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6 > (gdb) bt > #0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6 > #1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6 > #2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6 > #3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6 > #4 0x0000000000932134 in AllocSetAlloc () > #5 0x00000000009376cf in MemoryContextAllocExtended () > #6 0x00000000006ad915 in ExecInitMemoize () Well the PGDG repo have the debuginfos (e.g. postgresql14-debuginfo) rpms / dpkgs(?) so I hope you are basically 1 commandaway of being able to debug it further what happens in ExecInitMemoize() Those packages seem to be safe as they modify only /usr/lib/debug so should not have any impact on production workload. -J.
čt 5. 5. 2022 v 8:51 odesílatel Jakub Wartak <Jakub.Wartak@tomtom.com> napsal:
Hi Pavel,
> I have not debug symbols, so I have not more details now
> Breakpoint 1 at 0x7f557f0c16c0
> (gdb) c
> Continuing.
> Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
> #1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
> #2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
> #3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
> #4 0x0000000000932134 in AllocSetAlloc ()
> #5 0x00000000009376cf in MemoryContextAllocExtended ()
> #6 0x00000000006ad915 in ExecInitMemoize ()
Well the PGDG repo have the debuginfos (e.g. postgresql14-debuginfo) rpms / dpkgs(?) so I hope you are basically 1 command away of being able to debug it further what happens in ExecInitMemoize()
Those packages seem to be safe as they modify only /usr/lib/debug so should not have any impact on production workload.
I just have to wait for admin action - I have no root rights for the server.
-J.
On Thu, 5 May 2022 at 19:26, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > čt 5. 5. 2022 v 8:51 odesílatel Jakub Wartak <Jakub.Wartak@tomtom.com> napsal: >> > Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6 >> > (gdb) bt >> > #0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6 >> > #1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6 >> > #2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6 >> > #3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6 >> > #4 0x0000000000932134 in AllocSetAlloc () >> > #5 0x00000000009376cf in MemoryContextAllocExtended () >> > #6 0x00000000006ad915 in ExecInitMemoize () >> >> Well the PGDG repo have the debuginfos (e.g. postgresql14-debuginfo) rpms / dpkgs(?) so I hope you are basically 1 commandaway of being able to debug it further what happens in ExecInitMemoize() >> Those packages seem to be safe as they modify only /usr/lib/debug so should not have any impact on production workload. > > I just have to wait for admin action - I have no root rights for the server. Looking at ExecInitMemoize() it's hard to see what could take such a long time other than the build_hash_table(). Tom did mention this, but I can't quite see how the size given to that function could be larger than 91 in your case. If you get the debug symbols installed, can you use gdb to break nodeMemoize.c:268 p size maybe there's something I'm missing following the code and maybe there is some way that est_entries is not set to what I thought it was. It would also be good to see the same perf report again after the debug symbols are installed in order to resolve those unresolved function names. David
On Mon, May 2, 2022 at 10:02 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
út 3. 5. 2022 v 6:57 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:Pavel Stehule <pavel.stehule@gmail.com> writes:
> there is really something strange (see attached file). Looks so this issue
> is much more related to planning time than execution time
You sure there's not something taking an exclusive lock on one of these
tables every so often?I am almost sure, I can see this issue only every time when I set a higher work mem. I don't see this issue in other cases.
What are the values of work_mem and hash_mem_multiplier for the two cases?
David J.
pá 6. 5. 2022 v 1:28 odesílatel David G. Johnston <david.g.johnston@gmail.com> napsal:
On Mon, May 2, 2022 at 10:02 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:út 3. 5. 2022 v 6:57 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:Pavel Stehule <pavel.stehule@gmail.com> writes:
> there is really something strange (see attached file). Looks so this issue
> is much more related to planning time than execution time
You sure there's not something taking an exclusive lock on one of these
tables every so often?I am almost sure, I can see this issue only every time when I set a higher work mem. I don't see this issue in other cases.What are the values of work_mem and hash_mem_multiplier for the two cases?
(2022-05-06 07:35:21) prd_aukro=# show work_mem ;
┌──────────┐
│ work_mem │
├──────────┤
│ 400MB │
└──────────┘
(1 řádka)
Čas: 0,331 ms
(2022-05-06 07:35:32) prd_aukro=# show hash_mem_multiplier ;
┌─────────────────────┐
│ hash_mem_multiplier │
├─────────────────────┤
│ 1 │
└─────────────────────┘
(1 řádka)
┌──────────┐
│ work_mem │
├──────────┤
│ 400MB │
└──────────┘
(1 řádka)
Čas: 0,331 ms
(2022-05-06 07:35:32) prd_aukro=# show hash_mem_multiplier ;
┌─────────────────────┐
│ hash_mem_multiplier │
├─────────────────────┤
│ 1 │
└─────────────────────┘
(1 řádka)
David J.
pá 6. 5. 2022 v 1:19 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Thu, 5 May 2022 at 19:26, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> čt 5. 5. 2022 v 8:51 odesílatel Jakub Wartak <Jakub.Wartak@tomtom.com> napsal:
>> > Breakpoint 1, 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
>> > (gdb) bt
>> > #0 0x00007f557f0c16c0 in mmap64 () from /lib64/libc.so.6
>> > #1 0x00007f557f04dd91 in sysmalloc () from /lib64/libc.so.6
>> > #2 0x00007f557f04eaa9 in _int_malloc () from /lib64/libc.so.6
>> > #3 0x00007f557f04fb1e in malloc () from /lib64/libc.so.6
>> > #4 0x0000000000932134 in AllocSetAlloc ()
>> > #5 0x00000000009376cf in MemoryContextAllocExtended ()
>> > #6 0x00000000006ad915 in ExecInitMemoize ()
>>
>> Well the PGDG repo have the debuginfos (e.g. postgresql14-debuginfo) rpms / dpkgs(?) so I hope you are basically 1 command away of being able to debug it further what happens in ExecInitMemoize()
>> Those packages seem to be safe as they modify only /usr/lib/debug so should not have any impact on production workload.
>
> I just have to wait for admin action - I have no root rights for the server.
Looking at ExecInitMemoize() it's hard to see what could take such a
long time other than the build_hash_table(). Tom did mention this,
but I can't quite see how the size given to that function could be
larger than 91 in your case.
If you get the debug symbols installed, can you use gdb to
break nodeMemoize.c:268
p size
maybe there's something I'm missing following the code and maybe there
is some way that est_entries is not set to what I thought it was.
It would also be good to see the same perf report again after the
debug symbols are installed in order to resolve those unresolved
function names.
Breakpoint 1, build_hash_table (size=4369066, mstate=0xfc7f08) at nodeMemoize.c:268
268 if (size == 0)
(gdb) p size
$1 = 4369066
268 if (size == 0)
(gdb) p size
$1 = 4369066
This is work_mem size
+ 99,92% 0,00% postmaster postgres [.] ServerLoop ▒
+ 99,92% 0,00% postmaster postgres [.] PostgresMain ▒
+ 99,92% 0,00% postmaster postgres [.] exec_simple_query ▒
+ 99,70% 0,00% postmaster postgres [.] PortalRun ▒
+ 99,70% 0,00% postmaster postgres [.] FillPortalStore ▒
+ 99,70% 0,02% postmaster postgres [.] PortalRunUtility ▒
+ 99,68% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b599c6 ▒
+ 99,68% 0,00% postmaster postgres [.] standard_ProcessUtility ▒
+ 99,68% 0,00% postmaster postgres [.] ExplainQuery ◆
+ 99,63% 0,00% postmaster postgres [.] ExplainOneQuery ▒
+ 99,16% 0,00% postmaster postgres [.] ExplainOnePlan ▒
+ 99,06% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b5ad2c ▒
+ 99,06% 0,00% postmaster postgres [.] standard_ExecutorStart ▒
+ 99,06% 0,00% postmaster postgres [.] InitPlan (inlined) ▒
+ 99,06% 0,00% postmaster postgres [.] ExecInitNode ▒
+ 99,06% 0,00% postmaster postgres [.] ExecInitNestLoop ▒
+ 99,00% 0,02% postmaster postgres [.] ExecInitMemoize ▒
+ 98,87% 26,80% postmaster libc-2.28.so [.] __memset_avx2_erms ▒
+ 98,87% 0,00% postmaster postgres [.] build_hash_table (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] memoize_create (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] memoize_allocate (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] MemoryContextAllocExtended ▒
+ 72,08% 72,08% postmaster [unknown] [k] 0xffffffffbaa010e0
+ 99,92% 0,00% postmaster postgres [.] PostgresMain ▒
+ 99,92% 0,00% postmaster postgres [.] exec_simple_query ▒
+ 99,70% 0,00% postmaster postgres [.] PortalRun ▒
+ 99,70% 0,00% postmaster postgres [.] FillPortalStore ▒
+ 99,70% 0,02% postmaster postgres [.] PortalRunUtility ▒
+ 99,68% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b599c6 ▒
+ 99,68% 0,00% postmaster postgres [.] standard_ProcessUtility ▒
+ 99,68% 0,00% postmaster postgres [.] ExplainQuery ◆
+ 99,63% 0,00% postmaster postgres [.] ExplainOneQuery ▒
+ 99,16% 0,00% postmaster postgres [.] ExplainOnePlan ▒
+ 99,06% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b5ad2c ▒
+ 99,06% 0,00% postmaster postgres [.] standard_ExecutorStart ▒
+ 99,06% 0,00% postmaster postgres [.] InitPlan (inlined) ▒
+ 99,06% 0,00% postmaster postgres [.] ExecInitNode ▒
+ 99,06% 0,00% postmaster postgres [.] ExecInitNestLoop ▒
+ 99,00% 0,02% postmaster postgres [.] ExecInitMemoize ▒
+ 98,87% 26,80% postmaster libc-2.28.so [.] __memset_avx2_erms ▒
+ 98,87% 0,00% postmaster postgres [.] build_hash_table (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] memoize_create (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] memoize_allocate (inlined) ▒
+ 98,87% 0,00% postmaster postgres [.] MemoryContextAllocExtended ▒
+ 72,08% 72,08% postmaster [unknown] [k] 0xffffffffbaa010e0
0,47% 0,00% postmaster postgres [.] pg_plan_query
0,47% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b59ba4
0,47% 0,00% postmaster postgres [.] standard_planner
0,47% 0,00% postmaster postgres [.] subquery_planner
0,47% 0,00% postmaster postgres [.] grouping_planner
0,47% 0,00% postmaster postgres [.] query_planner
0,47% 0,00% postmaster pg_stat_statements.so [.] 0x00007f5579b59ba4
0,47% 0,00% postmaster postgres [.] standard_planner
0,47% 0,00% postmaster postgres [.] subquery_planner
0,47% 0,00% postmaster postgres [.] grouping_planner
0,47% 0,00% postmaster postgres [.] query_planner
David
Pavel Stehule <pavel.stehule@gmail.com> writes: > Breakpoint 1, build_hash_table (size=4369066, mstate=0xfc7f08) at > nodeMemoize.c:268 > 268 if (size == 0) > (gdb) p size > $1 = 4369066 Uh-huh .... regards, tom lane
On Fri, 6 May 2022 at 17:52, Pavel Stehule <pavel.stehule@gmail.com> wrote: > Breakpoint 1, build_hash_table (size=4369066, mstate=0xfc7f08) at nodeMemoize.c:268 > 268 if (size == 0) > (gdb) p size > $1 = 4369066 Thanks for the report. I think I now see the problem. Looking at [1], it seems that's a bushy plan. That's fine, but less common than a left deep plan. I think the problem is down to some incorrect code in get_memoize_path() where we pass the wrong value of "calls" to create_memoize_path(). I think instead of outer_path->parent->rows it instead should be outer_path->rows. If you look closely at the plan, you'll see that the outer side of the inner-most Nested Loop is parameterized by some higher-level nested loop. -> Nested Loop (cost=1.14..79.20 rows=91 width=8) (actual time=0.024..0.024 rows=1 loops=66) -> Index Only Scan using uq_isi_itemid_itemimageid on item_share_image itemsharei2__1 (cost=0.57..3.85 rows=91 width=16) (actual time=0.010..0.010 rows=1 loops=66) Index Cond: (item_id = itembo0_.id) Heap Fetches: 21 -> Memoize (cost=0.57..2.07 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=66) so instead of passing 91 to create_memoize_path() as I thought. Since I can't see any WHERE clause items filtering rows from the itemsharei2__1 relation, then the outer_path->parent->rows is should be whatever pg_class.reltuples says. Are you able to send the results of: explain select item_id from item_share_image group by item_id; -- I'm interested in the estimated number of groups in the plan's top node. select reltuples from pg_class where oid = 'item_share_image'::regclass; I'm expecting the estimated number of rows in the top node of the group by plan to be about 4369066. David [1] https://explain.depesz.com/s/2rBw#source
On Fri, 6 May 2022 at 20:04, David Rowley <dgrowleyml@gmail.com> wrote: > Thanks for the report. I think I now see the problem. Looking at > [1], it seems that's a bushy plan. That's fine, but less common than a > left deep plan. On second thoughts, it does not need to be a bushy plan for the outer side of the nested loop to be parameterized by some higher-level nested loop. There's an example of a plan like this in the regression tests. regression=# explain (analyze, costs off, summary off) regression-# select * from tenk1 t1 left join regression-# (tenk1 t2 join tenk1 t3 on t2.thousand = t3.unique2) regression-# on t1.hundred = t2.hundred and t1.ten = t3.ten regression-# where t1.unique1 = 1; QUERY PLAN --------------------------------------------------------------------------------------------------------- Nested Loop Left Join (actual time=0.258..0.487 rows=20 loops=1) -> Index Scan using tenk1_unique1 on tenk1 t1 (actual time=0.049..0.049 rows=1 loops=1) Index Cond: (unique1 = 1) -> Nested Loop (actual time=0.204..0.419 rows=20 loops=1) Join Filter: (t1.ten = t3.ten) Rows Removed by Join Filter: 80 -> Bitmap Heap Scan on tenk1 t2 (actual time=0.064..0.194 rows=100 loops=1) Recheck Cond: (t1.hundred = hundred) Heap Blocks: exact=86 -> Bitmap Index Scan on tenk1_hundred (actual time=0.036..0.036 rows=100 loops=1) Index Cond: (hundred = t1.hundred) -> Memoize (actual time=0.001..0.001 rows=1 loops=100) Cache Key: t2.thousand Cache Mode: logical Hits: 90 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 4kB -> Index Scan using tenk1_unique2 on tenk1 t3 (actual time=0.009..0.009 rows=1 loops=10) Index Cond: (unique2 = t2.thousand) (17 rows) debugging this I see that the memorize plan won because it was passing 10000 as the number of calls. It should have been passing 100. The memorize node's number of loops agrees with that. Fixing the calls to correctly pass 100 gets rid of the Memoize node. I've attached a patch to fix. I'll look at it in more detail after the weekend. I'm very tempted to change the EXPLAIN output in at least master to display the initial and final (maximum) hash table sizes. Wondering if anyone would object to that? David
Attachment
pá 6. 5. 2022 v 10:05 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Fri, 6 May 2022 at 17:52, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> Breakpoint 1, build_hash_table (size=4369066, mstate=0xfc7f08) at nodeMemoize.c:268
> 268 if (size == 0)
> (gdb) p size
> $1 = 4369066
Thanks for the report. I think I now see the problem. Looking at
[1], it seems that's a bushy plan. That's fine, but less common than a
left deep plan.
I think the problem is down to some incorrect code in
get_memoize_path() where we pass the wrong value of "calls" to
create_memoize_path(). I think instead of outer_path->parent->rows it
instead should be outer_path->rows.
If you look closely at the plan, you'll see that the outer side of the
inner-most Nested Loop is parameterized by some higher-level nested
loop.
-> Nested Loop (cost=1.14..79.20 rows=91 width=8) (actual
time=0.024..0.024 rows=1 loops=66)
-> Index Only Scan using
uq_isi_itemid_itemimageid on item_share_image itemsharei2__1
(cost=0.57..3.85 rows=91 width=16) (actual time=0.010..0.010 rows=1
loops=66)
Index Cond: (item_id = itembo0_.id)
Heap Fetches: 21
-> Memoize (cost=0.57..2.07 rows=1 width=8)
(actual time=0.013..0.013 rows=1 loops=66)
so instead of passing 91 to create_memoize_path() as I thought. Since
I can't see any WHERE clause items filtering rows from the
itemsharei2__1 relation, then the outer_path->parent->rows is should
be whatever pg_class.reltuples says.
Are you able to send the results of:
explain select item_id from item_share_image group by item_id; -- I'm
interested in the estimated number of groups in the plan's top node.
select reltuples from pg_class where oid = 'item_share_image'::regclass;
I'm expecting the estimated number of rows in the top node of the
group by plan to be about 4369066.
QUERY PLAN
────────────────────────────────────────────────────────────────────────────
Finalize HashAggregate (cost=1543418.63..1554179.08 rows=1076045 width=8)
Group Key: item_id
-> Gather (cost=1000.57..1532658.18 rows=4304180 width=8)
Workers Planned: 4
-> Group (cost=0.57..1101240.18 rows=1076045 width=8)
Group Key: item_id
-> Parallel Index Only Scan using ixfk_isi_itemid on item_share_image (cost=0.57..1039823.86 rows=24566530 width=8)
(7 řádek)
Čas: 1,808 ms
(2022-05-06 12:30:26) prd_aukro=# select reltuples from pg_class where oid = 'item_share_image'::regclass;
reltuples
────────────
9.826612e+07
(1 řádka)
Čas: 0,887 ms
Regards
Pavel
David
[1] https://explain.depesz.com/s/2rBw#source
On Tue, May 03, 2022 at 02:13:18PM +1200, David Rowley wrote: > I'm wishing I put the initial hash table size and the final hash table > size in EXPLAIN + EXPLAIN ANALYZE now. Perhaps it's not too late for > v15 to do that so that it might help us figure things out in the > future. On Fri, May 06, 2022 at 09:27:57PM +1200, David Rowley wrote: > I'm very tempted to change the EXPLAIN output in at least master to > display the initial and final (maximum) hash table sizes. Wondering if > anyone would object to that? No objection to add it to v15. I'll point out that "Cache Mode" was added to EXPLAIN between 11.1 and 11.2 without controversy, so this could conceivably be backpatched to v14, too. commit 6c32c0977783fae217b5eaa1d22d26c96e5b0085 Author: David Rowley <drowley@postgresql.org> Date: Wed Nov 24 10:07:38 2021 +1300 Allow Memoize to operate in binary comparison mode
On Tue, 10 May 2022 at 14:22, Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, May 06, 2022 at 09:27:57PM +1200, David Rowley wrote: > > I'm very tempted to change the EXPLAIN output in at least master to > > display the initial and final (maximum) hash table sizes. Wondering if > > anyone would object to that? > > No objection to add it to v15. > > I'll point out that "Cache Mode" was added to EXPLAIN between 11.1 and 11.2 > without controversy, so this could conceivably be backpatched to v14, too. > > commit 6c32c0977783fae217b5eaa1d22d26c96e5b0085 This is seemingly a good point, but I don't really think it's a case of just keeping the EXPLAIN output stable in minor versions, it's more about adding new fields to structs. I just went and wrote the patch and the fundamental difference seems to be that what I did in 6c32c0977 managed to only add a new field in the empty padding between two fields. That resulted in no fields in the struct being pushed up in their address offset. The idea here is not to break any extension that's already been compiled that references some field that comes after that. In the patch I've just written, I've had to add some fields which causes sizeof(MemoizeState) to go up resulting in the offsets of some later fields changing. One thing I'll say about this patch is that I found it annoying that I had to add code to cache_lookup() when we failed to find an entry. That's probably not the end of the world as that's only for cache misses. Ideally, I'd just be looking at the size of the hash table at the end of execution, however, naturally, we must show the EXPLAIN output before we shut down the executor. I just copied the Hash Join output. It looks like: # alter table tt alter column a set (n_distinct=4); ALTER TABLE # analyze tt; # explain (analyze, costs off, timing off) select * from tt inner join t2 on tt.a=t2.a; QUERY PLAN --------------------------------------------------------------------------------- Nested Loop (actual rows=1000000 loops=1) -> Seq Scan on tt (actual rows=1000000 loops=1) -> Memoize (actual rows=1 loops=1000000) Cache Key: tt.a Cache Mode: logical Hits: 999990 Misses: 10 Evictions: 0 Overflows: 0 Memory Usage: 2kB Hash Buckets: 16 (originally 4) -> Index Only Scan using t2_pkey on t2 (actual rows=1 loops=10) Index Cond: (a = tt.a) Heap Fetches: 0 Planning Time: 0.483 ms Execution Time: 862.860 ms (12 rows) Does anyone have any views about the attached patch going into v15? David
Attachment
On Fri, 6 May 2022 at 21:27, David Rowley <dgrowleyml@gmail.com> wrote: > I've attached a patch to fix. I'll look at it in more detail after the weekend. I've now pushed this fix to master and backpatched to 14. David
po 16. 5. 2022 v 6:11 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
On Fri, 6 May 2022 at 21:27, David Rowley <dgrowleyml@gmail.com> wrote:
> I've attached a patch to fix. I'll look at it in more detail after the weekend.
I've now pushed this fix to master and backpatched to 14.
Thank you
Pavel
David