Thread: Planner picks n² query plan when available
Offending O(n²) query:
```sql
SELECT id FROM indexed_table WHERE indexed_value = ANY (ARRAY[1,2,...])
```
I'm not posting this on the `pgsql-performance` mailing list because this is about fixing the issue, not working around it.
I'm not posting this on the `pgsql-bugs` mailing list because documentation states that performance issues are not bugs.
Please let me know if I'm not posting to the appropriate place.
From what I could investigate so far, it looks like:
1. When performing bitmap heap scan, if recheck is activated, that is O(n²) because it rechecks linearly in the array.
2. If using a Hash index, because hash indexes only point to entries that match the hash but the values aren't stored in the index and there may be hash collisions, so bitmap index scan on a Hash index forces the parent bitmap heap scan to perform a recheck. That recheck leads to the O(n²) of 1.
3. When a hash index is available, Postgres tends to prefer using it over Btree index, ignoring the two above properties.
4. The EXPLAIN ANALYZE output shows that the planner always supposes that arrays are of size 10, instead of using the estimated sizes of subqueries they are created from, or actual size provided as argument. This maybe causes 3.
5. If using a Btree index but work_mem is not sufficient for the heap read to be exact (as in Heap Blocks: exact=510 lossy=43738 as output of EXPLAIN (ANALYZE, BUFFERS)), this will also activate recheck, causing O(n²).
It looks like this could be improved/fixed by either/all of:
1. Using a hashset (or sort + binary search) for recheck (past a certain array length or even always) instead of always searching linearly
2. Fixing planner assuming that all arrays are of size 10, using instead actual or estimated sizes. Taking it into account for recheck cond cost estimation. (This wouldn't solve recheck perf, but at least would make the planner aware that lossy heap scans and Hash index bitmap heap scans are n times more expensive than exact BTree bitmap heap scans, and hopefully make it avoid picking them.)
Is my understanding correct?
Thanks,
PG Version: PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit
Reproducer:
```sql
BEGIN;
CREATE TABLE tmp_benchmark(
id Int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
value text NOT NULL
);
-- Insert 1M values
INSERT INTO tmp_benchmark(value) SELECT DISTINCT (floor(random() * 1000000000000)::int8)::text FROM generate_series(1,1000000);
ALTER TABLE tmp_benchmark ADD CONSTRAINT tmp_benchmark_unique_value_and_corresponding_index UNIQUE(value);
CREATE INDEX tmp_benchmark_benchmark_value_hash ON tmp_benchmark USING HASH (value);
ANALYZE tmp_benchmark;
-- Randomly sample 50k of those values
CREATE TEMPORARY TABLE lookup_test ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 50000;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test);
/* Gives:
Nested Loop (cost=864.00..2376.52 rows=43520 width=20) (actual time=6.890..51.473 rows=50000 loops=1)
Buffers: shared hit=102085, local hit=320
-> HashAggregate (cost=864.00..866.00 rows=200 width=32) (actual time=6.879..11.140 rows=50000 loops=1)
Group Key: lookup_test.value
Batches: 1 Memory Usage: 3617kB
Buffers: local hit=320
-> Seq Scan on lookup_test (cost=0.00..755.20 rows=43520 width=32) (actual time=0.003..1.702 rows=50000 loops=1)
Buffers: local hit=320
-> Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=50000)
Index Cond: (value = lookup_test.value)
Rows Removed by Index Recheck: 0
Buffers: shared hit=102085
Planning:
Buffers: shared hit=18 read=1
Planning Time: 0.196 ms
Execution Time: 52.520 ms
*/
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=904.09..943.13 rows=10 width=20) (actual time=22.804..13931.788 rows=50000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 15
Heap Blocks: exact=6369
Buffers: shared hit=68885, local hit=320
InitPlan 1 (returns $0)
-> Aggregate (cost=864.00..864.01 rows=1 width=32) (actual time=4.784..4.785 rows=1 loops=1)
Buffers: local hit=320
-> Seq Scan on lookup_test (cost=0.00..755.20 rows=43520 width=32) (actual time=0.007..2.040 rows=50000 loops=1)
Buffers: local hit=320
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=22.166..22.166 rows=50015 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=62516, local hit=320
Planning:
Buffers: shared hit=9
Planning Time: 0.112 ms
Execution Time: 13933.270 ms
*/
-- Randomly sample 100k of those values
CREATE TEMPORARY TABLE lookup_test_2 ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 100000;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test_2);
/* Gives:
Nested Loop (cost=1555.20..3025.00 rows=78336 width=20) (actual time=15.661..104.241 rows=100000 loops=1)
Buffers: shared hit=204153, local hit=576, temp read=20 written=44
-> HashAggregate (cost=1555.20..1557.20 rows=200 width=32) (actual time=15.653..29.110 rows=100000 loops=1)
Group Key: lookup_test_2.value
Batches: 5 Memory Usage: 6977kB Disk Usage: 232kB
Buffers: local hit=576, temp read=20 written=44
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..3.692 rows=100000 loops=1)
Buffers: local hit=576
-> Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=100000)
Index Cond: (value = lookup_test_2.value)
Rows Removed by Index Recheck: 0
Buffers: shared hit=204153
Planning:
Buffers: shared hit=4
Planning Time: 0.146 ms
Execution Time: 106.359 ms
*/
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1595.29..1634.33 rows=10 width=20) (actual time=45.983..55338.490 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 21
Heap Blocks: exact=6370
Buffers: shared hit=129063 read=2197, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=8.863..8.864 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..3.696 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=45.436..45.436 rows=100025 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=124890, local hit=576
Planning Time: 0.089 ms
Execution Time: 55342.836 ms
*/
-- Note how with 2x the number of values, execution time is x4 the above. Monitoring CPU while this query runs shows 100% CPU, unlike the subselect-based variant.
DROP INDEX tmp_benchmark_benchmark_value_hash;
-- Once we remove the hash index, even with the same 100k sample and bitmap heap scan plan it doesn't do n²
-- On the same 100k sample:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1599.54..1638.58 rows=10 width=20) (actual time=234.701..256.963 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Heap Blocks: exact=6370
Buffers: shared hit=296148 read=10225 written=767, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.425..9.426 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..4.001 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index (cost=0.00..44.33 rows=10 width=0) (actual time=234.264..234.264 rows=100000 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=296148 read=3855 written=763, local hit=576
Planning:
Buffers: shared hit=8 read=3
Planning Time: 0.169 ms
Execution Time: 259.087 ms
*/
-- So despite doing a bitmap heap scan on the same 100k elements array as above, this one seemingly doesn't n².
-- With less work_mem to lead even the BTree index scan to perform a recheck:
SET LOCAL work_mem TO '64kB';
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1599.54..1638.58 rows=10 width=20) (actual time=248.154..974679.087 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 802166
Heap Blocks: exact=683 lossy=5687
Buffers: shared hit=300000 read=6370, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.274..9.276 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.009..3.944 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index (cost=0.00..44.33 rows=10 width=0) (actual time=234.699..234.699 rows=100000 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=300000, local hit=576
Planning Time: 0.099 ms
Execution Time: 974687.222 ms
*/
-- => It's clear that the recheck is n² when performed
ROLLBACK;
```
On Thu, 21 Nov 2024 at 13:03, Toto guyoyg <thomas.bessou@hotmail.fr> wrote: > > Offending O(n²) query: I disagree with the O(n^2) claims. The number of live matched rows in a single table's bitmap scan may be anywhere from 0 (leading to O(1) complexity in the rescan) to 970_662_608_670 (= 226 tuples per page * (2^32 - 1) pages), so such a rescan's complexity - given O(n) complexity for always choosing to use a linear scan on the array to find matches - would better be parameterized as O(n*m), for m live tuples in the generated bitmap, or even O(n*m + k), as the bitmap may contain k dead tuples which can also take some measurable amount of effort to scan. > 4. The EXPLAIN ANALYZE output shows that the planner always supposes that arrays are of size 10, instead of using the estimatedsizes of subqueries they are created from, or actual size provided as argument. Did you recheck this on a recently released version of PostgreSQL? IIRC this was updated in PG17, and with `= ANY (ARRAY[...])` expressions you will get the expected number of rows for that array expression (assuming accurate statistics on your table). > It looks like this could be improved/fixed by either/all of: > > 1. Using a hashset (or sort + binary search) for recheck (past a certain array length or even always) instead of alwayssearching linearly IIUC, hashed "= ANY()" expressions were already implemented with commit 50e17ad2 (released in PG14) - look for EEOP_HASHED_SCALARARRAYOP in expression handling code. > 2. Fixing planner assuming that all arrays are of size 10, using instead actual or estimated sizes. IIUC, this was also already implemented with commit 9391f715 (released in PG17). > Taking it into account for recheck cond cost estimation. (This wouldn't solve recheck perf, but at least would make theplanner aware that lossy heap scans and Hash index bitmap heap scans are n times more expensive than exact BTree bitmapheap scans, and hopefully make it avoid picking them.) > > Is my understanding correct? I believe your understanding may be quite out of date. Not all planner or executor features and optimizations are explicitly present in the output of EXPLAIN, and the examples all indicate you may be working with an outdated view of Postgres' capabilities. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > On Thu, 21 Nov 2024 at 13:03, Toto guyoyg <thomas.bessou@hotmail.fr> wrote: >> >> Offending O(n²) query: > I disagree with the O(n^2) claims. I think these cases actually are O(n^2). But I'm finding it hard to care. What we have here is a straightforward way to write a query versus a much-less-straightforward way --- indeed, a way that isn't even syntactically legal per the SQL standard. The straightforward way is already well optimized, and no reason has been given why the much-less-straightforward way should be considered preferable. So I'm not seeing why we should put our finite development resources into optimizing the much-less-straightforward way. >> 4. The EXPLAIN ANALYZE output shows that the planner always supposes that arrays are of size 10, instead of using theestimated sizes of subqueries they are created from, or actual size provided as argument. It's a little disingenuous to complain about bad estimates with this test methodology: the test tables are never vacuumed or analyzed. And since they're temporary, there's no hope of autovacuum curing that oversight. It's not clear that having done that would improve anything in this particular case, but it's certainly not helping. regards, tom lane