Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id 919bfbcb-f812-758d-d687-71f89f0d9a68@postgrespro.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Andrey Lepikhov <a.lepikhov@postgrespro.ru>)
Responses Re: POC, WIP: OR-clause support for indexes
List pgsql-hackers

I agree with your idea and try to implement it and will soon attach a patch with a solution.

I also have a really practical example confirming that such optimization can be useful.

A query was written that consisted of 50000 conditions due to the fact that the ORM framework couldn't work with a query having an ANY operator. In summary, we got a better plan that contained 50000 Bitmap Index Scan nodes with 50000 different conditions. Since approximately 27336 Bite of memory were required to initialize one BitmapOr Index Scan node, therefore, about 1.27 GB of memory was spent at the initialization step of the plan execution and query execution time was about 55756,053 ms (00:55,756).

psql -U postgres -c "CREATE DATABASE test_db" pgbench -U postgres -d test_db -i -s 10

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a      WHERE %s', '(' || string_agg('int', ',') || ')',      string_agg(FORMAT('aid = $%s', g.id), ' or ') ) AS cmd FROM      generate_series(1, 50000) AS g(id) \gexec

SELECT FORMAT('execute x %s;', '(' || string_agg(g.id::text, ',') || ')')      AS cmd FROM generate_series(1, 50000) AS g(id) \gexec 

I got the plan of this query:

                                                                    QUERY PLAN                                                                 
    
---------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on pgbench_accounts a  (cost=44.35..83.96 rows=10 width=97)
   Recheck Cond: ((aid = 1) OR (aid = 2) OR (aid = 3) OR (aid = 4) OR (aid = 5) OR (aid = 6) OR (aid = 7) OR (aid = 8) OR (aid = 9) OR (aid = 10))
   ->  BitmapOr  (cost=44.35..44.35 rows=10 width=0)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 1)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 2)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 3)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 4)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 5)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 6)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 7)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 8)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 9)
         ->  Bitmap Index Scan on pgbench_accounts_pkey  (cost=0.00..4.43 rows=1 width=0)
               Index Cond: (aid = 10)

If I rewrite this query using ANY operator,

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE aid = ANY(SELECT g.id FROM generate_series(1, 50000) AS g(id))', '(' || string_agg('int', ',') || ')' ) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

I will get a plan where the array comparison operator is used through ANY operator at the index scan stage. It's execution time is significantly lower as  339,764 ms.

                                            QUERY PLAN                                             
--------------------------------------------------------------------------------------------------- Index Scan using pgbench_accounts_pkey on pgbench_accounts a  (cost=0.42..48.43 rows=10 width=97)   Index Cond: (aid = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
(2 rows)

IN operator is also converted to ANY operator, and if I rewrite this query as:
SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE aid IN(%s)', '(' || string_agg('int', ',') || ')', string_agg(FORMAT('%s', g.id), ', ') ) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

I will get the same plan as the previous one using ANY operator and his execution time will be about the same.
                                            QUERY PLAN                                             
--------------------------------------------------------------------------------------------------- Index Scan using pgbench_accounts_pkey on pgbench_accounts a  (cost=0.42..48.43 rows=10 width=97)   Index Cond: (aid = ANY ('{1,2,3,4,5,6,7,8,9,10}'::integer[]))
(2 rows)

On 28.12.2022 07:19, Andrey Lepikhov wrote:
On 12/26/15 23:04, Teodor Sigaev wrote:
I'd like to present OR-clause support for indexes. Although OR-clauses could be supported by bitmapOR index scan it isn't very effective and such scan lost any order existing in index. We (with Alexander Korotkov) presented results on Vienna's conference this year. In short, it provides performance improvement:

EXPLAIN ANALYZE
SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000;
...
The problems on the way which I see for now:
1 Calculating cost. Right now it's just a simple transformation of costs computed for BitmapOr path. I'd like to hope that's possible and so index's estimation function could be non-touched. So, they could believe that all clauses are implicitly-ANDed
2 I'd like to add such support to btree but it seems that it should be a separated patch. Btree search algorithm doesn't use any kind of stack of pages and algorithm to walk over btree doesn't clear for me for now.
3 I could miss some places which still assumes  implicitly-ANDed list of clauses although regression tests passes fine.
I support such a cunning approach. But this specific case, you demonstrated above, could be optimized independently at an earlier stage. If to convert:

(F(A) = ConstStableExpr_1) OR (F(A) = ConstStableExpr_2)
to
F(A) IN (ConstStableExpr_1, ConstStableExpr_2)

it can be seen significant execution speedup. For example, using the demo.sql to estimate maximum positive effect we see about 40% of execution and 100% of planning speedup.

To avoid unnecessary overhead, induced by the optimization, such transformation may be made at the stage of planning (we have cardinality estimations and have pruned partitions) but before creation of a relation scan paths. So, we can avoid planning overhead and non-optimal BitmapOr in the case of many OR's possibly aggravated by many indexes on the relation.
For example, such operation can be executed in create_index_paths() before passing rel->indexlist.

-- 
Alena Rybakina
Postgres Professional

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Extracting cross-version-upgrade knowledge from buildfarm client
Next
From: Marcos Pegoraro
Date:
Subject: Re: POC, WIP: OR-clause support for indexes