Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers

From Alena Rybakina
Subject Re: POC, WIP: OR-clause support for indexes
Date
Msg-id 8b040a42-863a-6660-6bdf-c023c40e8927@yandex.ru
Whole thread Raw
In response to Re: POC, WIP: OR-clause support for indexes  (Ranier Vilela <ranier.vf@gmail.com>)
List pgsql-hackers

Sorry for the possible duplicate. I have a suspicion that the previous email was not sent.

Hi!

On 29.06.2023 04:36, Ranier Vilela wrote:
Em qua., 28 de jun. de 2023 às 18:45, Tomas Vondra <tomas.vondra@enterprisedb.com> escreveu:
On 6/27/23 20:55, Ranier Vilela wrote:
> Hi,
>
>>I finished writing the code patch for transformation "Or" expressions to
>>"Any" expressions. I didn't see any problems in regression tests, even
>>when I changed the constant at which the minimum or expression is
>>replaced by any at 0. I ran my patch on sqlancer and so far the code has
>>never fallen.
> Thanks for working on this.
>
> I took the liberty of making some modifications to the patch.
> I didn't compile or test it.
> Please feel free to use them.
>

I don't want to be rude, but this doesn't seem very helpful.
Sorry, It was not my intention to cause interruptions.


- You made some changes, but you don't even attempt to explain what you
changed or why you changed it.
1. Reduce scope
2. Eliminate unnecessary variables
3. Eliminate unnecessary expressions
 

- You haven't even tried to compile the code, nor tested it. If it
happens to compile, wow could others even know it actually behaves the
way you wanted?
Attached v2 with make check pass all tests.
Ubuntu 64 bits
gcc 64 bits


- You responded in a way that breaks the original thread, so it's not
clear which message you're responding to.
It was a pretty busy day.

Sorry for the noise, I hope I was of some help.

regards,
Ranier Vilela

P.S.
0001-Replace-clause-X-N1-OR-X-N2-.-with-X-ANY-N1-N2-on.patch fails with 4 tests.

Sorry I didn't answer right away. I will try not to do this in the future thank you for your participation and help.

Yes, the scope of this patch may be small, but I am sure that it will solve the worst case of memory consumption with large numbers of "or" expressions or reduce execution and planning time. As I have already said, I conducted a launch on a database with 20 billion data generated using a benchmark. Unfortunately, at that time I sent a not quite correct picture: the execution time, not the planning time, increases with the number of "or" expressions (https://www.dropbox.com/s/u7gt81blbv2adpi/execution_time.png?dl=0). x is the number of or expressions, y is the execution/scheduling time.

I also throw memory consumption at 50,000 "or" expressions collected by HeapTrack (where memory consumption was recorded already at the initialization stage of the 1.27GB https://www.dropbox.com/s/vb827ya0193dlz0/pic3.png?dl=0). I think such a transformation will allow just the same to avoid such a worst case, since in comparison with ANY memory is much less and takes little time.

SELECT FORMAT('prepare x %s AS SELECT * FROM pgbench_accounts a WHERE %s', '(' || string_agg('int', ',') || ')', string_agg(FORMAT('aid = $%s', g.id), ' or ') ) AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

SELECT FORMAT('execute x %s;', '(' || string_agg(g.id::text, ',') || ')') AS cmd FROM generate_series(1, 50000) AS g(id)
\gexec

I tried to add a transformation at the path formation stage before we form indexes (set_plain_rel_pathlist function) and at the stage when we have preprocessing of "or" expressions (getting rid of duplicates or useless conditions), but everywhere there was a problem of incorrect selectivity estimation.

CREATE TABLE tenk1 (unique1 int, unique2 int, ten int, hundred int);
insert into tenk1 SELECT x,x,x,x FROM generate_series(1,50000) as x;
CREATE INDEX a_idx1 ON tenk1(unique1);
CREATE INDEX a_idx2 ON tenk1(unique2);
CREATE INDEX a_hundred ON tenk1(hundred);

postgres=# explain analyze
select * from tenk1 a join tenk1 b on  ((a.unique2 = 3 or a.unique2 = 7)) or (a.unique1 = 1);                                                      QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------- Nested Loop  (cost=0.00..140632434.50 rows=11250150000 width=32) (actual time=0.077..373.279 rows=1350000 loops=1)   ->  Seq Scan on tenk1 b  (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.037..13.941 rows=150000 loops=1)   ->  Materialize  (cost=0.00..3436.01 rows=75001 width=16) (actual time=0.000..0.001 rows=9 loops=150000)         ->  Seq Scan on tenk1 a  (cost=0.00..3061.00 rows=75001 width=16) (actual time=0.027..59.174 rows=9 loops=1)               Filter: ((unique2 = ANY (ARRAY[3, 7])) OR (unique1 = 1))               Rows Removed by Filter: 149991 Planning Time: 0.438 ms Execution Time: 407.144 ms
(8 rows)

Only by converting the expression at this stage, we do not encounter this problem.

postgres=# set enable_bitmapscan ='off';
SET
postgres=# explain analyze
select * from tenk1 a join tenk1 b on  a.unique2 = 3 or a.unique2 = 7 or a.unique1 = 1;                                                      QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------- Nested Loop  (cost=0.00..22247.02 rows=1350000 width=32) (actual time=0.094..373.627 rows=1350000 loops=1)   ->  Seq Scan on tenk1 b  (cost=0.00..2311.00 rows=150000 width=16) (actual time=0.051..14.667 rows=150000 loops=1)   ->  Materialize  (cost=0.00..3061.05 rows=9 width=16) (actual time=0.000..0.001 rows=9 loops=150000)         ->  Seq Scan on tenk1 a  (cost=0.00..3061.00 rows=9 width=16) (actual time=0.026..42.389 rows=9 loops=1)               Filter: ((unique2 = ANY ('{3,7}'::integer[])) OR (unique1 = 1))               Rows Removed by Filter: 149991 Planning Time: 0.414 ms Execution Time: 409.154 ms
(8 rows)

I compiled my original patch and there were no problems with regression tests. The only time there was a problem when I set the const_transform_or_limit variable to 0 (I have 15), as you have in the patch. To be honest, diff appears there because you had a different plan, specifically the expressions "or" are replaced by ANY (see regression.diffs).
Unfortunately, your patch version did not apply immediately, I did not understand the reasons, I applied it manually.
At the moment, I'm not sure that the constant is the right number for applying transformations, so I'm in search of it, to be honest. I will post my observations on this issue later. If you don't mind, I'll leave the constant equal to 15 for now.

Sorry, I don't understand well enough what is meant by points "Eliminate unnecessary variables" and "Eliminate unnecessary expressions". Can you explain in more detail?


Regarding the patch, there was a Warning at the compilation stage. 

In file included from ../../../src/include/nodes/bitmapset.h:21,

                 from ../../../src/include/nodes/parsenodes.h:26,

                 from ../../../src/include/catalog/objectaddress.h:17,

                 from ../../../src/include/catalog/pg_aggregate.h:24,

                 from parse_expr.c:18:

parse_expr.c: In function ‘transformBoolExprOr’:

../../../src/include/nodes/nodes.h:133:66: warning: ‘expr’ is used uninitialized [-Wuninitialized]

  133 | #define nodeTag(nodeptr)                (((const Node*)(nodeptr))->type)

      |                                                                  ^~

parse_expr.c:116:29: note: ‘expr’ was declared here

  116 |         BoolExpr           *expr;

      |                             ^~~~

I couldn't figure out how to fix it and went back to my original version. To be honest, I don't think anything needs to be changed here.

Unfortunately, I didn't understand the reasons why, with the available or expressions, you don't even try to convert to ANY by calling transformBoolExpr, as I saw. I went back to my version.

I think it's worth checking whether the or_statement variable is positive.

I think it's worth leaving the use of the or_statement variable in its original form.

  switch (expr->boolop)
  {
    case AND_EXPR:
      opname = "AND";
      break;
    case OR_EXPR:
      opname = "OR";
      or_statement = true;
      break;
    case NOT_EXPR:
      opname = "NOT";
      break;
    default:
      elog(ERROR, "unrecognized boolop: %d", (int) expr->boolop);
      opname = NULL;    /* keep compiler quiet */
      break;
  }

  if (!or_statement || list_length(expr->args) < const_transform_or_limit)
    return transformBoolExpr(pstate, (BoolExpr *)expr_orig);
The current version of the patch also works and all tests pass.

-- 
Regards,
Alena Rybakina
Postgres Professional
Attachment

pgsql-hackers by date:

Previous
From: Alena Rybakina
Date:
Subject: Re: POC, WIP: OR-clause support for indexes
Next
From: Richard Guo
Date:
Subject: Re: Assert !bms_overlap(joinrel->relids, required_outer)