Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers

From David Rowley
Subject Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date
Msg-id CAApHDvr_b3VGY3LEcdjJutCjFS3S7hOCTCE1ddg8x3Qf12f19g@mail.gmail.com
Whole thread Raw
In response to Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> I ran the same set of benchmarks on the v6, which I think should be
> mostly identical to what was committed. I extended the test a bit to
> test table with 0, 1, 5 and 1000 rows, and also with int and text
> values, to see how it works with more expensive comparators.
>
> I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are
> attached. There's a bit of difference between gcc and clang, but the
> general behavior is about the same, so I'll only present gcc results to
> keep this simple. I'll only throughput comparison to master, so >1.0
> means good, <1.0 means bad. If you're interested in actual tps, see the
> full results.
>
> For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are:
>
> integer column / v5
> ===================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0       97%      97%      97%     107%     126%     108%
>        1       95%      82%      94%     108%     132%     110%
>        5       95%      83%      95%     108%     132%     110%
>     1000      129%     481%     171%     131%     382%     165%

I think we should likely ignore the v5 patch now.  The reason is that
it was pretty much unfinished and there were many places that I'd not
yet added support for the HashedScalarArrayOpExpr node type yet.  This
could cause these nodes to be skipped during node mutations or node
walking which would certainly make planning faster, just not in a way
that's correct.

> integer column / v6
> ===================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0       97%      97%      97%      98%      98%      98%
>        1       96%      84%      95%      97%      97%      98%
>        5       97%      85%      96%      98%      97%      97%
>     1000      129%     489%     172%     128%     330%     162%

This is a really informative set of results. I can only guess that the
slowdown of the 100/prepared query is down to building the hash table.
I think that because the 0 rows test does not show the slowdown and we
only build the table when evaluating for the first time. There's a
slightly larger hit on 1 row vs 5 rows, which makes sense since the
rewards of the hash lookup start paying off more with more rows.

Looking at your tps numbers, I think I can see why we get the drop in
performance with prepared statements but not simple statements. This
seems to just be down to the fact that the planning time dominates in
the simple statement case.  For example, the "1 row" test for 100/s
for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With
master, prepared gets 52400.0 tps. So we could say the hash table
build costs us 8306.2 tps, or 3.59 microseconds per execution, per:

postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8;
      ?column?
---------------------
 -3.5949559161508538
(1 row)

If we look at the tps for the simple query version of the same test.
Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59
microsecond slowdown to master's tps, then we get pretty close to
within 1% of the v6 tps:

postgres=# select 1000000 / (1000000 / 10309.6 + 3.59);
       ?column?
-----------------------
 9941.6451581291294165
(1 row)

> text column / v6
> ================
>
>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>     -----------------------------------------------------------
>        0      101%     101%     101%      98%      99%      99%
>        1       98%      82%      96%      98%      96%      97%
>        5      100%      84%      98%      98%      96%      98%
>     1000      297%    1645%     408%     255%    1000%     336%
>
>
> Overall, the behavior for integer and text columns is the same, for both
> patches. There's a couple interesting observations:
>
> 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup),
> but v6 does not seem to help at all - it's either same or slower than
> unpatched master.

I think that's related to the fact that I didn't finish adding
HashedScalarArrayOpExpr processing to all places that needed it.

> I wonder why is that, and if we could get some of the speedup with v6?
> At first I thought that maybe v5 is not building the hash table in cases
> where v6 does, but that shouldn't be faster than master.

I don't think v5 and v6 really do anything much differently in the
executor. The only difference is really during ExecInitExprRec() when
we initialize the expression. With v5 we had a case
T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we
have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the
missing permissions check on the hash function, so that will account
for something. As far as I can see, that's required.

> 2) For the "prepared" mode, there's a clear performance hit the longer
> the array is (for both v5 and v6). For 100 elements it's about 15%,
> which is not great.
>
> I think the reason is fairly simple - building the hash table is not
> free, and with few rows it's not worth it - it'd be faster to just
> search the array directly. Unfortunately, the logic that makes the
> decision to switch to hashing only looks at the array length only, and
> ignores the number of rows entirely. So I think if we want to address
> this, convert_saop_to_hashed_saop needs to compare
>
>     has_build_cost + nrows * hash_lookup_cost
>
> and
>
>     nrows * linear_lookup_cost
>
> to make reasonable decision.

I thought about that but I was really worried that the performance of
ScalarArrayOpExpr would just become too annoyingly unpredictable. You
know fairly well that we can often get massive row underestimations in
the planner. (I guess you worked on ext stats mainly because of that)
The problem I want to avoid is the ones where we get a big row
underestimation but don't really get a bad plan as a result. For
example a query like:

SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ...
AND col4 IN( ... big list of values ...);

If col1, col2 and col3 are highly correlated but individually fairly
selective, then we could massively underestimate how many rows the IN
clause will see (assuming no rearranging was done here).

I'm not completely opposed to the idea of taking the estimated rows
into account during planning. It might just mean having to move the
convert_saop_to_hashed_saop() call somewhere else. I imagine that's
fairly trivial to do. I just have concerns about doing so.

> I was thinking that maybe we can ignore this, because people probably
> have much larger tables in practice. But I'm not sure that's really
> true, because there may be other quals and it's possible the preceding
> ones are quite selective, filtering most of the rows.
>
> I'm not sure how much of the necessary information we have available in
> convert_saop_to_hashed_saop_walker, though :-( I suppose we know the
> number of input rows for that plan node, not sure about selectivity of
> the other quals, though.
>
> It's also a bit strange that we get speedup for "simple" protocol, while
> for "prepared" it gets slower. That seems counter-intuitive, because why
> should we see opposite outcomes in those cases? I'd assume that we'll
> see either speedup or slowdown in both cases, with the relative change
> being more significant in the "prepared" mode.

I hope my theory above about the planner time dominating the overall
time shows why that is.

Another way to look at these result is by taking your tps value and
calculating how long it takes to do N number of transactions then
totalling up the time it takes. If I do that to calculate how long it
took each test to perform 1000 transactions and sum each test grouping
by mode and version with rollup on mode, I get:

time values are in seconds:

pg-master 28.07
prepared 11.28
simple 16.79

pg-v6 15.86
prepared 5.23
simple 10.63

So, the overall result when applying the total time is 177%.

David



pgsql-hackers by date:

Previous
From: Tatsuro Yamada
Date:
Subject: Re: Typo in jsonfuncs.c
Next
From: David Rowley
Date:
Subject: Re: Lots of incorrect comments in nodeFuncs.c