Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays
Date
Msg-id 390e8d18-0c90-1372-b8be-149870c9296f@enterprisedb.com
Whole thread Raw
In response to Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers

On 4/9/21 1:21 AM, David Rowley wrote:
> On Fri, 9 Apr 2021 at 09:32, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> I ran the same set of benchmarks on the v6, which I think should be
>> mostly identical to what was committed. I extended the test a bit to
>> test table with 0, 1, 5 and 1000 rows, and also with int and text
>> values, to see how it works with more expensive comparators.
>>
>> I built binaries with gcc 9.2.0 and clang 11.0.0, the full results are
>> attached. There's a bit of difference between gcc and clang, but the
>> general behavior is about the same, so I'll only present gcc results to
>> keep this simple. I'll only throughput comparison to master, so >1.0
>> means good, <1.0 means bad. If you're interested in actual tps, see the
>> full results.
>>
>> For the v5 patch (actually v4-0001 + v5-0002) and v6, the results are:
>>
>> integer column / v5
>> ===================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0       97%      97%      97%     107%     126%     108%
>>        1       95%      82%      94%     108%     132%     110%
>>        5       95%      83%      95%     108%     132%     110%
>>     1000      129%     481%     171%     131%     382%     165%
> 
> I think we should likely ignore the v5 patch now.  The reason is that
> it was pretty much unfinished and there were many places that I'd not
> yet added support for the HashedScalarArrayOpExpr node type yet.  This
> could cause these nodes to be skipped during node mutations or node
> walking which would certainly make planning faster, just not in a way
> that's correct.
> 
>> integer column / v6
>> ===================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0       97%      97%      97%      98%      98%      98%
>>        1       96%      84%      95%      97%      97%      98%
>>        5       97%      85%      96%      98%      97%      97%
>>     1000      129%     489%     172%     128%     330%     162%
> 
> This is a really informative set of results. I can only guess that the
> slowdown of the 100/prepared query is down to building the hash table.
> I think that because the 0 rows test does not show the slowdown and we
> only build the table when evaluating for the first time. There's a
> slightly larger hit on 1 row vs 5 rows, which makes sense since the
> rewards of the hash lookup start paying off more with more rows.
> 

Agreed. I think that's essentially what I wrote too.

> Looking at your tps numbers, I think I can see why we get the drop in
> performance with prepared statements but not simple statements. This
> seems to just be down to the fact that the planning time dominates in
> the simple statement case.  For example, the "1 row" test for 100/s
> for v6 is 10023.3 tps, whereas the 100/p result is 44093.8 tps. With
> master, prepared gets 52400.0 tps. So we could say the hash table
> build costs us 8306.2 tps, or 3.59 microseconds per execution, per:
> 
> postgres=# select 1000000 / 52400.0 - 1000000 / 44093.8;
>       ?column?
> ---------------------
>  -3.5949559161508538
> (1 row)
> 
> If we look at the tps for the simple query version of the same test.
> Master did 10309.6 tps, v6 did 10023.3 tps. If we apply that 3.59
> microsecond slowdown to master's tps, then we get pretty close to
> within 1% of the v6 tps:
> 
> postgres=# select 1000000 / (1000000 / 10309.6 + 3.59);
>        ?column?
> -----------------------
>  9941.6451581291294165
> (1 row)
> 

Right, that makes perfect sense.

>> text column / v6
>> ================
>>
>>     rows      10/p    100/p     16/p     10/s    100/s     16/s
>>     -----------------------------------------------------------
>>        0      101%     101%     101%      98%      99%      99%
>>        1       98%      82%      96%      98%      96%      97%
>>        5      100%      84%      98%      98%      96%      98%
>>     1000      297%    1645%     408%     255%    1000%     336%
>>
>>
>> Overall, the behavior for integer and text columns is the same, for botYep,h
>> patches. There's a couple interesting observations:
>>
>> 1) For the "simple" query mode, v5 helped quite a bit (20-30% speedup),
>> but v6 does not seem to help at all - it's either same or slower than
>> unpatched master.
> 
> I think that's related to the fact that I didn't finish adding
> HashedScalarArrayOpExpr processing to all places that needed it.
> 

Not sure I understand. Shouldn't that effectively default to the
unpatched behavior? How could that result in code that is *faster* than
master?


>> I wonder why is that, and if we could get some of the speedup with v6?
>> At first I thought that maybe v5 is not building the hash table in cases
>> where v6 does, but that shouldn't be faster than master.
> 
> I don't think v5 and v6 really do anything much differently in the
> executor. The only difference is really during ExecInitExprRec() when
> we initialize the expression. With v5 we had a case
> T_HashedScalarArrayOpExpr: to handle the new node type, but in v6 we
> have if (OidIsValid(opexpr->hashfuncid)). Oh, wait. I did add the
> missing permissions check on the hash function, so that will account
> for something. As far as I can see, that's required.
> 

I think the check is required, but I don't think that should be so very
expensive - it's likely one of many other permission checks during.

But I think the puzzle is not so much about v5 vs v6, but more about v5
vs. master. I still don't understand how v5 managed to be faster than
master, but maybe I'm missing something.

>> 2) For the "prepared" mode, there's a clear performance hit the longer
>> the array is (for both v5 and v6). For 100 elements it's about 15%,
>> which is not great.
>>
>> I think the reason is fairly simple - building the hash table is not
>> free, and with few rows it's not worth it - it'd be faster to just
>> search the array directly. Unfortunately, the logic that makes the
>> decision to switch to hashing only looks at the array length only, and
>> ignores the number of rows entirely. So I think if we want to address
>> this, convert_saop_to_hashed_saop needs to compare
>>
>>     has_build_cost + nrows * hash_lookup_cost
>>
>> and
>>
>>     nrows * linear_lookup_cost
>>
>> to make reasonable decision.
> 
> I thought about that but I was really worried that the performance of
> ScalarArrayOpExpr would just become too annoyingly unpredictable. You
> know fairly well that we can often get massive row underestimations in
> the planner. (I guess you worked on ext stats mainly because of that)
> The problem I want to avoid is the ones where we get a big row
> underestimation but don't really get a bad plan as a result. For
> example a query like:
> 
> SELECT * FROM big_table WHERE col1 = ... AND col2 = ... AND col3 = ...
> AND col4 IN( ... big list of values ...);
> 
> If col1, col2 and col3 are highly correlated but individually fairly
> selective, then we could massively underestimate how many rows the IN
> clause will see (assuming no rearranging was done here).
> 
> I'm not completely opposed to the idea of taking the estimated rows
> into account during planning. It might just mean having to move the
> convert_saop_to_hashed_saop() call somewhere else. I imagine that's
> fairly trivial to do. I just have concerns about doing so.
> 

Hmm, yeah. I understand your concerns, but the way it works now it kinda
penalizes correct estimates. Imagine you have workload with simple OLTP
queries, the queries always hit only a couple rows - but we make it run
15% slower because we're afraid there might be under-estimate.

It's sensible to make the planning resilient to under-estimates, but I'm
not sure just ignoring the cardinality estimate with the justification
it might be wrong is good strategy. In a way, all the other planning
decisions assume it's correct, so why should this be any different?
We're not using seqscan exclusively just because the selectivity might
be wrong, making index scan ineffective, for example.

Maybe the right solution is to rely on the estimates, but then also
enable the hashing if we significantly cross the threshold during
execution. So for example we might get estimate 10 rows, and calculate
that the hashing would start winning at 100 rows, so we start without
hashing. But then at execution if we get 200 rows, we build the hash
table and start using it.

Yes, there's a risk that there are only 200 rows, and the time spent
building the hash table is wasted. But it's much more likely that there
are many more rows.

>> I was thinking that maybe we can ignore this, because people probably
>> have much larger tables in practice. But I'm not sure that's really
>> true, because there may be other quals and it's possible the preceding
>> ones are quite selective, filtering most of the rows.
>>
>> I'm not sure how much of the necessary information we have available in
>> convert_saop_to_hashed_saop_walker, though :-( I suppose we know the
>> number of input rows for that plan node, not sure about selectivity of
>> the other quals, though.
>>
>> It's also a bit strange that we get speedup for "simple" protocol, while
>> for "prepared" it gets slower. That seems counter-intuitive, because why
>> should we see opposite outcomes in those cases? I'd assume that we'll
>> see either speedup or slowdown in both cases, with the relative change
>> being more significant in the "prepared" mode.
> 
> I hope my theory above about the planner time dominating the overall
> time shows why that is.
> 

I think it does explain the v6 behavior, where prepared gets slower
while simple is about the same (with low row counts).

But I'm not sure I understand the v5, where simple got faster than master.

> Another way to look at these result is by taking your tps value and
> calculating how long it takes to do N number of transactions then
> totalling up the time it takes. If I do that to calculate how long it
> took each test to perform 1000 transactions and sum each test grouping
> by mode and version with rollup on mode, I get:
> 
> time values are in seconds:
> 
> pg-master 28.07
> prepared 11.28
> simple 16.79
> 
> pg-v6 15.86
> prepared 5.23
> simple 10.63
> 
> So, the overall result when applying the total time is 177%.
> 

Not sure how you got those numbers, or how it explains the results.

E.g. on v5, the results for 100 int values / 1 row look like this:

            100/p     100/s
    master  52400     10310
        v5  43446     13610

I understand why the prepared mode got slower. I don't understand how
the simple mode got faster.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: pgsql: autovacuum: handle analyze for partitioned tables
Next
From: Tom Lane
Date:
Subject: PANIC: wrong buffer passed to visibilitymap_clear