Thread: neqjoinsel versus "refresh materialized view concurrently"

neqjoinsel versus "refresh materialized view concurrently"

From
Jeff Janes
Date:
The following commit has caused a devastating performance regression
in concurrent refresh of MV:

commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Nov 29 22:00:29 2017 -0500

    Fix neqjoinsel's behavior for semi/anti join cases.


The below reproduction goes from taking about 1 second to refresh, to taking an amount of time I don't have the patience to measure.  

drop table foobar2 cascade;
create table foobar2 as select * from generate_series(1,200000);
create materialized view foobar3 as select * from foobar2;
create unique index on foobar3 (generate_series );
analyze foobar3;
refresh materialized view CONCURRENTLY foobar3 ;


When I interrupt the refresh, I get a message including this line:

CONTEXT:  SQL statement "SELECT newdata FROM pg_temp_3.pg_temp_16420 newdata WHERE newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16420 newdata2 WHERE newdata2 IS NOT NULL AND newdata2 OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1"

So I makes sense that the commit in question could have caused a change in the execution plan.  Because these are temp tables, I can't easily get my hands on them to investigate further.

Cheers,

Jeff

Re: neqjoinsel versus "refresh materialized view concurrently"

From
Thomas Munro
Date:
On Wed, Mar 14, 2018 at 8:07 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> The following commit has caused a devastating performance regression
> in concurrent refresh of MV:
>
> commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Wed Nov 29 22:00:29 2017 -0500
>
>     Fix neqjoinsel's behavior for semi/anti join cases.
>
>
> The below reproduction goes from taking about 1 second to refresh, to taking
> an amount of time I don't have the patience to measure.
>
> drop table foobar2 cascade;
> create table foobar2 as select * from generate_series(1,200000);
> create materialized view foobar3 as select * from foobar2;
> create unique index on foobar3 (generate_series );
> analyze foobar3;
> refresh materialized view CONCURRENTLY foobar3 ;
>
>
> When I interrupt the refresh, I get a message including this line:
>
> CONTEXT:  SQL statement "SELECT newdata FROM pg_temp_3.pg_temp_16420 newdata
> WHERE newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16420
> newdata2 WHERE newdata2 IS NOT NULL AND newdata2 OPERATOR(pg_catalog.*=)
> newdata AND newdata2.ctid OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1"
>
> So I makes sense that the commit in question could have caused a change in
> the execution plan.  Because these are temp tables, I can't easily get my
> hands on them to investigate further.

Ouch.  A quadratic join.  This looks like an invisible correlation problem.

load 'auto_explain';
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;

drop table if exists t cascade;

create table t as select generate_series(1, 5000);
create materialized view mv as select * from t;
create unique index on mv(generate_series);
analyze mv;
refresh materialized view concurrently mv;

HEAD:

LOG:  duration: 26101.612 ms  plan:
Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16452 newdata WHERE
newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16452
newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
Limit  (cost=0.00..90.52 rows=1 width=28) (actual
time=26101.608..26101.608 rows=0 loops=1)
 ->  Nested Loop Semi Join  (cost=0.00..225220.96 rows=2488 width=28)
(actual time=26101.606..26101.606 rows=0 loops=1)
       Join Filter: ((newdata2.ctid <> newdata.ctid) AND (newdata.* *=
newdata2.*))
       Rows Removed by Join Filter: 25000000
       ->  Seq Scan on pg_temp_16452 newdata  (cost=0.00..73.00
rows=4975 width=34) (actual time=0.022..15.448 rows=5000 loops=1)
             Filter: (newdata.* IS NOT NULL)
       ->  Materialize  (cost=0.00..97.88 rows=4975 width=34) (actual
time=0.000..0.500 rows=5000 loops=5000)
             ->  Seq Scan on pg_temp_16452 newdata2  (cost=0.00..73.00
rows=4975 width=34) (actual time=0.010..4.033 rows=5000 loops=1)
                   Filter: (newdata2.* IS NOT NULL)

And with commit 7ca25b7de6aefa5537e0dbe56541bc41c0464f97 reverted:

LOG:  duration: 36.358 ms  plan:
Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16470 newdata WHERE
newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16470
newdata2 WHERE newdata2 IS NOT NULL AND newdata2
OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
Limit  (cost=756.95..939.50 rows=1 width=28) (actual
time=36.354..36.354 rows=0 loops=1)
 ->  Merge Semi Join  (cost=756.95..2947.51 rows=12 width=28) (actual
time=36.352..36.352 rows=0 loops=1)
       Merge Cond: (newdata.* *= newdata2.*)
       Join Filter: (newdata2.ctid <> newdata.ctid)
       Rows Removed by Join Filter: 5000
       ->  Sort  (cost=378.48..390.91 rows=4975 width=34) (actual
time=9.622..10.300 rows=5000 loops=1)
             Sort Key: newdata.* USING *<
             Sort Method: quicksort  Memory: 622kB
             ->  Seq Scan on pg_temp_16470 newdata  (cost=0.00..73.00
rows=4975 width=34) (actual time=0.021..4.986 rows=5000 loops=1)
                   Filter: (newdata.* IS NOT NULL)
       ->  Sort  (cost=378.48..390.91 rows=4975 width=34) (actual
time=7.378..8.010 rows=5000 loops=1)
             Sort Key: newdata2.* USING *<
             Sort Method: quicksort  Memory: 622kB
             ->  Seq Scan on pg_temp_16470 newdata2  (cost=0.00..73.00
rows=4975 width=34) (actual time=0.017..3.034 rows=5000 loops=1)
                   Filter: (newdata2.* IS NOT NULL)

-- 
Thomas Munro
http://www.enterprisedb.com


Re: neqjoinsel versus "refresh materialized view concurrently"

From
Thomas Munro
Date:
On Wed, Mar 14, 2018 at 11:34 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> LOG:  duration: 26101.612 ms  plan:
> Query Text: SELECT newdata FROM pg_temp_3.pg_temp_16452 newdata WHERE
> newdata IS NOT NULL AND EXISTS (SELECT * FROM pg_temp_3.pg_temp_16452
> newdata2 WHERE newdata2 IS NOT NULL AND newdata2
> OPERATOR(pg_catalog.*=) newdata AND newdata2.ctid
> OPERATOR(pg_catalog.<>) newdata.ctid) LIMIT 1
> Limit  (cost=0.00..90.52 rows=1 width=28) (actual
> time=26101.608..26101.608 rows=0 loops=1)
>  ->  Nested Loop Semi Join  (cost=0.00..225220.96 rows=2488 width=28)
> (actual time=26101.606..26101.606 rows=0 loops=1)
>        Join Filter: ((newdata2.ctid <> newdata.ctid) AND (newdata.* *=
> newdata2.*))
>        Rows Removed by Join Filter: 25000000
>        ->  Seq Scan on pg_temp_16452 newdata  (cost=0.00..73.00
> rows=4975 width=34) (actual time=0.022..15.448 rows=5000 loops=1)
>              Filter: (newdata.* IS NOT NULL)
>        ->  Materialize  (cost=0.00..97.88 rows=4975 width=34) (actual
> time=0.000..0.500 rows=5000 loops=5000)
>              ->  Seq Scan on pg_temp_16452 newdata2  (cost=0.00..73.00
> rows=4975 width=34) (actual time=0.010..4.033 rows=5000 loops=1)
>                    Filter: (newdata2.* IS NOT NULL)

This plan is chosen because we're looking for just one row (LIMIT 1)
that has equal data but a different ctid.  In this case we're not
going to find one, so we'll pay the full enormous cost of the nested
loop, but the startup cost is estimated as 0 and we think we are going
to find a row straight away.  That's because we don't know that it's
unlikely for there to be a row with the same columns but a different
ctid.

There is a fundamental and complicated estimation problem lurking here
of course and I'm not sure what to think about that yet.  Maybe there
is a very simple fix for this particular problem:

--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -660,7 +660,7 @@ refresh_by_match_merge(Oid matviewOid, Oid
tempOid, Oid relowner,
                                         "(SELECT * FROM %s newdata2
WHERE newdata2 IS NOT NULL "
                                         "AND newdata2
OPERATOR(pg_catalog.*=) newdata "
                                         "AND newdata2.ctid
OPERATOR(pg_catalog.<>) "
-                                        "newdata.ctid) LIMIT 1",
+                                        "newdata.ctid)",
                                         tempname, tempname);
        if (SPI_execute(querybuf.data, false, 1) != SPI_OK_SELECT)
                elog(ERROR, "SPI_exec failed: %s", querybuf.data);

That gets me back to the sort-merge plan, but maybe it's too superficial.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: neqjoinsel versus "refresh materialized view concurrently"

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> This looks like an invisible correlation problem.

Yeah --- the planner has no idea that the join rows satisfying
newdata.* *= newdata2.* are likely to be exactly the ones not
satisfying newdata.ctid <> newdata2.ctid.  It's very accidental
that we got a good plan before.

I've not looked to see where this query is generated, but I wonder
if we could make things better by dropping the LIMIT 1 and instead
using an executor count parameter to stop early.

            regards, tom lane


Re: neqjoinsel versus "refresh materialized view concurrently"

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> There is a fundamental and complicated estimation problem lurking here
> of course and I'm not sure what to think about that yet.  Maybe there
> is a very simple fix for this particular problem:

Ah, I see you thought of the same hack I did.

I think this may actually be a good fix, and here's the reason: this plan
is in fact being driven entirely off planner default estimates, because
we don't have any estimation code that knows what to do with
"wholerowvar *= wholerowvar".  I'm suspicious that we could drop the
preceding ANALYZE as being a waste of cycles, except maybe it's finding
out the number of rows for us.  In any case, LIMIT 1 is only a good idea
to the extent that the planner knows what it's doing, and this is an
example where it demonstrably doesn't and won't any time soon.

            regards, tom lane


Re: neqjoinsel versus "refresh materialized view concurrently"

From
Thomas Munro
Date:
On Wed, Mar 14, 2018 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> There is a fundamental and complicated estimation problem lurking here
>> of course and I'm not sure what to think about that yet.  Maybe there
>> is a very simple fix for this particular problem:
>
> Ah, I see you thought of the same hack I did.
>
> I think this may actually be a good fix, and here's the reason: this plan
> is in fact being driven entirely off planner default estimates, because
> we don't have any estimation code that knows what to do with
> "wholerowvar *= wholerowvar".  I'm suspicious that we could drop the
> preceding ANALYZE as being a waste of cycles, except maybe it's finding
> out the number of rows for us.  In any case, LIMIT 1 is only a good idea
> to the extent that the planner knows what it's doing, and this is an
> example where it demonstrably doesn't and won't any time soon.

Hmm.  I wonder if the ANALYZE might have been needed to avoid the
nested loop plan at some point in history.

Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
scenario and some smaller and larger examples I tried.  The query is
already executed with SPI_execute(..., 1) so it'll give up after one
row anyway.  The regression test includes a case that causes a row to
be produced here and that's passing ('ERROR:  new data for
materialized view "mvtest_mv" contains duplicate rows without any null
columns').

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: neqjoinsel versus "refresh materialized view concurrently"

From
Jeff Janes
Date:
On Tue, Mar 13, 2018 at 4:57 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Mar 14, 2018 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> There is a fundamental and complicated estimation problem lurking here
>> of course and I'm not sure what to think about that yet.  Maybe there
>> is a very simple fix for this particular problem:
>
> Ah, I see you thought of the same hack I did.
>
> I think this may actually be a good fix, and here's the reason: this plan
> is in fact being driven entirely off planner default estimates, because
> we don't have any estimation code that knows what to do with
> "wholerowvar *= wholerowvar".  I'm suspicious that we could drop the
> preceding ANALYZE as being a waste of cycles, except maybe it's finding
> out the number of rows for us.  In any case, LIMIT 1 is only a good idea
> to the extent that the planner knows what it's doing, and this is an
> example where it demonstrably doesn't and won't any time soon.

Hmm.  I wonder if the ANALYZE might have been needed to avoid the
nested loop plan at some point in history.

Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
scenario and some smaller and larger examples I tried.  The query is
already executed with SPI_execute(..., 1) so it'll give up after one
row anyway.  The regression test includes a case that causes a row to
be produced here and that's passing ('ERROR:  new data for
materialized view "mvtest_mv" contains duplicate rows without any null
columns').

Is there any good way to make the regression tests fail if the plan reverts to the bad one?  The only thing I can think of would be to make the table bigger so the regression tests becomes "noticeably slower", but that is pretty vague and not user friendly to formally pass and just hope it is slow enough for someone to investigate.
 
Cheers,

Jeff

Re: neqjoinsel versus "refresh materialized view concurrently"

From
Thomas Munro
Date:
On Wed, Mar 14, 2018 at 2:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Mar 13, 2018 at 4:57 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Here's a patch to remove LIMIT 1, which fixes the plan for Jeff's test
>> scenario and some smaller and larger examples I tried.  The query is
>> already executed with SPI_execute(..., 1) so it'll give up after one
>> row anyway.  The regression test includes a case that causes a row to
>> be produced here and that's passing ('ERROR:  new data for
>> materialized view "mvtest_mv" contains duplicate rows without any null
>> columns').
>
> Is there any good way to make the regression tests fail if the plan reverts
> to the bad one?  The only thing I can think of would be to make the table
> bigger so the regression tests becomes "noticeably slower", but that is
> pretty vague and not user friendly to formally pass and just hope it is slow
> enough for someone to investigate.

I can't think of a good way.  I guess it can still pick a nested loop
if it thinks there'll only be a couple of loops.  This patch tells it
to pay attention to the total cost, not the startup cost, so as soon
as it thinks there is more than a hand full of rows the quadratic cost
will exceed the sort/merge's logarithmic cost.

Since I've had hash joins on the mind recently I couldn't help
noticing that you can't get a hash join out of this query's "record
image" based join qual (or even a regular row-based =).  Were these
operators rigged up specifically to allow merge joins for this very
matview code?  Just for fun, I tried a quick and dirty hack to get
past that by naming the columns explicitly in the query instead.
Maybe that's unfair for a single-column tabe or doesn't have the right
semantics, I'm not sure...  But here are the rough times from my
laptop with a 5 million row version of your test:

patched just to remove LIMIT 1: 85s, of which 61s in the dup-check query (sort)
hacked to name columns: 38s, of which 14s in the dup-check query (hash)
hacked to name columns, work_mem=1GB: 31s, of which 4s in the
dup-check query (hash)

The reason I thought about that is because Parallel Hash is really
good at scaling big stupid self-joins (though for that we'd need
UNLOGGED instead of TEMP tables, which I didn't have time to try out
today).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: neqjoinsel versus "refresh materialized view concurrently"

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Mar 14, 2018 at 2:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Is there any good way to make the regression tests fail if the plan reverts
>> to the bad one?  The only thing I can think of would be to make the table
>> bigger so the regression tests becomes "noticeably slower", but that is
>> pretty vague and not user friendly to formally pass and just hope it is slow
>> enough for someone to investigate.

> I can't think of a good way.  I guess it can still pick a nested loop
> if it thinks there'll only be a couple of loops.  This patch tells it
> to pay attention to the total cost, not the startup cost, so as soon
> as it thinks there is more than a hand full of rows the quadratic cost
> will exceed the sort/merge's logarithmic cost.

Right.  After further thought, the key point here is that in non-error
cases the query will produce no rows, meaning that it must be executed
to completion before we can be sure of that.  But applying a LIMIT
encourages the planner to pick a fast-start (slow-completion) plan,
which is not going to be what we want.  If in fact the query were going
to produce a lot of rows, and the planner could accurately predict that,
then maybe a LIMIT would be useful --- but there's no hope of estimates
on wholerowvar *= wholerowvar being accurate any time soon, let alone
correctly handling the correlation with ctid <> ctid.  So the LIMIT
is just an invitation to trouble and we may as well remove it.

Committed that way.  I also changed EXISTS(SELECT * ...) to
EXISTS(SELECT 1 ...), in hopes of saving a few microseconds of
parsing effort.

> Since I've had hash joins on the mind recently I couldn't help
> noticing that you can't get a hash join out of this query's "record
> image" based join qual (or even a regular row-based =).

Yeah, because there's no hash support for recordimage.  So there's even
less reason to be worried about how smart the planner is for this query:
basically, it might do a nestloop for a very small number of rows, but
otherwise it's gonna have to go for a merge join.

My earlier thought that we might be able to skip the ANALYZE step
seems wrong, though.  It's true that it does little for this query,
but the follow-on query to build a diff table can probably make
good use of the stats.

            regards, tom lane