Thread: parallel sequential scan returns extraneous rows

parallel sequential scan returns extraneous rows

From
Michael Day
Date:
I have found a nasty bug when using parallel sequential scans with an =
exists clause on postgresql 9.6.1.  I have found that the rows returned =
using parallel sequential scan plans are incorrect (though I haven=E2=80=99=
t dug sufficiently to know in what ways).  See below for an example of =
the issue.

denver=3D# set max_parallel_workers_per_gather =3D 0;
SET
denver=3D# select count(*)
denver-# from users u=20
denver-# join address a on (a.users_id =3D u.id)=20
denver-# where exists (select 1 from address where users_id =3D u.id)
  count =20
---------
 9486910
(1 row)

                                                  QUERY PLAN             =
                                     =20
=
--------------------------------------------------------------------------=
-------------------------------------
 Aggregate  (cost=3D2117655.96..2117655.97 rows=3D1 width=3D8)
   ->  Merge Join  (cost=3D1257.34..2094242.48 rows=3D9365393 width=3D0)
         Merge Cond: (u.id =3D a.users_id)
         ->  Merge Semi Join  (cost=3D495.43..1607025.52 rows=3D2824687 =
width=3D8)
               Merge Cond: (u.id =3D address.users_id)
               ->  Index Only Scan using users_pk on users u  =
(cost=3D0.43..1018930.31 rows=3D11648927 width=3D4)
               ->  Index Only Scan using address_idx01 on address  =
(cost=3D0.43..456495.80 rows=3D9365393 width=3D4)
         ->  Index Only Scan using address_idx01 on address a  =
(cost=3D0.43..456495.80 rows=3D9365393 width=3D4)
(8 rows)



denver=3D# set max_parallel_workers_per_gather =3D 1;
SET
denver=3D# select count(*)
denver-# from users u=20
denver-# join address a on (a.users_id =3D u.id)=20
denver-# where exists (select 1 from address where users_id =3D u.id);
  count  =20
----------
 29288954
(1 row)

                                               QUERY PLAN                =
                               =20
=
--------------------------------------------------------------------------=
-------------------------------
 Aggregate  (cost=3D1889898.47..1889898.48 rows=3D1 width=3D8)
   ->  Hash Join  (cost=3D1401575.70..1866484.99 rows=3D9365393 width=3D0)=

         Hash Cond: (a.users_id =3D u.id)
         ->  Seq Scan on address a  (cost=3D0.00..299463.93 rows=3D9365393=
 width=3D4)
         ->  Hash  (cost=3D1355233.12..1355233.12 rows=3D2824687 =
width=3D8)
               ->  Gather  (cost=3D571820.86..1355233.12 rows=3D2824687 =
width=3D8)
                     Workers Planned: 1
                     ->  Hash Join  (cost=3D570820.86..1071764.42 =
rows=3D2824687 width=3D8)
                           Hash Cond: (address.users_id =3D u.id)
                           ->  Parallel Seq Scan on address  =
(cost=3D0.00..260900.55 rows=3D5509055 width=3D4)
                           ->  Hash  (cost=3D379705.27..379705.27 =
rows=3D11648927 width=3D4)
                                 ->  Seq Scan on users u  =
(cost=3D0.00..379705.27 rows=3D11648927 width=3D4)
(12 rows)

Re: parallel sequential scan returns extraneous rows

From
Tom Lane
Date:
Michael Day <blake@rcmail.com> writes:
> I have found a nasty bug when using parallel sequential scans with an ex=
ists clause on postgresql 9.6.1.  I have found that the rows returned usin=
g parallel sequential scan plans are incorrect (though I haven=E2=80=99t d=
ug sufficiently to know in what ways).  See below for an example of the is=
sue.

Hm, looks like a planner error: it seems to be forgetting that the join
to "address" should be a semijoin.  "address" should either be on the
inside of a "Semi" join (as in your first, correct-looking plan) or be
passed through a unique-ification stage such as a HashAgg.  Clearly,
neither thing is happening in the second plan.

I couldn't reproduce this in a bit of trying, however.  Can you come
up with a self-contained test case?

            regards, tom lane

Re: parallel sequential scan returns extraneous rows

From
Michael Day
Date:
I was able to reproduce with this set of data.

create table users (id integer);
create table address (id integer, users_id integer);

insert into users select s from generate_series(1,1000000) s;
insert into address select s, s/2 from generate_series(1,2000000) s;

analyze users;
analyze address;

set max_parallel_workers_per_gather =3D 0;

select count(*)
from users u=20
join address a on (a.users_id =3D u.id)=20
where exists (select 1 from address where users_id =3D u.id);

set max_parallel_workers_per_gather =3D 1;

select count(*)
from users u=20
join address a on (a.users_id =3D u.id)=20
where exists (select 1 from address where users_id =3D u.id);


On 11/29/16, 11:19 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

    Michael Day <blake@rcmail.com> writes:
    > I have found a nasty bug when using parallel sequential scans with an=
 exists clause on postgresql 9.6.1.  I have found that the rows returned usi=
ng parallel sequential scan plans are incorrect (though I haven=E2=80=99t dug suff=
iciently to know in what ways).  See below for an example of the issue.
   =20
    Hm, looks like a planner error: it seems to be forgetting that the join
    to "address" should be a semijoin.  "address" should either be on the
    inside of a "Semi" join (as in your first, correct-looking plan) or be
    passed through a unique-ification stage such as a HashAgg.  Clearly,
    neither thing is happening in the second plan.
   =20
    I couldn't reproduce this in a bit of trying, however.  Can you come
    up with a self-contained test case?
   =20
                regards, tom lane
   =20

Re: parallel sequential scan returns extraneous rows

From
Tom Lane
Date:
Michael Day <blake@rcmail.com> writes:
> I was able to reproduce with this set of data.

Ah, thanks for the test case!  Looks like the handling of
JOIN_UNIQUE_INNER cases for parallel plans is completely confused ---
it forgot about needing to unique-ify the inner rel, and I'm rather
surprised it didn't trip over some Asserts while at it.

Will fix, thanks.

            regards, tom lane