Thread: Bug in execution of EXISTS and IN clauses for large tables
Dear PostgreSQL team,
I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.
I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).
Description: I need to subset one large table to rows to only ids present in second table and then join columns from a third table, that has all IDs (unique) with different information. Finally, counts of different columns in different group by conditions are executed. To make the sub-setting efficient and avoid multiplication of the rows when sub-setting to the IDs present only in the second table, I have used the “where exists” or “where ID in…” clauses. However, under my conditions, the “exists” and “in” clauses were processed as inner join leading to multiplication of the rows. Placing the “exists” or “in” clauses on the second smaller table1, retrieved expected smaller number of rows. It seems that PostgreSQL selects different plan for the two clauses depending on the cost.
Minimal settings to reproduce the issue:
I have managed to reproduce the issue with this minimal setting. I have three tables with large number of rows:
table1_25mil_base table: 25 million rows, ID column (simple sequence) and column_x – random number with duplicates
table2_60mil table: 60 million rows – created from the first table, multiple occurrences of the 25 million unique IDs, and column_y (random)
table3_25mil table: 25 million rows - created from the first table, but contains only subset of the unique IDs (~ 2/3 of the unique IDs), and column_z (random).
The presence of the second column the table 3 seems to be needed to reproduce the issue.
The queries to create the test tables are at the end below.
The result of the different queries that should in principle provide the same output is below – please note the number of rows in the tables produced.
Please let me know if this is an expected behavior?
If you need any more information, let me know.
Thank you.
Kind regards,
Martin Kos
/* test 1 using “EXISTS” */
select count(*)
from (select *
from table2_60mil table2
where exists (select id
from table3_30mil table3
where table2.id = table3.id
)
)t1
join (select id, column_x
from table1_25mil_base
)t2
on t1.id = t2.id
-- 72'010'384 (expected less then 60 million rows present in the table2)
;
select count(*)
from (select *
from table2_60mil
--where id in (select id from test_20mil)
)t1
join (select id, column_x
from table1_25mil_base table1
where exists (select id
from table3_30mil table3
where table1.id = table3.id
)
)t2
on t1.id = t2.id
-- 44'273'564 (expected “correct” reduced number of rows)
;
/* test 2 – using “IN clause” */
select count(*)
from (select *
from table2_60mil
where id in (select id from table3_30mil)
)t1
join (select id, column_x
from table1_25mil_base
)t2
on t1.id = t2.id
-- 72'010'384 (expected less then 60 million rows present in the table2. The number of rows corresponds to inner join);
select count(*)
from (select *
from table2_60mil
--where id in (select id from test_20mil)
)t1
join (select id, column_x
from table1_25mil_base
where id in (select id from table3_30mil)
)t2
on t1.id = t2.id
-- 44'273'564 (expected “correct” reduced number of rows)
;
/* test compare with "inner join" */
select count(*)
from (select table2.id
from table2_60mil table2
join table3_30mil table3
on table2.id = table3.id
)t1
join (select id, column_x
from table1_25mil_base
)t2
on t1.id = t2.id
-- 72'010'384
;
/* test 3 – sub-setting to the same table -> should result in the same number of rows as in the original table! */
select count(*)
from (select *
from table2_60mil
where id in (select id from table2_60mil)
)t1
join (select id, column_x
from table1_25mil_base
)t2
on t1.id = t2.id
-- 179'989'276 – clearly multiplied rows, behaves as inner join
;
select count(*)
from (select *
from table2_60mil
--where id in (select id from test_20mil)
)t1
join (select id, column_x
from table1_25mil_base
where id in (select id from table2_60mil)
)t2
on t1.id = t2.id
-- 60'000'000 (correct number of rows)
;
/* make base table1 with 25 million distinct ids */
select generate_series(1,25000000) AS id, floor(random() * 100000)::int as column_x
into table1_25mil_base
;
/* make table2 with 60 million rows with close to all ids. Offset is used to prevent planner to reuse of the tables, perhaps not needed */
select id, floor(random() * 10000)::int as column_y
into table2_60mil
from ((select id
from table1_25mil_base
order by random()
limit 10000000)
union all
(select id
from table1_25mil_base
order by random()
limit 10000000 offset 0)
union all
(select id
from table1_25mil_base
order by random()
limit 10000000 offset 2345656)
union all
(select id
from table1_25mil_base
order by random()
limit 10000000 offset 500000)
union all
(select id
from table1_25mil_base
order by random()
limit 10000000 offset 1452652)
union all
(select id
from table1_25mil_base
order by random()
limit 10000000 offset 777)
)t
;
/* create 25 million "indication" table */
select id, floor(random() * 10000)::int as column_z
into table3_30mil
from ((select id
from table1_25mil_base
order by random()
limit 5000000 offset 7)
union all
(select id
from table1_25mil_base
order by random()
limit 5000000 offset 345786)
union all
(select id
from table1_25mil_base
order by random()
limit 5000000 offset 0)
union all
(select id
from table1_25mil_base
order by random()
limit 5000000 offset 20008)
union all
(select id
from table1_25mil_base
order by random()
limit 5000000 offset 369)
union all
(select id
from table1_25mil_base
order by random()
limit 5000000 offset 77877)
)t
;
/* tables info */
select count(id), count(distinct id)
from table1_25mil_base
-- 25000000 25000000
;
select count(id), count(distinct id)
from table2_60mil
-- 60000000 23833944
;
select count(id), count(distinct id)
from table3_30mil
-- 25000000 23833944
;
|
Attachment
On Tuesday, February 22, 2022, Martin Kos <martin.kos@molecularhealth.com> wrote:
I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.
I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).
On Tuesday, February 22, 2022, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Tuesday, February 22, 2022, Martin Kos <martin.kos@molecularhealth.com> wrote:
I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.
I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).
Reporting bugs without a standalone reproducer against long unsupported versions is generally inactionable.Though at minimum providing explain analyze results, especially if you have a near equivalent query that supposedly works, is needed.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > This seems like a decent report (haven’t played with it), still, version > 11.1! Indeed. I can't see any problem in 11.current --- for me, the first four queries all give 44272445, the next 72002328, and the last two 60000000. So either we fixed it since 11.1, or the problem requires some nondefault setting that wasn't mentioned. I'm not particularly interested in bisecting to see where it was fixed. regards, tom lane
I agree, there is not much sense in dissecting it if it's not reproducible in the current versions.
Just to confirm (sorry if this is a trivial question) - the "exists" and "in" should never behave like an inner join, right?
Thanks.
Best wishes,
Martin
>
|
> From: Tom Lane <tgl@sss.pgh.pa.us>
> Sent: Tuesday, 22 February 2022 16:45
> To: David G. Johnston <david.g.johnston@gmail.com>
> Cc: Martin Kos <martin.kos@molecularhealth.com>; pgsql-
> bugs@lists.postgresql.org
> Subject: Re: Bug in execution of EXISTS and IN clauses for large tables
>
> Caution - External Sender: This message is from an external source and
> may contain unsafe content. Please do not click on any links or open any
> attachments unless you are sure. If in doubt, contact the MH Service Desk for
> a further plausibility check.
>
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
> > This seems like a decent report (haven’t played with it), still,
> > version 11.1!
>
> Indeed. I can't see any problem in 11.current --- for me, the first four queries
> all give 44272445, the next 72002328, and the last two 60000000. So either
> we fixed it since 11.1, or the problem requires some nondefault setting that
> wasn't mentioned. I'm not particularly interested in bisecting to see where it
> was fixed.
>
> regards, tom lane
Attachment
On Wed, 23 Feb 2022 at 09:29, Martin Kos <martin.kos@molecularhealth.com> wrote: > Just to confirm (sorry if this is a trivial question) - the "exists" and "in" should never behave like an inner join, right? If the query planner is able first determine that the IN/EXISTs can be executed as a semi-join, and it later can also find proofs that any given outer side row of the semi-join can match to *at most* 1 inner side row, then the planner will switch this to an inner join. This allows more flexibility in the join orders, which can produce more efficient plans. David