Home > mailing lists

Thread: Bug in execution of EXISTS and IN clauses for large tables

Bug in execution of EXISTS and IN clauses for large tables

From

Martin Kos

Date:

22 February 2022, 13:17:06

Dear PostgreSQL team,

I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.

I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).

Description: I need to subset one large table to rows to only ids present in second table and then join columns from a third table, that has all IDs (unique) with different information. Finally, counts of different columns in different group by conditions are executed. To make the sub-setting efficient and avoid multiplication of the rows when sub-setting to the IDs present only in the second table, I have used the “where exists” or “where ID in…” clauses. However, under my conditions, the “exists” and “in” clauses were processed as inner join leading to multiplication of the rows. Placing the “exists” or “in” clauses on the second smaller table1, retrieved expected smaller number of rows. It seems that PostgreSQL selects different plan for the two clauses depending on the cost.

Minimal settings to reproduce the issue:

I have managed to reproduce the issue with this minimal setting. I have three tables with large number of rows:

table1_25mil_base table: 25 million rows, ID column (simple sequence) and column_x – random number with duplicates
table2_60mil table: 60 million rows – created from the first table, multiple occurrences of the 25 million unique IDs, and column_y (random)

table3_25mil table: 25 million rows - created from the first table, but contains only subset of the unique IDs (~ 2/3 of the unique IDs), and column_z (random).

The presence of the second column the table 3 seems to be needed to reproduce the issue.

The queries to create the test tables are at the end below.

The result of the different queries that should in principle provide the same output is below – please note the number of rows in the tables produced.

Please let me know if this is an expected behavior?

If you need any more information, let me know.

Thank you.

Kind regards,
Martin Kos

/* test 1 using “EXISTS” */

select count(*)

from (select *

from table2_60mil table2

where exists (select id

from table3_30mil table3

where table2.id = table3.id

)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384 (expected less then 60 million rows present in the table2)

;

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base table1

where exists (select id

from table3_30mil table3

where table1.id = table3.id

)

)t2

on t1.id = t2.id

-- 44'273'564 (expected “correct” reduced number of rows)

;

/* test 2 – using “IN clause” */

select count(*)

from (select *

from table2_60mil

where id in (select id from table3_30mil)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384 (expected less then 60 million rows present in the table2. The number of rows corresponds to inner join);

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base

where id in (select id from table3_30mil)

)t2

on t1.id = t2.id

-- 44'273'564 (expected “correct” reduced number of rows)

;

/* test compare with "inner join" */

select count(*)

from (select table2.id

from table2_60mil table2

join table3_30mil table3

on table2.id = table3.id

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384

;

/* test 3 – sub-setting to the same table -> should result in the same number of rows as in the original table! */

select count(*)

from (select *

from table2_60mil

where id in (select id from table2_60mil)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 179'989'276 – clearly multiplied rows, behaves as inner join

;

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base

where id in (select id from table2_60mil)

)t2

on t1.id = t2.id

-- 60'000'000 (correct number of rows)

;

/* make base table1 with 25 million distinct ids */

select generate_series(1,25000000) AS id, floor(random() * 100000)::int as column_x

into table1_25mil_base

;

/* make table2 with 60 million rows with close to all ids. Offset is used to prevent planner to reuse of the tables, perhaps not needed */

select id, floor(random() * 10000)::int as column_y

into table2_60mil

from ((select id

from table1_25mil_base

order by random()

limit 10000000)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 0)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 2345656)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 500000)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 1452652)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 777)

;

/* create 25 million "indication" table */

select id, floor(random() * 10000)::int as column_z

into table3_30mil

from ((select id

from table1_25mil_base

order by random()

limit 5000000 offset 7)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 345786)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 0)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 20008)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 369)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 77877)

;

/* tables info */

select count(id), count(distinct id)

from table1_25mil_base

-- 25000000 25000000

;

select count(id), count(distinct id)

from table2_60mil

-- 60000000 23833944

;

select count(id), count(distinct id)

from table3_30mil

-- 25000000 23833944

;

Martin Kos

Biomedical Data Scientist Analytics




E:	martin.kos@molecularhealth.com

Molecular Health GmbH

Kurfürsten‑Anlage 21

69115

Heidelberg

www.molecularhealth.com

Vorsitzender des Aufsichtsrats: Prof. Dr. Christof Hettich
Geschäftsführer: Dr. Friedrich von Bohlen und Halbach (Sprecher der Geschäftsleitung), Dr. Stephan Brock
Sitz der Gesellschaft: Heidelberg
Handelsregister: Amtsgericht Mannheim - HRB 338037

Confidentiality note: The information in this email and any attachment may contain confidential and proprietary information of Molecular Health GmbH, Molecular Health, Inc. and/or its affiliates and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any review, reliance or distribution by others or forwarding without express permission is strictly prohibited and may cause liability. In case you have received this message due to an error in transmission, we kindly ask you to notify the sender immediately and to delete this email and any attachment from your system.

Disclaimer: The information in this transmission contains privileged and confidential information, including patient information protected by federal and state privacy laws. It is intended only for the use of the person(s) named above. If you are not the intended recipient, you are hereby notified, that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply mail and destroy all copies of the original message.

Attachment

Re: Bug in execution of EXISTS and IN clauses for large tables

From

"David G. Johnston"

Date:

22 February 2022, 14:17:53

On Tuesday, February 22, 2022, Martin Kos <martin.kos@molecularhealth.com> wrote:

I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.

I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).

Reporting bugs without a standalone reproducer against long unsupported versions is generally inactionable.

Though at minimum providing explain analyze results, especially if you have a near equivalent query that supposedly works, is needed.

David J.

Re: Bug in execution of EXISTS and IN clauses for large tables

From

"David G. Johnston"

Date:

22 February 2022, 14:21:05

On Tuesday, February 22, 2022, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Tuesday, February 22, 2022, Martin Kos <martin.kos@molecularhealth.com> wrote:

I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.

I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).

Reporting bugs without a standalone reproducer against long unsupported versions is generally inactionable.

Though at minimum providing explain analyze results, especially if you have a near equivalent query that supposedly works, is needed.

My bad, too early for me…

This seems like a decent report (haven’t played with it), still, version 11.1!

David J.

Re: Bug in execution of EXISTS and IN clauses for large tables

From

Tom Lane

Date:

22 February 2022, 15:45:08

"David G. Johnston" <david.g.johnston@gmail.com> writes:
> This seems like a decent report (haven’t played with it), still, version
> 11.1!

Indeed.  I can't see any problem in 11.current --- for me, the first
four queries all give 44272445, the next 72002328, and the last two
60000000.  So either we fixed it since 11.1, or the problem requires
some nondefault setting that wasn't mentioned.  I'm not particularly
interested in bisecting to see where it was fixed.

            regards, tom lane

RE: Bug in execution of EXISTS and IN clauses for large tables

From

Martin Kos

Date:

22 February 2022, 16:16:18

Thank you Tom,

I agree, there is not much sense in dissecting it if it's not reproducible in the current versions.

Just to confirm (sorry if this is a trivial question) - the "exists" and "in" should never behave like an inner join, right?
Thanks.

Best wishes,
Martin

>

Martin Kos

Biomedical Data Scientist Analytics




E:	martin.kos@molecularhealth.com

Molecular Health GmbH

Kurfürsten‑Anlage 21

69115

Heidelberg

www.molecularhealth.com

-----Original Message-----
> From: Tom Lane <tgl@sss.pgh.pa.us>
> Sent: Tuesday, 22 February 2022 16:45
> To: David G. Johnston <david.g.johnston@gmail.com>
> Cc: Martin Kos <martin.kos@molecularhealth.com>; pgsql-
> bugs@lists.postgresql.org
> Subject: Re: Bug in execution of EXISTS and IN clauses for large tables
>
> Caution - External Sender: This message is from an external source and
> may contain unsafe content. Please do not click on any links or open any
> attachments unless you are sure. If in doubt, contact the MH Service Desk for
> a further plausibility check.
>
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
> > This seems like a decent report (haven’t played with it), still,
> > version 11.1!
>
> Indeed. I can't see any problem in 11.current --- for me, the first four queries
> all give 44272445, the next 72002328, and the last two 60000000. So either
> we fixed it since 11.1, or the problem requires some nondefault setting that
> wasn't mentioned. I'm not particularly interested in bisecting to see where it
> was fixed.
>
> regards, tom lane

Attachment

Re: Bug in execution of EXISTS and IN clauses for large tables

From

David Rowley

Date:

22 February 2022, 23:33:19

On Wed, 23 Feb 2022 at 09:29, Martin Kos <martin.kos@molecularhealth.com> wrote:
> Just to confirm (sorry if this is a trivial question) - the "exists" and "in" should never behave like an inner join,
right?

If the query planner is able first determine that the IN/EXISTs can be
executed as a semi-join, and it later can also find proofs that any
given outer side row of the semi-join can match to *at most* 1 inner
side row, then the planner will switch this to an inner join. This
allows more flexibility in the join orders, which can produce more
efficient plans.

David