Thread: BUG #16504: Wrapping query in EXISTS() causes sequential scans of tables

BUG #16504: Wrapping query in EXISTS() causes sequential scans of tables

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16504
Logged by:          Drk Shadow
Email address:      postgresql.20.drkshadow@spamgourmet.com
PostgreSQL version: 12.0
Operating system:   Gentoo
Description:

I'm trying to migrate from Postgresql 9.6, but running into performance
issues.

In this case, I have a query that selects from a table, joins a table, joins
that same table, joins the first table, performs a NOT
firsttable.value=lasttable.value to make sure that I'm not joining an item
with itself. There are indexes for the conditions in each join. When I run
this as a select query, the EXPLAIN output uses index scans the whole way
through. When I SELECT EXISTS(thatquery) FROM firsttable WHERE firsttable.id
IN (1,2,3), the EXPLAIN output switches to using sequential scans. This
destroys performance on this 3.8m row table.


The real case, with a view and multiple queries joined together, trying to
determine if one item (or its preview) has a tag anywhere:
https://www.db-fiddle.com/f/efSF4GbPEbAAUQWRnRCkEp/5

The example case, where I compare an EXISTS() to a flat select:
https://www.db-fiddle.com/f/f96LRpJQF3VFGZo9EP8nMm/0

=======
The SQL for the example, in case db-fiddle goes away

CREATE TABLE boxes (
  itemid INTEGER PRIMARY KEY,
  size INTEGER DEFAULT 0,
  md5 CHARACTER(22),
  prevoid oid DEFAULT 0,
  prevlocal BOOLEAN DEFAULT FALSE
 );
CREATE INDEX prevoid ON boxes (prevoid) WHERE prevoid > 0;
CREATE INDEX boxhash ON boxes(size, SUBSTR(md5, 1, 3));

-- your preview is stored in another location..
CREATE TABLE prevs (
  id INTEGER PRIMARY KEY,
  size INTEGER NOT NULL,
  md5 CHARACTER(22) NOT NULL
  -- location CHARACTER...
 );
CREATE INDEX prevhash ON prevs(size, SUBSTR(md5, 1, 3));

-- ratings
CREATE TABLE box_tags (
  size INTEGER NOT NULL,
  md5 CHARACTER(22) NOT NULL,
  value INTEGER NOT NULL
  );
CREATE INDEX taghashes ON box_tags(size, SUBSTR(md5, 1, 3));

CREATE VIEW other_location AS
SELECT bx_a.itemid, bx_a.prevlocal AS local, bx_other_a.itemid AS bitemid,
bt_a.value
from boxes bx_a
JOIN prevs pv_a ON pv_a.id=(bx_a.prevoid)::INTEGER AND bx_a.prevlocal
JOIN prevs pv1_a ON (pv1_a.size,SUBSTR(pv1_a.md5, 1,
3),pv1_a.md5)=(pv_a.size, SUBSTR(pv_a.md5, 1, 3),pv_a.md5)
JOIN boxes bx_other_a ON bx_other_a.prevoid=(pv1_a.id)::oid AND
bx_other_a.prevlocal
LEFT JOIN box_tags bt_a ON (bt_a.size,SUBSTR(bt_a.md5, 1, 3),
bt_a.md5)=(pv1_a.size,SUBSTR(pv1_a.md5, 1, 3), pv1_a.md5)
WHERE bx_other_a.prevoid > 0 AND NOT bx_a.itemid = bx_other_a.itemid
;

-- ----
-- The examples


EXPLAIN 
SELECT EXISTS(select 1 FROM other_location WHERE itemid=boxes.itemid)
FROM boxes
WHERE boxes.itemid IN (1,2,3);

----
Not copy-pasting the EXPLAIN output, the first query has two subplans. The
first subplan uses indexes, and the second subplan uses  -> Seq Scan on
boxes bx_a_1 (cost=0.00..22.10 rows=605 width=8).

The second query has no second subplan, and there is no sequential table
scan.

It appears there is a bug: EXISTS() subqueries are not handled as part of
the query, or perhaps they're done in parallel (at extreme cost), or perhaps
something else. In this case, there's no indication of a parallel branch,
and `SET max_parallel_workers_per_gather = 0;` does not change the plan.
I've noticed this problem as well with JOIN (SELECT...) subqueries (but I
didn't make an example), so I believe this isn't about EXISTS(), but rather
about _all_ subqueries now being a performance barrier.

As such, I can't migrate anything with subqueries to Postgres 12. Is this a
bug?


EXPLAIN 
SELECT other_location.bitemid
FROM boxes
JOIN other_location ON other_location.itemid=boxes.itemid
WHERE boxes.itemid IN (1,2,3);


PG Bug reporting form <noreply@postgresql.org> writes:
> In this case, I have a query that selects from a table, joins a table, joins
> that same table, joins the first table, performs a NOT
> firsttable.value=lasttable.value to make sure that I'm not joining an item
> with itself. There are indexes for the conditions in each join. When I run
> this as a select query, the EXPLAIN output uses index scans the whole way
> through. When I SELECT EXISTS(thatquery) FROM firsttable WHERE firsttable.id
> IN (1,2,3), the EXPLAIN output switches to using sequential scans. This
> destroys performance on this 3.8m row table.

Are you claiming that 9.6 did better?  I'm not aware that we changed
anything significant about how a sub-select in a query's select list
works.

The table/view definitions alone are not enough to investigate this,
since it's unlikely that we'd get the same plans on an empty table
as a populated one.  Please see if you can make a self-contained test
case with some dummy data that reproduces the problem (ie better
plan on 9.6 than later).

            regards, tom lane