Thread: Inefficient query plan for SELECT ... EXCEPT ...

Inefficient query plan for SELECT ... EXCEPT ...

From

Dimitrios Apostolou

Date:

31 October 2023, 22:41:33

Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:

> \d test_datatags
                             Table "public.test_datatags"
      Column     |  Type   | Collation | Nullable |             Default
----------------+---------+-----------+----------+----------------------------------
  test_datatag_n | integer |           | not null | generated by default as identity
  test_datatag   | text    |           | not null |
Indexes:
     "test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
     "test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)


Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:


> BEGIN;
> CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
> ANALYZE VERBOSE tmp_table ;
INFO:  analyzing "pg_temp_9.tmp_table"
INFO:  "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total
rows
> EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)
             SELECT DISTINCT  d1
                 FROM tmp_table
                 WHERE d1 IS NOT NULL
             EXCEPT
                 SELECT test_datatag FROM test_datatags;

                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  HashSetOp Except  (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1)
    Output: "*SELECT* 1".d1, (0)
    Buffers: shared hit=1259 read=58800
    I/O Timings: shared/local read=77.713
    ->  Append  (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1)
          Buffers: shared hit=1259 read=58800
          I/O Timings: shared/local read=77.713
          ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0
loops=1)
                Output: "*SELECT* 1".d1, 0
                ->  HashAggregate  (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1)
                      Output: tmp_table.d1
                      Group Key: tmp_table.d1
                      Batches: 1  Memory Usage: 24kB
                      ->  Seq Scan on pg_temp.tmp_table  (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001
rows=0loops=1) 
                            Output: tmp_table.d1
                            Filter: (tmp_table.d1 IS NOT NULL)
          ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850
rows=8702840loops=1) 
                Output: "*SELECT* 2".test_datatag, 1
                Buffers: shared hit=1259 read=58800
                I/O Timings: shared/local read=77.713
                ->  Seq Scan on public.test_datatags  (cost=0.00..147257.20 rows=8719820 width=22) (actual
time=3.747..515.420rows=8702840 loops=1) 
                      Output: test_datatags.test_datatag
                      Buffers: shared hit=1259 read=58800
                      I/O Timings: shared/local read=77.713
  Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on',
hash_mem_multiplier= '1', 
  maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers =
'64MB'
  Planning:
    Buffers: shared hit=2
  Planning Time: 0.055 ms
  JIT:
    Functions: 15
    Options: Inlining false, Optimization false, Expressions true,
Deforming true
    Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
  Execution Time: 1726.835 ms
(33 rows)


I'm wondering why the planner doesn't see that the left table is very small and follow a different path.
From an abstract computer science POV, I would

1. sort the left table  (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris

Re: Inefficient query plan for SELECT ... EXCEPT ...

From

"David G. Johnston"

Date:

31 October 2023, 22:56:01

On Tue, Oct 31, 2023 at 3:41 PM Dimitrios Apostolou <jimis@gmx.net> wrote:

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better,

I'd expect a NOT EXISTS query would also perform well and more accurately reflect what you seem to be wanting here.

I could see maybe optimizing away the second independent query if the first produces zero rows but it also doesn't seem like that productive an optimization to attempt.

David J.

Re: Inefficient query plan for SELECT ... EXCEPT ...

From

David Rowley

Date:

31 October 2023, 22:57:03

On Wed, 1 Nov 2023 at 11:41, Dimitrios Apostolou <jimis@gmx.net> wrote:
> I'm wondering why the planner doesn't see that the left table is very small and follow a different path.
> From an abstract computer science POV, I would
>
> 1. sort the left table  (the right one is already indexed)
> 2. "merge" the two tables, by walking them in-order in parallel and excluding the matches
> 3. stop when the left table is exhausted, which would happen very early.

It would be possible to have some sort of MergeExcept operator and
have the planner consider that. Unfortunately, since the upper planner
was changed a few years ago to have it consider paths the same as the
join planner does, nobody has yet come back to the union planner to
properly pathify that.  I do have a WIP patch to do this work, but I
wasn't planning on improving EXCEPT, only UNION. Making it work for
EXCEPT and INTERSECT would require a new executor operator.

> Is this worth a bug report? I can file one if the issue is not known.

No. It's just a missing optimisation. We know about it.

> In the meantime I have replaced the query with a LEFT OUTER JOIN which
> performs much better, and I believe is equivalent. I find it less readable
> than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
> queries (with smaller right-side tables) in my application that I would
> hate to change them all to the ugliest equivalent. Under what conditions
> would the above query plan perform well?

It'll be best if you just use NOT EXISTS. You should be able to form
the LEFT JOINS to make use of an Anti-Join.  If you don't want to
rewrite your queries then you'll just be at the mercy of the current
planner's ability to plan EXCEPT queries, unfortunately.  There won't
be any bug fixes to improve this. It may, however, be improved in some
future version of PostgreSQL.

David

Re: Inefficient query plan for SELECT ... EXCEPT ...

From

Tom Lane

Date:

01 November 2023, 00:52:07

David Rowley <dgrowleyml@gmail.com> writes:
> It would be possible to have some sort of MergeExcept operator and
> have the planner consider that. Unfortunately, since the upper planner
> was changed a few years ago to have it consider paths the same as the
> join planner does, nobody has yet come back to the union planner to
> properly pathify that.  I do have a WIP patch to do this work, but I
> wasn't planning on improving EXCEPT, only UNION. Making it work for
> EXCEPT and INTERSECT would require a new executor operator.

Yeah.  We're moderately good about UNION ALL, but UNION/INTERSECT/EXCEPT
are an area that nobody has ever gotten around to optimizing: the two
sub-queries will be planned independently and then fed to a simplistic
set-operation node.  Maybe that'll get better someday but don't hold
your breath.  In the meantime, try to recast an EXCEPT query as an
antijoin.

            regards, tom lane

Re: Inefficient query plan for SELECT ... EXCEPT ...

From

Dimitrios Apostolou

Date:

01 November 2023, 13:32:51

Thank you all for the answers, they covered me well.


>> Is this worth a bug report? I can file one if the issue is not known.
>
> No. It's just a missing optimisation. We know about it.

It's good I shot an email first then. FWIW my usual way in other projects
would be to check the bugtracker, and just "follow" the relevant issue if
it's minor like a missing optimisation. I didn't find a way to search for
"known issues" in the Postgresql project.


Dimitris