Thread: Apply LIMIT when computation is logically irrelevant

Apply LIMIT when computation is logically irrelevant

From
Robins Tharakan
Date:
Hi,

When an SQL needs to UNION constants on either side, it should be possible to 
implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
or something already discussed but rejected for some reason?

This need came up while reviewing generated SQL, where the need was to return true when
at least one of two lists had a row. A simplified version is given below:

(SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
vs.
(select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster




postgres=# explain analyse (select 1 from generate_series(1,10000)) UNION (select 1 from generate_series(1,10000));
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=550.00..750.00 rows=20000 width=4) (actual time=54.847..54.866 rows=1 loops=1)
   Group Key: (1)
   ->  Append  (cost=0.00..500.00 rows=20000 width=4) (actual time=0.782..40.215 rows=20000 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.780..7.542 rows=10000 loops=1)
         ->  Function Scan on generate_series generate_series_1  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.929..7.706 rows=10000 loops=1)
 Planning Time: 0.055 ms
 Execution Time: 55.535 ms
(7 rows)

postgres=# explain analyse (select 1 from generate_series(1,10000) limit 1) UNION (select 1 from generate_series(1,10000) limit 1);
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.07..0.08 rows=2 width=4) (actual time=1.292..1.298 rows=1 loops=1)
   ->  Sort  (cost=0.07..0.07 rows=2 width=4) (actual time=1.290..1.292 rows=2 loops=1)
         Sort Key: (1)
         Sort Method: quicksort  Memory: 25kB
         ->  Append  (cost=0.00..0.06 rows=2 width=4) (actual time=0.554..1.266 rows=2 loops=1)
               ->  Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.552..0.554 rows=1 loops=1)
                     ->  Function Scan on generate_series  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.550..0.551 rows=1 loops=1)
               ->  Limit  (cost=0.00..0.01 rows=1 width=4) (actual time=0.706..0.707 rows=1 loops=1)
                     ->  Function Scan on generate_series generate_series_1  (cost=0.00..100.00 rows=10000 width=4) (actual time=0.704..0.705 rows=1 loops=1)
 Planning Time: 0.096 ms
 Execution Time: 1.847 ms
(11 rows)


postgres=# select version();
                                                  version
-----------------------------------------------------------------------------------------------------------
 PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2), 64-bit
(1 row)

-
robins

Re: Apply LIMIT when computation is logically irrelevant

From
Francisco Olarte
Date:
Robins:

On Mon, Jul 6, 2020 at 1:37 PM Robins Tharakan <tharakan@gmail.com> wrote:

> When an SQL needs to UNION constants on either side, it should be possible to
> implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
> or something already discussed but rejected for some reason?

Maybe the optimization does not hold its weight. I mean, the increased
complexity in the optimizer, bigger memory footprint, testing and
developer usage, is not worth it.

> This need came up while reviewing generated SQL, where the need was to return true when
> at least one of two lists had a row. A simplified version is given below:

I.e., I do not think this is a "need", specially in generated SQL,
seems more like a deficiency in the generator ( specially since
generators are able, and some do it, to massively overcondition the
generated code to insure the optimizer does not miss anything ), and
wrapping things in a limit 1 when just testing for row existence seems
easy to do while generating.

> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
> vs.
> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster

As an aside, isn't it easier, more correct ( in my opinion ) and
clearer to just use exists for row-existence test? Because you have to
at least see it there is a result above, probably using exists, and
you can do...

select exists(SELECT 1 FROM pg_class) or exists(SELECT 1 FROM pg_class);

to get a direct boolean and benefit from shortcircuiting, by putting
the most likely one first, and from the internal knowledge the
optimizer may have to not fully evaluate queries, which may be greater
than deducting from the union/limit case.

Francisco Olarte.



Re: Apply LIMIT when computation is logically irrelevant

From
Simon Riggs
Date:
On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
 
When an SQL needs to UNION constants on either side, it should be possible to 
implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
or something already discussed but rejected for some reason?

This need came up while reviewing generated SQL, where the need was to return true when
at least one of two lists had a row. A simplified version is given below:

(SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
vs.
(select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster

Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.

First query returns lots of random rows, the second query returns just one random row.

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases

Re: Apply LIMIT when computation is logically irrelevant

From
Michael Lewis
Date:
On Mon, Jul 6, 2020 at 5:37 AM Robins Tharakan <tharakan@gmail.com> wrote:
This need came up while reviewing generated SQL, where the need was to return true when
at least one of two lists had a row.

Generated SQL... yep. That will happen. Manual SQL may be more work, but often has significant reward.

If you can change how the SQL is generated, I would expect that EXISTS would likely be more performant in your case, even if you need to do UNION ALL between the two current queries that potentially return true.

Re: Apply LIMIT when computation is logically irrelevant

From
David Rowley
Date:
On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
>
>>
>> When an SQL needs to UNION constants on either side, it should be possible to
>> implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
>> or something already discussed but rejected for some reason?
>>
>> This need came up while reviewing generated SQL, where the need was to return true when
>> at least one of two lists had a row. A simplified version is given below:
>>
>> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
>> vs.
>> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
>
>
> Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.
>
> First query returns lots of random rows, the second query returns just one random row.

I think the idea here is that because the target list contains only
constants that pulling additional rows from the query after the first
one will just be a duplicate row and never add any rows after the
UNION is processed.

David



Re: Apply LIMIT when computation is logically irrelevant

From
Simon Riggs
Date:
On Mon, 6 Jul 2020 at 21:49, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
>
>>
>> When an SQL needs to UNION constants on either side, it should be possible to
>> implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
>> or something already discussed but rejected for some reason?
>>
>> This need came up while reviewing generated SQL, where the need was to return true when
>> at least one of two lists had a row. A simplified version is given below:
>>
>> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
>> vs.
>> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
>
>
> Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.
>
> First query returns lots of random rows, the second query returns just one random row.

I think the idea here is that because the target list contains only
constants that pulling additional rows from the query after the first
one will just be a duplicate row and never add any rows after the
UNION is processed.

OK, I see. Are you saying you think it's a worthwhile optimization to autodetect?

--
Simon Riggs                http://www.2ndQuadrant.com/
Mission Critical Databases

Re: Apply LIMIT when computation is logically irrelevant

From
David Rowley
Date:
On Tue, 7 Jul 2020 at 09:03, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 6 Jul 2020 at 21:49, David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
>> >> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
>> >> vs.
>> >> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
>> >
>> >
>> > Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.
>> >
>> > First query returns lots of random rows, the second query returns just one random row.
>>
>> I think the idea here is that because the target list contains only
>> constants that pulling additional rows from the query after the first
>> one will just be a duplicate row and never add any rows after the
>> UNION is processed.
>
>
> OK, I see. Are you saying you think it's a worthwhile optimization to autodetect?

I guess it's all about how much effort would be involved to detect
such cases vs how likely it is that we're going to speed up someone's
query.  I imagine it's not much effort to detect this, but also, this
is the first time I recall seeing this mentioned, so perhaps that
means not many people would be rewarded by making such a change. (It
does seem like quite a strange way to express the query.)

There is currently a patch floating around that implements UniqueKeys
which allows RelOptInfos to be tagged with the properties that they're
unique on.  With the current design of that patch, there is no way to
say "this relation *only* has duplicate rows".  Perhaps some design
tweaks there can make detecting this case cheaper in terms of CPU
effort during planning, and perhaps also in terms of how much code it
would take to make it work.  I'll try to keep this in mind when I
review that work soon.   If we were to start adding optimisations for
cases such as this, then I'd rather they were done in some
general-purpose way that just makes them "just work" rather than
adding special cases around the codebase that trigger some special
behaviour.  Likely in this case, it would still take at least some
code when planning setops.

Additionally, the setops code is badly in need of a rewrite, so the
bar is likely pretty high on adding too many smarts in there that we
need to migrate forwards after a rewrite. The setops planning code is
still not making use of the upper planner pathification work that Tom
did years ago. In many cases UNION would be far more optimal if it
were implemented as an Index Scan -> MergeAppend -> Unique.  I think
right now we only consider Append -> Sort -> Unique and Append -> Hash
Aggregate.


David