Thread: Apply LIMIT when computation is logically irrelevant
Hi,
When an SQL needs to UNION constants on either side, it should be possible to
When an SQL needs to UNION constants on either side, it should be possible to
implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
or something already discussed but rejected for some reason?
This need came up while reviewing generated SQL, where the need was to return true when
at least one of two lists had a row. A simplified version is given below:
vs.
(select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
postgres=# explain analyse (select 1 from generate_series(1,10000)) UNION (select 1 from generate_series(1,10000));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=550.00..750.00 rows=20000 width=4) (actual time=54.847..54.866 rows=1 loops=1)
Group Key: (1)
-> Append (cost=0.00..500.00 rows=20000 width=4) (actual time=0.782..40.215 rows=20000 loops=1)
-> Function Scan on generate_series (cost=0.00..100.00 rows=10000 width=4) (actual time=0.780..7.542 rows=10000 loops=1)
-> Function Scan on generate_series generate_series_1 (cost=0.00..100.00 rows=10000 width=4) (actual time=0.929..7.706 rows=10000 loops=1)
Planning Time: 0.055 ms
Execution Time: 55.535 ms
(7 rows)
postgres=# explain analyse (select 1 from generate_series(1,10000) limit 1) UNION (select 1 from generate_series(1,10000) limit 1);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.07..0.08 rows=2 width=4) (actual time=1.292..1.298 rows=1 loops=1)
-> Sort (cost=0.07..0.07 rows=2 width=4) (actual time=1.290..1.292 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.06 rows=2 width=4) (actual time=0.554..1.266 rows=2 loops=1)
-> Limit (cost=0.00..0.01 rows=1 width=4) (actual time=0.552..0.554 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..100.00 rows=10000 width=4) (actual time=0.550..0.551 rows=1 loops=1)
-> Limit (cost=0.00..0.01 rows=1 width=4) (actual time=0.706..0.707 rows=1 loops=1)
-> Function Scan on generate_series generate_series_1 (cost=0.00..100.00 rows=10000 width=4) (actual time=0.704..0.705 rows=1 loops=1)
Planning Time: 0.096 ms
Execution Time: 1.847 ms
(11 rows)
postgres=# select version();
version
-----------------------------------------------------------------------------------------------------------
PostgreSQL 13devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2), 64-bit
(1 row)
-
robins
Robins: On Mon, Jul 6, 2020 at 1:37 PM Robins Tharakan <tharakan@gmail.com> wrote: > When an SQL needs to UNION constants on either side, it should be possible to > implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding, > or something already discussed but rejected for some reason? Maybe the optimization does not hold its weight. I mean, the increased complexity in the optimizer, bigger memory footprint, testing and developer usage, is not worth it. > This need came up while reviewing generated SQL, where the need was to return true when > at least one of two lists had a row. A simplified version is given below: I.e., I do not think this is a "need", specially in generated SQL, seems more like a deficiency in the generator ( specially since generators are able, and some do it, to massively overcondition the generated code to insure the optimizer does not miss anything ), and wrapping things in a limit 1 when just testing for row existence seems easy to do while generating. > (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class); > vs. > (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster As an aside, isn't it easier, more correct ( in my opinion ) and clearer to just use exists for row-existence test? Because you have to at least see it there is a result above, probably using exists, and you can do... select exists(SELECT 1 FROM pg_class) or exists(SELECT 1 FROM pg_class); to get a direct boolean and benefit from shortcircuiting, by putting the most likely one first, and from the internal knowledge the optimizer may have to not fully evaluate queries, which may be greater than deducting from the union/limit case. Francisco Olarte.
On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
When an SQL needs to UNION constants on either side, it should be possible toimplicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,or something already discussed but rejected for some reason?This need came up while reviewing generated SQL, where the need was to return true whenat least one of two lists had a row. A simplified version is given below:(SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
vs.
(select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.
First query returns lots of random rows, the second query returns just one random row.
On Mon, Jul 6, 2020 at 5:37 AM Robins Tharakan <tharakan@gmail.com> wrote:
This need came up while reviewing generated SQL, where the need was to return true whenat least one of two lists had a row.
Generated SQL... yep. That will happen. Manual SQL may be more work, but often has significant reward.
If you can change how the SQL is generated, I would expect that EXISTS would likely be more performant in your case, even if you need to do UNION ALL between the two current queries that potentially return true.
On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote: > >> >> When an SQL needs to UNION constants on either side, it should be possible to >> implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding, >> or something already discussed but rejected for some reason? >> >> This need came up while reviewing generated SQL, where the need was to return true when >> at least one of two lists had a row. A simplified version is given below: >> >> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class); >> vs. >> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster > > > Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization. > > First query returns lots of random rows, the second query returns just one random row. I think the idea here is that because the target list contains only constants that pulling additional rows from the query after the first one will just be a duplicate row and never add any rows after the UNION is processed. David
On Mon, 6 Jul 2020 at 21:49, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote:
>
>>
>> When an SQL needs to UNION constants on either side, it should be possible to
>> implicitly apply a LIMIT 1 and get good speed up. Is this an incorrect understanding,
>> or something already discussed but rejected for some reason?
>>
>> This need came up while reviewing generated SQL, where the need was to return true when
>> at least one of two lists had a row. A simplified version is given below:
>>
>> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class);
>> vs.
>> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster
>
>
> Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization.
>
> First query returns lots of random rows, the second query returns just one random row.
I think the idea here is that because the target list contains only
constants that pulling additional rows from the query after the first
one will just be a duplicate row and never add any rows after the
UNION is processed.
OK, I see. Are you saying you think it's a worthwhile optimization to autodetect?
On Tue, 7 Jul 2020 at 09:03, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 6 Jul 2020 at 21:49, David Rowley <dgrowleyml@gmail.com> wrote: >> >> On Tue, 7 Jul 2020 at 00:43, Simon Riggs <simon@2ndquadrant.com> wrote: >> > >> > On Mon, 6 Jul 2020 at 12:37, Robins Tharakan <tharakan@gmail.com> wrote: >> >> (SELECT 1 FROM pg_class) UNION (SELECT 1 FROM pg_class); >> >> vs. >> >> (select 1 FROM pg_class limit 1) UNION (SELECT 1 FROM pg_class limit 1); -- Faster >> > >> > >> > Those two queries aren't logically equivalent, so you can't apply the LIMIT 1 as an optimization. >> > >> > First query returns lots of random rows, the second query returns just one random row. >> >> I think the idea here is that because the target list contains only >> constants that pulling additional rows from the query after the first >> one will just be a duplicate row and never add any rows after the >> UNION is processed. > > > OK, I see. Are you saying you think it's a worthwhile optimization to autodetect? I guess it's all about how much effort would be involved to detect such cases vs how likely it is that we're going to speed up someone's query. I imagine it's not much effort to detect this, but also, this is the first time I recall seeing this mentioned, so perhaps that means not many people would be rewarded by making such a change. (It does seem like quite a strange way to express the query.) There is currently a patch floating around that implements UniqueKeys which allows RelOptInfos to be tagged with the properties that they're unique on. With the current design of that patch, there is no way to say "this relation *only* has duplicate rows". Perhaps some design tweaks there can make detecting this case cheaper in terms of CPU effort during planning, and perhaps also in terms of how much code it would take to make it work. I'll try to keep this in mind when I review that work soon. If we were to start adding optimisations for cases such as this, then I'd rather they were done in some general-purpose way that just makes them "just work" rather than adding special cases around the codebase that trigger some special behaviour. Likely in this case, it would still take at least some code when planning setops. Additionally, the setops code is badly in need of a rewrite, so the bar is likely pretty high on adding too many smarts in there that we need to migrate forwards after a rewrite. The setops planning code is still not making use of the upper planner pathification work that Tom did years ago. In many cases UNION would be far more optimal if it were implemented as an Index Scan -> MergeAppend -> Unique. I think right now we only consider Append -> Sort -> Unique and Append -> Hash Aggregate. David