Thread: FETCH FIRST clause PERCENT option
FETCH FIRST with PERCENT option is SQL standard that state limit count to be specified in a percentage in addition to specify it in exact count and listed as one of Major features simply not implemented yet in recent wiki page [1].
I implemented it by executing the query plan twice. One without limit filter to find the total number of row that will be feed to limit node so the exact limit count can be calculated and the query plan execute again with limit filter with newly calculated exact count .
[1].https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard
Regards
Surafel
Attachment
Hi, On 2018-08-16 17:27:45 +0300, Surafel Temesgen wrote: > FETCH FIRST with PERCENT option is SQL standard that state limit count to > be specified in a percentage in addition to specify it in exact count and > listed as one of Major features simply not implemented yet in recent wiki > page [1]. > > I implemented it by executing the query plan twice. One without limit > filter to find the total number of row that will be feed to limit node so > the exact limit count can be calculated and the query plan execute again > with limit filter with newly calculated exact count . Won't that have rather massive issues with multiple evaluations of clauses in the query? Besides being really expensive? I think you'd have to instead spill the query results into a tuplestore (like we do for FOR HOLD queries at end of xact), and then do the computations based on that. - Andres
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> I think you'd have to instead spill the query results into a Andres> tuplestore (like we do for FOR HOLD queries at end of xact), Andres> and then do the computations based on that. The approach I've suggested to some people who wanted to look into this was to inject a window function call into the query, and teach the Limit node to be able to stop on a boolean condition (which would be based on that function result). This method should also work for WITH TIES by using a different window function. -- Andrew (irc:RhodiumToad)
Won't that have rather massive issues with multiple evaluations of
clauses in the query? Besides being really expensive?
The plan re-scane after first execution I can’t see issue for multiple execution of a clause in this case
I think you'd have to instead spill the query results into a tuplestore
The downside of it is all the result have to be stored even if needed tuple is a fraction of it and also store it for longer so the choice became memory or cpu utilization
- Andres
On 2018-08-21 15:47:07 +0300, Surafel Temesgen wrote: > On Thu, Aug 16, 2018 at 5:34 PM, Andres Freund <andres@anarazel.de> wrote: > > > > > Won't that have rather massive issues with multiple evaluations of > > clauses in the query? Besides being really expensive? > > > The plan re-scane after first execution I can’t see issue for multiple > execution of a clause in this case Imagine volatile functions being used. That can be problematic because of repeated side-effects, but also will lead to plainly wrong results. Consider what happens with your approach where somebody does something like WHERE value < random(). Greetings, Andres Freund
Imagine volatile functions being used. That can be problematic because
of repeated side-effects, but also will lead to plainly wrong
results. Consider what happens with your approach where somebody does
something like WHERE value < random().
Attachment
On 2018-08-28 14:14, Surafel Temesgen wrote: > On Tue, Aug 21, 2018 at 3:50 PM Andres Freund <andres@anarazel.de> > wrote: > >> >> Imagine volatile functions being used. That can be problematic because >> of repeated side-effects, but also will lead to plainly wrong >> results. Consider what happens with your approach where somebody does >> something like WHERE value < random(). >> > Thanks for explaining . The attach patch use tuplestore instead > [fetch-wth-percent-v2.patch] I had a quick look at this. Apply, compile, make check are ok. In straightforward usage seems to work ok. But I also ran across this crash: create table if not exists t as select i from generate_series(1, 100) as f(i); select * from t offset 60 rows fetch next 3 percent rows only FOR UPDATE ; TRAP: FailedAssertion("!(slot != ((void *)0))", File: "execTuples.c", Line: 428) Erik Rijkers
;
TRAP: FailedAssertion("!(slot != ((void *)0))", File: "execTuples.c",
Line: 42
Attachment
On Fri, Aug 31, 2018 at 11:42 PM Surafel Temesgen <surafel3000@gmail.com> wrote: > On Tue, Aug 28, 2018 at 7:33 PM Erik Rijkers <er@xs4all.nl> wrote: >> >> ; >> >> TRAP: FailedAssertion("!(slot != ((void *)0))", File: "execTuples.c", >> Line: 42 > > The attache patch include a fix for the crash .can you check it again? FYI this fails[1] in src/test/modules/test_ddl_deparse's create_table test because of the keyword: CREATE TABLE stud_emp ( percent int4 ) INHERITS (emp, student); ! ERROR: syntax error at or near "percent" [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/430777123 -- Thomas Munro http://www.enterprisedb.com
> On Aug 16, 2018, at 7:34 AM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2018-08-16 17:27:45 +0300, Surafel Temesgen wrote: >> FETCH FIRST with PERCENT option is SQL standard that state limit count to >> be specified in a percentage in addition to specify it in exact count and >> listed as one of Major features simply not implemented yet in recent wiki >> page [1]. >> >> I implemented it by executing the query plan twice. One without limit >> filter to find the total number of row that will be feed to limit node so >> the exact limit count can be calculated and the query plan execute again >> with limit filter with newly calculated exact count . Surafel, there are no regression tests that I can see in your patch. It would help if you added some, as then I could precisely what behavior you are expecting. As it is, I'm just guessing here, but here goes.... > Won't that have rather massive issues with multiple evaluations of > clauses in the query? Besides being really expensive? > > I think you'd have to instead spill the query results into a tuplestore > (like we do for FOR HOLD queries at end of xact), and then do the > computations based on that. I should think that spilling anything to a tuplestore would only be needed if the query contains an ORDER BY expression. If you query FETCH FIRST 50 PERCENT * FROM foo; you should just return every other row, discarding the rest, right? It's only when an explicit ordering is given that the need to store the results arises. Even with FETCH FIRST 50 PERCENT name FROM foo ORDER BY name; you can return one row for every two rows that you get back from the sort node, reducing the maximum number you need to store at any time to no more than 25% of all rows. Or am I missing something? mark
Hi, On 2018-09-20 17:06:36 -0700, Mark Dilger wrote: > I should think that spilling anything to a tuplestore would only be needed > if the query contains an ORDER BY expression. If you query > > FETCH FIRST 50 PERCENT * FROM foo; > > you should just return every other row, discarding the rest, right? It's > only when an explicit ordering is given that the need to store the results > arises. Even with > > FETCH FIRST 50 PERCENT name FROM foo ORDER BY name; > > you can return one row for every two rows that you get back from the > sort node, reducing the maximum number you need to store at any time to > no more than 25% of all rows. I'm doubtful about the validity of these optimizations, particularly around being surprising. But I think more importantly, we should focus on the basic implementation that's needed anyway. Greetings, Andres Freund
> On Sep 20, 2018, at 5:29 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2018-09-20 17:06:36 -0700, Mark Dilger wrote: >> I should think that spilling anything to a tuplestore would only be needed >> if the query contains an ORDER BY expression. If you query >> >> FETCH FIRST 50 PERCENT * FROM foo; >> >> you should just return every other row, discarding the rest, right? It's >> only when an explicit ordering is given that the need to store the results >> arises. Even with >> >> FETCH FIRST 50 PERCENT name FROM foo ORDER BY name; >> >> you can return one row for every two rows that you get back from the >> sort node, reducing the maximum number you need to store at any time to >> no more than 25% of all rows. > > I'm doubtful about the validity of these optimizations, particularly > around being surprising. But I think more importantly, we should focus > on the basic implementation that's needed anyway. You may be right that getting the basic implementation finished first is better than optimizing at this stage. So the rest of what I'm going to say is just in defense of the optimization, and not an argument for needing to optimize right away. As for reducing the surprise factor, I think that it would be surprising if I ask for a smallish percentage of rows and it takes significantly longer and significantly more memory or disk than asking for all the rows takes. If I'm including an explicit ORDER BY, then that explains it, but otherwise, I'd be surprised. Note that I'm not saying I'd be surprised by it taking roughly the same length of time / memory / disk. I'd only be surprised if it took a lot more. There are plenty of SQL generation engines that people put in their software. I'd expect something like sprintf("FETCH FIRST %d PERCENT %s FROM %s", percentage, columns, tablename) to show up in such engines, and percentage to sometimes be 100. At least in that case you should just return all rows rather than dumping them into a tuplestore. Likewise, if the percentage is 0, you'll want to finish quickly. Actually, I don't know if the SQL spec would require side effects to still happen, in which case you'd still have to generate all rows for their side effects to happen, and then just not return them. But still, no tuplestore. So the implementation of FETCH FIRST would at least need to think about what percentage is being requested, rather than just mindlessly adding a node to the tree for storing everything, then computing the LIMIT based on the number of rows stored, and then returning that number of rows. mark
hey On 9/21/18, Mark Dilger <hornschnorter@gmail.com> wrote: > Surafel, there are no regression tests that I can see in your patch. It > would help if you added some, as then I could precisely what behavior you > are expecting. thank you for looking at it .the attach patch add regression tests
Attachment
> On Sep 25, 2018, at 5:07 AM, Surafel Temesgen <surafel3000@gmail.com> wrote: > > hey > > On 9/21/18, Mark Dilger <hornschnorter@gmail.com> wrote: > >> Surafel, there are no regression tests that I can see in your patch. It >> would help if you added some, as then I could precisely what behavior you >> are expecting. > thank you for looking at it .the attach patch add regression tests > <fetch-wth-percent-v4.patch> Surafel, I messed around with your changes to the grammar and it seems you don't need to add PERCENT as a reserved keyword. Moving this to the unreserved keyword section does not cause any shift/reduce errors, and the regression tests still pass. Relative to your patch v4, these changes help: diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index a97ee30924..9872cf7257 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -15186,6 +15186,7 @@ unreserved_keyword: | PARTITION | PASSING | PASSWORD + | PERCENT | PLANS | POLICY | PRECEDING @@ -15465,7 +15466,6 @@ reserved_keyword: | ONLY | OR | ORDER - | PERCENT | PLACING | PRIMARY | REFERENCES diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h index 150d51df8f..f23fee0653 100644 --- a/src/include/parser/kwlist.h +++ b/src/include/parser/kwlist.h @@ -300,7 +300,7 @@ PG_KEYWORD("partial", PARTIAL, UNRESERVED_KEYWORD) PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD) PG_KEYWORD("passing", PASSING, UNRESERVED_KEYWORD) PG_KEYWORD("password", PASSWORD, UNRESERVED_KEYWORD) -PG_KEYWORD("percent", PERCENT, RESERVED_KEYWORD) +PG_KEYWORD("percent", PERCENT, UNRESERVED_KEYWORD) PG_KEYWORD("placing", PLACING, RESERVED_KEYWORD) PG_KEYWORD("plans", PLANS, UNRESERVED_KEYWORD) PG_KEYWORD("policy", POLICY, UNRESERVED_KEYWORD)
> On Sep 25, 2018, at 8:08 AM, Mark Dilger <hornschnorter@gmail.com> wrote: > > > >> On Sep 25, 2018, at 5:07 AM, Surafel Temesgen <surafel3000@gmail.com> wrote: >> >> hey >> >> On 9/21/18, Mark Dilger <hornschnorter@gmail.com> wrote: >> >>> Surafel, there are no regression tests that I can see in your patch. It >>> would help if you added some, as then I could precisely what behavior you >>> are expecting. >> thank you for looking at it .the attach patch add regression tests >> <fetch-wth-percent-v4.patch> > > Surafel, > > I messed around with your changes to the grammar and it seems you don't > need to add PERCENT as a reserved keyword. Moving this to the unreserved > keyword section does not cause any shift/reduce errors, and the regression > tests still pass. Relative to your patch v4, these changes help: I spoke too soon. The main regression tests pass, but your change to src/test/modules/test_ddl_deparse/sql/create_table.sql per Thomas's suggestion is no longer needed, since PERCENT no longer needs to be quoted. I recommend you also apply the following to your v4 patch, which just rolls back that one change you made, and at least for me, is enough to get `make check-world` to pass: diff --git a/src/test/modules/test_ddl_deparse/sql/create_table.sql b/src/test/modules/test_ddl_deparse/sql/create_table.sql index 4325de2d04..5e78452729 100644 --- a/src/test/modules/test_ddl_deparse/sql/create_table.sql +++ b/src/test/modules/test_ddl_deparse/sql/create_table.sql @@ -94,7 +94,7 @@ CREATE TABLE student ( ) INHERITS (person); CREATE TABLE stud_emp ( - "percent" int4 + percent int4 ) INHERITS (emp, student);
I messed around with your changes to the grammar and it seems you don't
need to add PERCENT as a reserved keyword. Moving this to the unreserved
keyword section does not cause any shift/reduce errors, and the regression
tests still pass. Relative to your patch v4, these changes help:
Attachment
On 25/11/2018 10:00, Surafel Temesgen wrote: > > > > > I messed around with your changes to the grammar and it seems you don't > need to add PERCENT as a reserved keyword. Moving this to the > unreserved > keyword section does not cause any shift/reduce errors, and the > regression > tests still pass. Relative to your patch v4, these changes help: > > > In sql standard PERCENT list as reserved world so i don't think it is a > thing that can change by me. Yes it absolutely is. In PostgreSQL we only make words as reserved as they need to be, and PERCENT doesn't need to be reserved at all. Also, this query returns 210 rows instead of the expected 208: select * from generate_series(1, 1000) fetch first 20.8 percent rows only As for the design, I think you should put some serious consideration into Andrew Gierth's approach. I would rather see something like that. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Also, this query returns 210 rows instead of the expected 208:
select *
from generate_series(1, 1000)
fetch first 20.8 percent rows only
this is because fetch first values work with integer and it change fractional number to nearest integer number like select * from generate_series(1, 1000) fetch first 20.3 rows only; is not an error rather it return 20 rows.
As for the design, I think you should put some serious consideration
into Andrew Gierth's approach. I would rather see something like that.
I didn't go in that direction because I don’t understand why this approach is inferior to using window function and I am not fully understand on how to implement it.
regards
Surafel
On 25/11/2018 12:49, Surafel Temesgen wrote: > > > On Sun, Nov 25, 2018 at 1:24 PM Vik Fearing <vik.fearing@2ndquadrant.com > <mailto:vik.fearing@2ndquadrant.com>> wrote: > > > Also, this query returns 210 rows instead of the expected 208: > > select * > from generate_series(1, 1000) > fetch first 20.8 percent rows only > > this is because fetch first values work with integer and it change > fractional number to nearest integer number like select * from > generate_series(1, 1000) fetch first 20.3 rows only; is not an error > rather it return 20 rows. I don't see how this behavior is justified by reading the SQL standard. Obviously only an integer number of rows is going to be returned, but the percentage should be calculated correctly. I assume you meant 200 rows there, but the correct number of rows to return is 203 for that query. Please fix it. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
>>>>> "Surafel" == Surafel Temesgen <surafel3000@gmail.com> writes: Surafel> this is because fetch first values work with integer and it Surafel> change fractional number to nearest integer number like select Surafel> * from generate_series(1, 1000) fetch first 20.3 rows only; is Surafel> not an error rather it return 20 rows. 31) The declared type of the <simple value specification> simply contained in <fetch first percentage> shall be numeric. Note that unlike <fetch first row count> there is no requirement for the type to be _exact_ numeric or to have scale 0. later: ii) If <fetch first percentage> is specified, then let FFP be the <simple value specification> simply contained in <fetch first percentage>, and let LOCT be a <literal> whose value is OCT. Let FFRC be the value of CEILING ( FFP * LOCT / 100.0E0 ) NOTE 333 -- The percentage is computed using the number of rows before removing the rows specified by <offset row count>. ceil(20.3 * 1000 / 100.0E0) is definitely 203, not 200. -- Andrew.
On 11/25/18 1:05 PM, Vik Fearing wrote: > On 25/11/2018 12:49, Surafel Temesgen wrote: >> >> >> On Sun, Nov 25, 2018 at 1:24 PM Vik Fearing <vik.fearing@2ndquadrant.com >> <mailto:vik.fearing@2ndquadrant.com>> wrote: >> >> >> Also, this query returns 210 rows instead of the expected 208: >> >> select * >> from generate_series(1, 1000) >> fetch first 20.8 percent rows only >> >> this is because fetch first values work with integer and it change >> fractional number to nearest integer number like select * from >> generate_series(1, 1000) fetch first 20.3 rows only; is not an error >> rather it return 20 rows. > > I don't see how this behavior is justified by reading the SQL standard. > Obviously only an integer number of rows is going to be returned, but > the percentage should be calculated correctly. > Right. My draft of SQL standard says this: <fetch first clause> ::= FETCH { FIRST | NEXT } [ <fetch first quantity> ] { ROW | ROWS } { ONLY | WITH TIES } <fetch first quantity> ::= <fetch first row count> | <fetch first percentage> <fetch first percentage> ::= <simple value specification> PERCENT and then 30) The declared type of <fetch first row count> shall be an exact numeric with scale 0 (zero). 31) The declared type of the <simple value specification> simply contained in <fetch first percentage> shall be numeric. So the standard pretty much requires treating the value as numeric. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 11/25/18 1:05 PM, Vik Fearing wrote: >> I don't see how this behavior is justified by reading the SQL standard. >> Obviously only an integer number of rows is going to be returned, but >> the percentage should be calculated correctly. > Right. My draft of SQL standard says this: > 31) The declared type of the <simple value specification> simply > contained in <fetch first percentage> shall be numeric. > So the standard pretty much requires treating the value as numeric. Note that this should not be read as "it must be of the type that PG calls numeric". I'd read it as requiring a value that's in the numeric type hierarchy. float8 would be a reasonable implementation choice if we're going to coerce to a specific type. regards, tom lane
I assume you meant 200 rows there, but the correct number of rows to
return is 203 for that query. Please fix it.
Attachment
Attach is a patch that include a fix for it
Attachment
Hi, I've been looking at this patch today, and I think there's a couple of issues with the costing and execution. Consider a simple table with 1M rows: create table t as select i from generate_series(1,1000000) s(i); and these two queries, that both select 1% of rows select * from t fetch first 10000 rows only; select * from t fetch first 1 percent rows only; which are costed like this test=# explain select * from t fetch first 10000 rows only; QUERY PLAN ----------------------------------------------------------------- Limit (cost=0.00..144.25 rows=10000 width=4) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (2 rows) test=# explain select * from t fetch first 1 percent rows only; QUERY PLAN ----------------------------------------------------------------- Limit (cost=0.00..14425.00 rows=1000000 width=4) -> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4) (2 rows) There are two issues with the costs in the "percent" case. Firstly, the estimated number of rows should be about the same (10k) in both cases, but the "percent" case incorrectly uses the total row count (i.e. as if 100% rows were returned). It's correct that the total cost for the "percent" case is based on 100% of rows, of course, because the code has to fetch everything and stash it into the tuplestore in LIMIT_INITIAL phase. But that implies the startup cost for the "percent" case can't be 0, because all this work has to be done before emitting the first row. So these costing aspects need fixing, IMHO. The execution part of the patch seems to be working correctly, but I think there's an improvement - we don't need to execute the outer plan to completion before emitting the first row. For example, let's say the outer plan produces 10000 rows in total and we're supposed to return the first 1% of those rows. We can emit the first row after fetching the first 100 rows, we don't have to wait for fetching all 10k rows. This may seem pointless at first because we eventually have to scan all the rows anyway (as we don't know we're done otherwise). But it may be nested in another early-terminating node in which case it will matter, e.g. like this: SELECT * FROM t1 WHERE EXISTS ( SELECT * FROM t2 FETCH FIRST 1 PERCENT ROWS ONLY ) foo; I admit this is example is somewhat artificial, but it's pretty easy to end with queries like this with views etc. Overall the fast startup seems like a quite valuable feature and we should try keeping it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've been looking at this patch today, and I think there's a couple of
issues with the costing and execution. Consider a simple table with 1M rows:
create table t as select i from generate_series(1,1000000) s(i);
and these two queries, that both select 1% of rows
select * from t fetch first 10000 rows only;
select * from t fetch first 1 percent rows only;
which are costed like this
test=# explain select * from t fetch first 10000 rows only;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..144.25 rows=10000 width=4)
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(2 rows)
test=# explain select * from t fetch first 1 percent rows only;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..14425.00 rows=1000000 width=4)
-> Seq Scan on t (cost=0.00..14425.00 rows=1000000 width=4)
(2 rows)
There are two issues with the costs in the "percent" case.
Firstly, the estimated number of rows should be about the same (10k) in
both cases, but the "percent" case incorrectly uses the total row count
(i.e. as if 100% rows were returned).
Ok I will correct it
It's correct that the total cost for the "percent" case is based on 100%
of rows, of course, because the code has to fetch everything and stash
it into the tuplestore in LIMIT_INITIAL phase.
But that implies the startup cost for the "percent" case can't be 0,
because all this work has to be done before emitting the first row.
Correct, startup cost must be equal to total cost of source data path and total cost have to be slightly greater than startup cost . I am planing to increase the total cost by limitCount * 0.1(similar to the parallel_tuple_cost) because getting tuple from tuplestore are almost similar operation to passing a tuple from worker to master backend.
So these costing aspects need fixing, IMHO.
The execution part of the patch seems to be working correctly, but I
think there's an improvement - we don't need to execute the outer plan
to completion before emitting the first row. For example, let's say the
outer plan produces 10000 rows in total and we're supposed to return the
first 1% of those rows. We can emit the first row after fetching the
first 100 rows, we don't have to wait for fetching all 10k rows.
On 1/3/19 1:00 PM, Surafel Temesgen wrote: > Hi > > On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > ... > > Firstly, the estimated number of rows should be about the same (10k) in > both cases, but the "percent" case incorrectly uses the total row count > (i.e. as if 100% rows were returned). > > Ok I will correct it > > > It's correct that the total cost for the "percent" case is based on 100% > of rows, of course, because the code has to fetch everything and stash > it into the tuplestore in LIMIT_INITIAL phase. > > But that implies the startup cost for the "percent" case can't be 0, > because all this work has to be done before emitting the first row. > > > Correct, startup cost must be equal to total cost of source data path > and total cost have to be slightly greater than startup cost . I am > planing to increase the total cost by limitCount * 0.1(similar to the > parallel_tuple_cost) because getting tuple from tuplestore are almost > similar operation to passing a tuple from worker to master backend. > OK, sounds good in principle. > > So these costing aspects need fixing, IMHO. > > > The execution part of the patch seems to be working correctly, but I > think there's an improvement - we don't need to execute the outer plan > to completion before emitting the first row. For example, let's say the > outer plan produces 10000 rows in total and we're supposed to return the > first 1% of those rows. We can emit the first row after fetching the > first 100 rows, we don't have to wait for fetching all 10k rows. > > > but total rows count is not given how can we determine safe to return row > But you know how many rows were fetched from the outer plan, and this number only grows grows. So the number of rows returned by FETCH FIRST ... PERCENT also only grows. For example with 10% of rows, you know that once you reach 100 rows you should emit ~10 rows, with 200 rows you know you should emit ~20 rows, etc. So you may track how many rows we're supposed to return / returned so far, and emit them early. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/3/19 1:00 PM, Surafel Temesgen wrote:
> Hi
>
> On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> The execution part of the patch seems to be working correctly, but I
> think there's an improvement - we don't need to execute the outer plan
> to completion before emitting the first row. For example, let's say the
> outer plan produces 10000 rows in total and we're supposed to return the
> first 1% of those rows. We can emit the first row after fetching the
> first 100 rows, we don't have to wait for fetching all 10k rows.
>
>
> but total rows count is not given how can we determine safe to return row
>
But you know how many rows were fetched from the outer plan, and this
number only grows grows. So the number of rows returned by FETCH FIRST
... PERCENT also only grows. For example with 10% of rows, you know that
once you reach 100 rows you should emit ~10 rows, with 200 rows you know
you should emit ~20 rows, etc. So you may track how many rows we're
supposed to return / returned so far, and emit them early.
The execution part of the patch seems to be working correctly, but I
think there's an improvement - we don't need to execute the outer plan
to completion before emitting the first row. For example, let's say the
outer plan produces 10000 rows in total and we're supposed to return the
first 1% of those rows. We can emit the first row after fetching the
first 100 rows, we don't have to wait for fetching all 10k rows.
On 1/4/19 7:40 AM, Surafel Temesgen wrote: > > > On Thu, Jan 3, 2019 at 4:51 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > On 1/3/19 1:00 PM, Surafel Temesgen wrote: > > Hi > > > > On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com> > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>> wrote: > > > The execution part of the patch seems to be working correctly, > but I > > think there's an improvement - we don't need to execute the > outer plan > > to completion before emitting the first row. For example, > let's say the > > outer plan produces 10000 rows in total and we're supposed to > return the > > first 1% of those rows. We can emit the first row after > fetching the > > first 100 rows, we don't have to wait for fetching all 10k rows. > > > > > > but total rows count is not given how can we determine safe to > return row > > > > But you know how many rows were fetched from the outer plan, and this > number only grows grows. So the number of rows returned by FETCH FIRST > ... PERCENT also only grows. For example with 10% of rows, you know that > once you reach 100 rows you should emit ~10 rows, with 200 rows you know > you should emit ~20 rows, etc. So you may track how many rows we're > supposed to return / returned so far, and emit them early. > > > > > yes that is clear but i don't find it easy to put that in formula. may > be someone with good mathematics will help > What formula? All the math remains exactly the same, you just need to update the number of rows to return and track how many rows are already returned. I haven't tried doing that, but AFAICS you'd need to tweak how/when node->count is computed - instead of computing it only once it needs to be updated after fetching each row from the subplan. Furthermore, you'll need to stash the subplan rows somewhere (into a tuplestore probably), and whenever the node->count value increments, you'll need to grab a row from the tuplestore and return that (i.e. tweak node->position and set node->subSlot). I hope that makes sense. The one thing I'm not quite sure about is whether tuplestore allows adding and getting rows at the same time. Does that make sense? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/4/19 8:08 AM, Surafel Temesgen wrote: > > > > On Tue, Jan 1, 2019 at 10:08 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > The execution part of the patch seems to be working correctly, but I > think there's an improvement - we don't need to execute the outer plan > to completion before emitting the first row. For example, let's say the > outer plan produces 10000 rows in total and we're supposed to return the > first 1% of those rows. We can emit the first row after fetching the > first 100 rows, we don't have to wait for fetching all 10k rows. > > > my other concern is in this case we are going to determine limit and > safe row to return on the fly > which have additional computation and i fair that it will not perform > better on most of the case > I very much doubt recomputing the number of rows to return will be measurable at all, considering those are int64 values. It's likely negligible compared to all the other overhead (reading rows from the subplan, etc.). And even if it turns out expensive, there are ways to make it less expensive (e.g. not computing it every time, but only when it actually can increment the value - for example with 1% rows to return, it's enough to recompute it every 100 rows). It also very much depends on which use cases you consider important. Clearly, you're only thinking about the case that actually requires fetching everything. But there's also the use case where you have another limit node somewhere above. So let's do a simple "cost" analysis of these two use cases: 1) case #1 (no LIMIT above, requires fetching everything) Let's assume the "cost" of the current approach (fetching everything in advance, computing node->count once) is the baseline here. With the incremental approach, there likely is some extra overhead. Let's overshoot it a bit - with 5% overhead, it's 1.05 of the baseline. 2) case #2 (LIMIT/EXISTS above, only needs to fetch the first few rows, perhaps just a single one) Here, we take the incremental approach as the baseline. With the "fetch everything first" approach we need to evaluate the whole subplan and fetch all the results, instead of just some small fraction. But the exact ratio is pretty much arbitrary - I can easily construct queries that need to fetch 100x or 100000x more data from the subplan. So essentially using the "incremental fetch" approach means we may pay 5% overhead in cases where we need to fetch everything, while the "fetch everything first" approach means we'll add arbitrary amount of overhead to cases where we don't need to fetch everything. Of course, you one way to deal with this would be to argue the cases benefiting from incremental approach are very rare / non-existent. Perhaps that's the case, although I don't see why it would be. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
What formula? All the math remains exactly the same, you just need to
update the number of rows to return and track how many rows are already
returned.
I haven't tried doing that, but AFAICS you'd need to tweak how/when
node->count is computed - instead of computing it only once it needs to
be updated after fetching each row from the subplan.
Furthermore, you'll need to stash the subplan rows somewhere (into a
tuplestore probably), and whenever the node->count value increments,
you'll need to grab a row from the tuplestore and return that (i.e.
tweak node->position and set node->subSlot).
I hope that makes sense. The one thing I'm not quite sure about is
whether tuplestore allows adding and getting rows at the same time.
Does that make sense
In current implementation in LIMIT_INITIAL state we execute outer plan to the end , store the resulting tuples in tuplestore and calculate limitCount in number .We are in this state only once and did not return any tuple. Once we are at LIMIT_INWINDOW state and inner node execution asking for tuple it return from tuple store immediately.
Inorder to do fast startup what I was thinking was dividing the work done at LIMIT_INITIAL state in to limitCount. For example we want 0.5 percent of the result which means in LIMIT_INWINDOW state we execute outer node 200 times ,store the result in tuplestore and return the first tuple. if we don’t get that much tuple that means we reach end of the limit . But I think i can’t do it in this way because percent have meaning only with total number so LIMIT_INITIAL work can’t be avoid
regards
Surafel
On 1/6/19 12:40 PM, Surafel Temesgen wrote: > > > On Fri, Jan 4, 2019 at 5:27 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > What formula? All the math remains exactly the same, you just need to > update the number of rows to return and track how many rows are already > returned. > > I haven't tried doing that, but AFAICS you'd need to tweak how/when > node->count is computed - instead of computing it only once it needs to > be updated after fetching each row from the subplan. > > Furthermore, you'll need to stash the subplan rows somewhere (into a > tuplestore probably), and whenever the node->count value increments, > you'll need to grab a row from the tuplestore and return that (i.e. > tweak node->position and set node->subSlot). > > I hope that makes sense. The one thing I'm not quite sure about is > whether tuplestore allows adding and getting rows at the same time. > > Does that make sense > > > > In current implementation in LIMIT_INITIAL state we execute outer > plan to the end , store the resulting tuples in tuplestore and > calculate limitCount in number .We are in this state only once and > did not return any tuple. Once we are at LIMIT_INWINDOW state and > inner node execution asking for tuple it return from tuple store > immediately. > Right. > Inorder to do fast startup what I was thinking was dividing the work > done at LIMIT_INITIAL state in to limitCount. For example we want > 0.5 percent of the result which means in LIMIT_INWINDOW state we > execute outer node 200 times ,store the result in tuplestore and > return the first tuple. if we don’t get that much tuple that means we > reach end of the limit. Yes, pretty much. > But I think i can’t do it in this way because percent have meaning > only with total number so LIMIT_INITIAL work can’t be avoid. > I'm not sure I understand what you mean by "percent would have meaning only with the total number". The important thing is that you can track how many tuples you're supposed to return (based on the current number of rows from the outer plan) and how many you've produced so far. And those numbers are only growing. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/6/19 12:40 PM, Surafel Temesgen wrote:
>
>
> On Fri, Jan 4, 2019 at 5:27 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>
> What formula? All the math remains exactly the same, you just need to
> update the number of rows to return and track how many rows are already
> returned.
>
> I haven't tried doing that, but AFAICS you'd need to tweak how/when
> node->count is computed - instead of computing it only once it needs to
> be updated after fetching each row from the subplan.
>
> Furthermore, you'll need to stash the subplan rows somewhere (into a
> tuplestore probably), and whenever the node->count value increments,
> you'll need to grab a row from the tuplestore and return that (i.e.
> tweak node->position and set node->subSlot).
>
> I hope that makes sense. The one thing I'm not quite sure about is
> whether tuplestore allows adding and getting rows at the same time.
>
> Does that make sense
>
>
>
> In current implementation in LIMIT_INITIAL state we execute outer
> plan to the end , store the resulting tuples in tuplestore and
> calculate limitCount in number .We are in this state only once and
> did not return any tuple. Once we are at LIMIT_INWINDOW state and
> inner node execution asking for tuple it return from tuple store
> immediately.
>
Right.
> Inorder to do fast startup what I was thinking was dividing the work
> done at LIMIT_INITIAL state in to limitCount. For example we want
> 0.5 percent of the result which means in LIMIT_INWINDOW state we
> execute outer node 200 times ,store the result in tuplestore and
> return the first tuple. if we don’t get that much tuple that means we
> reach end of the limit.
Yes, pretty much.
On 1/9/19 7:09 AM, Surafel Temesgen wrote: > > > On Sun, Jan 6, 2019 at 5:51 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > On 1/6/19 12:40 PM, Surafel Temesgen wrote: > > > > > > On Fri, Jan 4, 2019 at 5:27 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com> > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>> wrote: > > > > > > What formula? All the math remains exactly the same, you just > need to > > update the number of rows to return and track how many rows > are already > > returned. > > > > I haven't tried doing that, but AFAICS you'd need to tweak > how/when > > node->count is computed - instead of computing it only once it > needs to > > be updated after fetching each row from the subplan. > > > > Furthermore, you'll need to stash the subplan rows somewhere > (into a > > tuplestore probably), and whenever the node->count value > increments, > > you'll need to grab a row from the tuplestore and return that > (i.e. > > tweak node->position and set node->subSlot). > > > > I hope that makes sense. The one thing I'm not quite sure about is > > whether tuplestore allows adding and getting rows at the same > time. > > > > Does that make sense > > > > > > > > In current implementation in LIMIT_INITIAL state we execute outer > > plan to the end , store the resulting tuples in tuplestore and > > calculate limitCount in number .We are in this state only once and > > did not return any tuple. Once we are at LIMIT_INWINDOW state and > > inner node execution asking for tuple it return from tuple store > > immediately. > > > > Right. > > > Inorder to do fast startup what I was thinking was dividing the work > > done at LIMIT_INITIAL state in to limitCount. For example we want > > 0.5 percent of the result which means in LIMIT_INWINDOW state we > > execute outer node 200 times ,store the result in tuplestore and > > return the first tuple. if we don’t get that much tuple that means we > > reach end of the limit. > Yes, pretty much. > > > > While working on this method i found an issue that did not work will > on percent with fractional part like 2.6 percent which means we have > to emit per every 38.4615 tuple so it have to be round up to 39 or > round down to 38 either way it leads to incorrect result .Even with > integer percentage sometimes the result came out with fractional > part and needs rounding > It's hard to say what exactly are you doing, because you haven't shared any code nor examples. But it sounds as if you're computing how often to produce an additional row at the beginning, and then simply multiplying it. I'm not sure that's quite possible, because there will be issues due to rounding. Firstly, I see your last patch (from 30/11) does this: node->count = DatumGetInt64((DatumGetFloat8(val) * tuplestore_tuple_count(node->totalTuple)) / 100); but the SQL standard I have says this in the <query expression> section: If <fetch first percentage> is specified, then let FFP be the <simple value specification> simply contained in <fetch first percentage>, and let LOCT be a <literal> whose value is OCT. Let FFRC be the value of CEILING ( FFP * LOCT / 100.0E0 ) That's not what you patch does, though, because it relies on automatic float->int conversion, which does floor() and not ceil(). FWIW the final DatumGetInt64 seems unnecessary, because the expression is float8 and not Datum. The patch should do something like this instead: node->count = ceil((DatumGetFloat8(val) * tuplestore_tuple_count(node->totalTuple)) / 100); Now, back to the row count issue - let's assume the user requests FETCH FIRST 3 PERCENT ROWS ONLY This means we should produce 1 rows every 33 rows, but only very roughly. In reality it's a bit less, because 3% is not 1/33 exactly. Consider this query, which computes number of rows for select sample_cnt, l - lag(l) over (order by sample_cnt) AS d from ( select sample_cnt, min(nrows) AS l from ( select nrows, (nrows * 3.0 / 100.0) AS raw_sample_cnt, ceil(nrows * 3.0 / 100.0) AS sample_cnt from generate_series(1,10000) s(nrows) ) foo group by sample_cnt order by sample_cnt ) bar; The inner-most query simply computes how many rows we're supposed to return based on certain row count. Then we compute points at which the sample_cnt actually changes. And finally we compute the distance between those points. You should get something like sample_cnt | d ------------+---- 1 | 2 | 33 3 | 33 4 | 34 5 | 33 6 | 33 7 | 34 8 | 33 9 | 33 10 | 34 11 | 33 ... This shows that the distance oscillates between 33 and 34, so no matter what value you pick, you can't simply multiply it to get the desired sample size. It needs to be re-computed as you go. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
It's hard to say what exactly are you doing, because you haven't shared
any code nor examples.
Attachment
On 1/9/19 4:43 PM, Surafel Temesgen wrote: > > > On Wed, Jan 9, 2019 at 5:38 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > It's hard to say what exactly are you doing, because you haven't shared > any code nor examples. > > > okay i attach in progress patch > Yeah, that's what I thought - the patch computes node->count = DatumGetInt64(100 / DatumGetFloat8(val)); and then always fetches this number of records before emitting the next row from the tuplestore. That's wrong, as I explained before, because the distance does change, due to rounding. See the attached patch, which recomputes the count regularly. I don't claim the patch is committable or that it has no other bugs, but hopefully it shows what I meant. FWIW you don't need to create any slots - the two already created are enough. You certainly don't need to create the slots within a loop. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
See the attached patch, which recomputes the count regularly. I don't
claim the patch is committable or that it has no other bugs, but
hopefully it shows what I meant.
I got only one issue it is not work well with cursor
postgres=# START TRANSACTION;
START TRANSACTION
postgres=# create table t as select i from generate_series(1,1000) s(i);
SELECT 1000
postgres=# declare c cursor for select * from t fetch first 5 percent rows only;
DECLARE CURSOR
postgres=# fetch all in c;
ERROR: trying to store a minimal tuple into wrong type of slot
I am looking at it .
meanwhile i fix row estimation and cost and make create_ordered_paths creation with no LIMIT consideration in PERCENTAGE case
regards
Surafel
Attachment
On 1/24/19 10:57 AM, Surafel Temesgen wrote: > > > On Wed, Jan 9, 2019 at 8:18 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > See the attached patch, which recomputes the count regularly. I don't > claim the patch is committable or that it has no other bugs, but > hopefully it shows what I meant. > > > > I got only one issue it is not work well with cursor > > postgres=# START TRANSACTION; > > START TRANSACTION > > postgres=# create table t as select i from generate_series(1,1000) s(i); > > SELECT 1000 > > postgres=# declare c cursor for select * from t fetch first 5 percent > rows only; > > DECLARE CURSOR > > postgres=# fetch all in c; > > ERROR: trying to store a minimal tuple into wrong type of slot > > > I am looking at it . > OK. Does that mean you agree the incremental approach is reasonable? > meanwhile i fix row estimation and cost and make create_ordered_paths > creation with no LIMIT consideration in PERCENTAGE case > I haven't reviewed the costing code yet, but linear interpolation seems reasonable in principle. I find the create_limit_path changes somewhat sloppy and difficult to comprehend (particularly without comments). For example, the fact that the code computes the total cost based on count_rows, which may be tweaked later seems suspicious. I mean, the doe now does this: if (limitOption = PERCENTAGE) { ... update count_rows ... compute total_cost } if (count_rows > pathnode->path.rows) count_rows = pathnode->path.rows; ... compute total_cost for the EXACT_NUMBER case pathnode->path.rows = count_rows; But I need to do think about this a bit more. FWIW I see there are no regression tests for the PERCENT option. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
OK. Does that mean you agree the incremental approach is reasonable?
On Mon, Jan 28, 2019 at 1:28 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
OK. Does that mean you agree the incremental approach is reasonable?there are no noticeable performance difference but i love previousapproach more regarding cursor operation it fetch tuple forward andbackward from tuplestore only but in incremental approach we have tore execute outer node in every forward and backward fetching operationregardsSurafel
Attachment
On 1/30/19 7:07 AM, Surafel Temesgen wrote: > > > On Mon, Jan 28, 2019 at 1:28 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > OK. Does that mean you agree the incremental approach is reasonable? > > > there are no noticeable performance difference but i love previous > approach more regarding cursor operation it fetch tuple forward and > backward from tuplestore only but in incremental approach we have to > re execute outer node in every forward and backward fetching operation > I'm not sure I understand - are you saying every time the user does a FETCH, we have to run the outer plan from scratch? I don't see why would that be necessary? And if it is, how come there's no noticeable performance difference? Can you share a patch implementing the incremental approach, and a query demonstrating the issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/1/19 12:10 PM, Surafel Temesgen wrote: > here is a rebased version of previous patch with planner > comment included > It's really hard to say which comment is that .. FWIW I find the changes in nodeLimit lacking sufficient comments. The comments present are somewhat obvious - what we need are comments explaining why things happen. For example the LIMIT_INITIAL now includes this chunk of code: case LIMIT_INITIAL: if (node->limitOption == PERCENTAGE) { /* * Find all rows the plan will return. */ for (;;) { slot = ExecProcNode(outerPlan); if (TupIsNull(slot)) { break; } tuplestore_puttupleslot(node->totalTuple, slot); } } Ignoring the fact that the comment is incorrectly indented, it states a rather obvious fact - that we fetch all rows from a plan and stash them into a tuplestore. What is missing is comment explaining why we need to do that. This applies to other changes in nodeLimit too, and elsewhere. Another detail is that we generally leave a free line before "if", i.e. instead of } if (node->limitOption == PERCENTAGE) { it should be } if (node->limitOption == PERCENTAGE) { because it's fairly easy to misread as "else if". Even better, there should be a comment before the "if" explaining what the branch does. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm not sure I understand - are you saying every time the user does a
FETCH, we have to run the outer plan from scratch? I don't see why would
that be necessary? And if it is, how come there's no noticeable
performance difference?
Can you share a patch implementing the incremental approach, and a query
demonstrating the issue?
I didn't implement it but its obvious that it doesn't work similarly with previous approach.
We need different implementation and my plan was to use tuplestore per call and clear
it after returning tuple but I see that the plan will not go far because mainly the last returned
slot is not the last slot we get from outerPlan execution
regards
Surafel
On 2/23/19 8:53 AM, Surafel Temesgen wrote: > > > On Sun, Feb 10, 2019 at 2:22 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > I'm not sure I understand - are you saying every time the user does a > FETCH, we have to run the outer plan from scratch? I don't see why would > that be necessary? And if it is, how come there's no noticeable > performance difference? > > Can you share a patch implementing the incremental approach, and a query > demonstrating the issue? > > > I didn't implement it but its obvious that it doesn't work similarly > with previous approach. > Sure, but that's hardly a sufficient argument for the current approach. > We need different implementation and my plan was to use tuplestore per > call and clear > > it after returning tuple but I see that the plan will not go far because > mainly the last returned > > slot is not the last slot we get from outerPlan execution > I'm sorry, I still don't understand what the supposed problem is. I don't think it's all that different from what nodeMaterial.c does, for example. As I explained before, having to execute the outer plan till completion before returning any tuples is an issue. So either it needs fixing or an explanation why it's not an issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm sorry, I still don't understand what the supposed problem is. I
don't think it's all that different from what nodeMaterial.c does, for
example.
Attachment
Hello. At Sat, 23 Feb 2019 22:27:44 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <81a5c0e9-c17d-28f3-4647-8a4659cdfdb1@2ndquadrant.com> > > > On 2/23/19 8:53 AM, Surafel Temesgen wrote: > > > > > > On Sun, Feb 10, 2019 at 2:22 AM Tomas Vondra > > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > > > > > I'm not sure I understand - are you saying every time the user does a > > FETCH, we have to run the outer plan from scratch? I don't see why would > > that be necessary? And if it is, how come there's no noticeable > > performance difference? > > > > Can you share a patch implementing the incremental approach, and a query > > demonstrating the issue? > > > > > > I didn't implement it but its obvious that it doesn't work similarly > > with previous approach. > > > > Sure, but that's hardly a sufficient argument for the current approach. > > > We need different implementation and my plan was to use tuplestore per > > call and clear > > > > it after returning tuple but I see that the plan will not go far because > > mainly the last returned > > > > slot is not the last slot we get from outerPlan execution > > > > I'm sorry, I still don't understand what the supposed problem is. I > don't think it's all that different from what nodeMaterial.c does, for > example. > > As I explained before, having to execute the outer plan till completion > before returning any tuples is an issue. So either it needs fixing or an > explanation why it's not an issue. One biggest issue seems to be we don't know the total number of outer tuples before actually reading a null tuple. I doubt of general shortcut for that. It also seems preventing limit node from just using materialized outer. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 28 Feb 2019 13:32:17 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in <CALAY4q_TrFN2z1oQcpxCKB5diy+A7m5hhG4VOV5d1BaWpf+qqA@mail.gmail.com> > On Sun, Feb 24, 2019 at 12:27 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> > wrote: > > > > > I'm sorry, I still don't understand what the supposed problem is. I > > don't think it's all that different from what nodeMaterial.c does, for > > example. > > > > > sorry for the noise .Attache is complete patch for incremental approach Mmm. It seems just moving reading-all-before-return-the-first-tuple from LIMIT_INITIAL to LIMIT_RESCAN. If I read it correctly node->conut is not correct since it is calculated as "the percentage of the number of tuples read *so far* excluding the offset portion.). | while (node->position - node->offset >= node->count) | { | slot = ExecProcNode(outerPlan); | (if Tupleis Null then break;) | tuplestore_puttupleslot(node->tuple_store, slot); | cnt = tuplestore_tuple_count(node->tuple_store); | node->count = ceil(node->percent * cnt / 100.0); | } If I'm missing nothing, this seems almost same with the follows. while ((position - offset) * percent / 100 >= (position - offset) * percent / 100) {} In the first place, the patch breaks nodeLimit.c and build fails. ==== - * previous time we got a different result. + * previous time we got a different result.In PERCENTAGE option there are + * no bound on the number of output tuples */ */ ==== regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2/28/19 12:26 PM, Kyotaro HORIGUCHI wrote: > Hello. > > At Sat, 23 Feb 2019 22:27:44 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <81a5c0e9-c17d-28f3-4647-8a4659cdfdb1@2ndquadrant.com> >> >> >> On 2/23/19 8:53 AM, Surafel Temesgen wrote: >>> >>> >>> On Sun, Feb 10, 2019 at 2:22 AM Tomas Vondra >>> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: >>> >>> >>> >>> I'm not sure I understand - are you saying every time the user does a >>> FETCH, we have to run the outer plan from scratch? I don't see why would >>> that be necessary? And if it is, how come there's no noticeable >>> performance difference? >>> >>> Can you share a patch implementing the incremental approach, and a query >>> demonstrating the issue? >>> >>> >>> I didn't implement it but its obvious that it doesn't work similarly >>> with previous approach. >>> >> >> Sure, but that's hardly a sufficient argument for the current approach. >> >>> We need different implementation and my plan was to use tuplestore per >>> call and clear >>> >>> it after returning tuple but I see that the plan will not go far because >>> mainly the last returned >>> >>> slot is not the last slot we get from outerPlan execution >>> >> >> I'm sorry, I still don't understand what the supposed problem is. I >> don't think it's all that different from what nodeMaterial.c does, for >> example. >> >> As I explained before, having to execute the outer plan till completion >> before returning any tuples is an issue. So either it needs fixing or an >> explanation why it's not an issue. > > One biggest issue seems to be we don't know the total number of > outer tuples before actually reading a null tuple. I doubt of > general shortcut for that. It also seems preventing limit node > from just using materialized outer. > Sure, if you actually want all tuples, you'll have to execute the outer plan till completion. But that's not what I'm talking about - what if we only ever need to read one row from the limit? To give you a (admittedly, somewhat contrived and artificial example): SELECT * FROM t1 WHERE id IN ( SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY ); Maybe this example is bogus and/or does not really matter in practice. I don't know, but I've been unable to convince myself that's the case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello. At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <fbd08ad3-5dd8-3169-6cba-38d610d7be7f@2ndquadrant.com> > > One biggest issue seems to be we don't know the total number of # One *of* the biggest *issues*? > > outer tuples before actually reading a null tuple. I doubt of > > general shortcut for that. It also seems preventing limit node > > from just using materialized outer. > > > > Sure, if you actually want all tuples, you'll have to execute the outer > plan till completion. But that's not what I'm talking about - what if we > only ever need to read one row from the limit? We have no choice than once reading all tuples just to find we are to return just one row, since estimator is not guaranteed to be exact as required for this purpose. > To give you a (admittedly, somewhat contrived and artificial example): > > SELECT * FROM t1 WHERE id IN ( > SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY > ); > > Maybe this example is bogus and/or does not really matter in practice. I > don't know, but I've been unable to convince myself that's the case. I see such kind of idiom common. Even in the quite simple example above, *we* cannot tell how many tuples the inner should return unless we actually fetch all tuples in t2. This is the same problem with count(*). The query is equivalent to the folloing one. SELECT * FROM t1 WHERE id IN ( SELECT id FROM t2 ORDER BY x FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY ); This scans t2 twice, but this patch does only one full scan moving another partial scan to tuplestore. We would win if the outer is complex enough. Anyway, even excluding the count(*) issue, it seems that we are not alone in that point. (That is, at least Oracle shows different E-Rows and A-Rows for PERCENT). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello.
At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <fbd08ad3-5dd8-3169-6cba-38d610d7be7f@2ndquadrant.com>
> > One biggest issue seems to be we don't know the total number of
# One *of* the biggest *issues*?
> > outer tuples before actually reading a null tuple. I doubt of
> > general shortcut for that. It also seems preventing limit node
> > from just using materialized outer.
> >
>
> Sure, if you actually want all tuples, you'll have to execute the outer
> plan till completion. But that's not what I'm talking about - what if we
> only ever need to read one row from the limit?
We have no choice than once reading all tuples just to find we
are to return just one row, since estimator is not guaranteed to
be exact as required for this purpose.
> To give you a (admittedly, somewhat contrived and artificial example):
>
> SELECT * FROM t1 WHERE id IN (
> SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
> );
>
> Maybe this example is bogus and/or does not really matter in practice. I
> don't know, but I've been unable to convince myself that's the case.
I see such kind of idiom common. Even in the quite simple example
above, *we* cannot tell how many tuples the inner should return
unless we actually fetch all tuples in t2. This is the same
problem with count(*).
The query is equivalent to the folloing one.
SELECT * FROM t1 WHERE id IN (
SELECT id FROM t2 ORDER BY x
FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
);
This scans t2 twice, but this patch does only one full scan
moving another partial scan to tuplestore. We would win if the
outer is complex enough.
Attachment
On 3/1/19 2:31 AM, Kyotaro HORIGUCHI wrote: > Hello. > > At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <fbd08ad3-5dd8-3169-6cba-38d610d7be7f@2ndquadrant.com> >>> One biggest issue seems to be we don't know the total number of > > # One *of* the biggest *issues*? > >>> outer tuples before actually reading a null tuple. I doubt of >>> general shortcut for that. It also seems preventing limit node >>> from just using materialized outer. >>> >> >> Sure, if you actually want all tuples, you'll have to execute the outer >> plan till completion. But that's not what I'm talking about - what if we >> only ever need to read one row from the limit? > > We have no choice than once reading all tuples just to find we > are to return just one row, since estimator is not guaranteed to > be exact as required for this purpose. > I don't follow - I'm not suggesting to base this on row estimates at all. I'm saying that we can read in rows, and decide how many rows we are expected to produce. For example, with 1% sample, we'll produce about 1 row for each 100 rows read from the outer plan. So we read the first row, and compute rows_to_produce = ceil(0.01 * rows_read) and every time this increases, we emit another row from a tuplestore. >> To give you a (admittedly, somewhat contrived and artificial example): >> >> SELECT * FROM t1 WHERE id IN ( >> SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY >> ); >> >> Maybe this example is bogus and/or does not really matter in practice. I >> don't know, but I've been unable to convince myself that's the case. > > I see such kind of idiom common. Even in the quite simple example > above, *we* cannot tell how many tuples the inner should return > unless we actually fetch all tuples in t2. This is the same > problem with count(*). > > The query is equivalent to the folloing one. > > SELECT * FROM t1 WHERE id IN ( > SELECT id FROM t2 ORDER BY x > FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY > ); > > This scans t2 twice, but this patch does only one full scan > moving another partial scan to tuplestore. We would win if the > outer is complex enough. > But what if all t1 rows match the very first row in the subselect? In that case we don't need to read all rows from the subplan - we only need to read the first chunk (until we emit the first row). > Anyway, even excluding the count(*) issue, it seems that we are > not alone in that point. (That is, at least Oracle shows > different E-Rows and A-Rows for PERCENT). > I'm not sure hat E-Rows and A-Rows are? I suppose that's estimated and actual - but as I explained above, that's not what I propose the incremental approach to use. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Surafel, On 3/1/19 8:18 PM, Tomas Vondra wrote: >> > > I'm not sure hat E-Rows and A-Rows are? I suppose that's estimated and > actual - but as I explained above, that's not what I propose the > incremental approach to use. Marked as Waiting for Author since it appears Tomas is waiting for a response from you. Regards, -- -David david@pgmasters.net
To give you a (admittedly, somewhat contrived and artificial example):
SELECT * FROM t1 WHERE id IN (
SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
);
Maybe this example is bogus and/or does not really matter in practice. I
don't know, but I've been unable to convince myself that's the case.
====
- * previous time we got a different result.
+ * previous time we got a different result.In PERCENTAGE option there are
+ * no bound on the number of output tuples */
*/
====
Attachment
Hi, On 2019-03-29 12:04:50 +0300, Surafel Temesgen wrote: > + if (node->limitOption == PERCENTAGE) > + { > + while (node->position - node->offset < node->count) > + { > + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple); > + if (tuplestore_gettupleslot(node->tupleStore, true, true, slot)) There's several blocks like this - how's this not going to be a resource leak? As far as I can tell you're just going to create new slots, and their previous contents aren't going to be cleared? I think there very rarely are cases where an executor node's *Next function (or its subsidiaries) creates slots. Normally you ought to create them *once* on the *Init function. You might not directly leak memory, because this is probably running in a short lived context, but you'll leak tuple desc references etc. (Or if it were a heap buffer slot, buffer pins). > + /* In PERCENTAGE case the result is already in tuplestore */ > + if (node->limitOption == PERCENTAGE) > + { > + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple); > + if (tuplestore_gettupleslot(node->tupleStore, false, false, slot)) > + { > + node->subSlot = slot; > + node->lstate = LIMIT_INWINDOW; > + } > + else > + elog(ERROR, "LIMIT subplan failed to run backwards"); > + } > + else if (node->limitOption == EXACT_NUMBER) > + { > /* > * Backing up from subplan EOF, so re-fetch previous tuple; there > * should be one! Note previous tuple must be in window. > @@ -194,6 +329,7 @@ ExecLimit(PlanState *pstate) > node->subSlot = slot; > node->lstate = LIMIT_INWINDOW; > /* position does not change 'cause we didn't advance it before */ > + } The indentation here looks wrong... > break; > > case LIMIT_WINDOWEND: > @@ -278,17 +414,29 @@ recompute_limits(LimitState *node) > /* Interpret NULL count as no count (LIMIT ALL) */ > if (isNull) > { > - node->count = 0; > + node->count = 1; > node->noCount = true; Huh? > } > else > { > - node->count = DatumGetInt64(val); > - if (node->count < 0) > - ereport(ERROR, > - (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE), > - errmsg("LIMIT must not be negative"))); > - node->noCount = false; > + if (node->limitOption == PERCENTAGE) > + { > + /* > + * We expect to return at least one row (unless there > + * are no rows in the subplan), and we'll update this > + * count later as we go. > + */ > + node->count = 0; > + node->percent = DatumGetFloat8(val); > + } > + else > + { > + node->count = DatumGetInt64(val); > + if (node->count < 0) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE), > + errmsg("LIMIT must not be negative"))); > + } Shouldn't we error out with a negative count, regardless of PERCENT? Or is that prevented elsewhere? > } > } > else > @@ -299,8 +447,10 @@ recompute_limits(LimitState *node) > } > > /* Reset position to start-of-scan */ > - node->position = 0; > + node->position = 0;; unnecessary > if (parse->sortClause) > { > - current_rel = create_ordered_paths(root, > - current_rel, > - final_target, > - final_target_parallel_safe, > - have_postponed_srfs ? -1.0 : > - limit_tuples); > + > + /* In PERCENTAGE option there are no bound on the number of output tuples */ > + if (parse->limitOption == PERCENTAGE) > + current_rel = create_ordered_paths(root, > + current_rel, > + final_target, > + final_target_parallel_safe, > + have_postponed_srfs ? -1.0 : > + -1.0); > + else > + current_rel = create_ordered_paths(root, > + current_rel, > + final_target, > + final_target_parallel_safe, > + have_postponed_srfs ? -1.0 : > + limit_tuples); I'd rather not duplicate those two calls, and just have the limit_tuples computation inside an if. > offset_clause: > @@ -15435,6 +15442,7 @@ reserved_keyword: > | ONLY > | OR > | ORDER > + | PERCENT > | PLACING > | PRIMARY > | REFERENCES Are we really ok with adding a new reserved keyword 'PERCENT' for this? That seems awfully likely to already exist in columns etc. Is there a chance we can use a less strong category? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: >> offset_clause: >> @@ -15435,6 +15442,7 @@ reserved_keyword: >> | ONLY >> | OR >> | ORDER >> + | PERCENT >> | PLACING >> | PRIMARY >> | REFERENCES > Are we really ok with adding a new reserved keyword 'PERCENT' for this? I'm not. It doesn't look like it ought to be necessary to reserve it, either, given that we don't have to reduce the production right there. (If that doesn't work right away, try getting rid of row_or_rows in favor of spelling out those alternatives in separate productions.) More generally: using an undocumented list as the data structure for select_limit's result was already a pretty bad idea, I think, and this patch certainly makes it totally unreadable. Probably ought to refactor that to use some sort of struct. regards, tom lane
On Tue, Mar 26, 2019 at 03:06:52PM +0300, Surafel Temesgen wrote:
As for the OFFSET, I don't see why that would be incompatible with PERCENT
clause. I'm not sure if it should compute the percentage from the total
number of rows (including those skipped by the OFFSET) or the rest, but I
assume SQL standard says something about that.
Hi,
On 2019-03-29 12:04:50 +0300, Surafel Temesgen wrote:
> + if (node->limitOption == PERCENTAGE)
> + {
> + while (node->position - node->offset < node->count)
> + {
> + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple);
> + if (tuplestore_gettupleslot(node->tupleStore, true, true, slot))
There's several blocks like this - how's this not going to be a resource
leak? As far as I can tell you're just going to create new slots, and
their previous contents aren't going to be cleared? I think there very
rarely are cases where an executor node's *Next function (or its
subsidiaries) creates slots. Normally you ought to create them *once* on
the *Init function.
You might not directly leak memory, because this is probably running in
a short lived context, but you'll leak tuple desc references etc. (Or if
it were a heap buffer slot, buffer pins).
> + /* In PERCENTAGE case the result is already in tuplestore */
> + if (node->limitOption == PERCENTAGE)
> + {
> + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple);
> + if (tuplestore_gettupleslot(node->tupleStore, false, false, slot))
> + {
> + node->subSlot = slot;
> + node->lstate = LIMIT_INWINDOW;
> + }
> + else
> + elog(ERROR, "LIMIT subplan failed to run backwards");
> + }
> + else if (node->limitOption == EXACT_NUMBER)
> + {
> /*
> * Backing up from subplan EOF, so re-fetch previous tuple; there
> * should be one! Note previous tuple must be in window.
> @@ -194,6 +329,7 @@ ExecLimit(PlanState *pstate)
> node->subSlot = slot;
> node->lstate = LIMIT_INWINDOW;
> /* position does not change 'cause we didn't advance it before */
> + }
The indentation here looks wrong...
> break;
>
> case LIMIT_WINDOWEND:
> @@ -278,17 +414,29 @@ recompute_limits(LimitState *node)
> /* Interpret NULL count as no count (LIMIT ALL) */
> if (isNull)
> {
> - node->count = 0;
> + node->count = 1;
> node->noCount = true;
Huh?
> }
> else
> {
> - node->count = DatumGetInt64(val);
> - if (node->count < 0)
> - ereport(ERROR,
> - (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> - errmsg("LIMIT must not be negative")));
> - node->noCount = false;
> + if (node->limitOption == PERCENTAGE)
> + {
> + /*
> + * We expect to return at least one row (unless there
> + * are no rows in the subplan), and we'll update this
> + * count later as we go.
> + */
> + node->count = 0;
> + node->percent = DatumGetFloat8(val);
> + }
> + else
> + {
> + node->count = DatumGetInt64(val);
> + if (node->count < 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> + errmsg("LIMIT must not be negative")));
> + }
Shouldn't we error out with a negative count, regardless of PERCENT? Or
is that prevented elsewhere?
> }
> }
> else
> @@ -299,8 +447,10 @@ recompute_limits(LimitState *node)
> }
>
> /* Reset position to start-of-scan */
> - node->position = 0;
> + node->position = 0;;
unnecessary
> if (parse->sortClause)
> {
> - current_rel = create_ordered_paths(root,
> - current_rel,
> - final_target,
> - final_target_parallel_safe,
> - have_postponed_srfs ? -1.0 :
> - limit_tuples);
> +
> + /* In PERCENTAGE option there are no bound on the number of output tuples */
> + if (parse->limitOption == PERCENTAGE)
> + current_rel = create_ordered_paths(root,
> + current_rel,
> + final_target,
> + final_target_parallel_safe,
> + have_postponed_srfs ? -1.0 :
> + -1.0);
> + else
> + current_rel = create_ordered_paths(root,
> + current_rel,
> + final_target,
> + final_target_parallel_safe,
> + have_postponed_srfs ? -1.0 :
> + limit_tuples);
I'd rather not duplicate those two calls, and just have the limit_tuples
computation inside an if.
> offset_clause:
> @@ -15435,6 +15442,7 @@ reserved_keyword:
> | ONLY
> | OR
> | ORDER
> + | PERCENT
> | PLACING
> | PRIMARY
> | REFERENCES
Are we really ok with adding a new reserved keyword 'PERCENT' for this?
That seems awfully likely to already exist in columns etc. Is there a
chance we can use a less strong category?
On Tue, Mar 26, 2019 at 03:06:52PM +0300, Surafel Temesgen wrote:
>On Thu, Feb 28, 2019 at 11:16 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>>
>> To give you a (admittedly, somewhat contrived and artificial example):
>>
>> SELECT * FROM t1 WHERE id IN (
>> SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
>> );
>>
>> Maybe this example is bogus and/or does not really matter in practice. I
>> don't know, but I've been unable to convince myself that's the case.
>
>
>does this means we abandon incremental approach? and am not sure of
>calculating
>percentage after OFFSET clause is acceptable or not
>
I don't follow. I don't think I've suggested to abandon the incremental
approach - I've explained why I think it's what the patch should be doing,
and illustrated that with an example.
On Fri, Apr 05, 2019 at 03:14:56PM +0300, Surafel Temesgen wrote: >On Tue, Mar 26, 2019 at 5:46 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> >wrote: > >> On Tue, Mar 26, 2019 at 03:06:52PM +0300, Surafel Temesgen wrote: >> >On Thu, Feb 28, 2019 at 11:16 PM Tomas Vondra < >> tomas.vondra@2ndquadrant.com> >> >wrote: >> > >> >> >> >> To give you a (admittedly, somewhat contrived and artificial example): >> >> >> >> SELECT * FROM t1 WHERE id IN ( >> >> SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY >> >> ); >> >> >> >> Maybe this example is bogus and/or does not really matter in practice. I >> >> don't know, but I've been unable to convince myself that's the case. >> > >> > >> >does this means we abandon incremental approach? and am not sure of >> >calculating >> >percentage after OFFSET clause is acceptable or not >> > >> >> I don't follow. I don't think I've suggested to abandon the incremental >> approach - I've explained why I think it's what the patch should be doing, >> and illustrated that with an example. >> > >but it is more complex than the previous approach and it will be more >complex >in starting calculating percentage before offset row count. Doesn't >simplicity prefer? > Sure, simplicity has it's value. And often the simplest solution is the best one. But in But in plenty of cases it's a tradeoff between simplicity and efficiency, i.e. the simplest solution may perform poorly. I've tried to explain why I think the incremental solution is the right approach here - not having to always execute the subplan to completion. I still haven't heard reasons why that argument is irrelevant (it may be). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2019-03-29 12:04:50 +0300, Surafel Temesgen wrote:
> + if (node->limitOption == PERCENTAGE)
> + {
> + while (node->position - node->offset < node->count)
> + {
> + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple);
> + if (tuplestore_gettupleslot(node->tupleStore, true, true, slot))
There's several blocks like this - how's this not going to be a resource
leak? As far as I can tell you're just going to create new slots, and
their previous contents aren't going to be cleared? I think there very
rarely are cases where an executor node's *Next function (or its
subsidiaries) creates slots. Normally you ought to create them *once* on
the *Init function.
You might not directly leak memory, because this is probably running in
a short lived context, but you'll leak tuple desc references etc. (Or if
it were a heap buffer slot, buffer pins).
> + /* In PERCENTAGE case the result is already in tuplestore */
> + if (node->limitOption == PERCENTAGE)
> + {
> + slot = MakeSingleTupleTableSlot(tupleDescriptor, &TTSOpsMinimalTuple);
> + if (tuplestore_gettupleslot(node->tupleStore, false, false, slot))
> + {
> + node->subSlot = slot;
> + node->lstate = LIMIT_INWINDOW;
> + }
> + else
> + elog(ERROR, "LIMIT subplan failed to run backwards");
> + }
> + else if (node->limitOption == EXACT_NUMBER)
> + {
> /*
> * Backing up from subplan EOF, so re-fetch previous tuple; there
> * should be one! Note previous tuple must be in window.
> @@ -194,6 +329,7 @@ ExecLimit(PlanState *pstate)
> node->subSlot = slot;
> node->lstate = LIMIT_INWINDOW;
> /* position does not change 'cause we didn't advance it before */
> + }
The indentation here looks wrong...
> break;
>
> case LIMIT_WINDOWEND:
> @@ -278,17 +414,29 @@ recompute_limits(LimitState *node)
> /* Interpret NULL count as no count (LIMIT ALL) */
> if (isNull)
> {
> - node->count = 0;
> + node->count = 1;
> node->noCount = true;
Huh?
> }
> else
> {
> - node->count = DatumGetInt64(val);
> - if (node->count < 0)
> - ereport(ERROR,
> - (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> - errmsg("LIMIT must not be negative")));
> - node->noCount = false;
> + if (node->limitOption == PERCENTAGE)
> + {
> + /*
> + * We expect to return at least one row (unless there
> + * are no rows in the subplan), and we'll update this
> + * count later as we go.
> + */
> + node->count = 0;
> + node->percent = DatumGetFloat8(val);
> + }
> + else
> + {
> + node->count = DatumGetInt64(val);
> + if (node->count < 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE),
> + errmsg("LIMIT must not be negative")));
> + }
Shouldn't we error out with a negative count, regardless of PERCENT? Or
is that prevented elsewhere?
> }
> }
> else
> @@ -299,8 +447,10 @@ recompute_limits(LimitState *node)
> }
>
> /* Reset position to start-of-scan */
> - node->position = 0;
> + node->position = 0;;
unnecessary
> if (parse->sortClause)
> {
> - current_rel = create_ordered_paths(root,
> - current_rel,
> - final_target,
> - final_target_parallel_safe,
> - have_postponed_srfs ? -1.0 :
> - limit_tuples);
> +
> + /* In PERCENTAGE option there are no bound on the number of output tuples */
> + if (parse->limitOption == PERCENTAGE)
> + current_rel = create_ordered_paths(root,
> + current_rel,
> + final_target,
> + final_target_parallel_safe,
> + have_postponed_srfs ? -1.0 :
> + -1.0);
> + else
> + current_rel = create_ordered_paths(root,
> + current_rel,
> + final_target,
> + final_target_parallel_safe,
> + have_postponed_srfs ? -1.0 :
> + limit_tuples);
I'd rather not duplicate those two calls, and just have the limit_tuples
computation inside an if.
> offset_clause:
> @@ -15435,6 +15442,7 @@ reserved_keyword:
> | ONLY
> | OR
> | ORDER
> + | PERCENT
> | PLACING
> | PRIMARY
> | REFERENCES
Are we really ok with adding a new reserved keyword 'PERCENT' for this?
That seems awfully likely to already exist in columns etc. Is there a
chance we can use a less strong category?
Greetings,
Andres Freund
Attachment
On Thu, Jun 27, 2019 at 9:06 PM Surafel Temesgen <surafel3000@gmail.com> wrote: > The attached patch include the fix for all the comment given Hi Surafel, There's a call to adjust_limit_rows_costs() hiding under contrib/postgres_fdw, so this fails check-world. -- Thomas Munro https://enterprisedb.com
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: tested, passed Spec compliant: not tested Documentation: not tested The basic functionality works as I expect. In the following example I would have guessed it would return 4 rows insteadof 5. I don't mind that it uses ceil here, but think that deserves a mention in the documentation. CREATE TABLE r100 (id INT); INSERT INTO r100 SELECT generate_series(1, 100); SELECT * FROM r100 FETCH FIRST 4.01 PERCENT ROWS ONLY; id ---- 1 2 3 4 5 (5 rows) There's a missing space between the period and following sentence in src\backend\executor\nodeLimit.c "previous time we got a different result.In PERCENTAGE option there are" There's a missing space and the beginning "w" should be capitalized in doc\src\sgml\ref\select.sgml with <literal>PERCENT</literal> count specifies the maximum number of rows to return in percentage.<literal>ROW</literal> Another missing space after the period. previous time we got a different result.In PERCENTAGE option there are" Ryan Lambert The new status of this patch is: Waiting on Author
Hi Surafel,
There's a call to adjust_limit_rows_costs() hiding under
contrib/postgres_fdw, so this fails check-world.
Attachment
The v5 patch [1] applies cleanly and passes make installcheck-world.
My concern with this is its performance. As a user I would expect using this feature to enable queries to run faster than they would simply pulling the full table. I tested on tables ranging from 10k rows up to 10 million with the same basic result that using FETCH FIRST N PERCENT is slower than using the full table. At best it ran slightly slower than querying the full table, at worst it increased execution times by 1400% when using a large high percentage (95%).
Testing was on a DigitalOcean 2 CPU, 2GB RAM droplet. All queries were executed multiple times (6+) with EXPLAIN (ANALYZE, COSTS) and /timing enabled, query plans provided are from the faster end of each query.
Create the test table:
DROP TABLE IF EXISTS r10mwide;
CREATE TABLE r10mwide AS
SELECT generate_series(1, 10000000)::INT AS id,
random() AS v1,
random() AS v2,
random() AS v3,
random() AS v4,
random() AS v5,
random() AS v6,
random() AS v7,
random() AS v8,
random() AS v9,
random() AS v10
;
VACUUM ANALYZE;
Start with a baseline query from the full table, the limiting will be done within the CTE as I would typically do in a production query. The outer query does basic aggregates. Below I provide each full query for easy copy/paste but the only real change is how the inner results are limited. This runs in 2.2s off the full table.
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=227192.07..227192.08 rows=1 width=24) (actual time=2230.419..2230.419 rows=1 loops=1)
-> Gather (cost=227191.84..227192.05 rows=2 width=72) (actual time=2228.321..2231.225 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=226191.84..226191.85 rows=1 width=72) (actual time=2218.754..2218.754 rows=1 loops=3)
-> Parallel Seq Scan on r10mwide (cost=0.00..184524.92 rows=4166692 width=16) (actual time=0.032..934.652 rows=3333333 loops=3)
Planning Time: 0.116 ms
Execution Time: 2231.935 ms
(8 rows)
Time: 2232.459 ms (00:02.232)
It did use parallel, since the FETCH FIRST queries apparently won't I turned that off and ran it again.
SET max_parallel_workers_per_gather = 0;
New query plan for full table, no parallel, just over 4 seconds.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=342859.21..342859.22 rows=1 width=24) (actual time=4253.861..4253.862 rows=1 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=16) (actual time=0.062..1728.346 rows=10000000 loops=1)
Planning Time: 0.082 ms
Execution Time: 4253.918 ms
(4 rows)
The following uses the explicit row count to pull 100k rows (1%) and gives a massive improvement in speed, this query ranged from 55- 120ms. This is the type of speedup I would expect, and it beats the full-table query hands down, regardless of parallel query.
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 100000 ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4428.58..4428.59 rows=1 width=24) (actual time=55.963..55.963 rows=1 loops=1)
-> Limit (cost=0.00..2428.57 rows=100000 width=20) (actual time=0.041..35.725 rows=100000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=20) (actual time=0.039..24.796 rows=100000 loops=1)
Planning Time: 0.134 ms
Execution Time: 56.032 ms
(5 rows)
Time: 57.262 ms
Using FETCH FIRST 1 PERCENT it takes over 7 seconds, 71% slower than querying the full table.
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 1 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6857.22..6857.23 rows=1 width=24) (actual time=7112.205..7112.206 rows=1 loops=1)
-> Limit (cost=2428.60..4857.19 rows=100001 width=20) (actual time=0.036..7047.394 rows=100000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=20) (actual time=0.033..3072.881 rows=10000000 loops=1)
Planning Time: 0.132 ms
Execution Time: 7214.302 ms
(5 rows)
Time: 7215.278 ms (00:07.215)
Cranking the percentage up to 95% took 59-75 seconds.
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 95 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=651432.48..651432.49 rows=1 width=24) (actual time=58981.043..58981.044 rows=1 loops=1)
-> Limit (cost=230715.67..461431.34 rows=9500057 width=20) (actual time=0.017..55799.389 rows=9500000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=20) (actual time=0.014..3847.146 rows=10000000 loops=1)
Planning Time: 0.117 ms
Execution Time: 59079.680 ms
(5 rows)
Even taking it way down to .001% (100 rows!) didn't run fast by any measure, more than 6 seconds.
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST .001 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6.86..6.87 rows=1 width=24) (actual time=6414.406..6414.406 rows=1 loops=1)
-> Limit (cost=2.43..4.86 rows=100 width=20) (actual time=0.021..6413.981 rows=100 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=20) (actual time=0.017..3086.504 rows=10000000 loops=1)
Planning Time: 0.168 ms
Execution Time: 6495.686 ms
[1] https://www.postgresql.org/message-id/attachment/102333/percent-incremental-v5.patch
Hi Thomas,Thank you for informing meHi Surafel,
There's a call to adjust_limit_rows_costs() hiding under
contrib/postgres_fdw, so this fails check-world.Fixed . I also include the review given by Ryan in attached patchregardsSurafel
Hi Surafel,
The v5 patch [1] applies cleanly and passes make installcheck-world.
My concern with this is its performance. As a user I would expect using this feature to enable queries to run faster than they would simply pulling the full table. I tested on tables ranging from 10k rows up to 10 million with the same basic result that using FETCH FIRST N PERCENT is slower than using the full table. At best it ran slightly slower than querying the full table, at worst it increased execution times by 1400% when using a large high percentage (95%).
The cost of FITCH FIRST N PERCENT execution in current implementation is the cost of pulling the full table plus the cost of storing and fetching the tuple from tuplestore so it can not perform better than pulling the full table in any case . This is because we can't determined the number of rows to return without executing the plan until the end. We can find the estimation of rows that will be return in planner estimation but that is not exact.
regards
Surafel
EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 95 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=651432.48..651432.49 rows=1 width=24) (actual time=58981.043..58981.044 rows=1 loops=1)
-> Limit (cost=230715.67..461431.34 rows=9500057 width=20) (actual time=0.017..55799.389 rows=9500000 loops=1)
-> Seq Scan on r10mwide (cost=0.00..242858.60 rows=10000060 width=20) (actual time=0.014..3847.146 rows=10000000 loops=1)
Planning Time: 0.117 ms
Execution Time: 59079.680 ms
(5 rows)
I did some more testing. I initialized a database with 1 million rows with indexes and joins to test against and ran pgbench with a few different settings for % to return. I started with a base query not utilizing the new functionality. The queries used are similar to my prior examples, code at [1].
createdb bench_test
psql -d bench_test -f init/reporting.sql -v scale=10
The following provided 3.21 TPS and an average latency of 623. The "per_change_" columns in the table below use those values.
pgbench -c 2 -j 2 -T 600 -P 60 -s 10 \
-f tests/reporting1.sql bench_test
The remainder of the tests use the following, only adjusting fetch_percent value:
pgbench -c 2 -j 2 -T 600 -P 60 -s 10 \
--define=fetch_percent=1 \
-f tests/reporting_fetch_percent.sql \
bench_test
Returning 1% it runs well. By 10% the TPS drops by 30% while the average latency increases by 43%. When returning 95% of the table latency has increased by 548%.
fetch_percent | tps | latency_avg_ms | per_change_tps | per_change_latency
---------------+------+----------------+----------------+--------------------
1 | 3.37 | 593 | 0.05 | -0.05
5 | 2.85 | 700 | -0.11 | 0.12
10 | 2.24 | 891 | -0.30 | 0.43
25 | 1.40 | 1423 | -0.56 | 1.28
45 | 0.93 | 2147 | -0.71 | 2.45
95 | 0.49 | 4035 | -0.85 | 5.48
I manually tested the inner select queries without the outer aggregation thinking it might be a different story with a simple select and no CTE. Unfortunately it showed the same overall characteristics. 1% returns in about 550 ms, 45% took 1950, and 95% took 4050.
[1] https://github.com/rustprooflabs/pgbench-tests
Hello. At Tue, 9 Jul 2019 21:56:32 -0600, Ryan Lambert <ryan@rustprooflabs.com> wrote in <CAN-V+g-rwFp=xQEjOwbJuggNLegMi1qDhaJt3h1Eqm16yqwqmw@mail.gmail.com> > I did some more testing. I initialized a database with 1 million rows with > indexes and joins to test against and ran pgbench with a few different > settings for % to return. I started with a base query not utilizing the > new functionality. The queries used are similar to my prior examples, code > at [1]. > > createdb bench_test > psql -d bench_test -f init/reporting.sql -v scale=10 > > The following provided 3.21 TPS and an average latency of 623. The > "per_change_" columns in the table below use those values. > > pgbench -c 2 -j 2 -T 600 -P 60 -s 10 \ > -f tests/reporting1.sql bench_test > > The remainder of the tests use the following, only adjusting fetch_percent > value: > > pgbench -c 2 -j 2 -T 600 -P 60 -s 10 \ > --define=fetch_percent=1 \ > -f tests/reporting_fetch_percent.sql \ > bench_test > > > Returning 1% it runs well. By 10% the TPS drops by 30% while the average > latency increases by 43%. When returning 95% of the table latency has > increased by 548%. > > > fetch_percent | tps | latency_avg_ms | per_change_tps | per_change_latency > ---------------+------+----------------+----------------+-------------------- > 1 | 3.37 | 593 | 0.05 | -0.05 > 5 | 2.85 | 700 | -0.11 | 0.12 > 10 | 2.24 | 891 | -0.30 | 0.43 > 25 | 1.40 | 1423 | -0.56 | 1.28 > 45 | 0.93 | 2147 | -0.71 | 2.45 > 95 | 0.49 | 4035 | -0.85 | 5.48 > > > I manually tested the inner select queries without the outer aggregation > thinking it might be a different story with a simple select and no CTE. > Unfortunately it showed the same overall characteristics. 1% returns in > about 550 ms, 45% took 1950, and 95% took 4050. > > [1] https://github.com/rustprooflabs/pgbench-tests It is seen by a simpler test. create table t as select a from generate_series(0, 99999) a; analyze t; explain analyze select * from t order by a desc; Execution Time: 116.613 ms explain analyze select * from t order by a desc fetch first 1 percent rows only; Execution Time: 158.458 ms explain analyze select * from t order by a desc fetch first 100 percent rows only; Execution Time: 364.442 ms I didn't looked closer to the version. Fetching from tuplestore and returning all tuples costs 206ms and it is exceeding the cost of fething of the whole table and returning all tuples. I don't believe tuplestore that isn't splling out to disk is so slower than (cached) table access. Other than that, we can rip the clause if it is 100% regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. At Wed, 10 Jul 2019 15:02:57 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190710.150257.260806103.horikyota.ntt@gmail.com> > It is seen by a simpler test. > > create table t as select a from generate_series(0, 99999) a; > analyze t; > explain analyze select * from t order by a desc; > Execution Time: 116.613 ms > explain analyze select * from t order by a desc fetch first 1 percent rows only; > Execution Time: 158.458 ms > explain analyze select * from t order by a desc fetch first 100 percent rows only; > Execution Time: 364.442 ms > > I didn't looked closer to the version. Fetching from tuplestore > and returning all tuples costs 206ms and it is exceeding the cost > of fething of the whole table and returning all tuples. I don't > believe tuplestore that isn't splling out to disk is so slower > than (cached) table access. > > Other than that, we can rip the clause if it is 100% As a more significant point, I found that the first query in the aboves runs faster by about 10-18% on master(unpatched). explain analyze select * from t order by a desc; Execution Time: 96.690 ms But perf didn't give me useful information. patched: 11857 11.7065 postgres qsort_ssup 9026 8.9114 postgres ApplySortComparator 6443 6.3612 [vdso] (tgid:8388 range:0x7ffed49ed000-0x7ffed49eefff) [vdso] (tgid:8388 range:0x7ffed49ed000-0x7ffed49eefff) 5826 5.7520 postgres btint4fastcmp 4699 4.6393 no-vmlinux /no-vmlinux 3451 3.4072 libc-2.17.so __memcpy_ssse3_back 3270 3.2285 postgres LogicalTapeWrite 2972 2.9343 postgres copytup_heap 2961 2.9234 postgres readtup_heap 2769 2.7338 postgres LogicalTapeRead 2457 2.4258 postgres GetMemoryChunkContext 2147 2.1197 postgres InstrStopNode 2021 1.9953 postgres heapgettup_pagemode 1583 1.5629 postgres writetup_heap 1555 1.5353 postgres tuplesort_gettuple_common 1508 1.4889 postgres AllocSetAlloc ... master: 12932 12.0168 postgres qsort_ssup 9491 8.8193 postgres ApplySortComparator 6705 6.2305 postgres btint4fastcmp 6557 6.0930 [vdso] (tgid:6341 range:0x7ffdd0315000-0x7ffdd0316fff) [vdso] (tgid:6341 range:0x7ffdd0315000-0x7ffdd0316fff) 4874 4.5291 no-vmlinux /no-vmlinux 4059 3.7717 postgres readtup_heap 3707 3.4447 libc-2.17.so __memcpy_ssse3_back 3583 3.3294 postgres LogicalTapeWrite 3382 3.1427 postgres LogicalTapeRead 3001 2.7886 postgres copytup_heap 2522 2.3435 postgres GetMemoryChunkContext 2464 2.2896 postgres heapgettup_pagemode 2115 1.9653 postgres InstrStopNode 1847 1.7163 postgres tuplesort_gettuple_common 1652 1.5351 postgres writetup_heap 1565 1.4542 postgres AllocSetAlloc regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2019-Jul-08, Surafel Temesgen wrote: > Hi Thomas, > Thank you for informing me > > Hi Surafel, > > > > There's a call to adjust_limit_rows_costs() hiding under > > contrib/postgres_fdw, so this fails check-world. > > > > Fixed . I also include the review given by Ryan in attached patch What's with the new tuplestore function for getting heap tuples? That looks really odd. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
What's with the new tuplestore function for getting heap tuples? That
looks really odd.
Previously I create new TTSOpsMinimalTuple type slots for every tuple returned in order to fetch it from tuplestore because tuplestore store tuple in MinimalTuple format and created slot format changes in subsequent operation. This case resource leak as Andres mention and I create this function to remove the need for creating new slot for every returned tuple because of form difference
regards
Surafel
"It is possible for FETCH FIRST N PERCENT to create poorly performing query plans when the N supplied exceeds 50 percent. In these cases query execution can take an order of magnitude longer to execute than simply returning the full table. If performance is critical using an explicit row count for limiting is recommended."
I don’t understand how fetch first n percent functionality can be replaced with explicit row count limiting. There may be a way to do it in a client side but we can not be sure of its performance advantage
regards
Surafel
Hi Ryan,On Tue, Jul 9, 2019 at 4:13 PM Ryan Lambert <ryan@rustprooflabs.com> wrote:"It is possible for FETCH FIRST N PERCENT to create poorly performing query plans when the N supplied exceeds 50 percent. In these cases query execution can take an order of magnitude longer to execute than simply returning the full table. If performance is critical using an explicit row count for limiting is recommended."I don’t understand how fetch first n percent functionality can be replaced with explicit row count limiting. There may be a way to do it in a client side but we can not be sure of its performance advantage
regards
Surafel
"Using <literal>PERCENT</literal> is best suited to returning single-digit percentages of the query's total row count."
Other than that, we can rip the clause if it is 100%
I was suggesting a warning in the documentation so users aren't caught unaware about the performance characteristics. My first version was very rough, how about adding this in doc/src/sgml/ref/select.sgml?"Using <literal>PERCENT</literal> is best suited to returning single-digit percentages of the query's total row count."The following paragraphs in that same section give suggestions and warnings regarding LIMIT and OFFSET usage, I think this is more in line with the wording of those existing warnings.
Other than that, we can rip the clause if it is 100%You mean if PERCENT=100 it should short circuit and run the query normally? I like that.
Attachment
The patch did not did it automatically. Its query writer obligation to do that currently
plan = (Plan *) make_limit(plan,
subparse->limitOffset,
- subparse->limitCount);
+ subparse->limitCount,
+ subparse->limitOption);
I assume the limit percentage number goes into subparse->limitCount? If so, I don't see that documented. Or does this truly only store the count?
Attachment
Hello. At Thu, 1 Aug 2019 15:54:11 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in <CALAY4q_icXPTcAVxnPckpLGrZgER9-FWM0VPFvvOsPiJrnK6Dg@mail.gmail.com> > > Other than that, we can rip the clause if it is 100% > > > > > > You mean if PERCENT=100 it should short circuit and run the query > > normally? I like that. > > > > The patch did not did it automatically. Its query writer obligation to do > that currently I have some comments. This patch uses distinct parameters for exact number and percentage. On the othe hand planner has a notion of tuple_fraction covering the both. The same handling is also used for tuple number estimation. Using the same scheme will make things far simpler. See the comment of grouping_planner(). In executor part, addition to LimiteState.position, this patch uses LimiteState.backwardPosition to count how many tuples we're back from the end of the current tuplestore. But things will get simpler by just asking the tuplestore for the number of holding tuples. + slot = node->subSlot; Why is this needed? The variable is properly set before use and the assignment is bogus in the first place. The new code block in LIMIT_RESCAN in ExecLimit is useless since it is exatctly the same with existing code block. Why didn't you use the existing if block? > if (node->limitOption == PERCENTAGE) + { + node->perExactCount = ceil(node->percent * node->position / 100.0); + + node->position is the number of tuples returned to upper node so far (not the number of tuples this node has read from the lower node so far). I don't understand what the expression means. + if (node->perExactCount == node->perExactCount + 1) + node->perExactCount++; What? The condition never be true. As the result, the following if block below won't run. > /* + * Return the tuple up to the number of exact count in OFFSET + * clause without percentage value consideration. + */ + if (node->perExactCount > 0) + { + + /* + * We may needed this tuple in backward scan so put it into + * tuplestore. + */ + if (node->limitOption == PERCENTAGE) + { + tuplestore_puttupleslot(node->tupleStore, slot); + tuplestore_advance(node->tupleStore, true); + } "needed"->"need" ? The comment says that this is needed for backward scan, but it is actually required even in forward scan. More to the point, tuplestore_advance lacks comment. Anyway, the code in LIMIT_RESCAN is broken in some ways. For example, if it is used as the inner scan of a NestLoop, the tuplestore accumulates tuples by every scan. You will see that the tuplestore stores up to 1000 tuples (10 times of the inner) by the following query. create table t0 as (select a from generate_series(0, 99) a); create table t1 as (select a from generate_series(0, 9) a); analyze; set enable_hashjoin to false; set enable_mergejoin to false; set enable_material to false; explain analyze select t.*, t1.* from (select a from t0 fetch first 10 percent rows only) as t join t1 on (t.a = t1.a); QUERY PLAN ------------------------------------------------------------------------------- Nested Loop (cost=0.20..7.35 rows=10 width=8) (actual ...) -> Seq Scan on t1 (cost=0.00..1.10 rows=10 width=4) (actual ...) -> Limit (cost=0.20..0.40 rows=10 width=4) (actual ...) -> Seq Scan on t0 (cost=0.00..2.00 rows=100 width=4) (actual ...) That's it for now. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I have some comments.
This patch uses distinct parameters for exact number and
percentage. On the othe hand planner has a notion of
tuple_fraction covering the both. The same handling is also used
for tuple number estimation. Using the same scheme will make
things far simpler. See the comment of grouping_planner().
In executor part, addition to LimiteState.position, this patch
uses LimiteState.backwardPosition to count how many tuples we're
back from the end of the current tuplestore. But things will get
simpler by just asking the tuplestore for the number of holding
tuples.
+ slot = node->subSlot;
Why is this needed? The variable is properly set before use and
the assignment is bogus in the first place.
The new code block in LIMIT_RESCAN in ExecLimit is useless since
it is exatctly the same with existing code block. Why didn't you
use the existing if block?
> if (node->limitOption == PERCENTAGE)
+ {
+ node->perExactCount = ceil(node->percent * node->position / 100.0);
+
+
node->position is the number of tuples returned to upper node so
far (not the number of tuples this node has read from the lower
node so far). I don't understand what the expression means.
node so far. see LIMIT_RESCAN state
+ if (node->perExactCount == node->perExactCount + 1)
+ node->perExactCount++;
What? The condition never be true. As the result, the following
if block below won't run.
scan. More to the point, tuplestore_advance lacks comment.
> /*
+ * Return the tuple up to the number of exact count in OFFSET
+ * clause without percentage value consideration.
+ */
+ if (node->perExactCount > 0)
+ {
+
+ /*
+ * We may needed this tuple in backward scan so put it into
+ * tuplestore.
+ */
+ if (node->limitOption == PERCENTAGE)
+ {
+ tuplestore_puttupleslot(node->tupleStore, slot);
+ tuplestore_advance(node->tupleStore, true);
+ }
"needed"->"need" ? The comment says that this is needed for
backward scan, but it is actually required even in forward
Anyway, the code in LIMIT_RESCAN is broken in some ways. For
example, if it is used as the inner scan of a NestLoop, the
tuplestore accumulates tuples by every scan. You will see that
the tuplestore stores up to 1000 tuples (10 times of the inner)
by the following query.
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: tested, passed The latest patch [1] and the cleanup patch [2] apply cleanly to the latest master (5f110933). I reviewed the conversationand don't see any outstanding questions or concerns. Updating status to ready for committer. [1] https://www.postgresql.org/message-id/attachment/103028/percent-incremental-v6.patch [2] https://www.postgresql.org/message-id/attachment/103157/percent-incremental-v6-comment-cleanup.patch Ryan Lambert The new status of this patch is: Ready for Committer
On 2019-08-19 01:33, Ryan Lambert wrote: > The following review has been posted through the commitfest > application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: tested, passed > > The latest patch [1] and the cleanup patch [2] apply cleanly to the > latest master (5f110933). I reviewed the conversation and don't see > any outstanding questions or concerns. Updating status to ready for > committer. > > [1] > > https://www.postgresql.org/message-id/attachment/103028/percent-incremental-v6.patch > [2] > > https://www.postgresql.org/message-id/attachment/103157/percent-incremental-v6-comment-cleanup.patch > > Ryan Lambert > > The new status of this patch is: Ready for Committer Hi, (running with those two patches applied) select * from onek where thousand < 5 order by thousand fetch first -1 percent rows only is correctly caught (with "ERROR: PERCENT must not be negative") but: select * from onek where thousand < 5 order by thousand fetch first 101 percent rows only is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess is another bug? thanks, Erik Rijkers
Hi,
(running with those two patches applied)
select * from onek where thousand < 5 order by thousand fetch first -1
percent rows only
is correctly caught (with "ERROR: PERCENT must not be negative") but:
select * from onek where thousand < 5 order by thousand fetch first
101 percent rows only
is not. It doesn't return, and cannot be Ctrl-C'ed out of, which I guess
is another bug?
Attachment
On 2019-08-19 11:18, Surafel Temesgen wrote: > > [..] > > [percent-incremental-v7.patch] Thanks. Another little thing, not sure it's a bug: limit interprets its argument by rounding up or down as one would expect: table onek limit 10.4; --> gives 10 rows table onek limit 10.6; --> gives 11 rows but FETCH count PERCENT does not do that; it rounds always up. select * from (table onek limit 100) f fetch first 10.4 percent rows only; --> gives 11 rows select * from (table onek limit 100) f fetch first 10.6 percent rows only; --> gives 11 rows I see that it's documented in the .sgml to behave as it does, but it seems wrong to me; shouldn't that 10.4-percent-query yield 10 rows instead of 11? thanks, Erik Rijkers
Another little thing, not sure it's a bug:
limit interprets its argument by rounding up or down as one would
expect:
table onek limit 10.4; --> gives 10 rows
table onek limit 10.6; --> gives 11 rows
but FETCH count PERCENT does not do that; it rounds always up.
select * from (table onek limit 100) f fetch first 10.4 percent rows
only; --> gives 11 rows
select * from (table onek limit 100) f fetch first 10.6 percent rows
only; --> gives 11 rows
I see that it's documented in the .sgml to behave as it does, but it
seems wrong to me;
shouldn't that 10.4-percent-query yield 10 rows instead of 11?
Hi, At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in <CALAY4q98xbVHtZ4yj9DcCMG2-s1_JuRr7fYaNfW+bKmr22OiVQ@mail.gmail.com> > Hi > On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: > > > > > I have some comments. > > > > This patch uses distinct parameters for exact number and > > percentage. On the othe hand planner has a notion of > > tuple_fraction covering the both. The same handling is also used > > for tuple number estimation. Using the same scheme will make > > things far simpler. See the comment of grouping_planner(). > > > > > Its because of data type difference .In planner the data type is the same I meant that, with the usage of tuple_fraction, one variable can represent both the absolute limit and relative limit. This simplifies parse structs. > > In executor part, addition to LimiteState.position, this patch > > uses LimiteState.backwardPosition to count how many tuples we're > > back from the end of the current tuplestore. But things will get > > simpler by just asking the tuplestore for the number of holding > > tuples. > > > > > backwardPosition hold how many tuple returned in backward scan I meant that we don't need to hold the number in estate. > > + slot = node->subSlot; > > > > Why is this needed? The variable is properly set before use and > > the assignment is bogus in the first place. > > > > > its because Tuplestore needs initialized slot. I meant that the initilized slot is overwritten before first use. > > The new code block in LIMIT_RESCAN in ExecLimit is useless since > > it is exatctly the same with existing code block. Why didn't you > > use the existing if block? > > > > But they test different scenario I meant that the two different scenario can share the code block. > > > if (node->limitOption == PERCENTAGE) > > + { > > + node->perExactCount = ceil(node->percent * > > node->position / 100.0); > > + > > + > > > > node->position is the number of tuples returned to upper node so > > far (not the number of tuples this node has read from the lower > > node so far). I don't understand what the expression means. > > > > node->position hold the number of tuples this node has read from the lower > node so far. see LIMIT_RESCAN state Reallly? node->position is incremented when tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the tuple to the caller immediately... > > + if (node->perExactCount == node->perExactCount + 1) > > + node->perExactCount++; > > > > What? The condition never be true. As the result, the following > > if block below won't run. > > > > it became true according to the number of tuple returned in from the lower > node so far > and percentage specification. Mmm. How do you think X can be equal to (X + 1)? > > > /* > > + * Return the tuple up to the number of exact count in > > OFFSET > > + * clause without percentage value consideration. > > + */ > > + if (node->perExactCount > 0) > > + { > > + > > > > > > > > > > + /* > > + * We may needed this tuple in backward scan so put it into > > + * tuplestore. > > + */ > > + if (node->limitOption == PERCENTAGE) > > + { > > + tuplestore_puttupleslot(node->tupleStore, slot); > > + tuplestore_advance(node->tupleStore, true); > > + } > > > > "needed"->"need" ? The comment says that this is needed for > > backward scan, but it is actually required even in forward > > scan. More to the point, tuplestore_advance lacks comment. > > > ok > > > > > > Anyway, the code in LIMIT_RESCAN is broken in some ways. For > > example, if it is used as the inner scan of a NestLoop, the > > tuplestore accumulates tuples by every scan. You will see that > > the tuplestore stores up to 1000 tuples (10 times of the inner) > > by the following query. > > > > It this because in percentage we scan the whole table It's useless and rather harmful that the tuplestore holds indefinite number of duplicate set of the whole tuples from the lower node. We must reuse tuples already stored in the tuplestore or clear it before the next round. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi,
At Wed, 7 Aug 2019 10:20:09 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in <CALAY4q98xbVHtZ4yj9DcCMG2-s1_JuRr7fYaNfW+bKmr22OiVQ@mail.gmail.com>
> Hi
> On Wed, Aug 7, 2019 at 6:11 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
>
> >
> > I have some comments.
> >
> > This patch uses distinct parameters for exact number and
> > percentage. On the othe hand planner has a notion of
> > tuple_fraction covering the both. The same handling is also used
> > for tuple number estimation. Using the same scheme will make
> > things far simpler. See the comment of grouping_planner().
> >
> >
> Its because of data type difference .In planner the data type is the same
I meant that, with the usage of tuple_fraction, one variable can
represent both the absolute limit and relative limit. This
simplifies parse structs.
In grouping_planner the patch set tuple bound to -1 in create_ordered_paths because it iterate until the end in percentage. Did that answer your question?
> > In executor part, addition to LimiteState.position, this patch
> > uses LimiteState.backwardPosition to count how many tuples we're
> > back from the end of the current tuplestore. But things will get
> > simpler by just asking the tuplestore for the number of holding
> > tuples.
> >
> >
> backwardPosition hold how many tuple returned in backward scan
I meant that we don't need to hold the number in estate.
I did it this way because I didn't find an API in tuplestore to do this
> > + slot = node->subSlot;
> >
> > Why is this needed? The variable is properly set before use and
> > the assignment is bogus in the first place.
> >
> >
> its because Tuplestore needs initialized slot.
I meant that the initilized slot is overwritten before first use.
> > The new code block in LIMIT_RESCAN in ExecLimit is useless since
> > it is exatctly the same with existing code block. Why didn't you
> > use the existing if block?
> >
>
> But they test different scenario
I meant that the two different scenario can share the code block.
Sorry for not clarifying will .One have to be check before offsetting and the other is after offsetting
> > > if (node->limitOption == PERCENTAGE)
> > + {
> > + node->perExactCount = ceil(node->percent *
> > node->position / 100.0);
> > +
> > +
> >
> > node->position is the number of tuples returned to upper node so
> > far (not the number of tuples this node has read from the lower
> > node so far). I don't understand what the expression means.
> >
>
> node->position hold the number of tuples this node has read from the lower
> node so far. see LIMIT_RESCAN state
Reallly? node->position is incremented when
tuplestore_gettupleslot_heaptuple() succeeded and reutnrs the
tuple to the caller immediately...
> > + if (node->perExactCount == node->perExactCount + 1)
> > + node->perExactCount++;
> >
> > What? The condition never be true. As the result, the following
> > if block below won't run.
> >
>
> it became true according to the number of tuple returned in from the lower
> node so far
> and percentage specification.
Mmm. How do you think X can be equal to (X + 1)?
> > > /*
> > + * Return the tuple up to the number of exact count in
> > OFFSET
> > + * clause without percentage value consideration.
> > + */
> > + if (node->perExactCount > 0)
> > + {
> > +
> >
> >
> >
> >
> > + /*
> > + * We may needed this tuple in backward scan so put it into
> > + * tuplestore.
> > + */
> > + if (node->limitOption == PERCENTAGE)
> > + {
> > + tuplestore_puttupleslot(node->tupleStore, slot);
> > + tuplestore_advance(node->tupleStore, true);
> > + }
> >
> > "needed"->"need" ? The comment says that this is needed for
> > backward scan, but it is actually required even in forward
> > scan. More to the point, tuplestore_advance lacks comment.
>
>
> ok
>
>
> >
> > Anyway, the code in LIMIT_RESCAN is broken in some ways. For
> > example, if it is used as the inner scan of a NestLoop, the
> > tuplestore accumulates tuples by every scan. You will see that
> > the tuplestore stores up to 1000 tuples (10 times of the inner)
> > by the following query.
> >
>
> It this because in percentage we scan the whole table
It's useless and rather harmful that the tuplestore holds
indefinite number of duplicate set of the whole tuples from the
lower node. We must reuse tuples already stored in the tuplestore
or clear it before the next round.
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hi everyone, The v8 patch [1] applies cleanly and passes tests. I reviewed the conversation to date and from what I can see, all identifiedbugs have been fixed and concerns either fixed or addressed. Marking as ready for committer. [1] https://www.postgresql.org/message-id/attachment/103486/percent-incremental-v8.patch Ryan Lambert The new status of this patch is: Ready for Committer
Surafel Temesgen <surafel3000@gmail.com> writes: > [ percent-incremental-v8.patch ] I took a quick look through this. * Why is this adding new functionality in tuplestore.c? Especially functionality to get out a different tuple representation than what was put in? I can't see a reason why nodeLimit should need that, and for sure I don't see a reason why it would need it for this feature if it didn't before. * The business with query.limitCount having different data types depending on LimitOption seems entirely bletcherous and bug-inducing. (There are live bugs of that sort in what you have now: notably, that kluge in adjust_limit_rows_costs will crash and burn if float8 is pass-by-reference. Breaking the Datum abstraction is bad practice.) I wonder if we could get away with making limitCount be float8 all the time. This would mean that you'd lose exactness for counts above 2^53 (or so, depending on whether float8 is IEEE or not), but I doubt that anybody would ever notice. * Names like "PERCENTAGE" seem mighty generic to be exposing as globally known symbols; also you're disregarding a fairly widespread PG convention that members of an enum should have names derived from the enum type's name. So I'd be inclined to make the values of enum LimitOption be LIMIT_OPTION_COUNT and LIMIT_OPTION_PERCENT. (I don't especially like EXACT_NUMBER, first because it's going to be quite long with the prefix, and second because it suggests that maybe there's an INEXACT_NUMBER alternative.) * The "void *limitOption" business in gram.y also seems pretty ugly; it'd be better for that to be enum LimitOption. I gather you want the null to indicate "no option given", but likely it'd be better to invent a LIMIT_OPTION_DEFAULT enum alternative for that purpose. * An alternative to the three points above is just to separate Query.limitCount (an int8) from Query.limitPercent (a float8), with the understanding that at most one of the two can be non-null, and use similar representations at other steps of the processing chain. Then you don't need enum limitOption at all. That might not scale especially nicely to additional variants, but trying to overload limitCount even further isn't going to be nice either. * The proposed change in grouping_planner() seems like a complete hack. Why would you not change the way that limit_tuples was computed earlier, instead? That is, it seems like the part of the planner to teach about this feature is preprocess_limit (and limit_needed), not someplace else. It is surely not sane that only this one planner decision would react to a percentage-based limit. * I didn't really study the changes in nodeLimit.c, but doing "tuplestore_rescan" in ExecReScanLimit is surely just wrong. You probably want to delete and recreate the tuplestore, instead, since whatever data you already collected is of no further use. Maybe, in the case where no rescan of the child node is needed, you could re-use the data already collected; but that would require a bunch of additional logic. I'm inclined to think that v1 of the patch shouldn't concern itself with that sort of optimization. * I don't see any delta in ruleutils.c, but surely that needs to know about this feature so that it can correctly print views using it. regards, tom lane
Surafel Temesgen <surafel3000@gmail.com> writes:
> [ percent-incremental-v8.patch ]
I took a quick look through this.
* Why is this adding new functionality in tuplestore.c? Especially
functionality to get out a different tuple representation than what was
put in?
If am not mistaken tuplestore store MinimalTuple(tuple without transaction status information) for minimal memory usage. it change what we put in into MinimalTuple so we need the same format for getting out. The other way I can think of doing it is allocating new MinimalTuple slot every time to get the tuple which have resource like. The operation is very similar to RunFromStore without the ability of reusing once allocated slot.
I can't see a reason why nodeLimit should need that, and for
sure I don't see a reason why it would need it for this feature if it
didn't before.
Only this feature needs tuple storage
* The business with query.limitCount having different data types
depending on LimitOption seems entirely bletcherous and bug-inducing.
(There are live bugs of that sort in what you have now: notably, that
kluge in adjust_limit_rows_costs will crash and burn if float8 is
pass-by-reference. Breaking the Datum abstraction is bad practice.)
I wonder if we could get away with making limitCount be float8 all the
time. This would mean that you'd lose exactness for counts above 2^53
(or so, depending on whether float8 is IEEE or not), but I doubt that
anybody would ever notice.
* Names like "PERCENTAGE" seem mighty generic to be exposing as globally
known symbols; also you're disregarding a fairly widespread PG
convention that members of an enum should have names derived from the
enum type's name. So I'd be inclined to make the values of enum
LimitOption be LIMIT_OPTION_COUNT and LIMIT_OPTION_PERCENT. (I don't
especially like EXACT_NUMBER, first because it's going to be quite long
with the prefix, and second because it suggests that maybe there's an
INEXACT_NUMBER alternative.)
* The "void *limitOption" business in gram.y also seems pretty ugly;
it'd be better for that to be enum LimitOption. I gather you want
the null to indicate "no option given", but likely it'd be better
to invent a LIMIT_OPTION_DEFAULT enum alternative for that purpose.
* An alternative to the three points above is just to separate
Query.limitCount (an int8) from Query.limitPercent (a float8), with the
understanding that at most one of the two can be non-null, and use
similar representations at other steps of the processing chain. Then
you don't need enum limitOption at all. That might not scale especially
nicely to additional variants, but trying to overload limitCount even
further isn't going to be nice either.
* The proposed change in grouping_planner() seems like a complete hack.
Why would you not change the way that limit_tuples was computed earlier,
instead?
That is, it seems like the part of the planner to teach about
this feature is preprocess_limit (and limit_needed), not someplace else.
It is surely not sane that only this one planner decision would react to
a percentage-based limit.
* I didn't really study the changes in nodeLimit.c, but doing
"tuplestore_rescan" in ExecReScanLimit is surely just wrong. You
probably want to delete and recreate the tuplestore, instead, since
whatever data you already collected is of no further use. Maybe, in
the case where no rescan of the child node is needed, you could re-use
the data already collected; but that would require a bunch of additional
logic. I'm inclined to think that v1 of the patch shouldn't concern
itself with that sort of optimization.
* I don't see any delta in ruleutils.c, but surely that needs to know
about this feature so that it can correctly print views using it.
Attachment
Hi Tom,In the attached patch i include the comments givenregardsSurafel
> From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: 2019-09-05 22:26:29
> * I didn't really study the changes in nodeLimit.c, but doing
> "tuplestore_rescan" in ExecReScanLimit is surely just wrong. You
> probably want to delete and recreate the tuplestore, instead, since
> whatever data you already collected is of no further use. Maybe, in
> the case where no rescan of the child node is needed, you could re-use
> the data already collected; but that would require a bunch of additional
> logic. I'm inclined to think that v1 of the patch shouldn't concern
> itself with that sort of optimization.
I don't think this was addressed.
Ryan Lambert
At Thu, 26 Sep 2019 15:13:42 -0600, Ryan Lambert <ryan@rustprooflabs.com> wrote in <CAN-V+g_cG4_R9q25nccNoMrtwA7DsCuHh-G5_ygaj8PBFTWUDw@mail.gmail.com> > On Thu, Sep 19, 2019 at 6:52 AM Surafel Temesgen <surafel3000@gmail.com> > wrote: > > > Hi Tom, > > In the attached patch i include the comments given > > > > regards > > Surafel > > > > Patch v9 applies and passes make installcheck-world. > > > From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> > > Date: 2019-09-05 22:26:29 > > > * I didn't really study the changes in nodeLimit.c, but doing > > "tuplestore_rescan" in ExecReScanLimit is surely just wrong. You > > probably want to delete and recreate the tuplestore, instead, since > > whatever data you already collected is of no further use. Maybe, in > > the case where no rescan of the child node is needed, you could re-use > > the data already collected; but that would require a bunch of additional > > logic. I'm inclined to think that v1 of the patch shouldn't concern > > itself with that sort of optimization. > > I don't think this was addressed. It seems to me that Tom is suggesting to conisider that kind of optimization after once completing the patch without it. (That in the first patch was just wrong, but more stuff is needed to achieve that.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested The latest patch applies cleanly and passes all tests. I believe all feedback has been addressed except for the one casethat can be improved upon in later work. Updating status as ready for committer. Ryan Lambert The new status of this patch is: Ready for Committer
Hi, This patch is marked as RFC since September. Since then there was no discussion on this thread, but Andrew proposed an alternative approach based on window functions in a separate thread [1] (both for this and for the WITH TIES case). I'll set this patch back to "needs review" - at this point we need to decide which of the approaches is the right one. regards [1] https://www.postgresql.org/message-id/flat/87o8wvz253.fsf@news-spur.riddles.org.uk -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for the notification, Tomas. At Thu, 16 Jan 2020 22:33:58 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in > This patch is marked as RFC since September. Since then there was no > discussion on this thread, but Andrew proposed an alternative approach > based on window functions in a separate thread [1] (both for this and > for the WITH TIES case). > > I'll set this patch back to "needs review" - at this point we need to > decide which of the approaches is the right one. Even consdiering that we have only two usage , WITH TIES and PERCENT, of the mechanism for now, I like [1] for the generic nature, simpleness and smaller invasiveness to executor. The fact that cursor needs materialize to allow backward scan would not be a big problem. It seems that PERCENT will be easily implemented on that. However, isn't there a possibility that we allow FETCH FIRST x PERCENT WITH TIES, though I'm not sure what SQL spec tells about that? I don't come up right now how to do that using window functions.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 17/01/2020 07:20, Kyotaro Horiguchi wrote: > Thank you for the notification, Tomas. > > At Thu, 16 Jan 2020 22:33:58 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in >> This patch is marked as RFC since September. Since then there was no >> discussion on this thread, but Andrew proposed an alternative approach >> based on window functions in a separate thread [1] (both for this and >> for the WITH TIES case). >> >> I'll set this patch back to "needs review" - at this point we need to >> decide which of the approaches is the right one. > Even consdiering that we have only two usage , WITH TIES and PERCENT, > of the mechanism for now, I like [1] for the generic nature, > simpleness and smaller invasiveness to executor. The fact that cursor > needs materialize to allow backward scan would not be a big problem. > It seems that PERCENT will be easily implemented on that. > > However, isn't there a possibility that we allow FETCH FIRST x PERCENT > WITH TIES, though I'm not sure what SQL spec tells about that? I > don't come up right now how to do that using window functions.. PERCENT and WITH TIES can play together, per spec. -- Vik Fearing
PERCENT and WITH TIES can play together, per spec.
Attachment
On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote: > I also Implement PERCENT WITH TIES option. patch is attached > i don't start a new tread because the patches share common code This fails to apply per the CF bot. Could you send a rebase? -- Michael
Attachment
On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote:
> I also Implement PERCENT WITH TIES option. patch is attached
> i don't start a new tread because the patches share common code
This fails to apply per the CF bot. Could you send a rebase?
--
Attachment
Sorry for the dealy. I started to look this. At Fri, 25 Sep 2020 12:25:24 +0300, Surafel Temesgen <surafel3000@gmail.com> wrote in > Hi Michael > On Thu, Sep 24, 2020 at 6:58 AM Michael Paquier <michael@paquier.xyz> wrote: > > > On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote: > > > I also Implement PERCENT WITH TIES option. patch is attached > > > i don't start a new tread because the patches share common code > > > > This fails to apply per the CF bot. Could you send a rebase? This still applies on the master HEAD. percent-incremental-v11.patch The existing nodeLimit passes the slot of the subnode to the caller. but this patch changes that behavior. You added a new function to tuplestore.c not to store a minimal tuple into the slot that passed from subnode, but we should refrain from scribbling on the slot passed from the subnode. Instead, the PERCENT path of the limit node should use its own ResultTupleSlot for the purpose. See nodeSort for a concrete example. + LIMIT_OPTION_PER_WITH_TIES, /* FETCH FIRST... PERCENT WITH TIES */ That name is a bit hard to read. We should spell it with complete words. case LIMIT_INWINDOW: ... + if (IsPercentOption(node->limitOption) && node->backwardPosition ... + if (IsPercentOption(node->limitOption) && node->reachEnd) ... + if (IsPercentOption(node->limitOption)) I think we can use separate lstate state for each condition above since IsPercentOption() gives a constant result through the execution time. For example, LIMIT_PERCENT_TUPLESLOT_NOT_FILLED and LIMIT_PERCENT_TUPLESLOT_FILLED and some derived states similar to the non-percent path. I *feel* that makes code simpler. What do you think about this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry for the dealy. I started to look this.
On 1/26/21 11:29 AM, Surafel Temesgen wrote: > > On Mon, Jan 25, 2021 at 2:39 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com <mailto:horikyota.ntt@gmail.com>> wrote: > > Sorry for the dealy. I started to look this. > > Hi kyotaro, > Thanks for looking into this but did we agree to proceed > on this approach? I fear that it will be the west of effort if > Andrew comes up with the patch for his approach. > Andrew Gierth: Are you working on your approach? [Added Andrew to recipients] Andrew, would you care to comment? Regards, -- -David david@pgmasters.net
On Thu, Mar 25, 2021 at 11:15:10AM -0400, David Steele wrote: > On 1/26/21 11:29 AM, Surafel Temesgen wrote: > > > > On Mon, Jan 25, 2021 at 2:39 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com <mailto:horikyota.ntt@gmail.com>> wrote: > > > > Sorry for the dealy. I started to look this. > > > > Hi kyotaro, > > Thanks for looking into this but did we agree to proceed > > on this approach? I fear that it will be the west of effort if > > Andrew comes up with the patch for his approach. > > Andrew Gierth: Are you working on your approach? > > [Added Andrew to recipients] > > Andrew, would you care to comment? > Hi everyone, This is still marked as needs review, but I think it should be marked as "Returned with feedback" or "waiting on author". suggestions? -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
On Tue, Jan 26, 2021 at 07:29:11PM +0300, Surafel Temesgen wrote: > On Mon, Jan 25, 2021 at 2:39 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: > > > Sorry for the dealy. I started to look this. > > > > > Hi kyotaro, > Thanks for looking into this but did we agree to proceed > on this approach? I fear that it will be the west of effort if > Andrew comes up with the patch for his approach. > Andrew Gierth: Are you working on your approach? > Hi Surafel, I'm marking this one as "returned with feedback". Then you or Andrew can submit a new version of the patch for the next CF. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL