Re: FETCH FIRST clause PERCENT option - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: FETCH FIRST clause PERCENT option |
Date | |
Msg-id | 7dd45bec-1e8b-138c-b2d9-2acb395a0e11@2ndquadrant.com Whole thread Raw |
In response to | Re: FETCH FIRST clause PERCENT option (Surafel Temesgen <surafel3000@gmail.com>) |
Responses |
Re: FETCH FIRST clause PERCENT option
|
List | pgsql-hackers |
On 1/9/19 7:09 AM, Surafel Temesgen wrote: > > > On Sun, Jan 6, 2019 at 5:51 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > On 1/6/19 12:40 PM, Surafel Temesgen wrote: > > > > > > On Fri, Jan 4, 2019 at 5:27 PM Tomas Vondra > > <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com> > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>> wrote: > > > > > > What formula? All the math remains exactly the same, you just > need to > > update the number of rows to return and track how many rows > are already > > returned. > > > > I haven't tried doing that, but AFAICS you'd need to tweak > how/when > > node->count is computed - instead of computing it only once it > needs to > > be updated after fetching each row from the subplan. > > > > Furthermore, you'll need to stash the subplan rows somewhere > (into a > > tuplestore probably), and whenever the node->count value > increments, > > you'll need to grab a row from the tuplestore and return that > (i.e. > > tweak node->position and set node->subSlot). > > > > I hope that makes sense. The one thing I'm not quite sure about is > > whether tuplestore allows adding and getting rows at the same > time. > > > > Does that make sense > > > > > > > > In current implementation in LIMIT_INITIAL state we execute outer > > plan to the end , store the resulting tuples in tuplestore and > > calculate limitCount in number .We are in this state only once and > > did not return any tuple. Once we are at LIMIT_INWINDOW state and > > inner node execution asking for tuple it return from tuple store > > immediately. > > > > Right. > > > Inorder to do fast startup what I was thinking was dividing the work > > done at LIMIT_INITIAL state in to limitCount. For example we want > > 0.5 percent of the result which means in LIMIT_INWINDOW state we > > execute outer node 200 times ,store the result in tuplestore and > > return the first tuple. if we don’t get that much tuple that means we > > reach end of the limit. > Yes, pretty much. > > > > While working on this method i found an issue that did not work will > on percent with fractional part like 2.6 percent which means we have > to emit per every 38.4615 tuple so it have to be round up to 39 or > round down to 38 either way it leads to incorrect result .Even with > integer percentage sometimes the result came out with fractional > part and needs rounding > It's hard to say what exactly are you doing, because you haven't shared any code nor examples. But it sounds as if you're computing how often to produce an additional row at the beginning, and then simply multiplying it. I'm not sure that's quite possible, because there will be issues due to rounding. Firstly, I see your last patch (from 30/11) does this: node->count = DatumGetInt64((DatumGetFloat8(val) * tuplestore_tuple_count(node->totalTuple)) / 100); but the SQL standard I have says this in the <query expression> section: If <fetch first percentage> is specified, then let FFP be the <simple value specification> simply contained in <fetch first percentage>, and let LOCT be a <literal> whose value is OCT. Let FFRC be the value of CEILING ( FFP * LOCT / 100.0E0 ) That's not what you patch does, though, because it relies on automatic float->int conversion, which does floor() and not ceil(). FWIW the final DatumGetInt64 seems unnecessary, because the expression is float8 and not Datum. The patch should do something like this instead: node->count = ceil((DatumGetFloat8(val) * tuplestore_tuple_count(node->totalTuple)) / 100); Now, back to the row count issue - let's assume the user requests FETCH FIRST 3 PERCENT ROWS ONLY This means we should produce 1 rows every 33 rows, but only very roughly. In reality it's a bit less, because 3% is not 1/33 exactly. Consider this query, which computes number of rows for select sample_cnt, l - lag(l) over (order by sample_cnt) AS d from ( select sample_cnt, min(nrows) AS l from ( select nrows, (nrows * 3.0 / 100.0) AS raw_sample_cnt, ceil(nrows * 3.0 / 100.0) AS sample_cnt from generate_series(1,10000) s(nrows) ) foo group by sample_cnt order by sample_cnt ) bar; The inner-most query simply computes how many rows we're supposed to return based on certain row count. Then we compute points at which the sample_cnt actually changes. And finally we compute the distance between those points. You should get something like sample_cnt | d ------------+---- 1 | 2 | 33 3 | 33 4 | 34 5 | 33 6 | 33 7 | 34 8 | 33 9 | 33 10 | 34 11 | 33 ... This shows that the distance oscillates between 33 and 34, so no matter what value you pick, you can't simply multiply it to get the desired sample size. It needs to be re-computed as you go. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: