Re: Infinite Interval - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Infinite Interval |
Date | |
Msg-id | CAExHW5uXBi5y5dQkrewACC8H=2Kk20EOiWe9Y8uJktgsPb4FHQ@mail.gmail.com Whole thread Raw |
In response to | Re: Infinite Interval (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Infinite Interval
|
List | pgsql-hackers |
On Wed, Sep 20, 2023 at 5:39 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Wed, 20 Sept 2023 at 11:27, jian he <jian.universality@gmail.com> wrote: > > > > if I remove IntervalAggState's element: MemoryContext, it will not work. > > so I don't understand what the above sentence means...... Sorry. (it's > > my problem) > > > > I don't see why it won't work. The point is to try to simplify > do_interval_accum() as much as possible. Looking at the current code, > I see a few places that could be simpler: > > + X.day = newval->day; > + X.month = newval->month; > + X.time = newval->time; > + > + temp.day = state->sumX.day; > + temp.month = state->sumX.month; > + temp.time = state->sumX.time; > > Why do we need these local variables X and temp? It could just add the > values from newval directly to those in state->sumX. > > + /* The rest of this needs to work in the aggregate context */ > + old_context = MemoryContextSwitchTo(state->agg_context); > > Why? It's not allocating any memory here, so I don't see a need to > switch context. > > So basically, do_interval_accum() could be simplified to: > > static void > do_interval_accum(IntervalAggState *state, Interval *newval) > { > /* Count infinite intervals separately from all else */ > if (INTERVAL_IS_NOBEGIN (newval)) > { > state->nInfcount++; > return; > } > if (INTERVAL_IS_NOEND(newval)) > { > state->pInfcount++; > return; > } > > /* Update count of finite intervals */ > state->N++; > > /* Update sum of finite intervals */ > if (unlikely(pg_add_s32_overflow(state->sumX.month, newval->month, > &state->sumX.month))) > ereport(ERROR, > errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > errmsg("interval out of range")); > > if (unlikely(pg_add_s32_overflow(state->sumX.day, newval->day, > &state->sumX.day))) > ereport(ERROR, > errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > errmsg("interval out of range")); > > if (unlikely(pg_add_s64_overflow(state->sumX.time, newval->time, > &state->sumX.time))) > ereport(ERROR, > errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), > errmsg("interval out of range")); > > return; > } > > and that can be further refactored, as described below, and similarly > for do_interval_discard(), except using pg_sub_s32/64_overflow(). > > > > > Also, this needs to include serialization and deserialization > > > functions, otherwise these aggregates will no longer be able to use > > > parallel workers. That makes a big difference to queryE, if the size > > > of the test data is scaled up. > > > > > I tried, but failed. sum(interval) result is correct, but > > avg(interval) result is wrong. > > > > SELECT sum(b) ,avg(b) > > ,avg(b) = sum(b)/count(*) as should_be_true > > ,avg(b) * count(*) = sum(b) as should_be_true_too > > from interval_aggtest_1m; --1million row. > > The above query expects two bool columns to return true, but actually > > both returns false.(spend some time found out parallel mode will make > > the number of rows to 1_000_002, should be 1_000_0000). > > > > I think the reason for your wrong results is this code in > interval_avg_combine(): > > + if (state2->N > 0) > + { > + /* The rest of this needs to work in the aggregate context */ > + old_context = MemoryContextSwitchTo(agg_context); > + > + /* Accumulate interval values */ > + do_interval_accum(state1, &state2->sumX); > + > + MemoryContextSwitchTo(old_context); > + } > > The problem is that using do_interval_accum() to add the 2 sums > together also adds 1 to the count N, making it incorrect. This code > should only be adding state2->sumX to state1->sumX, not touching > state1->N. And, as in do_interval_accum(), there is no need to switch > memory context. > > Given that there are multiple places in this file that need to add > intervals, I think it makes sense to further refactor, and add a local > function to add 2 finite intervals, along the lines of the code above. > This can then be called from do_interval_accum(), > interval_avg_combine(), and interval_pl(). And similarly for > subtracting 2 finite intervals. I was working on refactoring Jian's patches but forgot to mention it there. I think the patchset attached has addressed all your comments. But they do not implement serialization and deserialization yet. I will take a look at Jian's patch for the same and incorporate/refactor those changes. Jian, I don't understand why there's two sets of test queries, one with ORDER BY and one without? Does ORDER BY specification make any difference in the testing? Patches are thus 0001, 0002 are same 0003 - earlier 0003 + incorporated doc changes suggested by Jian 0004 - fixes DecodeInterval() 0005 - Refactored Jian's code fixing window functions. Does not contain the changes for serialization and deserialization. Jian, please let me know if I have missed anything else. In my testing, I saw the timings for queries as below Query A: 145.164 ms Query B: 222.419 ms Query C: 136.995 ms Query D: 146.893 ms Query E: 38.053 ms Query F: 80.112 ms I didn't see degradation in case of sum(). -- Best Wishes, Ashutosh Bapat
Attachment
pgsql-hackers by date: