Re: Infinite Interval - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Infinite Interval
Date
Msg-id CAExHW5uXBi5y5dQkrewACC8H=2Kk20EOiWe9Y8uJktgsPb4FHQ@mail.gmail.com
Whole thread Raw
In response to Re: Infinite Interval  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Infinite Interval
List pgsql-hackers
On Wed, Sep 20, 2023 at 5:39 PM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> On Wed, 20 Sept 2023 at 11:27, jian he <jian.universality@gmail.com> wrote:
> >
> > if I remove IntervalAggState's element: MemoryContext, it will not work.
> > so I don't understand what the above sentence means...... Sorry. (it's
> > my problem)
> >
>
> I don't see why it won't work. The point is to try to simplify
> do_interval_accum() as much as possible. Looking at the current code,
> I see a few places that could be simpler:
>
> +    X.day = newval->day;
> +    X.month = newval->month;
> +    X.time = newval->time;
> +
> +    temp.day = state->sumX.day;
> +    temp.month = state->sumX.month;
> +    temp.time = state->sumX.time;
>
> Why do we need these local variables X and temp? It could just add the
> values from newval directly to those in state->sumX.
>
> +    /* The rest of this needs to work in the aggregate context */
> +    old_context = MemoryContextSwitchTo(state->agg_context);
>
> Why? It's not allocating any memory here, so I don't see a need to
> switch context.
>
> So basically, do_interval_accum() could be simplified to:
>
> static void
> do_interval_accum(IntervalAggState *state, Interval *newval)
> {
>     /* Count infinite intervals separately from all else */
>     if (INTERVAL_IS_NOBEGIN (newval))
>     {
>         state->nInfcount++;
>         return;
>     }
>     if (INTERVAL_IS_NOEND(newval))
>     {
>         state->pInfcount++;
>         return;
>     }
>
>     /* Update count of finite intervals */
>     state->N++;
>
>     /* Update sum of finite intervals */
>     if (unlikely(pg_add_s32_overflow(state->sumX.month, newval->month,
>                                      &state->sumX.month)))
>         ereport(ERROR,
>                 errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
>                 errmsg("interval out of range"));
>
>     if (unlikely(pg_add_s32_overflow(state->sumX.day, newval->day,
>                                      &state->sumX.day)))
>         ereport(ERROR,
>                 errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
>                 errmsg("interval out of range"));
>
>     if (unlikely(pg_add_s64_overflow(state->sumX.time, newval->time,
>                                      &state->sumX.time)))
>         ereport(ERROR,
>                 errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
>                 errmsg("interval out of range"));
>
>     return;
> }
>
> and that can be further refactored, as described below, and similarly
> for do_interval_discard(), except using pg_sub_s32/64_overflow().
>
>
> > > Also, this needs to include serialization and deserialization
> > > functions, otherwise these aggregates will no longer be able to use
> > > parallel workers. That makes a big difference to queryE, if the size
> > > of the test data is scaled up.
> > >
> > I tried, but failed. sum(interval) result is correct, but
> > avg(interval) result is wrong.
> >
> > SELECT sum(b) ,avg(b)
> >         ,avg(b) = sum(b)/count(*) as should_be_true
> >         ,avg(b) * count(*) = sum(b) as should_be_true_too
> > from interval_aggtest_1m; --1million row.
> > The above query expects two bool columns to return true, but actually
> > both returns false.(spend some time found out parallel mode will make
> > the number of rows to 1_000_002, should be 1_000_0000).
> >
>
> I think the reason for your wrong results is this code in
> interval_avg_combine():
>
> +    if (state2->N > 0)
> +    {
> +        /* The rest of this needs to work in the aggregate context */
> +        old_context = MemoryContextSwitchTo(agg_context);
> +
> +        /* Accumulate interval values */
> +        do_interval_accum(state1, &state2->sumX);
> +
> +        MemoryContextSwitchTo(old_context);
> +    }
>
> The problem is that using do_interval_accum() to add the 2 sums
> together also adds 1 to the count N, making it incorrect. This code
> should only be adding state2->sumX to state1->sumX, not touching
> state1->N. And, as in do_interval_accum(), there is no need to switch
> memory context.
>
> Given that there are multiple places in this file that need to add
> intervals, I think it makes sense to further refactor, and add a local
> function to add 2 finite intervals, along the lines of the code above.
> This can then be called from do_interval_accum(),
> interval_avg_combine(), and interval_pl(). And similarly for
> subtracting 2 finite intervals.

I was working on refactoring Jian's patches but forgot to mention it
there. I think the patchset attached has addressed all your comments.
But they do not implement serialization and deserialization yet. I
will take a look at Jian's patch for the same and incorporate/refactor
those changes.

Jian,
I don't understand why there's two sets of test queries, one with
ORDER BY and one without? Does ORDER BY specification make any
difference in the testing?

Patches are thus
0001, 0002 are same
0003 - earlier 0003 + incorporated doc changes suggested by Jian
0004 - fixes DecodeInterval()
0005 - Refactored Jian's code fixing window functions. Does not
contain the changes for serialization and deserialization. Jian,
please let me know if I have missed anything else.

In my testing, I saw the timings for queries as below
Query A: 145.164 ms
Query B: 222.419 ms

Query C: 136.995 ms
Query D: 146.893 ms

Query E: 38.053 ms
Query F: 80.112 ms

I didn't see degradation in case of sum().

--
Best Wishes,
Ashutosh Bapat

Attachment

pgsql-hackers by date:

Previous
From: David Geier
Date:
Subject: Re: Parallel Bitmap Heap Scan reports per-worker stats in EXPLAIN ANALYZE
Next
From: Ashutosh Bapat
Date:
Subject: Re: Infinite Interval