Re: range_agg - Mailing list pgsql-hackers

From Paul A Jungwirth
Subject Re: range_agg
Date
Msg-id CA+renyX7jsWnyry6SOn-WSgdgQLtyFJ7SLWBVs+oatwQsMzvog@mail.gmail.com
Whole thread Raw
In response to Re: range_agg  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: range_agg  (David Fetter <david@fetter.org>)
Re: range_agg  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Mon, Jul 1, 2019 at 3:38 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> For getting into core though, it should be a more complete set of
> related operations. The patch is implicitly introducing the concept of
> a "multirange" (in this case, an array of ranges), but it's not making
> the concept whole.
>
> What else should return a multirange? This patch implements the union-
> agg of ranges, but we also might want an intersection-agg of ranges
> (that is, the set of ranges that are subranges of every input). Given
> that there are other options here, the name "range_agg" is too generic
> to mean union-agg specifically.

Thanks for the review!

I like the idea of adding a multirange type that works like range
types, although I'm not sure I want to build it. :-)

My own motivations for the range_agg patch are for temporal databases,
where I have two use-cases: checking temporal foreign keys [1] and
doing a Snodgrass "coalesce" operation [2]. The FKs use-case is more
important. For coalesce I just immediately UNNEST the array, so a
multirange would sort of "get in the way". It's no big deal if I can
cast a multirange to an array, although for something that would run
on every INSERT/UPDATE/DELETE I'd like to understand what the cast
would cost us in performance. But coalesce isn't part of SQL:2011 and
should be optional behavior or just something in an extension. The FKs
use case matters to me a lot more, and I think a multirange would be
fine for that. Also a multirange would solve the generics problem
Pavel asked about. (I'll say more about that in a reply to him.)

I'm not convinced that an intersection aggregate function for
multiranges would be used by anyone, but I don't mind building one.
Every other *_agg function has an "additive" sense, not a
"subtractive" sense. json{,b}_object_agg are the closest since you
*could* imagine intersection semantics for those, but they are unions.
So in terms of *naming* I think using "range_agg" for union semantics
is natural and would fit expectations. (I'm not the first to name this
function range_agg btw: [3]).

But there is clearly more than one worthwhile way to aggregate ranges:

- David suggested weighted_range_agg and covering_range_agg. At PGCon
2019 someone else said he has had to build something that was
essentially weighted_range_agg. I can see it used for
scheduling/capacity/max-utilization problems.
- You suggested an intersection range_agg.
- At [4] there is a missing_ranges function that only gives the *gaps*
between the input ranges.

Nonetheless I still think I would call the union behavior simply
range_agg, and then use weighted_range_agg, covering_range_agg,
intersection_range_agg, and missing_range_agg for the rest (assuming
we built them all). I'm not going to insist on any of that, but it's
what feels most user-friendly to me.

> What can we do with a multirange? A lot of range operators still make
> sense, like "contains" or "overlaps"; but "adjacent" doesn't quite
> work. What about new operations like inverting a multirange to get the
> gaps?

I can think of a lot of cool operators for `multirange \bigoplus
multirange` or `multirange \bigoplus range` (commuting of course). And
I've certainly wanted `range + range` and `range - range` in the past,
which would both return a multirange.

> If we have a more complete set of operators here, the flags for
> handling overlapping ranges and gaps will be unnecessary.

Both of my use cases should permit overlaps & gaps (range_agg(r, true,
true)), so I'm actually pretty okay with dropping the flags entirely
and just giving a one-param function that behavior. Or defining a
strict_range_agg that offers more control. But also I don't understand
how richer *operators* make the flags for the *aggregate* unnecessary.

So I really like the idea of multiranges, but I'm reluctant to take it
on myself, especially since this patch is just a utility function for
my other temporal patches. But I don't want my rush to leave a blemish
in our standard library either. But I think what really persuades me
to add multiranges is making a range_agg that more easily supports
user-defined range types. So how about I start on it and see how it
goes? I expect I can follow the existing code for range types pretty
closely, so maybe it won't be too hard.

Another option would be to rename my function range_array_agg (or
something) so that we are leaving space for a multirange function in
the future. I don't love this idea myself but it would could a Plan B.
What do you think of that?

Regards,
Paul

[1] With range_agg I can make the temporal FK check use similar SQL to
the non-temporal FK check:

    SELECT 1
    FROM [ONLY] <pktable> x
    WHERE pkatt1 = $1 [AND ...]
    FOR KEY SHARE OF x

vs

    SELECT 1
    FROM (
        SELECT range_agg(r, true, true) AS r
        FROM  (
            SELECT pkperiodatt AS r
            FROM [ONLY] pktable x
            WHERE pkatt1 = $1 [AND ...]
            FOR KEY SHARE OF x
        ) x1
    ) x2
    WHERE $n <@ x2.r[1]

(The temporal version could be simplified a bit if FOR KEY SHARE ever
supports aggregate functions.)

[2] page number 159ff in https://www2.cs.arizona.edu/~rts/tdbbook.pdf
(section 6.5.1, starts on page 183 of the PDF)

[3]
https://git.proteus-tech.com/open-source/django-postgres/blob/fa91cf9b43ce942e84b1a9be22f445f3515ca360/postgres/sql/range_agg.sql

[4] https://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: range_agg
Next
From: Noah Misch
Date:
Subject: Re: Fix runtime errors from -fsanitize=undefined