Re: range_agg - Mailing list pgsql-hackers
From | Paul A Jungwirth |
---|---|
Subject | Re: range_agg |
Date | |
Msg-id | CA+renyWdkaThzX3FKCVU6eV7k6=NJH862w2TjkQ2M7jftFoWWg@mail.gmail.com Whole thread Raw |
In response to | Re: range_agg (David Fetter <david@fetter.org>) |
Responses |
Re: range_agg
|
List | pgsql-hackers |
Thanks everyone for offering some thoughts on this! Tom Lane <tgl@sss.pgh.pa.us> wrote: > have you given any thought to just deciding that ranges and > multiranges are the same type? I can see how it might be nice to have just one type to think about. Still I think keeping them separate makes sense. Other folks have brought up several reasons already. Just to chime in: Tom Lane <tgl@sss.pgh.pa.us> wrote: > Isaac Morland <isaac.morland@gmail.com> writes: > > Definitely agreed that range and multirange (or whatever it's called) > > should be different. In the work I do I have a number of uses for ranges, > > but not (yet) for multiranges. I want to be able to declare a column as > > range and be sure that it is just a single range, and then call lower() and > > upper() on it and be sure to get just one value in each case; and if I > > accidentally try to take the union of ranges where the union isn’t another > > range, I want to get an error rather than calculate some weird (in my > > context) multirange. > > I do not find that argument convincing at all. Surely you could put > that constraint on your column using "CHECK (numranges(VALUE) <= 1)" > or some such notation. A check constraint works for columns, but there are other contexts where you'd like to restrict things to just a contiguous range, e.g. user-defined functions and intermediate results in queries. Basic ranges seem a lot simpler to think about, so I can appreciate how letting any range be a multirange adds a heavy cognitive burden. I think a lot of people will share Isaac's opinion here. Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, this would allow us to remove at least one ugly misfeature: > > regression=# select '[1,2]'::int4range + '[3,10)'::int4range; > ?column? > ---------- > [1,10) > (1 row) > > regression=# select '[1,2]'::int4range + '[4,10)'::int4range; > ERROR: result of range union would not be contiguous Because of backwards compatibility we can't really change +/-/* not to raise (right?), so if we joined ranges and multiranges we'd need to add operators with a different name. I was calling those @+/@-/@* before, but that was considered too unintuitive and undiscoverable. Having two types lets us use the nicer operator names. Tom Lane <tgl@sss.pgh.pa.us> wrote: > it seems like we could consider the traditional > range functions like lower() and upper() to report on the first or last > range bound in a multirange I tried to keep functions/operators similar, so already lower(mr) = lower(r) and upper(mr) = upper(r). I think *conceptually* it's good to make ranges & multiranges as interchangable as possible, but that doesn't mean they have to be the same type. Adding multiranges-as-ranges also raises questions about their string format. If a multirange is {[1,2), [4,5)} would you only print the curly braces when there is more than one element? I don't *think* allowing non-contiguous ranges would break how we use them in GiST indexes or exclusion constraints, but maybe someone can think of some problem I can't. It's one place to be wary anyway. At the very least it would make those things slower I expect. On a few other issues people have raised recently: Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I wonder what's the point of multirange arrays. Is there a reason we > create those? We have arrays of everything else, so why not have them for multiranges? We don't have to identify specific use cases here, although I can see how you'd want to call array_agg/UNNEST on some multiranges, e.g. (Actually I really want to add an UNNEST that *takes* a multirange, but that could be a follow-on commit.) If nothing else I think omitting arrays of multiranges would be a strange irregularity in the type system. David G. Johnston <david.g.johnston@gmail.com> wrote: > In the tests there is: > > +select '{[a,a],[b,b]}'::textmultirange; > + textmultirange > +---------------- > + {[a,a],[b,b]} > +(1 row) > + > +-- without canonicalization, we can't join these: > +select '{[a,a], [b,b]}'::textmultirange; > + textmultirange > +---------------- > + {[a,a],[b,b]} > +(1 row) > + > > Aside from the comment they are identical so I'm confused as to why both tests exist - though I suspect it has to do withthe fact that the expected result would be {[a,b]} since text is discrete. Those tests are for basic string parsing (multirange_in), so one is testing {A,B} and the other {A, B} (with a space after the comma). (There are some tests right above those that also have blank spaces, but they only output a single element in the multirange result.) David G. Johnston <david.g.johnston@gmail.com> wrote: > Also, the current patch set seems a bit undecided on whether it wants to be truly a multi-range or a range that can reportnon-contiguous components. Specifically, > > +select '{[a,d), [b,f]}'::textmultirange; > + textmultirange > +---------------- > + {[a,f]} > +(1 row) Without a canonicalization function, we can't know that [a,a] touches [b,b], but we *can* know that [a,d) touches [b,f). Or even: regression=# select '{[a,b), [b,b]}'::textmultirange; textmultirange ---------------- {[a,b]} (1 row) So I'm joining ranges whenever we know they touch. I think this is consistent with existing range operators, e.g.: regression=# select '[a,a]'::textrange -|- '[b,b]'; ?column? ---------- f regression=# select '[a,b)'::textrange -|- '[b,b]'; ?column? ---------- t David G. Johnston <david.g.johnston@gmail.com> wrote: > There is a an argument that a multi-range should output {[a,d),[b,f]}. IMO its arguable that a multi-range container shouldnot try and reduce the number of contained ranges at all. Automatically combining touching ranges seems very desirable to me, and one of the motivations to building a multirange type instead of just using an array of ranges. Mathematically {[1,2), [2,3)} is equivalent to {[1,3)}, and merging the touching elements keeps things easier to read/understand and faster. Ranges themselves have a canonical form too. Not canonicalizing raises a lot of questions, like when are two "equivalent" ranges equal? And when you compose operators/functions, do you keep all the internally-introduced splits, or somehow preserve only splits that were present in your top-level inputs? If you really want a list of possibly-touching ranges, I would use an array for that. Why even have a range_agg (instead of just array_agg'ing the ranges) if you're not going to merge the inputs? Isaac Morland <isaac.morland@gmail.com> writes: > On a related note, I was thinking about this and I don’t think I like > range_agg as a name at all. I know we have array_agg and string_agg but surely > shouldn’t this be called union_agg, and shouldn’t there also be an > intersect_agg? I mean, taking the union isn’t the only possible aggregate on > ranges or multiranges. The patch does include a range_intersect_agg already. Since there are so many set-like things in SQL, I don't think the unqualified union_agg/intersect_agg are appropriate to give to ranges alone. And the existing ${type}_agg functions are all "additive": json_agg, json_object_agg, array_agg, string_agg, xmlagg. So I think range_agg is the least-surprising name for this behavior. I'm not even the first person to call it that, as you can see from [1]. David Fetter <david@fetter.org> writes: > One way to do that would be to include a "range cardinality" in the > data structure which be the number of left ends in it. I agree that is probably useful enough to add to this patch. I'll work on it. David Fetter <david@fetter.org> writes: > There's another use case not yet covered here that could make this > even more complex, we should probably plan for it: multi-ranges with > weights. Several people have asked me about this. I think it would need to be a separate type though, e.g. weighted_multirange. Personally I wouldn't mind working on it eventually, but I don't think it needs to be part of this initial patch. Possibly it could even be an extension. In lieu of a real type you also have an array-of-ranges, which is what I originally proposed range_agg to return. Finally, I think I mentioned this a long time ago, but I'm still not sure if this patch needs work around these things: - gist opclass - spgist opclass - typanalyze - selectivity I'd love for a real Postgres expert to tell me "No, we can add that later" or "Yes, you have to add that now." Even better if they can offer some help, because I'm not sure I understand those areas well enough to do it myself. Thanks all, Paul [1] https://git.proteus-tech.com/open-source/django-postgres/blob/fa91cf9b43ce942e84b1a9be22f445f3515ca360/postgres/sql/range_agg.sql
pgsql-hackers by date: