Re: range_agg - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: range_agg
Date
Msg-id CAFj8pRCUQxPwkoGqDfN-BSqCXTrc6KVBJRP_yQRsthOdx=uC7A@mail.gmail.com
Whole thread Raw
In response to Re: range_agg  (Paul A Jungwirth <pj@illuminatedcomputing.com>)
List pgsql-hackers


pá 5. 7. 2019 v 19:22 odesílatel Paul A Jungwirth <pj@illuminatedcomputing.com> napsal:
On Fri, Jul 5, 2019 at 4:31 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
> The first issue is unstable regress tests - there is a problem with opr_sanity

I would prefer to avoid needing to add anything to opr_sanity really.
A multirange would let me achieve that I think. But otherwise I'll add
the ordering. Thanks!

> +   rangeTypeId = get_fn_expr_argtype(fcinfo->flinfo, 1);
> +   if (!type_is_range(rangeTypeId))
> +   {
> +       ereport(ERROR, (errmsg("range_agg must be called with a range")));
> +   }

I think this was left over from my attempts to support user-defined
ranges. I believe I can remove it now.

> +# Making 2- and 3-param range_agg polymorphic is difficult
> +# because it would take an anyrange and return an anyrange[],
> +# which doesn't exist.
> +# As a workaround we define separate functions for each built-in range type.
> +# This is what causes the mess in src/test/regress/expected/opr_sanity.out.
> +{ oid => '8003', descr => 'aggregate transition function',
>
> This is strange. Does it means so range_agg will not work with custom range types?

You would have to define your own range_agg with your own custom type
in the signature. I gave instructions about that in my range_agg
extension (https://github.com/pjungwir/range_agg), but it's not as
easy with range_agg in core because I don't think you can define a
custom function that is backed by a built-in C function. At least I
couldn't get that to work.

I understand so anybody can implement own function, but this is fundamental problem in implementation.

In this case I prefer to start implement anyrangearray type first. It is not hard work.
 

The biggest argument for a multirange to me is that we could have
anymultirange, with similar rules as anyarray. That way I could have
`range_agg(r anyrange) RETURNS anymultirange`. [1] is a conversation
where I asked about doing this before multiranges were suggested. Also
I'm aware of your own recent work on polymorphic types at [2] but I
haven't read it closely enough to see if it would help me here. Do you
think it applies?

I don't see any difference between anymultirange or anyrangearray - it is just name

ANSI SQL has a special kind for this purpose - "set". It is maybe better, because PostgreSQL's arrays revoke a idea of ordering, but it is hard to do some generic order for ranges.

Functionality is important - if you can do some special functionality, that cannot be implemented as "array of ranges", then the new type is necessary. If all functionality can be covered by array of ranges, then there is not necessity of new type.

I don't talk about polymorphic types - probably it needs one - "anymultirange" or "anyrangearray".

If some operation can be done smarter with new type, then I am ok for new type, if not, then we should to use array of ranges.


> I am not sure about implementation. I think so accumulating all ranges, sorting and processing on the end can be memory and CPU expensive.

I did some research and couldn't find any algorithm that was better
than O(n log n), but if you're aware of any I'd like to know about it.
Assuming we can't improve on the complexity bounds, I think a sort +
iteration is desirable because it keeps things simple and benefits
from past & future work on the sorting code. I care more about
optimizing time here than RAM, especially since we'll use this in FK
checks, where the inputs will rarely be more than a few and probably
never in the millions.

We especially want to avoid an O(n^2) algorithm, which would be easy
to stumble into if we naively accumulated the result as we go. For
example if we had multiranges you could imagine an implementation that
just did `result + r` for all inputs r. But that would have the same
n^2 pattern as looping over strcat.

We could use a tree to keep things sorted and accumulate as we go, but
the implementation would be more complicated and I think slower.

I don't think - working with large arrays is slow, due often cache miss. I understand so it depends on data, but in this area, I can imagine significant memory reduction based on running processing.

array builder doesn't respect work_men, and I think so any different way is safer.

Regards

Pavel

 

Thanks for the review!

Paul

[1] https://www.postgresql.org/message-id/CA%2BrenyVOjb4xQZGjdCnA54-1nzVSY%2B47-h4nkM-EP5J%3D1z%3Db9w%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAFj8pRDna7VqNi8gR+Tt2Ktmz0cq5G93guc3Sbn_NVPLdXAkqA@mail.gmail.com

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: range_agg
Next
From: Amit Kapila
Date:
Subject: Re: Comment typo in tableam.h