Thread: Patch: propose to include 3 new functions into intarray and intagg

Patch: propose to include 3 new functions into intarray and intagg

From
"Dmitry Koterov"
Date:
<div dir="ltr">Hello.<br /><br />Here are these functions with detailed documentation:<br /><a
href="http://en.dklab.ru/lib/dklab_postgresql_patch/">http://en.dklab.ru/lib/dklab_postgresql_patch/</a><br/><br
/>intagg.int_array_append_aggregate(int[]):fast merge arrays into one large list<br />
intarray._int_group_count_sort(int[],bool): frequency-based sorting<br />intarray.bidx(int[], int): binary search in a
sortedarray<br /><br />Tested for about a year on a real PostgreSQL cluster (10 machines, Slony replication) under a
heavyload (millions of requests). <br /> No crash nor memory problem detected during a year, so I suppose these
functionsare well-tested.<br /><br />What do you think about that?<br /></div> 

Re: Patch: propose to include 3 new functions into intarray and intagg

From
Zdenek Kotala
Date:
Dmitry Koterov napsal(a):
> Hello.
> 
> Here are these functions with detailed documentation:
> http://en.dklab.ru/lib/dklab_postgresql_patch/

Added to next commit fest patch list.
Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Patch: propose to include 3 new functions into intarray and intagg

From
Gregory Stark
Date:
Regarding the patch listed on the commitfest "3 new functions into intarray
and intagg" (which I just noticed has a reviewer listed -- doh):

http://archives.postgresql.org/message-id/d7df81620808130429l2a75c895g5dd6fe8ae64cc23e@mail.gmail.com

I definitely like the int_array_append_aggregate function but I don't see
anything int[] specific about it. We should be able to have a generic
array_union() aggregate which uses the same IsA(fcinfo->context, AggState)
trick to scribble on its state variable. It don't even see any reason it
couldn't work for arrays of varlenas, though it would take a bit of
restructuring.

So I would be definitely for a adding this to core if it were rewritten to
work with generic arrays which, unless there are problems I'm not seeing, I
don't think would be very hard.

As far as detailed code commentary the only thing which jumps out at me is
that it's using MemoryContextAlloc to grow the array instead of repalloc which
seems like a waste. This isn't a new thing though, it was how intagg was
written and this patch just didn't change it.

I'm not against putting more functions into intagg and intarray and bidx and
the grouping/counting thing seem like they might be useful functionality. but
I have a feeling others might feel differently.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Patch: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

Gregory Stark wrote:
> Regarding the patch listed on the commitfest "3 new functions into intarray
> and intagg" (which I just noticed has a reviewer listed -- doh):

..well, just add your name as well, no?

> I definitely like the int_array_append_aggregate function but I don't see
> anything int[] specific about it. We should be able to have a generic
> array_union() aggregate which uses the same IsA(fcinfo->context, AggState)
> trick to scribble on its state variable. It don't even see any reason it
> couldn't work for arrays of varlenas, though it would take a bit of
> restructuring.

Yeah, the same idea was bugging me. Doesn't such code already exist?

> So I would be definitely for a adding this to core if it were rewritten to
> work with generic arrays which, unless there are problems I'm not seeing, I
> don't think would be very hard.
> 
> As far as detailed code commentary the only thing which jumps out at me is
> that it's using MemoryContextAlloc to grow the array instead of repalloc which
> seems like a waste. This isn't a new thing though, it was how intagg was
> written and this patch just didn't change it.

Oh, good catch.

> I'm not against putting more functions into intagg and intarray and bidx and
> the grouping/counting thing seem like they might be useful functionality. but
> I have a feeling others might feel differently.

The naming 'bidx' seems a bit weired to me, but otherwise I'm also 
optimistic about it.

Regards

Markus Wanner


Re: Patch: propose to include 3 new functions into intarray and intagg

From
Gregory Stark
Date:
Markus Wanner <markus@bluegap.ch> writes:

> Hi,
>
> Gregory Stark wrote:
>> Regarding the patch listed on the commitfest "3 new functions into intarray
>> and intagg" (which I just noticed has a reviewer listed -- doh):
>
> ..well, just add your name as well, no?

Yeah, people should feel free to comment on items even if other people have or
are as well. It would have just been more useful for me to pick one that
didn't already have someone reading it is all.

>> As far as detailed code commentary the only thing which jumps out at me is
>> that it's using MemoryContextAlloc to grow the array instead of repalloc which
>> seems like a waste. This isn't a new thing though, it was how intagg was
>> written and this patch just didn't change it.
>
> Oh, good catch.

Actually on further thought what's going on is that it's resizing the array by
doubling its size when necessary. palloc/repalloc does that as well, so you're
getting two layers of extra space to handle reallocations. I think it's
simpler and cleaner to just repalloc just enough space at each growth and let
repalloc handle allocating extra space to handle future growth. I think that
would be necessary for handling varlenas also.

> The naming 'bidx' seems a bit weired to me, but otherwise I'm also optimistic
> about it.

If we wanted to put that in core it would make more sense to have a flag on
the array indicating whether it's sorted or not which is maintained whenever
we construct or alter an array. Then just have the regular _int_contains()
(which is an operator @>) take advantage of it if it's set and the data type
is fixed-size.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Patch: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

Gregory Stark wrote:
>> The naming 'bidx' seems a bit weired to me, but otherwise I'm also optimistic
>> about it.
> 
> If we wanted to put that in core

Uh.. ATM it's just a patch against contrib. I don't think 'bidx' needs 
to go into core. Otherwise we'd have to move the whole intarr contrib 
module as well, no?

> it would make more sense to have a flag on
> the array indicating whether it's sorted or not which is maintained whenever
> we construct or alter an array. Then just have the regular _int_contains()
> (which is an operator @>) take advantage of it if it's set and the data type
> is fixed-size.

Yeah, that sounds like a good thing. Currently, the intarr module 
doesn't provide the optimized functions for the outside world 
(_int_inter_inner() and such.. the _inner appendix really means "inside 
intarr module only). I've ended up copy'n'pasting that code into my own 
module, where I take care about ordering myself. But still want to 
maintain the optimization.

However, that's probably not within the scope of this patch.

Regards

Markus Wanner




Re: Patch: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

Gregory Stark wrote:
> I definitely like the int_array_append_aggregate function but I don't see
> anything int[] specific about it. We should be able to have a generic
> array_union() aggregate which uses the same IsA(fcinfo->context, AggState)
> trick to scribble on its state variable. It don't even see any reason it
> couldn't work for arrays of varlenas, though it would take a bit of
> restructuring.

I've just noticed that this already is a todo item:
  "Add array_accum() and array_to_set() functions for arrays
    The standards specify array_agg() and UNNEST."

See also:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00464.php

Regards

Markus Wanner


Review Report: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

this is my first "official" review. I've tried to follow the "Review a
patch" guidelines from the wiki - thanks Simon, that was pretty helpful.

This review covers only the intagg additions.

Dmitry Koterov wrote:
> Here are these functions with detailed documentation:
> http://en.dklab.ru/lib/dklab_postgresql_patch/

Submission review: we generally prefer having patches archived on our
mailing lists, so please just send future patches or revisions of this
patch to our lists (I prefer -hackers, but probably -patches is still
the official one). The documentation should go into a README file of the
contrib module or some such. Tests are missing, but that's the case for
the intagg contrib module anyway. The patch applies and is in context
diff format, good. It adds an unrelated newline, which should generally 
be avoided.

Usability review: functional additions look good and usable, certainly
within a contrib module. I don't think it's compliant to any spec, but
that's not required for contrib, IMO.

Functional test: works as advertised on the accompanying website. I've
tested the int_array_append_aggregate somewhat, see [1].

Performance testing: I've tested with 10 mio rows of arrays of size 1
and compared against core's int_array_aggregate function, see [1] again.
In this simple test it took roughly 50% longer, which seems okay. Memory 
consumption looks sane as well.

Coding review: style seems fine for contrib, though lines longer than 80 
cols should be broken up. Comments in the code are sparse , some even 
counter-productive (marking something as "additional things" certainly 
doesn't help).

Code and architecture review: the new int_array_append_aggregate() 
functions itself seems fine to me.

Summary: My general feeling is, that this patch should be applied after 
minor code style corrections. As a longer term goal I think intagg 
should be integrated into  core, since it's very basic functionality. 
TODO entries for things like an array_accum() aggregate already exist. 
Adding this patch to contrib now might be a step into the right direction.

Dmitry, can you please apply these small corrections and re-submit the 
patch?

Regards

Markus Wanner

P.S.: I dislike the intagg's use of PGARRAY, but that's nothing to do 
with this patch. Shouldn't this better use a real composite type as the 
aggregate's state type? I'd propose to clean up the intagg contrib 
module and prepare it for inclusion into core.


[1]: functional and performance testing session:

On a database with (patched) intagg and intarr contrib modules:

markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT NULL);
CREATE TABLE
markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3,
ARRAY[3,2,1]);
INSERT 0 3
markus=# SELECT * FROM test; id |   arr
----+---------  1 | {1,2,3}  2 | {4,5}  3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_append_aggregate(arr) FROM test; int_array_append_aggregate
---------------------------- {1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT * FROM test; id |   arr
----+---------  1 | {1,2,3}  2 | {4,5}  3 | {3,2,1}
(3 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id / 2);  ids  | int_array_append_aggregate
-------+---------------------------- {1}   | {1,2,3} {2,3} | {4,5,3,2,1}
(2 rows)

markus=# SELECT int_array_aggregate(id) AS ids,
int_array_append_aggregate(arr) FROM test GROUP BY (id % 2);  ids  | int_array_append_aggregate
-------+---------------------------- {2}   | {4,5} {1,3} | {1,2,3,3,2,1}
(2 rows)

markus=# INSERT INTO test VALUES (4, NULL);
INSERT 0 1

markus=# SELECT int_array_append_aggregate(arr) FROM test; int_array_append_aggregate
---------------------------- {1,2,3,4,5,3,2,1}
(1 row)

markus=# SELECT id, int_array_append_aggregate(arr) FROM test GROUP BY id; id | int_array_append_aggregate
----+----------------------------  4 | {}  2 | {4,5}  3 | {3,2,1}  1 | {1,2,3}
(4 rows)

-- switching to performance testing

markus=# \timing
Timing is on.

markus=# DELETE FROM test;
DELETE 4
Time: 9.037 ms

markus=# INSERT INTO test SELECT generate_series(1, 10000000),
array[round(random() * 100)]::int[];
INSERT 0 10000000
Time: 53321.186 ms

markus=# SELECT icount(int_array_aggregate(id)) AS count FROM test;  count
---------- 10000000
(1 row)

Time: 2493.184 ms

markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM test;  count
---------- 10000000
(1 row)

Time: 4152.478 ms




Markus Wanner <markus@bluegap.ch> writes:
> Submission review: we generally prefer having patches archived on our
> mailing lists, so please just send future patches or revisions of this
> patch to our lists (I prefer -hackers, but probably -patches is still
> the official one).

Please.  But -patches is dead, please use -hackers.

> The documentation should go into a README file of the
> contrib module or some such.

No, it should go into the SGML docs, specifically doc/src/sgml/intarray.sgml
and intagg.sgml.  We got rid of flat-text README documentation for
contrib in 8.3, and are not about to permit any backsliding.

> Summary: My general feeling is, that this patch should be applied after 
> minor code style corrections. As a longer term goal I think intagg 
> should be integrated into  core, since it's very basic functionality. 

Well, what should be integrated is generic (non-datatype-specific)
versions of this functionality, ie functions working on
anyarray/anyelement not just int4[]/int4.  There was some discussion
about that a few days ago.  I'm not really sure how painful it would be
to do; probably the generalization to non-fixed-width types would be the
trickiest part.
        regards, tom lane


Re: Review Report: propose to include 3 new functions into intarray and intagg

From
"Dmitry Koterov"
Date:
<div dir="ltr">OK, thank you for your review.<br /><br />I'll correct everything and send a patch in a couple of days.
Areyou completely sure that this patch will be included? If not, seems the work of the patch standartization has much
lowerpriority, and I will not hurry so much.<br /><br />But, what about intarray patch? Does somebody plan to review
it?I'd prefer to include it too. If you approve, I'll correct the code style in intarray contrib patch too.<br /><br
/><br/><br /><div class="gmail_quote">On Sat, Sep 6, 2008 at 3:34 PM, Markus Wanner <span
dir="ltr"><markus@bluegap.ch></span>wrote:<br /><blockquote class="gmail_quote" style="border-left: 1px solid
rgb(204,204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi,<br /><br /> this is my first "official" review.
I'vetried to follow the "Review a<br /> patch" guidelines from the wiki - thanks Simon, that was pretty helpful.<br
/><br/> This review covers only the intagg additions.<br /><br /> Dmitry Koterov wrote:<br /><blockquote
class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hereare these functions with detailed documentation:<br /><a href="http://en.dklab.ru/lib/dklab_postgresql_patch/"
target="_blank">http://en.dklab.ru/lib/dklab_postgresql_patch/</a><br/></blockquote><br /> Submission review: we
generallyprefer having patches archived on our<br /> mailing lists, so please just send future patches or revisions of
this<br/> patch to our lists (I prefer -hackers, but probably -patches is still<br /> the official one). The
documentationshould go into a README file of the<br /> contrib module or some such. Tests are missing, but that's the
casefor<br /> the intagg contrib module anyway. The patch applies and is in context<br /> diff format, good. It adds an
unrelatednewline, which should generally be avoided.<br /><br /> Usability review: functional additions look good and
usable,certainly<br /> within a contrib module. I don't think it's compliant to any spec, but<br /> that's not required
forcontrib, IMO.<br /><br /> Functional test: works as advertised on the accompanying website. I've<br /> tested the
int_array_append_aggregatesomewhat, see [1].<br /><br /> Performance testing: I've tested with 10 mio rows of arrays of
size1<br /> and compared against core's int_array_aggregate function, see [1] again.<br /> In this simple test it took
roughly50% longer, which seems okay. Memory consumption looks sane as well.<br /><br /> Coding review: style seems fine
forcontrib, though lines longer than 80 cols should be broken up. Comments in the code are sparse , some even
counter-productive(marking something as "additional things" certainly doesn't help).<br /><br /> Code and architecture
review:the new int_array_append_aggregate() functions itself seems fine to me.<br /><br /> Summary: My general feeling
is,that this patch should be applied after minor code style corrections. As a longer term goal I think intagg should be
integratedinto  core, since it's very basic functionality. TODO entries for things like an array_accum() aggregate
alreadyexist. Adding this patch to contrib now might be a step into the right direction.<br /><br /> Dmitry, can you
pleaseapply these small corrections and re-submit the patch?<br /><br /> Regards<br /><br /> Markus Wanner<br /><br />
P.S.:I dislike the intagg's use of PGARRAY, but that's nothing to do with this patch. Shouldn't this better use a real
compositetype as the aggregate's state type? I'd propose to clean up the intagg contrib module and prepare it for
inclusioninto core.<br /><br /><br /> [1]: functional and performance testing session:<br /><br /> On a database with
(patched)intagg and intarr contrib modules:<br /><br /> markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT
NULL);<br/> CREATE TABLE<br /> markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3,<br />
ARRAY[3,2,1]);<br/> INSERT 0 3<br /> markus=# SELECT * FROM test;<br />  id |   arr<br /> ----+---------<br />  1 |
{1,2,3}<br/>  2 | {4,5}<br />  3 | {3,2,1}<br /> (3 rows)<br /><br /> markus=# SELECT int_array_append_aggregate(arr)
FROMtest;<br />  int_array_append_aggregate<br /> ----------------------------<br />  {1,2,3,4,5,3,2,1}<br /> (1
row)<br/><br /> markus=# SELECT * FROM test;<br />  id |   arr<br /> ----+---------<br />  1 | {1,2,3}<br />  2 |
{4,5}<br/>  3 | {3,2,1}<br /> (3 rows)<br /><br /> markus=# SELECT int_array_aggregate(id) AS ids,<br />
int_array_append_aggregate(arr)FROM test GROUP BY (id / 2);<br />  ids  | int_array_append_aggregate<br />
-------+----------------------------<br/>  {1}   | {1,2,3}<br />  {2,3} | {4,5,3,2,1}<br /> (2 rows)<br /><br />
markus=#SELECT int_array_aggregate(id) AS ids,<br /> int_array_append_aggregate(arr) FROM test GROUP BY (id % 2);<br />
 ids | int_array_append_aggregate<br /> -------+----------------------------<br />  {2}   | {4,5}<br />  {1,3} |
{1,2,3,3,2,1}<br/> (2 rows)<br /><br /> markus=# INSERT INTO test VALUES (4, NULL);<br /> INSERT 0 1<br /><br />
markus=#SELECT int_array_append_aggregate(arr) FROM test;<br />  int_array_append_aggregate<br />
----------------------------<br/>  {1,2,3,4,5,3,2,1}<br /> (1 row)<br /><br /> markus=# SELECT id,
int_array_append_aggregate(arr)FROM test GROUP BY id;<br />  id | int_array_append_aggregate<br />
----+----------------------------<br/>  4 | {}<br />  2 | {4,5}<br />  3 | {3,2,1}<br />  1 | {1,2,3}<br /> (4 rows)<br
/><br/> -- switching to performance testing<br /><br /> markus=# \timing<br /> Timing is on.<br /><br /> markus=#
DELETEFROM test;<br /> DELETE 4<br /> Time: 9.037 ms<br /><br /> markus=# INSERT INTO test SELECT generate_series(1,
10000000),<br/> array[round(random() * 100)]::int[];<br /> INSERT 0 10000000<br /> Time: 53321.186 ms<br /><br />
markus=#SELECT icount(int_array_aggregate(id)) AS count FROM test;<br />  count<br /> ----------<br />  10000000<br />
(1row)<br /><br /> Time: 2493.184 ms<br /><br /> markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM
test;<br/>  count<br /> ----------<br />  10000000<br /> (1 row)<br /><br /> Time: 4152.478 ms<br /><br /><br
/></blockquote></div><br/></div> 

Re: Review Report: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

Dmitry Koterov wrote:
> I'll correct everything and send a patch in a couple of days.

Cool, thank you.

> Are you 
> completely sure that this patch will be included?

Uh.. I'm not a committer, but I'm pretty sure your patch has good chances.

I can help with SGML documentation, if you want.

> But, what about intarray patch? Does somebody plan to review it? I'd 
> prefer to include it too. If you approve, I'll correct the code style in 
> intarray contrib patch too.

I've already volunteered for reviewing it as well. I just felt like 
splitting things up...

Regards

Markus Wanner




Re: Review Report: propose to include 3 new functions into intarray and intagg

From
Heikki Linnakangas
Date:
Markus Wanner wrote:
> Dmitry Koterov wrote:
>> But, what about intarray patch? Does somebody plan to review it? I'd 
>> prefer to include it too. If you approve, I'll correct the code style 
>> in intarray contrib patch too.
> 
> I've already volunteered for reviewing it as well. I just felt like 
> splitting things up...

Looking at the intarray patch:

I find it a bit unfriendly to have a function that depends on sorted 
input, but doesn't check it. But that's probably not a good enough 
reason to reject an otherwise simple and useful function. Also, we 
already have uniq, which doesn't strictly speaking require sorted input, 
but it does if you want to eliminate all duplicates from the array.

_int_group_count_sort seems a bit special purpose. Why does it bother to 
sort the output? That's wasted time if you don't need sorted output, or 
if you want the array sorted by the integer value instead of frequency. 
If you want sorted output, you can just sort it afterwards.

Also, it's requiring sorted input for a small performance gain, but 
there's a lot more precedence in the existing intarray functions to not 
require sorted input, but to sort the input instead (union, intersect, 
same, overlap).

I realize that the current implementation is faster for the use case 
where the input is sorted, and output needs to be sorted, but if we go 
down that path we'll soon have dozens of different variants of various 
functions, with different ordering requirements of inputs and outputs.

So, I'd suggest changing _int_group_count_sort so that it doesn't 
require sorted input, and doesn't sort the output. The binary search 
function looks good to me (I think I'd prefer naming it bsearch(), 
though, though I can see that it was named bidx in reference to the 
existing idx function). Also, as Markus pointed out, the SGML docs need 
to be updated.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Review Report: propose to include 3 new functions into intarray and intagg

From
Markus Wanner
Date:
Hi,

sorry for not having completed this review, yet. As you are obviously 
looking at the patch as well, I'll try to quickly write down my points 
so far.

Trying to compile the intarray module, I now receive an error:

error: ‘INT4OID’ undeclared (first use in this function)

That can be solved by including "catalog/pg_type.h" from 
contrib/intarr/_int_op.c.


The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the 
top of the file, where all others are.

Some lines are longer than 80 columns and again comments are a bit 
sparse or even useless (no "additional things", please).


Heikki Linnakangas wrote:
> I find it a bit unfriendly to have a function that depends on sorted 
> input, but doesn't check it. But that's probably not a good enough 
> reason to reject an otherwise simple and useful function. Also, we 
> already have uniq, which doesn't strictly speaking require sorted input, 
> but it does if you want to eliminate all duplicates from the array.

I think it's a performance optimization which is absolutely required in 
some cases. Some time ago I've also had to rip out the sorting step from 
certain intarray module functions to save processing time.

One option already mentioned somewhere would be saving a 'sorted' 
property for the array. Then again, I think such a thing would certainly 
have to be done globally, for all kinds of arrays.

> _int_group_count_sort seems a bit special purpose. Why does it bother to 
> sort the output? That's wasted time if you don't need sorted output, or 
> if you want the array sorted by the integer value instead of frequency. 
> If you want sorted output, you can just sort it afterwards.

Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database 
itself should be used for such a thing. However, that means turning an 
array into a set of rows...

> Also, it's requiring sorted input for a small performance gain, but 
> there's a lot more precedence in the existing intarray functions to not 
> require sorted input, but to sort the input instead (union, intersect, 
> same, overlap).

..and exactly these are the functions I had to wrap again to strip the 
sorting step, due to poor performance for known-sorted arrays.

> I realize that the current implementation is faster for the use case 
> where the input is sorted, and output needs to be sorted, but if we go 
> down that path we'll soon have dozens of different variants of various 
> functions, with different ordering requirements of inputs and outputs.

Agreed.

However, given the OP is using that in production, there seems to be a 
use case for the optimization, where we have none for the same function 
without it.

> So, I'd suggest changing _int_group_count_sort so that it doesn't 
> require sorted input, and doesn't sort the output. The binary search 
> function looks good to me (I think I'd prefer naming it bsearch(), 
> though, though I can see that it was named bidx in reference to the 
> existing idx function). Also, as Markus pointed out, the SGML docs need 
> to be updated.

As is, it should probably also carry the '_int' prefix, because it's not 
a general purpose array function. So propose to name it '_int_bsearch'.

Overall I think these functions are overly specialized and should be 
replaced be more general counterparts in core. However, until we have 
that, it's hard to refuse such a thing for contrib.

By that reasoning, however, the intarray would have to provide methods 
for sorted as well as asorted input arrays as well.

I'm closing my part of reviewing of this patch now. Dmitry, how do you 
want to proceed with these patches? Do you have time for some cleaning 
up and writing documentation?

Regards

Markus Wanner


Re: Review Report: propose to include 3 new functions into intarray and intagg

From
"Dmitry Koterov"
Date:
No problem, I have time for clearing. But are these functions
guaranteed to be included in the contrib? If there is no guarantee,
seems the time of clearing will be wasted. (5 years ago I have already
cleaned one open-source library on demand and after that it was not
approved for PEAR repository, so now I am more circumspect, sorry :-).

So - is the approval procedure finished? If not, who could make the
final decision ("be or not to be")? Sorry, I don't yet know well the
patch proposal procedure...

P.S.
1. Unfortunately GROUP BY + ORDER BY is sometimes 1000 times (!)
slower than _int_group_count_sort. Because of that I have created this
function.
2. I have to assume that the input is sorted in functions, because
else the performance is lowered very much. Sort operation is quite
expensive; checking if an array is sorted or not is also quite
expensive... So I think it is not a good idea to add
sorting-independency to functions.
3. Seems the conversion of functions to anyarray/anyelement is much
more time-expensive than simply including it to specialized
intarray/intagg. Is it absolutely necessery?


On Mon, Sep 15, 2008 at 4:38 PM, Markus Wanner <markus@bluegap.ch> wrote:
> Hi,
>
> sorry for not having completed this review, yet. As you are obviously
> looking at the patch as well, I'll try to quickly write down my points so
> far.
>
> Trying to compile the intarray module, I now receive an error:
>
> error: 'INT4OID' undeclared (first use in this function)
>
> That can be solved by including "catalog/pg_type.h" from
> contrib/intarr/_int_op.c.
>
>
> The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the top
> of the file, where all others are.
>
> Some lines are longer than 80 columns and again comments are a bit sparse or
> even useless (no "additional things", please).
>
>
> Heikki Linnakangas wrote:
>>
>> I find it a bit unfriendly to have a function that depends on sorted
>> input, but doesn't check it. But that's probably not a good enough reason to
>> reject an otherwise simple and useful function. Also, we already have uniq,
>> which doesn't strictly speaking require sorted input, but it does if you
>> want to eliminate all duplicates from the array.
>
> I think it's a performance optimization which is absolutely required in some
> cases. Some time ago I've also had to rip out the sorting step from certain
> intarray module functions to save processing time.
>
> One option already mentioned somewhere would be saving a 'sorted' property
> for the array. Then again, I think such a thing would certainly have to be
> done globally, for all kinds of arrays.
>
>> _int_group_count_sort seems a bit special purpose. Why does it bother to
>> sort the output? That's wasted time if you don't need sorted output, or if
>> you want the array sorted by the integer value instead of frequency. If you
>> want sorted output, you can just sort it afterwards.
>
> Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database itself
> should be used for such a thing. However, that means turning an array into a
> set of rows...
>
>> Also, it's requiring sorted input for a small performance gain, but
>> there's a lot more precedence in the existing intarray functions to not
>> require sorted input, but to sort the input instead (union, intersect, same,
>> overlap).
>
> ..and exactly these are the functions I had to wrap again to strip the
> sorting step, due to poor performance for known-sorted arrays.
>
>> I realize that the current implementation is faster for the use case where
>> the input is sorted, and output needs to be sorted, but if we go down that
>> path we'll soon have dozens of different variants of various functions, with
>> different ordering requirements of inputs and outputs.
>
> Agreed.
>
> However, given the OP is using that in production, there seems to be a use
> case for the optimization, where we have none for the same function without
> it.
>
>> So, I'd suggest changing _int_group_count_sort so that it doesn't require
>> sorted input, and doesn't sort the output. The binary search function looks
>> good to me (I think I'd prefer naming it bsearch(), though, though I can see
>> that it was named bidx in reference to the existing idx function). Also, as
>> Markus pointed out, the SGML docs need to be updated.
>
> As is, it should probably also carry the '_int' prefix, because it's not a
> general purpose array function. So propose to name it '_int_bsearch'.
>
> Overall I think these functions are overly specialized and should be
> replaced be more general counterparts in core. However, until we have that,
> it's hard to refuse such a thing for contrib.
>
> By that reasoning, however, the intarray would have to provide methods for
> sorted as well as asorted input arrays as well.
>
> I'm closing my part of reviewing of this patch now. Dmitry, how do you want
> to proceed with these patches? Do you have time for some cleaning up and
> writing documentation?
>
> Regards
>
> Markus Wanner
>