Thread: Patch: propose to include 3 new functions into intarray and intagg
<div dir="ltr">Hello.<br /><br />Here are these functions with detailed documentation:<br /><a href="http://en.dklab.ru/lib/dklab_postgresql_patch/">http://en.dklab.ru/lib/dklab_postgresql_patch/</a><br/><br />intagg.int_array_append_aggregate(int[]):fast merge arrays into one large list<br /> intarray._int_group_count_sort(int[],bool): frequency-based sorting<br />intarray.bidx(int[], int): binary search in a sortedarray<br /><br />Tested for about a year on a real PostgreSQL cluster (10 machines, Slony replication) under a heavyload (millions of requests). <br /> No crash nor memory problem detected during a year, so I suppose these functionsare well-tested.<br /><br />What do you think about that?<br /></div>
Dmitry Koterov napsal(a): > Hello. > > Here are these functions with detailed documentation: > http://en.dklab.ru/lib/dklab_postgresql_patch/ Added to next commit fest patch list. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Regarding the patch listed on the commitfest "3 new functions into intarray and intagg" (which I just noticed has a reviewer listed -- doh): http://archives.postgresql.org/message-id/d7df81620808130429l2a75c895g5dd6fe8ae64cc23e@mail.gmail.com I definitely like the int_array_append_aggregate function but I don't see anything int[] specific about it. We should be able to have a generic array_union() aggregate which uses the same IsA(fcinfo->context, AggState) trick to scribble on its state variable. It don't even see any reason it couldn't work for arrays of varlenas, though it would take a bit of restructuring. So I would be definitely for a adding this to core if it were rewritten to work with generic arrays which, unless there are problems I'm not seeing, I don't think would be very hard. As far as detailed code commentary the only thing which jumps out at me is that it's using MemoryContextAlloc to grow the array instead of repalloc which seems like a waste. This isn't a new thing though, it was how intagg was written and this patch just didn't change it. I'm not against putting more functions into intagg and intarray and bidx and the grouping/counting thing seem like they might be useful functionality. but I have a feeling others might feel differently. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Hi, Gregory Stark wrote: > Regarding the patch listed on the commitfest "3 new functions into intarray > and intagg" (which I just noticed has a reviewer listed -- doh): ..well, just add your name as well, no? > I definitely like the int_array_append_aggregate function but I don't see > anything int[] specific about it. We should be able to have a generic > array_union() aggregate which uses the same IsA(fcinfo->context, AggState) > trick to scribble on its state variable. It don't even see any reason it > couldn't work for arrays of varlenas, though it would take a bit of > restructuring. Yeah, the same idea was bugging me. Doesn't such code already exist? > So I would be definitely for a adding this to core if it were rewritten to > work with generic arrays which, unless there are problems I'm not seeing, I > don't think would be very hard. > > As far as detailed code commentary the only thing which jumps out at me is > that it's using MemoryContextAlloc to grow the array instead of repalloc which > seems like a waste. This isn't a new thing though, it was how intagg was > written and this patch just didn't change it. Oh, good catch. > I'm not against putting more functions into intagg and intarray and bidx and > the grouping/counting thing seem like they might be useful functionality. but > I have a feeling others might feel differently. The naming 'bidx' seems a bit weired to me, but otherwise I'm also optimistic about it. Regards Markus Wanner
Markus Wanner <markus@bluegap.ch> writes: > Hi, > > Gregory Stark wrote: >> Regarding the patch listed on the commitfest "3 new functions into intarray >> and intagg" (which I just noticed has a reviewer listed -- doh): > > ..well, just add your name as well, no? Yeah, people should feel free to comment on items even if other people have or are as well. It would have just been more useful for me to pick one that didn't already have someone reading it is all. >> As far as detailed code commentary the only thing which jumps out at me is >> that it's using MemoryContextAlloc to grow the array instead of repalloc which >> seems like a waste. This isn't a new thing though, it was how intagg was >> written and this patch just didn't change it. > > Oh, good catch. Actually on further thought what's going on is that it's resizing the array by doubling its size when necessary. palloc/repalloc does that as well, so you're getting two layers of extra space to handle reallocations. I think it's simpler and cleaner to just repalloc just enough space at each growth and let repalloc handle allocating extra space to handle future growth. I think that would be necessary for handling varlenas also. > The naming 'bidx' seems a bit weired to me, but otherwise I'm also optimistic > about it. If we wanted to put that in core it would make more sense to have a flag on the array indicating whether it's sorted or not which is maintained whenever we construct or alter an array. Then just have the regular _int_contains() (which is an operator @>) take advantage of it if it's set and the data type is fixed-size. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Hi, Gregory Stark wrote: >> The naming 'bidx' seems a bit weired to me, but otherwise I'm also optimistic >> about it. > > If we wanted to put that in core Uh.. ATM it's just a patch against contrib. I don't think 'bidx' needs to go into core. Otherwise we'd have to move the whole intarr contrib module as well, no? > it would make more sense to have a flag on > the array indicating whether it's sorted or not which is maintained whenever > we construct or alter an array. Then just have the regular _int_contains() > (which is an operator @>) take advantage of it if it's set and the data type > is fixed-size. Yeah, that sounds like a good thing. Currently, the intarr module doesn't provide the optimized functions for the outside world (_int_inter_inner() and such.. the _inner appendix really means "inside intarr module only). I've ended up copy'n'pasting that code into my own module, where I take care about ordering myself. But still want to maintain the optimization. However, that's probably not within the scope of this patch. Regards Markus Wanner
Hi, Gregory Stark wrote: > I definitely like the int_array_append_aggregate function but I don't see > anything int[] specific about it. We should be able to have a generic > array_union() aggregate which uses the same IsA(fcinfo->context, AggState) > trick to scribble on its state variable. It don't even see any reason it > couldn't work for arrays of varlenas, though it would take a bit of > restructuring. I've just noticed that this already is a todo item: "Add array_accum() and array_to_set() functions for arrays The standards specify array_agg() and UNNEST." See also: http://archives.postgresql.org/pgsql-hackers/2007-08/msg00464.php Regards Markus Wanner
Hi, this is my first "official" review. I've tried to follow the "Review a patch" guidelines from the wiki - thanks Simon, that was pretty helpful. This review covers only the intagg additions. Dmitry Koterov wrote: > Here are these functions with detailed documentation: > http://en.dklab.ru/lib/dklab_postgresql_patch/ Submission review: we generally prefer having patches archived on our mailing lists, so please just send future patches or revisions of this patch to our lists (I prefer -hackers, but probably -patches is still the official one). The documentation should go into a README file of the contrib module or some such. Tests are missing, but that's the case for the intagg contrib module anyway. The patch applies and is in context diff format, good. It adds an unrelated newline, which should generally be avoided. Usability review: functional additions look good and usable, certainly within a contrib module. I don't think it's compliant to any spec, but that's not required for contrib, IMO. Functional test: works as advertised on the accompanying website. I've tested the int_array_append_aggregate somewhat, see [1]. Performance testing: I've tested with 10 mio rows of arrays of size 1 and compared against core's int_array_aggregate function, see [1] again. In this simple test it took roughly 50% longer, which seems okay. Memory consumption looks sane as well. Coding review: style seems fine for contrib, though lines longer than 80 cols should be broken up. Comments in the code are sparse , some even counter-productive (marking something as "additional things" certainly doesn't help). Code and architecture review: the new int_array_append_aggregate() functions itself seems fine to me. Summary: My general feeling is, that this patch should be applied after minor code style corrections. As a longer term goal I think intagg should be integrated into core, since it's very basic functionality. TODO entries for things like an array_accum() aggregate already exist. Adding this patch to contrib now might be a step into the right direction. Dmitry, can you please apply these small corrections and re-submit the patch? Regards Markus Wanner P.S.: I dislike the intagg's use of PGARRAY, but that's nothing to do with this patch. Shouldn't this better use a real composite type as the aggregate's state type? I'd propose to clean up the intagg contrib module and prepare it for inclusion into core. [1]: functional and performance testing session: On a database with (patched) intagg and intarr contrib modules: markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT NULL); CREATE TABLE markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3, ARRAY[3,2,1]); INSERT 0 3 markus=# SELECT * FROM test; id | arr ----+--------- 1 | {1,2,3} 2 | {4,5} 3 | {3,2,1} (3 rows) markus=# SELECT int_array_append_aggregate(arr) FROM test; int_array_append_aggregate ---------------------------- {1,2,3,4,5,3,2,1} (1 row) markus=# SELECT * FROM test; id | arr ----+--------- 1 | {1,2,3} 2 | {4,5} 3 | {3,2,1} (3 rows) markus=# SELECT int_array_aggregate(id) AS ids, int_array_append_aggregate(arr) FROM test GROUP BY (id / 2); ids | int_array_append_aggregate -------+---------------------------- {1} | {1,2,3} {2,3} | {4,5,3,2,1} (2 rows) markus=# SELECT int_array_aggregate(id) AS ids, int_array_append_aggregate(arr) FROM test GROUP BY (id % 2); ids | int_array_append_aggregate -------+---------------------------- {2} | {4,5} {1,3} | {1,2,3,3,2,1} (2 rows) markus=# INSERT INTO test VALUES (4, NULL); INSERT 0 1 markus=# SELECT int_array_append_aggregate(arr) FROM test; int_array_append_aggregate ---------------------------- {1,2,3,4,5,3,2,1} (1 row) markus=# SELECT id, int_array_append_aggregate(arr) FROM test GROUP BY id; id | int_array_append_aggregate ----+---------------------------- 4 | {} 2 | {4,5} 3 | {3,2,1} 1 | {1,2,3} (4 rows) -- switching to performance testing markus=# \timing Timing is on. markus=# DELETE FROM test; DELETE 4 Time: 9.037 ms markus=# INSERT INTO test SELECT generate_series(1, 10000000), array[round(random() * 100)]::int[]; INSERT 0 10000000 Time: 53321.186 ms markus=# SELECT icount(int_array_aggregate(id)) AS count FROM test; count ---------- 10000000 (1 row) Time: 2493.184 ms markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM test; count ---------- 10000000 (1 row) Time: 4152.478 ms
Markus Wanner <markus@bluegap.ch> writes: > Submission review: we generally prefer having patches archived on our > mailing lists, so please just send future patches or revisions of this > patch to our lists (I prefer -hackers, but probably -patches is still > the official one). Please. But -patches is dead, please use -hackers. > The documentation should go into a README file of the > contrib module or some such. No, it should go into the SGML docs, specifically doc/src/sgml/intarray.sgml and intagg.sgml. We got rid of flat-text README documentation for contrib in 8.3, and are not about to permit any backsliding. > Summary: My general feeling is, that this patch should be applied after > minor code style corrections. As a longer term goal I think intagg > should be integrated into core, since it's very basic functionality. Well, what should be integrated is generic (non-datatype-specific) versions of this functionality, ie functions working on anyarray/anyelement not just int4[]/int4. There was some discussion about that a few days ago. I'm not really sure how painful it would be to do; probably the generalization to non-fixed-width types would be the trickiest part. regards, tom lane
Re: Review Report: propose to include 3 new functions into intarray and intagg
From
"Dmitry Koterov"
Date:
<div dir="ltr">OK, thank you for your review.<br /><br />I'll correct everything and send a patch in a couple of days. Areyou completely sure that this patch will be included? If not, seems the work of the patch standartization has much lowerpriority, and I will not hurry so much.<br /><br />But, what about intarray patch? Does somebody plan to review it?I'd prefer to include it too. If you approve, I'll correct the code style in intarray contrib patch too.<br /><br /><br/><br /><div class="gmail_quote">On Sat, Sep 6, 2008 at 3:34 PM, Markus Wanner <span dir="ltr"><markus@bluegap.ch></span>wrote:<br /><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204,204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi,<br /><br /> this is my first "official" review. I'vetried to follow the "Review a<br /> patch" guidelines from the wiki - thanks Simon, that was pretty helpful.<br /><br/> This review covers only the intagg additions.<br /><br /> Dmitry Koterov wrote:<br /><blockquote class="gmail_quote"style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hereare these functions with detailed documentation:<br /><a href="http://en.dklab.ru/lib/dklab_postgresql_patch/" target="_blank">http://en.dklab.ru/lib/dklab_postgresql_patch/</a><br/></blockquote><br /> Submission review: we generallyprefer having patches archived on our<br /> mailing lists, so please just send future patches or revisions of this<br/> patch to our lists (I prefer -hackers, but probably -patches is still<br /> the official one). The documentationshould go into a README file of the<br /> contrib module or some such. Tests are missing, but that's the casefor<br /> the intagg contrib module anyway. The patch applies and is in context<br /> diff format, good. It adds an unrelatednewline, which should generally be avoided.<br /><br /> Usability review: functional additions look good and usable,certainly<br /> within a contrib module. I don't think it's compliant to any spec, but<br /> that's not required forcontrib, IMO.<br /><br /> Functional test: works as advertised on the accompanying website. I've<br /> tested the int_array_append_aggregatesomewhat, see [1].<br /><br /> Performance testing: I've tested with 10 mio rows of arrays of size1<br /> and compared against core's int_array_aggregate function, see [1] again.<br /> In this simple test it took roughly50% longer, which seems okay. Memory consumption looks sane as well.<br /><br /> Coding review: style seems fine forcontrib, though lines longer than 80 cols should be broken up. Comments in the code are sparse , some even counter-productive(marking something as "additional things" certainly doesn't help).<br /><br /> Code and architecture review:the new int_array_append_aggregate() functions itself seems fine to me.<br /><br /> Summary: My general feeling is,that this patch should be applied after minor code style corrections. As a longer term goal I think intagg should be integratedinto core, since it's very basic functionality. TODO entries for things like an array_accum() aggregate alreadyexist. Adding this patch to contrib now might be a step into the right direction.<br /><br /> Dmitry, can you pleaseapply these small corrections and re-submit the patch?<br /><br /> Regards<br /><br /> Markus Wanner<br /><br /> P.S.:I dislike the intagg's use of PGARRAY, but that's nothing to do with this patch. Shouldn't this better use a real compositetype as the aggregate's state type? I'd propose to clean up the intagg contrib module and prepare it for inclusioninto core.<br /><br /><br /> [1]: functional and performance testing session:<br /><br /> On a database with (patched)intagg and intarr contrib modules:<br /><br /> markus=# CREATE TABLE test (id INT NOT NULL, arr INT[] NOT NULL);<br/> CREATE TABLE<br /> markus=# INSERT INTO test VALUES (1, ARRAY[1,2,3]), (2, ARRAY[4,5]), (3,<br /> ARRAY[3,2,1]);<br/> INSERT 0 3<br /> markus=# SELECT * FROM test;<br /> id | arr<br /> ----+---------<br /> 1 | {1,2,3}<br/> 2 | {4,5}<br /> 3 | {3,2,1}<br /> (3 rows)<br /><br /> markus=# SELECT int_array_append_aggregate(arr) FROMtest;<br /> int_array_append_aggregate<br /> ----------------------------<br /> {1,2,3,4,5,3,2,1}<br /> (1 row)<br/><br /> markus=# SELECT * FROM test;<br /> id | arr<br /> ----+---------<br /> 1 | {1,2,3}<br /> 2 | {4,5}<br/> 3 | {3,2,1}<br /> (3 rows)<br /><br /> markus=# SELECT int_array_aggregate(id) AS ids,<br /> int_array_append_aggregate(arr)FROM test GROUP BY (id / 2);<br /> ids | int_array_append_aggregate<br /> -------+----------------------------<br/> {1} | {1,2,3}<br /> {2,3} | {4,5,3,2,1}<br /> (2 rows)<br /><br /> markus=#SELECT int_array_aggregate(id) AS ids,<br /> int_array_append_aggregate(arr) FROM test GROUP BY (id % 2);<br /> ids | int_array_append_aggregate<br /> -------+----------------------------<br /> {2} | {4,5}<br /> {1,3} | {1,2,3,3,2,1}<br/> (2 rows)<br /><br /> markus=# INSERT INTO test VALUES (4, NULL);<br /> INSERT 0 1<br /><br /> markus=#SELECT int_array_append_aggregate(arr) FROM test;<br /> int_array_append_aggregate<br /> ----------------------------<br/> {1,2,3,4,5,3,2,1}<br /> (1 row)<br /><br /> markus=# SELECT id, int_array_append_aggregate(arr)FROM test GROUP BY id;<br /> id | int_array_append_aggregate<br /> ----+----------------------------<br/> 4 | {}<br /> 2 | {4,5}<br /> 3 | {3,2,1}<br /> 1 | {1,2,3}<br /> (4 rows)<br /><br/> -- switching to performance testing<br /><br /> markus=# \timing<br /> Timing is on.<br /><br /> markus=# DELETEFROM test;<br /> DELETE 4<br /> Time: 9.037 ms<br /><br /> markus=# INSERT INTO test SELECT generate_series(1, 10000000),<br/> array[round(random() * 100)]::int[];<br /> INSERT 0 10000000<br /> Time: 53321.186 ms<br /><br /> markus=#SELECT icount(int_array_aggregate(id)) AS count FROM test;<br /> count<br /> ----------<br /> 10000000<br /> (1row)<br /><br /> Time: 2493.184 ms<br /><br /> markus=# SELECT icount(int_array_append_aggregate(arr)) AS count FROM test;<br/> count<br /> ----------<br /> 10000000<br /> (1 row)<br /><br /> Time: 4152.478 ms<br /><br /><br /></blockquote></div><br/></div>
Re: Review Report: propose to include 3 new functions into intarray and intagg
From
Markus Wanner
Date:
Hi, Dmitry Koterov wrote: > I'll correct everything and send a patch in a couple of days. Cool, thank you. > Are you > completely sure that this patch will be included? Uh.. I'm not a committer, but I'm pretty sure your patch has good chances. I can help with SGML documentation, if you want. > But, what about intarray patch? Does somebody plan to review it? I'd > prefer to include it too. If you approve, I'll correct the code style in > intarray contrib patch too. I've already volunteered for reviewing it as well. I just felt like splitting things up... Regards Markus Wanner
Re: Review Report: propose to include 3 new functions into intarray and intagg
From
Heikki Linnakangas
Date:
Markus Wanner wrote: > Dmitry Koterov wrote: >> But, what about intarray patch? Does somebody plan to review it? I'd >> prefer to include it too. If you approve, I'll correct the code style >> in intarray contrib patch too. > > I've already volunteered for reviewing it as well. I just felt like > splitting things up... Looking at the intarray patch: I find it a bit unfriendly to have a function that depends on sorted input, but doesn't check it. But that's probably not a good enough reason to reject an otherwise simple and useful function. Also, we already have uniq, which doesn't strictly speaking require sorted input, but it does if you want to eliminate all duplicates from the array. _int_group_count_sort seems a bit special purpose. Why does it bother to sort the output? That's wasted time if you don't need sorted output, or if you want the array sorted by the integer value instead of frequency. If you want sorted output, you can just sort it afterwards. Also, it's requiring sorted input for a small performance gain, but there's a lot more precedence in the existing intarray functions to not require sorted input, but to sort the input instead (union, intersect, same, overlap). I realize that the current implementation is faster for the use case where the input is sorted, and output needs to be sorted, but if we go down that path we'll soon have dozens of different variants of various functions, with different ordering requirements of inputs and outputs. So, I'd suggest changing _int_group_count_sort so that it doesn't require sorted input, and doesn't sort the output. The binary search function looks good to me (I think I'd prefer naming it bsearch(), though, though I can see that it was named bidx in reference to the existing idx function). Also, as Markus pointed out, the SGML docs need to be updated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Re: Review Report: propose to include 3 new functions into intarray and intagg
From
Markus Wanner
Date:
Hi, sorry for not having completed this review, yet. As you are obviously looking at the patch as well, I'll try to quickly write down my points so far. Trying to compile the intarray module, I now receive an error: error: ‘INT4OID’ undeclared (first use in this function) That can be solved by including "catalog/pg_type.h" from contrib/intarr/_int_op.c. The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the top of the file, where all others are. Some lines are longer than 80 columns and again comments are a bit sparse or even useless (no "additional things", please). Heikki Linnakangas wrote: > I find it a bit unfriendly to have a function that depends on sorted > input, but doesn't check it. But that's probably not a good enough > reason to reject an otherwise simple and useful function. Also, we > already have uniq, which doesn't strictly speaking require sorted input, > but it does if you want to eliminate all duplicates from the array. I think it's a performance optimization which is absolutely required in some cases. Some time ago I've also had to rip out the sorting step from certain intarray module functions to save processing time. One option already mentioned somewhere would be saving a 'sorted' property for the array. Then again, I think such a thing would certainly have to be done globally, for all kinds of arrays. > _int_group_count_sort seems a bit special purpose. Why does it bother to > sort the output? That's wasted time if you don't need sorted output, or > if you want the array sorted by the integer value instead of frequency. > If you want sorted output, you can just sort it afterwards. Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database itself should be used for such a thing. However, that means turning an array into a set of rows... > Also, it's requiring sorted input for a small performance gain, but > there's a lot more precedence in the existing intarray functions to not > require sorted input, but to sort the input instead (union, intersect, > same, overlap). ..and exactly these are the functions I had to wrap again to strip the sorting step, due to poor performance for known-sorted arrays. > I realize that the current implementation is faster for the use case > where the input is sorted, and output needs to be sorted, but if we go > down that path we'll soon have dozens of different variants of various > functions, with different ordering requirements of inputs and outputs. Agreed. However, given the OP is using that in production, there seems to be a use case for the optimization, where we have none for the same function without it. > So, I'd suggest changing _int_group_count_sort so that it doesn't > require sorted input, and doesn't sort the output. The binary search > function looks good to me (I think I'd prefer naming it bsearch(), > though, though I can see that it was named bidx in reference to the > existing idx function). Also, as Markus pointed out, the SGML docs need > to be updated. As is, it should probably also carry the '_int' prefix, because it's not a general purpose array function. So propose to name it '_int_bsearch'. Overall I think these functions are overly specialized and should be replaced be more general counterparts in core. However, until we have that, it's hard to refuse such a thing for contrib. By that reasoning, however, the intarray would have to provide methods for sorted as well as asorted input arrays as well. I'm closing my part of reviewing of this patch now. Dmitry, how do you want to proceed with these patches? Do you have time for some cleaning up and writing documentation? Regards Markus Wanner
Re: Review Report: propose to include 3 new functions into intarray and intagg
From
"Dmitry Koterov"
Date:
No problem, I have time for clearing. But are these functions guaranteed to be included in the contrib? If there is no guarantee, seems the time of clearing will be wasted. (5 years ago I have already cleaned one open-source library on demand and after that it was not approved for PEAR repository, so now I am more circumspect, sorry :-). So - is the approval procedure finished? If not, who could make the final decision ("be or not to be")? Sorry, I don't yet know well the patch proposal procedure... P.S. 1. Unfortunately GROUP BY + ORDER BY is sometimes 1000 times (!) slower than _int_group_count_sort. Because of that I have created this function. 2. I have to assume that the input is sorted in functions, because else the performance is lowered very much. Sort operation is quite expensive; checking if an array is sorted or not is also quite expensive... So I think it is not a good idea to add sorting-independency to functions. 3. Seems the conversion of functions to anyarray/anyelement is much more time-expensive than simply including it to specialized intarray/intagg. Is it absolutely necessery? On Mon, Sep 15, 2008 at 4:38 PM, Markus Wanner <markus@bluegap.ch> wrote: > Hi, > > sorry for not having completed this review, yet. As you are obviously > looking at the patch as well, I'll try to quickly write down my points so > far. > > Trying to compile the intarray module, I now receive an error: > > error: 'INT4OID' undeclared (first use in this function) > > That can be solved by including "catalog/pg_type.h" from > contrib/intarr/_int_op.c. > > > The PG_FUNCTION_INFO_V1 and prototype definition certainly belong to the top > of the file, where all others are. > > Some lines are longer than 80 columns and again comments are a bit sparse or > even useless (no "additional things", please). > > > Heikki Linnakangas wrote: >> >> I find it a bit unfriendly to have a function that depends on sorted >> input, but doesn't check it. But that's probably not a good enough reason to >> reject an otherwise simple and useful function. Also, we already have uniq, >> which doesn't strictly speaking require sorted input, but it does if you >> want to eliminate all duplicates from the array. > > I think it's a performance optimization which is absolutely required in some > cases. Some time ago I've also had to rip out the sorting step from certain > intarray module functions to save processing time. > > One option already mentioned somewhere would be saving a 'sorted' property > for the array. Then again, I think such a thing would certainly have to be > done globally, for all kinds of arrays. > >> _int_group_count_sort seems a bit special purpose. Why does it bother to >> sort the output? That's wasted time if you don't need sorted output, or if >> you want the array sorted by the integer value instead of frequency. If you >> want sorted output, you can just sort it afterwards. > > Agreed. IMO the normal GROUP BY and ORDER BY stuff of the database itself > should be used for such a thing. However, that means turning an array into a > set of rows... > >> Also, it's requiring sorted input for a small performance gain, but >> there's a lot more precedence in the existing intarray functions to not >> require sorted input, but to sort the input instead (union, intersect, same, >> overlap). > > ..and exactly these are the functions I had to wrap again to strip the > sorting step, due to poor performance for known-sorted arrays. > >> I realize that the current implementation is faster for the use case where >> the input is sorted, and output needs to be sorted, but if we go down that >> path we'll soon have dozens of different variants of various functions, with >> different ordering requirements of inputs and outputs. > > Agreed. > > However, given the OP is using that in production, there seems to be a use > case for the optimization, where we have none for the same function without > it. > >> So, I'd suggest changing _int_group_count_sort so that it doesn't require >> sorted input, and doesn't sort the output. The binary search function looks >> good to me (I think I'd prefer naming it bsearch(), though, though I can see >> that it was named bidx in reference to the existing idx function). Also, as >> Markus pointed out, the SGML docs need to be updated. > > As is, it should probably also carry the '_int' prefix, because it's not a > general purpose array function. So propose to name it '_int_bsearch'. > > Overall I think these functions are overly specialized and should be > replaced be more general counterparts in core. However, until we have that, > it's hard to refuse such a thing for contrib. > > By that reasoning, however, the intarray would have to provide methods for > sorted as well as asorted input arrays as well. > > I'm closing my part of reviewing of this patch now. Dmitry, how do you want > to proceed with these patches? Do you have time for some cleaning up and > writing documentation? > > Regards > > Markus Wanner >