Thread: Aggregates with non-commutative transition functions
Dear list, I am working on a bibliograpic database. Some citations come from Medline, whose "full" format gives citations as an (ordered) set of tag-value pairs. Among them, authots are quoted one tag-pair per author. Collecting them is trivial, giving a table whose structure is essentially as in : create cit_authors ( recnum int4, -- Currrent record (citation) linenum int4, -- current line in input file, gives ordering author text, -- Author name and initials primary key (recnum, linenum)); This table has secondary indexes on both recnum and linenum, for various query efficiency reasons ... In some case, a third index on author might prove useful ... In order to build the authors list of a given reference, I built an auxilliary aggregate : create function txt_glue (text, text) returns text as ' declare t1 alias for $1; t2 alias for $2; res text; begin if t1 is null then res:=t2; elsif t2 is null then res:=t1; else res := t1 || \', \' || t2; end if; return res; end;' language plpgsql; create aggregate glue ( basetype=text, sfunc=txt_glue, stype=text); The problem is as follows : how can I guarantee that the authors will be quoted in the original order ? In this case, text catenation is *highly* noncommutative ! (<AsbestosLongjohns> Getting the authors order wrong is a sure-fire way to get all the authors mad at you ... </AsbestosLongjohns>). In other words, may I guarantee that : select recnum, glue(linenum)as authors from (select recnum, linenum, author from cit_authors where <some conditions on recnum> order by recnum, linenum) as foo; will indeed give me the authors in the original order ? Your thoughs ? Emmanuel Charpentier
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes: > In other words, may I guarantee that : > select recnum, glue(linenum)as authors from (select recnum, linenum, author > from cit_authors where <some conditions on recnum> order by recnum, > linenum) as foo; > will indeed give me the authors in the original order ? The query as given is illegal; you'd need to add "GROUP BY recnum" to the outer query to make it legal. And once you do that, I don't think you can rely on the ordering of the rows. This problem has been discussed before, and I think we came up with some workaround for it, but I don't have time to go trolling the archives for the solution at the moment. regards, tom lane
(CC'd to the newsgroup/mailing list to work around damn Tom's spamguard, which refuses to recognize my SMTP as a possibly valid one ...) Dear Tom, Tom Lane wrote: > Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes: > >>In other words, may I guarantee that : >>select recnum, glue(linenum)as authors from (select recnum, linenum, author >>from cit_authors where <some conditions on recnum> order by recnum, >>linenum) as foo; >>will indeed give me the authors in the original order ? > > > The query as given is illegal; you'd need to add "GROUP BY recnum" to > the outer query to make it legal. And once you do that, I don't think > you can rely on the ordering of the rows. Right ! I was typing off the top of my mind ... > This problem has been discussed before, and I think we came up with some > workaround for it, but I don't have time to go trolling the archives for > the solution at the moment. I looked up the archives (not a, easy task ...), didn't found a solution per se, but a good explanation of the problem (yours, BTW ...). and got an idea : the problem arises, according to you) from the fact that the aggregation itself requires a sort according to the grouping key, which isn't stable (i.e. may muck the row ordering of the table to which the aggregate is to be applied) "on most platforms" (does this mean that you use the libc sort() ?). Since the stability of the sort is an issue but for this quite specific case, one could envision a flag to "CREATE AGGREGATE", signalling the planner the fact that the transition function isn't commutative, hence the need to use a stable version of sort() (I *suppose* that this exists in the littterature ...). AFAICT, this is the only case where this need arises. Since CREATE AGGREGATE is PostgreSQL-specific anyway, this shouldn't break compatibility with anything. What I do not know is if this extension has enough of "general usefulness" to grant the work (insertion of a stable sort routine, mucling with the planner, changging the grammar for CREATE AGGREGATE, extenging pg_aggregate, etc ...) necessary to fit it in PostgreSQL... In the meanwhile, I'll cope with an ad-hockery (probably involving a cursor). Your thoughs ? Emmanuel Charpentier -- Emmanuel Charpentier
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes: > Since the stability of the sort is an issue but for this quite specific > case, one could envision a flag to "CREATE AGGREGATE", signalling the > planner the fact that the transition function isn't commutative, hence the > need to use a stable version of sort() Actually, I would think that you'd really prefer that the system not run a sort step at all. As of CVS tip, if the planner decides to use hash-based aggregation for your query then there wouldn't be any pre-sort. But there's no guarantee it will do that. A better alternative is to get the planner to notice in the context of the outer query that the inner query's result is already sorted by recnum. Then it wouldn't do the unwanted sort in any case. This has been on the to-do list for awhile, but hasn't risen to the top ... regards, tom lane
I said: > A better alternative is to get the planner to notice in the context of > the outer query that the inner query's result is already sorted by > recnum. Then it wouldn't do the unwanted sort in any case. This has > been on the to-do list for awhile, but hasn't risen to the top ... Now it has ... as of CVS tip, you can do regression=# create table tab(foo int, bar int, baz float); CREATE TABLE regression=# explain select foo, avg(baz) from regression-# (select foo,baz from tab order by foo, bar) ss regression-# group by foo; QUERY PLAN -------------------------------------------------------------------------- GroupAggregate (cost=69.83..77.83 rows=200 width=16) -> Subquery Scan ss (cost=69.83..72.33 rows=1000 width=16) -> Sort (cost=69.83..72.33 rows=1000 width=16) Sort Key: foo, bar -> Seq Scan on tab (cost=0.00..20.00 rows=1000 width=16) (5 rows) Note the lack of an extra sort above the subquery. This provides a general technique for controlling the ordering of inputs to a user-written aggregate function, even when grouping. regards, tom lane
Tom Lane wrote: > I said: > >>A better alternative is to get the planner to notice in the context of >>the outer query that the inner query's result is already sorted by >>recnum. Then it wouldn't do the unwanted sort in any case. This has >>been on the to-do list for awhile, but hasn't risen to the top ... > > > Now it has ... as of CVS tip, you can do [ Nice demo ... ] > Note the lack of an extra sort above the subquery. This provides a > general technique for controlling the ordering of inputs to a > user-written aggregate function, even when grouping. Schön ! I suppose that this has other fringe benefits for planning in general ... Thanks a lot ! I'll try to build a secondary PostgreSQL from CVS on a development machine to test it. Do you plan incorporation in some forthcoming 7.3.x ? Or push it back to 7.4 ? Emmanuel Charpentier
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes: > Tom Lane wrote: >> Note the lack of an extra sort above the subquery. This provides a >> general technique for controlling the ordering of inputs to a >> user-written aggregate function, even when grouping. > Sch�n ! I suppose that this has other fringe benefits for planning in > general ... Yes, it should save cycles in many scenarios, so I thought it was worth doing in any case. > Do you plan incorporation in some forthcoming 7.3.x ? Or push it back > to 7.4 ? No, I would not risk back-patching this into 7.3.*. It's not a bug fix. (But having said that, you could get the diffs from the CVS server and back-patch to create your own private 7.3 variant, if you can't wait for 7.4. Offhand I do not think there'd be any great difficulty in applying the change to 7.3 branch.) regards, tom lane
Tom Lane wrote: > Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes: [ A nice optimisation allowing non-commutative aggregates to do what they're intended to do ... ] >>Do you plan incorporation in some forthcoming 7.3.x ? Or push it back >>to 7.4 ? > > > No, I would not risk back-patching this into 7.3.*. It's not a bug fix. Right ! > (But having said that, you could get the diffs from the CVS server and > back-patch to create your own private 7.3 variant, if you can't wait > for 7.4. Offhand I do not think there'd be any great difficulty in > applying the change to 7.3 branch.) Schuck no ! I'm not good enough at this ... My coding days are behind me, and I'm currently just a user. BTW, I tend to stay on stable versions. I'm experimenting with 7.3 on a development machine, but my main production machine is still 7.2 (I'm staying with Debian woody). The recent backport of 7.3 to Woody is a Good Thing (TM), but the current 7.3.1 has problems with ODBC. Oliver Elphick is currently backporting 7.3.2, but it(s not here yet ... I'll wait patiently 7.4 ... Thanks a lot ! Emmanuel Charpentier PS : Did you (or anyone) considered my suggestion for an "alter table alter column retype" extension ? Thinking of it again, I guess it could have other benefits and be something else than simple syntactic sugar ...