Thread: Aggregates with non-commutative transition functions

Aggregates with non-commutative transition functions

From
Emmanuel Charpentier
Date:
Dear list,

I am working on a bibliograpic database. Some citations come from Medline,
whose "full" format gives citations as an (ordered) set of tag-value pairs.
Among them, authots are quoted one tag-pair per author. Collecting them is
trivial, giving a table whose structure is essentially as in :

create cit_authors (
    recnum int4, -- Currrent record (citation)
    linenum int4, -- current line in input file, gives ordering
    author text, -- Author name and initials
    primary key (recnum, linenum));

This table has secondary indexes on both recnum and linenum, for various
query efficiency reasons ... In some case, a third index on author might
prove useful ...

In order to build the authors list of a given reference, I built an
auxilliary aggregate :

create function txt_glue (text, text) returns text as '
declare
   t1 alias for $1;
   t2 alias for $2;
   res text;
begin
   if t1 is null
      then
        res:=t2;
      elsif t2 is null
        then res:=t1;
      else res := t1 || \', \' || t2;
    end if;
   return res;
end;'
language plpgsql;

create aggregate glue ( basetype=text,
   sfunc=txt_glue,
   stype=text);

The problem is as follows : how can I guarantee that the authors will be
quoted in the original order ? In this case, text catenation is *highly*
noncommutative ! (<AsbestosLongjohns> Getting the authors order wrong is a
sure-fire way to get all the authors mad at you ... </AsbestosLongjohns>).

In other words, may I guarantee that :

select recnum, glue(linenum)as authors from (select recnum, linenum, author
from cit_authors where <some conditions on recnum> order by recnum,
linenum) as foo;

will indeed give me the authors in the original order ?

Your thoughs ?

                    Emmanuel Charpentier


Re: Aggregates with non-commutative transition functions

From
Tom Lane
Date:
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes:
> In other words, may I guarantee that :
> select recnum, glue(linenum)as authors from (select recnum, linenum, author
> from cit_authors where <some conditions on recnum> order by recnum,
> linenum) as foo;
> will indeed give me the authors in the original order ?

The query as given is illegal; you'd need to add "GROUP BY recnum" to
the outer query to make it legal.  And once you do that, I don't think
you can rely on the ordering of the rows.

This problem has been discussed before, and I think we came up with some
workaround for it, but I don't have time to go trolling the archives for
the solution at the moment.

            regards, tom lane

Re: Aggregates with non-commutative transition functions

From
Emmanuel Charpentier
Date:
(CC'd to the newsgroup/mailing list to work around damn Tom's spamguard,
which refuses to recognize my SMTP as a possibly valid one ...)

Dear Tom,

Tom Lane wrote:
> Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes:
>
>>In other words, may I guarantee that :
>>select recnum, glue(linenum)as authors from (select recnum, linenum, author
>>from cit_authors where <some conditions on recnum> order by recnum,
>>linenum) as foo;
>>will indeed give me the authors in the original order ?
>
>
> The query as given is illegal; you'd need to add "GROUP BY recnum" to
> the outer query to make it legal.  And once you do that, I don't think
> you can rely on the ordering of the rows.

Right ! I was typing off the top of my mind ...

> This problem has been discussed before, and I think we came up with some
> workaround for it, but I don't have time to go trolling the archives for
> the solution at the moment.

I looked up the archives (not a, easy task ...), didn't found a solution
per se, but a good explanation of the problem (yours, BTW ...). and got an
idea : the problem arises, according to you) from the fact that the
aggregation itself requires a sort according to the grouping key, which
isn't stable (i.e. may muck the row ordering of the table to which the
aggregate is to be applied) "on most platforms" (does this mean that you
use the libc sort() ?).

Since the stability of the sort is an issue but for this quite specific
case, one could envision a flag to "CREATE AGGREGATE", signalling the
planner the fact that the transition function isn't commutative, hence the
need to use a stable version of sort() (I *suppose* that this exists in the
littterature ...). AFAICT, this is the only case where this need arises.

Since CREATE AGGREGATE is PostgreSQL-specific anyway, this shouldn't break
compatibility with anything.

What I do not know is if this extension has enough of "general usefulness"
to grant the work (insertion of a stable sort routine, mucling with the
planner, changging the grammar for CREATE AGGREGATE, extenging
pg_aggregate, etc ...) necessary to  fit it in PostgreSQL...

In the meanwhile, I'll cope with an ad-hockery (probably involving a cursor).

Your thoughs ?

                        Emmanuel Charpentier

--
Emmanuel Charpentier


Re: Aggregates with non-commutative transition functions

From
Tom Lane
Date:
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes:
> Since the stability of the sort is an issue but for this quite specific
> case, one could envision a flag to "CREATE AGGREGATE", signalling the
> planner the fact that the transition function isn't commutative, hence the
> need to use a stable version of sort()

Actually, I would think that you'd really prefer that the system not
run a sort step at all.

As of CVS tip, if the planner decides to use hash-based aggregation
for your query then there wouldn't be any pre-sort.  But there's no
guarantee it will do that.

A better alternative is to get the planner to notice in the context of
the outer query that the inner query's result is already sorted by
recnum.  Then it wouldn't do the unwanted sort in any case.  This has
been on the to-do list for awhile, but hasn't risen to the top ...

            regards, tom lane

Re: Aggregates with non-commutative transition functions

From
Tom Lane
Date:
I said:
> A better alternative is to get the planner to notice in the context of
> the outer query that the inner query's result is already sorted by
> recnum.  Then it wouldn't do the unwanted sort in any case.  This has
> been on the to-do list for awhile, but hasn't risen to the top ...

Now it has ... as of CVS tip, you can do

regression=# create table tab(foo int, bar int, baz float);
CREATE TABLE
regression=# explain select foo, avg(baz) from
regression-#   (select foo,baz from tab order by foo, bar) ss
regression-# group by foo;
                                QUERY PLAN
--------------------------------------------------------------------------
 GroupAggregate  (cost=69.83..77.83 rows=200 width=16)
   ->  Subquery Scan ss  (cost=69.83..72.33 rows=1000 width=16)
         ->  Sort  (cost=69.83..72.33 rows=1000 width=16)
               Sort Key: foo, bar
               ->  Seq Scan on tab  (cost=0.00..20.00 rows=1000 width=16)
(5 rows)

Note the lack of an extra sort above the subquery.  This provides a
general technique for controlling the ordering of inputs to a
user-written aggregate function, even when grouping.

            regards, tom lane

Re: Aggregates with non-commutative transition functions

From
Emmanuel Charpentier
Date:
Tom Lane wrote:
> I said:
>
>>A better alternative is to get the planner to notice in the context of
>>the outer query that the inner query's result is already sorted by
>>recnum.  Then it wouldn't do the unwanted sort in any case.  This has
>>been on the to-do list for awhile, but hasn't risen to the top ...
>
>
> Now it has ... as of CVS tip, you can do

[ Nice demo ... ]

> Note the lack of an extra sort above the subquery.  This provides a
> general technique for controlling the ordering of inputs to a
> user-written aggregate function, even when grouping.

Schön ! I suppose that this has other fringe benefits for planning in
general ...

Thanks a lot ! I'll try to build a secondary PostgreSQL from CVS on a
development machine to test it.

Do you plan incorporation in some forthcoming 7.3.x ? Or push it back to 7.4 ?

                    Emmanuel Charpentier

Re: Aggregates with non-commutative transition functions

From
Tom Lane
Date:
Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes:
> Tom Lane wrote:
>> Note the lack of an extra sort above the subquery.  This provides a
>> general technique for controlling the ordering of inputs to a
>> user-written aggregate function, even when grouping.

> Sch�n ! I suppose that this has other fringe benefits for planning in
> general ...

Yes, it should save cycles in many scenarios, so I thought it was worth
doing in any case.

> Do you plan incorporation in some forthcoming 7.3.x ? Or push it back
> to 7.4 ?

No, I would not risk back-patching this into 7.3.*.  It's not a bug fix.

(But having said that, you could get the diffs from the CVS server and
back-patch to create your own private 7.3 variant, if you can't wait
for 7.4.  Offhand I do not think there'd be any great difficulty in
applying the change to 7.3 branch.)

            regards, tom lane

Re: Aggregates with non-commutative transition functions

From
Emmanuel Charpentier
Date:
Tom Lane wrote:
> Emmanuel Charpentier <charpent@bacbuc.dyndns.org> writes:

[ A nice optimisation allowing non-commutative aggregates to do what
they're intended to do ... ]

>>Do you plan incorporation in some forthcoming 7.3.x ? Or push it back
>>to 7.4 ?
>
>
> No, I would not risk back-patching this into 7.3.*.  It's not a bug fix.

Right !

> (But having said that, you could get the diffs from the CVS server and
> back-patch to create your own private 7.3 variant, if you can't wait
> for 7.4.  Offhand I do not think there'd be any great difficulty in
> applying the change to 7.3 branch.)

Schuck no ! I'm not good enough at this ... My coding days are behind me,
and I'm currently just a user. BTW, I tend to stay on stable versions. I'm
experimenting with 7.3 on a development machine, but my main production
machine is still 7.2 (I'm staying with Debian woody). The recent backport
of 7.3 to Woody is a Good Thing (TM), but the current 7.3.1 has problems
with ODBC. Oliver Elphick is currently backporting 7.3.2, but it(s not here
yet ...

I'll wait patiently 7.4 ...

Thanks a lot !

                    Emmanuel Charpentier

PS : Did you (or anyone) considered my suggestion for an "alter table alter
column retype" extension ? Thinking of it again, I guess it could have
other benefits and be something else than simple syntactic sugar ...