Thread: ts_rewrite aggregate API seems mighty ugly

ts_rewrite aggregate API seems mighty ugly

From
Tom Lane
Date:
While working on the documentation it's finally sunk into me what
ts_rewrite is all about, and I'm not very happy.  The simple
three-argument form is fine, it's basically a tsquery-specific
version of replace():

regression=# select ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'foo|bar'::tsquery);      ts_rewrite        
-------------------------'b' & ( 'foo' | 'bar' )
(1 row)

(BTW, would it be possible to get it to not randomly rearrange the order
of the clauses?  That makes it quite confusing to understand what it's
doing, and I don't offhand see a reason for that to happen.)

And the two-argument form is okay, though a bit hard to wrap your head
around --- the second argument is the text of a SQL query that returns
successive second and third arguments for successive ts_rewrite
operations on the first argument.  For instance,

regression=# create table ts_subst(target tsquery, subst tsquery);
CREATE TABLE
regression=# insert into ts_subst values('a', 'foo|bar');
INSERT 0 1
regression=# insert into ts_subst values('b', 'baz');
INSERT 0 1
regression=# select ts_rewrite('a & b'::tsquery, 'select target, subst from ts_subst');       ts_rewrite         
---------------------------'baz' & ( 'foo' | 'bar' )
(1 row)

The point of this of course is to be able to apply many different
substitutions stored in a table.

However, the aggregate form seems just plain weird:

regression=# select ts_rewrite(array['a & b'::tsquery, target, subst]) from ts_subst;       ts_rewrite         
---------------------------'baz' & ( 'baz' | 'foo' )
(1 row)

The assumption here is that target and substitute come from successive
rows of the table, and again we're trying to map an original tsquery
through a bunch of different replace() substitutions.  The array[] bit
is ugly and inefficient.  I suppose it's a holdover from days when
aggregates could only take one argument.  Now that we have
multiple-argument aggregates it would seem to make more sense to express
the thing as a three-argument aggregate, except that then we'd have to
give it a different name to distinguish it from the basic three-argument
non-aggregate function.

Another thing that I find ugly about this is that the 'original'
tsquery is meaningful only on the first accumulation cycle; after that
it's ignored, but the second and third array elements do have meaning.
That's inconsistent.

But the killer problem is that the aggregate form actually does the
Wrong Thing if there are no rows to aggregate over --- it'll return
NULL, not the original tsquery as you'd want.  This is not too
improbable if you follow the documentation's recommendation to filter
the rows using an @> operator:

regression=# select ts_rewrite(array['a & b'::tsquery, target, subst]) from ts_subst where 'a & b'::tsquery @> target;
    ts_rewrite         
 
---------------------------'baz' & ( 'bar' | 'foo' )
(1 row)

regression=# select ts_rewrite(array['quux'::tsquery, target, subst]) from ts_subst where 'quux'::tsquery @>
target;ts_rewrite
 
------------
(1 row)

So it seems to me this is just wrong, and the only safe way to do it is
to use the two-argument form, with the selection from the table
happening as a SPI query.

Since we're already committed to an initdb for beta2, it's not quite too
late to reconsider the API here.  My feeling at the moment is that we
should just drop the aggregate form of ts_rewrite; it does nothing you
can't do better with the two-argument form, and it is just ugly to boot.
Also, if anyone does come up with a not-so-ugly design later, we can
always add things in 8.4 or beyond; but once it's in core it's going
to be a very hard sell to take it out.

Comments?
        regards, tom lane


Re: ts_rewrite aggregate API seems mighty ugly

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Since we're already committed to an initdb for beta2, it's not quite too
> late to reconsider the API here.  My feeling at the moment is that we
> should just drop the aggregate form of ts_rewrite; it does nothing you
> can't do better with the two-argument form, and it is just ugly to boot.
> Also, if anyone does come up with a not-so-ugly design later, we can
> always add things in 8.4 or beyond; but once it's in core it's going
> to be a very hard sell to take it out.

The two-argument form may not be actively broken but it sounds not very
integrated. Passing a string which is then planned as an SQL query is not very
SQL-ish. If you prepare the query does that string get prepared? When do you
get syntax errors? If the plan's invalidated does it handle it correctly?

It sounds to me like once we have a good aggregate interface there's not much
point to the two-argument form. A good aggregate form would let you do the
subquery or a join or any other form of query to fetch all the replacements
you want.

I would suggest putting both the two-argument form and the aggregate form into
a contrib module -- effectively *back* into a contrib module until the
interfaces can be fully integrated into SQL. So whereas previously you
installed a contrib module for the whole tsearch2, now it's all integrated
except for a few oddball functions which you can still get by supplementing
the core tsearch with the contrib module.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: ts_rewrite aggregate API seems mighty ugly

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> The two-argument form may not be actively broken but it sounds not very
> integrated. Passing a string which is then planned as an SQL query is not very
> SQL-ish.

True.  I'll bet you don't like ts_stat() either.
        regards, tom lane


Re: ts_rewrite aggregate API seems mighty ugly

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> The two-argument form may not be actively broken but it sounds not very
>> integrated. Passing a string which is then planned as an SQL query is not very
>> SQL-ish.
>
> True.  I'll bet you don't like ts_stat() either.

It seems the right way interface here wouldn't be too different from what's
there. All we need is a SRF which takes a single tsvector and returns the set
of words from it.

Then you could do the aggregates yourself in SQL:

SELECT count(distinct apodid) as ndoc,      count(*) as nentry,      element FROM ( SELECT apodid, ts_elements(vector)
ASelement   FROM apod
 
) GROUP BY element

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: ts_rewrite aggregate API seems mighty ugly

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> True.  I'll bet you don't like ts_stat() either.

> It seems the right way interface here wouldn't be too different from what's
> there. All we need is a SRF which takes a single tsvector and returns the set
> of words from it.

> Then you could do the aggregates yourself in SQL:

> SELECT count(distinct apodid) as ndoc,
>        count(*) as nentry,
>        element
>   FROM (
>   SELECT apodid, ts_elements(vector) AS element
>     FROM apod
> ) GROUP BY element

I'm not sure that's so much cleaner than what's there.  It's relying on
SRF-in-SELECT-list, which is doable at the C level but it's deprecated;
plus the SRF has to return a record, which makes the notation a bit
klugy --- (element).whatever in the upper select-list, and the GROUP BY
probably won't work the way you show here, either.

Another big problem is that the count(distinct) is going to cause the
whole thing to have pretty awful performance.

Not sure about a better way though...
        regards, tom lane


Re: ts_rewrite aggregate API seems mighty ugly

From
Tom Lane
Date:
BTW, I noticed while poking at this that the 2-arg form of ts_rewrite()
is labeled proretset = true in pg_proc.h.  This must be a typo, since
a moment's glance at the code shows it cannot return a set.  Good thing
we didn't ship beta2 yet ...
        regards, tom lane


Re: ts_rewrite aggregate API seems mighty ugly

From
Tom Lane
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Gregory Stark <stark@enterprisedb.com> writes:
>> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>>> True.  I'll bet you don't like ts_stat() either.

>> It seems the right way interface here wouldn't be too different from what's
>> there. ...

> I'm not sure that's so much cleaner than what's there.

It doesn't seem like anyone else is particularly excited about this.
What I'm inclined to do is
* remove the aggregate form of ts_rewrite, which seems unusably  broken because of the problem with the no-rewrites
case.*leave the text-query-input form alone.* leave ts_stat alone.
 

Perhaps future improvements to the aggregate-function mechanisms will
allow these forms of ts_rewrite and ts_stat to be deprecated in favor of
aggregate-based ones.  But for now they offer useful functionality, and
no one else seems to like the idea of hiding them in a contrib module.

One small thing that I am inclined to do, though, is tweak the SPI calls
inside these functions to pass read_only = true.  This is not so much to
prevent someone from doing a non-SELECT query, as to cause the invoked
query to use the same snapshot as the outer query is using.  That'll
possibly avoid some odd behaviors, especially in scenarios where the
function gets evaluated more than once in the outer query.  As the
code stands you could conceivably get different answers.
        regards, tom lane