Thread: ts_rewrite aggregate API seems mighty ugly
While working on the documentation it's finally sunk into me what ts_rewrite is all about, and I'm not very happy. The simple three-argument form is fine, it's basically a tsquery-specific version of replace(): regression=# select ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'foo|bar'::tsquery); ts_rewrite -------------------------'b' & ( 'foo' | 'bar' ) (1 row) (BTW, would it be possible to get it to not randomly rearrange the order of the clauses? That makes it quite confusing to understand what it's doing, and I don't offhand see a reason for that to happen.) And the two-argument form is okay, though a bit hard to wrap your head around --- the second argument is the text of a SQL query that returns successive second and third arguments for successive ts_rewrite operations on the first argument. For instance, regression=# create table ts_subst(target tsquery, subst tsquery); CREATE TABLE regression=# insert into ts_subst values('a', 'foo|bar'); INSERT 0 1 regression=# insert into ts_subst values('b', 'baz'); INSERT 0 1 regression=# select ts_rewrite('a & b'::tsquery, 'select target, subst from ts_subst'); ts_rewrite ---------------------------'baz' & ( 'foo' | 'bar' ) (1 row) The point of this of course is to be able to apply many different substitutions stored in a table. However, the aggregate form seems just plain weird: regression=# select ts_rewrite(array['a & b'::tsquery, target, subst]) from ts_subst; ts_rewrite ---------------------------'baz' & ( 'baz' | 'foo' ) (1 row) The assumption here is that target and substitute come from successive rows of the table, and again we're trying to map an original tsquery through a bunch of different replace() substitutions. The array[] bit is ugly and inefficient. I suppose it's a holdover from days when aggregates could only take one argument. Now that we have multiple-argument aggregates it would seem to make more sense to express the thing as a three-argument aggregate, except that then we'd have to give it a different name to distinguish it from the basic three-argument non-aggregate function. Another thing that I find ugly about this is that the 'original' tsquery is meaningful only on the first accumulation cycle; after that it's ignored, but the second and third array elements do have meaning. That's inconsistent. But the killer problem is that the aggregate form actually does the Wrong Thing if there are no rows to aggregate over --- it'll return NULL, not the original tsquery as you'd want. This is not too improbable if you follow the documentation's recommendation to filter the rows using an @> operator: regression=# select ts_rewrite(array['a & b'::tsquery, target, subst]) from ts_subst where 'a & b'::tsquery @> target; ts_rewrite ---------------------------'baz' & ( 'bar' | 'foo' ) (1 row) regression=# select ts_rewrite(array['quux'::tsquery, target, subst]) from ts_subst where 'quux'::tsquery @> target;ts_rewrite ------------ (1 row) So it seems to me this is just wrong, and the only safe way to do it is to use the two-argument form, with the selection from the table happening as a SPI query. Since we're already committed to an initdb for beta2, it's not quite too late to reconsider the API here. My feeling at the moment is that we should just drop the aggregate form of ts_rewrite; it does nothing you can't do better with the two-argument form, and it is just ugly to boot. Also, if anyone does come up with a not-so-ugly design later, we can always add things in 8.4 or beyond; but once it's in core it's going to be a very hard sell to take it out. Comments? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Since we're already committed to an initdb for beta2, it's not quite too > late to reconsider the API here. My feeling at the moment is that we > should just drop the aggregate form of ts_rewrite; it does nothing you > can't do better with the two-argument form, and it is just ugly to boot. > Also, if anyone does come up with a not-so-ugly design later, we can > always add things in 8.4 or beyond; but once it's in core it's going > to be a very hard sell to take it out. The two-argument form may not be actively broken but it sounds not very integrated. Passing a string which is then planned as an SQL query is not very SQL-ish. If you prepare the query does that string get prepared? When do you get syntax errors? If the plan's invalidated does it handle it correctly? It sounds to me like once we have a good aggregate interface there's not much point to the two-argument form. A good aggregate form would let you do the subquery or a join or any other form of query to fetch all the replacements you want. I would suggest putting both the two-argument form and the aggregate form into a contrib module -- effectively *back* into a contrib module until the interfaces can be fully integrated into SQL. So whereas previously you installed a contrib module for the whole tsearch2, now it's all integrated except for a few oddball functions which you can still get by supplementing the core tsearch with the contrib module. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > The two-argument form may not be actively broken but it sounds not very > integrated. Passing a string which is then planned as an SQL query is not very > SQL-ish. True. I'll bet you don't like ts_stat() either. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> The two-argument form may not be actively broken but it sounds not very >> integrated. Passing a string which is then planned as an SQL query is not very >> SQL-ish. > > True. I'll bet you don't like ts_stat() either. It seems the right way interface here wouldn't be too different from what's there. All we need is a SRF which takes a single tsvector and returns the set of words from it. Then you could do the aggregates yourself in SQL: SELECT count(distinct apodid) as ndoc, count(*) as nentry, element FROM ( SELECT apodid, ts_elements(vector) ASelement FROM apod ) GROUP BY element -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> True. I'll bet you don't like ts_stat() either. > It seems the right way interface here wouldn't be too different from what's > there. All we need is a SRF which takes a single tsvector and returns the set > of words from it. > Then you could do the aggregates yourself in SQL: > SELECT count(distinct apodid) as ndoc, > count(*) as nentry, > element > FROM ( > SELECT apodid, ts_elements(vector) AS element > FROM apod > ) GROUP BY element I'm not sure that's so much cleaner than what's there. It's relying on SRF-in-SELECT-list, which is doable at the C level but it's deprecated; plus the SRF has to return a record, which makes the notation a bit klugy --- (element).whatever in the upper select-list, and the GROUP BY probably won't work the way you show here, either. Another big problem is that the count(distinct) is going to cause the whole thing to have pretty awful performance. Not sure about a better way though... regards, tom lane
BTW, I noticed while poking at this that the 2-arg form of ts_rewrite() is labeled proretset = true in pg_proc.h. This must be a typo, since a moment's glance at the code shows it cannot return a set. Good thing we didn't ship beta2 yet ... regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> True. I'll bet you don't like ts_stat() either. >> It seems the right way interface here wouldn't be too different from what's >> there. ... > I'm not sure that's so much cleaner than what's there. It doesn't seem like anyone else is particularly excited about this. What I'm inclined to do is * remove the aggregate form of ts_rewrite, which seems unusably broken because of the problem with the no-rewrites case.*leave the text-query-input form alone.* leave ts_stat alone. Perhaps future improvements to the aggregate-function mechanisms will allow these forms of ts_rewrite and ts_stat to be deprecated in favor of aggregate-based ones. But for now they offer useful functionality, and no one else seems to like the idea of hiding them in a contrib module. One small thing that I am inclined to do, though, is tweak the SPI calls inside these functions to pass read_only = true. This is not so much to prevent someone from doing a non-SELECT query, as to cause the invoked query to use the same snapshot as the outer query is using. That'll possibly avoid some odd behaviors, especially in scenarios where the function gets evaluated more than once in the outer query. As the code stands you could conceivably get different answers. regards, tom lane