Thread: tsearch patch and namespace pollution

tsearch patch and namespace pollution

From
Tom Lane
Date:
I find the following additions to pg_proc in the current tsearch2 patch:
                  proc                   | prorettype 
------------------------------------------+------------pg_ts_parser_is_visible(oid)             |
booleanpg_ts_dict_is_visible(oid)              | booleanpg_ts_template_is_visible(oid)           |
booleanpg_ts_config_is_visible(oid)            | booleantsvectorin(cstring)                      |
tsvectortsvectorout(tsvector)                   | cstringtsvectorsend(tsvector)                   |
byteatsqueryin(cstring)                      | tsquerytsqueryout(tsquery)                      |
cstringtsquerysend(tsquery)                    | byteagtsvectorin(cstring)                     |
gtsvectorgtsvectorout(gtsvector)                 | cstringtsvector_lt(tsvector,tsvector)           |
booleantsvector_le(tsvector,tsvector)          | booleantsvector_eq(tsvector,tsvector)           |
booleantsvector_ne(tsvector,tsvector)          | booleantsvector_ge(tsvector,tsvector)           |
booleantsvector_gt(tsvector,tsvector)          | booleantsvector_cmp(tsvector,tsvector)          |
integerlength(tsvector)                        | integerstrip(tsvector)                          |
tsvectorsetweight(tsvector,"char")              | tsvectortsvector_concat(tsvector,tsvector)       |
tsvectorvq_exec(tsvector,tsquery)               | booleanqv_exec(tsquery,tsvector)                |
booleantt_exec(text,text)                      | booleanct_exec(character varying,text)          |
booleantq_exec(text,tsquery)                   | booleancq_exec(character varying,tsquery)       |
booleantsquery_lt(tsquery,tsquery)             | booleantsquery_le(tsquery,tsquery)              |
booleantsquery_eq(tsquery,tsquery)             | booleantsquery_ne(tsquery,tsquery)              |
booleantsquery_ge(tsquery,tsquery)             | booleantsquery_gt(tsquery,tsquery)              |
booleantsquery_cmp(tsquery,tsquery)            | integertsquery_and(tsquery,tsquery)             |
tsquerytsquery_or(tsquery,tsquery)             | tsquerytsquery_not(tsquery)                     |
tsquerytsq_mcontains(tsquery,tsquery)          | booleantsq_mcontained(tsquery,tsquery)          |
booleannumnode(tsquery)                        | integerquerytree(tsquery)                       |
textrewrite(tsquery,tsquery,tsquery)        | tsqueryrewrite(tsquery,text)                    |
tsqueryrewrite_accum(tsquery,tsquery[])        | tsqueryrewrite_finish(tsquery)                  |
tsqueryrewrite(tsquery[])                      | tsquerystat(text)                               |
recordstat(text,text)                         | recordrank(real[],tsvector,tsquery,integer)    |
realrank(real[],tsvector,tsquery)           | realrank(tsvector,tsquery,integer)           | realrank(tsvector,tsquery)
                 | realrank_cd(real[],tsvector,tsquery,integer) | realrank_cd(real[],tsvector,tsquery)         |
realrank_cd(tsvector,tsquery,integer)       | realrank_cd(tsvector,tsquery)                | realtoken_type(oid)
                 | recordtoken_type(text)                         | recordparse(oid,text)                          |
recordparse(text,text)                        | recordlexize(oid,text)                         |
text[]lexize(text,text)                       | text[]headline(oid,text,tsquery,text)          |
textheadline(oid,text,tsquery)              | textheadline(text,text,tsquery,text)         |
textheadline(text,text,tsquery)             | textheadline(text,tsquery,text)              | textheadline(text,tsquery)
                 | textto_tsvector(oid,text)                    | tsvectorto_tsvector(text,text)                   |
tsvectorto_tsquery(oid,text)                    | tsqueryto_tsquery(text,text)                    |
tsqueryplainto_tsquery(oid,text)               | tsqueryplainto_tsquery(text,text)               |
tsqueryto_tsvector(text)                       | tsvectorto_tsquery(text)                         |
tsqueryplainto_tsquery(text)                   | tsquerytsvector_update_trigger()                |
triggerget_ts_config_oid(text)                 | oidget_current_ts_config()                  | oid
 
(82 rows)

(This list omits functions with INTERNAL arguments, as those are of
no particular concern to users.)

While most of these are probably OK, I'm disturbed by the prospect
that we are commandeering names as generic as "parse" or "stat"
with argument types as generic as "text".  I think we need to put
a "ts_" prefix on some of these.  Specifically, I find these names
totally unacceptable without a ts_ prefix:
stat(text)                               | recordstat(text,text)                          | record
token_type(oid)                          | recordtoken_type(text)                         | record
parse(oid,text)                          | recordparse(text,text)                         | record
lexize(oid,text)                         | text[]lexize(text,text)                        | text[]

These guys might be all right given that some of their arguments are
tsvector or tsquery, but it's not completely convincing --- think about
the case where an argument is given as an undecorated literal string.
It's also not all that clear that they are related to text searching.
I'm for putting a ts_ prefix on them too:
rank(real[],tsvector,tsquery,integer)    | realrank(real[],tsvector,tsquery)            |
realrank(tsvector,tsquery,integer)          | realrank(tsvector,tsquery)                   |
realrank_cd(real[],tsvector,tsquery,integer)| realrank_cd(real[],tsvector,tsquery)         |
realrank_cd(tsvector,tsquery,integer)       | realrank_cd(tsvector,tsquery)                | real
 
rewrite(tsquery,tsquery,tsquery)         | tsqueryrewrite(tsquery,text)                    |
tsqueryrewrite_accum(tsquery,tsquery[])        | tsqueryrewrite_finish(tsquery)                  |
tsqueryrewrite(tsquery[])                      | tsquery
 
headline(oid,text,tsquery,text)          | textheadline(oid,text,tsquery)               |
textheadline(text,text,tsquery,text)        | textheadline(text,text,tsquery)              |
textheadline(text,tsquery,text)             | textheadline(text,tsquery)                   | text
 

These guys are just plain badly named, as it's completely unobvious that
they have anything to do with tsearch (or what they do at all, actually).
Furthermore the "varchar" variants seem entirely redundant with the
"text" ones:
vq_exec(tsvector,tsquery)                | booleanqv_exec(tsquery,tsvector)                | booleantt_exec(text,text)
                    | booleanct_exec(character varying,text)          | booleantq_exec(text,tsquery)
|booleancq_exec(character varying,tsquery)       | boolean
 

Comments, suggestions?
        regards, tom lane


Re: tsearch patch and namespace pollution

From
Bruce Momjian
Date:
Tom Lane wrote:
> I find the following additions to pg_proc in the current tsearch2 patch:

It seems a lot of these are useless and just bloat.  I will mark a few:

>                    proc                   | prorettype 
> ------------------------------------------+------------
>  pg_ts_parser_is_visible(oid)             | boolean
>  pg_ts_dict_is_visible(oid)               | boolean
>  pg_ts_template_is_visible(oid)           | boolean
>  pg_ts_config_is_visible(oid)             | boolean

Why would anyone look these up via OID rather than name?

>  tsvectorin(cstring)                      | tsvector
>  tsvectorout(tsvector)                    | cstring
>  tsvectorsend(tsvector)                   | bytea
>  tsqueryin(cstring)                       | tsquery
>  tsqueryout(tsquery)                      | cstring
>  tsquerysend(tsquery)                     | bytea
>  gtsvectorin(cstring)                     | gtsvector
>  gtsvectorout(gtsvector)                  | cstring
>  tsvector_lt(tsvector,tsvector)           | boolean
>  tsvector_le(tsvector,tsvector)           | boolean
>  tsvector_eq(tsvector,tsvector)           | boolean
>  tsvector_ne(tsvector,tsvector)           | boolean
>  tsvector_ge(tsvector,tsvector)           | boolean
>  tsvector_gt(tsvector,tsvector)           | boolean
>  tsvector_cmp(tsvector,tsvector)          | integer
>  length(tsvector)                         | integer
>  strip(tsvector)                          | tsvector
>  setweight(tsvector,"char")               | tsvector
>  tsvector_concat(tsvector,tsvector)       | tsvector
>  vq_exec(tsvector,tsquery)                | boolean
>  qv_exec(tsquery,tsvector)                | boolean
>  tt_exec(text,text)                       | boolean
>  ct_exec(character varying,text)          | boolean
>  tq_exec(text,tsquery)                    | boolean
>  cq_exec(character varying,tsquery)       | boolean
>  tsquery_lt(tsquery,tsquery)              | boolean
>  tsquery_le(tsquery,tsquery)              | boolean
>  tsquery_eq(tsquery,tsquery)              | boolean
>  tsquery_ne(tsquery,tsquery)              | boolean
>  tsquery_ge(tsquery,tsquery)              | boolean
>  tsquery_gt(tsquery,tsquery)              | boolean
>  tsquery_cmp(tsquery,tsquery)             | integer
>  tsquery_and(tsquery,tsquery)             | tsquery
>  tsquery_or(tsquery,tsquery)              | tsquery
>  tsquery_not(tsquery)                     | tsquery
>  tsq_mcontains(tsquery,tsquery)           | boolean
>  tsq_mcontained(tsquery,tsquery)          | boolean
>  numnode(tsquery)                         | integer
>  querytree(tsquery)                       | text
>  rewrite(tsquery,tsquery,tsquery)         | tsquery
>  rewrite(tsquery,text)                    | tsquery
>  rewrite_accum(tsquery,tsquery[])         | tsquery
>  rewrite_finish(tsquery)                  | tsquery
>  rewrite(tsquery[])                       | tsquery
>  stat(text)                               | record
>  stat(text,text)                          | record

>  rank(real[],tsvector,tsquery,integer)    | real
>  rank(real[],tsvector,tsquery)            | real
>  rank(tsvector,tsquery,integer)           | real
>  rank(tsvector,tsquery)                   | real
>  rank_cd(real[],tsvector,tsquery,integer) | real
>  rank_cd(real[],tsvector,tsquery)         | real
>  rank_cd(tsvector,tsquery,integer)        | real
>  rank_cd(tsvector,tsquery)                | real

Do we realy need this many ranking functions?

>  token_type(oid)                          | record

Again, why by OID?

>  token_type(text)                         | record
>  parse(oid,text)                          | record
>  parse(text,text)                         | record
>  lexize(oid,text)                         | text[]
>  lexize(text,text)                        | text[]
>  headline(oid,text,tsquery,text)          | text
>  headline(oid,text,tsquery)               | text
>  headline(text,text,tsquery,text)         | text
>  headline(text,text,tsquery)              | text
>  headline(text,tsquery,text)              | text
>  headline(text,tsquery)                   | text
>  to_tsvector(oid,text)                    | tsvector
>  to_tsvector(text,text)                   | tsvector
>  to_tsquery(oid,text)                     | tsquery

Why OID again for the configuration?  I just don't see the use case and
it is bloat and causes confusion.

>  to_tsquery(text,text)                    | tsquery
>  plainto_tsquery(oid,text)                | tsquery
>  plainto_tsquery(text,text)               | tsquery

Again, OID.  I asked Oleg about this and he said:

> Bruce, just remove oid argument specification from documentation.

so I think we can go ahead and remove cases where the configuration name
or object is specified by oid.  I have already removed them from the
documentation and I though the patch had them removed too, but I guess
not.  Admittedly this API has been in flux.

>  to_tsvector(text)                        | tsvector
>  to_tsquery(text)                         | tsquery
>  plainto_tsquery(text)                    | tsquery
>  tsvector_update_trigger()                | trigger
>  get_ts_config_oid(text)                  | oid
>  get_current_ts_config()                  | oid
> (82 rows)
> 
> (This list omits functions with INTERNAL arguments, as those are of
> no particular concern to users.)

Also @@ accepts TEXT @@ TEXT, at least according to the documentation. 
Is it clear to anyone which is tsquery and which tsvector?  Is this
something we want to support?

> While most of these are probably OK, I'm disturbed by the prospect
> that we are commandeering names as generic as "parse" or "stat"
> with argument types as generic as "text".  I think we need to put
> a "ts_" prefix on some of these.  Specifically, I find these names
> totally unacceptable without a ts_ prefix:
> 
>  stat(text)                               | record
>  stat(text,text)                          | record
> 
>  token_type(oid)                          | record
>  token_type(text)                         | record
> 
>  parse(oid,text)                          | record
>  parse(text,text)                         | record
> 
>  lexize(oid,text)                         | text[]
>  lexize(text,text)                        | text[]
> 
> These guys might be all right given that some of their arguments are
> tsvector or tsquery, but it's not completely convincing --- think about
> the case where an argument is given as an undecorated literal string.
> It's also not all that clear that they are related to text searching.
> I'm for putting a ts_ prefix on them too:
> 
>  rank(real[],tsvector,tsquery,integer)    | real
>  rank(real[],tsvector,tsquery)            | real
>  rank(tsvector,tsquery,integer)           | real
>  rank(tsvector,tsquery)                   | real
>  rank_cd(real[],tsvector,tsquery,integer) | real
>  rank_cd(real[],tsvector,tsquery)         | real
>  rank_cd(tsvector,tsquery,integer)        | real
>  rank_cd(tsvector,tsquery)                | real
> 
>  rewrite(tsquery,tsquery,tsquery)         | tsquery
>  rewrite(tsquery,text)                    | tsquery
>  rewrite_accum(tsquery,tsquery[])         | tsquery
>  rewrite_finish(tsquery)                  | tsquery
>  rewrite(tsquery[])                       | tsquery
> 
>  headline(oid,text,tsquery,text)          | text
>  headline(oid,text,tsquery)               | text
>  headline(text,text,tsquery,text)         | text
>  headline(text,text,tsquery)              | text
>  headline(text,tsquery,text)              | text
>  headline(text,tsquery)                   | text
> 
> These guys are just plain badly named, as it's completely unobvious that
> they have anything to do with tsearch (or what they do at all, actually).
> Furthermore the "varchar" variants seem entirely redundant with the
> "text" ones:
> 
>  vq_exec(tsvector,tsquery)                | boolean
>  qv_exec(tsquery,tsvector)                | boolean
>  tt_exec(text,text)                       | boolean
>  ct_exec(character varying,text)          | boolean
>  tq_exec(text,tsquery)                    | boolean
>  cq_exec(character varying,tsquery)       | boolean
> 
> Comments, suggestions?

I would be happy if all text search functions began with 'ts', 'ts_', or
'to_ts', and if we don't clean this up now, it is going to be harder in
the future.  I think users can expect some migration for text search in
8.3 as a benefit of getting into core and be dump-able.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: tsearch patch and namespace pollution

From
Michael Paesold
Date:
Bruce Momjian wrote:
> I would be happy if all text search functions began with 'ts', 'ts_', or
> 'to_ts', and if we don't clean this up now, it is going to be harder in
> the future.

+1 from me. \df is also much more useful then.
> I think users can expect some migration for text search in
> 8.3 as a benefit of getting into core and be dump-able.

I guess so. Especially if you change some functions, they will have to 
change source code anyway. So you can as well cleanup all functions that 
don't fit into a sound naming schema.

Best Regards
Michael Paesold