Re: tsearch2 in PostgreSQL 8.3? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: tsearch2 in PostgreSQL 8.3?
Date
Msg-id 12278.1187127714@sss.pgh.pa.us
Whole thread Raw
In response to Re: tsearch2 in PostgreSQL 8.3?  (Bruce Momjian <bruce@momjian.us>)
Responses Re: tsearch2 in PostgreSQL 8.3?  (Bruce Momjian <bruce@momjian.us>)
Re: tsearch2 in PostgreSQL 8.3?  ("Simon Riggs" <simon@2ndquadrant.com>)
List pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
> Alvaro Herrera wrote:
>> I haven't really seen anyone else arguing about this.  I wonder whether
>> you are being overly zealous about it.

> Uh, OK, but no one has told me how a database restore without a
> configuration name would work, so I am all ears.

It's the dump/restore problem that really is the kicker here.  I don't
mind so much a GUC that only controls the interpretation of queries, but
if it determines how an index or a tsvector column gets rebuilt, we've
got problems.

I've just finished re-reading the prior thread, and here are what seem
to me to be the salient points:

* Oleg, Teodor, and all of the old-line users of tsearch2 are
comfortable with setting up a trigger to maintain a materialized
tsvector column for a table.  They argue that recomputing the tsvector
(possibly more than once) during a query is more expensive than fetching
it from disk.  My suspicion is that the latter argument gets weaker
every year --- CPUs are getting faster lots faster than disks are.

* Bruce (and I ... not sure about anyone else) want to support usage of
text search via a functional index.  This is argued to be easier to set
up (no fooling with triggers) and possibly faster depending on CPU vs
I/O speeds.  I don't think there is any desire here to eliminate the
trigger approach, just to provide an alternative.

* For *either* the trigger or functional-index approach, I think it is
unacceptable to rely on a GUC variable to determine how the tsvector is
derived from the raw-document fields for storage or indexing.  It's just
too error-prone, particularly when you consider dump-and-reload cases.
What I think we should say is that the ts parsing configuration name
can be either hardwired or taken from another field of the table.
In the trigger case this would mean providing a couple of standard
triggers, one taking the config name as a trigger parameter, and the
other accepting a trigger parameter that's the name of the config name
column.  In the index case this would mean that the index expression
has to be either to_tsvector('constant', ...) or to_tsvector(field, ...).
Note that all four cases boil down to saying that stored or indexed
tsvectors have to be derived from the two-parameter form of to_tsvector.

* For queries, there is not anything very wrong with having a default
configuration, but the trick is how to get the planner to match that up
with an index that's written with the two-parameter form of to_tsvector.
One hackish possibility is to define the single-parameter form of
to_tsvector like this:
 create function to_tsvector(text) returns tsvector as $$   select to_tsvector(get_default_text_search_config(), $1) $$
languagesql strict stable;
 

where get_default_text_search_config() is essentially just
current_setting('default_text_search_config') except it is
misleadingly marked immutable.  Then, a query with
WHERE to_tsvector(document_col) @@ tsquery(...)

will have the SQL function inlined, and the
get_default_text_search_config() call const-folded, and suddenly
it looks like
WHERE to_tsvector('english', document_col) @@ tsquery(...)

and can be matched to a functional index that's declared using the
explicit 'english' configuration name.  This is pretty grotty though
... can anyone think of a better way?  (The main objection I can see
to it is that someone could shoot himself in the foot by using this
function instead of two-parameter to_tsvector in a custom trigger
function.  But hopefully, anyone writing a custom trigger function
will have read the manual's warning not to do that.  Note that I
suggest marking the function stable so that it can't be misused
that way in a functional index.  Another possible objection is that
get_default_text_search_config() is a foot-gun all by itself, since
it could be used in a functional index.  Aside from not documenting
it I'm not sure there's much to be done about that.)

* I'm not enamored of Heikki's idea about a datatype-based solution,
because I don't think that it will allow columns containing tsvectors
derived with different configurations.  It's perfectly clear that
advanced users want to be able to do that, and it's also clear that
as long as the config name is coming from a stored column (or can be
reconstructed somehow from the stored data) that it's perfectly
well-defined.

Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Mike Rylander"
Date:
Subject: Re: default_text_search_config and expression indexes
Next
From: Bruce Momjian
Date:
Subject: Re: tsearch2 in PostgreSQL 8.3?