Thread: text search vs schemas

text search vs schemas

From
Tom Lane
Date:
I wrote:
> We can't put tsvector_update_trigger() into core in anything like its
> current form.  As is, it will take an unqualified function name, look
> it up, and call it.  The prospects for subversion by search path
> manipulation are obvious, and even if you aren't concerned about
> malicious attacks, the effects of the trigger are context-dependent

Actually ... I'm suddenly not happy about the choice to put text search
configurations etc. into schemas at all.  We've been sitting here and
assuming that to_tsvector('english', my_text_col) has a well defined
meaning --- but as the patch stands, *it does not*.  The interpretation
of the config name could easily change depending on search_path.

It does not seem likely that a typical installation will have so many
text search configs that subdividing them into schemas will really be
useful.  If I recall correctly, Teodor did that on my recommendation
that it'd be the cleanest way to distinguish built-in from non-built-in
objects for dump purposes.  That is, pg_dump would ignore TS objects
that are in pg_catalog and dump everything else.  But I'm having severe
second thoughts about that.

What seems the most attractive alternative at the moment is to have a
flat namespace for TS objects (no schemas) and introduce something like
a "bool is_built_in" column for pg_dump to consult in deciding whether
to dump 'em.

Comments?
        regards, tom lane


Re: text search vs schemas

From
"Trevor Talbot"
Date:
On 8/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Actually ... I'm suddenly not happy about the choice to put text search
> configurations etc. into schemas at all.  We've been sitting here and
> assuming that to_tsvector('english', my_text_col) has a well defined
> meaning --- but as the patch stands, *it does not*.  The interpretation
> of the config name could easily change depending on search_path.
>
> It does not seem likely that a typical installation will have so many
> text search configs that subdividing them into schemas will really be
> useful.  If I recall correctly, Teodor did that on my recommendation
> that it'd be the cleanest way to distinguish built-in from non-built-in
> objects for dump purposes.  That is, pg_dump would ignore TS objects
> that are in pg_catalog and dump everything else.  But I'm having severe
> second thoughts about that.
>
> What seems the most attractive alternative at the moment is to have a
> flat namespace for TS objects (no schemas) and introduce something like
> a "bool is_built_in" column for pg_dump to consult in deciding whether
> to dump 'em.

That assumes a database-oriented search config, instead of a case of
multiple users confined to invidual schemas doing their own thing.  Is
the latter possible now, and do you want to remove that ability?

Something else that occurs to me though: the problem seems to be that
parts of tsearch take object names as strings.  I thought one
advantage of having it in core is that they are now real database
objects, with owners etc.  How many other database objects are passed
around as string labels?

Wouldn't treating them as actual objects remove this whole issue?
What happens now if you try to drop a configuration that's still used
in a trigger somewhere?

(I'm new to both tsearch2 and this list, so please excuse any
mistakes.  Mostly keeping an eye on this for future use in my own
projects.)


Re: text search vs schemas

From
Tom Lane
Date:
"Trevor Talbot" <quension@gmail.com> writes:
> Wouldn't treating them as actual objects remove this whole issue?

Uh, no.  Function names for example are subject to search-path
confusion.
        regards, tom lane


Re: text search vs schemas

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Trevor Talbot" <quension@gmail.com> writes:
>> Wouldn't treating them as actual objects remove this whole issue?
>
> Uh, no.  Function names for example are subject to search-path
> confusion.

Wait, are they? They are in PL languages but only because most languages store
their source code as text just as is happening here. 

But they're not in views or other native SQL uses of functions because they
store the reference to the specific function's OID. In dumps they output the
schema along with the name. As in:

postgres=# \d foo.testv      View "foo.testv"Column |  Type   | Modifiers 
--------+---------+-----------i      | integer | a      | integer | 
View definition:SELECT test.i, foo.a(test.i) AS a  FROM foo.test;

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: text search vs schemas

From
Oleg Bartunov
Date:
On Thu, 16 Aug 2007, Tom Lane wrote:

> I wrote:
>> We can't put tsvector_update_trigger() into core in anything like its
>> current form.  As is, it will take an unqualified function name, look
>> it up, and call it.  The prospects for subversion by search path
>> manipulation are obvious, and even if you aren't concerned about
>> malicious attacks, the effects of the trigger are context-dependent
>
> Actually ... I'm suddenly not happy about the choice to put text search
> configurations etc. into schemas at all.  We've been sitting here and
> assuming that to_tsvector('english', my_text_col) has a well defined
> meaning --- but as the patch stands, *it does not*.  The interpretation
> of the config name could easily change depending on search_path.

what's wrong with schema-qualified name ?

>
> It does not seem likely that a typical installation will have so many
> text search configs that subdividing them into schemas will really be
> useful.  If I recall correctly, Teodor did that on my recommendation

it's useful.


> that it'd be the cleanest way to distinguish built-in from non-built-in
> objects for dump purposes.  That is, pg_dump would ignore TS objects

I think you're wrong here. Schema often used to save connections and it's
natural to have different searches in different schemes.

> that are in pg_catalog and dump everything else.  But I'm having severe
> second thoughts about that.
>
> What seems the most attractive alternative at the moment is to have a
> flat namespace for TS objects (no schemas) and introduce something like
> a "bool is_built_in" column for pg_dump to consult in deciding whether
> to dump 'em.
>
> Comments?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: text search vs schemas

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Uh, no.  Function names for example are subject to search-path
>> confusion.

> Wait, are they? They are in PL languages but only because most
> languages store their source code as text just as is happening here.

Hmmm ... if you look at the current solution for default expressions
for serial columns, ie nextval() on a regclass constant, it's pretty
schema-safe.  So we could imagine inventing a regconfig datatype that
is the same sort of wrapper-over-OID.  Then make the 2-parameter form
of to_tsvector take that type instead of text.

That seems like it'd fix the problem for expression indexes on
to_tsvector calls, but I don't see how it fixes the problem for
triggers.  We don't have any clear path for making trigger arguments
be anything but a list of strings.
        regards, tom lane


Re: text search vs schemas

From
Martijn van Oosterhout
Date:
On Fri, Aug 17, 2007 at 01:16:22AM -0400, Tom Lane wrote:
> That seems like it'd fix the problem for expression indexes on
> to_tsvector calls, but I don't see how it fixes the problem for
> triggers.  We don't have any clear path for making trigger arguments
> be anything but a list of strings.

I'm unsure how it works now, but it seems reasonable that when a
regclass/regtype/regetc is passed to a trigger, pass it in OID form.
This can be cast back safely inside the trigger itself. Seems a little
hacky though...

Having it as a type would also help with tracking dependancies.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: text search vs schemas

From
Peter Eisentraut
Date:
Am Freitag, 17. August 2007 05:15 schrieb Tom Lane:
> Actually ... I'm suddenly not happy about the choice to put text search
> configurations etc. into schemas at all.  We've been sitting here and
> assuming that to_tsvector('english', my_text_col) has a well defined
> meaning --- but as the patch stands, *it does not*.  The interpretation
> of the config name could easily change depending on search_path.

But that isn't different from any other part of the system.  A proper fix 
would be a mechanism to alleviate the confusion in all places, not simply to 
remove features that cause such confusion in some places (but not all, 
thereby causing inconsistencies).

> It does not seem likely that a typical installation will have so many
> text search configs that subdividing them into schemas will really be
> useful.

But schemas are not only used to organize objects because there are so many.  
Altering the search path to get at a different implementation without having 
to alter the names in every single place is also a useful application.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: text search vs schemas

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Aug 17, 2007 at 01:16:22AM -0400, Tom Lane wrote:
>> That seems like it'd fix the problem for expression indexes on
>> to_tsvector calls, but I don't see how it fixes the problem for
>> triggers.  We don't have any clear path for making trigger arguments
>> be anything but a list of strings.

> I'm unsure how it works now, but it seems reasonable that when a
> regclass/regtype/regetc is passed to a trigger, pass it in OID form.

If you insist on a solution that involves attaching type information
to trigger arguments, then we can forget about getting tsearch into 8.3.
That's a nontrivial amount of new design and code that hasn't even been
on the radar screen before.

At the moment I feel our thoughts have to revolve not around adding
complexity to tsearch, but taking stuff out.  If we ship it with no
schema support for TS objects in 8.3, we can always add that later,
if there proves to be real demand for that (and I note that the contrib
version has gotten along fine without it).  But we cannot go in the
other direction.
        regards, tom lane


Re: text search vs schemas

From
Martijn van Oosterhout
Date:
On Fri, Aug 17, 2007 at 10:42:09AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > I'm unsure how it works now, but it seems reasonable that when a
> > regclass/regtype/regetc is passed to a trigger, pass it in OID form.
>
> If you insist on a solution that involves attaching type information
> to trigger arguments, then we can forget about getting tsearch into 8.3.
> That's a nontrivial amount of new design and code that hasn't even been
> on the radar screen before.

Hmm, maybe I didn't explain clearly enough. I meant that if the
argument is a regclass for example, to pass it in the TG_ARGV list as
the OID in *string form*.

That way trigger arguments stay a list of strings, yet the whole thing
is schema safe because when trigger body casts the string back to a
regclass, it gets exactly what was passed.

Hope this makes more sense,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: text search vs schemas

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Am Freitag, 17. August 2007 05:15 schrieb Tom Lane:
>> Actually ... I'm suddenly not happy about the choice to put text search
>> configurations etc. into schemas at all.

> But that isn't different from any other part of the system.  A proper fix 
> would be a mechanism to alleviate the confusion in all places, not simply to 
> remove features that cause such confusion in some places (but not all, 
> thereby causing inconsistencies).

Well, we are already inconsistent about this.  PL languages and index
access methods, for example, don't have schema-ified names.

>> It does not seem likely that a typical installation will have so many
>> text search configs that subdividing them into schemas will really be
>> useful.

> But schemas are not only used to organize objects because there are so many.
> Altering the search path to get at a different implementation without having 
> to alter the names in every single place is also a useful application.

This is isomorphic to the argument about whether default_text_search_config
is a good idea; indeed, I think that default_text_search_config pretty
much solves this problem already, for the places where it's sane to have
the configuration-in-use depend upon context.  The problem with using
schemas for TS configs is that we can't prevent the search result from
changing in contexts where it mustn't change.  At least, not short of
requiring fully-qualified config names in those places, which doesn't
sound like an advance in usability.
        regards, tom lane


Re: text search vs schemas

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Aug 17, 2007 at 10:42:09AM -0400, Tom Lane wrote:
>> If you insist on a solution that involves attaching type information
>> to trigger arguments, then we can forget about getting tsearch into 8.3.

> Hmm, maybe I didn't explain clearly enough. I meant that if the
> argument is a regclass for example, to pass it in the TG_ARGV list as
> the OID in *string form*.

Are you expecting the *user* to deal with that?  If not, how is the
system supposed to know which trigger arguments to do it to?  What
about dump and reload?
        regards, tom lane


Re: text search vs schemas

From
"Trevor Talbot"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Tom Lane" <tgl@sss.pgh.pa.us> writes:
> >> Uh, no.  Function names for example are subject to search-path
> >> confusion.
>
> > Wait, are they? They are in PL languages but only because most
> > languages store their source code as text just as is happening here.
>
> Hmmm ... if you look at the current solution for default expressions
> for serial columns, ie nextval() on a regclass constant, it's pretty
> schema-safe.  So we could imagine inventing a regconfig datatype that
> is the same sort of wrapper-over-OID.  Then make the 2-parameter form
> of to_tsvector take that type instead of text.

Right, that's what I was getting at.  I was confused about the trigger
issues, sorry about that.

> That seems like it'd fix the problem for expression indexes on
> to_tsvector calls, but I don't see how it fixes the problem for
> triggers.  We don't have any clear path for making trigger arguments
> be anything but a list of strings.

Okay, trying to catch up here...

For the simple case of handling a single column, we've got expression
indexes as above.

For handling two or more columns, expression indexes don't work that
well, so that leaves triggers.  There happens to be one utility
function provided for that purpose, tsvector_update_trigger().  This
trigger function needs its configuration as a (string) argument, and
is also the only one with this problem.

Is that correct?

If so, then it seems the question should really be: is this situation
of wanting to index multiple columns together, without even using
different ranks for them, so common that this trigger function belongs
in core?  Maybe it shouldn't be there at all; instead have the docs
walk through creating a specialized trigger function.  It doesn't get
rid of the schema-qualified names issue, but when you're writing PL
functions you need to deal with that anyway, tsearch or not.

And there's still contrib of course.


Re: text search vs schemas

From
"Trevor Talbot"
Date:
On 8/17/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> At the moment I feel our thoughts have to revolve not around adding
> complexity to tsearch, but taking stuff out.  If we ship it with no
> schema support for TS objects in 8.3, we can always add that later,
> if there proves to be real demand for that (and I note that the contrib
> version has gotten along fine without it).  But we cannot go in the
> other direction.

Currently you can schema-qualify objects where you need to, to avoid
issues with search_path subversion.  If it's impossible to
schema-qualify tsearch configs now, when schema support is later added
it suddenly exposes everyone to risks that didn't exist before, and
requires manual changes to fix.

I'm for removing complexity, but per-schema support seems like a
design decision that needs to be made up front, whichever way it goes.


Re: text search vs schemas

From
Tom Lane
Date:
"Trevor Talbot" <quension@gmail.com> writes:
> Currently you can schema-qualify objects where you need to, to avoid
> issues with search_path subversion.  If it's impossible to
> schema-qualify tsearch configs now, when schema support is later added
> it suddenly exposes everyone to risks that didn't exist before, and
> requires manual changes to fix.

True.  I thought of another counter-argument as well: we use schemas
not only so that pg_dump can tell user from system objects, but for
permissions purposes.  If TS configs don't live in schemas then there is
no structure for controlling who may create one.  A fairly standard
requirement is to be able to prevent someone from creating any non-temp
objects (or not even those), and right now you do it by revoking create
rights in the public schema, and/or permissions to create new schemas.
We'd need some other kluge for non-schema-ified TS objects.

So let's go back to the "regconfig" idea.  If we invent such a type, and
make the 2-parameter forms of to_tsvector et al take that instead of
just text for the config name, then I think we have fixed things for the
expression index case.  (There are also other benefits, eg a command
applying one of these functions on many rows wouldn't have to do a
name-based config lookup each time.)  The problem is with the trigger
approach.

As my copy of the patch currently stands, there are two built-in trigger
functions, tsvector_update_trigger and tsvector_update_trigger_column.
The first expects trigger arguments   name of tsvector col, name of tsconfig to use, name(s) of text col(s)
and the second   name of tsvector col, name of tsconfig col, name(s) of text col(s)
that is, the tsconfig name is stored in a text column.  We could fix
the second form by changing it to expect the tsconfig column to be of
type regconfig.  The first form is a bit more problematic.  I can see
two approaches: either specify both the schema and the tsconfig name,
as two separate arguments, or keep it one argument but insist that
the content of the argument be an explicitly-qualified name.  The
second way seems a bit klugier when considered in isolation, but I think
I like it better, because there would be a natural migration path to
treating the argument as being of type regconfig when and if we get
around to having real types for trigger arguments.  (Which I think is
a good idea, btw, just not for 8.3.)

This looks fairly do-able --- a regconfig data type is no problem,
should be able to whip it up with an hour or two of cutting and pasting
code from one of the existing OID alias types.  And the other changes
seem minor.  (Note: offhand I don't see a need for "reg" types for
parsers, dictionaries, or templates, since none of those are referenced
directly in queries.)

Thoughts?
        regards, tom lane


Re: text search vs schemas

From
"Trevor Talbot"
Date:
On 8/18/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> As my copy of the patch currently stands, there are two built-in trigger
> functions, tsvector_update_trigger and tsvector_update_trigger_column.
> The first expects trigger arguments
>     name of tsvector col, name of tsconfig to use, name(s) of text col(s)
> and the second
>     name of tsvector col, name of tsconfig col, name(s) of text col(s)
> that is, the tsconfig name is stored in a text column.  We could fix
> the second form by changing it to expect the tsconfig column to be of
> type regconfig.  The first form is a bit more problematic.  I can see
> two approaches: either specify both the schema and the tsconfig name,
> as two separate arguments, or keep it one argument but insist that
> the content of the argument be an explicitly-qualified name.  The
> second way seems a bit klugier when considered in isolation, but I think
> I like it better, because there would be a natural migration path to
> treating the argument as being of type regconfig when and if we get
> around to having real types for trigger arguments.  (Which I think is
> a good idea, btw, just not for 8.3.)

I like the second approach too.  It may be slightly awkward for now,
but IMHO it does the right thing.