Thread: Simplifying Text Search

Simplifying Text Search

From
Simon Riggs
Date:
Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.

By any stretch, this query is difficult for most people to understand:

SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');

Wouldn't it be much simpler to just have a function, so we can write
this query like this?

SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');

We then explain to people that while the above is nice, it will presume
that both the function inputs are Text, which isn't any good for complex
searches, indexing and dictionaries etc.., so then we move to:

SELECT * FROM text_table
WHERE text_search('haystack needle haystack'::tsvector,
'needle'::tsquery);

or perhaps

SELECT * FROM text_table
WHERE full_text_search('haystack needle haystack', 'needle & hay');

which would automatically do the conversions to tsvector and tsquery for
us. No more tedious casting, easy to read.

[*text_search() functions would return bool]

So we end up with a normal sounding function that is overloaded to
provide all of the various goodies. We can include the text_search(text,
text) version of the function in the normal chapter on functions, with a
pointer to the more complex stuff elsewhere.

Sound good?

We can then explain everything without having to use @@ operators. They
can then be introduced as an option.

The side benefit of this is that we can then allow our wonderful new
functionality to be more easily usable by things like Hibernate. We just
tell them we have this new function and thats all they need to know.

I know that under the covers the @@ operator is necessary because we
hang various pieces of optimizer information from it. Each function
signature gets an operator with matching signature, so there's a 1:1
correspondence in most use cases. So to make this all hang together,
there'd need to be a some smarts that says: if there is only one
operator on a function then use the operator's optimizer information
when you see just the function. That information can be assessed at DDL
time, so we can keep accurate track of operator counts in pgproc. 

An alternative approach might be to make the first operator created on a
function the "primary" operator. All other operators would then be
secondary operators, so that adding operators would not change the
inference mechanism.

I've not got sufficient knowledge to say how hard the
function-to-operator inference is, but it would be dang useful in making
text search and many other programs readable and easy to interface to.
In the end that is going to mean wider usage of that functionality, with
more people feeling like they can dip their toes into the water.

I must confess I have insufficient time to do this myself right now, not
least me discovering exactly how. I'm spending time on this now because
I'm the one that has to explain this stuff to people and things like
this can make a huge difference in their understanding and eventual
uptake.

Thoughts?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Martijn van Oosterhout
Date:
On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
> Something Tom Dunstan just mentioned has made me ask the question "Why
> does our full text search feature look so strange?". It's the
> operator-laden syntax that causes the problem.
>
> By any stretch, this query is difficult for most people to understand:
>
> SELECT * FROM text_table
> WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
>
> Wouldn't it be much simpler to just have a function, so we can write
> this query like this?
>
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack', 'needle');

Can't you do this with an SQL function that gets expanded inline?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Simplifying Text Search

From
Simon Riggs
Date:
On Mon, 2007-11-12 at 16:28 +0100, Martijn van Oosterhout wrote:
> On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
> > Something Tom Dunstan just mentioned has made me ask the question "Why
> > does our full text search feature look so strange?". It's the
> > operator-laden syntax that causes the problem.
> > 
> > By any stretch, this query is difficult for most people to understand:
> > 
> > SELECT * FROM text_table
> > WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
> > 
> > Wouldn't it be much simpler to just have a function, so we can write
> > this query like this?
> > 
> > SELECT * FROM text_table
> > WHERE text_search('haystack needle haystack', 'needle');
> 
> Can't you do this with an SQL function that gets expanded inline?

Yep, we can. Good thinking. So the change is fairly trivial.

What do you think of the proposal to make text search work this way and
to document this more easily readable form?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
David Fetter
Date:
On Mon, Nov 12, 2007 at 03:48:20PM +0000, Simon Riggs wrote:
> On Mon, 2007-11-12 at 16:28 +0100, Martijn van Oosterhout wrote:
> > On Mon, Nov 12, 2007 at 03:00:36PM +0000, Simon Riggs wrote:
> > > Something Tom Dunstan just mentioned has made me ask the question "Why
> > > does our full text search feature look so strange?". It's the
> > > operator-laden syntax that causes the problem.
> > > 
> > > By any stretch, this query is difficult for most people to understand:
> > > 
> > > SELECT * FROM text_table
> > > WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
> > > 
> > > Wouldn't it be much simpler to just have a function, so we can write
> > > this query like this?
> > > 
> > > SELECT * FROM text_table
> > > WHERE text_search('haystack needle haystack', 'needle');
> > 
> > Can't you do this with an SQL function that gets expanded inline?
> 
> Yep, we can. Good thinking. So the change is fairly trivial.
> 
> What do you think of the proposal to make text search work this way
> and to document this more easily readable form?

+1 for adding this.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Simplifying Text Search

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Something Tom Dunstan just mentioned has made me ask the question "Why
> does our full text search feature look so strange?". It's the
> operator-laden syntax that causes the problem.
> 
> By any stretch, this query is difficult for most people to understand:
> 
> SELECT * FROM text_table
> WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
> 
> Wouldn't it be much simpler to just have a function, so we can write
> this query like this?
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack', 'needle');
> 
> We then explain to people that while the above is nice, it will presume
> that both the function inputs are Text, which isn't any good for complex
> searches, indexing and dictionaries etc.., so then we move to:
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack'::tsvector,
> 'needle'::tsquery);
> 
> or perhaps
> 
> SELECT * FROM text_table
> WHERE full_text_search('haystack needle haystack', 'needle & hay');
> 
> which would automatically do the conversions to tsvector and tsquery for
> us. No more tedious casting, easy to read.

There's a text @@ text operator, so you can write just:

SELECT * FROM tstable where data @@ 'needle';

No need to cast.

Unfortunately, that form can't use a GIN index, I think. But that's 
another issue, which I don't think your proposal would fix...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simplifying Text Search

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> So we end up with a normal sounding function that is overloaded to
> provide all of the various goodies.

As best I can tell, @@ does exactly this already.  This is just a
different spelling of the same capability, and I don't actually
find it better.  Why is "text_search(x,y)" better than "x @@ y"?
We don't recommend that people write "texteq(x,y)" instead of
"x = y".

> Sound good?

It's not an improvement, it's not compatible with what existing tsearch2
users are accustomed to, and it's several months too late...
        regards, tom lane


Re: Simplifying Text Search

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Simon Riggs <simon@2ndquadrant.com> writes:
>> So we end up with a normal sounding function that is overloaded to
>> provide all of the various goodies.
>
> As best I can tell, @@ does exactly this already.  This is just a
> different spelling of the same capability, and I don't actually
> find it better.  Why is "text_search(x,y)" better than "x @@ y"?
> We don't recommend that people write "texteq(x,y)" instead of
> "x = y".

I agree, I find it odd to suggest that a function would be more natural than
an operator. The main reason the non-core version of tsearch felt so much like
an add-on was precisely that it had to use functions to interface with
objects. That Postgres supports creating new operators is a strength which
allows a lot more extensibility.

And yet I agree that there's something awkward about the tsearch syntax. I'm
not sure where the core of it comes from though, but I don't think it comes
from the use of operators.

Part of it is that "@@" isn't a familiar operator. I'm not even sure what to
read it as. "Matches"? "Satisfies"?

Perhaps we should think (at some point in the future) about some way of
allowing alphabetic characters in operator names. Then you could write
something like: 
 col ~satisfies~ '1 & 2'

(That exact syntax wouldn't work without removing ~ from the characters in
normal operators so something with more finesse would be needed.)

The other part of tsearch that seems somewhat awkward is just the very concept
and syntax of tsqueries. But that seems pretty integral to the functionality
and I don't see any way to avoid it. It's not entirely unlike the idea of
regexps which I'm sure would seem unnatural if we were just meeting them with
no background.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > So we end up with a normal sounding function that is overloaded to
> > provide all of the various goodies.
> 
> As best I can tell, @@ does exactly this already.  This is just a
> different spelling of the same capability, and I don't actually
> find it better.  Why is "text_search(x,y)" better than "x @@ y"?
> We don't recommend that people write "texteq(x,y)" instead of
> "x = y".

Most people don't understand those differences. x = y means "make sure
they are the same" to most people. They don't see what you (and I) see:
function and operator interchangeability. So text_search() is better
than @@ and = is better than texteq(). Life ain't neat...

Right now, Full Text Search SQL looks like complete gibberish and it
dissuades many people from using what is an awesome set of features. I
just want to add a little sugar to help people get started.

> > Sound good?
> 
> It's not an improvement

That is the very point of debate

> it's not compatible with what existing tsearch2
> users are accustomed to

@@ would still exist, so no problems. These additions are for new users,
not old ones.

> it's several months too late...

True. I wish I'd thought of it before. I've waded through the syntax
without thinking how to make it more easily readable and explainable.
Damn.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
>> Simon Riggs <simon@2ndquadrant.com> writes:
>>> So we end up with a normal sounding function that is overloaded to
>>> provide all of the various goodies.
>> As best I can tell, @@ does exactly this already.  This is just a
>> different spelling of the same capability, and I don't actually
>> find it better.  Why is "text_search(x,y)" better than "x @@ y"?
>> We don't recommend that people write "texteq(x,y)" instead of
>> "x = y".
> 
> Most people don't understand those differences. x = y means "make sure
> they are the same" to most people. They don't see what you (and I) see:
> function and operator interchangeability. So text_search() is better
> than @@ and = is better than texteq(). Life ain't neat...
> 
> Right now, Full Text Search SQL looks like complete gibberish and it
> dissuades many people from using what is an awesome set of features. I
> just want to add a little sugar to help people get started.

Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would 
read out better, but unfortunately that's already taken ;-).

In any case, it's way too late.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simplifying Text Search

From
Andrew Sullivan
Date:
On Mon, Nov 12, 2007 at 08:09:48PM +0000, Simon Riggs wrote:
> 
> @@ would still exist, so no problems. These additions are for new users,
> not old ones.

Given that this is all sugar on top of tsearch anyway, why not put it in
pgfoundry as the tsearch_sugar project?  Then packagers could include a
standard set of such sugar if they wanted.

A

-- 
Andrew Sullivan
Old sigs will return after re-constitution of blue smoke


Re: Simplifying Text Search

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would 
> read out better, but unfortunately that's already taken ;-).

Actually LIKE does not make much sense when you have 'hay & needle'.
Probably MATCHES would be a better term ... but then, MySQL defines a
strange thing called MATCH/AGAINST; so apparently you use "MATCH (column
list) AGAINST (pattern spec)"

None of this is standard though ...

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Everybody understands Mickey Mouse. Few understand Hermann Hesse.
Hardly anybody understands Einstein. And nobody understands Emperor Norton."


Re: Simplifying Text Search

From
Aidan Van Dyk
Date:
* Heikki Linnakangas <heikki@enterprisedb.com> [071112 15:18]:
> Simon Riggs wrote:

> >Right now, Full Text Search SQL looks like complete gibberish and it
> >dissuades many people from using what is an awesome set of features. I
> >just want to add a little sugar to help people get started.

> Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would 
> read out better, but unfortunately that's already taken ;-).

Can LIKE be easily overloaded in the parser?  So:text LIKE text
works in it's current form, andtsvector LIKE tsquery
also works like the @@?  Or have I gotten all the ts* types all mixed up
again...

But it doesn't buy anything except avoiding the "@@" that people seem to not
grok easily, and it might actually cause more grief, because of people
confusing the 2 forms of LIKE.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Simplifying Text Search

From
Simon Riggs
Date:
On Mon, 2007-11-12 at 20:17 +0000, Heikki Linnakangas wrote:

> Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would 
> read out better, but unfortunately that's already taken ;-).

Remember, I'm not suggesting we get rid of @@

> In any case, it's way too late.

I'm suggesting we add a couple of simple SQL functions that will help
text search docs be more easily understood.

It's beta and its valid to respond to usability issues just as we would
respond to code bugs. Otherwise, why have beta? Late, but not too late. 

SQLServer, Oracle and MySQL all use functions, not operators. My
observation would be that we have the hardest and most difficult to
understand full text search capability. The Contains() function seems
like a better name than I gave earlier also.

I love what we've done; I just want more people be able to use it.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
"Pavel Stehule"
Date:
Hello

look to standard, please. SQL/MM has part - full text.

SELECT  docno
FROM information
WHERE document.CONTAINS('STEMMED FORM OF "standard"IN SAME PARAGRAPH ASSOUNDS LIKE "sequel"') = 1

it's little bit baroque, It's sample of method.

So,it can be:

SELECT ..  FROM x.contains(y);

It's well readable and elegant too.

Regards
Pavel Stehule


Re: Simplifying Text Search

From
"Joshua D. Drake"
Date:
Simon Riggs wrote:
> On Mon, 2007-11-12 at 20:17 +0000, Heikki Linnakangas wrote:
> 
>> Granted, @@ is a bit awkward until you get used to it. "x LIKE y" would 
>> read out better, but unfortunately that's already taken ;-).
> 
> Remember, I'm not suggesting we get rid of @@
> 
>> In any case, it's way too late.
> 
> I'm suggesting we add a couple of simple SQL functions that will help
> text search docs be more easily understood.
> 
> It's beta and its valid to respond to usability issues just as we would
> respond to code bugs. Otherwise, why have beta? Late, but not too late. 
> 
> SQLServer, Oracle and MySQL all use functions, not operators. My
> observation would be that we have the hardest and most difficult to
> understand full text search capability. The Contains() function seems
> like a better name than I gave earlier also.
> 
> I love what we've done; I just want more people be able to use it.

Hmmm, my choices are:

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & 
rat'::tsquery;

Or:

SELECT ts_match('a fat cat sat on a mat and ate a fat rat','cat & rat');

This seems a little too much like the "duh" department to ignore. A set 
of SQL functions would certainly be appropriate here.

Sincerely,

Joshua D. Drake







Re: Simplifying Text Search

From
Martijn van Oosterhout
Date:
On Mon, Nov 12, 2007 at 03:44:18PM -0500, Aidan Van Dyk wrote:
> Can LIKE be easily overloaded in the parser?  So:
>     text LIKE text
> works in it's current form, and
>     tsvector LIKE tsquery
> also works like the @@?  Or have I gotten all the ts* types all mixed up
> again...

AIUI LIKE is mashed into an operator at parse time, so yes, if you
create the operator with the right name it will just work.

Or not (I havn't tested it).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Re: Simplifying Text Search

From
Simon Riggs
Date:
On Mon, 2007-11-12 at 21:59 +0100, Pavel Stehule wrote:

> SELECT  docno
> FROM information
> WHERE document.CONTAINS
>  ('STEMMED FORM OF "standard"
>  IN SAME PARAGRAPH AS
>  SOUNDS LIKE "sequel"') = 1
> 
> it's little bit baroque, It's sample of method.

Seems thats the way Oracle does it too.

The SQLServer syntax is 
WHERE contains(text_column, search_query)

which seems marginally better.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Alvaro Herrera
Date:
Pavel Stehule escribió:
> Hello
> 
> look to standard, please. SQL/MM has part - full text.

Huh, what version of the standard is this?  My copy (the typical 2003
draft) doesn't have SQL/MM AFAICS.


-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La espina, desde que nace, ya pincha" (Proverbio africano)


Re: Simplifying Text Search

From
"Pavel Stehule"
Date:
On 12/11/2007, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Pavel Stehule escribió:
> > Hello
> >
> > look to standard, please. SQL/MM has part - full text.
>
> Huh, what version of the standard is this?  My copy (the typical 2003
> draft) doesn't have SQL/MM AFAICS.
>
>

I found

http://jtc1sc32.org/doc/N0751-0800/32N0771T.pdf
http://www.sigmod.org/record/issues/0112/standards.pdf
http://dbs.uni-leipzig.de/file/kap5.pdf

Pavel


> --
> Alvaro Herrera                 http://www.amazon.com/gp/registry/CTMLCN8V17R4
> "La espina, desde que nace, ya pincha" (Proverbio africano)
>

Re: Simplifying Text Search

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > So we end up with a normal sounding function that is overloaded to
> > > provide all of the various goodies.
> > 
> > As best I can tell, @@ does exactly this already.  This is just a
> > different spelling of the same capability, and I don't actually
> > find it better.  Why is "text_search(x,y)" better than "x @@ y"?
> > We don't recommend that people write "texteq(x,y)" instead of
> > "x = y".
> 
> Most people don't understand those differences. x = y means "make sure
> they are the same" to most people. They don't see what you (and I) see:
> function and operator interchangeability. So text_search() is better
> than @@ and = is better than texteq(). Life ain't neat...
> 
> Right now, Full Text Search SQL looks like complete gibberish and it
> dissuades many people from using what is an awesome set of features. I
> just want to add a little sugar to help people get started.

I realized this when editing the documentation but not clearly.  I
noticed that:
http://momjian.us/main/writings/pgsql/sgml/textsearch-intro.html#TEXTSEARCH-MATCHING
tsvector @@ tsquerytsquery  @@ tsvectortext @@ tsquerytext @@ text
The first two of these we saw already. The form text @@ tsquery  isequivalent to to_tsvector(x) @@ y. The form text @@
text is equivalentto to_tsvector(x) @@ plainto_tsquery(y).
 

was quite odd, especially the "text @@ text" case, and in fact it makes
casting almost required unless you can remember which one is a query and
which is a vector (hint, the vector is first).  What really adds to the
confusion is that the operator is two _identical_ characters, meaning
the operator is symetric, and it behave symetric if you cast one side,
but as vector @@ query if you don't.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Mon, 2007-11-12 at 23:03 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
> > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > So we end up with a normal sounding function that is overloaded to
> > > > provide all of the various goodies.
> > > 
> > > As best I can tell, @@ does exactly this already.  This is just a
> > > different spelling of the same capability, and I don't actually
> > > find it better.  Why is "text_search(x,y)" better than "x @@ y"?
> > > We don't recommend that people write "texteq(x,y)" instead of
> > > "x = y".
> > 
> > Most people don't understand those differences. x = y means "make sure
> > they are the same" to most people. They don't see what you (and I) see:
> > function and operator interchangeability. So text_search() is better
> > than @@ and = is better than texteq(). Life ain't neat...
> > 
> > Right now, Full Text Search SQL looks like complete gibberish and it
> > dissuades many people from using what is an awesome set of features. I
> > just want to add a little sugar to help people get started.
> 
> I realized this when editing the documentation but not clearly.  I
> noticed that:
> 
>     http://momjian.us/main/writings/pgsql/sgml/textsearch-intro.html#TEXTSEARCH-MATCHING
> 
>     tsvector @@ tsquery
>     tsquery  @@ tsvector
>     text @@ tsquery
>     text @@ text
> 
>     The first two of these we saw already. The form text @@ tsquery  is
>     equivalent to to_tsvector(x) @@ y. The form text @@ text  is equivalent
>     to to_tsvector(x) @@ plainto_tsquery(y).
> 
> was quite odd, especially the "text @@ text" case, and in fact it makes
> casting almost required unless you can remember which one is a query and
> which is a vector (hint, the vector is first).  What really adds to the
> confusion is that the operator is two _identical_ characters, meaning
> the operator is symetric, and it behave symetric if you cast one side,
> but as vector @@ query if you don't.

I'm thinking we can have an inlinable function

contains(text, text) returns int 

Return values limited to just 0 or 1 or NULL, as with SQL/MM.
It's close to SQL/MM, but not exact.

contains(sourceText, searchText) is a macro for

case to_tsvector(default_text_search_config, sourceText) @@
to_tsquery(default_text_search_config, searchText)
when true then 1
when false then 0
else null
end

that allows us to write indexable queries like this

WHERE contains(sourceText, searchText) > 0

where we must still have built the index on a constant config.
Not checked that still works yet, maybe not, in which case something
slightly more complex to make sure its still indexable. This is the
difficult part.

So changes are:
- add SQL function
- simplify first 2 pages of docs using this function

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
"Pavel Stehule"
Date:
On 13/11/2007, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, 2007-11-12 at 23:03 -0500, Bruce Momjian wrote:
> > Simon Riggs wrote:
> > > On Mon, 2007-11-12 at 11:56 -0500, Tom Lane wrote:
> > > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > > So we end up with a normal sounding function that is overloaded to
> > > > > provide all of the various goodies.
> > > >
> > > > As best I can tell, @@ does exactly this already.  This is just a
> > > > different spelling of the same capability, and I don't actually
> > > > find it better.  Why is "text_search(x,y)" better than "x @@ y"?
> > > > We don't recommend that people write "texteq(x,y)" instead of
> > > > "x = y".
> > >
> > > Most people don't understand those differences. x = y means "make sure
> > > they are the same" to most people. They don't see what you (and I) see:
> > > function and operator interchangeability. So text_search() is better
> > > than @@ and = is better than texteq(). Life ain't neat...
> > >
> > > Right now, Full Text Search SQL looks like complete gibberish and it
> > > dissuades many people from using what is an awesome set of features. I
> > > just want to add a little sugar to help people get started.
> >
> > I realized this when editing the documentation but not clearly.  I
> > noticed that:
> >
> >       http://momjian.us/main/writings/pgsql/sgml/textsearch-intro.html#TEXTSEARCH-MATCHING
> >
> >       tsvector @@ tsquery
> >       tsquery  @@ tsvector
> >       text @@ tsquery
> >       text @@ text
> >
> >       The first two of these we saw already. The form text @@ tsquery  is
> >       equivalent to to_tsvector(x) @@ y. The form text @@ text  is equivalent
> >       to to_tsvector(x) @@ plainto_tsquery(y).
> >
> > was quite odd, especially the "text @@ text" case, and in fact it makes
> > casting almost required unless you can remember which one is a query and
> > which is a vector (hint, the vector is first).  What really adds to the
> > confusion is that the operator is two _identical_ characters, meaning
> > the operator is symetric, and it behave symetric if you cast one side,
> > but as vector @@ query if you don't.
>
> I'm thinking we can have an inlinable function
>
> contains(text, text) returns int
>
> Return values limited to just 0 or 1 or NULL, as with SQL/MM.
> It's close to SQL/MM, but not exact.
>
> contains(sourceText, searchText) is a macro for
>
> case to_tsvector(default_text_search_config, sourceText) @@
> to_tsquery(default_text_search_config, searchText)
> when true then 1
> when false then 0
> else null
> end
>

it's look well.

Pavel


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Tue, 2007-11-13 at 08:58 +0100, Pavel Stehule wrote:
> On 13/11/2007, Simon Riggs <simon@2ndquadrant.com> wrote:

> > I'm thinking we can have an inlinable function
> >
> > contains(text, text) returns int
> >
> > Return values limited to just 0 or 1 or NULL, as with SQL/MM.
> > It's close to SQL/MM, but not exact.
> >
> > contains(sourceText, searchText) is a macro for
> >
> > case to_tsvector(default_text_search_config, sourceText) @@
> > to_tsquery(default_text_search_config, searchText)
> > when true then 1
> > when false then 0
> > else null
> > end
> >
> 
> it's look well.

I think it needs lot more thought yet. Travelling now, so not able to
add further. Will pick up again in next few days.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Peter Eisentraut
Date:
Am Montag, 12. November 2007 schrieb Martijn van Oosterhout:
> On Mon, Nov 12, 2007 at 03:44:18PM -0500, Aidan Van Dyk wrote:
> > Can LIKE be easily overloaded in the parser?  So:
> >     text LIKE text
> > works in it's current form, and
> >     tsvector LIKE tsquery
> > also works like the @@?  Or have I gotten all the ts* types all mixed up
> > again...
>
> AIUI LIKE is mashed into an operator at parse time, so yes, if you
> create the operator with the right name it will just work.

What we'd need is a way to convert a LIKE pattern into a tsquery ('%foo%bar%' 
=> 'foo & bar').  Then you might even be able to sneak index-optimized text 
search into existing applications.  Might be worth a try.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Simplifying Text Search

From
Peter Eisentraut
Date:
Am Montag, 12. November 2007 schrieb Pavel Stehule:
> > Huh, what version of the standard is this?  My copy (the typical 2003
> > draft) doesn't have SQL/MM AFAICS.
>
> I found
>
> http://jtc1sc32.org/doc/N0751-0800/32N0771T.pdf
> http://www.sigmod.org/record/issues/0112/standards.pdf
> http://dbs.uni-leipzig.de/file/kap5.pdf

I think this standard never actually became one.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Simplifying Text Search

From
Gregory Stark
Date:
"Peter Eisentraut" <peter_e@gmx.net> writes:

> What we'd need is a way to convert a LIKE pattern into a tsquery ('%foo%bar%' 
> => 'foo & bar').  Then you might even be able to sneak index-optimized text 
> search into existing applications.  Might be worth a try.

I don't think that's the right direction to go. Notably "%foo%bar%" isn't the
same thing as "foo & bar". Also most tsearch queries can't be expressed as
LIKE patterns anyways.


What I do like is the idea of using LIKE as a model to follow for tsearch.
Make some keyword which maps to an operator. So you could do something like
WHERE col SATISFIES 'foo & bar'

which maps to @@ just as LIKE maps to ~~. This has the interesting property of
working for ltree and other data types which support the @@ operator.

Now I'm starting to wonder though. How evil would it be to just rename all the
@@ operators to ~~ ? That redefines LIKE to be a data-type-specific operation.
Effectively declaring that tsquery, lquery, et al are all analogous concepts
for LIKE patterns.

It's probably safer to come up with a new keyword for @@ and if we ever decide
to pursue a regexp data type we could define @@ for that as well which would
make that keyword an equivalent to LIKE for regexps.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Simplifying Text Search

From
Peter Eisentraut
Date:
Am Dienstag, 13. November 2007 schrieb Gregory Stark:
> "Peter Eisentraut" <peter_e@gmx.net> writes:
> > What we'd need is a way to convert a LIKE pattern into a tsquery
> > ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
> > index-optimized text search into existing applications.  Might be worth a
> > try.
>
> I don't think that's the right direction to go. Notably "%foo%bar%" isn't
> the same thing as "foo & bar". Also most tsearch queries can't be expressed
> as LIKE patterns anyways.

The requirement is to express LIKE patterns as tsearch queries, not the other 
way around.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Simplifying Text Search

From
Gregory Stark
Date:
"Peter Eisentraut" <peter_e@gmx.net> writes:

> Am Dienstag, 13. November 2007 schrieb Gregory Stark:
>> "Peter Eisentraut" <peter_e@gmx.net> writes:
>> > What we'd need is a way to convert a LIKE pattern into a tsquery
>> > ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
>> > index-optimized text search into existing applications.  Might be worth a
>> > try.
>>
>> I don't think that's the right direction to go. Notably "%foo%bar%" isn't
>> the same thing as "foo & bar". Also most tsearch queries can't be expressed
>> as LIKE patterns anyways.
>
> The requirement is to express LIKE patterns as tsearch queries, not the other 
> way around.

As an optimization to use GIST indexes this is a cool idea. Much like our
existing functionality which converts "col LIKE 'bar%'" to range queries so it
can use btree indexes.

But it doesn't help the people who find "col @@ 'foo & bar'" confusing.
They're just being thrown off because "@@" is an unfamiliar operator.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Simplifying Text Search

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> I realized this when editing the documentation but not clearly.  I
> noticed that:
>
>     http://momjian.us/main/writings/pgsql/sgml/textsearch-intro.html#TEXTSEARCH-MATCHING
>
>     tsvector @@ tsquery
>     tsquery  @@ tsvector
>     text @@ tsquery
>     text @@ text
>
>     The first two of these we saw already. The form text @@ tsquery  is
>     equivalent to to_tsvector(x) @@ y. The form text @@ text  is equivalent
>     to to_tsvector(x) @@ plainto_tsquery(y).
>
> was quite odd, especially the "text @@ text" case, and in fact it makes
> casting almost required unless you can remember which one is a query and
> which is a vector (hint, the vector is first).  What really adds to the
> confusion is that the operator is two _identical_ characters, meaning
> the operator is symetric, and it behave symetric if you cast one side,
> but as vector @@ query if you don't.

I find this odd as well. Effectively what we're doing is rather than defining
the casting behaviour in a global way we're defining operators specifically
for text which do the casts internally. That seems like a bad idea, especially
given that other data types implement @@ operators as well.

I feel like the right idea is to throw out all but tsvector @@ tsquery and
define casts as necessary to get that to work in every (non-inverted) case
above.

Actually the only case which wouldn't work with just that is a bare
 'foo' @@ 'bar'

And even that would work fine until you load _int.sql or ltree.sql which
define conflicting operators.

Separately I feel like we should name this operator something like ~= or =? or
something like that. @@ doesn't look like any kind of equality or matching
operator and it looks symmetric. We also already have @@ operators which are
right-handed unary operators for geometric data types which this is very
different from.

I would suggest something like =?


PS: I thought of a whacky idea which would look neat but be mainly silly. I
thought I would mention it anyways though. If we define a unary postfix
operator "text ?" which just casted text to tsquery then define a 
"text = tsquery" operator which does what @@ does. Then you could write
queries like:
WHERE col = 'foo & bar' ?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Tue, 2007-11-13 at 13:32 +0100, Peter Eisentraut wrote:
> Am Dienstag, 13. November 2007 schrieb Gregory Stark:
> > "Peter Eisentraut" <peter_e@gmx.net> writes:
> > > What we'd need is a way to convert a LIKE pattern into a tsquery
> > > ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
> > > index-optimized text search into existing applications.  Might be worth a
> > > try.
> >
> > I don't think that's the right direction to go. Notably "%foo%bar%" isn't
> > the same thing as "foo & bar". Also most tsearch queries can't be expressed
> > as LIKE patterns anyways.
> 
> The requirement is to express LIKE patterns as tsearch queries, not the other 
> way around.

+1

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
"Trevor Talbot"
Date:
On 11/13/07, Peter Eisentraut <peter_e@gmx.net> wrote:
> Am Dienstag, 13. November 2007 schrieb Gregory Stark:
> > "Peter Eisentraut" <peter_e@gmx.net> writes:

> > > What we'd need is a way to convert a LIKE pattern into a tsquery
> > > ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
> > > index-optimized text search into existing applications.  Might be worth a
> > > try.

> > I don't think that's the right direction to go. Notably "%foo%bar%" isn't
> > the same thing as "foo & bar". Also most tsearch queries can't be expressed
> > as LIKE patterns anyways.

> The requirement is to express LIKE patterns as tsearch queries, not the other
> way around.

How?  LIKE queries are incapable of expressing word boundaries, do not
support substitution, and are implicitly ordered.  tsearch queries
operate entirely on word boundaries, may substitute words, and are
unordered.

I don't see the two as even working in the same space, let alone be
convertable for optimization purposes.  If the idea was just to use a
tsearch index as an initial filter, then running LIKE on the results,
dictionary-based substitution makes that unreliable.


Re: Simplifying Text Search

From
Andrew Dunstan
Date:

Trevor Talbot wrote:
> On 11/13/07, Peter Eisentraut <peter_e@gmx.net> wrote:
>   
>> Am Dienstag, 13. November 2007 schrieb Gregory Stark:
>>     
>>> "Peter Eisentraut" <peter_e@gmx.net> writes:
>>>       
>
>   
>>>> What we'd need is a way to convert a LIKE pattern into a tsquery
>>>> ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
>>>> index-optimized text search into existing applications.  Might be worth a
>>>> try.
>>>>         
>
>   
>>> I don't think that's the right direction to go. Notably "%foo%bar%" isn't
>>> the same thing as "foo & bar". Also most tsearch queries can't be expressed
>>> as LIKE patterns anyways.
>>>       
>
>   
>> The requirement is to express LIKE patterns as tsearch queries, not the other
>> way around.
>>     
>
> How?  LIKE queries are incapable of expressing word boundaries, do not
> support substitution, and are implicitly ordered.  tsearch queries
> operate entirely on word boundaries, may substitute words, and are
> unordered.
>
> I don't see the two as even working in the same space, let alone be
> convertable for optimization purposes.  If the idea was just to use a
> tsearch index as an initial filter, then running LIKE on the results,
> dictionary-based substitution makes that unreliable.
>
>
>   

The fact that we are having this discussion at all demonstrates to me 
that we should leave well alone - any use of LIKE in this context is 
just about guaranteed to cause massive confusion. (Not to mention that 
it's far too late in the dev cycle to be making such changes, if we're 
thinking of them for 8.3).

cheers

andrew


Re: Simplifying Text Search

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> The fact that we are having this discussion at all demonstrates to me 
> that we should leave well alone - any use of LIKE in this context is 
> just about guaranteed to cause massive confusion.

Also, the semantics of LIKE are perfectly clearly specified by the SQL
standard; unless you're willing to ignore the spec, it's simply not
that useful for full-text search.  Therefore, we have to tell people
to use some other API anyway.  The existing tsearch2 API at least has
the virtue of having been proven in the field over several years.
        regards, tom lane


Re: Simplifying Text Search

From
"Webb Sprague"
Date:
> ... Therefore, we have to tell people
> to use some other API anyway.  The existing tsearch2 API at least has
> the virtue of having been proven in the field over several years.

I can only speak as a moderately sophisticated end user, but ... I
think the tsearch2 API has been "proven" to alienate a lot of
potential users, myself included.  If the simple things were simple,
there might be a large user base that would rebel against an API
extension, but I don't think this is the case.  And I think the need
for a simpler, refactored interface to tsearch is desperate.

Granted, one can learn tsearch2 as is, but it is somewhat painful.
It isn't the sort of thing one figures out for fun and potential
future use, but  probably only if one is forced to.  If we (well,
"you", really) could make tsearch2 less like C++ (or OCAML or FORTH )
and more like Python, we would get a whole lot of new users of tsearch
in the process, and probably a whole of good will toward PostgreSQL.

-W


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Tue, 2007-11-13 at 08:58 +0100, Pavel Stehule wrote:
> On 13/11/2007, Simon Riggs <simon@2ndquadrant.com> wrote:
> > I'm thinking we can have an inlinable function
> >
> > contains(text, text) returns int
> >
> > Return values limited to just 0 or 1 or NULL, as with SQL/MM.
> > It's close to SQL/MM, but not exact.
> >
> > contains(sourceText, searchText) is a macro for
> >
> > case to_tsvector(default_text_search_config, sourceText) @@
> > to_tsquery(default_text_search_config, searchText)
> > when true then 1
> > when false then 0
> > else null
> > end

Better idea:

in-linable function called

create function
contains(sourceText text, searchText text, config text) returns boolean
as $$
to_tsvector(config, sourceText) @@ to_tsquery(config, searchText);
$$ language sql;

so that 

SELECT title
FROM pgweb
WHERE contains(body, 'a & b', 'english')

is an indexable, easily readable way of using full text search.

allowing 

SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');?column? 
----------t

to become

SELECT contains('fat cats ate fat rats', 'fat & rat', 'english');?column? 
----------t

Proposed changes:
1. Add function contains()
2. Alter docs to show use of contains()

All other @@ features still the same

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> Proposed changes:
> 1. Add function contains()
> 2. Alter docs to show use of contains()
>
> All other @@ features still the same

Have you yet given any advantages of contains over @@ ?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simplifying Text Search

From
Magnus Hagander
Date:
On Wed, Nov 14, 2007 at 07:46:58AM +0000, Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > Proposed changes:
> > 1. Add function contains()
> > 2. Alter docs to show use of contains()
> >
> > All other @@ features still the same
> 
> Have you yet given any advantages of contains over @@ ?

Familiarity for users of SQL Server that are migrating? ;-)
(http://msdn2.microsoft.com/en-us/library/ms187787.aspx)

//Magnus


Re: Simplifying Text Search

From
Bruce Momjian
Date:
Simon Riggs wrote:
> Better idea:
> 
> in-linable function called
> 
> create function
> contains(sourceText text, searchText text, config text) returns boolean
> as $$
> to_tsvector(config, sourceText) @@ to_tsquery(config, searchText);
> $$ language sql;
> 
> so that 
> 
> SELECT title
> FROM pgweb
> WHERE contains(body, 'a & b', 'english')
> 
> is an indexable, easily readable way of using full text search.
> 
> allowing 
> 
> SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
>  ?column? 
> ----------
>  t
> 
> to become
> 
> SELECT contains('fat cats ate fat rats', 'fat & rat', 'english');
>  ?column? 
> ----------
>  t
> 
> Proposed changes:
> 1. Add function contains()
> 2. Alter docs to show use of contains()
> 
> All other @@ features still the same

One advantage to this syntax is that it allows the specification of the
configuration name, which needs a function call to be done with the @@
operator.  

However, to access a tsvector column we would need to add another
function that takes a tsvector value.  However, the config specification
would apply only to the text column, not the pre-computed tsvector, so
that might be confusing.  I think this method could use a function index
but it wouldn't be very clear from the syntax.

The problem I always kept having in clarifying the syntax is that I
could never find anything that covered all common use cases.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simplifying Text Search

From
Peter Eisentraut
Date:
I wrote:
> What we'd need is a way to convert a LIKE pattern into a tsquery
> ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
> index-optimized text search into existing applications.  Might be worth a
> try.

Here is how this could work:

CREATE FUNCTION likepattern_to_tsquery(text) RETURNS tsquery   RETURNS NULL ON NULL INPUT IMMUTABLE   LANGUAGE SQL   AS
$$SELECT trim(replace($1, '%', ' & '), '& ')::tsquery; $$;
 


UPDATE pg_operator SET oprname = '#~~#' WHERE oprcode = 'textlike'::regproc;


CREATE FUNCTION textlike_ts(text, text) RETURNS boolean   RETURNS NULL ON NULL INPUT IMMUTABLE   LANGUAGE SQL   AS $$
SELECT$1 @@ likepattern_to_tsquery($2) AND $1 #~~# $2; $$;
 


CREATE OPERATOR ~~ (   PROCEDURE = textlike_ts,   LEFTARG = text,   RIGHTARG = text
);

Maybe something like this could be useful for people who cannot readily change 
their application code.  (Of course it is not meant to solve the issue of how 
to make the text-search functionality itself easier to access.)

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Simplifying Text Search

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> [ replace LIKE with this: ]

> CREATE FUNCTION textlike_ts(text, text) RETURNS boolean
>     RETURNS NULL ON NULL INPUT IMMUTABLE
>     LANGUAGE SQL
>     AS $$ SELECT $1 @@ likepattern_to_tsquery($2) AND $1 #~~# $2; $$;

Cute trick, but as-is this will not make anything go any faster, because
it doesn't expose any opportunities for indexing the @@ operation.
I think what you'd really need is something like

$$ SELECT to_tsvector('english', $1) @@ likepattern_to_tsquery($2) AND $1 #~~# $2; $$;

which will win if there is an expression index on to_tsvector('english',
<textcolumn>).  (You can substitute your preferred configuration of
course, but you don't get to rely on default_text_search_config, because
that would render the expression non-immutable and thus non-indexable.)

This points up the same old notational problem that there is no good
place in the operator notation to mention which text search
configuration you want to use.  Simon's suggestion of a three-parameter
function at least addresses that issue.
        regards, tom lane


Re: Simplifying Text Search

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Better idea:

> create function
> contains(sourceText text, searchText text, config text) returns boolean
> as $$
> to_tsvector(config, sourceText) @@ to_tsquery(config, searchText);
> $$ language sql;

I think you have confused "simple query syntax" with "easy to use".

The above does make the query notation look nicer, but the query will
not actually go fast unless the DBA has made an expression index on
to_tsvector('desired config', textcolumn).  Thus, in terms of getting
people "over the hump" of doing their first decently-performing text
search, you haven't reduced the number of concepts to be understood
at all; rather, you've added one more.  People will still have to
understand the complexity that contains() is supposedly hiding.
Worse, they'll have to make the connection between two completely
different-looking syntaxes every time they look at their schemas.
        regards, tom lane


Re: Simplifying Text Search

From
Bruce Momjian
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Better idea:
> 
> > create function
> > contains(sourceText text, searchText text, config text) returns boolean
> > as $$
> > to_tsvector(config, sourceText) @@ to_tsquery(config, searchText);
> > $$ language sql;
> 
> I think you have confused "simple query syntax" with "easy to use".
> 
> The above does make the query notation look nicer, but the query will
> not actually go fast unless the DBA has made an expression index on
> to_tsvector('desired config', textcolumn).  Thus, in terms of getting
> people "over the hump" of doing their first decently-performing text
> search, you haven't reduced the number of concepts to be understood
> at all; rather, you've added one more.  People will still have to
> understand the complexity that contains() is supposedly hiding.
> Worse, they'll have to make the connection between two completely
> different-looking syntaxes every time they look at their schemas.

Yes, this the same problem we had months ago trying to improve the
syntax, that there was no easy syntax that covered all common use cases.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simplifying Text Search

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Yes, this the same problem we had months ago trying to improve the
> syntax, that there was no easy syntax that covered all common use cases.

The thing that we keep coming up against is that we'd like queries to
be able to depend on default_text_search_config.  Quite aside from
anyone's personal opinions about whether operator or function syntax
is preferable, it would clearly be desirable to be able to say
... WHERE textcolumn @@ 'pattern-constant'

and get a fast full-text search that's governed by the current setting
of default_text_search_config (of course, it can only be fast if there
is an index using that same configuration, but that's a setup detail).

It strikes me that now that we have an invalidatable plan cache,
we could fix this by having the planner rewrite "textcolumn @@ something"
into "to_tsvector(regconfig_constant, textcolumn) @@ something",
so long as it marked the resulting plan as needing to be invalidated
by any change in the value of default_text_search_config.  Once you
have that form, it can be matched against an index, and away you go.

Too late for 8.3, but seems doable enough for 8.4.
        regards, tom lane


Re: Simplifying Text Search

From
"Trevor Talbot"
Date:
On 11/14/07, Peter Eisentraut <peter_e@gmx.net> wrote:
> I wrote:
> > What we'd need is a way to convert a LIKE pattern into a tsquery
> > ('%foo%bar%' => 'foo & bar').  Then you might even be able to sneak
> > index-optimized text search into existing applications.  Might be worth a
> > try.
>
> Here is how this could work:
>
> CREATE FUNCTION likepattern_to_tsquery(text) RETURNS tsquery

[...]

But that coversion itself is fundamentally flawed, is the problem.

'foo bar'
'fooandbar'
'barfoo and foobar'

'%foo%bar%' matches all 3.
'foo & bar' matches only the first.

If the application currently using LIKE actually wants a word-based
search, it should probably just convert to using tsearch wholesale,
since it doesn't work as intended now.  If it actually wants wildcard
matching behavior, it can't use tsearch at all.


Re: Simplifying Text Search

From
Oleg Bartunov
Date:
On Wed, 14 Nov 2007, Tom Lane wrote:

> Bruce Momjian <bruce@momjian.us> writes:
>> Yes, this the same problem we had months ago trying to improve the
>> syntax, that there was no easy syntax that covered all common use cases.
>
> The thing that we keep coming up against is that we'd like queries to
> be able to depend on default_text_search_config.  Quite aside from
> anyone's personal opinions about whether operator or function syntax
> is preferable, it would clearly be desirable to be able to say
>
>     ... WHERE textcolumn @@ 'pattern-constant'
>
> and get a fast full-text search that's governed by the current setting
> of default_text_search_config (of course, it can only be fast if there
> is an index using that same configuration, but that's a setup detail).
>
> It strikes me that now that we have an invalidatable plan cache,
> we could fix this by having the planner rewrite "textcolumn @@ something"
> into "to_tsvector(regconfig_constant, textcolumn) @@ something",
> so long as it marked the resulting plan as needing to be invalidated
> by any change in the value of default_text_search_config.  Once you
> have that form, it can be matched against an index, and away you go.
>
> Too late for 8.3, but seems doable enough for 8.4.

cute, pity it's didn't come up in our early discussion

>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Simplifying Text Search

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Wed, Nov 14, 2007 at 07:46:58AM +0000, Gregory Stark wrote:
>> Have you yet given any advantages of contains over @@ ?

> Familiarity for users of SQL Server that are migrating? ;-)
> (http://msdn2.microsoft.com/en-us/library/ms187787.aspx)

That argument would only hold water if we were going to adopt *all* of
their syntax for the feature ... and the other choices they've made seem
pretty ugly to me.
        regards, tom lane


Re: Simplifying Text Search

From
Peter Eisentraut
Date:
Am Mittwoch, 14. November 2007 schrieb Trevor Talbot:
> But that coversion itself is fundamentally flawed, is the problem.

I know it's incorrect, but with a different parser and/or dictionary you could 
make it work.

In practice, the search pattern will mostly be provided dynamically from some 
user input, so you could conceivably be able to modify the search patterns 
more readily than the entire queries in your application.  Anyway, it's just 
an idea for those who need it.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Simplifying Text Search

From
"Trevor Talbot"
Date:
On 11/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:

> In practice, the search pattern will mostly be provided dynamically from some
> user input, so you could conceivably be able to modify the search patterns
> more readily than the entire queries in your application.  Anyway, it's just
> an idea for those who need it.

Ah, I see what you mean, like for a simple web forum that only knows
LIKE searches now.  It may be easier to adjust the DB to do the
intended thing instead of trying to change the entire forum.  I wasn't
thinking of those narrow cases.


Re: Simplifying Text Search

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Am Mittwoch, 14. November 2007 schrieb Trevor Talbot:
>> But that coversion itself is fundamentally flawed, is the problem.

> I know it's incorrect, but with a different parser and/or dictionary you could 
> make it work.

No, I don't think so.  Trevor's killer point is that the token
boundaries chosen by the parser --- no matter *what* they are ---
might not line up with the substrings needed by a given LIKE pattern.
There isn't any mechanism in tsearch that will find the stored word
"foobar" if the search is for "foo" & "bar"; nor vice versa.

It might be possible to use a pg_trgm index in this way, since
AFAICT from the documents pg_trgm just chops up the substrings
blindly rather than trying to have smarts about word boundaries.

(And on third thought, I suppose you could emulate pg_tgrm with
a suitable parser and a lobotomized dictionary ... but it would
be pretty wasteful to use the tsearch mechanisms for that.)
        regards, tom lane


Re: Simplifying Text Search

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Magnus Hagander <magnus@hagander.net> writes:
>> On Wed, Nov 14, 2007 at 07:46:58AM +0000, Gregory Stark wrote:
>>> Have you yet given any advantages of contains over @@ ?
>
>> Familiarity for users of SQL Server that are migrating? ;-)
>> (http://msdn2.microsoft.com/en-us/library/ms187787.aspx)
>
> That argument would only hold water if we were going to adopt *all* of
> their syntax for the feature ... and the other choices they've made seem
> pretty ugly to me.

Well, yes and no. A degree of partial compatibility could still be interesting
without the bells and whistles as long as we're compatible as far as we do go.
We could always add bells and whistles later if we want.

What would be of concern here is that we could be cutting off adding a more
fully compatible function. 

Their syntax for the search pattern is quite different (and more
human-readable). If we implement a contains() function which just takes a text
and casts it from our text representation to a tsquery then we're blocking
ever having a function which does a full conversion from an MSSQL style query
string to a tsquery.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Simplifying Text Search

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Simon Riggs wrote:
> Something Tom Dunstan just mentioned has made me ask the question "Why
> does our full text search feature look so strange?". It's the
> operator-laden syntax that causes the problem.
> 
> By any stretch, this query is difficult for most people to understand:
> 
> SELECT * FROM text_table
> WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
> 
> Wouldn't it be much simpler to just have a function, so we can write
> this query like this?
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack', 'needle');
> 
> We then explain to people that while the above is nice, it will presume
> that both the function inputs are Text, which isn't any good for complex
> searches, indexing and dictionaries etc.., so then we move to:
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack'::tsvector,
> 'needle'::tsquery);
> 
> or perhaps
> 
> SELECT * FROM text_table
> WHERE full_text_search('haystack needle haystack', 'needle & hay');
> 
> which would automatically do the conversions to tsvector and tsquery for
> us. No more tedious casting, easy to read.
> 
> [*text_search() functions would return bool]
> 
> So we end up with a normal sounding function that is overloaded to
> provide all of the various goodies. We can include the text_search(text,
> text) version of the function in the normal chapter on functions, with a
> pointer to the more complex stuff elsewhere.
> 
> Sound good?
> 
> We can then explain everything without having to use @@ operators. They
> can then be introduced as an option.
> 
> The side benefit of this is that we can then allow our wonderful new
> functionality to be more easily usable by things like Hibernate. We just
> tell them we have this new function and thats all they need to know.
> 
> I know that under the covers the @@ operator is necessary because we
> hang various pieces of optimizer information from it. Each function
> signature gets an operator with matching signature, so there's a 1:1
> correspondence in most use cases. So to make this all hang together,
> there'd need to be a some smarts that says: if there is only one
> operator on a function then use the operator's optimizer information
> when you see just the function. That information can be assessed at DDL
> time, so we can keep accurate track of operator counts in pgproc. 
> 
> An alternative approach might be to make the first operator created on a
> function the "primary" operator. All other operators would then be
> secondary operators, so that adding operators would not change the
> inference mechanism.
> 
> I've not got sufficient knowledge to say how hard the
> function-to-operator inference is, but it would be dang useful in making
> text search and many other programs readable and easy to interface to.
> In the end that is going to mean wider usage of that functionality, with
> more people feeling like they can dip their toes into the water.
> 
> I must confess I have insufficient time to do this myself right now, not
> least me discovering exactly how. I'm spending time on this now because
> I'm the one that has to explain this stuff to people and things like
> this can make a huge difference in their understanding and eventual
> uptake.
> 
> Thoughts?
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simplifying Text Search

From
Simon Riggs
Date:
On Tue, 2007-11-20 at 14:25 -0500, Bruce Momjian wrote:
> This has been saved for the 8.4 release:
> 
>     http://momjian.postgresql.org/cgi-bin/pgpatches_hold
> 

It isn't a patch, so isn't being held for later review, nor have you
added it to the TODO list, so I'm not sure what this means.

Would you mind me asking what happens to things on this list?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Simplifying Text Search

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Tue, 2007-11-20 at 14:25 -0500, Bruce Momjian wrote:
> > This has been saved for the 8.4 release:
> > 
> >     http://momjian.postgresql.org/cgi-bin/pgpatches_hold
> > 
> 
> It isn't a patch, so isn't being held for later review, nor have you
> added it to the TODO list, so I'm not sure what this means.
> 
> Would you mind me asking what happens to things on this list?

It says "saved" meaning we will review the issue for 8.4.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simplifying Text Search

From
Decibel!
Date:
On Nov 15, 2007, at 4:41 AM, Gregory Stark wrote:
>   Ask me about EnterpriseDB's Slony Replication support!


Hah, wtf is that all about? :)

BTW, looked at londiste?
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Simplifying Text Search

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 4 Dec 2007 16:41:52 -0600
Decibel! <decibel@decibel.org> wrote:

> On Nov 15, 2007, at 4:41 AM, Gregory Stark wrote:
> >   Ask me about EnterpriseDB's Slony Replication support!
> 
> 
> Hah, wtf is that all about? :)

Stealth marketing :)...

Joshua D. Drake


- -- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/        UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHVdlPATb/zqfZUUQRAsnFAJ9W8KkxkoijdAUOLsB71kZGmWaxfgCgqat+
GswHcfzZNDkw4i37s9ecy7M=
=7x8o
-----END PGP SIGNATURE-----

Re: Simplifying Text Search

From
Decibel!
Date:
On Dec 4, 2007, at 4:48 PM, Joshua D. Drake wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Tue, 4 Dec 2007 16:41:52 -0600
> Decibel! <decibel@decibel.org> wrote:
>
>> On Nov 15, 2007, at 4:41 AM, Gregory Stark wrote:
>>>   Ask me about EnterpriseDB's Slony Replication support!
>>
>>
>> Hah, wtf is that all about? :)
>
> Stealth marketing :)...

Dammit, I need to de-program reply-all from my fingers...

But yeah, I thought it funny. :)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Simplifying Text Search

From
Bruce Momjian
Date:
Added to TODO:

* Consider a function-based API for '@@' full text searches
 http://archives.postgresql.org/pgsql-hackers/2007-11/msg00511.php


---------------------------------------------------------------------------

Simon Riggs wrote:
> Something Tom Dunstan just mentioned has made me ask the question "Why
> does our full text search feature look so strange?". It's the
> operator-laden syntax that causes the problem.
> 
> By any stretch, this query is difficult for most people to understand:
> 
> SELECT * FROM text_table
> WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
> 
> Wouldn't it be much simpler to just have a function, so we can write
> this query like this?
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack', 'needle');
> 
> We then explain to people that while the above is nice, it will presume
> that both the function inputs are Text, which isn't any good for complex
> searches, indexing and dictionaries etc.., so then we move to:
> 
> SELECT * FROM text_table
> WHERE text_search('haystack needle haystack'::tsvector,
> 'needle'::tsquery);
> 
> or perhaps
> 
> SELECT * FROM text_table
> WHERE full_text_search('haystack needle haystack', 'needle & hay');
> 
> which would automatically do the conversions to tsvector and tsquery for
> us. No more tedious casting, easy to read.
> 
> [*text_search() functions would return bool]
> 
> So we end up with a normal sounding function that is overloaded to
> provide all of the various goodies. We can include the text_search(text,
> text) version of the function in the normal chapter on functions, with a
> pointer to the more complex stuff elsewhere.
> 
> Sound good?
> 
> We can then explain everything without having to use @@ operators. They
> can then be introduced as an option.
> 
> The side benefit of this is that we can then allow our wonderful new
> functionality to be more easily usable by things like Hibernate. We just
> tell them we have this new function and thats all they need to know.
> 
> I know that under the covers the @@ operator is necessary because we
> hang various pieces of optimizer information from it. Each function
> signature gets an operator with matching signature, so there's a 1:1
> correspondence in most use cases. So to make this all hang together,
> there'd need to be a some smarts that says: if there is only one
> operator on a function then use the operator's optimizer information
> when you see just the function. That information can be assessed at DDL
> time, so we can keep accurate track of operator counts in pgproc. 
> 
> An alternative approach might be to make the first operator created on a
> function the "primary" operator. All other operators would then be
> secondary operators, so that adding operators would not change the
> inference mechanism.
> 
> I've not got sufficient knowledge to say how hard the
> function-to-operator inference is, but it would be dang useful in making
> text search and many other programs readable and easy to interface to.
> In the end that is going to mean wider usage of that functionality, with
> more people feeling like they can dip their toes into the water.
> 
> I must confess I have insufficient time to do this myself right now, not
> least me discovering exactly how. I'm spending time on this now because
> I'm the one that has to explain this stuff to people and things like
> this can make a huge difference in their understanding and eventual
> uptake.
> 
> Thoughts?
> 
> -- 
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +