Re: Full text searching, anyone interested? - Mailing list pgsql-hackers

From Gavin Sherry
Subject Re: Full text searching, anyone interested?
Date
Msg-id Pine.LNX.4.21.0106031141320.17650-100000@linuxworld.com.au
Whole thread Raw
In response to Full text searching, anyone interested?  (mlw <markw@mohawksoft.com>)
List pgsql-hackers
Hi guys,


On Sat, 2 Jun 2001, mlw wrote:

> I frequently rant on this about full text searching.
> 
[snip]
> I would love to find a way to get a bitmap like index native to Postgres. I
> would, in fact, love and expect to do an amount of it. The problem is to do it
> "right" will require a lot of work at very low levels of Postgres.
> 
> Is anyone interested in pursuing this?

Yes. I think this would be an important feature of PostgreSQL. I have
hacked contrib/fulltextindex to bits in order to segment the index so that
I can better deploy it on a cluster of postgres machines. I have also
changed it to a score/rating style system. However, it is still a word ->
oid relationship and is not scaling as well as I had hoped.

> How should it look?

In terms of interface to SQL, the function call which activates your FTI
search is much neater than the way I do it -- build a query based on the
number of search terms and the boolean operator used.

In terms of how it is interfaced to Postgres backend, I think it should be
an index type which one can apply to any character orientated
columns(s). It would be important that the index capable of handling
multiple columns so that many textual fields in a single row could be
indexed.

The index itself is where troubles would be found. People expect a lot
from full text searches. My own implementations allow chronological
searching, score based searching and searching on similar words. It would
be hard to interface this to CREATE INDEX as well as select. So, if a
native full text index was to be build it would have to be able to:

a) Index multiple columns
b) Be configurable: score/frequency based sorting, sorting in terms of a
column in an index row?
c) Be interfaced to a user level fti() function
d) Be ignored by the planner (if we want searches to occur only through
fti())
e) honour insert/delete.

Something else which is an issue is the size of the index. Indices on text
columns are generally very large. In my applications I have managed to
reduce this through segmenting the indices along the following lines: case
sensitivity/insensitivty, leading characters. This dramatically reduces
the IO load of an index scan -- but it would be quite difficult to build
this into a dynamic framework for the backend. For one, a VACUUM or some
equivalent would need to evaluate the size of a given index and, based on
other configuration information (is the user allowing index
segmentation?) segment the index based on some criteria. The problem then
is that for large indices, this could take quite some time.

I've probably over looked a fair number of other things which would need
to be considered. However, it's safe to say that such a feature native to
Postgres would be greatly appreciated by many.

Gavin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: [GENERAL] +/- Inf for float8's
Next
From: Peter Eisentraut
Date:
Subject: Re: [PATCHES] Fw: Isn't pg_statistic a security hole - Solution Proposal