Thread: OK, does anyone have any better ideas?

OK, does anyone have any better ideas?

From
mlw
Date:
I have a working version of a text search engine. I want to make it work
for Postgres (I will be releasing it GPL). It can literally find the
occurrence of a string of words within 5 million records in a few
milliseconds. It is very fast, it works similarly to many web search
engines.

I have tried many approaches to integrate the search system with
Postgres, but I can't find any method that isn't too slow or too
cumbersome.

The best I have been able to come up with is this:

create function textsearch(varchar) returns integer as       '       DECLARE               handle  integer;
 count   integer;               pos     integer;       BEGIN               handle = search_exec( \'localhost\', $1);
          count = search_count(handle);               for pos in 0 .. count-1 loop                       insert into
search_result(key,rank)                       values (search_key(handle,pos),
 
search_rank(handle,pos));               end loop;               return search_done(handle);       END;
' language 'plpgsql';  

And this is used as:

create temp table search_result (key integer, rank integer);

select textsearch('bla bla');

select field from table where field_key = search_result.key order by
search_result.rank ;

drop table search_result ;  


The problems with this are, I can't seem to be able to create a table in
plpgsql. (I read about a patch, but have to find out what version it is
in), so I have to create a table outside the function.

I can only execute one text search, because I can't seem to use the name
of a table that has been passed in to the plpgsql environment, that
would allow multiple searches to be joined. As:

select textsearch(temp_tbl1, 'bla bla');
select textsearch(temp_tbl2, 'foo bar');

select field from table1, table2 where table1.field_key = temp_tbl1.key
and  table2.field_key = temp_tbl2.key;


This could be so sweet, but, right now, it is just a disaster and I am
pulling my hair out. Does anyone have any suggestions or tricks that
could make this easier/faster, or is Postgres just unable to do this
sort of thing.


-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> I have a working version of a text search engine. I want to make it work
> for Postgres (I will be releasing it GPL). It can literally find the
> occurrence of a string of words within 5 million records in a few
> milliseconds.

Where are the records coming from?  Are they inside the database?
(If not, why do you care about integrating this with Postgres?)

It seems like the right way to integrate this sort of functionality
is to turn it into a kind of index, so that you can do
SELECT * FROM mytable WHERE keyfield ~~~ 'search string';

where ~~~ is the name of some operator that is associated with the
index.  The temporary-table approach you are taking seems inherently
klugy, and would still be awkward even if we had functions returning
recordsets...
        regards, tom lane


Re: OK, does anyone have any better ideas?

From
mlw
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > I have a working version of a text search engine. I want to make it work
> > for Postgres (I will be releasing it GPL). It can literally find the
> > occurrence of a string of words within 5 million records in a few
> > milliseconds.
> 
> Where are the records coming from?  Are they inside the database?
> (If not, why do you care about integrating this with Postgres?)
> 
> It seems like the right way to integrate this sort of functionality
> is to turn it into a kind of index, so that you can do
> 
>         SELECT * FROM mytable WHERE keyfield ~~~ 'search string';
> 
> where ~~~ is the name of some operator that is associated with the
> index.  The temporary-table approach you are taking seems inherently
> klugy, and would still be awkward even if we had functions returning
> recordsets...

OK, I get the misunderstanding, you are absolutely right it is VERY
kludgy.

It is sort of like a bitmap index, but it is more of a search engine. I
actually have it working on a commercial website. You run a program
periodically (cron job?) that executes a query, the query is then parsed
and an index of words, keys, ranks and phrase meta-data is created. You
also specify which fields in the query should be indexed and which field
will be the "key." (It is not ACID if I understand what they term
means.) The data for the text search need not even be in the database,
as long as the "key" being indexed is.

Then you call search with a string, such as "the long and winding road"
or "software OR hardware AND engineer NOT sales." A few milliseconds
later, a list of key/rank pairs are produced. This is FAR faster than
the '~~~' operator because it never does a full table scan. It is
assumed that the "key" field specified is properly indexed.

If I had a way of getting the key/rank result pair deeper into Postgres,
it would be an amazing platform to make some serious high speed search
applications. Think about a million resumes' online and searchable with
an arbitrary text string to get a list of candidates, powered by
Postgres, handling 100 queries a second. 

Right now, the way I have it working is PHP makes the search call and
then executes a query with the first result (highest rank) and returns
the data. If I could get the key/rank pair into postgres as a table or
multiple searches into postgres as a set of tables, then you could do
some amazing queries really really fast.

Still, you said that "select foo from bar where key = textsearch('bla
bla',..)" could not be done, and my previous example was the only other
way I have been able to even prototype my idea.


-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
mlw
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > I have a working version of a text search engine. I want to make it work
> > for Postgres (I will be releasing it GPL). It can literally find the
> > occurrence of a string of words within 5 million records in a few
> > milliseconds.
> 
> Where are the records coming from?  Are they inside the database?
> (If not, why do you care about integrating this with Postgres?)
> 
> It seems like the right way to integrate this sort of functionality
> is to turn it into a kind of index, so that you can do
> 
>         SELECT * FROM mytable WHERE keyfield ~~~ 'search string';
> 
> where ~~~ is the name of some operator that is associated with the
> index.  The temporary-table approach you are taking seems inherently
> klugy, and would still be awkward even if we had functions returning
> recordsets...

Oh! Another method I tried and just could not get working was returning
an array of integers. I as thinking about "select * from table where
key_field in ( textsearch('bla bla') ), but I haven't been able to get
that to work, and as per a previous post and belatedly reading a FAQ,
this would probably still force a full table scan.

Another method I thought about was having a table with some maximum
number of zero initialized records, and trying something like: 

create table temp_table as select * from ts_template limit
textsearch('bla bla', 10);

select filltable(temp_table, 10);

select * from table where key_field = temp_table.key;

As you can see, all of these ideas are heinous hacks, there has to be a
better way. Surely someone has a better idea.

-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> Then you call search with a string, such as "the long and winding road"
> or "software OR hardware AND engineer NOT sales." A few milliseconds
> later, a list of key/rank pairs are produced. This is FAR faster than
> the '~~~' operator because it never does a full table scan.

An index-associated operator doesn't imply a full table scan either.
The whole purpose of an index is to pull out the rows matched by the
WHERE expression without doing a full scan.

The thing that bothers me about the way you're doing it is that the
search result as such doesn't give you access to anything but the keys
themselves.  Typically what you want to do is get the whole record(s)
in which the matching keys are located --- and that's why the notion
of SELECT ... WHERE textfield-matches-search-string looks so attractive.
You get the records immediately, in one step.  Without that, your next
step after the search engine call is to do a join of the search result
table against your data table, and poof there goes much of your speed
gain.  (At best, you can make the join reasonably quick by having an
index on the unique key field ... but that just means another index to
maintain.)

Another advantage of handling it as an index is that you don't have to
rely on a periodic recomputation of the index; you can do on-the-fly
updates each time the table is altered.  (Of course, if your indexing
technology can't handle incremental updates efficiently, that might not
be of any value to you.  But there's nothing in the system design that
precludes making an index type that's only updated by REINDEX.)

I realize this is probably not what you wanted to hear, since building a
new index type is a lot more work than I suppose you were looking for.
But if you want a full-text index that's integrated naturally into
Postgres, that's the path to travel.  The way you're doing it is
swimming against the tide.  Even when the function-returning-recordset
limitation is gone (maybe a version or two away), it's still going to
be an awkward and inefficient way to work.
        regards, tom lane


Re: OK, does anyone have any better ideas?

From
Andrew Snow
Date:
Could you perhaps post the code you have for splitting a text field up into
keys, then I could work on turning into a new type of index with a new
operator, as Tom suggested?

(Or is this already what the text search code in contrib already does??)


- Andrew



Re: OK, does anyone have any better ideas?

From
mlw
Date:
Andrew Snow wrote:
> 
> Could you perhaps post the code you have for splitting a text field up into
> keys, then I could work on turning into a new type of index with a new
> operator, as Tom suggested?
> 
> (Or is this already what the text search code in contrib already does??)

Go to a search engine like lycos, alltheweb, or altavista. This is the
type of search capability I want to use in Postgres. I have it working
as a stand-alone daemon, it is fast and produces very relevant results.
I just thought that this sort of functionality would be a big plus if I
could tie it down deep in Postgres.

The big advantage to the code is high relevance, boolean operations, and
very very high speed operation. If I could get easy Postgres record
selection out of it, it would rock.

-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
mlw
Date:
Andrew Snow wrote:
> 
> Could you perhaps post the code you have for splitting a text field up into
> keys, then I could work on turning into a new type of index with a new
> operator, as Tom suggested?
> 
> (Or is this already what the text search code in contrib already does??)
> 
> - Andrew

OK, I guess I am not getting everything across. Let me give the basics:

There are two programs: sqlindex, and sqlfts.

sqlindex, is the SQL indexer.
sqlfts, is the SQL full text server.

They currently take a config file, which will be replaced by columns in
the database. (This technology is aimed at multiple databases and
non-SQL uses) The config file currently looks like the example at the
bottom of this post.

The current incarnation of this server sits outside of Postgres and
execute joins based the results of the query.

The indexing query returns a number of fields, one must be designated as
the "key" field. In websearch lingo, think of it as "document name."
During index time, I separate the individual fields and create bitmap
files which relate word numbers to document bits.

Words are parsed and a dictionary is created. Phrase meta-data is also
stored along with the document reference (key field) associated with a
document number.

When a query is executed, each word is picked out of the dictionary. At
various points, phrases are evaluated, the bitmap indexes are ANDed,
ORed, or NOTed together, rank is applied. The results are then sorted by
rank, and the document numbers are merged in with document "references"
(key field value) and return with the rank.

This technology works quite well as a search engine sort of thing if I
store a URL or file name and a teaser as the document reference. I
thought it would be cool (and easy) if I just stored a SQL key field as
the URL, and connected this stuff to a SQL database. I chose Postgres
because I had used it in a number of projects, and thought since it was
open source I would have fewer problems.

It has not been easy to do what I thought would be a fairly trivial
task. I am starting to get Windows programming flashbacks of the "so
close, but yet so far" feeling one gets when one tries to do
conceptually simple things on Windows.

I'm sorry I am getting discouraged and beginning to think that this
project is not going to work.


>>>>>>>>>>> configuration file <<<<<<<<<<<<<<<<<<<<<<
# The computer host name used for the database
sqlindex=localhost
sqlfts=localhost
# The name of the database
sqldb=cdinfo
# Base name of the index files.
basename=cdinfo/index
# The key field used to index and find records.
sqlkey=trackid
sqlkeyindex=off
metaphone=1
# A SQL query that produces a single result, which is the count of
# records to be indexed.
sqlcount=select count(trackid) from zsong
# The SQL query used to produce data to be indexed.
sqlretrieve=select  * from songview;
sqlfields = all,performer2,title,song

# A SQL query that will be used to display a list records found
sqldisplay=select zsong.muzenbr, performer, performer2, title, song,
trackid from ztitles, zsong where zsong.muzenbr = ztitles.muzenbr and
zsong.trackid = %s
# The tcport is the TCP/IP port for the server
tcpport = 8090
ftsproc = 5
ftsqueue = 32     
<<<<<<<<<<<<

-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
Oleg Bartunov
Date:
We need multi-key B-tree like index for such problem.
Our full text search engine is based on arrays and we need to find quickly
is some number exists in array - some kind of index over int array.
We're currently testing GiST approach and seems will have some conclusions
soon. I think multi-key B-tree like index would be better in my
opinion, but this requires to much work and knowledge of postgres's internals.
Yesterday I read about UBTree, seems like it's good for index and query
sets. Currently postgres has no set specific methods.
Regards,
    Oleg
On Fri, 8 Dec 2000, mlw wrote:

> Date: Fri, 08 Dec 2000 20:17:34 -0500
> From: mlw <markw@mohawksoft.com>
> To: Tom Lane <tgl@sss.pgh.pa.us>
> Cc: Hackers List <pgsql-hackers@postgresql.org>
> Subject: Re: [HACKERS] OK, does anyone have any better ideas?
> 
> Tom Lane wrote:
> > 
> > mlw <markw@mohawksoft.com> writes:
> > > I have a working version of a text search engine. I want to make it work
> > > for Postgres (I will be releasing it GPL). It can literally find the
> > > occurrence of a string of words within 5 million records in a few
> > > milliseconds.
> > 
> > Where are the records coming from?  Are they inside the database?
> > (If not, why do you care about integrating this with Postgres?)
> > 
> > It seems like the right way to integrate this sort of functionality
> > is to turn it into a kind of index, so that you can do
> > 
> >         SELECT * FROM mytable WHERE keyfield ~~~ 'search string';
> > 
> > where ~~~ is the name of some operator that is associated with the
> > index.  The temporary-table approach you are taking seems inherently
> > klugy, and would still be awkward even if we had functions returning
> > recordsets...
> 
> Oh! Another method I tried and just could not get working was returning
> an array of integers. I as thinking about "select * from table where
> key_field in ( textsearch('bla bla') ), but I haven't been able to get
> that to work, and as per a previous post and belatedly reading a FAQ,
> this would probably still force a full table scan.
> 
> Another method I thought about was having a table with some maximum
> number of zero initialized records, and trying something like: 
> 
> create table temp_table as select * from ts_template limit
> textsearch('bla bla', 10);
> 
> select filltable(temp_table, 10);
> 
> select * from table where key_field = temp_table.key;
> 
> As you can see, all of these ideas are heinous hacks, there has to be a
> better way. Surely someone has a better idea.
> 
> -- 
> http://www.mohawksoft.com
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



...

From
Oleg Bartunov
Date:
Here is a link to paper
"A new method to index and Query Sets" by J.Hoffman and J. Koehler
http://www.informatik.uni-freiburg.de/~hoffmann/publications.html
btw, is there a restriction to length of key in GiST in 7.1 ?
Regards,
    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83




Re: OK, does anyone have any better ideas?

From
mlw
Date:
Oleg Bartunov wrote:
> 
> We need multi-key B-tree like index for such problem.
> Our full text search engine is based on arrays and we need to find quickly
> is some number exists in array - some kind of index over int array.
> We're currently testing GiST approach and seems will have some conclusions
> soon. I think multi-key B-tree like index would be better in my
> opinion, but this requires to much work and knowledge of postgres's internals.
> Yesterday I read about UBTree, seems like it's good for index and query
> sets. Currently postgres has no set specific methods.

The way I do my search indexing is with bitmap objects and a word
dictionary. One creates a searchable dictionary of words by scanning the
selected records. So, in one query that results in 2 million records,
the total aggregate number of words is about 60,000 depending on how you
parse. For each word, you create a "bitmap object" (in one of a few
forms) where bit '0' represents the first record, bit '1' represents the
second, and so on, until you have 2 million bits.

Set the correct bit in the bitmap for each document that contains that
word. In the end you will have the equivalent 60,000 bitmaps or 2
million bits.

During search time, one creates an empty bitmap of 2 million bits as a
work space. One parses the search term, and performs boolean operation
on the workspace from the bitmap retrieved for each word.

When you are done parsing, you have a bitmap with a bit set for each
document that fits the search criteria. You then enumerate the bits by
bit position, and you now have a list of document numbers.

If only I could get the list of document numbers back into
postgres....... It would be great.
-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
Oleg Bartunov
Date:
On Sat, 9 Dec 2000, mlw wrote:

> Date: Sat, 09 Dec 2000 08:18:10 -0500
> From: mlw <markw@mohawksoft.com>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tom Lane <tgl@sss.pgh.pa.us>,
>     Hackers List <pgsql-hackers@postgresql.org>
> Subject: Re: [HACKERS] OK, does anyone have any better ideas?
> 
> Oleg Bartunov wrote:
> > 
> > We need multi-key B-tree like index for such problem.
> > Our full text search engine is based on arrays and we need to find quickly
> > is some number exists in array - some kind of index over int array.
> > We're currently testing GiST approach and seems will have some conclusions
> > soon. I think multi-key B-tree like index would be better in my
> > opinion, but this requires to much work and knowledge of postgres's internals.
> > Yesterday I read about UBTree, seems like it's good for index and query
> > sets. Currently postgres has no set specific methods.
> 
> The way I do my search indexing is with bitmap objects and a word
> dictionary. One creates a searchable dictionary of words by scanning the
> selected records. So, in one query that results in 2 million records,
> the total aggregate number of words is about 60,000 depending on how you
> parse. For each word, you create a "bitmap object" (in one of a few
> forms) where bit '0' represents the first record, bit '1' represents the
> second, and so on, until you have 2 million bits.
> 
> Set the correct bit in the bitmap for each document that contains that
> word. In the end you will have the equivalent 60,000 bitmaps or 2
> million bits.
> 
> During search time, one creates an empty bitmap of 2 million bits as a
> work space. One parses the search term, and performs boolean operation
> on the workspace from the bitmap retrieved for each word.
> 
> When you are done parsing, you have a bitmap with a bit set for each
> document that fits the search criteria. You then enumerate the bits by
> bit position, and you now have a list of document numbers.
> 
> If only I could get the list of document numbers back into
> postgres....... It would be great.

Gotcha. It's impossible to return a set from a function, so the only
way to use perl to parse your bitmap. We did (in one project) external
search using suffix arrays which incredibly fast and use postgres to 
return results to perl for processing.

Regards,
Oleg

> -- 
> http://www.mohawksoft.com
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: OK, does anyone have any better ideas?

From
mlw
Date:
Oleg Bartunov wrote:
> postgres....... It would be great.
> 
> Gotcha. It's impossible to return a set from a function, so the only
> way to use perl to parse your bitmap. We did (in one project) external
> search using suffix arrays which incredibly fast and use postgres to
> return results to perl for processing.

Here's a question, and I simply do not know enough about the internals
of postgres to know, I had a brainstorm last night and though of a
method.

Create a table:
Is it possible to call "SPI_exec" in a C function which does this:

"create temp table fubar as select ts_key(10) as 'key', ts_rank(10) as
'rank' from textsearch_template where ts_placeholder(10) limit
ts_count(10)"

In the above example, which call would be called first? I assume the
count would be called first, but I'm probably wrong. Which ever function
would be called first would execute the query. textsearch_template would
be a bogus table with 1000 or so zeros.

So, in a query one does this:

select ts_search('fubar', 'bla bla');

select * from table, fubar where table.field_key = fubar.key;

How about this: Is there a construct in Postgres that represents a row
ID, so a row can be found quickly without using an index? I tried oid
but that didn't seem fast at all.

P.S. If you want to see the system working, I have a test fixture
running on "http://gateway.mohawksoft.com/music.php3" It calls the text
search daemon from PHP and the text search daemon executes a sql query
per result (PQExec). Look for a popular song and press "search." 

A good example is look for "pink floyd pigs," then try "pink floyd pigs
-box." (It is running slow because it has debugging code, but it is
still pretty fast.) This index has been metaphoned so something like
"penk floid" will work too. 

The "+" operator is "requires" this is the default. The "-" operator is
"must not have" and the "?" operator is "may have" (the "?" operator is
a big hit because it increases the selection size.)

I think if you try it, you'll see why I want to be able to get it deep
into postgres, and what the possibilities are.

-- 
http://www.mohawksoft.com


Re:

From
Tom Lane
Date:
Oleg Bartunov <oleg@sai.msu.su> writes:
> btw, is there a restriction to length of key in GiST in 7.1 ?

Index tuples still have to fit in a block --- no TOAST table for
an index (yet).

I think we can compress index datums in-line though; at least that
should work in principle.
        regards, tom lane


Re:

From
Oleg Bartunov
Date:
On Sat, 9 Dec 2000, Tom Lane wrote:

> Date: Sat, 09 Dec 2000 10:45:33 -0500
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Oleg Bartunov <oleg@sai.msu.su>
> Cc: vmikheev@SECTORBASE.COM, Hackers List <pgsql-hackers@postgresql.org>
> Subject: Re: [HACKERS] 
> 
> Oleg Bartunov <oleg@sai.msu.su> writes:
> > btw, is there a restriction to length of key in GiST in 7.1 ?
> 
> Index tuples still have to fit in a block --- no TOAST table for
> an index (yet).

not a good news. indexing and quering sets requires much more length.
is there plan to do this ?

> 
> I think we can compress index datums in-line though; at least that
> should work in principle.

we could use loosy compression of GiST indices.
regards,
    Oleg

> 
>             regards, tom lane
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



RE: OK, does anyone have any better ideas?

From
"Edmar Wiggers"
Date:
Personally, I'm not too afraid of having an FTS engine outside the database.
Oracle's Intermedia, which has a very powerful/fast FTS engine, uses that
approach.

I could look into how they do the SQL integration, maybe it would yeld some
ideas.

Mark, about that row identifier: OID's are no good for fast find. Maybe you
could use tuple id (TID). But please note that tuple ID's change after a
simple update. It's like the offset of the row inside the table file (and
therefore blazing fast to get to that row again).

One possible idea for SQL integration: can one use index access-method
functions to query the FTS outside the database? Yes, it would require some
work, but the results I guess it would be wonderful. Ok, Tom, not so fast as
an index sitting inside Postgres, but AFAICS that's not going to happen
anytime soon.

Yours sincerely,

Edmar Wiggers
BRASMAP Information Systems
+55 48 9960 2752




RE: OK, does anyone have any better ideas?

From
Oleg Bartunov
Date:
On Sat, 9 Dec 2000, Edmar Wiggers wrote:

> Date: Sat, 9 Dec 2000 14:20:17 -0200
> From: Edmar Wiggers <edmar@brasmap.com>
> To: mlw <markw@mohawksoft.com>, Oleg Bartunov <oleg@sai.msu.su>
> Cc: Tom Lane <tgl@sss.pgh.pa.us>,
>     Hackers List <pgsql-hackers@postgresql.org>
> Subject: RE: [HACKERS] OK, does anyone have any better ideas?
> 
> 
> One possible idea for SQL integration: can one use index access-method
> functions to query the FTS outside the database? Yes, it would require some
> work, but the results I guess it would be wonderful. Ok, Tom, not so fast as
> an index sitting inside Postgres, but AFAICS that's not going to happen
> anytime soon.

We did external indexing using suffix arrays ( http://sary.namazu.org )
for one project because of limiting time :-) But we had to do a lot
of work like ACID (well, we already have some technology) and
everything we get usually from SQL.
Now we're trying to implement fast indexing using GiST.
Regards,
    Oleg

> 
> Yours sincerely,
> 
> Edmar Wiggers
> BRASMAP Information Systems
> +55 48 9960 2752
> 
> 

_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



Re: OK, does anyone have any better ideas?

From
Tom Lane
Date:
"Edmar Wiggers" <edmar@brasmap.com> writes:
> One possible idea for SQL integration: can one use index access-method
> functions to query the FTS outside the database?

Hm.  In principle an index access method can do whatever it darn
pleases.  In practice, though, I think the main problem with making an
index type for FTS is simply learning *how* to make a new index access
method.  (No one currently associated with the project has ever done it
--- the four existing types all date back to Berkeley, I believe.)
Seems to me that that learning curve is not going to be made any easier
by pushing the guts of the information outside of Postgres ... if
anything, it'd probably be harder because the existing examples of index
access methods would become less relevant.
        regards, tom lane

PS: by the way, do any of the rest of you find that Mark's email address
doesn't work?  Everytime I send him something, it sits in my mail queue
for five days and then bounces with Name server: mohawksoft.com.: host
name lookup failure.  The DNS server's syslog entries look like
Dec  9 11:34:56 sss2 named[10258]: ns_resp: query(mohawksoft.com) All possible A RR's lame
Sorry to use up list bandwidth on this, but I have no other way to reach
him --- apparently hub.org's nameserver is less picky than bind
8.2.2.p7?


Re: OK, does anyone have any better ideas?

From
mlw
Date:
Oleg Bartunov wrote:
> 
> On Sat, 9 Dec 2000, Edmar Wiggers wrote:
> 
> > Date: Sat, 9 Dec 2000 14:20:17 -0200
> > From: Edmar Wiggers <edmar@brasmap.com>
> > To: mlw <markw@mohawksoft.com>, Oleg Bartunov <oleg@sai.msu.su>
> > Cc: Tom Lane <tgl@sss.pgh.pa.us>,
> >     Hackers List <pgsql-hackers@postgresql.org>
> > Subject: RE: [HACKERS] OK, does anyone have any better ideas?
> >
> >
> > One possible idea for SQL integration: can one use index access-method
> > functions to query the FTS outside the database? Yes, it would require some
> > work, but the results I guess it would be wonderful. Ok, Tom, not so fast as
> > an index sitting inside Postgres, but AFAICS that's not going to happen
> > anytime soon.
> 
> We did external indexing using suffix arrays ( http://sary.namazu.org )
> for one project because of limiting time :-) But we had to do a lot
> of work like ACID (well, we already have some technology) and
> everything we get usually from SQL.
> Now we're trying to implement fast indexing using GiST.

I think I have the answer, or at least as good as I think I can get in
the near term.

The syntax is as follows:

create temp table search_results as select ts_key(10) as "key",
ts_rank(10) as "rank" from ts_template where ts_search('bla bla bla',
10)>0;

select * from table where search_results.key = table.field; 

It is not ideal, obviously, but it solves a couple problems, and should
not be too big a performance hit. (If ANYONE can come up with a better
way, I'd really really love to hear it.)

The first call to ts_search(...) does the search and saves the results.
Each successive call simply advances the result number. ts_key() and
ts_rank() work on the current result number and return the respective
value. ts_template is a table of some maximum number of records plus 1,
say 1001. 

When the total number of results have been returned (or maximum has been
reached), ts_search frees the search results (because they should be
saved in the table) and returns 0, stopping the select call.

Anyone see any problems with this approach? It is not ideal, but it is
as efficient as I can come up with, without spending a year or two
creating a new Postgres index type.


-- 
http://www.mohawksoft.com


Re: OK, does anyone have any better ideas?

From
Hannu Krosing
Date:
Andrew Snow wrote:
> 
> Could you perhaps post the code you have for splitting a text field up into
> keys, then I could work on turning into a new type of index with a new
> operator, as Tom suggested?
> 
> (Or is this already what the text search code in contrib already does??)

I think that splitting the text is the least of the problems.

I've been contemplating integrating my own (currently BSDDB based) 
full-text-index with postgresql and I've come to the conclusion that 
the index proper should work on already split list-of-words (so it could
be 
made to work with any array types easily)

So my suggestion is to start working on an inverted-index on a field of
any 
array type.

Or even better - start digging in the indexing interface of PostgreSQL
first 
documenting it (the current docs just state that it is too complicated
for 
anyone to comprehend ;) and then cleaning it up and making it thus
making 
it easyer for any kind of new index to be written.

-------------
Hannu