Re: Postgres + Xapian (was Re: fulltext searching via a custom index type ) - Mailing list pgsql-hackers

From Eric Ridge
Subject Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Date
Msg-id B021131D-4230-11D8-ADB4-000A95BB5944@tcdi.com
Whole thread Raw
In response to Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )  (Eric Ridge <ebr@tcdi.com>)
Responses Re: Postgres + Xapian (was Re: fulltext searching via a
List pgsql-hackers
Thanks to everyone that provided input about this idea.  After taking 
everything into consideration and talking with Olly Betts from the 
Xapian project, what I'm going to do is implement a btree that supports 
multiple roots and an "tag" value of arbitrary length for each item in 
the tree.  Then implement a new Xapian backend that uses it.  In the 
end, it'll be much more efficient and much less dirty than this silly 
filesystem thing I have now.  And hopefully, concurrency and WAL will 
be easier to deal with too.  Having the existing nbtree AM as guide 
will be very useful.

I have one issue that could potentially make all of this completely 
useless, and it's related to Postgres deciding to use a Filter instead 
of an Index Scan.

I asked this question last week, and Tom Lane responded with a 
solution, but I don't think I explained myself very well.  And now that 
I've got a bit more experience with some of the PG internals, maybe I 
can ask the question more intelligently.

Given this query:     select id from table where document_text ==> 'food' and title ==> 
'water';

Postgres generates this plan: Index Scan using idxtitle on table  (cost=0.00..4.01 rows=1 width=8)   Index Cond: (title
==>'water'::text)   Filter: (document_text ==> 'food'::text)
 

The problem is, the "document_text" column is *really* big.  Average 
value length is 171k.

With this query plan, my operator procedure is forced to re-parse the 
document_text column from each row returned by the index scan against 
the title column, and do a bunch of Xapian tricks for each row.  The 
overhead is huge.

The composite index solution doesn't seem ideal here because there's 
absolutely no way I can anticipate every combination of fields a user 
might choose to search.

Forgetting about composite indexes for a moment, is postgres even 
capable of doing 2 index scans in this situation?  Does it know how do 
take the intersection of two scans?

My AM's cost estimate function literally sets the selectivity, 
correlation, total cost, and startup cost values to zero (and I've 
tried tons of combinations of really big, really small, and really 
negative values).  My thought behind setting them all to zero was that 
if the cost function basically says, "There is no cost", I could fool 
Postgres into wanting to use the index everywhere it can.  Sadly, this 
doesn't work out.

Now, I realize I can fix my cost estimate function to return better 
costs for the title and document_text fields so that PG will instead 
decide to index scan on document_text, but what I really want to do is 
make PG do an index scan for each field.

Can PG even do this?

I'd appreciate any thoughts on this.  Hopefully, I'm just missing 
something really obvious.

thanks!

eric



pgsql-hackers by date:

Previous
From: Shachar Shemesh
Date:
Subject: OLE DB driver
Next
From: Tom Lane
Date:
Subject: Re: RFC: bufmgr locking changes