Re: Postgres + Xapian (was Re: fulltext searching via a custom index type ) - Mailing list pgsql-hackers
From | Eric Ridge |
---|---|
Subject | Re: Postgres + Xapian (was Re: fulltext searching via a custom index type ) |
Date | |
Msg-id | B021131D-4230-11D8-ADB4-000A95BB5944@tcdi.com Whole thread Raw |
In response to | Re: Postgres + Xapian (was Re: fulltext searching via a custom index type ) (Eric Ridge <ebr@tcdi.com>) |
Responses |
Re: Postgres + Xapian (was Re: fulltext searching via a
|
List | pgsql-hackers |
Thanks to everyone that provided input about this idea. After taking everything into consideration and talking with Olly Betts from the Xapian project, what I'm going to do is implement a btree that supports multiple roots and an "tag" value of arbitrary length for each item in the tree. Then implement a new Xapian backend that uses it. In the end, it'll be much more efficient and much less dirty than this silly filesystem thing I have now. And hopefully, concurrency and WAL will be easier to deal with too. Having the existing nbtree AM as guide will be very useful. I have one issue that could potentially make all of this completely useless, and it's related to Postgres deciding to use a Filter instead of an Index Scan. I asked this question last week, and Tom Lane responded with a solution, but I don't think I explained myself very well. And now that I've got a bit more experience with some of the PG internals, maybe I can ask the question more intelligently. Given this query: select id from table where document_text ==> 'food' and title ==> 'water'; Postgres generates this plan: Index Scan using idxtitle on table (cost=0.00..4.01 rows=1 width=8) Index Cond: (title ==>'water'::text) Filter: (document_text ==> 'food'::text) The problem is, the "document_text" column is *really* big. Average value length is 171k. With this query plan, my operator procedure is forced to re-parse the document_text column from each row returned by the index scan against the title column, and do a bunch of Xapian tricks for each row. The overhead is huge. The composite index solution doesn't seem ideal here because there's absolutely no way I can anticipate every combination of fields a user might choose to search. Forgetting about composite indexes for a moment, is postgres even capable of doing 2 index scans in this situation? Does it know how do take the intersection of two scans? My AM's cost estimate function literally sets the selectivity, correlation, total cost, and startup cost values to zero (and I've tried tons of combinations of really big, really small, and really negative values). My thought behind setting them all to zero was that if the cost function basically says, "There is no cost", I could fool Postgres into wanting to use the index everywhere it can. Sadly, this doesn't work out. Now, I realize I can fix my cost estimate function to return better costs for the title and document_text fields so that PG will instead decide to index scan on document_text, but what I really want to do is make PG do an index scan for each field. Can PG even do this? I'd appreciate any thoughts on this. Hopefully, I'm just missing something really obvious. thanks! eric
pgsql-hackers by date: