Postgres + Xapian (was Re: fulltext searching via a custom index type ) - Mailing list pgsql-hackers
From | Eric B.Ridge |
---|---|
Subject | Postgres + Xapian (was Re: fulltext searching via a custom index type ) |
Date | |
Msg-id | CE6BCC23-3CDA-11D8-BF11-000A95D98B3E@tcdi.com Whole thread Raw |
In response to | Re: fulltext searching via a custom index type (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
|
List | pgsql-hackers |
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> Xapian has it's own storage subsystem, and that's what I'm using to >> store the index... not using anything internal to postgres (although >> this could change). > > I would say you have absolutely zero chance of making it work that way. I still think this is one of the best quotes I've heard in awhile. :) > It might be worth pointing out here than an index AM is not bound to > use > exactly the typical Postgres page layout. Thanks again for this little bit of info. It was just enough to get me thinking about how to make "it work". Xapian is basically a big btree index, except it's 5 btree indexes. One for terms, one for posts (terms with positions), one for positions, one for arbitrary document values, and one for the documents themselves. Each index is made up of 3 physical files on disk. All told there's 17 files for a single Xapian index (15 db files, a versioninfo file, and a lock file). I couldn't think of a way to create a whole new database type for Xapian that could deal with managing 5 btree indexes inside of Postgres (other than using tables w/ standard postgres btree index on certain fields), so instead, I dug into Xapian and abstracted out it's filesystem i/o (open, read, write, etc). (as an aside, I did spend some time pondering ways to adapt Postgres' nbtree AM to handle this, but I just don't understand how it works) Once I had Xapian's filesystem i/o encapsulated into a nice little C++ class, I embarked on creating a mini "filesystem" ontop of Postgres' storage subsystem. In essence, I've now got a Postgres access method that mirrors the basics of a filesystem, from creating/open files to reading from and writing to them, in addition to truncation and deletion. After that, it was just a matter of the glue code to teach Xapian to use this "filesystem" for all its filesystem i/o, and voila!, Xapian works ontop of Postgres' storage subsystem and I didn't have to rewrite Xapian from scratch. And surprisingly, despite the additional overhead of this filesystem abstraction layer, it's still very fast... esp. once Buffers get cached. I've still got more work to do (like dealing with locking and general concurrency issues, not to mention bugs I haven't found yet), but it's working *really* well in a single-user environment. So here's the important question: How stupid is this? I've done some benchmarking against tsearch2. Attached are the queries and execution times on my dual 2gig G5 w/ 2gig ram. The table contains 51,160 records. It's every text file contained on my computer (which includes multiple copies of all my java projects). All told, it's 337,343,569 bytes of data, with an average file size of 6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is "@@". I ran each query 6 times, and just took the best execution time. It's also worth noting that my document parser is much different than tsearch2's. I'm splitting words on non-alphanumerics (and currently am not using stopwords), and it seems that tsearch2 tries to do something more intelligent, so the # of results returned vary widely between tsearch2 and Xapian. I'm not offering an opinion on which way is "better". I've got a few more questions about transactions, locking, and a few other things, but I just thought I'd throw this out as a status report and to see if there's any kind of reaction. thanks for your time. eric
Attachment
pgsql-hackers by date: