Re: Query advice request - Mailing list pgsql-novice

From Mladen Gogala
Subject Re: Query advice request
Date
Msg-id 4CB0F5BC.5080308@vmsinfo.com
Whole thread Raw
In response to Query advice request  (Mark Kelly <pgsql@wastedtimes.net>)
List pgsql-novice
Francisco Leovey wrote:
> Well, it is a good enough implementation for us, a very large legal
> document database.
> We are forced by government policy to only use free software.
> Full phrase search can be simulated easily with or's to extract the
> documents.
> And we have proximity search done by the application.
>
I opted for Sphinx which is also free software. It has connector for
PostgreSQL and works perfectly. It can index 55 million documents
without a problem:

Sphinx 1.10-id64-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
indexing index 'mover'...
collected 55616085 docs, 101706.4 MB
sorted 15298.1 Mhits, 100.0% done
total 55616085 docs, 101706364580 bytes
total 18132.419 sec, 5609089 bytes/sec, 3067.21 docs/sec
total 6705 reads, 647.896 sec, 6460.7 kb/call avg, 96.6 msec/call avg
total 23240 writes, 307.031 sec, 4144.8 kb/call avg, 13.2 msec/call avg
[Sat Oct  9 05:03:20.204 2010] [32745] using config file
'/usr/local/etc/sphinx.conf'...
[Sat Oct  9 05:03:20.230 2010] [32745] listening on all interfaces,
port=9312
Sphinx 1.10-id64-beta (r2420)

This index contains 55 million documents and takes 100GB of space. It's
not integrated with Postgres the way Tsearch2 is, but it has PHP and
Perl API's and all the capabilities I need. The biggest problem was not
with the index but with the database and optimizing the SQL accessing a
huge partitioned table. To that end, I even upgraded to Postgres 9.0,
which can resolve conditions like 'IS NOT NULL'  by using indexes. The
3rd software I mentioned is Lucene, produced by the Apache foundation,
with the price equal to the price of Sphinx and Lucene. Here is my table:

psql (9.0.1)
Type "help" for help.

news=# set search_path=moreover;
SET
news=# select count(*) from moreover_documents;
  count
----------
 55892429
(1 row)

There is a text column, which is indexed, and I need all the usual
search options: phrases, proximity, quorum and alike.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


pgsql-novice by date:

Previous
From: Mladen Gogala
Date:
Subject: Re: Query advice request
Next
From: e-letter
Date:
Subject: Re: permissions failure to copy csv data