Thread: Query advice request

Query advice request

From

Mark Kelly

Date:

09 October 2010, 16:10:06

Hi.

I'm wondering if there is a better way to do this:

SELECT news.id FROM news
WHERE news.headline ~* '(japan|office)'
OR news.body ~* '(japan|office)'
ORDER BY news.posted DESC;

It is for small articles; a few hundred or so rows in the table and not
expecting millions.

Cheers,

Mark

Re: Query advice request

From

Mladen Gogala

Date:

09 October 2010, 18:23:59

Mark, you should be using text indexes. You are looking for the whole
words and the regular expression will not use an index. If you want
speed, at least one of the two columns in your query should be indexed
with a text index.

Mark Kelly wrote:
> Hi.
>
> I'm wondering if there is a better way to do this:
>
> SELECT news.id FROM news
> WHERE news.headline ~* '(japan|office)'
> OR news.body ~* '(japan|office)'
> ORDER BY news.posted DESC;
>
> It is for small articles; a few hundred or so rows in the table and not
> expecting millions.
>
> Cheers,
>
> Mark
>
>

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

Re: Query advice request

From

Mladen Gogala

Date:

09 October 2010, 23:07:50

Francisco Leovey wrote:
> Well, it is a good enough implementation for us, a very large legal
> document database.
> We are forced by government policy to only use free software.
> Full phrase search can be simulated easily with or's to extract the
> documents.
> And we have proximity search done by the application.
>
I opted for Sphinx which is also free software. It has connector for
PostgreSQL and works perfectly. It can index 55 million documents
without a problem:

Sphinx 1.10-id64-beta (r2420)
Copyright (c) 2001-2010, Andrew Aksyonoff
Copyright (c) 2008-2010, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
indexing index 'mover'...
collected 55616085 docs, 101706.4 MB
sorted 15298.1 Mhits, 100.0% done
total 55616085 docs, 101706364580 bytes
total 18132.419 sec, 5609089 bytes/sec, 3067.21 docs/sec
total 6705 reads, 647.896 sec, 6460.7 kb/call avg, 96.6 msec/call avg
total 23240 writes, 307.031 sec, 4144.8 kb/call avg, 13.2 msec/call avg
[Sat Oct  9 05:03:20.204 2010] [32745] using config file
'/usr/local/etc/sphinx.conf'...
[Sat Oct  9 05:03:20.230 2010] [32745] listening on all interfaces,
port=9312
Sphinx 1.10-id64-beta (r2420)

This index contains 55 million documents and takes 100GB of space. It's
not integrated with Postgres the way Tsearch2 is, but it has PHP and
Perl API's and all the capabilities I need. The biggest problem was not
with the index but with the database and optimizing the SQL accessing a
huge partitioned table. To that end, I even upgraded to Postgres 9.0,
which can resolve conditions like 'IS NOT NULL'  by using indexes. The
3rd software I mentioned is Lucene, produced by the Apache foundation,
with the price equal to the price of Sphinx and Lucene. Here is my table:

psql (9.0.1)
Type "help" for help.

news=# set search_path=moreover;
SET
news=# select count(*) from moreover_documents;
  count
----------
 55892429
(1 row)

There is a text column, which is indexed, and I need all the usual
search options: phrases, proximity, quorum and alike.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com

Re: Query advice request

From

Mark Kelly

Date:

10 October 2010, 21:48:29

Hi Mladen

Thanks for taking the time to reply.

On Saturday 09 Oct 2010 at 19:23 Mladen Gogala wrote:

> Mark, you should be using text indexes. You are looking for the whole
> words and the regular expression will not use an index. If you want
> speed, at least one of the two columns in your query should be indexed
> with a text index.

That's probably a bit more high-level than I was asking for TBH - I was
curious as to whether there was a better way to write the query, and I'm
assuming from the lack of comments on my idiocy that it's at least okay :)

The headline column in the query has an index (it is limited to 200 characters
to fit the web layout), and I shall investigate using the stuff in your reply
to Francisco (though I never saw his post for some reason) regarding fulltext.

Thanks,

Mark