Re: Search for fixed set of keywords - Mailing list pgsql-performance

From Oleg Bartunov
Subject Re: Search for fixed set of keywords
Date
Msg-id Pine.LNX.4.64.0801091351440.26876@sn.sai.msu.ru
Whole thread Raw
In response to Search for fixed set of keywords  (Jörg Kiegeland <kiegeland@ikv.de>)
Responses Re: Search for fixed set of keywords  (Jörg Kiegeland <kiegeland@ikv.de>)
List pgsql-performance
Did you try integer arrays with GIN (inverted index) ?

Oleg
On Wed, 9 Jan 2008, J?rg Kiegeland wrote:

> Hello,
>
> I have an interesting generic search task, for which I have done different
> performance tests and I would like to share and discuss my results on this
> newsgroup.
>
> So I begin to describe the search task:
>
> =========
> You have a set of N unique IDs. Every ID is associated with an integer
> scoring value. Every ID is also associated with up to K different keywords
> (there are totally K different keywords K1 ... Kn). Now find the first Z
> best-scored IDs which are associated with a given set of keywords in one of
> two ways:
>
> (C1) The ID must be associated with all keywords of the given set of
> keywords.
> (C2) The ID must be associated with at least one keyword of the given set of
> keywords.
> =========
>
>
> My tests showed that only a Multiple-Column-approach resulted in a acceptable
> query response time. I also tried out an int-array approach using gist,  a
> sub-string approach, a bit-column approach, and even a sub-string approach
> using Solr.
> Actually, the int-array approach was 20% faster for Z=infinity, but it became
> linear for the test case [Z=1000 and *all* IDs matches the search condition].
> (To be not misunderstood, "acceptable time" means: having a fixed Z, a fixed
> set of keywords K, a fixed query, and an increasing N, results in constant up
> to logarithmic response time; linear or worser-than-linear time is not
> accepted)
>
> In the Multiple-Column-approach, there is one table. The table has a boolean
> column for each keyword. It has also a column for the ID and for the scoring.
> Now, for each keyword column and for the scoring column a separate index is
> created.
> C1 is implemented by an AND-query on the keyword columns, C2 by and OR query,
> and the result is sorted for the scoring column, cutting of after the first Z
> results.
>
> However our requirements for the search task have changed and I not yet
> managed to find a search approach with acceptable response time for following
> variation:
> Namely that one uses C2 and do not sort for a scoring column but use as
> scoring value the number of matched keywords for a given ID.
> The difficulty in this query type is that the scoring is dependent on the
> query itself..
>
> So has anyone an idea how to solve this query type with acceptable response
> time, or can anybody tell/prove, that this is theoretically not possible?
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-performance by date:

Previous
From: Jörg Kiegeland
Date:
Subject: Search for fixed set of keywords
Next
From: Adrian Moisey
Date:
Subject: Re: big database performance