Thread: Search for fixed set of keywords

Search for fixed set of keywords

From

Jörg Kiegeland

Date:

09 January 2008, 09:43:54

Hello,

I have an interesting generic search task, for which I have done
different performance tests and I would like to share and discuss my
results on this newsgroup.

So I begin to describe the search task:

=========
You have a set of N unique IDs. Every ID is associated with an integer
scoring value. Every ID is also associated with up to K different
keywords (there are totally K different keywords K1 ... Kn). Now find
the first Z best-scored IDs which are associated with a given set of
keywords in one of two ways:

(C1) The ID must be associated with all keywords of the given set of
keywords.
(C2) The ID must be associated with at least one keyword of the given
set of keywords.
=========


My tests showed that only a Multiple-Column-approach resulted in a
acceptable query response time. I also tried out an int-array approach
using gist,  a sub-string approach, a bit-column approach, and even a
sub-string approach using Solr.
Actually, the int-array approach was 20% faster for Z=infinity, but it
became linear for the test case [Z=1000 and *all* IDs matches the search
condition].
(To be not misunderstood, "acceptable time" means: having a fixed Z, a
fixed set of keywords K, a fixed query, and an increasing N, results in
constant up to logarithmic response time; linear or worser-than-linear
time is not accepted)

In the Multiple-Column-approach, there is one table. The table has a
boolean column for each keyword. It has also a column for the ID and for
the scoring. Now, for each keyword column and for the scoring column a
separate index is created.
C1 is implemented by an AND-query on the keyword columns, C2 by and OR
query, and the result is sorted for the scoring column, cutting of after
the first Z results.

However our requirements for the search task have changed and I not yet
managed to find a search approach with acceptable response time for
following variation:
Namely that one uses C2 and do not sort for a scoring column but use as
scoring value the number of matched keywords for a given ID.
The difficulty in this query type is that the scoring is dependent on
the query itself..

So has anyone an idea how to solve this query type with acceptable
response time, or can anybody tell/prove, that this is theoretically not
possible?

Re: Search for fixed set of keywords

From

Oleg Bartunov

Date:

09 January 2008, 09:59:39

Did you try integer arrays with GIN (inverted index) ?

Oleg
On Wed, 9 Jan 2008, J?rg Kiegeland wrote:

> Hello,
>
> I have an interesting generic search task, for which I have done different
> performance tests and I would like to share and discuss my results on this
> newsgroup.
>
> So I begin to describe the search task:
>
> =========
> You have a set of N unique IDs. Every ID is associated with an integer
> scoring value. Every ID is also associated with up to K different keywords
> (there are totally K different keywords K1 ... Kn). Now find the first Z
> best-scored IDs which are associated with a given set of keywords in one of
> two ways:
>
> (C1) The ID must be associated with all keywords of the given set of
> keywords.
> (C2) The ID must be associated with at least one keyword of the given set of
> keywords.
> =========
>
>
> My tests showed that only a Multiple-Column-approach resulted in a acceptable
> query response time. I also tried out an int-array approach using gist,  a
> sub-string approach, a bit-column approach, and even a sub-string approach
> using Solr.
> Actually, the int-array approach was 20% faster for Z=infinity, but it became
> linear for the test case [Z=1000 and *all* IDs matches the search condition].
> (To be not misunderstood, "acceptable time" means: having a fixed Z, a fixed
> set of keywords K, a fixed query, and an increasing N, results in constant up
> to logarithmic response time; linear or worser-than-linear time is not
> accepted)
>
> In the Multiple-Column-approach, there is one table. The table has a boolean
> column for each keyword. It has also a column for the ID and for the scoring.
> Now, for each keyword column and for the scoring column a separate index is
> created.
> C1 is implemented by an AND-query on the keyword columns, C2 by and OR query,
> and the result is sorted for the scoring column, cutting of after the first Z
> results.
>
> However our requirements for the search task have changed and I not yet
> managed to find a search approach with acceptable response time for following
> variation:
> Namely that one uses C2 and do not sort for a scoring column but use as
> scoring value the number of matched keywords for a given ID.
> The difficulty in this query type is that the scoring is dependent on the
> query itself..
>
> So has anyone an idea how to solve this query type with acceptable response
> time, or can anybody tell/prove, that this is theoretically not possible?
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Search for fixed set of keywords

From

Jörg Kiegeland

Date:

10 January 2008, 08:24:01

> Did you try integer arrays with GIN (inverted index) ?
I now tried this, and GIN turned out to be linear time, compared with
GIST which was acceptable time. However I tested this only for
Z=infinity, for Z=1000, GIST/GIN are both not acceptable.

Re: Search for fixed set of keywords

From

Oleg Bartunov

Date:

10 January 2008, 11:06:13

On Thu, 10 Jan 2008, J?rg Kiegeland wrote:

>> Did you try integer arrays with GIN (inverted index) ?
> I now tried this, and GIN turned out to be linear time, compared with GIST
> which was acceptable time. However I tested this only for Z=infinity, for
> Z=1000, GIST/GIN are both not acceptable.

Sorry, I didn't follow your problem, but GIN should be certainly
logarithmic on the number of unique words. Also, it'd be much clear
if you show us your queries and explain analyze.


     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83