Thread: weighting (the results of) a query ?

weighting (the results of) a query ?

From
"Peter Galbavy"
Date:
I have a table with a primary key ('md5') and a bunch of text fields.
There is one row per 'photograph' and the number of rows is about 1100
now but will rise to over 20,000 in a few months - assuming I get time
to import all my stuff.

I would like to offer users on my web site a free text search on these
text fields, but I would like to weight the results base on which field
the text came from.

Let's say those fields are (for simplicity) 'category', 'subcategory',
'caption' and 'keywords'.

I want to do:
   SELECT md5, weighting() FROM images WHERE       category ~* 'term' OR subcategory ~* 'term' OR ...

Is there anything I can do - including writing functions - to return a
number that is somehow representative of which WHERE clause matched
'first' and even better the more columns matched ?

I am guessing that like 'C' an 'OR' conditional stops at the first match
and does not process further conditions after a previous one has
matched - that's good enough for me for day one...

It is not critial that I get a value out, the return order of results
could be fine too.

I would like to minimise the number of queries to the DB, but I can fall
back on doing one query per column and combining the results in perl.
This is my approach for an initial implementation later today unless
anyone can suggest otherwise...

Any pointers, tips, code, suggestions greatly appreciated.

Happy New Year all, BTW
--
Peter



Re: weighting (the results of) a query ?

From
"Peter Galbavy"
Date:
Thanks to Len Morgan for the hints to get to this:

SELECT *, '4' as result  FROM images
WHERE       iptc_release_date < 'today' AND       iptc_priority BETWEEN 1 AND 5 AND iptc_caption ~* 'term'
UNION
SELECT *, '3' as result FROM images
WHERE       iptc_release_date < 'today' AND       iptc_priority BETWEEN 1 AND 5 AND iptc_keywords ~* 'term'
UNION
SELECT *, '2' as result FROM images
WHERE       iptc_release_date < 'today' AND       iptc_priority BETWEEN 1 AND 5 AND iptc_category ~* 'term'
UNION
SELECT *, '1' as result FROM images
WHERE       iptc_release_date < 'today' AND       iptc_priority BETWEEN 1 AND 5 AND iptc_subcategory ~* 'term'

ORDER BY result desc, iptc_priority asc, shot_time, image_serial asc;

Using a constant and UNION made it work OK. Not sure yet on real world
performance, but that's what tuning is for :)

Hope someone finds this in the archive and finds it useful.

Peter

----- Original Message -----
From: "Peter Galbavy" <peter.galbavy@knowtion.net>
To: <pgsql-sql@postgresql.org>
Sent: Sunday, January 05, 2003 10:41 AM
Subject: [SQL] weighting (the results of) a query ?


> I have a table with a primary key ('md5') and a bunch of text fields.
> There is one row per 'photograph' and the number of rows is about 1100
> now but will rise to over 20,000 in a few months - assuming I get time
> to import all my stuff.
>
> I would like to offer users on my web site a free text search on these
> text fields, but I would like to weight the results base on which field
> the text came from.
>
> Let's say those fields are (for simplicity) 'category', 'subcategory',
> 'caption' and 'keywords'.
>
> I want to do:
>
>     SELECT md5, weighting() FROM images WHERE
>         category ~* 'term' OR subcategory ~* 'term' OR ...
>
> Is there anything I can do - including writing functions - to return a
> number that is somehow representative of which WHERE clause matched
> 'first' and even better the more columns matched ?
>
> I am guessing that like 'C' an 'OR' conditional stops at the first match
> and does not process further conditions after a previous one has
> matched - that's good enough for me for day one...
>
> It is not critial that I get a value out, the return order of results
> could be fine too.
>
> I would like to minimise the number of queries to the DB, but I can fall
> back on doing one query per column and combining the results in perl.
> This is my approach for an initial implementation later today unless
> anyone can suggest otherwise...
>
> Any pointers, tips, code, suggestions greatly appreciated.
>
> Happy New Year all, BTW
> --
> Peter
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>



Re: weighting (the results of) a query ?

From
Tom Lane
Date:
"Peter Galbavy" <peter.galbavy@knowtion.net> writes:
> Thanks to Len Morgan for the hints to get to this:
> SELECT *, '4' as result  FROM images
> WHERE
>         iptc_release_date < 'today' AND
>         iptc_priority BETWEEN 1 AND 5 AND iptc_caption ~* 'term'
> UNION
> SELECT *, '3' as result FROM images
> WHERE
>         iptc_release_date < 'today' AND
>         iptc_priority BETWEEN 1 AND 5 AND iptc_keywords ~* 'term'
> UNION
> SELECT *, '2' as result FROM images
> WHERE
>         iptc_release_date < 'today' AND
>         iptc_priority BETWEEN 1 AND 5 AND iptc_category ~* 'term'
> UNION
> SELECT *, '1' as result FROM images
> WHERE
>         iptc_release_date < 'today' AND
>         iptc_priority BETWEEN 1 AND 5 AND iptc_subcategory ~* 'term'

> ORDER BY result desc, iptc_priority asc, shot_time, image_serial asc;

> Using a constant and UNION made it work OK. Not sure yet on real world
> performance, but that's what tuning is for :)

Most likely you should write UNION ALL, not UNION.  As given, the query
will go through a pass of attempted duplicate-row-elimination, which is
almost certainly not what you want.
        regards, tom lane


Re: weighting (the results of) a query ?

From
"Peter Galbavy"
Date:
> Most likely you should write UNION ALL, not UNION.  As given, the query
> will go through a pass of attempted duplicate-row-elimination, which is
> almost certainly not what you want.

Not sure - what I want is only one row per real row but ordered as per the
constants. When you say duplicate-row-elimination do you mean including the
'constant' weighting, in which case UNION ALL is probably right. At the
moment, the dataset returned appears correctly de-duplicated.

What I mean is when an image row has both 'eat static' in the caption and as
a keyword, I want it returned only once...

Peter



Re: weighting (the results of) a query ?

From
Tom Lane
Date:
"Peter Galbavy" <peter.galbavy@knowtion.net> writes:
> Not sure - what I want is only one row per real row but ordered as per the
> constants. When you say duplicate-row-elimination do you mean including the
> 'constant' weighting,

Yes, UNION would consider all columns in deciding if two rows are dups.

> What I mean is when an image row has both 'eat static' in the caption and as
> a keyword, I want it returned only once...

I think your query might fail on that requirement regardless, no?  At
least I missed how you'd prevent it.
        regards, tom lane


Re: weighting (the results of) a query ?

From
"Peter Galbavy"
Date:
> I think your query might fail on that requirement regardless, no?  At
> least I missed how you'd prevent it.

I have had about 10 minutes to play with this - my day jobrequires I do real
testing when I get home later tonight :)

Thanks, and I will keep an eye out for this and figure a way around it.

Petr