Thread: FTI, paged, ranked searching and efficiency.

FTI, paged, ranked searching and efficiency.

From
Paul
Date:
Hello,

This is going to be a bit long, I hope some of you will take the
trouble to read it :)

I am building a search engine for a section of a (PHP based) website.
I wish the user to be able to a number of words in the search, and the
search results to be ranked by the number of times words occur (both
different words and the same word occuring multiple times are good).

My (simplified) table structure is this:
======        Table "entry_fti"Attribute |    Type     | Modifier
-----------+-------------+----------string    | varchar(25) |id        | oid         |
Index: entry_fti_string_idx
                                  Table "entry"      Attribute       | Type  |                    Modifier
-----------------------+-------+----------------------------------------------
-entry_id              |integer| not null default 
nextval('entry_id_seq'::text)entry_name            |text   |entry_description_html|text   |entry_image_id
|integer|not null default 0entry_tn_image_id     |integer| not null default 0entry_live            |boolean| not null
default't'
 
Index: entry_pkey
                              Table "image"Attribute  |    Type     |                    Modifier
------------+-------------+------------------------------------------------image_id   | integer     | not null default
nextval('image_id_seq'::text)image_name| varchar(32) |height     | integer     | not nullwidth      | integer     | not
null
Indices:  image_pkey
======

And my (simplified) query looks like this:
======
SELECT   COUNT(entry_fti.id) AS rating,        entry.entry_name AS name,        entry.entry_id AS id,
entry.entry_description_htmlAS description_html,        image.image_name AS thumb1_name,        image.height AS
thumb1_height,       image.width AS thumb1_width
 
FROM     entry, entry_fti, image
WHERE    entry_fti.id=entry.oid AND    entry.entrytn_image_id=image.image_id AND    entry.entry_live = 't'::bool AND
(        entry_fti.string ~'^word1'         OR         entry_fti.string ~'^word2'         OR                      .
                .         OR         entry_fti.string ~'^wordn'        ) 
 
GROUP BY entry.entry_id,        entry.entry_name,        entry.entry_description_html,        image.image_name,
image.height,       image.width
 
ORDER BY rating DESC 
======

Now this all works, which is good. My problem now is that I want to 
limit the number of results shown on a page to 20 and show the number
of pages of extra results, much like you'd see on any search engine site.
Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then:
a) I'd need to do an extra query, to find out the total number of results  to show the number of pages on the webpage.
b) I have no idea how to write that query. It'd be a COUNT of 'rating'  in the above, which would be a
COUNT(COUNT(entry_fti.id))which  would probably require some hideous (and not legal?) double GROUP  BY construct.
Ouch.

So basically, LIMIT/OFFSET looks like a no go. This leaves me with just
doing the above query, and using PHP to jump to a particular row in the
results depending on what page you are on and pg_numrows() to 
calculate the number of pages.

Would that be particularly inefficient? 
Should I be looking harder for a LIMIT/OFFSET based solution?

Perhaps I'd be better off splitting it into two queries, one to just
get the entry_id list in order, then another query to pull out the
rest of the information for the 20 of those entry_ids that are on the results 
page I wish to show?
That would stop Postgres from gathering so much information that I am just 
going to throw away anyway without looking at.

Any ideas? Have I missed something obvious that will help me? Or better yet, 
can someone who has done this sort of thing before tell me whether I am on the 
right track?

Paul



Re: FTI, paged, ranked searching and efficiency.

From
Josh Berkus
Date:
Paul,

    I'm afraid that much of that was over my head.  In fact, I'm keeping it
as an example in case I ever need to do something similar.  Forward your
info and I'll include credit in the source :-)

    In general terms, I've always depended on the PHP to select a "page" of
results, using the logic that the number of results on a page is a
matter for the web application to handle (a display issue) rather than
something to be handled on the back-end (a data issue).

    However, you point about not pulling out tons of data the user will
never examine (i.e. 2nd and succeeding pages) is well-taken.  Although,
without pulling the entire data set, you can't display to the user how
many results there are, total.  If it's a strong possibility that the
users are really only ever going to want the top 20-40 rated results,
then splitting it as you suggest ... first, counting all the matches and
then dragging in the rest of the data for the top X records ... makes a
lot of sense.

    Unfortunately, your only real option I can see for DB server-side row
grabbing is: Create the query(ies) as a temporary table or view using a
function.  Then use Limit and Offset to grab one chunk of data at a
time.  This is, of course, a serious problem for mutli-user performance
since eash user would need their own temp table or view.

    From what I can tell, search engines (google, for example) grab the
whole recordset and use the web script to parse it out 25 records at a
time.

    Hopefully, someone on this list will have done that before and can
provide less theoretical advice.

                    -Josh Berkus

P.S. I've also posted this to the pgsql-php list.  I;ve quoted the full
text of your question below my .sig for that reason.

--
______AGLIO DATABASE SOLUTIONS___________________________
                                        Josh Berkus
   Complete information technology      josh@agliodbs.com
    and data management solutions       (415) 436-9166
   for law firms, small businesses       fax  436-0137
    and non-profit organizations.       pager 338-4078
                                San Francisco



Paul wrote:
>
> Hello,
>
> This is going to be a bit long, I hope some of you will take the
> trouble to read it :)
>
> I am building a search engine for a section of a (PHP based) website.
> I wish the user to be able to a number of words in the search, and the
> search results to be ranked by the number of times words occur (both
> different words and the same word occuring multiple times are good).
>
> My (simplified) table structure is this:
> ======
>          Table "entry_fti"
>  Attribute |    Type     | Modifier
> -----------+-------------+----------
>  string    | varchar(25) |
>  id        | oid         |
> Index: entry_fti_string_idx
>
>                                    Table "entry"
>        Attribute       | Type  |                    Modifier
> -----------------------+-------+----------------------------------------------
> -
>  entry_id              |integer| not null default
> nextval('entry_id_seq'::text)
>  entry_name            |text   |
>  entry_description_html|text   |
>  entry_image_id        |integer| not null default 0
>  entry_tn_image_id     |integer| not null default 0
>  entry_live            |boolean| not null default 't'
> Index: entry_pkey
>
>                                Table "image"
>  Attribute  |    Type     |                    Modifier
> ------------+-------------+------------------------------------------------
>  image_id   | integer     | not null default nextval('image_id_seq'::text)
>  image_name | varchar(32) |
>  height     | integer     | not null
>  width      | integer     | not null
> Indices:  image_pkey
> ======
>
> And my (simplified) query looks like this:
> ======
> SELECT   COUNT(entry_fti.id) AS rating,
>          entry.entry_name AS name,
>          entry.entry_id AS id,
>          entry.entry_description_html AS description_html,
>          image.image_name AS thumb1_name,
>          image.height AS thumb1_height,
>          image.width AS thumb1_width
> FROM     entry, entry_fti, image
> WHERE    entry_fti.id=entry.oid
>   AND    entry.entrytn_image_id=image.image_id
>   AND    entry.entry_live = 't'::bool
>   AND    (
>           entry_fti.string ~'^word1'
>           OR
>           entry_fti.string ~'^word2'
>           OR
>                        .
>                        .
>           OR
>           entry_fti.string ~'^wordn'
>          )
> GROUP BY entry.entry_id,
>          entry.entry_name,
>          entry.entry_description_html,
>          image.image_name,
>          image.height,
>          image.width
> ORDER BY rating DESC
> ======
>
> Now this all works, which is good. My problem now is that I want to
> limit the number of results shown on a page to 20 and show the number
> of pages of extra results, much like you'd see on any search engine site.
> Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then:
> a) I'd need to do an extra query, to find out the total number of results
>    to show the number of pages on the webpage.
> b) I have no idea how to write that query. It'd be a COUNT of 'rating'
>    in the above, which would be a COUNT(COUNT(entry_fti.id)) which
>    would probably require some hideous (and not legal?) double GROUP
>    BY construct. Ouch.
>
> So basically, LIMIT/OFFSET looks like a no go. This leaves me with just
> doing the above query, and using PHP to jump to a particular row in the
> results depending on what page you are on and pg_numrows() to
> calculate the number of pages.
>
> Would that be particularly inefficient?
> Should I be looking harder for a LIMIT/OFFSET based solution?
>
> Perhaps I'd be better off splitting it into two queries, one to just
> get the entry_id list in order, then another query to pull out the
> rest of the information for the 20 of those entry_ids that are on the results
> page I wish to show?
> That would stop Postgres from gathering so much information that I am just
> going to throw away anyway without looking at.
>
> Any ideas? Have I missed something obvious that will help me? Or better yet,
> can someone who has done this sort of thing before tell me whether I am on the
> right track?
>
> Paul

Re: [PHP] Re: FTI, paged, ranked searching and efficiency.

From
Josh Berkus
Date:
Stephen,

> How come nobody's ever thought of cursors?
>
> DECLARE foo CURSOR FOR SELECT stuff FROM stuff WHERE foo ORDER BY
> something;
>
> Hop forward N rows?
>         MOVE FORWARD $n IN foo
>
> Want M rows?
>         FETCH FORWARD $m IN foo

    I'm intrigued by this.  How would I retrieve cursor rows into a web
application?  If we could output a cursor to a functon result (we
can't), it would be easy, but I'm not sure otherwise.

                    -Josh
--
______AGLIO DATABASE SOLUTIONS___________________________
                                        Josh Berkus
   Complete information technology      josh@agliodbs.com
    and data management solutions       (415) 436-9166
   for law firms, small businesses       fax  436-0137
    and non-profit organizations.       pager 338-4078
                                San Francisco

Re: [PHP] Re: FTI, paged, ranked searching and efficiency.

From
Stephen van Egmond
Date:
Josh Berkus (josh@agliodbs.com) wrote:
>     Unfortunately, your only real option I can see for DB server-side row
> grabbing is: Create the query(ies) as a temporary table or view using a
> function.  Then use Limit and Offset to grab one chunk of data at a
> time.  This is, of course, a serious problem for mutli-user performance
> since eash user would need their own temp table or view.

How come nobody's ever thought of cursors?

DECLARE foo CURSOR FOR SELECT stuff FROM stuff WHERE foo ORDER BY
something;

Hop forward N rows?
    MOVE FORWARD $n IN foo

Want M rows?
    FETCH FORWARD $m IN foo

/Steve


Re: [PHP] Re: FTI, paged, ranked searching and efficiency.

From
Stephen van Egmond
Date:
My example wasn't complete. You need to wrap the cursor calls in BEGIN
and END. Read the DECLARE docs for the minutiae.

/Steve