Re: FTI, paged, ranked searching and efficiency. - Mailing list pgsql-sql

From Josh Berkus
Subject Re: FTI, paged, ranked searching and efficiency.
Date
Msg-id 3A12C1DA.D296628C@agliodbs.com
Whole thread Raw
In response to FTI, paged, ranked searching and efficiency.  (Paul <paul@operamail.com>)
Responses Re: [PHP] Re: FTI, paged, ranked searching and efficiency.
Re: [PHP] Re: FTI, paged, ranked searching and efficiency.
List pgsql-sql
Paul,

    I'm afraid that much of that was over my head.  In fact, I'm keeping it
as an example in case I ever need to do something similar.  Forward your
info and I'll include credit in the source :-)

    In general terms, I've always depended on the PHP to select a "page" of
results, using the logic that the number of results on a page is a
matter for the web application to handle (a display issue) rather than
something to be handled on the back-end (a data issue).

    However, you point about not pulling out tons of data the user will
never examine (i.e. 2nd and succeeding pages) is well-taken.  Although,
without pulling the entire data set, you can't display to the user how
many results there are, total.  If it's a strong possibility that the
users are really only ever going to want the top 20-40 rated results,
then splitting it as you suggest ... first, counting all the matches and
then dragging in the rest of the data for the top X records ... makes a
lot of sense.

    Unfortunately, your only real option I can see for DB server-side row
grabbing is: Create the query(ies) as a temporary table or view using a
function.  Then use Limit and Offset to grab one chunk of data at a
time.  This is, of course, a serious problem for mutli-user performance
since eash user would need their own temp table or view.

    From what I can tell, search engines (google, for example) grab the
whole recordset and use the web script to parse it out 25 records at a
time.

    Hopefully, someone on this list will have done that before and can
provide less theoretical advice.

                    -Josh Berkus

P.S. I've also posted this to the pgsql-php list.  I;ve quoted the full
text of your question below my .sig for that reason.

--
______AGLIO DATABASE SOLUTIONS___________________________
                                        Josh Berkus
   Complete information technology      josh@agliodbs.com
    and data management solutions       (415) 436-9166
   for law firms, small businesses       fax  436-0137
    and non-profit organizations.       pager 338-4078
                                San Francisco



Paul wrote:
>
> Hello,
>
> This is going to be a bit long, I hope some of you will take the
> trouble to read it :)
>
> I am building a search engine for a section of a (PHP based) website.
> I wish the user to be able to a number of words in the search, and the
> search results to be ranked by the number of times words occur (both
> different words and the same word occuring multiple times are good).
>
> My (simplified) table structure is this:
> ======
>          Table "entry_fti"
>  Attribute |    Type     | Modifier
> -----------+-------------+----------
>  string    | varchar(25) |
>  id        | oid         |
> Index: entry_fti_string_idx
>
>                                    Table "entry"
>        Attribute       | Type  |                    Modifier
> -----------------------+-------+----------------------------------------------
> -
>  entry_id              |integer| not null default
> nextval('entry_id_seq'::text)
>  entry_name            |text   |
>  entry_description_html|text   |
>  entry_image_id        |integer| not null default 0
>  entry_tn_image_id     |integer| not null default 0
>  entry_live            |boolean| not null default 't'
> Index: entry_pkey
>
>                                Table "image"
>  Attribute  |    Type     |                    Modifier
> ------------+-------------+------------------------------------------------
>  image_id   | integer     | not null default nextval('image_id_seq'::text)
>  image_name | varchar(32) |
>  height     | integer     | not null
>  width      | integer     | not null
> Indices:  image_pkey
> ======
>
> And my (simplified) query looks like this:
> ======
> SELECT   COUNT(entry_fti.id) AS rating,
>          entry.entry_name AS name,
>          entry.entry_id AS id,
>          entry.entry_description_html AS description_html,
>          image.image_name AS thumb1_name,
>          image.height AS thumb1_height,
>          image.width AS thumb1_width
> FROM     entry, entry_fti, image
> WHERE    entry_fti.id=entry.oid
>   AND    entry.entrytn_image_id=image.image_id
>   AND    entry.entry_live = 't'::bool
>   AND    (
>           entry_fti.string ~'^word1'
>           OR
>           entry_fti.string ~'^word2'
>           OR
>                        .
>                        .
>           OR
>           entry_fti.string ~'^wordn'
>          )
> GROUP BY entry.entry_id,
>          entry.entry_name,
>          entry.entry_description_html,
>          image.image_name,
>          image.height,
>          image.width
> ORDER BY rating DESC
> ======
>
> Now this all works, which is good. My problem now is that I want to
> limit the number of results shown on a page to 20 and show the number
> of pages of extra results, much like you'd see on any search engine site.
> Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then:
> a) I'd need to do an extra query, to find out the total number of results
>    to show the number of pages on the webpage.
> b) I have no idea how to write that query. It'd be a COUNT of 'rating'
>    in the above, which would be a COUNT(COUNT(entry_fti.id)) which
>    would probably require some hideous (and not legal?) double GROUP
>    BY construct. Ouch.
>
> So basically, LIMIT/OFFSET looks like a no go. This leaves me with just
> doing the above query, and using PHP to jump to a particular row in the
> results depending on what page you are on and pg_numrows() to
> calculate the number of pages.
>
> Would that be particularly inefficient?
> Should I be looking harder for a LIMIT/OFFSET based solution?
>
> Perhaps I'd be better off splitting it into two queries, one to just
> get the entry_id list in order, then another query to pull out the
> rest of the information for the 20 of those entry_ids that are on the results
> page I wish to show?
> That would stop Postgres from gathering so much information that I am just
> going to throw away anyway without looking at.
>
> Any ideas? Have I missed something obvious that will help me? Or better yet,
> can someone who has done this sort of thing before tell me whether I am on the
> right track?
>
> Paul

pgsql-sql by date:

Previous
From: Patrik Kudo
Date:
Subject: Index problem
Next
From: Josh Berkus
Date:
Subject: Re: [PHP] Re: FTI, paged, ranked searching and efficiency.