Re: FTI, paged, ranked searching and efficiency. - Mailing list pgsql-sql
From | Josh Berkus |
---|---|
Subject | Re: FTI, paged, ranked searching and efficiency. |
Date | |
Msg-id | 3A12C1DA.D296628C@agliodbs.com Whole thread Raw |
In response to | FTI, paged, ranked searching and efficiency. (Paul <paul@operamail.com>) |
Responses |
Re: [PHP] Re: FTI, paged, ranked searching and efficiency.
Re: [PHP] Re: FTI, paged, ranked searching and efficiency. |
List | pgsql-sql |
Paul, I'm afraid that much of that was over my head. In fact, I'm keeping it as an example in case I ever need to do something similar. Forward your info and I'll include credit in the source :-) In general terms, I've always depended on the PHP to select a "page" of results, using the logic that the number of results on a page is a matter for the web application to handle (a display issue) rather than something to be handled on the back-end (a data issue). However, you point about not pulling out tons of data the user will never examine (i.e. 2nd and succeeding pages) is well-taken. Although, without pulling the entire data set, you can't display to the user how many results there are, total. If it's a strong possibility that the users are really only ever going to want the top 20-40 rated results, then splitting it as you suggest ... first, counting all the matches and then dragging in the rest of the data for the top X records ... makes a lot of sense. Unfortunately, your only real option I can see for DB server-side row grabbing is: Create the query(ies) as a temporary table or view using a function. Then use Limit and Offset to grab one chunk of data at a time. This is, of course, a serious problem for mutli-user performance since eash user would need their own temp table or view. From what I can tell, search engines (google, for example) grab the whole recordset and use the web script to parse it out 25 records at a time. Hopefully, someone on this list will have done that before and can provide less theoretical advice. -Josh Berkus P.S. I've also posted this to the pgsql-php list. I;ve quoted the full text of your question below my .sig for that reason. -- ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete information technology josh@agliodbs.com and data management solutions (415) 436-9166 for law firms, small businesses fax 436-0137 and non-profit organizations. pager 338-4078 San Francisco Paul wrote: > > Hello, > > This is going to be a bit long, I hope some of you will take the > trouble to read it :) > > I am building a search engine for a section of a (PHP based) website. > I wish the user to be able to a number of words in the search, and the > search results to be ranked by the number of times words occur (both > different words and the same word occuring multiple times are good). > > My (simplified) table structure is this: > ====== > Table "entry_fti" > Attribute | Type | Modifier > -----------+-------------+---------- > string | varchar(25) | > id | oid | > Index: entry_fti_string_idx > > Table "entry" > Attribute | Type | Modifier > -----------------------+-------+---------------------------------------------- > - > entry_id |integer| not null default > nextval('entry_id_seq'::text) > entry_name |text | > entry_description_html|text | > entry_image_id |integer| not null default 0 > entry_tn_image_id |integer| not null default 0 > entry_live |boolean| not null default 't' > Index: entry_pkey > > Table "image" > Attribute | Type | Modifier > ------------+-------------+------------------------------------------------ > image_id | integer | not null default nextval('image_id_seq'::text) > image_name | varchar(32) | > height | integer | not null > width | integer | not null > Indices: image_pkey > ====== > > And my (simplified) query looks like this: > ====== > SELECT COUNT(entry_fti.id) AS rating, > entry.entry_name AS name, > entry.entry_id AS id, > entry.entry_description_html AS description_html, > image.image_name AS thumb1_name, > image.height AS thumb1_height, > image.width AS thumb1_width > FROM entry, entry_fti, image > WHERE entry_fti.id=entry.oid > AND entry.entrytn_image_id=image.image_id > AND entry.entry_live = 't'::bool > AND ( > entry_fti.string ~'^word1' > OR > entry_fti.string ~'^word2' > OR > . > . > OR > entry_fti.string ~'^wordn' > ) > GROUP BY entry.entry_id, > entry.entry_name, > entry.entry_description_html, > image.image_name, > image.height, > image.width > ORDER BY rating DESC > ====== > > Now this all works, which is good. My problem now is that I want to > limit the number of results shown on a page to 20 and show the number > of pages of extra results, much like you'd see on any search engine site. > Naturally I immediatly thought of the LIMIT and OFFSET clauses, but then: > a) I'd need to do an extra query, to find out the total number of results > to show the number of pages on the webpage. > b) I have no idea how to write that query. It'd be a COUNT of 'rating' > in the above, which would be a COUNT(COUNT(entry_fti.id)) which > would probably require some hideous (and not legal?) double GROUP > BY construct. Ouch. > > So basically, LIMIT/OFFSET looks like a no go. This leaves me with just > doing the above query, and using PHP to jump to a particular row in the > results depending on what page you are on and pg_numrows() to > calculate the number of pages. > > Would that be particularly inefficient? > Should I be looking harder for a LIMIT/OFFSET based solution? > > Perhaps I'd be better off splitting it into two queries, one to just > get the entry_id list in order, then another query to pull out the > rest of the information for the 20 of those entry_ids that are on the results > page I wish to show? > That would stop Postgres from gathering so much information that I am just > going to throw away anyway without looking at. > > Any ideas? Have I missed something obvious that will help me? Or better yet, > can someone who has done this sort of thing before tell me whether I am on the > right track? > > Paul