Thread: best paging strategies for large datasets?
Hi, I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and am in the process of developping a pager to let users leaf through it (30K rows). Ideally I'd like to know when requesting any 'page' of data where I am within the dataset: how many pages are available each way, etc. Of course that can be done by doing a count(*) query before requesting a limit/offset subset. But the main query is already quite slow, so I'd like to minimize them. But I am intrigued by window functions, especially the row_number() and ntile(int) ones. Adding "row_number() over (order by <reverse query>)" to my query will return the total number of rows in the first row, letting my deduce the number of pages remaining, etc. row_number() apparently adds very little cost to the main query. And ntile(buckets) seems nice too but I need the total row count for it to contain a 'page' number: ntile(row_count/page_size). What better "paging" strategies are out there? Thanks,
Hello 2010/5/12 Louis-David Mitterrand <vindex+lists-pgsql-sql@apartia.org>: > Hi, > > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > am in the process of developping a pager to let users leaf through it > (30K rows). > > Ideally I'd like to know when requesting any 'page' of data where I am > within the dataset: how many pages are available each way, etc. > > Of course that can be done by doing a count(*) query before requesting a > limit/offset subset. But the main query is already quite slow, so I'd > like to minimize them. look on scrollable cursors. see DECLARE CURSOR statement Regards Pavel Stehule > > But I am intrigued by window functions, especially the row_number() and > ntile(int) ones. > > Adding "row_number() over (order by <reverse query>)" to my query will > return the total number of rows in the first row, letting my deduce the > number of pages remaining, etc. row_number() apparently adds very little > cost to the main query. > > And ntile(buckets) seems nice too but I need the total row count for it > to contain a 'page' number: ntile(row_count/page_size). > > What better "paging" strategies are out there? > > Thanks, > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql >
On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote: > Hi, > > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > am in the process of developping a pager to let users leaf through it > (30K rows). > That's not that big of a record set. > Ideally I'd like to know when requesting any 'page' of data where I am > within the dataset: how many pages are available each way, etc. > > Of course that can be done by doing a count(*) query before requesting a > limit/offset subset. But the main query is already quite slow, so I'd > like to minimize them. > What do you mean by quite slow?? On a 30K record table count() and query should speed should be a problem.. > But I am intrigued by window functions, especially the row_number() and > ntile(int) ones. > > Adding "row_number() over (order by<reverse query>)" to my query will > return the total number of rows in the first row, letting my deduce the > number of pages remaining, etc. row_number() apparently adds very little > cost to the main query. > That will get a sequential number, but you still don't know how many records are in the table, limit and offset block that value. I don't see how this helps? Limit and Offset with Total Record count tell us where we are in the record set and which page we are on. RecordCount/Limit = Number of pages CurrentPage = (offset%RecordCount)/Limit to complicate things further what if the site allows user to change the number of records displayed per page. The pager logic needs to figure out how many records need to be return per page, and what the next and previous iterations are. Without the total count records I don't see how that is even possible. I have written pagers in ASP and PHP All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.
oops typos On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote: > Hi, > > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > am in the process of developping a pager to let users leaf through it > (30K rows). > That's not that big of a record set. > Ideally I'd like to know when requesting any 'page' of data where I am > within the dataset: how many pages are available each way, etc. > > Of course that can be done by doing a count(*) query before requesting a > limit/offset subset. But the main query is already quite slow, so I'd > like to minimize them. > What do you mean by quite slow?? On a 30K record table count() and query speed should not be a problem.. > But I am intrigued by window functions, especially the row_number() and > ntile(int) ones. > > Adding "row_number() over (order by<reverse query>)" to my query will > return the total number of rows in the first row, letting my deduce the > number of pages remaining, etc. row_number() apparently adds very little > cost to the main query. > That will get a sequential number, but you still don't know how many records are in the table, limit and offset block that value. I don't see how this helps? Limit and Offset with Total Record count tell us where we are in the record set and which page we are on. RecordCount/Limit = Number of pages CurrentPage = (offset%RecordCount)/Limit to complicate things further what if the site allows user to change the number of records displayed per page. The pager logic needs to figure out how many records need to be return per page, and what the next and previous iterations are. Without the total record count I don't see how that is even possible. I have written pagers in ASP and PHP All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.
On 05/12/10 09:41, Louis-David Mitterrand wrote: > Hi, > > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > am in the process of developping a pager to let users leaf through it > (30K rows). > > Ideally I'd like to know when requesting any 'page' of data where I am > within the dataset: how many pages are available each way, etc. > > Of course that can be done by doing a count(*) query before requesting a > limit/offset subset. But the main query is already quite slow, so I'd > like to minimize them. nowadays i tend to bet on AJAX. in other words i propose to move some calculations to a client side at all. and this particular situation might looks similar to the following: First u count(*) the rows and select a requested page returning to a client the count result bundled "with a page of rows" (1) client renders the acquired rows (2)__memorize__ what part of the data he just got (3) and stores the count result to calculate "the pager div" all the subsequent clicks on "the pager div" should not immediately generate requests and decides if the request is needed.
On Wed, May 12, 2010 at 12:26:17PM -0400, Justin Graf wrote: > oops typos > On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote: > > Hi, > > > > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > > am in the process of developping a pager to let users leaf through it > > (30K rows). > > That's not that big of a record set. Well for me it's a big one :) But then again it's my first serious web app. > > Ideally I'd like to know when requesting any 'page' of data where I am > > within the dataset: how many pages are available each way, etc. > > > > Of course that can be done by doing a count(*) query before requesting a > > limit/offset subset. But the main query is already quite slow, so I'd > > like to minimize them. > > > > What do you mean by quite slow?? Like several seconds. I have to cache the results. > On a 30K record table count() and query speed should not be a problem.. This query is a large multi-join of times series data, not a single table. And it's not (prematurely :) optimized. I'm planning a materialized view for it. > > But I am intrigued by window functions, especially the row_number() and > > ntile(int) ones. > > > > Adding "row_number() over (order by<reverse query>)" to my query will > > return the total number of rows in the first row, letting my deduce the > > number of pages remaining, etc. row_number() apparently adds very little > > cost to the main query. > > That will get a sequential number, but you still don't know how many > records are in the table, limit and offset block that value. > I don't see how this helps? > > Limit and Offset with Total Record count tell us where we are in the > record set and which page we are on. Hmm, good to know. I hadn't tried that yet. > RecordCount/Limit = Number of pages > CurrentPage = (offset%RecordCount)/Limit These simple formulas we bill handy. > to complicate things further what if the site allows user to change the > number of records displayed per page. The pager logic needs to figure > out how many records need to be return per page, and what the next and > previous iterations are. Without the total record count I don't see how > that is even possible. > > I have written pagers in ASP and PHP Thanks for your input. I now realize I'll have to get a total count in a separate (cached) query, or else I'll only be able to provide a basic "previous/next" pager. Cheers,
On Thu, May 13, 2010 at 11:36:53AM +0400, silly sad wrote: > On 05/12/10 09:41, Louis-David Mitterrand wrote: > >Hi, > > > >I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and > >am in the process of developping a pager to let users leaf through it > >(30K rows). > > > >Ideally I'd like to know when requesting any 'page' of data where I am > >within the dataset: how many pages are available each way, etc. > > > >Of course that can be done by doing a count(*) query before requesting a > >limit/offset subset. But the main query is already quite slow, so I'd > >like to minimize them. > > nowadays i tend to bet on AJAX. > in other words i propose to move some calculations to a client side at all. > > and this particular situation might looks similar to the following: > > First u count(*) the rows and select a requested page > returning to a client the count result bundled "with a page of rows" > > (1) client renders the acquired rows > (2)__memorize__ what part of the data he just got > (3) and stores the count result to calculate "the pager div" > > all the subsequent clicks on "the pager div" should not immediately > generate requests and decides if the request is needed. Yes, rendering the results throught ajax is a good idea, but one has to be careful not to expose one's LIMIT and OFFSET to the client, but only the "page" number. Or else the client could query the whole data set. A lot of "professional" web site have that hole.
>> First u count(*) the rows and select a requested page >> returning to a client the count result bundled "with a page of rows" >> >> (1) client renders the acquired rows >> (2)__memorize__ what part of the data he just got >> (3) and stores the count result to calculate "the pager div" >> >> all the subsequent clicks on "the pager div" should not immediately >> generate requests and decides if the request is needed. > > Yes, rendering the results throught ajax is a good idea, but one has to > be careful not to expose one's LIMIT and OFFSET to the client, but only > the "page" number. Or else the client could query the whole data set. A > lot of "professional" web site have that hole. > this is not a hole, it is only a matter of aesthetic
On 5/13/2010 4:41 AM, silly sad wrote: > >>> First u count(*) the rows and select a requested page >>> returning to a client the count result bundled "with a page of rows" >>> >>> (1) client renders the acquired rows >>> (2)__memorize__ what part of the data he just got >>> (3) and stores the count result to calculate "the pager div" >>> >>> all the subsequent clicks on "the pager div" should not immediately >>> generate requests and decides if the request is needed. >> >> Yes, rendering the results throught ajax is a good idea, but one has to >> be careful not to expose one's LIMIT and OFFSET to the client, but only >> the "page" number. Or else the client could query the whole data set. A >> lot of "professional" web site have that hole. >> > > this is not a hole, it is only a matter of aesthetic > Silly Sad is right this is not a hole but a matter of aesthetics. To keep the code simple and limit the amount of things that have to be tracked with client session on the server, I pass the limit and offset to the client normally in a url/link. This also solves the problem if the users sessions expires, the information is lost meaning the user has to restart. Very annoying. So the urls look something like this www.mywebsit.com/index.php?module=getthedata&limit=10&offset=30&orderby=5 On the server set the these three data types to integer to block sql injection. I really don't care if the user sends a command to get all 10,000 records. If you block that, all that been accomplished is slowing down data harvesting and eating up even more resources, as the client/user will send ever more GETs to harvest data. Nothing has been accomplished. to keep the casual on looker for screwing with the url encode in base64. It keeps honest people honest but the hackers will pull that apart in a second. All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.
On 5/13/2010 3:43 AM, Louis-David Mitterrand wrote: **snip*** > >> What do you mean by quite slow?? >> > Like several seconds. I have to cache the results. > Well then i suggest posting the queries to Performance or here and let us take a look them don't forget to include the explain/analyze, and number of records in each table. >> On a 30K record table count() and query speed should not be a problem.. >> > This query is a large multi-join of times series data, not a single > table. And it's not (prematurely :) optimized. > > I'm planning a materialized view for it. > here read this http://www.pgcon.org/2008/schedule/events/69.en.html The question that begs to be asked how big are the tables the query is accessing?? if its not hundreds of thousands to millions of records or on a stone age Server , my guess is the query can be improved. > **snip** >> RecordCount/Limit = Number of pages >> CurrentPage = (offset%RecordCount)/Limit >> > These simple formulas we bill handy. > don't forget to use floor on these in what ever language your programming in All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by ourproprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mailimmediately. Thank you.