Thread: best paging strategies for large datasets?

best paging strategies for large datasets?

From

Louis-David Mitterrand

Date:

12 May 2010, 02:41:55

Hi,

I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
am in the process of developping a pager to let users leaf through it
(30K rows).

Ideally I'd like to know when requesting any 'page' of data where I am
within the dataset: how many pages are available each way, etc.

Of course that can be done by doing a count(*) query before requesting a
limit/offset subset. But the main query is already quite slow, so I'd
like to minimize them.

But I am intrigued by window functions, especially the row_number() and
ntile(int) ones. 

Adding "row_number() over (order by <reverse query>)" to my query will
return the total number of rows in the first row, letting my deduce the
number of pages remaining, etc. row_number() apparently adds very little
cost to the main query.

And ntile(buckets) seems nice too but I need the total row count for it
to contain a 'page' number: ntile(row_count/page_size).

What better "paging" strategies are out there?

Thanks,

Re: best paging strategies for large datasets?

From

Pavel Stehule

Date:

12 May 2010, 02:46:05

Hello

2010/5/12 Louis-David Mitterrand <vindex+lists-pgsql-sql@apartia.org>:
> Hi,
>
> I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> am in the process of developping a pager to let users leaf through it
> (30K rows).
>
> Ideally I'd like to know when requesting any 'page' of data where I am
> within the dataset: how many pages are available each way, etc.
>
> Of course that can be done by doing a count(*) query before requesting a
> limit/offset subset. But the main query is already quite slow, so I'd
> like to minimize them.

look on scrollable cursors.

see DECLARE CURSOR statement

Regards
Pavel Stehule

>
> But I am intrigued by window functions, especially the row_number() and
> ntile(int) ones.
>
> Adding "row_number() over (order by <reverse query>)" to my query will
> return the total number of rows in the first row, letting my deduce the
> number of pages remaining, etc. row_number() apparently adds very little
> cost to the main query.
>
> And ntile(buckets) seems nice too but I need the total row count for it
> to contain a 'page' number: ntile(row_count/page_size).
>
> What better "paging" strategies are out there?
>
> Thanks,
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>

Re: best paging strategies for large datasets?

From

Justin Graf

Date:

12 May 2010, 13:00:21

On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote:
> Hi,
>
> I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> am in the process of developping a pager to let users leaf through it
> (30K rows).
>    

That's not that big of a record set.

> Ideally I'd like to know when requesting any 'page' of data where I am
> within the dataset: how many pages are available each way, etc.
>
> Of course that can be done by doing a count(*) query before requesting a
> limit/offset subset. But the main query is already quite slow, so I'd
> like to minimize them.
>    

What do you mean by quite slow??

On a 30K record table count() and query should speed should be a problem..

> But I am intrigued by window functions, especially the row_number() and
> ntile(int) ones.
>
> Adding "row_number() over (order by<reverse query>)" to my query will
> return the total number of rows in the first row, letting my deduce the
> number of pages remaining, etc. row_number() apparently adds very little
> cost to the main query.
>    

That will get a sequential number,  but you still don't know how many 
records are in the table,  limit and offset block that value.
I don't see how this helps?
 Limit and Offset with Total Record count tell us where we are in the 
record set and which page we are on.

RecordCount/Limit = Number of pages
CurrentPage = (offset%RecordCount)/Limit

to complicate things further what if the site allows user to change the 
number of records displayed per page.  The pager logic needs to figure 
out how many records need to be return per page, and what the next and 
previous iterations are. Without the total count records I don't see how 
that is even possible.

I have written pagers in ASP and PHP

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by
ourproprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other
informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it
addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified
thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have
receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this
e-mailimmediately.

Thank you.

Re: best paging strategies for large datasets?

From

Justin Graf

Date:

12 May 2010, 13:27:03

oops typos
On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote:
> Hi,
>
> I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> am in the process of developping a pager to let users leaf through it
> (30K rows).
>

That's not that big of a record set.

> Ideally I'd like to know when requesting any 'page' of data where I am
> within the dataset: how many pages are available each way, etc.
>
> Of course that can be done by doing a count(*) query before requesting a
> limit/offset subset. But the main query is already quite slow, so I'd
> like to minimize them.
>

What do you mean by quite slow??

On a 30K record table count() and query speed should not be a problem..

> But I am intrigued by window functions, especially the row_number() and
> ntile(int) ones.
>
> Adding "row_number() over (order by<reverse query>)" to my query will
> return the total number of rows in the first row, letting my deduce the
> number of pages remaining, etc. row_number() apparently adds very little
> cost to the main query.
>

That will get a sequential number,  but you still don't know how many 
records are in the table,  limit and offset block that value.
I don't see how this helps?
 Limit and Offset with Total Record count tell us where we are in the 
record set and which page we are on.

RecordCount/Limit = Number of pages
CurrentPage = (offset%RecordCount)/Limit

to complicate things further what if the site allows user to change the 
number of records displayed per page.  The pager logic needs to figure 
out how many records need to be return per page, and what the next and 
previous iterations are. Without the total record count  I don't see how 
that is even possible.

I have written pagers in ASP and PHP

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by
ourproprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other
informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it
addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified
thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have
receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this
e-mailimmediately.

Thank you.

Re: best paging strategies for large datasets?

From

silly sad

Date:

13 May 2010, 04:37:09

On 05/12/10 09:41, Louis-David Mitterrand wrote:
> Hi,
>
> I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> am in the process of developping a pager to let users leaf through it
> (30K rows).
>
> Ideally I'd like to know when requesting any 'page' of data where I am
> within the dataset: how many pages are available each way, etc.
>
> Of course that can be done by doing a count(*) query before requesting a
> limit/offset subset. But the main query is already quite slow, so I'd
> like to minimize them.

nowadays i tend to bet on AJAX.
in other words i propose to move some calculations to a client side at all.

and this particular situation might looks similar to the following:

First u count(*) the rows and select a requested page
returning to a client the count result bundled "with a page of rows"

(1) client renders the acquired rows
(2)__memorize__ what part of the data he just got
(3) and stores the count result to calculate "the pager div"

all the subsequent clicks on "the pager div" should not immediately 
generate requests and decides if the request is needed.

Re: best paging strategies for large datasets?

From

Louis-David Mitterrand

Date:

13 May 2010, 04:44:43

On Wed, May 12, 2010 at 12:26:17PM -0400, Justin Graf wrote:
> oops typos
> On 5/12/2010 1:41 AM, Louis-David Mitterrand wrote:
> > Hi,
> >
> > I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> > am in the process of developping a pager to let users leaf through it
> > (30K rows).
> 
> That's not that big of a record set.

Well for me it's a big one :) But then again it's my first serious web
app.

> > Ideally I'd like to know when requesting any 'page' of data where I am
> > within the dataset: how many pages are available each way, etc.
> >
> > Of course that can be done by doing a count(*) query before requesting a
> > limit/offset subset. But the main query is already quite slow, so I'd
> > like to minimize them.
> >
> 
> What do you mean by quite slow??

Like several seconds. I have to cache the results.

> On a 30K record table count() and query speed should not be a problem..

This query is a large multi-join of times series data, not a single
table. And it's not (prematurely :) optimized.

I'm planning a materialized view for it.

> > But I am intrigued by window functions, especially the row_number() and
> > ntile(int) ones.
> >
> > Adding "row_number() over (order by<reverse query>)" to my query will
> > return the total number of rows in the first row, letting my deduce the
> > number of pages remaining, etc. row_number() apparently adds very little
> > cost to the main query.
> 
> That will get a sequential number,  but you still don't know how many 
> records are in the table,  limit and offset block that value.
> I don't see how this helps?
> 
>   Limit and Offset with Total Record count tell us where we are in the 
> record set and which page we are on.

Hmm, good to know. I hadn't tried that yet.

> RecordCount/Limit = Number of pages
> CurrentPage = (offset%RecordCount)/Limit

These simple formulas we bill handy.

> to complicate things further what if the site allows user to change the 
> number of records displayed per page.  The pager logic needs to figure 
> out how many records need to be return per page, and what the next and 
> previous iterations are. Without the total record count  I don't see how 
> that is even possible.
> 
> I have written pagers in ASP and PHP

Thanks for your input. I now realize I'll have to get a total count in a
separate (cached) query, or else I'll only be able to provide a basic
"previous/next" pager.

Cheers,

Re: best paging strategies for large datasets?

From

Louis-David Mitterrand

Date:

13 May 2010, 04:48:01

On Thu, May 13, 2010 at 11:36:53AM +0400, silly sad wrote:
> On 05/12/10 09:41, Louis-David Mitterrand wrote:
> >Hi,
> >
> >I have a large dataset (page 1 at http://www.cruisefish.net/stat.md) and
> >am in the process of developping a pager to let users leaf through it
> >(30K rows).
> >
> >Ideally I'd like to know when requesting any 'page' of data where I am
> >within the dataset: how many pages are available each way, etc.
> >
> >Of course that can be done by doing a count(*) query before requesting a
> >limit/offset subset. But the main query is already quite slow, so I'd
> >like to minimize them.
> 
> nowadays i tend to bet on AJAX.
> in other words i propose to move some calculations to a client side at all.
> 
> and this particular situation might looks similar to the following:
> 
> First u count(*) the rows and select a requested page
> returning to a client the count result bundled "with a page of rows"
> 
> (1) client renders the acquired rows
> (2)__memorize__ what part of the data he just got
> (3) and stores the count result to calculate "the pager div"
> 
> all the subsequent clicks on "the pager div" should not immediately
> generate requests and decides if the request is needed.

Yes, rendering the results throught ajax is a good idea, but one has to
be careful not to expose one's LIMIT and OFFSET to the client, but only
the "page" number. Or else the client could query the whole data set. A
lot of "professional" web site have that hole.

Re: best paging strategies for large datasets?

From

silly sad

Date:

13 May 2010, 05:41:47

>> First u count(*) the rows and select a requested page
>> returning to a client the count result bundled "with a page of rows"
>>
>> (1) client renders the acquired rows
>> (2)__memorize__ what part of the data he just got
>> (3) and stores the count result to calculate "the pager div"
>>
>> all the subsequent clicks on "the pager div" should not immediately
>> generate requests and decides if the request is needed.
>
> Yes, rendering the results throught ajax is a good idea, but one has to
> be careful not to expose one's LIMIT and OFFSET to the client, but only
> the "page" number. Or else the client could query the whole data set. A
> lot of "professional" web site have that hole.
>

this is not a hole, it is only a matter of aesthetic

Re: best paging strategies for large datasets?

From

Justin Graf

Date:

13 May 2010, 10:28:51

On 5/13/2010 4:41 AM, silly sad wrote:
>
>>> First u count(*) the rows and select a requested page
>>> returning to a client the count result bundled "with a page of rows"
>>>
>>> (1) client renders the acquired rows
>>> (2)__memorize__ what part of the data he just got
>>> (3) and stores the count result to calculate "the pager div"
>>>
>>> all the subsequent clicks on "the pager div" should not immediately
>>> generate requests and decides if the request is needed.
>>
>> Yes, rendering the results throught ajax is a good idea, but one has to
>> be careful not to expose one's LIMIT and OFFSET to the client, but only
>> the "page" number. Or else the client could query the whole data set. A
>> lot of "professional" web site have that hole.
>>
>
> this is not a hole, it is only a matter of aesthetic
>
Silly Sad is right this is not a hole but a matter of aesthetics.  To 
keep the code simple and limit the amount of things that have to be 
tracked with client session on the server,  I pass the limit and offset 
to the client normally in a url/link.  This also solves the problem if 
the users sessions expires, the information is lost meaning the user has 
to restart.  Very annoying.

So the urls look something like this   
www.mywebsit.com/index.php?module=getthedata&limit=10&offset=30&orderby=5

On the server  set the these three data types to integer to block sql 
injection.

I really don't care if the user sends a command to get all 10,000 
records.  If you block that, all that been accomplished is slowing down 
data harvesting and eating up even more resources, as the client/user 
will send ever more GETs to harvest data.  Nothing has been accomplished.

to keep the casual on looker for screwing with the url encode in 
base64.  It keeps honest people honest but the hackers will pull that 
apart in a second.

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by
ourproprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other
informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it
addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified
thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have
receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this
e-mailimmediately.

Thank you.

Re: best paging strategies for large datasets?

From

Justin Graf

Date:

13 May 2010, 10:58:42

On 5/13/2010 3:43 AM, Louis-David Mitterrand wrote:
**snip***
>
>> What do you mean by quite slow??
>>      
> Like several seconds. I have to cache the results.
>    

Well then i suggest posting the queries to Performance or here and let 
us take a look them
don't forget to include the explain/analyze, and number of records in 
each table.
>> On a 30K record table count() and query speed should not be a problem..
>>      
> This query is a large multi-join of times series data, not a single
> table. And it's not (prematurely :) optimized.
>
> I'm planning a materialized view for it.
>    

here read this
http://www.pgcon.org/2008/schedule/events/69.en.html

The question that begs to be asked how big are the tables the query is 
accessing??

if its not hundreds of thousands to millions of records or on a stone 
age Server , my guess is  the  query can be improved.

> **snip**
>> RecordCount/Limit = Number of pages
>> CurrentPage = (offset%RecordCount)/Limit
>>      
> These simple formulas we bill handy.
>    

don't forget to use floor on these in what ever language your 
programming in

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by
ourproprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other
informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it
addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified
thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have
receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this
e-mailimmediately.

Thank you.