Thread: Re: Performance issues when the number of records are around 10 Million

Re: Performance issues when the number of records are around 10 Million

From
"Kevin Grittner"
Date:
venu madhav  wrote:

>>> AND e.timestamp >= '1270449180'
>>> AND e.timestamp < '1273473180'
>>> ORDER BY.
>>> e.cid DESC,
>>> e.cid DESC
>>> limit 21
>>> offset 10539780

> The second column acts as a secondary key for sorting if the
> primary sorting key is a different column. For this query both of
> them are same.

Any chance you could just leave the second one off in that case?

> This query is part of an application which allows user to select
> time ranges and retrieve the data in that interval. Hence the time
> stamp.

Which, of course, is going to affect the number of rows.  Which
leaves me wondering how you know that once you select and sequence
the result set you need to read past and ignore exactly 10539780
rows to get to the last page.

> To have it in some particular order we're doing order by.

Which will affect which rows are at any particular offset.

> If the records are more in the interval,

How do you know that before you run your query?

> we display in sets of 20/30 etc. The user also has the option to
> browse through any of those records hence the limit and offset.

Have you considered alternative techniques for paging?  You might
use values at the edges of the page to run a small query (limit, no
offset) when they page.  You might generate all the pages on the
first pass and cache them for a while.

> When the user asks for the last set of 20 records, this query gets
> executed.

The DESC on the ORDER BY makes it look like you're trying to use the
ORDER BY to get to the end, but then your offset tells PostgreSQL to
skip the 10.5 million result rows with the highest keys.  Is the
"last page" the one with the highest or lowest values for cid?

-Kevin



Re: Performance issues when the number of records are around 10 Million

From
"Kevin Grittner"
Date:
venu madhav <venutaurus539@gmail.com> wrote:

>> > If the records are more in the interval,
>>
>> How do you know that before you run your query?
>>
> I calculate the count first.

This and other comments suggest that the data is totally static
while this application is running.  Is that correct?

> If generate all the pages at once, to retrieve all the 10 M
> records at once, it would take much longer time

Are you sure of that?  It seems to me that it's going to read all
ten million rows once for the count and again for the offset.  It
might actually be faster to pass them just once and build the pages.

Also, you didn't address the issue of storing enough information on
the page to read off either edge in the desired sequence with just a
LIMIT and no offset.  "Last page" or "page up" would need to reverse
the direction on the ORDER BY.  This would be very fast if you have
appropriate indexes.  Your current technique can never be made very
fast.

-Kevin

Re: Performance issues when the number of records are around 10 Million

From
Craig James
Date:
On 5/12/10 4:55 AM, Kevin Grittner wrote:
> venu madhav  wrote:
>> we display in sets of 20/30 etc. The user also has the option to
>> browse through any of those records hence the limit and offset.
>
> Have you considered alternative techniques for paging?  You might
> use values at the edges of the page to run a small query (limit, no
> offset) when they page.  You might generate all the pages on the
> first pass and cache them for a while.

Kevin is right.  You need to you "hitlists" - a semi-temporary table that holds the results of your initial query.
You'rerepeating a complex, expensive query over and over, once for each page of data that the user wants to see.
Instead,using a hitlist, your initial query looks something like this: 

create table hitlist_xxx(
    objectid integer,
    sortorder integer default nextval('hitlist_seq')
);

insert into hitlist_xxx (objectid)
     (select ... your original query ... order by ...)

You store some object ID or primary key in the "hitlist" table, and the sequence records your original order.

Then when your user asks for page 1, 2, 3 ... N, all you have to do is join your hitlist to your original data:

   select ... from mytables join hitlist_xxx on (...)
      where sortorder >= 100 and sortorder < 120;

which would instantly return page 5 of your data.

To do this, you need a way to know when a user is finished so that you can discard the hitlist.

Craig

Re: Performance issues when the number of records are around 10 Million

From
"Kevin Grittner"
Date:
venu madhav <venutaurus539@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:

>> > I calculate the count first.
>>
>> This and other comments suggest that the data is totally static
>> while this application is running.  Is that correct?
>>
> No, the data gets added when the application is running. As I've
> mentioned before it could be as faster as 100-400 records per
> second. And it is an important application which will be running
> 24/7.

Then how can you trust that the count you run before selecting is
accurate when you run the SELECT?  Are they both in the same
REPEATABLE READ or SERIALIZABLE transaction?

>> Also, you didn't address the issue of storing enough information
>> on the page to read off either edge in the desired sequence with
>> just a LIMIT and no offset.  "Last page" or "page up" would need
>> to reverse the direction on the ORDER BY.  This would be very
>> fast if you have appropriate indexes.  Your current technique can
>> never be made very fast.
>>
> I actually didn't understand what did you mean when you said
> "storing enough information on the page to read off either edge in
> the desired sequence with just a LIMIT and no offset". What kind
> of information can we store to improve the performance.

Well, for starters, it's entirely possible that the "hitlist"
approach posted by Craig James will work better for you than what
I'm about to describe.  Be sure to read this post carefully:

http://archives.postgresql.org/pgsql-performance/2010-05/msg00058.php

The reason that might work better than the idea I was suggesting is
that the combination of selecting on timestamp and ordering by
something else might make it hard to use reasonable indexes to
position and limit well enough for the technique I was suggesting to
perform well.  It's hard to say without testing.

For what I was describing, you must use an ORDER BY which guarantees
a consistent sequence for the result rows.  I'm not sure whether you
always have that currently; if not, that's another nail in the
coffin of your current technique, since the same OFFSET into the
result might be different rows from one time to the next, even if
data didn't change.  If your ORDER BY can't guarantee a unique set
of ordering values for every row in the result set, you need to add
any missing columns from a unique index (usually the primary key) to
the ORDER BY clause.

Anyway, once you are sure you have an ORDER BY which is
deterministic, you make sure your software remembers the ORDER BY
values for the first and last entries on the page.  Then you can do
something like (abstractly):

SELECT x, y, z
  FROM a, b
  WHERE ts BETWEEN m AND n
    AND a.x = b.a_x
    AND (x, y) > (lastx, lasty)
  ORDER BY x, y
  LIMIT 20;

With the right indexes, data distributions, selection criteria, and
ORDER BY columns -- that *could* be very fast.  If not, look at
Craig's post.

-Kevin

Re: Performance issues when the number of records are around 10 Million

From
venu madhav
Date:


On Wed, May 12, 2010 at 5:25 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
venu madhav  wrote:

>>> AND e.timestamp >= '1270449180'
>>> AND e.timestamp < '1273473180'
>>> ORDER BY.
>>> e.cid DESC,
>>> e.cid DESC
>>> limit 21
>>> offset 10539780

> The second column acts as a secondary key for sorting if the
> primary sorting key is a different column. For this query both of
> them are same.

Any chance you could just leave the second one off in that case?
[Venu] Yes, that can be ignored. But am not sure that removing it would reduce the time drastically.

> This query is part of an application which allows user to select
> time ranges and retrieve the data in that interval. Hence the time
> stamp.

Which, of course, is going to affect the number of rows.  Which
leaves me wondering how you know that once you select and sequence
the result set you need to read past and ignore exactly 10539780
rows to get to the last page.
[Venu]For Ex:  My database has 10539793 records. My application first calculates the count of number of records in that interval. And then based on user request to display 10/20/30/40 records in one page, it calculates how many records to be displayed when the last link is clicked.

> To have it in some particular order we're doing order by.

Which will affect which rows are at any particular offset.
[Venu]Yes, by default it has the primary key for order by.

> If the records are more in the interval,

How do you know that before you run your query?
 [Venu] I calculate the count first.

> we display in sets of 20/30 etc. The user also has the option to
> browse through any of those records hence the limit and offset.

Have you considered alternative techniques for paging?  You might
use values at the edges of the page to run a small query (limit, no
offset) when they page.  You might generate all the pages on the
first pass and cache them for a while.

[Venu] If generate all the pages at once, to retrieve all the 10 M records at once, it would take much longer time and since the request from the browser, there is a chance of browser getting timed out.
> When the user asks for the last set of 20 records, this query gets
> executed.

The DESC on the ORDER BY makes it look like you're trying to use the
ORDER BY to get to the end, but then your offset tells PostgreSQL to
skip the 10.5 million result rows with the highest keys.  Is the
"last page" the one with the highest or lowest values for cid?

[Venu] The last page contains the lowest values of cid. By default we get the records in the decreasing order of cid and then get the last 10/20.

Thank you,
Venu.
-Kevin



Re: Performance issues when the number of records are around 10 Million

From
venu madhav
Date:


On Wed, May 12, 2010 at 7:26 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
venu madhav <venutaurus539@gmail.com> wrote:

>> > If the records are more in the interval,
>>
>> How do you know that before you run your query?
>>
> I calculate the count first.

This and other comments suggest that the data is totally static
while this application is running.  Is that correct?
[Venu] No, the data gets added when the application is running. As I've mentioned before it could be as faster as 100-400 records per second. And it is an important application which will be running 24/7. 

> If generate all the pages at once, to retrieve all the 10 M
> records at once, it would take much longer time

Are you sure of that?  It seems to me that it's going to read all
ten million rows once for the count and again for the offset.  It
might actually be faster to pass them just once and build the pages.
[Venu] Even if the retrieval is faster, the client which is viewing the database and the server where the data gets logged can be any where on the globe. So, it is not feasible to get all the 1 or 10 M records at once from the server to client.
 

Also, you didn't address the issue of storing enough information on
the page to read off either edge in the desired sequence with just a
LIMIT and no offset.  "Last page" or "page up" would need to reverse
the direction on the ORDER BY.  This would be very fast if you have
appropriate indexes.  Your current technique can never be made very
fast.
[Venu] I actually didn't understand what did you mean when you said "storing enough information on the page to read off either edge in the desired sequence with just a
LIMIT and no offset". What kind of information can we store to improve the performance.  Reversing the order by is one thing, I am trying to figure out how fast it is. Thanks a lot for this suggestion.

Thank you,
Venu.

-Kevin