Re: Performance issues when the number of records are around 10 Million - Mailing list pgsql-performance

From Kevin Grittner
Subject Re: Performance issues when the number of records are around 10 Million
Date
Msg-id 4BEBBEA70200002500031699@gw.wicourts.gov
Whole thread Raw
In response to Re: Performance issues when the number of records are around 10 Million  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-performance
venu madhav <venutaurus539@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:

>> > I calculate the count first.
>>
>> This and other comments suggest that the data is totally static
>> while this application is running.  Is that correct?
>>
> No, the data gets added when the application is running. As I've
> mentioned before it could be as faster as 100-400 records per
> second. And it is an important application which will be running
> 24/7.

Then how can you trust that the count you run before selecting is
accurate when you run the SELECT?  Are they both in the same
REPEATABLE READ or SERIALIZABLE transaction?

>> Also, you didn't address the issue of storing enough information
>> on the page to read off either edge in the desired sequence with
>> just a LIMIT and no offset.  "Last page" or "page up" would need
>> to reverse the direction on the ORDER BY.  This would be very
>> fast if you have appropriate indexes.  Your current technique can
>> never be made very fast.
>>
> I actually didn't understand what did you mean when you said
> "storing enough information on the page to read off either edge in
> the desired sequence with just a LIMIT and no offset". What kind
> of information can we store to improve the performance.

Well, for starters, it's entirely possible that the "hitlist"
approach posted by Craig James will work better for you than what
I'm about to describe.  Be sure to read this post carefully:

http://archives.postgresql.org/pgsql-performance/2010-05/msg00058.php

The reason that might work better than the idea I was suggesting is
that the combination of selecting on timestamp and ordering by
something else might make it hard to use reasonable indexes to
position and limit well enough for the technique I was suggesting to
perform well.  It's hard to say without testing.

For what I was describing, you must use an ORDER BY which guarantees
a consistent sequence for the result rows.  I'm not sure whether you
always have that currently; if not, that's another nail in the
coffin of your current technique, since the same OFFSET into the
result might be different rows from one time to the next, even if
data didn't change.  If your ORDER BY can't guarantee a unique set
of ordering values for every row in the result set, you need to add
any missing columns from a unique index (usually the primary key) to
the ORDER BY clause.

Anyway, once you are sure you have an ORDER BY which is
deterministic, you make sure your software remembers the ORDER BY
values for the first and last entries on the page.  Then you can do
something like (abstractly):

SELECT x, y, z
  FROM a, b
  WHERE ts BETWEEN m AND n
    AND a.x = b.a_x
    AND (x, y) > (lastx, lasty)
  ORDER BY x, y
  LIMIT 20;

With the right indexes, data distributions, selection criteria, and
ORDER BY columns -- that *could* be very fast.  If not, look at
Craig's post.

-Kevin

pgsql-performance by date:

Previous
From: Bob Lunney
Date:
Subject: Re: Slow Bulk Delete
Next
From: Piotr Legiecki
Date:
Subject: old server, new server, same performance