Thread: Hardware upgrade for a high-traffic database

Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

10 August 2004, 19:53:20

Hi All,

We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, 2GB
ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!).

Our database is about 20GB on disk, we have some quite large tables - 2M
rows with TEXT fields in a sample table, accessed constantly.  We average
about 4,000 - 5,000 queries per second - all from web traffic.  As you can
imagine, we're quite disk limited and checkpoints can be killer.
Additionally, we see queries and connections getting serialized due to
queries that take a long time (5 sec or so) while waiting on disk access.
No fun at all.

We've tweaked everything long and hard, and at the end of the day, the disk
is killing us.

We're looking to upgrade our server - or rather, replace it as it has no
upgrade path to SCSI.  I'm considering going Opteron (though right now we
don't need more CPU time), and am looking for suggestions on what an optimal
RAID configuration may look like (disks, controller, cache setting).  We're
in the market to buy right now - any good vendor suggestions?

I'd appreciate any input, thanks!

Jason

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

10 August 2004, 22:10:45

Hi Rod,

Actually, we're already using a substantial caching system in code for
nearly all pages delivered - we've exhausted that option.  Our system uses a
login/session table for about 1/8 of our page views (those visitors who are
logged in), and has tracking features.  While I'd love to scrap them and
give the database server a vacation, it's a requirement for us.

You're correct about the query caching (stored in memory) being used - most
of our queries are run once and then come from memory (or, based on speed of
consecutive executions, that seems to be the case).  Once a user hits a page
for the first time in an hour or so, it seems to cache their session query.

The issue that I think we're seeing is that the performance on the 3Ware
RAID is quite bad, watching FreeBSD systat will show it at "100% busy" at
around "3.5 MB/s".  When it needs to seek across a table (for, say, an
aggregate function - typically a COUNT()), it slows the entire server down
while working on the disk.  Additionally, VACUUM's make the server
practically useless.  We have indexes on everything that's used in queries,
and the planner is using them.

The server has 2GB of physical memory, however it's only uses between 130MB
and 200MB of it.  Postgres is the only application running on the server.

Our pertinent settings look like this:

max_connections = 512

shared_buffers = 20000
sort_mem = 2000
vacuum_mem = 20000
effective_cache_size = 300000

fsync = false
wal_sync_method = fsync
wal_buffers = 32

checkpoint_segments = 2
checkpoint_timeout = 30
commit_delay = 10000

Typically, we don't use anywhere near the 512 connections - however there
are peak hours where we come close, and other times that we eclipse it and
run out (should some connections become serialized due to a slowdown).  It's
not something that we can comfortably lower.

The non-standard checkpoint settings have helped making it less likely that
a large (in disk time) query will conflict with a checkpoint write.

I'm a programmer - definitely not a DBA by any stretch - though I am forced
into the role.  From reading this list, it seems to me that our settings are
reasonable given our usage, and that a disk upgrade is likely in order.

I'd love to hear any suggestions.

Thanks,

Jason

-----Original Message-----
From: Rod Taylor [mailto:pg@rbt.ca]
Sent: Tuesday, August 10, 2004 7:07 PM
To: Jason Coene
Cc: Postgresql Performance
Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database

> Our database is about 20GB on disk, we have some quite large tables - 2M
> rows with TEXT fields in a sample table, accessed constantly.  We average
> about 4,000 - 5,000 queries per second - all from web traffic.  As you can

99% is reads? and probably the same data over and over again? You might
want to think about a small code change to cache sections of page output
in memory for the most commonly generated pages (there are usually 3 or
4 that account for 25% to 50% of web traffic -- starting pages).

The fact you're getting 5k queries/second off IDE drives tells me most
of the active data is in memory -- so your actual working data set is
probably quite small (less than 10% of the 20GB).


If the above is all true (mostly reads, smallish dataset, etc.) and the
database is not growing very quickly, you might want to look into RAM
and RAM bandwidth over disk. An Opteron with 8GB ram using the same old
IDE drives. Get a mobo with a SCSI raid controller in it, so the disk
component can be upgraded in the future (when necessary).

Re: Hardware upgrade for a high-traffic database

From

"Scott Marlowe"

Date:

10 August 2004, 22:30:14

On Tue, 2004-08-10 at 13:17, Jason Coene wrote:
> Hi All,
>
> We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, 2GB
> ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!).
>
> Our database is about 20GB on disk, we have some quite large tables - 2M
> rows with TEXT fields in a sample table, accessed constantly.  We average
> about 4,000 - 5,000 queries per second - all from web traffic.  As you can
> imagine, we're quite disk limited and checkpoints can be killer.
> Additionally, we see queries and connections getting serialized due to
> queries that take a long time (5 sec or so) while waiting on disk access.
> No fun at all.
>
> We've tweaked everything long and hard, and at the end of the day, the disk
> is killing us.
>
> We're looking to upgrade our server - or rather, replace it as it has no
> upgrade path to SCSI.  I'm considering going Opteron (though right now we
> don't need more CPU time), and am looking for suggestions on what an optimal
> RAID configuration may look like (disks, controller, cache setting).  We're
> in the market to buy right now - any good vendor suggestions?

I've had very good luck with LSI MegaRAID controllers with battery
backed cache.  The amount of cache doesn't seem as important as having
it, and having it set for write back.

After that, 2 gigs or more of memory is the next improvement.

After that, the speed of the memory.

Re: Hardware upgrade for a high-traffic database

From

Rod Taylor

Date:

10 August 2004, 22:43:19

> Our database is about 20GB on disk, we have some quite large tables - 2M
> rows with TEXT fields in a sample table, accessed constantly.  We average
> about 4,000 - 5,000 queries per second - all from web traffic.  As you can

99% is reads? and probably the same data over and over again? You might
want to think about a small code change to cache sections of page output
in memory for the most commonly generated pages (there are usually 3 or
4 that account for 25% to 50% of web traffic -- starting pages).

The fact you're getting 5k queries/second off IDE drives tells me most
of the active data is in memory -- so your actual working data set is
probably quite small (less than 10% of the 20GB).


If the above is all true (mostly reads, smallish dataset, etc.) and the
database is not growing very quickly, you might want to look into RAM
and RAM bandwidth over disk. An Opteron with 8GB ram using the same old
IDE drives. Get a mobo with a SCSI raid controller in it, so the disk
component can be upgraded in the future (when necessary).

Re: Hardware upgrade for a high-traffic database

From

Pierre-Frédéric Caillaud

Date:

11 August 2004, 03:43:44

> We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4,
> 2GB
> ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!).

    Cheap solution while you look for another server :

    Try to use something other than RAID5.
    You have 4 disks, so you could use a striping+mirroring RAID which would
boost performance.
    You can switch with a minimum downtime (copy files to other HDD, change
RAID parameters, copy again...) maybe 1 hour ?
    If your hardware supports it of course.
    And tell us how it works !

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

11 August 2004, 09:22:37

> The issue that I think we're seeing is that the performance on the
3Ware
> RAID is quite bad, watching FreeBSD systat will show it at "100% busy"
at
> around "3.5 MB/s".  When it needs to seek across a table (for, say, an
> aggregate function - typically a COUNT()), it slows the entire server
down
> while working on the disk.  Additionally, VACUUM's make the server
> practically useless.  We have indexes on everything that's used in
> queries,
> and the planner is using them.

It sounds to me like your application is CPU bound, except when
vacuuming...then your server is just overloaded.  A higher performance
i/o system will help when vacuuming and checkpointing but will not solve
the overall problem.

With a (good & well supported) battery backed raid controller you can
turn fsync back on which will help you with your i/o storm issues (plus
the safety issue).   This will be particularly important if you follow
my next suggestion.

One thing you might consider is materialized views.  Your aggregate
functions are killing you...try to avoid using them (except min/max on
an index).  Just watch out for mutable functions like now().

http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html

An application specific approach is to use triggers to keep the data you
need in as close to query form as possible...you can reap enormous
savings particularly if your queries involve 3 or more tables or have
large aggregate scans.

Merlin

Re: Hardware upgrade for a high-traffic database

From

Sanjay Arora

Date:

11 August 2004, 13:23:38

On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote:

> One thing you might consider is materialized views.  Your aggregate
> functions are killing you...try to avoid using them (except min/max on
> an index).  Just watch out for mutable functions like now().
>
> http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html
>
> An application specific approach is to use triggers to keep the data you
> need in as close to query form as possible...you can reap enormous
> savings particularly if your queries involve 3 or more tables or have
> large aggregate scans.

I thought materialized views support in pgsql was experimental as yet.
Are the pg mat-view code upto production servers? Also, do you have to
delete mat-views before you dump the db or does dump automatically not
dump the mat-views data?

Sanjay.

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

11 August 2004, 14:43:23

> On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote:
>
> > One thing you might consider is materialized views.  Your aggregate
> > functions are killing you...try to avoid using them (except min/max
on
> > an index).  Just watch out for mutable functions like now().
> >
> > http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html
> >
> > An application specific approach is to use triggers to keep the data
you
> > need in as close to query form as possible...you can reap enormous
> > savings particularly if your queries involve 3 or more tables or
have
> > large aggregate scans.
>
> I thought materialized views support in pgsql was experimental as yet.
> Are the pg mat-view code upto production servers? Also, do you have to
> delete mat-views before you dump the db or does dump automatically not
> dump the mat-views data?

I think you are thinking about 100% 'true' materialized views.  In that
case the answer is no, the server does not have them.  The GeneralBits
article describes how to emulate them through pl/sql triggers.  I just
bumped into the article yesterday and was very impressed by it...I have
to admin though Note: I have never tried the method, but it should work.
I cc'd the author who perhaps might chime in and tell you more about
them.

Materialized views can give performance savings so good that the tpc
people had to ban them from benchmarks because they skewed results...:)
In postgres, they can help a lot with aggregates...there are many
gotchas tho, for example keeping a count() up to date can get kind of
tricky.  If you can get them to work, the filesystem cache efficiency
will rocket upwards...YMMV.

Getting back on topic, I missed the original post where the author
stated his problems were i/o related, not cpu (contrary to my
speculation).  I wonder what his insert/update load is?

Merlin

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

11 August 2004, 16:17:31

Thanks for all the feedback.  To clear it up, we are definitely not CPU
bound at the moment.  Any slowdown seems to be disk dependant, or from to
serialization due to a long query (due to disk).

We do have a lot of INSERT/UPDATE calls, specifically on tables that track
user sessions, then of course things like comments, etc (where we'll see
10-30 INSERT's per second, with TEXT field, and hundreds of reads per
second).  Additionally, our system does use a lot of aggregate functions.
I'll look into materialized views, it sounds like it may be worth
implementing.

One question I do have though - you specifically mentioned NOW() as
something to watch out for, in that it's mutable.  We typically use COUNT()
as a subselect to retrieve the number of associated rows to the current
query.  Additionally, we use NOW a lot, primarily to detect the status of a
date, i.e.:

SELECT id FROM subscriptions WHERE userid = 11111 AND timeend > NOW();

Is there a better way to do this?  I was under the impression that NOW() was
pretty harmless, just to return a current timestamp.

Based on feedback, I'm looking at a minor upgrade of our RAID controller to
a 3ware 9000 series (SATA with cache, battery backup optional), and
re-configuring it for RAID 10.  It's a damn cheap upgrade at around $350 and
an hour of downtime, so I figure that it's worth it for us to give it a
shot.

Thanks,

Jason

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Merlin Moncure
Sent: Wednesday, August 11, 2004 1:04 PM
To: skpobox@gawab.com
Cc: Postgresql Performance; jgardner@jonathangardner.net
Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database

> On Wed, 2004-08-11 at 17:51, Merlin Moncure wrote:
>
> > One thing you might consider is materialized views.  Your aggregate
> > functions are killing you...try to avoid using them (except min/max
on
> > an index).  Just watch out for mutable functions like now().
> >
> > http://www.varlena.com/varlena/GeneralBits/Tidbits/matviews.html
> >
> > An application specific approach is to use triggers to keep the data
you
> > need in as close to query form as possible...you can reap enormous
> > savings particularly if your queries involve 3 or more tables or
have
> > large aggregate scans.
>
> I thought materialized views support in pgsql was experimental as yet.
> Are the pg mat-view code upto production servers? Also, do you have to
> delete mat-views before you dump the db or does dump automatically not
> dump the mat-views data?

I think you are thinking about 100% 'true' materialized views.  In that
case the answer is no, the server does not have them.  The GeneralBits
article describes how to emulate them through pl/sql triggers.  I just
bumped into the article yesterday and was very impressed by it...I have
to admin though Note: I have never tried the method, but it should work.
I cc'd the author who perhaps might chime in and tell you more about
them.

Materialized views can give performance savings so good that the tpc
people had to ban them from benchmarks because they skewed results...:)
In postgres, they can help a lot with aggregates...there are many
gotchas tho, for example keeping a count() up to date can get kind of
tricky.  If you can get them to work, the filesystem cache efficiency
will rocket upwards...YMMV.

Getting back on topic, I missed the original post where the author
stated his problems were i/o related, not cpu (contrary to my
speculation).  I wonder what his insert/update load is?

Merlin

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Hardware upgrade for a high-traffic database

From

Brian Hirt

Date:

11 August 2004, 16:42:48

Jason,,

One suggestion i have, stay away from adaptec ZCR RAID products, we've
been doing testing on them, and they don't perform well at all.

--brian

On Aug 11, 2004, at 1:08 PM, Jason Coene wrote:

> Thanks for all the feedback.  To clear it up, we are definitely not CPU
> bound at the moment.  Any slowdown seems to be disk dependant, or from
> to
> serialization due to a long query (due to disk).

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

11 August 2004, 17:02:45

> We do have a lot of INSERT/UPDATE calls, specifically on tables that
track
> user sessions, then of course things like comments, etc (where we'll
see
> 10-30 INSERT's per second, with TEXT field, and hundreds of reads per
> second).  Additionally, our system does use a lot of aggregate
functions.
> I'll look into materialized views, it sounds like it may be worth
> implementing.

Right.  The point is: is your i/o bottle neck on the read side or the
write side.  With 10-30 inserts/sec and fsync off, it's definitely on
the read side.  What's interesting is that such a low insert load is
causing i/o storm problems.  How does your app run with fsync on?

With read-bound i/o problems, might want to consider upgrading memory
first to get better cache efficiency.  You may want to consider Opteron
for > 4GB allocations (yummy!).

The good news is that read problems are usually solvable by being
clever, whereas write problems require hardware.

> One question I do have though - you specifically mentioned NOW() as
> something to watch out for, in that it's mutable.  We typically use

This is specifically with regards to materialized views.  Mutable
functions cause problems because when they are pushed unto the view,
they are refreshed...something to watch out for.

The trick with MVs is to increase your filesystem cache efficiency.  The
big picture is to keep frequently read data in a single place to make
better benefit of cache.  Aggregates naturally read multiple rows to
return a single row's worth of data so you want to target them first.
This all comes at a cost of update I/O time and some application
complexity.

> as a subselect to retrieve the number of associated rows to the
current
> query.  Additionally, we use NOW a lot, primarily to detect the status
of
> a
> date, i.e.:

Might want to check if your application middleware (php?) exposes
PQntuples()...this is a zero cost way to get the same information.

> Based on feedback, I'm looking at a minor upgrade of our RAID
controller
> to
> a 3ware 9000 series (SATA with cache, battery backup optional), and
> re-configuring it for RAID 10.  It's a damn cheap upgrade at around
$350
> and
> an hour of downtime, so I figure that it's worth it for us to give it
a
> shot.

p.s. you can also increase cache efficiency by reducing database size,
for example use int2/int4 vs. numerics.

Good luck!

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

11 August 2004, 18:18:51

>
> Right.  The point is: is your i/o bottle neck on the read side or the
> write side.  With 10-30 inserts/sec and fsync off, it's definitely on
> the read side.  What's interesting is that such a low insert load is
> causing i/o storm problems.  How does your app run with fsync on?
>
> With read-bound i/o problems, might want to consider upgrading memory
> first to get better cache efficiency.  You may want to consider Opteron
> for > 4GB allocations (yummy!).
>
> The good news is that read problems are usually solvable by being
> clever, whereas write problems require hardware.
>

The difference with fsync being off makes seems to be that it allows the
server to write in groups instead of scattering our INSERT/UPDATE calls all
over - it helps keep things going.  When a checkpoint occurs, reads slow
down there.  Normal reads are usually quite fast, aside from some reads.

A good example, a comments table where users submit TEXT data.  A common
query is to find the last 5 comments a user has submitted.  The scan, while
using an index, takes a considerable amount of time (> 0.5 sec is about as
good as it gets).  Again, it's using an index on the single WHERE clause
(userid = int).  The field that's used to ORDER BY (timestamp) is also
indexed.

I'm wondering why our PG server is using so little memory...  The system has
2GB of memory, though only around 200MB of it are used.  Is there a PG
setting to force more memory usage towards the cache?  Additionally, we use
FreeBSD.  I've heard that Linux may manage that memory better, any truth
there?  Sorry if I'm grabbing at straws here :)

> > One question I do have though - you specifically mentioned NOW() as
> > something to watch out for, in that it's mutable.  We typically use
>
> This is specifically with regards to materialized views.  Mutable
> functions cause problems because when they are pushed unto the view,
> they are refreshed...something to watch out for.
>
> The trick with MVs is to increase your filesystem cache efficiency.  The
> big picture is to keep frequently read data in a single place to make
> better benefit of cache.  Aggregates naturally read multiple rows to
> return a single row's worth of data so you want to target them first.
> This all comes at a cost of update I/O time and some application
> complexity.
>
> > as a subselect to retrieve the number of associated rows to the
> current
> > query.  Additionally, we use NOW a lot, primarily to detect the status
> of
> > a
> > date, i.e.:
>
> Might want to check if your application middleware (php?) exposes
> PQntuples()...this is a zero cost way to get the same information.
>

Thanks, I'll look into it.  We use C and PHP.

> > Based on feedback, I'm looking at a minor upgrade of our RAID
> controller
> > to
> > a 3ware 9000 series (SATA with cache, battery backup optional), and
> > re-configuring it for RAID 10.  It's a damn cheap upgrade at around
> $350
> > and
> > an hour of downtime, so I figure that it's worth it for us to give it
> a
> > shot.
>
> p.s. you can also increase cache efficiency by reducing database size,
> for example use int2/int4 vs. numerics.
>

I've gone through and optimized data types as much as possible.  I'll see
what else we can do w/o causing downtime once PG 8 is ready to go and we can
change data types on the fly.

Thanks,

Jason

Re: Hardware upgrade for a high-traffic database

From

Brian Hirt

Date:

11 August 2004, 18:31:59

On Aug 11, 2004, at 3:18 PM, Jason Coene wrote:
>
> I'm wondering why our PG server is using so little memory...  The
> system has
> 2GB of memory, though only around 200MB of it are used.  Is there a PG
> setting to force more memory usage towards the cache?  Additionally,
> we use
> FreeBSD.  I've heard that Linux may manage that memory better, any
> truth
> there?  Sorry if I'm grabbing at straws here :)
>

i don't know about freebsd, but linux is very aggressive about using
unused memory for disk cache.  we have dedicated linux box running pg
with 2gb of memory,   about 250mb of memory is being used by processes
(system+postgres) and shared memory (postgres only), and there is
1.75gb of disk buffers in use in the kernel.  this particular database
is only about 4gb, so a good portion of the db resides in memory,
certainly most of the active set.  the disk subsystem is a 6 disk scsi
u160 raid array which performs pretty well when there is io.

Re: Hardware upgrade for a high-traffic database

From

Rod Taylor

Date:

11 August 2004, 18:46:34

> I'm wondering why our PG server is using so little memory...  The system has
> 2GB of memory, though only around 200MB of it are used.  Is there a PG

This is the second time you've said this. Surely you're not implying
there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache.

Send output of the below:

sysctl vm

sysctl -a | grep buffers

top | grep -E "(Mem|Swap):"

Re: Hardware upgrade for a high-traffic database

From

Rod Taylor

Date:

11 August 2004, 18:52:12

On Wed, 2004-08-11 at 17:31, Brian Hirt wrote:
> On Aug 11, 2004, at 3:18 PM, Jason Coene wrote:
> >
> > I'm wondering why our PG server is using so little memory...  The
> > system has
> > 2GB of memory, though only around 200MB of it are used.  Is there a PG
> > setting to force more memory usage towards the cache?  Additionally,
> > we use
> > FreeBSD.  I've heard that Linux may manage that memory better, any
> > truth
> > there?  Sorry if I'm grabbing at straws here :)
> >
>
> i don't know about freebsd, but linux is very aggressive about using
> unused memory for disk cache.  we have dedicated linux box running pg

Aggressive indeed.. I'm stuck with the version that has a tendency to
swap out active processes rather than abandon disk cache -- it gets very
annoying!

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

11 August 2004, 19:03:54

> -----Original Message-----
> From: Rod Taylor [mailto:pg@rbt.ca]
> Sent: Wednesday, August 11, 2004 5:46 PM
> To: Jason Coene
> Cc: 'Merlin Moncure'; Postgresql Performance
> Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database
>
> > I'm wondering why our PG server is using so little memory...  The system
> has
> > 2GB of memory, though only around 200MB of it are used.  Is there a PG
>
> This is the second time you've said this. Surely you're not implying
> there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache.

Hi Rod,

I was looking at top and vmstat - which always show under 300MB "Active".
We may hit 400MB at peak.  Everything I see (though this isn't my area of
expertise) points to most of the memory simply being unused.  Results below,
am I missing something?

Jason

>
> Send output of the below:
>
> sysctl vm

d01> sysctl vm
vm.vmtotal:
System wide totals computed every five seconds: (values in kilobytes)
===============================================
Processes:              (RUNQ: 1 Disk Wait: 0 Page Wait: 0 Sleep: 149)
Virtual Memory:         (Total: 2101614K, Active 440212K)
Real Memory:            (Total: 2023532K Active 327032K)
Shared Virtual Memory:  (Total: 14356K Active: 3788K)
Shared Real Memory:     (Total: 4236K Active: 2456K)
Free Memory Pages:      88824K

vm.loadavg: { 0.46 0.41 0.42 }
vm.v_free_min: 3312
vm.v_free_target: 13997
vm.v_free_reserved: 749
vm.v_inactive_target: 20995
vm.v_cache_min: 13997
vm.v_cache_max: 27994
vm.v_pageout_free_min: 34
vm.pageout_algorithm: 0
vm.swap_enabled: 1
vm.swap_async_max: 4
vm.dmmax: 32
vm.nswapdev: 1
vm.swap_idle_threshold1: 2
vm.swap_idle_threshold2: 10
vm.v_free_severe: 2030
vm.stats.sys.v_swtch: 627853362
vm.stats.sys.v_trap: 3622664114
vm.stats.sys.v_syscall: 1638589210
vm.stats.sys.v_intr: 3250875036
vm.stats.sys.v_soft: 1930666043
vm.stats.vm.v_vm_faults: 3197534554
vm.stats.vm.v_cow_faults: 2999625102
vm.stats.vm.v_cow_optim: 10093309
vm.stats.vm.v_zfod: 3603956919
vm.stats.vm.v_ozfod: 3104475907
vm.stats.vm.v_swapin: 3353
vm.stats.vm.v_swapout: 3382
vm.stats.vm.v_swappgsin: 3792
vm.stats.vm.v_swappgsout: 7213
vm.stats.vm.v_vnodein: 14675
vm.stats.vm.v_vnodeout: 140671
vm.stats.vm.v_vnodepgsin: 24330
vm.stats.vm.v_vnodepgsout: 245840
vm.stats.vm.v_intrans: 3643
vm.stats.vm.v_reactivated: 35038
vm.stats.vm.v_pdwakeups: 26984
vm.stats.vm.v_pdpages: 335769007
vm.stats.vm.v_dfree: 8
vm.stats.vm.v_pfree: 1507856856
vm.stats.vm.v_tfree: 430723755
vm.stats.vm.v_page_size: 4096
vm.stats.vm.v_page_count: 512831
vm.stats.vm.v_free_reserved: 749
vm.stats.vm.v_free_target: 13997
vm.stats.vm.v_free_min: 3312
vm.stats.vm.v_free_count: 968
vm.stats.vm.v_wire_count: 62039
vm.stats.vm.v_active_count: 44233
vm.stats.vm.v_inactive_target: 20995
vm.stats.vm.v_inactive_count: 343621
vm.stats.vm.v_cache_count: 21237
vm.stats.vm.v_cache_min: 13997
vm.stats.vm.v_cache_max: 27994
vm.stats.vm.v_pageout_free_min: 34
vm.stats.vm.v_interrupt_free_min: 2
vm.stats.vm.v_forks: 45205536
vm.stats.vm.v_vforks: 74315
vm.stats.vm.v_rforks: 0
vm.stats.vm.v_kthreads: 2416
vm.stats.vm.v_forkpages: 1464383994
vm.stats.vm.v_vforkpages: 4259727
vm.stats.vm.v_rforkpages: 0
vm.stats.vm.v_kthreadpages: 0
vm.stats.misc.zero_page_count: 709
vm.stats.misc.cnt_prezero: -972664922
vm.max_proc_mmap: 34952
vm.msync_flush_flags: 3
vm.idlezero_enable: 1
vm.idlezero_maxrun: 16
vm.max_launder: 32
vm.pageout_stats_max: 13997
vm.pageout_full_stats_interval: 20
vm.pageout_stats_interval: 5
vm.pageout_stats_free_max: 5
vm.swap_idle_enabled: 0
vm.defer_swapspace_pageouts: 0
vm.disable_swapspace_pageouts: 0
vm.pageout_lock_miss: 0
vm.zone:
ITEM            SIZE     LIMIT     USED    FREE  REQUESTS

FFS2 dinode:     256,        0,  30156,   4389, 20093512
FFS1 dinode:     128,        0,      0,      0,        0
FFS inode:       140,        0,  30156,   4340, 20093512
SWAPMETA:        276,   121576,     16,    264,    44599
ripcb:           180,    32780,      0,    132,      289
hostcache:        88,    15390,      6,    309,      741
syncache:        104,    15390,      0,    418, 44592418
tcptw:            56,     6603,      3,   1204,   224900
tcpcb:           368,    32769,    136,   4264, 44594153
inpcb:           180,    32780,    139,   4437, 44594153
udpcb:           180,    32780,     10,    144,    85953
unpcb:           140,    32788,      6,    246,   143982
socket:          240,    32768,    152,   4248, 44824378
KNOTE:            64,        0,      0,    434,     7561
PIPE:            172,        0,      8,    222,   352848
NFSNODE:         460,        0,   1596,     92,     2419
NFSMOUNT:        424,        0,      1,     17,        1
DIRHASH:        1024,        0,    238,     86,      287
L VFS Cache:     291,        0,    165,    160,    11956
S VFS Cache:      68,        0,  38283,   3430,  3795133
NAMEI:          1024,        0,      0,    240, 907013101
VNODEPOLL:        60,        0,      1,    131,        2
VNODE:           260,        0,  34104,     36,    34104
g_bio:           136,        0,      0,   5887, 551700514
VMSPACE:         236,        0,    152,    987, 45279840
UPCALL:           44,        0,      0,      0,        0
KSE:              64,        0,   1224,    202,     1224
KSEGRP:          120,        0,   1224,    109,     1224
THREAD:          312,        0,   1224,     84,     1224
PROC:            452,        0,    261,    963, 45282231
Files:            68,        0,    782,   5413, 719968279
4096:           4096,        0,    441,   1935, 90066743
2048:           2048,        0,    237,    423,    25077
1024:           1024,        0,     23,    157,   448114
512:             512,        0,    108,    140,   770519
256:             256,        0,    458,   1102, 70685682
128:             128,        0,   1904,   1041, 186085712
64:               64,        0,   5124,  13042, 1404464781
32:               32,        0,   1281,   1302, 839881182
16:               16,        0,    842,   1548, 1712031683
DP fakepg:        72,        0,      0,      0,        0
PV ENTRY:         28,  2166780, 157829, 769251, 56650653911
MAP ENTRY:        60,        0,   6716,  33280, 2270740046
KMAP ENTRY:       60,    65538,     24,    702,   152938
MAP:             160,        0,      9,     41,        2
VM OBJECT:       132,        0,  21596,  10654, 1136467083
128 Bucket:      524,        0,   3115,      0,        0
64 Bucket:       268,        0,    200,     10,        0
32 Bucket:       140,        0,    191,      5,        0
16 Bucket:        76,        0,     49,      3,        0
UMA Hash:        128,        0,      0,     31,        0
UMA Slabs:        34,        0,   3095,     95,        0
UMA Zones:       432,        0,     52,      2,        0

vm.kvm_size: 1069543424
vm.kvm_free: 364900352

>
> sysctl -a | grep buffers

d01 > sysctl -a | grep buffers
vfs.numdirtybuffers: 52
vfs.lodirtybuffers: 909
vfs.hidirtybuffers: 1819
vfs.numfreebuffers: 7146
vfs.lofreebuffers: 404
vfs.hifreebuffers: 808
>
> top | grep -E "(Mem|Swap):"
>

d01 > top | grep -E "(Mem|Swap):"
Mem: 173M Active, 1346M Inact, 242M Wired, 77M Cache, 112M Buf, 5784K Free
Swap: 4096M Total, 124K Used, 4096M Free

Re: Hardware upgrade for a high-traffic database

From

Rod Taylor

Date:

11 August 2004, 19:26:58

On Wed, 2004-08-11 at 18:03, Jason Coene wrote:
> > -----Original Message-----
> > From: Rod Taylor [mailto:pg@rbt.ca]
> > Sent: Wednesday, August 11, 2004 5:46 PM
> > To: Jason Coene
> > Cc: 'Merlin Moncure'; Postgresql Performance
> > Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database
> >
> > > I'm wondering why our PG server is using so little memory...  The system
> > has
> > > 2GB of memory, though only around 200MB of it are used.  Is there a PG
> >
> > This is the second time you've said this. Surely you're not implying
> > there is 1.8GB Free Memory -- rather than 1.8GB in Buffers or Cache.
>
> Hi Rod,
>
> I was looking at top and vmstat - which always show under 300MB "Active".
> We may hit 400MB at peak.  Everything I see (though this isn't my area of
> expertise) points to most of the memory simply being unused.  Results below,
> am I missing something?

This looks fine. The memory is not unused (only 5MB is actually empty)
but is being used for disk cache.

Active is memory used by programs and would need to be swapped if this
space was needed.

Inactive is memory that is generally dirty. Disk cache is often here. In
your case, you likely write to the same pages you're reading from --
which is why this number is so big. It also explains why a checkpoint is
a killer; a large chunk of this memory set needs to be pushed to disk.

Cache is memory used generally for disk cache that is not dirty. It's
been read from the disk and could be cleared immediately if necessary.

Wired is memory that cannot be swapped. In your case, Shared Memory is
probably Wired (this is good). There is another sysctl to check and set
whether it is wired or swappable.

Interesting (if dry) read:
http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/index.html

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

11 August 2004, 19:57:00

> > Hi Rod,
> >
> > I was looking at top and vmstat - which always show under 300MB
> "Active".
> > We may hit 400MB at peak.  Everything I see (though this isn't my area
> of
> > expertise) points to most of the memory simply being unused.  Results
> below,
> > am I missing something?
>
> This looks fine. The memory is not unused (only 5MB is actually empty)
> but is being used for disk cache.
>
> Active is memory used by programs and would need to be swapped if this
> space was needed.
>
> Inactive is memory that is generally dirty. Disk cache is often here. In
> your case, you likely write to the same pages you're reading from --
> which is why this number is so big. It also explains why a checkpoint is
> a killer; a large chunk of this memory set needs to be pushed to disk.
>
> Cache is memory used generally for disk cache that is not dirty. It's
> been read from the disk and could be cleared immediately if necessary.
>
> Wired is memory that cannot be swapped. In your case, Shared Memory is
> probably Wired (this is good). There is another sysctl to check and set
> whether it is wired or swappable.
>
>
>
> Interesting (if dry) read:
> http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/index.html
>

Ah, thanks - I didn't know that Inactive was still being used.  I'm glad to
know that at least the OS is using up the free memory for disk cache.
Shared memory is Wired, set via sysctl.  Thanks for the info!  It sounds
like adding more memory would help cache more data - I'll look into the
upgrade.

Jason

Re: Hardware upgrade for a high-traffic database

From

Tom Lane

Date:

11 August 2004, 20:20:38

"Jason Coene" <jcoene@gotfrag.com> writes:
> A good example, a comments table where users submit TEXT data.  A common
> query is to find the last 5 comments a user has submitted.  The scan, while
> using an index, takes a considerable amount of time (> 0.5 sec is about as
> good as it gets).  Again, it's using an index on the single WHERE clause
> (userid = int).  The field that's used to ORDER BY (timestamp) is also
> indexed.

You mean you are doing
    SELECT ... WHERE userid = 42 ORDER BY timestamp DESC LIMIT 5;
and hoping that separate indexes on userid and timestamp will get the
job done?  They won't.  There are only two possible plans for this,
neither very good: select all of user 42's posts and sort them, or
scan timewise backwards through *all* posts looking for the last 5 from
user 42.

If you do this enough to justify a specialized index, I would suggest a
two-column index on (userid, timestamp).  You will also need to tweak
the query, because the planner is not quite smart enough to deduce that
such an index is applicable to the given sort order:
    SELECT ... WHERE userid = 42 ORDER BY userid DESC, timestamp DESC LIMIT 5;
This should generate an index-scan-backwards plan that will execute nigh
instantaneously, because it will only fetch the rows you really want.

You might or might not be able to drop the separate indexes on userid
and timestamp, depending on what other queries you might have that need
them.  But you should be paying attention to what plans you are really
getting (see EXPLAIN) rather than just assuming that some indexes chosen
at random will do what you need.

            regards, tom lane

Re: Hardware upgrade for a high-traffic database

From

Russell Smith

Date:

11 August 2004, 20:56:14

[snip]
>
> One question I do have though - you specifically mentioned NOW() as
> something to watch out for, in that it's mutable.  We typically use COUNT()
> as a subselect to retrieve the number of associated rows to the current
> query.  Additionally, we use NOW a lot, primarily to detect the status of a
> date, i.e.:
>
> SELECT id FROM subscriptions WHERE userid = 11111 AND timeend > NOW();
>
> Is there a better way to do this?  I was under the impression that NOW() was
> pretty harmless, just to return a current timestamp.
>
NOW() will trigger unnessecary sequence scans.  As it is unknown with prepared
query and function when the statement is run, the planner plans the query with
now as a variable.  This can push the planner to a seq scan over and index scan.
I have seen this time and time again.

You can create your own immutable now, but don't use it in functions or prepared queries
or you will get wrong results.

> Based on feedback, I'm looking at a minor upgrade of our RAID controller to
> a 3ware 9000 series (SATA with cache, battery backup optional), and
> re-configuring it for RAID 10.  It's a damn cheap upgrade at around $350 and
> an hour of downtime, so I figure that it's worth it for us to give it a
> shot.
>
> Thanks,
>
> Jason

Russell Smith

Re: Hardware upgrade for a high-traffic database

From

"Jason Coene"

Date:

11 August 2004, 21:29:10

> You mean you are doing
>     SELECT ... WHERE userid = 42 ORDER BY timestamp DESC LIMIT 5;
> and hoping that separate indexes on userid and timestamp will get the
> job done?  They won't.  There are only two possible plans for this,
> neither very good: select all of user 42's posts and sort them, or
> scan timewise backwards through *all* posts looking for the last 5 from
> user 42.

Wow!  I did try the method you state below (including the WHERE restricted
column in the sort by, and creating a two-column index), and it did execute
much faster (even on odd userid's to avoid cached results as much as
possible).

We have a lot of:

SELECT whatever
    FROM ourtable
    WHERE field1 = X
    AND field2 = Y
    AND field3 = Z
    ORDER BY id DESC
    LIMIT 5

With indexes:

ourtable(id)
ourtable(field1, field2, field3)

Is it standard procedure with postgres to include any fields listed in WHERE
in the ORDER BY, and create a single index for only the ORDER BY fields (in
order of appearance, of course)?

>
> If you do this enough to justify a specialized index, I would suggest a
> two-column index on (userid, timestamp).  You will also need to tweak
> the query, because the planner is not quite smart enough to deduce that
> such an index is applicable to the given sort order:
>     SELECT ... WHERE userid = 42 ORDER BY userid DESC, timestamp DESC
> LIMIT 5;
> This should generate an index-scan-backwards plan that will execute nigh
> instantaneously, because it will only fetch the rows you really want.
>
> You might or might not be able to drop the separate indexes on userid
> and timestamp, depending on what other queries you might have that need
> them.  But you should be paying attention to what plans you are really
> getting (see EXPLAIN) rather than just assuming that some indexes chosen
> at random will do what you need.
>
>             regards, tom lane
>

We do many varied queries on nearly every table - our data is highly
relational, and we have a lot of indexes.  I thought the planner would pick
up the right index via constraints and not require them in ORDER BY...
EXPLAIN ANALYZE says that the indexes are being used, ala:

gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY
timestamp DESC LIMIT 5;
                                                                  QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------------------------
 Limit  (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317
rows=5 loops=1)
   ->  Sort  (cost=1608.43..1609.45 rows=407 width=8) (actual
time=0.287..0.295 rows=5 loops=1)
         Sort Key: "timestamp"
         ->  Index Scan using comments_ix_userid on comments
(cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35
loops=1)
               Index Cond: (userid = 51)
 Total runtime: 0.375 ms
(6 rows)

Is this the wrong procedure?  Your suggested syntax seems much more
efficient, but I don't quite understand exactly why, as PG is using our
existing indexes...

gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY
userid DESC, timestamp DESC LIMIT 5;

QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----
 Limit  (cost=0.00..19.90 rows=5 width=12) (actual time=0.040..0.076 rows=5
loops=1)
   ->  Index Scan Backward using comments_ix_userid_timestamp on comments
(cost=0.00..1620.25 rows=407 width=12) (actual time=0.035..0.054 rows=5
loops=1)
         Index Cond: (userid = 51)
 Total runtime: 0.134 ms
(4 rows)

Note: This was done after adding an index on comments (userid, timestamp)

Regards,

Jason

Re: Hardware upgrade for a high-traffic database

From

Tom Lane

Date:

11 August 2004, 23:09:24

"Jason Coene" <jcoene@gotfrag.com> writes:
> We have a lot of:

> SELECT whatever
>     FROM ourtable
>     WHERE field1 = X
>     AND field2 = Y
>     AND field3 = Z
>     ORDER BY id DESC
>     LIMIT 5

> With indexes:

> ourtable(id)
> ourtable(field1, field2, field3)

> Is it standard procedure with postgres to include any fields listed in WHERE
> in the ORDER BY, and create a single index for only the ORDER BY fields (in
> order of appearance, of course)?

It depends.  If the X/Y/Z constraint is already pretty selective, then
it seems sufficient to me to pick up the matching rows (using the
3-field index), sort them by id, and take the first 5.  The case where
the extra-index-column trick is useful is where the WHERE clause *isn't*
real selective and so a lot of rows would have to be sorted.  In your
previous example, I imagine you have a lot of prolific posters and so
"all posts by userid 42" can be a nontrivial set.  The double-column
index lets you skip the sort and just pull out the required rows by
scanning from the end of the range of userid = 42 entries.

> gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY
> timestamp DESC LIMIT 5;
>                                                                   QUERY PLAN
> ----------------------------------------------------------------------------
> -------------------------------------------------------------------
>  Limit  (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317
> rows=5 loops=1)
>    ->  Sort  (cost=1608.43..1609.45 rows=407 width=8) (actual
> time=0.287..0.295 rows=5 loops=1)
>          Sort Key: "timestamp"
>          ->  Index Scan using comments_ix_userid on comments
> (cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35
> loops=1)
>                Index Cond: (userid = 51)
>  Total runtime: 0.375 ms
> (6 rows)

This example looks fine, but since userid 51 evidently only has 35
posts, there's not much time needed to read 'em all and sort 'em.  The
place where the double-column index will win big is on userids with
hundreds of posts.

You have to keep in mind that each index costs time to maintain during
inserts/updates.  So adding an index just because it makes a few queries
a little faster probably isn't a win.  You need to make tradeoffs.

            regards, tom lane

Re: Hardware upgrade for a high-traffic database

From

Pierre-Frédéric Caillaud

Date:

12 August 2004, 03:49:42

On Wed, 11 Aug 2004 20:29:04 -0400, Jason Coene <jcoene@gotfrag.com> wrote:




> gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY
> timestamp DESC LIMIT 5;
>                                                                   QUERY
> PLAN
> ----------------------------------------------------------------------------
> -------------------------------------------------------------------
>  Limit  (cost=1608.43..1608.45 rows=5 width=8) (actual time=0.292..0.317
> rows=5 loops=1)
>    ->  Sort  (cost=1608.43..1609.45 rows=407 width=8) (actual
> time=0.287..0.295 rows=5 loops=1)
>          Sort Key: "timestamp"
>          ->  Index Scan using comments_ix_userid on comments
> (cost=0.00..1590.79 rows=407 width=8) (actual time=0.031..0.190 rows=35
> loops=1)
>                Index Cond: (userid = 51)
>  Total runtime: 0.375 ms
> (6 rows)

    Well, you have to read it from the bottom.
    - Index Scan using comments_ix_userid :
    It selects all records for your user.
    rows=407 : there are 407 rows.

    ->  Sort  (cost=1608.43..1609.45 rows=407 width=8)
    It sorts them to find the 5 more recent.

    So basically you grab 407 rows to return only 5, so you do 80x more disk
I/O than necessary. It is likely that posts from all users are interleaved
in the table, so this probably translates directly into 407 page fetches.

    Note : EXPLAIN ANALYZE will only give good results the first time you run
it. The second time, all data is in the cache, so it looks really faster
than it is.

> gf=# EXPLAIN ANALYZE SELECT id FROM comments WHERE userid = 51 ORDER BY
> userid DESC, timestamp DESC LIMIT 5;
> QUERY PLAN
> ----
>  Limit  (cost=0.00..19.90 rows=5 width=12) (actual time=0.040..0.076
> rows=5
> loops=1)
>    ->  Index Scan Backward using comments_ix_userid_timestamp on comments
> (cost=0.00..1620.25 rows=407 width=12) (actual time=0.035..0.054 rows=5
> loops=1)
>          Index Cond: (userid = 51)
>  Total runtime: 0.134 ms
> (4 rows)
>
> Note: This was done after adding an index on comments (userid, timestamp)

    Well, this one correctly uses the index, fetches 5 rows, and returns them.

    So, excluding index page hits, your unoptimized query has >400 page
fetches, and your optimized one has 5 page fetches. Still wonder why it's
faster ?

    Seq scan is fast when locality of reference is good. In your case, it's
very bad.

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

12 August 2004, 10:39:49

> This example looks fine, but since userid 51 evidently only has 35
> posts, there's not much time needed to read 'em all and sort 'em.  The
> place where the double-column index will win big is on userids with
> hundreds of posts.
>
> You have to keep in mind that each index costs time to maintain during
> inserts/updates.  So adding an index just because it makes a few
queries
> a little faster probably isn't a win.  You need to make tradeoffs.

IMNSHO, in Jason's case he needs to do everything possible to get his
frequently run queries going as quick as possible.  ISTM he can give up
a little on the update side, especially since he is running fsync=false.
A .3-.5 sec query multiplied over 50-100 users running concurrently adds
up quick.  Ideally, you are looking up records based on a key that takes
you directly to the first record you want and is pointing to the next
number of records in ascending order.  I can't stress enough how
important this is so long as you can deal with the index/update
overhead.

I don't have a huge amount of experience with this in pg, but one of the
tricks we do in the ISAM world is a 'reverse date' system, so that you
can scan forwards on the key to pick up datetimes in descending order.
This is often a win because the o/s cache may assume read/forwards
giving you more cache hits.   There are a few different ways to do this,
but imagine:

create table t
(
    id int,
    ts  timestamp default now(),
    iv  interval  default ('01/01/2050'::timestamp - now())
);

create index t_idx on t(id, iv);
select * from t where id = k order by id, iv limit 5;

The above query should do a much better job pulling up data and should
be easier on your cache.  A further win might be to cluster the table on
this key if the table is really big.

note: interval is poor type to do this with, because it's a 12 byte type
(just used it here for demonstration purposes because it's easy).  With
a little trickery you can stuff it into a time type or an int4 type
(even better!).  If you want to be really clever you can do it without
adding any data to your table at all through functional indexes.

Since the planner can use the same index in the extraction and ordering,
you get some savings...not much, but worthwhile when applied over a lot
of users.  Knowing when and how to apply multiple key/functional indexes
will make you feel like you have 10 times the database you are using
right now.

Merlin

Re: Hardware upgrade for a high-traffic database

From

Tom Lane

Date:

12 August 2004, 14:50:54

"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> The following suggestion works in two principles: one is that instead of
> using timestamps for ordering, integers are quicker,

The difference would be pretty marginal --- especially if you choose to
use bigints instead of ints.  (A timestamp is just a float8 or bigint
under the hood, and is no more expensive to compare than those datatypes.
Timestamps *are* expensive to convert for I/O, but comparison does not
have to do that.)  I wouldn't recommend kluging up your data schema just
for that.

            regards, tom lane

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

12 August 2004, 15:10:38

> I don't have a huge amount of experience with this in pg, but one of
the
> tricks we do in the ISAM world is a 'reverse date' system, so that you
> can scan forwards on the key to pick up datetimes in descending order.
> This is often a win because the o/s cache may assume read/forwards
> giving you more cache hits.   There are a few different ways to do
this,
> but imagine:

I've been thinking more about this and there is even a more optimal way
of doing this if you are willing to beat on your data a bit.  It
involves the use of sequences.  Lets revisit your id/timestamp query
combination for a message board.  The assumption is you are using
integer keys for all tables.  You probably have something like:

create table messages
(
    user_id     int4 references users,
    topic_id      int4 references topics,
    message_id    serial,
    message_time  timestamp default now(),
    [...]
);

The following suggestion works in two principles: one is that instead of
using timestamps for ordering, integers are quicker, and sequences have
a built in ability for reverse-ordering.

Lets define:
create sequence message_seq increment -1 start 2147483647 minvalue 0
maxvalue 2147483647;

now we define our table:
create table messages
(
    user_id      int4 references users,
    topic_id     int4 references topics,
    message_id   int4 default nextval('message_seq') primary key,
    message_time timestamp default now(),
    [...]
);

create index user_message_idx on messages(user_id, message_id);
-- optional
cluster user_message_idx messages;

Since the sequence is in descending order, we don't have to do any
tricks to logically reverse order the table.

-- return last k posts made by user u in descending order;

select * from messages where user_id = u order by user_id, message_id
limit k;

-- return last k posts on a topic
create index topic_message_idx on messages(topic_id, user_id);
select * from messages where topic_id = t order by topic_id, message_id

a side benefit of clustering is that there is little penalty for
increasing k because of read ahead optimization whereas in normal
scenarios your read time scales with k (forcing small values for k).  If
we tended to pull up messages by topic more frequently than user, we
would cluster on topic_message_idx instead.  (if we couldn't decide, we
might cluster on message_id or not at all).

The crucial point is that we are making this one index run really fast
at the expense of other operations.  The other major point is we can use
a sequence in place of a timestamp for ordering.  Using int4 vs.
timestamp is a minor efficiency win, if you are worried about > 4B rows,
then stick with timestamp.

This all boils down to a central unifying principle: organize your
indices around your expected access patterns to the data.  Sorry if I'm
bleating on and on about this...I just think there is plenty of
optimization room left in there :)

Merlin

Re: Hardware upgrade for a high-traffic database

From

"gnari"

Date:

12 August 2004, 15:58:46

"Merlin Moncure" <merlin.moncure@rcsonline.com> wrote:

-- optional
cluster user_message_idx messages;

would one not have to repeat this operation regularly, to keep
any advantage of this ? my impression was that this is a relatively
heavy operation on a large table.

gnari

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

12 August 2004, 16:33:23

Tom Lane wrote:
> The difference would be pretty marginal --- especially if you choose
to
> use bigints instead of ints.  (A timestamp is just a float8 or bigint
> under the hood, and is no more expensive to compare than those
datatypes.
> Timestamps *are* expensive to convert for I/O, but comparison does not
> have to do that.)  I wouldn't recommend kluging up your data schema
just
> for that.

Right (int4 use was assumed).  I agree, but it's kind of a 'two birds
with one stone' kind of thing, because it's easier to work with reverse
ordering integers than time values.  So I claim a measurable win (the
real gainer of course being able to select and sort on the same key,
which works on any type), based on the int4-int8 difference, which is a
33% reduction in key size.

One claim I don't have the data for is that read-forward is better than
read-back, but my gut tells me he'll get a better cache hit ratio that
way.  This will be very difficult to measure.

As for kludging, using a decrementing sequence is not a bad idea if the
general tendency is to read the table backwards, even if just for
conceptual reasons.  The main kludge is the int4 assumption, which (IMO)
isn't so bad.  He would just have to rebuild the existing p-key in
reverse order (10$ says his keys are all already int4s), and hopefully
not mess with the application code too much.

At least, it's what I would try if I was in his shoes :)

YMMV
Merlin

Re: Hardware upgrade for a high-traffic database

From

"Merlin Moncure"

Date:

12 August 2004, 17:03:37

> would one not have to repeat this operation regularly, to keep
> any advantage of this ? my impression was that this is a relatively
> heavy operation on a large table.

Yeah, it requires an exclusive lock and a table rebuild.  It might be
useful to a message board type database since (one presumes) the reads
would be concentrated over recently created data, entered after the
cluster and losing any benefit.

As far as table size, bigger tables are a larger operation but take
longer to get all out of whack.  Question is: what percentage of the
data turns over between maintenance periods?   Plus, there has to be a
maintenance period...nobody does anything while the table is clustering.

Also, a particular method of reading the table has to really dominate as
far as user usage pattern.  So, it's pretty rare to user cluster, but it
can help in some cases.

Merlin