Thread: settings input for upgrade

settings input for upgrade

From
"Midge Brown"
Date:
I'm in the process of upgrading from postgres 7.4.8 to 9.0.4 and wanted to run my decisions past some folks who can give me some input on whether my decisions make sense or not.
 

It's basically a LAPP configuration and on a busy day we probably get in the neighborhood of a million hits.

 
Server Info:
 
- 4 dual-core AMD Opteron 2212 processors, 2010.485 MHz
- 64GB RAM
- 16 67GB RAID 1 drives and 1 464GB RAID 10 drive (all ext3)
- Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 
There are 3 separate databases:
 
DB1 is 10GB and consists of multiple tables that I've spread out so that the 3 most used have their data and indexes on 6 separate RAID1 drives, the 3 next busiest have data & index on 3 drives, and the remaining tables and indexes are on the RAID10 drive. The WAL for all is on a separate RAID1 drive.
 
The others are very write-heavy, started as one table within the original DB, and were split out on an odd/even id # in an effort to get better performance:
 
DB2 is 25GB with data, index, and WAL all on separate RAID1 drives.
DB3 is 15GB with data, index, and WAL on separate RAID1 drives.
 
Here are the changes I made to postgres.conf. The only differences between the conf file for DB1 and those for DB2 & 3 are the port and effective_cache_size (which I made slightly smaller -- 8 GB instead of 10 -- for the 2 write-heavy DBs). The 600 max connections are often idle and don't get explicitly closed in the application. I'm looking at connection pooling as well.
 

 autovacuum = on

 autovacuum_analyze_threshold = 250

 autovacuum_freeze_max_age = 200000000

 autovacuum_max_workers = 3

 autovacuum_naptime = 10min

 autovacuum_vacuum_cost_delay = 20ms

 autovacuum_vacuum_cost_limit = -1

 autovacuum_vacuum_threshold  = 250

 checkpoint_completion_target = 0.7

 checkpoint_segments = 64

 checkpoint_timeout = 5min

 checkpoint_warning = 30s

 deadlock_timeout = 3s

 effective_cache_size = 10GB

 log_autovacuum_min_duration  = 1s

 maintenance_work_mem = 256MB

 max_connections = 600

 max_locks_per_transaction = 64

 max_stack_depth = 8MB

 shared_buffers = 4GB

 vacuum_cost_delay = 10ms

 wal_buffers = 32MB

 wal_level = minimal

 work_mem = 128MB

 

 
ANY comments or suggestions would be greatly appreciated.  
 
Thank you,
Midge
 
 
 
 

Re: settings input for upgrade

From
Robert Klemme
Date:
On Thu, Aug 18, 2011 at 11:55 PM, Midge Brown <midgems@sbcglobal.net> wrote:
> I'm in the process of upgrading from postgres 7.4.8 to 9.0.4 and wanted to
> run my decisions past some folks who can give me some input on whether my
> decisions make sense or not.

I am not sure what decisions you actually refer to here: in your
posting I can only see description of the current setup but no
decisions for the upgrade (i.e. changed parameters, other physical
layout etc.).

> The others are very write-heavy, started as one table within the original
> DB, and were split out on an odd/even id # in an effort to get better
> performance:

Did it pay off?  I mean you planned to increase performance and did
this actually happen?  Apart from reserving IO bandwidth (which you
achieved by placing data on different disks) you basically only added
reserved memory for each instance by separating them.  Or are there
any other effects achieved by separating (like reduced lock contention
on some globally shared resource, distribution of CPU for logging)?

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Re: settings input for upgrade

From
"Midge Brown"
Date:
Robert,
 
I was largely looking for input on whether I may have inadvertently shot myself in the foot with some of the choices I made when setting up postgresql 9.0, which is on different hardware than was the 7.4 setup.
 
The splitting of one table to two separate databases was done on 7.4 and did make a positive change in write performance. I was including that information only in an attempt to provide as much detail as possible.
 
- Midge
----- Original Message -----
Sent: Saturday, August 20, 2011 2:38 AM
Subject: Re: [PERFORM] settings input for upgrade

On Thu, Aug 18, 2011 at 11:55 PM, Midge Brown <midgems@sbcglobal.net> wrote:
> I'm in the process of upgrading from postgres 7.4.8 to 9.0.4 and wanted to
> run my decisions past some folks who can give me some input on whether my
> decisions make sense or not.

I am not sure what decisions you actually refer to here: in your
posting I can only see description of the current setup but no
decisions for the upgrade (i.e. changed parameters, other physical
layout etc.).

> The others are very write-heavy, started as one table within the original
> DB, and were split out on an odd/even id # in an effort to get better
> performance:

Did it pay off?  I mean you planned to increase performance and did
this actually happen?  Apart from reserving IO bandwidth (which you
achieved by placing data on different disks) you basically only added
reserved memory for each instance by separating them.  Or are there
any other effects achieved by separating (like reduced lock contention
on some globally shared resource, distribution of CPU for logging)?

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Re: settings input for upgrade

From
Scott Marlowe
Date:
On Thu, Aug 18, 2011 at 3:55 PM, Midge Brown <midgems@sbcglobal.net> wrote:
> Here are the changes I made to postgres.conf. The only differences between
> the conf file for DB1 and those for DB2 & 3 are the port and
> effective_cache_size (which I made slightly smaller -- 8 GB instead of 10 --
> for the 2 write-heavy DBs). The 600 max connections are often idle and don't
> get explicitly closed in the application. I'm looking at connection pooling
> as well.

>  work_mem = 128MB

I'd lower this unless you are certain that something like 16MB just
isn't gonna get similar performance.  Even with mostly connections
idle, 128M is a rather large work_mem.  Remember it's per sort, per
connection.  It can quickly cause the kernel to dump file cache that
keeps the machine running fast if a couple dozen connections run a
handful of large sorts at once.  What happens is that while things run
smooth when there's low to medium load, under high load the machine
will start thrashing trying to allocate too much work_mem and then
just slow to a crawl.

Re: settings input for upgrade

From
"Midge Brown"
Date:
Thank you!
----- Original Message -----
Sent: Saturday, August 20, 2011 9:01 PM
Subject: Re: [PERFORM] settings input for upgrade

On Thu, Aug 18, 2011 at 3:55 PM, Midge Brown wrote:
> Here are the changes I made to postgres.conf. The only differences between
> the conf file for DB1 and those for DB2 & 3 are the port and
> effective_cache_size (which I made slightly smaller -- 8 GB instead of 10 --
> for the 2 write-heavy DBs). The 600 max connections are often idle and don't
> get explicitly closed in the application. I'm looking at connection pooling
> as well.

> work_mem = 128MB

I'd lower this unless you are certain that something like 16MB just
isn't gonna get similar performance.  Even with mostly connections
idle, 128M is a rather large work_mem.  Remember it's per sort, per
connection.  It can quickly cause the kernel to dump file cache that
keeps the machine running fast if a couple dozen connections run a
handful of large sorts at once.  What happens is that while things run
smooth when there's low to medium load, under high load the machine
will start thrashing trying to allocate too much work_mem and then
just slow to a crawl.

Re: settings input for upgrade

From
Robert Klemme
Date:
On Sat, Aug 20, 2011 at 8:33 PM, Midge Brown <midgems@sbcglobal.net> wrote:
> Robert,
>
> I was largely looking for input on whether I may have inadvertently shot
> myself in the foot with some of the choices I made when setting up
> postgresql 9.0, which is on different hardware than was the 7.4 setup.

OK, I though the config change was the diff for the other two database
and not for 9.0.

> The splitting of one table to two separate databases was done on 7.4 and did
> make a positive change in write performance. I was including that
> information only in an attempt to provide as much detail as possible.

Good to know!  Thanks!

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Re: settings input for upgrade

From
Greg Smith
Date:
On 08/18/2011 05:55 PM, Midge Brown wrote:
 
DB1 is 10GB and consists of multiple tables that I've spread out so that the 3 most used have their data and indexes on 6 separate RAID1 drives, the 3 next busiest have data & index on 3 drives, and the remaining tables and indexes are on the RAID10 drive. The WAL for all is on a separate RAID1 drive.
DB2 is 25GB with data, index, and WAL all on separate RAID1 drives.
DB3 is 15GB with data, index, and WAL on separate RAID1 drives.

Anytime you have a set of disks and a set of databases/tables to lay out onto them, there are two main options to consider:

-Put all of them into a single RAID10 array.  Performance will be high now matter what subset of the database is being used.  But if one particular part of a database is really busy, it can divert resources away from the rest.

-Break the database into fine-grained pieces and carefully lay out each of them on disk.  Performance of any individual chunk will be steady here.  But if only a subset of the data is being used, server resources will be idle.  All of the disks that don't have data related to that will be unused.

Consider two configurations following these ideas:

1) 12 disks are placed into a large RAID10 array.  Peak transfer rate will be about 600MB/s on sequential scans.

2) 6 RAID1 arrays are created and the database is manually laid out onto those disks.  Peak transfer rate from any one section will be closer to 100MB/s.

Each of these is optimizing for a different use scenario.  Here's the best case for each:

-One user is active, and they're hitting one of the database sections.  In setup (1) they might get 600MB/s, the case where it shows the most benefit.  In setup (2), they'd only get 100MB/s.

-10 users are pounding one section of the database; 1 user is hitting a different section.  In setup (2), all 10 users will be fighting over access to one section of the disk, each getting (at best) 10MB/s of its transfers. The nature of random I/O means that it will likely be much worse for them.  Meanwhile, the user hitting the other database section will still be merrily chugging away getting their 100MB/s.  Had setup (1) been used, you'd have 11 users fighting over 600MB/s, so at best 55MB/s for each.  And with the random mix, it could be much worse.

Which of these is better?  Well, (1) is guaranteed to use your hardware to its fullest capability.  There are some situations where contention over the disk array will cause performance to be lower for some people, compared to if they had an isolated environment split up more like (2).  But the rest of the time, (2) will have taken a large number of disks and left them idle.  The second example shows this really well.  The mere fact that you have such a huge aggregate speed available means that the big array really doesn't necessarily suffer that badly from a heavy load.  It has 6X as much capacity to handle them.  You really need to have a >6:1 misbalance in access before the carefully laid out version pulls ahead.  In every other case, the big array wins.

You can defend (2) as the better choice if you have really compelling, hard data proving use of the various parts of the data is split quite evenly among the expected incoming workload.  If you have response time latency targets that require separating resources evenly among the various types of users, it can also make sense there.  I don't know if the data you've been collecting from your older version is good enough to know that for sure or not.

In every other case, you'd be better off just dumping the whole pile into a single, large array, and letting the array and operating system figure out how to schedule things best.  That why this is the normal practice for building PostgreSQL systems.  The sole exception is that splitting out the pg_xlog filesystem can usually be justified in a larger array.  The fact that it's always sequential I/O means that mixing its work with the rest of the server doesn't work as well as giving it a dedicated pair of drives to write to, where it doesn't ever stop to seek somewhere else.
 
 wal_buffers = 32MB

This might as well drop to 16MB.  And you've already gotten some warnings about work_mem.  Switching to a connection pooler would help with that, too.


 autovacuum_analyze_threshold = 250

 autovacuum_naptime = 10min

 autovacuum_vacuum_threshold  = 250

 vacuum_cost_delay = 10ms


This strikes me as more customization than you really should be doing to autovacuum, if you haven't been running on a recent version of PostgreSQL yet.  You shouldn't ever need to touch the thresholds for example.  Those only matter on really small tables; once something gets big enough to really matter, the threshold part is really small compared to the scale factor one.  And the defaults are picked partly so that cleanup of the system catalog tables is done frequently enough.  You're slowing that cleanup by moving the thresholds upward so much, and that's not a great idea. 

For similar reasons, you really shouldn't be touching autovacuum_naptime unless there's really good evidence it's necessary for your environment.

Changing things such that regular vacuums executed at the command line happen with a cost delay like this should be fine though.  Those will happen using twice as many resources as the autovacuum ones, but not run as fast as possible as in the normal case.

 deadlock_timeout = 3s

You probably don't want to increase this.  When you reach the point where you want to find slow lock issues by turning on log_lock_waits, you're just going to put it right back to the default again--or lower it.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: settings input for upgrade

From
Scott Marlowe
Date:
On Sun, Aug 21, 2011 at 1:20 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>  deadlock_timeout = 3s
>
> You probably don't want to increase this.  When you reach the point where
> you want to find slow lock issues by turning on log_lock_waits, you're just
> going to put it right back to the default again--or lower it.

All of these random changes brings up the REAL subject, that changes
should be made by measuring performance before and after each change
set and justifying each change.  Just randomly throwing what seem like
good changes at the database is a surefire recipe for disaster.

Re: settings input for upgrade

From
"Midge Brown"
Date:
Thank you. I'll set work_mem back to 16MB and see what happens from there.
-Midge
----- Original Message -----
Sent: Saturday, August 20, 2011 9:01 PM
Subject: Re: [PERFORM] settings input for upgrade

On Thu, Aug 18, 2011 at 3:55 PM, Midge Brown <midgems@sbcglobal.net> wrote:
> Here are the changes I made to postgres.conf. The only differences between
> the conf file for DB1 and those for DB2 & 3 are the port and
> effective_cache_size (which I made slightly smaller -- 8 GB instead of 10 --
> for the 2 write-heavy DBs). The 600 max connections are often idle and don't
> get explicitly closed in the application. I'm looking at connection pooling
> as well.

> work_mem = 128MB

I'd lower this unless you are certain that something like 16MB just
isn't gonna get similar performance.  Even with mostly connections
idle, 128M is a rather large work_mem.  Remember it's per sort, per
connection.  It can quickly cause the kernel to dump file cache that
keeps the machine running fast if a couple dozen connections run a
handful of large sorts at once.  What happens is that while things run
smooth when there's low to medium load, under high load the machine
will start thrashing trying to allocate too much work_mem and then
just slow to a crawl.

Re: settings input for upgrade

From
"Midge Brown"
Date:
Thank you so much for the input, and the detail provided.
 
I'll be making the configuration changes, probably over the course of the week, checking the affect after each (as reminded by Scott Marlowe). I was pushed to put the new version into production over the weekend, which at least may provide me with some accurate feedback, and so will see what happens for a bit before addressing the disk/drive layout. 
 
-Midge
 
----- Original Message -----
Sent: Sunday, August 21, 2011 12:20 PM
Subject: Re: [PERFORM] settings input for upgrade

On 08/18/2011 05:55 PM, Midge Brown wrote:
 
DB1 is 10GB and consists of multiple tables that I've spread out so that the 3 most used have their data and indexes on 6 separate RAID1 drives, the 3 next busiest have data & index on 3 drives, and the remaining tables and indexes are on the RAID10 drive. The WAL for all is on a separate RAID1 drive.
DB2 is 25GB with data, index, and WAL all on separate RAID1 drives.
DB3 is 15GB with data, index, and WAL on separate RAID1 drives.

Anytime you have a set of disks and a set of databases/tables to lay out onto them, there are two main options to consider:

-Put all of them into a single RAID10 array.  Performance will be high now matter what subset of the database is being used.  But if one particular part of a database is really busy, it can divert resources away from the rest.

-Break the database into fine-grained pieces and carefully lay out each of them on disk.  Performance of any individual chunk will be steady here.  But if only a subset of the data is being used, server resources will be idle.  All of the disks that don't have data related to that will be unused.

Consider two configurations following these ideas:

1) 12 disks are placed into a large RAID10 array.  Peak transfer rate will be about 600MB/s on sequential scans.

2) 6 RAID1 arrays are created and the database is manually laid out onto those disks.  Peak transfer rate from any one section will be closer to 100MB/s.

Each of these is optimizing for a different use scenario.  Here's the best case for each:

-One user is active, and they're hitting one of the database sections.  In setup (1) they might get 600MB/s, the case where it shows the most benefit.  In setup (2), they'd only get 100MB/s.

-10 users are pounding one section of the database; 1 user is hitting a different section.  In setup (2), all 10 users will be fighting over access to one section of the disk, each getting (at best) 10MB/s of its transfers. The nature of random I/O means that it will likely be much worse for them.  Meanwhile, the user hitting the other database section will still be merrily chugging away getting their 100MB/s.  Had setup (1) been used, you'd have 11 users fighting over 600MB/s, so at best 55MB/s for each.  And with the random mix, it could be much worse.

Which of these is better?  Well, (1) is guaranteed to use your hardware to its fullest capability.  There are some situations where contention over the disk array will cause performance to be lower for some people, compared to if they had an isolated environment split up more like (2).  But the rest of the time, (2) will have taken a large number of disks and left them idle.  The second example shows this really well.  The mere fact that you have such a huge aggregate speed available means that the big array really doesn't necessarily suffer that badly from a heavy load.  It has 6X as much capacity to handle them.  You really need to have a >6:1 misbalance in access before the carefully laid out version pulls ahead.  In every other case, the big array wins.

You can defend (2) as the better choice if you have really compelling, hard data proving use of the various parts of the data is split quite evenly among the expected incoming workload.  If you have response time latency targets that require separating resources evenly among the various types of users, it can also make sense there.  I don't know if the data you've been collecting from your older version is good enough to know that for sure or not.

In every other case, you'd be better off just dumping the whole pile into a single, large array, and letting the array and operating system figure out how to schedule things best.  That why this is the normal practice for building PostgreSQL systems.  The sole exception is that splitting out the pg_xlog filesystem can usually be justified in a larger array.  The fact that it's always sequential I/O means that mixing its work with the rest of the server doesn't work as well as giving it a dedicated pair of drives to write to, where it doesn't ever stop to seek somewhere else.
 
 wal_buffers = 32MB

This might as well drop to 16MB.  And you've already gotten some warnings about work_mem.  Switching to a connection pooler would help with that, too.


 autovacuum_analyze_threshold = 250

 autovacuum_naptime = 10min

 autovacuum_vacuum_threshold  = 250

 vacuum_cost_delay = 10ms


This strikes me as more customization than you really should be doing to autovacuum, if you haven't been running on a recent version of PostgreSQL yet.  You shouldn't ever need to touch the thresholds for example.  Those only matter on really small tables; once something gets big enough to really matter, the threshold part is really small compared to the scale factor one.  And the defaults are picked partly so that cleanup of the system catalog tables is done frequently enough.  You're slowing that cleanup by moving the thresholds upward so much, and that's not a great idea. 

For similar reasons, you really shouldn't be touching autovacuum_naptime unless there's really good evidence it's necessary for your environment.

Changing things such that regular vacuums executed at the command line happen with a cost delay like this should be fine though.  Those will happen using twice as many resources as the autovacuum ones, but not run as fast as possible as in the normal case.

 deadlock_timeout = 3s

You probably don't want to increase this.  When you reach the point where you want to find slow lock issues by turning on log_lock_waits, you're just going to put it right back to the default again--or lower it.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: settings input for upgrade

From
Greg Smith
Date:
On 08/22/2011 12:48 PM, Midge Brown wrote:
 
I was pushed to put the new version into production over the weekend, which at least may provide me with some accurate feedback, and so will see what happens for a bit before addressing the disk/drive layout.

The good news is that deploying onto the split up configuration will give you lots of data to collect about just how the load is split up over the various parts of the database.  You can just monitor which drives the I/O goes to.  If there are some that are really underused,  and can arrange downtime and disk space to merge them together in the future, that may be an option for improving performance one day.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us