Thread: postgresql.conf recommendations

postgresql.conf recommendations

From
Johnny Tan
Date:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny

Re: postgresql.conf recommendations

From
Josh Krupka
Date:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag 
[always] never


On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny


Re: postgresql.conf recommendations

From
Josh Krupka
Date:
I've been looking into something on our system that sounds similar to what you're seeing.  I'm still researching it, but I'm suspecting the memory compaction that runs as part of transparent huge pages when memory is allocated... yet to be proven.  The tunable you mentioned controls the compaction process that runs at allocation time so it can try to allocate large pages, there's a separate one that controls if the compaction is done in khugepaged, and a separate one that controls whether THP is used at all or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different in your distro)

What's the output of this command?
egrep 'trans|thp|compact_' /proc/vmstat
compact_stall represents the number of processes that were stalled to do a compaction, the other metrics have to do with other parts of THP.  If you see compact_stall climbing, from what I can tell those might be causing your spikes.  I haven't found a way of telling how long the processes have been stalled. You could probably get a little more insight into the processes with some tracing assuming you can catch it quickly enough.  Running perf top will also show the compaction happening but that doesn't necessarily mean it's impacting your running processes.




On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnydtan@gmail.com> wrote:
# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag 
[always] never


On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny



Re: postgresql.conf recommendations

From
Pavan Deolasee
Date:
On Wed, Feb 6, 2013 at 3:32 AM, Johnny Tan <johnnydtan@gmail.com> wrote:
>
> maintenance_work_mem = 24GB # min 1MB

I'm quite astonished by this setting. Not that it explains the problem
at hand, but I wonder if this is a plain mistake in configuration.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: postgresql.conf recommendations

From
"ac@hsk.hk"
Date:

On 6 Feb 2013, at 12:23 PM, Josh Krupka wrote:

On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnydtan@gmail.com> wrote:
shared_buffers = 48GB # min 128kB




Hi,

From the postgresql.conf, I can see that the shared_buffers is set to 48GB which is not small,  it would be possible that the large buffer cache could be "dirty", when a checkpoint starts, it would cause a checkpoint I/O spike.

I would like to suggest you about using pgtune to get recommended conf for postgresql.

Regards

Re: postgresql.conf recommendations

From
Kevin Grittner
Date:
"ac@hsk.hk" <ac@hsk.hk> wrote:
> Johnny Tan <johnnydtan@gmail.com> wrote:

>>shared_buffers = 48GB# min 128kB

> From the postgresql.conf, I can see that the shared_buffers is
> set to 48GB which is not small,  it would be possible that the
> large buffer cache could be "dirty", when a checkpoint starts, it
> would cause a checkpoint I/O spike.
>
>
> I would like to suggest you about using pgtune to get recommended
> conf for postgresql.

I have seen symptoms like those described which were the result of
too many dirty pages accumulating inside PostgreSQL shared_buffers.
It might be something else entirely in this case, but it would at
least be worth trying a reduced shared_buffers setting combined
with more aggressive bgwriter settings.  I might try something like
the following changes, as an experiment:

shared_buffers = 8GB
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4

-Kevin


Re: postgresql.conf recommendations

From
David Whittaker
Date:
Josh/Johnny,

We've been seeing a similar problem as well, and had also figured THP was involved.  We found this in syslog: https://gist.github.com/davewhittaker/4723285, which led us to disable THP 2 days ago.  At first the results seemed good.  In particular, our issues always seemed interrupt related and the average interrupts/sec immediately dropped from 7k to around 3k after restarting.  The good news is that we didn't see any spike in system CPU time yesterday.  The bad news is that we did see a spike in app latency that originated from the DB, but now the spike is in user CPU time and seems to be spread across all of the running postgres processes.  Interrupts still blew up to 21k/sec when it happened.  We are still diagnosing, but I'd be curious to see if either of you get similar results from turning THP off.


On Tue, Feb 5, 2013 at 11:23 PM, Josh Krupka <jkrupka@gmail.com> wrote:
I've been looking into something on our system that sounds similar to what you're seeing.  I'm still researching it, but I'm suspecting the memory compaction that runs as part of transparent huge pages when memory is allocated... yet to be proven.  The tunable you mentioned controls the compaction process that runs at allocation time so it can try to allocate large pages, there's a separate one that controls if the compaction is done in khugepaged, and a separate one that controls whether THP is used at all or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different in your distro)

What's the output of this command?
egrep 'trans|thp|compact_' /proc/vmstat
compact_stall represents the number of processes that were stalled to do a compaction, the other metrics have to do with other parts of THP.  If you see compact_stall climbing, from what I can tell those might be causing your spikes.  I haven't found a way of telling how long the processes have been stalled. You could probably get a little more insight into the processes with some tracing assuming you can catch it quickly enough.  Running perf top will also show the compaction happening but that doesn't necessarily mean it's impacting your running processes.




On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnydtan@gmail.com> wrote:
# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag 
[always] never


On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny




Re: postgresql.conf recommendations

From
Josh Krupka
Date:
David,

Interesting observations.  I had not been tracking the interrupts but perhaps I should take a look.  How are you measuring them over a period of time, or are you just getting them real time? 

Did you turn off THP all together or just the THP defrag?


On Wed, Feb 6, 2013 at 10:42 AM, David Whittaker <dave@iradix.com> wrote:
Josh/Johnny,

We've been seeing a similar problem as well, and had also figured THP was involved.  We found this in syslog: https://gist.github.com/davewhittaker/4723285, which led us to disable THP 2 days ago.  At first the results seemed good.  In particular, our issues always seemed interrupt related and the average interrupts/sec immediately dropped from 7k to around 3k after restarting.  The good news is that we didn't see any spike in system CPU time yesterday.  The bad news is that we did see a spike in app latency that originated from the DB, but now the spike is in user CPU time and seems to be spread across all of the running postgres processes.  Interrupts still blew up to 21k/sec when it happened.  We are still diagnosing, but I'd be curious to see if either of you get similar results from turning THP off.


On Tue, Feb 5, 2013 at 11:23 PM, Josh Krupka <jkrupka@gmail.com> wrote:
I've been looking into something on our system that sounds similar to what you're seeing.  I'm still researching it, but I'm suspecting the memory compaction that runs as part of transparent huge pages when memory is allocated... yet to be proven.  The tunable you mentioned controls the compaction process that runs at allocation time so it can try to allocate large pages, there's a separate one that controls if the compaction is done in khugepaged, and a separate one that controls whether THP is used at all or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different in your distro)

What's the output of this command?
egrep 'trans|thp|compact_' /proc/vmstat
compact_stall represents the number of processes that were stalled to do a compaction, the other metrics have to do with other parts of THP.  If you see compact_stall climbing, from what I can tell those might be causing your spikes.  I haven't found a way of telling how long the processes have been stalled. You could probably get a little more insight into the processes with some tracing assuming you can catch it quickly enough.  Running perf top will also show the compaction happening but that doesn't necessarily mean it's impacting your running processes.




On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnydtan@gmail.com> wrote:
# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag 
[always] never


On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny





Re: postgresql.conf recommendations

From
David Whittaker
Date:
Hi Josh,

On Wed, Feb 6, 2013 at 1:20 PM, Josh Krupka <jkrupka@gmail.com> wrote:
David,

Interesting observations.  I had not been tracking the interrupts but perhaps I should take a look.  How are you measuring them over a period of time, or are you just getting them real time? 

I initially saw it happen with vmstat, but now I'm collecting them every 5 minutes over SNMP with Cacti.
 
Did you turn off THP all together or just the THP defrag?

We disabled THP all together, with the thought that we might re-enable without defrag if we got positive results.  At this point I don't think THP is the root cause though, so I'm curious to see if anyone else gets positive results from disabling it.  We definitely haven't seen any performance hit from turning it off.
 
On Wed, Feb 6, 2013 at 10:42 AM, David Whittaker <dave@iradix.com> wrote:
Josh/Johnny,

We've been seeing a similar problem as well, and had also figured THP was involved.  We found this in syslog: https://gist.github.com/davewhittaker/4723285, which led us to disable THP 2 days ago.  At first the results seemed good.  In particular, our issues always seemed interrupt related and the average interrupts/sec immediately dropped from 7k to around 3k after restarting.  The good news is that we didn't see any spike in system CPU time yesterday.  The bad news is that we did see a spike in app latency that originated from the DB, but now the spike is in user CPU time and seems to be spread across all of the running postgres processes.  Interrupts still blew up to 21k/sec when it happened.  We are still diagnosing, but I'd be curious to see if either of you get similar results from turning THP off.


On Tue, Feb 5, 2013 at 11:23 PM, Josh Krupka <jkrupka@gmail.com> wrote:
I've been looking into something on our system that sounds similar to what you're seeing.  I'm still researching it, but I'm suspecting the memory compaction that runs as part of transparent huge pages when memory is allocated... yet to be proven.  The tunable you mentioned controls the compaction process that runs at allocation time so it can try to allocate large pages, there's a separate one that controls if the compaction is done in khugepaged, and a separate one that controls whether THP is used at all or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different in your distro)

What's the output of this command?
egrep 'trans|thp|compact_' /proc/vmstat
compact_stall represents the number of processes that were stalled to do a compaction, the other metrics have to do with other parts of THP.  If you see compact_stall climbing, from what I can tell those might be causing your spikes.  I haven't found a way of telling how long the processes have been stalled. You could probably get a little more insight into the processes with some tracing assuming you can catch it quickly enough.  Running perf top will also show the compaction happening but that doesn't necessarily mean it's impacting your running processes.




On Tue, Feb 5, 2013 at 6:46 PM, Johnny Tan <johnnydtan@gmail.com> wrote:
# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag 
[always] never


On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:

Just out of curiosity, are you using transparent huge pages?

On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan@gmail.com> wrote:
Server specs:
Dell R610
dual E5645 hex-core 2.4GHz
192GB RAM
RAID 1: 2x400GB SSD (OS + WAL logs)
RAID 10: 4x400GB SSD (/var/lib/pgsql)


/etc/sysctl.conf:
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.overcommit_memory = 0
vm.swappiness = 0
vm.dirty_background_bytes = 536870912
vm.dirty_bytes = 536870912


postgresql.conf:
listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 150 # (change requires restart)
shared_buffers = 48GB # min 128kB
work_mem = 1310MB # min 64kB
maintenance_work_mem = 24GB # min 1MB
wal_level = hot_standby # minimal, archive, or hot_standby
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 120s # max delay before canceling queries
max_standby_streaming_delay = 120s # max delay before canceling queries
effective_cache_size = 120GB
constraint_exclusion = partition # on, off, or partition
log_destination = 'syslog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%a.log' # log file name pattern,
log_truncate_on_rotation = on # If on, an existing log file with the
log_rotation_age = 1d # Automatic rotation of logfiles will
log_rotation_size = 0 # Automatic rotation of logfiles will
log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
log_lock_waits = on # log lock waits >= deadlock_timeout
autovacuum = on # Enable autovacuum subprocess?  'on'
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
autovacuum_max_workers = 5 # max number of autovacuum subprocesses
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
deadlock_timeout = 300ms


per pgtune:
#------------------------------------------------------------------------------
# pgtune wizard run on 2013-02-05
# Based on 198333224 KB RAM in the server
#------------------------------------------------------------------------------
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 128GB
work_mem = 1152MB
wal_buffers = 8MB
checkpoint_segments = 16
shared_buffers = 44GB
max_connections = 80

We use pgbouncer (set to 140 connections) in transaction pooling mode in front of our db.


The problem:

For the most part, the server hums along. No other applications run on this server other than postgres. Load averages rarely break 2.0, it never swaps, and %iowait is usually not more than 0.12

But periodically, there are spikes in our app's db response time. Normally, the app's db response time hovers in the 100ms range for most of the day. During the spike times, it can go up to 1000ms or 1500ms, and the number of pg connections goes to 140 (maxed out to pgbouncer's limit, where normally it's only about 20-40 connections). Also, during these times, which usually last less than 2 minutes, we will see several thousand queries in the pg log (this is with log_min_duration_statement = 500), compared to maybe one or two dozen 500ms+ queries in non-spike times.

Inbetween spikes could be an hour, two hours, sometimes half a day. There doesn't appear to be any pattern that we can see:
* there are no obvious queries that are locking the db
* it doesn't necessarily happen during high-traffic times, though it can
* it doesn't happen during any known system, db, or app regularly-scheduled job, including crons
* in general, there's no discernible regularity to it at all
* it doesn't coincide with checkpoint starts or completions
* it doesn't coincide with autovacuums
* there are no messages in any system logs that might indicate any system or hardware-related issue

Besides spikes in our graphs, the only other visible effect is that %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait and all other sar variables remain the same).

And according to our monitoring system, web requests get queued up, and our alerting system sometimes either says there's a timeout or that it had multiple web response times greater than 300ms, and so we suspect (but have no proof) that some users will see either a long hang or possibly a timeout. But since it's almost always less than two minutes, and sometimes less than one, we don't really hear any complaints (guessing that most people hit reload, and things work again, so they continue on), and we haven't been able to see any negative effect ourselves.

But we want to get in front of the problem, in case it is something that will get worse as traffic continues to grow. We've tweaked various configs on the OS side as well as the postgresql.conf side. What's posted above is our current setup, and the problem persists.

Any ideas as to where we could even look?

Also, whether related or unrelated to the spikes, are there any recommendations for our postgresql.conf or sysctl.conf based on our hardware? From pgtune's output, I am lowering maintenance_work_mem from 24GB down to maybe 2GB, but I keep reading conflicting things about other settings, such as checkpoints or max_connections.

johnny






Re: postgresql.conf recommendations

From
Johnny Tan
Date:



On Wed, Feb 6, 2013 at 7:49 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
"ac@hsk.hk" <ac@hsk.hk> wrote:
> Johnny Tan <johnnydtan@gmail.com> wrote:

>>shared_buffers = 48GB# min 128kB

> From the postgresql.conf, I can see that the shared_buffers is
> set to 48GB which is not small,  it would be possible that the
> large buffer cache could be "dirty", when a checkpoint starts, it
> would cause a checkpoint I/O spike.
>
>
> I would like to suggest you about using pgtune to get recommended
> conf for postgresql.

I have seen symptoms like those described which were the result of
too many dirty pages accumulating inside PostgreSQL shared_buffers.
It might be something else entirely in this case, but it would at
least be worth trying a reduced shared_buffers setting combined
with more aggressive bgwriter settings.  I might try something like
the following changes, as an experiment:

shared_buffers = 8GB
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4

Thanks Kevin. Wouldn't this be controlled by our checkpoint settings, though? 

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
On Wed, Feb 6, 2013 at 2:13 PM, David Whittaker <dave@iradix.com> wrote:
We disabled THP all together, with the thought that we might re-enable without defrag if we got positive results.  At this point I don't think THP is the root cause though, so I'm curious to see if anyone else gets positive results from disabling it.  We definitely haven't seen any performance hit from turning it off.

We are considering disabling THP, although we aren't seeing the errors in syslog that David mentioned.

Josh: What made you think of THP? I'm curious if there's a discussion you can point me to. Since you mentioned it, I've been looking more into it, and there isn't too much. In fact, this post makes it sound like enabling it fixes a similar problem to what we're seeing -- i.e., %system shoots up during the spikes:

johnny

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
On Tue, Feb 5, 2013 at 11:23 PM, Josh Krupka <jkrupka@gmail.com> wrote:
I've been looking into something on our system that sounds similar to what you're seeing.  I'm still researching it, but I'm suspecting the memory compaction that runs as part of transparent huge pages when memory is allocated... yet to be proven.  The tunable you mentioned controls the compaction process that runs at allocation time so it can try to allocate large pages, there's a separate one that controls if the compaction is done in khugepaged, and a separate one that controls whether THP is used at all or not (/sys/kernel/mm/transparent_hugepage/enabled, or perhaps different in your distro)

BTW, I sent /defrag yesterday, but /enabled had the same output.
 
What's the output of this command?
egrep 'trans|thp|compact_' /proc/vmstat
compact_stall represents the number of processes that were stalled to do a compaction, the other metrics have to do with other parts of THP.  If you see compact_stall climbing, from what I can tell those might be causing your spikes.  I haven't found a way of telling how long the processes have been stalled. You could probably get a little more insight into the processes with some tracing assuming you can catch it quickly enough.  Running perf top will also show the compaction happening but that doesn't necessarily mean it's impacting your running processes.

Interesting:

# egrep 'trans|thp|compact_' /proc/vmstat
nr_anon_transparent_hugepages 643
compact_blocks_moved 22629094
compact_pages_moved 532129382
compact_pagemigrate_failed 0
compact_stall 398051
compact_fail 80453
compact_success 317598
thp_fault_alloc 8254106
thp_fault_fallback 167286
thp_collapse_alloc 622783
thp_collapse_alloc_failed 3321
thp_split 122833 

Re: postgresql.conf recommendations

From
Jeff Janes
Date:
On Tue, Feb 5, 2013 at 2:02 PM, Johnny Tan <johnnydtan@gmail.com> wrote:

> checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0

I always set this to 0.9.  I don't know why the default is 0.5.


> But periodically, there are spikes in our app's db response time. Normally,
> the app's db response time hovers in the 100ms range for most of the day.
> During the spike times, it can go up to 1000ms or 1500ms, and the number of
> pg connections goes to 140 (maxed out to pgbouncer's limit, where normally
> it's only about 20-40 connections).

What if you lower the pgbouncer limit to 40?

It is hard to know if the latency spikes cause the connection build
up, or if the connection build up cause the latency spikes, or if they
reinforce each other in a vicious circle.  But making the connections
wait in pgbouncer's queue rather than in the server should do no harm,
and very well might help.

> Also, during these times, which usually
> last less than 2 minutes, we will see several thousand queries in the pg log
> (this is with log_min_duration_statement = 500), compared to maybe one or
> two dozen 500ms+ queries in non-spike times.

Is the nature of the queries the same, just the duration that changes?
 Or are the queries of a different nature?

Cheers,

Jeff


Re: postgresql.conf recommendations

From
Josh Krupka
Date:
I originally got started down that trail because running perf top while having some of the slow query issues showed compaction_alloc at the top of the list.  That function is the THP page compaction which lead me to some pages like:
http://www.olivierdoucet.info/blog/2012/05/19/debugging-a-mysql-stall/
http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/
https://gist.github.com/fgbreel/4454559
Only the last is around pg, but I think they all may still be applicable.

 The kernel docs in Documentation/vm/transhuge.txt have an explanation of the metrics

We hadn't been having the issue that much until a few weeks ago, when we started using the rest of our free memory for page cache.. my thoughts were if we have no more memory that's totally free, it might be doing compaction more.  That lead me to find how often compaction is happening, but like I said I don't know how to tell how *long* it's happening - someone who knows systemtap better than I might be able to help with the collection of that info assuming the right systemtap events are there.

One thing to keep in mind is the page you linked to (http://www.pythian.com/blog/performance-tuning-hugepages-in-linux/) talks mostly about *regular* large pages, which are related but different than THP.

I have yet to be able to prove THP's involvement one way or the other, but we are going to try some things on a test box to zero in on it. 

Re: postgresql.conf recommendations

From
Kevin Grittner
Date:
Johnny Tan <johnnydtan@gmail.com> wrote:

> Wouldn't this be controlled by our checkpoint settings, though?

Spread checkpoints made the issue less severe, but on servers with
a lot of RAM I've had to make the above changes (or even go lower
with shared_buffers) to prevent a burst of writes from overwhelming
the RAID controllers battery-backed cache.  There may be other
things which could cause these symptoms, so I'm not certain that
this will help; but I have seen this as the cause and seen the
suggested changes help.

-Kevin


Re: postgresql.conf recommendations

From
Strahinja Kustudić
Date:
As others suggested having shared_buffers = 48GB is to large. You should never need to go above 8GB. I have a similar server and mine has

shared_buffers = 8GB
checkpoint_completion_target = 0.9

This looks like a problem of dirty memory being flushed to the disk. You should set your monitoring to monitor dirty memory from /proc/meminfo and check if it has any correlation with the slowdowns. Also vm.dirty_background_bytes should always be a fraction of vm.dirty_bytes, since when there is more than vm.dirty_bytes bytes dirty it will stop all writing to the disk until it flushes everything, while when it reaches the vm.dirty_background_bytes it will slowly start flushing those pages to the disk. As far as I remember vm.dirty_bytes should be configured to be a little less than the cache size of your RAID controller, while vm.dirty_background_bytes should be 4 times smaller.


Strahinja Kustudić
| System Engineer | Nordeus


On Wed, Feb 6, 2013 at 10:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Johnny Tan <johnnydtan@gmail.com> wrote:

> Wouldn't this be controlled by our checkpoint settings, though?

Spread checkpoints made the issue less severe, but on servers with
a lot of RAM I've had to make the above changes (or even go lower
with shared_buffers) to prevent a burst of writes from overwhelming
the RAID controllers battery-backed cache.  There may be other
things which could cause these symptoms, so I'm not certain that
this will help; but I have seen this as the cause and seen the
suggested changes help.

-Kevin


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: postgresql.conf recommendations

From
Charles Gomes
Date:
I've benchmarked shared_buffers with high and low settings, in a server dedicated to postgres with 48GB my settings are:
shared_buffers = 37GB
effective_cache_size = 38GB

Having a small number and depending on OS caching is unpredictable, if the server is dedicated to postgres you want make sure postgres has the memory. A random unrelated process doing a cat /dev/sda1 should not destroy postgres buffers.
I agree your problem is most related to dirty background ration, where buffers are READ only and have nothing to do with disk writes.



From: strahinjak@nordeus.com
Date: Thu, 7 Feb 2013 13:06:53 +0100
Subject: Re: [PERFORM] postgresql.conf recommendations
To: kgrittn@ymail.com
CC: johnnydtan@gmail.com; ac@hsk.hk; jkrupka@gmail.com; alex@paperlesspost.com; pgsql-performance@postgresql.org

As others suggested having shared_buffers = 48GB is to large. You should never need to go above 8GB. I have a similar server and mine has

shared_buffers = 8GB
checkpoint_completion_target = 0.9

This looks like a problem of dirty memory being flushed to the disk. You should set your monitoring to monitor dirty memory from /proc/meminfo and check if it has any correlation with the slowdowns. Also vm.dirty_background_bytes should always be a fraction of vm.dirty_bytes, since when there is more than vm.dirty_bytes bytes dirty it will stop all writing to the disk until it flushes everything, while when it reaches the vm.dirty_background_bytes it will slowly start flushing those pages to the disk. As far as I remember vm.dirty_bytes should be configured to be a little less than the cache size of your RAID controller, while vm.dirty_background_bytes should be 4 times smaller.


Strahinja Kustudić
| System Engineer | Nordeus


On Wed, Feb 6, 2013 at 10:12 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Johnny Tan <johnnydtan@gmail.com> wrote:

> Wouldn't this be controlled by our checkpoint settings, though?

Spread checkpoints made the issue less severe, but on servers with
a lot of RAM I've had to make the above changes (or even go lower
with shared_buffers) to prevent a burst of writes from overwhelming
the RAID controllers battery-backed cache.  There may be other
things which could cause these symptoms, so I'm not certain that
this will help; but I have seen this as the cause and seen the
suggested changes help.

-Kevin


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
I appreciate all the responses on this thread, even though some are conflicting :). We are going to try these one at a time, but we'll likely need a day or so inbetween each to see what impact (if any), so it will take time. But I will post back here our findings.

We'll start with dirty_background_bytes, as that's straightforward. (dirty_bytes is already set to the same size as our RAID cache.)

Re: postgresql.conf recommendations

From
Josh Krupka
Date:
Just as an update from my angle on the THP side... I put together a systemtap script last night and so far it's confirming my theory (at least in our environment).  I want to go through some more data and make some changes on our test box to see if we can make it go away before declaring success - it's always possible two problems are intertwined or that the THP thing is only showing up because of the *real* problem... you know how it goes.

Basically the systemtap script does this:
- probes the compaction function
- keeps track of the number of calls to it and aggregate time spent in it by process
- at the end spit out the collected info.

So far when I run the script for a short period of time that I know THP compactions are happening, I have been able to match up the compaction duration collected via systemtap with a query in the pg logs that took that amount of time or slightly longer (as expected).  A lot of these are only a second or so, so I haven't been able to catch everything, but at least the data I am getting is consistent.

Will be interested to see what you find Johnny.

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
Josh:

Are you able to share your systemtap script? Our problem will be to try and regenerate the same amount of traffic/load that we see in production. We could replay our queries, but we don't even capture a full set because it'd be roughly 150GB per day.

johnny


On Thu, Feb 7, 2013 at 12:49 PM, Josh Krupka <jkrupka@gmail.com> wrote:
Just as an update from my angle on the THP side... I put together a systemtap script last night and so far it's confirming my theory (at least in our environment).  I want to go through some more data and make some changes on our test box to see if we can make it go away before declaring success - it's always possible two problems are intertwined or that the THP thing is only showing up because of the *real* problem... you know how it goes.

Basically the systemtap script does this:
- probes the compaction function
- keeps track of the number of calls to it and aggregate time spent in it by process
- at the end spit out the collected info.

So far when I run the script for a short period of time that I know THP compactions are happening, I have been able to match up the compaction duration collected via systemtap with a query in the pg logs that took that amount of time or slightly longer (as expected).  A lot of these are only a second or so, so I haven't been able to catch everything, but at least the data I am getting is consistent.

Will be interested to see what you find Johnny.

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
On Thu, Feb 7, 2013 at 11:16 PM, Tony Chan <tonychan@emblocsoft.com> wrote:
Hi,

May I know what is your setting for OS cache?


Tony:

Wasn't sure if you were asking me, but here's the output from "free":
# free
             total       used       free     shared    buffers     cached
Mem:     198333224  187151280   11181944          0     155512  179589612
-/+ buffers/cache:    7406156  190927068
Swap:     16777208          0   16777208 


- better to analyze large joins and sequential scan, and turn this parameter, e.g. reduce the size of  effective_cache_size in postgresql.conf and change it  for big queries.

This makes sense. We were setting it based on the tuning guideline from this page:
"Setting effective_cache_size to 1/2 of total memory would be a normal conservative setting, and 3/4 of memory is a more aggressive but still reasonable amount."

johnny 

Re: postgresql.conf recommendations

From
Scott Marlowe
Date:
On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com> wrote:
> I've benchmarked shared_buffers with high and low settings, in a server
> dedicated to postgres with 48GB my settings are:
> shared_buffers = 37GB
> effective_cache_size = 38GB
>
> Having a small number and depending on OS caching is unpredictable, if the
> server is dedicated to postgres you want make sure postgres has the memory.
> A random unrelated process doing a cat /dev/sda1 should not destroy postgres
> buffers.
> I agree your problem is most related to dirty background ration, where
> buffers are READ only and have nothing to do with disk writes.

You make an assertion here but do not tell us of your benchmarking
methods.  My testing in the past has show catastrophic performance
with very large % of memory as postgresql buffers with heavy write
loads, especially transactional ones.  Many others on this list have
had the same thing happen.  Also you supposed PostgreSQL has a better
/ smarter caching algorithm than the OS kernel, and often times this
is NOT the case.

In this particular instance the OP may not be seeing an issue from too
large of a pg buffer, my point still stands, large pg_buffer can cause
problems with heavy or even moderate write loads.


Re: postgresql.conf recommendations

From
Josh Krupka
Date:
Johnny,

Sure thing, here's the system tap script:

#! /usr/bin/env stap

global pauses, counts

probe begin {
  printf("%s\n", ctime(gettimeofday_s()))
}

probe  kernel.function("compaction_alloc@mm/compaction.c").return {
  elapsed_time = gettimeofday_us() - @entry(gettimeofday_us())
  key = sprintf("%d-%s", pid(), execname())
  pauses[key] = pauses[key] + elapsed_time
  counts[key]++
}

probe end {
  printf("%s\n", ctime(gettimeofday_s()))
  foreach (pid in pauses) {
    printf("pid %s : %d ms %d pauses\n", pid, pauses[pid]/1000, counts[pid])
  }
}


I was able to do some more observations in production, and some testing in the lab, here are my latest findings:
- The THP compaction delays aren't happening during queries (at least not that I've seen yet) from the middleware our legacy app uses.  The pauses during those queries are what originally got my attention.  Those queries though only ever insert/update/read/delete one record at a time (don't ask).  Which would theoretically makes sense, since because of how that app works, the pg backend processes for that app don't have to ask for as much memory during a query, which is when the THP compactions would be happening.
- The THP compaction delays are impacting backend processes that are for other apps, and things like autovacuum processes - sometimes multiple seconds worth of delay over a short period of time
- I haven't been able to duplicate 1+s query times for our "one record at a time" app in the lab, but I was getting some 20-30ms queries which is still higher than it should be most of the time.  We noticed in production by looking at pg_stat_bgwriter that the backends were having to write pages out for 50% of the allocations, so we starting tuning checkpoint/bgwriter settings on the test system and seem to be making some progress.  See http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
- I think you already started looking at this, but the linux dirty memory settings may have to be tuned as well (see Greg's post http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html).   Ours haven't been changed from the defaults, but that's another thing to test for next week.  Have you had any luck tuning these yet?

Josh

Re: postgresql.conf recommendations

From
Jeff Janes
Date:
On Sat, Feb 9, 2013 at 6:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com> wrote:
>> I've benchmarked shared_buffers with high and low settings, in a server
>> dedicated to postgres with 48GB my settings are:
>> shared_buffers = 37GB
>> effective_cache_size = 38GB
>>
>> Having a small number and depending on OS caching is unpredictable, if the
>> server is dedicated to postgres you want make sure postgres has the memory.
>> A random unrelated process doing a cat /dev/sda1 should not destroy postgres
>> buffers.
>> I agree your problem is most related to dirty background ration, where
>> buffers are READ only and have nothing to do with disk writes.
>
> You make an assertion here but do not tell us of your benchmarking
> methods.

Well, he is not the only one committing that sin.

> My testing in the past has show catastrophic performance
> with very large % of memory as postgresql buffers with heavy write
> loads, especially transactional ones.  Many others on this list have
> had the same thing happen.

People also have problems by setting it too low.  For example, doing
bulk loads into indexed tables becomes catastrophically bad when the
size of the index exceeds shared_buffers by too much (where "too much"
depends on kernel, IO subsystem, and settings of vm.dirty* ) , and
increasing shared_buffers up to 80% of RAM fixes that (if 80% of RAM
is large enough to hold the indexes being updated).

Of course when doing bulk loads into truncated tables, you should drop
the indexes.  But if bulk loading into live tables, that is often a
cure worse than the disease.

> Also you supposed PostgreSQL has a better
> / smarter caching algorithm than the OS kernel, and often times this
> is NOT the case.

Even if it is not smarter as an algorithm, it might still be better to
use it.  For example,  "heap_blks_read", "heap_blks_hit", and friends
become completely useless if most block "reads" are not actually
coming from disk.

Also, vacuum_cost_page_miss is impossible to tune if some unknown but
potentially large fraction of those misses are not really misses, and
that fraction changes from table to table, and from wrap-around scan
to vm scan on the same table.

> In this particular instance the OP may not be seeing an issue from too
> large of a pg buffer, my point still stands, large pg_buffer can cause
> problems with heavy or even moderate write loads.

Sure, but that can go the other way as well.  What additional
instrumentation is needed so that people can actually know which is
the case for them?

Cheers,

Jeff


Re: postgresql.conf recommendations

From
Scott Marlowe
Date:
On Sat, Feb 9, 2013 at 1:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Feb 9, 2013 at 6:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com> wrote:
>>> I've benchmarked shared_buffers with high and low settings, in a server
>>> dedicated to postgres with 48GB my settings are:
>>> shared_buffers = 37GB
>>> effective_cache_size = 38GB
>>>
>>> Having a small number and depending on OS caching is unpredictable, if the
>>> server is dedicated to postgres you want make sure postgres has the memory.
>>> A random unrelated process doing a cat /dev/sda1 should not destroy postgres
>>> buffers.
>>> I agree your problem is most related to dirty background ration, where
>>> buffers are READ only and have nothing to do with disk writes.
>>
>> You make an assertion here but do not tell us of your benchmarking
>> methods.
>
> Well, he is not the only one committing that sin.

I'm not asking for a complete low level view.  but it would be nice to
know if he's benchmarking heavy read or write loads, lots of users, a
few users, something.  All we get is "I've benchmarked a lot" followed
by "don't let the OS do the caching."  At least with my testing I was
using a large transactional system (heavy write) and there I KNOW from
testing that large shared_buffers do nothing but get in the way.

all the rest of the stuff you mention is why we have effective cache
size which tells postgresql about how much of the data CAN be cached.
In short, postgresql is designed to use and / or rely on OS cache.


Re: postgresql.conf recommendations

From
Jeff Janes
Date:
On Saturday, February 9, 2013, Scott Marlowe wrote:
On Sat, Feb 9, 2013 at 1:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Sat, Feb 9, 2013 at 6:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com> wrote:
>>> I've benchmarked shared_buffers with high and low settings, in a server
>>> dedicated to postgres with 48GB my settings are:
>>> shared_buffers = 37GB
>>> effective_cache_size = 38GB
>>>
>>> Having a small number and depending on OS caching is unpredictable, if the
>>> server is dedicated to postgres you want make sure postgres has the memory.
>>> A random unrelated process doing a cat /dev/sda1 should not destroy postgres
>>> buffers.
>>> I agree your problem is most related to dirty background ration, where
>>> buffers are READ only and have nothing to do with disk writes.
>>
>> You make an assertion here but do not tell us of your benchmarking
>> methods.
>
> Well, he is not the only one committing that sin.

I'm not asking for a complete low level view.  but it would be nice to
know if he's benchmarking heavy read or write loads, lots of users, a
few users, something.  All we get is "I've benchmarked a lot" followed
by "don't let the OS do the caching."  At least with my testing I was
using a large transactional system (heavy write) and there I KNOW from
testing that large shared_buffers do nothing but get in the way.

Can you see this with pgbench workloads? (it is certainly write heavy)

I've tried to reproduce these problems, and was never able to.
 

all the rest of the stuff you mention is why we have effective cache
size which tells postgresql about how much of the data CAN be cached.

The effective_cache_size setting does not figure into any of the things I mentioned.
 
Cheers,

Jeff

Re: postgresql.conf recommendations

From
Charles Gomes
Date:


> Date: Sat, 9 Feb 2013 14:03:35 -0700
> Subject: Re: [PERFORM] postgresql.conf recommendations
> From: scott.marlowe@gmail.com
> To: jeff.janes@gmail.com
> CC: charlesrg@outlook.com; strahinjak@nordeus.com; kgrittn@ymail.com; johnnydtan@gmail.com; ac@hsk.hk; jkrupka@gmail.com; alex@paperlesspost.com; pgsql-performance@postgresql.org
>
> On Sat, Feb 9, 2013 at 1:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Sat, Feb 9, 2013 at 6:51 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> >> On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com> wrote:
> >>> I've benchmarked shared_buffers with high and low settings, in a server
> >>> dedicated to postgres with 48GB my settings are:
> >>> shared_buffers = 37GB
> >>> effective_cache_size = 38GB
> >>>
> >>> Having a small number and depending on OS caching is unpredictable, if the
> >>> server is dedicated to postgres you want make sure postgres has the memory.
> >>> A random unrelated process doing a cat /dev/sda1 should not destroy postgres
> >>> buffers.
> >>> I agree your problem is most related to dirty background ration, where
> >>> buffers are READ only and have nothing to do with disk writes.
> >>
> >> You make an assertion here but do not tell us of your benchmarking
> >> methods.
> >
> > Well, he is not the only one committing that sin.
>
> I'm not asking for a complete low level view. but it would be nice to
> know if he's benchmarking heavy read or write loads, lots of users, a
> few users, something. All we get is "I've benchmarked a lot" followed
> by "don't let the OS do the caching." At least with my testing I was
> using a large transactional system (heavy write) and there I KNOW from
> testing that large shared_buffers do nothing but get in the way.
>
> all the rest of the stuff you mention is why we have effective cache
> size which tells postgresql about how much of the data CAN be cached.
> In short, postgresql is designed to use and / or rely on OS cache.
>
Hello Scott

I've tested using 8 bulk writers in a 8 core machine (16 Threads).

I've loaded a database with 17 partitions, total 900 million rows and later executed single queries on it.

In my case the main point of having postgres manage memory is because postgres is the single and most important application running on the server.

 

If Linux would manage the Cache it would not know what is important and what should be discarded, it would simply discard the oldest least accessed entry.

Let's say a DBA logs in the server and copies a 20GB file. If you leave Linux to decide, it will decide that the  20GB file is more important than the old not so heavily accessed postgres entries.

 

This may be looked in a case by case, in my case I need PostgreSQL to perform FAST and I also don't want cron jobs taking my cache out. For example (locate, logrotate, prelink, makewhatis).

 

If postgres was unable to manage 40GB of RAM, we would get into major problems because nowadays it's normal to buy 64GB servers, and  many of Us have dealt with 512GB Ram Servers.

 

By the way, I've tested this same scenario with Postgres, Mysql and Oracle. And Postgres have given the best results overall. Especially with symmetric replication turned on.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: postgresql.conf recommendations

From
Johnny Tan
Date:
On Sat, Feb 9, 2013 at 2:37 PM, Josh Krupka <jkrupka@gmail.com> wrote:
Johnny,

Sure thing, here's the system tap script:


Thank you for this!

 
- I think you already started looking at this, but the linux dirty memory settings may have to be tuned as well (see Greg's post http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html).   Ours haven't been changed from the defaults, but that's another thing to test for next week.  Have you had any luck tuning these yet?

We lowered dirty_background_bytes to 1/4 of what dirty_bytes was. That didn't get rid of the spikes, but seemed to have some impact -- just from 24 hours' observation, the spikes were clustered together more closely, and then there were long stretches without any spikes. Unfortunately, we only got to observe 24 hours before we made the next change.

The next change was lowering pgbouncer poolsize down to 50. We originally (way back when) started out at 100, then bumped up to 150. But Jeff Janes' rationale for LOWERING the poolsize/connections made sense to us.

And so far, 48 hours since lowering it, it does seem to have eliminated the DB spikes! We haven't seen one yet, and this is the longest we've gone without seeing one.

To be more precise, we now have a lot more "short" spikes -- i.e., our response time graphs are more jagged, but at least the peaks are under the threshold we desire. Previously, they were "smooth" in between the spikes.

We will probably tweak this knob some more -- i.e., what is the sweet spot between 1 and 100? Would it be higher than 50 but less than 100? Or is it somewhere lower than 50?

Even after we find that sweet spot, I'm still going to try some of the other suggestions. I do want to play with shared_buffers, just so we know whether, for our setup, it's better to have larger or smaller shared_buffers. I'd also like to test the THP stuff on a testing cluster, which we are still in the middle of setting up (or rather, we have set up, but we need to make it more prod-like).

johnny

Re: postgresql.conf recommendations

From
Scott Marlowe
Date:
On Mon, Feb 11, 2013 at 7:57 AM, Charles Gomes <charlesrg@outlook.com> wrote:
>
>
>> Date: Sat, 9 Feb 2013 14:03:35 -0700
>
>> Subject: Re: [PERFORM] postgresql.conf recommendations
>> From: scott.marlowe@gmail.com
>> To: jeff.janes@gmail.com
>> CC: charlesrg@outlook.com; strahinjak@nordeus.com; kgrittn@ymail.com;
>> johnnydtan@gmail.com; ac@hsk.hk; jkrupka@gmail.com; alex@paperlesspost.com;
>> pgsql-performance@postgresql.org
>
>>
>> On Sat, Feb 9, 2013 at 1:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> > On Sat, Feb 9, 2013 at 6:51 AM, Scott Marlowe <scott.marlowe@gmail.com>
>> > wrote:
>> >> On Thu, Feb 7, 2013 at 7:41 AM, Charles Gomes <charlesrg@outlook.com>
>> >> wrote:
>> >>> I've benchmarked shared_buffers with high and low settings, in a
>> >>> server
>> >>> dedicated to postgres with 48GB my settings are:
>> >>> shared_buffers = 37GB
>> >>> effective_cache_size = 38GB
>> >>>
>> >>> Having a small number and depending on OS caching is unpredictable, if
>> >>> the
>> >>> server is dedicated to postgres you want make sure postgres has the
>> >>> memory.
>> >>> A random unrelated process doing a cat /dev/sda1 should not destroy
>> >>> postgres
>> >>> buffers.
>> >>> I agree your problem is most related to dirty background ration, where
>> >>> buffers are READ only and have nothing to do with disk writes.
>> >>
>> >> You make an assertion here but do not tell us of your benchmarking
>> >> methods.
>> >
>> > Well, he is not the only one committing that sin.
>>
>> I'm not asking for a complete low level view. but it would be nice to
>> know if he's benchmarking heavy read or write loads, lots of users, a
>> few users, something. All we get is "I've benchmarked a lot" followed
>> by "don't let the OS do the caching." At least with my testing I was
>> using a large transactional system (heavy write) and there I KNOW from
>> testing that large shared_buffers do nothing but get in the way.
>>
>> all the rest of the stuff you mention is why we have effective cache
>> size which tells postgresql about how much of the data CAN be cached.
>> In short, postgresql is designed to use and / or rely on OS cache.
>>
> Hello Scott
>
> I've tested using 8 bulk writers in a 8 core machine (16 Threads).
>
> I've loaded a database with 17 partitions, total 900 million rows and later
> executed single queries on it.
>
> In my case the main point of having postgres manage memory is because
> postgres is the single and most important application running on the server.
>
>
>
> If Linux would manage the Cache it would not know what is important and what
> should be discarded, it would simply discard the oldest least accessed
> entry.

Point taken however,

> Let's say a DBA logs in the server and copies a 20GB file. If you leave
> Linux to decide, it will decide that the  20GB file is more important than
> the old not so heavily accessed postgres entries.

The linux kernel (and most other unix kernels) don't cache that way.
They're usually quite smart about caching.  While some older things
might get pushed out, it doesn't generally make room for larger files
that have been accessed just once.  But on a mixed load server this
may not be the case.

> If postgres was unable to manage 40GB of RAM, we would get into major
> problems because nowadays it's normal to buy 64GB servers, and  many of Us
> have dealt with 512GB Ram Servers.

It's not that postgres can't hadndle large cache, it's that quite
often the kernel is simply better at it.

> By the way, I've tested this same scenario with Postgres, Mysql and Oracle.
> And Postgres have given the best results overall. Especially with symmetric
> replication turned on.

Good to know.  In the past PostgreSQL has had some performance issues
with large shared_buffer values, and this is still apparently the case
when run on windows.  With dedicated linux servers running just
postgres, letting the kernel handle cache has yielded very good
results.  Most of the negative aspects on large buffers I've seen has
been on heavy write / high transactional dbs.


Re: postgresql.conf recommendations

From
Will Platnick
Date:
We will probably tweak this knob some more -- i.e., what is the sweet spot between 1 and 100? Would it be higher than 50 but less than 100? Or is it somewhere lower than 50?
I would love to know the answer to this as well. We have a similar situation, pgbouncer with transaction log pooling with 140 connections.  What is the the right value to size pgbouncer connections to? Is there a formula that takes the # of cores into account?

Re: postgresql.conf recommendations

From
Scott Marlowe
Date:
On Mon, Feb 11, 2013 at 4:29 PM, Will Platnick <wplatnick@gmail.com> wrote:
> We will probably tweak this knob some more -- i.e., what is the sweet spot
> between 1 and 100? Would it be higher than 50 but less than 100? Or is it
> somewhere lower than 50?
>
> I would love to know the answer to this as well. We have a similar
> situation, pgbouncer with transaction log pooling with 140 connections.
> What is the the right value to size pgbouncer connections to? Is there a
> formula that takes the # of cores into account?

If you can come up with a synthetic benchmark that's similar to what
your real load is (size, mix etc) then you can test it and see at what
number your throughput peaks and you have good behavior from the
server.

On a server I built a few years back with 48 AMD cores and 24 Spinners
in a RAID-10 for data and 4 drives for a RAID-10 for pg_xlog (no RAID
controller in this one as the chassis cooked them) my throughput
peaked at ~60 connections.  What you'll wind up with is a graph where
the throughput keeps climbing as you add clients and at some point it
will usually drop off quickly when you pass it.  The sharper the drop
the more dangerous it is to run your server in such an overloaded
situation.


--
To understand recursion, one must first understand recursion.