Thread: Queries seldomly take >4s while normally take <1ms?
Hello I have a setup with one master and two slaves which are used by a closed source application. The database is asked the same query, a stored procedure, with different parameters about 4 million times per second at a peak rate of 150 times per second using 10 parallel connections. The slaves are decent Dell servers with 64GB RAM with dual hexacore CPUs and RAID10. Usually this stored procedure takes <1ms as it basically just does two selects against a GIST index for a prefix_range type. Seldomly though, about 1-3 times per day, one of these queries takes up to 4000ms! All those queries also runs in <1ms when executed manually some times later. Queries with similar parameters (who I supposed to use the same area of the index) also continues to run fast during that time. Queries with different paramers which are running parallel on different threads take <1ms, too, so it's not a general "load problem". Cronjobs and other applications seem quiet during that time, there is no peak in any of our monitoring graphs. Automatic vacuum/analyze log entries on the master are not near the timestamps in question. So my problem seems not the query itself nor the way I indexed my data but what could it be? Some strange effects with streaming replication or cache invalidation? Apologies for not giving you reproducible problem but maybe you still have some ideas as I'm just curious as I've never seem such an effect during my MySQL years :-) The queries contain obvious customer data so I'm reluctant to give examples but again I doubt that an explain plan will help if only 1 out of 4E6 queries takes too long. bye, -christian-
Christian -- <original text snip because this POS editor won't let me properly edit> postgres version ? type of replication ? changes from postgres config defaults ? Do they happen more at peak usage, semi regularly or sporadically ? Possibly some sporadic postgres process such as checkpoints of autovac processes kicking off. Do your logs show anything? HTH, Greg W.
Hello On Tue, 9 Apr 2013 03:53:13 -0700 (PDT) Greg Williamson <gwilliamson39@yahoo.com> wrote: > Christian -- > > <original text snip because this POS editor won't let me properly edit> > > postgres version ? 9.2.3 > type of replication ? As written, one master does streaming replication to two slaves. > changes from postgres config defaults ? max_connections = 1000 # (change requires restart) shared_buffers = 20GB # min 128kB wal_level = hot_standby # minimal, archive, or hot_standby archive_mode = on # allows archiving to be done archive_command = 'test ! -f /srv/postgresql-data/archivedir/%f && cp %p /srv/postgresql-data/archivedir/%f' #command to use to archive a logfile segment max_wal_senders = 3 # max number of walsender processes hot_standby = on # "on" allows queries during recovery max_standby_archive_delay = 1h # max delay before canceling queries max_standby_streaming_delay = 1h # max delay before canceling queries hot_standby_feedback = on # send info from standby to prevent effective_cache_size = 1024MB log_destination = 'stderr' # Valid values are combinations of logging_collector = on # Enable capturing of stderr and csvlog log_directory = '/var/log/postgresql/' # directory where log files are written, log_filename = 'postgresql-9.2-data.log' # log file name pattern, log_file_mode = 0640 # creation mode for log files, log_rotation_age = 0 # Automatic rotation of logfiles will log_rotation_size = 0 # Automatic rotation of logfiles will log_min_messages = notice log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements log_connections = on log_disconnections = on log_duration = on log_line_prefix = '%t [%p] %u@%d ' # special values: log_statement = 'all' # none, ddl, mod, all log_timezone = 'Europe/Berlin' track_activities = on track_counts = on track_io_timing = on datestyle = 'iso, dmy' timezone = 'Europe/Berlin' lc_messages = 'en_GB.UTF-8' # locale for system error message default_text_search_config = 'pg_catalog.english' shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.max = 1000 pg_stat_statements.track = all effective_cache_size is way too low, I noticed, can that be the cause? > Do they happen more at peak usage, semi regularly or sporadically ? Sporadically. Not at a specific hour or minute-of-hour. Not only at peak times either. Sometimes at 05:xx in the morning, sometings at 10:xxh. > Possibly some sporadic postgres process such as checkpoints of > autovac processes kicking off. Do your logs show anything ? As far as I understood, Auto-Vacuum and Auto-Cleaning only happen on the master and their results get to the slaves via streaming replication. I therefore checked the master logs and, as written, all occurences of Auto-* are at least half an hour before or after. > HTH, > > Greg W. bye, -christian-
Christian Hammers <ch@lathspell.de> wrote: > 9.2.3 You really need to think about 9.2.4 Real Soon Now; there's a security fix that you probably should not wait on. > max_connections = 1000 # (change requires restart) > shared_buffers = 20GB # min 128kB Those are both potential causes. For max_connections, see this: http://wiki.postgresql.org/wiki/Number_Of_Database_Connections Maybe you happened to have enough users hit the enter key at the same moment to cause a process holding a lock to be starved of cycles or something similar. One problem with a large shared_buffers setting is that PostgreSQL can accumulate a very large number of dirty pages and flush them to the OS all at once. This can overwhelm the storage system and cause exactly the kind of symptoms you're seeing. > effective_cache_size is way too low, I noticed, can that be the cause? Probably not. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 9 Apr 2013 07:25:16 -0700 (PDT) Kevin Grittner <kgrittn@ymail.com> wrote: > Christian Hammers <ch@lathspell.de> wrote: > > > 9.2.3 > > You really need to think about 9.2.4 Real Soon Now; there's a > security fix that you probably should not wait on. Is scheduled (no access from outside to that network segment at least) > > max_connections = 1000 # (change requires restart) > > shared_buffers = 20GB # min 128kB > > Those are both potential causes. For max_connections, see this: > > http://wiki.postgresql.org/wiki/Number_Of_Database_Connections > > Maybe you happened to have enough users hit the enter key at the > same moment to cause a process holding a lock to be starved of > cycles or something similar. The application connects permanently with a fixed number of only 20 connections. > One problem with a large shared_buffers setting is that PostgreSQL > can accumulate a very large number of dirty pages and flush them to > the OS all at once. This can overwhelm the storage system and > cause exactly the kind of symptoms you're seeing. I have pretty big changes during early night hours on the master and then almost only read-only accesses during the day. As checkpoint_timeout is at 5min, there should not be any significant amount of dirty pages during daytime, right? Where would I verity this, with pg_stats_bgwriter.buffers_checkpoint and the Linux I/O graphs? bye, -christian-
One of the most common causes I've seen for this is linux's vm.*dirty* settings to get in the way. Like so many linux kernel "optimizations" this one looks good on paper but gives at best middling improvements with occasional io storms that block everything else. On big mem machines doing a lot of writing IO I just set these to 0. Also tend to turn off swap as well as it's known to get in the way as well.
settings for /etc/sysctl.conf
vm.dirty_background_ratio = 0
vm.dirty_ratio = 0
On Tue, Apr 9, 2013 at 3:41 AM, Christian Hammers <ch@lathspell.de> wrote:
Hello
I have a setup with one master and two slaves which are used by a closed
source application. The database is asked the same query, a stored procedure,
with different parameters about 4 million times per second at a peak rate of
150 times per second using 10 parallel connections. The slaves are decent
Dell servers with 64GB RAM with dual hexacore CPUs and RAID10.
Usually this stored procedure takes <1ms as it basically just does two
selects against a GIST index for a prefix_range type. Seldomly though,
about 1-3 times per day, one of these queries takes up to 4000ms!
All those queries also runs in <1ms when executed manually some times later.
Queries with similar parameters (who I supposed to use the same area of the
index) also continues to run fast during that time. Queries with different
paramers which are running parallel on different threads take <1ms, too,
so it's not a general "load problem".
Cronjobs and other applications seem quiet during that time, there is
no peak in any of our monitoring graphs. Automatic vacuum/analyze log
entries on the master are not near the timestamps in question.
So my problem seems not the query itself nor the way I indexed my data
but what could it be? Some strange effects with streaming replication
or cache invalidation?
Apologies for not giving you reproducible problem but maybe you
still have some ideas as I'm just curious as I've never seem such an
effect during my MySQL years :-) The queries contain obvious customer
data so I'm reluctant to give examples but again I doubt that
an explain plan will help if only 1 out of 4E6 queries takes too long.
bye,
-christian-
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
To understand recursion, one must first understand recursion.
On Apr 9, 2013, at 11:25 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > One of the most common causes I've seen for this is linux's vm.*dirty* settings to get in the way. Like so many linux kernel"optimizations" this one looks good on paper but gives at best middling improvements with occasional io storms thatblock everything else. On big mem machines doing a lot of writing IO I just set these to 0. Also tend to turn off swapas well as it's known to get in the way as well. > > settings for /etc/sysctl.conf > vm.dirty_background_ratio = 0 > vm.dirty_ratio = 0 > I'll +1 on the "you have to tune your Linux install" advice. I found the "PostgreSQL 9.0 High Performance" book to be worth its weight in gold. A few days spent with the book and researchon mailing lists improved our PostgreSQL performance multiple times over, and responsiveness under load by ordersof magnitude. http://www.amazon.com/PostgreSQL-High-Performance-Gregory-Smith/dp/184951030X > > > On Tue, Apr 9, 2013 at 3:41 AM, Christian Hammers <ch@lathspell.de> wrote: > Hello > > I have a setup with one master and two slaves which are used by a closed > source application. The database is asked the same query, a stored procedure, > with different parameters about 4 million times per second at a peak rate of > 150 times per second using 10 parallel connections. The slaves are decent > Dell servers with 64GB RAM with dual hexacore CPUs and RAID10. > > Usually this stored procedure takes <1ms as it basically just does two > selects against a GIST index for a prefix_range type. Seldomly though, > about 1-3 times per day, one of these queries takes up to 4000ms! > All those queries also runs in <1ms when executed manually some times later. > Queries with similar parameters (who I supposed to use the same area of the > index) also continues to run fast during that time. Queries with different > paramers which are running parallel on different threads take <1ms, too, > so it's not a general "load problem". > > Cronjobs and other applications seem quiet during that time, there is > no peak in any of our monitoring graphs. Automatic vacuum/analyze log > entries on the master are not near the timestamps in question. > > So my problem seems not the query itself nor the way I indexed my data > but what could it be? Some strange effects with streaming replication > or cache invalidation? > > Apologies for not giving you reproducible problem but maybe you > still have some ideas as I'm just curious as I've never seem such an > effect during my MySQL years :-) The queries contain obvious customer > data so I'm reluctant to give examples but again I doubt that > an explain plan will help if only 1 out of 4E6 queries takes too long. > > bye, > > -christian- > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > > > > -- > To understand recursion, one must first understand recursion.
On Tue, Apr 9, 2013 at 12:37 PM, Steven Schlansker <steven@likeness.com> wrote:
I'll +1 on the "you have to tune your Linux install" advice.
On Apr 9, 2013, at 11:25 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> One of the most common causes I've seen for this is linux's vm.*dirty* settings to get in the way. Like so many linux kernel "optimizations" this one looks good on paper but gives at best middling improvements with occasional io storms that block everything else. On big mem machines doing a lot of writing IO I just set these to 0. Also tend to turn off swap as well as it's known to get in the way as well.
>
> settings for /etc/sysctl.conf
> vm.dirty_background_ratio = 0
> vm.dirty_ratio = 0
>
I found the "PostgreSQL 9.0 High Performance" book to be worth its weight in gold. A few days spent with the book and research on mailing lists improved our PostgreSQL performance multiple times over, and responsiveness under load by orders of magnitude.
http://www.amazon.com/PostgreSQL-High-Performance-Gregory-Smith/dp/184951030X
Yep. That's probably the single most useful performance tuning book anyone working with dbs can own. Even if you don't run postgresql, the hardware tuning and testing section is fantastic.