Thread: Checkpoint_segments optimal value
Hello All
i had a database load issue today and when i was checking logs i saw this message. I m running postgresql 8.3db::ts:2014-07-17 09:38:21 PDT::LOG: checkpoints are occurring too frequently (26 seconds apart)
db::ts:2014-07-17 09:38:21 PDT::HINT: Consider increasing the configuration parameter "checkpoint_segments".
db::ts:2014-07-17 09:38:21 PDT::LOG: checkpoint starting: xlog
pls suggest the optimal checkpoint_segment value i should set
My current values are this
checkpoint_segments = 32 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 is off
checkpoint_segments = 32 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 is off
Current Memory values
# - Memory -
effective_cache_size = 8GB
shared_buffers = 1024MB # min 128kB or max_connections*16kB
# (change requires restart)
temp_buffers = 64MB # min 800kB
#max_prepared_transactions = 5 # can be 0 or more
# (change requires restart)
work_mem = 512MB # min 64kB
maintenance_work_mem = 2048MB # min 1MB !!! increased to 1GB 10/1/2010 mrostron (to assist autovacuum workers)
#max_stack_depth = 2MB # min 100kB
# - Free Space Map -
max_fsm_pages = 809600 # min max_fsm_relations*16, 6 bytes each
# (change requires restart)
synchronous_commit = off
#full_page_writes = on # recover from partial page writes
#wal_buffers = 64kB # min 32kB
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
# - Memory -
effective_cache_size = 8GB
shared_buffers = 1024MB # min 128kB or max_connections*16kB
# (change requires restart)
temp_buffers = 64MB # min 800kB
#max_prepared_transactions = 5 # can be 0 or more
# (change requires restart)
work_mem = 512MB # min 64kB
maintenance_work_mem = 2048MB # min 1MB !!! increased to 1GB 10/1/2010 mrostron (to assist autovacuum workers)
#max_stack_depth = 2MB # min 100kB
# - Free Space Map -
max_fsm_pages = 809600 # min max_fsm_relations*16, 6 bytes each
# (change requires restart)
synchronous_commit = off
#full_page_writes = on # recover from partial page writes
#wal_buffers = 64kB # min 32kB
# (change requires restart)
#wal_writer_delay = 200ms # 1-10000 milliseconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000
On Thu, 17 Jul 2014 10:44:32 -0700 Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote: > Hello All > i had a database load issue today and when i was checking logs > i saw this message. I m running postgresql 8.3 > > db::ts:2014-07-17 09:38:21 PDT::LOG: checkpoints are occurring too > frequently (26 seconds apart) > db::ts:2014-07-17 09:38:21 PDT::HINT: Consider increasing the > configuration parameter "checkpoint_segments". > db::ts:2014-07-17 09:38:21 PDT::LOG: checkpoint starting: xlog Is this the first time you've had the load problem? How often does it happen? How often is that message in the logs? The downside (as described in the docs) is that increasing the value will cause PG to take longer to recover in the event of a crash. That and that you need enough disk space to hold the extra segments. If the warning isn't happening too often, I would try increasing it only a little and see if it helps. If it's not enough you can then increase it some more. Various sources around the Internet suggest that you don't want to go much larger than 256 for this (if only because it's uncommon to do so and is probably indicative of other tuning that you need to do). Unfortunatley, you need to restart PG for the change to take effect, so you have to balance experimenting with your tuning against how often you can get away with a server restart. If it's just that one time that you got that message, then you might want to double checkpoint_segments to 64. Anything more than that seems unjustified, unless you're seeing the problem a lot more often than your email suggests. If the problem is happening frequently, you're probably way better off organizing an upgrade to PG 9.3 than fidgeting with a lot of tuning. The number of performance improvements from 8.3 to 9.3 is quite significant. > pls suggest the optimal checkpoint_segment value i should set > > *My current values are this* > > checkpoint_segments = 32 # in logfile segments, min 1, 16MB > each > #checkpoint_timeout = 5min # range 30s-1h > #checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - > 1.0 > #checkpoint_warning = 30s # 0 is off > > *Current Memory values* > > # - Memory - > > effective_cache_size = 8GB > > shared_buffers = 1024MB # min 128kB or max_connections*16kB > # (change requires restart) > temp_buffers = 64MB # min 800kB > #max_prepared_transactions = 5 # can be 0 or more > # (change requires restart) > > work_mem = 512MB # min 64kB > maintenance_work_mem = 2048MB # min 1MB !!! increased to 1GB > 10/1/2010 mrostron (to assist autovacuum workers) > #max_stack_depth = 2MB # min 100kB > > # - Free Space Map - > > max_fsm_pages = 809600 # min max_fsm_relations*16, 6 bytes > each > # (change requires restart) > synchronous_commit = off > > #full_page_writes = on # recover from partial page writes > #wal_buffers = 64kB # min 32kB > # (change requires restart) > #wal_writer_delay = 200ms # 1-10000 milliseconds > > #commit_delay = 0 # range 0-100000, in microseconds > #commit_siblings = 5 # range 1-1000 -- Potentialtech <wmoran@potentialtech.com>
Potentialtech <wmoran@potentialtech.com> writes: > If the warning isn't happening too often, I would try increasing it only a > little and see if it helps. If it's not enough you can then increase it some > more. Various sources around the Internet suggest that you don't want to go > much larger than 256 for this (if only because it's uncommon to do so and is > probably indicative of other tuning that you need to do). Unfortunatley, you > need to restart PG for the change to take effect, so you have to balance > experimenting with your tuning against how often you can get away with a server > restart. Huh? You don't need a restart, just a reload (SIGHUP) to change that. regards, tom lane
Thanks i have changed that to 64 and reloaded it.
When i had load issue today there was this exact same query that hits the db like 50 to 60 times from different machines in 3 to 4 minutes and was taking long time to execute and was holding up the database. i did recreate an index and it started performing better. My question is why it is not fetching the result from the memory since its the same query that runs again and again. SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id
On Thu, Jul 17, 2014 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Potentialtech <wmoran@potentialtech.com> writes:Huh? You don't need a restart, just a reload (SIGHUP) to change that.
> If the warning isn't happening too often, I would try increasing it only a
> little and see if it helps. If it's not enough you can then increase it some
> more. Various sources around the Internet suggest that you don't want to go
> much larger than 256 for this (if only because it's uncommon to do so and is
> probably indicative of other tuning that you need to do). Unfortunatley, you
> need to restart PG for the change to take effect, so you have to balance
> experimenting with your tuning against how often you can get away with a server
> restart.
regards, tom lane
On Thu, 17 Jul 2014 14:06:28 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Potentialtech <wmoran@potentialtech.com> writes: > > If the warning isn't happening too often, I would try increasing it only a > > little and see if it helps. If it's not enough you can then increase it some > > more. Various sources around the Internet suggest that you don't want to go > > much larger than 256 for this (if only because it's uncommon to do so and is > > probably indicative of other tuning that you need to do). Unfortunatley, you > > need to restart PG for the change to take effect, so you have to balance > > experimenting with your tuning against how often you can get away with a server > > restart. > > Huh? You don't need a restart, just a reload (SIGHUP) to change that. Really? Well, thanks for correcting me on that. -- Potentialtech <wmoran@potentialtech.com>
On Thu, 17 Jul 2014 11:28:04 -0700 Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote: > Thanks i have changed that to 64 and reloaded it. > > When i had load issue today there was this exact same query that hits the > db like 50 to 60 times from different machines in 3 to 4 minutes and was > taking long time to execute and was holding up the database. i did recreate > an index and it started performing better. My question is why it is not > fetching the result from the memory since its the same query that runs > again and again. > > This is the actual query i m taking about: > > SELECT account.id, account.organization_id, run.application_id, > work_unit.script, work_unit.id, work_unit.start_time, run.id, > work_unit.priority FROM work_unit, run, account WHERE work_unit.status = > 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 > AND run.account_id = account.id > > Pls suggest if i can do something to fix this Provide a lot more information if you want anyone on the list to be able to help: such as explain output while the problem is happening, and some information about the makeup of the tables (column types/indexes/# rows). Guessing, based on the little information you've provided, it's likely that you have something else going on at the same time that you're not aware of, and this particular query is only a symptom. I'm saying that because SELECTs don't generally create any WAL traffic, so there were probably some INSERT/UPDATE/DELETE running at the same time that both pushed those 3 tables out of memory and/or saturated disk activity to the point that accessing everything becomes slow for a short while, and it's just those queries that you noticed. Are you making the mistake where you set log_min_duration to 1s and only worry about queries that take longer than 1s? Because I've seen (on multiple occasions) where many 1000s of queries, each running less than 1s, are the actual cause of the problem. pgBadger is particularly helpful in tracking down situations like that. > On Thu, Jul 17, 2014 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Potentialtech <wmoran@potentialtech.com> writes: > > > If the warning isn't happening too often, I would try increasing it only > > a > > > little and see if it helps. If it's not enough you can then increase it > > some > > > more. Various sources around the Internet suggest that you don't want > > to go > > > much larger than 256 for this (if only because it's uncommon to do so > > and is > > > probably indicative of other tuning that you need to do). > > Unfortunatley, you > > > need to restart PG for the change to take effect, so you have to balance > > > experimenting with your tuning against how often you can get away with a > > server > > > restart. > > > > Huh? You don't need a restart, just a reload (SIGHUP) to change that. > > > > regards, tom lane > > -- Potentialtech <wmoran@potentialtech.com>
Here is the explain plan. There query time went backup up to 2 to 3 minutes from 3 second in just 2 hrs. Can anyone suggest something on how to fix this or why this is happening
explain SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.84 rows=1 width=64)
Join Filter: (run.id = work_unit.run_id)
-> Nested Loop (cost=0.00..16.55 rows=1 width=16)
-> Index Scan using run_status_1 on run (cost=0.00..8.27 rows=1 width=12)
Index Cond: (status = 1)
-> Index Scan using account_pkey on account (cost=0.00..8.27 rows=1 width=8)
Index Cond: (account.id = run.account_id)
-> Index Scan using idx_work_unit_1_status_3_new on work_unit (cost=0.00..8.28 rows=1 width=52)
Index Cond: (work_unit.status = 3)
Table Structure
\d work_unit
Table "public.work_unit"
Column | Type | Modifiers
------------+-----------------------------+-----------
id | integer | not null
run_id | integer |
status | integer |
script | character varying |
parameters | character varying |
input | character varying |
start_time | timestamp without time zone |
stop_time | timestamp without time zone |
priority | integer |
type | integer |
lineitems | integer |
outputs | integer |
sub_type | integer |
site_code | text |
Indexes:
"work_unit_pkey" PRIMARY KEY, btree (id)
"idx_work_unit_1_partial" btree (run_id, start_time) WHERE status >= 3
"idx_work_unit_1_run_id" btree (run_id)
"idx_work_unit_1_script" btree (script)
"idx_work_unit_1_site_code" btree (site_code)
"idx_work_unit_1_starttime" btree (start_time)
"idx_work_unit_1_status_3_new" btree (status, type) WHERE status = 3 AND type <> 1
"idx_work_unit_1_status_5" btree (status) WHERE status >= 4
"idx_work_unit_1_status_part_new" btree (status) WHERE status < 4
"idx_work_unit_1_stop_time" btree (stop_time)
\d run
Table "public.run"
Column | Type | Modifiers
---------------------+--------------------------+---------------
id | integer | not null
job_id | integer |
start_time | timestamp with time zone | not null
status | integer | not null
job_name | character varying |
account_id | integer |
application_id | integer |
postproc_script | character varying |
upload_url | character varying |
complete_time | timestamp with time zone |
email | character varying |
size | integer |
errors | integer |
raw_count | integer |
munge | integer |
job_details | character varying |
user_settings | character varying |
run_type | integer |
stop_after | timestamp with time zone |
total_work_time | integer |
flags | integer |
cluster_info_id | integer |
skiplineitems_count | integer |
last_update_ts | timestamp with time zone |
result_exists | boolean | default false
abort_type | integer |
Indexes:
"run_pkey" PRIMARY KEY, btree (id)
"idx_run_acc_stat_comp" btree (account_id, status, complete_time)
"idx_run_app" btree (application_id)
"idx_run_complete_time_2" btree (complete_time)
"idx_run_job_name" btree (job_name)
"run_application_account_idx" btree (application_id, account_id, status)
"run_job" btree (job_id)
"run_result_exists" btree (result_exists)
"run_start_time" btree (start_time)
"run_status" btree (status)
"run_status_1" btree (status) WHERE status = 1
"run_status_part_idx" btree (status) WHERE status < 3
Triggers:
run_upd_ts BEFORE INSERT OR UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE set_last_update_ts()
tr_proc_update_job_summary AFTER UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE tr_proc_update_job_summary()
explain SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.84 rows=1 width=64)
Join Filter: (run.id = work_unit.run_id)
-> Nested Loop (cost=0.00..16.55 rows=1 width=16)
-> Index Scan using run_status_1 on run (cost=0.00..8.27 rows=1 width=12)
Index Cond: (status = 1)
-> Index Scan using account_pkey on account (cost=0.00..8.27 rows=1 width=8)
Index Cond: (account.id = run.account_id)
-> Index Scan using idx_work_unit_1_status_3_new on work_unit (cost=0.00..8.28 rows=1 width=52)
Index Cond: (work_unit.status = 3)
Table Structure
\d work_unit
Table "public.work_unit"
Column | Type | Modifiers
------------+-----------------------------+-----------
id | integer | not null
run_id | integer |
status | integer |
script | character varying |
parameters | character varying |
input | character varying |
start_time | timestamp without time zone |
stop_time | timestamp without time zone |
priority | integer |
type | integer |
lineitems | integer |
outputs | integer |
sub_type | integer |
site_code | text |
Indexes:
"work_unit_pkey" PRIMARY KEY, btree (id)
"idx_work_unit_1_partial" btree (run_id, start_time) WHERE status >= 3
"idx_work_unit_1_run_id" btree (run_id)
"idx_work_unit_1_script" btree (script)
"idx_work_unit_1_site_code" btree (site_code)
"idx_work_unit_1_starttime" btree (start_time)
"idx_work_unit_1_status_3_new" btree (status, type) WHERE status = 3 AND type <> 1
"idx_work_unit_1_status_5" btree (status) WHERE status >= 4
"idx_work_unit_1_status_part_new" btree (status) WHERE status < 4
"idx_work_unit_1_stop_time" btree (stop_time)
\d run
Table "public.run"
Column | Type | Modifiers
---------------------+--------------------------+---------------
id | integer | not null
job_id | integer |
start_time | timestamp with time zone | not null
status | integer | not null
job_name | character varying |
account_id | integer |
application_id | integer |
postproc_script | character varying |
upload_url | character varying |
complete_time | timestamp with time zone |
email | character varying |
size | integer |
errors | integer |
raw_count | integer |
munge | integer |
job_details | character varying |
user_settings | character varying |
run_type | integer |
stop_after | timestamp with time zone |
total_work_time | integer |
flags | integer |
cluster_info_id | integer |
skiplineitems_count | integer |
last_update_ts | timestamp with time zone |
result_exists | boolean | default false
abort_type | integer |
Indexes:
"run_pkey" PRIMARY KEY, btree (id)
"idx_run_acc_stat_comp" btree (account_id, status, complete_time)
"idx_run_app" btree (application_id)
"idx_run_complete_time_2" btree (complete_time)
"idx_run_job_name" btree (job_name)
"run_application_account_idx" btree (application_id, account_id, status)
"run_job" btree (job_id)
"run_result_exists" btree (result_exists)
"run_start_time" btree (start_time)
"run_status" btree (status)
"run_status_1" btree (status) WHERE status = 1
"run_status_part_idx" btree (status) WHERE status < 3
Triggers:
run_upd_ts BEFORE INSERT OR UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE set_last_update_ts()
tr_proc_update_job_summary AFTER UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE tr_proc_update_job_summary()
On Thu, Jul 17, 2014 at 12:00 PM, Potentialtech <wmoran@potentialtech.com> wrote:
On Thu, 17 Jul 2014 11:28:04 -0700Provide a lot more information if you want anyone on the list to be able
Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote:
> Thanks i have changed that to 64 and reloaded it.
>
> When i had load issue today there was this exact same query that hits the
> db like 50 to 60 times from different machines in 3 to 4 minutes and was
> taking long time to execute and was holding up the database. i did recreate
> an index and it started performing better. My question is why it is not
> fetching the result from the memory since its the same query that runs
> again and again.
>
> This is the actual query i m taking about:
>
> SELECT account.id, account.organization_id, run.application_id,
> work_unit.script, work_unit.id, work_unit.start_time, run.id,
> work_unit.priority FROM work_unit, run, account WHERE work_unit.status =
> 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1
> AND run.account_id = account.id
>
> Pls suggest if i can do something to fix this
to help: such as explain output while the problem is happening, and some
information about the makeup of the tables (column types/indexes/# rows).
Guessing, based on the little information you've provided, it's likely
that you have something else going on at the same time that you're not
aware of, and this particular query is only a symptom. I'm saying that
because SELECTs don't generally create any WAL traffic, so there were
probably some INSERT/UPDATE/DELETE running at the same time that both
pushed those 3 tables out of memory and/or saturated disk activity to
the point that accessing everything becomes slow for a short while, and
it's just those queries that you noticed.
Are you making the mistake where you set log_min_duration to 1s and only
worry about queries that take longer than 1s? Because I've seen (on
multiple occasions) where many 1000s of queries, each running less than
1s, are the actual cause of the problem. pgBadger is particularly helpful
in tracking down situations like that.--
> On Thu, Jul 17, 2014 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Potentialtech <wmoran@potentialtech.com> writes:
> > > If the warning isn't happening too often, I would try increasing it only
> > a
> > > little and see if it helps. If it's not enough you can then increase it
> > some
> > > more. Various sources around the Internet suggest that you don't want
> > to go
> > > much larger than 256 for this (if only because it's uncommon to do so
> > and is
> > > probably indicative of other tuning that you need to do).
> > Unfortunatley, you
> > > need to restart PG for the change to take effect, so you have to balance
> > > experimenting with your tuning against how often you can get away with a
> > server
> > > restart.
> >
> > Huh? You don't need a restart, just a reload (SIGHUP) to change that.
> >
> > regards, tom lane
> >
Potentialtech <wmoran@potentialtech.com>
even though the explain plan suggests differently but its taking long long time
On Thu, Jul 17, 2014 at 4:26 PM, Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote:
Here is the explain plan. There query time went backup up to 2 to 3 minutes from 3 second in just 2 hrs. Can anyone suggest something on how to fix this or why this is happening
explain SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.84 rows=1 width=64)
Join Filter: (run.id = work_unit.run_id)
-> Nested Loop (cost=0.00..16.55 rows=1 width=16)
-> Index Scan using run_status_1 on run (cost=0.00..8.27 rows=1 width=12)
Index Cond: (status = 1)
-> Index Scan using account_pkey on account (cost=0.00..8.27 rows=1 width=8)
Index Cond: (account.id = run.account_id)
-> Index Scan using idx_work_unit_1_status_3_new on work_unit (cost=0.00..8.28 rows=1 width=52)
Index Cond: (work_unit.status = 3)
Table Structure
\d work_unit
Table "public.work_unit"
Column | Type | Modifiers
------------+-----------------------------+-----------
id | integer | not null
run_id | integer |
status | integer |
script | character varying |
parameters | character varying |
input | character varying |
start_time | timestamp without time zone |
stop_time | timestamp without time zone |
priority | integer |
type | integer |
lineitems | integer |
outputs | integer |
sub_type | integer |
site_code | text |
Indexes:
"work_unit_pkey" PRIMARY KEY, btree (id)
"idx_work_unit_1_partial" btree (run_id, start_time) WHERE status >= 3
"idx_work_unit_1_run_id" btree (run_id)
"idx_work_unit_1_script" btree (script)
"idx_work_unit_1_site_code" btree (site_code)
"idx_work_unit_1_starttime" btree (start_time)
"idx_work_unit_1_status_3_new" btree (status, type) WHERE status = 3 AND type <> 1
"idx_work_unit_1_status_5" btree (status) WHERE status >= 4
"idx_work_unit_1_status_part_new" btree (status) WHERE status < 4
"idx_work_unit_1_stop_time" btree (stop_time)
\d run
Table "public.run"
Column | Type | Modifiers
---------------------+--------------------------+---------------
id | integer | not null
job_id | integer |
start_time | timestamp with time zone | not null
status | integer | not null
job_name | character varying |
account_id | integer |
application_id | integer |
postproc_script | character varying |
upload_url | character varying |
complete_time | timestamp with time zone |
email | character varying |
size | integer |
errors | integer |
raw_count | integer |
munge | integer |
job_details | character varying |
user_settings | character varying |
run_type | integer |
stop_after | timestamp with time zone |
total_work_time | integer |
flags | integer |
cluster_info_id | integer |
skiplineitems_count | integer |
last_update_ts | timestamp with time zone |
result_exists | boolean | default false
abort_type | integer |
Indexes:
"run_pkey" PRIMARY KEY, btree (id)
"idx_run_acc_stat_comp" btree (account_id, status, complete_time)
"idx_run_app" btree (application_id)
"idx_run_complete_time_2" btree (complete_time)
"idx_run_job_name" btree (job_name)
"run_application_account_idx" btree (application_id, account_id, status)
"run_job" btree (job_id)
"run_result_exists" btree (result_exists)
"run_start_time" btree (start_time)
"run_status" btree (status)
"run_status_1" btree (status) WHERE status = 1
"run_status_part_idx" btree (status) WHERE status < 3
Triggers:
run_upd_ts BEFORE INSERT OR UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE set_last_update_ts()
tr_proc_update_job_summary AFTER UPDATE ON run FOR EACH ROW EXECUTE PROCEDURE tr_proc_update_job_summary()On Thu, Jul 17, 2014 at 12:00 PM, Potentialtech <wmoran@potentialtech.com> wrote:On Thu, 17 Jul 2014 11:28:04 -0700Provide a lot more information if you want anyone on the list to be able
Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote:
> Thanks i have changed that to 64 and reloaded it.
>
> When i had load issue today there was this exact same query that hits the
> db like 50 to 60 times from different machines in 3 to 4 minutes and was
> taking long time to execute and was holding up the database. i did recreate
> an index and it started performing better. My question is why it is not
> fetching the result from the memory since its the same query that runs
> again and again.
>
> This is the actual query i m taking about:
>
> SELECT account.id, account.organization_id, run.application_id,
> work_unit.script, work_unit.id, work_unit.start_time, run.id,
> work_unit.priority FROM work_unit, run, account WHERE work_unit.status =
> 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1
> AND run.account_id = account.id
>
> Pls suggest if i can do something to fix this
to help: such as explain output while the problem is happening, and some
information about the makeup of the tables (column types/indexes/# rows).
Guessing, based on the little information you've provided, it's likely
that you have something else going on at the same time that you're not
aware of, and this particular query is only a symptom. I'm saying that
because SELECTs don't generally create any WAL traffic, so there were
probably some INSERT/UPDATE/DELETE running at the same time that both
pushed those 3 tables out of memory and/or saturated disk activity to
the point that accessing everything becomes slow for a short while, and
it's just those queries that you noticed.
Are you making the mistake where you set log_min_duration to 1s and only
worry about queries that take longer than 1s? Because I've seen (on
multiple occasions) where many 1000s of queries, each running less than
1s, are the actual cause of the problem. pgBadger is particularly helpful
in tracking down situations like that.--
> On Thu, Jul 17, 2014 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Potentialtech <wmoran@potentialtech.com> writes:
> > > If the warning isn't happening too often, I would try increasing it only
> > a
> > > little and see if it helps. If it's not enough you can then increase it
> > some
> > > more. Various sources around the Internet suggest that you don't want
> > to go
> > > much larger than 256 for this (if only because it's uncommon to do so
> > and is
> > > probably indicative of other tuning that you need to do).
> > Unfortunatley, you
> > > need to restart PG for the change to take effect, so you have to balance
> > > experimenting with your tuning against how often you can get away with a
> > server
> > > restart.
> >
> > Huh? You don't need a restart, just a reload (SIGHUP) to change that.
> >
> > regards, tom lane
> >
Potentialtech <wmoran@potentialtech.com>
On 7/17/2014 4:26 PM, Prabhjot Sheena wrote:
Here is the explain plan. There query time went backup up to 2 to 3 minutes from 3 second in just 2 hrs. Can anyone suggest something on how to fix this or why this is happening
explain SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
you need to use EXPLAIN ANALYZE to get accurate data.
run it when the query is fast, and again when the query is slow, paste both outputs here. also, you can paste them to http://explain.depesz.com and that will give you a nice analysis of the timing data included in the EXPLAIN ANALYZE output.
-- john r pierce 37N 122W somewhere on the middle of the left coast
On 7/17/2014 11:28 AM, Prabhjot Sheena wrote:
SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id
that query is an implied 3-way join. it would be clearer to write it like...
SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority
FROM work_unit
JOIN run ON work_unit.run_id = run.id
JOIN account ON run.account_id = account.id
WHERE work_unit.status = 3 AND work_unit.type != 1 AND run.status = 1;
postgres would (or at least should) treat this exactly the same, but its clearer what the query is doing when its written this way.
-- john r pierce 37N 122W somewhere on the middle of the left coast
i just did explain analyze and currently database is running slow coz of the query
explain ANALYZE SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.84 rows=1 width=64) (actual time=71.824..69729.467 rows=1820 loops=1)
-> Nested Loop (cost=0.00..16.56 rows=1 width=60) (actual time=71.760..69628.874 rows=1820 loops=1)
Join Filter: (work_unit.run_id = run.id)
-> Index Scan using idx_work_unit_1_status_3_new on work_unit (cost=0.00..8.28 rows=1 width=52) (actual time=0.067..154.364 rows=1820 loops=1)
Index Cond: (status = 3)
-> Index Scan using run_status_1 on run (cost=0.00..8.27 rows=1 width=12) (actual time=0.081..34.338 rows=3138 loops=1820)
Index Cond: (run.status = 1)
-> Index Scan using account_pkey on account (cost=0.00..8.27 rows=1 width=8) (actual time=0.044..0.046 rows=1 loops=1820)
Index Cond: (account.id = run.account_id)
Total runtime: 69732.893 ms
explain ANALYZE SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.84 rows=1 width=64) (actual time=71.824..69729.467 rows=1820 loops=1)
-> Nested Loop (cost=0.00..16.56 rows=1 width=60) (actual time=71.760..69628.874 rows=1820 loops=1)
Join Filter: (work_unit.run_id = run.id)
-> Index Scan using idx_work_unit_1_status_3_new on work_unit (cost=0.00..8.28 rows=1 width=52) (actual time=0.067..154.364 rows=1820 loops=1)
Index Cond: (status = 3)
-> Index Scan using run_status_1 on run (cost=0.00..8.27 rows=1 width=12) (actual time=0.081..34.338 rows=3138 loops=1820)
Index Cond: (run.status = 1)
-> Index Scan using account_pkey on account (cost=0.00..8.27 rows=1 width=8) (actual time=0.044..0.046 rows=1 loops=1820)
Index Cond: (account.id = run.account_id)
Total runtime: 69732.893 ms
On Thu, Jul 17, 2014 at 4:45 PM, John R Pierce <pierce@hogranch.com> wrote:
you need to use EXPLAIN ANALYZE to get accurate data.On 7/17/2014 4:26 PM, Prabhjot Sheena wrote:Here is the explain plan. There query time went backup up to 2 to 3 minutes from 3 second in just 2 hrs. Can anyone suggest something on how to fix this or why this is happening
explain SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
run it when the query is fast, and again when the query is slow, paste both outputs here. also, you can paste them to http://explain.depesz.com and that will give you a nice analysis of the timing data included in the EXPLAIN ANALYZE output.-- john r pierce 37N 122W somewhere on the middle of the left coast
On 7/17/2014 5:01 PM, Prabhjot Sheena wrote:
i just did explain analyze and currently database is running slow coz of the query
explain ANALYZE SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
what does the overall workload on the system look like when this is so slow? taking over a minute to do 1820 iterations of a 3138 row index scan seems way out of bounds for anything other than a very heavily overloaded server.
is this running on a virtual machine where there are other virtual servers contending for the same resources ?
btw, here's your analyze formatted pretty: http://explain.depesz.com/s/XRI
-- john r pierce 37N 122W somewhere on the middle of the left coast
The overall load of system is good. It is a vm but the other database is running normal and not doing much activity
vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 2 22048 296804 5016 10829360 0 0 1 3 0 0 5 1 75 18 1
9 1 22048 255184 5024 10868884 0 0 0 60 8955 10518 33 3 41 15 8
11 3 22048 197664 5024 10925944 0 0 0 0 7866 8767 37 2 44 12 6
12 1 22048 140000 5048 10976620 0 0 0 10 9181 10632 39 3 42 11 6
7 1 22048 148068 5076 10970104 0 0 1 33 7456 7715 34 2 46 8 10
11 3 22048 87408 5076 11027832 0 0 0 0 9846 12631 31 4 38 14 14
8 1 22048 48380 5084 11067456 0 0 0 35 7887 8902 33 3 43 11 10
vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 2 22048 296804 5016 10829360 0 0 1 3 0 0 5 1 75 18 1
9 1 22048 255184 5024 10868884 0 0 0 60 8955 10518 33 3 41 15 8
11 3 22048 197664 5024 10925944 0 0 0 0 7866 8767 37 2 44 12 6
12 1 22048 140000 5048 10976620 0 0 0 10 9181 10632 39 3 42 11 6
7 1 22048 148068 5076 10970104 0 0 1 33 7456 7715 34 2 46 8 10
11 3 22048 87408 5076 11027832 0 0 0 0 9846 12631 31 4 38 14 14
8 1 22048 48380 5084 11067456 0 0 0 35 7887 8902 33 3 43 11 10
On Thu, Jul 17, 2014 at 5:10 PM, John R Pierce <pierce@hogranch.com> wrote:
what does the overall workload on the system look like when this is so slow? taking over a minute to do 1820 iterations of a 3138 row index scan seems way out of bounds for anything other than a very heavily overloaded server.On 7/17/2014 5:01 PM, Prabhjot Sheena wrote:i just did explain analyze and currently database is running slow coz of the query
explain ANALYZE SELECT account.id, account.organization_id, run.application_id, work_unit.script, work_unit.id, work_unit.start_time, run.id, work_unit.priority FROM work_unit, run, account WHERE work_unit.status = 3 AND work_unit.run_id = run.id AND work_unit.type != 1 AND run.status = 1 AND run.account_id = account.id;
is this running on a virtual machine where there are other virtual servers contending for the same resources ?
btw, here's your analyze formatted pretty: http://explain.depesz.com/s/XRI-- john r pierce 37N 122W somewhere on the middle of the left coast
On 7/17/2014 5:36 PM, Prabhjot Sheena wrote: > The overall load of system is good. It is a vm but the other database > is running normal and not doing much activity other database? is that on the same postgresql cluster? or on a seperate instance of the postgres server on the same VM? or on a seperate VM on the same host ? what does `iostat -x 5` show ? let it run a bunch of samples, then run this slow query and get more samples. pay attention to the disk drive(s) the database is stored on. high %utilization or await times would be a bad thing. -- john r pierce 37N 122W somewhere on the middle of the left coast
Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote: > I m running postgresql 8.3 That is out of support. If you are at all concerned about performance, you would do well to upgrade to a recent and supported major release. http://www.postgresql.org/support/versioning/ Anyway, it is always a good idea to show the output of: SELECT version(); > temp_buffers = 64MB You don't say what you have set for max_connections, but temp_buffers is allocated at need *separately for each connection* and once allocated to a connection it is never released until the connection is closed. So, hypothetically, if you had 100 connections open and they each had used temporary tables at some time or other, even when all are idle they could have 6.4GB of your 12GB of RAM tied up. > work_mem = 512MB Unlike temp_buffers, this one is released at the end of the command (or earlier), but it is possible to have multiple allocations for a single connection. If you had 100 queries active, each using one work_mem allocation, that would allocate about 51.2GB of RAM, which looks like it is more than you have. I generally recommend setting max_connections only to what is needed (with a small reserve) and setting work_mem to 25% of machine (or VM) RAM divided by max_connections. On the face of it, neither of the above seems to be causing problems at the time you ran vmstat, but they worry me. > max_fsm_pages = 809600 Since the EXPLAIN ANALYZE output you later posted show estimates which were off by several orders of magnitude, quite possibly leading to a suboptimal plan, you should probably run queries to check for bloat and update statistics. If you find badly bloated tables you should run CLUSTER or VACUUM FULL on them to eliminate bloat. Any bloated indexes, and any indexed on bloated tables you cleaned up using VACUUM FULL should be fixed with REINDEX. (Once you are on 9.0 or later it is no longer necessary to REINDEX a table after using VACUUM FULL on it, although a regular VACUUM after the VACUUM FULL is still important.) If you find bloat you should make autovacuum more aggressive or add VACUUM ANALYZE crontab jobs, to prevent a recurrence. You may want to look at increasing statistics targets, either globally or on specific columns used for selection criteria. After doing the above, or if you decide not to do any or all of it, you should probably run EXPLAIN ANALYZE VERBOSE; (without specifying a table name and without the FULL option) and review the output. The most important part of the output is the last few lines, which in 8.3 might include warnings that suggest configuration changes. The above might help, but I think the biggest problem may be your VM. You show very low disk I/O numbers, but a pretty significant fraction of the time waiting for I/O. The previously-suggested iostat output may help nail it down more specifically, but basically you seem to have a big problem with bandwidth for storage access. It's pretty disturbing to see lines in vmstat output which show zero disk in or out, but over 10% of CPU time waiting for storage?!? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/21/2014 1:51 PM, Kevin Grittner wrote: > The above might help, but I think the biggest problem may be your > VM. You show very low disk I/O numbers, but a pretty significant > fraction of the time waiting for I/O. The previously-suggested > iostat output may help nail it down more specifically, but > basically you seem to have a big problem with bandwidth for storage > access. It's pretty disturbing to see lines in vmstat output which > show zero disk in or out, but over 10% of CPU time waiting for > storage?!? that would suggest to me a large number of VMs sharing a single SATA drive, or similar. -- john r pierce 37N 122W somewhere on the middle of the left coast
Thanks a lot Kevin. This is what i did to improve query performance. i recreated all the indexes on work_unit table and have been running vacuum analyze through cron job 3 times a day on two tables that are in the query. The query performance is between 2 to 3 seconds now. The strange thing i noticed is that just today at one time query performance came down to under 1 second and started using this query plan
http://explain.depesz.com/s/h5q8
But than cronjob started to vacuum analyze the work_unit and run table and after vacuum analyze got completed. it started using another query plan which made query slow and it went back to running between 2 to 3 seconds. This is query plan that is used now and is slowerhttp://explain.depesz.com/s/h5q8
http://explain.depesz.com/s/AiG
We do have i/o problem during heavy loads because we share mount point from netapp across different machines which we are getting rid in few weeks by moving stuff to Amazon aws and than IO issues will go away with fast iops
i have copied some more information. Can you please recomend what values i should set for temp_buffer and work_memory and also what query should i run to check for bloat.
caesius=# SELECT version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 8.3.7 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9)
(1 row)
caesius=# show max_connections ;
max_connections
-----------------
600
caesius=# select count(*) from pg_stat_activity;
count
-------
165
#autovacuum = on
log_autovacuum_min_duration = 10000
autovacuum_max_workers = 1
#autovacuum_naptime = 1min
#autovacuum_vacuum_threshold = 50
#autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.1
#autovacuum_analyze_scale_factor = 0.05
#autovacuum_freeze_max_age = 200000000
#autovacuum_vacuum_cost_delay = 20
#autovacuum_vacuum_cost_limit = -1
# - Cost-Based Vacuum Delay -
#vacuum_cost_delay = 0 # 0-1000 milliseconds
#vacuum_cost_page_hit = 1 # 0-10000 credits
#vacuum_cost_page_miss = 10 # 0-10000 credits
#vacuum_cost_page_dirty = 20 # 0-10000 credits
#vacuum_cost_limit = 200 # 1-10000 credits
[caesius@clientdb01 tmp]$ cat /proc/meminfo
MemTotal: 12582912 kB
MemFree: 204748 kB
Buffers: 4540 kB
Cached: 9541024 kB
SwapCached: 5324 kB
Active: 5218556 kB
Inactive: 6554684 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 12582912 kB
LowFree: 204748 kB
SwapTotal: 16777208 kB
SwapFree: 16755516 kB
Dirty: 36584 kB
Writeback: 20 kB
AnonPages: 2227364 kB
Mapped: 1093452 kB
Slab: 101396 kB
PageTables: 206692 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 23068664 kB
Committed_AS: 3796932 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 9196 kB
VmallocChunk: 34359729063 kB
MemTotal: 12582912 kB
MemFree: 204748 kB
Buffers: 4540 kB
Cached: 9541024 kB
SwapCached: 5324 kB
Active: 5218556 kB
Inactive: 6554684 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 12582912 kB
LowFree: 204748 kB
SwapTotal: 16777208 kB
SwapFree: 16755516 kB
Dirty: 36584 kB
Writeback: 20 kB
AnonPages: 2227364 kB
Mapped: 1093452 kB
Slab: 101396 kB
PageTables: 206692 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 23068664 kB
Committed_AS: 3796932 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 9196 kB
VmallocChunk: 34359729063 kB
iostar -d -s 5
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 7.40 0.20 1.20 4.80 68.80 52.57 0.01 4.57 2.29 0.32
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 7.40 0.20 1.20 4.80 68.80 52.57 0.01 4.57 2.29 0.32
dm-0 0.00 0.00 0.20 8.60 4.80 68.80 8.36 0.04 4.09 0.36 0.32
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 3.80 0.00 0.60 0.00 35.20 58.67 0.00 6.67 5.33 0.32
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 3.80 0.00 0.60 0.00 35.20 58.67 0.00 6.67 5.33 0.32
dm-0 0.00 0.00 0.00 4.40 0.00 35.20 8.00 0.02 4.36 0.73 0.32
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 9.40 0.00 1.80 0.00 89.60 49.78 0.00 2.22 0.44 0.08
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 9.40 0.00 1.80 0.00 89.60 49.78 0.00 2.22 0.44 0.08
dm-0 0.00 0.00 0.00 11.20 0.00 89.60 8.00 0.03 3.00 0.07 0.08
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 12.38 5.79 13.17 106.99 204.39 16.42 0.20 10.78 1.85 3.51
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 12.38 5.79 13.17 106.99 204.39 16.42 0.20 10.78 1.85 3.51
dm-0 0.00 0.00 5.79 25.55 106.99 204.39 9.94 0.31 9.83 1.12 3.51
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
dm-0 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 7.40 0.20 1.20 4.80 68.80 52.57 0.01 4.57 2.29 0.32
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 7.40 0.20 1.20 4.80 68.80 52.57 0.01 4.57 2.29 0.32
dm-0 0.00 0.00 0.20 8.60 4.80 68.80 8.36 0.04 4.09 0.36 0.32
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 3.80 0.00 0.60 0.00 35.20 58.67 0.00 6.67 5.33 0.32
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 3.80 0.00 0.60 0.00 35.20 58.67 0.00 6.67 5.33 0.32
dm-0 0.00 0.00 0.00 4.40 0.00 35.20 8.00 0.02 4.36 0.73 0.32
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 9.40 0.00 1.80 0.00 89.60 49.78 0.00 2.22 0.44 0.08
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 9.40 0.00 1.80 0.00 89.60 49.78 0.00 2.22 0.44 0.08
dm-0 0.00 0.00 0.00 11.20 0.00 89.60 8.00 0.03 3.00 0.07 0.08
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 12.38 5.79 13.17 106.99 204.39 16.42 0.20 10.78 1.85 3.51
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 12.38 5.79 13.17 106.99 204.39 16.42 0.20 10.78 1.85 3.51
dm-0 0.00 0.00 5.79 25.55 106.99 204.39 9.94 0.31 9.83 1.12 3.51
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
xvda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvda2 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
dm-0 0.00 0.00 0.00 0.20 0.00 1.60 8.00 0.00 4.00 4.00 0.08
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
On Mon, Jul 21, 2014 at 2:04 PM, John R Pierce <pierce@hogranch.com> wrote:
On 7/21/2014 1:51 PM, Kevin Grittner wrote:that would suggest to me a large number of VMs sharing a single SATA drive, or similar.The above might help, but I think the biggest problem may be your
VM. You show very low disk I/O numbers, but a pretty significant
fraction of the time waiting for I/O. The previously-suggested
iostat output may help nail it down more specifically, but
basically you seem to have a big problem with bandwidth for storage
access. It's pretty disturbing to see lines in vmstat output which
show zero disk in or out, but over 10% of CPU time waiting for
storage?!?
--
john r pierce 37N 122W
somewhere on the middle of the left coast
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 7/25/2014 5:58 PM, Prabhjot Sheena wrote: > We do have i/o problem during heavy loads because we share mount point > from netapp across different machines which we are getting rid in few > weeks by moving stuff to Amazon aws and than IO issues will go away > with fast iops Netapp has to be some of the most expensive storage I've ever seen in a datacenter and is frequently a source of performance problems. NFS isn't really appropriate for storing database data on, there's way too many opportunites for fsync to get lost, more so if you've got a virtualization layer in there too. -- john r pierce 37N 122W somewhere on the middle of the left coast
Prabhjot Sheena <prabhjot.sheena@rivalwatch.com> wrote: > PostgreSQL 8.3.7 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9) Running anything on that version is insane. Not only has the 8.3 major release been out of support since February of 2013, but you're missing about 4 years of fixes for serious bugs and security issues since the last patch version of 8.3 before it went out of support. Any vacuum or planning problems you are seeing on that release are very likely fixed on a supported version. > This is what i did to improve query performance. i recreated all > the indexes on work_unit table and have been running vacuum > analyze through cron job 3 times a day on two tables that are in > the query. The query performance is between 2 to 3 seconds now. That's pretty solid evidence that you have been having problems with bloat. The most common cause of this is not setting autovacuum to be aggressive enough. > autovacuum_max_workers = 1 Vacuum of one big table could starve all other tables, resulting in bloat. If anything, this should be set larger than the default. You might want to try setting it to somewhere in the 3 to 5 range, but then watch for any long periods where all workers are busy. If you see that, you probably need more workers. I would also boost autovacuum_vacuum_cost_limit to maybe 500. > The strange thing i noticed is that just today at one time query > performance came down to under 1 second and started using this > query plan There's really no point hypothesizing about what might cause that in such an old version with so many known bugs. The best way to improve performance would be to upgrade. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company