Thread: slow query, different plans

slow query, different plans

From
"Midge Brown"
Date:
I'm having a problem with a query on our production server, but not on a laptop running a similar postgres version with a recent backup copy of the same table. I tried reindexing the table on the production server, but it didn't make any difference. Other queries on the same table are plenty fast.
 
This query has been slow, but never like this, particularly during a period when there are only a couple of connections in use.
 
Vacuum and analyze are run nightly (and show as such in pg_stat_user_tables) in addition to autovacuum during the day. Here are my autovacuum settings, but when I checked last_autovacuum & last_autoanalyze in pg_stat_user_tables those fields were blank.
 
autovacuum = on                        
log_autovacuum_min_duration = 10       
autovacuum_max_workers = 3             
autovacuum_naptime = 1min              
autovacuum_vacuum_threshold = 50       
autovacuum_analyze_threshold = 50      
autovacuum_vacuum_scale_factor = 0.2   
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 200000000  
autovacuum_vacuum_cost_delay = 10ms (changed earlier today from 1000ms) 
autovacuum_vacuum_cost_limit = -1
 
wal_level = minimal
wal_buffers = 16MB
 
The only recent change was moving the 3 databases we have from multiple raid 1 drives with tablespaces spread all over to one large raid10 with indexes and data in pg_default. WAL for this table was moved as well.
 
Does anyone have any suggestions on where to look for the problem? 
 
clientlog table info:
 
Size: 1.94G
 
  Column  |            Type             | Modifiers
----------+-----------------------------+-----------
 pid0     | integer                     | not null
 rid      | integer                     | not null
 verb     | character varying(32)       | not null
 noun     | character varying(32)       | not null
 detail   | text                        |
 path     | character varying(256)      | not null
 ts       | timestamp without time zone |
 applies2 | integer                     |
 toname   | character varying(128)      |
 byname   | character varying(128)      |
Indexes:
    "clientlog_applies2" btree (applies2)
    "clientlog_pid0_key" btree (pid0)
    "clientlog_rid_key" btree (rid)
    "clientlog_ts" btree (ts)
The query, hardware info, and links to both plans:
 
explain analyze select max(ts) as ts from clientlog where applies2=256;
 
Production server:
- 4 dual-core AMD Opteron 2212 processors, 2010.485 MHz
- 64GB RAM
- 464GB RAID10 drive
- Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 PostgreSQL 9.0.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
http://explain.depesz.com/s/8R4
                                                                         
 
From laptop running Linux 2.6.34.9-69.fc13.868 with 3G ram against a copy of the same table:
PostgreSQL 9.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10), 32-bit
 
 
Thank you,
Midge
 

Re: slow query, different plans

From
Greg Williamson
Date:
Midge --

Sorry for top-quoting -- challenged mail.

Perhaps a difference in the stats estimates -- default_statistics_target ?

Can you show us a diff between the postgres config files for each instance ? Maybe something there ...

Greg Williamson


From: Midge Brown <midgems@sbcglobal.net>
To: pgsql-performance@postgresql.org
Sent: Friday, August 3, 2012 5:38 PM
Subject: [PERFORM] slow query, different plans

I'm having a problem with a query on our production server, but not on a laptop running a similar postgres version with a recent backup copy of the same table. I tried reindexing the table on the production server, but it didn't make any difference. Other queries on the same table are plenty fast.
 
This query has been slow, but never like this, particularly during a period when there are only a couple of connections in use.
 
Vacuum and analyze are run nightly (and show as such in pg_stat_user_tables) in addition to autovacuum during the day. Here are my autovacuum settings, but when I checked last_autovacuum & last_autoanalyze in pg_stat_user_tables those fields were blank.
 
autovacuum = on                        
log_autovacuum_min_duration = 10       
autovacuum_max_workers = 3             
autovacuum_naptime = 1min              
autovacuum_vacuum_threshold = 50       
autovacuum_analyze_threshold = 50      
autovacuum_vacuum_scale_factor = 0.2   
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 200000000  
autovacuum_vacuum_cost_delay = 10ms (changed earlier today from 1000ms) 
autovacuum_vacuum_cost_limit = -1
 
wal_level = minimal
wal_buffers = 16MB
 
The only recent change was moving the 3 databases we have from multiple raid 1 drives with tablespaces spread all over to one large raid10 with indexes and data in pg_default. WAL for this table was moved as well.
 
Does anyone have any suggestions on where to look for the problem? 
 
clientlog table info:
 
Size: 1.94G
 
  Column  |            Type             | Modifiers
----------+-----------------------------+-----------
 pid0     | integer                     | not null
 rid      | integer                     | not null
 verb     | character varying(32)       | not null
 noun     | character varying(32)       | not null
 detail   | text                        |
 path     | character varying(256)      | not null
 ts       | timestamp without time zone |
 applies2 | integer                     |
 toname   | character varying(128)      |
 byname   | character varying(128)      |
Indexes:
    "clientlog_applies2" btree (applies2)
    "clientlog_pid0_key" btree (pid0)
    "clientlog_rid_key" btree (rid)
    "clientlog_ts" btree (ts)
The query, hardware info, and links to both plans:
 
explain analyze select max(ts) as ts from clientlog where applies2=256;
 
Production server:
- 4 dual-core AMD Opteron 2212 processors, 2010.485 MHz
- 64GB RAM
- 464GB RAID10 drive
- Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 PostgreSQL 9.0.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
http://explain.depesz.com/s/8R4
                                                                         
 
From laptop running Linux 2.6.34.9-69.fc13.868 with 3G ram against a copy of the same table:
PostgreSQL 9.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10), 32-bit
 
http://explain.depesz.com/s/NQl
 
Thank you,
Midge
 


Re: slow query, different plans

From
Tom Lane
Date:
"Midge Brown" <midgems@sbcglobal.net> writes:
> I'm having a problem with a query on our production server, but not on a laptop running a similar postgres version
witha recent backup copy of the same table. I tried reindexing the table on the production server, but it didn't make
anydifference. Other queries on the same table are plenty fast.  

Reindexing won't help that.  The problem is a bad statistical estimate;
it thinks there are about 700 rows with applies2 = 256, when there's
really only one.  That means the "fast" plan is a lot faster than the
planner gives it credit for, and conversely the "slow" plan is a lot
slower than the planner is expecting.  Their estimated costs end up
nearly the same, which makes it a bit of a chance matter which one is
picked --- but the true costs are a lot different.  So you need to fix
that rowcount estimate.  Raising the stats target for the table might
help.

            regards, tom lane

Re: slow query, different plans

From
"Midge Brown"
Date:
Diff of config files is below. default_statistics_target in both is currently at the default of 100, though I'm going to try increasing that for this table as Tom Lane suggested.  
-Midge
 
----- Original Message -----
Sent: Friday, August 03, 2012 6:30 PM
Subject: Re: [PERFORM] slow query, different plans

Midge --

Sorry for top-quoting -- challenged mail.

Perhaps a difference in the stats estimates -- default_statistics_target ?

Can you show us a diff between the postgres config files for each instance ? Maybe something there ...

Greg Williamson


From: Midge Brown <midgems@sbcglobal.net>
To: pgsql-performance@postgresql.org
Sent: Friday, August 3, 2012 5:38 PM
Subject: [PERFORM] slow query, different plans

I'm having a problem with a query on our production server, but not on a laptop running a similar postgres version with a recent backup copy of the same table. I tried reindexing the table on the production server, but it didn't make any difference. Other queries on the same table are plenty fast.
 
This query has been slow, but never like this, particularly during a period when there are only a couple of connections in use.
 
Vacuum and analyze are run nightly (and show as such in pg_stat_user_tables) in addition to autovacuum during the day. Here are my autovacuum settings, but when I checked last_autovacuum & last_autoanalyze in pg_stat_user_tables those fields were blank.
 
autovacuum = on                        
log_autovacuum_min_duration = 10       
autovacuum_max_workers = 3             
autovacuum_naptime = 1min              
autovacuum_vacuum_threshold = 50       
autovacuum_analyze_threshold = 50      
autovacuum_vacuum_scale_factor = 0.2   
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 200000000  
autovacuum_vacuum_cost_delay = 10ms (changed earlier today from 1000ms) 
autovacuum_vacuum_cost_limit = -1
 
wal_level = minimal
wal_buffers = 16MB
 
The only recent change was moving the 3 databases we have from multiple raid 1 drives with tablespaces spread all over to one large raid10 with indexes and data in pg_default. WAL for this table was moved as well.
 
Does anyone have any suggestions on where to look for the problem? 
 
clientlog table info:
 
Size: 1.94G
 
  Column  |            Type             | Modifiers
----------+-----------------------------+-----------
 pid0     | integer                     | not null
 rid      | integer                     | not null
 verb     | character varying(32)       | not null
 noun     | character varying(32)       | not null
 detail   | text                        |
 path     | character varying(256)      | not null
 ts       | timestamp without time zone |
 applies2 | integer                     |
 toname   | character varying(128)      |
 byname   | character varying(128)      |
Indexes:
    "clientlog_applies2" btree (applies2)
    "clientlog_pid0_key" btree (pid0)
    "clientlog_rid_key" btree (rid)
    "clientlog_ts" btree (ts)
The query, hardware info, and links to both plans:
 
explain analyze select max(ts) as ts from clientlog where applies2=256;
 
Production server:
- 4 dual-core AMD Opteron 2212 processors, 2010.485 MHz
- 64GB RAM
- 464GB RAID10 drive
- Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 PostgreSQL 9.0.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
http://explain.depesz.com/s/8R4
                                                                         
 
From laptop running Linux 2.6.34.9-69.fc13.868 with 3G ram against a copy of the same table:
PostgreSQL 9.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10), 32-bit
 
http://explain.depesz.com/s/NQl
 
Thank you,
Midge
 
 
==================
 
Here's the diff of the 2 config files. I didn't list the autovacuum settings since the laptop is a development machine with that feature turned off.
 
109c109
< shared_buffers = 28MB                 # min 128kB
---
> shared_buffers = 4GB
118,120c118,120
< #work_mem = 1MB                               # min 64kB
< #maintenance_work_mem = 16MB          # min 1MB
< #max_stack_depth = 2MB                        # min 100kB
---
> work_mem = 16MB
> maintenance_work_mem = 256MB
> max_stack_depth = 2MB
130c130
< #vacuum_cost_delay = 0ms              # 0-100 milliseconds
---
> vacuum_cost_delay = 10ms
134c134
< #vacuum_cost_limit = 200              # 1-10000 credits
---
> vacuum_cost_limit = 200                       # 1-10000 credits
153c153
< #wal_level = minimal                  # minimal, archive, or hot_standby
---
> wal_level = minimal                   # minimal, archive, or hot_standby
165c165
< wal_buffers = 64kB                    # min 32kB
---
> wal_buffers = 16MB
174c174
< checkpoint_segments = 3               # in logfile segments, min 1, 16MB each
---
> checkpoint_segments = 64              # in logfile segments, min 1, 16MB each
176,177c176,177
< checkpoint_completion_target = 0.5    # checkpoint target duration, 0.0 - 1.0
< checkpoint_warning = 30s              # 0 disables
---
> checkpoint_completion_target = 0.7    # checkpoint target duration, 0.0 - 1.0
> checkpoint_warning = 30s              # 0 disables
231c231
< #effective_cache_size = 128MB
---
> effective_cache_size = 10GB
413c414

Re: slow query, different plans

From
"Midge Brown"
Date:
----- Original Message -----
From: Tom Lane
Sent: Friday, August 03, 2012 11:26 PM
Subject: Re: [PERFORM] slow query, different plans

"Midge Brown" <midgems@sbcglobal.net> writes:
> I'm having a problem with a query on our production server, but not on a laptop running a similar postgres version with a recent backup copy of the same table. I tried reindexing the table on the production server, but it didn't make any difference. Other queries on the same table are plenty fast.

Reindexing won't help that.  The problem is a bad statistical estimate;
it thinks there are about 700 rows with applies2 = 256, when there's
really only one.  That means the "fast" plan is a lot faster than the
planner gives it credit for, and conversely the "slow" plan is a lot
slower than the planner is expecting.  Their estimated costs end up
nearly the same, which makes it a bit of a chance matter which one is
picked --- but the true costs are a lot different.  So you need to fix
that rowcount estimate.  Raising the stats target for the table might
help.

regards, tom lane

--

I added "and ts is not null" to the query and the planner came back with .102 ms. The problem area in production went from a 10 second response to < 1 second.
 
Thanks for the responses.
 
-Midge