Thread: near identical queries have vastly different plans

From:
Samuel Gendler
Date:

Here's the setup:

I'm cross joining two dimensions before left outer joining to a fact table so that I can throw a default value into the resultset wherever a value is missing from the fact table.  I have a time dimension and another dimension.  I want the cross join to only cross a subset of rows from each dimension, so I have some filters in the ON clause of the inner join.  

I've got 2 nearly identical queries that perform incredibly differently.  The only difference between them is that the fast query uses two text columns from the sensor dimension when filtering the inner join while the slow query uses bigint id values, where the ids correlate to the text strings in the fast query.  The string columns are device_name and dpoint_name, where device_name is unique but many devices have dpoints with the same name.  The bigint columns are device_id and dpoint_id, where both device_id and dpoint_id map to a single row.  There are indexes on all of them, so the only real difference is that an index on dpoint_name will have more rows for a given value than the index on dpoint_id because dpoints with the same name will still have different ids if attached to different devices.  In both queries, exactly 35 rows from the sensor dimension will be returned.  Note also that I'm aggregating fact table rows via avg() because I have duplicate rows in the fact table, but I want to extract only a single row for each time and sensor row.  The left outer join allows me to populate any missing rows with a default value and the aggregation via avg() combines duplicate rows so that they appear only once.

I can easily just use the fast query, but my concern is that without understanding why the queries are executing differently, I might suddenly discover my code using the slow query plan instead of the fast one at some point in the future, even when using the varchar columns instead of the bigint ids for filtering.  They differ in execution speed by about 5x (which translates to many minutes), so that would be a big problem.  If I could figure out either a query structure or an index structure which will force the fast query plan, I'd be much happier.  So that is what I am looking for - an explanation of how I might convince the planner to always use the fast plan.

Its a CentOS host - Quad core AMD Opteron 1.6Ghz, 2GB of RAM, single 75GB disk with everything on it.  I'm not looking for outright performance, just relative performance between the 2 queries.  DB config was taken wholesale from pg_tune with no changes, IIRC.  It isn't a production box.  I have yet to spec out production hardware for this application, so I don't know what it will eventually be.

# select version();
                                                     version                                                      
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit

             name             |                 current_setting                 
------------------------------+-------------------------------------------------
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 64
 default_statistics_target    | 100
 effective_cache_size         | 1408MB
 lc_collate                   | en_US.UTF-8
 lc_ctype                     | en_US.UTF-8
 log_directory                | pg_log
 log_filename                 | postgresql-%a.log
 log_rotation_age             | 1d
 log_rotation_size            | 0
 log_truncate_on_rotation     | on
 logging_collector            | on
 maintenance_work_mem         | 240MB
 max_connections              | 20
 max_stack_depth              | 2MB
 port                         | 5432
 server_encoding              | UTF8
 shared_buffers               | 480MB
 TimeZone                     | UTC
 wal_buffers                  | 32MB
 work_mem                     | 48MB


time dimension looks like this:

          Column          |            Type             | Modifiers 
--------------------------+-----------------------------+-----------
 time_zone                | character varying(64)       | 
 tstamp                   | timestamp without time zone | 
 tstamptz                 | timestamp with time zone    | 
 local_key                | bigint                      | 
 utc_key                  | bigint                      | 
Indexes:
    "idx_time_ny_local_key" btree (local_key)
    "idx_time_ny_tstamp" btree (tstamp) CLUSTER
    "idx_time_ny_tstamptz" btree (tstamptz)
    "idx_time_ny_utc_key" btree (utc_key)

plus lots of other columns (approx 25 columns, mostly integer)  that aren't relevant to this query.  It has 420,480 rows where each row is 300 seconds after the previous row.  local_key and utc_key are bigint columns in the form yyyyMMddHHmm (utc_key in UTC time and the other in local time for the time zone represented by the table) and tstamp is the same value as an actual timestamp type. tstamptz is just a convenient column for when I need to do time zone conversions. tstamp is a timestamp without timezone that stores the date and time for that row in the local time zone for the table in question.

The other dimension looks like this:

    Column     |            Type             |                         Modifiers                          
---------------+-----------------------------+------------------------------------------------------------
 sensor_pk     | bigint                      | not null default nextval('sensor_sensor_pk_seq'::regclass)
 building_id   | bigint                      | 
 device_id     | bigint                      | 
 device_type   | character varying(16)       | 
 device_name   | character varying(64)       | 
 dpoint_id     | bigint                      | 
 dpoint_name   | character varying(64)       | 
Indexes:
    "sensor_pkey" PRIMARY KEY, btree (sensor_pk)
    "idx_sensor_device_id" btree (device_id)
    "idx_sensor_device_name" btree (device_name)
    "idx_sensor_device_type" btree (device_type)
    "idx_sensor_dpoint_id" btree (dpoint_id)
    "idx_sensor_dpoint_name" btree (dpoint_name)

There are other columns, but again, they aren't relevant - about 10 more columns, half bigint and half varchar.  Row count is less than 400 rows at the moment, but it will potentially get much larger unless I partition on building_id, which I may well do - in which case, row count will never exceed 100,000 and rarely exceed 10,000.

The fact table is as follows:

             Table "facts.bldg_1_thermal_fact"
    Column    |            Type             |   Modifiers   
--------------+-----------------------------+---------------
 time_fk      | bigint                      | not null
 sensor_fk    | bigint                      | not null
 tstamp       | timestamp without time zone | 
 dpoint_value | real                        | 
 device_mode  | bigint                      | 

With actual data in child tables:

         Table "facts.bldg_1_thermal_fact_20110501"
    Column    |            Type             |   Modifiers   
--------------+-----------------------------+---------------
 time_fk      | bigint                      | not null
 sensor_fk    | bigint                      | not null
 tstamp       | timestamp without time zone | 
 dpoint_value | real                        | 
 device_mode  | bigint                      | 
Indexes:
    "idx_bldg_1_thermal_fact_20110501_sensor_fk" btree (sensor_fk)
    "idx_bldg_1_thermal_fact_20110501_time_fk" btree (time_fk) CLUSTER
Check constraints:
    "bldg_1_thermal_fact_20110501_time_fk_check" CHECK (time_fk >= 201105010000::bigint AND time_fk < 201106010000::bigint)
    "bldg_1_thermal_fact_20110501_tstamp_check" CHECK (tstamp >= '2011-05-01 00:00:00'::timestamp without time zone AND tstamp < '2011-06-01 00:00:00'::timestamp without time zone)
Inherits: bldg_1_thermal_fact

One of the 2 partitions that exists at the moment contains 2 million rows, none of which are relevant to the query in question.  The other partition has 4 million rows and contains 100% of the rows returned by the query.

fast query is as follows:

 SELECT t.local_key,
        s.device_name||'_'||s.dpoint_name,
        CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
             ELSE avg(f.dpoint_value)::numeric(10,2)
        END as dpoint_value
   FROM dimensions.sensor s
  INNER JOIN dimensions.time_ny t
          ON s.building_id=1
         AND ((s.device_name='AHU_1M02' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_01' AND s.dpoint_name='EffectSetPt')OR(s.device_name='VAV_1M_01' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02A' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02A' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_02' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_02' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_03' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_03' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_04' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_04' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05A' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05A' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_05' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_05' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_06' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_06' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_07' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_07' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_08' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_08' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_09' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_09' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_10' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_10' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_11' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_11' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_12' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_12' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_13' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_13' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_14' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_14' AND s.dpoint_name='SpaceTemp') OR (s.device_name='VAV_1M_15' AND s.dpoint_name='EffectSetPt') OR (s.device_name='VAV_1M_15' AND s.dpoint_name='SpaceTemp'))
         AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
  LEFT OUTER JOIN facts.bldg_1_thermal_fact f
          ON f.time_fk=t.local_key
         AND f.sensor_fk=s.sensor_pk
  GROUP BY 1,2
  ORDER BY 1,2

Note: as complicated as that big filter on the inner join is, it will never result in more than a few tens of rows being selected.

Fast explain is here: 

Slow query is as follows:

 SELECT t.local_key,
         s.device_name||'_'||s.dpoint_name,
         CASE WHEN avg(f.dpoint_value) IS NULL THEN -999999
              ELSE avg(f.dpoint_value)::numeric(10,2)
         END as dpoint_value
    FROM dimensions.sensor s
   INNER JOIN dimensions.time_ny t
           ON s.building_id=1
          AND ((s.device_id=33 AND s.dpoint_id=183) OR (s.device_id=33 AND s.dpoint_id=184) OR (s.device_id=34 AND s.dpoint_id=187) OR (s.device_id=34 AND s.dpoint_id=188) OR (s.device_id=35 AND s.dpoint_id=191) OR (s.device_id=35 AND s.dpoint_id=192) OR (s.device_id=36 AND s.dpoint_id=195) OR (s.device_id=36 AND s.dpoint_id=196) OR (s.device_id=77 AND s.dpoint_id=364) OR (s.device_id=20 AND s.dpoint_id=131) OR (s.device_id=20 AND s.dpoint_id=132) OR (s.device_id=21 AND s.dpoint_id=135) OR (s.device_id=21 AND s.dpoint_id=136) OR (s.device_id=22 AND s.dpoint_id=139) OR (s.device_id=22 AND s.dpoint_id=140) OR (s.device_id=30 AND s.dpoint_id=171) OR (s.device_id=30 AND s.dpoint_id=172) OR (s.device_id=23 AND s.dpoint_id=143) OR (s.device_id=23 AND s.dpoint_id=144) OR (s.device_id=24 AND s.dpoint_id=147) OR (s.device_id=24 AND s.dpoint_id=148) OR (s.device_id=25 AND s.dpoint_id=151) OR (s.device_id=25 AND s.dpoint_id=152) OR (s.device_id=26 AND s.dpoint_id=155) OR (s.device_id=26 AND s.dpoint_id=156) OR (s.device_id=27 AND s.dpoint_id=159) OR (s.device_id=27 AND s.dpoint_id=160) OR (s.device_id=28 AND s.dpoint_id=163) OR (s.device_id=28 AND s.dpoint_id=164) OR (s.device_id=29 AND s.dpoint_id=167) OR (s.device_id=29 AND s.dpoint_id=168) OR (s.device_id=31 AND s.dpoint_id=175) OR (s.device_id=31 AND s.dpoint_id=176) OR (s.device_id=32 AND s.dpoint_id=179) OR (s.device_id=32 AND s.dpoint_id=180))
         AND t.tstamp BETWEEN '15-Jun-2011 00:00' AND '29-Jun-2011 00:00'
  LEFT OUTER JOIN facts.bldg_1_thermal_fact f
          ON f.time_fk=t.local_key
         AND f.sensor_fk=s.sensor_pk 
  GROUP BY 1,2 
  ORDER BY 1,2 

Slow explain is here:

I tried rewriting the slow query such that the cross join is expressed as a subquery left outer joined to the fact table, but it had no effect.  the query plan was identical.

The query plan also doesn't change if I replace the 2nd column with s.sensor_pk (which maps one to one to the string I am constructing, but which does have an index).  This query is actually being used as the first query in a call to crosstab(text, text) and that second column is just the pivot column.  It doesn't matter to me whether it is a string or the id, since those labels wind up being re-written as column names of the function result. (select * from crosstab() as q(local_key bigint, column1 real, column2 real, ...).
From:
Samuel Gendler
Date:



On Thu, Jun 30, 2011 at 1:53 AM, Samuel Gendler <> wrote:
If I could figure out either a query structure or an index structure which will force the fast query plan, I'd be much happier.  So that is what I am looking for - an explanation of how I might convince the planner to always use the fast plan.


For the record, "set enable_nestloop=false" does force a more effective plan when using the 'slow' query.  It is not quite identical in structure - it materializes the other side of the query, resulting in about 10% less performance - but it is close enough that I'm tempted to disable nestloop whenever I run the query in the hope that it will prevent the planner from switching to the really awful plan.  I know that's kind of a drastic measure, so hopefully someone out there will suggest a config fix which accomplishes the same thing without requiring special handling for this query, but at least it works (for now).

Incidentally, upgrading to 9.0.x is not out of the question if it is believed that doing so might help here.  I'm only running 8.4 because I've got another project in production on 8.4 and I don't want to have to deal with running both versions on my development laptop.  But that's a pretty weak reason for not upgrading, I know.

--sam

From:
Samuel Gendler
Date:



On Thu, Jun 30, 2011 at 1:53 AM, Samuel Gendler <> wrote:
If I could figure out either a query structure or an index structure which will force the fast query plan, I'd be much happier.  So that is what I am looking for - an explanation of how I might convince the planner to always use the fast plan.


I eventually noticed that constraint_exclusion didn't seem to be working and remembered that it only works when the filter is on the partitioned table itself, not when the table is being filtered via a join.  Adding a where clause which limits f.time_fk to the appropriate range not only fixed constraint_exclusion behaviour, but also caused the query planner to produce the same plan for both versions of the query - a plan that was an order of magnitude faster than the previous fastest plan.   It went from 20 seconds to just less than 2 seconds.
From:
Tom Lane
Date:

Samuel Gendler <> writes:
> I've got 2 nearly identical queries that perform incredibly differently.

The reason the slow query sucks is that the planner is estimating at
most one "s" row will match that complicated AND/OR condition, so it
goes for a nestloop.  In the "fast" query there is another complicated
AND/OR filter condition, but it's not so far off on the number of
matching rows, so you get a better plan choice.  Can't tell from the
given information whether the better guess is pure luck, or there's some
difference in the column statistics that makes it able to get a better
estimate for that.

In general, though, you're skating on thin ice anytime you ask the
planner to derive statistical estimates about combinations of correlated
columns --- and these evidently are correlated.  Think about refactoring
the table definitions so that you're only testing a single column, which
ANALYZE will be able to provide stats about.  Or maybe you can express
it as a test on a computed expression, which you could then keep an
index on, prompting ANALYZE to gather stats about that.

            regards, tom lane

From:
Samuel Gendler
Date:



On Fri, Jul 1, 2011 at 3:46 PM, Tom Lane <> wrote:
Samuel Gendler <> writes:
> I've got 2 nearly identical queries that perform incredibly differently.

The reason the slow query sucks is that the planner is estimating at
most one "s" row will match that complicated AND/OR condition, so it
goes for a nestloop.  In the "fast" query there is another complicated
AND/OR filter condition, but it's not so far off on the number of
matching rows, so you get a better plan choice.  Can't tell from the
given information whether the better guess is pure luck, or there's some
difference in the column statistics that makes it able to get a better
estimate for that.

In general, though, you're skating on thin ice anytime you ask the
planner to derive statistical estimates about combinations of correlated
columns --- and these evidently are correlated.  Think about refactoring
the table definitions so that you're only testing a single column, which
ANALYZE will be able to provide stats about.  Or maybe you can express
it as a test on a computed expression, which you could then keep an
index on, prompting ANALYZE to gather stats about that.

Thanks.  There is actually already a column in s which is a primary key for the 2 columns that are currently being tested for.  I didn't write the application code which generates the query, so can't say for sure why it is being generated as it is, but I'll ask the engineer in question to try the primary key column instead and see what happens.