Thread: Why is plan (and performance) different on partitioned table?

Why is plan (and performance) different on partitioned table?

From
"Mark Liberman"
Date:

Hi,

I have recently implemented table partitioning in our postgres 8.1 db. Upon analyzing query performance, I have realized that, even when only a single one of the "partitions" has to be scanned, the plan is drastically different, and performs much worse, when I query against the master table (uses merge join), vs. a direct query against the partition directly (uses a hash join).  The majority of our queries only access a single partition.

Any insight into why this happens and what can be done to improve performance would be greatly appreciated.

br_1min is my partitioned table:

explain analyze
SELECT *
FROM br_1min br1 JOIN br_mods mod on br1.modules_id = mod.id 
WHERE ((end_time >= '2006-05-01 17:12:18-07' AND end_time < '2006-05-01 17:13:18-07'))
  AND mod.downloads_id IN (153226,153714,153730,153728,153727,153724,153713,153725,153739,153722) ;

----------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=73.99..223.43 rows=1 width=109) (actual time=2925.629..3082.188 rows=45 loops=1)
   Merge Cond: ("outer".id = "inner".modules_id)
   ->  Index Scan using br_mods_id_pkey on br_mods mod  (cost=0.00..40861.18 rows=282 width=77) (actual time=2922.223..3078.335 rows=45 loops=1)
         Filter: ((downloads_id = 153226) OR (downloads_id = 153714) OR (downloads_id = 153730) OR (downloads_id = 153728) OR (downloads_id = 153727) OR (downloads_id = 153724) OR (downloads_id = 153713) OR (downloads_id = 153725) OR (downloads_id = 153739) OR (downloads_id = 153722))
   ->  Sort  (cost=73.99..76.26 rows=906 width=32) (actual time=3.334..3.508 rows=348 loops=1)
         Sort Key: br1.modules_id
         ->  Append  (cost=0.00..29.49 rows=906 width=32) (actual time=0.133..2.169 rows=910 loops=1)
               ->  Index Scan using br_1min_end_idx on br_1min br1  (cost=0.00..2.02 rows=1 width=32) (actual time=0.029..0.029 rows=0 loops=1)
                     Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
               ->  Index Scan using br_1min_20557_end_idx on br_1min_20557 br1  (cost=0.00..27.48 rows=905 width=32) (actual time=0.101..1.384 rows=910 loops=1)
                     Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
 Total runtime: 3082.450 ms
(12 rows)



Now, If I query directly against br_1min_20557, my partition, I get:

explain analyze
SELECT *
FROM br_1min_20557 br1 JOIN br_mods mod on br1.modules_id = mod.id 
WHERE ((end_time >= '2006-05-01 17:12:18-07' AND end_time < '2006-05-01 17:13:18-07'))
  AND mod.downloads_id IN (153226,153714,153730,153728,153727,153724,153713,153725,153739,153722) ;


----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=764.74..796.94 rows=1 width=109) (actual time=2.488..2.865 rows=45 loops=1)
   Hash Cond: ("outer".modules_id = "inner".id)
   ->  Index Scan using br_1min_20557_end_idx on br_1min_20557 br1  (cost=0.00..27.62 rows=914 width=32) (actual time=0.084..1.886 rows=910 loops=1)
         Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
   ->  Hash  (cost=764.03..764.03 rows=282 width=77) (actual time=0.284..0.284 rows=45 loops=1)
         ->  Bitmap Heap Scan on br_mods mod  (cost=20.99..764.03 rows=282 width=77) (actual time=0.154..0.245 rows=45 loops=1)
               Recheck Cond: ((downloads_id = 153226) OR (downloads_id = 153714) OR (downloads_id = 153730) OR (downloads_id = 153728) OR (downloads_id = 153727) OR (downloads_id = 153724) OR (downloads_id = 153713) OR (downloads_id = 153725) OR (downloads_id = 153739) OR (downloads_id = 153722))
               ->  BitmapOr  (cost=20.99..20.99 rows=282 width=0) (actual time=0.144..0.144 rows=0 loops=1)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.031..0.031 rows=14 loops=1)
                           Index Cond: (downloads_id = 153226)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.011..0.011 rows=2 loops=1)
                           Index Cond: (downloads_id = 153714)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
                           Index Cond: (downloads_id = 153730)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
                           Index Cond: (downloads_id = 153728)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153727)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153724)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153713)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153725)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.041..0.041 rows=16 loops=1)
                           Index Cond: (downloads_id = 153739)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.010..0.010 rows=1 loops=1)
                           Index Cond: (downloads_id = 153722)
 Total runtime: 3.017 ms
(29 rows)

The difference is night-and-day.  Any suggestions?

Thanks alot,

Mark

Re: Why is plan (and performance) different on partitioned table?

From
Tom Lane
Date:
"Mark Liberman" <mliberman@mixedsignals.com> writes:
> I have recently implemented table partitioning in our postgres 8.1 db. =
> Upon analyzing query performance, I have realized that, even when only a =
> single one of the "partitions" has to be scanned, the plan is =
> drastically different, and performs much worse, when I query against the =
> master table (uses merge join), vs. a direct query against the partition =
> directly (uses a hash join).  The majority of our queries only access a =
> single partition.

Joins against partitioned tables suck in 8.1 :-(.  There is code in CVS
HEAD to improve this, but it didn't get done in time for 8.1.

            regards, tom lane

Re: Why is plan (and performance) different on partitioned table?

From
Tom Lane
Date:
I wrote:
> Joins against partitioned tables suck in 8.1 :-(.

Actually ... while the above is a true statement, it's too flippant a
response for your problem.  The reason the planner is going for a
mergejoin in your example is that it thinks the mergejoin will terminate
early.  (Notice that the cost estimate for the mergejoin is actually
quite a bit less than the estimate for its first input.)  This estimate
can only be made if the planner has statistics that say that one of the
join columns has a max value much less than the other's.  Well, that's
fine, but where the heck did it get the stats for the partitioned table?
We don't compute union statistics for partitions.  The answer is that
it's confused and is using the stats for just the parent table as if
they were representative for the whole inheritance tree.

I think this behavior was intentional back when it was coded, but when
inheritance is being used for partitioning, it's clearly brain-dead.
We should either not assume anything about the statistics for an
inheritance tree, or make a real effort to compute them.

For the moment, I've applied a quick patch that makes sure we don't
assume anything.

If you don't have anything in the parent table br_1min, then deleting
the (presumably obsolete) pg_statistic rows for it should fix your
immediate problem.  Otherwise, consider applying the attached.

            regards, tom lane


Index: src/backend/optimizer/path/allpaths.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.137.2.2
diff -c -r1.137.2.2 allpaths.c
*** src/backend/optimizer/path/allpaths.c    13 Feb 2006 16:22:29 -0000    1.137.2.2
--- src/backend/optimizer/path/allpaths.c    2 May 2006 04:31:27 -0000
***************
*** 264,269 ****
--- 264,276 ----
                   errmsg("SELECT FOR UPDATE/SHARE is not supported for inheritance queries")));

      /*
+      * We might have looked up indexes for the parent rel, but they're
+      * really not relevant to the appendrel.  Reset the pointer to avoid
+      * any confusion.
+      */
+     rel->indexlist = NIL;
+
+     /*
       * Initialize to compute size estimates for whole inheritance tree
       */
      rel->rows = 0;
Index: src/backend/utils/adt/selfuncs.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/selfuncs.c,v
retrieving revision 1.191.2.1
diff -c -r1.191.2.1 selfuncs.c
*** src/backend/utils/adt/selfuncs.c    22 Nov 2005 18:23:22 -0000    1.191.2.1
--- src/backend/utils/adt/selfuncs.c    2 May 2006 04:31:27 -0000
***************
*** 2970,2988 ****
          (varRelid == 0 || varRelid == ((Var *) basenode)->varno))
      {
          Var           *var = (Var *) basenode;
!         Oid            relid;

          vardata->var = basenode;    /* return Var without relabeling */
          vardata->rel = find_base_rel(root, var->varno);
          vardata->atttype = var->vartype;
          vardata->atttypmod = var->vartypmod;

!         relid = getrelid(var->varno, root->parse->rtable);

!         if (OidIsValid(relid))
          {
              vardata->statsTuple = SearchSysCache(STATRELATT,
!                                                  ObjectIdGetDatum(relid),
                                                   Int16GetDatum(var->varattno),
                                                   0, 0);
          }
--- 2970,2996 ----
          (varRelid == 0 || varRelid == ((Var *) basenode)->varno))
      {
          Var           *var = (Var *) basenode;
!         RangeTblEntry *rte;

          vardata->var = basenode;    /* return Var without relabeling */
          vardata->rel = find_base_rel(root, var->varno);
          vardata->atttype = var->vartype;
          vardata->atttypmod = var->vartypmod;

!         rte = rt_fetch(var->varno, root->parse->rtable);

!         if (rte->inh)
!         {
!             /*
!              * XXX This means the Var represents a column of an append relation.
!              * Later add code to look at the member relations and try to derive
!              * some kind of combined statistics?
!              */
!         }
!         else if (rte->rtekind == RTE_RELATION)
          {
              vardata->statsTuple = SearchSysCache(STATRELATT,
!                                                  ObjectIdGetDatum(rte->relid),
                                                   Int16GetDatum(var->varattno),
                                                   0, 0);
          }

Re: Why is plan (and performance) different on partitioned table?

From
"Mark Liberman"
Date:

>If you don't have anything in the parent table br_1min, then deleting
>the (presumably obsolete) pg_statistic rows for it should fix your
>immediate problem.  Otherwise, consider applying the attached.

Tom, thanks alot for your reply.  A few follow-up questions, and one potential "bug"?

I've been experimenting with deleting the rows from pg_statistics.  FYI, there were statistics for all master tables prior to us partioning the data.  We then manually inserted the rows into each inherited partition and, when done - did a truncate of the master table.

So, here's what I'm finding. 

1) When I delete the rows from pg_statistics, the new plan is, indeed, a hash join.

explain analyze
SELECT *
FROM br_1min br1 JOIN br_mods mod on br1.modules_id = mod.id
WHERE ((end_time >= '2006-05-01 17:12:18-07' AND end_time < '2006-05-01 17:13:18-07'))
  AND mod.downloads_id IN (153226,153714,153730,153728,153727,153724,153713,153725,153739,153722) ;

Hash Join  (cost=763.35..807.35 rows=1 width=109) (actual time=3.631..36.181 rows=45 loops=1)
   Hash Cond: ("outer".modules_id = "inner".id)
   ->  Append  (cost=1.04..40.64 rows=877 width=32) (actual time=0.198..34.872 rows=910 loops=1)
         ->  Bitmap Heap Scan on br_1min bfs1  (cost=1.04..8.70 rows=6 width=32) (actual time=0.060..0.060 rows=0 loops=1)
               Recheck Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
               ->  Bitmap Index Scan on br_1min_end_idx  (cost=0.00..1.04 rows=6 width=0) (actual time=0.054..0.054 rows=0 loops=1)
                     Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
         ->  Index Scan using br_1min_20557_end_idx on br_1min_20557 bfs1  (cost=0.00..25.91 rows=869 width=32) (actual time=0.136..1.858 rows=910 loops=1)
               Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
         ->  Index Scan using br_1min_20570_end_idx on br_1min_20570 bfs1  (cost=0.00..3.02 rows=1 width=32) (actual time=0.092..0.092 rows=0 loops=1)
               Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
         ->  Index Scan using br_1min_20583_end_idx on br_1min_20583 bfs1  (cost=0.00..3.02 rows=1 width=32) (actual time=32.034..32.034 rows=0 loops=1)
               Index Cond: ((end_time >= '2006-05-01 17:12:18-07'::timestamp with time zone) AND (end_time < '2006-05-01 17:13:18-07'::timestamp with time zone))
   ->  Hash  (cost=761.61..761.61 rows=281 width=77) (actual time=0.487..0.487 rows=45 loops=1)
         ->  Bitmap Heap Scan on br_mods mod  (cost=20.98..761.61 rows=281 width=77) (actual time=0.264..0.435 rows=45 loops=1)
               Recheck Cond: ((downloads_id = 153226) OR (downloads_id = 153714) OR (downloads_id = 153730) OR (downloads_id = 153728) OR (downloads_id = 153727) OR (downloads_id = 153724) OR (downloads_id = 153713) OR (downloads_id = 153725) OR (downloads_id = 153739) OR (downloads_id = 153722))
               ->  BitmapOr  (cost=20.98..20.98 rows=281 width=0) (actual time=0.223..0.223 rows=0 loops=1)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.091..0.091 rows=14 loops=1)
                           Index Cond: (downloads_id = 153226)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.037..0.037 rows=2 loops=1)
                           Index Cond: (downloads_id = 153714)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.010..0.010 rows=2 loops=1)
                           Index Cond: (downloads_id = 153730)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153728)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.008..0.008 rows=2 loops=1)
                           Index Cond: (downloads_id = 153727)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
                           Index Cond: (downloads_id = 153724)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
                           Index Cond: (downloads_id = 153713)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.007..0.007 rows=2 loops=1)
                           Index Cond: (downloads_id = 153725)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.031..0.031 rows=16 loops=1)
                           Index Cond: (downloads_id = 153739)
                     ->  Bitmap Index Scan on br_mods_downloads_id_idx  (cost=0.00..2.10 rows=28 width=0) (actual time=0.009..0.009 rows=1 loops=1)
                           Index Cond: (downloads_id = 153722)
 Total runtime: 36.605 ms
(38 rows)


Note:  there are 2 new partitions that our cron jobs automatically created yesterday that are being scanned, but they do not return any rows.

2) When I re-analyze the br_1min table, new rows do not appear in pg_statistics for that table.

Now, my questions:

1) If there are no statistics for the master table, does postgres use the statistics for any of the partitions, or does it create a plan without any statistics related to the partitioned tables (e.g. some default plan.)?

2) I'm curious where it got an estimate of 6 rows for br_1min in "Bitmap Heap Scan on br_1min bfs1  (cost=1.04..8.70 rows=6 width=32)"  Any insight?

3) Basically, I'm wondering if this strategy of deleting the rows in pg_statistics for the master tables will work in all conditions, or if it runs the risk of again using faulty statistics and choosing a bad plan.  Would I be better off setting enable_mergejoin = f in the session right before I  issue this query and then resetting it after?  What are the risks of that approach?


Now, the potentital bug:

It appears that after you truncate a table, the statistics for that table still remain in pg_statistics.  And, as long as there are no rows added back to that table, the same statistics remain for that table, after an ANALYZE, - and are used by queries.  Once, you re-insert any rows in the table, however, new statistics will be computed.  So, the bug appears to be that after a truncate, if there are no rows in a table, the old, out-dated statistics do not get overwritten.  To follow are some simple tests I did to illustrate that.  Maybe this is by design, or, should I post this on pg-hackers?  It might be that in my case, it's better that new statitics ARE NOT inserted into pg_statistics for empty tables, but maybe the fix could be to delete the old statistics for analyzes to an empty table.

Thanks again Tom for your feedback,

- Mark


prdb=# create table mark_temp (col1 int, col2 int);
CREATE TABLE
prdb=# create index mark_temp_idx on mark_temp(col1);
CREATE INDEX

... I then inserted several thousand rows ....

prdb=# analyze mark_temp;
ANALYZE
prdb=# select staattnum,stadistinct from pg_statistic where starelid = (select oid from pg_class where relname = 'mark_temp');
 staattnum | stadistinct
-----------+-------------
         1 |        9671
         2 |           1
(2 rows)

prdb=# explain analyze select * from mark_temp where col1 = 1045;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Index Scan using mark_temp_idx on mark_temp  (cost=0.00..51.35 rows=27 width=8) (actual time=0.013..0.015 rows=1 loops=1)
   Index Cond: (col1 = 1045)
 Total runtime: 0.048 ms
(3 rows)

prdb=# truncate table mark_temp;
TRUNCATE TABLE
prdb=# analyze mark_temp;
ANALYZE

NOTE:  STATISTICS ARE THE SAME AND IT'S STILL DOING AN INDEX SCAN INSTEAD OF A SEQ SCAN

prdb=# explain analyze select * from mark_temp where col1 = 1045;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Index Scan using mark_temp_idx on mark_temp  (cost=0.00..3.14 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=1)
   Index Cond: (col1 = 1045)
 Total runtime: 0.031 ms
(3 rows)

prdb=# select staattnum,stadistinct from pg_statistic where starelid = (select oid from pg_class where relname = 'mark_temp');
 staattnum | stadistinct
-----------+-------------
         1 |        9671
         2 |           1
(2 rows)

prdb=# insert into mark_temp (col1,col2) values (1,100);
INSERT 0 1
prdb=# analyze mark_temp;

NOTE: AFTER INSERT, THERE ARE NEW STATISTICS AND IT'S DOING A SEQ SCAN NOW

ANALYZE
prdb=# select staattnum,stadistinct from pg_statistic where starelid = (select oid from pg_class where relname = 'mark_temp');
 staattnum | stadistinct
-----------+-------------
         1 |          -1
         2 |          -1
(2 rows)

prdb=# explain analyze select * from mark_temp where col1 = 1045;
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Seq Scan on mark_temp  (cost=0.00..1.01 rows=1 width=8) (actual time=0.007..0.007 rows=0 loops=1)
   Filter: (col1 = 1045)
 Total runtime: 0.029 ms
(3 rows)







Re: Why is plan (and performance) different on partitioned table?

From
Tom Lane
Date:
"Mark Liberman" <mliberman@mixedsignals.com> writes:
> Now, the potentital bug:
> It appears that after you truncate a table, the statistics for that =
> table still remain in pg_statistics.

That's intentional, on the theory that when the table is re-populated
the new contents will probably resemble the old.

            regards, tom lane