Thread: Bad query plans for queries on partitioned table

Bad query plans for queries on partitioned table

From
Julian Mehnle
Date:
Hi all,

I have a large database with e-mail meta-data (no bodies) for over 100
million messages.  I am running PostgreSQL 8.2.4 on a server with 2GB of
RAM (shared_buffers = 240MB, temp_buffers = 128MB, work_mem = 256MB,
maintenance_work_mem = 256MB).  I have the data split in two separate
tables, "email" and "email_extras":

                      Table "public.email"
        Column       |            Type             | Modifiers
  -------------------+-----------------------------+-----------
   id                | bigint                      | not null
   load_id           | integer                     | not null
   ts                | timestamp without time zone | not null
   ip                | inet                        | not null
   mfrom             | text                        | not null
   helo              | text                        |

                   Table "public.email_extras"
        Column       |            Type             | Modifiers
  -------------------+-----------------------------+-----------
   id                | bigint                      | not null
   ts                | timestamp without time zone | not null
   size              | integer                     | not null
   hdr_from          | text                        |

Each of these tables has been partitioned equally based on the "ts"
(timestamp) field into two dozen or so tables, each covering one week of
messages.  For testing purposes, I have only one week's partition filled
for each of the "email" and "email_extras" tables (email_2007_week34
{,extras}).

Now if I perform the following simple join on the "email" and "email_
extras" tables ...

  SELECT
      count(*)
  FROM
      email
      INNER JOIN email_extras USING (id, ts)
  WHERE
      mfrom <> hdr_from;

then I get the following horrendously inefficient plan:

                                                          QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
   Aggregate  (cost=391396890.89..391396890.90 rows=1 width=0)
     ->  Merge Join  (cost=9338881.64..349156398.02 rows=16896197148 width=0)
           Merge Cond: ((public.email_extras.id = public.email.id) AND (public.email_extras.ts = public.email.ts))
           Join Filter: (public.email.mfrom <> public.email_extras.hdr_from)
           ->  Sort  (cost=4592966.95..4658121.33 rows=26061752 width=48)
                 Sort Key: public.email_extras.id, public.email_extras.ts
                 ->  Append  (cost=0.00..491969.52 rows=26061752 width=48)
                       ->  Seq Scan on email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week13_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week14_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week15_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week16_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week17_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week18_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week19_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week20_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week21_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week22_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week23_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week24_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week25_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week26_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week27_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week28_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week29_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week30_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week31_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week32_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week33_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week34_extras email_extras  (cost=0.00..491597.12 rows=26052512
width=33)
                       ->  Seq Scan on email_2007_week35_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week36_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week37_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week38_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week39_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
                       ->  Seq Scan on email_2007_week40_extras email_extras  (cost=0.00..13.30 rows=330 width=48)
           ->  Sort  (cost=4745914.69..4811071.87 rows=26062872 width=48)
                 Sort Key: public.email.id, public.email.ts
                 ->  Append  (cost=0.00..644732.72 rows=26062872 width=48)
                       ->  Seq Scan on email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week13 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week14 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week15 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week16 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week17 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week18 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week19 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week20 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week21 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week22 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week23 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week24 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week25 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week26 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week27 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week28 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week29 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week30 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week31 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week32 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week33 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week34 email  (cost=0.00..644349.12 rows=26052512 width=33)
                       ->  Seq Scan on email_2007_week35 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week36 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week37 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week38 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week39 email  (cost=0.00..13.70 rows=370 width=48)
                       ->  Seq Scan on email_2007_week40 email  (cost=0.00..13.70 rows=370 width=48)
  (68 rows)

However, if I restrict the query to just the partitions that actually do
have data in them ...

  SELECT
      count(*)
  FROM
      email_2007_week34
      INNER JOIN email_2007_week34_extras USING (id, ts)
  WHERE
      mfrom <> hdr_from;

then I get a much better plan that uses a hash join:

                                                               QUERY PLAN
  

------------------------------------------------------------------------------------------------------------------------------------
   Aggregate  (cost=4266338.94..4266338.95 rows=1 width=0)
     ->  Hash Join  (cost=1111362.80..4266336.07 rows=1145 width=0)
           Hash Cond: ((email_2007_week34.ts = email_2007_week34_extras.ts) AND (email_2007_week34.id =
email_2007_week34_extras.id))
           Join Filter: (email_2007_week34.mfrom <> email_2007_week34_extras.hdr_from)
           ->  Seq Scan on email_2007_week34  (cost=0.00..644349.12 rows=26052512 width=33)
           ->  Hash  (cost=491597.12..491597.12 rows=26052512 width=33)
                 ->  Seq Scan on email_2007_week34_extras  (cost=0.00..491597.12 rows=26052512 width=33)
  (7 rows)

Yes, I have `ANALYZE`d the database before running the queries.

How come the query planner gets thrown off that far by the simple table
partitioning?  What can I do to put the query planner back on the right
track?

Julian.

Attachment

Re: Bad query plans for queries on partitioned table

From
Julian Mehnle
Date:
Julian Mehnle wrote:
> I have a large database with e-mail meta-data (no bodies) for over 100
> million messages.  I am running PostgreSQL 8.2.4 on a server with 2GB
> of RAM (shared_buffers = 240MB, temp_buffers = 128MB, work_mem = 256MB,
> maintenance_work_mem = 256MB).  I have the data split in two separate
> tables, "email" and "email_extras":
> [...]
>
> Each of these tables has been partitioned equally based on the "ts"
> (timestamp) field into two dozen or so tables, each covering one week
> of messages.  For testing purposes, I have only one week's partition
> filled for each of the "email" and "email_extras" tables
> (email_2007_week34 {,extras}).

Oh, just for the record:  I do have "constraint_exclusion" enabled.

Julian.

Attachment

Re: Bad query plans for queries on partitioned table

From
Gregory Stark
Date:
"Julian Mehnle" <julian@mehnle.net> writes:

> However, if I restrict the query to just the partitions that actually do
> have data in them ...

There are a few things going on here.

1) The optimizer can't build a plan which ignores those partitions because the
statistics are just approximations. You could insert into one of them at any
time and the statistics won't update immediately. If you have a partition
which is empty of some type of data you can put a constraint on it to promise
the optimizer that that condition will stay true.

2) The optimizer is assuming that empty tables have a default 1,000 records in
them with no idea about their statistics. Otherwise you get terrible plans on
tables which have just been created or never analyzed. In this case that's
causing it to think there will be tons of matches on what is apparently a very
selective criterion.

3) The optimizer is a bit dumb about partitioned tables. But I'm not sure if
that's actually the fault here.

Try adding one record of data to each of those partitions or putting a
constraint on them which will allow constraint_exclusion (I assume you have
that enabled?) to kick in. You'll still be bitten by the parent table but
hopefully that's not enough to cause a problem.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Bad query plans for queries on partitioned table

From
Julian Mehnle
Date:
Gregory Stark wrote:
> There are a few things going on here.
>
> 1) The optimizer can't build a plan which ignores those partitions
> because the statistics are just approximations. You could insert into
> one of them at any time and the statistics won't update immediately. If
> you have a partition which is empty of some type of data you can put a
> constraint on it to promise the optimizer that that condition will stay
> true.

I actually do have constraints on all the partitions, e.g. for week 34:

  Check constraints [for email_2007_week34]:
    "email_2007_week34_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-27
00:00:00'::timestampwithout time zone) 

  Check constraints [for email_2007_week34_extras]:
    "email_2007_week34_extras_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts <
'2007-08-2700:00:00'::timestamp without time zone) 

Shouldn't this be enough to give the query planner a clue that it only
has to join the "email" and "email_extras" tables' partitions pair-wise,
as opposed to cross-joining them?

For the record, I also have indexes:

  Indexes [for email_2007_week34]:
    "email_2007_week34_pkey" PRIMARY KEY, btree (id)
    "index_email_2007_week34_on_helo" btree (helo)
    "index_email_2007_week34_on_ip" btree (ip)
    "index_email_2007_week34_on_load_id" btree (load_id)
    "index_email_2007_week34_on_mfrom" btree (mfrom)
    "index_email_2007_week34_on_ts" btree (ts)

  Indexes [for for email_2007_week34_extras]:
    "email_2007_week34_extras_pkey" PRIMARY KEY, btree (id)

> 2) The optimizer is assuming that empty tables have a default 1,000
> records in them with no idea about their statistics. Otherwise you get
> terrible plans on tables which have just been created or never
> analyzed. In this case that's causing it to think there will be tons of
> matches on what is apparently a very selective criterion.

I see.  But this shouldn't matter under the assumption that constraint
exclusion works correctly, right?

> 3) The optimizer is a bit dumb about partitioned tables. But I'm not
> sure if that's actually the fault here.
>
> Try adding one record of data to each of those partitions or putting a
> constraint on them which will allow constraint_exclusion (I assume you
> have that enabled?) to kick in. You'll still be bitten by the parent
> table but hopefully that's not enough to cause a problem.

The parent table is empty.  How will adding one record to each of the
partitions make a difference given the above constraints?

Julian.

Attachment

Re: Bad query plans for queries on partitioned table

From
Gregory Stark
Date:
"Julian Mehnle" <julian@mehnle.net> writes:

> I actually do have constraints on all the partitions, e.g. for week 34:
>
>   Check constraints [for email_2007_week34]:
>     "email_2007_week34_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-27
00:00:00'::timestampwithout time zone) 
>
>   Check constraints [for email_2007_week34_extras]:
>     "email_2007_week34_extras_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts <
'2007-08-2700:00:00'::timestamp without time zone) 
>
> Shouldn't this be enough to give the query planner a clue that it only
> has to join the "email" and "email_extras" tables' partitions pair-wise,
> as opposed to cross-joining them?

Ah, well, this falls under "The optimizer is a bit dumb about partitioned
tables". It only looks at the constraints to compare against your WHERE
clause. It doesn't compare them against the constraints for other tables to
see if they're partitioned on the same key and therefore can be joined
table-by-table.

I want 8.4 to be cleverer in this area but there's a ton of things it has to
learn.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Bad query plans for queries on partitioned table

From
Julian Mehnle
Date:
Gregory Stark wrote:
> "Julian Mehnle" <julian@mehnle.net> writes:
> > I actually do have constraints on all the partitions, e.g. for week
> > 34: [...]
> >
> > Shouldn't this be enough to give the query planner a clue that it
> > only has to join the "email" and "email_extras" tables' partitions
> > pair-wise, as opposed to cross-joining them?
>
> Ah, well, this falls under "The optimizer is a bit dumb about
> partitioned tables". It only looks at the constraints to compare
> against your WHERE clause. It doesn't compare them against the
> constraints for other tables to see if they're partitioned on the same
> key and therefore can be joined table-by-table.
>
> I want 8.4 to be cleverer in this area but there's a ton of things it
> has to learn.

That would be great.

So there's nothing that can be done about it right now, apart from
manually combining separate SELECTs for each partition using UNION?

Julian.

Attachment

Re: Bad query plans for queries on partitioned table

From
Gregory Stark
Date:
"Julian Mehnle" <julian@mehnle.net> writes:

> Gregory Stark wrote:
>> "Julian Mehnle" <julian@mehnle.net> writes:
>> > I actually do have constraints on all the partitions, e.g. for week
>> > 34: [...]
>> >
>> > Shouldn't this be enough to give the query planner a clue that it
>> > only has to join the "email" and "email_extras" tables' partitions
>> > pair-wise, as opposed to cross-joining them?
>>
>> Ah, well, this falls under "The optimizer is a bit dumb about
>> partitioned tables". It only looks at the constraints to compare
>> against your WHERE clause. It doesn't compare them against the
>> constraints for other tables to see if they're partitioned on the same
>> key and therefore can be joined table-by-table.
>>
>> I want 8.4 to be cleverer in this area but there's a ton of things it
>> has to learn.
>
> That would be great.
>
> So there's nothing that can be done about it right now, apart from
> manually combining separate SELECTs for each partition using UNION?

Well the in the query you gave I think if the partitions weren't completely
empty it would still be using the hash join, it would just be doin an append
of all the nearly-empty partitions first. The reason it's getting confused is
that in the absence of stats on them it thinks they contain hundreds of tuples
which will match your where clause and join clause. Look at the expected
number of rows the for the merge jjoin compared to the expected number of rows
for the hash join.

But yeah, there will be cases where you really want:

Append
   Merge Join
      Part1 of table1
      Part2 of table2
  Merge Join
     Part2 of table1
     Part2 of table2
  ...


But the planner only knows how to do:

Merge Join
   Append
      Part1 of table1
      Part2 of table1
      ...
  Append
     Part1 of table1
     Part2 of table2
     ...

Which requires two big sorts whereas the first plan could use indexes on
individual partitions. It also has a slower startup time and can't take
advantage of discovering that a partition of table1 is empty to avoid ever
reading from the corresponding partition of table2 the way the first plan can.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: Bad query plans for queries on partitioned table

From
Julian Mehnle
Date:
Gregory, thanks for all the insight!  It is much appreciated.

Julian.

Attachment