Thread: Bad query plans for queries on partitioned table
Hi all, I have a large database with e-mail meta-data (no bodies) for over 100 million messages. I am running PostgreSQL 8.2.4 on a server with 2GB of RAM (shared_buffers = 240MB, temp_buffers = 128MB, work_mem = 256MB, maintenance_work_mem = 256MB). I have the data split in two separate tables, "email" and "email_extras": Table "public.email" Column | Type | Modifiers -------------------+-----------------------------+----------- id | bigint | not null load_id | integer | not null ts | timestamp without time zone | not null ip | inet | not null mfrom | text | not null helo | text | Table "public.email_extras" Column | Type | Modifiers -------------------+-----------------------------+----------- id | bigint | not null ts | timestamp without time zone | not null size | integer | not null hdr_from | text | Each of these tables has been partitioned equally based on the "ts" (timestamp) field into two dozen or so tables, each covering one week of messages. For testing purposes, I have only one week's partition filled for each of the "email" and "email_extras" tables (email_2007_week34 {,extras}). Now if I perform the following simple join on the "email" and "email_ extras" tables ... SELECT count(*) FROM email INNER JOIN email_extras USING (id, ts) WHERE mfrom <> hdr_from; then I get the following horrendously inefficient plan: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=391396890.89..391396890.90 rows=1 width=0) -> Merge Join (cost=9338881.64..349156398.02 rows=16896197148 width=0) Merge Cond: ((public.email_extras.id = public.email.id) AND (public.email_extras.ts = public.email.ts)) Join Filter: (public.email.mfrom <> public.email_extras.hdr_from) -> Sort (cost=4592966.95..4658121.33 rows=26061752 width=48) Sort Key: public.email_extras.id, public.email_extras.ts -> Append (cost=0.00..491969.52 rows=26061752 width=48) -> Seq Scan on email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week13_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week14_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week15_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week16_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week17_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week18_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week19_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week20_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week21_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week22_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week23_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week24_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week25_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week26_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week27_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week28_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week29_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week30_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week31_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week32_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week33_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week34_extras email_extras (cost=0.00..491597.12 rows=26052512 width=33) -> Seq Scan on email_2007_week35_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week36_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week37_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week38_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week39_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Seq Scan on email_2007_week40_extras email_extras (cost=0.00..13.30 rows=330 width=48) -> Sort (cost=4745914.69..4811071.87 rows=26062872 width=48) Sort Key: public.email.id, public.email.ts -> Append (cost=0.00..644732.72 rows=26062872 width=48) -> Seq Scan on email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week13 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week14 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week15 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week16 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week17 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week18 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week19 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week20 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week21 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week22 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week23 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week24 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week25 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week26 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week27 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week28 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week29 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week30 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week31 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week32 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week33 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week34 email (cost=0.00..644349.12 rows=26052512 width=33) -> Seq Scan on email_2007_week35 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week36 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week37 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week38 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week39 email (cost=0.00..13.70 rows=370 width=48) -> Seq Scan on email_2007_week40 email (cost=0.00..13.70 rows=370 width=48) (68 rows) However, if I restrict the query to just the partitions that actually do have data in them ... SELECT count(*) FROM email_2007_week34 INNER JOIN email_2007_week34_extras USING (id, ts) WHERE mfrom <> hdr_from; then I get a much better plan that uses a hash join: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=4266338.94..4266338.95 rows=1 width=0) -> Hash Join (cost=1111362.80..4266336.07 rows=1145 width=0) Hash Cond: ((email_2007_week34.ts = email_2007_week34_extras.ts) AND (email_2007_week34.id = email_2007_week34_extras.id)) Join Filter: (email_2007_week34.mfrom <> email_2007_week34_extras.hdr_from) -> Seq Scan on email_2007_week34 (cost=0.00..644349.12 rows=26052512 width=33) -> Hash (cost=491597.12..491597.12 rows=26052512 width=33) -> Seq Scan on email_2007_week34_extras (cost=0.00..491597.12 rows=26052512 width=33) (7 rows) Yes, I have `ANALYZE`d the database before running the queries. How come the query planner gets thrown off that far by the simple table partitioning? What can I do to put the query planner back on the right track? Julian.
Attachment
Julian Mehnle wrote: > I have a large database with e-mail meta-data (no bodies) for over 100 > million messages. I am running PostgreSQL 8.2.4 on a server with 2GB > of RAM (shared_buffers = 240MB, temp_buffers = 128MB, work_mem = 256MB, > maintenance_work_mem = 256MB). I have the data split in two separate > tables, "email" and "email_extras": > [...] > > Each of these tables has been partitioned equally based on the "ts" > (timestamp) field into two dozen or so tables, each covering one week > of messages. For testing purposes, I have only one week's partition > filled for each of the "email" and "email_extras" tables > (email_2007_week34 {,extras}). Oh, just for the record: I do have "constraint_exclusion" enabled. Julian.
Attachment
"Julian Mehnle" <julian@mehnle.net> writes: > However, if I restrict the query to just the partitions that actually do > have data in them ... There are a few things going on here. 1) The optimizer can't build a plan which ignores those partitions because the statistics are just approximations. You could insert into one of them at any time and the statistics won't update immediately. If you have a partition which is empty of some type of data you can put a constraint on it to promise the optimizer that that condition will stay true. 2) The optimizer is assuming that empty tables have a default 1,000 records in them with no idea about their statistics. Otherwise you get terrible plans on tables which have just been created or never analyzed. In this case that's causing it to think there will be tons of matches on what is apparently a very selective criterion. 3) The optimizer is a bit dumb about partitioned tables. But I'm not sure if that's actually the fault here. Try adding one record of data to each of those partitions or putting a constraint on them which will allow constraint_exclusion (I assume you have that enabled?) to kick in. You'll still be bitten by the parent table but hopefully that's not enough to cause a problem. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Gregory Stark wrote: > There are a few things going on here. > > 1) The optimizer can't build a plan which ignores those partitions > because the statistics are just approximations. You could insert into > one of them at any time and the statistics won't update immediately. If > you have a partition which is empty of some type of data you can put a > constraint on it to promise the optimizer that that condition will stay > true. I actually do have constraints on all the partitions, e.g. for week 34: Check constraints [for email_2007_week34]: "email_2007_week34_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-27 00:00:00'::timestampwithout time zone) Check constraints [for email_2007_week34_extras]: "email_2007_week34_extras_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-2700:00:00'::timestamp without time zone) Shouldn't this be enough to give the query planner a clue that it only has to join the "email" and "email_extras" tables' partitions pair-wise, as opposed to cross-joining them? For the record, I also have indexes: Indexes [for email_2007_week34]: "email_2007_week34_pkey" PRIMARY KEY, btree (id) "index_email_2007_week34_on_helo" btree (helo) "index_email_2007_week34_on_ip" btree (ip) "index_email_2007_week34_on_load_id" btree (load_id) "index_email_2007_week34_on_mfrom" btree (mfrom) "index_email_2007_week34_on_ts" btree (ts) Indexes [for for email_2007_week34_extras]: "email_2007_week34_extras_pkey" PRIMARY KEY, btree (id) > 2) The optimizer is assuming that empty tables have a default 1,000 > records in them with no idea about their statistics. Otherwise you get > terrible plans on tables which have just been created or never > analyzed. In this case that's causing it to think there will be tons of > matches on what is apparently a very selective criterion. I see. But this shouldn't matter under the assumption that constraint exclusion works correctly, right? > 3) The optimizer is a bit dumb about partitioned tables. But I'm not > sure if that's actually the fault here. > > Try adding one record of data to each of those partitions or putting a > constraint on them which will allow constraint_exclusion (I assume you > have that enabled?) to kick in. You'll still be bitten by the parent > table but hopefully that's not enough to cause a problem. The parent table is empty. How will adding one record to each of the partitions make a difference given the above constraints? Julian.
Attachment
"Julian Mehnle" <julian@mehnle.net> writes: > I actually do have constraints on all the partitions, e.g. for week 34: > > Check constraints [for email_2007_week34]: > "email_2007_week34_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-27 00:00:00'::timestampwithout time zone) > > Check constraints [for email_2007_week34_extras]: > "email_2007_week34_extras_ts_check" CHECK (ts >= '2007-08-20 00:00:00'::timestamp without time zone AND ts < '2007-08-2700:00:00'::timestamp without time zone) > > Shouldn't this be enough to give the query planner a clue that it only > has to join the "email" and "email_extras" tables' partitions pair-wise, > as opposed to cross-joining them? Ah, well, this falls under "The optimizer is a bit dumb about partitioned tables". It only looks at the constraints to compare against your WHERE clause. It doesn't compare them against the constraints for other tables to see if they're partitioned on the same key and therefore can be joined table-by-table. I want 8.4 to be cleverer in this area but there's a ton of things it has to learn. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
Gregory Stark wrote: > "Julian Mehnle" <julian@mehnle.net> writes: > > I actually do have constraints on all the partitions, e.g. for week > > 34: [...] > > > > Shouldn't this be enough to give the query planner a clue that it > > only has to join the "email" and "email_extras" tables' partitions > > pair-wise, as opposed to cross-joining them? > > Ah, well, this falls under "The optimizer is a bit dumb about > partitioned tables". It only looks at the constraints to compare > against your WHERE clause. It doesn't compare them against the > constraints for other tables to see if they're partitioned on the same > key and therefore can be joined table-by-table. > > I want 8.4 to be cleverer in this area but there's a ton of things it > has to learn. That would be great. So there's nothing that can be done about it right now, apart from manually combining separate SELECTs for each partition using UNION? Julian.
Attachment
"Julian Mehnle" <julian@mehnle.net> writes: > Gregory Stark wrote: >> "Julian Mehnle" <julian@mehnle.net> writes: >> > I actually do have constraints on all the partitions, e.g. for week >> > 34: [...] >> > >> > Shouldn't this be enough to give the query planner a clue that it >> > only has to join the "email" and "email_extras" tables' partitions >> > pair-wise, as opposed to cross-joining them? >> >> Ah, well, this falls under "The optimizer is a bit dumb about >> partitioned tables". It only looks at the constraints to compare >> against your WHERE clause. It doesn't compare them against the >> constraints for other tables to see if they're partitioned on the same >> key and therefore can be joined table-by-table. >> >> I want 8.4 to be cleverer in this area but there's a ton of things it >> has to learn. > > That would be great. > > So there's nothing that can be done about it right now, apart from > manually combining separate SELECTs for each partition using UNION? Well the in the query you gave I think if the partitions weren't completely empty it would still be using the hash join, it would just be doin an append of all the nearly-empty partitions first. The reason it's getting confused is that in the absence of stats on them it thinks they contain hundreds of tuples which will match your where clause and join clause. Look at the expected number of rows the for the merge jjoin compared to the expected number of rows for the hash join. But yeah, there will be cases where you really want: Append Merge Join Part1 of table1 Part2 of table2 Merge Join Part2 of table1 Part2 of table2 ... But the planner only knows how to do: Merge Join Append Part1 of table1 Part2 of table1 ... Append Part1 of table1 Part2 of table2 ... Which requires two big sorts whereas the first plan could use indexes on individual partitions. It also has a slower startup time and can't take advantage of discovering that a partition of table1 is empty to avoid ever reading from the corresponding partition of table2 the way the first plan can. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory, thanks for all the insight! It is much appreciated. Julian.