Thread: Re: [GENERAL] nested loops in joins, ambiguous rewrite rules
Hello. I just sent this note to the pgsql-general list, but I thought it probably deserved to land on the hackers list as well. My apologies if I'm wrong. Charlie ----------------------------------------- X-Sender: hornberger@tabloid.net (Unverified) Date: Fri, 29 Jan 1999 18:00:11 -0800 To: pgsql-general@postgreSQL.org From: Charles Hornberger <hornberger@tabloid.net> Subject: Re: [GENERAL] nested loops in joins, ambiguous rewrite rules Sender: owner-pgsql-general@postgreSQL.org Hello again. I've got a few more notes on this "strange" optimizer behavior, which might be useful to anyone interested in this problem: As I mentioned in my last posting, I have one DB called apx13 that is producing expensive "nested loop" query plans when I try to join two tables, and one called apx14 that's using "merge join" query plans. I've been trying to figure out why this is. (This is all happening under 6.4. We haven't upgraded to 6.4.2 yet but I don't see anything in the 6.4.2 HISTORY file that suggests that major changes have been made to the optimizer, so I guess that probably wouldn't change anything?) In any case, I dumped the entire contents the apx14 database using `pg_dump -D apx14 > apx14.out`. Then I created a new database called apx15 and ran `psql -f apx14.out apx15`. Then I ran a join query against apx15: apx15=> \i ../query explain select a.article_id, b.article_id from article a, article_text b where a.article_id = b.article_id;NOTICE: QUERY PLAN: Nested Loop (cost=3.20 size=2 width=8) -> Seq Scan on article a (cost=1.07 size=2 width=4) -> Seq Scan on article_textb (cost=1.07 size=2 width=4) EXPLAIN So I ran vacuum analyze and did it again. And got the same query plan. At that point, I decided to create yet another DB, apx16. I just used the schema from apx14, and the contents of the two tables being used in my join query. In other words: `pg_dump -s apx14 > apx14.schema.out` `pg_dump -aDt article apx14 > apx14.article.out` `pg_dump -aDt article_text apx14 > apx14.article_text.out` `psql -f apx14.schema.out apx16` `psql -f apx14.article.out apx16` `psql -f apx14.article_text.out apx16` Then I ran the same join query against apx16, and this time the optimizer decided to use a "merge join": apx16=> \i ../query explain select a.article_id, b.article_id from article a, article_text b where a.article_id = b.article_id;NOTICE: QUERY PLAN: Merge Join (cost=0.00 size=1 width=8) -> Seq Scan (cost=0.00 size=0 width=0) -> Sort (cost=0.00 size=0 width=0) -> Seq Scan on article a (cost=0.00 size=0 width=4) -> Seq Scan (cost=0.00 size=0 width=0) -> Sort (cost=0.00 size=0 width=0) -> Seq Scan on article_text b (cost=0.00 size=0 width=4) EXPLAIN Maybe I'm missing something, but I can't understand why two identical join queries run against two identical tables should produce different output. Maybe there's something in the way my tables and indices are set up, but I can't imagine what that is. One final note: I just did another test, hoping to exlude other factors from the equation. I created two more databases using only the two relevant tables (and sequences & indices) from apx16. What I discovered is that if I build a database from "full" dump files (i.e., dump files that include both table schema & data), the optimizer will always elect to use "nested loops" for its query plans. On the other hand, if I build the database by first creating the tables and indices, and then doing the inserts in a second operation, it uses the less costly "merge join". As a matter of fact, it appears that if you create an index on a table *after* inserting data into that table, joins performed on that table will *always* use these expensive nested loops. So my theory is: Since pg_dump creates dump files that perform INSERTs on tables before performing any CREATE INDEX statements on tables, using raw dump files to create databases will cause this "nested loop" behavior. And there's another catch: If you create your tables and indices first, then insert your data, the optimizer will continue to build "merge join" query plans until you make the mistake of running a "vacuum analyze". If you run "vacuum analyze", the optimizer will then decide to begin using the more costly nested loops to handle any join queries. The only way to get around this is to delete all records from the tables being joined, re-insert them and remember NOT to run vacuum analyze. If anyone wants to replicate this, I've documented my tests below. (And, of course, if I've totally missed the point on anything in here, please correct me.) =============== Table structure =============== apx16=> \d article Table = article +----------------------------------+----------------------------------+----- --+ | Field | Type | Length| +----------------------------------+----------------------------------+----- --+ | article_id | int4 not null default nextval ( | 4 | | section_id | int4 not null | 4 | | locale_id | int4 not null | 4 | | article_source_id | int4 | 4 | | volume_id | int4 | 4 | | issue_id | int4 | 4 | | print_page_no | int4 | 4 | | print_publ_date | date | 4 | | site_publ_date | date not null | 4 | | site_publ_time | datetime not null | 8 | | inputter_id | int4 not null | 4 | | input_date | datetime default text 'now' | 8 | | published | int4 default 0 | 4 | +----------------------------------+----------------------------------+----- --+ Indices: article_article_id_key article_issue_ix article_locale_ix article_section_ix article_source_ix article_vol_ix apx16=> \d article_article_id_key Table = article_article_id_key +----------------------------------+----------------------------------+----- --+ | Field | Type | Length| +----------------------------------+----------------------------------+----- --+ | article_id | int4 | 4 | +----------------------------------+----------------------------------+----- --+ apx16=> \d article_text Table = article_text +----------------------------------+----------------------------------+----- --+ | Field | Type | Length| +----------------------------------+----------------------------------+----- --+ | article_id | int4 not null | 4 | | headline | varchar() | 0 | | subhead | varchar() | 0 | +----------------------------------+----------------------------------+----- --+ Index: article_text_ix apx16=> \d article_text_ix Table = article_text_ix +----------------------------------+----------------------------------+----- --+ | Field | Type | Length| +----------------------------------+----------------------------------+----- --+ | article_id | int4 | 4 | +----------------------------------+----------------------------------+----- --+ ================= CREATE STATEMENTS ================= CREATE SEQUENCE "article_article_id_seq" start 10 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; SELECT nextval ('article_article_id_seq'); CREATE TABLE "article" ("article_id" "int4" DEFAULT nextval ( 'article_article_id_seq' ) NOT NULL, "section_id" "int4" NOT NULL, "locale_id" "int4" NOT NULL, "article_source_id" "int4", "volume_id" "int4", "issue_id" "int4", "print_page_no" "int4", "print_publ_date" "date", "site_publ_date" "date" NOT NULL, "site_publ_time" "datetime" NOT NULL, "inputter_id" "int4" NOT NULL, "input_date" "datetime" DEFAULT text 'now', "published" "int4" DEFAULT 0); CREATE UNIQUE INDEX "article_article_id_key" on "article" using btree ( "article_id" "int4_ops" ); CREATE INDEX "article_vol_ix" on "article" using btree ( "volume_id" "int4_ops" ); CREATE INDEX "article_source_ix" on "article" using btree ( "article_source_id" "int4_ops" ); CREATE INDEX "article_issue_ix" on "article" using btree ( "issue_id" "int4_ops" ); CREATE INDEX "article_locale_ix" on "article" using btree ( "locale_id" "int4_ops" ); CREATE INDEX "article_section_ix" on "article" using btree ( "section_id" "int4_ops" ); CREATE TABLE "article_text" ("article_id" "int4" NOT NULL, "headline" varchar, "subhead" varchar); CREATE INDEX "article_text_ix" on "article_text" using btree ( "article_id" "int4_ops" ); ================= INSERT STATEMENTS ================= INSERT INTO "article" ("article_id","section_id","locale_id","article_source_id","volume_id","issu e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i nputter_id","input_date","published") values (10,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999 PST',100,'Thu Jan 28 19:28:40 1999 PST',0); INSERT INTO "article" ("article_id","section_id","locale_id","article_source_id","volume_id","issu e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i nputter_id","input_date","published") values (11,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999 PST',100,'Thu Jan 28 19:28:40 1999 PST',0); INSERT INTO "article_text" ("article_id","headline","subhead") values (10,'Mayor Signs Contract With Company','Legally binding document said to be four pages long'); INSERT INTO "article_text" ("article_id","headline","subhead") values (11,'Mayor Cancels Contract','Company Promises to Sue Over Scuttled Deal'); ================== Steps to replicate ================== Vacuum analyze 'problem' ------------------------ 1. Create a fresh database and execute the CREATE statements above 2. Execute the INSERT statements above 3. Execute the following query: EXPLAIN SELECT a.article_id, b.article_id FROM article a, article_text b WHERE a.article_id= b.article_id; 4. You should see: NOTICE: QUERY PLAN: Merge Join (cost=0.00 size=1 width=8) -> Seq Scan (cost=0.00 size=0 width=0) -> Sort (cost=0.00 size=0width=0) -> Seq Scan on article a (cost=0.00 size=0 width=4) -> Seq Scan (cost=0.00 size=0width=0) -> Sort (cost=0.00 size=0 width=0) -> Seq Scan on article_text b (cost=0.00size=0 width=4) EXPLAIN 5. Do "vacuum analyze". 6. Execute the query again. 7. Now you'll see: NOTICE: QUERY PLAN: Nested Loop (cost=3.20 size=2 width=8) -> Seq Scan on article a (cost=1.07 size=2 width=4) -> Seq Scan onarticle_text b (cost=1.07 size=2 width=4) EXPLAIN pg_dump 'problem' ----------------- 1. Dump the DB you created above to a file with`pg_dump -D dbname > foo` 2. Create a new DB 3. Do `psql -f foo newdb` 4. Execute the join query: EXPLAIN SELECT a.article_id, b.article_id FROM article a, article_text b WHERE a.article_id= b.article_id; 5. You should see: NOTICE: QUERY PLAN: Nested Loop (cost=3.20 size=2 width=8) -> Seq Scan on article a (cost=1.07 size=2 width=4) -> Seq Scan onarticle_text b (cost=1.07 size=2 width=4) EXPLAIN
Charles Hornberger <charlie@k4azl.net> writes: > Hello again. I've got a few more notes on this "strange" optimizer > behavior, which might be useful to anyone interested in this problem: Well, I understand at least part of the "problem" here. First, you're assuming that a merge-join plan is necessarily better than a nested-loop plan. That should be true for large queries, but it is *not* necessarily true for small tables --- when there are only a few tuples in the tables being scanned, a simple nested loop wins because it has much less startup overhead. (Or at least that's what our optimizer thinks; I have not tried to measure this for myself.) What's really going on here is that when the optimizer *knows* how small your test tables are, it deliberately chooses nested-loop as being the fastest thing. If it doesn't know, it makes some default assumptions about the sizes of the tables, and with those default sizes it decides that merge-join will be cheaper. So the next question is why apparently similar database situations yield different states of optimizer knowledge. The answer is that creating an index before inserting tuples and creating it afterwards have different side effects on the optimizer's statistics. I've only spent about ten minutes looking into this, so my understanding is no doubt incomplete, but what I've found out is: 1. There are a couple of statistical fields in the system pg_class table, namely relpages and reltuples (number of disk pages and tuples in each relation). 2. There are per-attribute statisticskept in the pg_statistic table. 3. The pg_class statistics fields are updated by a VACUUM (with or without ANALYZE)*and also by CREATE INDEX*. Possibly by other things too ... but plain INSERTs and so forth don't update 'em.4. The pg_statistics info only seems to be updated by VACUUM ANALYZE. So if you doCREATE TABLECREATE INDEXinsert tuples then the state of play is that the optimizer thinks the table is empty. (Or, perhaps, it makes its default assumption --- pg_class doesn't seem to have any special representation for "size of table unknown" as opposed to "size of table is zero", so maybe the optimizer treats reltuples = 0 as meaning it should use a default table size. I haven't looked at that part of the code to find out.) But if you doCREATE TABLEinsert tuplesCREATE INDEX the state of play is that the optimizer thinks there are as many tuples in the table as there were when you created the index. This explains the varying behavior in your detailed test case. I'll bet that if you investigate the contents of pg_class and pg_statistic in the two databases you were originally working with, you'll find that they are different. But a VACUUM ANALYZE should bring everything into sync, assuming that the actual data in the databases is the same. At any rate, if you still think there's something flaky going on, please have a look at what's happening in these statistical fields; that'll give us more info about whether there's really a bug or not. Also, if the optimizer still wants to use nested loop when it knows there are a *lot* of tuples to be processed, there might be a bug in its equations for the cost of these operations --- how large are your tables? regards, tom lane
At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >First, you're assuming that a merge-join plan is necessarily better than >a nested-loop plan. That should be true for large queries, but it is >*not* necessarily true for small tables --- when there are only a few >tuples in the tables being scanned, a simple nested loop wins because it >has much less startup overhead. (Or at least that's what our optimizer >thinks; I have not tried to measure this for myself.) OK, I understand that I don't understand whether merge-join plans are necessarily better than nested-loop plans, and that it could make sense to pick one or the other depending on the size of the tables and the number of rows in them. Also, your explanation of how 'vacuum analyze' updates the statistics in pg_class and pg_statistic makes it very clear why I'm seeing one query plan in one DB, and different plan in the other. Thanks for the quick lesson, and my apologies for making it happen on the hackers list. But let's leave that aside for the moment and address the underlying question: Why does it take ~ 11.5 minutes to process the following query? It performs a simple ANDed join on seven tables. All of the columns named in the WHERE clauses are indexed, and all of the tables except article and article_text contain just one single row; article and article_text contain two rows. EXPLAIN SELECT a.article_id, a.print_publ_date, b.section_name, c.source_name, d.headline, e.issue_name,f.locale_name, g.volume_name FROM article a, section b, article_source c, article_text d, issue e, locale f, volume g WHERE a.article_id = d.article_id AND a.section_id = b.section_id AND a.article_source_id = c.source_id AND a.issue_id = e.issue_id AND a.locale_id = f.locale_id AND a.volume_id = g.volume_id ; Let me explain each of the WHERE clauses: 1) a.article_id has a unique index, d.article_id has a non-unique index 2) a.section_idhas a non-unique index, b.section_id has a unique index 3) a.article_source_id has a non-unique index, c.source_idhas a unique index 4) a.issue_id has a non-unique index, e.issue_id has a unique index 5) a.locale_id has a non-unique index, f.locale_idhas a unique index 6) a.volume_id has a non-unique index, g.volume_id has a unique index This is being done under Postgres 6.4.2 (we upgraded from 6.4 last night) on a Pentium 150 with 192MB of RAM running Linux 2.0.35 Here's the query plan it generated: NOTICE: QUERY PLAN: Nested Loop (cost=12.49 size=2 width=124) -> Nested Loop (cost=10.43 size=2 width=108) -> Nested Loop (cost=8.36size=2 width=92) -> Nested Loop (cost=6.30 size=2 width=76) -> Nested Loop (cost=4.16size=2 width=60) -> Nested Loop (cost=2.10 size=2 width=44) -> Seq Scan on locale f (cost=1.03 size=1 width=16) -> Seq Scan on article a (cost=1.07 size=2 width=28) -> Seq Scan on issue e (cost=1.03 size=1 width=16) -> Seq Scanon article_text d (cost=1.07 size=2 width=16) -> Seq Scan on article_source c (cost=1.03 size=1 width=16) -> Seq Scan on section b (cost=1.03size=1 width=16) -> Seq Scan on volume g (cost=1.03 size=1 width=16) >At any rate, if you still think there's something flaky going on, >please have a look at what's happening in these statistical fields; >that'll give us more info about whether there's really a bug or not. > >Also, if the optimizer still wants to use nested loop when it knows >there are a *lot* of tuples to be processed, there might be a bug in >its equations for the cost of these operations --- how large are your >tables? > The tables are tiny. Here are some statistics from the DB. apx00=> select * from pg_statistic; starelid|staattnum|staop|stalokey |stahikey --------+---------+-----+------------------------------------------+-------- ------------------------------------------- 407828| 4| 0|4 |4 407828| 5| 0|2 |2 407828| 6| 0|3 |3 407828| 7| 0|4 |4 407828| 8| 0|04-05-2006 |04-05-2006 407828| 9| 0|01-28-1999 |01-28-1999 407828| 10| 0|Thu Jan 28 19:28:40 1999 PST |ThuJan 28 19:28:40 1999 PST 407828| 11| 0|100 |100 407828| 12| 0|Thu Jan 28 19:28:40 1999 PST |Thu Jan 28 19:28:40 1999 PST 407828| 13| 0|0 |0 407828| 1| 0|10 |11 407828| 2| 0|3 |3 407828| 3| 0|1 |1 407852| 1| 0|10 |11 407852| 2| 0|Mayor Cancels Contract |Mayor Signs Contract With Company 407852| 3| 0|Company Promises to Sue Over Scuttled Deal|Legally binding document said to be four pages long 407863| 1| 0|3 |3 407863| 2| 0|News |News 407874| 1| 0|1 |1 407874| 2| 0|Downtown |Downtown 407885| 1| 0|4 |4 407885| 2| 0|The Times |The Times 407896| 1| 0|2 |2 407896| 2| 0|2 |2 407907| 1| 0|3 |3 407907| 2| 0|25 |25 407907| 3| 0|04-05-2006 |04-05-2006 (27 rows) apx00=> select oid,* from pg_class where relowner <> 505; oid|relname |reltype|relowner|relam|relpages|reltuples|relhasindex|relisshared|relkind|r elnatts|relchecks|reltriggers|relukeys|relfkeys|relrefs|relhaspkey|relhasrul es|relacl ------+----------------------------+-------+--------+-----+--------+-------- -+-----------+-----------+-------+--------+---------+-----------+--------+-- ------+-------+----------+-----------+------ 407923|article_vol_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407926|article_source_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407929|article_issue_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407932|article_locale_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407726|article_article_id_seq | 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407743|section_section_id_seq | 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407760|locale_locale_id_seq | 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407777|article_source_source_id_seq| 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407794|volume_volume_id_seq | 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407811|issue_issue_id_seq | 0| 508| 0| 0| 0|f |f |S | 8| 0| 0| 0| 0| 0|f |f | 407828|article | 0| 508| 0| 1| 2|t |f |r | 13| 0| 0| 0| 0| 0|f |f | 407935|article_section_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407852|article_text | 0| 508| 0| 1| 2|t |f |r | 3| 0| 0| 0| 0| 0|f |f | 407938|article_text_ix | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407863|section | 0| 508| 0| 1| 1|t |f |r | 2| 0| 0| 0| 0| 0|f |f | 407941|section_section_id_key | 0| 508| 403| 2| 1|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407874|locale | 0| 508| 0| 1| 1|t |f |r | 2| 0| 0| 0| 0| 0|f |f | 407944|locale_locale_id_key | 0| 508| 403| 2| 1|f |f |i | 1| 0| 407885|article_source | 0| 508| 0| 1| 1|t |f |r | 2| 0| 0| 0| 0| 0|f |f | 407920|article_article_id_key | 0| 508| 403| 2| 2|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407947|article_source_source_id_key| 0| 508| 403| 2| 1|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407896|volume | 0| 508| 0| 1| 1|t |f |r | 2| 0| 0| 0| 0| 0|f |f | 407950|volume_volume_id_key | 0| 508| 403| 2| 1|f |f |i | 1| 0| 0| 0| 0| 0|f |f | 407907|issue | 0| 508| 0| 1| 1|t |f |r | 3| 0| 0| 0| 0| 0|f |f | 407953|issue_issue_id_key | 0| 508| 403| 2| 1|f |f |i | 1| 0| 0| 0| 0| 0|f |f | (25 rows) I've appended a dump of this database to the end of this e-mail, if anyone else wants to give it a shot. (And finally, a note of apology. I think my last e-mail to the list was posted three times; I got two BOUNCE messages back from the list, and so I resent the message each time. I think the "From" header in my e-mails was wrong, but apparently it wasn't wrong enough to actually stop the message from making its way onto the list; it was just wrong enough to generate a BOUNCE.) Thanks, Charlie >What's really going on here is that when the optimizer *knows* how small >your test tables are, it deliberately chooses nested-loop as being the >fastest thing. If it doesn't know, it makes some default assumptions >about the sizes of the tables, and with those default sizes it decides >that merge-join will be cheaper. > >So the next question is why apparently similar database situations yield >different states of optimizer knowledge. The answer is that creating >an index before inserting tuples and creating it afterwards have >different side effects on the optimizer's statistics. > >I've only spent about ten minutes looking into this, so my understanding >is no doubt incomplete, but what I've found out is: > 1. There are a couple of statistical fields in the system pg_class > table, namely relpages and reltuples (number of disk pages and > tuples in each relation). > 2. There are per-attribute statistics kept in the pg_statistic table. > 3. The pg_class statistics fields are updated by a VACUUM (with or > without ANALYZE) *and also by CREATE INDEX*. Possibly by other > things too ... but plain INSERTs and so forth don't update 'em. > 4. The pg_statistics info only seems to be updated by VACUUM ANALYZE. > >So if you do > CREATE TABLE > CREATE INDEX > insert tuples >then the state of play is that the optimizer thinks the table is empty. >(Or, perhaps, it makes its default assumption --- pg_class doesn't seem >to have any special representation for "size of table unknown" as >opposed to "size of table is zero", so maybe the optimizer treats >reltuples = 0 as meaning it should use a default table size. I haven't >looked at that part of the code to find out.) > >But if you do > CREATE TABLE > insert tuples > CREATE INDEX >the state of play is that the optimizer thinks there are as many tuples >in the table as there were when you created the index. This explains >the varying behavior in your detailed test case. > >I'll bet that if you investigate the contents of pg_class and >pg_statistic in the two databases you were originally working with, >you'll find that they are different. But a VACUUM ANALYZE should >bring everything into sync, assuming that the actual data in the >databases is the same. > >At any rate, if you still think there's something flaky going on, >please have a look at what's happening in these statistical fields; >that'll give us more info about whether there's really a bug or not. > >Also, if the optimizer still wants to use nested loop when it knows >there are a *lot* of tuples to be processed, there might be a bug in >its equations for the cost of these operations --- how large are your >tables? > > regards, tom lane > > CREATE SEQUENCE "article_article_id_seq" start 10 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE SEQUENCE "section_section_id_seq" start 7 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE SEQUENCE "locale_locale_id_seq" start 3 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE SEQUENCE "article_source_source_id_seq" start 4 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE SEQUENCE "volume_volume_id_seq" start 2 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE SEQUENCE "issue_issue_id_seq" start 3 increment 1 maxvalue 2147483647 minvalue 1 cache 1 ; CREATE TABLE "article" ( "article_id" int4 DEFAULT nextval ( 'article_article_id_seq' ) NOT NULL, "section_id" int4 NOT NULL, "locale_id" int4 NOT NULL, "article_source_id" int4, "volume_id"int4, "issue_id" int4, "print_page_no" int4, "print_publ_date" date, "site_publ_date"date NOT NULL, "site_publ_time" datetime NOT NULL, "inputter_id" int4 NOT NULL, "input_date"datetime DEFAULT text 'now', "published" int4 DEFAULT 0); CREATE TABLE "article_text" ( "article_id" int4 NOT NULL, "headline" character varying, "subhead" charactervarying); CREATE TABLE "section" ( "section_id" int4 DEFAULT nextval ( 'section_section_id_seq' ) NOT NULL, "section_name" character varying NOT NULL); CREATE TABLE "locale" ( "locale_id" int4 DEFAULT nextval ( 'locale_locale_id_seq' ) NOT NULL, "locale_name" charactervarying NOT NULL); CREATE TABLE "article_source" ( "source_id" int4 DEFAULT nextval ( 'article_source_source_id_seq' ) NOT NULL, "source_name" character varying NOT NULL); CREATE TABLE "volume" ( "volume_id" int4 DEFAULT nextval ( 'volume_volume_id_seq' ) NOT NULL, "volume_name" charactervarying); CREATE TABLE "issue" ( "issue_id" int4 DEFAULT nextval ( 'issue_issue_id_seq' ) NOT NULL, "issue_name" charactervarying, "issue_date" date NOT NULL); INSERT INTO "article" ("article_id","section_id","locale_id","article_source_id","volume_id","issu e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i nputter_id","input_date","published") values (10,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999 PST',100,'Thu Jan 28 19:28:40 1999 PST',0); INSERT INTO "article" ("article_id","section_id","locale_id","article_source_id","volume_id","issu e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i nputter_id","input_date","published") values (11,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999 PST',100,'Thu Jan 28 19:28:40 1999 PST',0); INSERT INTO "article_text" ("article_id","headline","subhead") values (10,'Mayor Signs Contract With Company','Legally binding document said to be four pages long'); INSERT INTO "article_text" ("article_id","headline","subhead") values (11,'Mayor Cancels Contract','Company Promises to Sue Over Scuttled Deal'); INSERT INTO "section" ("section_id","section_name") values (3,'News'); INSERT INTO "locale" ("locale_id","locale_name") values (1,'Downtown'); INSERT INTO "article_source" ("source_id","source_name") values (4,'The Times'); INSERT INTO "volume" ("volume_id","volume_name") values (2,'2'); INSERT INTO "issue" ("issue_id","issue_name","issue_date") values (3,'25','04-05-2006'); CREATE UNIQUE INDEX "article_article_id_key" on "article" using btree ( "article_id" "int4_ops" ); CREATE INDEX "article_vol_ix" on "article" using btree ( "volume_id" "int4_ops" ); CREATE INDEX "article_source_ix" on "article" using btree ( "article_source_id" "int4_ops" ); CREATE INDEX "article_issue_ix" on "article" using btree ( "issue_id" "int4_ops" ); CREATE INDEX "article_locale_ix" on "article" using btree ( "locale_id" "int4_ops" ); CREATE INDEX "article_section_ix" on "article" using btree ( "section_id" "int4_ops" ); CREATE INDEX "article_text_ix" on "article_text" using btree ( "article_id" "int4_ops" ); CREATE UNIQUE INDEX "section_section_id_key" on "section" using btree ( "section_id" "int4_ops" ); CREATE UNIQUE INDEX "locale_locale_id_key" on "locale" using btree ( "locale_id" "int4_ops" ); CREATE UNIQUE INDEX "article_source_source_id_key" on "article_source" using btree ( "source_id" "int4_ops" ); CREATE UNIQUE INDEX "volume_volume_id_key" on "volume" using btree ( "volume_id" "int4_ops" ); CREATE UNIQUE INDEX "issue_issue_id_key" on "issue" using btree ( "issue_id" "int4_ops" );
See the SET options of psql. test=> show geqo\g NOTICE: GEQO is ON beginning with 8 relations SHOW VARIABLE test=> \q We turn on geqo at 8 relations. Try: SET GEQO TO 4 and try the query again. Let us know. > At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >First, you're assuming that a merge-join plan is necessarily better than > >a nested-loop plan. That should be true for large queries, but it is > >*not* necessarily true for small tables --- when there are only a few > >tuples in the tables being scanned, a simple nested loop wins because it > >has much less startup overhead. (Or at least that's what our optimizer > >thinks; I have not tried to measure this for myself.) > > OK, I understand that I don't understand whether merge-join plans are > necessarily better than nested-loop plans, and that it could make sense to > pick one or the other depending on the size of the tables and the number of > rows in them. Also, your explanation of how 'vacuum analyze' updates the > statistics in pg_class and pg_statistic makes it very clear why I'm seeing > one query plan in one DB, and different plan in the other. Thanks for the > quick lesson, and my apologies for making it happen on the hackers list. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>We turn on geqo at 8 relations. Try: > > SET GEQO TO 4 > >and try the query again. Let us know. Well isn't that something! Thanks so much for your help! I set the GEQO variable to 4 and now the 11.5 minute query executes in 6 seconds with this query plan: Hash Join (cost=21.99 size=152 width=124) -> Hash Join (cost=17.48 size=38 width=108) -> Hash Join (cost=13.48size=16 width=92) -> Hash Join (cost=10.09 size=8 width=76) -> Hash Join (cost=6.66size=7 width=60) -> Nested Loop (cost=3.26 size=6 width=44) -> Seq Scan on volume g (cost=1.07 size=2 width=16) -> Seq Scan on article a (cost=1.10size=3 width=28) -> Hash (cost=0.00 size=0 width=0) -> Seq Scan on article_text d (cost=1.10 size=3 width=16) -> Hash (cost=0.00 size=0 width=0) -> Seq Scan on locale f (cost=1.10 size=3 width=16) -> Hash (cost=0.00 size=0 width=0) -> Seq Scan on issue e (cost=1.07 size=2 width=16) -> Hash (cost=0.00 size=0 width=0) -> Seq Scan on section b (cost=1.23 size=7 width=16) -> Hash (cost=0.00 size=0 width=0) -> Seq Scan on article_sourcec (cost=1.13 size=4 width=16) Are there any recommendations about what value *ought* to be set for GEQO? It seems to me like 'ON=8' is pretty high --for us, it meant that UNLESS we explicity set that variable for every JOIN query of 6-7 tables, the joins were going tobog down to a total crawl, while sending memory and CPU consumption through the roof (roughly 22MB and 90-95%, respectively,for the entire query-processing period). What we've done is change the default setting in /src/include/optimizer/internals.h and recompiled. (It's the very last linein that file.) Maybe it'd be nice to add that as a command-line option to postmaster? Also, we couldn't find the GEQO README, which was mentioned several times in comments in the source code but doesn't appearto have made its way into the distribution tarball. (AFAIK, we don't have a copy anywhere beneath /usr/local/pgsql/.)Maybe it got overlooked when the tarball was balled up? Thanks again. If you'd like me to submit any more information about this "problem", please let me know. Charlie At 10:12 PM 1/30/99 -0500, Bruce Momjian wrote: >See the SET options of psql. > >test=> show geqo\g >NOTICE: GEQO is ON beginning with 8 relations >SHOW VARIABLE >test=> \q > > > >> At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >First, you're assuming that a merge-join plan is necessarily better than >> >a nested-loop plan. That should be true for large queries, but it is >> >*not* necessarily true for small tables --- when there are only a few >> >tuples in the tables being scanned, a simple nested loop wins because it >> >has much less startup overhead. (Or at least that's what our optimizer >> >thinks; I have not tried to measure this for myself.) >> >> OK, I understand that I don't understand whether merge-join plans are >> necessarily better than nested-loop plans, and that it could make sense to >> pick one or the other depending on the size of the tables and the number of >> rows in them. Also, your explanation of how 'vacuum analyze' updates the >> statistics in pg_class and pg_statistic makes it very clear why I'm seeing >> one query plan in one DB, and different plan in the other. Thanks for the >> quick lesson, and my apologies for making it happen on the hackers list. > > >-- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > >
> >We turn on geqo at 8 relations. Try: > > SET GEQO TO 4 > > >and try the query again. Let us know. > > Well isn't that something! Thanks so much for your help! > > I set the GEQO variable to 4 and now the 11.5 minute query > executes in 6 seconds with this query plan: > > Hash Join (cost=21.99 size=152 width=124) width=16) > > > Are there any recommendations about what value *ought* to be > set for GEQO? It seems to me like 'ON=8' is pretty high -- for > us, it meant that UNLESS we explicity set that variable for > every JOIN query of 6-7 tables, the joins were going to bog down > to a total crawl, while sending memory and CPU consumption > through the roof (roughly 22MB and 90-95%, respectively, for > the entire query-processing period). We have toyed with lowering it to 6. I think someone said that was too low, but obviously, some queries need a value of 6 or lower. I don't understand when that is the case, though. Any comments from anyone? I believe the GEQO README was moved into the documentation in docs. It used to be there, but is no longer. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>> Well isn't that something! Thanks so much for your help! >> I set the GEQO variable to 4 and now the 11.5 minute query >> executes in 6 seconds with this query plan: Maybe this was already obvious to everyone else, but I just now got it through my head that Charlie was not complaining about the time to *execute* his seven-way join, but about the time to *plan* it. (In other words, EXPLAIN was taking about as long as actually doing the SELECT, right?) The problem here is explained in the Postgres "Developer's Guide", in the chapter titled "Genetic Query Optimization in Database Systems". The regular planner/optimizer does a nearly exhaustive search through all possible ways to execute the query --- and the number of possible ways grows exponentially in the number of joins. So you need a heuristic shortcut when there are lots of tables. It may not find the best possible execution plan, but it should find a pretty good one in not too much time. That's what the GEQO code does. >> Are there any recommendations about what value *ought* to be >> set for GEQO? It seems to me like 'ON=8' is pretty high Bruce Momjian <maillist@candle.pha.pa.us> writes: > We have toyed with lowering it to 6. I think someone said that was too > low, but obviously, some queries need a value of 6 or lower. I don't > understand when that is the case, though. Any comments from anyone? The docs say that the number being used is just the number of base tables mentioned in the query. I wonder whether that is the right number to look at. In particular, the number of cases that the exhaustive searcher would have to look at is not only dependent on the number of underlying tables, but also on how many indexes there are to consider using. Perhaps if we develop a slightly more complex measure to compare against the GEQO threshold, we can do better in deciding when to kick in GEQO. A brain-dead suggestion is number of tables + number of indexes ... does anyone know whether that is a good measure of the number of cases considered in the optimizer? regards, tom lane
> Perhaps if we develop a slightly more complex measure to compare against > the GEQO threshold, we can do better in deciding when to kick in GEQO. > A brain-dead suggestion is number of tables + number of indexes ... > does anyone know whether that is a good measure of the number of cases > considered in the optimizer? That is an excellent suggestion. We have always seen 6-table joins that are short with geqo, and some longer. Perhaps that is the cause. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026