Thread: Re: [GENERAL] nested loops in joins, ambiguous rewrite rules

Re: [GENERAL] nested loops in joins, ambiguous rewrite rules

From
Charles Hornberger
Date:
Hello.  I just sent this note to the pgsql-general list, but I thought it
probably deserved to land on the hackers list as well.  My apologies if I'm
wrong.

Charlie
-----------------------------------------

X-Sender: hornberger@tabloid.net (Unverified)
Date: Fri, 29 Jan 1999 18:00:11 -0800
To: pgsql-general@postgreSQL.org
From: Charles Hornberger <hornberger@tabloid.net>
Subject: Re: [GENERAL] nested loops in joins, ambiguous rewrite rules
Sender: owner-pgsql-general@postgreSQL.org

Hello again.  I've got a few more notes on this "strange" optimizer
behavior, which might be useful to anyone interested in this problem:

As I mentioned in my last posting, I have one DB called apx13 that is
producing expensive "nested loop" query plans when I try to join two
tables, and one called apx14 that's using "merge join" query plans.  I've
been trying to figure out why this is.  (This is all happening under 6.4.
We haven't upgraded to 6.4.2 yet but I don't see anything in the 6.4.2
HISTORY file that suggests that major changes have been made to the
optimizer, so I guess that probably wouldn't change anything?)

In any case, I dumped the entire contents the apx14 database using `pg_dump
-D apx14 > apx14.out`.  Then I created a new database called apx15 and ran
`psql -f apx14.out apx15`.

Then I ran a join query against apx15:
 apx15=> \i ../query explain select a.article_id, b.article_id  from article a, article_text b  where a.article_id =
b.article_id;NOTICE:  QUERY PLAN:
 
 Nested Loop  (cost=3.20 size=2 width=8)   ->  Seq Scan on article a  (cost=1.07 size=2 width=4)   ->  Seq Scan on
article_textb  (cost=1.07 size=2 width=4)
 
 EXPLAIN

So I ran vacuum analyze and did it again.  And got the same query plan.

At that point, I decided to create yet another DB, apx16.  I just used the
schema from apx14, and the contents of the two tables being used in my join
query.

In other words:
`pg_dump -s apx14 > apx14.schema.out`
`pg_dump -aDt article apx14 > apx14.article.out`
`pg_dump -aDt article_text apx14 > apx14.article_text.out`
`psql -f apx14.schema.out apx16`
`psql -f apx14.article.out apx16`
`psql -f apx14.article_text.out apx16`


Then I ran the same join query against apx16, and this time the optimizer
decided to use a "merge join":
 apx16=> \i ../query explain select a.article_id, b.article_id  from article a, article_text b  where a.article_id =
b.article_id;NOTICE:  QUERY PLAN:
 
 Merge Join  (cost=0.00 size=1 width=8)   ->  Seq Scan  (cost=0.00 size=0 width=0)         ->  Sort  (cost=0.00 size=0
width=0)              ->  Seq Scan on article a  (cost=0.00 size=0 width=4)   ->  Seq Scan  (cost=0.00 size=0 width=0)
      ->  Sort  (cost=0.00 size=0 width=0)               ->  Seq Scan on article_text b  (cost=0.00 size=0 width=4)
 
 EXPLAIN

Maybe I'm missing something, but I can't understand why two identical join
queries run against two identical tables should produce different output.
Maybe there's something in the way my tables and indices are set up, but I
can't imagine what that is.

One final note: I just did another test, hoping to exlude other factors
from the equation.  I created two more databases using only the two
relevant tables (and sequences & indices) from apx16. What I discovered is
that if I build a database from "full" dump files (i.e., dump files that
include both table schema & data), the optimizer will always elect to use
"nested loops" for its query plans.  On the other hand, if I build the
database by first creating the tables and indices, and then doing the
inserts in a second operation, it uses the less costly "merge join".  As a
matter of fact, it appears that if you create an index on a table *after*
inserting data into that table, joins performed on that table will *always*
use these expensive nested loops.

So my theory is:

Since pg_dump creates dump files that perform INSERTs on tables before
performing any CREATE INDEX statements on tables, using raw dump files to
create databases will cause this "nested loop" behavior.

And there's another catch: If you create your tables and indices first,
then insert your data, the optimizer will continue to build "merge join"
query plans until you make the mistake of running a "vacuum analyze".  If
you run "vacuum analyze", the optimizer will then decide to begin using the
more costly nested loops to handle any join queries.  The only way to get
around this is to delete all records from the tables being joined,
re-insert them and remember NOT to run vacuum analyze.

If anyone wants to replicate this, I've documented my tests below.

(And, of course, if I've totally missed the point on anything in here,
please correct me.)

===============
Table structure
===============

apx16=> \d article
Table    = article
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| article_id                       | int4 not null default nextval (  |
4 |
| section_id                       | int4 not null                    |
4 |
| locale_id                        | int4 not null                    |
4 |
| article_source_id                | int4                             |
4 |
| volume_id                        | int4                             |
4 |
| issue_id                         | int4                             |
4 |
| print_page_no                    | int4                             |
4 |
| print_publ_date                  | date                             |
4 |
| site_publ_date                   | date not null                    |
4 |
| site_publ_time                   | datetime not null                |
8 |
| inputter_id                      | int4 not null                    |
4 |
| input_date                       | datetime default text 'now'      |
8 |
| published                        | int4 default 0                   |
4 |
+----------------------------------+----------------------------------+-----
--+
Indices:  article_article_id_key         article_issue_ix         article_locale_ix         article_section_ix
article_source_ix        article_vol_ix
 

apx16=> \d article_article_id_key

Table    = article_article_id_key
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| article_id                       | int4                             |
4 |
+----------------------------------+----------------------------------+-----
--+

apx16=> \d article_text

Table    = article_text
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| article_id                       | int4 not null                    |
4 |
| headline                         | varchar()                        |
0 |
| subhead                          | varchar()                        |
0 |
+----------------------------------+----------------------------------+-----
--+
Index:    article_text_ix

apx16=> \d article_text_ix

Table    = article_text_ix
+----------------------------------+----------------------------------+-----
--+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-----
--+
| article_id                       | int4                             |
4 |
+----------------------------------+----------------------------------+-----
--+



=================
CREATE STATEMENTS
=================

CREATE SEQUENCE "article_article_id_seq" start 10 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
SELECT nextval ('article_article_id_seq');
CREATE TABLE "article" ("article_id" "int4" DEFAULT nextval (
'article_article_id_seq' ) NOT NULL, "section_id" "int4" NOT NULL,
"locale_id" "int4" NOT NULL, "article_source_id" "int4", "volume_id"
"int4", "issue_id" "int4", "print_page_no" "int4", "print_publ_date"
"date", "site_publ_date" "date" NOT NULL, "site_publ_time" "datetime" NOT
NULL, "inputter_id" "int4" NOT NULL, "input_date" "datetime" DEFAULT text
'now', "published" "int4" DEFAULT 0);
CREATE UNIQUE INDEX "article_article_id_key" on "article" using btree (
"article_id" "int4_ops" );
CREATE  INDEX "article_vol_ix" on "article" using btree ( "volume_id"
"int4_ops" );
CREATE  INDEX "article_source_ix" on "article" using btree (
"article_source_id" "int4_ops" );
CREATE  INDEX "article_issue_ix" on "article" using btree ( "issue_id"
"int4_ops" );
CREATE  INDEX "article_locale_ix" on "article" using btree ( "locale_id"
"int4_ops" );
CREATE  INDEX "article_section_ix" on "article" using btree ( "section_id"
"int4_ops" );
CREATE TABLE "article_text" ("article_id" "int4" NOT NULL, "headline"
varchar, "subhead" varchar);
CREATE  INDEX "article_text_ix" on "article_text" using btree (
"article_id" "int4_ops" );


=================
INSERT STATEMENTS
=================

INSERT INTO "article"
("article_id","section_id","locale_id","article_source_id","volume_id","issu
e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i
nputter_id","input_date","published") values
(10,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999
PST',100,'Thu Jan 28 19:28:40 1999 PST',0);
INSERT INTO "article"
("article_id","section_id","locale_id","article_source_id","volume_id","issu
e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i
nputter_id","input_date","published") values
(11,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999
PST',100,'Thu Jan 28 19:28:40 1999 PST',0);
INSERT INTO "article_text" ("article_id","headline","subhead") values
(10,'Mayor Signs Contract With Company','Legally binding document said to
be four pages long');
INSERT INTO "article_text" ("article_id","headline","subhead") values
(11,'Mayor Cancels Contract','Company Promises to Sue Over Scuttled Deal');

==================
Steps to replicate
==================

Vacuum analyze 'problem'
------------------------
1. Create a fresh database and execute the CREATE statements above
2. Execute the INSERT statements above
3. Execute the following query:   EXPLAIN SELECT a.article_id, b.article_id    FROM article a, article_text b    WHERE
a.article_id= b.article_id;
 
4. You should see:
   NOTICE:  QUERY PLAN:
   Merge Join  (cost=0.00 size=1 width=8)     ->  Seq Scan  (cost=0.00 size=0 width=0)           ->  Sort  (cost=0.00
size=0width=0)                 ->  Seq Scan on article a  (cost=0.00 size=0 width=4)     ->  Seq Scan  (cost=0.00
size=0width=0)           ->  Sort  (cost=0.00 size=0 width=0)                 ->  Seq Scan on article_text b
(cost=0.00size=0 width=4)
 
   EXPLAIN

5. Do "vacuum analyze".
6. Execute the query again.
7. Now you'll see:
   NOTICE:  QUERY PLAN:
   Nested Loop  (cost=3.20 size=2 width=8)     ->  Seq Scan on article a  (cost=1.07 size=2 width=4)     ->  Seq Scan
onarticle_text b  (cost=1.07 size=2 width=4)
 
   EXPLAIN


pg_dump 'problem'
-----------------

1. Dump the DB you created above to a file with`pg_dump -D dbname > foo`

2. Create a new DB

3. Do `psql -f foo newdb`

4. Execute the join query:   EXPLAIN SELECT a.article_id, b.article_id    FROM article a, article_text b    WHERE
a.article_id= b.article_id;
 

5. You should see:
   NOTICE:  QUERY PLAN:
   Nested Loop  (cost=3.20 size=2 width=8)     ->  Seq Scan on article a  (cost=1.07 size=2 width=4)     ->  Seq Scan
onarticle_text b  (cost=1.07 size=2 width=4)
 
   EXPLAIN




Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Tom Lane
Date:
Charles Hornberger <charlie@k4azl.net> writes:
> Hello again.  I've got a few more notes on this "strange" optimizer
> behavior, which might be useful to anyone interested in this problem:

Well, I understand at least part of the "problem" here.

First, you're assuming that a merge-join plan is necessarily better than
a nested-loop plan.  That should be true for large queries, but it is
*not* necessarily true for small tables --- when there are only a few
tuples in the tables being scanned, a simple nested loop wins because it
has much less startup overhead.  (Or at least that's what our optimizer
thinks; I have not tried to measure this for myself.)

What's really going on here is that when the optimizer *knows* how small
your test tables are, it deliberately chooses nested-loop as being the
fastest thing.  If it doesn't know, it makes some default assumptions
about the sizes of the tables, and with those default sizes it decides
that merge-join will be cheaper.

So the next question is why apparently similar database situations yield
different states of optimizer knowledge.  The answer is that creating
an index before inserting tuples and creating it afterwards have
different side effects on the optimizer's statistics.

I've only spent about ten minutes looking into this, so my understanding
is no doubt incomplete, but what I've found out is: 1. There are a couple of statistical fields in the system pg_class
 table, namely relpages and reltuples (number of disk pages and    tuples in each relation). 2. There are per-attribute
statisticskept in the pg_statistic table. 3. The pg_class statistics fields are updated by a VACUUM (with or    without
ANALYZE)*and also by CREATE INDEX*.  Possibly by other    things too ... but plain INSERTs and so forth don't update
'em.4. The pg_statistics info only seems to be updated by VACUUM ANALYZE.
 

So if you doCREATE TABLECREATE INDEXinsert tuples
then the state of play is that the optimizer thinks the table is empty.
(Or, perhaps, it makes its default assumption --- pg_class doesn't seem
to have any special representation for "size of table unknown" as
opposed to "size of table is zero", so maybe the optimizer treats
reltuples = 0 as meaning it should use a default table size.  I haven't
looked at that part of the code to find out.)

But if you doCREATE TABLEinsert tuplesCREATE INDEX
the state of play is that the optimizer thinks there are as many tuples
in the table as there were when you created the index.  This explains
the varying behavior in your detailed test case.

I'll bet that if you investigate the contents of pg_class and
pg_statistic in the two databases you were originally working with,
you'll find that they are different.  But a VACUUM ANALYZE should
bring everything into sync, assuming that the actual data in the
databases is the same.

At any rate, if you still think there's something flaky going on,
please have a look at what's happening in these statistical fields;
that'll give us more info about whether there's really a bug or not.

Also, if the optimizer still wants to use nested loop when it knows
there are a *lot* of tuples to be processed, there might be a bug in
its equations for the cost of these operations --- how large are your
tables?
        regards, tom lane


Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Charles Hornberger
Date:
At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>First, you're assuming that a merge-join plan is necessarily better than
>a nested-loop plan.  That should be true for large queries, but it is
>*not* necessarily true for small tables --- when there are only a few
>tuples in the tables being scanned, a simple nested loop wins because it
>has much less startup overhead.  (Or at least that's what our optimizer
>thinks; I have not tried to measure this for myself.)

OK, I understand that I don't understand whether merge-join plans are
necessarily better than nested-loop plans, and that it could make sense to
pick one or the other depending on the size of the tables and the number of
rows in them.  Also, your explanation of how 'vacuum analyze' updates the
statistics in pg_class and pg_statistic makes it very clear why I'm seeing
one query plan in one DB, and different plan in the other.  Thanks for the
quick lesson, and my apologies for making it happen on the hackers list.

But let's leave that aside for the moment and address the underlying
question:  Why does it take ~ 11.5 minutes to process the following query?
It performs a simple ANDed join on seven tables.  All of the columns named
in the WHERE clauses are indexed, and all of the tables except article and
article_text contain just one single row; article and article_text contain
two rows.

EXPLAIN SELECT a.article_id, a.print_publ_date, b.section_name,               c.source_name, d.headline,
e.issue_name,f.locale_name, g.volume_name         FROM article a, section b, article_source c, article_text d,
   issue e, locale f, volume g        WHERE a.article_id = d.article_id          AND a.section_id = b.section_id
 AND a.article_source_id = c.source_id          AND a.issue_id = e.issue_id          AND a.locale_id = f.locale_id
   AND a.volume_id = g.volume_id ;
 

Let me explain each of the WHERE clauses: 1) a.article_id has a unique index, d.article_id has a non-unique index 2)
a.section_idhas a non-unique index, b.section_id has a unique index 3) a.article_source_id has a non-unique index,
c.source_idhas a unique
 
index 4) a.issue_id has a non-unique index, e.issue_id has a unique index 5) a.locale_id has a non-unique index,
f.locale_idhas a unique index 6) a.volume_id has a non-unique index, g.volume_id has a unique index
 


This is being done under Postgres 6.4.2 (we upgraded from 6.4 last night)
on a Pentium 150 with 192MB of RAM running Linux 2.0.35 

Here's the query plan it generated:

NOTICE:  QUERY PLAN:

Nested Loop  (cost=12.49 size=2 width=124) ->  Nested Loop  (cost=10.43 size=2 width=108)       ->  Nested Loop
(cost=8.36size=2 width=92)             ->  Nested Loop  (cost=6.30 size=2 width=76)                   ->  Nested Loop
(cost=4.16size=2 width=60)                         ->  Nested Loop  (cost=2.10 size=2 width=44)
     ->  Seq Scan on locale f  (cost=1.03 size=1
 
width=16)                               ->  Seq Scan on article a  (cost=1.07
size=2 width=28)                         ->  Seq Scan on issue e  (cost=1.03 size=1 width=16)                   ->  Seq
Scanon article_text d  (cost=1.07 size=2
 
width=16)             ->  Seq Scan on article_source c  (cost=1.03 size=1 width=16)       ->  Seq Scan on section b
(cost=1.03size=1 width=16) ->  Seq Scan on volume g  (cost=1.03 size=1 width=16)
 


>At any rate, if you still think there's something flaky going on,
>please have a look at what's happening in these statistical fields;
>that'll give us more info about whether there's really a bug or not.
>
>Also, if the optimizer still wants to use nested loop when it knows
>there are a *lot* of tuples to be processed, there might be a bug in
>its equations for the cost of these operations --- how large are your
>tables?
>

The tables are tiny.

Here are some statistics from the DB.

apx00=> select * from pg_statistic;
starelid|staattnum|staop|stalokey
|stahikey                                           
--------+---------+-----+------------------------------------------+--------
------------------------------------------- 407828|        4|    0|4                                         |4
                                 407828|        5|    0|2                                         |2
                      407828|        6|    0|3                                         |3
           407828|        7|    0|4                                         |4
407828|        8|    0|04-05-2006
 
|04-05-2006                                          407828|        9|    0|01-28-1999
|01-28-1999                                          407828|       10|    0|Thu Jan 28 19:28:40 1999 PST
|ThuJan
 
28 19:28:40 1999 PST                        407828|       11|    0|100                                       |100
                                   407828|       12|    0|Thu Jan 28 19:28:40 1999 PST              |Thu Jan
 
28 19:28:40 1999 PST                        407828|       13|    0|0                                         |0
                                 407828|        1|    0|10                                        |11
                       407828|        2|    0|3                                         |3
            407828|        3|    0|1                                         |1
 407852|        1|    0|10                                        |11                                           407852|
      2|    0|Mayor Cancels Contract                    |Mayor
 
Signs Contract With Company                   407852|        3|    0|Company Promises to Sue Over Scuttled
Deal|Legally
binding document said to be four pages long 407863|        1|    0|3                                         |3
                                 407863|        2|    0|News                                      |News
                         407874|        1|    0|1                                         |1
              407874|        2|    0|Downtown
 
|Downtown                                            407885|        1|    0|4
|4                                          407885|        2|    0|The Times                                 |The
 
Times                                           407896|        1|    0|2                                         |2
                                     407896|        2|    0|2                                         |2
                          407907|        1|    0|3                                         |3
               407907|        2|    0|25                                        |25
     407907|        3|    0|04-05-2006
 
|04-05-2006                                         
(27 rows)

apx00=> select oid,* from pg_class where relowner <> 505;  oid|relname
|reltype|relowner|relam|relpages|reltuples|relhasindex|relisshared|relkind|r
elnatts|relchecks|reltriggers|relukeys|relfkeys|relrefs|relhaspkey|relhasrul
es|relacl
------+----------------------------+-------+--------+-----+--------+--------
-+-----------+-----------+-------+--------+---------+-----------+--------+--
------+-------+----------+-----------+------
407923|article_vol_ix              |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407926|article_source_ix           |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |
407929|article_issue_ix            |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407932|article_locale_ix           |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407726|article_article_id_seq      |      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407743|section_section_id_seq      |      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407760|locale_locale_id_seq        |      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407777|article_source_source_id_seq|      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407794|volume_volume_id_seq        |      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407811|issue_issue_id_seq          |      0|     508|    0|       0|
0|f          |f          |S      |       8|        0|          0|       0|    0|      0|f         |f          |      
407828|article                     |      0|     508|    0|       1|
2|t          |f          |r      |      13|        0|          0|       0|    0|      0|f         |f          |      
407935|article_section_ix          |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407852|article_text                |      0|     508|    0|       1|
2|t          |f          |r      |       3|        0|          0|       0|    0|      0|f         |f          |      
407938|article_text_ix             |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407863|section                     |      0|     508|    0|       1|
1|t          |f          |r      |       2|        0|          0|       0|    0|      0|f         |f          |      
407941|section_section_id_key      |      0|     508|  403|       2|
1|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407874|locale                      |      0|     508|    0|       1|
1|t          |f          |r      |       2|        0|          0|       0|    0|      0|f         |f          |      
407944|locale_locale_id_key        |      0|     508|  403|       2|
1|f          |f          |i      |       1|        0|  
407885|article_source              |      0|     508|    0|       1|
1|t          |f          |r      |       2|        0|          0|       0|    0|      0|f         |f          |      
407920|article_article_id_key      |      0|     508|  403|       2|
2|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407947|article_source_source_id_key|      0|     508|  403|       2|
1|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407896|volume                      |      0|     508|    0|       1|
1|t          |f          |r      |       2|        0|          0|       0|    0|      0|f         |f          |      
407950|volume_volume_id_key        |      0|     508|  403|       2|
1|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
407907|issue                       |      0|     508|    0|       1|
1|t          |f          |r      |       3|        0|          0|       0|    0|      0|f         |f          |      
407953|issue_issue_id_key          |      0|     508|  403|       2|
1|f          |f          |i      |       1|        0|          0|       0|    0|      0|f         |f          |      
(25 rows)


I've appended a dump of this database to the end of this e-mail, if anyone
else wants to give it a shot.

(And finally, a note of apology.  I think my last e-mail to the list was
posted three times; I got two BOUNCE messages back from the list, and so I
resent the message each time.  I think the "From" header in my e-mails was
wrong, but apparently it wasn't wrong enough to actually stop the message
from making its way onto the list; it was just wrong enough to generate a
BOUNCE.)

Thanks,
Charlie

>What's really going on here is that when the optimizer *knows* how small
>your test tables are, it deliberately chooses nested-loop as being the
>fastest thing.  If it doesn't know, it makes some default assumptions
>about the sizes of the tables, and with those default sizes it decides
>that merge-join will be cheaper.
>
>So the next question is why apparently similar database situations yield
>different states of optimizer knowledge.  The answer is that creating
>an index before inserting tuples and creating it afterwards have
>different side effects on the optimizer's statistics.
>
>I've only spent about ten minutes looking into this, so my understanding
>is no doubt incomplete, but what I've found out is:
>  1. There are a couple of statistical fields in the system pg_class
>     table, namely relpages and reltuples (number of disk pages and
>     tuples in each relation).
>  2. There are per-attribute statistics kept in the pg_statistic table.
>  3. The pg_class statistics fields are updated by a VACUUM (with or
>     without ANALYZE) *and also by CREATE INDEX*.  Possibly by other
>     things too ... but plain INSERTs and so forth don't update 'em.
>  4. The pg_statistics info only seems to be updated by VACUUM ANALYZE.
>
>So if you do
>    CREATE TABLE
>    CREATE INDEX
>    insert tuples
>then the state of play is that the optimizer thinks the table is empty.
>(Or, perhaps, it makes its default assumption --- pg_class doesn't seem
>to have any special representation for "size of table unknown" as
>opposed to "size of table is zero", so maybe the optimizer treats
>reltuples = 0 as meaning it should use a default table size.  I haven't
>looked at that part of the code to find out.)
>
>But if you do
>    CREATE TABLE
>    insert tuples
>    CREATE INDEX
>the state of play is that the optimizer thinks there are as many tuples
>in the table as there were when you created the index.  This explains
>the varying behavior in your detailed test case.
>
>I'll bet that if you investigate the contents of pg_class and
>pg_statistic in the two databases you were originally working with,
>you'll find that they are different.  But a VACUUM ANALYZE should
>bring everything into sync, assuming that the actual data in the
>databases is the same.
>
>At any rate, if you still think there's something flaky going on,
>please have a look at what's happening in these statistical fields;
>that'll give us more info about whether there's really a bug or not.
>
>Also, if the optimizer still wants to use nested loop when it knows
>there are a *lot* of tuples to be processed, there might be a bug in
>its equations for the cost of these operations --- how large are your
>tables?
>
>            regards, tom lane
>
>


CREATE SEQUENCE "article_article_id_seq" start 10 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE SEQUENCE "section_section_id_seq" start 7 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE SEQUENCE "locale_locale_id_seq" start 3 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE SEQUENCE "article_source_source_id_seq" start 4 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE SEQUENCE "volume_volume_id_seq" start 2 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE SEQUENCE "issue_issue_id_seq" start 3 increment 1 maxvalue
2147483647 minvalue 1  cache 1 ;
CREATE TABLE "article" (       "article_id" int4 DEFAULT nextval ( 'article_article_id_seq' ) NOT
NULL,       "section_id" int4 NOT NULL,       "locale_id" int4 NOT NULL,       "article_source_id" int4,
"volume_id"int4,       "issue_id" int4,       "print_page_no" int4,       "print_publ_date" date,
"site_publ_date"date NOT NULL,       "site_publ_time" datetime NOT NULL,       "inputter_id" int4 NOT NULL,
"input_date"datetime DEFAULT text 'now',       "published" int4 DEFAULT 0);
 
CREATE TABLE "article_text" (       "article_id" int4 NOT NULL,       "headline" character varying,       "subhead"
charactervarying);
 
CREATE TABLE "section" (       "section_id" int4 DEFAULT nextval ( 'section_section_id_seq' ) NOT
NULL,       "section_name" character varying NOT NULL);
CREATE TABLE "locale" (       "locale_id" int4 DEFAULT nextval ( 'locale_locale_id_seq' ) NOT NULL,       "locale_name"
charactervarying NOT NULL);
 
CREATE TABLE "article_source" (       "source_id" int4 DEFAULT nextval ( 'article_source_source_id_seq' )
NOT NULL,       "source_name" character varying NOT NULL);
CREATE TABLE "volume" (       "volume_id" int4 DEFAULT nextval ( 'volume_volume_id_seq' ) NOT NULL,       "volume_name"
charactervarying);
 
CREATE TABLE "issue" (       "issue_id" int4 DEFAULT nextval ( 'issue_issue_id_seq' ) NOT NULL,       "issue_name"
charactervarying,       "issue_date" date NOT NULL);
 
INSERT INTO "article"
("article_id","section_id","locale_id","article_source_id","volume_id","issu
e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i
nputter_id","input_date","published") values
(10,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999
PST',100,'Thu Jan 28 19:28:40 1999 PST',0);
INSERT INTO "article"
("article_id","section_id","locale_id","article_source_id","volume_id","issu
e_id","print_page_no","print_publ_date","site_publ_date","site_publ_time","i
nputter_id","input_date","published") values
(11,3,1,4,2,3,4,'04-05-2006','01-28-1999','Thu Jan 28 19:28:40 1999
PST',100,'Thu Jan 28 19:28:40 1999 PST',0);
INSERT INTO "article_text" ("article_id","headline","subhead") values
(10,'Mayor Signs Contract With Company','Legally binding document said to
be four pages long');
INSERT INTO "article_text" ("article_id","headline","subhead") values
(11,'Mayor Cancels Contract','Company Promises to Sue Over Scuttled Deal');
INSERT INTO "section" ("section_id","section_name") values (3,'News');
INSERT INTO "locale" ("locale_id","locale_name") values (1,'Downtown');
INSERT INTO "article_source" ("source_id","source_name") values (4,'The
Times');
INSERT INTO "volume" ("volume_id","volume_name") values (2,'2');
INSERT INTO "issue" ("issue_id","issue_name","issue_date") values
(3,'25','04-05-2006');
CREATE UNIQUE INDEX "article_article_id_key" on "article" using btree (
"article_id" "int4_ops" );
CREATE  INDEX "article_vol_ix" on "article" using btree ( "volume_id"
"int4_ops" );
CREATE  INDEX "article_source_ix" on "article" using btree (
"article_source_id" "int4_ops" );
CREATE  INDEX "article_issue_ix" on "article" using btree ( "issue_id"
"int4_ops" );
CREATE  INDEX "article_locale_ix" on "article" using btree ( "locale_id"
"int4_ops" );
CREATE  INDEX "article_section_ix" on "article" using btree ( "section_id"
"int4_ops" );
CREATE  INDEX "article_text_ix" on "article_text" using btree (
"article_id" "int4_ops" );
CREATE UNIQUE INDEX "section_section_id_key" on "section" using btree (
"section_id" "int4_ops" );
CREATE UNIQUE INDEX "locale_locale_id_key" on "locale" using btree (
"locale_id" "int4_ops" );
CREATE UNIQUE INDEX "article_source_source_id_key" on "article_source"
using btree ( "source_id" "int4_ops" );
CREATE UNIQUE INDEX "volume_volume_id_key" on "volume" using btree (
"volume_id" "int4_ops" );
CREATE UNIQUE INDEX "issue_issue_id_key" on "issue" using btree (
"issue_id" "int4_ops" );



Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Bruce Momjian
Date:
See the SET options of psql.

test=> show geqo\g
NOTICE:  GEQO is ON beginning with 8 relations
SHOW VARIABLE
test=> \q

We turn on geqo at 8 relations.  Try:
SET GEQO TO 4

and try the query again.  Let us know.


> At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >First, you're assuming that a merge-join plan is necessarily better than
> >a nested-loop plan.  That should be true for large queries, but it is
> >*not* necessarily true for small tables --- when there are only a few
> >tuples in the tables being scanned, a simple nested loop wins because it
> >has much less startup overhead.  (Or at least that's what our optimizer
> >thinks; I have not tried to measure this for myself.)
> 
> OK, I understand that I don't understand whether merge-join plans are
> necessarily better than nested-loop plans, and that it could make sense to
> pick one or the other depending on the size of the tables and the number of
> rows in them.  Also, your explanation of how 'vacuum analyze' updates the
> statistics in pg_class and pg_statistic makes it very clear why I'm seeing
> one query plan in one DB, and different plan in the other.  Thanks for the
> quick lesson, and my apologies for making it happen on the hackers list.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Charles Hornberger
Date:
>We turn on geqo at 8 relations.  Try:
>
>    SET GEQO TO 4
>
>and try the query again.  Let us know.

Well isn't that something!  Thanks so much for your help!

I set the GEQO variable to 4 and now the 11.5 minute query executes in 6 seconds with this query plan:

Hash Join  (cost=21.99 size=152 width=124) ->  Hash Join  (cost=17.48 size=38 width=108)       ->  Hash Join
(cost=13.48size=16 width=92)             ->  Hash Join  (cost=10.09 size=8 width=76)                   ->  Hash Join
(cost=6.66size=7 width=60)                         ->  Nested Loop  (cost=3.26 size=6 width=44)
     ->  Seq Scan on volume g  (cost=1.07 size=2 width=16)                               ->  Seq Scan on article a
(cost=1.10size=3 width=28)                         ->  Hash  (cost=0.00 size=0 width=0)
-> Seq Scan on article_text d  (cost=1.10 size=3 width=16)                   ->  Hash  (cost=0.00 size=0 width=0)
                 ->  Seq Scan on locale f  (cost=1.10 size=3 width=16)             ->  Hash  (cost=0.00 size=0 width=0)
                 ->  Seq Scan on issue e  (cost=1.07 size=2 width=16)       ->  Hash  (cost=0.00 size=0 width=0)
    ->  Seq Scan on section b  (cost=1.23 size=7 width=16) ->  Hash  (cost=0.00 size=0 width=0)       ->  Seq Scan on
article_sourcec  (cost=1.13 size=4 width=16)
 


Are there any recommendations about what value *ought* to be set for GEQO?  It seems to me like 'ON=8' is pretty high
--for us, it meant that UNLESS we explicity set that variable for every JOIN query of 6-7 tables, the joins were going
tobog down to a total crawl, while sending memory and CPU consumption through the roof (roughly 22MB and 90-95%,
respectively,for the entire query-processing period).
 

What we've done is change the default setting in /src/include/optimizer/internals.h and recompiled. (It's the very last
linein that file.)  Maybe it'd be nice to add that as a command-line option to postmaster?
 

Also, we couldn't find the GEQO README, which was mentioned several times in comments in the source code but doesn't
appearto have made its way into the distribution tarball.  (AFAIK, we don't have a copy anywhere beneath
/usr/local/pgsql/.)Maybe it got overlooked when the tarball was balled up?
 

Thanks again.  If you'd like me to submit any more information about this "problem", please let me know.

Charlie

At 10:12 PM 1/30/99 -0500, Bruce Momjian wrote:
>See the SET options of psql.
>
>test=> show geqo\g
>NOTICE:  GEQO is ON beginning with 8 relations
>SHOW VARIABLE
>test=> \q
>
>
>
>> At 04:07 PM 1/30/99 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >First, you're assuming that a merge-join plan is necessarily better than
>> >a nested-loop plan.  That should be true for large queries, but it is
>> >*not* necessarily true for small tables --- when there are only a few
>> >tuples in the tables being scanned, a simple nested loop wins because it
>> >has much less startup overhead.  (Or at least that's what our optimizer
>> >thinks; I have not tried to measure this for myself.)
>> 
>> OK, I understand that I don't understand whether merge-join plans are
>> necessarily better than nested-loop plans, and that it could make sense to
>> pick one or the other depending on the size of the tables and the number of
>> rows in them.  Also, your explanation of how 'vacuum analyze' updates the
>> statistics in pg_class and pg_statistic makes it very clear why I'm seeing
>> one query plan in one DB, and different plan in the other.  Thanks for the
>> quick lesson, and my apologies for making it happen on the hackers list.
>
>
>-- 
>  Bruce Momjian                        |  http://www.op.net/~candle
>  maillist@candle.pha.pa.us            |  (610) 853-3000
>  +  If your life is a hard drive,     |  830 Blythe Avenue
>  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>
>


Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Bruce Momjian
Date:
> >We turn on geqo at 8 relations.  Try:  > >    SET GEQO TO 4 >
> >and try the query again.  Let us know.
> 
> Well isn't that something!  Thanks so much for your help!
> 
> I set the GEQO variable to 4 and now the 11.5 minute query
> executes in 6 seconds with this query plan:
> 
> Hash Join  (cost=21.99 size=152 width=124)   width=16)
> 
> 
> Are there any recommendations about what value *ought* to be
> set for GEQO?  It seems to me like 'ON=8' is pretty high -- for
> us, it meant that UNLESS we explicity set that variable for
> every JOIN query of 6-7 tables, the joins were going to bog down
> to a total crawl, while sending memory and CPU consumption
> through the roof (roughly 22MB and 90-95%, respectively, for
> the entire query-processing period).

We have toyed with lowering it to 6.  I think someone said that was too
low, but obviously, some queries need a value of 6 or lower.  I don't
understand when that is the case, though.  Any comments from anyone?

I believe the GEQO README was moved into the documentation in docs.  It
used to be there, but is no longer.


-- Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Tom Lane
Date:
>> Well isn't that something!  Thanks so much for your help!
>> I set the GEQO variable to 4 and now the 11.5 minute query
>> executes in 6 seconds with this query plan:

Maybe this was already obvious to everyone else, but I just now got it
through my head that Charlie was not complaining about the time to
*execute* his seven-way join, but about the time to *plan* it.

(In other words, EXPLAIN was taking about as long as actually doing
the SELECT, right?)

The problem here is explained in the Postgres "Developer's Guide",
in the chapter titled "Genetic Query Optimization in Database Systems".
The regular planner/optimizer does a nearly exhaustive search through
all possible ways to execute the query --- and the number of possible
ways grows exponentially in the number of joins.

So you need a heuristic shortcut when there are lots of tables.  It may
not find the best possible execution plan, but it should find a pretty
good one in not too much time.  That's what the GEQO code does.

>> Are there any recommendations about what value *ought* to be
>> set for GEQO?  It seems to me like 'ON=8' is pretty high

Bruce Momjian <maillist@candle.pha.pa.us> writes:
> We have toyed with lowering it to 6.  I think someone said that was too
> low, but obviously, some queries need a value of 6 or lower.  I don't
> understand when that is the case, though.  Any comments from anyone?

The docs say that the number being used is just the number of base
tables mentioned in the query.  I wonder whether that is the right
number to look at.  In particular, the number of cases that the
exhaustive searcher would have to look at is not only dependent on the
number of underlying tables, but also on how many indexes there are to
consider using.

Perhaps if we develop a slightly more complex measure to compare against
the GEQO threshold, we can do better in deciding when to kick in GEQO.
A brain-dead suggestion is number of tables + number of indexes ...
does anyone know whether that is a good measure of the number of cases
considered in the optimizer?
        regards, tom lane


Re: [HACKERS] nested loops in joins, ambiguous rewrite rules

From
Bruce Momjian
Date:
> Perhaps if we develop a slightly more complex measure to compare against
> the GEQO threshold, we can do better in deciding when to kick in GEQO.
> A brain-dead suggestion is number of tables + number of indexes ...
> does anyone know whether that is a good measure of the number of cases
> considered in the optimizer?

That is an excellent suggestion.  We have always seen 6-table joins that
are short with geqo, and some longer.  Perhaps that is the cause.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026