Thread: Optimizer fed bad data about some system-table indexes

Optimizer fed bad data about some system-table indexes

From
Tom Lane
Date:
Last night I was looking into optimizer misbehavior on the sample query

explain select * from pg_class, pg_description
where pg_class.oid = pg_description.objoid;

As of yesterday the system was generating

Hash Join  (cost=86.59 rows=1007 width=101) ->  Seq Scan on pg_description  (cost=41.23 rows=1007 width=16) ->  Hash
(cost=0.00rows=0 width=0)       ->  Index Scan using pg_class_oid_index on pg_class  (cost=5.57 rows=138 width=85)
 

which was pretty stupid; why use an index scan to load the hashtable?
The reason was that the optimizer was actually estimating the index scan
to be cheaper than a sequential scan (cost of sequential scan was
figured at 6.55).  When I poked into this, I found that costsize.c
was being fed a size of zero for pg_class_oid_index, and was generating
a bogus cost for the index scan because of it.

I changed costsize.c to ensure that cost_index with a selectivity of 1
will always return a larger value than cost_seqscan does with the same
relation-size stats, regardless of what it's told about the index size.
This fixes the immediate problem, but it's still bad that costsize is
getting a bogus index size value; the cost estimates won't be very
accurate.  And considering that there are reasonable stats for 
pg_class_oid_index in pg_class, you'd sort of expect those numbers to
get passed to the optimizer.

As near as I can tell, the bogus data is the fault of the relation
cache.  Info about pg_class_oid_index and a couple of other indexes on
system relations is preloaded into the relcache and locked there on
startup --- and it is *not* coming from pg_class, but from an
initialization file that evidently was made when these system tables
were empty.

Bottom line is that optimization estimates that involve these critical
system indexes will be wrong.  That's not a show-stopper, but it seems
to me that it must be costing us performance somewhere along the line.
I'd like to see if it can be fixed.

Does anyone understand:

(a) why does the relcache need an initialization file for the system
index cache entries in the first place?  If I'm reading the code
correctly, it is able to build the initialization file from the info
in pg_class, so one would think it'd be better to just do that during
every startup and forget the initialization file.

(b) if we can't just get rid of the init file, how about dropping and
rebuilding it at the end of the initdb process (after template1 has
been vacuumed)?  Then at least it'd show a size of a few hundred for
pg_class, instead of zero.
        regards, tom lane


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> As near as I can tell, the bogus data is the fault of the relation
> cache.  Info about pg_class_oid_index and a couple of other indexes on
> system relations is preloaded into the relcache and locked there on
> startup --- and it is *not* coming from pg_class, but from an
> initialization file that evidently was made when these system tables
> were empty.
> 
> Bottom line is that optimization estimates that involve these critical
> system indexes will be wrong.  That's not a show-stopper, but it seems
> to me that it must be costing us performance somewhere along the line.
> I'd like to see if it can be fixed.
> 
> Does anyone understand:
> 
> (a) why does the relcache need an initialization file for the system
> index cache entries in the first place?  If I'm reading the code
> correctly, it is able to build the initialization file from the info
> in pg_class, so one would think it'd be better to just do that during
> every startup and forget the initialization file.

The problem is cicurular too.  Without those entries in the cache, the
system can't do the lookups of the real tables.

> (b) if we can't just get rid of the init file, how about dropping and
> rebuilding it at the end of the initdb process (after template1 has
> been vacuumed)?  Then at least it'd show a size of a few hundred for
> pg_class, instead of zero.

You can't drop them or you could never recreate them.  Why does the
vacuum analyze at the end of initdb not fix this?  Is this because the
cache bypasses pg_class and returns the hardcoded rows?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
>> (a) why does the relcache need an initialization file for the system
>> index cache entries in the first place?

> The problem is cicurular too.  Without those entries in the cache, the
> system can't do the lookups of the real tables.

But the init file is built on-the-fly the first time it is needed;
so it seems it can't be as circular as all that.  If we *really* needed
hardcoded data then it would have to be done more like the way the
standard entries in pg_class and other sys tables are made.  I think.

>> (b) if we can't just get rid of the init file, how about dropping and
>> rebuilding it at the end of the initdb process (after template1 has
>> been vacuumed)?  Then at least it'd show a size of a few hundred for
>> pg_class, instead of zero.

> You can't drop them or you could never recreate them.  Why does the
> vacuum analyze at the end of initdb not fix this?  Is this because the
> cache bypasses pg_class and returns the hardcoded rows?

The vacuum analyze *does* fix the data that's in the pg_class entry
for the index.  Trouble is that the relcache entry for the index is
never read from pg_class; it's loaded from this never-updated init file.

One possible answer is to rewrite the init file as the final step of
a vacuum, using the just-updated pg_class data.  But I'm still not
convinced that we really need the init file at all...
        regards, tom lane


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> The vacuum analyze *does* fix the data that's in the pg_class entry
> for the index.  Trouble is that the relcache entry for the index is
> never read from pg_class; it's loaded from this never-updated init file.
> 
> One possible answer is to rewrite the init file as the final step of
> a vacuum, using the just-updated pg_class data.  But I'm still not
> convinced that we really need the init file at all...

Can you point me to that file?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Can you point me to that file?

The code that does this stuff is init_irels() and write_irels() at the
end of backend/utils/cache/relcache.c.  The init file itself is
"pg_internal.init" in the database directory (looks like there is one
for each database...).
        regards, tom lane


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > Can you point me to that file?
> 
> The code that does this stuff is init_irels() and write_irels() at the
> end of backend/utils/cache/relcache.c.  The init file itself is
> "pg_internal.init" in the database directory (looks like there is one
> for each database...).
> 
>             regards, tom lane
> 

If you delete the file at the end of initdb, is it recreated with the
proper values?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Tom Lane
Date:
Bruce Momjian <maillist@candle.pha.pa.us> writes:
> If you delete the file at the end of initdb, is it recreated with the
> proper values?

OK, let's try it ...

Sure enough, if I delete the file and then start a new backend,
it's rebuilt.  Not only that, it's rebuilt with the *correct* index-
size values read from pg_class!  And cost_index then gets that data
passed to it.

So this code actually is able to go out and read the database, it just
doesn't want to ;-)

I'd say this whole mechanism is unnecessary; we should just build
the data on-the-fly the way it's done in write_irels(), and eliminate
all the file reading and writing code in init_irels and write_irels.
The only thing it could possibly be doing for us is saving some backend
startup time, but I'm not able to measure any difference when I delete
the init file.

I'll work on that tomorrow, unless I hear squawks of outrage from
someone who remembers what this code was for.
        regards, tom lane


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > If you delete the file at the end of initdb, is it recreated with the
> > proper values?
> 
> OK, let's try it ...
> 
> Sure enough, if I delete the file and then start a new backend,
> it's rebuilt.  Not only that, it's rebuilt with the *correct* index-
> size values read from pg_class!  And cost_index then gets that data
> passed to it.
> 
> So this code actually is able to go out and read the database, it just
> doesn't want to ;-)
> 
> I'd say this whole mechanism is unnecessary; we should just build
> the data on-the-fly the way it's done in write_irels(), and eliminate
> all the file reading and writing code in init_irels and write_irels.
> The only thing it could possibly be doing for us is saving some backend
> startup time, but I'm not able to measure any difference when I delete
> the init file.
> 
> I'll work on that tomorrow, unless I hear squawks of outrage from
> someone who remembers what this code was for.

Hmm.  If you can get it to work without the file, great, or you could
just delete the file when vacuum is performed, so the next backend
recreates the file.  That would work too.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> > I'd say this whole mechanism is unnecessary; we should just build
> > the data on-the-fly the way it's done in write_irels(), and eliminate
> > all the file reading and writing code in init_irels and write_irels.
> > The only thing it could possibly be doing for us is saving some backend
> > startup time, but I'm not able to measure any difference when I delete
> > the init file.
> > 
> > I'll work on that tomorrow, unless I hear squawks of outrage from
> > someone who remembers what this code was for.
> 
> Hmm.  If you can get it to work without the file, great, or you could
> just delete the file when vacuum is performed, so the next backend
> recreates the file.  That would work too.

One other cache thing I want to do is enable index lookups on cache
failure for tables like pg_operator, that didn't use index lookups in
the cache because you didn't have multi-key system tables.

If you are looking for a single row, does anyone know when it is faster
to do a sequential scan, and when it is faster to do an index lookup
into the table.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Bruce Momjian <maillist@candle.pha.pa.us> writes:
> Hmm.  If you can get it to work without the file, great, or you could
> just delete the file when vacuum is performed, so the next backend
> recreates the file.  That would work too.

That's a good idea.  I made a test database with a couple thousand
tables in it, and found that when pg_class gets that big it does take
a measurable amount of time to rebuild the index info if the relcache
init file is not there.  (Looked like about a third of a second on my
machine.)  Since backend startup time is a hotbutton item for some
folks, I'm not going to take out the init file code.  I'll just make
VACUUM remove the file, and then the first backend start after a VACUUM
will rebuild the file with up-to-date statistics for the system indexes.
        regards, tom lane


Re: [HACKERS] Optimizer fed bad data about some system-table indexes

From
Bruce Momjian
Date:
> Bruce Momjian <maillist@candle.pha.pa.us> writes:
> > Hmm.  If you can get it to work without the file, great, or you could
> > just delete the file when vacuum is performed, so the next backend
> > recreates the file.  That would work too.
> 
> That's a good idea.  I made a test database with a couple thousand
> tables in it, and found that when pg_class gets that big it does take
> a measurable amount of time to rebuild the index info if the relcache
> init file is not there.  (Looked like about a third of a second on my
> machine.)  Since backend startup time is a hotbutton item for some
> folks, I'm not going to take out the init file code.  I'll just make
> VACUUM remove the file, and then the first backend start after a VACUUM
> will rebuild the file with up-to-date statistics for the system indexes.

That sounds like a big win.  1/3 second is large.  If they vacuum a
single table, and it is not a system table, can the removal be skipped?

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026