Thread: Optimizer fed bad data about some system-table indexes
Last night I was looking into optimizer misbehavior on the sample query explain select * from pg_class, pg_description where pg_class.oid = pg_description.objoid; As of yesterday the system was generating Hash Join (cost=86.59 rows=1007 width=101) -> Seq Scan on pg_description (cost=41.23 rows=1007 width=16) -> Hash (cost=0.00rows=0 width=0) -> Index Scan using pg_class_oid_index on pg_class (cost=5.57 rows=138 width=85) which was pretty stupid; why use an index scan to load the hashtable? The reason was that the optimizer was actually estimating the index scan to be cheaper than a sequential scan (cost of sequential scan was figured at 6.55). When I poked into this, I found that costsize.c was being fed a size of zero for pg_class_oid_index, and was generating a bogus cost for the index scan because of it. I changed costsize.c to ensure that cost_index with a selectivity of 1 will always return a larger value than cost_seqscan does with the same relation-size stats, regardless of what it's told about the index size. This fixes the immediate problem, but it's still bad that costsize is getting a bogus index size value; the cost estimates won't be very accurate. And considering that there are reasonable stats for pg_class_oid_index in pg_class, you'd sort of expect those numbers to get passed to the optimizer. As near as I can tell, the bogus data is the fault of the relation cache. Info about pg_class_oid_index and a couple of other indexes on system relations is preloaded into the relcache and locked there on startup --- and it is *not* coming from pg_class, but from an initialization file that evidently was made when these system tables were empty. Bottom line is that optimization estimates that involve these critical system indexes will be wrong. That's not a show-stopper, but it seems to me that it must be costing us performance somewhere along the line. I'd like to see if it can be fixed. Does anyone understand: (a) why does the relcache need an initialization file for the system index cache entries in the first place? If I'm reading the code correctly, it is able to build the initialization file from the info in pg_class, so one would think it'd be better to just do that during every startup and forget the initialization file. (b) if we can't just get rid of the init file, how about dropping and rebuilding it at the end of the initdb process (after template1 has been vacuumed)? Then at least it'd show a size of a few hundred for pg_class, instead of zero. regards, tom lane
> As near as I can tell, the bogus data is the fault of the relation > cache. Info about pg_class_oid_index and a couple of other indexes on > system relations is preloaded into the relcache and locked there on > startup --- and it is *not* coming from pg_class, but from an > initialization file that evidently was made when these system tables > were empty. > > Bottom line is that optimization estimates that involve these critical > system indexes will be wrong. That's not a show-stopper, but it seems > to me that it must be costing us performance somewhere along the line. > I'd like to see if it can be fixed. > > Does anyone understand: > > (a) why does the relcache need an initialization file for the system > index cache entries in the first place? If I'm reading the code > correctly, it is able to build the initialization file from the info > in pg_class, so one would think it'd be better to just do that during > every startup and forget the initialization file. The problem is cicurular too. Without those entries in the cache, the system can't do the lookups of the real tables. > (b) if we can't just get rid of the init file, how about dropping and > rebuilding it at the end of the initdb process (after template1 has > been vacuumed)? Then at least it'd show a size of a few hundred for > pg_class, instead of zero. You can't drop them or you could never recreate them. Why does the vacuum analyze at the end of initdb not fix this? Is this because the cache bypasses pg_class and returns the hardcoded rows? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: >> (a) why does the relcache need an initialization file for the system >> index cache entries in the first place? > The problem is cicurular too. Without those entries in the cache, the > system can't do the lookups of the real tables. But the init file is built on-the-fly the first time it is needed; so it seems it can't be as circular as all that. If we *really* needed hardcoded data then it would have to be done more like the way the standard entries in pg_class and other sys tables are made. I think. >> (b) if we can't just get rid of the init file, how about dropping and >> rebuilding it at the end of the initdb process (after template1 has >> been vacuumed)? Then at least it'd show a size of a few hundred for >> pg_class, instead of zero. > You can't drop them or you could never recreate them. Why does the > vacuum analyze at the end of initdb not fix this? Is this because the > cache bypasses pg_class and returns the hardcoded rows? The vacuum analyze *does* fix the data that's in the pg_class entry for the index. Trouble is that the relcache entry for the index is never read from pg_class; it's loaded from this never-updated init file. One possible answer is to rewrite the init file as the final step of a vacuum, using the just-updated pg_class data. But I'm still not convinced that we really need the init file at all... regards, tom lane
> The vacuum analyze *does* fix the data that's in the pg_class entry > for the index. Trouble is that the relcache entry for the index is > never read from pg_class; it's loaded from this never-updated init file. > > One possible answer is to rewrite the init file as the final step of > a vacuum, using the just-updated pg_class data. But I'm still not > convinced that we really need the init file at all... Can you point me to that file? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Can you point me to that file? The code that does this stuff is init_irels() and write_irels() at the end of backend/utils/cache/relcache.c. The init file itself is "pg_internal.init" in the database directory (looks like there is one for each database...). regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > Can you point me to that file? > > The code that does this stuff is init_irels() and write_irels() at the > end of backend/utils/cache/relcache.c. The init file itself is > "pg_internal.init" in the database directory (looks like there is one > for each database...). > > regards, tom lane > If you delete the file at the end of initdb, is it recreated with the proper values? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > If you delete the file at the end of initdb, is it recreated with the > proper values? OK, let's try it ... Sure enough, if I delete the file and then start a new backend, it's rebuilt. Not only that, it's rebuilt with the *correct* index- size values read from pg_class! And cost_index then gets that data passed to it. So this code actually is able to go out and read the database, it just doesn't want to ;-) I'd say this whole mechanism is unnecessary; we should just build the data on-the-fly the way it's done in write_irels(), and eliminate all the file reading and writing code in init_irels and write_irels. The only thing it could possibly be doing for us is saving some backend startup time, but I'm not able to measure any difference when I delete the init file. I'll work on that tomorrow, unless I hear squawks of outrage from someone who remembers what this code was for. regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > If you delete the file at the end of initdb, is it recreated with the > > proper values? > > OK, let's try it ... > > Sure enough, if I delete the file and then start a new backend, > it's rebuilt. Not only that, it's rebuilt with the *correct* index- > size values read from pg_class! And cost_index then gets that data > passed to it. > > So this code actually is able to go out and read the database, it just > doesn't want to ;-) > > I'd say this whole mechanism is unnecessary; we should just build > the data on-the-fly the way it's done in write_irels(), and eliminate > all the file reading and writing code in init_irels and write_irels. > The only thing it could possibly be doing for us is saving some backend > startup time, but I'm not able to measure any difference when I delete > the init file. > > I'll work on that tomorrow, unless I hear squawks of outrage from > someone who remembers what this code was for. Hmm. If you can get it to work without the file, great, or you could just delete the file when vacuum is performed, so the next backend recreates the file. That would work too. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > I'd say this whole mechanism is unnecessary; we should just build > > the data on-the-fly the way it's done in write_irels(), and eliminate > > all the file reading and writing code in init_irels and write_irels. > > The only thing it could possibly be doing for us is saving some backend > > startup time, but I'm not able to measure any difference when I delete > > the init file. > > > > I'll work on that tomorrow, unless I hear squawks of outrage from > > someone who remembers what this code was for. > > Hmm. If you can get it to work without the file, great, or you could > just delete the file when vacuum is performed, so the next backend > recreates the file. That would work too. One other cache thing I want to do is enable index lookups on cache failure for tables like pg_operator, that didn't use index lookups in the cache because you didn't have multi-key system tables. If you are looking for a single row, does anyone know when it is faster to do a sequential scan, and when it is faster to do an index lookup into the table. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Hmm. If you can get it to work without the file, great, or you could > just delete the file when vacuum is performed, so the next backend > recreates the file. That would work too. That's a good idea. I made a test database with a couple thousand tables in it, and found that when pg_class gets that big it does take a measurable amount of time to rebuild the index info if the relcache init file is not there. (Looked like about a third of a second on my machine.) Since backend startup time is a hotbutton item for some folks, I'm not going to take out the init file code. I'll just make VACUUM remove the file, and then the first backend start after a VACUUM will rebuild the file with up-to-date statistics for the system indexes. regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > Hmm. If you can get it to work without the file, great, or you could > > just delete the file when vacuum is performed, so the next backend > > recreates the file. That would work too. > > That's a good idea. I made a test database with a couple thousand > tables in it, and found that when pg_class gets that big it does take > a measurable amount of time to rebuild the index info if the relcache > init file is not there. (Looked like about a third of a second on my > machine.) Since backend startup time is a hotbutton item for some > folks, I'm not going to take out the init file code. I'll just make > VACUUM remove the file, and then the first backend start after a VACUUM > will rebuild the file with up-to-date statistics for the system indexes. That sounds like a big win. 1/3 second is large. If they vacuum a single table, and it is not a system table, can the removal be skipped? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026