RE: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers
From | Hiroshi Inoue |
---|---|
Subject | RE: [HACKERS] mdnblocks is an amazing time sink in huge relations |
Date | |
Msg-id | 000201bf19cd$a5771c60$2801007e@cadzone.tpf.co.jp Whole thread Raw |
In response to | Re: [HACKERS] mdnblocks is an amazing time sink in huge relations (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
|
List | pgsql-hackers |
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > >> ... I don't see anything fundamentally > >> wrong with using the existence and size of a data file as useful > >> information. It's not a substitute for a lock, of course, and there > >> may be places where we need cross-backend interlocks that we haven't > >> got now. > > > We have to lseek() each time to know the number of blocks of a table > > file. Isn't it a overhead ? > > True, but lseek is pretty cheap as kernel calls go (the kernel just has > to consult the file's inode, which should be in memory already). We're > not going to get the info for free; any other way of keeping track of > it is going to have its own costs. Vadim's been muttering about using > a shared cache for system catalog tuples, which might be a win but I'm > not sure (I'm worried about contention for the cache, especially if it's > protected by just one or a few spinlocks). Anyway, if we did have one > then keeping an accurate block count in the relation's pg_class row > would be a practical alternative. > Seems it's related to a TODO. * Shared catalog cache, reduce lseek()'s by caching table size in shared area But there would be a problem if we use shared catalog cache. Being updated system tuples are only visible to an updating backend and other backends should see committed tuples. On the other hand,an accurate block count should be visible to all backends. Which tuple of a row should we load to catalog cache and update ? > >> That's bothered me too. A possible answer would be to do the unlinking > >> back-to-front (zap the last file first); that'd require a few > more lines > >> of code in md.c, but a crash midway through would then leave a legal > >> file configuration that another backend could still do something with. > > > Oops,it's more serious than I have thought. > > mdunlink() may only truncates a table file by a crash while unlinking > > back-to-front. > > A crash while unlinking front-to-back may leave unlinked segments > > and they would suddenly appear as segments of the recreated table. > > Seems there's no easy fix. > > Well, it seems to me that the first misbehavior (incomplete delete becomes > a partial truncate, and you can try again) is a lot better than the > second (incomplete delete leaves an undeletable, unrecreatable table). > Should I go ahead and make delete/truncate work back-to-front, or do you > see a reason why that'd be a bad thing to do? > I also think back-to-front is better. > > Probably it would require a lot of work to fix correctly. > > Postponing real unlink/truncating until commit and creating table > > files which correspond to their oids ..... etc ... > > It's same as "DROP TABLE inside transations" requires. > > Hmm,is it worth the work ? > > I'm not eager to do that either (not just because of the low return on > work invested, but because naming table files by OIDs would be a big > handicap for debugging and database admin work). However, I'm not > quite sure why you see it as a solution to the problem of recovering > from a failed md.c operation. The risks would be the same if the > deletion was happening during commit, no? And I'd be a lot *more* > worried about deleting the wrong file during manual cleanup if the > files were named by OID ;-) > We don't have to delete relation files even after commit. Backend would never see them and access to the files corresponding to new oids of (being) recreated relations. Deletion is necessary only not to consume disk space. For example vacuum could remove not deleted files. It may be a PostgreSQL style. Regards. Hiroshi Inoue Inoue@tpf.co.jp
pgsql-hackers by date: