Re: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
Date
Msg-id 25228.940259455@sss.pgh.pa.us
Whole thread Raw
In response to RE: [HACKERS] mdnblocks is an amazing time sink in huge relations  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
Responses RE: [HACKERS] mdnblocks is an amazing time sink in huge relations  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-hackers
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> ...  I don't see anything fundamentally
>> wrong with using the existence and size of a data file as useful
>> information.  It's not a substitute for a lock, of course, and there
>> may be places where we need cross-backend interlocks that we haven't
>> got now.

> We have to lseek() each time to know the number of blocks of a table
> file.  Isn't it a overhead ?

True, but lseek is pretty cheap as kernel calls go (the kernel just has
to consult the file's inode, which should be in memory already).  We're
not going to get the info for free; any other way of keeping track of
it is going to have its own costs.  Vadim's been muttering about using
a shared cache for system catalog tuples, which might be a win but I'm
not sure (I'm worried about contention for the cache, especially if it's
protected by just one or a few spinlocks).  Anyway, if we did have one
then keeping an accurate block count in the relation's pg_class row
would be a practical alternative.

>> That's bothered me too.  A possible answer would be to do the unlinking
>> back-to-front (zap the last file first); that'd require a few more lines
>> of code in md.c, but a crash midway through would then leave a legal
>> file configuration that another backend could still do something with.

> Oops,it's more serious than I have thought.
> mdunlink() may only truncates a table file by a crash while unlinking
> back-to-front.
> A crash while unlinking front-to-back may leave unlinked segments
> and they would suddenly appear as segments of the recreated table.
> Seems there's no easy fix.

Well, it seems to me that the first misbehavior (incomplete delete becomes
a partial truncate, and you can try again) is a lot better than the
second (incomplete delete leaves an undeletable, unrecreatable table).
Should I go ahead and make delete/truncate work back-to-front, or do you
see a reason why that'd be a bad thing to do?

>>>> 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0
>>>> length.
>> 
>> I don't understand what causes this.  Can you explain?

> You call FileUnlink() after FileTrucnate() to unlink in md.c. If
> FileUnlink()
> fails there remains segments of 0 length. But it seems not critical in
> this issue.

Ah, I see.  It's not specific to cygwin then.  I think you are right
that it is not critical, at least not if we change to do the operations
back-to-front.  Truncating and then failing to delete will still leave
a valid file configuration.

> Probably it would require a lot of work to fix correctly.
> Postponing real unlink/truncating until commit and creating table
> files which correspond to their oids ..... etc ...
> It's same as "DROP TABLE inside transations" requires.
> Hmm,is it worth the work ?

I'm not eager to do that either (not just because of the low return on
work invested, but because naming table files by OIDs would be a big
handicap for debugging and database admin work).  However, I'm not
quite sure why you see it as a solution to the problem of recovering
from a failed md.c operation.  The risks would be the same if the
deletion was happening during commit, no?  And I'd be a lot *more*
worried about deleting the wrong file during manual cleanup if the
files were named by OID ;-)
        regards, tom lane


pgsql-hackers by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: [HACKERS] Tr: Functions documentations
Next
From: wieck@debis.com (Jan Wieck)
Date:
Subject: Re: New developer globe (was: Re: [HACKERS] Interesting Quote you