RE: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers

From Hiroshi Inoue
Subject RE: [HACKERS] mdnblocks is an amazing time sink in huge relations
Date
Msg-id 000201bf19cd$a5771c60$2801007e@cadzone.tpf.co.jp
Whole thread Raw
In response to Re: [HACKERS] mdnblocks is an amazing time sink in huge relations  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] mdnblocks is an amazing time sink in huge relations  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> ...  I don't see anything fundamentally
> >> wrong with using the existence and size of a data file as useful
> >> information.  It's not a substitute for a lock, of course, and there
> >> may be places where we need cross-backend interlocks that we haven't
> >> got now.
>
> > We have to lseek() each time to know the number of blocks of a table
> > file.  Isn't it a overhead ?
>
> True, but lseek is pretty cheap as kernel calls go (the kernel just has
> to consult the file's inode, which should be in memory already).  We're
> not going to get the info for free; any other way of keeping track of
> it is going to have its own costs.  Vadim's been muttering about using
> a shared cache for system catalog tuples, which might be a win but I'm
> not sure (I'm worried about contention for the cache, especially if it's
> protected by just one or a few spinlocks).  Anyway, if we did have one
> then keeping an accurate block count in the relation's pg_class row
> would be a practical alternative.
>

Seems it's related to a TODO.
* Shared catalog cache, reduce lseek()'s by caching table size in shared
area

But there would be a problem if we use shared catalog cache.
Being updated system tuples are only visible to an updating backend
and other backends should see committed tuples.
On the other hand,an accurate block count should be visible to all
backends.
Which tuple of a row should we load to catalog cache and update ?

> >> That's bothered me too.  A possible answer would be to do the unlinking
> >> back-to-front (zap the last file first); that'd require a few
> more lines
> >> of code in md.c, but a crash midway through would then leave a legal
> >> file configuration that another backend could still do something with.
>
> > Oops,it's more serious than I have thought.
> > mdunlink() may only truncates a table file by a crash while unlinking
> > back-to-front.
> > A crash while unlinking front-to-back may leave unlinked segments
> > and they would suddenly appear as segments of the recreated table.
> > Seems there's no easy fix.
>
> Well, it seems to me that the first misbehavior (incomplete delete becomes
> a partial truncate, and you can try again) is a lot better than the
> second (incomplete delete leaves an undeletable, unrecreatable table).
> Should I go ahead and make delete/truncate work back-to-front, or do you
> see a reason why that'd be a bad thing to do?
>

I also think back-to-front is better.

> > Probably it would require a lot of work to fix correctly.
> > Postponing real unlink/truncating until commit and creating table
> > files which correspond to their oids ..... etc ...
> > It's same as "DROP TABLE inside transations" requires.
> > Hmm,is it worth the work ?
>
> I'm not eager to do that either (not just because of the low return on
> work invested, but because naming table files by OIDs would be a big
> handicap for debugging and database admin work).  However, I'm not
> quite sure why you see it as a solution to the problem of recovering
> from a failed md.c operation.  The risks would be the same if the
> deletion was happening during commit, no?  And I'd be a lot *more*
> worried about deleting the wrong file during manual cleanup if the
> files were named by OID ;-)
>

We don't have to delete relation files even after commit.
Backend would never see them and access to the files
corresponding to new oids of (being) recreated relations.
Deletion is necessary only not to consume disk space.

For example vacuum could remove not deleted files.
It may be a PostgreSQL style.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



pgsql-hackers by date:

Previous
From: bruc@stone.congenomics.com (Robert E. Bruccoleri)
Date:
Subject: Another historical message from the early days of PostgreSQL development
Next
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] indexable and locale