RE: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers

From Hiroshi Inoue
Subject RE: [HACKERS] mdnblocks is an amazing time sink in huge relations
Date
Msg-id 000c01bf192b$5437e2a0$2801007e@cadzone.tpf.co.jp
Whole thread Raw
In response to Re: [HACKERS] mdnblocks is an amazing time sink in huge relations  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
List pgsql-hackers
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > I have been suspicious about current implementation of md.c.
> > It relies so much on information about existent phisical files.
>
> Yes, but on the other hand we rely completely on those same physical
> files to hold our data ;-).  I don't see anything fundamentally
> wrong with using the existence and size of a data file as useful
> information.  It's not a substitute for a lock, of course, and there
> may be places where we need cross-backend interlocks that we haven't
> got now.
>

We have to lseek() each time to know the number of blocks of a table
file.  Isn't it a overhead ?

> > How do you think about the following ?
> >
> > 2. If a backend was killed or crashed in the middle of execution of
> >     mdunlink()/mdtruncate(),half of segments wouldn't be unlink/
> >     truncated.
>
> That's bothered me too.  A possible answer would be to do the unlinking
> back-to-front (zap the last file first); that'd require a few more lines
> of code in md.c, but a crash midway through would then leave a legal
> file configuration that another backend could still do something with.

Oops,it's more serious than I have thought.
mdunlink() may only truncates a table file by a crash while unlinking
back-to-front.
A crash while unlinking front-to-back may leave unlinked segments
and they would suddenly appear as segments of the recreated table.
Seems there's no easy fix.

> > 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0
> >     length.
>
> I don't understand what causes this.  Can you explain?
>

You call FileUnlink() after FileTrucnate() to unlink in md.c. If
FileUnlink()
fails there remains segments of 0 length. But it seems not critical in
this issue.

> > 4. We couldn't mdcreate() existent files and coudn't mdopen()/md
> >     unlink() non-existent files.  So there are some cases that we
> >     could neither CREATE TABLE nor DROP TABLE.
>
> True, but I think this is probably the best thing for safety's sake.
> It seems to me there is too much risk of losing or overwriting valid
> data if md.c bulls ahead when it finds an unexpected file configuration.
> I'd rather rely on manual cleanup if things have gotten that seriously
> out of whack... (but that's just my opinion, perhaps I'm in the
> minority?)
>

There is another risk.
We may remove other table files manually by mistake.
And if I were a newcomer,I would not consider PostgreSQL as
a real DBMS(Fortunately I have never seen the reference to this).

However,I don't object to you because I also have the same anxiety
and could provide no easy solution,

Probably it would require a lot of work to fix correctly.
Postponing real unlink/truncating until commit and creating table
files which correspond to their oids ..... etc ...
It's same as "DROP TABLE inside transations" requires.

Hmm,is it worth the work ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] don't know whether nodes of type 719 are equal
Next
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] sort on huge table