Re: [HACKERS] TODO item - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [HACKERS] TODO item
Date
Msg-id 200002071735.MAA08496@candle.pha.pa.us
Whole thread Raw
In response to Re: [HACKERS] TODO item  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses fsync alternatives (was: Re: [HACKERS] TODO item)  (Alfred Perlstein <bright@wintelcom.net>)
Re: [HACKERS] TODO item  (Tom Lane <tgl@sss.pgh.pa.us>)
RE: [HACKERS] TODO item  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
List pgsql-hackers
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Don't tell me we fsync on every buffer write, and not just at
> > transaction commit?  That is terrible.
> 
> If you don't have -F set, yup.  Why did you think fsync mode was
> so slow?
> 
> > What if we set a flag on the file descriptor stating we dirtied/wrote
> > one of its buffers during the transaction, and cycle through the file
> > descriptors on buffer commit and fsync all involved in the transaction. 
> 
> That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
> has pointed out a serious problem that would make it unreliable when
> multiple backends are running: if some *other* backend fwrites the page
> instead of your backend, and it doesn't fsync until *its* transaction is
> done (possibly long after yours), then you lose the ordering guarantee
> that is the point of the whole exercise...

OK, I understand now.  You are saying if my backend dirties a buffer,
but another backend does the write, would my backend fsync() that buffer
that the other backend wrote.

I can't imagine how fsync could flush _only_ the file discriptor buffers
modified by the current process.  It would have to affect all buffers
for the file descriptor.

BSDI says:
    Fsync() causes all modified data and attributes of fd to be moved to a    permanent storage device.  This normally
resultsin all in-core modified    copies of buffers for the associated file to be written to a disk.
 

Looking at the BSDI kernel, there is a user-mode file descriptor table,
which maps to a kernel file descriptor table.  This table can be shared,
so a file descriptor opened multiple times, like in a fork() call.  The
kernel table maps to an actual file inode/vnode that maps to a file. 
The only thing that is kept in the file descriptor table is the current
offset in the file (struct file in BSD).  There is no mapping of who
wrote which blocks.

In fact, I would suggest that any kernel implementation that could track
such things would be pretty broken.  I can imagine some cases the use of
that mapping of blocks to file descriptors would cause compatibility
problems.  Those buffers have to be shared by all processes.

So, I think we are safe if we can either keep that file descriptor open
until commit, or re-open it and fsync it on commit.  That assume a
re-open is hitting the same file.  My opinion is that we should just
fsync it on close and not worry about a reopen.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] New Globe
Next
From: Alfred Perlstein
Date:
Subject: fsync alternatives (was: Re: [HACKERS] TODO item)