Re: O_DIRECT in freebsd - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: O_DIRECT in freebsd
Date
Msg-id 200306230042.h5N0gji07128@candle.pha.pa.us
Whole thread Raw
In response to Re: O_DIRECT in freebsd  (Sean Chittenden <sean@chittenden.org>)
Responses Re: O_DIRECT in freebsd  (Sean Chittenden <sean@chittenden.org>)
List pgsql-hackers
Sean Chittenden wrote:
> > Basically, we don't know when we read a buffer whether this is a
> > read-only or read/write.  In fact, we could read it in, and another
> > backend could write it for us.
> 
> Um, wait.  The cache is shared between backends?  I don't think so,
> but it shouldn't matter because there has to be a semaphore locking
> the cache to prevent the coherency issue you describe.  If PostgreSQL
> didn't, it'd be having problems with this now.  I'd also think that
> MVCC would handle the case of updated data in the cache as that has to
> be a common case.  At what point is the cached result invalidated and
> fetched from the OS?

Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
lock pages only while we are reading/writing them, not for the duration
they are in the cache.

> > The big issue is that when we do a write, we don't wait for it to
> > get to disk.
> 
> Only in the case when fsync() is turned off, but again, that's up to
> the OS to manage that can of worms, which I think BSD takes care of
> that.  From conf/NOTES:

Nope.  When you don't have a kernel buffer cache, and you do a write,
where do you expect it to go?  I assume it goes to the drive, and you
have to wait for that.

> 
> # Attempt to bypass the buffer cache and put data directly into the
> # userland buffer for read operation when O_DIRECT flag is set on the
> # file.  Both offset and length of the read operation must be
> # multiples of the physical media sector size.
> #
> #options                DIRECTIO
> 
> The offsets and length bit kinda bothers me though, but I thin that's
> stuff that the ernel has to take into account, not the userland calls,
> I wonder if that's actually accurate any more or affects userland
> calls...  seems like that'd be a bit too much work to have the user
> do, esp given the lack of documentation on the flag... should be just
> drop in additional flag, afaict.
> 
> > It seems to use O_DIRECT, we would have to read the buffer in a
> > special way to mark it as read-only, which seems kind of strange.  I
> > see no reason we can't use free-behind in the PostgreSQL buffer
> > cache to handle most of the benefits of O_DIRECT, without the
> > read-only buffer restriction.
> 
> I don't see how this'd be an issue as buffers populated via a read(),
> that are updated, and then written out, would occupy a new chunk of
> disk to satisfy MVCC.  Why would we need to mark a buffer as read only
> and carry around/check its state?

We update the expired flags on the tuple during update/delete.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


pgsql-hackers by date:

Previous
From: Sean Chittenden
Date:
Subject: Re: O_DIRECT in freebsd
Next
From: Sean Chittenden
Date:
Subject: Re: O_DIRECT in freebsd