Re: O_DIRECT in freebsd - Mailing list pgsql-hackers
From | Sean Chittenden |
---|---|
Subject | Re: O_DIRECT in freebsd |
Date | |
Msg-id | 20030623011247.GI97131@perrin.int.nxad.com Whole thread Raw |
In response to | Re: O_DIRECT in freebsd (Bruce Momjian <pgman@candle.pha.pa.us>) |
List | pgsql-hackers |
> > > Basically, we don't know when we read a buffer whether this is a > > > read-only or read/write. In fact, we could read it in, and > > > another backend could write it for us. > > > > Um, wait. The cache is shared between backends? I don't think > > so, but it shouldn't matter because there has to be a semaphore > > locking the cache to prevent the coherency issue you describe. If > > PostgreSQL didn't, it'd be having problems with this now. I'd > > also think that MVCC would handle the case of updated data in the > > cache as that has to be a common case. At what point is the > > cached result invalidated and fetched from the OS? > > Uh, it's called the _shared_ buffer cache in postgresql.conf, and we > lock pages only while we are reading/writing them, not for the duration > they are in the cache. *smacks forhead* Duh, you're right. I always just turn up the FS cache in the OS instead. The shared buffer cache has got to have enormous churn though if everything ends up in the userland cache. Is it really an exhaustive cache? I thought the bulk of the caching happened in the kernel and not in the userland. Is the userland cache just for the SysCache and friends, or does it cache everything that moves through PostgreSQL? > > > The big issue is that when we do a write, we don't wait for it > > > to get to disk. > > > > Only in the case when fsync() is turned off, but again, that's up to > > the OS to manage that can of worms, which I think BSD takes care of > > that. From conf/NOTES: > > Nope. When you don't have a kernel buffer cache, and you do a > write, where do you expect it to go? I assume it goes to the drive, > and you have to wait for that. Correct, a write call blocks until the bits hit the disk in the absence of lack of enough buffer space. In the event of enough buffer, however, the buffer houses the bits until written to disk and the kernel returns control to the userland app. Consencus is that FreeBSD does the right thing and hands back data from the FS buffer even though the fd was marked O_DIRECT (see bottom). > > I don't see how this'd be an issue as buffers populated via a > > read(), that are updated, and then written out, would occupy a new > > chunk of disk to satisfy MVCC. Why would we need to mark a buffer > > as read only and carry around/check its state? > > We update the expired flags on the tuple during update/delete. *nods* Okay, I don't see where the problem would be then with O_DIRECT. I'm going to ask Dillion about O_DIRECT since he implemented it, likely for the backplane database that he's writing. I'll let 'ya know what he says. -sc Here's a snip from the conv I had with someone that has mega vfs foo in FreeBSD: 17:58 * seanc has a question about O_DIRECT 17:58 <@zb^3> ask 17:59 <@seanc> assume two procs have a file open, one proc writes using buffered IO, the other uses O_DIRECTto read from the file, is read() smart enough to hand back the data in the buffer that hasn't hit the disk yet or will there be syncing issues? 18:00 <@zb^3> O_DIRECT in the incarnation from matt dillon will break shit 18:00 <@zb^3> basically, any data read will be set non-cacheable 18:01 <@zb^3> and you'll experience writes earlier than you should 18:01 <@seanc> zb^3: hrm, I don't want to write to the fd + O_DIRECT though 18:02 <@seanc> zb^3: basically you're saying an O_DIRECT fd doesn't consult the FS cache before reading fromdisk? 18:03 <@zb^3> no, it does 18:03 <@zb^3> but it immediately puts any read blocks on the ass end of the LRU 18:03 <@zb^3> so if you write a block, then read it with O_DIRECT it will get written out early :( 18:04 <@seanc> zb^3: ah, got it... it's not a data coherency issue, it's a priority issue and O_DIRECT makeswrites jump the gun 18:04 <@seanc> got it 18:05 <@seanc> zb^3: is that required in the implementation or is it a bug? 18:06 * seanc is wondering whether or not he should bug dillion about this to get things working correctly 18:07 <@zb^3> it's a bug in the implementation 18:08 <@zb^3> to fix it you have to pass flags all the way down into the getblk-like layer 18:08 <@zb^3> and dillon was opposed to that 18:09 <@seanc> zb^3: hrm, thx... I'll go bug him about it now and see what's up in backplane land -- Sean Chittenden
pgsql-hackers by date: