Re: O_DIRECT in freebsd - Mailing list pgsql-hackers

From Sean Chittenden
Subject Re: O_DIRECT in freebsd
Date
Msg-id 20030622224943.GF97131@perrin.int.nxad.com
Whole thread Raw
In response to Re: O_DIRECT in freebsd  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: O_DIRECT in freebsd
List pgsql-hackers
> > > What you really want is Solaris's free-behind, where it detects
> > > if a scan is exceeding a certain percentage of the OS cache and
> > > moves the pages to the _front_ of the to-be-reused list.  I am
> > > not sure what other OS's support this, but we need this on our
> > > own buffer manager code as well.
> > > 
> > > Our TODO already has:
> > > 
> > >     * Add free-behind capability for large sequential scans (Bruce)
> > > 
> > > Basically, I think we need free-behind rather than O_DIRECT.
> > 
> > I suppose, but you've already polluted the cache by the time the
> > above mentioned mechanism kicks in and takes effect.  Given that
> > the planner has an idea of how much data it's going to read in in
> > order to complete the query, seems easier/better to mark the fd
> > O_DIRECT.  *shrug*
> 
> _That_ is an excellent point.  However, do we know at the time we
> open the file descriptor if we will be doing this?

Doesn't matter, it's an option to fcntl().

> What about cache coherency problems with other backends not opening
> with O_DIRECT?

That's a problem for the kernel VM, if you mean cache coherency in the
VM.  If you mean inside of the backend, that could be a stickier
issue, I think.  I don't know enough of the internals yet to know if
this is a problem or not, but you're right, it's certainly something
to consider.  Is the cache a write behind cache or is it a read
through cache?  If it's a read through cache, which I think it is,
then the backend would have to dirty all cache entries pertaining to
the relations being opened with O_DIRECT.  The use case for that
being:

1) a transaction begins
2) a few rows out of the huge table are read
3) a huge query is performed that triggers the use of O_DIRECT
4) the rows selected in step 2 are updated (this step should poison or  update the cache, actually, and act as a write
throughcache if the  data is in the cache)
 
5) the same few rows are read in again
6) transaction is committed

Provided the cache is poisoned or updated in step 4, I can't see how
or where this would be a problem.  Please enlighten if there's a
different case that would need to be taken into account.  I can't
imagine ever wanting to write out data using O_DIRECT and think that
it's a read only optimization in an attempt to minimize the turn over
in the OS's cache.  From fcntl(2):
    O_DIRECT     Minimize or eliminate the cache effects of reading and writ-                 ing.  The system will
attemptto avoid caching the data you                 read or write.  If it cannot avoid caching the data, it will
         minimize the impact the data has on the cache.  Use of this                 flag can drastically reduce
performanceif not used with                 care.
 


> And finally, how do we deal with the fact that writes to O_DIRECT
> files will wait until the data hits the disk because there is no
> kernel buffer cache?

Well, two things.

1) O_DIRECT should never be used on writes... I can't think of a case  where you'd want it off, even when COPY'ing data
andrestoring a  DB, it just doesn't make sense to use it.  The write buffer is  emptied as soon as the pages hit the
diskunless something is  reading those bits, but I'd imagine the write buffer would be used  to make sure that as much
writingis done to the platter in a  single write by the OS as possible, circumventing that would be  insane (though
usefulpossibly for embedded devices with low RAM,  solid state drives, or some super nice EMC fiber channel storage
devicethat basically has its own huge disk cache).
 

2) Last I checked PostgreSQL wasn't a threaded app and doesn't use  non-blocking IO.  The backend would block until the
callreturns,  where's the problem?  :)
 

If anything O_DIRECT would shake out any bugs in PostgreSQL's caching
code, if there are any.  -sc

-- 
Sean Chittenden


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PlPython
Next
From: Sailesh Krishnamurthy
Date:
Subject: Re: Two weeks to feature freeze