Thread: a question about Direct I/O and double buffering

a question about Direct I/O and double buffering

From
Xiaoning Ding
Date:
Hi,

A page may be double buffered in PG's buffer pool and in OS's buffer cache.
Other DBMS like DB2 and Oracle has provided Direct I/O option to eliminate
double buffering. I noticed there were discusses on the list. But
I can not find similar option in PG. Does PG support direct I/O now?

The tuning guide of PG usually recommends a small shared buffer pool
(compared
to the size of physical memory).  I think it is to avoid swapping. If
there were
swapping, OS kernel may swap out some pages in PG's buffer pool even PG
want to keep them in memory. i.e. PG would loose full control over
buffer pool.
A large buffer pool is not good because it may
1. cause more pages double buffered, and thus decrease the efficiency of
buffer
cache and buffer pool.
2. may cause swapping.
Am I right?

If PG's buffer pool is small compared with physical memory, can I say
that the
hit ratio of PG's buffer pool is not so meaningful because most misses
can be
satisfied by OS Kernel's buffer cache?

Thanks!


Xiaoning


Re: a question about Direct I/O and double buffering

From
Erik Jones
Date:
On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:

Hi,

A page may be double buffered in PG's buffer pool and in OS's buffer cache.
Other DBMS like DB2 and Oracle has provided Direct I/O option to eliminate
double buffering. I noticed there were discusses on the list. But
I can not find similar option in PG. Does PG support direct I/O now?

The tuning guide of PG usually recommends a small shared buffer pool
(compared
to the size of physical memory).  I think it is to avoid swapping. If
there were
swapping, OS kernel may swap out some pages in PG's buffer pool even PG
want to keep them in memory. i.e. PG would loose full control over
buffer pool.
A large buffer pool is not good because it may
1. cause more pages double buffered, and thus decrease the efficiency of
buffer
cache and buffer pool.
2. may cause swapping.
Am I right?

If PG's buffer pool is small compared with physical memory, can I say
that the
hit ratio of PG's buffer pool is not so meaningful because most misses
can be
satisfied by OS Kernel's buffer cache?

Thanks!

To the best of my knowledge, Postgres itself does not have a direct IO option (although it would be a good addition).  So, in order to use direct IO with postgres you'll need to consult your filesystem docs for how to set the forcedirectio mount option.  I believe it can be set dynamically, but if you want it to be permanent you'll to add it to your fstab/vfstab file.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: a question about Direct I/O and double buffering

From
Xiaoning Ding
Date:
Erik Jones wrote:
> On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
>
>> Hi,
>>
>> A page may be double buffered in PG's buffer pool and in OS's buffer
>> cache.
>> Other DBMS like DB2 and Oracle has provided Direct I/O option to eliminate
>> double buffering. I noticed there were discusses on the list. But
>> I can not find similar option in PG. Does PG support direct I/O now?
>>
>> The tuning guide of PG usually recommends a small shared buffer pool
>> (compared
>> to the size of physical memory).  I think it is to avoid swapping. If
>> there were
>> swapping, OS kernel may swap out some pages in PG's buffer pool even PG
>> want to keep them in memory. i.e. PG would loose full control over
>> buffer pool.
>> A large buffer pool is not good because it may
>> 1. cause more pages double buffered, and thus decrease the efficiency of
>> buffer
>> cache and buffer pool.
>> 2. may cause swapping.
>> Am I right?
>>
>> If PG's buffer pool is small compared with physical memory, can I say
>> that the
>> hit ratio of PG's buffer pool is not so meaningful because most misses
>> can be
>> satisfied by OS Kernel's buffer cache?
>>
>> Thanks!
>
> To the best of my knowledge, Postgres itself does not have a direct IO
> option (although it would be a good addition).  So, in order to use
> direct IO with postgres you'll need to consult your filesystem docs for
> how to set the forcedirectio mount option.  I believe it can be set
> dynamically, but if you want it to be permanent you'll to add it to your
> fstab/vfstab file.

I use Linux.  It supports direct I/O on a per-file basis only.  To
bypass OS buffer cache,
files should be opened with O_DIRECT option.  I afraid that I have to
modify PG.

Xiaoning
>
> erik jones <erik@myemma.com <mailto:erik@myemma.com>>
> software developer
> 615-296-0838
> emma(r)
>
>
>


Re: a question about Direct I/O and double buffering

From
Mark Lewis
Date:
Not to hijack this thread, but has anybody here tested the behavior of
PG on a file system with OS-level caching disabled via forcedirectio or
by using an inherently non-caching file system such as ocfs2?

I've been thinking about trying this setup to avoid double-caching now
that the 8.x series scales shared buffers better, but I figured I'd ask
first if anybody here had experience with similar configurations.

-- Mark

On Thu, 2007-04-05 at 13:09 -0500, Erik Jones wrote:
> On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
>
> > Hi,
> >
> >
> > A page may be double buffered in PG's buffer pool and in OS's buffer
> > cache.
> > Other DBMS like DB2 and Oracle has provided Direct I/O option to
> > eliminate
> > double buffering. I noticed there were discusses on the list. But
> > I can not find similar option in PG. Does PG support direct I/O now?
> >
> >
> > The tuning guide of PG usually recommends a small shared buffer pool
> > (compared
> > to the size of physical memory).  I think it is to avoid swapping.
> > If
> > there were
> > swapping, OS kernel may swap out some pages in PG's buffer pool even
> > PG
> > want to keep them in memory. i.e. PG would loose full control over
> > buffer pool.
> > A large buffer pool is not good because it may
> > 1. cause more pages double buffered, and thus decrease the
> > efficiency of
> > buffer
> > cache and buffer pool.
> > 2. may cause swapping.
> > Am I right?
> >
> >
> > If PG's buffer pool is small compared with physical memory, can I
> > say
> > that the
> > hit ratio of PG's buffer pool is not so meaningful because most
> > misses
> > can be
> > satisfied by OS Kernel's buffer cache?
> >
> >
> > Thanks!
>
>
> To the best of my knowledge, Postgres itself does not have a direct IO
> option (although it would be a good addition).  So, in order to use
> direct IO with postgres you'll need to consult your filesystem docs
> for how to set the forcedirectio mount option.  I believe it can be
> set dynamically, but if you want it to be permanent you'll to add it
> to your fstab/vfstab file.
>
>
> erik jones <erik@myemma.com>
> software developer
> 615-296-0838
> emma(r)
>
>
>
>
>

Re: a question about Direct I/O and double buffering

From
Erik Jones
Date:
On Apr 5, 2007, at 1:22 PM, Xiaoning Ding wrote:

Erik Jones wrote:
On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
Hi,

A page may be double buffered in PG's buffer pool and in OS's buffer cache.
Other DBMS like DB2 and Oracle has provided Direct I/O option to eliminate
double buffering. I noticed there were discusses on the list. But
I can not find similar option in PG. Does PG support direct I/O now?

The tuning guide of PG usually recommends a small shared buffer pool
(compared
to the size of physical memory).  I think it is to avoid swapping. If
there were
swapping, OS kernel may swap out some pages in PG's buffer pool even PG
want to keep them in memory. i.e. PG would loose full control over
buffer pool.
A large buffer pool is not good because it may
1. cause more pages double buffered, and thus decrease the efficiency of
buffer
cache and buffer pool.
2. may cause swapping.
Am I right?

If PG's buffer pool is small compared with physical memory, can I say
that the
hit ratio of PG's buffer pool is not so meaningful because most misses
can be
satisfied by OS Kernel's buffer cache?

Thanks!
To the best of my knowledge, Postgres itself does not have a direct IO option (although it would be a good addition).  So, in order to use direct IO with postgres you'll need to consult your filesystem docs for how to set the forcedirectio mount option.  I believe it can be set dynamically, but if you want it to be permanent you'll to add it to your fstab/vfstab file.

I use Linux.  It supports direct I/O on a per-file basis only.  To bypass OS buffer cache,
files should be opened with O_DIRECT option.  I afraid that I have to modify PG.

Xiaoning

Looks like it.  I just did a cursory search of the archives and it seems that others have looked at this before so you'll probably want to start there if your up to it.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: a question about Direct I/O and double buffering

From
"Alex Deucher"
Date:
On 4/5/07, Erik Jones <erik@myemma.com> wrote:
>
> On Apr 5, 2007, at 1:22 PM, Xiaoning Ding wrote:
>
> Erik Jones wrote:
> On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
> Hi,
>
> A page may be double buffered in PG's buffer pool and in OS's buffer cache.
> Other DBMS like DB2 and Oracle has provided Direct I/O option to eliminate
> double buffering. I noticed there were discusses on the list. But
> I can not find similar option in PG. Does PG support direct I/O now?
>
> The tuning guide of PG usually recommends a small shared buffer pool
> (compared
> to the size of physical memory).  I think it is to avoid swapping. If
> there were
> swapping, OS kernel may swap out some pages in PG's buffer pool even PG
> want to keep them in memory. i.e. PG would loose full control over
> buffer pool.
> A large buffer pool is not good because it may
> 1. cause more pages double buffered, and thus decrease the efficiency of
> buffer
> cache and buffer pool.
> 2. may cause swapping.
> Am I right?
>
> If PG's buffer pool is small compared with physical memory, can I say
> that the
> hit ratio of PG's buffer pool is not so meaningful because most misses
> can be
> satisfied by OS Kernel's buffer cache?
>
> Thanks!
> To the best of my knowledge, Postgres itself does not have a direct IO
> option (although it would be a good addition).  So, in order to use direct
> IO with postgres you'll need to consult your filesystem docs for how to set
> the forcedirectio mount option.  I believe it can be set dynamically, but if
> you want it to be permanent you'll to add it to your fstab/vfstab file.
>
> I use Linux.  It supports direct I/O on a per-file basis only.  To bypass OS
> buffer cache,
> files should be opened with O_DIRECT option.  I afraid that I have to modify
> PG.
>
> Xiaoning
> Looks like it.  I just did a cursory search of the archives and it seems
> that others have looked at this before so you'll probably want to start
> there if your up to it.
>

Linux used to have (still does?) a RAW interface which might also be
useful.  I think the original code was contributed by oracle so they
could support direct IO.

Alex

Re: a question about Direct I/O and double buffering

From
Erik Jones
Date:

On Apr 5, 2007, at 1:27 PM, Mark Lewis wrote:
On Thu, 2007-04-05 at 13:09 -0500, Erik Jones wrote:
On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:

Hi,


A page may be double buffered in PG's buffer pool and in OS's buffer
cache.
Other DBMS like DB2 and Oracle has provided Direct I/O option to
eliminate
double buffering. I noticed there were discusses on the list. But
I can not find similar option in PG. Does PG support direct I/O now?


The tuning guide of PG usually recommends a small shared buffer pool
(compared
to the size of physical memory).  I think it is to avoid swapping.
If
there were
swapping, OS kernel may swap out some pages in PG's buffer pool even
PG
want to keep them in memory. i.e. PG would loose full control over
buffer pool.
A large buffer pool is not good because it may
1. cause more pages double buffered, and thus decrease the
efficiency of
buffer
cache and buffer pool.
2. may cause swapping.
Am I right?


If PG's buffer pool is small compared with physical memory, can I
say
that the
hit ratio of PG's buffer pool is not so meaningful because most
misses
can be
satisfied by OS Kernel's buffer cache?


Thanks!


To the best of my knowledge, Postgres itself does not have a direct IO
option (although it would be a good addition).  So, in order to use
direct IO with postgres you'll need to consult your filesystem docs
for how to set the forcedirectio mount option.  I believe it can be
set dynamically, but if you want it to be permanent you'll to add it
to your fstab/vfstab file.

Not to hijack this thread, but has anybody here tested the behavior of
PG on a file system with OS-level caching disabled via forcedirectio or
by using an inherently non-caching file system such as ocfs2?

I've been thinking about trying this setup to avoid double-caching now
that the 8.x series scales shared buffers better, but I figured I'd ask
first if anybody here had experience with similar configurations.

-- Mark

Rather than repeat everything that was said just last week, I'll point out that we just had a pretty decent discusson on this last week that I started, so check the archives.  In summary though, if you have a high io transaction load with a db where the average size of your "working set" of data doesn't fit in memory with room to spare, then direct io can be a huge plus, otherwise you probably won't see much of a difference.  I have yet to hear of anybody actually seeing any degradation in the db performance from it.  In addition, while it doesn't bother me, I'd watch the top posting as some people get pretty religious about (I moved your comments down).

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: a question about Direct I/O and double buffering

From
Mark Lewis
Date:
...
[snipped for brevity]
...
>
> > Not to hijack this thread, but has anybody here tested the behavior
> > of
> > PG on a file system with OS-level caching disabled via forcedirectio
> > or
> > by using an inherently non-caching file system such as ocfs2?
> >
> >
> > I've been thinking about trying this setup to avoid double-caching
> > now
> > that the 8.x series scales shared buffers better, but I figured I'd
> > ask
> > first if anybody here had experience with similar configurations.
> >
> >
> > -- Mark
>
>
> Rather than repeat everything that was said just last week, I'll point
> out that we just had a pretty decent discusson on this last week that
> I started, so check the archives.  In summary though, if you have a
> high io transaction load with a db where the average size of your
> "working set" of data doesn't fit in memory with room to spare, then
> direct io can be a huge plus, otherwise you probably won't see much of
> a difference.  I have yet to hear of anybody actually seeing any
> degradation in the db performance from it.  In addition, while it
> doesn't bother me, I'd watch the top posting as some people get pretty
> religious about (I moved your comments down).

I saw the thread, but my understanding from reading through it was that
you never fully tracked down the cause of the factor of 10 write volume
mismatch, so I pretty much wrote it off as a data point for
forcedirectio because of the unknowns.  Did you ever figure out the
cause of that?

-- Mark Lewis

Re: a question about Direct I/O and double buffering

From
Erik Jones
Date:
On Apr 5, 2007, at 2:56 PM, Mark Lewis wrote:

...
[snipped for brevity]
...

Not to hijack this thread, but has anybody here tested the behavior
of
PG on a file system with OS-level caching disabled via forcedirectio
or
by using an inherently non-caching file system such as ocfs2?


I've been thinking about trying this setup to avoid double-caching
now
that the 8.x series scales shared buffers better, but I figured I'd
ask
first if anybody here had experience with similar configurations.


-- Mark


Rather than repeat everything that was said just last week, I'll point
out that we just had a pretty decent discusson on this last week that
I started, so check the archives.  In summary though, if you have a
high io transaction load with a db where the average size of your
"working set" of data doesn't fit in memory with room to spare, then
direct io can be a huge plus, otherwise you probably won't see much of
a difference.  I have yet to hear of anybody actually seeing any
degradation in the db performance from it.  In addition, while it
doesn't bother me, I'd watch the top posting as some people get pretty
religious about (I moved your comments down).

I saw the thread, but my understanding from reading through it was that
you never fully tracked down the cause of the factor of 10 write volume
mismatch, so I pretty much wrote it off as a data point for
forcedirectio because of the unknowns.  Did you ever figure out the
cause of that?

-- Mark Lewis

Nope.  What we never tracked down was the factor of 10 drop in database transactions, not disk transactions.  The write volume was most definitely due to the direct io setting -- writes are now being done in terms of the system's block size where as before they were being done in terms of the the filesystem's cache page size (as it's in virtual memory).  Basically, we do so many write transactions that the fs cache was constantly paging.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: a question about Direct I/O and double buffering

From
david@lang.hm
Date:
On Thu, 5 Apr 2007, Xiaoning Ding wrote:

>>
>>  To the best of my knowledge, Postgres itself does not have a direct IO
>>  option (although it would be a good addition).  So, in order to use direct
>>  IO with postgres you'll need to consult your filesystem docs for how to
>>  set the forcedirectio mount option.  I believe it can be set dynamically,
>>  but if you want it to be permanent you'll to add it to your fstab/vfstab
>>  file.
>
> I use Linux.  It supports direct I/O on a per-file basis only.  To bypass OS
> buffer cache,
> files should be opened with O_DIRECT option.  I afraid that I have to modify
> PG.

as someone who has been reading the linux-kernel mailing list for 10
years, let me comment on this a bit.

linux does have a direct i/o option, but it has significant limits on when
and how you cna use it (buffers must be 512byte aligned and multiples of
512 bytes, things like that). Also, in many cases testing has shon that
there is a fairly significant performance hit for this, not a perfomance
gain.

what I think that postgres really needs is to add support for write
barriers (telling the OS to make shure that everything before the barrier
is written to disk before anything after the barrier) I beleive that these
are avaiable on SCSI drives, and on some SATA drives. this sort of
support, along with appropriate async I/O support (which is probably going
to end up being the 'syslets' or 'threadlets' stuff that's in the early
experimental stage, rather then the current aio API) has the potential to
be a noticable improvement.

if you haven't followed the syslets discussion on the kernel list,
threadlets are an approach that basicly lets you turn any syscall into a
async interface (if the call doesn't block on anything you get the answer
back immediatly, if it does block it gets turned into a async call by the
kernel)

syslets are a way to combine multiple syscalls into a single call,
avoiding the user->system->user calling overhead for the additional calls.
(it's also viewed as a way to do prototyping of possible new calls, if a
sequence of syscalls end up being common enough the kernel devs will look
at makeing a new, combined, syscall (for example lock, write, unlock could
be made into one if it's common enough and there's enough of a performance
gain)

David Lang

Re: a question about Direct I/O and double buffering

From
Xiaoning Ding
Date:
Alex Deucher wrote:
> On 4/5/07, Erik Jones <erik@myemma.com> wrote:
>>
>> On Apr 5, 2007, at 1:22 PM, Xiaoning Ding wrote:
>>
>> Erik Jones wrote:
>> On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
>> Hi,
>>
>> A page may be double buffered in PG's buffer pool and in OS's buffer
>> cache.
>> Other DBMS like DB2 and Oracle has provided Direct I/O option to
>> eliminate
>> double buffering. I noticed there were discusses on the list. But
>> I can not find similar option in PG. Does PG support direct I/O now?
>>
>> The tuning guide of PG usually recommends a small shared buffer pool
>> (compared
>> to the size of physical memory).  I think it is to avoid swapping. If
>> there were
>> swapping, OS kernel may swap out some pages in PG's buffer pool even PG
>> want to keep them in memory. i.e. PG would loose full control over
>> buffer pool.
>> A large buffer pool is not good because it may
>> 1. cause more pages double buffered, and thus decrease the efficiency of
>> buffer
>> cache and buffer pool.
>> 2. may cause swapping.
>> Am I right?
>>
>> If PG's buffer pool is small compared with physical memory, can I say
>> that the
>> hit ratio of PG's buffer pool is not so meaningful because most misses
>> can be
>> satisfied by OS Kernel's buffer cache?
>>
>> Thanks!
>> To the best of my knowledge, Postgres itself does not have a direct IO
>> option (although it would be a good addition).  So, in order to use
>> direct
>> IO with postgres you'll need to consult your filesystem docs for how
>> to set
>> the forcedirectio mount option.  I believe it can be set dynamically,
>> but if
>> you want it to be permanent you'll to add it to your fstab/vfstab file.
>>
>> I use Linux.  It supports direct I/O on a per-file basis only.  To
>> bypass OS
>> buffer cache,
>> files should be opened with O_DIRECT option.  I afraid that I have to
>> modify
>> PG.
>>
>> Xiaoning
>> Looks like it.  I just did a cursory search of the archives and it seems
>> that others have looked at this before so you'll probably want to start
>> there if your up to it.
>>
>
> Linux used to have (still does?) a RAW interface which might also be
> useful.  I think the original code was contributed by oracle so they
> could support direct IO.
>
> Alex
I am more concerned with reads , and how to do direct I/O under Linux here.
Reading raw devices in linux bypasses OS buffer cache. But how can you
mount a raw device( it is a character device) as a file system?

  Xiaoning

Re: a question about Direct I/O and double buffering

From
"Alex Deucher"
Date:
On 4/5/07, Xiaoning Ding <dingxn@cse.ohio-state.edu> wrote:
> Alex Deucher wrote:
> > On 4/5/07, Erik Jones <erik@myemma.com> wrote:
> >>
> >> On Apr 5, 2007, at 1:22 PM, Xiaoning Ding wrote:
> >>
> >> Erik Jones wrote:
> >> On Apr 5, 2007, at 12:09 PM, Xiaoning Ding wrote:
> >> Hi,
> >>
> >> A page may be double buffered in PG's buffer pool and in OS's buffer
> >> cache.
> >> Other DBMS like DB2 and Oracle has provided Direct I/O option to
> >> eliminate
> >> double buffering. I noticed there were discusses on the list. But
> >> I can not find similar option in PG. Does PG support direct I/O now?
> >>
> >> The tuning guide of PG usually recommends a small shared buffer pool
> >> (compared
> >> to the size of physical memory).  I think it is to avoid swapping. If
> >> there were
> >> swapping, OS kernel may swap out some pages in PG's buffer pool even PG
> >> want to keep them in memory. i.e. PG would loose full control over
> >> buffer pool.
> >> A large buffer pool is not good because it may
> >> 1. cause more pages double buffered, and thus decrease the efficiency of
> >> buffer
> >> cache and buffer pool.
> >> 2. may cause swapping.
> >> Am I right?
> >>
> >> If PG's buffer pool is small compared with physical memory, can I say
> >> that the
> >> hit ratio of PG's buffer pool is not so meaningful because most misses
> >> can be
> >> satisfied by OS Kernel's buffer cache?
> >>
> >> Thanks!
> >> To the best of my knowledge, Postgres itself does not have a direct IO
> >> option (although it would be a good addition).  So, in order to use
> >> direct
> >> IO with postgres you'll need to consult your filesystem docs for how
> >> to set
> >> the forcedirectio mount option.  I believe it can be set dynamically,
> >> but if
> >> you want it to be permanent you'll to add it to your fstab/vfstab file.
> >>
> >> I use Linux.  It supports direct I/O on a per-file basis only.  To
> >> bypass OS
> >> buffer cache,
> >> files should be opened with O_DIRECT option.  I afraid that I have to
> >> modify
> >> PG.
> >>
> >> Xiaoning
> >> Looks like it.  I just did a cursory search of the archives and it seems
> >> that others have looked at this before so you'll probably want to start
> >> there if your up to it.
> >>
> >
> > Linux used to have (still does?) a RAW interface which might also be
> > useful.  I think the original code was contributed by oracle so they
> > could support direct IO.
> >
> > Alex
> I am more concerned with reads , and how to do direct I/O under Linux here.
> Reading raw devices in linux bypasses OS buffer cache. But how can you
> mount a raw device( it is a character device) as a file system?
>

In this case, I guess you'd probably have to do it within pg itself.

Alex

Re: a question about Direct I/O and double buffering

From
Erik Jones
Date:

On Apr 5, 2007, at 3:33 PM, david@lang.hm wrote:

On Thu, 5 Apr 2007, Xiaoning Ding wrote:


 To the best of my knowledge, Postgres itself does not have a direct IO
 option (although it would be a good addition).  So, in order to use direct
 IO with postgres you'll need to consult your filesystem docs for how to
 set the forcedirectio mount option.  I believe it can be set dynamically,
 but if you want it to be permanent you'll to add it to your fstab/vfstab
 file.

I use Linux.  It supports direct I/O on a per-file basis only.  To bypass OS buffer cache,
files should be opened with O_DIRECT option.  I afraid that I have to modify PG.

as someone who has been reading the linux-kernel mailing list for 10 years, let me comment on this a bit.

linux does have a direct i/o option,

Yes, I know applications can request direct i/o with the O_DIRECT flag to open(), but can this be set to be forced for all applications or for individual applications from "outside" the application (not that I've ever heard of something like the second)?

but it has significant limits on when and how you cna use it (buffers must be 512byte aligned and multiples of 512 bytes, things like that).

That's a standard limit imposed by the sector size of hard drives, and is present in all direct i/o implementations, not just Linux.

Also, in many cases testing has shon that there is a fairly significant performance hit for this, not a perfomance gain.

Those performance hits have been noticed for high i/o transaction databases?  The idea here is that these kinds of database manage their own caches and having a separate filesystem cache in virtual memory that works with system memory page sizes is an unneeded level of indirection.  Yes, you should expect other "normal" utilities will suffer a performance hit as if you are trying to cp a 500 byte file you'll still have to work with 8K writes and reads whereas with the filesystem cache you can just write/read part of a page in memory and let the cache decide when it needs to write and read from disk.  If there are other caveats to direct i/o on Linux I'd love to hear them.

erik jones <erik@myemma.com>
software developer
615-296-0838
emma(r)



Re: a question about Direct I/O and double buffering

From
david@lang.hm
Date:
On Thu, 5 Apr 2007, Xiaoning Ding wrote:

>> >  Xiaoning
>> >  Looks like it.  I just did a cursory search of the archives and it seems
>> >  that others have looked at this before so you'll probably want to start
>> >  there if your up to it.
>> >
>>
>>  Linux used to have (still does?) a RAW interface which might also be
>>  useful.  I think the original code was contributed by oracle so they
>>  could support direct IO.
>>
>>  Alex
> I am more concerned with reads , and how to do direct I/O under Linux here.
> Reading raw devices in linux bypasses OS buffer cache.

it also bypassed OS readahead, not nessasarily a win

>  But how can you
> mount a raw device( it is a character device) as a file system?

you can do a makefs on /dev/hda just like you do on /dev/hda2 and then
mount the result as a filesystem.

Postgres wants the OS layer to provide the filesystem, Oracle implements
it's own filesystem, so you would just point it at the drive/partition and
it would do it's own 'formatting'

this is something that may be reasonable for postgres to consider doing
someday, since postgres allocates things into 1m files and then keeps
track of what filename is used for what, it could instead allocate things
in 1m (or whatever size) chunks on the disk, and just keep track of what
addresses are used for what instead of filenames. this would definantly
allow you to work around problems like the ext2/3 indirect lookup
problems. now that the ability for partitioned table spaces it would be an
interesting experiment to be able to define a tablespace that used a raw
device instead of a filesystem to see if there are any noticable
performance gains

David Lang

Re: a question about Direct I/O and double buffering

From
david@lang.hm
Date:
On Thu, 5 Apr 2007, Erik Jones wrote:

> On Apr 5, 2007, at 3:33 PM, david@lang.hm wrote:
>
>> On Thu, 5 Apr 2007, Xiaoning Ding wrote:
>>
>> > >
>> > >  To the best of my knowledge, Postgres itself does not have a direct IO
>> > >  option (although it would be a good addition).  So, in order to use
>> > >  direct
>> > >  IO with postgres you'll need to consult your filesystem docs for how
>> > >  to
>> > >  set the forcedirectio mount option.  I believe it can be set
>> > >  dynamically,
>> > >  but if you want it to be permanent you'll to add it to your
>> > >  fstab/vfstab
>> > >  file.
>> >
>> > I use Linux.  It supports direct I/O on a per-file basis only.  To bypass
>> > OS buffer cache,
>> > files should be opened with O_DIRECT option.  I afraid that I have to
>> > modify PG.
>>
>> as someone who has been reading the linux-kernel mailing list for 10 years,
>> let me comment on this a bit.
>>
>> linux does have a direct i/o option,
>
> Yes, I know applications can request direct i/o with the O_DIRECT flag to
> open(), but can this be set to be forced for all applications or for
> individual applications from "outside" the application (not that I've ever
> heard of something like the second)?

no it can't, due to the fact that direct i/o has additional requirements
for what you can user for buffers that don't apply to normal i/o

>> but it has significant limits on when and how you cna use it (buffers must
>> be 512byte aligned and multiples of 512 bytes, things like that).
>
> That's a standard limit imposed by the sector size of hard drives, and is
> present in all direct i/o implementations, not just Linux.

right, but you don't have those limits for normal i/o

>> Also, in many cases testing has shon that there is a fairly significant
>> performance hit for this, not a perfomance gain.
>
> Those performance hits have been noticed for high i/o transaction databases?
> The idea here is that these kinds of database manage their own caches and
> having a separate filesystem cache in virtual memory that works with system
> memory page sizes is an unneeded level of indirection.

ahh, you're proposing a re-think of how postgres interacts with the O/S,
not just an optimization to be applied to the current architecture.

unlike Oracle, Postgres doesn't try to be an OS itself, it tries very hard
to rely on the OS to properly implement things rather then doing it's own
implementation.

> Yes, you should
> expect other "normal" utilities will suffer a performance hit as if you are
> trying to cp a 500 byte file you'll still have to work with 8K writes and
> reads whereas with the filesystem cache you can just write/read part of a
> page in memory and let the cache decide when it needs to write and read from
> disk.  If there are other caveats to direct i/o on Linux I'd love to hear
> them.

other then bad interactions with "normal" utilities not compiled for
driect i/o I don't remember them offhand.

David Lang

Re: a question about Direct I/O and double buffering

From
"Jim C. Nasby"
Date:
On Thu, Apr 05, 2007 at 03:10:43PM -0500, Erik Jones wrote:
> Nope.  What we never tracked down was the factor of 10 drop in
> database transactions, not disk transactions.  The write volume was
> most definitely due to the direct io setting -- writes are now being
> done in terms of the system's block size where as before they were
> being done in terms of the the filesystem's cache page size (as it's
> in virtual memory).  Basically, we do so many write transactions that
> the fs cache was constantly paging.

Did you try decreasing the size of the cache pages? I didn't realize
that Solaris used a different size for cache pages and filesystem
blocks. Perhaps the OS was also being too aggressive with read-aheads?

My concern is that you're essentially leaving a lot of your memory
unused this way, since shared_buffers is only set to 1.6G.

BTW, did you ever increase the parameter that controls how much memory
Solaris will use for filesystem caching?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)