Thread: [Linus Torvalds ] Re: statfs() / statvfs() syscall ballsup...

[Linus Torvalds ] Re: statfs() / statvfs() syscall ballsup...

From
Greg Stark
Date:
There's an interesting thread on linux-kernel right now about O_DIRECT and the
kernel i/o APIs databases need. I noticed a connection between what they were
discussing and the earlier discussions here and the pining for an interface to
avoid having vacuum preempt other disk i/o.


Someone from Oracle is on there explaining what Oracle's needs are. Perhaps
someone more knowledgable than myself could explain what would most help
postgres in this area.


There was another thread I commented on that touched on another postgres
wishlist item. A way to sync IDE disks reliably without disabling write
caching entirely. There was some inkling that newer drives might provide for
such a possibility. Perhaps that too could be worth advocating for on
postgres's behalf.




On 12 Oct 2003, Greg Stark wrote:
>
> There are other reasons databases want to control their own cache. The
> application knows more about the usage and the future usage of the data than
> the kernel does.

But this again is not an argument for not using the page cache - it's only
an argument for _telling_ the kernel about its use.

> However on busy servers whenever it's run it causes lots of pain because the
> kernel flushes all the cached data in favour of the data this job touches.

Yes. But this is actually pretty easy to avoid in-kernel, since all of the
LRU logic is pretty localized.

It could be done on a per-process thing ("this process should not pollute
the active list") or on a per-fd thing ("accesses through this particular
open are not to pollute the active list").

>                                     And
> worse, there's no way to indicate that the i/o it's doing is lower priority,
> so i/o bound servers get hit dramatically.

IO priorities are pretty much worthless. It doesn't _matter_ if other
processes get preferred treatment - what is costly is the latency cost of
seeking. What you want is not priorities, but batching.

            Linus




--
greg

Re: Database Kernels and O_DIRECT

From
James Rogers
Date:
On Sun, 2003-10-12 at 15:13, Greg Stark wrote:
> There's an interesting thread on linux-kernel right now about O_DIRECT and the
> kernel i/o APIs databases need. I noticed a connection between what they were
> discussing and the earlier discussions here and the pining for an interface to
> avoid having vacuum preempt other disk i/o.
>
> Someone from Oracle is on there explaining what Oracle's needs are. Perhaps
> someone more knowledgable than myself could explain what would most help
> postgres in this area.


There is an important difference between Oracle and Postgres that makes
discussions of this complicated because the assumptions are different.

Oracle runs on top of a database kernel, whereas Postgres does not.  In
the former case, it is very useful and conducive to better performance
to have O_DIRECT and direct control of the I/O in general -- the more,
the better.  In the latter case (e.g. Postgres), it is more of a
nuisance and difficult to exploit well.

The point of having a database kernel underneath the DBMS is two-fold.  

First, it improves portability by acting as an operating system
abstraction layer, replacing OS kernel services with its own equivalents
(which may map to any number of mechanisms underneath).  It is the
reason Oracle is easily supported on so many operating systems; to port
to a new OS, they only have to modify the database kernel, and they
probably have a highly portable generic version to start with that they
can then optimize for a given platform at their leisure. All the rest of
Oracle's code only has to compile against and run on the virtual
operating system that is their database kernel.

Second, where possible, the database kernel bypasses the OS kernel
internally (e.g. O_DIRECT) and implements its own versions of the OS
kernel services that are highly-tuned for database purposes. This often
has significant performance benefits.  While it kind of looks like an OS
on top of an OS, well-written database kernels often tend to exist
almost parallel the system kernel in certain respects, only using the
system kernel where it is convenient or for future capabilities that
have been stubbed out in the database kernel.  Writing DBMS code to a
database kernel almost always produces a more scalable system than
writing to portable OS APIs because it eliminates the "lowest common
denominator" effect.

Having a database kernel isn't really important unless you are a
performance junkie or have to address really scalable database systems. 
Some more advanced DBMS features are easier to implement on a database
kernel as a pragmatic concern, because the system model being
implemented for is more database friendly. It lets the database take
advantage of the more advanced features and optimizations of whatever
operating system it is running on without the vast majority of the DBMS
code base being aware of these significant differences.

I'd like to see Postgres move to a database kernel eventually for a lot
of reasons, but it would a relatively significant change. Maybe v8? :-)

Cheers,

-James Rogersjamesr@best.com




Re: Database Kernels and O_DIRECT

From
Greg Stark
Date:
James Rogers <jamesr@best.com> writes:
> >
> > Someone from Oracle is on there explaining what Oracle's needs are. Perhaps
> > someone more knowledgable than myself could explain what would most help
> > postgres in this area.
> 
> 
> There is an important difference between Oracle and Postgres that makes
> discussions of this complicated because the assumptions are different.

All the more reason Postgres's view of the world should maybe be represented
there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and
seems more interested in building a better kernel interface to control caching
and i/o scheduling. Something that fits better with postgres's design than
Oracle's.

> the former case, it is very useful and conducive to better performance
> to have O_DIRECT and direct control of the I/O in general -- the more,
> the better.  In the latter case (e.g. Postgres), it is more of a
> nuisance and difficult to exploit well.

Actually I think it would be useful for the WAL. As I understand it there's no
point caching the WAL and every write is going to get synced anyways so
there's no point in buffering it either. The sooner the process can find out
it's been synced the better. But I'm not really 100% up on the way the WAL is
used so I could be wrong.

> The point of having a database kernel underneath the DBMS is two-fold.  
> 
> First, it improves portability by acting as an operating system
> abstraction layer, replacing OS kernel services with its own equivalents

Bah. So Oracle has to live with whatever OS features VMS had 20 years ago. It
has to reimplement whatever I/O scheduling or other strategies it wants.
Rather than being the escape from the "lowest common denominator" it is in
fact precisely the cause of it.

You describe Postgres as if abstraction is a foreign concept to it. Much
better to have well designed minimal abstractions for each of the resources
needed, rather than trying to turn every OS you meet into the first one you
met.


-- 
greg



Re: Database Kernels and O_DIRECT

From
James Rogers
Date:
On 10/14/03 8:26 PM, "Greg Stark" <gsstark@mit.edu> wrote:
> 
> All the more reason Postgres's view of the world should maybe be represented
> there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and
> seems more interested in building a better kernel interface to control caching
> and i/o scheduling. Something that fits better with postgres's design than
> Oracle's.


This would certainly help Postgres as currently written, but it won't have
the theoretical performance headroom of what Oracle wants.  A practical
kernel API is too narrow to be fully aware of and exploit database state.
And then there is the portability issue...

The way you want these kinds of things implemented in an operating system
kernel are somewhat orthogonal to how you want them implemented from the
perspective of a database kernel.  Typical resource use cases for an
operating system and a database engine make pretty different assumptions and
the best you'll get is a compromise that doesn't optimize either.

Making additional optimizations to the OS kernel works great for Postgres
(on Linux, at least) because currently very little is optimized in this
regard.  Basically Linus is doing some design optimization work for us.  An
improvement, but kind of a mediocre one in the big scheme of things and not
terribly portable.  If we suddenly wanted to optimize Postgres for
performance the way Oracle does, we would be a lot more keen on the O_DIRECT
approach.

> Actually I think it would be useful for the WAL. As I understand it there's no
> point caching the WAL and every write is going to get synced anyways so
> there's no point in buffering it either. The sooner the process can find out
> it's been synced the better. But I'm not really 100% up on the way the WAL is
> used so I could be wrong.


Aye, I think you may be correct.

> Bah. So Oracle has to live with whatever OS features VMS had 20 years ago. It
> has to reimplement whatever I/O scheduling or other strategies it wants.
> Rather than being the escape from the "lowest common denominator" it is in
> fact precisely the cause of it.


You appear to have completely missed the point.

The point of the abstraction layer is so they can optimize the hell out of
the database for every single platform they support without having to
rewrite a bunch of the database every time.  The database kernel API is
BETTER AND MORE OPTIMAL than the operating system API. It allows them to use
whatever memory management scheme, I/O scheme, etc is the best for every
single platform.  If "the best" happens to going to the native OS service,
then that is what they do, but most of the code doesn't need to know this if
the abstraction layer is well-designed.

Most of the code in a DBMS does not care where memory comes from, how its
managed, what the file system actually looks like, or how I/O is done.  As
long as the behavior is the same from the database kernel API it is writing
to, it is all good.  What this means from a practical standpoint is that you
don't *have* to use SysV IPC on every platform, or POSIX, or mmap, or
whatever.  You can use whatever that particular platform likes as long it
can be mapped into the database kernel API, which tends to be at a high
enough level that just about *any* reasonable implementation of an OS API
can be mapped into it with quite a bit of optimization.


> You describe Postgres as if abstraction is a foreign concept to it. Much
> better to have well designed minimal abstractions for each of the resources
> needed, rather than trying to turn every OS you meet into the first one you
> met.

You have a serious misconception of what a database kernel is and looks
like.

A database kernel doesn't look like the OS kernel that is mapped to it.  You
write a database kernel API that is idealized for database usage and
provides services specifically designed for the needs of a database.  It is
a high-level API, not a mirror copy of standard OS APIs; if you did that,
you wouldn't have any room to do the database kernel implementation.  You
then build an implementation of the API on the local system using whatever
operating system interfaces suit your fancy.  The API is simple enough and
small enough that this isn't particularly difficult to do in a typical case.
And you can write a default kernel that is portable "as is" to most
operating systems.

There is some abstraction in Postgres and the database is well-written, but
it isn't written in a manner that makes it easy to swap out operating system
or API models.  It is written to be portable at all levels.  A database
kernel isn't necessarily required to be portable at the very lowest level,
but it is vastly more optimizable because you aren't forced into a narrow
set of choices for interfacing with the operating system.

Operating system APIs are not particularly well-suited for databases, and if
you force a database to adhere to operating system APIs directly, you end up
with a suboptimal situation almost every single time.  You end with
implementations that you never would have done if you were targeting the
database for only that platform.  Using a database kernel lets you make
platform specific optimizations and API selections without forcing most of
the database code to be aware of it.

Perhaps more to the point, who gives a damn what optimizations Linus puts in
the Linux kernel.  What good does that do Postgres users on FreeBSD, or OSX,
or Windows?  Abstracting a database engine to a set of operating system APIs
is never going to give stellar or even results across all platforms because
the operating system APIs usually aren't written so that you could write
your database optimally.

Theoretically, it is the difference between middling performance in the
typical case and highly optimal in just about every case.  A database kernel
lets you use an operating system in the way it likes to be used rather than
using an API that you just happen to support.

Cheers,

-James Rogersjamesr@best.com







Re: Database Kernels and O_DIRECT

From
James Rogers
Date:
On 10/14/03 11:31 PM, "James Rogers" <jamesr@best.com> wrote:
> 
> There is some abstraction in Postgres and the database is well-written, but
> it isn't written in a manner that makes it easy to swap out operating system
> or API models.  It is written to be portable at all levels.  A database
> kernel isn't necessarily required to be portable at the very lowest level,
> but it is vastly more optimizable because you aren't forced into a narrow
> set of choices for interfacing with the operating system.


Just to clarify, my post wasn't really to say that we should run out and
make Postgres use a database kernel type internal model tomorrow.  The point
of all that was that Oracle does things that way for a very good reason and
that there can be benefits that may not be immediately obvious.

It is really one of those emergent "needs" when a database engine gets to a
certain level of sophistication.  For smaller and simpler databases, you
don't really need it and the effort isn't justified.  At some point, you
cross a threshold where not only does it become justified but it becomes a
wise idea or not having it will start to punish you in a number of different
ways.  I personally think that Postgres is sitting on the cusp of "its a
wise idea", and that it is something worth thinking about in the future.

Cheers,

-James Rogersjamesr@best.com



Re: Database Kernels and O_DIRECT

From
Bruce Momjian
Date:
Greg Stark wrote:
> 
> James Rogers <jamesr@best.com> writes:
> > >
> > > Someone from Oracle is on there explaining what Oracle's needs are. Perhaps
> > > someone more knowledgable than myself could explain what would most help
> > > postgres in this area.
> > 
> > 
> > There is an important difference between Oracle and Postgres that makes
> > discussions of this complicated because the assumptions are different.
> 
> All the more reason Postgres's view of the world should maybe be represented
> there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and
> seems more interested in building a better kernel interface to control caching
> and i/o scheduling. Something that fits better with postgres's design than
> Oracle's.

Of course, the big question is why Oracle is even there talking to
Linus, and Linus isn't asking to get PostgreSQL involved.  If you are
running an open-source project, you would think you would give favor to
other open-source projects.  Same with MySQL favortism --- if you are
writing an open-source tool, why favor a database developed/controlled
by a single company?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Database Kernels and O_DIRECT

From
"Paulo Scardine"
Date:
> Of course, the big question is why Oracle is even there talking to
> Linus, and Linus isn't asking to get PostgreSQL involved.  If you are
> running an open-source project, you would think you would give favor to
> other open-source projects.  Same with MySQL favortism --- if you are
> writing an open-source tool, why favor a database developed/controlled
> by a single company?

It's the unix style: no message, no error... If Postgres developers do not
send any message to Linus he will think Linux is doing just fine for them.

Seems that Oracle cares to improve their Linux port so they asked Linus some
features. I doubt Linus runned to Oracle asking "please, how could I help
you improve your closed software project?". Kernel folks seems to be very
busy people.

IMHO if we see any window for improvement in any OS, we should go to Linus
(or Peter or Bill Gates) and ask for it. As wrote in the original post.

Regards,
--
Paulo Scardine




Re: Database Kernels and O_DIRECT

From
Tom Lane
Date:
James Rogers <jamesr@best.com> writes:
> If we suddenly wanted to optimize Postgres for performance the way
> Oracle does, we would be a lot more keen on the O_DIRECT approach.

This isn't ever going to happen, for the simple reason that we don't
have Oracle's manpower.  You are blithely throwing around the phrase
"database kernel" like it would be a small simple project.  In reality
you are talking about (at least) implementing our own complete
filesystem, and then doing it over again on every platform we want to
support, and then after that, optimizing it to the point of actually
being enough better than the native facilities to have been worth the
effort.  I cannot conceive of that happening in a Postgres project that
even remotely resembles the present reality, because we just don't have
the manpower; and what manpower we do have is better spent on other
tasks.  We have other things to do than re-invent the operating system
wheel.  Improving the planner, for example.

One of the first concepts I learned in CS grad school was that of
optimizing a system at multiple levels.  If the hardware guys can build
a 2X faster CPU, and the operating system guys can find a 2X improvement
in (say) filesystem performance, and then the application guys can find
a 2X improvement in their algorithms, you've got 8X total speedup, which
might have been impossible or at least vastly harder to get by working
at only one level of the system.  The lesson for Postgres is that we
should not be trying to beat the operating system guys at their own
game.  It's unclear that we can anyway, and we can certainly get more
bang for our optimization buck by working at system levels that don't
correspond to operating-system concerns.

I tend to agree with the opinion that Oracle's architecture is based on
twenty-year-old assumptions.  Back then it was reasonable to assume that
database-specific algorithms could outperform a general-purpose
operating system.  In today's environment that assumption is not a given.
        regards, tom lane


Re: Database Kernels and O_DIRECT

From
Andrew Dunstan
Date:
Tom Lane wrote:

>James Rogers <jamesr@best.com> writes:
>  
>
>>If we suddenly wanted to optimize Postgres for performance the way
>>Oracle does, we would be a lot more keen on the O_DIRECT approach.
>>    
>>
>
>This isn't ever going to happen, for the simple reason that we don't
>have Oracle's manpower.  
>
[snip - long and sensible elaboration of above statement]

I have wondered (somewhat fruitlessly) for several years about the 
possibilities of special purpose lightweight file systems that could 
relax some of the assumptions and checks used in general purpose file 
systems. Such a thing might provide most of the benefits of a "database 
kernel" without imposing anything extra on the database application layer.

Just a thought - I have no resources to make any attack on such a project.

cheers

andrew



Re: Database Kernels and O_DIRECT

From
Hannu Krosing
Date:
James Rogers kirjutas K, 15.10.2003 kell 11:26:
> On 10/14/03 11:31 PM, "James Rogers" <jamesr@best.com> wrote:
> > 
> > There is some abstraction in Postgres and the database is well-written, but
> > it isn't written in a manner that makes it easy to swap out operating system
> > or API models.  It is written to be portable at all levels.  A database
> > kernel isn't necessarily required to be portable at the very lowest level,
> > but it is vastly more optimizable because you aren't forced into a narrow
> > set of choices for interfacing with the operating system.
> 
> 
> Just to clarify, my post wasn't really to say that we should run out and
> make Postgres use a database kernel type internal model tomorrow.  The point
> of all that was that Oracle does things that way for a very good reason and
> that there can be benefits that may not be immediately obvious.

OTOH, what may be a perfectly good reason for Oracle, may not be it for
PostgreSQL.

For me the beauty of OS software has always been the possibility to fix
problems at the right level (kernel, library, language) , and not to
just make workarounds at another level (your application).

So getting some API's into kernel for optimizing cache usage or
writeback strategies would be much better than using raw writes and
rewriting the whole thing ourseleves. 

The newer linux kernels have several schedulers to choose from, why not
push for choice in other areas as well.

The ultimate "database kernel" could thus be a custom tuned linux kernel
;)

> It is really one of those emergent "needs" when a database engine gets to a
> certain level of sophistication.  For smaller and simpler databases, you
> don't really need it and the effort isn't justified.  At some point, you
> cross a threshold where not only does it become justified but it becomes a
> wise idea or not having it will start to punish you in a number of different
> ways.  I personally think that Postgres is sitting on the cusp of "its a
> wise idea", and that it is something worth thinking about in the future.

This thread reminds me of Linus/Tannenbaum Monolithic vs. Microkernel
argument - while theoretically Microkernels are "better" Linux could
outperform it by having the required modularity on source level, and
being an open-source project this was enough. It also beat the Mach
kernel by being there whereas microkernel based mach was too hard to
develop/debug and thus has taken way longer to mature.

--------------
Hannu



Re: Database Kernels and O_DIRECT

From
Sailesh Krishnamurthy
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
   Tom> I tend to agree with the opinion that Oracle's architecture   Tom> is based on twenty-year-old assumptions.
Backthen it was   Tom> reasonable to assume that database-specific algorithms could   Tom> outperform a general-purpose
operatingsystem.  In today's   Tom> environment that assumption is not a given.
 


In fact: 
  Michael Stonebraker: Operating System Support for Database Management.   CACM 24(7): 412-418 (1981)
  Abstract: 
            Several operating system services are examined with a            view toward their applicability to support
ofdatabase            management functions. These services include buffer pool            management; the file system;
scheduling,process            management, and interprocess communication; and            consistency control.
 

-- 
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh




Re: Database Kernels and O_DIRECT

From
Manfred Spraul
Date:
Andrew Dunstan wrote:

>
> I have wondered (somewhat fruitlessly) for several years about the 
> possibilities of special purpose lightweight file systems that could 
> relax some of the assumptions and checks used in general purpose file 
> systems. Such a thing might provide most of the benefits of a 
> "database kernel" without imposing anything extra on the database 
> application layer.

CPU is usually cheap compared to disk io.

There are two things that might be worth looking into:
Oracle released their cluster filesystem (ocfs) as a GPL driver for 
Linux. It might be interesting to check how it performs if used for 
postgres, but I fear that it implicitely assumes that the bulk of the 
caching is performed by the database in user space.
And using O_DIRECT for the WAL logs - the logs are never read.

--   Manfred




Re: Database Kernels and O_DIRECT

From
Christopher Browne
Date:
andrew@dunslane.net (Andrew Dunstan) writes:
> Tom Lane wrote:
>>James Rogers <jamesr@best.com> writes:
>>>If we suddenly wanted to optimize Postgres for performance the way
>>>Oracle does, we would be a lot more keen on the O_DIRECT approach.
>>This isn't ever going to happen, for the simple reason that we don't
>> have Oracle's manpower.
>>
> [snip - long and sensible elaboration of above statement]
>
> I have wondered (somewhat fruitlessly) for several years about the
> possibilities of special purpose lightweight file systems that could
> relax some of the assumptions and checks used in general purpose file
> systems. Such a thing might provide most of the benefits of a
> "database kernel" without imposing anything extra on the database
> application layer.
>
> Just a thought - I have no resources to make any attack on such a project.

There is an exactly relevant project for this, namely Hans Reiser's
"ReiserFS," on Linux.

http://www.namesys.com/whitepaper.html

In Version 4, they will be exporting an API that allows userspace
applications to control the use of transactional filesystem updates.

If someone were to directly build a database on top of this, one might
wind up with some sort of "ReiserSQL," which would be relatively
analagous to the "database kernel" approach.

Of course, the task would be large, and it would likely take _years_
for it to stabilize to the point of being much more than a "neat
hack."

The other neat approach that would be more relevant to PostgreSQL
would be to create a filesystem that stored data in pure blocks, with
pretty large block sizes, and low overhead for saving directory
metadata.  There isn't too terribly much interest in {a,o,m}time...
-- 
output = reverse("ofni.smrytrebil" "@" "enworbbc")
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)


Re: Database Kernels and O_DIRECT

From
Bruce Momjian
Date:
Tom Lane wrote:
> James Rogers <jamesr@best.com> writes:
> > If we suddenly wanted to optimize Postgres for performance the way
> > Oracle does, we would be a lot more keen on the O_DIRECT approach.
> 
> This isn't ever going to happen, for the simple reason that we don't
> have Oracle's manpower.  You are blithely throwing around the phrase
> "database kernel" like it would be a small simple project.  In reality
> you are talking about (at least) implementing our own complete
> filesystem, and then doing it over again on every platform we want to
> support, and then after that, optimizing it to the point of actually
> being enough better than the native facilities to have been worth the
> effort.  I cannot conceive of that happening in a Postgres project that
> even remotely resembles the present reality, because we just don't have
> the manpower; and what manpower we do have is better spent on other
> tasks.  We have other things to do than re-invent the operating system
> wheel.  Improving the planner, for example.

One question is what a database kernel would look like?  Would it
basically mean just taking our existing portability code, such as for
shared memory, and moving it into a separate libary with its own API? 
Don't we almost have that already?

I am just confused what would be different?  I think the only major
difference I have heard is to bypass the OS file system and memory
management.  We already bypass most of the memory management by using
palloc.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073