Thread: [Linus Torvalds ] Re: statfs() / statvfs() syscall ballsup...
There's an interesting thread on linux-kernel right now about O_DIRECT and the kernel i/o APIs databases need. I noticed a connection between what they were discussing and the earlier discussions here and the pining for an interface to avoid having vacuum preempt other disk i/o. Someone from Oracle is on there explaining what Oracle's needs are. Perhaps someone more knowledgable than myself could explain what would most help postgres in this area. There was another thread I commented on that touched on another postgres wishlist item. A way to sync IDE disks reliably without disabling write caching entirely. There was some inkling that newer drives might provide for such a possibility. Perhaps that too could be worth advocating for on postgres's behalf. On 12 Oct 2003, Greg Stark wrote: > > There are other reasons databases want to control their own cache. The > application knows more about the usage and the future usage of the data than > the kernel does. But this again is not an argument for not using the page cache - it's only an argument for _telling_ the kernel about its use. > However on busy servers whenever it's run it causes lots of pain because the > kernel flushes all the cached data in favour of the data this job touches. Yes. But this is actually pretty easy to avoid in-kernel, since all of the LRU logic is pretty localized. It could be done on a per-process thing ("this process should not pollute the active list") or on a per-fd thing ("accesses through this particular open are not to pollute the active list"). > And > worse, there's no way to indicate that the i/o it's doing is lower priority, > so i/o bound servers get hit dramatically. IO priorities are pretty much worthless. It doesn't _matter_ if other processes get preferred treatment - what is costly is the latency cost of seeking. What you want is not priorities, but batching. Linus -- greg
On Sun, 2003-10-12 at 15:13, Greg Stark wrote: > There's an interesting thread on linux-kernel right now about O_DIRECT and the > kernel i/o APIs databases need. I noticed a connection between what they were > discussing and the earlier discussions here and the pining for an interface to > avoid having vacuum preempt other disk i/o. > > Someone from Oracle is on there explaining what Oracle's needs are. Perhaps > someone more knowledgable than myself could explain what would most help > postgres in this area. There is an important difference between Oracle and Postgres that makes discussions of this complicated because the assumptions are different. Oracle runs on top of a database kernel, whereas Postgres does not. In the former case, it is very useful and conducive to better performance to have O_DIRECT and direct control of the I/O in general -- the more, the better. In the latter case (e.g. Postgres), it is more of a nuisance and difficult to exploit well. The point of having a database kernel underneath the DBMS is two-fold. First, it improves portability by acting as an operating system abstraction layer, replacing OS kernel services with its own equivalents (which may map to any number of mechanisms underneath). It is the reason Oracle is easily supported on so many operating systems; to port to a new OS, they only have to modify the database kernel, and they probably have a highly portable generic version to start with that they can then optimize for a given platform at their leisure. All the rest of Oracle's code only has to compile against and run on the virtual operating system that is their database kernel. Second, where possible, the database kernel bypasses the OS kernel internally (e.g. O_DIRECT) and implements its own versions of the OS kernel services that are highly-tuned for database purposes. This often has significant performance benefits. While it kind of looks like an OS on top of an OS, well-written database kernels often tend to exist almost parallel the system kernel in certain respects, only using the system kernel where it is convenient or for future capabilities that have been stubbed out in the database kernel. Writing DBMS code to a database kernel almost always produces a more scalable system than writing to portable OS APIs because it eliminates the "lowest common denominator" effect. Having a database kernel isn't really important unless you are a performance junkie or have to address really scalable database systems. Some more advanced DBMS features are easier to implement on a database kernel as a pragmatic concern, because the system model being implemented for is more database friendly. It lets the database take advantage of the more advanced features and optimizations of whatever operating system it is running on without the vast majority of the DBMS code base being aware of these significant differences. I'd like to see Postgres move to a database kernel eventually for a lot of reasons, but it would a relatively significant change. Maybe v8? :-) Cheers, -James Rogersjamesr@best.com
James Rogers <jamesr@best.com> writes: > > > > Someone from Oracle is on there explaining what Oracle's needs are. Perhaps > > someone more knowledgable than myself could explain what would most help > > postgres in this area. > > > There is an important difference between Oracle and Postgres that makes > discussions of this complicated because the assumptions are different. All the more reason Postgres's view of the world should maybe be represented there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and seems more interested in building a better kernel interface to control caching and i/o scheduling. Something that fits better with postgres's design than Oracle's. > the former case, it is very useful and conducive to better performance > to have O_DIRECT and direct control of the I/O in general -- the more, > the better. In the latter case (e.g. Postgres), it is more of a > nuisance and difficult to exploit well. Actually I think it would be useful for the WAL. As I understand it there's no point caching the WAL and every write is going to get synced anyways so there's no point in buffering it either. The sooner the process can find out it's been synced the better. But I'm not really 100% up on the way the WAL is used so I could be wrong. > The point of having a database kernel underneath the DBMS is two-fold. > > First, it improves portability by acting as an operating system > abstraction layer, replacing OS kernel services with its own equivalents Bah. So Oracle has to live with whatever OS features VMS had 20 years ago. It has to reimplement whatever I/O scheduling or other strategies it wants. Rather than being the escape from the "lowest common denominator" it is in fact precisely the cause of it. You describe Postgres as if abstraction is a foreign concept to it. Much better to have well designed minimal abstractions for each of the resources needed, rather than trying to turn every OS you meet into the first one you met. -- greg
On 10/14/03 8:26 PM, "Greg Stark" <gsstark@mit.edu> wrote: > > All the more reason Postgres's view of the world should maybe be represented > there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and > seems more interested in building a better kernel interface to control caching > and i/o scheduling. Something that fits better with postgres's design than > Oracle's. This would certainly help Postgres as currently written, but it won't have the theoretical performance headroom of what Oracle wants. A practical kernel API is too narrow to be fully aware of and exploit database state. And then there is the portability issue... The way you want these kinds of things implemented in an operating system kernel are somewhat orthogonal to how you want them implemented from the perspective of a database kernel. Typical resource use cases for an operating system and a database engine make pretty different assumptions and the best you'll get is a compromise that doesn't optimize either. Making additional optimizations to the OS kernel works great for Postgres (on Linux, at least) because currently very little is optimized in this regard. Basically Linus is doing some design optimization work for us. An improvement, but kind of a mediocre one in the big scheme of things and not terribly portable. If we suddenly wanted to optimize Postgres for performance the way Oracle does, we would be a lot more keen on the O_DIRECT approach. > Actually I think it would be useful for the WAL. As I understand it there's no > point caching the WAL and every write is going to get synced anyways so > there's no point in buffering it either. The sooner the process can find out > it's been synced the better. But I'm not really 100% up on the way the WAL is > used so I could be wrong. Aye, I think you may be correct. > Bah. So Oracle has to live with whatever OS features VMS had 20 years ago. It > has to reimplement whatever I/O scheduling or other strategies it wants. > Rather than being the escape from the "lowest common denominator" it is in > fact precisely the cause of it. You appear to have completely missed the point. The point of the abstraction layer is so they can optimize the hell out of the database for every single platform they support without having to rewrite a bunch of the database every time. The database kernel API is BETTER AND MORE OPTIMAL than the operating system API. It allows them to use whatever memory management scheme, I/O scheme, etc is the best for every single platform. If "the best" happens to going to the native OS service, then that is what they do, but most of the code doesn't need to know this if the abstraction layer is well-designed. Most of the code in a DBMS does not care where memory comes from, how its managed, what the file system actually looks like, or how I/O is done. As long as the behavior is the same from the database kernel API it is writing to, it is all good. What this means from a practical standpoint is that you don't *have* to use SysV IPC on every platform, or POSIX, or mmap, or whatever. You can use whatever that particular platform likes as long it can be mapped into the database kernel API, which tends to be at a high enough level that just about *any* reasonable implementation of an OS API can be mapped into it with quite a bit of optimization. > You describe Postgres as if abstraction is a foreign concept to it. Much > better to have well designed minimal abstractions for each of the resources > needed, rather than trying to turn every OS you meet into the first one you > met. You have a serious misconception of what a database kernel is and looks like. A database kernel doesn't look like the OS kernel that is mapped to it. You write a database kernel API that is idealized for database usage and provides services specifically designed for the needs of a database. It is a high-level API, not a mirror copy of standard OS APIs; if you did that, you wouldn't have any room to do the database kernel implementation. You then build an implementation of the API on the local system using whatever operating system interfaces suit your fancy. The API is simple enough and small enough that this isn't particularly difficult to do in a typical case. And you can write a default kernel that is portable "as is" to most operating systems. There is some abstraction in Postgres and the database is well-written, but it isn't written in a manner that makes it easy to swap out operating system or API models. It is written to be portable at all levels. A database kernel isn't necessarily required to be portable at the very lowest level, but it is vastly more optimizable because you aren't forced into a narrow set of choices for interfacing with the operating system. Operating system APIs are not particularly well-suited for databases, and if you force a database to adhere to operating system APIs directly, you end up with a suboptimal situation almost every single time. You end with implementations that you never would have done if you were targeting the database for only that platform. Using a database kernel lets you make platform specific optimizations and API selections without forcing most of the database code to be aware of it. Perhaps more to the point, who gives a damn what optimizations Linus puts in the Linux kernel. What good does that do Postgres users on FreeBSD, or OSX, or Windows? Abstracting a database engine to a set of operating system APIs is never going to give stellar or even results across all platforms because the operating system APIs usually aren't written so that you could write your database optimally. Theoretically, it is the difference between middling performance in the typical case and highly optimal in just about every case. A database kernel lets you use an operating system in the way it likes to be used rather than using an API that you just happen to support. Cheers, -James Rogersjamesr@best.com
On 10/14/03 11:31 PM, "James Rogers" <jamesr@best.com> wrote: > > There is some abstraction in Postgres and the database is well-written, but > it isn't written in a manner that makes it easy to swap out operating system > or API models. It is written to be portable at all levels. A database > kernel isn't necessarily required to be portable at the very lowest level, > but it is vastly more optimizable because you aren't forced into a narrow > set of choices for interfacing with the operating system. Just to clarify, my post wasn't really to say that we should run out and make Postgres use a database kernel type internal model tomorrow. The point of all that was that Oracle does things that way for a very good reason and that there can be benefits that may not be immediately obvious. It is really one of those emergent "needs" when a database engine gets to a certain level of sophistication. For smaller and simpler databases, you don't really need it and the effort isn't justified. At some point, you cross a threshold where not only does it become justified but it becomes a wise idea or not having it will start to punish you in a number of different ways. I personally think that Postgres is sitting on the cusp of "its a wise idea", and that it is something worth thinking about in the future. Cheers, -James Rogersjamesr@best.com
Greg Stark wrote: > > James Rogers <jamesr@best.com> writes: > > > > > > Someone from Oracle is on there explaining what Oracle's needs are. Perhaps > > > someone more knowledgable than myself could explain what would most help > > > postgres in this area. > > > > > > There is an important difference between Oracle and Postgres that makes > > discussions of this complicated because the assumptions are different. > > All the more reason Postgres's view of the world should maybe be represented > there. As it turns out Linus seems unsympathetic to the O_DIRECT approach and > seems more interested in building a better kernel interface to control caching > and i/o scheduling. Something that fits better with postgres's design than > Oracle's. Of course, the big question is why Oracle is even there talking to Linus, and Linus isn't asking to get PostgreSQL involved. If you are running an open-source project, you would think you would give favor to other open-source projects. Same with MySQL favortism --- if you are writing an open-source tool, why favor a database developed/controlled by a single company? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> Of course, the big question is why Oracle is even there talking to > Linus, and Linus isn't asking to get PostgreSQL involved. If you are > running an open-source project, you would think you would give favor to > other open-source projects. Same with MySQL favortism --- if you are > writing an open-source tool, why favor a database developed/controlled > by a single company? It's the unix style: no message, no error... If Postgres developers do not send any message to Linus he will think Linux is doing just fine for them. Seems that Oracle cares to improve their Linux port so they asked Linus some features. I doubt Linus runned to Oracle asking "please, how could I help you improve your closed software project?". Kernel folks seems to be very busy people. IMHO if we see any window for improvement in any OS, we should go to Linus (or Peter or Bill Gates) and ask for it. As wrote in the original post. Regards, -- Paulo Scardine
James Rogers <jamesr@best.com> writes: > If we suddenly wanted to optimize Postgres for performance the way > Oracle does, we would be a lot more keen on the O_DIRECT approach. This isn't ever going to happen, for the simple reason that we don't have Oracle's manpower. You are blithely throwing around the phrase "database kernel" like it would be a small simple project. In reality you are talking about (at least) implementing our own complete filesystem, and then doing it over again on every platform we want to support, and then after that, optimizing it to the point of actually being enough better than the native facilities to have been worth the effort. I cannot conceive of that happening in a Postgres project that even remotely resembles the present reality, because we just don't have the manpower; and what manpower we do have is better spent on other tasks. We have other things to do than re-invent the operating system wheel. Improving the planner, for example. One of the first concepts I learned in CS grad school was that of optimizing a system at multiple levels. If the hardware guys can build a 2X faster CPU, and the operating system guys can find a 2X improvement in (say) filesystem performance, and then the application guys can find a 2X improvement in their algorithms, you've got 8X total speedup, which might have been impossible or at least vastly harder to get by working at only one level of the system. The lesson for Postgres is that we should not be trying to beat the operating system guys at their own game. It's unclear that we can anyway, and we can certainly get more bang for our optimization buck by working at system levels that don't correspond to operating-system concerns. I tend to agree with the opinion that Oracle's architecture is based on twenty-year-old assumptions. Back then it was reasonable to assume that database-specific algorithms could outperform a general-purpose operating system. In today's environment that assumption is not a given. regards, tom lane
Tom Lane wrote: >James Rogers <jamesr@best.com> writes: > > >>If we suddenly wanted to optimize Postgres for performance the way >>Oracle does, we would be a lot more keen on the O_DIRECT approach. >> >> > >This isn't ever going to happen, for the simple reason that we don't >have Oracle's manpower. > [snip - long and sensible elaboration of above statement] I have wondered (somewhat fruitlessly) for several years about the possibilities of special purpose lightweight file systems that could relax some of the assumptions and checks used in general purpose file systems. Such a thing might provide most of the benefits of a "database kernel" without imposing anything extra on the database application layer. Just a thought - I have no resources to make any attack on such a project. cheers andrew
James Rogers kirjutas K, 15.10.2003 kell 11:26: > On 10/14/03 11:31 PM, "James Rogers" <jamesr@best.com> wrote: > > > > There is some abstraction in Postgres and the database is well-written, but > > it isn't written in a manner that makes it easy to swap out operating system > > or API models. It is written to be portable at all levels. A database > > kernel isn't necessarily required to be portable at the very lowest level, > > but it is vastly more optimizable because you aren't forced into a narrow > > set of choices for interfacing with the operating system. > > > Just to clarify, my post wasn't really to say that we should run out and > make Postgres use a database kernel type internal model tomorrow. The point > of all that was that Oracle does things that way for a very good reason and > that there can be benefits that may not be immediately obvious. OTOH, what may be a perfectly good reason for Oracle, may not be it for PostgreSQL. For me the beauty of OS software has always been the possibility to fix problems at the right level (kernel, library, language) , and not to just make workarounds at another level (your application). So getting some API's into kernel for optimizing cache usage or writeback strategies would be much better than using raw writes and rewriting the whole thing ourseleves. The newer linux kernels have several schedulers to choose from, why not push for choice in other areas as well. The ultimate "database kernel" could thus be a custom tuned linux kernel ;) > It is really one of those emergent "needs" when a database engine gets to a > certain level of sophistication. For smaller and simpler databases, you > don't really need it and the effort isn't justified. At some point, you > cross a threshold where not only does it become justified but it becomes a > wise idea or not having it will start to punish you in a number of different > ways. I personally think that Postgres is sitting on the cusp of "its a > wise idea", and that it is something worth thinking about in the future. This thread reminds me of Linus/Tannenbaum Monolithic vs. Microkernel argument - while theoretically Microkernels are "better" Linux could outperform it by having the required modularity on source level, and being an open-source project this was enough. It also beat the Mach kernel by being there whereas microkernel based mach was too hard to develop/debug and thus has taken way longer to mature. -------------- Hannu
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I tend to agree with the opinion that Oracle's architecture Tom> is based on twenty-year-old assumptions. Backthen it was Tom> reasonable to assume that database-specific algorithms could Tom> outperform a general-purpose operatingsystem. In today's Tom> environment that assumption is not a given. In fact: Michael Stonebraker: Operating System Support for Database Management. CACM 24(7): 412-418 (1981) Abstract: Several operating system services are examined with a view toward their applicability to support ofdatabase management functions. These services include buffer pool management; the file system; scheduling,process management, and interprocess communication; and consistency control. -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh
Andrew Dunstan wrote: > > I have wondered (somewhat fruitlessly) for several years about the > possibilities of special purpose lightweight file systems that could > relax some of the assumptions and checks used in general purpose file > systems. Such a thing might provide most of the benefits of a > "database kernel" without imposing anything extra on the database > application layer. CPU is usually cheap compared to disk io. There are two things that might be worth looking into: Oracle released their cluster filesystem (ocfs) as a GPL driver for Linux. It might be interesting to check how it performs if used for postgres, but I fear that it implicitely assumes that the bulk of the caching is performed by the database in user space. And using O_DIRECT for the WAL logs - the logs are never read. -- Manfred
andrew@dunslane.net (Andrew Dunstan) writes: > Tom Lane wrote: >>James Rogers <jamesr@best.com> writes: >>>If we suddenly wanted to optimize Postgres for performance the way >>>Oracle does, we would be a lot more keen on the O_DIRECT approach. >>This isn't ever going to happen, for the simple reason that we don't >> have Oracle's manpower. >> > [snip - long and sensible elaboration of above statement] > > I have wondered (somewhat fruitlessly) for several years about the > possibilities of special purpose lightweight file systems that could > relax some of the assumptions and checks used in general purpose file > systems. Such a thing might provide most of the benefits of a > "database kernel" without imposing anything extra on the database > application layer. > > Just a thought - I have no resources to make any attack on such a project. There is an exactly relevant project for this, namely Hans Reiser's "ReiserFS," on Linux. http://www.namesys.com/whitepaper.html In Version 4, they will be exporting an API that allows userspace applications to control the use of transactional filesystem updates. If someone were to directly build a database on top of this, one might wind up with some sort of "ReiserSQL," which would be relatively analagous to the "database kernel" approach. Of course, the task would be large, and it would likely take _years_ for it to stabilize to the point of being much more than a "neat hack." The other neat approach that would be more relevant to PostgreSQL would be to create a filesystem that stored data in pure blocks, with pretty large block sizes, and low overhead for saving directory metadata. There isn't too terribly much interest in {a,o,m}time... -- output = reverse("ofni.smrytrebil" "@" "enworbbc") <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
Tom Lane wrote: > James Rogers <jamesr@best.com> writes: > > If we suddenly wanted to optimize Postgres for performance the way > > Oracle does, we would be a lot more keen on the O_DIRECT approach. > > This isn't ever going to happen, for the simple reason that we don't > have Oracle's manpower. You are blithely throwing around the phrase > "database kernel" like it would be a small simple project. In reality > you are talking about (at least) implementing our own complete > filesystem, and then doing it over again on every platform we want to > support, and then after that, optimizing it to the point of actually > being enough better than the native facilities to have been worth the > effort. I cannot conceive of that happening in a Postgres project that > even remotely resembles the present reality, because we just don't have > the manpower; and what manpower we do have is better spent on other > tasks. We have other things to do than re-invent the operating system > wheel. Improving the planner, for example. One question is what a database kernel would look like? Would it basically mean just taking our existing portability code, such as for shared memory, and moving it into a separate libary with its own API? Don't we almost have that already? I am just confused what would be different? I think the only major difference I have heard is to bypass the OS file system and memory management. We already bypass most of the memory management by using palloc. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073