Thread: Anything to be gained from a 'Postgres Filesystem'?
I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. Not being an FS guru I'm not really clear on whether this would help much (enough to be worth it anyway) or not - any thoughts? And if there were useful gains to be had, would it need a whole new FS or could an existing one be modified? So there might be (as I said, I'm not an FS guru...): * great append performance for the WAL? * optimised scattered writes for checkpointing? * Knowledge that FSYNC is being used for preserving ordering a lot of the time, rather than requiring actual writes to disk (so long as the writes eventually happen in order...)? Matt Matt Clark Ymogen Ltd P: 0845 130 4531 W: https://ymogen.net/ M: 0774 870 1584
Hiya, Looking at that list, I got the feeling that you'd want to push that PG-awareness down into the block-io layer as well, then,so as to be able to optimise for (perhaps) conflicting goals depending on what the app does; for the IO system to beable to read the apps mind it needs to have some knowledge of what the app is / needs / wants and I get the impressionthat this awareness needs to go deeper than the FS only. --Tim (But you might have time to rewrite Linux/BSD as a PG-OS? just kidding!) -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Matt Clark Sent: Thursday, October 21, 2004 9:58 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? I suppose I'm just idly wondering really. Clearly it's against PG philosophy to build an FS or direct IO management into PG, but now it's so relatively easy to plug filesystems into the main open-source Oses, It struck me that there might be some useful changes to, say, XFS or ext3, that could be made that would help PG out. I'm thinking along the lines of an FS that's aware of PG's strategies and requirements and therefore optimised to make those activities as efiicient as possible - possibly even being aware of PG's disk layout and treating files differently on that basis. Not being an FS guru I'm not really clear on whether this would help much (enough to be worth it anyway) or not - any thoughts? And if there were useful gains to be had, would it need a whole new FS or could an existing one be modified? So there might be (as I said, I'm not an FS guru...): * great append performance for the WAL? * optimised scattered writes for checkpointing? * Knowledge that FSYNC is being used for preserving ordering a lot of the time, rather than requiring actual writes to disk (so long as the writes eventually happen in order...)? Matt Matt Clark Ymogen Ltd P: 0845 130 4531 W: https://ymogen.net/ M: 0774 870 1584 ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Reiser4 ? On Thu, 21 Oct 2004 08:58:01 +0100, Matt Clark <matt@ymogen.net> wrote: > I suppose I'm just idly wondering really. Clearly it's against PG > philosophy to build an FS or direct IO management into PG, but now it's > so > relatively easy to plug filesystems into the main open-source Oses, It > struck me that there might be some useful changes to, say, XFS or ext3, > that > could be made that would help PG out. > > I'm thinking along the lines of an FS that's aware of PG's strategies and > requirements and therefore optimised to make those activities as > efiicient > as possible - possibly even being aware of PG's disk layout and treating > files differently on that basis. > > Not being an FS guru I'm not really clear on whether this would help much > (enough to be worth it anyway) or not - any thoughts? And if there were > useful gains to be had, would it need a whole new FS or could an existing > one be modified? > > So there might be (as I said, I'm not an FS guru...): > * great append performance for the WAL? > * optimised scattered writes for checkpointing? > * Knowledge that FSYNC is being used for preserving ordering a lot of the > time, rather than requiring actual writes to disk (so long as the writes > eventually happen in order...)? > > > Matt > > > > Matt Clark > Ymogen Ltd > P: 0845 130 4531 > W: https://ymogen.net/ > M: 0774 870 1584 > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
> Looking at that list, I got the feeling that you'd want to > push that PG-awareness down into the block-io layer as well, > then, so as to be able to optimise for (perhaps) conflicting > goals depending on what the app does; for the IO system to be > able to read the apps mind it needs to have some knowledge of > what the app is / needs / wants and I get the impression that > this awareness needs to go deeper than the FS only. That's a fair point, it would need be a kernel patch really, although not necessarily a very big one, more a case of looking at FDs and if they're flagged in some way then get the PGfs to do the job instead of/as well as the normal code path.
On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: > I suppose I'm just idly wondering really. Clearly it's against PG > philosophy to build an FS or direct IO management into PG, but now it's so > relatively easy to plug filesystems into the main open-source Oses, It > struck me that there might be some useful changes to, say, XFS or ext3, that > could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/
Hi, I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG code'. Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development team.I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and duplicatinglarge amounts of OS type of code inside the application, just to try to attain an unknown performance benefit. PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do file-management,if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance againstputting such code inside PG. So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then suchtype of code can at least be maintained outside the application, and will not be as much of a burden to the rest of theapplication. (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier thangetting FS code inside of PG) cheers, --Tim -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Steinar H. Gunderson Sent: Thursday, October 21, 2004 12:27 PM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: > I suppose I'm just idly wondering really. Clearly it's against PG > philosophy to build an FS or direct IO management into PG, but now it's so > relatively easy to plug filesystems into the main open-source Oses, It > struck me that there might be some useful changes to, say, XFS or ext3, that > could be made that would help PG out. This really sounds like a poor replacement for just making PostgreSQL use raw devices to me. (I have no idea why that isn't done already, but presumably it isn't all that easy to get right. :-) ) /* Steinar */ -- Homepage: http://www.sesse.net/ ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
The intuitive thing would be to put pg into a file system. /Aaron On Thu, 21 Oct 2004 12:44:10 +0200, Leeuw van der, Tim <tim.leeuwvander@nl.unisys.com> wrote: > Hi, > > I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG code'. > > Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development team.I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and duplicatinglarge amounts of OS type of code inside the application, just to try to attain an unknown performance benefit. > > PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do file-management,if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance againstputting such code inside PG. > So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then suchtype of code can at least be maintained outside the application, and will not be as much of a burden to the rest of theapplication. > > (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier thangetting FS code inside of PG) > > cheers, > > --Tim > > > > -----Original Message----- > From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Steinar H. Gunderson > Sent: Thursday, October 21, 2004 12:27 PM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'? > > On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: > > I suppose I'm just idly wondering really. Clearly it's against PG > > philosophy to build an FS or direct IO management into PG, but now it's so > > relatively easy to plug filesystems into the main open-source Oses, It > > struck me that there might be some useful changes to, say, XFS or ext3, that > > could be made that would help PG out. > > This really sounds like a poor replacement for just making PostgreSQL use raw > devices to me. (I have no idea why that isn't done already, but presumably it > isn't all that easy to get right. :-) ) > > /* Steinar */ > -- > Homepage: http://www.sesse.net/ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Regards, /Aaron
Matt Clark wrote: > I'm thinking along the lines of an FS that's aware of PG's strategies and > requirements and therefore optimised to make those activities as efiicient > as possible - possibly even being aware of PG's disk layout and treating > files differently on that basis. As someone else noted, this doesn't belong in the filesystem (rather the kernel's block I/O layer/buffer cache). But I agree, an API by which we can tell the kernel what kind of I/O behavior to expect would be good. The kernel needs to provide good behavior for a wide range of applications, but the DBMS can take advantage of a lot of domain-specific information. In theory, being able to pass that domain-specific information on to the kernel would mean we could get better performance without needing to reimplement large chunks of functionality that really ought to be done by the kernel anyway (as implementing raw I/O would require, for example). On the other hand, it would probably mean adding a fair bit of OS-specific hackery, which we've largely managed to avoid in the past. The closest API to what you're describing that I'm aware of is posix_fadvise(). While that is technically-speaking a POSIX standard, it is not widely implemented (I know Linux 2.6 implements it; based on some quick googling, it looks like AIX does too). Using posix_fadvise() has been discussed in the past, so you might want to search the archives. We could use FADV_SEQUENTIAL to request more aggressive readahead on a file that we know we're about to sequentially scan. We might be able to use FADV_NOREUSE on the WAL. We might be able to get away with specifying FADV_RANDOM for indexes all of the time, or at least most of the time. One question is how this would interact with concurrent access (AFAICS there is no way to fetch the "current advice" on an fd...) Also, I would imagine Win32 provides some means to inform the kernel about your expected I/O pattern, but I haven't checked. Does anyone know of any other relevant APIs? -Neil
On Thu, Oct 21, 2004 at 12:44:10PM +0200, Leeuw van der, Tim wrote: > Hacking PG internally to handle raw devices will meet with strong > resistance from large portions of the development team. I don't expect > (m)any core devs of PG will be excited about rewriting the entire I/O > architecture of PG and duplicating large amounts of OS type of code inside > the application, just to try to attain an unknown performance benefit. Well, at least I see people claiming >30% difference between different file systems, but no, I'm not shouting "bah, you'd better do this or I'll warez Oracle" :-) I have no idea how much you can improve over the "best" filesystems out there, but having two layers of journalling (both WAL _and_ FS journalling) on top of each other don't make all that much sense to me. :-) /* Steinar */ -- Homepage: http://www.sesse.net/
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: > ... I have no idea how much you can improve over the "best" > filesystems out there, but having two layers of journalling (both WAL _and_ > FS journalling) on top of each other don't make all that much sense to me. Which is why setting the FS to journal metadata but not file contents is often suggested as best practice for a PG-only filesystem. regards, tom lane
Neil Conway wrote: > Also, I would imagine Win32 provides some means to inform the kernel > about your expected I/O pattern, but I haven't checked. Does anyone know > of any other relevant APIs? See CreateFile, Parameter dwFlagsAndAttributes http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createfile.asp There is FILE_FLAG_NO_BUFFERING, FILE_FLAG_OPEN_NO_RECALL, FILE_FLAG_RANDOM_ACCESS and even FILE_FLAG_POSIX_SEMANTICS Jan
On Thu, Oct 21, 2004 at 10:20:55AM -0400, Tom Lane wrote: >> ... I have no idea how much you can improve over the "best" >> filesystems out there, but having two layers of journalling (both WAL _and_ >> FS journalling) on top of each other don't make all that much sense to me. > Which is why setting the FS to journal metadata but not file contents is > often suggested as best practice for a PG-only filesystem. Mm, but you still journal the metadata. Oh well, noatime etc.. :-) By the way, I'm probably hitting a FAQ here, but would O_DIRECT help PostgreSQL any, given large enough shared_buffers? /* Steinar */ -- Homepage: http://www.sesse.net/
> As someone else noted, this doesn't belong in the filesystem (rather > the kernel's block I/O layer/buffer cache). But I agree, an API by > which we can tell the kernel what kind of I/O behavior to expect would > be good. [snip] > The closest API to what you're describing that I'm aware of is > posix_fadvise(). While that is technically-speaking a POSIX standard, > it is not widely implemented (I know Linux 2.6 implements it; based on > some quick googling, it looks like AIX does too). Don't forget about the existence/usefulness/widely implemented madvise(2)/posix_madvise(2) call, which can give the OS the following hints: MADV_NORMAL, MADV_SEQUENTIAL, MADV_RANDOM, MADV_WILLNEED, MADV_DONTNEED, and MADV_FREE. :) -sc -- Sean Chittenden
Note that most people are now moving away from raw devices for databases in most applicaitons. The relatively small performance gain isn't worth the hassles. On Thu, Oct 21, 2004 at 12:27:27PM +0200, Steinar H. Gunderson wrote: > On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote: > > I suppose I'm just idly wondering really. Clearly it's against PG > > philosophy to build an FS or direct IO management into PG, but now it's so > > relatively easy to plug filesystems into the main open-source Oses, It > > struck me that there might be some useful changes to, say, XFS or ext3, that > > could be made that would help PG out. > > This really sounds like a poor replacement for just making PostgreSQL use raw > devices to me. (I have no idea why that isn't done already, but presumably it > isn't all that easy to get right. :-) ) > > /* Steinar */ > -- > Homepage: http://www.sesse.net/ > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
Hi, Leeuw, On Thu, 21 Oct 2004 12:44:10 +0200 "Leeuw van der, Tim" <tim.leeuwvander@nl.unisys.com> wrote: > (I'm not sure if it's a good idea to create a PG-specific FS in your > OS of choice, but it's certainly gonna be easier than getting FS code > inside of PG) I don't think PG really needs a specific FS. I rather think that PG could profit from some functionality that's missing in traditional UN*X file systems. posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as well asn syncing a bunch of data in different files with a single call (so that the OS can determine the best write order). I can also imagine some interaction with the FS journalling system (to avoid duplicate efforts). We should create a list of those needs, and then communicate those to the kernel/fs developers. Then we (as well as other apps) can make use of those features where they are available, and use the old way everywhere else. Maybe Reiser4 is a step into the right way, and maybe even a postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already have such capabilities. Maybe that's completely wrong. cheers, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
> posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as > well asn syncing a bunch of data in different files with a single call > (so that the OS can determine the best write order). I can also imagine > some interaction with the FS journalling system (to avoid duplicate > efforts). There is also the fact that syncing after every transaction could be changed to syncing every N transactions (N fixed or depending on the data size written by the transactions) which would be more efficient than the current behaviour with a sleep. HOWEVER suppressing the sleep() would lead to postgres returning from the COMMIT while it is in fact not synced, which somehow rings a huge alarm bell somewhere. What about read order ? This could be very useful for SELECT queries involving indexes, which in case of a non-clustered table lead to random seeks in the table. There's fadvise to tell the OS to readahead on a seq scan (I think the OS detects it anyway), but if there was a system call telling the OS "in the next seconds I'm going to read these chunks of data from this file (gives a list of offsets and lengths), could you put them in your cache in the most efficient order without seeking too much, so that when I read() them in random order, they will be in the cache already ?". This would be an asynchronous call which would return immediately, just queuing up the data somewhere in the kernel, and maybe sending a signal to the application when a certain percentage of the data has been cached. PG could take advantage of this with not much code changes, simply by putting a fifo between the index scan and the tuple fetches, to wait the time necessary for the OS to have enough reads to cluster them efficiently. On very large tables this would maybe not gain much, but on tables which are explicitely clustered, or naturally clustered like accessing an index on a serial primary key in order, it could be interesting. Just a thought.
lists@boutiquenumerique.com (Pierre-Frédéric Caillaud) writes: >> posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as >> well asn syncing a bunch of data in different files with a single call >> (so that the OS can determine the best write order). I can also imagine >> some interaction with the FS journalling system (to avoid duplicate >> efforts). > > There is also the fact that syncing after every transaction > could be changed to syncing every N transactions (N fixed or > depending on the data size written by the transactions) which would > be more efficient than the current behaviour with a sleep. HOWEVER > suppressing the sleep() would lead to postgres returning from the > COMMIT while it is in fact not synced, which somehow rings a huge > alarm bell somewhere. > > What about read order ? > This could be very useful for SELECT queries involving > indexes, which in case of a non-clustered table lead to random seeks > in the table. Another thing that would be valuable would be to have some way to say: "Read this data; don't bother throwing other data out of the cache to stuff this in." Something like a "read_uncached()" call... That would mean that a seq scan or a vacuum wouldn't force useful data out of cache. -- let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];; http://www.ntlug.org/~cbbrowne/linuxxian.html A VAX is virtually a computer, but not quite.
On Thu, 2004-11-04 at 15:47, Chris Browne wrote: > Another thing that would be valuable would be to have some way to say: > > "Read this data; don't bother throwing other data out of the cache > to stuff this in." > > Something like a "read_uncached()" call... > > That would mean that a seq scan or a vacuum wouldn't force useful data > out of cache. ARC does almost exactly those two things in 8.0. Seq scans do get put in cache, but in a way that means they don't spoil the main bulk of the cache. -- Best Regards, Simon Riggs
On Thu, Nov 04, 2004 at 10:47:31AM -0500, Chris Browne wrote: > Another thing that would be valuable would be to have some way to say: > > "Read this data; don't bother throwing other data out of the cache > to stuff this in." > > Something like a "read_uncached()" call... You mean, like, open(filename, O_DIRECT)? :-) /* Steinar */ -- Homepage: http://www.sesse.net/
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2004-11-04 at 15:47, Chris Browne wrote: >> Something like a "read_uncached()" call... >> >> That would mean that a seq scan or a vacuum wouldn't force useful data >> out of cache. > ARC does almost exactly those two things in 8.0. But only for Postgres' own shared buffers. The kernel cache still gets trashed, because we have no way to suggest to the kernel that it not hang onto the data read in. regards, tom lane
On Thu, 2004-11-04 at 19:34, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Thu, 2004-11-04 at 15:47, Chris Browne wrote: > >> Something like a "read_uncached()" call... > >> > >> That would mean that a seq scan or a vacuum wouldn't force useful data > >> out of cache. > > > ARC does almost exactly those two things in 8.0. > > But only for Postgres' own shared buffers. The kernel cache still gets > trashed, because we have no way to suggest to the kernel that it not > hang onto the data read in. I guess a difference in viewpoints. I'm inclined to give most of the RAM to PostgreSQL, since as you point out, the kernel is out of our control. That way, we can do what we like with it - keep it or not, as we choose. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2004-11-04 at 19:34, Tom Lane wrote: >> But only for Postgres' own shared buffers. The kernel cache still gets >> trashed, because we have no way to suggest to the kernel that it not >> hang onto the data read in. > I guess a difference in viewpoints. I'm inclined to give most of the RAM > to PostgreSQL, since as you point out, the kernel is out of our control. > That way, we can do what we like with it - keep it or not, as we choose. That's always been a Bad Idea for three or four different reasons, of which ARC will eliminate no more than one. regards, tom lane
In an attempt to throw the authorities off his trail, schabios@logi-track.com (Markus Schaber) transmitted: > We should create a list of those needs, and then communicate those > to the kernel/fs developers. Then we (as well as other apps) can > make use of those features where they are available, and use the old > way everywhere else. Which kernel/fs developers did you have in mind? The ones working on Linux? Or FreeBSD? Or DragonflyBSD? Or Solaris? Or AIX? Please keep in mind that many of the PostgreSQL developers are BSD folk that aren't particularly interested in creating bleeding edge Linux capabilities. Furthermore, I'd think long and hard before jumping into such a _spectacularly_ bleeding edge kind of project. The reason why you would want this would be if you needed to get some margin of performance. I can't see wanting that without also wanting some _assurance_ of system reliability, at which point I also want things like vendor support. If you've ever contacted Red Hat Software, you'd know that they very nearly refuse to provide support for any filesystem other than ext3. Use anything else and they'll make noises about not being able to assure you of anything at all. If you need high performance, you'd also want to use interesting sorts of hardware. Disk arrays, RAID controllers, that sort of thing. Vendors of such things don't particularly want to talk to you unless you're using a "supported" Linux distribution and a "supported" filesystem. Jumping into a customized filesystem that neither hardware nor software vendors would remotely consider supporting just doesn't look like a viable strategy to me. > Maybe Reiser4 is a step into the right way, and maybe even a > postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS > etc. already have such capabilities. Maybe that's completely wrong. The capabilities tend to be redundant. They tend to implement vaguely similar transactional capabilities to what databases have to implement. The similarities are not close enough to eliminate either variety of "commit" as redundant. -- "cbbrowne","@","linuxfinances.info" http://linuxfinances.info/info/linux.html Rules of the Evil Overlord #128. "I will not employ robots as agents of destruction if there is any possible way that they can be re-programmed or if their battery packs are externally mounted and easily removable." <http://www.eviloverlord.com/>
After a long battle with technology, simon@2ndquadrant.com (Simon Riggs), an earthling, wrote: > On Thu, 2004-11-04 at 15:47, Chris Browne wrote: > >> Another thing that would be valuable would be to have some way to say: >> >> "Read this data; don't bother throwing other data out of the cache >> to stuff this in." >> >> Something like a "read_uncached()" call... >> >> That would mean that a seq scan or a vacuum wouldn't force useful >> data out of cache. > > ARC does almost exactly those two things in 8.0. > > Seq scans do get put in cache, but in a way that means they don't > spoil the main bulk of the cache. We're not talking about the same cache. ARC does these exact things for _shared memory_ cache, and is the obvious inspiration. But it does more or less nothing about the way OS file buffer cache is managed, and the handling of _that_ would be the point of modifying OS filesystem semantics. -- select 'cbbrowne' || '@' || 'linuxfinances.info'; http://www3.sympatico.ca/cbbrowne/oses.html Have you ever considered beating yourself with a cluestick?
On Fri, 2004-11-05 at 06:20, Steinar H. Gunderson wrote: > You mean, like, open(filename, O_DIRECT)? :-) This disables readahead (at least on Linux), which is certainly not we want: for the very case where we don't want to keep the data in cache for a while (sequential scans, VACUUM), we also want aggressive readahead. -Neil
On Thu, 2004-11-04 at 23:29, Pierre-Frédéric Caillaud wrote: > There is also the fact that syncing after every transaction could be > changed to syncing every N transactions (N fixed or depending on the data > size written by the transactions) which would be more efficient than the > current behaviour with a sleep. Uh, which "sleep" are you referring to? Also, how would interacting with the filesystem's journal effect how often we need to force-write the WAL to disk? (ISTM we need to sync _something_ to disk when a transaction commits in order to maintain the WAL invariant.) > There's fadvise to tell the OS to readahead on a seq scan (I think the OS > detects it anyway) Not perfectly, though; also, Linux will do a more aggressive readahead if you tell it to do so via posix_fadvise(). > if there was a system call telling the OS "in the > next seconds I'm going to read these chunks of data from this file (gives > a list of offsets and lengths), could you put them in your cache in the > most efficient order without seeking too much, so that when I read() them > in random order, they will be in the cache already ?". http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html POSIX_FADV_WILLNEED Specifies that the application expects to access the specified data in the near future. -Neil
On Fri, 2004-11-05 at 02:47, Chris Browne wrote: > Another thing that would be valuable would be to have some way to say: > > "Read this data; don't bother throwing other data out of the cache > to stuff this in." This is similar, although not exactly the same thing: http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html POSIX_FADV_NOREUSE Specifies that the application expects to access the specified data once and then not reuse it thereafter. -Neil
Hi, Christopher, [sorry for the delay of my answer, we were rather busy last weks] On Thu, 04 Nov 2004 21:29:04 -0500 Christopher Browne <cbbrowne@acm.org> wrote: > In an attempt to throw the authorities off his trail, schabios@logi-track.com (Markus Schaber) transmitted: > > We should create a list of those needs, and then communicate those > > to the kernel/fs developers. Then we (as well as other apps) can > > make use of those features where they are available, and use the old > > way everywhere else. > > Which kernel/fs developers did you have in mind? The ones working on > Linux? Or FreeBSD? Or DragonflyBSD? Or Solaris? Or AIX? All of them, and others (e. G. Windows). Once we have a list of those needs, the advocates can talk to the OS developers. Some OS developers will follow, others not. Then the postgres folks (and other application developers that benefit from this capabilities) can point interested users to our benchmarks and tell them that Foox performs 3 times as fast as BaarOs because they provide better support for database needs. > Please keep in mind that many of the PostgreSQL developers are BSD > folk that aren't particularly interested in creating bleeding edge > Linux capabilities. Then this should be motivation to add those things to BSD, maybe as a patch or loadable module so it does not bloat mainstream. I personally would prefer it to appear in BSD first, because in case it really pays of, it won't be long until it appears in Linux as well :-) > Jumping into a customized filesystem that neither hardware nor > software vendors would remotely consider supporting just doesn't look > like a viable strategy to me. I did not vote for a custom filesystem, as the OP did. I did vote for isolating a set of useful capabilities PostgreSQL could exploit, and then try to confince the kernel developers to include this capabilities, so they are likely to be included in the main distributions. I don't know about the BSD market, but I know that Redhat and SuSE often ship their patched versions of the kernels (so then they officially support the extensions), and most of this is likely to be included in main stream later. > > Maybe Reiser4 is a step into the right way, and maybe even a > > postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS > > etc. already have such capabilities. Maybe that's completely wrong. > > The capabilities tend to be redundant. They tend to implement vaguely > similar transactional capabilities to what databases have to > implement. The similarities are not close enough to eliminate either > variety of "commit" as redundant. But a speed gain may be possible by coordinating DB and FS tansactions. Thanks, Markus -- markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com