Thread: Allowing WAL fsync to be done via O_SYNC
Based on the tests we did last week, it seems clear than on many platforms it's a win to sync the WAL log by writing it with open() option O_SYNC (or O_DSYNC where available) rather than issuing explicit fsync() (resp. fdatasync()) calls. In theory fsync ought to be faster, but it seems that too many kernels have inefficient implementations of fsync. I think we need to make both O_SYNC and fsync() choices available in 7.1. Two important questions need to be settled: 1. Is a compile-time flag (in config.h.in) good enough, or do we need to make it configurable via a GUC variable? (A variable would have to be postmaster-start-time changeable only, so you'd still need a postmaster restart to change it.) 2. Which way should be the default? There's also the lesser question of what to call the config symbol or variable. My inclination is to go with a compile-time flag named USE_FSYNC_FOR_WAL and have the default be off (ie, use O_SYNC by default) but I'm not strongly set on that. Opinions anyone? In any case the code should automatically prefer O_DSYNC over O_SYNC if available, and should prefer fdatasync() over fsync() if available; I doubt we need to provide a knob to alter those choices. BTW, are there any platforms where O_DSYNC exists but has a different spelling? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 09:35] wrote: > > BTW, are there any platforms where O_DSYNC exists but has a different > spelling? Yes, FreeBSD only has: O_FSYNC it doesn't have O_SYNC nor O_DSYNC. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > * Tom Lane <tgl@sss.pgh.pa.us> [010315 09:35] wrote: >> BTW, are there any platforms where O_DSYNC exists but has a different >> spelling? > Yes, FreeBSD only has: O_FSYNC > it doesn't have O_SYNC nor O_DSYNC. Okay ... we can fall back to O_FSYNC if we don't see either of the others. No problem. Any other weird cases out there? I think Andreas might've muttered something about AIX but I'm not sure now. regards, tom lane
Tom Lane writes: > I think we need to make both O_SYNC and fsync() choices available in > 7.1. Two important questions need to be settled: > > 1. Is a compile-time flag (in config.h.in) good enough, or do we need > to make it configurable via a GUC variable? (A variable would have to > be postmaster-start-time changeable only, so you'd still need a > postmaster restart to change it.) As a general rule, if something can be a run time option, as opposed to a compile time option, then it should be. At the very least you keep the installation simple and allow for easier experimenting. > There's also the lesser question of what to call the config symbol > or variable. I suggest "wal_use_fsync" as a GUC variable, assuming the default would be off. Otherwise "wal_use_open_sync". (Use a general-to-specific naming scheme to allow for easier grouping. Having defaults be "off" consistently is more intuitive.) -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: > As a general rule, if something can be a run time option, as opposed to a > compile time option, then it should be. At the very least you keep the > installation simple and allow for easier experimenting. I've been mentally working through the code, and see only one reason why it might be necessary to go with a compile-time choice: suppose we see that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on. If it's a GUC variable then we need a way to prevent the GUC option from becoming unset (which would disable the fsync() calls, leaving nothing to replace 'em). Doable, perhaps, but seems kind of ugly ... any thoughts about that? regards, tom lane
> Based on the tests we did last week, it seems clear than on many > platforms it's a win to sync the WAL log by writing it with open() > option O_SYNC (or O_DSYNC where available) rather than > issuing explicit fsync() (resp. fdatasync()) calls. I don't remember big difference in using fsync or O_SYNC in tfsync tests. Both depend on block size and keeping in mind that fsync allows us syncing after writing *multiple* blocks I would either use fsync as default or don't deal with O_SYNC at all. But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should use O_DSYNC by default. (BTW, we didn't compare fdatasync and O_SYNC yet). Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > ... I would either > use fsync as default or don't deal with O_SYNC at all. > But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should > use O_DSYNC by default. Hm. We could do that reasonably painlessly as a compile-time test in xlog.c, but I'm not clear on how it would play out as a GUC option. Peter, what do you think about configuration-dependent defaults for GUC variables? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 11:07] wrote: > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > > ... I would either > > use fsync as default or don't deal with O_SYNC at all. > > But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should > > use O_DSYNC by default. > > Hm. We could do that reasonably painlessly as a compile-time test in > xlog.c, but I'm not clear on how it would play out as a GUC option. > Peter, what do you think about configuration-dependent defaults for > GUC variables? Sorry, what's a GUC? :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein writes: > Sorry, what's a GUC? :) Grand Unified Configuration system It's basically a cute name for the achievement that there's now a single name space and interface for (almost) all postmaster run time configuration variables, -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Alfred Perlstein wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [010315 11:07] wrote: > > Peter, what do you think about configuration-dependent defaults for > > GUC variables? > Sorry, what's a GUC? :) Grand Unified Configuration, Peter E.'s baby. See the thread starting at http://www.postgresql.org/mhonarc/pgsql-hackers/2000-03/msg00107.html for details. (And the search is working.... :-)). -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
* Peter Eisentraut <peter_e@gmx.net> [010315 11:33] wrote: > Alfred Perlstein writes: > > > Sorry, what's a GUC? :) > > Grand Unified Configuration system > > It's basically a cute name for the achievement that there's now a single > name space and interface for (almost) all postmaster run time > configuration variables, Oh, thanks. Well considering that, a runtime check for doing_sync_wal_writes == 1 shouldn't be that expensive. Sort of the inverse of -F, meaning that we're using O_SYNC for WAL writes, we don't need to fsync it. Btw, if you guys want to get some speed with WAL, I'd implement a write-behind process if it was possible to do the O_SYNC writes. ... And since we're sorta on the topic of IO, I noticed that it looks like (at least in 7.0.3) that vacuum and certain other routines read files in reverse order. The problem (at least in FreeBSD) is that we haven't tuned the system to detect reverse reading and hence don't do much readahead. There may be some going on as a function of the read clustering, but I'm not entirely sure. I'd suspect that other OSs might have neglected to check for reverse reading of files as well, but I'm not sure. Basically, if there was a way to do this another way, or anticipate the backwards motion and do large reads, it may add latency, but it should improve performance. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > And since we're sorta on the topic of IO, I noticed that it looks > like (at least in 7.0.3) that vacuum and certain other routines > read files in reverse order. Vacuum does that because it's trying to push tuples down from the end into free space in earlier blocks. I don't see much way around that (nor any good reason to think that it's a critical part of vacuum's performance anyway). Where else have you seen such behavior? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 11:45] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > And since we're sorta on the topic of IO, I noticed that it looks > > like (at least in 7.0.3) that vacuum and certain other routines > > read files in reverse order. > > Vacuum does that because it's trying to push tuples down from the end > into free space in earlier blocks. I don't see much way around that > (nor any good reason to think that it's a critical part of vacuum's > performance anyway). Where else have you seen such behavior? Just vacuum, but the source is large, and I'm sort of lacking on database-foo so I guessed that it may be done elsewhere. You can optimize this out by implementing the read behind yourselves sorta like this: struct sglist * read(fd, len) { if (fd.lastpos - fd.curpos <= THRESHOLD) { fd.curpos = fd.lastpos - THRESHOLD; len = THRESHOLD;} return (do_read(fd, len)); } of course this is entirely wrong, but illustrates what would/could help. I would fix FreeBSD, but it's sort of a mess and beyond what I've got time to do ATM. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> Peter Eisentraut <peter_e@gmx.net> writes: > > As a general rule, if something can be a run time option, as opposed to a > > compile time option, then it should be. At the very least you keep the > > installation simple and allow for easier experimenting. > > I've been mentally working through the code, and see only one reason why > it might be necessary to go with a compile-time choice: suppose we see > that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the > compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on. > If it's a GUC variable then we need a way to prevent the GUC option from > becoming unset (which would disable the fsync() calls, leaving nothing > to replace 'em). Doable, perhaps, but seems kind of ugly ... any > thoughts about that? I don't think having something a run-time option is always a good idea. Giving people too many choices is often confusing. I think we should just check at compile time, and choose O_* if we have it, and if not, use fsync(). No one will ever do the proper timing tests to know which is better except us. Also, it seems O_* should be faster because you are fsync'ing the buffer you just wrote, so there is no looking around for dirty buffers like fsync(). -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Based on the tests we did last week, it seems clear than on many > platforms it's a win to sync the WAL log by writing it with open() > option O_SYNC (or O_DSYNC where available) rather than issuing explicit > fsync() (resp. fdatasync()) calls. In theory fsync ought to be faster, > but it seems that too many kernels have inefficient implementations of > fsync. Can someone explain why configure/platform-specific flags are allowed to be added at this stage in the release, but my pgmonitor patch was rejected? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Can someone explain why configure/platform-specific flags are allowed to > be added at this stage in the release, but my pgmonitor patch was > rejected? Possibly just because Marc hasn't stomped on me quite yet ;-) However, I can actually make a case for this: we are flushing out performance bugs in a new feature, ie WAL. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Can someone explain why configure/platform-specific flags are allowed to > > be added at this stage in the release, but my pgmonitor patch was > > rejected? > > Possibly just because Marc hasn't stomped on me quite yet ;-) > > However, I can actually make a case for this: we are flushing out > performance bugs in a new feature, ie WAL. You did a masterful job of making my pgmonitor patch sound like a debug aid instead of a feature too. :-) Have you considered a career in law. :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > I've been mentally working through the code, and see only one reason why > > it might be necessary to go with a compile-time choice: suppose we see > > that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? With the > > compile-time choice it's easy: #define USE_FSYNC_FOR_WAL, and sail on. > > If it's a GUC variable then we need a way to prevent the GUC option from > > becoming unset (which would disable the fsync() calls, leaving nothing > > to replace 'em). Doable, perhaps, but seems kind of ugly ... any > > thoughts about that? > > I don't think having something a run-time option is always a good idea. > Giving people too many choices is often confusing. > > I think we should just check at compile time, and choose O_* if we have > it, and if not, use fsync(). No one will ever do the proper timing > tests to know which is better except us. Also, it seems O_* should be > faster because you are fsync'ing the buffer you just wrote, so there is > no looking around for dirty buffers like fsync(). I later read Vadim's comment that fsync() of two blocks may be faster than two O_* writes, so I am now confused about the proper solution. However, I think we need to pick one and make it invisible to the user. Perhaps a compiler/config.h flag for testing would be a good solution. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
[ Charset ISO-8859-1 unsupported, converting... ] > > Based on the tests we did last week, it seems clear than on many > > platforms it's a win to sync the WAL log by writing it with open() > > option O_SYNC (or O_DSYNC where available) rather than > > issuing explicit fsync() (resp. fdatasync()) calls. > > I don't remember big difference in using fsync or O_SYNC in tfsync > tests. Both depend on block size and keeping in mind that fsync > allows us syncing after writing *multiple* blocks I would either > use fsync as default or don't deal with O_SYNC at all. I see what you are saying. That the OS may be faster at fsync'ing two blocks in one operation rather than doing to O_SYNC operations. Seems we should just pick a default and leave the rest for a later release. Marc wants RC1 tomorrow, I think. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I later read Vadim's comment that fsync() of two blocks may be faster > than two O_* writes, so I am now confused about the proper solution. > However, I think we need to pick one and make it invisible to the user. > Perhaps a compiler/config.h flag for testing would be a good solution. I believe that we don't know enough yet to nail down a hard-wired decision. Vadim's idea of preferring O_DSYNC if it appears to be different from O_SYNC is a good first cut, but I think we'd better make it possible to override that, at least for testing purposes. So I think it should be configurable at *some* level. I don't much care whether it's a config.h entry or a GUC variable. But consider this: we'll be more likely to get some feedback from the field (allowing us to refine the policy in future releases) if it is a GUC variable. Not many people will build two versions of the software, but people might take the trouble to play with a run-time configuration setting. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I later read Vadim's comment that fsync() of two blocks may be faster > > than two O_* writes, so I am now confused about the proper solution. > > However, I think we need to pick one and make it invisible to the user. > > Perhaps a compiler/config.h flag for testing would be a good solution. > > I believe that we don't know enough yet to nail down a hard-wired > decision. Vadim's idea of preferring O_DSYNC if it appears to be > different from O_SYNC is a good first cut, but I think we'd better make > it possible to override that, at least for testing purposes. > > So I think it should be configurable at *some* level. I don't much care > whether it's a config.h entry or a GUC variable. > > But consider this: we'll be more likely to get some feedback from the > field (allowing us to refine the policy in future releases) if it is a > GUC variable. Not many people will build two versions of the software, > but people might take the trouble to play with a run-time configuration > setting. Yes, I can imagine. Can we remove it once we know the answer? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
I'd actually vote for it to remain for a release or two or more, as we get more experience with stuff, the defaults may be different for different workloads. LER -- Larry Rosenman http://www.lerctr.org/~ler/ Phone: +1 972 414 9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 US >>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<< On 3/15/01, 2:46:20 PM, Bruce Momjian <pgman@candle.pha.pa.us> wrote regarding Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I later read Vadim's comment that fsync() of two blocks may be faster > > > than two O_* writes, so I am now confused about the proper solution. > > > However, I think we need to pick one and make it invisible to the user. > > > Perhaps a compiler/config.h flag for testing would be a good solution. > > > > I believe that we don't know enough yet to nail down a hard-wired > > decision. Vadim's idea of preferring O_DSYNC if it appears to be > > different from O_SYNC is a good first cut, but I think we'd better make > > it possible to override that, at least for testing purposes. > > > > So I think it should be configurable at *some* level. I don't much care > > whether it's a config.h entry or a GUC variable. > > > > But consider this: we'll be more likely to get some feedback from the > > field (allowing us to refine the policy in future releases) if it is a > > GUC variable. Not many people will build two versions of the software, > > but people might take the trouble to play with a run-time configuration > > setting. > Yes, I can imagine. Can we remove it once we know the answer? > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > http://www.postgresql.org/users-lounge/docs/faq.html
Tom Lane writes: > "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > > ... I would either > > use fsync as default or don't deal with O_SYNC at all. > > But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should > > use O_DSYNC by default. > > Hm. We could do that reasonably painlessly as a compile-time test in > xlog.c, but I'm not clear on how it would play out as a GUC option. > Peter, what do you think about configuration-dependent defaults for > GUC variables? We have plenty of those already, but we should avoid a variable whose specification is: "The default is 'on' if your system defines one of the macros O_SYNC, O_DSYNC, O_FSYNC, and if O_SYNC and O_DSYNC are distinct, otherwise the default is 'off'." The net result of this would be that the average user would have absolutely no clue what the default on his machine is. Additionally consider that maybe O_SYNC and O_DSYNC have different values but the kernel treats them the same anyway. We really shouldn't try to guess that far. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Tom Lane writes: > However, I can actually make a case for this: we are flushing out > performance bugs in a new feature, ie WAL. I haven't followed the jungle of numbers too closely. Is it not the case that WAL + fsync is still faster than 7.0 + fsync and WAL/no fsync is still faster than 7.0/no fsync? -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
Peter Eisentraut <peter_e@gmx.net> writes: >> Peter, what do you think about configuration-dependent defaults for >> GUC variables? > We have plenty of those already, but we should avoid a variable whose > specification is: > "The default is 'on' if your system defines one of the macros O_SYNC, > O_DSYNC, O_FSYNC, and if O_SYNC and O_DSYNC are distinct, otherwise the > default is 'off'." Unfortunately, I think that's just about what the default would need to be. What alternative do you have to offer? > The net result of this would be that the average user would have > absolutely no clue what the default on his machine is. Sure he would. Fire up the software and do "SHOW wal_use_fsync" (or whatever we call it). I think the documentation could just say "the default is platform-dependent". > Additionally consider that maybe O_SYNC and O_DSYNC have different values > but the kernel treats them the same anyway. We really shouldn't try to > guess that far. Well, that's exactly *why* we need an overridable default. Or would you like to try to do some performance measurements in configure? regards, tom lane
> I believe that we don't know enough yet to nail down a hard-wired > decision. Vadim's idea of preferring O_DSYNC if it appears to be > different from O_SYNC is a good first cut, but I think we'd > better make it possible to override that, at least for testing purposes. So let's leave fsync as default and add option to open log files with O_DSYNC/O_SYNC. Vadim
Tom Lane writes: > I've been mentally working through the code, and see only one reason why > it might be necessary to go with a compile-time choice: suppose we see > that none of O_DSYNC, O_SYNC, O_FSYNC, [others] are defined? We postulate that one of those has to exist. Alternatively, you make the option read wal_sync_method = fsync | open_sync In the "parse_hook" for the parameter you if #ifdef out 'open_sync' as a valid option if none of those exist, so a user will get "'open_sync' is not a valid option value". -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
> "The default is 'on' if your system defines one of the macros O_SYNC, > O_DSYNC, O_FSYNC, and if O_SYNC and O_DSYNC are distinct, otherwise the > default is 'off'." > > The net result of this would be that the average user would have > absolutely no clue what the default on his machine is. > > Additionally consider that maybe O_SYNC and O_DSYNC have different values > but the kernel treats them the same anyway. We really shouldn't try to > guess that far. Good point. I think Tom already found dfsync points to fsync in his libc, or something like that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Peter Eisentraut <peter_e@gmx.net> writes: > I haven't followed the jungle of numbers too closely. > Is it not the case that WAL + fsync is still faster than 7.0 + fsync and > WAL/no fsync is still faster than 7.0/no fsync? I believe the first is true in most cases. I wouldn't swear to the second though, since WAL requires more I/O and doesn't save any fsyncs if you've got 'em all turned off anyway ... regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > We postulate that one of those has to exist. Alternatively, you make the > option read > wal_sync_method = fsync | open_sync > In the "parse_hook" for the parameter you if #ifdef out 'open_sync' as a > valid option if none of those exist, so a user will get "'open_sync' is > not a valid option value". I like this a lot. In fact, I am mightily tempted to make it wal_sync_method = fsync | fdatasync | open_sync | open_datasync where fdatasync would only be valid if configure found fdatasync() and open_datasync would only be valid if we found O_DSYNC exists and isn't O_SYNC. This would let people try all the available methods under realistic test conditions, for hardly any extra work. Furthermore, the documentation could say something like "The default is the first available method in the order open_datasync, fdatasync, fsync, open_sync" (assuming that Vadim's preferences are right). A small problem is that I don't want to be doing multiple strcasecmp's to figure out what to do in xlog.c. Do you object if I add an "assign_hook" to guc.c that's called when an actual assignment is made? That would provide a place to set up the flag variables that xlog.c would actually look at. Furthermore, having an assign_hook would let us support changing this value at SIGHUP, not only at postmaster start. (The assign hook would just need to fsync whatever WAL file is currently open and possibly close/reopen the file, to ensure that no blocks miss getting synced when we change conventions.) Creeping featurism strikes again ;-) ... but this feels right ... regards, tom lane
Tom Lane writes: > wal_sync_method = fsync | fdatasync | open_sync | open_datasync > A small problem is that I don't want to be doing multiple strcasecmp's > to figure out what to do in xlog.c. This should be efficient: switch(lower(string[0]) + lower(string[5])) {case 'f': /* fsync */case 'f' + 's': /* fdatasync */case 'o' + 's': /* open_sync */case 'o' + 'd': /* open_datasync*/ } Although ugly, it should serve as a readable solution for now. > Do you object if I add an "assign_hook" to guc.c that's called when an > actual assignment is made? Something like this is on my wish list, but I'm not sure if it's wise to start this now. There are a few issues that need some thought, like how to make the interface for non-string options, and how to keep it in sync with the parse hook of string options, ... > That would provide a place to set up the flag variables that xlog.c > would actually look at. Furthermore, having an assign_hook would let > us support changing this value at SIGHUP, not only at postmaster > start. (The assign hook would just need to fsync whatever WAL file is > currently open and possibly close/reopen the file, to ensure that no > blocks miss getting synced when we change conventions.) ... and possibly here you need to pass the context to the assign hook as well. This application strikes me as a bit too esoteric for a first try. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
* Mikheev, Vadim <vmikheev@SECTORBASE.COM> [010315 13:52] wrote: > > I believe that we don't know enough yet to nail down a hard-wired > > decision. Vadim's idea of preferring O_DSYNC if it appears to be > > different from O_SYNC is a good first cut, but I think we'd > > better make it possible to override that, at least for testing purposes. > > So let's leave fsync as default and add option to open log files > with O_DSYNC/O_SYNC. I have a weird and untested suggestion: How many files need to be fsync'd? If it's more than one, what might work is using mmap() to map the files in adjacent areas, then calling msync() on the entire range, this would allow you to batch fsync the data. The only problem is that I'm not sure: 1) how portable msync() is. 2) if msync garauntees metadata consistancy. Another benifit of mmap() is the 'zero' copy nature of it. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: > How many files need to be fsync'd? Only one. > If it's more than one, what might work is using mmap() to map the > files in adjacent areas, then calling msync() on the entire range, > this would allow you to batch fsync the data. Interesting thought, but mmap to a prespecified address is most definitely not portable, whether or not you want to assume that plain mmap is ... regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes: > switch(lower(string[0]) + lower(string[5])) > { > case 'f': /* fsync */ > case 'f' + 's': /* fdatasync */ > case 'o' + 's': /* open_sync */ > case 'o' + 'd': /* open_datasync */ > } > Although ugly, it should serve as a readable solution for now. Ugly is the word ... >> Do you object if I add an "assign_hook" to guc.c that's called when an >> actual assignment is made? > Something like this is on my wish list, but I'm not sure if it's wise to > start this now. I'm not particularly concerned about changing the interface later if that proves necessary. We're not likely to have so many of the things that an API change is burdensome, and they will all be strictly backend internal. What I have in mind for now is just void (*assign_hook) (const char *newval); (obviously this is for string variables only, for now) called just before actually changing the variable value. This lets the hook see the old value if it needs to. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010315 14:54] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > > How many files need to be fsync'd? > > Only one. > > > If it's more than one, what might work is using mmap() to map the > > files in adjacent areas, then calling msync() on the entire range, > > this would allow you to batch fsync the data. > > Interesting thought, but mmap to a prespecified address is most > definitely not portable, whether or not you want to assume that > plain mmap is ... Yeah... :( Evil thought though (for reference): mmap(anon memory) returns addr1 addr2 = addr1 + maplen split addr1<->addr2 on points A B and C mmap(file1 over addr1 to A) mmap(file2 over A to B) mmap(file3 over B to C) mmap(file4 over C to addr2) It _should_ work, but there's probably some corner cases where it doesn't. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> Well, that's exactly *why* we need an overridable default. Or would you > like to try to do some performance measurements in configure? At this point I'm more comfortable with a compile-time option (determined statically or in a configure compilation test, not a performance test), rather than a GUC variable. But imho 7.1 will be nice with either choice, and if you think that a variable will make it easier for developers to do tuning from a distance (as opposed to having it just confuse new users) then... ;) - Thomas
Bruce Momjian wrote: > <snip> > No one will ever do the proper timing tests to know which is better except us. Hi Bruce, I believe in the future that anyone doing serious benchmark tests before large-scale implementation will indeed be testing things like this. There will also be people/companies out there who will specialise in "tuning" PostgreSQL systems and they will definitely test stuff like this... different variations, different database structures, different OS's, etc. Regards and best wishes, Justin Clift
> Bruce Momjian wrote: > > > <snip> > > No one will ever do the proper timing tests to know which is better except us. > > Hi Bruce, > > I believe in the future that anyone doing serious benchmark tests before > large-scale implementation will indeed be testing things like this. > There will also be people/companies out there who will specialize in > "tuning" PostgreSQL systems and they will definitely test stuff like > this... different variations, different database structures, different > OS's, etc. But I don't want to go the Informix/Oracle way where we have so many tuning options that no one understands them all. I would like us to find the best options and only give users choices when there is a real tradeoff. For example, Tom had a nice fsync test program. Why can't we run that on various platforms and collect the results, then make a decision on the best default. Trying to test the affects of fsync() with a database wrapped around it really makes for difficult measurement anyway. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > For example, Tom had a nice fsync test program. Why can't we run that > on various platforms and collect the results, then make a decision on > the best default. Mainly because (a) there's not enough time before release, and (b) that test program was far too stupid to give trustworthy results anyway. (It was assuming exactly one commit per XLOG block, for example.) > Trying to test the affects of fsync() with a database wrapped around it > really makes for difficult measurement anyway. Exactly. What I'm doing now is providing some infrastructure with which we can hope to see some realistic tests. For example, I'm gonna be leaning on Great Bridge's lab guys to rerun their TPC tests with a bunch of combinations, just as soon as the dust settles. But I'm not planning to put my faith in only that one benchmark. I'm all for improving the intelligence of the defaults once we know enough to pick better defaults. But we don't yet, and there's no way that we *will* know enough until after we've shipped a release that has these tuning knobs and gotten some real-world results from the field. regards, tom lane
Is someone able to put together a testing-type script or sequence so people can run this on the various platforms and then report the results? For example, I can setup benchmarking, (or automated testing) on various Solaris platforms to run overnight and report the results in the morning. I suspect that quite a few people can do similar. Would this be a good thing for someone to spend some time and effort on, in generating testing-type scripts/structures? It might be a useful tool to use in the future when making performance/related decisions like this. Regards and best wishes, Justin Clift Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I later read Vadim's comment that fsync() of two blocks may be faster > > than two O_* writes, so I am now confused about the proper solution. > > However, I think we need to pick one and make it invisible to the user. > > Perhaps a compiler/config.h flag for testing would be a good solution. > > I believe that we don't know enough yet to nail down a hard-wired > decision. Vadim's idea of preferring O_DSYNC if it appears to be > different from O_SYNC is a good first cut, but I think we'd better make > it possible to override that, at least for testing purposes. > > So I think it should be configurable at *some* level. I don't much care > whether it's a config.h entry or a GUC variable. > > But consider this: we'll be more likely to get some feedback from the > field (allowing us to refine the policy in future releases) if it is a > GUC variable. Not many people will build two versions of the software, > but people might take the trouble to play with a run-time configuration > setting. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
I was wondering if the multiple writes performed to the XLOG could be grouped into one write(). Seems everyone agrees:fdatasync/O_DSYNC is better then plain fsync/O_SYNC and the O_* flags are better than fsync() if we are doing only one write before every fsync. It seems the only open question is now often we do multiple writes before fsync, and if that is ever faster than putting the O_* on the file for all writes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I was wondering if the multiple writes performed to the XLOG could be > grouped into one write(). That would require fairly major restructuring of xlog.c, which I don't want to undertake at this point in the cycle (we're trying to push out a release candidate, remember?). I'm not convinced it would be a huge win anyway. It would be a win if your average transaction writes multiple blocks' worth of XLOG ... but if your average transaction writes less than a block then it won't help. I think it probably is a good idea to restructure xlog.c so that it can write more than one page at a time --- but it's not such a great idea that I want to hold up the release any more for it. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I was wondering if the multiple writes performed to the XLOG could be > > grouped into one write(). > > That would require fairly major restructuring of xlog.c, which I don't > want to undertake at this point in the cycle (we're trying to push out > a release candidate, remember?). I'm not convinced it would be a huge > win anyway. It would be a win if your average transaction writes > multiple blocks' worth of XLOG ... but if your average transaction > writes less than a block then it won't help. > > I think it probably is a good idea to restructure xlog.c so that it can > write more than one page at a time --- but it's not such a great idea > that I want to hold up the release any more for it. OK, but the point of adding all those configuration options was to allow us to figure out which was faster. If you can do the code so we no longer need to know the answer of which is best, why bother adding the config options. Just ship our best guess and fix it when we can. Does that make sense? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > OK, but the point of adding all those configuration options was to allow > us to figure out which was faster. If you can do the code so we no > longer need to know the answer of which is best, why bother adding the > config options. How in the world did you arrive at that idea? I don't see anyone around here but you claiming that we don't need any experimentation ... regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > OK, but the point of adding all those configuration options was to allow > > us to figure out which was faster. If you can do the code so we no > > longer need to know the answer of which is best, why bother adding the > > config options. > > How in the world did you arrive at that idea? I don't see anyone around > here but you claiming that we don't need any experimentation ... I am trying to understand what testing we need to do. I know we need configure tests to check to see what exists in the OS. My question was what are we needing to test? If we can do only single writes to the log, don't we prefer O_* to fsync, and the O_D* options over plain O_*? Am I confused? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > My question was what are we needing to test? If we can do only single writes > to the log, don't we prefer O_* to fsync, and the O_D* options over > plain O_*? Am I confused? I don't think we have enough data to conclude that with any certainty. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > My question was what are we needing to test? If we can do only single writes > > to the log, don't we prefer O_* to fsync, and the O_D* options over > > plain O_*? Am I confused? > > I don't think we have enough data to conclude that with any certainty. I just figured we knew the answers to above issues, that that the only issue was multiple writes vs. fsync(). It is hard for me to imagine O_* being slower than fsync(), or fdatasync being slower than fsync. Are we not able to assume that? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It is hard for me to imagine O_* being slower than fsync(), Not hard at all --- if we're writing multiple xlog blocks per transaction, then O_* constrains the sequence of operations more than we really want. Changing xlog.c to combine writes as much as possible would reduce this problem, but not eliminate it. Besides, the entire object of this exercise is to work around an unexpected inefficiency in some kernels' implementations of fsync/fdatasync (viz, scanning over lots of not-dirty buffers). Who's to say that there might not be inefficiencies in other platforms' implementations of the O_* options? regards, tom lane
Hello Tom, Friday, March 16, 2001, 6:54:22 AM, you wrote: TL> Alfred Perlstein <bright@wintelcom.net> writes: >> How many files need to be fsync'd? TL> Only one. >> If it's more than one, what might work is using mmap() to map the >> files in adjacent areas, then calling msync() on the entire range, >> this would allow you to batch fsync the data. TL> Interesting thought, but mmap to a prespecified address is most TL> definitely not portable, whether or not you want to assume that TL> plain mmap is ... TL> regards, tom lane Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. -- Regards, Xu Yifeng
* Xu Yifeng <jamexu@telekbird.com.cn> [010315 22:25] wrote: > Hello Tom, > > Friday, March 16, 2001, 6:54:22 AM, you wrote: > > TL> Alfred Perlstein <bright@wintelcom.net> writes: > >> How many files need to be fsync'd? > > TL> Only one. > > >> If it's more than one, what might work is using mmap() to map the > >> files in adjacent areas, then calling msync() on the entire range, > >> this would allow you to batch fsync the data. > > TL> Interesting thought, but mmap to a prespecified address is most > TL> definitely not portable, whether or not you want to assume that > TL> plain mmap is ... > > TL> regards, tom lane > > Could anyone consider fork a syncer process to sync data to disk ? > build a shared sync queue, when a daemon process want to do sync after > write() is called, just put a sync request to the queue. this can release > process from blocked on writing as soon as possible. multipile sync > request for one file can be merged when the request is been inserting to > the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Hello Alfred, Friday, March 16, 2001, 3:21:09 PM, you wrote: AP> * Xu Yifeng <jamexu@telekbird.com.cn> [010315 22:25] wrote: >> >> Could anyone consider fork a syncer process to sync data to disk ? >> build a shared sync queue, when a daemon process want to do sync after >> write() is called, just put a sync request to the queue. this can release >> process from blocked on writing as soon as possible. multipile sync >> request for one file can be merged when the request is been inserting to >> the queue. AP> I suggested this about a year ago. :) AP> The problem is that you need that process to potentially open and close AP> many files over and over. AP> I still think it's somewhat of a good idea. I am not a DBMS guru. couldn't the syncer process cache opened files? is there any problem I didn't consider ? -- Best regards, Xu Yifeng
* Xu Yifeng <jamexu@telekbird.com.cn> [010316 01:15] wrote: > Hello Alfred, > > Friday, March 16, 2001, 3:21:09 PM, you wrote: > > AP> * Xu Yifeng <jamexu@telekbird.com.cn> [010315 22:25] wrote: > >> > >> Could anyone consider fork a syncer process to sync data to disk ? > >> build a shared sync queue, when a daemon process want to do sync after > >> write() is called, just put a sync request to the queue. this can release > >> process from blocked on writing as soon as possible. multipile sync > >> request for one file can be merged when the request is been inserting to > >> the queue. > > AP> I suggested this about a year ago. :) > > AP> The problem is that you need that process to potentially open and close > AP> many files over and over. > > AP> I still think it's somewhat of a good idea. > > I am not a DBMS guru. Hah, same here. :) > couldn't the syncer process cache opened files? is there any problem I > didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
> > Could anyone consider fork a syncer process to sync data to disk ? > > build a shared sync queue, when a daemon process want to do sync after > > write() is called, just put a sync request to the queue. this can release > > process from blocked on writing as soon as possible. multipile sync > > request for one file can be merged when the request is been inserting to > > the queue. > > I suggested this about a year ago. :) > > The problem is that you need that process to potentially open and close > many files over and over. > > I still think it's somewhat of a good idea. I like the idea too, but people want the transaction to return COMMIT only after data has been fsync'ed so I don't see a big win. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
* Bruce Momjian <pgman@candle.pha.pa.us> [010316 07:11] wrote: > > > Could anyone consider fork a syncer process to sync data to disk ? > > > build a shared sync queue, when a daemon process want to do sync after > > > write() is called, just put a sync request to the queue. this can release > > > process from blocked on writing as soon as possible. multipile sync > > > request for one file can be merged when the request is been inserting to > > > the queue. > > > > I suggested this about a year ago. :) > > > > The problem is that you need that process to potentially open and close > > many files over and over. > > > > I still think it's somewhat of a good idea. > > I like the idea too, but people want the transaction to return COMMIT > only after data has been fsync'ed so I don't see a big win. This isn't simply handing off the sync to this other process, it requires an ack from the syncer before returning 'COMMIT'. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
From: "Bruce Momjian" <pgman@candle.pha.pa.us> > > > Could anyone consider fork a syncer process to sync data to disk ? > > > build a shared sync queue, when a daemon process want to do sync after > > > write() is called, just put a sync request to the queue. this can release > > > process from blocked on writing as soon as possible. multipile sync > > > request for one file can be merged when the request is been inserting to > > > the queue. > > > > I suggested this about a year ago. :) > > > > The problem is that you need that process to potentially open and close > > many files over and over. > > > > I still think it's somewhat of a good idea. > > I like the idea too, but people want the transaction to return COMMIT > only after data has been fsync'ed so I don't see a big win. For a log file on a busy system, this could improve throughput a lot--batch commit. You end up with fewer than one fsync() per transaction.
Alfred Perlstein <bright@wintelcom.net> writes: >> couldn't the syncer process cache opened files? is there any problem I >> didn't consider ? > 1) IPC latency, the amount of time it takes to call fsync will > increase by at least two context switches. > 2) a working set (number of files needed to be fsync'd) that > is larger than the amount of files you wish to keep open. These days we're really only interested in fsync'ing the current WAL log file, so working set doesn't seem like a problem anymore. However context-switch latency is likely to be a big problem. One thing we'd definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. Vadim has designed the WAL stuff in such a way that a separate writer/syncer process would be easy to add; in fact it's almost that way already, in that any backend can write or sync data that's been added to the queue by any other backend. The question is whether it'd actually buy anything to have another process. Good stuff to experiment with for 7.2. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010316 08:16] wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > >> couldn't the syncer process cache opened files? is there any problem I > >> didn't consider ? > > > 1) IPC latency, the amount of time it takes to call fsync will > > increase by at least two context switches. > > > 2) a working set (number of files needed to be fsync'd) that > > is larger than the amount of files you wish to keep open. > > These days we're really only interested in fsync'ing the current WAL > log file, so working set doesn't seem like a problem anymore. However > context-switch latency is likely to be a big problem. One thing we'd > definitely need before considering this is to replace the existing > spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? > Vadim has designed the WAL stuff in such a way that a separate > writer/syncer process would be easy to add; in fact it's almost that way > already, in that any backend can write or sync data that's been added > to the queue by any other backend. The question is whether it'd > actually buy anything to have another process. Good stuff to experiment > with for 7.2. The delayed/coallecesed (sp?) fsync looked interesting. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: >> definitely need before considering this is to replace the existing >> spinlock mechanism with something more efficient. > What sort of problems are you seeing with the spinlock code? It's great as long as you never block, but it sucks for making things wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. We've speculated about using Posix semaphores instead, on platforms where those are available. I think Bruce was concerned about the possible overhead of pulling in a whole thread-support library just to get semaphores, however. regards, tom lane
> > I was wondering if the multiple writes performed to the > > XLOG could be grouped into one write(). > > That would require fairly major restructuring of xlog.c, which I don't Restructing? Why? It's only XLogWrite() who make writes. > want to undertake at this point in the cycle (we're trying to push out > a release candidate, remember?). I'm not convinced it would be a huge > win anyway. It would be a win if your average transaction writes > multiple blocks' worth of XLOG ... but if your average transaction > writes less than a block then it won't help. But in multi-user environment multiple transactions may write > 1 block before commit. > I think it probably is a good idea to restructure xlog.c so > that it can write more than one page at a time --- but it's > not such a great idea that I want to hold up the release any > more for it. Agreed. Vadim
On Fri, 16 Mar 2001, Tom Lane wrote: > Alfred Perlstein <bright@wintelcom.net> writes: > >> definitely need before considering this is to replace the existing > >> spinlock mechanism with something more efficient. > > > What sort of problems are you seeing with the spinlock code? > > It's great as long as you never block, but it sucks for making things > wait, because the wait interval will be some multiple of 10 msec rather > than just the time till the lock comes free. > > We've speculated about using Posix semaphores instead, on platforms > where those are available. I think Bruce was concerned about the > possible overhead of pulling in a whole thread-support library just to > get semaphores, however. But, with shared libraries, are you really pulling in a "whole thread-support library"? My understanding of shared libraries (altho it may be totally off) was that instead of pulling in a whole library, you pulled in the bits that you needed, pretty much as you needed them ...
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes: > I was wondering if the multiple writes performed to the > XLOG could be grouped into one write(). >> >> That would require fairly major restructuring of xlog.c, which I don't > Restructing? Why? It's only XLogWrite() who make writes. I was thinking of changing the data structure. I guess you could keep the data structure the same and make XLogWrite more complicated, though. >> I think it probably is a good idea to restructure xlog.c so >> that it can write more than one page at a time --- but it's >> not such a great idea that I want to hold up the release any >> more for it. > Agreed. Yes, to-do item for 7.2. regards, tom lane
Larry Rosenman <ler@lerctr.org> writes: >> But, with shared libraries, are you really pulling in a "whole >> thread-support library"? > Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT > of primitives to go through threads wrappers and scheduling. Right, it's not so much that we care about referencing another shlib, it's that -lpthreads may cause you to get a whole new thread-aware version of libc, with attendant overhead that we don't need or want. regards, tom lane
Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT of primitives to go through threads wrappers and scheduling. See the doc on the http://UW7DOC.SCO.COM or http://www.lerctr.org:457/ web pages. Also, some functions are NOT available without the -Kthread or -Kpthread directives. LER >>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<< On 3/16/01, 11:10:34 AM, The Hermit Hacker <scrappy@hub.org> wrote regarding Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC : > On Fri, 16 Mar 2001, Tom Lane wrote: > > Alfred Perlstein <bright@wintelcom.net> writes: > > >> definitely need before considering this is to replace the existing > > >> spinlock mechanism with something more efficient. > > > > > What sort of problems are you seeing with the spinlock code? > > > > It's great as long as you never block, but it sucks for making things > > wait, because the wait interval will be some multiple of 10 msec rather > > than just the time till the lock comes free. > > > > We've speculated about using Posix semaphores instead, on platforms > > where those are available. I think Bruce was concerned about the > > possible overhead of pulling in a whole thread-support library just to > > get semaphores, however. > But, with shared libraries, are you really pulling in a "whole > thread-support library"? My understanding of shared libraries (altho it > may be totally off) was that instead of pulling in a whole library, you > pulled in the bits that you needed, pretty much as you needed them ... > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Tom Lane <tgl@sss.pgh.pa.us> writes: > Alfred Perlstein <bright@wintelcom.net> writes: > >> definitely need before considering this is to replace the existing > >> spinlock mechanism with something more efficient. > > > What sort of problems are you seeing with the spinlock code? > > It's great as long as you never block, but it sucks for making things > wait, because the wait interval will be some multiple of 10 msec rather > than just the time till the lock comes free. Plus, using select() for the timeout is putting you into the kernel multiple times in a short period, and causing a reschedule everytime, which is a big lose. This was discussed in the linux-kernel thread that was referred to a few days ago. > We've speculated about using Posix semaphores instead, on platforms > where those are available. I think Bruce was concerned about the > possible overhead of pulling in a whole thread-support library just to > get semaphores, however. Are Posix semaphores faster by definition than SysV semaphores (which are described as "slow" in the source comments)? I can't see how they'd be much faster unless locking/unlocking an uncontended semaphore avoids a system call, in which case you might run into the same problems with userland backoff... Just looked, and on Linux pthreads and POSIX semaphores are both already in the C library. Unfortunately, the Linux C library doesn't support the PROCESS_SHARED attribute for either pthreads mutexes or POSIX semaphores. Grumble. What's the point then? Just some ignorant ramblings, thanks for listening... -Doug
[ Charset ISO-8859-1 unsupported, converting... ] > Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT > of primitives to go through threads wrappers and scheduling. This was my concern; the change that happens on startup and lib calls when thread support comes in through a library. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> On 3/16/01, 11:10:34 AM, The Hermit Hacker <scrappy@hub.org> wrote > regarding Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC : > > > But, with shared libraries, are you really pulling in a "whole > > thread-support library"? My understanding of shared libraries (altho it > > may be totally off) was that instead of pulling in a whole library, you > > pulled in the bits that you needed, pretty much as you needed them ... * Larry Rosenman <ler@lerctr.org> [010316 10:02] wrote: > Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT > of primitives to go through threads wrappers and scheduling. > > See the doc on the http://UW7DOC.SCO.COM or http://www.lerctr.org:457/ > web pages. > > Also, some functions are NOT available without the -Kthread or -Kpthread > directives. This is true on FreeBSD as well. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
The Hermit Hacker wrote: >> > But, with shared libraries, are you really pulling in a "whole > thread-support library"? My understanding of shared libraries (altho it > may be totally off) was that instead of pulling in a whole library, you > pulled in the bits that you needed, pretty much as you needed them ... Just by making a thread call libc changes personality to use thread safe routines (I.E. add mutex locking). Use one thread feature, get the whole set...which may not be that bad. -- William K. Volkman. CIO - H.I.S. Financial Services Corporation. 102 S. Tejon, Ste. 920, Colorado Springs, CO 80903 Phone: 719-633-6942 Fax: 719-633-7006 Cell: 719-330-8423
* William K. Volkman <wkv@hiscorp.net> [010318 11:56] wrote: > The Hermit Hacker wrote: > >> > > But, with shared libraries, are you really pulling in a "whole > > thread-support library"? My understanding of shared libraries (altho it > > may be totally off) was that instead of pulling in a whole library, you > > pulled in the bits that you needed, pretty much as you needed them ... > > Just by making a thread call libc changes personality to use thread > safe routines (I.E. add mutex locking). Use one thread feature, get > the whole set...which may not be that bad. Actually it can be pretty bad. Locked bus cycles needed for mutex operations are very, very expensive, not something you want to do unless you really really need to do it. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Alfred Perlstein <bright@wintelcom.net> writes: >> Just by making a thread call libc changes personality to use thread >> safe routines (I.E. add mutex locking). Use one thread feature, get >> the whole set...which may not be that bad. > Actually it can be pretty bad. Locked bus cycles needed for mutex > operations are very, very expensive, not something you want to do > unless you really really need to do it. It'd be interesting to try to get some numbers about the actual cost of using a thread-aware libc, on platforms where there's a difference. Shouldn't be that hard to build a postgres executable with the proper library and run some benchmarks ... anyone care to try? regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [010318 14:55]: > Alfred Perlstein <bright@wintelcom.net> writes: > >> Just by making a thread call libc changes personality to use thread > >> safe routines (I.E. add mutex locking). Use one thread feature, get > >> the whole set...which may not be that bad. > > > Actually it can be pretty bad. Locked bus cycles needed for mutex > > operations are very, very expensive, not something you want to do > > unless you really really need to do it. > > It'd be interesting to try to get some numbers about the actual cost > of using a thread-aware libc, on platforms where there's a difference. > Shouldn't be that hard to build a postgres executable with the proper > library and run some benchmarks ... anyone care to try? I can get the code compiled, but don't have the skills to generate a test case worthy of anything.... LER > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
* Larry Rosenman <ler@lerctr.org> [010318 14:17] wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [010318 14:55]: > > Alfred Perlstein <bright@wintelcom.net> writes: > > >> Just by making a thread call libc changes personality to use thread > > >> safe routines (I.E. add mutex locking). Use one thread feature, get > > >> the whole set...which may not be that bad. > > > > > Actually it can be pretty bad. Locked bus cycles needed for mutex > > > operations are very, very expensive, not something you want to do > > > unless you really really need to do it. > > > > It'd be interesting to try to get some numbers about the actual cost > > of using a thread-aware libc, on platforms where there's a difference. > > Shouldn't be that hard to build a postgres executable with the proper > > library and run some benchmarks ... anyone care to try? > I can get the code compiled, but don't have the skills to generate > a test case worthy of anything.... There's a 'make test' or something ('regression' maybe?) target that runs a suite of tests on the database, you could use that as a bench/timer, you could also try mysql's "crashme" script. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
Larry Rosenman <ler@lerctr.org> writes: > I can get the code compiled, but don't have the skills to generate > a test case worthy of anything.... contrib/pgbench would do as a first cut. regards, tom lane
> * William K. Volkman <wkv@hiscorp.net> [010318 11:56] wrote: > > The Hermit Hacker wrote: > > >> > > > But, with shared libraries, are you really pulling in a "whole > > > thread-support library"? My understanding of shared libraries (altho it > > > may be totally off) was that instead of pulling in a whole library, you > > > pulled in the bits that you needed, pretty much as you needed them ... > > > > Just by making a thread call libc changes personality to use thread > > safe routines (I.E. add mutex locking). Use one thread feature, get > > the whole set...which may not be that bad. > > Actually it can be pretty bad. Locked bus cycles needed for mutex > operations are very, very expensive, not something you want to do > unless you really really need to do it. And don't forget buggy implementations. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Added to TODO: * Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options * Allow multiple blocks to be written to WAL with onewrite() > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It is hard for me to imagine O_* being slower than fsync(), > > Not hard at all --- if we're writing multiple xlog blocks per > transaction, then O_* constrains the sequence of operations more > than we really want. Changing xlog.c to combine writes as much > as possible would reduce this problem, but not eliminate it. > > Besides, the entire object of this exercise is to work around > an unexpected inefficiency in some kernels' implementations of > fsync/fdatasync (viz, scanning over lots of not-dirty buffers). > Who's to say that there might not be inefficiencies in other > platforms' implementations of the O_* options? > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026