Thread: Win32 Powerfail testing
We are developing a Win32 port of PostgreSQL 7.3(different from Jan's implementaion, in that we are using a thread model. In the future I hope we could contribute the source code). We have done a power failure testing using the test tool made by Dave Page: Subject: [HACKERS] Win32 Powerfail testing - results From: "Dave Page" <dpage@vale-housing.co.uk> Date: Mon, 3 Feb 2003 16:51:33 -0000 So far we found interesting facts. Our Win32 port passes his test in most cases. However if power of the machine is turned off right after (10 to 20 seconds) the Checkpoint has been made, it does not passes his test. So we are thinking that there is someting wrong with the checkpoint implementaion for Win32 port, which is essentially same as Jan's implementation. i.e. using _flushall() instead of sync(). We were looking for a fix or an alternative implementaion of sync() without success. BTW, we found that Cygwin port of PostgreSQL does not pass his test neither. -- Tatsuo Ishii
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 05 March 2003 02:23 > To: pgsql-hackers@postgresql.org > Subject: [HACKERS] Win32 Powerfail testing > > So far we found interesting facts. Our Win32 port passes his > test in most cases. However if power of the machine is turned > off right after (10 to 20 seconds) the Checkpoint has been > made, it does not passes his test. So we are thinking that > there is someting wrong with the checkpoint implementaion for > Win32 port, which is essentially same as Jan's > implementation. i.e. using _flushall() instead of sync(). We > were looking for a fix or an alternative implementaion of > sync() without success. Hi Tatsuo, Does this help: http://support.microsoft.com/default.aspx?scid=kb;en-us;66052 Regards, Dave.
> > So far we found interesting facts. Our Win32 port passes his > > test in most cases. However if power of the machine is turned > > off right after (10 to 20 seconds) the Checkpoint has been > > made, it does not passes his test. So we are thinking that > > there is someting wrong with the checkpoint implementaion for > > Win32 port, which is essentially same as Jan's > > implementation. i.e. using _flushall() instead of sync(). We > > were looking for a fix or an alternative implementaion of > > sync() without success. > > Hi Tatsuo, > > Does this help: > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052 Sorry, but it does not help. The page says we could use FlushFileBuffers() to sync the kernel buffer to the disk. Unfortunately, it requires a file descriptor to flush for its argument. Thus it could not be a replacement of sync(). Actually I have modified the buffer manager so that it remembers all file descriptors those have not been synced yet to the disk at the checkpoint time to sync them later. However I found this modification does not help at all with some reason I don't know. -- Tatsuo Ishii
On Wed, 5 Mar 2003, Dave Page wrote: > > > > -----Original Message----- > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > Sent: 05 March 2003 02:23 > > To: pgsql-hackers@postgresql.org > > Subject: [HACKERS] Win32 Powerfail testing > > > > So far we found interesting facts. Our Win32 port passes his > > test in most cases. However if power of the machine is turned > > off right after (10 to 20 seconds) the Checkpoint has been > > made, it does not passes his test. So we are thinking that > > there is someting wrong with the checkpoint implementaion for > > Win32 port, which is essentially same as Jan's > > implementation. i.e. using _flushall() instead of sync(). We > > were looking for a fix or an alternative implementaion of > > sync() without success. > > Hi Tatsuo, > > Does this help: > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052 OMG, I'm rolling. You have to connect to the COMMODE.OBJ to fix a flushing problem. Someone at MS has a sense of humor. I thought running PHP on crack was funny (i.e. --with-crack switch to turn on cracklib) but this one is even better.
> > > So far we found interesting facts. Our Win32 port passes his > > > test in most cases. However if power of the machine is turned > > > off right after (10 to 20 seconds) the Checkpoint has been > > > made, it does not passes his test. So we are thinking that > > > there is someting wrong with the checkpoint implementaion for > > > Win32 port, which is essentially same as Jan's > > > implementation. i.e. using _flushall() instead of sync(). We > > > were looking for a fix or an alternative implementaion of > > > sync() without success. > > > > Hi Tatsuo, > > > > Does this help: > > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052 > > OMG, I'm rolling. You have to connect to the COMMODE.OBJ to fix a > flushing problem. Someone at MS has a sense of humor. I thought running > PHP on crack was funny (i.e. --with-crack switch to turn on cracklib) but > this one is even better. We have tried COMMODE.OBJ already. It seems that does not help at all. -- Tatsuo Ishii
Tatsuo Ishii wrote: > Sorry, but it does not help. The page says we could use > FlushFileBuffers() to sync the kernel buffer to the > disk. Unfortunately, it requires a file descriptor to flush for its > argument. Thus it could not be a replacement of sync(). Actually I > have modified the buffer manager so that it remembers all file > descriptors those have not been synced yet to the disk at the > checkpoint time to sync them later. However I found this modification > does not help at all with some reason I don't know. It would be an interesting comparison for you to roll the file descriptor tracking changes into the Unix side of the tree and use fsync() or fdatasync() in place of FlushFileBuffers() on the Unix side (you'd have to remove or disable the code that does a sync() of course). If the end result yields no data corruption issues during powerfail testing on various Unix platforms then it's reasonably likely that the problem you're experiencing on the Windows side is with the underlying Windows platform and not with your code. -- Kevin Brown kevin@sysexperts.com
> It would be an interesting comparison for you to roll the file > descriptor tracking changes into the Unix side of the tree and use > fsync() or fdatasync() in place of FlushFileBuffers() on the Unix side > (you'd have to remove or disable the code that does a sync() of > course). If the end result yields no data corruption issues during > powerfail testing on various Unix platforms then it's reasonably > likely that the problem you're experiencing on the Windows side is > with the underlying Windows platform and not with your code. Sounds like an idea. I'll do it if I have a spare time. -- Tatsuo Ishii
> -----Original Message----- > From: Kevin Brown [mailto:kevin@sysexperts.com] > Sent: 06 March 2003 04:37 > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > Tatsuo Ishii wrote: > > Sorry, but it does not help. The page says we could use > > FlushFileBuffers() to sync the kernel buffer to the > > disk. Unfortunately, it requires a file descriptor to flush for its > > argument. Thus it could not be a replacement of sync(). Actually I > > have modified the buffer manager so that it remembers all file > > descriptors those have not been synced yet to the disk at the > > checkpoint time to sync them later. However I found this > modification > > does not help at all with some reason I don't know. > > It would be an interesting comparison for you to roll the > file descriptor tracking changes into the Unix side of the > tree and use > fsync() or fdatasync() in place of FlushFileBuffers() on the > Unix side (you'd have to remove or disable the code that does > a sync() of course). If the end result yields no data > corruption issues during powerfail testing on various Unix > platforms then it's reasonably likely that the problem you're > experiencing on the Windows side is with the underlying > Windows platform and not with your code. Agreed, but I still keep thinking that despite some peoples claims that Windows ain't up to it, DB2, SQL and Exchange Server as well a probably others that don't use raw partitions have got over this problem, so therefore we should be able to. Admittedly Microsoft have a bit of an advantage over us, but there must be some accessible way of flushing the buffers in a guaranteed way. I'll look into it some more today if I can... Regards, Dave.
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 05 March 2003 13:49 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > > So far we found interesting facts. Our Win32 port passes his > > > test in most cases. However if power of the machine is turned > > > off right after (10 to 20 seconds) the Checkpoint has been > > > made, it does not passes his test. So we are thinking that > > > there is someting wrong with the checkpoint implementaion for > > > Win32 port, which is essentially same as Jan's > > > implementation. i.e. using _flushall() instead of sync(). We > > > were looking for a fix or an alternative implementaion of > > > sync() without success. > > > > Hi Tatsuo, > > > > Does this help: > > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052 > > Sorry, but it does not help. The page says we could use > FlushFileBuffers() to sync the kernel buffer to the > disk. Unfortunately, it requires a file descriptor to flush > for its argument. Thus it could not be a replacement of > sync(). Actually I have modified the buffer manager so that > it remembers all file descriptors those have not been synced > yet to the disk at the checkpoint time to sync them later. > However I found this modification does not help at all with > some reason I don't know. How do you open the files (function, flags etc)? Regards, Dave.
> > Sorry, but it does not help. The page says we could use > > FlushFileBuffers() to sync the kernel buffer to the > > disk. Unfortunately, it requires a file descriptor to flush > > for its argument. Thus it could not be a replacement of > > sync(). Actually I have modified the buffer manager so that > > it remembers all file descriptors those have not been synced > > yet to the disk at the checkpoint time to sync them later. > > However I found this modification does not help at all with > > some reason I don't know. > > How do you open the files (function, flags etc)? Are you asking the way how to open files in the buffer manager? If so, basically PostgreSQL uses open() with flags (O_RDWR | PG_BINARY, 0600). -- Tatsuo Ishii
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 06 March 2003 14:00 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > > Sorry, but it does not help. The page says we could use > > > FlushFileBuffers() to sync the kernel buffer to the > > > disk. Unfortunately, it requires a file descriptor to flush > > > for its argument. Thus it could not be a replacement of > > > sync(). Actually I have modified the buffer manager so that > > > it remembers all file descriptors those have not been synced > > > yet to the disk at the checkpoint time to sync them later. > > > However I found this modification does not help at all with > > > some reason I don't know. > > > > How do you open the files (function, flags etc)? > > Are you asking the way how to open files in the buffer > manager? If so, basically PostgreSQL uses open() with flags > (O_RDWR | PG_BINARY, 0600). I cannot find it now, but I'm sure I read that FlushFileBuffers() has no effect unless the file was opened with CreateFile() with the GENERIC_WRITE flag. A quick google shows quite a few people recommending that approach to others having trouble flushing files opened with fopen or _open. Regards, Dave.
> Agreed, but I still keep thinking that despite some peoples > claims that Windows ain't up to it, DB2, SQL and Exchange > Server as well a probably others that don't use raw > partitions have got over this problem, so therefore we should > be able to. Admittedly Microsoft have a bit of an advantage > over us, but there must be some accessible way of flushing > the buffers in a guaranteed way. I'll look into it some more > today if I can... FWIW, I beleive all the mentioned products (Ok, at least SQL and Exchange) use the CreateFile() API with the flag FILE_FLAG_NO_BUFFERING. It has the following constraints, though, which they code around in the app code I guess: *** Instructs the system to open the file with no intermediate buffering or caching. When combined with FILE_FLAG_OVERLAPPED, the flag gives maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager. However, some I/O operations will take longer, because data is not being held in the cache. An application must meet certain requirements when working with files opened with FILE_FLAG_NO_BUFFERING: File access must begin at byte offsets within the file that are integer multiples of the volume's sector size. File access must be for numbers of bytes that are integer multiples of the volume's sector size. For example, if the sector size is 512 bytes, an application can request reads and writes of 512, 1024, or 2048 bytes, but not of 335, 981, or 7171 bytes. Buffer addresses for read and write operations should be sector aligned (aligned on addresses in memory that are integer multiples of the volume's sector size). Depending on the disk, this requirement may not be enforced. One way to align buffers on integer multiples of the volume sector size is to use VirtualAlloc to allocate the buffers. It allocates memory that is aligned on addresses that are integer multiples of the operating system's memory page size. Because both memory page and volume sector sizes are powers of 2, this memory is also aligned on addresses that are integer multiples of a volume's sector size. An application can determine a volume's sector size by calling the GetDiskFreeSpace function. *** There is also the flag FILE_FLAG_WRITE_THROUGH which says: Instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them. But that's the CreateFile() Win32 API. The question is how the fopen() etc calls are mapped to Win32 calls, I'd guess. //Magnus
> > Are you asking the way how to open files in the buffer > > manager? If so, basically PostgreSQL uses open() with flags > > (O_RDWR | PG_BINARY, 0600). > > I cannot find it now, but I'm sure I read that FlushFileBuffers() has no > effect unless the file was opened with CreateFile() with the > GENERIC_WRITE flag. A quick google shows quite a few people recommending > that approach to others having trouble flushing files opened with fopen > or _open. I'm sure FlushFileBuffers() is usesless for files opend with open() too. As I said in the previlus mails, open()+_commit() does the right job with the transaction log files. So probably I think I should stick with open()+_commit() approach for ordinary table/index files too. -- Tatsuo Ishii
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 06 March 2003 15:17 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > I'm sure FlushFileBuffers() is usesless for files opend with > open() too. > > As I said in the previlus mails, open()+_commit() does the > right job with the transaction log files. So probably I think > I should stick with open()+_commit() approach for ordinary > table/index files too. Oh, I didn't see that message. So it's either: open() + _commit() Or CreateFile() + FlushFileBuffers() Magnus also mentioned using FILE_FLAG_NO_BUFFERING or FILE_FLAG_WRITE_THROUGH with CreateFile(). I was concerned about the additional complexity with FILE_FLAG_NO_BUFFERING, but FILE_FLAG_WRITE_THROUGH sounds like it might do the job, if a little sub-optimally. Is there really no way of allowing a decent write cache, but then being able to guarantee a flush at the required time? Sounds a little cuckoo to me but then it is Microsoft... Anyhoo, it sounds like open() and _commit is this best choice as you say. Regards, Dave.
My experience with windows backend work is that you have to turn off all buffering and implement your own write cache of sorts. Flushing is not the only reason: heavy buffering of files (the default behavior) also tends to thrash the server, because the cache does not always release memory properly. Likewise, with memory for maximum results you have to go straight to VirtualAlloc() and avoid using the C run time to do any persistent memory allocation. Memory pages get mapped to file pages and all file reads/writes are on sector boundaries. Generally, it's a nightmare. Merlin > -----Original Message----- > From: Dave Page [mailto:dpage@vale-housing.co.uk] > Sent: Thursday, March 06, 2003 11:02 AM > To: Tatsuo Ishii > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > > -----Original Message----- > > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > > Sent: 06 March 2003 15:17 > > To: Dave Page > > Cc: pgsql-hackers@postgresql.org > > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > I'm sure FlushFileBuffers() is usesless for files opend with > > open() too. > > > > As I said in the previlus mails, open()+_commit() does the > > right job with the transaction log files. So probably I think > > I should stick with open()+_commit() approach for ordinary > > table/index files too. > > Oh, I didn't see that message. So it's either: > > open() + _commit() > > Or > > CreateFile() + FlushFileBuffers() > > Magnus also mentioned using FILE_FLAG_NO_BUFFERING or > FILE_FLAG_WRITE_THROUGH with CreateFile(). I was concerned about the > additional complexity with FILE_FLAG_NO_BUFFERING, but > FILE_FLAG_WRITE_THROUGH sounds like it might do the job, if a little > sub-optimally. > > Is there really no way of allowing a decent write cache, but then being > able to guarantee a flush at the required time? Sounds a little cuckoo > to me but then it is Microsoft... > > Anyhoo, it sounds like open() and _commit is this best choice as you > say. > > Regards, Dave. > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
> > As I said in the previlus mails, open()+_commit() does the > > right job with the transaction log files. So probably I think > > I should stick with open()+_commit() approach for ordinary > > table/index files too. > > Oh, I didn't see that message. So it's either: > > open() + _commit() Sorry, I did not mention it explicitly. I meant we use the same implementation as Jan's work. He uses open()+_commit(), I believe. -- Tatsuo Ishii
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 07 March 2003 01:33 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > > As I said in the previlus mails, open()+_commit() does the > > > right job with the transaction log files. So probably I think > > > I should stick with open()+_commit() approach for ordinary > > > table/index files too. > > > > Oh, I didn't see that message. So it's either: > > > > open() + _commit() > > Sorry, I did not mention it explicitly. I meant we use the > same implementation as Jan's work. He uses open()+_commit(), > I believe. Ah, but Jan/Katie's code *did not* survive the powerfails. Is there a relatively easy way we can test open()/_commit against CreateFile()/FlushFileBuffers() with the FILE_FLAG_WRITE_THROUGH flag as suggested by Magnus (and indirectly by Merlin I guess)? Regards, Dave.
> > > > As I said in the previlus mails, open()+_commit() does the > > > > right job with the transaction log files. So probably I think > > > > I should stick with open()+_commit() approach for ordinary > > > > table/index files too. > > > > > > Oh, I didn't see that message. So it's either: > > > > > > open() + _commit() > > > > Sorry, I did not mention it explicitly. I meant we use the > > same implementation as Jan's work. He uses open()+_commit(), > > I believe. > > Ah, but Jan/Katie's code *did not* survive the powerfails. Is there a > relatively easy way we can test open()/_commit against > CreateFile()/FlushFileBuffers() with the FILE_FLAG_WRITE_THROUGH flag as > suggested by Magnus (and indirectly by Merlin I guess)? There are two stages where a synchronized write is needed. One is WAL log writing. We confirmed that with open()/_commit this is ok. The other is checkpoint. Here we need to flush kernel buffers holding previous write to table/index files. To sync those files, PostgreSQL uses sync(). I guess Jan's implementatin did not survive in this case (mine neither). Today I revisited the implemnetation (replacing sync() with open/_commit) I made several days ago and found a bug with it (thanks to Hiroshi). With the fixed version of it, now my Win32 port has passed your test even right after checkpoint!. -- Tatsuo Ishii
> -----Original Message----- > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] > Sent: 07 March 2003 08:37 > To: Dave Page > Cc: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Win32 Powerfail testing > > > > > > Ah, but Jan/Katie's code *did not* survive the powerfails. > Is there a > > relatively easy way we can test open()/_commit against > > CreateFile()/FlushFileBuffers() with the > FILE_FLAG_WRITE_THROUGH flag > > as suggested by Magnus (and indirectly by Merlin I guess)? > > There are two stages where a synchronized write is needed. > One is WAL log writing. We confirmed that with open()/_commit > this is ok. > > The other is checkpoint. Here we need to flush kernel buffers > holding previous write to table/index files. To sync those > files, PostgreSQL uses sync(). I guess Jan's implementatin > did not survive in this case (mine neither). > > Today I revisited the implemnetation (replacing sync() with > open/_commit) I made several days ago and found a bug with it > (thanks to Hiroshi). With the fixed version of it, now my > Win32 port has passed your test even right after checkpoint!. Ahh, excellent news. Another step nearer to a good quality Windows port :-) Regards, Dave.
Tatsuo Ishii wrote: > Today I revisited the implemnetation (replacing sync() with > open/_commit) I made several days ago and found a bug with it (thanks > to Hiroshi). With the fixed version of it, now my Win32 port has > passed your test even right after checkpoint!. I presume that this implementation tracks which files have been opened and uses _commit() to write all the changes to disk for those files? If so, then it would be of significant value, IMHO, if you could abstract the changes in such a way that they could be applied to the Unix side as well. sync() writes *all* uncommitted buffers to disk, whether or not they belong to the process or process group that initiated the sync(). On systems which do more than just host PG, a sync() does more work (sometimes much more work) than is necessary and will unnecessarily burden the system with writes. I think it would be a win, from a design standpoint if nothing else, if PG committed only those pages that it was responsible for. The Unix equivalent of _commit() appears to be fsync() or fdatasync(). So it sounds a lot like a "port" to Unix of the changes you have made for this might easily be a trivial search and replace. :-) -- Kevin Brown kevin@sysexperts.com
Kevin Brown kirjutas R, 07.03.2003 kell 12:05: > Tatsuo Ishii wrote: > > Today I revisited the implemnetation (replacing sync() with > > open/_commit) I made several days ago and found a bug with it (thanks > > to Hiroshi). With the fixed version of it, now my Win32 port has > > passed your test even right after checkpoint!. > > I presume that this implementation tracks which files have been opened > and uses _commit() to write all the changes to disk for those files? But are there quarantees that all closed files are flushed to disk as well ? Does postgres quarantee it by doing a _commit() before close() or do file system semantics quarantee that filehits the disk whan close()'d (I guess it does not not). ------------- Hannu
> > I presume that this implementation tracks which files have been opened > > and uses _commit() to write all the changes to disk for those files? > > But are there quarantees that all closed files are flushed to disk as > well ? > > Does postgres quarantee it by doing a _commit() before close() or do > file system semantics quarantee that filehits the disk whan close()'d (I > guess it does not not). Of course no. So I remember all file names to be flushed in a table (placed on thread global memory) and later open/_commit/close them at the checkpoint time. -- Tatsuo Ishii
Kevin Brown wrote: > Tatsuo Ishii wrote: > > Today I revisited the implemnetation (replacing sync() with > > open/_commit) I made several days ago and found a bug with it (thanks > > to Hiroshi). With the fixed version of it, now my Win32 port has > > passed your test even right after checkpoint!. > > I presume that this implementation tracks which files have been opened > and uses _commit() to write all the changes to disk for those files? > > If so, then it would be of significant value, IMHO, if you could > abstract the changes in such a way that they could be applied to the > Unix side as well. > > sync() writes *all* uncommitted buffers to disk, whether or not they > belong to the process or process group that initiated the sync(). On > systems which do more than just host PG, a sync() does more work > (sometimes much more work) than is necessary and will unnecessarily > burden the system with writes. I think it would be a win, from a > design standpoint if nothing else, if PG committed only those pages > that it was responsible for. > > The Unix equivalent of _commit() appears to be fsync() or fdatasync(). > So it sounds a lot like a "port" to Unix of the changes you have made > for this might easily be a trivial search and replace. :-) The idea of using this on Unix is tempting, but Tatsuo is using a threaded backend, so it is a little easier to do. However, it would probably be pretty easy to write a file of modified file names that the checkpoint could read and open/fsync/close. Of course, if there are lots of files, sync() may be faster than opening/fsync/closing all those files. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > The idea of using this on Unix is tempting, but Tatsuo is using a > threaded backend, so it is a little easier to do. However, it would > probably be pretty easy to write a file of modified file names that the > checkpoint could read and open/fsync/close. Even that's not strictly necessary -- we *do* have shared memory we can use for this, and even when hundreds of tables have been written the list will only end up being a few tens of kilobytes in size (plus whatever overhead is required to track and manipulate the entries). But even then, we don't actually have to track the *names* of the files that have changed, just their RelFileNodes, since there's a mapping function from the RelFileNode to the filename. > Of course, if there are lots of files, sync() may be faster than > opening/fsync/closing all those files. This is true, and is something I hadn't actually thought of. So it sounds like some testing would be in order. Unfortunately I know of no system call which will take an array of file descriptors (or file names! May as well go for the gold when wishing for something :-) and sync them all to disk in the most optimal way... -- Kevin Brown kevin@sysexperts.com
Kevin Brown wrote: > Bruce Momjian wrote: > > The idea of using this on Unix is tempting, but Tatsuo is using a > > threaded backend, so it is a little easier to do. However, it would > > probably be pretty easy to write a file of modified file names that the > > checkpoint could read and open/fsync/close. > > Even that's not strictly necessary -- we *do* have shared memory we > can use for this, and even when hundreds of tables have been written > the list will only end up being a few tens of kilobytes in size (plus > whatever overhead is required to track and manipulate the entries). > > But even then, we don't actually have to track the *names* of the > files that have changed, just their RelFileNodes, since there's a > mapping function from the RelFileNode to the filename. But we have to allow an unlimited number of files. Perhaps we could just fall back to sync if the shared memory overflows, and shared memory is finite. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > Kevin Brown wrote: > > Bruce Momjian wrote: > > > The idea of using this on Unix is tempting, but Tatsuo is using a > > > threaded backend, so it is a little easier to do. However, it would > > > probably be pretty easy to write a file of modified file names that the > > > checkpoint could read and open/fsync/close. > > > > Even that's not strictly necessary -- we *do* have shared memory we > > can use for this, and even when hundreds of tables have been written > > the list will only end up being a few tens of kilobytes in size (plus > > whatever overhead is required to track and manipulate the entries). > > > > But even then, we don't actually have to track the *names* of the > > files that have changed, just their RelFileNodes, since there's a > > mapping function from the RelFileNode to the filename. > > But we have to allow an unlimited number of files. Perhaps we could > just fall back to sync if the shared memory overflows, and shared memory > is finite. True. Hmm...perhaps there's another way to do this? Let me explain: When we do a checkpoint what we're really doing is writing any committed transactions in the transaction log to the associated data files, right? Or, so the theory goes. PG may do something quite different than that. I'm not terribly familiar with the source and so it may be no surprise that I'm having difficulty finding any code that converts transactions stored in the transaction log into changes to the data files... Anyway, if a checkpoint really does take transactions and commit them to the data files, then the transactions themselves contain all the information we need. So there would be no need to maintain a separate list: the list has already been stored on disk for us. All we'd have to do is build a list at checkpoint time and fsync/fdatasync each file in the list at the very end. The list wouldn't need to be shared because the only process that would care is the one doing the checkpointing. Or so the theory goes. Since I'm having so much trouble finding code that actually does any of what I describe, I'd have no trouble believing that how PG works is very different than I envision... -- Kevin Brown kevin@sysexperts.com
> But even then, we don't actually have to track the *names* of the > files that have changed, just their RelFileNodes, since there's a > mapping function from the RelFileNode to the filename. Right. I have noticed that too and have made changes to my implementaion. BTW, you need to track the block number as well. Files > 1GB may be splitted into separate files (segments). > > Of course, if there are lots of files, sync() may be faster than > > opening/fsync/closing all those files. > > This is true, and is something I hadn't actually thought of. So it > sounds like some testing would be in order. I regard the difference between sync() and fsync() does not affect too much to the whole performance. Checkpoint process is performed as a separate process and the fsync() part of checkpoint does nothing with WAL, that means other backend processes, busy with WAL IO will not be bothered. -- Tatsuo Ishii
Kevin Brown <kevin@sysexperts.com> writes: > Even that's not strictly necessary -- we *do* have shared memory we > can use for this, and even when hundreds of tables have been written > the list will only end up being a few tens of kilobytes in size (plus > whatever overhead is required to track and manipulate the entries). Why not just fsync _all_ the database table files? If there's no I/O pending on them then the fsync might be very quick. I guess that's something to test, how fast is fsyncing a few hundred file descriptors that have no pending changes. Hell, if you keep all the file descriptors open (yes I know some systems have problems with that) you don't even need to open and close them, just loop calling fsync umpteen times. > > Of course, if there are lots of files, sync() may be faster than > > opening/fsync/closing all those files. > > This is true, and is something I hadn't actually thought of. So it > sounds like some testing would be in order. Note that the average case here isn't nearly as important as the worst-case. It would really suck to find out that running some other program (say a backup restore on another partition?) suddenly kills your database response because that sync suddenly has gigabytes of data to sync. Actually the real problem with sync is that it's, er, asynchronous. The kernel will return before the data is actually synced. Waiting an arbitrary amount of time might work in the average case but when something else fills the kernel buffers with pending i/o that arbitrary time might prove to be inadequate and it could open a window where a crash could corrupt the database integrity. > Unfortunately I know of no system call which will take an array of > file descriptors (or file names! May as well go for the gold when > wishing for something :-) and sync them all to disk in the most > optimal way... mmmmm, yum. -- greg