Thread: Win32 Powerfail testing

Win32 Powerfail testing

From
Tatsuo Ishii
Date:
We are developing a Win32 port of PostgreSQL 7.3(different from Jan's
implementaion, in that we are using a thread model. In the future I
hope we could contribute the source code). We have done a power
failure testing using the test tool made by Dave Page:

Subject: [HACKERS] Win32 Powerfail testing - results
From: "Dave Page" <dpage@vale-housing.co.uk>
Date: Mon, 3 Feb 2003 16:51:33 -0000

So far we found interesting facts. Our Win32 port passes his test in
most cases. However if power of the machine is turned off right after
(10 to 20 seconds) the Checkpoint has been made, it does not passes
his test. So we are thinking that there is someting wrong with the
checkpoint implementaion for Win32 port, which is essentially same as
Jan's implementation. i.e. using _flushall() instead of sync().  We
were looking for a fix or an alternative implementaion of sync()
without success.

BTW, we found that Cygwin port of PostgreSQL does not pass his test
neither.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 05 March 2003 02:23
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Win32 Powerfail testing
>
> So far we found interesting facts. Our Win32 port passes his
> test in most cases. However if power of the machine is turned
> off right after (10 to 20 seconds) the Checkpoint has been
> made, it does not passes his test. So we are thinking that
> there is someting wrong with the checkpoint implementaion for
> Win32 port, which is essentially same as Jan's
> implementation. i.e. using _flushall() instead of sync().  We
> were looking for a fix or an alternative implementaion of
> sync() without success.

Hi Tatsuo,

Does this help:
http://support.microsoft.com/default.aspx?scid=kb;en-us;66052

Regards, Dave.


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > So far we found interesting facts. Our Win32 port passes his 
> > test in most cases. However if power of the machine is turned 
> > off right after (10 to 20 seconds) the Checkpoint has been 
> > made, it does not passes his test. So we are thinking that 
> > there is someting wrong with the checkpoint implementaion for 
> > Win32 port, which is essentially same as Jan's 
> > implementation. i.e. using _flushall() instead of sync().  We 
> > were looking for a fix or an alternative implementaion of 
> > sync() without success.
> 
> Hi Tatsuo,
> 
> Does this help:
> http://support.microsoft.com/default.aspx?scid=kb;en-us;66052

Sorry, but it does not help. The page says we could use
FlushFileBuffers() to sync the kernel buffer to the
disk. Unfortunately, it requires a file descriptor to flush for its
argument. Thus it could not be a replacement of sync(). Actually I
have modified the buffer manager so that it remembers all file
descriptors those have not been synced yet to the disk at the
checkpoint time to sync them later. However I found this modification
does not help at all with some reason I don't know.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"scott.marlowe"
Date:
On Wed, 5 Mar 2003, Dave Page wrote:

> 
> 
> > -----Original Message-----
> > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp] 
> > Sent: 05 March 2003 02:23
> > To: pgsql-hackers@postgresql.org
> > Subject: [HACKERS] Win32 Powerfail testing
> >
> > So far we found interesting facts. Our Win32 port passes his 
> > test in most cases. However if power of the machine is turned 
> > off right after (10 to 20 seconds) the Checkpoint has been 
> > made, it does not passes his test. So we are thinking that 
> > there is someting wrong with the checkpoint implementaion for 
> > Win32 port, which is essentially same as Jan's 
> > implementation. i.e. using _flushall() instead of sync().  We 
> > were looking for a fix or an alternative implementaion of 
> > sync() without success.
> 
> Hi Tatsuo,
> 
> Does this help:
> http://support.microsoft.com/default.aspx?scid=kb;en-us;66052

OMG, I'm rolling.  You have to connect to the COMMODE.OBJ to fix a 
flushing problem.  Someone at MS has a sense of humor.  I thought running 
PHP on crack was funny (i.e. --with-crack switch to turn on cracklib) but 
this one is even better.



Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > > So far we found interesting facts. Our Win32 port passes his 
> > > test in most cases. However if power of the machine is turned 
> > > off right after (10 to 20 seconds) the Checkpoint has been 
> > > made, it does not passes his test. So we are thinking that 
> > > there is someting wrong with the checkpoint implementaion for 
> > > Win32 port, which is essentially same as Jan's 
> > > implementation. i.e. using _flushall() instead of sync().  We 
> > > were looking for a fix or an alternative implementaion of 
> > > sync() without success.
> > 
> > Hi Tatsuo,
> > 
> > Does this help:
> > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052
> 
> OMG, I'm rolling.  You have to connect to the COMMODE.OBJ to fix a 
> flushing problem.  Someone at MS has a sense of humor.  I thought running 
> PHP on crack was funny (i.e. --with-crack switch to turn on cracklib) but 
> this one is even better.

We have tried COMMODE.OBJ already. It seems that does not help at all.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
Kevin Brown
Date:
Tatsuo Ishii wrote:
> Sorry, but it does not help. The page says we could use
> FlushFileBuffers() to sync the kernel buffer to the
> disk. Unfortunately, it requires a file descriptor to flush for its
> argument. Thus it could not be a replacement of sync(). Actually I
> have modified the buffer manager so that it remembers all file
> descriptors those have not been synced yet to the disk at the
> checkpoint time to sync them later. However I found this modification
> does not help at all with some reason I don't know.

It would be an interesting comparison for you to roll the file
descriptor tracking changes into the Unix side of the tree and use
fsync() or fdatasync() in place of FlushFileBuffers() on the Unix side
(you'd have to remove or disable the code that does a sync() of
course).  If the end result yields no data corruption issues during
powerfail testing on various Unix platforms then it's reasonably
likely that the problem you're experiencing on the Windows side is
with the underlying Windows platform and not with your code.



-- 
Kevin Brown                          kevin@sysexperts.com


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> It would be an interesting comparison for you to roll the file
> descriptor tracking changes into the Unix side of the tree and use
> fsync() or fdatasync() in place of FlushFileBuffers() on the Unix side
> (you'd have to remove or disable the code that does a sync() of
> course).  If the end result yields no data corruption issues during
> powerfail testing on various Unix platforms then it's reasonably
> likely that the problem you're experiencing on the Windows side is
> with the underlying Windows platform and not with your code.

Sounds like an idea. I'll do it if I have a spare time.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Kevin Brown [mailto:kevin@sysexperts.com]
> Sent: 06 March 2003 04:37
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
> Tatsuo Ishii wrote:
> > Sorry, but it does not help. The page says we could use
> > FlushFileBuffers() to sync the kernel buffer to the
> > disk. Unfortunately, it requires a file descriptor to flush for its
> > argument. Thus it could not be a replacement of sync(). Actually I
> > have modified the buffer manager so that it remembers all file
> > descriptors those have not been synced yet to the disk at the
> > checkpoint time to sync them later. However I found this
> modification
> > does not help at all with some reason I don't know.
>
> It would be an interesting comparison for you to roll the
> file descriptor tracking changes into the Unix side of the
> tree and use
> fsync() or fdatasync() in place of FlushFileBuffers() on the
> Unix side (you'd have to remove or disable the code that does
> a sync() of course).  If the end result yields no data
> corruption issues during powerfail testing on various Unix
> platforms then it's reasonably likely that the problem you're
> experiencing on the Windows side is with the underlying
> Windows platform and not with your code.

Agreed, but I still keep thinking that despite some peoples claims that
Windows ain't up to it, DB2, SQL and Exchange Server as well a probably
others that don't use raw partitions have got over this problem, so
therefore we should be able to. Admittedly Microsoft have a bit of an
advantage over us, but there must be some accessible way of flushing the
buffers in a guaranteed way. I'll look into it some more today if I
can...

Regards, Dave.


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 05 March 2003 13:49
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
> > > So far we found interesting facts. Our Win32 port passes his
> > > test in most cases. However if power of the machine is turned
> > > off right after (10 to 20 seconds) the Checkpoint has been
> > > made, it does not passes his test. So we are thinking that
> > > there is someting wrong with the checkpoint implementaion for
> > > Win32 port, which is essentially same as Jan's
> > > implementation. i.e. using _flushall() instead of sync().  We
> > > were looking for a fix or an alternative implementaion of
> > > sync() without success.
> >
> > Hi Tatsuo,
> >
> > Does this help:
> > http://support.microsoft.com/default.aspx?scid=kb;en-us;66052
>
> Sorry, but it does not help. The page says we could use
> FlushFileBuffers() to sync the kernel buffer to the
> disk. Unfortunately, it requires a file descriptor to flush
> for its argument. Thus it could not be a replacement of
> sync(). Actually I have modified the buffer manager so that
> it remembers all file descriptors those have not been synced
> yet to the disk at the checkpoint time to sync them later.
> However I found this modification does not help at all with
> some reason I don't know.

How do you open the files (function, flags etc)?

Regards, Dave.


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > Sorry, but it does not help. The page says we could use
> > FlushFileBuffers() to sync the kernel buffer to the
> > disk. Unfortunately, it requires a file descriptor to flush 
> > for its argument. Thus it could not be a replacement of 
> > sync(). Actually I have modified the buffer manager so that 
> > it remembers all file descriptors those have not been synced 
> > yet to the disk at the checkpoint time to sync them later. 
> > However I found this modification does not help at all with 
> > some reason I don't know.
> 
> How do you open the files (function, flags etc)? 

Are you asking the way how to open files in the buffer manager?
If so, basically PostgreSQL uses open() with flags (O_RDWR |
PG_BINARY, 0600).
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 06 March 2003 14:00
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
> > > Sorry, but it does not help. The page says we could use
> > > FlushFileBuffers() to sync the kernel buffer to the
> > > disk. Unfortunately, it requires a file descriptor to flush
> > > for its argument. Thus it could not be a replacement of
> > > sync(). Actually I have modified the buffer manager so that
> > > it remembers all file descriptors those have not been synced
> > > yet to the disk at the checkpoint time to sync them later.
> > > However I found this modification does not help at all with
> > > some reason I don't know.
> >
> > How do you open the files (function, flags etc)?
>
> Are you asking the way how to open files in the buffer
> manager? If so, basically PostgreSQL uses open() with flags
> (O_RDWR | PG_BINARY, 0600).

I cannot find it now, but I'm sure I read that FlushFileBuffers() has no
effect unless the file was opened with CreateFile() with the
GENERIC_WRITE flag. A quick google shows quite a few people recommending
that approach to others having trouble flushing files opened with fopen
or _open.

Regards, Dave.


Re: Win32 Powerfail testing

From
"Magnus Hagander"
Date:
> Agreed, but I still keep thinking that despite some peoples
> claims that Windows ain't up to it, DB2, SQL and Exchange
> Server as well a probably others that don't use raw
> partitions have got over this problem, so therefore we should
> be able to. Admittedly Microsoft have a bit of an advantage
> over us, but there must be some accessible way of flushing
> the buffers in a guaranteed way. I'll look into it some more
> today if I can...

FWIW, I beleive all the mentioned products (Ok, at least SQL and
Exchange) use the CreateFile() API with the flag FILE_FLAG_NO_BUFFERING.
It has the following constraints, though, which they code around in the
app code I guess:

***
Instructs the system to open the file with no intermediate buffering or
caching. When combined with FILE_FLAG_OVERLAPPED, the flag gives maximum
asynchronous performance, because the I/O does not rely on the
synchronous operations of the memory manager. However, some I/O
operations will take longer, because data is not being held in the
cache.
An application must meet certain requirements when working with files
opened with FILE_FLAG_NO_BUFFERING:


File access must begin at byte offsets within the file that are integer
multiples of the volume's sector size.

File access must be for numbers of bytes that are integer multiples of
the volume's sector size. For example, if the sector size is 512 bytes,
an application can request reads and writes of 512, 1024, or 2048 bytes,
but not of 335, 981, or 7171 bytes.

Buffer addresses for read and write operations should be sector aligned
(aligned on addresses in memory that are integer multiples of the
volume's sector size). Depending on the disk, this requirement may not
be enforced.

One way to align buffers on integer multiples of the volume sector size
is to use VirtualAlloc to allocate the buffers. It allocates memory that
is aligned on addresses that are integer multiples of the operating
system's memory page size. Because both memory page and volume sector
sizes are powers of 2, this memory is also aligned on addresses that are
integer multiples of a volume's sector size.

An application can determine a volume's sector size by calling the
GetDiskFreeSpace function.
***

There is also the flag FILE_FLAG_WRITE_THROUGH which says:
Instructs the system to write through any intermediate cache and go
directly to disk. The system can still cache write operations, but
cannot lazily flush them.



But that's the CreateFile() Win32 API. The question is how the fopen()
etc calls are mapped to Win32 calls, I'd guess.


//Magnus


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > Are you asking the way how to open files in the buffer 
> > manager? If so, basically PostgreSQL uses open() with flags 
> > (O_RDWR | PG_BINARY, 0600).
> 
> I cannot find it now, but I'm sure I read that FlushFileBuffers() has no
> effect unless the file was opened with CreateFile() with the
> GENERIC_WRITE flag. A quick google shows quite a few people recommending
> that approach to others having trouble flushing files opened with fopen
> or _open.

I'm sure FlushFileBuffers() is usesless for files opend with open()
too. 

As I said in the previlus mails, open()+_commit() does the right job
with the transaction log files. So probably I think I should stick
with open()+_commit() approach for ordinary table/index files too.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 06 March 2003 15:17
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
> I'm sure FlushFileBuffers() is usesless for files opend with
> open() too.
>
> As I said in the previlus mails, open()+_commit() does the
> right job with the transaction log files. So probably I think
> I should stick with open()+_commit() approach for ordinary
> table/index files too.

Oh, I didn't see that message. So it's either:

open() + _commit()

Or

CreateFile() + FlushFileBuffers()

Magnus also mentioned using FILE_FLAG_NO_BUFFERING or
FILE_FLAG_WRITE_THROUGH with CreateFile(). I was concerned about the
additional complexity with FILE_FLAG_NO_BUFFERING, but
FILE_FLAG_WRITE_THROUGH sounds like it might do the job, if a little
sub-optimally.

Is there really no way of allowing a decent write cache, but then being
able to guarantee a flush at the required time? Sounds a little cuckoo
to me but then it is Microsoft...

Anyhoo, it sounds like open() and _commit is this best choice as you
say.

Regards, Dave.


Re: Win32 Powerfail testing

From
"Merlin Moncure"
Date:
My experience with windows backend work is that you have to turn off all
buffering and implement your own write cache of sorts.  Flushing is not
the only reason: heavy buffering of files (the default behavior) also
tends to thrash the server, because the cache does not always release
memory properly.

Likewise, with memory for maximum results you have to go straight to
VirtualAlloc() and avoid using the C run time to do any persistent
memory allocation.  Memory pages get mapped to file pages and all file
reads/writes are on sector boundaries.  Generally, it's a nightmare.
Merlin



> -----Original Message-----
> From: Dave Page [mailto:dpage@vale-housing.co.uk]
> Sent: Thursday, March 06, 2003 11:02 AM
> To: Tatsuo Ishii
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
>
> > -----Original Message-----
> > From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> > Sent: 06 March 2003 15:17
> > To: Dave Page
> > Cc: pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] Win32 Powerfail testing
> >
> > I'm sure FlushFileBuffers() is usesless for files opend with
> > open() too.
> >
> > As I said in the previlus mails, open()+_commit() does the
> > right job with the transaction log files. So probably I think
> > I should stick with open()+_commit() approach for ordinary
> > table/index files too.
>
> Oh, I didn't see that message. So it's either:
>
> open() + _commit()
>
> Or
>
> CreateFile() + FlushFileBuffers()
>
> Magnus also mentioned using FILE_FLAG_NO_BUFFERING or
> FILE_FLAG_WRITE_THROUGH with CreateFile(). I was concerned about the
> additional complexity with FILE_FLAG_NO_BUFFERING, but
> FILE_FLAG_WRITE_THROUGH sounds like it might do the job, if a little
> sub-optimally.
>
> Is there really no way of allowing a decent write cache, but then
being
> able to guarantee a flush at the required time? Sounds a little cuckoo
> to me but then it is Microsoft...
>
> Anyhoo, it sounds like open() and _commit is this best choice as you
> say.
>
> Regards, Dave.
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > As I said in the previlus mails, open()+_commit() does the 
> > right job with the transaction log files. So probably I think 
> > I should stick with open()+_commit() approach for ordinary 
> > table/index files too.
> 
> Oh, I didn't see that message. So it's either:
> 
> open() + _commit()

Sorry, I did not mention it explicitly. I meant we use the same
implementation as Jan's work. He uses open()+_commit(), I believe.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 07 March 2003 01:33
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
> > > As I said in the previlus mails, open()+_commit() does the
> > > right job with the transaction log files. So probably I think
> > > I should stick with open()+_commit() approach for ordinary
> > > table/index files too.
> >
> > Oh, I didn't see that message. So it's either:
> >
> > open() + _commit()
>
> Sorry, I did not mention it explicitly. I meant we use the
> same implementation as Jan's work. He uses open()+_commit(),
> I believe.

Ah, but Jan/Katie's code *did not* survive the powerfails. Is there a
relatively easy way we can test open()/_commit against
CreateFile()/FlushFileBuffers() with the FILE_FLAG_WRITE_THROUGH flag as
suggested by Magnus (and indirectly by Merlin I guess)?

Regards, Dave.


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > > > As I said in the previlus mails, open()+_commit() does the
> > > > right job with the transaction log files. So probably I think 
> > > > I should stick with open()+_commit() approach for ordinary 
> > > > table/index files too.
> > > 
> > > Oh, I didn't see that message. So it's either:
> > > 
> > > open() + _commit()
> > 
> > Sorry, I did not mention it explicitly. I meant we use the 
> > same implementation as Jan's work. He uses open()+_commit(), 
> > I believe.
> 
> Ah, but Jan/Katie's code *did not* survive the powerfails. Is there a
> relatively easy way we can test open()/_commit against
> CreateFile()/FlushFileBuffers() with the FILE_FLAG_WRITE_THROUGH flag as
> suggested by Magnus (and indirectly by Merlin I guess)?

There are two stages where a synchronized write is needed. One is WAL
log writing. We confirmed that with open()/_commit this is ok.

The other is checkpoint. Here we need to flush kernel buffers holding
previous write to table/index files. To sync those files, PostgreSQL
uses sync(). I guess Jan's implementatin did not survive in this case
(mine neither).

Today I revisited the implemnetation (replacing sync() with
open/_commit) I made several days ago and found a bug with it (thanks
to Hiroshi). With the fixed version of it, now my Win32 port has
passed your test even right after checkpoint!.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
"Dave Page"
Date:

> -----Original Message-----
> From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
> Sent: 07 March 2003 08:37
> To: Dave Page
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Win32 Powerfail testing
>
>
> >
> > Ah, but Jan/Katie's code *did not* survive the powerfails.
> Is there a
> > relatively easy way we can test open()/_commit against
> > CreateFile()/FlushFileBuffers() with the
> FILE_FLAG_WRITE_THROUGH flag
> > as suggested by Magnus (and indirectly by Merlin I guess)?
>
> There are two stages where a synchronized write is needed.
> One is WAL log writing. We confirmed that with open()/_commit
> this is ok.
>
> The other is checkpoint. Here we need to flush kernel buffers
> holding previous write to table/index files. To sync those
> files, PostgreSQL uses sync(). I guess Jan's implementatin
> did not survive in this case (mine neither).
>
> Today I revisited the implemnetation (replacing sync() with
> open/_commit) I made several days ago and found a bug with it
> (thanks to Hiroshi). With the fixed version of it, now my
> Win32 port has passed your test even right after checkpoint!.

Ahh, excellent news. Another step nearer to a good quality Windows port
:-)

Regards, Dave.


Re: Win32 Powerfail testing

From
Kevin Brown
Date:
Tatsuo Ishii wrote:
> Today I revisited the implemnetation (replacing sync() with
> open/_commit) I made several days ago and found a bug with it (thanks
> to Hiroshi). With the fixed version of it, now my Win32 port has
> passed your test even right after checkpoint!.

I presume that this implementation tracks which files have been opened
and uses _commit() to write all the changes to disk for those files?

If so, then it would be of significant value, IMHO, if you could
abstract the changes in such a way that they could be applied to the
Unix side as well.

sync() writes *all* uncommitted buffers to disk, whether or not they
belong to the process or process group that initiated the sync().  On
systems which do more than just host PG, a sync() does more work
(sometimes much more work) than is necessary and will unnecessarily
burden the system with writes.  I think it would be a win, from a
design standpoint if nothing else, if PG committed only those pages
that it was responsible for.

The Unix equivalent of _commit() appears to be fsync() or fdatasync().
So it sounds a lot like a "port" to Unix of the changes you have made
for this might easily be a trivial search and replace.  :-)

-- 
Kevin Brown                          kevin@sysexperts.com


Re: Win32 Powerfail testing

From
Hannu Krosing
Date:
Kevin Brown kirjutas R, 07.03.2003 kell 12:05:
> Tatsuo Ishii wrote:
> > Today I revisited the implemnetation (replacing sync() with
> > open/_commit) I made several days ago and found a bug with it (thanks
> > to Hiroshi). With the fixed version of it, now my Win32 port has
> > passed your test even right after checkpoint!.
> 
> I presume that this implementation tracks which files have been opened
> and uses _commit() to write all the changes to disk for those files?

But are there quarantees that all closed files are flushed to disk as
well ?

Does postgres quarantee it by doing a _commit() before close() or do
file system semantics quarantee that filehits the disk whan close()'d (I
guess it does not not).

-------------
Hannu



Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> > I presume that this implementation tracks which files have been opened
> > and uses _commit() to write all the changes to disk for those files?
> 
> But are there quarantees that all closed files are flushed to disk as
> well ?
> 
> Does postgres quarantee it by doing a _commit() before close() or do
> file system semantics quarantee that filehits the disk whan close()'d (I
> guess it does not not).

Of course no. So I remember all file names to be flushed in a table
(placed on thread global memory) and later open/_commit/close them at
the checkpoint time.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
Bruce Momjian
Date:
Kevin Brown wrote:
> Tatsuo Ishii wrote:
> > Today I revisited the implemnetation (replacing sync() with
> > open/_commit) I made several days ago and found a bug with it (thanks
> > to Hiroshi). With the fixed version of it, now my Win32 port has
> > passed your test even right after checkpoint!.
> 
> I presume that this implementation tracks which files have been opened
> and uses _commit() to write all the changes to disk for those files?
> 
> If so, then it would be of significant value, IMHO, if you could
> abstract the changes in such a way that they could be applied to the
> Unix side as well.
> 
> sync() writes *all* uncommitted buffers to disk, whether or not they
> belong to the process or process group that initiated the sync().  On
> systems which do more than just host PG, a sync() does more work
> (sometimes much more work) than is necessary and will unnecessarily
> burden the system with writes.  I think it would be a win, from a
> design standpoint if nothing else, if PG committed only those pages
> that it was responsible for.
> 
> The Unix equivalent of _commit() appears to be fsync() or fdatasync().
> So it sounds a lot like a "port" to Unix of the changes you have made
> for this might easily be a trivial search and replace.  :-)

The idea of using this on Unix is tempting, but Tatsuo is using a
threaded backend, so it is a little easier to do.  However, it would
probably be pretty easy to write a file of modified file names that the
checkpoint could read and open/fsync/close.

Of course, if there are lots of files, sync() may be faster than
opening/fsync/closing all those files.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Win32 Powerfail testing

From
Kevin Brown
Date:
Bruce Momjian wrote:
> The idea of using this on Unix is tempting, but Tatsuo is using a
> threaded backend, so it is a little easier to do.  However, it would
> probably be pretty easy to write a file of modified file names that the
> checkpoint could read and open/fsync/close.

Even that's not strictly necessary -- we *do* have shared memory we
can use for this, and even when hundreds of tables have been written
the list will only end up being a few tens of kilobytes in size (plus
whatever overhead is required to track and manipulate the entries).

But even then, we don't actually have to track the *names* of the
files that have changed, just their RelFileNodes, since there's a
mapping function from the RelFileNode to the filename.

> Of course, if there are lots of files, sync() may be faster than
> opening/fsync/closing all those files.

This is true, and is something I hadn't actually thought of.  So it
sounds like some testing would be in order.

Unfortunately I know of no system call which will take an array of
file descriptors (or file names!  May as well go for the gold when
wishing for something :-) and sync them all to disk in the most
optimal way...


-- 
Kevin Brown                          kevin@sysexperts.com


Re: Win32 Powerfail testing

From
Bruce Momjian
Date:
Kevin Brown wrote:
> Bruce Momjian wrote:
> > The idea of using this on Unix is tempting, but Tatsuo is using a
> > threaded backend, so it is a little easier to do.  However, it would
> > probably be pretty easy to write a file of modified file names that the
> > checkpoint could read and open/fsync/close.
> 
> Even that's not strictly necessary -- we *do* have shared memory we
> can use for this, and even when hundreds of tables have been written
> the list will only end up being a few tens of kilobytes in size (plus
> whatever overhead is required to track and manipulate the entries).
> 
> But even then, we don't actually have to track the *names* of the
> files that have changed, just their RelFileNodes, since there's a
> mapping function from the RelFileNode to the filename.

But we have to allow an unlimited number of files.  Perhaps we could
just fall back to sync if the shared memory overflows, and shared memory
is finite.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Win32 Powerfail testing

From
Kevin Brown
Date:
Bruce Momjian wrote:
> Kevin Brown wrote:
> > Bruce Momjian wrote:
> > > The idea of using this on Unix is tempting, but Tatsuo is using a
> > > threaded backend, so it is a little easier to do.  However, it would
> > > probably be pretty easy to write a file of modified file names that the
> > > checkpoint could read and open/fsync/close.
> > 
> > Even that's not strictly necessary -- we *do* have shared memory we
> > can use for this, and even when hundreds of tables have been written
> > the list will only end up being a few tens of kilobytes in size (plus
> > whatever overhead is required to track and manipulate the entries).
> > 
> > But even then, we don't actually have to track the *names* of the
> > files that have changed, just their RelFileNodes, since there's a
> > mapping function from the RelFileNode to the filename.
> 
> But we have to allow an unlimited number of files.  Perhaps we could
> just fall back to sync if the shared memory overflows, and shared memory
> is finite.

True.

Hmm...perhaps there's another way to do this?  Let me explain:

When we do a checkpoint what we're really doing is writing any
committed transactions in the transaction log to the associated data
files, right?

Or, so the theory goes.  PG may do something quite different than
that.  I'm not terribly familiar with the source and so it may be no
surprise that I'm having difficulty finding any code that converts
transactions stored in the transaction log into changes to the data
files...

Anyway, if a checkpoint really does take transactions and commit them
to the data files, then the transactions themselves contain all the
information we need.  So there would be no need to maintain a separate
list: the list has already been stored on disk for us.  All we'd have
to do is build a list at checkpoint time and fsync/fdatasync each file
in the list at the very end.  The list wouldn't need to be shared
because the only process that would care is the one doing the
checkpointing.

Or so the theory goes.  Since I'm having so much trouble finding code
that actually does any of what I describe, I'd have no trouble
believing that how PG works is very different than I envision...



-- 
Kevin Brown                          kevin@sysexperts.com


Re: Win32 Powerfail testing

From
Tatsuo Ishii
Date:
> But even then, we don't actually have to track the *names* of the
> files that have changed, just their RelFileNodes, since there's a
> mapping function from the RelFileNode to the filename.

Right. I have noticed that too and have made changes to my
implementaion.

BTW, you need to track the block number as well. Files > 1GB may be
splitted into separate files (segments).

> > Of course, if there are lots of files, sync() may be faster than
> > opening/fsync/closing all those files.
> 
> This is true, and is something I hadn't actually thought of.  So it
> sounds like some testing would be in order.

I regard the difference between sync() and fsync() does not affect too
much to the whole performance. Checkpoint process is performed as a
separate process and the fsync() part of checkpoint does nothing with
WAL, that means other backend processes, busy with WAL IO will
not be bothered.
--
Tatsuo Ishii


Re: Win32 Powerfail testing

From
Greg Stark
Date:
Kevin Brown <kevin@sysexperts.com> writes:

> Even that's not strictly necessary -- we *do* have shared memory we
> can use for this, and even when hundreds of tables have been written
> the list will only end up being a few tens of kilobytes in size (plus
> whatever overhead is required to track and manipulate the entries).

Why not just fsync _all_ the database table files? If there's no I/O pending
on them then the fsync might be very quick. I guess that's something to test,
how fast is fsyncing a few hundred file descriptors that have no pending
changes. Hell, if you keep all the file descriptors open (yes I know some
systems have problems with that) you don't even need to open and close them,
just loop calling fsync umpteen times.

> > Of course, if there are lots of files, sync() may be faster than
> > opening/fsync/closing all those files.
> 
> This is true, and is something I hadn't actually thought of.  So it
> sounds like some testing would be in order.

Note that the average case here isn't nearly as important as the worst-case.
It would really suck to find out that running some other program (say a backup
restore on another partition?) suddenly kills your database response because
that sync suddenly has gigabytes of data to sync. 

Actually the real problem with sync is that it's, er, asynchronous. The kernel
will return before the data is actually synced. Waiting an arbitrary amount of
time might work in the average case but when something else fills the kernel
buffers with pending i/o that arbitrary time might prove to be inadequate and
it could open a window where a crash could corrupt the database integrity.

> Unfortunately I know of no system call which will take an array of
> file descriptors (or file names!  May as well go for the gold when
> wishing for something :-) and sync them all to disk in the most
> optimal way...

mmmmm, yum.

-- 
greg