Thread: Idea for fixing the Windows fsync problem

Idea for fixing the Windows fsync problem

From
Tom Lane
Date:
I just had a thought about fixing those Windows "permission denied"
problems.  The case that we believe we understand is where the bgwriter
is trying to execute a previously-logged fsync request against a table
file that is pending delete --- that is, actually has been unlink()'d,
but some other process is holding an open file reference to it.  The
problem is only for fsync, not for write(), because the table drop
sequence always invalidates every shared buffer for the table before
trying to unlink it.

So: maybe the solution is to add a step to the drop sequence, namely
revoking any pending fsync request, before unlink.  This would not only
clean up the Windows issue, it'd also let us remove the current hack in
md.c to not complain about an ENOENT failure (which is really hardly any
safer than ignoring EACCES would be, if you want to be honest about it).

The problem is that the ForwardFsyncRequest() mechanism is asynchronous:
currently, a backend could see pending fsync requests that are still in
the shared-memory queue, but there's no way to tell whether the bgwriter
has already absorbed some requests into its private memory.  How can a
backend tell the bgwriter to forget about it, and then delay until it
can be sure that the bgwriter won't try it later?

We could have backends put "revoke fsync" requests into the shared queue
and then sleep until they see the queue has been drained ... but there's
not a convenient way to implement that delay, and I hardly want to just
"sleep and retry" during every table drop.  It'd probably take at least
one more LWLock, and noticeably more complicated ForwardFsyncRequest()
logic, to make this work.

Thoughts?  Is this a reasonable solution path, or is it likely to be a
waste of time?  We know that there are causes of "permission denied"
that are not explained by the pending-delete problem.
        regards, tom lane


Re: Idea for fixing the Windows fsync problem

From
Magnus Hagander
Date:
Tom Lane wrote:
> I just had a thought about fixing those Windows "permission denied"
> problems.  The case that we believe we understand is where the bgwriter
> is trying to execute a previously-logged fsync request against a table
> file that is pending delete --- that is, actually has been unlink()'d,
> but some other process is holding an open file reference to it.  The
> problem is only for fsync, not for write(), because the table drop
> sequence always invalidates every shared buffer for the table before
> trying to unlink it.
> 
> So: maybe the solution is to add a step to the drop sequence, namely
> revoking any pending fsync request, before unlink.  This would not only
> clean up the Windows issue, it'd also let us remove the current hack in
> md.c to not complain about an ENOENT failure (which is really hardly any
> safer than ignoring EACCES would be, if you want to be honest about it).

Sounds good so far :-)


> The problem is that the ForwardFsyncRequest() mechanism is asynchronous:
> currently, a backend could see pending fsync requests that are still in
> the shared-memory queue, but there's no way to tell whether the bgwriter
> has already absorbed some requests into its private memory.  How can a
> backend tell the bgwriter to forget about it, and then delay until it
> can be sure that the bgwriter won't try it later?
> 
> We could have backends put "revoke fsync" requests into the shared queue
> and then sleep until they see the queue has been drained ... but there's
> not a convenient way to implement that delay, and I hardly want to just
> "sleep and retry" during every table drop.  It'd probably take at least
> one more LWLock, and noticeably more complicated ForwardFsyncRequest()
> logic, to make this work.
> 
> Thoughts?  Is this a reasonable solution path, or is it likely to be a
> waste of time?  We know that there are causes of "permission denied"
> that are not explained by the pending-delete problem.

Do we need to actually wait for it? Does the backend need to know when
it's done? If it fires off the "discard" request, then it's up to the
bgwriter to see it in time, no?

Perhaps we could have the bgwrite check the queue *if* it gets the
ENOENT/EACCESS error and then re-check the queue for drops on that file?
Or maybe that's even more complex?  (I confess I haven't looked at the
code..)

//Magnus


Re: Idea for fixing the Windows fsync problem

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Perhaps we could have the bgwrite check the queue *if* it gets the
> ENOENT/EACCESS error and then re-check the queue for drops on that file?

Hmm ... seems a bit ugly, but then again I've not been able to come up
with a nice way of making the backends wait for queue drain either; and
not having to make them wait at all is certainly attractive.  I'll look
into doing it like that.  Thanks for the idea!
        regards, tom lane


Re: Idea for fixing the Windows fsync problem

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> So: maybe the solution is to add a step to the drop sequence, namely
>> revoking any pending fsync request, before unlink.

> Perhaps we could have the bgwrite check the queue *if* it gets the
> ENOENT/EACCESS error and then re-check the queue for drops on that file?

I've committed a tentative patch along these lines to HEAD.  Please
test.
        regards, tom lane


Re: Idea for fixing the Windows fsync problem

From
"Takayuki Tsunakawa"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>I suggested that here
> http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php
> but have received no feedback about it ...

I'm sorry, I missed it.

From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> So: maybe the solution is to add a step to the drop sequence,
namely
>>> revoking any pending fsync request, before unlink.
>
>> Perhaps we could have the bgwrite check the queue *if* it gets the
>> ENOENT/EACCESS error and then re-check the queue for drops on that
file?
>
> I've committed a tentative patch along these lines to HEAD.  Please
> test.

I agree with Magnus-san's suggestion, too.
Though I'm willing to test, I'm not familiar with building on Windows
yet and do not have enogh time for other works right now.  If someone
builds and gives me the new postgres.exe, I'll put it on my 8.2
installation and test.  Or, could anyone do the following?  These are
what I did in yesterday's test.

1. Open two psql sessions.  Let me call those session1  and session2.

2. On session1, execute:

create table a (c int);
insert into a values(1);

3. On session2, execute:

select * from a;

4. On session1, execute:

drop table a;
checkpoint;

Checkpoint command here reported an error yesterday.  If Tom-san's
patch is effective, it should not fail and no messages are put in the
event log.





Re: Idea for fixing the Windows fsync problem

From
Tom Lane
Date:
I wrote:
> I've committed a tentative patch along these lines to HEAD.  Please
> test.

So I come home from dinner out, and find the buildfarm all red :-(

I'm not sure why I didn't see this failure in my own testing, but in
hindsight it's quite obvious that if the bgwriter is to take a hard
line about fsync failures, it's got to be told about DROP DATABASE
not only DROP TABLE --- that is, there has to be a signaling message
for "revoke fsync requests across whole database".

I think that it should not be necessary to invent a signal for "drop
across tablespace", though, because we don't allow DROP TABLESPACE to
remove any tables --- you've got to drop tables and/or databases to
clean out the tablespace, first.  Anyone see a flaw in that?

Not up to fixing this right now, but will have a go at it in the
morning, unless someone else wants to take a swing at it meanwhile.

BTW: what happens on Windows if we're trying to do the equivalent
of "rm -rf database-dir" and someone is holding open one of the files
in the directory?  Or has the directory itself open for readdir()?
        regards, tom lane


Re: Idea for fixing the Windows fsync problem

From
"Takayuki Tsunakawa"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
>I wrote:
>> I've committed a tentative patch along these lines to HEAD.  Please
>> test.
>
> So I come home from dinner out, and find the buildfarm all red :-(
>
> I'm not sure why I didn't see this failure in my own testing, but in
> hindsight it's quite obvious that if the bgwriter is to take a hard
> line about fsync failures, it's got to be told about DROP DATABASE
> not only DROP TABLE --- that is, there has to be a signaling message
> for "revoke fsync requests across whole database".
>

Excuse me if I misunderstand English and say something strange.
I thought "buildfarm is red" meant the failure of regression test.
What kind of errors did you get in what operation (e.g. DROP INDEX)?
Is everyone is able to see the test result freely?  Sorry, I'll read
developer's FAQ when I have more time.
And I thought you meant you have a question: why doesn't bgwriter
report "checkpoint request failed" when a checkpoint occurs after
dropping a database?
No, checkpoint after DROP DATABASE should not report "checkpoint
request failed."  When dropping a database, checkpoint is forced
before removing the database files, so bgwriter does not try to open
for fsync the database files.  I saw this in the following code
fragment in src/backend/commands/dbcommands.c:

----------------------------------------/* * On Windows, force a checkpoint so that the bgwriter doesn't hold
any * open files, which would cause rmdir() to fail. */
#ifdef WIN32RequestCheckpoint(true, false);
#endif
/* * Remove all tablespace subdirs belonging to the database. */remove_dbtablespaces(db_id);
----------------------------------------


> I think that it should not be necessary to invent a signal for "drop
> across tablespace", though, because we don't allow DROP TABLESPACE
to>> remove any tables --- you've got to drop tables and/or databases
to
> clean out the tablespace, first.  Anyone see a flaw in that?

I think you are right.
BTW: what happens on Windows if we're trying to do the equivalent
> of "rm -rf database-dir" and someone is holding open one of the
files
> in the directory?  Or has the directory itself open for readdir()?

And I wonder what happens if Windows "copy" command is accessing the
data files when bgwriter tries to open them for fsync, or the reverse
of it.  copy would fail?  If so, it means that online backup sometimes
fails.

And how about a checkpoint after ALTER TABLE/INDEX ... SET TABLESPACE?
When bgwriter tries to open the table/index file, the original file
does not exist.





Re: Idea for fixing the Windows fsync problem

From
Stefan Kaltenbrunner
Date:
Takayuki Tsunakawa wrote:
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>> I wrote:
>>> I've committed a tentative patch along these lines to HEAD.  Please
>>> test.
>> So I come home from dinner out, and find the buildfarm all red :-(
>>
>> I'm not sure why I didn't see this failure in my own testing, but in
>> hindsight it's quite obvious that if the bgwriter is to take a hard
>> line about fsync failures, it's got to be told about DROP DATABASE
>> not only DROP TABLE --- that is, there has to be a signaling message
>> for "revoke fsync requests across whole database".
>>
> 
> Excuse me if I misunderstand English and say something strange.
> I thought "buildfarm is red" meant the failure of regression test.
> What kind of errors did you get in what operation (e.g. DROP INDEX)?
> Is everyone is able to see the test result freely?  Sorry, I'll read
> developer's FAQ when I have more time.

tom is talking about the postgresql distributed buildfarm:

http://buildfarm.postgresql.org/cgi-bin/show_status.pl

and right now most of the members are indeed "red" due to the fsync
related changes.


Stefan


Re: Idea for fixing the Windows fsync problem

From
"Magnus Hagander"
Date:
>  BTW: what happens on Windows if we're trying to do the equivalent
> > of "rm -rf database-dir" and someone is holding open one of the
> files
> > in the directory?  Or has the directory itself open for readdir()?

For the first definity and I think for the second, when doing it from the commandline, you get a 'cannot delete the
directorybecause it is not empty'.
 
I'm not sure if our implementation for dealing with open files also work with directories.


> And I wonder what happens if Windows "copy" command is accessing the
> data files when bgwriter tries to open them for fsync, or the reverse
> of it.  copy would fail? 

Sharing violation.

> If so, it means that online backup sometimes
> fails.

Any backup software backing up *Anything* online should be using 
VSS or a custom OFM. and all real solutions do. 

/Magnus



Re: Idea for fixing the Windows fsync problem

From
"Takayuki Tsunakawa"
Date:
Hello, Stefan-san
tom is talking about the postgresql distributed buildfarm:
>
> http://buildfarm.postgresql.org/cgi-bin/show_status.pl

Thank you for telling me.  This is a great system, isn't it?

----- Original Message ----- 
From: "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: "Magnus Hagander" <magnus@hagander.net>;
<pgsql-hackers@postgresql.org>; "Tom Lane" <tgl@sss.pgh.pa.us>
Sent: Wednesday, January 17, 2007 4:44 PM
Subject: Re: [HACKERS] Idea for fixing the Windows fsync problem


> Takayuki Tsunakawa wrote:
>> From: "Tom Lane" <tgl@sss.pgh.pa.us>
>>> I wrote:
>>>> I've committed a tentative patch along these lines to HEAD.
Please
>>>> test.
>>> So I come home from dinner out, and find the buildfarm all red :-(
>>>
>>> I'm not sure why I didn't see this failure in my own testing, but
in
>>> hindsight it's quite obvious that if the bgwriter is to take a
hard
>>> line about fsync failures, it's got to be told about DROP DATABASE
>>> not only DROP TABLE --- that is, there has to be a signaling
message
>>> for "revoke fsync requests across whole database".
>>>
>>
>> Excuse me if I misunderstand English and say something strange.
>> I thought "buildfarm is red" meant the failure of regression test.
>> What kind of errors did you get in what operation (e.g. DROP
INDEX)?
>> Is everyone is able to see the test result freely?  Sorry, I'll
read
>> developer's FAQ when I have more time.
>
> tom is talking about the postgresql distributed buildfarm:
>
> http://buildfarm.postgresql.org/cgi-bin/show_status.pl
>
> and right now most of the members are indeed "red" due to the fsync
> related changes.
>
>
> Stefan
>




Re: Idea for fixing the Windows fsync problem

From
Tom Lane
Date:
"Magnus Hagander" <magnus@hagander.net> writes:
>> BTW: what happens on Windows if we're trying to do the equivalent
>> of "rm -rf database-dir" and someone is holding open one of the
>> files in the directory?  Or has the directory itself open for readdir()?

> For the first definity and I think for the second, when doing it from the commandline, you get a 'cannot delete the
directorybecause it is not empty'.
 
> I'm not sure if our implementation for dealing with open files also work with directories.

I just noticed this in dropdb():
   /*    * On Windows, force a checkpoint so that the bgwriter doesn't hold any    * open files, which would cause
rmdir()to fail.    */
 
#ifdef WIN32   RequestCheckpoint(true, false);
#endif

This is done after flushing shared buffers (and, in a little bit, after
telling the bgwriter to forget pending fsyncs); so there'd be no reason
for the bgwriter to re-open any files in the victim database.  And there
are other interlocks guaranteeing no active backends in the victim
database.

It's still not 100% bulletproof, because it's possible that some other
backend is holding an open file in the database as a consequence of
having had to dump some shared buffer for itself, but that should be
pretty darn rare if the bgwriter is getting its job done.
        regards, tom lane


Re: Idea for fixing the Windows fsync problem

From
Stefan Kaltenbrunner
Date:
Takayuki Tsunakawa wrote:
> Hello, Stefan-san
> 
>  tom is talking about the postgresql distributed buildfarm:
>> http://buildfarm.postgresql.org/cgi-bin/show_status.pl
> 
> Thank you for telling me.  This is a great system, isn't it?

yeah the buildfarm plays an important role in the development process now.


Stefan


Re: Idea for fixing the Windows fsync problem

From
Magnus Hagander
Date:
> Checkpoint command here reported an error yesterday.  If Tom-san's
> patch is effective, it should not fail and no messages are put in the
> event log.

I can confirm that the latest set of patches from Tom as in anoncvs now
fixes this. Checkpoint command succeeds and no error is logged on the
server.

//Magnus


Re: Idea for fixing the Windows fsync problem

From
"Takayuki Tsunakawa"
Date:
From: "Tom Lane" <tgl@sss.pgh.pa.us>
> It's still not 100% bulletproof, because it's possible that some
other
> backend is holding an open file in the database as a consequence of
> having had to dump some shared buffer for itself, but that should be
> pretty darn rare if the bgwriter is getting its job done.

I've understood that you are talking about the case where backends
have to evict dirty buffers containing data of a database they are not
connected to.  The problem is that the backends don't close the data
file after writing dirty buffers.
Then, how about making the backends close the data files?  Concretely,
in FlushBuffer() in src/backend/storage/buffer/bufmgr.c, call
SmgrClose() after SmgrWrite() like this:

--------------------------------------------------
if (reln passed to FlushBuffer() was NULL &&   reln->smgr_rnode.dbNode != my database's oid(where is this
stored?)   SmgrClose(reln);
}
--------------------------------------------------


Or, to make the intention clearer, it may be better to add calls to
SmgrOpen() and SmgrClose() in succession after FlushBuffer() in
BufferAlloc().


BTW, fsync() is causing trouble here in addition to load-distributed
checkpoint that Itagaki-san has been addressing, isn't it?  If fsync
were eliminated by using O_SYNC as commercial databases, Tom-san
didn't have to make efforts to solve this problem.