Thread: Idea for fixing the Windows fsync problem
I just had a thought about fixing those Windows "permission denied" problems. The case that we believe we understand is where the bgwriter is trying to execute a previously-logged fsync request against a table file that is pending delete --- that is, actually has been unlink()'d, but some other process is holding an open file reference to it. The problem is only for fsync, not for write(), because the table drop sequence always invalidates every shared buffer for the table before trying to unlink it. So: maybe the solution is to add a step to the drop sequence, namely revoking any pending fsync request, before unlink. This would not only clean up the Windows issue, it'd also let us remove the current hack in md.c to not complain about an ENOENT failure (which is really hardly any safer than ignoring EACCES would be, if you want to be honest about it). The problem is that the ForwardFsyncRequest() mechanism is asynchronous: currently, a backend could see pending fsync requests that are still in the shared-memory queue, but there's no way to tell whether the bgwriter has already absorbed some requests into its private memory. How can a backend tell the bgwriter to forget about it, and then delay until it can be sure that the bgwriter won't try it later? We could have backends put "revoke fsync" requests into the shared queue and then sleep until they see the queue has been drained ... but there's not a convenient way to implement that delay, and I hardly want to just "sleep and retry" during every table drop. It'd probably take at least one more LWLock, and noticeably more complicated ForwardFsyncRequest() logic, to make this work. Thoughts? Is this a reasonable solution path, or is it likely to be a waste of time? We know that there are causes of "permission denied" that are not explained by the pending-delete problem. regards, tom lane
Tom Lane wrote: > I just had a thought about fixing those Windows "permission denied" > problems. The case that we believe we understand is where the bgwriter > is trying to execute a previously-logged fsync request against a table > file that is pending delete --- that is, actually has been unlink()'d, > but some other process is holding an open file reference to it. The > problem is only for fsync, not for write(), because the table drop > sequence always invalidates every shared buffer for the table before > trying to unlink it. > > So: maybe the solution is to add a step to the drop sequence, namely > revoking any pending fsync request, before unlink. This would not only > clean up the Windows issue, it'd also let us remove the current hack in > md.c to not complain about an ENOENT failure (which is really hardly any > safer than ignoring EACCES would be, if you want to be honest about it). Sounds good so far :-) > The problem is that the ForwardFsyncRequest() mechanism is asynchronous: > currently, a backend could see pending fsync requests that are still in > the shared-memory queue, but there's no way to tell whether the bgwriter > has already absorbed some requests into its private memory. How can a > backend tell the bgwriter to forget about it, and then delay until it > can be sure that the bgwriter won't try it later? > > We could have backends put "revoke fsync" requests into the shared queue > and then sleep until they see the queue has been drained ... but there's > not a convenient way to implement that delay, and I hardly want to just > "sleep and retry" during every table drop. It'd probably take at least > one more LWLock, and noticeably more complicated ForwardFsyncRequest() > logic, to make this work. > > Thoughts? Is this a reasonable solution path, or is it likely to be a > waste of time? We know that there are causes of "permission denied" > that are not explained by the pending-delete problem. Do we need to actually wait for it? Does the backend need to know when it's done? If it fires off the "discard" request, then it's up to the bgwriter to see it in time, no? Perhaps we could have the bgwrite check the queue *if* it gets the ENOENT/EACCESS error and then re-check the queue for drops on that file? Or maybe that's even more complex? (I confess I haven't looked at the code..) //Magnus
Magnus Hagander <magnus@hagander.net> writes: > Perhaps we could have the bgwrite check the queue *if* it gets the > ENOENT/EACCESS error and then re-check the queue for drops on that file? Hmm ... seems a bit ugly, but then again I've not been able to come up with a nice way of making the backends wait for queue drain either; and not having to make them wait at all is certainly attractive. I'll look into doing it like that. Thanks for the idea! regards, tom lane
Magnus Hagander <magnus@hagander.net> writes: > Tom Lane wrote: >> So: maybe the solution is to add a step to the drop sequence, namely >> revoking any pending fsync request, before unlink. > Perhaps we could have the bgwrite check the queue *if* it gets the > ENOENT/EACCESS error and then re-check the queue for drops on that file? I've committed a tentative patch along these lines to HEAD. Please test. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us>I suggested that here > http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php > but have received no feedback about it ... I'm sorry, I missed it. From: "Tom Lane" <tgl@sss.pgh.pa.us> > Magnus Hagander <magnus@hagander.net> writes: >> Tom Lane wrote: >>> So: maybe the solution is to add a step to the drop sequence, namely >>> revoking any pending fsync request, before unlink. > >> Perhaps we could have the bgwrite check the queue *if* it gets the >> ENOENT/EACCESS error and then re-check the queue for drops on that file? > > I've committed a tentative patch along these lines to HEAD. Please > test. I agree with Magnus-san's suggestion, too. Though I'm willing to test, I'm not familiar with building on Windows yet and do not have enogh time for other works right now. If someone builds and gives me the new postgres.exe, I'll put it on my 8.2 installation and test. Or, could anyone do the following? These are what I did in yesterday's test. 1. Open two psql sessions. Let me call those session1 and session2. 2. On session1, execute: create table a (c int); insert into a values(1); 3. On session2, execute: select * from a; 4. On session1, execute: drop table a; checkpoint; Checkpoint command here reported an error yesterday. If Tom-san's patch is effective, it should not fail and no messages are put in the event log.
I wrote: > I've committed a tentative patch along these lines to HEAD. Please > test. So I come home from dinner out, and find the buildfarm all red :-( I'm not sure why I didn't see this failure in my own testing, but in hindsight it's quite obvious that if the bgwriter is to take a hard line about fsync failures, it's got to be told about DROP DATABASE not only DROP TABLE --- that is, there has to be a signaling message for "revoke fsync requests across whole database". I think that it should not be necessary to invent a signal for "drop across tablespace", though, because we don't allow DROP TABLESPACE to remove any tables --- you've got to drop tables and/or databases to clean out the tablespace, first. Anyone see a flaw in that? Not up to fixing this right now, but will have a go at it in the morning, unless someone else wants to take a swing at it meanwhile. BTW: what happens on Windows if we're trying to do the equivalent of "rm -rf database-dir" and someone is holding open one of the files in the directory? Or has the directory itself open for readdir()? regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> >I wrote: >> I've committed a tentative patch along these lines to HEAD. Please >> test. > > So I come home from dinner out, and find the buildfarm all red :-( > > I'm not sure why I didn't see this failure in my own testing, but in > hindsight it's quite obvious that if the bgwriter is to take a hard > line about fsync failures, it's got to be told about DROP DATABASE > not only DROP TABLE --- that is, there has to be a signaling message > for "revoke fsync requests across whole database". > Excuse me if I misunderstand English and say something strange. I thought "buildfarm is red" meant the failure of regression test. What kind of errors did you get in what operation (e.g. DROP INDEX)? Is everyone is able to see the test result freely? Sorry, I'll read developer's FAQ when I have more time. And I thought you meant you have a question: why doesn't bgwriter report "checkpoint request failed" when a checkpoint occurs after dropping a database? No, checkpoint after DROP DATABASE should not report "checkpoint request failed." When dropping a database, checkpoint is forced before removing the database files, so bgwriter does not try to open for fsync the database files. I saw this in the following code fragment in src/backend/commands/dbcommands.c: ----------------------------------------/* * On Windows, force a checkpoint so that the bgwriter doesn't hold any * open files, which would cause rmdir() to fail. */ #ifdef WIN32RequestCheckpoint(true, false); #endif /* * Remove all tablespace subdirs belonging to the database. */remove_dbtablespaces(db_id); ---------------------------------------- > I think that it should not be necessary to invent a signal for "drop > across tablespace", though, because we don't allow DROP TABLESPACE to>> remove any tables --- you've got to drop tables and/or databases to > clean out the tablespace, first. Anyone see a flaw in that? I think you are right. BTW: what happens on Windows if we're trying to do the equivalent > of "rm -rf database-dir" and someone is holding open one of the files > in the directory? Or has the directory itself open for readdir()? And I wonder what happens if Windows "copy" command is accessing the data files when bgwriter tries to open them for fsync, or the reverse of it. copy would fail? If so, it means that online backup sometimes fails. And how about a checkpoint after ALTER TABLE/INDEX ... SET TABLESPACE? When bgwriter tries to open the table/index file, the original file does not exist.
Takayuki Tsunakawa wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> I wrote: >>> I've committed a tentative patch along these lines to HEAD. Please >>> test. >> So I come home from dinner out, and find the buildfarm all red :-( >> >> I'm not sure why I didn't see this failure in my own testing, but in >> hindsight it's quite obvious that if the bgwriter is to take a hard >> line about fsync failures, it's got to be told about DROP DATABASE >> not only DROP TABLE --- that is, there has to be a signaling message >> for "revoke fsync requests across whole database". >> > > Excuse me if I misunderstand English and say something strange. > I thought "buildfarm is red" meant the failure of regression test. > What kind of errors did you get in what operation (e.g. DROP INDEX)? > Is everyone is able to see the test result freely? Sorry, I'll read > developer's FAQ when I have more time. tom is talking about the postgresql distributed buildfarm: http://buildfarm.postgresql.org/cgi-bin/show_status.pl and right now most of the members are indeed "red" due to the fsync related changes. Stefan
> BTW: what happens on Windows if we're trying to do the equivalent > > of "rm -rf database-dir" and someone is holding open one of the > files > > in the directory? Or has the directory itself open for readdir()? For the first definity and I think for the second, when doing it from the commandline, you get a 'cannot delete the directorybecause it is not empty'. I'm not sure if our implementation for dealing with open files also work with directories. > And I wonder what happens if Windows "copy" command is accessing the > data files when bgwriter tries to open them for fsync, or the reverse > of it. copy would fail? Sharing violation. > If so, it means that online backup sometimes > fails. Any backup software backing up *Anything* online should be using VSS or a custom OFM. and all real solutions do. /Magnus
Hello, Stefan-san tom is talking about the postgresql distributed buildfarm: > > http://buildfarm.postgresql.org/cgi-bin/show_status.pl Thank you for telling me. This is a great system, isn't it? ----- Original Message ----- From: "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> Cc: "Magnus Hagander" <magnus@hagander.net>; <pgsql-hackers@postgresql.org>; "Tom Lane" <tgl@sss.pgh.pa.us> Sent: Wednesday, January 17, 2007 4:44 PM Subject: Re: [HACKERS] Idea for fixing the Windows fsync problem > Takayuki Tsunakawa wrote: >> From: "Tom Lane" <tgl@sss.pgh.pa.us> >>> I wrote: >>>> I've committed a tentative patch along these lines to HEAD. Please >>>> test. >>> So I come home from dinner out, and find the buildfarm all red :-( >>> >>> I'm not sure why I didn't see this failure in my own testing, but in >>> hindsight it's quite obvious that if the bgwriter is to take a hard >>> line about fsync failures, it's got to be told about DROP DATABASE >>> not only DROP TABLE --- that is, there has to be a signaling message >>> for "revoke fsync requests across whole database". >>> >> >> Excuse me if I misunderstand English and say something strange. >> I thought "buildfarm is red" meant the failure of regression test. >> What kind of errors did you get in what operation (e.g. DROP INDEX)? >> Is everyone is able to see the test result freely? Sorry, I'll read >> developer's FAQ when I have more time. > > tom is talking about the postgresql distributed buildfarm: > > http://buildfarm.postgresql.org/cgi-bin/show_status.pl > > and right now most of the members are indeed "red" due to the fsync > related changes. > > > Stefan >
"Magnus Hagander" <magnus@hagander.net> writes: >> BTW: what happens on Windows if we're trying to do the equivalent >> of "rm -rf database-dir" and someone is holding open one of the >> files in the directory? Or has the directory itself open for readdir()? > For the first definity and I think for the second, when doing it from the commandline, you get a 'cannot delete the directorybecause it is not empty'. > I'm not sure if our implementation for dealing with open files also work with directories. I just noticed this in dropdb(): /* * On Windows, force a checkpoint so that the bgwriter doesn't hold any * open files, which would cause rmdir()to fail. */ #ifdef WIN32 RequestCheckpoint(true, false); #endif This is done after flushing shared buffers (and, in a little bit, after telling the bgwriter to forget pending fsyncs); so there'd be no reason for the bgwriter to re-open any files in the victim database. And there are other interlocks guaranteeing no active backends in the victim database. It's still not 100% bulletproof, because it's possible that some other backend is holding an open file in the database as a consequence of having had to dump some shared buffer for itself, but that should be pretty darn rare if the bgwriter is getting its job done. regards, tom lane
Takayuki Tsunakawa wrote: > Hello, Stefan-san > > tom is talking about the postgresql distributed buildfarm: >> http://buildfarm.postgresql.org/cgi-bin/show_status.pl > > Thank you for telling me. This is a great system, isn't it? yeah the buildfarm plays an important role in the development process now. Stefan
> Checkpoint command here reported an error yesterday. If Tom-san's > patch is effective, it should not fail and no messages are put in the > event log. I can confirm that the latest set of patches from Tom as in anoncvs now fixes this. Checkpoint command succeeds and no error is logged on the server. //Magnus
From: "Tom Lane" <tgl@sss.pgh.pa.us> > It's still not 100% bulletproof, because it's possible that some other > backend is holding an open file in the database as a consequence of > having had to dump some shared buffer for itself, but that should be > pretty darn rare if the bgwriter is getting its job done. I've understood that you are talking about the case where backends have to evict dirty buffers containing data of a database they are not connected to. The problem is that the backends don't close the data file after writing dirty buffers. Then, how about making the backends close the data files? Concretely, in FlushBuffer() in src/backend/storage/buffer/bufmgr.c, call SmgrClose() after SmgrWrite() like this: -------------------------------------------------- if (reln passed to FlushBuffer() was NULL && reln->smgr_rnode.dbNode != my database's oid(where is this stored?) SmgrClose(reln); } -------------------------------------------------- Or, to make the intention clearer, it may be better to add calls to SmgrOpen() and SmgrClose() in succession after FlushBuffer() in BufferAlloc(). BTW, fsync() is causing trouble here in addition to load-distributed checkpoint that Itagaki-san has been addressing, isn't it? If fsync were eliminated by using O_SYNC as commercial databases, Tom-san didn't have to make efforts to solve this problem.