Thread: BUG #2712: could not fsync segment: Permission denied
The following bug has been logged online: Bug reference: 2712 Logged by: Thomas H Email address: me@alternize.com PostgreSQL version: 8.2b1 Operating system: windows 2003 standard Description: could not fsync segment: Permission denied Details: sometimes we're seeing loads of errors in the log: 2006-10-22 23:48:50 LOG: could not fsync segment 0 of relation 1663/3964774/6409340: Permission denied 2006-10-22 23:48:50 ERROR: storage sync failed on magnetic disk: Permission denied 2006-10-22 23:48:51 LOG: could not fsync segment 0 of relation 1663/3964774/6409340: Permission denied 2006-10-22 23:48:51 ERROR: storage sync failed on magnetic disk: Permission denied 2006-10-22 23:48:52 LOG: could not fsync segment 0 of relation 1663/3964774/6409340: Permission denied 2006-10-22 23:48:52 ERROR: storage sync failed on magnetic disk: Permission denied 2006-10-22 23:48:53 LOG: could not fsync segment 0 of relation 1663/3964774/6409340: Permission denied 2006-10-22 23:48:53 ERROR: storage sync failed on magnetic disk: Permission denied {...} when this happens, there are also files locked within the data\base\{dbid}\. access to those files are denied by the os - the files vanish as soon as postmaster ist stopped & restarted. i haven't yet found a possible reason - i suspect the error to appear *sometimes* after issuing a "VACUUM FULL ANALYZE {tablename}" / "REINDEX TABLE {tablename}". the hardware is checked and ok.
in verbose mode, the log shows a little bit more: 2006-10-23 03:23:14 LOG: 42501: could not fsync segment 0 of relation 1663/3964774/6411190: Permission denied 2006-10-23 03:23:14 LOCATION: mdsync, md.c:785 2006-10-23 03:23:14 ERROR: XX000: storage sync failed on magnetic disk: Permission denied 2006-10-23 03:23:14 LOCATION: smgrsync, smgr.c:888 - thomas ----- Original Message ----- From: "Thomas H" <me@alternize.com> To: <pgsql-bugs@postgresql.org> Sent: Monday, October 23, 2006 1:28 AM Subject: [BUGS] BUG #2712: could not fsync segment: Permission denied > > The following bug has been logged online: > > Bug reference: 2712 > Logged by: Thomas H > Email address: me@alternize.com > PostgreSQL version: 8.2b1 > Operating system: windows 2003 standard > Description: could not fsync segment: Permission denied > Details: > > sometimes we're seeing loads of errors in the log: > > 2006-10-22 23:48:50 LOG: could not fsync segment 0 of relation > 1663/3964774/6409340: Permission denied > 2006-10-22 23:48:50 ERROR: storage sync failed on magnetic disk: > Permission > denied > 2006-10-22 23:48:51 LOG: could not fsync segment 0 of relation > 1663/3964774/6409340: Permission denied > 2006-10-22 23:48:51 ERROR: storage sync failed on magnetic disk: > Permission > denied > 2006-10-22 23:48:52 LOG: could not fsync segment 0 of relation > 1663/3964774/6409340: Permission denied > 2006-10-22 23:48:52 ERROR: storage sync failed on magnetic disk: > Permission > denied > 2006-10-22 23:48:53 LOG: could not fsync segment 0 of relation > 1663/3964774/6409340: Permission denied > 2006-10-22 23:48:53 ERROR: storage sync failed on magnetic disk: > Permission > denied > {...} > > when this happens, there are also files locked within the > data\base\{dbid}\. > access to those files are denied by the os - the files vanish as soon as > postmaster ist stopped & restarted. > > i haven't yet found a possible reason - i suspect the error to appear > *sometimes* after issuing a "VACUUM FULL ANALYZE {tablename}" / "REINDEX > TABLE {tablename}". > > the hardware is checked and ok. > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
"Thomas H" <me@alternize.com> writes: > Operating system: windows 2003 standard > Description: could not fsync segment: Permission denied The usual answer to this has been that you're running some overenthusiastic antivirus software. regards, tom lane
unfortunately not. and this is not happening with 8.1 regards, thomas ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Thomas H" <me@alternize.com> Cc: <pgsql-bugs@postgresql.org> Sent: Monday, October 23, 2006 4:07 AM Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied > "Thomas H" <me@alternize.com> writes: >> Operating system: windows 2003 standard >> Description: could not fsync segment: Permission denied > > The usual answer to this has been that you're running some > overenthusiastic antivirus software. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
there is defenitely something terribly wrong in the windows 8.2b1 regarding file access/locking. 2nd total db lockup today due to file access locks (all hold by postmaster): {...} 2006-10-23 17:48:10 LOG: 42501: could not fsync segment 0 of relation 1663/3964774/6419608: Permission denied 2006-10-23 17:48:10 LOCATION: mdsync, md.c:785 2006-10-23 17:48:10 ERROR: XX000: storage sync failed on magnetic disk: Permission denied 2006-10-23 17:48:10 LOCATION: smgrsync, smgr.c:888 2006-10-23 17:48:10 LOG: 00000: duration: 327.999 ms statement: SELECT threads.*, first.login AS first_user, last.login AS last_user FROM forum.threads JOIN users.users AS first ON first.id = threads.t_first_user LEFT JOIN users.users AS last ON last.id = threads.t_last_user WHERE t_b_id = 4 AND t_status_deleted = false ORDER BY t_status_sticky DESC, t_last_post DESC 2006-10-23 17:48:10 LOCATION: exec_simple_query, postgres.c:1007 2006-10-23 17:48:14 LOG: 00000: could not rename file "pg_xlog/00000001000000040000002E" to "pg_xlog/000000010000000400000037", continuing to try 2006-10-23 17:48:14 LOCATION: pgrename, dirmod.c:142 2006-10-23 18:12:05 LOG: 00000: received fast shutdown request 2006-10-23 18:12:05 LOCATION: pmdie, postmaster.c:1903 2006-10-23 18:12:05 LOG: 00000: aborting any active transactions 2006-10-23 18:12:05 LOCATION: pmdie, postmaster.c:1910 2006-10-23 18:12:05 FATAL: 57P01: terminating connection due to administrator command 2006-10-23 18:12:05 LOCATION: ProcessInterrupts, postgres.c:2465 2006-10-23 18:12:06 ERROR: XX000: could not rename file "pg_xlog/00000001000000040000002E" to "pg_xlog/000000010000000400000037" (initialization of log file 4, segment 55): A blocking operation was interrupted by a call to WSACancelBlockingCall. 2006-10-23 18:12:06 LOCATION: InstallXLogFileSegment, xlog.c:2201 {...} from 17:48:14 on pgsql didn't handle anymore queries until shutdown. as soon as one restarts postmaster, the file locks are cleared up. and no, there are no other file locking tools (av scanners and the such) running - 8.1 on the same box (even on same partition) run fine. regarnds, - thomas ----- Original Message ----- From: "Thomas H." <me@alternize.com> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: <pgsql-bugs@postgresql.org> Sent: Monday, October 23, 2006 11:52 AM Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied > unfortunately not. and this is not happening with 8.1 > > regards, > thomas > > ----- Original Message ----- > From: "Tom Lane" <tgl@sss.pgh.pa.us> > To: "Thomas H" <me@alternize.com> > Cc: <pgsql-bugs@postgresql.org> > Sent: Monday, October 23, 2006 4:07 AM > Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied > > >> "Thomas H" <me@alternize.com> writes: >>> Operating system: windows 2003 standard >>> Description: could not fsync segment: Permission denied >> >> The usual answer to this has been that you're running some >> overenthusiastic antivirus software. >> >> regards, tom lane >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
The same problem exists in 8.1 too. See this thread http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php Tom and Magnus tracked down a cause, but I don't think a fix was ever implemented. FWIW, we were bitten by the fsync problem which you noticed too. Unfortunately we were never able to track down a cause (see the mailing list archives). They are separate problems though. Pete >>> "Thomas H." <me@alternize.com> 23.10.2006 18:21 >>> there is defenitely something terribly wrong in the windows 8.2b1 regarding file access/locking. 2nd total db lockup today due to file access locks (all hold by postmaster): 2006-10-23 17:48:10 LOCATION: exec_simple_query, postgres.c:1007 2006-10-23 17:48:14 LOG: 00000: could not rename file "pg_xlog/00000001000000040000002E" to "pg_xlog/000000010000000400000037", continuing to try
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > The same problem exists in 8.1 too. See this thread > http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php > Tom and Magnus tracked down a cause, but I don't think a fix was ever > implemented. Thomas seems to have two different issues there: the "could not rename file" problem on the pg_xlog file is probably explained by the mechanism we identified back then (and I'm not sure why no fix has been installed), but there is no known reason other than antivirus software for the "could not fsync" problem. As for fixing the problem we do understand: ISTM it's just an awful idea for pgrename and pgunlink to be willing to loop forever. I think they should time out and report the failure after some reasonable period (say between 10 sec and a minute). If we simply made that change, then the behavior when there's an idle backend sitting on a filehandle for an old xlog segment would be that checkpoints would fail at the MoveOfflineLogs stage, which would not be fatal, but it'd be annoying. We'd probably want to further tweak InstallXLogFileSegment so that rename failure isn't an ERROR, at least not on Windows. (I think we could just make it return false, which'd cause the caller to try to delete the xlog segment, which should work even though rename doesn't.) I'm not in a position to test this though. Magnus or Bruce? regards, tom lane
> The same problem exists in 8.1 too. See this thread its only appearing in 8.2 here, i've just rechecked our logs... is there any workaround? how did you get around that problem of having a total lockdown? thanks, thomas >>>> "Thomas H." <me@alternize.com> 23.10.2006 18:21 >>> > there is defenitely something terribly wrong in the windows 8.2b1 > regarding > file access/locking. 2nd total db lockup today due to file access locks > (all > hold by postmaster): > > 2006-10-23 17:48:10 LOCATION: exec_simple_query, postgres.c:1007 > 2006-10-23 17:48:14 LOG: 00000: could not rename file > "pg_xlog/00000001000000040000002E" to > "pg_xlog/000000010000000400000037", continuing to try > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
That might be one cause (or it might otherwise exacerbate the problem), but it isn't the only cause. We weren't running anti-virus software and neither is Thomas. Unfortunately with the last go around, we collectively ran out of ideas before an underlying cause could be identified. Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 23.10.2006 19:49 >>> installed), but there is no known reason other than antivirus software for the "could not fsync" problem.
Move to Linux. :-) In our case, everything but the database servers were already Linux so it was an easy choice. Things have been rock solid since then. Once things get stuck, I don't think there is an alternative besides "stop -m immediate". However, since the problem is caused by an idle backend holding onto an old WAL segment, maybe having your middle tier/connection pool close and reopen the connections to the database every so often would function as a workaround. Somebody with more knowledge of PG internals than I would have to define "every so often" though (if the idea is viable at all). Pete >>> "Thomas H." <me@alternize.com> 23.10.2006 20:00 >>> is there any workaround? how did you get around that problem of having a total lockdown?
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > FWIW, we were bitten by the fsync problem which you noticed too. > Unfortunately we were never able to track down a cause (see the mailing > list archives). They are separate problems though. Actually, now that I look back in the archives, I think we had theorized that the fsync errors come from attempting to fsync a file that's already been deleted but some backend still has a reference to. Apparently that leads to EACCES instead of ENOENT (which the code is already prepared to expect). http://archives.postgresql.org/pgsql-bugs/2006-04/msg00215.php regards, tom lane
> Actually, now that I look back in the archives, I think we had theorized > that the fsync errors come from attempting to fsync a file that's > already been deleted but some backend still has a reference to. > Apparently that leads to EACCES instead of ENOENT (which the code is > already prepared to expect). with process explorer i can actually check which postgres.exe instance (in all cases i've checked its just 1 instance, and always just 1 file) holds the lock for the file in question. but will that help in determining why it is still holding a reference? the postgres instance that holds the lock eventually closes the filehandle after some minutes. the process itself is not killed but continues thereafter. let me know if i can be of any assistance. since we do regurarly reindex one table whose index size keeps growing despite of often vacuuming, the fsync-problem happens almost 4-5 times per hour. regards, thomas
"Thomas H." <me@alternize.com> writes: > with process explorer i can actually check which postgres.exe instance (in > all cases i've checked its just 1 instance, and always just 1 file) holds > the lock for the file in question. So which one is it? > the postgres instance that holds the lock eventually closes the filehandle > after some minutes. the process itself is not killed but continues > thereafter. That sounds a bit like what I'd expect the bgwriter to do, but the bgwriter is also the one trying to issue the fsync. regards, tom lane
>> with process explorer i can actually check which postgres.exe instance >> (in >> all cases i've checked its just 1 instance, and always just 1 file) holds >> the lock for the file in question. > > So which one is it? it's always one of the db-"slaves" and not "logger process", "stats collector process" or "writer process": right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow outnow1 127.0.0.1(2122) idle") that tries to write "D:\DB\PostgreSQL-8.2\data\base\3964774\6422331" can i somehow check what object that file-OID belong(ed/s) to? - thomas
"Thomas H." <me@alternize.com> writes: > right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow > outnow1 127.0.0.1(2122) idle") that tries to write > "D:\DB\PostgreSQL-8.2\data\base\3964774\6422331" Do you actually mean it's trying to write that file? Or is it just sitting there holding the open filehandle? > can i somehow check what object that file-OID belong(ed/s) to? You can check in pg_class.relfilenode and pg_class.oid of that database to see if you get a match. But our theory is that this table has been deleted ... regards, tom lane
> "Thomas H." <me@alternize.com> writes: >> right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow >> outnow1 127.0.0.1(2122) idle") that tries to write >> "D:\DB\PostgreSQL-8.2\data\base\3964774\6422331" > > Do you actually mean it's trying to write that file? Or is it just > sitting there holding the open filehandle? well, hard to tell :-) according to the log-messages i would assume it is *trying* to write. but the file in question isn't physically there anymore, it's just the open file handle that keeps it from vanish totally - you do not have access to the file (permission denied / access denied) if you for example try to read it or its attributes in file explorer. i've installed Filemon (http://www.sysinternals.com/Utilities/Filemon.html) now. this gives more insight what happens to the file. in this case its file 6422806, the first error message appeared at 23:45:21, the last one at 23:45:26 (only short duration this time). {....} 23:44:57 postgres.exe:1944 WRITE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS Offset: 16384 Length: 8192 23:44:57 postgres.exe:1944 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:44:57 postgres.exe:1944 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS Options: Open Access: 00010080 23:44:57 postgres.exe:1944 QUERY INFORMATION D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS FileAttributeTagInformation 23:44:57 postgres.exe:1944 DELETE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:44:57 postgres.exe:1944 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:44:57 postgres.exe:1944 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806.1 NOT FOUND Options: Open Access: 00010080 23:44:57 postgres.exe:5364 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:44:57 postgres.exe:2780 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:44:59 postgres.exe:6036 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:45:11 postgres.exe:5196 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:45:20 postgres.exe:1268 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:45:21 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:22 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:23 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:24 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:25 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:26 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open Access: 0012019F 23:45:26 postgres.exe:5428 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:45:26 postgres.exe:2200 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS 23:45:27 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 NOT FOUND Options: Open Access: 0012019F i have earlier log data for this file if needed, but at :45:27 was the last entry. unfortunately i wasn't quick enough to find the blocking process in processviewer, but i guess its pid 5196 >> can i somehow check what object that file-OID belong(ed/s) to? > > You can check in pg_class.relfilenode and pg_class.oid of that database > to see if you get a match. But our theory is that this table has been > deleted ... nothing there as assumed. - thomas
"Thomas H." <me@alternize.com> writes: >> Do you actually mean it's trying to write that file? Or is it just >> sitting there holding the open filehandle? > well, hard to tell :-) > according to the log-messages i would assume it is *trying* to write. The log messages you have don't make it clear which process is trying to do the fsync, but I would expect it to be the bgwriter. (Possibly you should modify log_line_prefix to include PID so we can tell a bit better.) regards, tom lane
> The log messages you have don't make it clear which process is trying to > do the fsync, but I would expect it to be the bgwriter. (Possibly you > should modify log_line_prefix to include PID so we can tell a bit > better.) you're right (as always :-)). its the "writer process" (pid 5196) that outputs the error messages: 2006-10-24 00:09:09 [5196] ERROR: XX000: storage sync failed on magnetic disk: Permission denied 2006-10-24 00:09:09 [5196] LOCATION: smgrsync, smgr.c:888 2006-10-24 00:09:10 [5196] LOG: 42501: could not fsync segment 0 of relation 1663/3964774/6422947: Permission denied 2006-10-24 00:09:10 [5196] LOCATION: mdsync, md.c:785 and in this case, its process 5988 that keeps the file handle open (its entry in pg_class is already deleted): \BaseNamedObjects\pgident: postgres: db_outnow outnow1 127.0.0.1(2362) idle D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 (1 references, 1 handle) ... while pid 5196 constantly tries to open the file (for over 15min in this case), until... 00:22:18 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 DELETE PEND Options: Open Access: 0012019F 00:22:19 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 DELETE PEND Options: Open Access: 0012019F 00:22:20 postgres.exe:5988 CLOSE D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 SUCCESS 00:22:20 postgres.exe:5196 OPEN D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 NOT FOUND Options: Open Access: 0012019F is that of any use? what more logging options would be interesting? - thomas
"Thomas H." <me@alternize.com> writes: > i've installed Filemon (http://www.sysinternals.com/Utilities/Filemon.html) > now. this gives more insight what happens to the file. > ... > D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open > Access: 0012019F This is quite interesting, because it says that Filemon knows how to distinguish a "delete pending" error from other errors. If we could do that, then my prior worries about ignoring all EACCES errors would go away. What's it looking at exactly? regards, tom lane
> > i've installed Filemon=20 > > (http://www.sysinternals.com/Utilities/Filemon.html) > > now. this gives more insight what happens to the file. > > ... > > D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options:=20 > > Open > > Access: 0012019F >=20 > This is quite interesting, because it says that Filemon knows=20 > how to distinguish a "delete pending" error from other=20 > errors. If we could do that, then my prior worries about=20 > ignoring all EACCES errors would go away. What's it looking=20 > at exactly? filemon is a kernel mode filter driver. So it's looking at kernel-only datastructures. AFAIK, there is no way to see that from userspace. //Magnus
> "Peter Brant" <Peter.Brant@wicourts.gov> writes: >> The same problem exists in 8.1 too. See this thread >> http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php >> Tom and Magnus tracked down a cause, but I don't think a fix was ever >> implemented. > > Thomas seems to have two different issues there: the "could not rename > file" problem on the pg_xlog file is probably explained by the mechanism > we identified back then (and I'm not sure why no fix has been > installed) just had another total lockdown. the writer-process was trying to rename the C1 to CA which failed: 2006-10-24 14:27:58 [5196] LOG: 00000: could not rename file "pg_xlog/0000000100000004000000C1" to "pg_xlog/0000000100000004000000CA", continuing to try 2006-10-24 14:27:58 [5196] LOCATION: pgrename, dirmod.c:142 when checking the process with process explorer, it reveals that it has this file handles to pg_xlog open: D:\DB\PostgreSQL-8.2\data\pg_xlog D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000C3 under normal operation, "writer process" does not seem to have a file handle to the xlog directory (D:\DB\PostgreSQL-8.2\data\pg_xlog) the last error log entry prior to the lockdown was about 15min, so probably the two problems are unrelated. unfortunately, before i could try to get more informations, i had to restart the pg_ctl due to the system being in half-productive mode. that leads me to the question: what are the chances of me being helpful with getting more informations? i can have the db run like that for some more days, but for long term i can't babysit our application and might have to go back to 8.1. thanks, thomas
another lockup, this time due to pgstat_write_statsfile :-( 2006-10-24 17:01:17 [5412] LOG: XX000: could not rename temporary statistics file "global/pgstat.tmp" to "global/pgstat.stat": A blocking operation was interrupted by a call to WSACancelBlockingCall. 2006-10-24 17:01:17 [5412] LOCATION: pgstat_write_statsfile, pgstat.c:2008 2006-10-24 17:23:23 [3280] LOG: 00000: received fast shutdown request 2006-10-24 17:23:23 [3280] LOCATION: pmdie, postmaster.c:1903 2006-10-24 17:23:23 [3280] LOG: 00000: aborting any active transactions 2006-10-24 17:23:23 [3280] LOCATION: pmdie, postmaster.c:1910 2006-10-24 17:23:23 [3468] FATAL: 57P01: terminating connection due to administrator command 2006-10-24 17:23:23 [3468] LOCATION: ProcessInterrupts, postgres.c:2465 this is what filemon reports (~10 times a second): 17:23:18 postgres.exe:1432 OPEN D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND Options: Open Access: 00110080 17:23:18 postgres.exe:1432 OPEN D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND Options: Open Access: 00110080 17:23:18 postgres.exe:1432 OPEN D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND Options: Open Access: 00110080 sorry for flooding. just tell me if i shall rather stop. - thomas ----- Original Message ----- From: "Thomas H." <me@alternize.com> To: <pgsql-bugs@postgresql.org> Sent: Tuesday, October 24, 2006 3:15 PM Subject: Re: [BUGS] could not rename xlog (was: BUG #2712) >> "Peter Brant" <Peter.Brant@wicourts.gov> writes: >>> The same problem exists in 8.1 too. See this thread >>> http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php >>> Tom and Magnus tracked down a cause, but I don't think a fix was ever >>> implemented. >> >> Thomas seems to have two different issues there: the "could not rename >> file" problem on the pg_xlog file is probably explained by the mechanism >> we identified back then (and I'm not sure why no fix has been >> installed) > > just had another total lockdown. the writer-process was trying to rename > the C1 to CA which failed: > > 2006-10-24 14:27:58 [5196] LOG: 00000: could not rename file > "pg_xlog/0000000100000004000000C1" to "pg_xlog/0000000100000004000000CA", > continuing to try > 2006-10-24 14:27:58 [5196] LOCATION: pgrename, dirmod.c:142 > > when checking the process with process explorer, it reveals that it has > this file handles to pg_xlog open: > > D:\DB\PostgreSQL-8.2\data\pg_xlog > D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000C3 > > under normal operation, "writer process" does not seem to have a file > handle to the xlog directory (D:\DB\PostgreSQL-8.2\data\pg_xlog) > > the last error log entry prior to the lockdown was about 15min, so > probably the two problems are unrelated. > > unfortunately, before i could try to get more informations, i had to > restart the pg_ctl due to the system being in half-productive mode. that > leads me to the question: what are the chances of me being helpful with > getting more informations? i can have the db run like that for some more > days, but for long term i can't babysit our application and might have to > go back to 8.1. > > thanks, > thomas > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
> > The same problem exists in 8.1 too. See this thread=20 > > http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php > > Tom and Magnus tracked down a cause, but I don't think a=20 > fix was ever=20 > > implemented. >=20 > Thomas seems to have two different issues there: the "could=20 > not rename file" problem on the pg_xlog file is probably=20 > explained by the mechanism we identified back then (and I'm=20 > not sure why no fix has been installed), but there is no=20 > known reason other than antivirus software for the "could not=20 > fsync" problem. >=20 > As for fixing the problem we do understand: ISTM it's just an=20 > awful idea for pgrename and pgunlink to be willing to loop=20 > forever. I think they should time out and report the failure=20 > after some reasonable period (say between 10 sec and a minute). >=20 > If we simply made that change, then the behavior when there's=20 > an idle backend sitting on a filehandle for an old xlog=20 > segment would be that checkpoints would fail at the=20 > MoveOfflineLogs stage, which would not be fatal, but it'd be=20 > annoying. We'd probably want to further tweak=20 > InstallXLogFileSegment so that rename failure isn't an ERROR,=20 > at least not on Windows. (I think we could just make it=20 > return false, which'd cause the caller to try to delete the=20 > xlog segment, which should work even though rename doesn't.) >=20 > I'm not in a position to test this though. Magnus or Bruce? I haven't reproduced this on my box. But if you can give me a patch to try I can build binaries for Thomas to test, if he can do testing but not building. //Magnus
>> I'm not in a position to test this though. Magnus or Bruce? > >I haven't reproduced this on my box. But if you can give me a patch to >try I can build binaries for Thomas to test, if he can do testing but >not building. a binary would be marvelous. if too much hasle, i can setup a msvc++ 2005 here and try to build it on my own, but would obviously prefere if i won't have to... b2 is installed here, but i'm seeing the same problems, so yes, i'm ready for testing ;-) thanks, thomas
"Magnus Hagander" <mha@sollentuna.net> writes: > I haven't reproduced this on my box. But if you can give me a patch to > try I can build binaries for Thomas to test, if he can do testing but > not building. Utterly untested ... BTW, why does pgrename have an #if to check either GetLastError() or errno, but pgunlink doesn't? regards, tom lane
Attachment
Tom Lane wrote: > "Magnus Hagander" <mha@sollentuna.net> writes: > > I haven't reproduced this on my box. But if you can give me a patch to > > try I can build binaries for Thomas to test, if he can do testing but > > not building. > > Utterly untested ... BTW, why does pgrename have an #if to check > either GetLastError() or errno, but pgunlink doesn't? Seems like a bug --- they both should have it. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
>> As for fixing the problem we do understand: ISTM it's just an >> awful idea for pgrename and pgunlink to be willing to loop >> forever. I think they should time out and report the failure >> after some reasonable period (say between 10 sec and a minute). is the main problem realy in the rename/delete function? while i'm in no position of actually knowing whats going on under the hood, my observations in +10 cases during this afternoon/evening revealed some patterns: it is defenitely the writer process that blocks the db. but in every case the writer process seems to fail to rename the file due to another postgresql still holding a filehandle to the very xlog file that should be renamed. ProcessExplorer lets you force a close of the file handle - as soon as you do this [which is a bad thing to do, i assume], the rename succeeds and processing continues normally. i actually can reproduce the error at will now - i just need do pump enough data into the db (~200mb data seems sufficient) to have it lock up. - thomas
"Thomas H." <me@alternize.com> writes: > it is defenitely the writer process that blocks the db. but in every case > the writer process seems to fail to rename the file due to another > postgresql still holding a filehandle to the very xlog file that should be > renamed. Right, all you need is a backend process that's made at least one xlog write and then is left to sit idle for long enough that that xlog file is due for recycling. However, the fact that the writer process is stuck should not in itself cause the DB to lock up. I think what's really happening is that after the writer process gets stuck, the remaining backends chew all the available WAL, and then they need to create more WAL segments for themselves, and the writer process is holding the ControlFileLock so they can't. It might be interesting to think about not requiring the ControlFileLock to be held while making new WAL segments. I think the only reason it does that is to ensure that link/rename failure can be treated as a hard error (because no one could have beat us to the filename), but we're already having to back off that stance for Windows ... regards, tom lane
> It might be interesting to think about not requiring the ControlFileLock > to be held while making new WAL segments. I think the only reason it > does that is to ensure that link/rename failure can be treated as a hard > error (because no one could have beat us to the filename), but we're > already having to back off that stance for Windows ... on a sidenote, i was able to work around the total xlog-lock by ingreasing checkpoint_segments from 3 (default) to 12. that seems enough to have the processes release (timeout?) the filehandles before writer-process wants to rename the xlog file, at least under normal workload. if there is a data load, the lockup still happens, but i can live with that for now. the logs are still being swamped with the other delete error messages, tho: 2006-10-27 16:16:58 [5828] ERROR: XX000: storage sync failed on magnetic disk: Permission denied 2006-10-27 16:16:58 [5828] LOCATION: smgrsync, smgr.c:888 2006-10-27 16:16:59 [5828] LOG: 42501: could not fsync segment 0 of relation 1663/3964774/6495380: Permission denied 2006-10-27 16:16:59 [5828] LOCATION: mdsync, md.c:785 2006-10-27 16:16:59 [5828] ERROR: XX000: storage sync failed on magnetic disk: Permission denied 2006-10-27 16:16:59 [5828] LOCATION: smgrsync, smgr.c:888 magnus, where you able to do a debug build for me to test the patch? would be nice if a solution could be found for the final 8.2 release. cheers, thomas
> > I haven't reproduced this on my box. But if you can give me=20 > a patch to=20 > > try I can build binaries for Thomas to test, if he can do=20 > testing but=20 > > not building. >=20 > Utterly untested ... BTW, why does pgrename have an #if to=20 > check either GetLastError() or errno, but pgunlink doesn't? Ok, I've built a .EXE with this patch. It's compiled without pretty much all other options, hope that still works :-) (Meaning no NLS, no kerberos, no SSL etc) Grab the exe from http://www.hagander.net/download/postgres_renamepatch.zip. Sorry about the delay. //Magnus
thanks alot. using only standard features, so the missing bits shouldn't be a problem. will try to test later tonight or tomorrow. > Grab the exe from > http://www.hagander.net/download/postgres_renamepatch.zip. report will follow. regards, thomas ----- Original Message ----- From: "Magnus Hagander" <mha@sollentuna.net> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: "Peter Brant" <Peter.Brant@wicourts.gov>; "Thomas H." <me@alternize.com>; <pgsql-bugs@postgresql.org>; "Bruce Momjian" <bruce@momjian.us> Sent: Sunday, October 29, 2006 6:10 PM Subject: RE: [BUGS] BUG #2712: could not fsync segment: Permission > > I haven't reproduced this on my box. But if you can give me > a patch to > > try I can build binaries for Thomas to test, if he can do > testing but > > not building. > > Utterly untested ... BTW, why does pgrename have an #if to > check either GetLastError() or errno, but pgunlink doesn't? Ok, I've built a .EXE with this patch. It's compiled without pretty much all other options, hope that still works :-) (Meaning no NLS, no kerberos, no SSL etc) Grab the exe from http://www.hagander.net/download/postgres_renamepatch.zip. Sorry about the delay. //Magnus
i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2 executable it locked up after ~100mb. the xlog-files are cycling... if i need to test for some specific behaviour let me know. maybe a similar patch could be found for the 2nd permission problem, where the writer process tries to use a previously deleted file whose filehandle is still in use by another postgresql process: 2006-10-31 07:12:37 [5392] ERROR: 42501: could not open relation 1663/3964774/6696548: Permission denied 2006-10-31 07:12:37 [5392] LOCATION: mdopen, md.c:366 2006-10-31 07:12:38 [5392] ERROR: 42501: could not open relation 1663/3964774/6696548: Permission denied 2006-10-31 07:12:38 [5392] LOCATION: mdopen, md.c:366 thanks for your efforts, very much appreciated! - thomas ----- Original Message ----- Sent: Sunday, October 29, 2006 6:10 PM Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission > > I haven't reproduced this on my box. But if you can give me > a patch to > > try I can build binaries for Thomas to test, if he can do > testing but > > not building. > > Utterly untested ... BTW, why does pgrename have an #if to > check either GetLastError() or errno, but pgunlink doesn't? Ok, I've built a .EXE with this patch. It's compiled without pretty much all other options, hope that still works :-) (Meaning no NLS, no kerberos, no SSL etc) Grab the exe from http://www.hagander.net/download/postgres_renamepatch.zip. Sorry about the delay. //Magnus ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
I wrote: > It might be interesting to think about not requiring the ControlFileLock > to be held while making new WAL segments. I think the only reason it > does that is to ensure that link/rename failure can be treated as a hard > error (because no one could have beat us to the filename), but we're > already having to back off that stance for Windows ... Just for the archives: I looked into doing this, and realized that the reason we use a lock here is that on platforms without working link() we use rename() to emplace new WAL segments. And rename() is defined to unlink any existing file of the target name, which would be completely unsafe to do. On platforms with link() we could probably get away without a lock, but I'm disinclined to introduce that much of a platform-specific difference in behavior when we don't have to. regards, tom lane
me wrote: > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2 > executable it locked up after ~100mb. the xlog-files are cycling... > > if i need to test for some specific behaviour let me know. what's the status on this patch for inclusion in future 8.2 builds? would be nice to see it in b4 or rc1... thanks, thomas
Thomas H. wrote: > me wrote: > > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2 > > executable it locked up after ~100mb. the xlog-files are cycling... > > > > if i need to test for some specific behaviour let me know. > > what's the status on this patch for inclusion in future 8.2 builds? would be > nice to see it in b4 or rc1... This was fixed the day after we packaged beta3: revision 1.44 date: 2006/11/08 20:12:05; author: tgl; state: Exp; lines: +20 -48 Change Windows rename and unlink substitutes so that they time out after 30 seconds instead of retrying forever. Also modify xlog.c so that if it fails to rename an old xlog segment up to a future slot, it will unlink the segment instead. Per discussion of bug #2712, in which it became apparent that Windows can handle unlinking a file that's being held open, but not renaming it. beta4 will have the fix. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Hi. On Mon, 20 Nov 2006 15:33:33 -0500 (EST) Bruce Momjian <bruce@momjian.us> wrote: > Thomas H. wrote: > > me wrote: > > > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2 > > > executable it locked up after ~100mb. the xlog-files are cycling... > > > > > > if i need to test for some specific behaviour let me know. > > > > what's the status on this patch for inclusion in future 8.2 builds? would be > > nice to see it in b4 or rc1... > > This was fixed the day after we packaged beta3: > > revision 1.44 > date: 2006/11/08 20:12:05; author: tgl; state: Exp; lines: +20 -48 > Change Windows rename and unlink substitutes so that they time out after > 30 seconds instead of retrying forever. Also modify xlog.c so that if > it fails to rename an old xlog segment up to a future slot, it will > unlink the segment instead. Per discussion of bug #2712, in which it > became apparent that Windows can handle unlinking a file that's being > held open, but not renaming it. > > beta4 will have the fix. Are there any plan to backport this patch for 8.1/8.0 server ? My customer has same problem in 8.1.5 , but this bug doesn't seems to be fixed in 8.1.6 . Backport patch for 8.1.6 attached. Sorry for replying old mail. -- TANIDA Yutaka <tanida@sraoss.co.jp>
It was considered too risky to backpatch because we couldn't do sufficient testing. --------------------------------------------------------------------------- TANIDA Yutaka wrote: > Hi. > > On Mon, 20 Nov 2006 15:33:33 -0500 (EST) > Bruce Momjian <bruce@momjian.us> wrote: > > > Thomas H. wrote: > > > me wrote: > > > > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2 > > > > executable it locked up after ~100mb. the xlog-files are cycling... > > > > > > > > if i need to test for some specific behaviour let me know. > > > > > > what's the status on this patch for inclusion in future 8.2 builds? would be > > > nice to see it in b4 or rc1... > > > > This was fixed the day after we packaged beta3: > > > > revision 1.44 > > date: 2006/11/08 20:12:05; author: tgl; state: Exp; lines: +20 -48 > > Change Windows rename and unlink substitutes so that they time out after > > 30 seconds instead of retrying forever. Also modify xlog.c so that if > > it fails to rename an old xlog segment up to a future slot, it will > > unlink the segment instead. Per discussion of bug #2712, in which it > > became apparent that Windows can handle unlinking a file that's being > > held open, but not renaming it. > > > > beta4 will have the fix. > > Are there any plan to backport this patch for 8.1/8.0 server ? > > My customer has same problem in 8.1.5 , but this bug doesn't seems to > be fixed in 8.1.6 . > > Backport patch for 8.1.6 attached. > > Sorry for replying old mail. > > -- > TANIDA Yutaka <tanida@sraoss.co.jp> [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +