Thread: BUG #2712: could not fsync segment: Permission denied

BUG #2712: could not fsync segment: Permission denied

From
"Thomas H"
Date:
The following bug has been logged online:

Bug reference:      2712
Logged by:          Thomas H
Email address:      me@alternize.com
PostgreSQL version: 8.2b1
Operating system:   windows 2003 standard
Description:        could not fsync segment: Permission denied
Details:

sometimes we're seeing loads of errors in the log:

2006-10-22 23:48:50 LOG:  could not fsync segment 0 of relation
1663/3964774/6409340: Permission denied
2006-10-22 23:48:50 ERROR:  storage sync failed on magnetic disk: Permission
denied
2006-10-22 23:48:51 LOG:  could not fsync segment 0 of relation
1663/3964774/6409340: Permission denied
2006-10-22 23:48:51 ERROR:  storage sync failed on magnetic disk: Permission
denied
2006-10-22 23:48:52 LOG:  could not fsync segment 0 of relation
1663/3964774/6409340: Permission denied
2006-10-22 23:48:52 ERROR:  storage sync failed on magnetic disk: Permission
denied
2006-10-22 23:48:53 LOG:  could not fsync segment 0 of relation
1663/3964774/6409340: Permission denied
2006-10-22 23:48:53 ERROR:  storage sync failed on magnetic disk: Permission
denied
{...}

when this happens, there are also files locked within the data\base\{dbid}\.
access to those files are denied by the os - the files vanish as soon as
postmaster ist stopped & restarted.

i haven't yet found a possible reason - i suspect the error to appear
*sometimes* after issuing a "VACUUM FULL ANALYZE {tablename}" / "REINDEX
TABLE {tablename}".

the hardware is checked and ok.

Re: BUG #2712: could not fsync segment: Permission denied

From
"Thomas H."
Date:
in verbose mode, the log shows a little bit more:

2006-10-23 03:23:14 LOG:  42501: could not fsync segment 0 of relation
1663/3964774/6411190: Permission denied
2006-10-23 03:23:14 LOCATION:  mdsync, md.c:785
2006-10-23 03:23:14 ERROR:  XX000: storage sync failed on magnetic disk:
Permission denied
2006-10-23 03:23:14 LOCATION:  smgrsync, smgr.c:888

- thomas

----- Original Message -----
From: "Thomas H" <me@alternize.com>
To: <pgsql-bugs@postgresql.org>
Sent: Monday, October 23, 2006 1:28 AM
Subject: [BUGS] BUG #2712: could not fsync segment: Permission denied


>
> The following bug has been logged online:
>
> Bug reference:      2712
> Logged by:          Thomas H
> Email address:      me@alternize.com
> PostgreSQL version: 8.2b1
> Operating system:   windows 2003 standard
> Description:        could not fsync segment: Permission denied
> Details:
>
> sometimes we're seeing loads of errors in the log:
>
> 2006-10-22 23:48:50 LOG:  could not fsync segment 0 of relation
> 1663/3964774/6409340: Permission denied
> 2006-10-22 23:48:50 ERROR:  storage sync failed on magnetic disk:
> Permission
> denied
> 2006-10-22 23:48:51 LOG:  could not fsync segment 0 of relation
> 1663/3964774/6409340: Permission denied
> 2006-10-22 23:48:51 ERROR:  storage sync failed on magnetic disk:
> Permission
> denied
> 2006-10-22 23:48:52 LOG:  could not fsync segment 0 of relation
> 1663/3964774/6409340: Permission denied
> 2006-10-22 23:48:52 ERROR:  storage sync failed on magnetic disk:
> Permission
> denied
> 2006-10-22 23:48:53 LOG:  could not fsync segment 0 of relation
> 1663/3964774/6409340: Permission denied
> 2006-10-22 23:48:53 ERROR:  storage sync failed on magnetic disk:
> Permission
> denied
> {...}
>
> when this happens, there are also files locked within the
> data\base\{dbid}\.
> access to those files are denied by the os - the files vanish as soon as
> postmaster ist stopped & restarted.
>
> i haven't yet found a possible reason - i suspect the error to appear
> *sometimes* after issuing a "VACUUM FULL ANALYZE {tablename}" / "REINDEX
> TABLE {tablename}".
>
> the hardware is checked and ok.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: BUG #2712: could not fsync segment: Permission denied

From
Tom Lane
Date:
"Thomas H" <me@alternize.com> writes:
> Operating system:   windows 2003 standard
> Description:        could not fsync segment: Permission denied

The usual answer to this has been that you're running some
overenthusiastic antivirus software.

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission denied

From
"Thomas H."
Date:
unfortunately not.
and this is not happening with 8.1

regards,
thomas

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Thomas H" <me@alternize.com>
Cc: <pgsql-bugs@postgresql.org>
Sent: Monday, October 23, 2006 4:07 AM
Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied


> "Thomas H" <me@alternize.com> writes:
>> Operating system:   windows 2003 standard
>> Description:        could not fsync segment: Permission denied
>
> The usual answer to this has been that you're running some
> overenthusiastic antivirus software.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: BUG #2712: could not fsync segment: Permission denied

From
"Thomas H."
Date:
there is defenitely something terribly wrong in the windows 8.2b1 regarding
file access/locking. 2nd total db lockup today due to file access locks (all
hold by postmaster):

{...}
2006-10-23 17:48:10 LOG:  42501: could not fsync segment 0 of relation
1663/3964774/6419608: Permission denied
2006-10-23 17:48:10 LOCATION:  mdsync, md.c:785
2006-10-23 17:48:10 ERROR:  XX000: storage sync failed on magnetic disk:
Permission denied
2006-10-23 17:48:10 LOCATION:  smgrsync, smgr.c:888
2006-10-23 17:48:10 LOG:  00000: duration: 327.999 ms  statement: SELECT
threads.*, first.login AS first_user, last.login AS last_user FROM
forum.threads JOIN users.users AS first ON first.id = threads.t_first_user
LEFT JOIN users.users AS last ON last.id = threads.t_last_user WHERE t_b_id
= 4 AND t_status_deleted = false ORDER BY t_status_sticky DESC, t_last_post
DESC
2006-10-23 17:48:10 LOCATION:  exec_simple_query, postgres.c:1007
2006-10-23 17:48:14 LOG:  00000: could not rename file
"pg_xlog/00000001000000040000002E" to "pg_xlog/000000010000000400000037",
continuing to try
2006-10-23 17:48:14 LOCATION:  pgrename, dirmod.c:142
2006-10-23 18:12:05 LOG:  00000: received fast shutdown request
2006-10-23 18:12:05 LOCATION:  pmdie, postmaster.c:1903
2006-10-23 18:12:05 LOG:  00000: aborting any active transactions
2006-10-23 18:12:05 LOCATION:  pmdie, postmaster.c:1910
2006-10-23 18:12:05 FATAL:  57P01: terminating connection due to
administrator command
2006-10-23 18:12:05 LOCATION:  ProcessInterrupts, postgres.c:2465
2006-10-23 18:12:06 ERROR:  XX000: could not rename file
"pg_xlog/00000001000000040000002E" to "pg_xlog/000000010000000400000037"
(initialization of log file 4, segment 55): A blocking operation was
interrupted by a call to WSACancelBlockingCall.
2006-10-23 18:12:06 LOCATION:  InstallXLogFileSegment, xlog.c:2201
{...}

from 17:48:14 on pgsql didn't handle anymore queries until shutdown. as soon
as one restarts postmaster, the file locks are cleared up.

and no, there are no other file locking tools (av scanners and the such)
running - 8.1 on the same box (even on same partition) run fine.

regarnds,
- thomas




----- Original Message -----
From: "Thomas H." <me@alternize.com>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: <pgsql-bugs@postgresql.org>
Sent: Monday, October 23, 2006 11:52 AM
Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied


> unfortunately not. and this is not happening with 8.1
>
> regards,
> thomas
>
> ----- Original Message -----
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
> To: "Thomas H" <me@alternize.com>
> Cc: <pgsql-bugs@postgresql.org>
> Sent: Monday, October 23, 2006 4:07 AM
> Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission denied
>
>
>> "Thomas H" <me@alternize.com> writes:
>>> Operating system:   windows 2003 standard
>>> Description:        could not fsync segment: Permission denied
>>
>> The usual answer to this has been that you're running some
>> overenthusiastic antivirus software.
>>
>> regards, tom lane
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: BUG #2712: could not fsync segment: Permission

From
"Peter Brant"
Date:
The same problem exists in 8.1 too.  See this thread

http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php

Tom and Magnus tracked down a cause, but I don't think a fix was ever
implemented.

FWIW, we were bitten by the fsync problem which you noticed too.
Unfortunately we were never able to track down a cause (see the mailing
list archives).  They are separate problems though.

Pete

>>> "Thomas H." <me@alternize.com> 23.10.2006 18:21 >>>
there is defenitely something terribly wrong in the windows 8.2b1
regarding
file access/locking. 2nd total db lockup today due to file access locks
(all
hold by postmaster):

2006-10-23 17:48:10 LOCATION:  exec_simple_query, postgres.c:1007
2006-10-23 17:48:14 LOG:  00000: could not rename file
"pg_xlog/00000001000000040000002E" to
"pg_xlog/000000010000000400000037", continuing to try

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> The same problem exists in 8.1 too.  See this thread
> http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php
> Tom and Magnus tracked down a cause, but I don't think a fix was ever
> implemented.

Thomas seems to have two different issues there: the "could not rename
file" problem on the pg_xlog file is probably explained by the mechanism
we identified back then (and I'm not sure why no fix has been
installed), but there is no known reason other than antivirus software
for the "could not fsync" problem.

As for fixing the problem we do understand: ISTM it's just an awful idea
for pgrename and pgunlink to be willing to loop forever.  I think they
should time out and report the failure after some reasonable period
(say between 10 sec and a minute).

If we simply made that change, then the behavior when there's an idle
backend sitting on a filehandle for an old xlog segment would be that
checkpoints would fail at the MoveOfflineLogs stage, which would not
be fatal, but it'd be annoying.  We'd probably want to further tweak
InstallXLogFileSegment so that rename failure isn't an ERROR, at least
not on Windows.  (I think we could just make it return false, which'd
cause the caller to try to delete the xlog segment, which should work
even though rename doesn't.)

I'm not in a position to test this though.  Magnus or Bruce?

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
> The same problem exists in 8.1 too.  See this thread

its only appearing in 8.2 here, i've just rechecked our logs...
is there any workaround? how did you get around that problem of having a
total lockdown?

thanks,
thomas

>>>> "Thomas H." <me@alternize.com> 23.10.2006 18:21 >>>
> there is defenitely something terribly wrong in the windows 8.2b1
> regarding
> file access/locking. 2nd total db lockup today due to file access locks
> (all
> hold by postmaster):
>
> 2006-10-23 17:48:10 LOCATION:  exec_simple_query, postgres.c:1007
> 2006-10-23 17:48:14 LOG:  00000: could not rename file
> "pg_xlog/00000001000000040000002E" to
> "pg_xlog/000000010000000400000037", continuing to try
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: BUG #2712: could not fsync segment: Permission

From
"Peter Brant"
Date:
That might be one cause (or it might otherwise exacerbate the problem),
but it isn't the only cause.  We weren't running anti-virus software and
neither is Thomas.  Unfortunately with the last go around, we
collectively ran out of ideas before an underlying cause could be
identified.

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 23.10.2006 19:49 >>>
installed), but there is no known reason other than antivirus software
for the "could not fsync" problem.

Re: BUG #2712: could not fsync segment: Permission

From
"Peter Brant"
Date:
Move to Linux. :-)  In our case, everything but the database servers
were already Linux so it was an easy choice.  Things have been rock
solid since then.

Once things get stuck, I don't think there is an alternative besides
"stop -m immediate".  However, since the problem is caused by an idle
backend holding onto an old WAL segment, maybe having your middle
tier/connection pool close and reopen the connections to the database
every so often would function as a workaround.  Somebody with more
knowledge of PG internals than I would have to define "every so often"
though (if the idea is viable at all).

Pete

>>> "Thomas H." <me@alternize.com> 23.10.2006 20:00 >>>
is there any workaround? how did you get around that problem of having
a
total lockdown?

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> FWIW, we were bitten by the fsync problem which you noticed too.
> Unfortunately we were never able to track down a cause (see the mailing
> list archives).  They are separate problems though.

Actually, now that I look back in the archives, I think we had theorized
that the fsync errors come from attempting to fsync a file that's
already been deleted but some backend still has a reference to.
Apparently that leads to EACCES instead of ENOENT (which the code is
already prepared to expect).

http://archives.postgresql.org/pgsql-bugs/2006-04/msg00215.php

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
> Actually, now that I look back in the archives, I think we had theorized
> that the fsync errors come from attempting to fsync a file that's
> already been deleted but some backend still has a reference to.
> Apparently that leads to EACCES instead of ENOENT (which the code is
> already prepared to expect).

with process explorer i can actually check which postgres.exe instance (in
all cases i've checked its just 1 instance, and always just 1 file) holds
the lock for the file in question. but will that help in determining why it
is still holding a reference?
the postgres instance that holds the lock eventually closes the filehandle
after some minutes. the process itself is not killed but continues
thereafter.

let me know if i can be of any assistance. since we do regurarly reindex one
table whose index size keeps growing despite of often vacuuming, the
fsync-problem happens almost 4-5 times per hour.

regards,
thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Thomas H." <me@alternize.com> writes:
> with process explorer i can actually check which postgres.exe instance (in
> all cases i've checked its just 1 instance, and always just 1 file) holds
> the lock for the file in question.

So which one is it?

> the postgres instance that holds the lock eventually closes the filehandle
> after some minutes. the process itself is not killed but continues
> thereafter.

That sounds a bit like what I'd expect the bgwriter to do, but the
bgwriter is also the one trying to issue the fsync.

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
>> with process explorer i can actually check which postgres.exe instance
>> (in
>> all cases i've checked its just 1 instance, and always just 1 file) holds
>> the lock for the file in question.
>
> So which one is it?

it's always one of the db-"slaves" and not "logger process", "stats
collector process" or "writer process":

right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow
outnow1 127.0.0.1(2122) idle") that tries to write
"D:\DB\PostgreSQL-8.2\data\base\3964774\6422331"

can i somehow check what object that file-OID belong(ed/s) to?

- thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Thomas H." <me@alternize.com> writes:
> right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow
> outnow1 127.0.0.1(2122) idle") that tries to write
> "D:\DB\PostgreSQL-8.2\data\base\3964774\6422331"

Do you actually mean it's trying to write that file?  Or is it just
sitting there holding the open filehandle?

> can i somehow check what object that file-OID belong(ed/s) to?

You can check in pg_class.relfilenode and pg_class.oid of that database
to see if you get a match.  But our theory is that this table has been
deleted ...

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
> "Thomas H." <me@alternize.com> writes:
>> right now its PID 4844 ("\BaseNamedObjects\pgident: postgres: db_outnow
>> outnow1 127.0.0.1(2122) idle") that tries to write
>> "D:\DB\PostgreSQL-8.2\data\base\3964774\6422331"
>
> Do you actually mean it's trying to write that file?  Or is it just
> sitting there holding the open filehandle?

well, hard to tell :-)
according to the log-messages i would assume it is *trying* to write. but
the file in question isn't physically there anymore, it's just the open file
handle that keeps it from vanish totally - you do not have access to the
file (permission denied / access denied) if you for example try to read it
or its attributes in file explorer.

i've installed Filemon (http://www.sysinternals.com/Utilities/Filemon.html)
now. this gives more insight what happens to the file. in this case its file
6422806, the first error message appeared at 23:45:21, the last one at
23:45:26 (only short duration this time).

{....}
23:44:57 postgres.exe:1944 WRITE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS Offset: 16384 Length:
8192
23:44:57 postgres.exe:1944 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:44:57 postgres.exe:1944 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS Options: Open
Access: 00010080
23:44:57 postgres.exe:1944 QUERY INFORMATION
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
FileAttributeTagInformation
23:44:57 postgres.exe:1944 DELETE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:44:57 postgres.exe:1944 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:44:57 postgres.exe:1944 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806.1 NOT FOUND Options: Open
Access: 00010080
23:44:57 postgres.exe:5364 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:44:57 postgres.exe:2780 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:44:59 postgres.exe:6036 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:45:11 postgres.exe:5196 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:45:20 postgres.exe:1268 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:45:21 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:22 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:23 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:24 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:25 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:26 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
Access: 0012019F
23:45:26 postgres.exe:5428 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:45:26 postgres.exe:2200 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 SUCCESS
23:45:27 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 NOT FOUND Options: Open
Access: 0012019F

i have earlier log data for this file if needed, but at :45:27 was the last
entry. unfortunately i wasn't quick enough to find the blocking process in
processviewer, but i guess its pid 5196

>> can i somehow check what object that file-OID belong(ed/s) to?
>
> You can check in pg_class.relfilenode and pg_class.oid of that database
> to see if you get a match.  But our theory is that this table has been
> deleted ...

nothing there as assumed.

- thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Thomas H." <me@alternize.com> writes:
>> Do you actually mean it's trying to write that file?  Or is it just
>> sitting there holding the open filehandle?

> well, hard to tell :-)
> according to the log-messages i would assume it is *trying* to write.

The log messages you have don't make it clear which process is trying to
do the fsync, but I would expect it to be the bgwriter.  (Possibly you
should modify log_line_prefix to include PID so we can tell a bit
better.)

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
> The log messages you have don't make it clear which process is trying to
> do the fsync, but I would expect it to be the bgwriter.  (Possibly you
> should modify log_line_prefix to include PID so we can tell a bit
> better.)

you're right (as always :-)). its the "writer process" (pid 5196) that
outputs the error messages:

2006-10-24 00:09:09 [5196] ERROR:  XX000: storage sync failed on magnetic
disk: Permission denied
2006-10-24 00:09:09 [5196] LOCATION:  smgrsync, smgr.c:888
2006-10-24 00:09:10 [5196] LOG:  42501: could not fsync segment 0 of
relation 1663/3964774/6422947: Permission denied
2006-10-24 00:09:10 [5196] LOCATION:  mdsync, md.c:785

and in this case, its process 5988 that keeps the file handle open (its
entry in pg_class is already deleted):

\BaseNamedObjects\pgident: postgres: db_outnow outnow1 127.0.0.1(2362) idle
D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 (1 references, 1 handle)

... while pid 5196 constantly tries to open the file (for over 15min in this
case), until...


00:22:18 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 DELETE PEND Options: Open
Access: 0012019F
00:22:19 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 DELETE PEND Options: Open
Access: 0012019F
00:22:20 postgres.exe:5988 CLOSE
D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 SUCCESS
00:22:20 postgres.exe:5196 OPEN
D:\DB\PostgreSQL-8.2\data\base\3964774\6422947 NOT FOUND Options: Open
Access: 0012019F

is that of any use? what more logging options would be interesting?

- thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Thomas H." <me@alternize.com> writes:
> i've installed Filemon (http://www.sysinternals.com/Utilities/Filemon.html)
> now. this gives more insight what happens to the file.
> ...
> D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options: Open
> Access: 0012019F

This is quite interesting, because it says that Filemon knows how to
distinguish a "delete pending" error from other errors.  If we could
do that, then my prior worries about ignoring all EACCES errors would
go away.  What's it looking at exactly?

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Magnus Hagander"
Date:
> > i've installed Filemon=20
> > (http://www.sysinternals.com/Utilities/Filemon.html)
> > now. this gives more insight what happens to the file.
> > ...
> > D:\DB\PostgreSQL-8.2\data\base\3964774\6422806 DELETE PEND Options:=20
> > Open
> > Access: 0012019F
>=20
> This is quite interesting, because it says that Filemon knows=20
> how to distinguish a "delete pending" error from other=20
> errors.  If we could do that, then my prior worries about=20
> ignoring all EACCES errors would go away.  What's it looking=20
> at exactly?

filemon is a kernel mode filter driver. So it's looking at kernel-only
datastructures. AFAIK, there is no way to see that from userspace.

//Magnus

Re: could not rename xlog (was: BUG #2712)

From
"Thomas H."
Date:
> "Peter Brant" <Peter.Brant@wicourts.gov> writes:
>> The same problem exists in 8.1 too.  See this thread
>> http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php
>> Tom and Magnus tracked down a cause, but I don't think a fix was ever
>> implemented.
>
> Thomas seems to have two different issues there: the "could not rename
> file" problem on the pg_xlog file is probably explained by the mechanism
> we identified back then (and I'm not sure why no fix has been
> installed)

just had another total lockdown. the writer-process was trying to rename the
C1 to CA which failed:

2006-10-24 14:27:58 [5196] LOG:  00000: could not rename file
"pg_xlog/0000000100000004000000C1" to "pg_xlog/0000000100000004000000CA",
continuing to try
2006-10-24 14:27:58 [5196] LOCATION:  pgrename, dirmod.c:142

when checking the process with process explorer, it reveals that it has this
file handles to pg_xlog open:

D:\DB\PostgreSQL-8.2\data\pg_xlog
D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000C3

under normal operation, "writer process" does not seem to have a file handle
to the xlog directory (D:\DB\PostgreSQL-8.2\data\pg_xlog)

the last error log entry prior to the lockdown was about 15min, so probably
the two problems are unrelated.

unfortunately, before i could try to get more informations, i had to restart
the pg_ctl due to the system being in half-productive mode. that leads me to
the question: what are the chances of me being helpful with getting more
informations? i can have the db run like that for some more days, but for
long term i can't babysit our application and might have to go back to 8.1.

thanks,
thomas

Re: could not rename xlog (was: BUG #2712)

From
"Thomas H."
Date:
another lockup, this time due to pgstat_write_statsfile :-(

2006-10-24 17:01:17 [5412] LOG:  XX000: could not rename temporary
statistics file "global/pgstat.tmp" to "global/pgstat.stat": A blocking
operation was interrupted by a call to WSACancelBlockingCall.
2006-10-24 17:01:17 [5412] LOCATION:  pgstat_write_statsfile, pgstat.c:2008
2006-10-24 17:23:23 [3280] LOG:  00000: received fast shutdown request
2006-10-24 17:23:23 [3280] LOCATION:  pmdie, postmaster.c:1903
2006-10-24 17:23:23 [3280] LOG:  00000: aborting any active transactions
2006-10-24 17:23:23 [3280] LOCATION:  pmdie, postmaster.c:1910
2006-10-24 17:23:23 [3468] FATAL:  57P01: terminating connection due to
administrator command
2006-10-24 17:23:23 [3468] LOCATION:  ProcessInterrupts, postgres.c:2465

this is what filemon reports (~10 times a second):

17:23:18 postgres.exe:1432 OPEN
D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND
Options: Open  Access: 00110080
17:23:18 postgres.exe:1432 OPEN
D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND
Options: Open  Access: 00110080
17:23:18 postgres.exe:1432 OPEN
D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000DB DELETE PEND
Options: Open  Access: 00110080

sorry for flooding. just tell me if i shall rather stop.

- thomas






----- Original Message -----
From: "Thomas H." <me@alternize.com>
To: <pgsql-bugs@postgresql.org>
Sent: Tuesday, October 24, 2006 3:15 PM
Subject: Re: [BUGS] could not rename xlog (was: BUG #2712)


>> "Peter Brant" <Peter.Brant@wicourts.gov> writes:
>>> The same problem exists in 8.1 too.  See this thread
>>> http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php
>>> Tom and Magnus tracked down a cause, but I don't think a fix was ever
>>> implemented.
>>
>> Thomas seems to have two different issues there: the "could not rename
>> file" problem on the pg_xlog file is probably explained by the mechanism
>> we identified back then (and I'm not sure why no fix has been
>> installed)
>
> just had another total lockdown. the writer-process was trying to rename
> the C1 to CA which failed:
>
> 2006-10-24 14:27:58 [5196] LOG:  00000: could not rename file
> "pg_xlog/0000000100000004000000C1" to "pg_xlog/0000000100000004000000CA",
> continuing to try
> 2006-10-24 14:27:58 [5196] LOCATION:  pgrename, dirmod.c:142
>
> when checking the process with process explorer, it reveals that it has
> this file handles to pg_xlog open:
>
> D:\DB\PostgreSQL-8.2\data\pg_xlog
> D:\DB\PostgreSQL-8.2\data\pg_xlog\0000000100000004000000C3
>
> under normal operation, "writer process" does not seem to have a file
> handle to the xlog directory (D:\DB\PostgreSQL-8.2\data\pg_xlog)
>
> the last error log entry prior to the lockdown was about 15min, so
> probably the two problems are unrelated.
>
> unfortunately, before i could try to get more informations, i had to
> restart the pg_ctl due to the system being in half-productive mode. that
> leads me to the question: what are the chances of me being helpful with
> getting more informations? i can have the db run like that for some more
> days, but for long term i can't babysit our application and might have to
> go back to 8.1.
>
> thanks,
> thomas
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Re: BUG #2712: could not fsync segment: Permission

From
"Magnus Hagander"
Date:
> > The same problem exists in 8.1 too.  See this thread=20
> > http://archives.postgresql.org/pgsql-bugs/2006-04/msg00177.php
> > Tom and Magnus tracked down a cause, but I don't think a=20
> fix was ever=20
> > implemented.
>=20
> Thomas seems to have two different issues there: the "could=20
> not rename file" problem on the pg_xlog file is probably=20
> explained by the mechanism we identified back then (and I'm=20
> not sure why no fix has been installed), but there is no=20
> known reason other than antivirus software for the "could not=20
> fsync" problem.
>=20
> As for fixing the problem we do understand: ISTM it's just an=20
> awful idea for pgrename and pgunlink to be willing to loop=20
> forever.  I think they should time out and report the failure=20
> after some reasonable period (say between 10 sec and a minute).
>=20
> If we simply made that change, then the behavior when there's=20
> an idle backend sitting on a filehandle for an old xlog=20
> segment would be that checkpoints would fail at the=20
> MoveOfflineLogs stage, which would not be fatal, but it'd be=20
> annoying.  We'd probably want to further tweak=20
> InstallXLogFileSegment so that rename failure isn't an ERROR,=20
> at least not on Windows.  (I think we could just make it=20
> return false, which'd cause the caller to try to delete the=20
> xlog segment, which should work even though rename doesn't.)
>=20
> I'm not in a position to test this though.  Magnus or Bruce?

I haven't reproduced this on my box. But if you can give me a patch to
try I can build binaries for Thomas to test, if he can do testing but
not building.

//Magnus

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
>> I'm not in a position to test this though.  Magnus or Bruce?
>
>I haven't reproduced this on my box. But if you can give me a patch to
>try I can build binaries for Thomas to test, if he can do testing but
>not building.

a binary would be marvelous. if too much hasle, i can setup a msvc++ 2005
here and try to build it on my own, but would obviously prefere if i won't
have to...

b2 is installed here, but i'm seeing the same problems, so yes, i'm ready
for testing ;-)

thanks,
thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Magnus Hagander" <mha@sollentuna.net> writes:
> I haven't reproduced this on my box. But if you can give me a patch to
> try I can build binaries for Thomas to test, if he can do testing but
> not building.

Utterly untested ... BTW, why does pgrename have an #if to check
either GetLastError() or errno, but pgunlink doesn't?

            regards, tom lane


Attachment

Re: BUG #2712: could not fsync segment: Permission

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Magnus Hagander" <mha@sollentuna.net> writes:
> > I haven't reproduced this on my box. But if you can give me a patch to
> > try I can build binaries for Thomas to test, if he can do testing but
> > not building.
>
> Utterly untested ... BTW, why does pgrename have an #if to check
> either GetLastError() or errno, but pgunlink doesn't?

Seems like a bug ---  they both should have it.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
>> As for fixing the problem we do understand: ISTM it's just an
>> awful idea for pgrename and pgunlink to be willing to loop
>> forever.  I think they should time out and report the failure
>> after some reasonable period (say between 10 sec and a minute).

is the main problem realy in the rename/delete function? while i'm in no
position of actually knowing whats going on under the hood, my observations
in +10 cases during this afternoon/evening revealed some patterns:

it is defenitely the writer process that blocks the db. but in every case
the writer process seems to fail to rename the file due to another
postgresql still holding a filehandle to the very xlog file that should be
renamed. ProcessExplorer lets you force a close of the file handle - as soon
as you do this [which is a bad thing to do, i assume], the rename succeeds
and processing continues normally.

i actually can reproduce the error at will now - i just need do pump enough
data into the db (~200mb data seems sufficient) to have it lock up.

- thomas

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
"Thomas H." <me@alternize.com> writes:
> it is defenitely the writer process that blocks the db. but in every case
> the writer process seems to fail to rename the file due to another
> postgresql still holding a filehandle to the very xlog file that should be
> renamed.

Right, all you need is a backend process that's made at least one xlog
write and then is left to sit idle for long enough that that xlog file
is due for recycling.

However, the fact that the writer process is stuck should not in itself
cause the DB to lock up.  I think what's really happening is that after
the writer process gets stuck, the remaining backends chew all the
available WAL, and then they need to create more WAL segments for
themselves, and the writer process is holding the ControlFileLock so
they can't.

It might be interesting to think about not requiring the ControlFileLock
to be held while making new WAL segments.  I think the only reason it
does that is to ensure that link/rename failure can be treated as a hard
error (because no one could have beat us to the filename), but we're
already having to back off that stance for Windows ...

            regards, tom lane

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
> It might be interesting to think about not requiring the ControlFileLock
> to be held while making new WAL segments.  I think the only reason it
> does that is to ensure that link/rename failure can be treated as a hard
> error (because no one could have beat us to the filename), but we're
> already having to back off that stance for Windows ...

on a sidenote, i was able to work around the total xlog-lock by ingreasing
checkpoint_segments from 3 (default) to 12. that seems enough to have the
processes release (timeout?) the filehandles before writer-process wants to
rename the xlog file, at least under normal workload. if there is a data
load, the lockup still happens, but i can live with that for now.

the logs are still being swamped with the other delete error messages, tho:

2006-10-27 16:16:58 [5828] ERROR:  XX000: storage sync failed on magnetic
disk: Permission denied
2006-10-27 16:16:58 [5828] LOCATION:  smgrsync, smgr.c:888
2006-10-27 16:16:59 [5828] LOG:  42501: could not fsync segment 0 of
relation 1663/3964774/6495380: Permission denied
2006-10-27 16:16:59 [5828] LOCATION:  mdsync, md.c:785
2006-10-27 16:16:59 [5828] ERROR:  XX000: storage sync failed on magnetic
disk: Permission denied
2006-10-27 16:16:59 [5828] LOCATION:  smgrsync, smgr.c:888

magnus, where you able to do a debug build for me to test the patch? would
be nice if a solution could be found for the final 8.2 release.

cheers,
thomas

Re: BUG #2712: could not fsync segment: Permission

From
"Magnus Hagander"
Date:
> > I haven't reproduced this on my box. But if you can give me=20
> a patch to=20
> > try I can build binaries for Thomas to test, if he can do=20
> testing but=20
> > not building.
>=20
> Utterly untested ... BTW, why does pgrename have an #if to=20
> check either GetLastError() or errno, but pgunlink doesn't?

Ok, I've built a .EXE with this patch. It's compiled without pretty much
all other options, hope that still works :-) (Meaning no NLS, no
kerberos, no SSL etc)

Grab the exe from
http://www.hagander.net/download/postgres_renamepatch.zip.

Sorry about the delay.

//Magnus

Re: BUG #2712: could not fsync segment: Permission

From
"Thomas H."
Date:
thanks alot. using only standard features, so the missing bits shouldn't be
a problem. will try to test later tonight or tomorrow.

> Grab the exe from
> http://www.hagander.net/download/postgres_renamepatch.zip.

report will follow.

regards,
thomas

----- Original Message -----
From: "Magnus Hagander" <mha@sollentuna.net>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Peter Brant" <Peter.Brant@wicourts.gov>; "Thomas H."
<me@alternize.com>; <pgsql-bugs@postgresql.org>; "Bruce Momjian"
<bruce@momjian.us>
Sent: Sunday, October 29, 2006 6:10 PM
Subject: RE: [BUGS] BUG #2712: could not fsync segment: Permission


> > I haven't reproduced this on my box. But if you can give me
> a patch to
> > try I can build binaries for Thomas to test, if he can do
> testing but
> > not building.
>
> Utterly untested ... BTW, why does pgrename have an #if to
> check either GetLastError() or errno, but pgunlink doesn't?

Ok, I've built a .EXE with this patch. It's compiled without pretty much
all other options, hope that still works :-) (Meaning no NLS, no
kerberos, no SSL etc)

Grab the exe from
http://www.hagander.net/download/postgres_renamepatch.zip.

Sorry about the delay.

//Magnus

Re: xlog lockup patch (was: BUG #2712: could not fsync segment: Permission)

From
"Thomas H."
Date:
i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2
executable it locked up after ~100mb. the xlog-files are cycling...

if i need to test for some specific behaviour let me know.


maybe a similar patch could be found for the 2nd permission problem, where
the writer process tries to use a previously deleted file whose filehandle
is still in use by another postgresql process:

2006-10-31 07:12:37 [5392] ERROR:  42501: could not open relation
1663/3964774/6696548: Permission denied
2006-10-31 07:12:37 [5392] LOCATION:  mdopen, md.c:366
2006-10-31 07:12:38 [5392] ERROR:  42501: could not open relation
1663/3964774/6696548: Permission denied
2006-10-31 07:12:38 [5392] LOCATION:  mdopen, md.c:366


thanks for your efforts, very much appreciated!

- thomas

----- Original Message -----
Sent: Sunday, October 29, 2006 6:10 PM
Subject: Re: [BUGS] BUG #2712: could not fsync segment: Permission


> > I haven't reproduced this on my box. But if you can give me
> a patch to
> > try I can build binaries for Thomas to test, if he can do
> testing but
> > not building.
>
> Utterly untested ... BTW, why does pgrename have an #if to
> check either GetLastError() or errno, but pgunlink doesn't?

Ok, I've built a .EXE with this patch. It's compiled without pretty much
all other options, hope that still works :-) (Meaning no NLS, no
kerberos, no SSL etc)

Grab the exe from
http://www.hagander.net/download/postgres_renamepatch.zip.

Sorry about the delay.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: BUG #2712: could not fsync segment: Permission

From
Tom Lane
Date:
I wrote:
> It might be interesting to think about not requiring the ControlFileLock
> to be held while making new WAL segments.  I think the only reason it
> does that is to ensure that link/rename failure can be treated as a hard
> error (because no one could have beat us to the filename), but we're
> already having to back off that stance for Windows ...

Just for the archives: I looked into doing this, and realized that the
reason we use a lock here is that on platforms without working link()
we use rename() to emplace new WAL segments.  And rename() is defined
to unlink any existing file of the target name, which would be
completely unsafe to do.

On platforms with link() we could probably get away without a lock,
but I'm disinclined to introduce that much of a platform-specific
difference in behavior when we don't have to.

            regards, tom lane

Re: xlog lockup patch (was: BUG #2712: could not fsync segment: Permission)

From
"Thomas H."
Date:
me wrote:
> i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2
> executable it locked up after ~100mb. the xlog-files are cycling...
>
> if i need to test for some specific behaviour let me know.

what's the status on this patch for inclusion in future 8.2 builds? would be
nice to see it in b4 or rc1...

thanks,
thomas

Re: xlog lockup patch (was: BUG #2712: could not fsync

From
Bruce Momjian
Date:
Thomas H. wrote:
> me wrote:
> > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2
> > executable it locked up after ~100mb. the xlog-files are cycling...
> >
> > if i need to test for some specific behaviour let me know.
>
> what's the status on this patch for inclusion in future 8.2 builds? would be
> nice to see it in b4 or rc1...

This was fixed the day after we packaged beta3:

    revision 1.44
    date: 2006/11/08 20:12:05;  author: tgl;  state: Exp;  lines: +20 -48
    Change Windows rename and unlink substitutes so that they time out after
    30 seconds instead of retrying forever.  Also modify xlog.c so that if
    it fails to rename an old xlog segment up to a future slot, it will
    unlink the segment instead.  Per discussion of bug #2712, in which it
    became apparent that Windows can handle unlinking a file that's being
    held open, but not renaming it.

beta4 will have the fix.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: xlog lockup patch (was: BUG #2712: could not fsync

From
TANIDA Yutaka
Date:
Hi.

On Mon, 20 Nov 2006 15:33:33 -0500 (EST)
Bruce Momjian <bruce@momjian.us> wrote:

> Thomas H. wrote:
> > me wrote:
> > > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2
> > > executable it locked up after ~100mb. the xlog-files are cycling...
> > >
> > > if i need to test for some specific behaviour let me know.
> >
> > what's the status on this patch for inclusion in future 8.2 builds? would be
> > nice to see it in b4 or rc1...
>
> This was fixed the day after we packaged beta3:
>
>     revision 1.44
>     date: 2006/11/08 20:12:05;  author: tgl;  state: Exp;  lines: +20 -48
>     Change Windows rename and unlink substitutes so that they time out after
>     30 seconds instead of retrying forever.  Also modify xlog.c so that if
>     it fails to rename an old xlog segment up to a future slot, it will
>     unlink the segment instead.  Per discussion of bug #2712, in which it
>     became apparent that Windows can handle unlinking a file that's being
>     held open, but not renaming it.
>
> beta4 will have the fix.

Are there any plan to  backport this patch for 8.1/8.0 server ?

My customer has same problem in 8.1.5 , but this bug doesn't seems to
be fixed in 8.1.6 .

Backport patch for 8.1.6 attached.

Sorry for replying old mail.

--
TANIDA Yutaka <tanida@sraoss.co.jp>

Re: [pgsql-patches] xlog lockup patch (was: BUG #2712:

From
Bruce Momjian
Date:
It was considered too risky to backpatch because we couldn't do
sufficient testing.

---------------------------------------------------------------------------

TANIDA Yutaka wrote:
> Hi.
>
> On Mon, 20 Nov 2006 15:33:33 -0500 (EST)
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > Thomas H. wrote:
> > > me wrote:
> > > > i've loaded 1gb of data without any xlog-problems, whereas with the 8.2b2
> > > > executable it locked up after ~100mb. the xlog-files are cycling...
> > > >
> > > > if i need to test for some specific behaviour let me know.
> > >
> > > what's the status on this patch for inclusion in future 8.2 builds? would be
> > > nice to see it in b4 or rc1...
> >
> > This was fixed the day after we packaged beta3:
> >
> >     revision 1.44
> >     date: 2006/11/08 20:12:05;  author: tgl;  state: Exp;  lines: +20 -48
> >     Change Windows rename and unlink substitutes so that they time out after
> >     30 seconds instead of retrying forever.  Also modify xlog.c so that if
> >     it fails to rename an old xlog segment up to a future slot, it will
> >     unlink the segment instead.  Per discussion of bug #2712, in which it
> >     became apparent that Windows can handle unlinking a file that's being
> >     held open, but not renaming it.
> >
> > beta4 will have the fix.
>
> Are there any plan to  backport this patch for 8.1/8.0 server ?
>
> My customer has same problem in 8.1.5 , but this bug doesn't seems to
> be fixed in 8.1.6 .
>
> Backport patch for 8.1.6 attached.
>
> Sorry for replying old mail.
>
> --
> TANIDA Yutaka <tanida@sraoss.co.jp>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +