Thread: postgres crash SOS

postgres crash SOS

From

Felde Norbert

Date:

16 June 2010, 20:55:32

Hi all,

I use 8.2 on a windows server 2008.
Suddenly postgres crashed and I can not do anything. The message is
Could not access status of transaction 3982736.
DATAIL: Could not read from file "pg_clog/0003" at offset 204800: No error.

The only one thing I found to correct this is to create a file filled
with binary 0 and replace clog/0003. Both files size are the same. I
tried that but I am still not able to run vacuum. I tried even with a
bigger empty clog/0003 file, but nothing.
I tried to drop the last few transaction with pg_resetxlog and hoped I
can save some data, but the service wont start.
The last error message is:
ERROR: could not create relation 1663/0/1224: File exists. I looked
for base/1224/0/1663, base/1663/1224/0 and base/1663/0/1224 file but
there is no such file.

Can anyone suggest something?
Many data would be lost if I can not repaire that so please!

Thanks,
fenor

Re: postgres crash SOS

From

Scott Marlowe

Date:

16 June 2010, 21:16:56

On Wed, Jun 16, 2010 at 5:55 PM, Felde Norbert <fenor77@gmail.com> wrote:
> Hi all,
>
> I use 8.2 on a windows server 2008.
> Suddenly postgres crashed and I can not do anything. The message is
> Could not access status of transaction 3982736.
> DATAIL: Could not read from file "pg_clog/0003" at offset 204800: No error.
>
> The only one thing I found to correct this is to create a file filled
> with binary 0 and replace clog/0003. Both files size are the same. I
> tried that but I am still not able to run vacuum.

Was the original 0003 file there, and if so what size was it?  Did you
try to copy it out of the way first?  Did it belong to the pgsql user?
 Did it have any attributes set that would make it impossible to read?
 Are you running anti-virus on this machine? It's known to get in the
way and cause these kinds of problems...

 I tried even with a
> bigger empty clog/0003 file, but nothing.
> I tried to drop the last few transaction with pg_resetxlog and hoped I
> can save some data, but the service wont start.
> The last error message is:
> ERROR: could not create relation 1663/0/1224: File exists. I looked
> for base/1224/0/1663, base/1663/1224/0 and base/1663/0/1224 file but
> there is no such file.
>
> Can anyone suggest something?
> Many data would be lost if I can not repaire that so please!
>
> Thanks,
> fenor
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>



--
When fascism comes to America, it will be intolerance sold as diversity.

Re: postgres crash SOS

From

Merlin Moncure

Date:

16 June 2010, 22:35:26

On Wed, Jun 16, 2010 at 7:55 PM, Felde Norbert <fenor77@gmail.com> wrote:
> Hi all,
>
> I use 8.2 on a windows server 2008.
> Suddenly postgres crashed and I can not do anything. The message is
> Could not access status of transaction 3982736.
> DATAIL: Could not read from file "pg_clog/0003" at offset 204800: No error.
>
> The only one thing I found to correct this is to create a file filled
> with binary 0 and replace clog/0003. Both files size are the same. I

that was probably not the right thing to do -- did you save off the
orgiinal file?

merlin

Re: postgres crash SOS

From

Felde Norbert

Date:

17 June 2010, 05:48:06

Before I began I made a filesystem level backup after I stopped the
postgres service.
I have the original 0003 file, the size is 204800, The size of the
other files in this dir is 262144.

I corrected the permissions of the whole data dir, but the error
message is the same.
The exact messages:
The message is the same for the original pg_clog/0003, the 0003
containing binary 0 and after pg_resetxlog:
pg_dump: Error message from server: ERROR:  could not access status of
transaction 3974799
DETAIL:  Could not read from file "pg_clog/0003" at offset 204800: No error.
pg_dump: The command was: COPY public.active_sessions_split (ct_sid,
ct_name, ct_pos, ct_val, ct_changed) TO stdout;
pg_dump: *** aborted because of error


If create the bigger 0003 containing 0 than I get that:
pg_dump: Error message from server: ERROR:  xlog flush request
0/A19F5BF8 is not satisfied --- flushed only to 0/A02A1AC8
CONTEXT:  writing block 1149 of relation 1663/4192208/4192508
pg_dump: The command was: COPY public.history (historyid, adatkod,
elemid, userid, ido, actionid, targyid, szuloid, opvalue, longfield,
longtext) TO stdout;
pg_dump: *** aborted because of error

Thanks,
fenor

2010/6/17 Scott Marlowe <scott.marlowe@gmail.com>:
> On Wed, Jun 16, 2010 at 5:55 PM, Felde Norbert <fenor77@gmail.com> wrote:
>> Hi all,
>>
>> I use 8.2 on a windows server 2008.
>> Suddenly postgres crashed and I can not do anything. The message is
>> Could not access status of transaction 3982736.
>> DATAIL: Could not read from file "pg_clog/0003" at offset 204800: No error.
>>
>> The only one thing I found to correct this is to create a file filled
>> with binary 0 and replace clog/0003. Both files size are the same. I
>> tried that but I am still not able to run vacuum.
>
> Was the original 0003 file there, and if so what size was it?  Did you
> try to copy it out of the way first?  Did it belong to the pgsql user?
>  Did it have any attributes set that would make it impossible to read?
>  Are you running anti-virus on this machine? It's known to get in the
> way and cause these kinds of problems...
>
>  I tried even with a
>> bigger empty clog/0003 file, but nothing.
>> I tried to drop the last few transaction with pg_resetxlog and hoped I
>> can save some data, but the service wont start.
>> The last error message is:
>> ERROR: could not create relation 1663/0/1224: File exists. I looked
>> for base/1224/0/1663, base/1663/1224/0 and base/1663/0/1224 file but
>> there is no such file.
>>
>> Can anyone suggest something?
>> Many data would be lost if I can not repaire that so please!
>>
>> Thanks,
>> fenor
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> When fascism comes to America, it will be intolerance sold as diversity.
>

Re: postgres crash SOS

From

Merlin Moncure

Date:

17 June 2010, 11:35:35

On Thu, Jun 17, 2010 at 4:47 AM, Felde Norbert <fenor77@gmail.com> wrote:
> Before I began I made a filesystem level backup after I stopped the
> postgres service.
> I have the original 0003 file, the size is 204800, The size of the
> other files in this dir is 262144.

hm...any indication of why the file is small? run out of space?
filesystem/os issues?

try appending hex 0x55 to the end of the file until it's the right
length and starting up the database. disable everything (don't vacuum)
and immediately try to pull off a dump.

what's the last good backup you have?

merlin

Re: postgres crash SOS

From

Merlin Moncure

Date:

17 June 2010, 11:39:10

On Thu, Jun 17, 2010 at 10:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Jun 17, 2010 at 4:47 AM, Felde Norbert <fenor77@gmail.com> wrote:
>> Before I began I made a filesystem level backup after I stopped the
>> postgres service.
>> I have the original 0003 file, the size is 204800, The size of the
>> other files in this dir is 262144.
>
> hm...any indication of why the file is small? run out of space?
> filesystem/os issues?
>
> try appending hex 0x55 to the end of the file until it's the right
> length and starting up the database. disable everything (don't vacuum)
> and immediately try to pull off a dump.
>
> what's the last good backup you have?

if you want to move this along faster and get some more minds on the
problem, hop onto postgresql irc channel:
http://www.postgresql.org/community/irc

also make sure not to modifiy your original filesystem snapshot.
always modify the copy first.

merlin

Re: postgres crash SOS

From

Tom Lane

Date:

17 June 2010, 15:53:59

Felde Norbert <fenor77@gmail.com> writes:
> The message is the same for the original pg_clog/0003, the 0003
> containing binary 0 and after pg_resetxlog:
> pg_dump: Error message from server: ERROR:  could not access status of
> transaction 3974799
> DETAIL:  Could not read from file "pg_clog/0003" at offset 204800: No error.
> pg_dump: The command was: COPY public.active_sessions_split (ct_sid,
> ct_name, ct_pos, ct_val, ct_changed) TO stdout;
> pg_dump: *** aborted because of error

> If create the bigger 0003 containing 0 than I get that:
> pg_dump: Error message from server: ERROR:  xlog flush request
> 0/A19F5BF8 is not satisfied --- flushed only to 0/A02A1AC8
> CONTEXT:  writing block 1149 of relation 1663/4192208/4192508
> pg_dump: The command was: COPY public.history (historyid, adatkod,
> elemid, userid, ido, actionid, targyid, szuloid, opvalue, longfield,
> longtext) TO stdout;
> pg_dump: *** aborted because of error

I'm afraid this means you're screwed :-(.  Both of those symptoms imply
that one part of the database storage is out of sync with another part:
the first error says there are transaction IDs in the
active_sessions_split table that don't exist in pg_clog, and the second
error says that there are pages in the history table that were last
updated by WAL records that don't exist in pg_xlog.  If there are two
such errors, there are probably more.

You weren't too specific about how you got into this state, but I
suppose that it must have been a system crash or power failure.  Even
then, you would not have gotten burnt if the filesystem and hardware
did what they're supposed to do.  I suspect you have a setup wherein
fsync() calls aren't being honored properly.  You may need to disable
write caching on your disks, and/or switch to another filesystem or OS.
(Personally I'd never run a database I cared about on Windows.)

            regards, tom lane

Re: postgres crash SOS

From

Dann Corbit

Date:

17 June 2010, 16:06:26

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Thursday, June 17, 2010 11:54 AM
> To: Felde Norbert
> Cc: pgsql-general@postgresql.org; Scott Marlowe
> Subject: Re: [GENERAL] postgres crash SOS
>
> Felde Norbert <fenor77@gmail.com> writes:
> > The message is the same for the original pg_clog/0003, the 0003
> > containing binary 0 and after pg_resetxlog:
> > pg_dump: Error message from server: ERROR:  could not access status
> of
> > transaction 3974799
> > DETAIL:  Could not read from file "pg_clog/0003" at offset 204800: No
> error.
> > pg_dump: The command was: COPY public.active_sessions_split (ct_sid,
> > ct_name, ct_pos, ct_val, ct_changed) TO stdout;
> > pg_dump: *** aborted because of error
>
> > If create the bigger 0003 containing 0 than I get that:
> > pg_dump: Error message from server: ERROR:  xlog flush request
> > 0/A19F5BF8 is not satisfied --- flushed only to 0/A02A1AC8
> > CONTEXT:  writing block 1149 of relation 1663/4192208/4192508
> > pg_dump: The command was: COPY public.history (historyid, adatkod,
> > elemid, userid, ido, actionid, targyid, szuloid, opvalue, longfield,
> > longtext) TO stdout;
> > pg_dump: *** aborted because of error
>
> I'm afraid this means you're screwed :-(.  Both of those symptoms imply
> that one part of the database storage is out of sync with another part:
> the first error says there are transaction IDs in the
> active_sessions_split table that don't exist in pg_clog, and the second
> error says that there are pages in the history table that were last
> updated by WAL records that don't exist in pg_xlog.  If there are two
> such errors, there are probably more.
>
> You weren't too specific about how you got into this state, but I
> suppose that it must have been a system crash or power failure.  Even
> then, you would not have gotten burnt if the filesystem and hardware
> did what they're supposed to do.  I suspect you have a setup wherein
> fsync() calls aren't being honored properly.  You may need to disable
> write caching on your disks, and/or switch to another filesystem or OS.
> (Personally I'd never run a database I cared about on Windows.)

Somehow, I doubt that Windows is to blame.  For instance, Oracle and SQL*Server seem to run fine on Windows without
thissort of problem.

Re: postgres crash SOS

From

Tom Lane

Date:

17 June 2010, 16:20:09

Dann Corbit <DCorbit@connx.com> writes:
>> (Personally I'd never run a database I cared about on Windows.)

> Somehow, I doubt that Windows is to blame.  For instance, Oracle and SQL*Server seem to run fine on Windows without
thissort of problem. 

Really?  Are you front-line support for either, so that you can say
that they "never" have this sort of problem on the basis of any actual
evidence?  Our failure rate on Windows is probably pretty low too,
it's just that we only hear about non-working cases here.

(Perhaps more to the point, if they don't have problems, it's likely
because they tell their customers how to configure Windows boxes safely
before the fact.  And people who are spending the money for an Oracle
license will heed that advice.)

            regards, tom lane

Re: postgres crash SOS

From

Scott Marlowe

Date:

17 June 2010, 16:22:18

On Thu, Jun 17, 2010 at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (Perhaps more to the point, if they don't have problems, it's likely
> because they tell their customers how to configure Windows boxes safely
> before the fact.  And people who are spending the money for an Oracle
> license will heed that advice.)

The only advice I've ever gotten from Oracle regarding running it on
Windows was "don't do that."

Re: postgres crash SOS

From

Dann Corbit

Date:

17 June 2010, 16:28:03

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, June 17, 2010 12:20 PM
> To: Dann Corbit
> Cc: Felde Norbert; pgsql-general@postgresql.org; Scott Marlowe
> Subject: Re: [GENERAL] postgres crash SOS
>
> Dann Corbit <DCorbit@connx.com> writes:
> >> (Personally I'd never run a database I cared about on Windows.)
>
> > Somehow, I doubt that Windows is to blame.  For instance, Oracle and
> SQL*Server seem to run fine on Windows without this sort of problem.
>
> Really?  Are you front-line support for either, so that you can say
> that they "never" have this sort of problem on the basis of any actual
> evidence?  Our failure rate on Windows is probably pretty low too,
> it's just that we only hear about non-working cases here.

I have worked as a DBA for large farms of database systems.

> (Perhaps more to the point, if they don't have problems, it's likely
> because they tell their customers how to configure Windows boxes safely
> before the fact.  And people who are spending the money for an Oracle
> license will heed that advice.)
>

Care was taken in selection of hardware and configuration.  Is this different for any database system?

Re: postgres crash SOS

From

"Joshua D. Drake"

Date:

17 June 2010, 16:42:17

On Thu, 2010-06-17 at 19:27 +0000, Dann Corbit wrote:
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Sent: Thursday, June 17, 2010 12:20 PM
> > To: Dann Corbit
> > Cc: Felde Norbert; pgsql-general@postgresql.org; Scott Marlowe
> > Subject: Re: [GENERAL] postgres crash SOS
> >
> > Dann Corbit <DCorbit@connx.com> writes:
> > >> (Personally I'd never run a database I cared about on Windows.)
> >
> > > Somehow, I doubt that Windows is to blame.  For instance, Oracle and
> > SQL*Server seem to run fine on Windows without this sort of problem.
> >
> > Really?  Are you front-line support for either, so that you can say
> > that they "never" have this sort of problem on the basis of any actual
> > evidence?  Our failure rate on Windows is probably pretty low too,
> > it's just that we only hear about non-working cases here.
>
> I have worked as a DBA for large farms of database systems.

I think this argument as a whole represents so many "What ifs" it is not
even worth having.

Fact: Windows Server is a decent OS for databases
Caveat: If properly managed
Wish: A decent shell to work in (although powershell is interesting)

Big fark of a caveat there. Of course the same caveat applies to Linux
or FreeBSD too.

>
> > (Perhaps more to the point, if they don't have problems, it's likely
> > because they tell their customers how to configure Windows boxes safely
> > before the fact.  And people who are spending the money for an Oracle
> > license will heed that advice.)
> >
>
> Care was taken in selection of hardware and configuration.  Is this different for any database system?

Yes :) but does it matter?

Sincerely,

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering

Re: postgres crash SOS

From

Tom Lane

Date:

17 June 2010, 16:48:30

"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Thu, 2010-06-17 at 19:27 +0000, Dann Corbit wrote:
>>> (Perhaps more to the point, if they don't have problems, it's likely
>>> because they tell their customers how to configure Windows boxes safely
>>> before the fact.  And people who are spending the money for an Oracle
>>> license will heed that advice.)
>>
>> Care was taken in selection of hardware and configuration.  Is this different for any database system?

> Yes :) but does it matter?

Sure it matters.  The people who are complaining are those who dropped
PG onto some el-cheapo generic PC or other and didn't spend any time
inquiring into things like write cache settings, if indeed they even
know what those are.

It's possible that SQL Server can survive in that kind of environment
because it knows about undocumented Windows hooks that allow it to force
the right things to happen even in a badly-configured machine.  It's
also possible that SQL Server gets just as badly hosed as we do.

            regards, tom lane

Re: postgres crash SOS

From

Merlin Moncure

Date:

17 June 2010, 16:56:27

On Thu, Jun 17, 2010 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> You weren't too specific about how you got into this state, but I
> suppose that it must have been a system crash or power failure.  Even
> then, you would not have gotten burnt if the filesystem and hardware
> did what they're supposed to do.  I suspect you have a setup wherein
> fsync() calls aren't being honored properly.  You may need to disable
> write caching on your disks, and/or switch to another filesystem or OS.
> (Personally I'd never run a database I cared about on Windows.)

Although I don't run pg/windows anymore, I did for years and I think
you're right -- I bet there was a hard reset event and either fsync
was off or not being honored properly.  I was communicating with the
OP off list to see if there was any evidence of disk full condition
(no response yet).   I never saw any data corruption in line with this
w/o some external trigger.

In regards to windows/pg generally, I don't think it's all that bad.
The windows NT kernel is absolutely rock solid stable (quite frankly,
moreso than linux IMO) and very well supported in terms of hardaware.
otoh, windows kernel configuration is wacky, the shell sucks, insanely
overdesigned security model, posix not really supported, etc.  It's a
mixed bag for sure.

merlin

Re: postgres crash SOS

From

Dann Corbit

Date:

17 June 2010, 17:22:32

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Merlin Moncure
> Sent: Thursday, June 17, 2010 12:56 PM
> To: Tom Lane
> Cc: Felde Norbert; pgsql-general@postgresql.org; Scott Marlowe
> Subject: Re: [GENERAL] postgres crash SOS
>
> On Thu, Jun 17, 2010 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > You weren't too specific about how you got into this state, but I
> > suppose that it must have been a system crash or power failure.  Even
> > then, you would not have gotten burnt if the filesystem and hardware
> > did what they're supposed to do.  I suspect you have a setup wherein
> > fsync() calls aren't being honored properly.  You may need to disable
> > write caching on your disks, and/or switch to another filesystem or
> OS.
> > (Personally I'd never run a database I cared about on Windows.)
>
> Although I don't run pg/windows anymore, I did for years and I think
> you're right -- I bet there was a hard reset event and either fsync
> was off or not being honored properly.  I was communicating with the
> OP off list to see if there was any evidence of disk full condition
> (no response yet).   I never saw any data corruption in line with this
> w/o some external trigger.
>
> In regards to windows/pg generally, I don't think it's all that bad.
> The windows NT kernel is absolutely rock solid stable (quite frankly,
> moreso than linux IMO) and very well supported in terms of hardaware.
> otoh, windows kernel configuration is wacky, the shell sucks, insanely
> overdesigned security model, posix not really supported, etc.  It's a
> mixed bag for sure.

I think that the bottom line is that {like any DBMS system} effort is needed to make PostgreSQL run smoothly on any
platform,but for Windows support is still fairly new, while on POSIX systems all the kinks have had plenty of time to
getworked out.  So it should not be unexpected that there will be more problems seen on Windows platforms than on POSIX
typeplatforms.  Even so, I think that the progress on Windows has been excellent and that PostgreSQL is definitely a
viablealternative to traditional commercial systems.  In all cases (Windows or other alternatives) we should use
thoroughlytested and carefully configured systems to store our business data. 

Nothing new or remarkable here.  Just common sense.
IMO-YMMV

Re: postgres crash SOS

From

Felde Norbert

Date:

17 June 2010, 17:51:44

The first error message was what I got after postgres crashed and I
tried to make a dump, run vacuum or tried somthing else.
The second message I got when I tried to repaire the problem, so it
dous not matter because I did something wrong i see.

If I could choose I would use a linux server too, but if the partner
sais there is a windows server and you have to use that than there is
no discuss.

Why I was not specific how to this state came is I do not know.
I could not find anything about a power failer and disk space seemed
to be more than needed. There was entries in log for full virtual
memory.



2010/6/17 Tom Lane <tgl@sss.pgh.pa.us>:
> Felde Norbert <fenor77@gmail.com> writes:
>> The message is the same for the original pg_clog/0003, the 0003
>> containing binary 0 and after pg_resetxlog:
>> pg_dump: Error message from server: ERROR:  could not access status of
>> transaction 3974799
>> DETAIL:  Could not read from file "pg_clog/0003" at offset 204800: No error.
>> pg_dump: The command was: COPY public.active_sessions_split (ct_sid,
>> ct_name, ct_pos, ct_val, ct_changed) TO stdout;
>> pg_dump: *** aborted because of error
>
>> If create the bigger 0003 containing 0 than I get that:
>> pg_dump: Error message from server: ERROR:  xlog flush request
>> 0/A19F5BF8 is not satisfied --- flushed only to 0/A02A1AC8
>> CONTEXT:  writing block 1149 of relation 1663/4192208/4192508
>> pg_dump: The command was: COPY public.history (historyid, adatkod,
>> elemid, userid, ido, actionid, targyid, szuloid, opvalue, longfield,
>> longtext) TO stdout;
>> pg_dump: *** aborted because of error
>
> I'm afraid this means you're screwed :-(.  Both of those symptoms imply
> that one part of the database storage is out of sync with another part:
> the first error says there are transaction IDs in the
> active_sessions_split table that don't exist in pg_clog, and the second
> error says that there are pages in the history table that were last
> updated by WAL records that don't exist in pg_xlog.  If there are two
> such errors, there are probably more.
>
> You weren't too specific about how you got into this state, but I
> suppose that it must have been a system crash or power failure.  Even
> then, you would not have gotten burnt if the filesystem and hardware
> did what they're supposed to do.  I suspect you have a setup wherein
> fsync() calls aren't being honored properly.  You may need to disable
> write caching on your disks, and/or switch to another filesystem or OS.
> (Personally I'd never run a database I cared about on Windows.)
>
>                        regards, tom lane
>

Re: postgres crash SOS

From

Merlin Moncure

Date:

17 June 2010, 17:57:37

On Thu, Jun 17, 2010 at 4:51 PM, Felde Norbert <fenor77@gmail.com> wrote:
> The first error message was what I got after postgres crashed and I
> tried to make a dump, run vacuum or tried somthing else.
> The second message I got when I tried to repaire the problem, so it
> dous not matter because I did something wrong i see.
>
> If I could choose I would use a linux server too, but if the partner
> sais there is a windows server and you have to use that than there is
> no discuss.
>
> Why I was not specific how to this state came is I do not know.
> I could not find anything about a power failer and disk space seemed
> to be more than needed. There was entries in log for full virtual
> memory.

This came before the crash?  Are you sure the server didn't reset
following the virtual memory full?

Memory full is a very dangerous condition for a database server and
may have contributed to your problem or been a symptom of another
problem.  The main things we need to know (any data corruption issue
is worth trying to diagnose after the fact) are:

*) what is the setting for fsync?
*) Are you using a raid controller?  how is the cache configured?
*) If not, is your drive configured to buffer writes?
*) How much free space is left on your various volumes on the computer?

Did you check the system event log for interesting events at or around
the time you saw virtual memory full.  Can we see the log message
reporting memory full condition as well as surrounding messages?

merlin

Re: postgres crash SOS

From

"Joshua D. Drake"

Date:

17 June 2010, 23:42:01

On Thu, 2010-06-17 at 19:27 +0000, Dann Corbit wrote:
> > -----Original Message-----
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> > Sent: Thursday, June 17, 2010 12:20 PM
> > To: Dann Corbit
> > Cc: Felde Norbert; pgsql-general@postgresql.org; Scott Marlowe
> > Subject: Re: [GENERAL] postgres crash SOS
> >
> > Dann Corbit <DCorbit@connx.com> writes:
> > >> (Personally I'd never run a database I cared about on Windows.)
> >
> > > Somehow, I doubt that Windows is to blame.  For instance, Oracle and
> > SQL*Server seem to run fine on Windows without this sort of problem.
> >
> > Really?  Are you front-line support for either, so that you can say
> > that they "never" have this sort of problem on the basis of any actual
> > evidence?  Our failure rate on Windows is probably pretty low too,
> > it's just that we only hear about non-working cases here.
>
> I have worked as a DBA for large farms of database systems.

I think this argument as a whole represents so many "What ifs" it is not
even worth having.

Fact: Windows Server is a decent OS for databases
Caveat: If properly managed
Wish: A decent shell to work in (although powershell is interesting)

Big fark of a caveat there. Of course the same caveat applies to Linux
or FreeBSD too.

>
> > (Perhaps more to the point, if they don't have problems, it's likely
> > because they tell their customers how to configure Windows boxes safely
> > before the fact.  And people who are spending the money for an Oracle
> > license will heed that advice.)
> >
>
> Care was taken in selection of hardware and configuration.  Is this different for any database system?

Yes :) but does it matter?

Sincerely,

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering

Re: postgres crash SOS

From

Felde Norbert

Date:

18 June 2010, 05:55:19

Hi,

This are the informations I could collect:

We use cobian to create the backup.
There are two volumes in use, on C is the volume where everything is
installed and here is the postgres data dir too.
The postgres backup that runs everynight places the backup file on
this volume too, it runs before daily backup is started.
There is an another volume where the cobian places the daily backups.

So to be precise:
C:
    postgres
    postgres\data
    postgres dump before daily backup is started
D:
    daily backups including postgres dump from C

The D volume was full on the 06-06 and stayd so for 5 days.

The first virtual memory log entry happend on the 06-09 05:41 and the
last came 06-10 16:18
The log entries are about the same:
Windows successfully diagnosed a low virtual memory condition.
The following programs consumed the most virtual memory:
cbService.exe (2348) consumed 2058158080 bytes,
explorer.exe (7136) consumed 245456896 bytes,
and McScript_InUse.exe (1908) consumed 218529792 bytes

In the postgres log at that time is this:
Postgres log
2010-06-10 16:58:14 LOG:  database system was interrupted at 2010-06-10 16:16:36
2010-06-10 16:58:14 LOG:  checkpoint record is at 0/9FBE5158
2010-06-10 16:58:14 LOG:  redo record is at 0/9FBE5158; undo record is
at 0/0; shutdown FALSE
2010-06-10 16:58:14 LOG:  next transaction ID: 0/3620193; next OID: 6744703
2010-06-10 16:58:14 LOG:  next MultiXactId: 2; next MultiXactOffset: 3
2010-06-10 16:58:14 LOG:  database system was not properly shut down;
automatic recovery in progress
2010-06-10 16:58:14 LOG:  redo starts at 0/9FBE51A8
2010-06-10 16:58:14 FATAL:  the database system is starting up
2010-06-10 16:58:14 LOG:  record with zero length at 0/9FEEDF60
2010-06-10 16:58:14 LOG:  redo done at 0/9FEEDF30
2010-06-10 16:58:15 FATAL:  the database system is starting up
2010-06-10 16:58:16 FATAL:  the database system is starting up
2010-06-10 16:58:17 FATAL:  the database system is starting up
2010-06-10 16:58:17 LOG:  database system is ready
Before this I can not find any interesting entries in the postgres log.

The first postgres backup that failed was on 06-11 00:30. The log is
filled with that message:
2010-06-11 00:31:19 ERROR:  xlog flush request 0/9FF74848 is not
satisfied --- flushed only to 0/9FEEDFB0
2010-06-11 00:31:19 CONTEXT:  writing block 17942 of relation
1663/4192208/4192534
2010-06-11 00:31:19 STATEMENT:  FETCH 100 FROM _pg_dump_cursor.
This message appears in 1 sec intervals and only the writing blocks
blocknumber changes.

About the informations you asked:
There are 2 SCSI drives and they are mirrored using windows mirroring.
As I could find out, the mirroring is done with default settings.
The fsync settings are the default.

fenor

2010/6/17 Merlin Moncure <mmoncure@gmail.com>:
> On Thu, Jun 17, 2010 at 4:51 PM, Felde Norbert <fenor77@gmail.com> wrote:
>> The first error message was what I got after postgres crashed and I
>> tried to make a dump, run vacuum or tried somthing else.
>> The second message I got when I tried to repaire the problem, so it
>> dous not matter because I did something wrong i see.
>>
>> If I could choose I would use a linux server too, but if the partner
>> sais there is a windows server and you have to use that than there is
>> no discuss.
>>
>> Why I was not specific how to this state came is I do not know.
>> I could not find anything about a power failer and disk space seemed
>> to be more than needed. There was entries in log for full virtual
>> memory.
>
> This came before the crash?  Are you sure the server didn't reset
> following the virtual memory full?
>
> Memory full is a very dangerous condition for a database server and
> may have contributed to your problem or been a symptom of another
> problem.  The main things we need to know (any data corruption issue
> is worth trying to diagnose after the fact) are:
>
> *) what is the setting for fsync?
> *) Are you using a raid controller?  how is the cache configured?
> *) If not, is your drive configured to buffer writes?
> *) How much free space is left on your various volumes on the computer?
>
> Did you check the system event log for interesting events at or around
> the time you saw virtual memory full.  Can we see the log message
> reporting memory full condition as well as surrounding messages?
>
> merlin
>

Re: postgres crash SOS

From

Merlin Moncure

Date:

18 June 2010, 10:03:27

On Fri, Jun 18, 2010 at 4:55 AM, Felde Norbert <fenor77@gmail.com> wrote:
> Hi,
>
> This are the informations I could collect:
>
>
> We use cobian to create the backup.
> There are two volumes in use, on C is the volume where everything is
> installed and here is the postgres data dir too.
> The postgres backup that runs everynight places the backup file on
> this volume too, it runs before daily backup is started.
> There is an another volume where the cobian places the daily backups.
>
> So to be precise:
> C:
>        postgres
>        postgres\data
>        postgres dump before daily backup is started
> D:
>        daily backups including postgres dump from C
>
> The D volume was full on the 06-06 and stayd so for 5 days.
>
> The first virtual memory log entry happend on the 06-09 05:41 and the
> last came 06-10 16:18
> The log entries are about the same:
> Windows successfully diagnosed a low virtual memory condition.
> The following programs consumed the most virtual memory:
> cbService.exe (2348) consumed 2058158080 bytes,
> explorer.exe (7136) consumed 245456896 bytes,
> and McScript_InUse.exe (1908) consumed 218529792 bytes
>
>
> In the postgres log at that time is this:
> Postgres log
> 2010-06-10 16:58:14 LOG:  database system was interrupted at 2010-06-10 16:16:36
> 2010-06-10 16:58:14 LOG:  checkpoint record is at 0/9FBE5158
> 2010-06-10 16:58:14 LOG:  redo record is at 0/9FBE5158; undo record is
> at 0/0; shutdown FALSE
> 2010-06-10 16:58:14 LOG:  next transaction ID: 0/3620193; next OID: 6744703
> 2010-06-10 16:58:14 LOG:  next MultiXactId: 2; next MultiXactOffset: 3
> 2010-06-10 16:58:14 LOG:  database system was not properly shut down;
> automatic recovery in progress
> 2010-06-10 16:58:14 LOG:  redo starts at 0/9FBE51A8
> 2010-06-10 16:58:14 FATAL:  the database system is starting up
> 2010-06-10 16:58:14 LOG:  record with zero length at 0/9FEEDF60
> 2010-06-10 16:58:14 LOG:  redo done at 0/9FEEDF30
> 2010-06-10 16:58:15 FATAL:  the database system is starting up
> 2010-06-10 16:58:16 FATAL:  the database system is starting up
> 2010-06-10 16:58:17 FATAL:  the database system is starting up
> 2010-06-10 16:58:17 LOG:  database system is ready
> Before this I can not find any interesting entries in the postgres log.
>
>
> The first postgres backup that failed was on 06-11 00:30. The log is
> filled with that message:
> 2010-06-11 00:31:19 ERROR:  xlog flush request 0/9FF74848 is not
> satisfied --- flushed only to 0/9FEEDFB0
> 2010-06-11 00:31:19 CONTEXT:  writing block 17942 of relation
> 1663/4192208/4192534
> 2010-06-11 00:31:19 STATEMENT:  FETCH 100 FROM _pg_dump_cursor.
> This message appears in 1 sec intervals and only the writing blocks
> blocknumber changes.
>
>
>
> About the informations you asked:
> There are 2 SCSI drives and they are mirrored using windows mirroring.
> As I could find out, the mirroring is done with default settings.
> The fsync settings are the default.

My guess here is that you are running into a bad interaction between
postgres, your backup software, and windows.  Cobian like most windows
backup software products uses vss which snapshots the volume during
the backup.  While that is happening writes are windows has to keep
track of files written to in a different place.  Something blew a fuse
in windows and the database was reset (I'm betting your server
completely reset in addition to postgres).

This is pretty nasty because it's likely to happen again and your
storage looks like it's no properly honoring fsync.  My advice would
be to double check the fsync situation first so you don't lose any
more data (check windows write caching, and your scsi controller cache
setting).  After that, I think you need to take a look at your backup
strategy. Snapshotting the volume postgres is on is probably not a
good idea, especially if you are relying on pg_dump for the backup.

merlin

Re: postgres crash SOS

From

Felde Norbert

Date:

18 June 2010, 13:15:06

Hi,

This is my opinion, tell me if I am wrong.
the backup runs at 00:30. At that time, i am sure nobody use the
program which use postgres, so there is no transaction which store
would be a problem when cobian creates the snapshot.
Cobian started one day to create the backup but the volume was full
that caused that cobian hanged up.
Cobian was still runing and slowly filled the virtual memory and when
it was full, something went wrong with a postgres transaction. After
that point no backup could made, but what I am not understand is how
could postgres still serve data to our application.

We use this backup method for 3 years and nothing happend. Ok, I know,
we was lucky or something else, but I think it is long time if it is
so wrong.

Thank you for your suggestions and advices, I will see what can I do.

fenor

2010/6/18 Merlin Moncure <mmoncure@gmail.com>:
> On Fri, Jun 18, 2010 at 4:55 AM, Felde Norbert <fenor77@gmail.com> wrote:
>> Hi,
>>
>> This are the informations I could collect:
>>
>>
>> We use cobian to create the backup.
>> There are two volumes in use, on C is the volume where everything is
>> installed and here is the postgres data dir too.
>> The postgres backup that runs everynight places the backup file on
>> this volume too, it runs before daily backup is started.
>> There is an another volume where the cobian places the daily backups.
>>
>> So to be precise:
>> C:
>>        postgres
>>        postgres\data
>>        postgres dump before daily backup is started
>> D:
>>        daily backups including postgres dump from C
>>
>> The D volume was full on the 06-06 and stayd so for 5 days.
>>
>> The first virtual memory log entry happend on the 06-09 05:41 and the
>> last came 06-10 16:18
>> The log entries are about the same:
>> Windows successfully diagnosed a low virtual memory condition.
>> The following programs consumed the most virtual memory:
>> cbService.exe (2348) consumed 2058158080 bytes,
>> explorer.exe (7136) consumed 245456896 bytes,
>> and McScript_InUse.exe (1908) consumed 218529792 bytes
>>
>>
>> In the postgres log at that time is this:
>> Postgres log
>> 2010-06-10 16:58:14 LOG:  database system was interrupted at 2010-06-10 16:16:36
>> 2010-06-10 16:58:14 LOG:  checkpoint record is at 0/9FBE5158
>> 2010-06-10 16:58:14 LOG:  redo record is at 0/9FBE5158; undo record is
>> at 0/0; shutdown FALSE
>> 2010-06-10 16:58:14 LOG:  next transaction ID: 0/3620193; next OID: 6744703
>> 2010-06-10 16:58:14 LOG:  next MultiXactId: 2; next MultiXactOffset: 3
>> 2010-06-10 16:58:14 LOG:  database system was not properly shut down;
>> automatic recovery in progress
>> 2010-06-10 16:58:14 LOG:  redo starts at 0/9FBE51A8
>> 2010-06-10 16:58:14 FATAL:  the database system is starting up
>> 2010-06-10 16:58:14 LOG:  record with zero length at 0/9FEEDF60
>> 2010-06-10 16:58:14 LOG:  redo done at 0/9FEEDF30
>> 2010-06-10 16:58:15 FATAL:  the database system is starting up
>> 2010-06-10 16:58:16 FATAL:  the database system is starting up
>> 2010-06-10 16:58:17 FATAL:  the database system is starting up
>> 2010-06-10 16:58:17 LOG:  database system is ready
>> Before this I can not find any interesting entries in the postgres log.
>>
>>
>> The first postgres backup that failed was on 06-11 00:30. The log is
>> filled with that message:
>> 2010-06-11 00:31:19 ERROR:  xlog flush request 0/9FF74848 is not
>> satisfied --- flushed only to 0/9FEEDFB0
>> 2010-06-11 00:31:19 CONTEXT:  writing block 17942 of relation
>> 1663/4192208/4192534
>> 2010-06-11 00:31:19 STATEMENT:  FETCH 100 FROM _pg_dump_cursor.
>> This message appears in 1 sec intervals and only the writing blocks
>> blocknumber changes.
>>
>>
>>
>> About the informations you asked:
>> There are 2 SCSI drives and they are mirrored using windows mirroring.
>> As I could find out, the mirroring is done with default settings.
>> The fsync settings are the default.
>
> My guess here is that you are running into a bad interaction between
> postgres, your backup software, and windows.  Cobian like most windows
> backup software products uses vss which snapshots the volume during
> the backup.  While that is happening writes are windows has to keep
> track of files written to in a different place.  Something blew a fuse
> in windows and the database was reset (I'm betting your server
> completely reset in addition to postgres).
>
> This is pretty nasty because it's likely to happen again and your
> storage looks like it's no properly honoring fsync.  My advice would
> be to double check the fsync situation first so you don't lose any
> more data (check windows write caching, and your scsi controller cache
> setting).  After that, I think you need to take a look at your backup
> strategy. Snapshotting the volume postgres is on is probably not a
> good idea, especially if you are relying on pg_dump for the backup.
>
> merlin
>