Thread: Pg_xlog increase due to postgres crash (disk full)

Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
My database crashed a couple of days ago during an upgrade several seconds after committing a large transaction to the database. Eventually we found out that this was due to the disk being full as the transaction had created several gigs of data. A day or so later the disk is full again and PostgreSQL crashes due to the pg_xlog file taking up all of the disk space. I have cleaned up the drive to have so extra space which allows PostgreSQL to start again but the xlogs are still increasing. I have two errors in my pg_log: 

"WARNING: transaction log file "00000001000000A800000078" could not be archived: too many failures" and 

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp pg_xlog/00000001000000A800000078 /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

Postgres version 9.0.3 conf:

  • wal_level = hot_standby
    archive_mode = true
    archive_command = 'test ! -f /opt/postgres/remote_pgsql/wal_archive/%f && cp %p /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to archive a logfile segment
    archive_timeout = 1800 
    max_wal_senders = 1
    max_standby_archive_delay = 900s
    max_standby_streaming_delay = 900s
    default_statistics_target = 50 # pgtune wizard 2010-11-18
    maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
    constraint_exclusion = on # pgtune wizard 2010-11-18
    checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
    effective_cache_size = 5632MB # pgtune wizard 2010-11-18
    work_mem = 48MB # pgtune wizard 2010-11-18
    wal_buffers = 8MB # pgtune wizard 2010-11-18
    checkpoint_segments = 16 # pgtune wizard 2010-11-18
    shared_buffers = 1920MB # pgtune wizard 2010-11-18
    max_connections = 80 # pgtune wizard 2010-11-18

I've tried stopping postgres and then deleting the 00000001000000A800000078.ready file and starting postgres but this appears to be recreated instantly and the error is still in the log file.

I've read about the pg_reset_xlog() command but with having to pg_dump our db with a large amount of blobs and restoring it again is highly problematic as the pg_restore has struggled to restore.

Will setting zero_damaged_pages (true) work in 9.0.1 and would this resolve the issue?

Would creating an empty file and replacing the offending xlog work, would this need to be a specific size?

Any ideas?

Re: Pg_xlog increase due to postgres crash (disk full)

From
Adrian Klaver
Date:
On 01/22/2013 03:12 AM, Cliff de Carteret wrote:
> My database crashed a couple of days ago during an upgrade several
> seconds after committing a large transaction to the database. Eventually
> we found out that this was due to the disk being full as the transaction
> had created several gigs of data. A day or so later the disk is full
> again and PostgreSQL crashes due to the pg_xlog file taking up all of
> the disk space. I have cleaned up the drive to have so extra space which
> allows PostgreSQL to start again but the xlogs are still increasing. I
> have two errors in my pg_log:
>
> "WARNING: transaction log file "00000001000000A800000078" could not be
> archived: too many failures" and
>
> "LOG: archive command failed with exit code 1
> DETAIL: The failed archive command was: test ! -f
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
> pg_xlog/00000001000000A800000078
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

I am not seeing it below, but just to be complete is wal_keep_segments
set to something greater than 0?


Just to be clear which xlogs are filling up, the ones in the original
data directory or in the archive directory?



>
> Postgres version 9.0.3 conf:
>
>   *
>
>
>     wal_level = hot_standby
>     archive_mode = true
>     archive_command = 'test ! -f
>     /opt/postgres/remote_pgsql/wal_archive/%f && cp %p
>     /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
>     archive a logfile segment
>     archive_timeout = 1800
>     max_wal_senders = 1
>     max_standby_archive_delay = 900s
>     max_standby_streaming_delay = 900s
>     default_statistics_target = 50 # pgtune wizard 2010-11-18
>     maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
>     constraint_exclusion = on # pgtune wizard 2010-11-18
>     checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
>     effective_cache_size = 5632MB # pgtune wizard 2010-11-18
>     work_mem = 48MB # pgtune wizard 2010-11-18
>     wal_buffers = 8MB # pgtune wizard 2010-11-18
>     checkpoint_segments = 16 # pgtune wizard 2010-11-18
>     shared_buffers = 1920MB # pgtune wizard 2010-11-18
>     max_connections = 80 # pgtune wizard 2010-11-18
>
>

--
Adrian Klaver
adrian.klaver@gmail.com


Re: Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
Hello, thanks for your response.

The xlogs that are filling up are in the original data directory and the wal_keep_segments is commented out on my master and slave postgresql.conf

Cliff


On 22 January 2013 14:48, Adrian Klaver <adrian.klaver@gmail.com> wrote:
On 01/22/2013 03:12 AM, Cliff de Carteret wrote:
My database crashed a couple of days ago during an upgrade several
seconds after committing a large transaction to the database. Eventually
we found out that this was due to the disk being full as the transaction
had created several gigs of data. A day or so later the disk is full
again and PostgreSQL crashes due to the pg_xlog file taking up all of
the disk space. I have cleaned up the drive to have so extra space which
allows PostgreSQL to start again but the xlogs are still increasing. I
have two errors in my pg_log:

"WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures" and

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

I am not seeing it below, but just to be complete is wal_keep_segments set to something greater than 0?


Just to be clear which xlogs are filling up, the ones in the original data directory or in the archive directory?




Postgres version 9.0.3 conf:


  *


    wal_level = hot_standby
    archive_mode = true
    archive_command = 'test ! -f
    /opt/postgres/remote_pgsql/wal_archive/%f && cp %p
    /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
    archive a logfile segment
    archive_timeout = 1800
    max_wal_senders = 1
    max_standby_archive_delay = 900s
    max_standby_streaming_delay = 900s
    default_statistics_target = 50 # pgtune wizard 2010-11-18
    maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
    constraint_exclusion = on # pgtune wizard 2010-11-18
    checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
    effective_cache_size = 5632MB # pgtune wizard 2010-11-18
    work_mem = 48MB # pgtune wizard 2010-11-18
    wal_buffers = 8MB # pgtune wizard 2010-11-18
    checkpoint_segments = 16 # pgtune wizard 2010-11-18
    shared_buffers = 1920MB # pgtune wizard 2010-11-18
    max_connections = 80 # pgtune wizard 2010-11-18



--
Adrian Klaver
adrian.klaver@gmail.com

Re: Pg_xlog increase due to postgres crash (disk full)

From
Adrian Klaver
Date:
On 01/22/2013 06:59 AM, Cliff de Carteret wrote:
> Hello, thanks for your response.
>
> The xlogs that are filling up are in the original data directory and the
> wal_keep_segments is commented out on my master and slave postgresql.conf

It is expected that there may a certain increase in WAL files.
  In particular: checkpoint_segments = 16

http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS
"
checkpoint_segments (integer)
Maximum number of log file segments between automatic WAL checkpoints
(each segment is normally 16 megabytes). The default is three segments.
Increasing this parameter can increase the amount of time needed for
crash recovery. This parameter can only be set in the postgresql.conf
file or on the server command line.
"

For more information that explains the above see:
http://www.postgresql.org/docs/9.2/static/wal-configuration.html

More questions:

Does it look like any WAL files in the pg_xlog directory are being recycled?

Re your archive error from your previous post, is your archive directory
full?

     Assuming the archive directory is remote, is it reachable?

>
> Cliff
>
>


--
Adrian Klaver
adrian.klaver@gmail.com


Re: Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
There are no files in the pg_xlog directory from before the database crash except one file which ends in .backup, there are 759 files. In the pg_xlog/archive_status folder there are 755 files. 
The local ../wal_archive folder has no files in it at all whereas the remote wal_archive folder has 147 files present.

The remote archive folder is not full and has ~14G whereas the local archive folder was previously full but now it has been moved so it has 100G+

The archive directory is a mount and as the user postgres I can copy a file into it successfully.

The current setup has been working successfully for several years until the recent database crash


On 22 January 2013 15:34, Adrian Klaver <adrian.klaver@gmail.com> wrote:
On 01/22/2013 06:59 AM, Cliff de Carteret wrote:
Hello, thanks for your response.

The xlogs that are filling up are in the original data directory and the
wal_keep_segments is commented out on my master and slave postgresql.conf

It is expected that there may a certain increase in WAL files.
 In particular: checkpoint_segments = 16

http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS
"
checkpoint_segments (integer)
Maximum number of log file segments between automatic WAL checkpoints (each segment is normally 16 megabytes). The default is three segments. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.
"

For more information that explains the above see:
http://www.postgresql.org/docs/9.2/static/wal-configuration.html

More questions:

Does it look like any WAL files in the pg_xlog directory are being recycled?

Re your archive error from your previous post, is your archive directory full?

    Assuming the archive directory is remote, is it reachable?


Cliff




--
Adrian Klaver
adrian.klaver@gmail.com

Re: Pg_xlog increase due to postgres crash (disk full)

From
"Kevin Grittner"
Date:
Cliff de Carteret wrote:

> The current setup has been working successfully for several years
> until the recent database crash

What file does the server log say it is trying to archive? What
error are you getting? Does that filename already exist on the
archive (or some intermediate location used by the archive command
or script)?

-Kevin


Re: Pg_xlog increase due to postgres crash (disk full)

From
Adrian Klaver
Date:
On 01/22/2013 07:57 AM, Cliff de Carteret wrote:
> There are no files in the pg_xlog directory from before the database
> crash except one file which ends in .backup, there are 759 files. In the
> pg_xlog/archive_status folder there are 755 files.
> The local ../wal_archive folder has no files in it at all whereas the
> remote wal_archive folder has 147 files present.

What is this local wal_archive directory?
 From a previous post:

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078

I see only /opt/postgres/remote_pgsql/wal_archive which I assumed was
the remote. I am not sure where the local one fits in?

>
> The remote archive folder is not full and has ~14G whereas the local
> archive folder was previously full but now it has been moved so it has 100G+
>
> The archive directory is a mount and as the user postgres I can copy a
> file into it successfully.
>
> The current setup has been working successfully for several years until
> the recent database crash

So what did you do to get the database running again?

>
>



--
Adrian Klaver
adrian.klaver@gmail.com


Re: Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
The local wal_archive directory is a directory named "wal_archive" which is at the same file system level as the data directory. The time stamp states that it has not been used for several years so it looks to be redundant.

To get the database running again I moved an old postgres installation to a backups folder which gave me ~2G of space to start up again until I was able to mount a new disk and move the pg_xlog to a separate partition using a symbolic link.


On 22 January 2013 16:15, Adrian Klaver <adrian.klaver@gmail.com> wrote:
On 01/22/2013 07:57 AM, Cliff de Carteret wrote:
There are no files in the pg_xlog directory from before the database
crash except one file which ends in .backup, there are 759 files. In the
pg_xlog/archive_status folder there are 755 files.
The local ../wal_archive folder has no files in it at all whereas the
remote wal_archive folder has 147 files present.

What is this local wal_archive directory?
From a previous post:


"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp pg_xlog/00000001000000A800000078 /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078

I see only /opt/postgres/remote_pgsql/wal_archive which I assumed was the remote. I am not sure where the local one fits in?



The remote archive folder is not full and has ~14G whereas the local
archive folder was previously full but now it has been moved so it has 100G+

The archive directory is a mount and as the user postgres I can copy a
file into it successfully.

The current setup has been working successfully for several years until
the recent database crash

So what did you do to get the database running again?






--
Adrian Klaver
adrian.klaver@gmail.com

Re: Pg_xlog increase due to postgres crash (disk full)

From
"Kevin Grittner"
Date:
[Please keep the list copied, and put your reply in-line instead
of at the top.]

Cliff de Carteret wrote:
> On 22 January 2013 16:07, Kevin Grittner <kgrittn@mail.com> wrote:
>
>> Cliff de Carteret wrote:
>>
>>> The current setup has been working successfully for several years
>>> until the recent database crash
>>
>> What file does the server log say it is trying to archive? What
>> error are you getting? Does that filename already exist on the
>> archive (or some intermediate location used by the archive command
>> or script)?

> The sever log is (repeated constantly):
>
> LOG: archive command failed with exit code 1
> DETAIL: The failed archive command was: test ! -f
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
> pg_xlog/00000001000000A800000078
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
> WARNING: transaction log file "00000001000000A800000078" could not be
> archived: too many failures
>
> The file 00000001000000A800000078 exists in the remote archive's
> wal_archive directory. I read a post saying to copy the file over to the
> archive and then delete the .ready file to get postgres to move onto the
> next file but this ended up logging out saying that a log file was missing.
> There are more recent files in this directory but they end at the point
> where I reverted all of the changes I made last night when time was running
> out and the database had to be put back to a known state.

I would have deleted (or renamed) the copy in the archive
directory. Archiving should have then resumed and cleaned up the
pg_xlog directory.

-Kevin


Re: Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
On 22 January 2013 16:43, Kevin Grittner <kgrittn@mail.com> wrote:
[Please keep the list copied, and put your reply in-line instead
of at the top.]

Cliff de Carteret wrote:
> On 22 January 2013 16:07, Kevin Grittner <kgrittn@mail.com> wrote:
>
>> Cliff de Carteret wrote:
>>
>>> The current setup has been working successfully for several years
>>> until the recent database crash
>>
>> What file does the server log say it is trying to archive? What
>> error are you getting? Does that filename already exist on the
>> archive (or some intermediate location used by the archive command
>> or script)?

> The sever log is (repeated constantly):
>
> LOG: archive command failed with exit code 1
> DETAIL: The failed archive command was: test ! -f
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
> pg_xlog/00000001000000A800000078
> /opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
> WARNING: transaction log file "00000001000000A800000078" could not be
> archived: too many failures
>
> The file 00000001000000A800000078 exists in the remote archive's
> wal_archive directory. I read a post saying to copy the file over to the
> archive and then delete the .ready file to get postgres to move onto the
> next file but this ended up logging out saying that a log file was missing.
> There are more recent files in this directory but they end at the point
> where I reverted all of the changes I made last night when time was running
> out and the database had to be put back to a known state.

I would have deleted (or renamed) the copy in the archive
directory. Archiving should have then resumed and cleaned up the
pg_xlog directory.
I have now deleted the copy on the remote wal_archive folder and the archiving is now functioning and sending the logs from the local to the remote folder. The remote database does not startup and the following is in the log:

LOG:  database system was shut down in recovery at 2013-01-22 10:54:48 GMT
LOG:  entering standby mode
LOG:  restored log file "00000001000000AB00000051" from archive
LOG:  invalid resource manager ID in primary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 22350) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

00000001000000AB00000051 is in my remote database's pg_xlog folder

Thanks for your help already!

 

-Kevin

Re: Pg_xlog increase due to postgres crash (disk full)

From
"Kevin Grittner"
Date:
Cliff de Carteret wrote:

> I have now deleted the copy on the remote wal_archive folder and the
> archiving is now functioning and sending the logs from the local to the
> remote folder. The remote database does not startup and the following is in
> the log:
>
> LOG: database system was shut down in recovery at 2013-01-22 10:54:48 GMT
> LOG: entering standby mode
> LOG: restored log file "00000001000000AB00000051" from archive
> LOG: invalid resource manager ID in primary checkpoint record
> PANIC: could not locate a valid checkpoint record
> LOG: startup process (PID 22350) was terminated by signal 6: Aborted
> LOG: aborting startup due to startup process failure
>
> 00000001000000AB00000051 is in my remote database's pg_xlog folder

Any chance that there was a pg_start_backup() call on the master without
a matching pg_stop_backup() call?

-Kevin


Re: Pg_xlog increase due to postgres crash (disk full)

From
Cliff de Carteret
Date:
I had not run these commands on the master as I was only doing sql updates ~1mil of them