Thread: URGENT issue: pg-xlog growing on master!

URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:
Hi, My pg_xlog dir has been growing rapidly the last 4 days, and my disk is now almost full (1000Gb) even though the
databaseis only 50Gb. I have a streaming replication server running, and in the log of the slave it says: 

cp: cannot stat `/var/lib/postgresql/9.2/wals/0000000200000E1B000000A9': No such file or directory
cp: cannot stat `/var/lib/postgresql/9.2/wals/0000000200000E1B000000A9': No such file or directory
2013-06-10 11:21:45 GMT FATAL:  could not connect to the primary server: could not connect to server: No route to host
        Is the server running on host "192.168.0.4" and accepting
        TCP/IP connections on port 5432?

All the time.

I have tried to restart the server, but that didn't help. I checked the master, and the file
/var/lib/postgresql/9.2/wals/0000000200000E1B000000A9does not exist! I'm pretty lost here, can someone help me solve
thisand get my master server cleaned up. What is causing this, and what do I need to do? 

Kind regards

Re: URGENT issue: pg-xlog growing on master!

From
Dinesh Kumar
Date:

Hi, My pg_xlog dir has been growing rapidly the last 4 days, and my disk is now almost full (1000Gb) even though the database is only 50Gb. I have a streaming replication server running, and in the log of the slave it says:

cp: cannot stat `/var/lib/postgresql/9.2/wals/0000000200000E1B000000A9': No such file or directory
cp: cannot stat `/var/lib/postgresql/9.2/wals/0000000200000E1B000000A9': No such file or directory
2013-06-10 11:21:45 GMT FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "192.168.0.4" and accepting
                TCP/IP connections on port 5432?

All the time.

I have tried to restart the server, but that didn't help. I checked the master, and the file /var/lib/postgresql/9.2/wals/0000000200000E1B000000A9 does not exist! I'm pretty lost here, can someone help me solve this and get my master server cleaned up. What is causing this, and what do I need to do?


IIRC, this kind of situation we may expect, when the archive command was failed at master side. Could you verify, how many files "000000xxxxxxx.ready" reside under the master's pg_xlog/archive_status directory. And also, verify the master server's recent pg_log file, for finding the root cause of the master server down issue.


Dinesh

-- 
Dinesh Kumar
Software Engineer

Ph: +918087463317
Skype ID: dinesh.kumar432
www.enterprisedb.com

Follow us on Twitter

@EnterpriseDB 

Visit EnterpriseDB for tutorials, webinars, whitepapers and more

 

Re: URGENT issue: pg-xlog growing on master!

From
bricklen
Date:
On Mon, Jun 10, 2013 at 4:29 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

2013-06-10 11:21:45 GMT FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "192.168.0.4" and accepting
                TCP/IP connections on port 5432?

Did anything get changed on the standby or master around the time this message started occurring?
On the master, what do the following show?
show port;
show listen_addresses;

The master's IP is still 192.168.0.4?

Have you tried connecting to the master using something like:
psql -h 192.168.0.4 -p 5432 -U postgres -d postgres
 
Does that throw a useful error or warning?

Re: URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:

Den 10/06/2013 kl. 16.36 skrev bricklen <bricklen@gmail.com>:

On Mon, Jun 10, 2013 at 4:29 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

2013-06-10 11:21:45 GMT FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "192.168.0.4" and accepting
                TCP/IP connections on port 5432?

Did anything get changed on the standby or master around the time this message started occurring?
On the master, what do the following show?
show port;
show listen_addresses;

The master's IP is still 192.168.0.4?

Have you tried connecting to the master using something like:
psql -h 192.168.0.4 -p 5432 -U postgres -d postgres
 
Does that throw a useful error or warning?


It turned out that the switch port that the server was connected to was faulty, and hence no successful connection between master and slave was established. This resolved in pg_xlog building up very fast, because our system performs a lot of changes on the data we store. 

I ended up running pg_archivecleanup on the master to get some space freed urgently. Then I got the switch changed with a new one. Now I'm trying to the streaming replication setup from scratch again, but with no luck.

I can't seem to figure out which steps I need to do, to get the standby server wiped and get it started as a streaming replication again from scratch. I tried to follow the steps, from step 6, in here http://wiki.postgresql.org/wiki/Streaming_Replication but the process seems to fail when I reach the point where I try to do a psql -c "SELECT pg_stop_backup()". It just says:

NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to be archived
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (240 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (480 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (960 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.

When looking at ps aux on the master, I see the following:

postgres 30930  0.0  0.0  98412  1632 ?        Ss   15:59   0:02 postgres: archiver process   failed on 0000000200000E1B000000A9

The file mentioned is the one that it was about to archive, when the standby server failed. Somehow it must still be trying to "catch up" from that file which of cause isn't there any more, since I had to remove those in order to get more space on the HDD. Instead of trying to catch up from the last succeeded file, I want it to start over from scratch with the replication - I just don't know how.



Re: URGENT issue: pg-xlog growing on master!

From
bricklen
Date:

On Mon, Jun 10, 2013 at 8:35 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
I can't seem to figure out which steps I need to do, to get the standby server wiped and get it started as a streaming replication again from scratch. I tried to follow the steps, from step 6, in here http://wiki.postgresql.org/wiki/Streaming_Replication but the process seems to fail when I reach the point where I try to do a psql -c "SELECT pg_stop_backup()". It just says:



If you use pg_basebackup you don't need to manually put the master into backup mode.
Be aware that if you are generating a lot of WAL segments and your filesystem backup is large (and takes a while to ship to the slave), you will need to set "wal_keep_segments" quite high on the master to prevent the segments from disappearing during the setup of the slave -- or at least that's the case when you use "--xlog-method=stream".

Re: URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:

Den 10/06/2013 kl. 17.51 skrev bricklen <bricklen@gmail.com>:


On Mon, Jun 10, 2013 at 8:35 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
I can't seem to figure out which steps I need to do, to get the standby server wiped and get it started as a streaming replication again from scratch. I tried to follow the steps, from step 6, in here http://wiki.postgresql.org/wiki/Streaming_Replication but the process seems to fail when I reach the point where I try to do a psql -c "SELECT pg_stop_backup()". It just says:



If you use pg_basebackup you don't need to manually put the master into backup mode.
Be aware that if you are generating a lot of WAL segments and your filesystem backup is large (and takes a while to ship to the slave), you will need to set "wal_keep_segments" quite high on the master to prevent the segments from disappearing during the setup of the slave -- or at least that's the case when you use "--xlog-method=stream".


Okay thanks,
I did the base backup, and I ran the rsync command and it succeeded. However then I try to do pg_stop_backup() it just "hangs" and I have a feeling, that it's rather because of some information mismatch than actual loading time, since nothing is transferred to the slave and I keep on seeing that "postgres 30930  0.0  0.0  98412  1632 ?        Ss   15:59   0:02 postgres: archiver process   failed on 0000000200000E1B000000A9" in the process overview, and I know that exactly that file was the one it has been trying to sync ever since the connection dropped. I saw something in here http://postgresql.1045698.n5.nabble.com/safe-to-clear-pg-xlog-archive-status-directory-td5738029.html, about wiping the pg_xlog/archive_status directly in order to "reset" the sync between the servers before running the pg_backup_start(), but I'm unsure if it's right, and when I would do it…

Re: URGENT issue: pg-xlog growing on master!

From
bricklen
Date:

On Mon, Jun 10, 2013 at 8:51 AM, bricklen <bricklen@gmail.com> wrote:

On Mon, Jun 10, 2013 at 8:35 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
I can't seem to figure out which steps I need to do, to get the standby server wiped and get it started as a streaming replication again from scratch. I tried to follow the steps, from step 6, in here http://wiki.postgresql.org/wiki/Streaming_Replication but the process seems to fail when I reach the point where I try to do a psql -c "SELECT pg_stop_backup()". It just says:


If you use pg_basebackup you don't need to manually put the master into backup mode.
Be aware that if you are generating a lot of WAL segments and your filesystem backup is large (and takes a while to ship to the slave), you will need to set "wal_keep_segments" quite high on the master to prevent the segments from disappearing during the setup of the slave -- or at least that's the case when you use "--xlog-method=stream".



For what its worth, I took some notes when I set up Streaming Replication the other day and the process worked for me. There might have been some tweaks here and there that I negelected to write down, but the gist of the steps are below.

If anyone has any corrections, please chime in!


##On the hot standby, create the staging directory to hold the master's log files
mkdir /pgdata/WAL_Archive
chown postgres:postgres /pgdata/WAL_Archive


# master, $PGDATA/postgresql.conf
wal_level = hot_standby
archive_mode = on
## /pgdata/WAL_Archive is a staging directory on the slave, outside of $PGDATA
archive_command = 'rsync -W -a %p postgres@SLAVE_IP_HERE:/pgdata/WAL_Archive/'
max_wal_senders = 3
wal_keep_segments = 10000   # if you have the room, to help the pg_basebackup
                            # not fail due to the WAL segment getting removed from the master.


## Modify the master $PGDATA/pg_hba.conf and enable the replication lines for the IPs of the slaves.
## Issue "pg_ctl reload" on the master after the changes have been made.
# TYPE  DATABASE        USER            ADDRESS                 METHOD
hostssl    replication     replication     SLAVE_IP_HERE/32       md5



## On the hot standby, $PGDATA/postgresql.conf
hot_standby = on #off           # "on" allows queries during recovery
max_standby_archive_delay = 15min # max delay before canceling queries, set to hours if backups will be taken from here
max_standby_streaming_delay = 15min # max delay before canceling queries
hot_standby_feedback = on #off



## On the master, create the replication role, which will be replicated to the slave via pg_basebackup
psql -d postgres -c "CREATE USER replication WITH replication ENCRYPTED PASSWORD 'CHANGEME' LOGIN"


## Restart the master, to pick up the changes to postgresql.conf


## On the slave, from $HOME, issue the pg_basebackup command to start setting up the hot standby from the master
## --host=IP_OF_MASTER      -> The master's IP
## --pgdata=$PGDATA         -> The slave's $PGDATA directory
## -- xlog-method=stream    -> Opens a second connection to the master to stream the WAL segments rather than pulling them all at the end
## --password will prompt for the replication role's password

## Without compression, "stream" gets the changes via the same method as Streaming Replication
time pg_basebackup --pgdata=$PGDATA --host=IP_OF_MASTER --port=5432 --username=replication --password --xlog-method=stream --format=plain --progress --verbose

-- Alternate version with compression
#time pg_basebackup --pgdata=$PGDATA --host=IP_OF_MASTER --port=5432 --username=replication --password --xlog --gzip --format=tar --progress --verbose




##On the standby, create $PGDATA/recovery.conf:
standby_mode = on

## To promote the slave to a live database, issue "touch /tmp/promote_db"
trigger_file = '/tmp/promote_db'

## Host can be the master's IP or hostname
primary_conninfo = 'host=IP_OF_MASTER port=5432 user=replication password=CHANGEME'

## Log the standby WAL segments applied to a standby.log file
## TODO: Add the standby.log to a log rotator
restore_command = 'cp /pgdata/WAL_Archive/%f "%p" 2>>/pgdata/9.2/data/pg_log/standby.log'

## XXX: If there are multiple slaves, do not use pg_archivecleanup (WAL segments could be removed before being applied to other slaves)
archive_cleanup_command = '/usr/pgsql-9.2/bin/pg_archivecleanup /pgdata/WAL_Archive %r'

## On hot standby clusters, set to 'latest' to switch to the newest timeline in the archive
recovery_target_timeline = 'latest'
 

Re: URGENT issue: pg-xlog growing on master!

From
Jeff Janes
Date:
On Mon, Jun 10, 2013 at 8:35 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.

When looking at ps aux on the master, I see the following:

postgres 30930  0.0  0.0  98412  1632 ?        Ss   15:59   0:02 postgres: archiver process   failed on 0000000200000E1B000000A9

The file mentioned is the one that it was about to archive, when the standby server failed. Somehow it must still be trying to "catch up" from that file which of cause isn't there any more, since I had to remove those in order to get more space on the HDD.

So the archive_command is failing because it is trying to archive a file that no longer exists.

One way around this is to remove the .ready files from the pg_xlog/archive_status directory, which correspond to the WAL files you manually removed.  

Another way would be to temporarily replace the archive_command with one that will report success even when the archiving fails, until the archiver gets paste this stretch.  In fact you could just replace the command with 'true', so it reports success without even doing anything.

Cheers,

Jeff

Re: URGENT issue: pg-xlog growing on master!

From
Matheus de Oliveira
Date:



On Mon, Jun 10, 2013 at 12:35 PM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

Den 10/06/2013 kl. 16.36 skrev bricklen <bricklen@gmail.com>:

On Mon, Jun 10, 2013 at 4:29 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

2013-06-10 11:21:45 GMT FATAL:  could not connect to the primary server: could not connect to server: No route to host
                Is the server running on host "192.168.0.4" and accepting
                TCP/IP connections on port 5432?

Did anything get changed on the standby or master around the time this message started occurring?
On the master, what do the following show?
show port;
show listen_addresses;

The master's IP is still 192.168.0.4?

Have you tried connecting to the master using something like:
psql -h 192.168.0.4 -p 5432 -U postgres -d postgres
 
Does that throw a useful error or warning?


It turned out that the switch port that the server was connected to was faulty, and hence no successful connection between master and slave was established. This resolved in pg_xlog building up very fast, because our system performs a lot of changes on the data we store. 

I ended up running pg_archivecleanup on the master to get some space freed urgently. Then I got the switch changed with a new one. Now I'm trying to the streaming replication setup from scratch again, but with no luck.

I can't seem to figure out which steps I need to do, to get the standby server wiped and get it started as a streaming replication again from scratch. I tried to follow the steps, from step 6, in here http://wiki.postgresql.org/wiki/Streaming_Replication but the process seems to fail when I reach the point where I try to do a psql -c "SELECT pg_stop_backup()". It just says:

NOTICE:  pg_stop_backup cleanup done, waiting for required WAL segments to be archived
WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
(...)

When looking at ps aux on the master, I see the following:

postgres 30930  0.0  0.0  98412  1632 ?        Ss   15:59   0:02 postgres: archiver process   failed on 0000000200000E1B000000A9

The file mentioned is the one that it was about to archive, when the standby server failed. Somehow it must still be trying to "catch up" from that file which of cause isn't there any more, since I had to remove those in order to get more space on the HDD. Instead of trying to catch up from the last succeeded file, I want it to start over from scratch with the replication - I just don't know how.


That is because you manually removed some xlog, and you shouldn't ever do that. To "cancel" the archiving, the better way (IMHO) is to set archive_command to a dummy command, like:

    archive_command = '/bin/true'

And reload PostgreSQL:

    psql -c "SELECT pg_reload_conf()"

With that, PostgreSQL will stop archiving, and so you'll **be with no backup at all**. With some archives removed, you can use your old archive_command again and reload the server.

BTW, check why the archive_command is not working properly (look at PG's log files). Is it because of no space left on disk? If so, removing some may work.

Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:
Okay, cool

You mean that I should do the following right?:

1. Stop slave server
2. set archive_command = 'true' in postgresql.conf on the master server
3. restart master server
4. run psql -c "SELECT pg_start_backup('label', true)" on master
5. run rsync -av --exclude postmaster.pid --exclude pg_xlog /var/lib/postgresql/9.2/main/ postgres@192.168.0.2:/var/lib/postgresql/9.2/main/" on master server
6. run psql -c "SELECT pg_stop_backup();" on master server
7. change archive_command back on master
8. restart master
9. start slave

Just to confirm the approach :-)



Den 10/06/2013 kl. 19.53 skrev Jeff Janes <jeff.janes@gmail.com>:

On Mon, Jun 10, 2013 at 8:35 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:

WARNING:  pg_stop_backup still waiting for all required WAL segments to be archived (1920 seconds elapsed)
HINT:  Check that your archive_command is executing properly.  pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.

When looking at ps aux on the master, I see the following:

postgres 30930  0.0  0.0  98412  1632 ?        Ss   15:59   0:02 postgres: archiver process   failed on 0000000200000E1B000000A9

The file mentioned is the one that it was about to archive, when the standby server failed. Somehow it must still be trying to "catch up" from that file which of cause isn't there any more, since I had to remove those in order to get more space on the HDD.

So the archive_command is failing because it is trying to archive a file that no longer exists.

One way around this is to remove the .ready files from the pg_xlog/archive_status directory, which correspond to the WAL files you manually removed.  

Another way would be to temporarily replace the archive_command with one that will report success even when the archiving fails, until the archiver gets paste this stretch.  In fact you could just replace the command with 'true', so it reports success without even doing anything.

Cheers,

Jeff

Re: URGENT issue: pg-xlog growing on master!

From
Jeff Janes
Date:
On Mon, Jun 10, 2013 at 11:02 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Okay, cool

You mean that I should do the following right?:

1. Stop slave server


At this point, you don't have a slave server.  Not a usable one, anyway.  If you used to have a hot-standby server, it is now simply a historical reporting server.  If you have no need/use for such a reporting server, then yes you should stop it, to avoid confusion.

 
2. set archive_command = 'true' in postgresql.conf on the master server
3. restart master server

You can simply do a reload rather than a full restart.
 
4. run psql -c "SELECT pg_start_backup('label', true)" on master

No, you shouldn't do that yet without first having correctly functioning archiving back in place.  After setting archive_command=true and reloading the server, you have to wait a while for the "bad" WAL files to get pseudo-archived and cleared from the system.  Once that has happened, you can then return archive_command to its previous setting, and again reload/restart the server.  Only at that point should you begin taking the new backup.  In other words, steps 7 and 8 have to be moved up to before step 4.
 
5. run rsync -av --exclude postmaster.pid --exclude pg_xlog /var/lib/postgresql/9.2/main/ postgres@192.168.0.2:/var/lib/postgresql/9.2/main/" on master server
6. run psql -c "SELECT pg_stop_backup();" on master server
7. change archive_command back on master
8. restart master
9. start slave

Just to confirm the approach :-)


Cheers,

Jeff


Re: URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:
Thanks,

> No, you shouldn't do that yet without first having correctly functioning archiving back in place.  After setting
archive_command=trueand reloading the server, you have to wait a while for the "bad" WAL files to get pseudo-archived
andcleared from the system. 

How do I know when this is done?



Re: URGENT issue: pg-xlog growing on master!

From
Niels Kristian Schjødt
Date:
Solved it - thanks!

Den 10/06/2013 kl. 20.24 skrev Jeff Janes <jeff.janes@gmail.com>:

On Mon, Jun 10, 2013 at 11:02 AM, Niels Kristian Schjødt <nielskristian@autouncle.com> wrote:
Okay, cool

You mean that I should do the following right?:

1. Stop slave server


At this point, you don't have a slave server.  Not a usable one, anyway.  If you used to have a hot-standby server, it is now simply a historical reporting server.  If you have no need/use for such a reporting server, then yes you should stop it, to avoid confusion.

 
2. set archive_command = 'true' in postgresql.conf on the master server
3. restart master server

You can simply do a reload rather than a full restart.
 
4. run psql -c "SELECT pg_start_backup('label', true)" on master

No, you shouldn't do that yet without first having correctly functioning archiving back in place.  After setting archive_command=true and reloading the server, you have to wait a while for the "bad" WAL files to get pseudo-archived and cleared from the system.  Once that has happened, you can then return archive_command to its previous setting, and again reload/restart the server.  Only at that point should you begin taking the new backup.  In other words, steps 7 and 8 have to be moved up to before step 4.
 
5. run rsync -av --exclude postmaster.pid --exclude pg_xlog /var/lib/postgresql/9.2/main/ postgres@192.168.0.2:/var/lib/postgresql/9.2/main/" on master server
6. run psql -c "SELECT pg_stop_backup();" on master server
7. change archive_command back on master
8. restart master
9. start slave

Just to confirm the approach :-)


Cheers,

Jeff