Thread: wal segment failed

wal segment failed

From
Lucas Possamai
Date:

Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022


Guys I'm getting that erro on my master server, and I believe that is causing a spike


archive command:

archive_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02 localhost:30022'



Any idea why?

Re: wal segment failed

From
"David G. Johnston"
Date:
On Tue, May 17, 2016 at 8:03 PM, Lucas Possamai <drum.lucas@gmail.com> wrote:

Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022


Guys I'm getting that erro on my master server, and I believe that is causing a spike


archive command:

archive_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02 localhost:30022'



Any idea why?

If the reason you haven't show the logs of relevant errors from your script the following will hopefully point you in the right direction.


General info on archive_command usage:


David J.

Re: wal segment failed

From
Lucas Possamai
Date:
yep so..

1 master and 2 slaves
all of those server are working..

The only error I got is this one:
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022

I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.

The server was rebooted and a parameter on postgres.conf was changed:
shared_buffer.

So i don't believe the cause of this is that change.
Before the reboot on the server, everything was working.

I just can't find the solution.

What I did:

1 - I can connect via postgres user between all the servers
2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)
2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)




Re: wal segment failed

From
"David G. Johnston"
Date:
On Tue, May 17, 2016 at 8:32 PM, Lucas Possamai <drum.lucas@gmail.com> wrote:
yep so..

1 master and 2 slaves
all of those server are working..

The only error I got is this one:
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022

I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.

The server was rebooted and a parameter on postgres.conf was changed:
shared_buffer.

So i don't believe the cause of this is that change.
Before the reboot on the server, everything was working.

I just can't find the solution.

What I did:

1 - I can connect via postgres user between all the servers
2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)
2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)



​So the question that comes to my mind - taking the above at face value - is that the archive_command is failing because it wants to archive said wal segment but when it goes to do so it finds that said segment already exists in the target location.  It correctly fails to potentially corrupt the remote file and due to the error will likewise not remove the master segment.​

If you are certain, or can become certain, that the remote files are identical to the one on the server, it would seem that manually removing the wal segment on the master would resolve the deadlock.  I am not recommending that you do this.  But it is an option to consider.  There are too many unknowns still present, and my own inexperience, to actually allow me to recommend something definitive.

David J.





Re: wal segment failed

From
"David G. Johnston"
Date:
On Tue, May 17, 2016 at 8:42 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Tue, May 17, 2016 at 8:32 PM, Lucas Possamai <drum.lucas@gmail.com> wrote:
yep so..

1 master and 2 slaves
all of those server are working..

The only error I got is this one:
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022

I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.

The server was rebooted and a parameter on postgres.conf was changed:
shared_buffer.

So i don't believe the cause of this is that change.
Before the reboot on the server, everything was working.

I just can't find the solution.

What I did:

1 - I can connect via postgres user between all the servers
2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)
2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)



​So the question that comes to my mind - taking the above at face value - is that the archive_command is failing because it wants to archive said wal segment but when it goes to do so it finds that said segment already exists in the target location.  It correctly fails to potentially corrupt the remote file and due to the error will likewise not remove the master segment.​

If you are certain, or can become certain, that the remote files are identical to the one on the server, it would seem that manually removing the wal segment on the master would resolve the deadlock.  I am not recommending that you do this.  But it is an option to consider.  There are too many unknowns still present, and my own inexperience, to actually allow me to recommend something definitive.


​Actually, strike that...the system knows which one it is trying to archive so simply removing it likely won't work out well.  i.e., it probably won't just move onto the next file in the directory.  I'm not positive in either case.

David J.
 

Re: wal segment failed

From
Lucas Possamai
Date:
even copying all the wal files from master into the slaves

and restarting postgres

did not work =\

Re: wal segment failed

From
Alvaro Herrera
Date:
Lucas Possamai wrote:

> archive command:
>
> archive_command = 'exec nice -n 19 ionice -c 2 -n 7
> ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02
> localhost:30022'

So what is in ../../bin/archive_command.ssh_to_slaves.bash?  Why are
you using "exec"?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: wal segment failed

From
Lucas Possamai
Date:


On 18 May 2016 at 13:29, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Lucas Possamai wrote:

> archive command:
>
> archive_command = 'exec nice -n 19 ionice -c 2 -n 7
> ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02
> localhost:30022'

So what is in ../../bin/archive_command.ssh_to_slaves.bash?  Why are
you using "exec"?


we send all the wal files through that script


# we open the file from here

exec {WAL_LOCK_FD}>"${WAL_SEGMENT}.ac_lock" || exit 4;
if ! flock -n ${WAL_LOCK_FD}; then
printf 'Cannot acquire lock for WAL segment `%s`. Aborting\n' "${WAL_SEGMENT}" 1>&2;
exit 4;
fi;

# time to connect and send the wal segments to all hosts. We count the failed transfers
TRANSFER_ERRORS=0;
ARORD=0; # see above
for NEXT_PAIR in "${@}"; do
if [ $((ARORD++)) -gt 0 ]; then
NEXT_HOST="${NEXT_PAIR%:*}";
if [[ "${NEXT_PAIR}" =~ : ]]; then NEXT_PORT=${NEXT_PAIR#*:}; else NEXT_PORT=22; fi;
# we use tar over SSH as I don't fully trust scp's exit status. The added benefit is that tar preserves all attributes
# the downside is that it's a little tricky to make the remote path relative
#printf 'Attempting to archive WAL segment `%s` on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
IFS=':';
set +e;
tar -c -O --no-same-owner -C "${WAL_SEGMENT%/*}" "${WAL_SEGMENT##*/}" | ssh -p ${NEXT_PORT} -C -o 'BatchMode=yes' -o 'CompressionLevel=3' "${USER}@${NEXT_HOST}" "exec tar -x --no-same-owner --overwrite -C '${WAL_ARCHIVE_PATH}'";
PS_CONCAT="${PIPESTATUS[*]}";
set -e;
IFS="${DEFAULT_IFS}";
if [ "${PS_CONCAT}" == '0:0' ]; then
printf 'WAL segment `%s` successfully archived on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
else
: $((TRANSFER_ERRORS++));
printf 'Failed to archive WAL segment `%s` on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
fi;
fi;
done;
flock -u ${WAL_LOCK_FD};
exec {WAL_LOCK_FD}<&-;
rm "${WAL_SEGMENT}.ac_lock";
if [ ${TRANSFER_ERRORS} -eq 0 ]; then
exit 0;
else
exit 4;