Thread: wal segment failed
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022
Guys I'm getting that erro on my master server, and I believe that is causing a spike
archive command:
archive_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02 localhost:30022'
Any idea why?
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022
Guys I'm getting that erro on my master server, and I believe that is causing a spike
archive command:
archive_command = 'exec nice -n 19 ionice -c 2 -n 7 ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02 localhost:30022'
Any idea why?
If the reason you haven't show the logs of relevant errors from your script the following will hopefully point you in the right direction.
General info on archive_command usage:
David J.
yep so..
1 master and 2 slaves
all of those server are working..
The only error I got is this one:
Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022
I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.
The server was rebooted and a parameter on postgres.conf was changed:
shared_buffer.
So i don't believe the cause of this is that change.
Before the reboot on the server, everything was working.
I just can't find the solution.
What I did:
1 - I can connect via postgres user between all the servers
2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)
2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)
yep so..1 master and 2 slavesall of those server are working..The only error I got is this one:Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.The server was rebooted and a parameter on postgres.conf was changed:shared_buffer.So i don't believe the cause of this is that change.Before the reboot on the server, everything was working.I just can't find the solution.What I did:1 - I can connect via postgres user between all the servers2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)
So the question that comes to my mind - taking the above at face value - is that the archive_command is failing because it wants to archive said wal segment but when it goes to do so it finds that said segment already exists in the target location. It correctly fails to potentially corrupt the remote file and due to the error will likewise not remove the master segment.
If you are certain, or can become certain, that the remote files are identical to the one on the server, it would seem that manually removing the wal segment on the master would resolve the deadlock. I am not recommending that you do this. But it is an option to consider. There are too many unknowns still present, and my own inexperience, to actually allow me to recommend something definitive.
David J.
yep so..1 master and 2 slavesall of those server are working..The only error I got is this one:Failed to archive WAL segment `pg_xlog/00000002000011E800000012` on host `localhost:30022I'm having spikes that cause me outage every 15 minutes.. I believe the cause of those spikes is that error above.The server was rebooted and a parameter on postgres.conf was changed:shared_buffer.So i don't believe the cause of this is that change.Before the reboot on the server, everything was working.I just can't find the solution.What I did:1 - I can connect via postgres user between all the servers2 - the file 00000002000011E800000012 is into the master /pg_xlog (it was already there)2 - the file 00000002000011E800000012 is into the slaves server /9.2/data/wal_archive (it was already there)So the question that comes to my mind - taking the above at face value - is that the archive_command is failing because it wants to archive said wal segment but when it goes to do so it finds that said segment already exists in the target location. It correctly fails to potentially corrupt the remote file and due to the error will likewise not remove the master segment.If you are certain, or can become certain, that the remote files are identical to the one on the server, it would seem that manually removing the wal segment on the master would resolve the deadlock. I am not recommending that you do this. But it is an option to consider. There are too many unknowns still present, and my own inexperience, to actually allow me to recommend something definitive.
Actually, strike that...the system knows which one it is trying to archive so simply removing it likely won't work out well. i.e., it probably won't just move onto the next file in the directory. I'm not positive in either case.
David J.
even copying all the wal files from master into the slaves
and restarting postgres
did not work =\
Lucas Possamai wrote: > archive command: > > archive_command = 'exec nice -n 19 ionice -c 2 -n 7 > ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02 > localhost:30022' So what is in ../../bin/archive_command.ssh_to_slaves.bash? Why are you using "exec"? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 18 May 2016 at 13:29, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Lucas Possamai wrote:
> archive command:
>
> archive_command = 'exec nice -n 19 ionice -c 2 -n 7
> ../../bin/archive_command.ssh_to_slaves.bash "%p" prod-db-01 prod-db-02
> localhost:30022'
So what is in ../../bin/archive_command.ssh_to_slaves.bash? Why are
you using "exec"?
we send all the wal files through that script
# we open the file from here
exec {WAL_LOCK_FD}>"${WAL_SEGMENT}.ac_lock" || exit 4;
if ! flock -n ${WAL_LOCK_FD}; then
printf 'Cannot acquire lock for WAL segment `%s`. Aborting\n' "${WAL_SEGMENT}" 1>&2;
exit 4;
fi;
# time to connect and send the wal segments to all hosts. We count the failed transfers
TRANSFER_ERRORS=0;
ARORD=0; # see above
for NEXT_PAIR in "${@}"; do
if [ $((ARORD++)) -gt 0 ]; then
NEXT_HOST="${NEXT_PAIR%:*}";
if [[ "${NEXT_PAIR}" =~ : ]]; then NEXT_PORT=${NEXT_PAIR#*:}; else NEXT_PORT=22; fi;
# we use tar over SSH as I don't fully trust scp's exit status. The added benefit is that tar preserves all attributes
# the downside is that it's a little tricky to make the remote path relative
#printf 'Attempting to archive WAL segment `%s` on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
IFS=':';
set +e;
tar -c -O --no-same-owner -C "${WAL_SEGMENT%/*}" "${WAL_SEGMENT##*/}" | ssh -p ${NEXT_PORT} -C -o 'BatchMode=yes' -o 'CompressionLevel=3' "${USER}@${NEXT_HOST}" "exec tar -x --no-same-owner --overwrite -C '${WAL_ARCHIVE_PATH}'";
PS_CONCAT="${PIPESTATUS[*]}";
set -e;
IFS="${DEFAULT_IFS}";
if [ "${PS_CONCAT}" == '0:0' ]; then
printf 'WAL segment `%s` successfully archived on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
else
: $((TRANSFER_ERRORS++));
printf 'Failed to archive WAL segment `%s` on host `%s`\n' "${WAL_SEGMENT}" "${NEXT_PAIR}" 1>&2;
fi;
fi;
done;
flock -u ${WAL_LOCK_FD};
exec {WAL_LOCK_FD}<&-;
rm "${WAL_SEGMENT}.ac_lock";
if [ ${TRANSFER_ERRORS} -eq 0 ]; then
exit 0;
else
exit 4;