Thread: Fwd: Re: Wal archive way behind in streaming replication

Fwd: Re: Wal archive way behind in streaming replication

From

John Scalia

Date:

24 June 2014, 21:19:48

Ok, I did the copy from pg_xlog directory into the restore.conf specifieddirectory. The standby servers seem fine with
that,however, just copying does not inform the primary that 
the copy has happened. The archive_status directory under pg_xlog on the primary still thinks the last WAL sent was *B7
andyet it's now writing *C9. When I did the copy it was 
only up to *C7 and nothing else has shown in the standby's directory.

Now, the *.done files in archive_status are just zero length, but I'm a bit hesitant to just do a touch for the ones I
manuallycopied as I don't know if this is from an in-memory 
queue or if it Postgresql reads the contents of this regularly in order to decide what to copy.

Is that safe to do?

On 6/24/2014 9:56 AM, Andrew Krause wrote:
> You can copy all of the WAL logs from your gap to the standby.  If you place them in the correct location (directory
designatedfor restore) theinstance will automatically apply them all. 
>
>
> Andrew Krause
>
>
>
> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com> wrote:
>
>> Came in this morning to numerous complaints from pgpool about the standby servers being behind from the primary.
Lookinginto it, no WAL files had been transferred since late Friday. All I did was restart the primaryand the WAL
archvingresumed, however, looking at the WAL files on the standby servers, this is never going to catch up. Now, I've
gotthe archive_timeout on the primary = 600 or 10 minutes and I see WAL files in pg_xlog every 10 minutes. As they show
upon the standby servers, they're also 10 minutes apart, but the primary is writing *21 and the standby's areonly up to
*10.Now, like I said prior, with there being 10 minutes (600seconds) between transfers (the same pace as the WALs are
generated)it will never catch up. Is this really the intended behavior? How would I get the additional WAL files over
tothe standbys without waiting 10 minutes to copy them one at a time? 
>> --
>> Jay
>>
>>
>> --
>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-admin
>

Re: Wal archive way behind in streaming replication

From

Andrew Krause

Date:

24 June 2014, 21:52:09

You shouldn’t have to touch the files as long as they aren’t compressed.  You may have to restart the standby instance
toget the recovery to begin though.  I’d suggest tailing your instance log and restarting the standby instance.  It
shouldshow that the logs from the gap are applying right away at startup.  


Andrew Krause




On Jun 24, 2014, at 1:19 PM, John Scalia <jayknowsunix@gmail.com> wrote:

>
> Ok, I did the copy from pg_xlog directory into the restore.conf specifieddirectory. The standby servers seem fine
withthat, however, just copying does not inform the primary that 
> the copy has happened. The archive_status directory under pg_xlog on the primary still thinks the last WAL sent was
*B7and yet it's now writing *C9. When I did the copy it was 
> only up to *C7 and nothing else has shown in the standby's directory.
>
> Now, the *.done files in archive_status are just zero length, but I'm a bit hesitant to just do a touch for the ones
Imanually copied as I don't know if this is from an in-memory 
> queue or if it Postgresql reads the contents of this regularly in order to decide what to copy.
>
> Is that safe to do?
>
> On 6/24/2014 9:56 AM, Andrew Krause wrote:
>> You can copy all of the WAL logs from your gap to the standby.  If you place them in the correct location (directory
designatedfor restore) theinstance will automatically apply them all. 
>>
>>
>> Andrew Krause
>>
>>
>>
>> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com> wrote:
>>
>>> Came in this morning to numerous complaints from pgpool about the standby servers being behind from the primary.
Lookinginto it, no WAL files had been transferred since late Friday. All I did was restart the primaryand the WAL
archvingresumed, however, looking at the WAL files on the standby servers, this is never going to catch up. Now, I've
gotthe archive_timeout on the primary = 600 or 10 minutes and I see WAL files in pg_xlog every 10 minutes. As they show
upon the standby servers, they're also 10 minutes apart, but the primary is writing *21 and the standby's areonly up to
*10.Now, like I said prior, with there being 10 minutes (600seconds) between transfers (the same pace as the WALs are
generated)it will never catch up. Is this really the intended behavior? How would I get the additional WAL files over
tothe standbys without waiting 10 minutes to copy them one at a time? 
>>> --
>>> Jay
>>>
>>>
>>> --
>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-admin
>>
>
>
>
>
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin

Re: Wal archive way behind in streaming replication

From

John Scalia

Date:

25 June 2014, 17:40:45

A little examination of the pgarch.c file showed what the archive process on the primary is doing. Anyway, to ensure
thatthe primary knows that it has transmitted all the up to  
date WALs, I went into the primary's data/pg_xlog/archive_status directory and performed "touch
00000003000000900000036.ready"and repeated this command for the other WALs up to  
*44.ready. This really shouldn't have been a problem as the most recent WAL file in pg_xlog was *45. The archiver then
pickedup all those WAL files and transmitted them to the  
standbys. At least I saw them appear on the standby in the directory specified in the recovery.conf file.

Now, what I really don't understand is the standby's behavior. After the WALs arrived, I saw nothing in today's
pg_log/Wed.logfile showing it saw them. I then issued a service  
postgresql-9.3 restart and this is what was spit out in the log:

LOG: entering standby mode
LOG: restored log file "00000000300000000900000035" from archive
LOG: unexpected pageaddr 9/1B000000 in log segment 00000000300000000900000035, offset 0
LOG: started streaming WAL from primary at 9/35000000 on timeline 3
FATAL: the database system is starting up
LOG: consistent recovery state reached at 9/350000C8
LOG: redo starts at 9/350000C8
LOG: database system is ready to accept read only connections

Two things stand out here. First, the standby didn't seem to process the newly arrived WAL files, and second. what's
withthe FATAL: in the logfile? 
--
Jay

On 6/24/2014 2:52 PM, Andrew Krause wrote:
> You shouldn’t have to touch the files as long as they aren’t compressed.  You may have to restart the standby
instanceto get the recovery to begin though.  I’d suggest tailing your instance log and restarting the standby
instance. It should show that the logs from the gap are applying right away at startup. 
>
>
> Andrew Krause
>
>
>
>
> On Jun 24, 2014, at 1:19 PM, John Scalia <jayknowsunix@gmail.com> wrote:
>
>> Ok, I did the copy from pg_xlog directory into the restore.conf specifieddirectory. The standby servers seem fine
withthat, however, just copying does not inform the primary that 
>> the copy has happened. The archive_status directory under pg_xlog on the primary still thinks the last WAL sent was
*B7and yet it's now writing *C9. When I did the copy it was 
>> only up to *C7 and nothing else has shown in the standby's directory.
>>
>> Now, the *.done files in archive_status are just zero length, but I'm a bit hesitant to just do a touch for the ones
Imanually copied as I don't know if this is from an in-memory 
>> queue or if it Postgresql reads the contents of this regularly in order to decide what to copy.
>>
>> Is that safe to do?
>>
>> On 6/24/2014 9:56 AM, Andrew Krause wrote:
>>> You can copy all of the WAL logs from your gap to the standby.  If you place them in the correct location
(directorydesignated for restore) theinstance will automatically apply them all. 
>>>
>>>
>>> Andrew Krause
>>>
>>>
>>>
>>> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com> wrote:
>>>
>>>> Came in this morning to numerous complaints from pgpool about the standby servers being behind from the primary.
Lookinginto it, no WAL files had been transferred since late Friday. All I did was restart the primaryand the WAL
archvingresumed, however, looking at the WAL files on the standby servers, this is never going to catch up. Now, I've
gotthe archive_timeout on the primary = 600 or 10 minutes and I see WAL files in pg_xlog every 10 minutes. As they show
upon the standby servers, they're also 10 minutes apart, but the primary is writing *21 and the standby's areonly up to
*10.Now, like I said prior, with there being 10 minutes (600seconds) between transfers (the same pace as the WALs are
generated)it will never catch up. Is this really the intended behavior? How would I get the additional WAL files over
tothe standbys without waiting 10 minutes to copy them one at a time? 
>>>> --
>>>> Jay
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-admin
>>
>>
>>
>>
>>
>> --
>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-admin
>

Re: Wal archive way behind in streaming replication

From

Jerry Sievers

Date:

25 June 2014, 19:42:51

John Scalia <jayknowsunix@gmail.com> writes:

> A little examination of the pgarch.c file showed what the archive
> process on the primary is doing. Anyway, to ensure that the primary
> knows that it has transmitted all the up to date WALs, I went into the
> primary's data/pg_xlog/archive_status directory and performed "touch
> 00000003000000900000036.ready" and repeated this command for the other
> WALs up to *44.ready. This really shouldn't have been a problem as the
> most recent WAL file in pg_xlog was *45. The archiver then picked up
> all those WAL files and transmitted them to the standbys. At least I
> saw them appear on the standby in the directory specified in the
> recovery.conf file.
>
> Now, what I really don't understand is the standby's behavior. After
> the WALs arrived, I saw nothing in today's pg_log/Wed.log file showing
> it saw them. I then issued a service postgresql-9.3 restart and this
> is what was spit out in the log:
>
> LOG: entering standby mode
> LOG: restored log file "00000000300000000900000035" from archive
> LOG: unexpected pageaddr 9/1B000000 in log segment 00000000300000000900000035, offset 0
> LOG: started streaming WAL from primary at 9/35000000 on timeline 3
> FATAL: the database system is starting up
> LOG: consistent recovery state reached at 9/350000C8
> LOG: redo starts at 9/350000C8
> LOG: database system is ready to accept read only connections
>
> Two things stand out here. First, the standby didn't seem to process the newly arrived WAL files, and second. what's
withthe FATAL: in the logfile? 

I'd suggest you ...

1. Toss out that standby instance.
2. Re-read all manual sections regarding hot backup/PITR/streaming
   replication etc.
3. Start fresh.
> --
I would not trust a standby instance after possibly corrupting it by
having to frob the .ready files on master.

A standby server configured as hybrid streamer/WAL shipper should...

1. Stream and/or read WAL segments from master's xlog directory when
   wal_keep_segments permits it.
2. Fetch WALs from a remote repository when it can't get  a feed
   directly from master.

There is no manual touching of .ready files needed and I can imagine
doing so   could be harmful in certain situations.

HTH



> Jay
>
> On 6/24/2014 2:52 PM, Andrew Krause wrote:
>> You shouldnt have to touch the files as long as they arent compressed.  You may have to restart the standby
instanceto get the recovery to begin though.  Id suggest tailing your instance log and restarting the standby
instance. It should show that the logs from the gap are applying right away at startup. 
>>
>>
>> Andrew Krause
>>
>>
>>
>>
>> On Jun 24, 2014, at 1:19 PM, John Scalia <jayknowsunix@gmail.com> wrote:
>>
>>> Ok, I did the copy from pg_xlog directory into the restore.conf specifieddirectory. The standby servers seem fine
withthat, however, just copying does not inform the primary that 
>>> the copy has happened. The archive_status directory under pg_xlog on the primary still thinks the last WAL sent was
*B7and yet it's now writing *C9. When I did the copy it was 
>>> only up to *C7 and nothing else has shown in the standby's directory.
>>>
>>> Now, the *.done files in archive_status are just zero length, but I'm a bit hesitant to just do a touch for the
onesI manually copied as I don't know if this is from an in-memory 
>>> queue or if it Postgresql reads the contents of this regularly in order to decide what to copy.
>>>
>>> Is that safe to do?
>>>
>>> On 6/24/2014 9:56 AM, Andrew Krause wrote:
>>>> You can copy all of the WAL logs from your gap to the standby.  If you place them in the correct location
(directorydesignated for restore) theinstance will automatically apply them all. 
>>>>
>>>>
>>>> Andrew Krause
>>>>
>>>>
>>>>
>>>> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com> wrote:
>>>>
>>>>> Came in this morning to numerous complaints from pgpool about the standby servers being behind from the primary.
Lookinginto it, no WAL files had been transferred since late Friday. All I did was restart the primaryand the WAL
archvingresumed, however, looking at the WAL files on the standby servers, this is never going to catch up. Now, I've
gotthe archive_timeout on the primary = 600 or 10 minutes and I see WAL files in pg_xlog every 10 minutes. As they show
upon the standby servers, they're also 10 minutes apart, but the primary is writing *21 and the standby's areonly up to
*10.Now, like I said prior, with there being 10 minutes (600seconds) between transfers (the same pace as the WALs are
generated)it will never catch up. Is this really the intended behavior? How would I get the additional WAL files over
tothe standbys without waiting 10 minutes to copy them one at a time? 
>>>>> --
>>>>> Jay
>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-admin
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-admin
>>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: Wal archive way behind in streaming replication

From

"Gilberto Castillo"

Date:

25 June 2014, 19:44:26



Stop to Master, copy folder pg_xlog you slave.

To me it has worked for me

>
> A little examination of the pgarch.c file showed what the archive process
> on the primary is doing. Anyway, to ensure that the primary knows that it
> has transmitted all the up to
> date WALs, I went into the primary's data/pg_xlog/archive_status directory
> and performed "touch 00000003000000900000036.ready" and repeated this
> command for the other WALs up to
> *44.ready. This really shouldn't have been a problem as the most recent
> WAL file in pg_xlog was *45. The archiver then picked up all those WAL
> files and transmitted them to the
> standbys. At least I saw them appear on the standby in the directory
> specified in the recovery.conf file.
>
> Now, what I really don't understand is the standby's behavior. After the
> WALs arrived, I saw nothing in today's pg_log/Wed.log file showing it saw
> them. I then issued a service
> postgresql-9.3 restart and this is what was spit out in the log:
>
> LOG: entering standby mode
> LOG: restored log file "00000000300000000900000035" from archive
> LOG: unexpected pageaddr 9/1B000000 in log segment
> 00000000300000000900000035, offset 0
> LOG: started streaming WAL from primary at 9/35000000 on timeline 3
> FATAL: the database system is starting up
> LOG: consistent recovery state reached at 9/350000C8
> LOG: redo starts at 9/350000C8
> LOG: database system is ready to accept read only connections
>
> Two things stand out here. First, the standby didn't seem to process the
> newly arrived WAL files, and second. what's with the FATAL: in the
> logfile?
> --
> Jay
>
> On 6/24/2014 2:52 PM, Andrew Krause wrote:
>> You shouldn’t have to touch the files as long as they aren’t compressed.
>>  You may have to restart the standby instance to get the recovery to
>> begin though.  I’d suggest tailing your instance log and restarting the
>> standby instance.  It should show that the logs from the gap are
>> applying right away at startup.
>>
>>
>> Andrew Krause
>>
>>
>>
>>
>> On Jun 24, 2014, at 1:19 PM, John Scalia <jayknowsunix@gmail.com> wrote:
>>
>>> Ok, I did the copy from pg_xlog directory into the restore.conf
>>> specifieddirectory. The standby servers seem fine with that, however,
>>> just copying does not inform the primary that
>>> the copy has happened. The archive_status directory under pg_xlog on
>>> the primary still thinks the last WAL sent was *B7 and yet it's now
>>> writing *C9. When I did the copy it was
>>> only up to *C7 and nothing else has shown in the standby's directory.
>>>
>>> Now, the *.done files in archive_status are just zero length, but I'm a
>>> bit hesitant to just do a touch for the ones I manually copied as I
>>> don't know if this is from an in-memory
>>> queue or if it Postgresql reads the contents of this regularly in order
>>> to decide what to copy.
>>>
>>> Is that safe to do?
>>>
>>> On 6/24/2014 9:56 AM, Andrew Krause wrote:
>>>> You can copy all of the WAL logs from your gap to the standby.  If you
>>>> place them in the correct location (directory designated for restore)
>>>> theinstance will automatically apply them all.
>>>>
>>>>
>>>> Andrew Krause
>>>>
>>>>
>>>>
>>>> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com>
>>>> wrote:
>>>>
>>>>> Came in this morning to numerous complaints from pgpool about the
>>>>> standby servers being behind from the primary. Looking into it, no
>>>>> WAL files had been transferred since late Friday. All I did was
>>>>> restart the primaryand the WAL archving resumed, however, looking at
>>>>> the WAL files on the standby servers, this is never going to catch
>>>>> up. Now, I've got the archive_timeout on the primary = 600 or 10
>>>>> minutes and I see WAL files in pg_xlog every 10 minutes. As they show
>>>>> up on the standby servers, they're also 10 minutes apart, but the
>>>>> primary is writing *21 and the standby's areonly up to *10. Now, like
>>>>> I said prior, with there being 10 minutes (600seconds) between
>>>>> transfers (the same pace as the WALs are generated) it will never
>>>>> catch up. Is this really the intended behavior? How would I get the
>>>>> additional WAL files over to the standbys without waiting 10 minutes
>>>>> to copy them one at a time?
>>>>> --
>>>>> Jay
>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-admin
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-admin
>>
>
>
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin---
> This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE
> running at host imx2.etecsa.cu
> Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>
>


Saludos,
Gilberto Castillo
La Habana, Cuba
---
This message was processed by Kaspersky Mail Gateway 5.6.28/RELEASE running at host imx3.etecsa.cu
Visit our web-site: <http://www.kaspersky.com>, <http://www.viruslist.com>

Re: Wal archive way behind in streaming replication

From

John Scalia

Date:

25 June 2014, 20:53:25

Well, having read and understanding the source code in pgarch.c, I saw nothing dangerous for performing these steps.
Maybethere is something deeper in there, but it just seemed  
odd that the standby isn't recognizing the new files considering they arrived by traditional means. I'm going to go
backand study the source a little more. 

On 6/25/2014 12:42 PM, Jerry Sievers wrote:
> John Scalia <jayknowsunix@gmail.com> writes:
>
>> A little examination of the pgarch.c file showed what the archive
>> process on the primary is doing. Anyway, to ensure that the primary
>> knows that it has transmitted all the up to date WALs, I went into the
>> primary's data/pg_xlog/archive_status directory and performed "touch
>> 00000003000000900000036.ready" and repeated this command for the other
>> WALs up to *44.ready. This really shouldn't have been a problem as the
>> most recent WAL file in pg_xlog was *45. The archiver then picked up
>> all those WAL files and transmitted them to the standbys. At least I
>> saw them appear on the standby in the directory specified in the
>> recovery.conf file.
>>
>> Now, what I really don't understand is the standby's behavior. After
>> the WALs arrived, I saw nothing in today's pg_log/Wed.log file showing
>> it saw them. I then issued a service postgresql-9.3 restart and this
>> is what was spit out in the log:
>>
>> LOG: entering standby mode
>> LOG: restored log file "00000000300000000900000035" from archive
>> LOG: unexpected pageaddr 9/1B000000 in log segment 00000000300000000900000035, offset 0
>> LOG: started streaming WAL from primary at 9/35000000 on timeline 3
>> FATAL: the database system is starting up
>> LOG: consistent recovery state reached at 9/350000C8
>> LOG: redo starts at 9/350000C8
>> LOG: database system is ready to accept read only connections
>>
>> Two things stand out here. First, the standby didn't seem to process the newly arrived WAL files, and second. what's
withthe FATAL: in the logfile? 
> I'd suggest you ...
>
> 1. Toss out that standby instance.
> 2. Re-read all manual sections regarding hot backup/PITR/streaming
>     replication etc.
> 3. Start fresh.
>> --
> I would not trust a standby instance after possibly corrupting it by
> having to frob the .ready files on master.
>
> A standby server configured as hybrid streamer/WAL shipper should...
>
> 1. Stream and/or read WAL segments from master's xlog directory when
>     wal_keep_segments permits it.
> 2. Fetch WALs from a remote repository when it can't get  a feed
>     directly from master.
>
> There is no manual touching of .ready files needed and I can imagine
> doing so   could be harmful in certain situations.
>
> HTH
>
>
>
>> Jay
>>
>> On 6/24/2014 2:52 PM, Andrew Krause wrote:
>>> You shouldn’t have to touch the files as long as they aren’t compressed.  You may have to restart the standby
instanceto get the recovery to begin though.  I’d suggest tailing your instance log and restarting the standby
instance. It should show that the logs from the gap are applying right away at startup. 
>>>
>>>
>>> Andrew Krause
>>>
>>>
>>>
>>>
>>> On Jun 24, 2014, at 1:19 PM, John Scalia <jayknowsunix@gmail.com> wrote:
>>>
>>>> Ok, I did the copy from pg_xlog directory into the restore.conf specifieddirectory. The standby servers seem fine
withthat, however, just copying does not inform the primary that 
>>>> the copy has happened. The archive_status directory under pg_xlog on the primary still thinks the last WAL sent
was*B7 and yet it's now writing *C9. When I did the copy it was 
>>>> only up to *C7 and nothing else has shown in the standby's directory.
>>>>
>>>> Now, the *.done files in archive_status are just zero length, but I'm a bit hesitant to just do a touch for the
onesI manually copied as I don't know if this is from an in-memory 
>>>> queue or if it Postgresql reads the contents of this regularly in order to decide what to copy.
>>>>
>>>> Is that safe to do?
>>>>
>>>> On 6/24/2014 9:56 AM, Andrew Krause wrote:
>>>>> You can copy all of the WAL logs from your gap to the standby.  If you place them in the correct location
(directorydesignated for restore) theinstance will automatically apply them all. 
>>>>>
>>>>>
>>>>> Andrew Krause
>>>>>
>>>>>
>>>>>
>>>>> On Jun 23, 2014, at 9:24 AM, John Scalia <jayknowsunix@gmail.com> wrote:
>>>>>
>>>>>> Came in this morning to numerous complaints from pgpool about the standby servers being behind from the primary.
Lookinginto it, no WAL files had been transferred since late Friday. All I did was restart the primaryand the WAL
archvingresumed, however, looking at the WAL files on the standby servers, this is never going to catch up. Now, I've
gotthe archive_timeout on the primary = 600 or 10 minutes and I see WAL files in pg_xlog every 10 minutes. As they show
upon the standby servers, they're also 10 minutes apart, but the primary is writing *21 and the standby's areonly up to
*10.Now, like I said prior, with there being 10 minutes (600seconds) between transfers (the same pace as the WALs are
generated)it will never catch up. Is this really the intended behavior? How would I get the additional WAL files over
tothe standbys without waiting 10 minutes to copy them one at a time? 
>>>>>> --
>>>>>> Jay
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>>>>> To make changes to your subscription:
>>>>>> http://www.postgresql.org/mailpref/pgsql-admin
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-admin

Re: Wal archive way behind in streaming replication

From

John Scalia

Date:

26 June 2014, 19:48:58

OK, starting over...

I killed off both standby servers and I rebuilt my databases off of the primary server (10.10.1.128) using:

pg_basebackup -h 10.10.1.128 -p 5432 -D /var/lib/pgsql/9.3/data -X s -c fast

It took about 40 minutes to complete for each standby. Immediately after it completed, I copied my previously saved
recovery.conffile into each's data directory and started the  
standby servers. I did not do anything to the files in the directory specified by the restore_command. Now, I would
expectthe primary to continue writing WAL segments into the  
directory while the pg_basebackup is running, but it apparently did not. Checking the primary's pg_xlog directory
showedthat it was now writing *DF, and each standby had only  
received up to *D5. That means each standby's WALs are about 1 hour behind. I would expect each to be at *DE as that
shouldbe the last completely written WAL file, and I would  
guess that the primary is still writing *DF.

The archive process on the primary is now writing WAL segments to the directory in archive_command, as I just saw *D6
popinto it, but it's never going to catch up without help as  
it's only copying 1 file every 10 minutes as a new one gets generated. Am I missing a flag for pg_basebackup? Or should
Ijust accept that the standby's could never catch up with  
WAL segments and just turn archiving off?
--
Jay

Re: Wal archive way behind in streaming replication

From

bricklen

Date:

28 June 2014, 00:35:10

I haven't been following this thread, but FWIW, here is a simple bash script I use to rebuild a hot standby when I'm testing.

It assumes that the master is correctly set up for replication, that the port is 5432. and that the replication user is named "replication".

Runs on CentOS, haven't tried it on any other distro.

This is a script I call /root/recreate_standby.sh, and gets run as root. This is optional, if you run it as postgres. remove the "sudo su - postgres -c" sections.

Watch out for word-wrapping from the email client.

------------

#!/bin/bash

## This script runs on the standby.
## Executed as root, else remove the "sudo - postgres -c" commands.
## Assumes you have a valid recovery.conf saved at
## $PGDATA/../recovery.conf.bkp

export PGDATA=/path/to/data/dir     ## Must be set correctly
export PGPORT=5432
export MASTER=192.168.x.x            ## IP or host entry for the master Postgresql server
export PGBIN=/usr/pgsql-9.3/bin

service postgresql-9.3 stop -m immediate

if [ $? != 0 ]; then
    service postgresql-9.3 start
    echo "Could not shut down PostgreSQL. Aborting."
    exit 1
fi

rm -rf $PGDATA

if [ $? != 0 ]; then
    echo "Could not remove the PostgreSQL $PGDATA dir. Aborting."
    exit 1
fi

## If the replication role is not set to "trust" in the master's
## pg_hba.conf file, the password will need to be passed into the command below,
## and "--no-password" will need to be removed or revised to be "--password"
su - postgres -c "$PGBIN/pg_basebackup --pgdata=$PGDATA --host=$MASTER --port=$PGPORT --username=replication --no-password --xlog-method=stream --format=plain --progress --verbose"

su - postgres -c "cp -p $PGDATA/../recovery.conf.bkp $PGDATA/recovery.conf"

service postgresql-9.3 start

su - postgres -c "$PGBIN/pg_isready -U postgres -p $PGPORT -t2"

while [ $? != 0 ]; do
echo "Sleep 1 second, check if slave is up yet. If not, sleep again."
sleep 1;
su - postgres -c "$PGBIN/pg_isready -U postgres -p $PGPORT -t2"
done

su - postgres -c "$PGBIN/psql -d postgres -U postgres -qXc 'select pg_is_in_recovery() as is_pg_in_recovery'"

exit 0