Thread: Streaming Replication Randomly Locking Up

Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Hello,

I'm having an issue where streaming replication just randomly stops working.  I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:

postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres: startup process   recovering 000000010000053D0000003F waiting
postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres: writer process
postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres: stats collector process
postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres: wal receiver process   streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and replication halts.  I verified that the two databases were out of sync with a query on both of them.  Has anyone experienced this issue before? 

Here are some relevant config settings:

Master:

wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
max_wal_senders = 2   
wal_keep_segments = 32
       
Slave:

wal_level = hot_standby
checkpoint_segments = 32
#checkpoint_completion_target = 0.5
hot_standby = on
max_standby_archive_delay = -1 
max_standby_streaming_delay = -1
#wal_receiver_status_interval = 10s
#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

Re: Streaming Replication Randomly Locking Up

From
Lonni J Friedman
Date:
I've never seen this happen.  Looks like you might be using 9.1?  Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?


On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres:
> writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres:
> stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres:
> wal receiver process   streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts.  I verified that the two databases were out of sync with
> a query on both of them.  Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
> </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org


Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Hi Lonni,

Yes, I am using PG 9.1.9.
Yes, 1 slave syncing from the master
CentOS 6.4
I don't see any network or hardware issues (e.g. NIC) but will look more into this.  They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs right back up and all if working again so if it is a network issue, the replication is just stopping after some hiccup instead of retrying and resuming when things are back up.

Thanks!



On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com> wrote:
I've never seen this happen.  Looks like you might be using 9.1?  Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?


On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres:
> writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres:
> stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres:
> wal receiver process   streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts.  I verified that the two databases were out of sync with
> a query on both of them.  Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
> </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From
Lonni J Friedman
Date:
Are you certain that there are no relevant errors in the database logs
(on both master & slave)?  Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this.  They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> > postgres:
>> > startup process   recovering 000000010000053D0000003F waiting
>> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> > postgres:
>> > writer process
>> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> > postgres:
>> > stats collector process
>> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> > postgres:
>> > wal receiver process   streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts.  I verified that the two databases were out of sync
>> > with
>> > a query on both of them.  Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > </dev/null'
>> > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >


Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
The only thing I see that is a possibility for the issue is in the slave log:

LOG:  unexpected EOF on client connection
LOG:  could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody running a query.  The log file does seem to be riddled with these but the replication failures don't happen constantly.

As far as I know I'm not swallowing any errors.  The logging is all set as the default:

log_destination = 'stderr'
logging_collector = on
#client_min_messages = notice
#log_min_messages = warning
#log_min_error_statement = error
#log_min_duration_statement = -1
#log_checkpoints = off
#log_connections = off
#log_disconnections = off
#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!


On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com> wrote:
Are you certain that there are no relevant errors in the database logs
(on both master & slave)?  Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this.  They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> > postgres:
>> > startup process   recovering 000000010000053D0000003F waiting
>> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> > postgres:
>> > writer process
>> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> > postgres:
>> > stats collector process
>> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> > postgres:
>> > wal receiver process   streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts.  I verified that the two databases were out of sync
>> > with
>> > a query on both of them.  Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > </dev/null'
>> > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >

Re: Streaming Replication Randomly Locking Up

From
Lonni J Friedman
Date:
I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname.  That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG:  unexpected EOF on client connection
> LOG:  could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query.  The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors.  The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)?  Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this.  They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
>> > wrote:
>> >>
>> >> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> >> > postgres:
>> >> > startup process   recovering 000000010000053D0000003F waiting
>> >> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> >> > postgres:
>> >> > wal receiver process   streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts.  I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them.  Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > </dev/null'
>> >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org


Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Yep, that's the first thing I'm going to do.


On Thu, Aug 15, 2013 at 12:34 PM, Lonni J Friedman <netllama@gmail.com> wrote:
I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname.  That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG:  unexpected EOF on client connection
> LOG:  could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query.  The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors.  The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)?  Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this.  They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
>> > wrote:
>> >>
>> >> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> >> > postgres:
>> >> > startup process   recovering 000000010000053D0000003F waiting
>> >> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> >> > postgres:
>> >> > wal receiver process   streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts.  I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them.  Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > </dev/null'
>> >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From
Jeff Janes
Date:
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away.  In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.


> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff


Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Hi Jeff,

Here is the full process list at the time it stopped working (I have changed the actual username, db and IP for security).  Would the idle in transaction process be the culprit?

postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres: startup process   recovering 000000010000053D0000003F waiting

postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres: writer process

postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres: stats collector process

postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres: wal receiver process   streaming 549/216B3730

postgres 10403  0.0  0.2 3430372 25920 ?       Ss   Aug14   0:31 postgres: user db x.x.x.x(61656) idle in transaction

postgres 19933  0.0  0.4 3426604 49564 ?       S    Aug05   0:06 /usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/9.1/data

postgres 19935  0.0  0.0 175288   396 ?        Ss   Aug05   0:13 postgres: logger process

postgres 21133  0.0  0.2 3443600 30680 ?       Ss   09:28   0:00 postgres: user db x.x.x.x(64430) idle

postgres 21134  0.4  0.2 3430160 27656 ?       Ss   09:28   0:16 postgres: user db x.x.x.x(64431) idle

root     21529  0.0  0.0 103240   844 pts/0    S+   10:33   0:00 grep --color postgres

 

Thanks,


Andrew



On Thu, Aug 15, 2013 at 1:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away.  In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.


> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff

Re: Streaming Replication Randomly Locking Up

From
John DeSoi
Date:
On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

> I'm having an issue where streaming replication just randomly stops working.  I haven't been able to find anything in
thelogs which point to an issue, but the Postgres process shows a "waiting" status on the slave: 
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres: startup process   recovering
000000010000053D0000003Fwaiting 
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres: writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres: stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres: wal receiver process   streaming
549/216B3730
>
> The replication works great for days, but randomly seems to lock up and replication halts.  I verified that the two
databaseswere out of sync with a query on both of them.  Has anyone experienced this issue before?  
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the
logson master or slave. 

It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then
rsyncwould just wait forever and never return. 

John DeSoi, Ph.D.



Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Awesome, I'll give that a shot John.


On Fri, Aug 16, 2013 at 8:39 AM, John DeSoi <desoi@pgedit.com> wrote:

On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

> I'm having an issue where streaming replication just randomly stops working.  I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres: startup process   recovering 000000010000053D0000003F waiting
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres: writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres: stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres: wal receiver process   streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and replication halts.  I verified that the two databases were out of sync with a query on both of them.  Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logs on master or slave.

It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsync would just wait forever and never return.

John DeSoi, Ph.D.


Re: Streaming Replication Randomly Locking Up

From
Jeff Janes
Date:
On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Jeff,
>
> Here is the full process list at the time it stopped working (I have changed
> the actual username, db and IP for security).  Would the idle in transaction
> process be the culprit?

Most likely, yes.  You should be able to dig into pg_locks to verify.


Cheers,

Jeff


Re: Streaming Replication Randomly Locking Up

From
Jeff Janes
Date:
On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
>> Hi Jeff,
>>
>> Here is the full process list at the time it stopped working (I have changed
>> the actual username, db and IP for security).  Would the idle in transaction
>> process be the culprit?
>
> Most likely, yes.  You should be able to dig into pg_locks to verify.

Actually, you can't.  The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff


Re: Streaming Replication Randomly Locking Up

From
Andrew Berman
Date:
Ok, next time it happens I'll try to do more sleuthing to figure out if that's the issue.  For now, I'm going to try adding --timeout=30 to the rsync command and see if that fixes things.

Thanks again for your help!

Andrew


On Fri, Aug 16, 2013 at 10:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
>> Hi Jeff,
>>
>> Here is the full process list at the time it stopped working (I have changed
>> the actual username, db and IP for security).  Would the idle in transaction
>> process be the culprit?
>
> Most likely, yes.  You should be able to dig into pg_locks to verify.

Actually, you can't.  The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff