Thread: Streaming Replication Randomly Locking Up

Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

15 August 2013, 21:08:06

Hello,

I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730

The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before?

Here are some relevant config settings:

Master:

wal_level = hot_standby

checkpoint_segments = 32

checkpoint_completion_target = 0.9

archive_mode = on

archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'

max_wal_senders = 2

wal_keep_segments = 32

Slave:

wal_level = hot_standby

checkpoint_segments = 32

#checkpoint_completion_target = 0.5

hot_standby = on

max_standby_archive_delay = -1

max_standby_streaming_delay = -1

#wal_receiver_status_interval = 10s

#hot_standby_feedback = off

Thank you for any help you can provide!

Andrew

Re: Streaming Replication Randomly Locking Up

From

Lonni J Friedman

Date:

15 August 2013, 21:32:55

I've never seen this happen.  Looks like you might be using 9.1?  Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?


On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres:
> writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres:
> stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres:
> wal receiver process   streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts.  I verified that the two databases were out of sync with
> a query on both of them.  Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
> </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

15 August 2013, 21:46:08

Hi Lonni,

Yes, I am using PG 9.1.9.

Yes, 1 slave syncing from the master

CentOS 6.4

I don't see any network or hardware issues (e.g. NIC) but will look more into this. They are communicating on a private network and switch.

I forgot to mention that after I restart the slave, everything syncs right back up and all if working again so if it is a network issue, the replication is just stopping after some hiccup instead of retrying and resuming when things are back up.

Thanks!

On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com> wrote:

I've never seen this happen. Looks like you might be using 9.1? Are
you up to date on all the 9.1.x releases?

Do you have just 1 slave syncing from the master?
Which OS are you using?
Did you verify that there aren't any network problems between the
slave & master?
Or hardware problems (like the NIC dying, or dropping packets)?

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
> startup process recovering 000000010000053D0000003F waiting
> postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres:
> writer process
> postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres:
> stats collector process
> postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres:
> wal receiver process streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and
> replication halts. I verified that the two databases were out of sync with
> a query on both of them. Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
> </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32
>
> Slave:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> #checkpoint_completion_target = 0.5
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> #wal_receiver_status_interval = 10s
> #hot_standby_feedback = off
>
> Thank you for any help you can provide!
>
> Andrew
>

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From

Lonni J Friedman

Date:

15 August 2013, 21:51:32

Are you certain that there are no relevant errors in the database logs
(on both master & slave)?  Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this.  They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> > postgres:
>> > startup process   recovering 000000010000053D0000003F waiting
>> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> > postgres:
>> > writer process
>> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> > postgres:
>> > stats collector process
>> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> > postgres:
>> > wal receiver process   streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts.  I verified that the two databases were out of sync
>> > with
>> > a query on both of them.  Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > </dev/null'
>> > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

15 August 2013, 22:23:15

The only thing I see that is a possibility for the issue is in the slave log:

LOG: unexpected EOF on client connection

LOG: could not receive data from client: Connection reset by peer

I don't know if that's related or not as it could just be somebody running a query. The log file does seem to be riddled with these but the replication failures don't happen constantly.

As far as I know I'm not swallowing any errors. The logging is all set as the default:

log_destination = 'stderr'

logging_collector = on

#client_min_messages = notice

#log_min_messages = warning

#log_min_error_statement = error

#log_min_duration_statement = -1

#log_checkpoints = off

#log_connections = off

#log_disconnections = off

#log_error_verbosity = default

I'm going to have a look at the NICs to make sure there's no issue there.

Thanks again for your help!

On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com> wrote:

Are you certain that there are no relevant errors in the database logs
(on both master & slave)? Also, are you sure that you didn't
misconfigure logging such that errors wouldn't appear?

On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Lonni,
>
> Yes, I am using PG 9.1.9.
> Yes, 1 slave syncing from the master
> CentOS 6.4
> I don't see any network or hardware issues (e.g. NIC) but will look more
> into this. They are communicating on a private network and switch.
>
> I forgot to mention that after I restart the slave, everything syncs right
> back up and all if working again so if it is a network issue, the
> replication is just stopping after some hiccup instead of retrying and
> resuming when things are back up.
>
> Thanks!
>
>
>
> On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> I've never seen this happen. Looks like you might be using 9.1? Are
>> you up to date on all the 9.1.x releases?
>>
>> Do you have just 1 slave syncing from the master?
>> Which OS are you using?
>> Did you verify that there aren't any network problems between the
>> slave & master?
>> Or hardware problems (like the NIC dying, or dropping packets)?
>>
>>
>> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hello,
>> >
>> > I'm having an issue where streaming replication just randomly stops
>> > working.
>> > I haven't been able to find anything in the logs which point to an
>> > issue,
>> > but the Postgres process shows a "waiting" status on the slave:
>> >
>> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
>> > postgres:
>> > startup process recovering 000000010000053D0000003F waiting
>> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
>> > postgres:
>> > writer process
>> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
>> > postgres:
>> > stats collector process
>> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
>> > postgres:
>> > wal receiver process streaming 549/216B3730
>> >
>> > The replication works great for days, but randomly seems to lock up and
>> > replication halts. I verified that the two databases were out of sync
>> > with
>> > a query on both of them. Has anyone experienced this issue before?
>> >
>> > Here are some relevant config settings:
>> >
>> > Master:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> > </dev/null'
>> > max_wal_senders = 2
>> > wal_keep_segments = 32
>> >
>> > Slave:
>> >
>> > wal_level = hot_standby
>> > checkpoint_segments = 32
>> > #checkpoint_completion_target = 0.5
>> > hot_standby = on
>> > max_standby_archive_delay = -1
>> > max_standby_streaming_delay = -1
>> > #wal_receiver_status_interval = 10s
>> > #hot_standby_feedback = off
>> >
>> > Thank you for any help you can provide!
>> >
>> > Andrew
>> >

Re: Streaming Replication Randomly Locking Up

From

Lonni J Friedman

Date:

15 August 2013, 22:34:26

I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname.  That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG:  unexpected EOF on client connection
> LOG:  could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query.  The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors.  The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)?  Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this.  They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
>> > wrote:
>> >>
>> >> I've never seen this happen.  Looks like you might be using 9.1?  Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54
>> >> > postgres:
>> >> > startup process   recovering 000000010000053D0000003F waiting
>> >> > postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31
>> >> > postgres:
>> >> > wal receiver process   streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts.  I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them.  Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > </dev/null'
>> >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

15 August 2013, 22:38:27

Yep, that's the first thing I'm going to do.

On Thu, Aug 15, 2013 at 12:34 PM, Lonni J Friedman <netllama@gmail.com> wrote:

I'd suggest enhancing your logging to include time/datestamps for
every entry, and also the client hostname. That will help to rule
in/out those 'unexpected EOF' errors.

On Thu, Aug 15, 2013 at 12:22 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> The only thing I see that is a possibility for the issue is in the slave
> log:
>
> LOG: unexpected EOF on client connection
> LOG: could not receive data from client: Connection reset by peer
>
> I don't know if that's related or not as it could just be somebody running a
> query. The log file does seem to be riddled with these but the replication
> failures don't happen constantly.
>
> As far as I know I'm not swallowing any errors. The logging is all set as
> the default:
>
> log_destination = 'stderr'
> logging_collector = on
> #client_min_messages = notice
> #log_min_messages = warning
> #log_min_error_statement = error
> #log_min_duration_statement = -1
> #log_checkpoints = off
> #log_connections = off
> #log_disconnections = off
> #log_error_verbosity = default
>
> I'm going to have a look at the NICs to make sure there's no issue there.
>
> Thanks again for your help!
>
>
> On Thu, Aug 15, 2013 at 11:51 AM, Lonni J Friedman <netllama@gmail.com>
> wrote:
>>
>> Are you certain that there are no relevant errors in the database logs
>> (on both master & slave)? Also, are you sure that you didn't
>> misconfigure logging such that errors wouldn't appear?
>>
>> On Thu, Aug 15, 2013 at 11:45 AM, Andrew Berman <rexxe98@gmail.com> wrote:
>> > Hi Lonni,
>> >
>> > Yes, I am using PG 9.1.9.
>> > Yes, 1 slave syncing from the master
>> > CentOS 6.4
>> > I don't see any network or hardware issues (e.g. NIC) but will look more
>> > into this. They are communicating on a private network and switch.
>> >
>> > I forgot to mention that after I restart the slave, everything syncs
>> > right
>> > back up and all if working again so if it is a network issue, the
>> > replication is just stopping after some hiccup instead of retrying and
>> > resuming when things are back up.
>> >
>> > Thanks!
>> >
>> >
>> >
>> > On Thu, Aug 15, 2013 at 11:32 AM, Lonni J Friedman <netllama@gmail.com>
>> > wrote:
>> >>
>> >> I've never seen this happen. Looks like you might be using 9.1? Are
>> >> you up to date on all the 9.1.x releases?
>> >>
>> >> Do you have just 1 slave syncing from the master?
>> >> Which OS are you using?
>> >> Did you verify that there aren't any network problems between the
>> >> slave & master?
>> >> Or hardware problems (like the NIC dying, or dropping packets)?
>> >>
>> >>
>> >> On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm having an issue where streaming replication just randomly stops
>> >> > working.
>> >> > I haven't been able to find anything in the logs which point to an
>> >> > issue,
>> >> > but the Postgres process shows a "waiting" status on the slave:
>> >> >
>> >> > postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54
>> >> > postgres:
>> >> > startup process recovering 000000010000053D0000003F waiting
>> >> > postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30
>> >> > postgres:
>> >> > writer process
>> >> > postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03
>> >> > postgres:
>> >> > stats collector process
>> >> > postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31
>> >> > postgres:
>> >> > wal receiver process streaming 549/216B3730
>> >> >
>> >> > The replication works great for days, but randomly seems to lock up
>> >> > and
>> >> > replication halts. I verified that the two databases were out of
>> >> > sync
>> >> > with
>> >> > a query on both of them. Has anyone experienced this issue before?
>> >> >
>> >> > Here are some relevant config settings:
>> >> >
>> >> > Master:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > checkpoint_completion_target = 0.9
>> >> > archive_mode = on
>> >> > archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f
>> >> > </dev/null'
>> >> > max_wal_senders = 2
>> >> > wal_keep_segments = 32
>> >> >
>> >> > Slave:
>> >> >
>> >> > wal_level = hot_standby
>> >> > checkpoint_segments = 32
>> >> > #checkpoint_completion_target = 0.5
>> >> > hot_standby = on
>> >> > max_standby_archive_delay = -1
>> >> > max_standby_streaming_delay = -1
>> >> > #wal_receiver_status_interval = 10s
>> >> > #hot_standby_feedback = off
>> >> >
>> >> > Thank you for any help you can provide!
>> >> >
>> >> > Andrew
>> >> >
>
>

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama@gmail.com
LlamaLand https://netllama.linux-sxs.org

Re: Streaming Replication Randomly Locking Up

From

Jeff Janes

Date:

15 August 2013, 23:20:20

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres:
> startup process   recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away.  In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.


> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

15 August 2013, 23:28:29

Hi Jeff,

Here is the full process list at the time it stopped working (I have changed the actual username, db and IP for security). Would the idle in transaction process be the culprit?

postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting

postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process

postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process

postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730

postgres 10403 0.0 0.2 3430372 25920 ? Ss Aug14 0:31 postgres: user db x.x.x.x(61656) idle in transaction

postgres 19933 0.0 0.4 3426604 49564 ? S Aug05 0:06 /usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/9.1/data

postgres 19935 0.0 0.0 175288 396 ? Ss Aug05 0:13 postgres: logger process

postgres 21133 0.0 0.2 3443600 30680 ? Ss 09:28 0:00 postgres: user db x.x.x.x(64430) idle

postgres 21134 0.4 0.2 3430160 27656 ? Ss 09:28 0:16 postgres: user db x.x.x.x(64431) idle

root 21529 0.0 0.0 103240 844 pts/0 S+ 10:33 0:00 grep --color postgres

Thanks,

Andrew

On Thu, Aug 15, 2013 at 1:20 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Aug 15, 2013 at 11:07 AM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hello,
>
> I'm having an issue where streaming replication just randomly stops working.
> I haven't been able to find anything in the logs which point to an issue,
> but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres:
> startup process recovering 000000010000053D0000003F waiting

There is a recovery conflict which it is waiting to go away. In other
words, you have a long-running (or long-idle) transaction on the slave
which is blocking recovery.

> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1

...and you are willing to wait forever.

Cheers,

Jeff

Re: Streaming Replication Randomly Locking Up

From

John DeSoi

Date:

16 August 2013, 18:39:14

On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

> I'm having an issue where streaming replication just randomly stops working.  I haven't been able to find anything in
thelogs which point to an issue, but the Postgres process shows a "waiting" status on the slave: 
>
> postgres  5639  0.1 24.3 3428264 2970236 ?     Ss   Aug14   1:54 postgres: startup process   recovering
000000010000053D0000003Fwaiting 
> postgres  5642  0.0 21.4 3428356 2613252 ?     Ss   Aug14   0:30 postgres: writer process
> postgres  5659  0.0  0.0 177524   788 ?        Ss   Aug14   0:03 postgres: stats collector process
> postgres  7159  1.2  0.1 3451360 18352 ?       Ss   Aug14  17:31 postgres: wal receiver process   streaming
549/216B3730
>
> The replication works great for days, but randomly seems to lock up and replication halts.  I verified that the two
databaseswere out of sync with a query on both of them.  Has anyone experienced this issue before?  
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the
logson master or slave. 

It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then
rsyncwould just wait forever and never return. 

John DeSoi, Ph.D.

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

16 August 2013, 19:26:36

Awesome, I'll give that a shot John.

On Fri, Aug 16, 2013 at 8:39 AM, John DeSoi <desoi@pgedit.com> wrote:

On Aug 15, 2013, at 1:07 PM, Andrew Berman <rexxe98@gmail.com> wrote:

> I'm having an issue where streaming replication just randomly stops working. I haven't been able to find anything in the logs which point to an issue, but the Postgres process shows a "waiting" status on the slave:
>
> postgres 5639 0.1 24.3 3428264 2970236 ? Ss Aug14 1:54 postgres: startup process recovering 000000010000053D0000003F waiting
> postgres 5642 0.0 21.4 3428356 2613252 ? Ss Aug14 0:30 postgres: writer process
> postgres 5659 0.0 0.0 177524 788 ? Ss Aug14 0:03 postgres: stats collector process
> postgres 7159 1.2 0.1 3451360 18352 ? Ss Aug14 17:31 postgres: wal receiver process streaming 549/216B3730
>
> The replication works great for days, but randomly seems to lock up and replication halts. I verified that the two databases were out of sync with a query on both of them. Has anyone experienced this issue before?
>
> Here are some relevant config settings:
>
> Master:
>
> wal_level = hot_standby
> checkpoint_segments = 32
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'rsync -a %p foo@foo:/var/lib/pgsql/9.1/wals/%f </dev/null'
> max_wal_senders = 2
> wal_keep_segments = 32

I recently posted about the same thing -- replication just stops after working OK for days or weeks, no errors in the logs on master or slave.

It appears I solved it by adding --timeout=30 to my rsync command. My guess was some kind of network hang and then rsync would just wait forever and never return.

John DeSoi, Ph.D.

Re: Streaming Replication Randomly Locking Up

From

Jeff Janes

Date:

16 August 2013, 19:45:44

On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
> Hi Jeff,
>
> Here is the full process list at the time it stopped working (I have changed
> the actual username, db and IP for security).  Would the idle in transaction
> process be the culprit?

Most likely, yes.  You should be able to dig into pg_locks to verify.


Cheers,

Jeff

Re: Streaming Replication Randomly Locking Up

From

Jeff Janes

Date:

16 August 2013, 20:12:24

On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
>> Hi Jeff,
>>
>> Here is the full process list at the time it stopped working (I have changed
>> the actual username, db and IP for security).  Would the idle in transaction
>> process be the culprit?
>
> Most likely, yes.  You should be able to dig into pg_locks to verify.

Actually, you can't.  The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff

Re: Streaming Replication Randomly Locking Up

From

Andrew Berman

Date:

16 August 2013, 20:24:26

Ok, next time it happens I'll try to do more sleuthing to figure out if that's the issue. For now, I'm going to try adding --timeout=30 to the rsync command and see if that fixes things.

Thanks again for your help!

Andrew

On Fri, Aug 16, 2013 at 10:12 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Aug 16, 2013 at 9:45 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Aug 15, 2013 at 1:28 PM, Andrew Berman <rexxe98@gmail.com> wrote:
>> Hi Jeff,
>>
>> Here is the full process list at the time it stopped working (I have changed
>> the actual username, db and IP for security). Would the idle in transaction
>> process be the culprit?
>
> Most likely, yes. You should be able to dig into pg_locks to verify.

Actually, you can't. The waiting doesn't show up in pg_locks, because
it polls in a sleep-loop, rather than doing a normal wait on the lock.

Still, that idle in transaction process is almost surely the culprit.

Cheers,

Jeff