Thread: pg_stat_replication in 9.3

pg_stat_replication in 9.3

From
Torsten Förtsch
Date:
Hi,

I noticed a strange behaviour regarding pg_stat_replication in 9.3. If
called from psql using the \watch command, I see all my replicas. From
time to time one of them drops out and reconnects in a short period of
time, typically ~30 sec.

If I use the same select in plpgsql like:

  FOR r in SELECT application_name,
                  client_addr,
                  flush_location, clock_timestamp() AS lmd
             FROM pg_stat_replication
            ORDER BY application_name, client_addr
  LOOP
    RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%',
                 r.application_name, r.client_addr, r.lmd,
                 r.flush_location,
                 pg_current_xlog_location(),
                 pg_size_pretty(
                   pg_xlog_location_diff(
                     pg_current_xlog_location(),
                     r.flush_location
                   )
                 );
  END LOOP;

I see one of the replicas dropping out but never coming back again while
in a parallel session using psql and \watch it indeed does come back.

Is that intended?

Torsten


Re: pg_stat_replication in 9.3

From
Andy Colson
Date:
On 09/14/2014 07:03 AM, Torsten Förtsch wrote:
> Hi,
>
> I noticed a strange behaviour regarding pg_stat_replication in 9.3. If
> called from psql using the \watch command, I see all my replicas. From
> time to time one of them drops out and reconnects in a short period of
> time, typically ~30 sec.
>
> If I use the same select in plpgsql like:
>
>    FOR r in SELECT application_name,
>                    client_addr,
>                    flush_location, clock_timestamp() AS lmd
>               FROM pg_stat_replication
>              ORDER BY application_name, client_addr
>    LOOP
>      RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%',
>                   r.application_name, r.client_addr, r.lmd,
>                   r.flush_location,
>                   pg_current_xlog_location(),
>                   pg_size_pretty(
>                     pg_xlog_location_diff(
>                       pg_current_xlog_location(),
>                       r.flush_location
>                     )
>                   );
>    END LOOP;
>
> I see one of the replicas dropping out but never coming back again while
> in a parallel session using psql and \watch it indeed does come back.
>
> Is that intended?
>
> Torsten
>
>

I wonder if its a transaction thing?  Maybe \watch is using a transaction for each (or isn't using transactions at
all),whereas the plpgsql is one long transaction? 

Also if one of your replicas is far away, it doesn't really surprise me that it might loose connection every once and a
while. On the other hand, if the box is on the same subnet, right next to the master, and it was loosing connection,
thatwould be a bad thing. 

So, how far away is the replica?  And does 'ps ax|grep postgr' show 'idle' or 'idle in transaction' on the \watch and
theplpgsql? 

-Andy




Re: pg_stat_replication in 9.3

From
Torsten Förtsch
Date:
On 14/09/14 16:24, Andy Colson wrote:
> I wonder if its a transaction thing?  Maybe \watch is using a
> transaction for each (or isn't using transactions at all), whereas the
> plpgsql is one long transaction?
>
> Also if one of your replicas is far away, it doesn't really surprise me
> that it might loose connection every once and a while.  On the other
> hand, if the box is on the same subnet, right next to the master, and it
> was loosing connection, that would be a bad thing.
>
> So, how far away is the replica?  And does 'ps ax|grep postgr' show
> 'idle' or 'idle in transaction' on the \watch and the plpgsql?

The replicas are far away, intercontinental far. I am not complaining
that the replica looses the connection. What makes me wonder is that
within a transaction, pg_stat_replication can forget rows but cannot
acquire new ones. I'd think it should be either report the state at the
beginning of the transaction like now() or the current state like
clock_timestamp(). But currently it's reporting half the current state.

Torsten


Re: pg_stat_replication in 9.3

From
Tom Lane
Date:
=?UTF-8?B?VG9yc3RlbiBGw7ZydHNjaA==?= <torsten.foertsch@gmx.net> writes:
> The replicas are far away, intercontinental far. I am not complaining
> that the replica looses the connection. What makes me wonder is that
> within a transaction, pg_stat_replication can forget rows but cannot
> acquire new ones. I'd think it should be either report the state at the
> beginning of the transaction like now() or the current state like
> clock_timestamp(). But currently it's reporting half the current state.

Are you watching the state in a loop inside a single plpgsql function?
If so, I wonder whether the problem is that the plpgsql function's
snapshot isn't changing.  From memory, marking the function VOLATILE
would help if that's the issue.

            regards, tom lane


Re: pg_stat_replication in 9.3

From
Torsten Förtsch
Date:
On 14/09/14 18:55, Tom Lane wrote:
> Are you watching the state in a loop inside a single plpgsql function?
> If so, I wonder whether the problem is that the plpgsql function's
> snapshot isn't changing.  From memory, marking the function VOLATILE
> would help if that's the issue.

The function is VOLATILE. I attached 2 versions of it. fn-old.sql does
not work because once a slave has disconnected it drops out and does not
come back. fn.sql uses dblink to work around the problem. But it
consumes 2 db connections.

The intent of the function is to be called between operations that may
cause slaves to lag behind. If the lag is below a certain limit, it
simply returns. Otherwise, it waits until the lag drops below a second
limit.

If it were a VOLATILE problem, the functions would not be able to see
when a slave drops out nor changes in the data. But it does see these
changes. Only when a slave comes back online, it is not seen in the
current transaction.

Torsten

Attachment