Thread: pg_stat_replication in 9.3
Hi, I noticed a strange behaviour regarding pg_stat_replication in 9.3. If called from psql using the \watch command, I see all my replicas. From time to time one of them drops out and reconnects in a short period of time, typically ~30 sec. If I use the same select in plpgsql like: FOR r in SELECT application_name, client_addr, flush_location, clock_timestamp() AS lmd FROM pg_stat_replication ORDER BY application_name, client_addr LOOP RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%', r.application_name, r.client_addr, r.lmd, r.flush_location, pg_current_xlog_location(), pg_size_pretty( pg_xlog_location_diff( pg_current_xlog_location(), r.flush_location ) ); END LOOP; I see one of the replicas dropping out but never coming back again while in a parallel session using psql and \watch it indeed does come back. Is that intended? Torsten
On 09/14/2014 07:03 AM, Torsten Förtsch wrote: > Hi, > > I noticed a strange behaviour regarding pg_stat_replication in 9.3. If > called from psql using the \watch command, I see all my replicas. From > time to time one of them drops out and reconnects in a short period of > time, typically ~30 sec. > > If I use the same select in plpgsql like: > > FOR r in SELECT application_name, > client_addr, > flush_location, clock_timestamp() AS lmd > FROM pg_stat_replication > ORDER BY application_name, client_addr > LOOP > RAISE NOTICE 'aname=%, ca=%, lmd=%, loc=%, cur=%, lag=%', > r.application_name, r.client_addr, r.lmd, > r.flush_location, > pg_current_xlog_location(), > pg_size_pretty( > pg_xlog_location_diff( > pg_current_xlog_location(), > r.flush_location > ) > ); > END LOOP; > > I see one of the replicas dropping out but never coming back again while > in a parallel session using psql and \watch it indeed does come back. > > Is that intended? > > Torsten > > I wonder if its a transaction thing? Maybe \watch is using a transaction for each (or isn't using transactions at all),whereas the plpgsql is one long transaction? Also if one of your replicas is far away, it doesn't really surprise me that it might loose connection every once and a while. On the other hand, if the box is on the same subnet, right next to the master, and it was loosing connection, thatwould be a bad thing. So, how far away is the replica? And does 'ps ax|grep postgr' show 'idle' or 'idle in transaction' on the \watch and theplpgsql? -Andy
On 14/09/14 16:24, Andy Colson wrote: > I wonder if its a transaction thing? Maybe \watch is using a > transaction for each (or isn't using transactions at all), whereas the > plpgsql is one long transaction? > > Also if one of your replicas is far away, it doesn't really surprise me > that it might loose connection every once and a while. On the other > hand, if the box is on the same subnet, right next to the master, and it > was loosing connection, that would be a bad thing. > > So, how far away is the replica? And does 'ps ax|grep postgr' show > 'idle' or 'idle in transaction' on the \watch and the plpgsql? The replicas are far away, intercontinental far. I am not complaining that the replica looses the connection. What makes me wonder is that within a transaction, pg_stat_replication can forget rows but cannot acquire new ones. I'd think it should be either report the state at the beginning of the transaction like now() or the current state like clock_timestamp(). But currently it's reporting half the current state. Torsten
=?UTF-8?B?VG9yc3RlbiBGw7ZydHNjaA==?= <torsten.foertsch@gmx.net> writes: > The replicas are far away, intercontinental far. I am not complaining > that the replica looses the connection. What makes me wonder is that > within a transaction, pg_stat_replication can forget rows but cannot > acquire new ones. I'd think it should be either report the state at the > beginning of the transaction like now() or the current state like > clock_timestamp(). But currently it's reporting half the current state. Are you watching the state in a loop inside a single plpgsql function? If so, I wonder whether the problem is that the plpgsql function's snapshot isn't changing. From memory, marking the function VOLATILE would help if that's the issue. regards, tom lane
On 14/09/14 18:55, Tom Lane wrote: > Are you watching the state in a loop inside a single plpgsql function? > If so, I wonder whether the problem is that the plpgsql function's > snapshot isn't changing. From memory, marking the function VOLATILE > would help if that's the issue. The function is VOLATILE. I attached 2 versions of it. fn-old.sql does not work because once a slave has disconnected it drops out and does not come back. fn.sql uses dblink to work around the problem. But it consumes 2 db connections. The intent of the function is to be called between operations that may cause slaves to lag behind. If the lag is below a certain limit, it simply returns. Otherwise, it waits until the lag drops below a second limit. If it were a VOLATILE problem, the functions would not be able to see when a slave drops out nor changes in the data. But it does see these changes. Only when a slave comes back online, it is not seen in the current transaction. Torsten