Re: pg_xlog-files not deleted on a standby after accidental server crash - Mailing list pgsql-admin

From Paul Dunkler
Subject Re: pg_xlog-files not deleted on a standby after accidental server crash
Date
Msg-id 1606435B-21AB-418C-91DE-994966EA70A6@xyrality.com
Whole thread Raw
In response to Re: pg_xlog-files not deleted on a standby after accidental server crash  (naveen kumar <mnaveendba2@gmail.com>)
Responses Re: pg_xlog-files not deleted on a standby after accidental server crash
List pgsql-admin
Hi there,

thanks for your answer. We already have some "replay delay" checks implemented in nagios, which is showing fine for
thisinstance. 

Anyway, i have tested all your commands (thanks, i didn't knew some of them), but the result is still the same.
Replicationis fine and is not lagging... But pg_xlog content is growing and growing and growing. 

[root@ct-db1b pg_xlog]# ls -la | wc -l
1479


> I might be wrong, because Im one of the growing up guy in the PostgreSQL world. But you can try this.
>
> I guess your salve is running properly, I think applying all WALs is taking time.
>
> Bellow commands are use full check replication status.
>
> 1. pg_stat_replication view on master/primary server.
>
> postgres=# select * from pg_stat_replication ;

postgres=# select * from pg_stat_replication;
  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |        backend_start
    | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority |
sync_state

-------+----------+----------+------------------+-------------+-----------------+-------------+------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 15919 |       10 | postgres | walreceiver      | 172.16.0.2  |                 |       50495 | 2015-03-24
09:46:12.59108+00|              | streaming | CA/CB1345F8   | CA/CB1345F8    | CA/CB1345F8    | CA/CB1345B8     |
     0 | async 
(1 row)

> 2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not.
>
> postgres=# select pg_is_in_recovery();

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

> 3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standby
andalso written on standby disk. 
>
> postgres=# select pg_last_xlog_receive_location();
> 4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process.
>
> postgres=# select pg_last_xlog_replay_location();
>
> 5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed
duringrecovery. Below is an example: 
>
> postgres=# select pg_last_xact_replay_timestamp();

postgres=# select pg_last_xlog_receive_location(); select pg_last_xlog_replay_location(); select
pg_last_xact_replay_timestamp();
 pg_last_xlog_receive_location
-------------------------------
 CA/CAF27218
(1 row)

 pg_last_xlog_replay_location
------------------------------
 CA/CAF27218
(1 row)

 pg_last_xact_replay_timestamp
-------------------------------
 2015-03-25 09:54:54.766285+00
(1 row)


> 6. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the
functionpg_xlog_location_diff function. 
>
> postgres=# select pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location);

postgres=# SELECT client_hostname, client_addr, pg_xlog_location_diff(pg_stat_replication.sent_location,
pg_stat_replication.replay_location)AS byte_lag FROM pg_stat_replication; 
 client_hostname | client_addr | byte_lag
-----------------+-------------+----------
                 | 172.16.0.2  |    13024
(1 row)

> 7.Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds:
>
> SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH FROM
now()- pg_last_xact_replay_timestamp()) 
> END AS log_delay;

postgres=# SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()THEN 0 ELSE EXTRACT (EPOCH
FROMnow() - pg_last_xact_replay_timestamp()) 
END AS log_delay;
 log_delay
-----------
         0
(1 row)

> 8.If the slave is up in hot standby mode, you can tell the time in seconds the delay of transactions applied on the
slavewith this query: 
>
> postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;

postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
 replication_delay
-------------------
 00:00:00.296506
(1 row)

> And please verify archive_cleanup_command in recovery.conf on slave.

I don't use this command on any slaves...

>
>
> Thanks & Regards,
> M Naveen Kuamr,
> PostgreSQL Database Administrator,
> Mobile Number: +91 7755929449.
>
> On Tue, Mar 24, 2015 at 9:47 PM, Paul Dunkler <paul.dunkler@xyrality.com> wrote:
> Hi there,
>
> i am currently encountering a weird problem:
>
> 3 Days ago we had an accidental server crash on one of our database servers (system just restarted).
> After it was up again, i restarted databases (we are running 4 instances of postgres on a 512GB / 64 Processor
server).All databases came up fine again. 
> Afterwards i restarted the slave postgres instances (running on another physical server) and they could recover
correctlyand went ahead with receiving WAL via streaming replication. All fine so far. 
>
> Then i realized that only on 1 of the 4 slave instances, the WAL-File count under data/pg_xlog was raising and
raising,seems like WAL files are not reused / deleted by postgres. 
> Afterwards i tried different things:
> 1. Restarted the slave multiple times without any effect on the xlog behaviour
> 2. Setting up the slave again from a fresh base backup of the master instance
> 3. Checking file permissions, even tried manually deleting a WAL-File under data/pg_xlog (which worked)
>
> Now i'm running out of ideas why these files are not reused/deleted anymore.
> I don't see any errors in the logfiles (not on slave, not on master).
>
> Slave is only showing normal "restartpoint" logs:
>> Mar 24 10:51:10 ct-db1b ct-db1-1[65432]: [3-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [4-1] db=, user=, host= LOG:  restartpoint complete: wrote 21264 buffers
(0.4%);0 transaction log file(s) added, 0 removed, 16 recycled; write=149.345 s, sync=2.217 s, total=151.682 s; sync
files=1156,longest=0.213 s, average=0.001 s 
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-1] db=, user=, host= LOG:  recovery restart point at C3/E9AD3258
>> Mar 24 10:53:42 ct-db1b ct-db1-1[65432]: [5-2] db=, user=, host= DETAIL:  last completed transaction was at log time
2015-03-2409:53:42.247769+00 
>> Mar 24 10:56:10 ct-db1b ct-db1-1[65432]: [6-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [7-1] db=, user=, host= LOG:  restartpoint complete: wrote 6402 buffers
(0.1%);0 transaction log file(s) added, 0 removed, 3 recycled; write=149.516 s, sync=1.101 s, total=150.757 s; sync
files=729,longest=0.160 s, average=0.001 s 
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-1] db=, user=, host= LOG:  recovery restart point at C3/EDA836D8
>> Mar 24 10:58:41 ct-db1b ct-db1-1[65432]: [8-2] db=, user=, host= DETAIL:  last completed transaction was at log time
2015-03-2409:58:40.611415+00 
>> Mar 24 11:01:10 ct-db1b ct-db1-1[65432]: [9-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [10-1] db=, user=, host= LOG:  restartpoint complete: wrote 11746 buffers
(0.2%);0 transaction log file(s) added, 0 removed, 5 recycled; write=149.652 s, sync=0.936 s, total=150.725 s; sync
files=672,longest=0.147 s, average=0.001 s 
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-1] db=, user=, host= LOG:  recovery restart point at C3/F046DA18
>> Mar 24 11:03:41 ct-db1b ct-db1-1[65432]: [11-2] db=, user=, host= DETAIL:  last completed transaction was at log
time2015-03-24 10:03:40.914148+00 
>> Mar 24 11:06:10 ct-db1b ct-db1-1[65432]: [12-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [13-1] db=, user=, host= LOG:  restartpoint complete: wrote 11155 buffers
(0.2%);0 transaction log file(s) added, 0 removed, 4 recycled; write=149.780 s, sync=1.184 s, total=151.080 s; sync
files=758,longest=0.184 s, average=0.001 s 
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-1] db=, user=, host= LOG:  recovery restart point at C3/F3C51B40
>> Mar 24 11:08:41 ct-db1b ct-db1-1[65432]: [14-2] db=, user=, host= DETAIL:  last completed transaction was at log
time2015-03-24 10:08:41.087379+00 
>> Mar 24 11:11:10 ct-db1b ct-db1-1[65432]: [15-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [16-1] db=, user=, host= LOG:  restartpoint complete: wrote 12602 buffers
(0.2%);0 transaction log file(s) added, 0 removed, 2 recycled; write=149.632 s, sync=1.309 s, total=151.067 s; sync
files=728,longest=0.203 s, average=0.001 s 
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-1] db=, user=, host= LOG:  recovery restart point at C3/F93AC8F0
>> Mar 24 11:13:41 ct-db1b ct-db1-1[65432]: [17-2] db=, user=, host= DETAIL:  last completed transaction was at log
time2015-03-24 10:13:41.168975+00 
>> Mar 24 11:16:10 ct-db1b ct-db1-1[65432]: [18-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [19-1] db=, user=, host= LOG:  restartpoint complete: wrote 10234 buffers
(0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.411 s, sync=1.175 s, total=150.703 s; sync
files=821,longest=0.142 s, average=0.001 s 
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-1] db=, user=, host= LOG:  recovery restart point at C3/FDD61320
>> Mar 24 11:18:41 ct-db1b ct-db1-1[65432]: [20-2] db=, user=, host= DETAIL:  last completed transaction was at log
time2015-03-24 10:18:41.052593+00 
>> Mar 24 11:21:10 ct-db1b ct-db1-1[65432]: [21-1] db=, user=, host= LOG:  restartpoint starting: time
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [22-1] db=, user=, host= LOG:  restartpoint complete: wrote 9105 buffers
(0.2%);0 transaction log file(s) added, 0 removed, 0 recycled; write=149.598 s, sync=0.842 s, total=150.579 s; sync
files=687,longest=0.061 s, average=0.001 s 
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-1] db=, user=, host= LOG:  recovery restart point at C4/290A078
>> Mar 24 11:23:40 ct-db1b ct-db1-1[65432]: [23-2] db=, user=, host= DETAIL:  last completed transaction was at log
time2015-03-24 10:23:40.446308+00 
>
> Postgres-Version:
>
>> postgres=# select version();
>>                                                    version
>> --------------------------------------------------------------------------------------------------------------
>>  PostgreSQL 9.4.1 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
>> (1 row)
>
> I would really appreciate somebodies advice on this.
>
> Thanks in advance.

--
Mit freundlichen Grüßen / Kind regards

Paul Dunkler


Attachment

pgsql-admin by date:

Previous
From: Raghavendra
Date:
Subject: Re: Postgres 9 : - How to interpret the EXPLAIN query results
Next
From: Khangelani Gama
Date:
Subject: Re: Postgres 9 : - How to interpret the EXPLAIN query results