Thread: Stand by server (9.6.6) with corrupt file

Stand by server (9.6.6) with corrupt file

From

Edson Carlos Ericksson Richter

Date:

22 December 2017, 19:29:39

Environment: PostgreSQL 9.6.6 installed from yum repository. Oracle Linux 7 EL x64. Dell servers with Raid 5 (hw).

I was testing our database backup system (based on pgBarman), and discovered that one base file is corrupt in our standby database server. The file is OK in master server, but has 0 bytes in size in standby server.

Looking master and standby servers there is no indication that the problem exists - replication is running fine.

Evidences:

On master server:

[root@server2 1106839]# find 6302536 -exec stat \{\} \;

  File: “6302536”

  Size: 16793600        Blocks: 32800      IO Block: 4096   arquivo comum

Device: f902h/63746d    Inode: 10618465    Links: 1

Access: (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)

Access: 2017-12-08 21:35:38.670841051 -0200

Modify: 2017-12-21 22:51:40.706074439 -0200

Change: 2017-12-21 22:51:40.706074439 -0200

 Birth: -

On standby server:

[root@server3 1106839]# find 6302536 -exec stat \{\} \;

  File: “6302536”

  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio

Device: f901h/63745d    Inode: 391519656   Links: 1

Access: (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)

Access: 2017-12-09 15:50:47.469135640 -0200

Modify: 2017-12-09 15:50:47.469135640 -0200

Change: 2017-12-09 15:50:47.469135640 -0200

 Birth: -

After long investigation, I discovered that if I execute a query on standby server:

< 2017-12-22 11:20:22.417 -02 > ERROR: could not read block 0 in file "base/1106839/6302536": read only 0 of 8192 bytes
< 2017-12-22 11:20:22.417 -02 > STATEMENT: SELECT *
          FROM MY_FAIR_LARGE_TABLE t1

         LEFT OUTER JOIN MY_FAIR_LARGE_SUBTABLE t0 ON (t0.the_id = t1.ID)
                   WHERE (((t1.COMPANY_ID = 2)
                     AND t1.OTHERCOMPANY LIKE '20147617%')
                     AND (t1.TEST_FLAG = 0))
        ORDER BY t1.DUE_DATE LIMIT 1000 OFFSET 0

Very same query on server works fine.

And there is no replication error - everything is in sync between these two servers (I know, I'm begin to be repetitive).

I've about 30 servers with same setup, and this only has this flaw. The only difference is that this database is about 3 times larger than the others (about 90Gb in size).

Server and slave have 23ms of network lag - which seems not be a problem for the other databases in the same server.

Any advice?

Edson Carlos Ericksson Richter SimKorp Ltda
Fone:	(51) 3366-7964

"A mente que se abre a uma nova ideia jamais voltará ao seu tamanho original" - Albert Einstein

Attachment

pfihinpkpebggjkh.png

Re: Stand by server (9.6.6) with corrupt file

From

Stephen Frost

Date:

22 December 2017, 19:42:33

Greetings,

* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:
> I was testing our database backup system (based on pgBarman), and
> discovered that one base file is corrupt in our standby database
> server. The file is OK in master server, but has 0 bytes in size in
> standby server.
[...]
> Any advice?

Have you checked if it's an unlogged table?

Thanks!

Stephen

Attachment

signature.asc

Re: Stand by server (9.6.6) with corrupt file

From

Edson Carlos Ericksson Richter

Date:

23 December 2017, 00:40:12

No, it is a normal table.

Edson Carlos Ericksson Richter SimKorp Ltda
Fone:	(51) 3366-7964

"A mente que se abre a uma nova ideia jamais voltará ao seu tamanho original" - Albert Einstein

Em 22/12/2017 11:42, Stephen Frost escreveu:

Greetings,

* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:

I was testing our database backup system (based on pgBarman), and
discovered that one base file is corrupt in our standby database
server. The file is OK in master server, but has 0 bytes in size in
standby server.

[...]

Any advice?

Have you checked if it's an unlogged table?

Thanks!

Stephen

Attachment

hldedmfnlebjhgif.png

Re: Stand by server (9.6.6) with corrupt file

From

Martin Marques

Date:

23 December 2017, 17:58:42

El 22/12/17 a las 10:29, Edson Carlos Ericksson Richter escribió:
> Environment: PostgreSQL 9.6.6 installed from yum repository. Oracle
> Linux 7 EL x64. Dell servers with Raid 5 (hw).
> 
> 
> I was testing our database backup system (based on pgBarman), and
> discovered that one base file is corrupt in our standby database server.
> The file is OK in master server, but has 0 bytes in size in standby server.

How was the standby built? Did you use a recovered backup from barman or
did you clone from the master with some tool. An if it's the former,
which exact process did you use? (Using pg_basebackup would be the
safest way of cloning)

Regards,

-- 
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services