Home > mailing lists

Re: Mysterious table that exists but doesn't exist - Mailing list pgsql-general

From	Dale Fukami
Subject	Re: Mysterious table that exists but doesn't exist
Date	April 17, 2013 17:53:51
Msg-id	CAAHkhS4o00H+3816ALinmYGjc1Rh_gi7XLf70AdJOxB3=q63cA@mail.gmail.com Whole thread Raw
In response to	Re: Mysterious table that exists but doesn't exist (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Mysterious table that exists but doesn't exist
List	pgsql-general

Tree view

On Tue, Apr 16, 2013 at 3:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dale Fukami <dale.fukami@gmail.com> writes:
> I'm having a problem on a standby server (streaming replication) where a
> table seems to exist but is not queryable. Essentially a select statement
> (and drop/insert/etc) fails but \d and pg_tables show it exists. The table
> exists on the master (and is queryable) and replication is still working in
> that changes to the master table don't cause errors on the standby and
> changes to other tables are replicated and verified to be on the standby.

That's peculiar. The most likely theories seem to be
(1) corruption in the standby's pg_class indexes, such that an index
search for relname = 'tracked_deductibles' fails to find the row,
although it is found by seqscans; or
(2) some type of visibility issue causing SnapshotNow operations to
think the row is invalid, though it is valid to MVCC queries.

Either way, if it's working on the master, then you've had a replication
failure since the standby's files evidently don't match the master's.

What PG version is this (and which versions have been installed since
the replication was set up)? Have you had any system-level crashes on
the standby?

My apologies: PostgreSQL 9.0.5 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit

This is the only version we've had since we've set up streaming replication.

We believe we've found the moment of corruption. It appears that the db was shutdown then restarted and didn't quite catch up to the master and then was shut down again. We use this standby to take snapshots on a regular basis but it looks like this one didn't quite get into a consistent state before the next one started. Logs pasted at the end of this email.

At this point I'm most concerned about:

1) We were quite lucky I think to have discovered this issue. I think the only other way we'd have noticed would have been if we'd failed over to it and our app stopped working. I'm worried now that we'll end up in a similar situation and won't have known it for many weeks. At some point we lose the ability to PITR based on how many backups are kept. Is there a way to be more confident in our standby machines?

2) Is there a way to recover from this? In the extreme case where we might have failed over to this standby, could we somehow figure out where exactly the corruption is and recover or are we talking about going back to an older snapshot and using our archived WALs? Note: we've already rebuilt this standby to be consistent with master last night. This is more a hypothetical at this point. We've kept the bad slave for experimentation in case anyone has thoughts on this.

Just to clear up the confusion that I had caused John. I had obscured the previous schema name by restoring from snapshots and altering the schema name itself. So, the queries and output are an exact copy/paste from my terminal. I get the same results on the Live standby when using the actual Live schema name but, obviously, it shows the Live schema name rather than 'someschema'.

Thanks,

Dale

Log snippet from what appears to be the time of corruption. We've confirmed this by restoring snapshots just prior and just following this time range. Note the lovely HINT that there may be corruption :)

2013-04-15 15:38:22 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:37:07.621579+00

2013-04-15 15:38:22 UTC @ LOG: shutting down

2013-04-15 15:38:22 UTC @ LOG: database system is shut down

2013-04-15 15:39:21 UTC @ LOG: database system was shut down in recovery at 2013-04-15 15:38:22 UTC

2013-04-15 15:39:21 UTC [unknown]@[unknown] LOG: incomplete startup packet

cp: cannot stat `/var/lib/postgresql-archives/00000005.history': No such file or directory

2013-04-15 15:39:21 UTC @ LOG: entering standby mode

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000088': No such file or directory

2013-04-15 15:39:21 UTC postgres@postgres FATAL: the database system is starting up

2013-04-15 15:39:22 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000086': No such file or directory

2013-04-15 15:39:22 UTC @ LOG: redo starts at 1648/86ED4360

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000087': No such file or directory

2013-04-15 15:39:23 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000088': No such file or directory

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000089': No such file or directory

cp: cannot stat `/var/lib/postgresql-archives/00000005000016480000008A': No such file or directory

2013-04-15 15:39:23 UTC postgres@postgres FATAL: the database system is starting up

2013-04-15 15:39:23 UTC @ LOG: consistent recovery state reached at 1648/8A6460E0

2013-04-15 15:39:23 UTC @ LOG: invalid record length at 1648/8A6460E0

2013-04-15 15:39:23 UTC @ LOG: database system is ready to accept read only connections

cp: cannot stat `/var/lib/postgresql-archives/00000005000016480000008A': No such file or directory

2013-04-15 15:39:23 UTC @ LOG: streaming replication successfully connected to primary

2013-04-15 15:39:27 UTC @ LOG: restartpoint starting: xlog

2013-04-15 15:43:56 UTC @ LOG: restartpoint complete: wrote 13894 buffers (2.1%); write=48.569 s, sync=219.650 s, total=269.460 s

2013-04-15 15:43:56 UTC @ LOG: recovery restart point at 1648/8909ED50

2013-04-15 15:43:56 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:40:48.449826+00

2013-04-15 15:43:56 UTC @ LOG: restartpoint starting: xlog time

2013-04-15 15:45:02 UTC @ LOG: received fast shutdown request

2013-04-15 15:45:02 UTC @ LOG: aborting any active transactions

2013-04-15 15:45:15 UTC @ FATAL: terminating walreceiver process due to administrator command

2013-04-15 15:45:25 UTC @ LOG: restartpoint complete: wrote 12759 buffers (1.9%); write=41.211 s, sync=47.518 s, total=88.759 s

2013-04-15 15:45:25 UTC @ LOG: recovery restart point at 1648/8FBEBAD8

2013-04-15 15:45:25 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:44:30.43801+00

2013-04-15 15:45:25 UTC @ LOG: shutting down

2013-04-15 15:45:25 UTC @ LOG: restartpoint starting: shutdown immediate

2013-04-15 15:46:02 UTC @ LOG: received immediate shutdown request

2013-04-15 15:47:24 UTC @ LOG: database system was interrupted while in recovery at log time 2013-04-15 15:39:45 UTC

2013-04-15 15:47:24 UTC @ HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

2013-04-15 15:47:24 UTC [unknown]@[unknown] LOG: incomplete startup packet

cp: cannot stat `/var/lib/postgresql-archives/00000005.history': No such file or directory

2013-04-15 15:47:24 UTC @ LOG: entering standby mode

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000091': No such file or directory

2013-04-15 15:47:24 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/00000005000016480000008F': No such file or directory

2013-04-15 15:47:25 UTC @ LOG: redo starts at 1648/8FBEBAD8

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000090': No such file or directory

2013-04-15 15:47:25 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000091': No such file or directory

2013-04-15 15:47:25 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000092': No such file or directory

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000093': No such file or directory

2013-04-15 15:47:26 UTC postgres@postgres FATAL: the database system is starting up

2013-04-15 15:47:26 UTC @ LOG: restartpoint starting: xlog

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000094': No such file or directory

2013-04-15 15:47:26 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000095': No such file or directory

2013-04-15 15:47:27 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000096': No such file or directory

2013-04-15 15:47:27 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000097': No such file or directory

2013-04-15 15:47:28 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000098': No such file or directory

2013-04-15 15:47:28 UTC postgres@postgres FATAL: the database system is starting up

cp: cannot stat `/var/lib/postgresql-archives/000000050000164800000099': No such file or directory

2013-04-15 15:47:29 UTC postgres@postgres FATAL: the database system is starting up

2013-04-15 15:47:29 UTC @ LOG: consistent recovery state reached at 1648/99E05F08

2013-04-15 15:47:29 UTC @ LOG: database system is ready to accept read only connections

cp: cannot stat `/var/lib/postgresql-archives/00000005000016480000009A': No such file or directory

2013-04-15 15:47:30 UTC @ LOG: unexpected pageaddr 1648/794E2000 in log file 5704, segment 154, offset 5120000

cp: cannot stat `/var/lib/postgresql-archives/00000005000016480000009A': No such file or directory

2013-04-15 15:47:30 UTC @ LOG: streaming replication successfully connected to primary

2013-04-15 15:52:34 UTC @ LOG: restartpoint complete: wrote 17096 buffers (2.6%); write=39.897 s, sync=264.845 s, total=307.746 s

2013-04-15 15:52:34 UTC @ LOG: recovery restart point at 1648/9196C3B0

2013-04-15 15:52:34 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:48:44.049103+00

2013-04-15 15:52:34 UTC @ LOG: restartpoint starting: xlog time

2013-04-15 15:53:58 UTC @ LOG: restartpoint complete: wrote 27760 buffers (4.2%); write=43.305 s, sync=41.135 s, total=84.489 s

2013-04-15 15:53:58 UTC @ LOG: recovery restart point at 1648/A20E13E8

2013-04-15 15:53:58 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:53:35.897601+00

2013-04-15 15:53:58 UTC @ LOG: restartpoint starting: xlog time

2013-04-15 15:55:04 UTC @ LOG: restartpoint complete: wrote 34572 buffers (5.3%); write=41.229 s, sync=24.578 s, total=65.855 s

2013-04-15 15:55:04 UTC @ LOG: recovery restart point at 1648/B1293B80

2013-04-15 15:55:04 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:54:41.804552+00

2013-04-15 15:55:04 UTC @ LOG: restartpoint starting: xlog time

2013-04-15 15:55:58 UTC @ LOG: restartpoint complete: wrote 9283 buffers (1.4%); write=41.391 s, sync=12.263 s, total=53.805 s

2013-04-15 15:55:58 UTC @ LOG: recovery restart point at 1648/B371CFF0

2013-04-15 15:55:58 UTC @ DETAIL: last completed transaction was at log time 2013-04-15 15:55:35.627312+00

2013-04-15 15:55:58 UTC @ LOG: restartpoint starting: xlog

2013-04-15 15:56:51 UTC @ LOG: restartpoint complete: wrote 6459 buffers (1.0%); write=41.648 s, sync=11.185 s, total=52.856 s

2013-04-15 15:56:51 UTC @ LOG: recovery restart point at 1648/B6C1DF00

pgsql-general by date:

From: Moshe Jacobson
Date: 17 April 2013, 17:49:41
Subject: Re: Can you spot the difference?

From: Adrian Klaver
Date: 17 April 2013, 18:32:16
Subject: Re: Can you spot the difference?

Re: Mysterious table that exists but doesn't exist - Mailing list pgsql-general

Previous

Next