Re: pgbackrest - question about restoring cluster to a new clusteron same server - Mailing list pgsql-general

From Stephen Frost
Subject Re: pgbackrest - question about restoring cluster to a new clusteron same server
Date
Msg-id 20190919141739.GP6962@tamriel.snowman.net
Whole thread Raw
In response to Re: pgbackrest - question about restoring cluster to a new cluster onsame server  (Ron <ronljohnsonjr@gmail.com>)
Responses Re: pgbackrest - question about restoring cluster to a new cluster onsame server
List pgsql-general
Greetings,

* Ron (ronljohnsonjr@gmail.com) wrote:
> On 9/18/19 8:58 PM, David Steele wrote:
> >On 9/18/19 9:40 PM, Ron wrote:
> >>I'm concerned with one pgbackrest process stepping over another one and
> >>the restore (or the "pg_ctl start" recovery phase) accidentally
> >>corrupting the production database by writing WAL files to the original
> >>cluster.
> >This is not an issue unless you seriously game the system.  When a
> >cluster is promoted it selects a new timeline and all WAL will be
> >archived to the repo on that new timeline.  It's possible to promote a
> >cluster without a timeline switch by tricking it but this is obviously a
> >bad idea.
>
> What's a timeline switchover?

Put simply, it's a branch off of the current WAL stream on to a new WAL
stream and it happens whenever there's a promotion.

Forgive the ASCII art, but-

----A---> timeline 1, where things start
     \-----> a promotion happened at time A, new WAL is on timeline 2

Consider an async replication scenario, where the network on the primary
is lost but it keeps writing out WAL and accepting new commits, but at a
time "A" we give up on it and promote the replica, so the replica
switches to timeline 2 and starts accepting writes.  Now we are in a
situation where two systems are generating WAL (the network partitioned
old primary, and the replica-now-primary).  Having the promotion switch
to a timeline makes it clear where that promotion happened and where the
replica-now-primary's WAL stream started.

This is actually what pg_rewind is based around too- to re-master the
old primary, it'll find that split point A and "rewind" (well, not
really, because it just grabs the pages, but whatever) the old primary
back to A and then the old primary can follow the new primary on
timeline 2.

> >So, if you promote the new cluster and forget to disable archive_command
> >there will be no conflict because the clusters will be generating WAL on
> >separate timelines.
>
> No cluster promotion even contemplated.

Ah, but you are talking about a cluster promotion, though you don't
realize it.  Any time there is a "at some point, I was to stop replaying
WAL and start accepting new changes", there's a timeline switch and
notionally a promotion.

> The point of the exercise would be to create an older copy of the cluster --
> while the production cluster is still running, while production jobs are
> still pumping data into the production database -- from before the time of
> the data loss, and query it in an attempt to recover the records which were
> deleted.

Sure, that's all entirely possible and shouldn't be an issue.  When you
go through the restore process and specify a point where you want the
restore to stop, so that you can connect and pull the down the table,
when PG reaches that point it'll promote and do a timeline switch.

Now, if you don't actually want that restore to promote and come up as a
system that you can write to, you could instead say 'pause', and then
connect to the database and grab whatever data you needed.  That should
also avoid the concern around archive command, provided you never
actually let that system finish recovery and instead just shut it down
while it's still read-only.

If you want to play around with this stuff and see what happens with a
promote, or try doing a pause instead, you might be interested in:

https://learn.crunchydata.com/

and specifically the pgbackrest one:

https://learn.crunchydata.com/pg-administration/courses/basic-postgresql-for-dbas/pgbackrest/

Basically, it's kind of like a blog post where you can play around on a
scratch system that's built into the page and click through the steps to
see what happens, and change things around if you want.

Thanks,

Stephen

Attachment

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: n_live_tup count increase after vacuum
Next
From: Adrian Klaver
Date:
Subject: Re: PGPASSWORD in crypted form, for example BlowFish or SHA-256