Thread: Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

"Kevin Grittner"

Date:

29 April 2012, 11:27:53

Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> Well, the xReader approach (Aakash's GSoC project) is to serve as
>> a proxy for a WAL stream going to a hot standby, to interpret each
>> incoming WAL record by "cheating" and querying the HS before
>> passing the WAL along, and then using sync rep to the HS
>> regardless of whether the incoming stream is sync or async. What
>> reliability problem do you see with this approach?
>
> Well, first off, it requires a replica that *is* a physical replica
> (else it won't have the same table OIDs, for instance). I'm not
> sure I get the point of logical replication that requires a
> physical replica as a prerequisite.
Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data.  Due to state law, state supreme
court rules, and the general suckiness of the WAN available to us,
each of Wisconsin's 72 counties has a database server in the county
seat which is the "official" data of record.  We have two parallel
forms of backup -- WAL-based and logical.
For WAL-based replication we maintain a copy of the last two weekly
base backups and all WAL files needed to recover from either of them
to any point in time.  One copy of this is required to be in the
county, on a separate system with separate drives.  The other copy is
maintained centrally.  One rather large server has a hot standby of
all county databases, currently fed through copying WAL files
(although we plan to move to streaming this year).  These "physical"
backups are the primary source for recovering from catastrophic
failure of a county server.
We have software to "top off" the recovered database with
transactions from the logical transaction stream as needed, since
that is more up-to-date.  We have also found this combination of
physical and logical transactions useful when someone trying to do a
data fix with direct SQL statements against a production database
mangled things badly.  We use PITR recovery up to just before the
damage, then apply logical transactions from that point forward,
skipping the problem transactions.
Until recently our logical transaction stream was generated from our
Java middle tier running on the database servers.  This has recently
been converted to use a generalized PostgreSQL trigger, written in C,
which is attached to all tables for which we want logical replication
(about 90% of them).  The trigger captures data into transaction
tables in the database as part of the same transaction they are
representing, so that the generation of the logical transaction data
is atomic with the represented data.  We didn't switch to one of the
existing solutions because of all the code downstream which uses the
logical transaction streams -- we didn't want to change the format or
delivery method of the logical steam because it's simply not feasible
to change everything at once.
The way the logical stream is currently delivered is still through
the aforementioned Java middle tier.  It reads the logical
transaction log tables and sends the same data in non-persistent JMS
messages through queues to the "TRAN client" at the central location,
just as we did for years before we ever heard of PostgreSQL.
Normally, the TRAN client reads each logical transaction once, and
feeds many "replication targets" at the central site.  These include: -  four "Central Information Repository" (CIR)
databaseseach of    which contains data from all 72 counties  (two machines each at    two different sites)
 
 -  four "Database Transaction Repository" (DTR) databases on the    same servers as the CIR databases (but separate
RAIDs) This is    used as a supplementary source for recovery (as mentioned    above), as well as for auditing what
happenedon a case when    there is any question, and certain activity reports.
 
 -  a filter to identify transactions of possible interest to the    publishers we use for interfaces to other agencies
(district   attorneys, public defenders, department of corrections, police    agencies, social service agencies, etc.)
Transactionsof    interest are queued for closer review by the publisher software,    which actually posts messages if
thetransaction is found to    actually be of interest.
 
 -  other databases for the state court's "central services" for    such things as court interpreters used for
non-Englishspeakers    in the various counties
 
The number and varied nature of the targets is significant, because
the "fan out" and timing differences need to be considered.  The TRAN
client normally receives one copy of the logical data for each
transaction, and about 99% of the time the transaction's data is
replicated to all databases (and queued for review by publishers)
within a second of someone pressing Enter 300 miles away.  It's that
other 1% that gets really interesting.
To deal with the possibility that one or more central targets were
down or running very slowly, the TRAN client will give up on keeping
them at the same point in the transaction stream after a
(configurable) delay.  Since this replication is important for
disaster recovery, we don't want problems on one server to keep
others from staying up-to-date.  As currently configured, for every
set of targets which falls behind, there is a redundant stream of
logical transactions across the WAN.
Let's talk about the WAN.  We're somewhat "compelled" to use the WAN
offered by the state's Department of Administration.  It is
expensive, so we use the slowest speed that will work for each
county.  That can be as low as 750 Mbps up to 10 Gbps, and even at
those pitiful rates, this was costing us (last I heard) over $80,000
per month -- so boosting WAN speed is a last resort.  Besides being
the conduit for both the WAL and logical transaction streams, this
same WAN is used for file server backups, access to the Internet for
court staff, and inter-county database access (such as District Court
Administrators or Chief Judges access all counties in their
respective districts).
Sometimes a WAN pipe gets saturated and the transaction streams for a
county fall behind.  This can be aggravated by other traffic (such as
when all court employees in a county are asked to view a training
video via the Internet), or by large raster-scanned documents being
added to the database.  In these cases the "let the lagging target
spawn its own transaction stream" approach currently becomes
pathological.  On the saturated connection an ACK of a message from
one or more targets may be delayed long enough that it splits the
stream and makes a bad situation even worse.
We have an item on our annual plan to try to address these problems,
which two people were assigned to start working on next month.  The
xReader technology proposed by Aakash is so much better than what we
otherwise can do, that when management heard about it, they delayed
that project in hopes that by waiting a few months, it can be done
with xReader.  The goal is to move to streaming replication and have
that be the one stream of transaction data flowing over the WAN from
the counties to the central location.  From this stream we want to
generate the logical transactions to feed all the other systems,
while maintaining the WAL-based HS as the primary source for disaster
recovery.  It is important that, at least initially, we can produce
logical transactions in the format accepted by all these other
internal systems.
The plan *was* to continue to fire the triggers to capture the data
on the county servers, but read those tables on the HS.  That has
many obvious disadvantages compared to either of the proposed logical
replication systems.  But for our purposes, generating logical
transactions at the central site rather than at the original database
is far superior, in that it means we only send the data over the WAN
once, not twice.
> Next, it breaks immediately in the face of DDL:
>
>     CREATE TABLE foo (f1 int);
>     BEGIN;
>     ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2;
>     INSERT INTO foo (f2) VALUES (1);
>     COMMIT;
>
> The standby is not going to think that the ALTER is committed, so
> it will not report the right column name when it comes time to
> translate the INSERT. Actually, you can break it even more easily
> than that:
>
>     CREATE TABLE bar AS SELECT ...
>
> What will you do with the insertions executed by this CREATE? They
> have to be executed before the creation of table bar is committed.
>
> Also, you'd need a query connection per database (at least, maybe
> one per session if you were trying to track DDL effects), which
> seems rather a lot of load on the HS slave. That together with the
> requirement for synchronous operation seems absolutely catastrophic
> from a performance standpoint.
Good points, but I think they all yield to one solution.  If we can
generate the appropriate snapshot based on the WAL record we are
looking at, those problems all evaporate, don't they?  It seems to me
that we could build on the "shared snapshot" work to provide a means
to this end.
> If you want to spend the summer building a toy, fine, but I don't
> see this going anywhere for production purposes.
I, the whole DBA group here, and the managers are very excited about
putting such a technology into production here.  Enough to probably
be willing to find a way to do it under version 9.1 this year if at
all possible.
> It would be enormously more performant for the master to be
> emitting logical replication records to start with, since it
> already has all the right names etc at hand at basically no cost.
Not when the consumers are across a WAN, and that WAN is the biggest
performance bottleneck and the most expensive resource involved.
I will cheerfully agree that both approaches have valuable use-cases.
-Kevin

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Simon Riggs

Date:

29 April 2012, 14:09:21

On Sun, Apr 29, 2012 at 3:27 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

>> It would be enormously more performant for the master to be
>> emitting logical replication records to start with, since it
>> already has all the right names etc at hand at basically no cost.
>
> Not when the consumers are across a WAN, and that WAN is the biggest
> performance bottleneck and the most expensive resource involved.

I agree that the WAN is important, for both bandwidth and response time.

Though it isn't a given that logical change records (LCRs) will
require more bandwidth than physical WAL. WAL contains full page
images, index changes and other information that would be absent from
the LCR stream. It also depends upon the specification of the LCRs -
what metadata is included and whether the LCRs use text or binary.
Those choices have other impacts as well, so measurements and detailed
analysis is required to justify how to proceed. Which is what is in
progress now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Jim Nasby

Date:

03 May 2012, 01:25:01

On 4/29/12 9:27 AM, Kevin Grittner wrote:
> Maybe I can help with that by describing what the Wisconsin court
> system does for circuit court data.

Thanks for the write-up, it was insightful.

One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint:
ifcorruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might
wantto consider that depending on how sensitive your data is.

-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Josh Berkus

Date:

03 May 2012, 16:05:32

> One thing I wanted to mention is that non-binary replication has an
> added advantage over binary from a DR standpoint: if corruption occurs
> on a master it is more likely to make it into your replicas thanks to
> full page writes. You might want to consider that depending on how
> sensitive your data is.

Yeah, we've seen this a few times.  We just recently had to rescue a
client from HS-wide corruption using Slony.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

"Kevin Grittner"

Date:

03 May 2012, 16:31:39

Josh Berkus <josh@agliodbs.com> wrote:
>> One thing I wanted to mention is that non-binary replication has
>> an added advantage over binary from a DR standpoint: if
>> corruption occurs on a master it is more likely to make it into
>> your replicas thanks to full page writes. You might want to
>> consider that depending on how sensitive your data is.
> 
> Yeah, we've seen this a few times.  We just recently had to rescue
> a client from HS-wide corruption using Slony.
That's an interesting point.  Out of curiosity, how did the
corruption originate?
It suggests a couple questions:
(1)  Was Slony running before the corruption occurred?  If not, how
was Slony helpful?  I know that in our environment, where we have
both going through separate streams, with a repository of the
logical transactions, we would use PITR recovery to get to the
latest known good state which we could easily identify, and then
replay the logical transactions to "top it off" to get current.  If
necessary we could skip logical transactions which were problematic
results of the corruption.
(2)  If logical transactions had been implemented as additions to
the WAL stream, and Slony was using that, do you think they would
still have been usable for this recovery?
Perhaps sending both physical and logical transaction streams over
the WAN isn't such a bad thing, if it gives us more independent
recovery mechanisms.  That's fewer copies than we're sending with
current trigger-based techniques.  It would be particularly
attractive is we could omit (filter out) certain tables before going
across the WAN.  I would be willing to risk sending the big
raster-scanned documents through just the physical channel so long
as I had a nightly compare of md5sum values on both sides so we can
resend any corrupted data (or tell people to rescan).
-Kevin

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Josh Berkus

Date:

03 May 2012, 16:54:23

> That's an interesting point.  Out of curiosity, how did the
> corruption originate?

We're still not sure.  It appears to be in the system catalogs, though.Note that the original master developed memory
issues.

> It suggests a couple questions:
>  
> (1)  Was Slony running before the corruption occurred?  

No.

> If not, how
> was Slony helpful?  

Install, replicate DB logically, new DB works fine.

> (2)  If logical transactions had been implemented as additions to
> the WAL stream, and Slony was using that, do you think they would
> still have been usable for this recovery?

Quite possibly not.

> Perhaps sending both physical and logical transaction streams over
> the WAN isn't such a bad thing, if it gives us more independent
> recovery mechanisms.  That's fewer copies than we're sending with
> current trigger-based techniques. 

Frankly, there's nothing wrong with the Slony model for replication
except for the overhead of:
1. triggers
2. queues
3. Running DDL

However, the three above are really big issues.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Re: xReader, double-effort (was: Temporary tables under hot standby)

From

Jim Nasby

Date:

04 May 2012, 19:18:44

On 5/3/12 2:54 PM, Josh Berkus wrote:
>> (2)  If logical transactions had been implemented as additions to
>> >  the WAL stream, and Slony was using that, do you think they would
>> >  still have been usable for this recovery?
> Quite possibly not.

The key advantage that I see in londiste/slony replication is that your data stream has absolutely nothing to do with
anythingbinary or internal to Postgres. That means that the only way corruption will travel from a master to a slave is
ifthe corruption is in the actual fields being updated, and even that's not a given (ie: UPDATING a field to a
completelynew value would not propagate corruption even if the old value of the field was corrupted).

So, embedding a logical stream into WAL is not inherently bad... what would be bad is if that "logical" stream was
susceptibleto corruption due to something like full page writes. Simply embedding the exact same info slony or londiste
capturesinto the WAL should be fine (though likely defeats the purpose). Translating binary WAL data into DML
statementswould very likely allow corruption to travel from master to slave.

-- 
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net