Thread: Re: Re: xReader, double-effort (was: Temporary tables under hot standby)
Re: Re: xReader, double-effort (was: Temporary tables under hot standby)
From
"Kevin Grittner"
Date:
Tom Lane wrote: > "Kevin Grittner" writes: >> Well, the xReader approach (Aakash's GSoC project) is to serve as >> a proxy for a WAL stream going to a hot standby, to interpret each >> incoming WAL record by "cheating" and querying the HS before >> passing the WAL along, and then using sync rep to the HS >> regardless of whether the incoming stream is sync or async. What >> reliability problem do you see with this approach? > > Well, first off, it requires a replica that *is* a physical replica > (else it won't have the same table OIDs, for instance). I'm not > sure I get the point of logical replication that requires a > physical replica as a prerequisite. Maybe I can help with that by describing what the Wisconsin court system does for circuit court data. Due to state law, state supreme court rules, and the general suckiness of the WAN available to us, each of Wisconsin's 72 counties has a database server in the county seat which is the "official" data of record. We have two parallel forms of backup -- WAL-based and logical. For WAL-based replication we maintain a copy of the last two weekly base backups and all WAL files needed to recover from either of them to any point in time. One copy of this is required to be in the county, on a separate system with separate drives. The other copy is maintained centrally. One rather large server has a hot standby of all county databases, currently fed through copying WAL files (although we plan to move to streaming this year). These "physical" backups are the primary source for recovering from catastrophic failure of a county server. We have software to "top off" the recovered database with transactions from the logical transaction stream as needed, since that is more up-to-date. We have also found this combination of physical and logical transactions useful when someone trying to do a data fix with direct SQL statements against a production database mangled things badly. We use PITR recovery up to just before the damage, then apply logical transactions from that point forward, skipping the problem transactions. Until recently our logical transaction stream was generated from our Java middle tier running on the database servers. This has recently been converted to use a generalized PostgreSQL trigger, written in C, which is attached to all tables for which we want logical replication (about 90% of them). The trigger captures data into transaction tables in the database as part of the same transaction they are representing, so that the generation of the logical transaction data is atomic with the represented data. We didn't switch to one of the existing solutions because of all the code downstream which uses the logical transaction streams -- we didn't want to change the format or delivery method of the logical steam because it's simply not feasible to change everything at once. The way the logical stream is currently delivered is still through the aforementioned Java middle tier. It reads the logical transaction log tables and sends the same data in non-persistent JMS messages through queues to the "TRAN client" at the central location, just as we did for years before we ever heard of PostgreSQL. Normally, the TRAN client reads each logical transaction once, and feeds many "replication targets" at the central site. These include: - four "Central Information Repository" (CIR) databaseseach of which contains data from all 72 counties (two machines each at two different sites) - four "Database Transaction Repository" (DTR) databases on the same servers as the CIR databases (but separate RAIDs) This is used as a supplementary source for recovery (as mentioned above), as well as for auditing what happenedon a case when there is any question, and certain activity reports. - a filter to identify transactions of possible interest to the publishers we use for interfaces to other agencies (district attorneys, public defenders, department of corrections, police agencies, social service agencies, etc.) Transactionsof interest are queued for closer review by the publisher software, which actually posts messages if thetransaction is found to actually be of interest. - other databases for the state court's "central services" for such things as court interpreters used for non-Englishspeakers in the various counties The number and varied nature of the targets is significant, because the "fan out" and timing differences need to be considered. The TRAN client normally receives one copy of the logical data for each transaction, and about 99% of the time the transaction's data is replicated to all databases (and queued for review by publishers) within a second of someone pressing Enter 300 miles away. It's that other 1% that gets really interesting. To deal with the possibility that one or more central targets were down or running very slowly, the TRAN client will give up on keeping them at the same point in the transaction stream after a (configurable) delay. Since this replication is important for disaster recovery, we don't want problems on one server to keep others from staying up-to-date. As currently configured, for every set of targets which falls behind, there is a redundant stream of logical transactions across the WAN. Let's talk about the WAN. We're somewhat "compelled" to use the WAN offered by the state's Department of Administration. It is expensive, so we use the slowest speed that will work for each county. That can be as low as 750 Mbps up to 10 Gbps, and even at those pitiful rates, this was costing us (last I heard) over $80,000 per month -- so boosting WAN speed is a last resort. Besides being the conduit for both the WAL and logical transaction streams, this same WAN is used for file server backups, access to the Internet for court staff, and inter-county database access (such as District Court Administrators or Chief Judges access all counties in their respective districts). Sometimes a WAN pipe gets saturated and the transaction streams for a county fall behind. This can be aggravated by other traffic (such as when all court employees in a county are asked to view a training video via the Internet), or by large raster-scanned documents being added to the database. In these cases the "let the lagging target spawn its own transaction stream" approach currently becomes pathological. On the saturated connection an ACK of a message from one or more targets may be delayed long enough that it splits the stream and makes a bad situation even worse. We have an item on our annual plan to try to address these problems, which two people were assigned to start working on next month. The xReader technology proposed by Aakash is so much better than what we otherwise can do, that when management heard about it, they delayed that project in hopes that by waiting a few months, it can be done with xReader. The goal is to move to streaming replication and have that be the one stream of transaction data flowing over the WAN from the counties to the central location. From this stream we want to generate the logical transactions to feed all the other systems, while maintaining the WAL-based HS as the primary source for disaster recovery. It is important that, at least initially, we can produce logical transactions in the format accepted by all these other internal systems. The plan *was* to continue to fire the triggers to capture the data on the county servers, but read those tables on the HS. That has many obvious disadvantages compared to either of the proposed logical replication systems. But for our purposes, generating logical transactions at the central site rather than at the original database is far superior, in that it means we only send the data over the WAN once, not twice. > Next, it breaks immediately in the face of DDL: > > CREATE TABLE foo (f1 int); > BEGIN; > ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2; > INSERT INTO foo (f2) VALUES (1); > COMMIT; > > The standby is not going to think that the ALTER is committed, so > it will not report the right column name when it comes time to > translate the INSERT. Actually, you can break it even more easily > than that: > > CREATE TABLE bar AS SELECT ... > > What will you do with the insertions executed by this CREATE? They > have to be executed before the creation of table bar is committed. > > Also, you'd need a query connection per database (at least, maybe > one per session if you were trying to track DDL effects), which > seems rather a lot of load on the HS slave. That together with the > requirement for synchronous operation seems absolutely catastrophic > from a performance standpoint. Good points, but I think they all yield to one solution. If we can generate the appropriate snapshot based on the WAL record we are looking at, those problems all evaporate, don't they? It seems to me that we could build on the "shared snapshot" work to provide a means to this end. > If you want to spend the summer building a toy, fine, but I don't > see this going anywhere for production purposes. I, the whole DBA group here, and the managers are very excited about putting such a technology into production here. Enough to probably be willing to find a way to do it under version 9.1 this year if at all possible. > It would be enormously more performant for the master to be > emitting logical replication records to start with, since it > already has all the right names etc at hand at basically no cost. Not when the consumers are across a WAN, and that WAN is the biggest performance bottleneck and the most expensive resource involved. I will cheerfully agree that both approaches have valuable use-cases. -Kevin
On Sun, Apr 29, 2012 at 3:27 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> It would be enormously more performant for the master to be >> emitting logical replication records to start with, since it >> already has all the right names etc at hand at basically no cost. > > Not when the consumers are across a WAN, and that WAN is the biggest > performance bottleneck and the most expensive resource involved. I agree that the WAN is important, for both bandwidth and response time. Though it isn't a given that logical change records (LCRs) will require more bandwidth than physical WAL. WAL contains full page images, index changes and other information that would be absent from the LCR stream. It also depends upon the specification of the LCRs - what metadata is included and whether the LCRs use text or binary. Those choices have other impacts as well, so measurements and detailed analysis is required to justify how to proceed. Which is what is in progress now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4/29/12 9:27 AM, Kevin Grittner wrote: > Maybe I can help with that by describing what the Wisconsin court > system does for circuit court data. Thanks for the write-up, it was insightful. One thing I wanted to mention is that non-binary replication has an added advantage over binary from a DR standpoint: ifcorruption occurs on a master it is more likely to make it into your replicas thanks to full page writes. You might wantto consider that depending on how sensitive your data is. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
> One thing I wanted to mention is that non-binary replication has an > added advantage over binary from a DR standpoint: if corruption occurs > on a master it is more likely to make it into your replicas thanks to > full page writes. You might want to consider that depending on how > sensitive your data is. Yeah, we've seen this a few times. We just recently had to rescue a client from HS-wide corruption using Slony. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Re: xReader, double-effort (was: Temporary tables under hot standby)
From
"Kevin Grittner"
Date:
Josh Berkus <josh@agliodbs.com> wrote: >> One thing I wanted to mention is that non-binary replication has >> an added advantage over binary from a DR standpoint: if >> corruption occurs on a master it is more likely to make it into >> your replicas thanks to full page writes. You might want to >> consider that depending on how sensitive your data is. > > Yeah, we've seen this a few times. We just recently had to rescue > a client from HS-wide corruption using Slony. That's an interesting point. Out of curiosity, how did the corruption originate? It suggests a couple questions: (1) Was Slony running before the corruption occurred? If not, how was Slony helpful? I know that in our environment, where we have both going through separate streams, with a repository of the logical transactions, we would use PITR recovery to get to the latest known good state which we could easily identify, and then replay the logical transactions to "top it off" to get current. If necessary we could skip logical transactions which were problematic results of the corruption. (2) If logical transactions had been implemented as additions to the WAL stream, and Slony was using that, do you think they would still have been usable for this recovery? Perhaps sending both physical and logical transaction streams over the WAN isn't such a bad thing, if it gives us more independent recovery mechanisms. That's fewer copies than we're sending with current trigger-based techniques. It would be particularly attractive is we could omit (filter out) certain tables before going across the WAN. I would be willing to risk sending the big raster-scanned documents through just the physical channel so long as I had a nightly compare of md5sum values on both sides so we can resend any corrupted data (or tell people to rescan). -Kevin
> That's an interesting point. Out of curiosity, how did the > corruption originate? We're still not sure. It appears to be in the system catalogs, though.Note that the original master developed memory issues. > It suggests a couple questions: > > (1) Was Slony running before the corruption occurred? No. > If not, how > was Slony helpful? Install, replicate DB logically, new DB works fine. > (2) If logical transactions had been implemented as additions to > the WAL stream, and Slony was using that, do you think they would > still have been usable for this recovery? Quite possibly not. > Perhaps sending both physical and logical transaction streams over > the WAN isn't such a bad thing, if it gives us more independent > recovery mechanisms. That's fewer copies than we're sending with > current trigger-based techniques. Frankly, there's nothing wrong with the Slony model for replication except for the overhead of: 1. triggers 2. queues 3. Running DDL However, the three above are really big issues. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 5/3/12 2:54 PM, Josh Berkus wrote: >> (2) If logical transactions had been implemented as additions to >> > the WAL stream, and Slony was using that, do you think they would >> > still have been usable for this recovery? > Quite possibly not. The key advantage that I see in londiste/slony replication is that your data stream has absolutely nothing to do with anythingbinary or internal to Postgres. That means that the only way corruption will travel from a master to a slave is ifthe corruption is in the actual fields being updated, and even that's not a given (ie: UPDATING a field to a completelynew value would not propagate corruption even if the old value of the field was corrupted). So, embedding a logical stream into WAL is not inherently bad... what would be bad is if that "logical" stream was susceptibleto corruption due to something like full page writes. Simply embedding the exact same info slony or londiste capturesinto the WAL should be fine (though likely defeats the purpose). Translating binary WAL data into DML statementswould very likely allow corruption to travel from master to slave. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net