Re: Re: xReader, double-effort (was: Temporary tables under hot standby) - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Re: xReader, double-effort (was: Temporary tables under hot standby) |
Date | |
Msg-id | 4F9D09810200002500047540@gw.wicourts.gov Whole thread Raw |
Responses |
Re: Re: xReader, double-effort (was: Temporary tables under
hot standby)
Re: Re: xReader, double-effort (was: Temporary tables under hot standby) |
List | pgsql-hackers |
Tom Lane wrote: > "Kevin Grittner" writes: >> Well, the xReader approach (Aakash's GSoC project) is to serve as >> a proxy for a WAL stream going to a hot standby, to interpret each >> incoming WAL record by "cheating" and querying the HS before >> passing the WAL along, and then using sync rep to the HS >> regardless of whether the incoming stream is sync or async. What >> reliability problem do you see with this approach? > > Well, first off, it requires a replica that *is* a physical replica > (else it won't have the same table OIDs, for instance). I'm not > sure I get the point of logical replication that requires a > physical replica as a prerequisite. Maybe I can help with that by describing what the Wisconsin court system does for circuit court data. Due to state law, state supreme court rules, and the general suckiness of the WAN available to us, each of Wisconsin's 72 counties has a database server in the county seat which is the "official" data of record. We have two parallel forms of backup -- WAL-based and logical. For WAL-based replication we maintain a copy of the last two weekly base backups and all WAL files needed to recover from either of them to any point in time. One copy of this is required to be in the county, on a separate system with separate drives. The other copy is maintained centrally. One rather large server has a hot standby of all county databases, currently fed through copying WAL files (although we plan to move to streaming this year). These "physical" backups are the primary source for recovering from catastrophic failure of a county server. We have software to "top off" the recovered database with transactions from the logical transaction stream as needed, since that is more up-to-date. We have also found this combination of physical and logical transactions useful when someone trying to do a data fix with direct SQL statements against a production database mangled things badly. We use PITR recovery up to just before the damage, then apply logical transactions from that point forward, skipping the problem transactions. Until recently our logical transaction stream was generated from our Java middle tier running on the database servers. This has recently been converted to use a generalized PostgreSQL trigger, written in C, which is attached to all tables for which we want logical replication (about 90% of them). The trigger captures data into transaction tables in the database as part of the same transaction they are representing, so that the generation of the logical transaction data is atomic with the represented data. We didn't switch to one of the existing solutions because of all the code downstream which uses the logical transaction streams -- we didn't want to change the format or delivery method of the logical steam because it's simply not feasible to change everything at once. The way the logical stream is currently delivered is still through the aforementioned Java middle tier. It reads the logical transaction log tables and sends the same data in non-persistent JMS messages through queues to the "TRAN client" at the central location, just as we did for years before we ever heard of PostgreSQL. Normally, the TRAN client reads each logical transaction once, and feeds many "replication targets" at the central site. These include: - four "Central Information Repository" (CIR) databaseseach of which contains data from all 72 counties (two machines each at two different sites) - four "Database Transaction Repository" (DTR) databases on the same servers as the CIR databases (but separate RAIDs) This is used as a supplementary source for recovery (as mentioned above), as well as for auditing what happenedon a case when there is any question, and certain activity reports. - a filter to identify transactions of possible interest to the publishers we use for interfaces to other agencies (district attorneys, public defenders, department of corrections, police agencies, social service agencies, etc.) Transactionsof interest are queued for closer review by the publisher software, which actually posts messages if thetransaction is found to actually be of interest. - other databases for the state court's "central services" for such things as court interpreters used for non-Englishspeakers in the various counties The number and varied nature of the targets is significant, because the "fan out" and timing differences need to be considered. The TRAN client normally receives one copy of the logical data for each transaction, and about 99% of the time the transaction's data is replicated to all databases (and queued for review by publishers) within a second of someone pressing Enter 300 miles away. It's that other 1% that gets really interesting. To deal with the possibility that one or more central targets were down or running very slowly, the TRAN client will give up on keeping them at the same point in the transaction stream after a (configurable) delay. Since this replication is important for disaster recovery, we don't want problems on one server to keep others from staying up-to-date. As currently configured, for every set of targets which falls behind, there is a redundant stream of logical transactions across the WAN. Let's talk about the WAN. We're somewhat "compelled" to use the WAN offered by the state's Department of Administration. It is expensive, so we use the slowest speed that will work for each county. That can be as low as 750 Mbps up to 10 Gbps, and even at those pitiful rates, this was costing us (last I heard) over $80,000 per month -- so boosting WAN speed is a last resort. Besides being the conduit for both the WAL and logical transaction streams, this same WAN is used for file server backups, access to the Internet for court staff, and inter-county database access (such as District Court Administrators or Chief Judges access all counties in their respective districts). Sometimes a WAN pipe gets saturated and the transaction streams for a county fall behind. This can be aggravated by other traffic (such as when all court employees in a county are asked to view a training video via the Internet), or by large raster-scanned documents being added to the database. In these cases the "let the lagging target spawn its own transaction stream" approach currently becomes pathological. On the saturated connection an ACK of a message from one or more targets may be delayed long enough that it splits the stream and makes a bad situation even worse. We have an item on our annual plan to try to address these problems, which two people were assigned to start working on next month. The xReader technology proposed by Aakash is so much better than what we otherwise can do, that when management heard about it, they delayed that project in hopes that by waiting a few months, it can be done with xReader. The goal is to move to streaming replication and have that be the one stream of transaction data flowing over the WAN from the counties to the central location. From this stream we want to generate the logical transactions to feed all the other systems, while maintaining the WAL-based HS as the primary source for disaster recovery. It is important that, at least initially, we can produce logical transactions in the format accepted by all these other internal systems. The plan *was* to continue to fire the triggers to capture the data on the county servers, but read those tables on the HS. That has many obvious disadvantages compared to either of the proposed logical replication systems. But for our purposes, generating logical transactions at the central site rather than at the original database is far superior, in that it means we only send the data over the WAN once, not twice. > Next, it breaks immediately in the face of DDL: > > CREATE TABLE foo (f1 int); > BEGIN; > ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2; > INSERT INTO foo (f2) VALUES (1); > COMMIT; > > The standby is not going to think that the ALTER is committed, so > it will not report the right column name when it comes time to > translate the INSERT. Actually, you can break it even more easily > than that: > > CREATE TABLE bar AS SELECT ... > > What will you do with the insertions executed by this CREATE? They > have to be executed before the creation of table bar is committed. > > Also, you'd need a query connection per database (at least, maybe > one per session if you were trying to track DDL effects), which > seems rather a lot of load on the HS slave. That together with the > requirement for synchronous operation seems absolutely catastrophic > from a performance standpoint. Good points, but I think they all yield to one solution. If we can generate the appropriate snapshot based on the WAL record we are looking at, those problems all evaporate, don't they? It seems to me that we could build on the "shared snapshot" work to provide a means to this end. > If you want to spend the summer building a toy, fine, but I don't > see this going anywhere for production purposes. I, the whole DBA group here, and the managers are very excited about putting such a technology into production here. Enough to probably be willing to find a way to do it under version 9.1 this year if at all possible. > It would be enormously more performant for the master to be > emitting logical replication records to start with, since it > already has all the right names etc at hand at basically no cost. Not when the consumers are across a WAN, and that WAN is the biggest performance bottleneck and the most expensive resource involved. I will cheerfully agree that both approaches have valuable use-cases. -Kevin
pgsql-hackers by date: