Thread: Data replication through disk replication
Hi there, I am currently looking into replicated two-node master/slave PostgreSQL environments. Lately I've heard more and more people recommend replicating data from the master to the slave at the disk device level as opposed to the DBMS level (Slony-I). On Linux, usually DRBD is recommended for this, which is basically RAID-1 via a network connection, i.e. DRBD copies everything that the master writes to its disk to the slave. What I keep wondering: Isn't there substantial risk involved? I mean, suppose the master fails in the middle of a write. Isn't there the possibility that this corrupts the database? How robust is PostgreSQL's on-disk file format and write caching strategy against failures like this? With something like Slony-I some data may not be fully copied to the slave when the master crashes. So there may be data loss. But there isn't the risk of database corruption. Or am I missing something here? Thanks, -Thomas
On Fri, May 18, 2007 at 02:48:03PM +0200, Thomas Lopatic wrote: > What I keep wondering: Isn't there substantial risk involved? > I mean, suppose the master fails in the middle of a write. Isn't there > the possibility that this corrupts the database? How robust is > PostgreSQL's on-disk file format and write caching strategy against > failures like this? It would seem slightly more risk, but as long as the filesystem honours fsync and doesn't mess with the order of the disk writes, then the standard WAL log should protect you fine. All you need to be sure of is that the write to WAL gets written to disk before the actual data files do. > With something like Slony-I some data may not be fully copied to the > slave when the master crashes. So there may be data loss. But there > isn't the risk of database corruption. The biggest benefit to Slony in my eyes is that it's asyncronous. If the network connection dies, slony can handle that but I have no idea what your DRDB might do... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
On Fri, May 18, 2007 at 02:48:03PM +0200, Thomas Lopatic wrote: > I am currently looking into replicated two-node master/slave PostgreSQL > environments. Lately I've heard more and more people recommend > replicating data from the master to the slave at the disk device level > as opposed to the DBMS level (Slony-I). What are the reasons they recommend this? (See my blathering in another thread about how often the hand-wavy recommendations that are made on this topic can really bite you hard if you don't know all the intimate details underneath.) > What I keep wondering: Isn't there substantial risk involved? Sure, there's risk. There's risk in using Slony, too (if you accidentally issue DDL under Slony without using EXECUTE SCRIPT, you can find yourself having a rather bad day). > I mean, suppose the master fails in the middle of a write. Isn't there > the possibility that this corrupts the database? How robust is > PostgreSQL's on-disk file format and write caching strategy against > failures like this? Well, this is going to depend partly on the implementation and the hardware and the settings you give. Also, you _cannot_ have another postmaster attached to that data area, and if you accidentally open up another postmaster on a data area and it does work while the first postmaster is still connected, you are just completely hosed. This is not a place to be 95% sure you got it right. Note, too, that copying the data files around does not give you the significant advantage that Slony or other replication systems do, that you can use the replicated database to satisfy some times of queries; so you should consider whether you need that. (This is why the excellent docs on this emphasise that there are different kinds of need and therefore different replication technologies.) A -- Andrew Sullivan | ajs@crankycanuck.ca The whole tendency of modern prose is away from concreteness. --George Orwell
[Disk-level replication instead of using Slony-I] > What are the reasons they recommend this? (See my blathering in > another thread about how often the hand-wavy recommendations that are > made on this topic can really bite you hard if you don't know all the > intimate details underneath.) The reason usually given is simplicity. Why replicate the state of every service individually if you can just replicate the file system? Obviously this neglects the fact that there is lots of other state than the persistent state stored in the file system. That this could bite me hard was exactly my gut feeling. [Are there any risks in disk-level replication?] > Sure, there's risk. There's risk in using Slony, too (if you > accidentally issue DDL under Slony without using EXECUTE SCRIPT, you > can find yourself having a rather bad day). Agreed. I should have been a bit more precise here. Let's assume that I do everything under my control to minimize risk, i.e. I follow best practices using Slony-I and I follow best practices using disk-level replication. I'd like to understand the risk that remains in both cases. For Slony-I it seems to me that my risk is losing a couple of rows in my database, which is something that I could live with. For disk-level replication it seems to me that, in case of a master failure, I could easily end up with a corrupted database file on the master's as well as the slave's harddrive (which is a mirror of the master's), possibly preventing the slave from taking over, which is clearly something that I cannot live with as this means downtime. [Suppose that the master fails in the middle of a write. How likely is corruption?] > Well, this is going to depend partly on the implementation and the > hardware and the settings you give. Also, you _cannot_ have another > postmaster attached to that data area, and if you accidentally open > up another postmaster on a data area and it does work while the first > postmaster is still connected, you are just completely hosed. This > is not a place to be 95% sure you got it right. Agreed. Accidentally running two postmasters on the same data file is indeed an additional risk that I did not think of when making my original posting. But you mention settings. So let's for a moment assume that I get everything right in this respect and then my postmaster fails in the middle of a write. Which settings would I want to modify in which way to minimize the risk of database corruption in such a case? > Note, too, that copying the data files around does not give you the > significant advantage that Slony or other replication systems do, > that you can use the replicated database to satisfy some times of > queries; so you should consider whether you need that. (This is why > the excellent docs on this emphasise that there are different kinds > of need and therefore different replication technologies.) Yup, that's true, thanks for pointing this out. In my scenario, however, I do not have any need to, for example, load-balance read accesses. I am solely interested in improving the availability by having one active database node and one passive node that tracks the state of the active node as closely as possible. -Thomas
On Fri, May 18, 2007 at 07:55:24PM +0200, Thomas Lopatic wrote: > For Slony-I it seems to me that my risk is losing a couple of rows in my > database, which is something that I could live with. For disk-level > replication it seems to me that, in case of a master failure, I could > easily end up with a corrupted database file on the master's as well as > the slave's harddrive (which is a mirror of the master's), possibly > preventing the slave from taking over, which is clearly something that I > cannot live with as this means downtime. Right. Slony and other database-level replication systems, if they're designed correctly, expend a great deal of effort on making sure that node switchover and failover works, where works means "your data is consistent, even if you lose some". In general, disk-replication schemes are designed for general _machine_ availability. It is very hard to optimise this approach for the many small writes that databases tend to do. That said, if you can be rock solid sure that (1) your "hot" system is always rigorous about fsync and write order and (2) that your "standby" system physically cannot take over the disk while your "hot" system's data has not all been flushed and the "hot" system somehow disconnected, then it can be safe. The problem usually lies in the actual guarantees of (1) and (2), and the corresponce of those guarantees with the initial promises. > original posting. But you mention settings. So let's for a moment assume > that I get everything right in this respect and then my postmaster fails > in the middle of a write. Which settings would I want to modify in which > way to minimize the risk of database corruption in such a case? This is entirely out of PostgreSQL's control, and up to the operating system and cluster/failover implementation you have. Note that there are many ways to do this wrong, including things like using tokens on the filesystem as a lockfile (I swear someone once proposed this to me). A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
If you're just looking for a way to have high availability and you're ok being tied to linux, DRBD is a good way to go. It keeps things simple in that all changes are replicated, it won't say an fsync is finished until it's finished on the remote host too, and it won't let you mount the block device on the slave system (at least with 0.7x). You loose out on the abiility to have multiple read-only databases like you get with Slony, and you have no control over which changes will be replicated, but if you don't care about that, then DRBD probably has the least risk of all postgres replication strategies. In case you think it matters, IBM recommends DRBD for DB2. On Fri, 18 May 2007, Thomas Lopatic wrote: > Hi there, > > I am currently looking into replicated two-node master/slave PostgreSQL > environments. Lately I've heard more and more people recommend > replicating data from the master to the slave at the disk device level > as opposed to the DBMS level (Slony-I). On Linux, usually DRBD is > recommended for this, which is basically RAID-1 via a network > connection, i.e. DRBD copies everything that the master writes to its > disk to the slave. > > What I keep wondering: Isn't there substantial risk involved? > I mean, suppose the master fails in the middle of a write. Isn't there > the possibility that this corrupts the database? How robust is > PostgreSQL's on-disk file format and write caching strategy against > failures like this? > > With something like Slony-I some data may not be fully copied to the > slave when the master crashes. So there may be data loss. But there > isn't the risk of database corruption. > > Or am I missing something here? > > Thanks, > -Thomas > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
Ben wrote: > If you're just looking for a way to have high availability and you're ok > being tied to linux, DRBD is a good way to go. It keeps things simple in > that all changes are replicated, it won't say an fsync is finished until > it's finished on the remote host too, Oh, so that's how it works. I assume performance must be, huh, not stellar? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
You pay a price writes, but with write caching enabled on your (battery-backed, of course) RAID card and using gigabit, it's easy to get >100MB/s throughput. It's also easy to replicate different block devices over separate network links, if that becomes your bottleneck. On May 18, 2007, at 6:14 PM, Alvaro Herrera wrote: > Ben wrote: >> If you're just looking for a way to have high availability and >> you're ok >> being tied to linux, DRBD is a good way to go. It keeps things >> simple in >> that all changes are replicated, it won't say an fsync is finished >> until >> it's finished on the remote host too, > > Oh, so that's how it works. I assume performance must be, huh, not > stellar? > > -- > Alvaro Herrera http:// > www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote: > that all changes are replicated, it won't say an fsync is finished until > it's finished on the remote host too, and it won't let you mount the block > device on the slave system (at least with 0.7x). How can it guarantee these things? The web pages say this: If the primary node fails, heartbeat is switching the secondary device into primary state and starts the application there. (If you are using it with a non-journaling FS this involves running fsck) If the failed node comes up again, it is a new secondary node and has to synchronise its content to the primary. This, of course, will happen whithout interruption of service in the background. So what happens in those cases where the primary node gets in trouble but isn't actually dead yet? I see a potential for a race condition here that is really troubling to me. (Especially since it uses the TCP/IP stack, which is notoriously subject to DoS on Linux.) I think you really had better have something like STONITH running to use this. > In case you think it matters, IBM recommends DRBD for DB2. IBM also sells HACMP. I refuse to comment on whether IBM's advice on high availability products is worth taking seriously. A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
>-----Original Message----- >From: pgsql-general-owner@postgresql.org >[mailto:pgsql-general-owner@postgresql.org] On Behalf Of >Andrew Sullivan >Sent: zaterdag 19 mei 2007 15:28 >To: pgsql-general@postgresql.org >Subject: Re: [GENERAL] Data replication through disk replication > >On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote: > >> that all changes are replicated, it won't say an fsync is finished >> until it's finished on the remote host too, and it won't let >you mount >> the block device on the slave system (at least with 0.7x). > >How can it guarantee these things? The web pages say this: > > If the primary node fails, heartbeat is switching the > secondary device into primary state and starts the > application there. (If you are using it with a non-journaling > FS this involves running fsck) > > If the failed node comes up again, it is a new secondary node > and has to synchronise its content to the primary. This, of > course, will happen whithout interruption of service in the > background. > >So what happens in those cases where the primary node gets in >trouble but isn't actually dead yet? I see a potential for a >race condition here that is really troubling to me. >(Especially since it uses the TCP/IP stack, which is >notoriously subject to DoS on Linux.) I think you really had >better have something like STONITH running to use this. General advise you see at linux-ha is to use redundant heartbeat paths. You can use a serial link if you want to. Other options are redundent networks. This is to reduce the probability of a split-brain situation. As you guessed stonith is much required to guard against possible 'race' conditions caused by hanging nodes. As a note, DRDB will also do a lot of work for you. It prevents you from some mistakes you can make. While starting it waits if it does not detect the other node, and thus doesn't know who has the latest data available. Can be overriden by a timeout if desired. (In practice only a single will fail or both will come up at the same time). Prevents running out of sync. It also detects when its out of sync, requiring administrator intervention. Another advice is to take a look at the linux-ha web sit and mailing list. Though the web-site might not be the best, the mailing list is quite active and a lot of knowledge is available. In general high availability is complex and requires a lot of thought to cover all possible cases. [snip] - Joris Dobbelsteen
Er, yes, sorry, I didn't mean to imply that you should run without some kind of STONITH solution, to catch the case when the link DRDB uses goes down but the other network links are still working fine. It's in the common case, when everything is working, that DRBD won't accidentally let you mount the same device on both machines. (In my experience, operator error more common than hardware error.) On May 19, 2007, at 6:28 AM, Andrew Sullivan wrote: > On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote: > >> that all changes are replicated, it won't say an fsync is finished >> until >> it's finished on the remote host too, and it won't let you mount >> the block >> device on the slave system (at least with 0.7x). > > How can it guarantee these things? The web pages say this: > > If the primary node fails, heartbeat is switching the > secondary device into primary state and starts the > application there. (If you are using it with a non-journaling > FS this involves running fsck) > > If the failed node comes up again, it is a new secondary node > and has to synchronise its content to the primary. This, of > course, will happen whithout interruption of service in the > background. > > So what happens in those cases where the primary node gets in trouble > but isn't actually dead yet? I see a potential for a race condition > here that is really troubling to me. (Especially since it uses the > TCP/IP stack, which is notoriously subject to DoS on Linux.) I think > you really had better have something like STONITH running to use > this. > >> In case you think it matters, IBM recommends DRBD for DB2. > > IBM also sells HACMP. I refuse to comment on whether IBM's advice on > high availability products is worth taking seriously. > > A > > -- > Andrew Sullivan | ajs@crankycanuck.ca > The fact that technology doesn't work is no bar to success in the > marketplace. > --Philip Greenspun > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly
Alvaro Herrera wrote: > Ben wrote: >> If you're just looking for a way to have high availability and you're ok >> being tied to linux, DRBD is a good way to go. It keeps things simple in >> that all changes are replicated, it won't say an fsync is finished until >> it's finished on the remote host too, > > Oh, so that's how it works. I assume performance must be, huh, not stellar? It depends... if the machines are equal, and you have bonded two gigabits together.... Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
> So what happens in those cases where the primary node gets in trouble > but isn't actually dead yet? Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply stop accepting replicated data from the primary node before firing up postmaster? Then the postmaster on the primary DRBD node would only write locally and not interfere with the secondary DRBD node. Unless I am missing something this would be a valid problem with shared storage but not with DRBD-like replicated storage. (As long as the secondary node can stop replicating if it decides to do so.) -Thomas
On Sun, May 20, 2007 at 12:01:46PM +0200, Thomas Lopatic wrote: > Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply > stop accepting replicated data from the primary node before firing up > postmaster? Then the postmaster on the primary DRBD node would only > write locally and not interfere with the secondary DRBD node. I presume that is possible (you're right that, since it's a block device under OS control, you should be able to tell it which way to behave). My fairly quick reading of the docs doesn't tell me that it works this way, though, so I have to assume it doesn't. (It's not the question of what it could do, but what it actually does, that's the important one for implementors.) A -- Andrew Sullivan | ajs@crankycanuck.ca When my information changes, I alter my conclusions. What do you do sir? --attr. John Maynard Keynes
hi group :) How do use hex numbers in psql? I.E. instead of: select 16 I want to do select 0x10 like in c++ I tried doing what this website said http://www.faqs.org/docs/ppbook/c12119.htm but it does not work TY :)
sry, one more question ... I want to trap an exception and return a -1 no matter WHAT it is ... what do i need to replace <anything> with? >>>>>> create or replace function clrsplit(int4) returns unknown as $$ BEGIN delete from split where tkid=$1; EXCEPTION WHEN <anything> return -1; update tk set dtchksp=null where nid=$1; return 0; END $$ language plpgsql;
On May 20, 2007, at 3:01 AM, Thomas Lopatic wrote: >> So what happens in those cases where the primary node gets in trouble >> but isn't actually dead yet? > > Hmmm. Is this really a problem? The problem comes when the primary is cannot replicate to the secondary but can, for whatever reason, still talk to clients. If a client is told something is committed but that commit isn't replicated, you have a problem.
Thomas Lopatic wrote: >> So what happens in those cases where the primary node gets in trouble >> but isn't actually dead yet? > > Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply > stop accepting replicated data from the primary node before firing up > postmaster? Then the postmaster on the primary DRBD node would only > write locally and not interfere with the secondary DRBD node. > > Unless I am missing something this would be a valid problem with shared > storage but not with DRBD-like replicated storage. (As long as the > secondary node can stop replicating if it decides to do so.) Yes, you can always force DRBD to go split brain, if your really like it to do so, bit this is usually unwanted. Usually a shared block device is not the only resource a node holds. You would like to have it hold at least an IP address as well. So it buys you nothing if you could fire up PostgreSQL on the secondary as you still need to take over additional resources to bring your service back online AND you need to make sure that the primary node won't recover and won't reclaim ownership of resources that have been taken over by the secondary again. This is exactly what STONITH is for. If the secondary has the slightest reason to believe the primary node might be dead it takes that assumption and makes it reality. -- Best regards, Hannes Dorbath
Ben wrote: > On May 20, 2007, at 3:01 AM, Thomas Lopatic wrote: > The problem comes when the primary is cannot replicate to the secondary > but can, for whatever reason, still talk to clients. If a client is told > something is committed but that commit isn't replicated, you have a > problem. Right, node and resource fencing is the first thing you need to think about in any HA setup. -- Best regards, Hannes Dorbath
On Sun, 20 May 2007, madhtr wrote: > How do use hex numbers in psql? I.E. > > instead of: > select 16 > I want to do > select 0x10 > See 4.1.2.3: "SELECT x'1FF'::int" http://www.postgresql.org/docs/8.2/static/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS Kris Jurka
"madhtr" <madhtr@schif.org> writes: > I want to trap an exception and return a -1 no matter WHAT it is ... what do > i need to replace <anything> with? OTHERS regards, tom lane