Thread: Data replication through disk replication

Data replication through disk replication

From

Thomas Lopatic

Date:

18 May 2007, 09:54:51

Hi there,

I am currently looking into replicated two-node master/slave PostgreSQL
environments. Lately I've heard more and more people recommend
replicating data from the master to the slave at the disk device level
as opposed to the DBMS level (Slony-I). On Linux, usually DRBD is
recommended for this, which is basically RAID-1 via a network
connection, i.e. DRBD copies everything that the master writes to its
disk to the slave.

What I keep wondering: Isn't there substantial risk involved?
I mean, suppose the master fails in the middle of a write. Isn't there
the possibility that this corrupts the database? How robust is
PostgreSQL's on-disk file format and write caching strategy against
failures like this?

With something like Slony-I some data may not be fully copied to the
slave when the master crashes. So there may be data loss. But there
isn't the risk of database corruption.

Or am I missing something here?

Thanks,
-Thomas

Re: Data replication through disk replication

From

Martijn van Oosterhout

Date:

18 May 2007, 10:12:57

On Fri, May 18, 2007 at 02:48:03PM +0200, Thomas Lopatic wrote:
> What I keep wondering: Isn't there substantial risk involved?
> I mean, suppose the master fails in the middle of a write. Isn't there
> the possibility that this corrupts the database? How robust is
> PostgreSQL's on-disk file format and write caching strategy against
> failures like this?

It would seem slightly more risk, but as long as the filesystem honours
fsync and doesn't mess with the order of the disk writes, then the
standard WAL log should protect you fine.  All you need to be sure of
is that the write to WAL gets written to disk before the actual data
files do.

> With something like Slony-I some data may not be fully copied to the
> slave when the master crashes. So there may be data loss. But there
> isn't the risk of database corruption.

The biggest benefit to Slony in my eyes is that it's asyncronous. If
the network connection dies, slony can handle that but I have no idea
what your DRDB might do...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

signature.asc

Re: Data replication through disk replication

From

Andrew Sullivan

Date:

18 May 2007, 11:45:34

On Fri, May 18, 2007 at 02:48:03PM +0200, Thomas Lopatic wrote:

> I am currently looking into replicated two-node master/slave PostgreSQL
> environments. Lately I've heard more and more people recommend
> replicating data from the master to the slave at the disk device level
> as opposed to the DBMS level (Slony-I).

What are the reasons they recommend this?  (See my blathering in
another thread about how often the hand-wavy recommendations that are
made on this topic can really bite you hard if you don't know all the
intimate details underneath.)

> What I keep wondering: Isn't there substantial risk involved?

Sure, there's risk.  There's risk in using Slony, too (if you
accidentally issue DDL under Slony without using EXECUTE SCRIPT, you
can find yourself having a rather bad day).

> I mean, suppose the master fails in the middle of a write. Isn't there
> the possibility that this corrupts the database? How robust is
> PostgreSQL's on-disk file format and write caching strategy against
> failures like this?

Well, this is going to depend partly on the implementation and the
hardware and the settings you give.  Also, you _cannot_ have another
postmaster attached to that data area, and if you accidentally open
up another postmaster on a data area and it does work while the first
postmaster is still connected, you are just completely hosed.  This
is not a place to be 95% sure you got it right.

Note, too, that copying the data files around does not give you the
significant advantage that Slony or other replication systems do,
that you can use the replicated database to satisfy some times of
queries; so you should consider whether you need that.  (This is why
the excellent docs on this emphasise that there are different kinds
of need and therefore different replication technologies.)

A
--
Andrew Sullivan  | ajs@crankycanuck.ca
The whole tendency of modern prose is away from concreteness.
        --George Orwell

Re: Data replication through disk replication

From

Thomas Lopatic

Date:

18 May 2007, 14:55:41

[Disk-level replication instead of using Slony-I]

> What are the reasons they recommend this?  (See my blathering in
> another thread about how often the hand-wavy recommendations that are
> made on this topic can really bite you hard if you don't know all the
> intimate details underneath.)

The reason usually given is simplicity. Why replicate the state of every
service individually if you can just replicate the file system?
Obviously this neglects the fact that there is lots of other state than
the persistent state stored in the file system. That this could bite me
hard was exactly my gut feeling.

[Are there any risks in disk-level replication?]

> Sure, there's risk.  There's risk in using Slony, too (if you
> accidentally issue DDL under Slony without using EXECUTE SCRIPT, you
> can find yourself having a rather bad day).

Agreed. I should have been a bit more precise here. Let's assume that I
do everything under my control to minimize risk, i.e. I follow best
practices using Slony-I and I follow best practices using disk-level
replication. I'd like to understand the risk that remains in both cases.
For Slony-I it seems to me that my risk is losing a couple of rows in my
database, which is something that I could live with. For disk-level
replication it seems to me that, in case of a master failure, I could
easily end up with a corrupted database file on the master's as well as
the slave's harddrive (which is a mirror of the master's), possibly
preventing the slave from taking over, which is clearly something that I
cannot live with as this means downtime.

[Suppose that the master fails in the middle of a write. How likely is
corruption?]

> Well, this is going to depend partly on the implementation and the
> hardware and the settings you give.  Also, you _cannot_ have another
> postmaster attached to that data area, and if you accidentally open
> up another postmaster on a data area and it does work while the first
> postmaster is still connected, you are just completely hosed.  This
> is not a place to be 95% sure you got it right.

Agreed. Accidentally running two postmasters on the same data file is
indeed an additional risk that I did not think of when making my
original posting. But you mention settings. So let's for a moment assume
that I get everything right in this respect and then my postmaster fails
in the middle of a write. Which settings would I want to modify in which
way to minimize the risk of database corruption in such a case?

> Note, too, that copying the data files around does not give you the
> significant advantage that Slony or other replication systems do,
> that you can use the replicated database to satisfy some times of
> queries; so you should consider whether you need that.  (This is why
> the excellent docs on this emphasise that there are different kinds
> of need and therefore different replication technologies.)

Yup, that's true, thanks for pointing this out. In my scenario, however,
I do not have any need to, for example, load-balance read accesses. I am
solely interested in improving the availability by having one active
database node and one passive node that tracks the state of the active
node as closely as possible.

-Thomas

Re: Data replication through disk replication

From

Andrew Sullivan

Date:

18 May 2007, 15:38:24

On Fri, May 18, 2007 at 07:55:24PM +0200, Thomas Lopatic wrote:

> For Slony-I it seems to me that my risk is losing a couple of rows in my
> database, which is something that I could live with. For disk-level
> replication it seems to me that, in case of a master failure, I could
> easily end up with a corrupted database file on the master's as well as
> the slave's harddrive (which is a mirror of the master's), possibly
> preventing the slave from taking over, which is clearly something that I
> cannot live with as this means downtime.

Right.  Slony and other database-level replication systems, if
they're designed correctly, expend a great deal of effort on making
sure that node switchover and failover works, where works means "your
data is consistent, even if you lose some".  In general,
disk-replication schemes are designed for general _machine_
availability.  It is very hard to optimise this approach for the many
small writes that databases tend to do.  That said, if you can be
rock solid sure that (1) your "hot" system is always rigorous about
fsync and write order and (2) that your "standby" system physically
cannot take over the disk while your "hot" system's data has not all
been flushed and the "hot" system somehow disconnected, then it can
be safe.  The problem usually lies in the actual guarantees of (1)
and (2), and the corresponce of those guarantees with the initial
promises.

> original posting. But you mention settings. So let's for a moment assume
> that I get everything right in this respect and then my postmaster fails
> in the middle of a write. Which settings would I want to modify in which
> way to minimize the risk of database corruption in such a case?

This is entirely out of PostgreSQL's control, and up to the operating
system and cluster/failover implementation you have.  Note that there
are many ways to do this wrong, including things like using tokens on
the filesystem as a lockfile (I swear someone once proposed this to
me).

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.
        --Dennis Ritchie

Re: Data replication through disk replication

From

Ben

Date:

18 May 2007, 21:03:42

If you're just looking for a way to have high availability and you're ok
being tied to linux, DRBD is a good way to go. It keeps things simple in
that all changes are replicated, it won't say an fsync is finished until
it's finished on the remote host too, and it won't let you mount the block
device on the slave system (at least with 0.7x). You loose out on the
abiility to have multiple read-only databases like you get with Slony, and
you have no control over which changes will be replicated, but if you
don't care about that, then DRBD probably has the least risk of all
postgres replication strategies.

In case you think it matters, IBM recommends DRBD for DB2.

On Fri, 18 May 2007, Thomas Lopatic wrote:

> Hi there,
>
> I am currently looking into replicated two-node master/slave PostgreSQL
> environments. Lately I've heard more and more people recommend
> replicating data from the master to the slave at the disk device level
> as opposed to the DBMS level (Slony-I). On Linux, usually DRBD is
> recommended for this, which is basically RAID-1 via a network
> connection, i.e. DRBD copies everything that the master writes to its
> disk to the slave.
>
> What I keep wondering: Isn't there substantial risk involved?
> I mean, suppose the master fails in the middle of a write. Isn't there
> the possibility that this corrupts the database? How robust is
> PostgreSQL's on-disk file format and write caching strategy against
> failures like this?
>
> With something like Slony-I some data may not be fully copied to the
> slave when the master crashes. So there may be data loss. But there
> isn't the risk of database corruption.
>
> Or am I missing something here?
>
> Thanks,
> -Thomas
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: Data replication through disk replication

From

Alvaro Herrera

Date:

18 May 2007, 22:14:47

Ben wrote:
> If you're just looking for a way to have high availability and you're ok
> being tied to linux, DRBD is a good way to go. It keeps things simple in
> that all changes are replicated, it won't say an fsync is finished until
> it's finished on the remote host too,

Oh, so that's how it works.  I assume performance must be, huh, not stellar?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Data replication through disk replication

From

Ben

Date:

18 May 2007, 22:27:09

You pay a price writes, but with write caching enabled on your
(battery-backed, of course) RAID card and using gigabit, it's easy to
get >100MB/s throughput. It's also easy to replicate different block
devices over separate network links, if that becomes your bottleneck.


On May 18, 2007, at 6:14 PM, Alvaro Herrera wrote:

> Ben wrote:
>> If you're just looking for a way to have high availability and
>> you're ok
>> being tied to linux, DRBD is a good way to go. It keeps things
>> simple in
>> that all changes are replicated, it won't say an fsync is finished
>> until
>> it's finished on the remote host too,
>
> Oh, so that's how it works.  I assume performance must be, huh, not
> stellar?
>
> --
> Alvaro Herrera                                http://
> www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: Data replication through disk replication

From

Andrew Sullivan

Date:

19 May 2007, 10:28:30

On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote:

> that all changes are replicated, it won't say an fsync is finished until
> it's finished on the remote host too, and it won't let you mount the block
> device on the slave system (at least with 0.7x).

How can it guarantee these things?  The web pages say this:

    If the primary node fails, heartbeat is switching the
    secondary device into primary state and starts the
    application there. (If you are using it with a non-journaling
    FS this involves running fsck)

    If the failed node comes up again, it is a new secondary node
    and has to synchronise its content to the primary. This, of
    course, will happen whithout interruption of service in the
    background.

So what happens in those cases where the primary node gets in trouble
but isn't actually dead yet?  I see a potential for a race condition
here that is really troubling to me.  (Especially since it uses the
TCP/IP stack, which is notoriously subject to DoS on Linux.)  I think
you really had better have something like STONITH running to use
this.

> In case you think it matters, IBM recommends DRBD for DB2.

IBM also sells HACMP.  I refuse to comment on whether IBM's advice on
high availability products is worth taking seriously.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
        --Philip Greenspun

Re: Data replication through disk replication

From

"Joris Dobbelsteen"

Date:

19 May 2007, 13:22:17

>-----Original Message-----
>From: pgsql-general-owner@postgresql.org
>[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
>Andrew Sullivan
>Sent: zaterdag 19 mei 2007 15:28
>To: pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Data replication through disk replication
>
>On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote:
>
>> that all changes are replicated, it won't say an fsync is finished
>> until it's finished on the remote host too, and it won't let
>you mount
>> the block device on the slave system (at least with 0.7x).
>
>How can it guarantee these things?  The web pages say this:
>
>    If the primary node fails, heartbeat is switching the
>    secondary device into primary state and starts the
>    application there. (If you are using it with a non-journaling
>    FS this involves running fsck)
>
>    If the failed node comes up again, it is a new secondary node
>    and has to synchronise its content to the primary. This, of
>    course, will happen whithout interruption of service in the
>    background.
>
>So what happens in those cases where the primary node gets in
>trouble but isn't actually dead yet?  I see a potential for a
>race condition here that is really troubling to me.
>(Especially since it uses the TCP/IP stack, which is
>notoriously subject to DoS on Linux.)  I think you really had
>better have something like STONITH running to use this.

General advise you see at linux-ha is to use redundant heartbeat paths.
You can use a serial link if you want to. Other options are redundent
networks. This is to reduce the probability of a split-brain situation.

As you guessed stonith is much required to guard against possible 'race'
conditions caused by hanging nodes.

As a note, DRDB will also do a lot of work for you. It prevents you from
some mistakes you can make.
While starting it waits if it does not detect the other node, and thus
doesn't know who has the latest data available. Can be overriden by a
timeout if desired. (In practice only a single will fail or both will
come up at the same time). Prevents running out of sync.
It also detects when its out of sync, requiring administrator
intervention.

Another advice is to take a look at the linux-ha web sit and mailing
list. Though the web-site might not be the best, the mailing list is
quite active and a lot of knowledge is available.

In general high availability is complex and requires a lot of thought to
cover all possible cases.

[snip]

- Joris Dobbelsteen

Re: Data replication through disk replication

From

Ben

Date:

19 May 2007, 13:30:48

Er, yes, sorry, I didn't mean to imply that you should run without
some kind of STONITH solution, to catch the case when the link DRDB
uses goes down but the other network links are still working fine.
It's in the common case, when everything is working, that DRBD won't
accidentally let you mount the same device on both machines. (In my
experience, operator error more common than hardware error.)

On May 19, 2007, at 6:28 AM, Andrew Sullivan wrote:

> On Fri, May 18, 2007 at 05:03:30PM -0700, Ben wrote:
>
>> that all changes are replicated, it won't say an fsync is finished
>> until
>> it's finished on the remote host too, and it won't let you mount
>> the block
>> device on the slave system (at least with 0.7x).
>
> How can it guarantee these things?  The web pages say this:
>
>     If the primary node fails, heartbeat is switching the
>     secondary device into primary state and starts the
>     application there. (If you are using it with a non-journaling
>     FS this involves running fsck)
>
>     If the failed node comes up again, it is a new secondary node
>     and has to synchronise its content to the primary. This, of
>     course, will happen whithout interruption of service in the
>     background.
>
> So what happens in those cases where the primary node gets in trouble
> but isn't actually dead yet?  I see a potential for a race condition
> here that is really troubling to me.  (Especially since it uses the
> TCP/IP stack, which is notoriously subject to DoS on Linux.)  I think
> you really had better have something like STONITH running to use
> this.
>
>> In case you think it matters, IBM recommends DRBD for DB2.
>
> IBM also sells HACMP.  I refuse to comment on whether IBM's advice on
> high availability products is worth taking seriously.
>
> A
>
> --
> Andrew Sullivan  | ajs@crankycanuck.ca
> The fact that technology doesn't work is no bar to success in the
> marketplace.
>         --Philip Greenspun
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
> your
>        message can get through to the mailing list cleanly

Re: Data replication through disk replication

From

"Joshua D. Drake"

Date:

19 May 2007, 14:07:05

Alvaro Herrera wrote:
> Ben wrote:
>> If you're just looking for a way to have high availability and you're ok
>> being tied to linux, DRBD is a good way to go. It keeps things simple in
>> that all changes are replicated, it won't say an fsync is finished until
>> it's finished on the remote host too,
>
> Oh, so that's how it works.  I assume performance must be, huh, not stellar?

It depends... if the machines are equal, and you have bonded two
gigabits together....

Joshua D. Drake

>


--

       === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Re: Data replication through disk replication

From

Thomas Lopatic

Date:

20 May 2007, 07:02:05

> So what happens in those cases where the primary node gets in trouble
> but isn't actually dead yet?

Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply
stop accepting replicated data from the primary node before firing up
postmaster? Then the postmaster on the primary DRBD node would only
write locally and not interfere with the secondary DRBD node.

Unless I am missing something this would be a valid problem with shared
storage but not with DRBD-like replicated storage. (As long as the
secondary node can stop replicating if it decides to do so.)

-Thomas

Re: Data replication through disk replication

From

Andrew Sullivan

Date:

20 May 2007, 09:39:01

On Sun, May 20, 2007 at 12:01:46PM +0200, Thomas Lopatic wrote:
> Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply
> stop accepting replicated data from the primary node before firing up
> postmaster? Then the postmaster on the primary DRBD node would only
> write locally and not interfere with the secondary DRBD node.

I presume that is possible (you're right that, since it's a block
device under OS control, you should be able to tell it which way to
behave).  My fairly quick reading of the docs doesn't tell me that it
works this way, though, so I have to assume it doesn't.  (It's not
the question of what it could do, but what it actually does, that's
the important one for implementors.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
When my information changes, I alter my conclusions.  What do you do sir?
        --attr. John Maynard Keynes

Hex numbers in psql

From

"madhtr"

Date:

20 May 2007, 11:58:33

hi group :)

How do use hex numbers in psql?  I.E.

instead of:
    select 16
I want to do
    select 0x10

like in c++

I tried doing what this website said
http://www.faqs.org/docs/ppbook/c12119.htm
but it does not work

TY :)

exception question ....

From

"madhtr"

Date:

20 May 2007, 12:46:01

sry, one more question ...

I want to trap an exception and return a -1 no matter WHAT it is ... what do
i need to replace <anything> with?

>>>>>>
create or replace function clrsplit(int4) returns unknown as
$$
BEGIN
delete from split where tkid=$1;
EXCEPTION
 WHEN <anything>
  return -1;
update tk set dtchksp=null where nid=$1;
return 0;
END
$$ language plpgsql;

Re: Data replication through disk replication

From

Ben

Date:

20 May 2007, 13:02:20

On May 20, 2007, at 3:01 AM, Thomas Lopatic wrote:

>> So what happens in those cases where the primary node gets in trouble
>> but isn't actually dead yet?
>
> Hmmm. Is this really a problem?

The problem comes when the primary is cannot replicate to the
secondary but can, for whatever reason, still talk to clients. If a
client is told something is committed but that commit isn't
replicated, you have a problem.

Re: Data replication through disk replication

From

Hannes Dorbath

Date:

20 May 2007, 13:20:21

Thomas Lopatic wrote:
>> So what happens in those cases where the primary node gets in trouble
>> but isn't actually dead yet?
>
> Hmmm. Is this really a problem? Couldn't the secondary DRBD node simply
> stop accepting replicated data from the primary node before firing up
> postmaster? Then the postmaster on the primary DRBD node would only
> write locally and not interfere with the secondary DRBD node.
>
> Unless I am missing something this would be a valid problem with shared
> storage but not with DRBD-like replicated storage. (As long as the
> secondary node can stop replicating if it decides to do so.)

Yes, you can always force DRBD to go split brain, if your really like it
to do so, bit this is usually unwanted.

Usually a shared block device is not the only resource a node holds. You
would like to have it hold at least an IP address as well. So it buys
you nothing if you could fire up PostgreSQL on the secondary as you
still need to take over additional resources to bring your service back
online AND you need to make sure that the primary node won't recover and
won't reclaim ownership of resources that have been taken over by the
secondary again.

This is exactly what STONITH is for. If the secondary has the slightest
reason to believe the primary node might be dead it takes that
assumption and makes it reality.

--
Best regards,
Hannes Dorbath

Re: Data replication through disk replication

From

Hannes Dorbath

Date:

20 May 2007, 13:24:31

Ben wrote:
> On May 20, 2007, at 3:01 AM, Thomas Lopatic wrote:
> The problem comes when the primary is cannot replicate to the secondary
> but can, for whatever reason, still talk to clients. If a client is told
> something is committed but that commit isn't replicated, you have a
> problem.

Right, node and resource fencing is the first thing you need to think
about in any HA setup.


--
Best regards,
Hannes Dorbath

Re: Hex numbers in psql

From

Kris Jurka

Date:

20 May 2007, 13:52:54


On Sun, 20 May 2007, madhtr wrote:

> How do use hex numbers in psql?  I.E.
>
> instead of:
>   select 16
> I want to do
>   select 0x10
>

See 4.1.2.3: "SELECT x'1FF'::int"

http://www.postgresql.org/docs/8.2/static/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

Kris Jurka

Re: exception question ....

From

Tom Lane

Date:

20 May 2007, 14:19:05

"madhtr" <madhtr@schif.org> writes:
> I want to trap an exception and return a -1 no matter WHAT it is ... what do
> i need to replace <anything> with?

OTHERS

            regards, tom lane