Re: BDR problem - Mailing list pgsql-general

From Craig Ringer
Subject Re: BDR problem
Date
Msg-id CAMsr+YG3Gn5_GqF0C16QhhD47GbSojtAVMD__bJvHt3wN+nCkA@mail.gmail.com
Whole thread Raw
In response to BDR problem  (Charles Lynch <charleslynchpostgresql@gmail.com>)
Responses Re: BDR problem
List pgsql-general
On 12 September 2015 at 05:21, Charles Lynch
<charleslynchpostgresql@gmail.com> wrote:

> We have, just recently, ran into a problem. I created a test cluster only
> within NV and after about a week of working without any problems, we got an
> error: Unexpected EOF on SSL connection. I had seen something like this
> before but on initial cluster join and chalked it up to me doing something
> wrong.

That's generally network level, though it could also occur if a worker
exits unexpectedly.

> This was after a week of working without issue. I wasn't sure what to
> do next. restarting the database started producing errors like this:
>
> LOG:  starting background worker process "bdr
> (6188205071755053119,1,16385,)->bdr (6188203625564571611,1,"
> FATAL:  mismatch in worker state, got 3, expected 1

That's ... very odd. It's violating a sanity check that shouldn't
really ever be triggered.

How exactly did you restart the database? Can you send more info on
your configuration via direct mail to me?

> This would repeat. So I removed this node from the cluster using the proper
> bdr commands and tried re-joining

You can't just re-join a removed node. Once it's removed it's removed
for ever. You have to drop the database (or re-initdb), create a new
blank database, and join it as a new node.

The reason for this is that when you remove the node the replication
slots on other nodes get dropped, so there's no record of what catchup
work needs to be done. It's not really possible to resync the node
with the rest after that. That's the point of node removal, to free
the resources from those slots when a node is retired, otherwise you'd
just switch it off.

> My problem is I don't know what caused this and, more importantly, I'm not
> sure how to fix it / prevent it and I can't launch this into production
> without figuring this out.

The "mismatch in worker state" is strongly likely to be a bug. The
trick will be figuring out how you triggered it.

Did you retain the malfunctioning cluster, or have you deleted it?

> One other thing: I've seen a lot of conflicting information on how to setup
> BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm
> curious now if I don't have a younger version and that this issue is all but
> fixed now. Here are my build steps if anyone has any comments on how to
> setup bdr better, please let me know.

You should use the apt respository referenced by
http://bdr-project.org/docs/stable/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN
.

Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu
packages are also produced. We're a little behind at the moment and
haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will
produce 0.9.3 packages for Debian/Ubuntu as well as for
Fedora/RHEL/CentOS.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-general by date:

Previous
From: Giovanni Maruzzelli
Date:
Subject: Re: BDR problem
Next
From: Martín Marqués
Date:
Subject: Re: Ubuntu installed postgresql password failure