Re: BDR problem - Mailing list pgsql-general
From | Craig Ringer |
---|---|
Subject | Re: BDR problem |
Date | |
Msg-id | CAMsr+YG3Gn5_GqF0C16QhhD47GbSojtAVMD__bJvHt3wN+nCkA@mail.gmail.com Whole thread Raw |
In response to | BDR problem (Charles Lynch <charleslynchpostgresql@gmail.com>) |
Responses |
Re: BDR problem
|
List | pgsql-general |
On 12 September 2015 at 05:21, Charles Lynch <charleslynchpostgresql@gmail.com> wrote: > We have, just recently, ran into a problem. I created a test cluster only > within NV and after about a week of working without any problems, we got an > error: Unexpected EOF on SSL connection. I had seen something like this > before but on initial cluster join and chalked it up to me doing something > wrong. That's generally network level, though it could also occur if a worker exits unexpectedly. > This was after a week of working without issue. I wasn't sure what to > do next. restarting the database started producing errors like this: > > LOG: starting background worker process "bdr > (6188205071755053119,1,16385,)->bdr (6188203625564571611,1," > FATAL: mismatch in worker state, got 3, expected 1 That's ... very odd. It's violating a sanity check that shouldn't really ever be triggered. How exactly did you restart the database? Can you send more info on your configuration via direct mail to me? > This would repeat. So I removed this node from the cluster using the proper > bdr commands and tried re-joining You can't just re-join a removed node. Once it's removed it's removed for ever. You have to drop the database (or re-initdb), create a new blank database, and join it as a new node. The reason for this is that when you remove the node the replication slots on other nodes get dropped, so there's no record of what catchup work needs to be done. It's not really possible to resync the node with the rest after that. That's the point of node removal, to free the resources from those slots when a node is retired, otherwise you'd just switch it off. > My problem is I don't know what caused this and, more importantly, I'm not > sure how to fix it / prevent it and I can't launch this into production > without figuring this out. The "mismatch in worker state" is strongly likely to be a bug. The trick will be figuring out how you triggered it. Did you retain the malfunctioning cluster, or have you deleted it? > One other thing: I've seen a lot of conflicting information on how to setup > BDR on ubuntu (using ppas, what pkg to install, and where to get source) I'm > curious now if I don't have a younger version and that this issue is all but > fixed now. Here are my build steps if anyone has any comments on how to > setup bdr better, please let me know. You should use the apt respository referenced by http://bdr-project.org/docs/stable/installation-packages.html#INSTALLATION-PACKAGES-DEBIAN . Support is focused mainly on RHEL/CentOS/Fedora, but Debian/Ubuntu packages are also produced. We're a little behind at the moment and haven't got 0.9.2 packages out. I'll be pushing 0.9.3 soon and will produce 0.9.3 packages for Debian/Ubuntu as well as for Fedora/RHEL/CentOS. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
pgsql-general by date: