Re: hidden junk files in ...data/base/oid/ - Mailing list pgsql-general

From Andrej Vanek
Subject Re: hidden junk files in ...data/base/oid/
Date
Msg-id CAFNFRyGu=dEVoyHWsjVKxbR8xbO6ussOMzPafR1dd=h2fGyBbw@mail.gmail.com
Whole thread Raw
In response to Re: hidden junk files in ...data/base/oid/  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-general
Hello,

thanks for your answer.

I've identified problems in my cluster agent script. It is a custom written script with built in automated recovery of failed slave. It was written in time when postgres 9.1 streaming replications feature was just in beta release and there was no postgres agent for streaming replications available out there. 
The problem was that the failed slave recovery was hardcoded into start operation. But this start operation was aborted by pacemaker due to startup operation timeout. This occured before having finished backup from master to failed slave (in case of bigger database). This is the point where rsync could be aborted and left over temporary junk files. There was no cleanup before re-running the backup from master (using rsync). This may be the reason why there may be left rsync temporary files.
Second problem identified is what you write: copying stuff from one direction first, then failed over, then copied in the opposite direction.
This was caused because my agent was missing the lock file that standard clusterlabs pgsql agent uses to avoid starting failed master in case of double failure followed by reboot.

Now I'm migrating to the standard pacemaker's postgres cluster agent provided by clusterlabs.org to avoid such issues. It is surely much better tested by plenty of installations worldwide with community feedback.

In addition I need to automate single (master or slave) failure recovery as much as possible. For this purpose I plan to introduce a new resource on top of pgsql resource which would recover failed pgsql slave(or master) in case master is active on another node (I use only two node cluster). Manual recovery by operator would be needed for cases when postgres on both nodes is down to avoid accidental data loss.
Do you know whether there is such cluster agent already available?

Best Regards, Andrej



2014-05-27 16:09 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Andrej Vanek wrote:
> Hello,
>
> solved.
> This is not a postgres issue.
>
> The system was used in HA-cluster with streaming replications.
> The hidden files I asked for were created probably by broken (killed)
> rsync. It uses such file-format for temporary files used during copying.
>
> This rsync is used by master to slave database synchronization (full
> on-line backup of master database to slave node) before starting postgres
> in hot-standby mode on slave the node...

You not only have leftover first-order rsync temp files (.NNNNN.uvwxyz)
-- but also when those temp files were being copied over by another
rsync run, which created temp files for the first-order temp files,
leaving you with second-order temp files (..NNNNN.uvwxyz.opqrst).  Not
nice.  I wonder if this is anywhere near sanity -- it looks like you're
copying stuff from one direction first, then failed over, then copied in
the opposite direction.  I would have your setup reviewed real closely,
to avoid data-corrupting configuration mistakes.  I have seen people
make subtle mistakes in their configuration, causing their whole HA
setups to be completely broken.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

pgsql-general by date:

Previous
From: Dmitry Samonenko
Date:
Subject: Re: Fwd: libpq: indefinite block on poll during network problems
Next
From: Leonardo M. Ramé
Date:
Subject: Error while upgrading from 8.4 to 9.3