Re: better architecture? - Mailing list pgsql-general

From zach cruise
Subject Re: better architecture?
Date
Msg-id CAL8icXyvYLte7MiHh-ymezitmpOX=WYvYR3PV0=G9+3gQn6ULQ@mail.gmail.com
Whole thread Raw
In response to Re: better architecture?  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
On 11/22/14, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> On 11/21/2014 07:38 PM, zach cruise wrote:
>> On 11/20/14, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>> On 11/20/2014 04:57 PM, zach cruise wrote:
>>>> On 11/20/14, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>>>> On 11/20/2014 12:30 PM, zach cruise wrote:
>>>>>>>
>>>>>>> For more info see:
>>>>>>>
>>>>>>> http://www.postgresql.org/docs/9.3/interactive/continuous-archiving.html
>>>>>> to be clear- i change my 2 VMs setup {"1. master (dev) - 2. slave
>>>>>> (prod) setup"} to 3 VMs {"1. master (dev) - 2. slave (prod) setup -
>>>>>> 3.
>>>>>> archive (wal)"}.
>>>>>>
>>>>>> but what do i gain?
>>>>>
>>>>> Extra protection against failure, maybe.
>>>>>
>>>>> So:
>>>>>
>>>>>              --->  WAL Archive ---
>>>>>              |                    |
>>>>>              |    Streaming       |
>>>>> master ---  --------------------> slave
>>>>>
>>>>> If the direct link between the master and slave goes down, the slave
>>>>> can
>>>>> still get WALs from the archive. If the archive machine goes down you
>>>>> still have the direct link. If you take the slave down the master can
>>>>> still push WALs to the archive. This assumes the 'machines' are
>>>>> actually
>>>>> separated and connecting through different networks. You say you are
>>>>> using VMs, but not where they are running. If they are all running on
>>>>> the same machine running through the same network link then you really
>>>>> do not have protection against network issues. The same if the host
>>>>> machine goes down. This is one of those pen and paper times, when you
>>>>> sketch out the arrangement and start doing what ifs.
>>>
>>> First of all, the below is really in need of whiteboard/paper diagram to
>>> keep track of the moving parts. That being said here it goes:
>>>
>>>>
>>>> master, slave and archive can be 3 separate VMs on 1 host, with their
>>>> clones on 2nd and 3rd hosts.
>>>
>>> I can see the above being a potential nightmare. I am not sure how you
>>> ensure that the cloning process results in clones that exactly mirror
>>> the state of the originals at a particular point in time. Failing in
>>> that would seem to me to lead to no end of issues in the replication
>>> process.
>>>
>>>>
>>>> a follow-up question on WAL recycling: ("When WAL archiving is being
>>>> done, the log segments must be archived before being recycled or
>>>> removed" from
>>>> http://www.postgresql.org/docs/9.3/static/wal-configuration.html)
>>>>
>>>> say streaming is off-
>>>> * if both master and archive are down, slave is still up and running.
>>>> yes?
>>>
>>> Yes.
>>>
>>>> * if master writes data when archive is down, it will copy over to
>>>> slave when archive is back up. yes?
>>>
>>> If streaming is off and you are doing archiving then it will copy over
>>> to the archive.
>>>
>>>> * but if WAL is recycled before archive is back up, it will not copy
>>>> over to slave. yes?
>>>
>>> The issue here as pointed out previously is that the WALs will stack up
>>> on the master because it will not be able to archive them. So then you
>>> run into a potential of of space issue on the master. From here:
>>>
>>> http://www.postgresql.org/docs/9.3/interactive/continuous-archiving.html
>>>
>>> "It is important that the archive command return zero exit status if and
>>> only if it succeeds. Upon getting a zero result, PostgreSQL will assume
>>> that the file has been successfully archived, and will remove or recycle
>>> it. However, a nonzero status tells PostgreSQL that the file was not
>>> archived; it will try again periodically until it succeeds."
>>>
>>> ...
>>>
>>> "While designing your archiving setup, consider what will happen if the
>>> archive command fails repeatedly because some aspect requires operator
>>> intervention or the archive runs out of space. For example, this could
>>> occur if you write to tape without an autochanger; when the tape fills,
>>> nothing further can be archived until the tape is swapped. You should
>>> ensure that any error condition or request to a human operator is
>>> reported appropriately so that the situation can be resolved reasonably
>>> quickly. The pg_xlog/ directory will continue to fill with WAL segment
>>> files until the situation is resolved. (If the file system containing
>>> pg_xlog/ fills up, PostgreSQL will do a PANIC shutdown. No committed
>>> transactions will be lost, but the database will remain offline until
>>> you free some space.)"
>>>
>>>
>>>> see my concern with a separate archive is if archive is down and
>>>> master gets stuck retrying to push the same segment again and again,
>>>> there may be a problem in recovery when archive is back up. no?
>>>
>>> See above.
>>
>> also streaming is working (checked with wireshark) and WAL is being
>> written to by master, but if i comment out streaming in pg_hba.conf,
>> WAL is not copying over to slave?!
>
> Do you have a restore_command set on the slave?:
>
> http://www.postgresql.org/docs/9.3/static/archive-recovery-settings.html#RESTORE-COMMAND

yes (recovery.conf on slave: restore_command = 'copy "\\path\\%f" "%p"')

> per the docs here:
>
> http://www.postgresql.org/docs/9.3/static/standby-settings.html
>
> standby_mode (boolean)
>
>      Specifies whether to start the PostgreSQL server as a standby. If
> this parameter is on, the server will not stop recovery when the end of
> archived WAL is reached, but will keep trying to continue recovery by
> fetching new WAL segments using restore_command and/or by connecting to
> the primary server as specified by the primary_conninfo setting.

yes (recovery.conf on slave: standby_mode = 'on')

>> how can i verify that WAL archiving is working for slave?
>
> Look in the logs or use the following functions:
>
> http://www.postgresql.org/docs/9.3/interactive/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL

here're my outputs:
pg_is_in_recovery() --> t
pg_last_xlog_receive_location() --> "5/D..."
pg_last_xlog_replay_location() --> "5/D..."
pg_last_xact_replay_timestamp() --> "2014-..."

pg_is_xlog_replay_paused() --> f
pg_xlog_replay_pause()
pg_xlog_replay_resume()


pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: better architecture?
Next
From: Sanjaya Vithanagama
Date:
Subject: Avoiding deadlocks when performing bulk update and delete operations