Thread: BUG #13962: transaction logs growing on standby
The following bug has been logged on the website: Bug reference: 13962 Logged by: Nick Bales Email address: nick.bales@rackspace.com PostgreSQL version: 9.3.9 Operating system: CentOS 6.6 Description: On several of my postgres 9.3.x clusters with streaming replication, the number of transaction logs(files in pg_xlog, not the archived logs) on the standby nodes are growing past the theoretical maximum(wal_keep_segments + 2 * checkpoint_segments + 1). Most are appropriately getting recycled at each restartpoint, as can be seed with the log entry like: restartpoint complete: wrote 128 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled However, at semi-regular intervals, no xlogs are added, removed, or recycled: restartpoint complete: wrote 135 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled The number of orphaned files for a period directly matches the number of log occurrences where no files are recycled during a restartpoint. There is also no log that a new transaction log file is added, even though the number of xlogs in the directory is growing. Furthermore, once a file is orphaned, it sticks around for the life of the system, ultimately leading to exhausting all disk space on long running systems. Relevant settings: wal_level = hot_standby wal_keep_segments = 200 checkpoint_segments = 128 checkpoint_timeout = 5min checkpoint_completion_target = 0.7 checkpoint_warning = 30s log_checkpoints = on
On Thu, Feb 18, 2016 at 5:42 AM, Nick Bales <nick.bales@rackspace.com> wrote: > > I am not seeing any instances of where the standby nodes are falling > behind the primary. However, I do see several instances each day where the > standby is having to re-establish the streaming connection to the primary: > > 2015-12-15 07:25:10 GMT 12969 XX000 566f9ae1.32a9 0 FATAL: could not > send data to WAL stream: server closed the connection unexpectedly > 2015-12-15 07:25:10 GMT 33302 00000 566fc056.8216 0 LOG: started > streaming WAL from primary at 374/F1000000 on timeline 1 > 2015-12-15 10:05:13 GMT 33302 XX000 566fc056.8216 0 FATAL: could not > send data to WAL stream: server closed the connection unexpectedly > 2015-12-15 10:05:13 GMT 5001 00000 566fe5d9.1389 0 LOG: started > streaming WAL from primary at 375/11000000 on timeline 1 > > Upon closer inspection, the number of times this is occurring per day does > correspond to the number of orphaned xlogs for the day(by file's last > updated timestamp), so there does seem to be some relevance. Will need to > investigate further as to what is causing the streaming connection to drop. > Thats the issue, there is an interruption in streaming probably due to network related issues. That needs to be fixed first. So, this is not a BUG. Please note that this is not an appropriate mailing list to send these technical queries. Please consider sending across such technical queries to pgsql-general mailing list. Regards, Venkata B N Fujitsu Australia
Understood. Really thought I exhausted all potential issues before submitting this. However, does still seem like the orphaned xlogs should be cleaned up if they predate the latest checkpoint. On 02/17/2016 04:21 PM, Venkata Balaji N wrote: > > On Thu, Feb 18, 2016 at 5:42 AM, Nick Bales <nick.bales@rackspace.com > <mailto:nick.bales@rackspace.com>> wrote: > > I am not seeing any instances of where the standby nodes are > falling behind the primary. However, I do see several instances > each day where the standby is having to re-establish the streaming > connection to the primary: > > 2015-12-15 07:25:10 GMT 12969 XX000 566f9ae1.32a9 0 FATAL: could > not send data to WAL stream: server closed the connection > unexpectedly > 2015-12-15 07:25:10 GMT 33302 00000 566fc056.8216 0 LOG: started > streaming WAL from primary at 374/F1000000 on timeline 1 > 2015-12-15 10:05:13 GMT 33302 XX000 566fc056.8216 0 FATAL: could > not send data to WAL stream: server closed the connection > unexpectedly > 2015-12-15 10:05:13 GMT 5001 00000 566fe5d9.1389 0 LOG: started > streaming WAL from primary at 375/11000000 on timeline 1 > > Upon closer inspection, the number of times this is occurring per > day does correspond to the number of orphaned xlogs for the day(by > file's last updated timestamp), so there does seem to be some > relevance. Will need to investigate further as to what is causing > the streaming connection to drop. > > > Thats the issue, there is an interruption in streaming probably due to > network related issues. That needs to be fixed first. > So, this is not a BUG. Please note that this is not an appropriate > mailing list to send these technical queries. Please consider sending > across such technical queries to pgsql-general mailing list. > > Regards, > Venkata B N > > Fujitsu Australia >
Just curious -- did communication happen in private email here? There's at least one email we do not have on the archives http://www.postgresql.org/message-id/flat/CAEyp7J_sYm9Ru-uD4kCSX7_BnC9vq8vDq9Sw9vkj3m2+jK18mw@mail.gmail.com#CAEyp7J_sYm9Ru-uD4kCSX7_BnC9vq8vDq9Sw9vkj3m2+jK18mw@mail.gmail.com which appears as quoted material from Nick in Ventaka's response, but is not present in the first email in the thread. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 18, 2016 at 9:50 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Just curious -- did communication happen in private email here? There's > at least one email we do not have on the archives > > http://www.postgresql.org/message-id/flat/CAEyp7J_sYm9Ru-uD4kCSX7_BnC9vq8vDq9Sw9vkj3m2+jK18mw@mail.gmail.com#CAEyp7J_sYm9Ru-uD4kCSX7_BnC9vq8vDq9Sw9vkj3m2+jK18mw@mail.gmail.com > which appears as quoted material from Nick in Ventaka's response, but is > not present in the first email in the thread. > Yes, there is one email which i received in private. I replied including the community email address. Regards, Venkata B N Fujitsu Australia