Thread: Increasing WAL usage followed by sudden drop
http://i.imgur.com/sva4H.png In the image above, please find the traffic we have seen from our main postgresql node to our cloud backup. A few notes on that image: a) We're only interested in looking at the blue - outbound - traffic b) In general, this pipe is almost exclusively for WAL usage only. Hopefully you can see how generalized WAL traffic increases, until it cuts off sharply, only to begin increasing again. http://i.imgur.com/2V8XY.png In that image, you can see the traffic just after a cutoff - slowly ramping up again. You can also see our mysterious sawtoothpattern - spikes of WAL traffic that occur on the hour, quarter-hour, half-hour, and three-quarter-hour marks. Wedon't see any corresponding spikes in database activity at those times, so we're also mystified as to why these spikesare occurring. Any ideas on any of this? Why the sawteeth? Why the rise-then-drop? Thanks so much!
> Any ideas on any of this? Why the sawteeth? Why the rise-then-drop? Well, my first thought on the sawteeth is that you have archive_timeout set to 15 minutes. No? As for the gradual buildup over days, that most likely corresponds to either changes in application activity levels, or some kind of weekly data purge cycle which shrinks your database every weekend. Since I don't know anything about your admin or your application, that's a best guess. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
We are not doing anything to postgres that would cause the rise and drop. Data base activity is pretty consistent. nor are we doing any kind of purge. This week the drop occurred after 6 days. We are thinking it must be some kind of internal postgres activity but we can't track it down. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720136.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
delongboy <sdelong@saucontech.com> wrote: > We are not doing anything to postgres that would cause the rise > and drop. Data base activity is pretty consistent. nor are we > doing any kind of purge. This week the drop occurred after 6 > days. We are thinking it must be some kind of internal postgres > activity but we can't track it down. Maybe autovacuum freezing tuples (which is a WAL-logged operation) as tables periodically hit the autovacuum_freeze_max_age limit? -Kevin
> We are not doing anything to postgres that would cause the rise and > drop. > Data base activity is pretty consistent. nor are we doing any kind > of > purge. This week the drop occurred after 6 days. We are thinking it > must > be some kind of internal postgres activity but we can't track it > down. Well, we certainly can't figure it out on this list by blind guessing ...
Josh Berkus wrote > >> We are not doing anything to postgres that would cause the rise and >> drop. >> Data base activity is pretty consistent. nor are we doing any kind >> of >> purge. This week the drop occurred after 6 days. We are thinking it >> must >> be some kind of internal postgres activity but we can't track it >> down. > > Well, we certainly can't figure it out on this list by blind guessing ... > Have to agree with you there. Unfortunately this is where we've ended up. What can we look at and/or show that would help us to narrow it down? Is there anyway we can look into the wal file and see exactly what is being sent? I think this might actually give us the most insight. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720250.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Fri, Aug 17, 2012 at 10:53 AM, delongboy <sdelong@saucontech.com> wrote: > > Josh Berkus wrote >> >>> We are not doing anything to postgres that would cause the rise and >>> drop. >>> Data base activity is pretty consistent. nor are we doing any kind >>> of >>> purge. This week the drop occurred after 6 days. We are thinking it >>> must >>> be some kind of internal postgres activity but we can't track it >>> down. >> >> Well, we certainly can't figure it out on this list by blind guessing ... >> > > Have to agree with you there. Unfortunately this is where we've ended up. > What can we look at and/or show that would help us to narrow it down? Is > there anyway we can look into the wal file and see exactly what is being > sent? I think this might actually give us the most insight. Maybe there is an easier way, but one thing would be to compile a test server (of the same version as the production) with WAL_DEBUG defined in src/include/pg_config_manual.h, turn on the wal_debug guc, and crank up trace_recovery_messages. Then replay the WAL log files from production through this test server and see what it logs. That requires that you have Easier would to be turn on wal_debug and watch the server log as the WAL logs are generated, instead of recovered, but you probably don't want to do that on production. So you would need a workload generator that also exhibits the phenomenon of interest. Cheers, Jeff
Jeff Janes wrote > > Maybe there is an easier way, but one thing would be to compile a test > server (of the same version as the production) with WAL_DEBUG defined > in src/include/pg_config_manual.h, turn on the wal_debug guc, and > crank up trace_recovery_messages. Then replay the WAL log files from > production through this test server and see what it logs. That > requires that you have > > Easier would to be turn on wal_debug and watch the server log as the > WAL logs are generated, instead of recovered, but you probably don't > want to do that on production. So you would need a workload generator > that also exhibits the phenomenon of interest. > This sounds like it may help me see what is going on. However I am not finding very much documentation as to how to do this exactly. What I have is it seems this has to be set and postgres needs to be re-compiled to enable it. Is this true? As that would not really be a viable option right now. I am in position to set up a test server and run wal files through it. But I am not sure how to accomplish this exactly? Is there somewhere you anyone could point me to find documentation on how to do this? Thanks a lot for everyone's input so far. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720492.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Mon, Aug 20, 2012 at 1:51 PM, delongboy <sdelong@saucontech.com> wrote: > > Jeff Janes wrote >> >> Maybe there is an easier way, but one thing would be to compile a test >> server (of the same version as the production) with WAL_DEBUG defined >> in src/include/pg_config_manual.h, turn on the wal_debug guc, and >> crank up trace_recovery_messages. Then replay the WAL log files from >> production through this test server and see what it logs. That >> requires that you have >> Sorry, I got distracted during editing and didn't finish my sentence. It requires you have a backup to apply the WAL to, and that you have the entire history of WAL from the when the backup was started, until the time when the interesting things are happening. That is rather annoying. It seems like it shouldn't be all that hard to write a tool to parse WAL logs in a context-free basis (i.e. without the backup to start applying them to) and emit some kind of descriptions of the records and their sizes. But I don't know about such a tool already existing, and am not able to offer to create one. (And assuming one existed, keeping it in sync with the main code would be a continuing problem) > > This sounds like it may help me see what is going on. However I am not > finding very much documentation as to how to do this exactly. What I have > is it seems this has to be set and postgres needs to be re-compiled to > enable it. Is this true? Yes. The compilation only needs to happen on the test server, however, not the production server. > As that would not really be a viable option right > now. I am in position to set up a test server and run wal files through it. > But I am not sure how to accomplish this exactly? Is there somewhere you > anyone could point me to find documentation on how to do this? creating the backup, accumulating the logs, and replaying them are described in: http://www.postgresql.org/docs/9.1/static/continuous-archiving.html Of course it does not explicitly describe the case of replaying through a toy system rather than another production system. It assumes you are replaying through a soon-to-become production server. I'm not sure how to address that part, other than to have you ask specific questions. Cheers, Jeff
Jeff Janes wrote > > It seems like it shouldn't be all that hard to write a tool to parse > WAL logs in a context-free basis (i.e. without the backup to start > applying them to) and emit some kind of descriptions of the records > and their sizes. But I don't know about such a tool already existing, > and am not able to offer to create one. (And assuming one existed, > keeping it in sync with the main code would be a continuing problem) > I appreciate your help Jeff. I have come across what would seem such a tool. Its called xlogdump I am working on getting it installed, having issues with libs I think at the moment. I will let you know how it works out. Thank you everybody for your input! -- View this message in context: http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720953.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.