Thread: Increasing WAL usage followed by sudden drop

Increasing WAL usage followed by sudden drop

From
Joseph Marlin
Date:
http://i.imgur.com/sva4H.png

In the image above, please find the traffic we have seen from our main postgresql node to our cloud backup.

A few notes on that image:

a) We're only interested in looking at the blue - outbound - traffic
b) In general, this pipe is almost exclusively for WAL usage only.

Hopefully you can see how generalized WAL traffic increases, until it cuts off sharply, only to begin increasing again.



http://i.imgur.com/2V8XY.png

In that image, you can see the traffic just after a cutoff - slowly ramping up again. You can also see our mysterious
sawtoothpattern - spikes of WAL traffic that occur on the hour, quarter-hour, half-hour, and three-quarter-hour marks.
Wedon't see any corresponding spikes in database activity at those times, so we're also mystified as to why these
spikesare occurring.  


Any ideas on any of this? Why the sawteeth? Why the rise-then-drop?

Thanks so much!


Re: Increasing WAL usage followed by sudden drop

From
Josh Berkus
Date:
> Any ideas on any of this? Why the sawteeth? Why the rise-then-drop?

Well, my first thought on the sawteeth is that you have archive_timeout
set to 15 minutes.  No?

As for the gradual buildup over days, that most likely corresponds to
either changes in application activity levels, or some kind of weekly
data purge cycle which shrinks your database every weekend.  Since I
don't know anything about your admin or your application, that's a best
guess.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Increasing WAL usage followed by sudden drop

From
delongboy
Date:
We are not doing anything to postgres that would cause the rise and drop.
Data base activity is pretty consistent.  nor are we doing any kind of
purge.  This week the drop occurred after 6 days.  We are thinking it must
be some kind of internal postgres activity but we can't track it down.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720136.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Increasing WAL usage followed by sudden drop

From
"Kevin Grittner"
Date:
delongboy <sdelong@saucontech.com> wrote:
> We are not doing anything to postgres that would cause the rise
> and drop.  Data base activity is pretty consistent.  nor are we
> doing any kind of purge.  This week the drop occurred after 6
> days.  We are thinking it must be some kind of internal postgres
> activity but we can't track it down.

Maybe autovacuum freezing tuples (which is a WAL-logged operation)
as tables periodically hit the autovacuum_freeze_max_age limit?

-Kevin


Re: Increasing WAL usage followed by sudden drop

From
Joshua Berkus
Date:
> We are not doing anything to postgres that would cause the rise and
> drop.
> Data base activity is pretty consistent.  nor are we doing any kind
> of
> purge.  This week the drop occurred after 6 days.  We are thinking it
> must
> be some kind of internal postgres activity but we can't track it
> down.

Well, we certainly can't figure it out on this list by blind guessing ...


Re: Increasing WAL usage followed by sudden drop

From
delongboy
Date:
Josh Berkus wrote
>
>> We are not doing anything to postgres that would cause the rise and
>> drop.
>> Data base activity is pretty consistent.  nor are we doing any kind
>> of
>> purge.  This week the drop occurred after 6 days.  We are thinking it
>> must
>> be some kind of internal postgres activity but we can't track it
>> down.
>
> Well, we certainly can't figure it out on this list by blind guessing ...
>

Have to agree with you there.  Unfortunately this is where we've ended up.
What can we look at and/or show that would help us to narrow it down?  Is
there anyway we can look into the wal file and see exactly what is being
sent?  I think this might actually give us the most insight.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720250.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Increasing WAL usage followed by sudden drop

From
Jeff Janes
Date:
On Fri, Aug 17, 2012 at 10:53 AM, delongboy <sdelong@saucontech.com> wrote:
>
> Josh Berkus wrote
>>
>>> We are not doing anything to postgres that would cause the rise and
>>> drop.
>>> Data base activity is pretty consistent.  nor are we doing any kind
>>> of
>>> purge.  This week the drop occurred after 6 days.  We are thinking it
>>> must
>>> be some kind of internal postgres activity but we can't track it
>>> down.
>>
>> Well, we certainly can't figure it out on this list by blind guessing ...
>>
>
> Have to agree with you there.  Unfortunately this is where we've ended up.
> What can we look at and/or show that would help us to narrow it down?  Is
> there anyway we can look into the wal file and see exactly what is being
> sent?  I think this might actually give us the most insight.

Maybe there is an easier way, but one thing would be to compile a test
server (of the same version as the production) with WAL_DEBUG defined
in src/include/pg_config_manual.h, turn on the wal_debug guc, and
crank up trace_recovery_messages.  Then replay the WAL log files from
production through this test server and see what it logs.  That
requires that you have

Easier would to be turn on wal_debug and watch the server log as the
WAL logs are generated, instead of recovered, but you probably don't
want to do that on production.  So you would need a workload generator
that also exhibits the phenomenon of interest.

Cheers,

Jeff


Re: Increasing WAL usage followed by sudden drop

From
delongboy
Date:
Jeff Janes wrote
>
> Maybe there is an easier way, but one thing would be to compile a test
> server (of the same version as the production) with WAL_DEBUG defined
> in src/include/pg_config_manual.h, turn on the wal_debug guc, and
> crank up trace_recovery_messages.  Then replay the WAL log files from
> production through this test server and see what it logs.  That
> requires that you have
>
> Easier would to be turn on wal_debug and watch the server log as the
> WAL logs are generated, instead of recovered, but you probably don't
> want to do that on production.  So you would need a workload generator
> that also exhibits the phenomenon of interest.
>

This sounds like it may help me see what is going on.  However I am not
finding very much documentation as to how to do this exactly.  What I have
is it seems this has to be set and postgres needs to be re-compiled to
enable it.  Is this true?  As that would not really be a viable option right
now.  I am in position to set up a test server and run wal files through it.
But I am not sure how to accomplish this exactly?  Is there somewhere you
anyone could point me to find documentation on how to do this?

Thanks a lot for everyone's input so far.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720492.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Increasing WAL usage followed by sudden drop

From
Jeff Janes
Date:
On Mon, Aug 20, 2012 at 1:51 PM, delongboy <sdelong@saucontech.com> wrote:
>
> Jeff Janes wrote
>>
>> Maybe there is an easier way, but one thing would be to compile a test
>> server (of the same version as the production) with WAL_DEBUG defined
>> in src/include/pg_config_manual.h, turn on the wal_debug guc, and
>> crank up trace_recovery_messages.  Then replay the WAL log files from
>> production through this test server and see what it logs.  That
>> requires that you have
>>

Sorry, I got distracted during editing and didn't finish my sentence.
It requires you have a backup to apply the WAL to, and that you have
the entire history of WAL from the when the backup was started, until
the time when the interesting things are happening.  That is rather
annoying.

It seems like it shouldn't be all that hard to write a tool to parse
WAL logs in a context-free basis (i.e. without the backup to start
applying them to) and emit some kind of descriptions of the records
and their sizes.  But I don't know about such a tool already existing,
and am not able to offer to create one.  (And assuming one existed,
keeping it in sync with the main code would be a continuing problem)


>
> This sounds like it may help me see what is going on.  However I am not
> finding very much documentation as to how to do this exactly.  What I have
> is it seems this has to be set and postgres needs to be re-compiled to
> enable it.  Is this true?

Yes.  The compilation only needs to happen on the test server,
however, not the production server.

> As that would not really be a viable option right
> now.  I am in position to set up a test server and run wal files through it.
> But I am not sure how to accomplish this exactly?  Is there somewhere you
> anyone could point me to find documentation on how to do this?

creating the backup, accumulating the logs, and replaying them are described in:

http://www.postgresql.org/docs/9.1/static/continuous-archiving.html

Of course it does not explicitly describe the case of replaying
through a toy system rather than another production system.  It
assumes you are replaying through a soon-to-become production server.

I'm not sure how to address that part, other than to have you ask
specific questions.

Cheers,

Jeff


Re: Increasing WAL usage followed by sudden drop

From
delongboy
Date:
Jeff Janes wrote
>
> It seems like it shouldn't be all that hard to write a tool to parse
> WAL logs in a context-free basis (i.e. without the backup to start
> applying them to) and emit some kind of descriptions of the records
> and their sizes.  But I don't know about such a tool already existing,
> and am not able to offer to create one.  (And assuming one existed,
> keeping it in sync with the main code would be a continuing problem)
>

I appreciate your help Jeff.  I have come across what would seem such a
tool.
Its called xlogdump
I am working on getting it installed, having issues with libs I think at the
moment. I will let you know how it works out.

Thank you everybody for your input!



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Increasing-WAL-usage-followed-by-sudden-drop-tp5719567p5720953.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.