On Wed, 2006-05-17 at 00:36 -0400, Tom Lane wrote:
> Jeff Frost <jeff@frostconsultingllc.com> writes:
> > On Tue, 16 May 2006, Simon Riggs wrote:
> >> Whatever happened between 02:08 and 02:14 seems important.
>
> > I have the logs and after reviewing /var/log/messages for that time period,
> > there is no other activity besides postgres.
>
> I have a lurking feeling that the still-hypothetical connection between
> archiver and foreground operations might come into operation at pg_clog
> page boundaries (which require emitting XLOG events) --- that is, every
> 32K transactions something special happens. The time delay between
> archiver wedging and foreground wedging would then correspond to how
> long it took the XID counter to reach the next 32K multiple. (Jeff,
> what transaction rate do you see on that server --- is that a plausible
> delay for some thousands of transactions to pass?)
>
> This is just a guess, but if you check the archives for Chris K-L's
> out-of-disk-space server meltdown a year or three ago, you'll see
> something similar.
You'll have to explain a little more. I checked the archives...
archiver looks for archive_status files that end with .ready and that
has got nothing at all to do with transactions, LWlocks etc. If there's
a file ready, it will archive it, if there's not - it won't. There is
very deliberately a very low amount of synchronization there: archiver
holds no locks, LWLocks or spinlocks at any time.
The "lurking feeling" scenario above might or might nor be an issue
here, but I can't see how the archiver could be involved at all. I see
no evidence for the archiver to be the source of a problem here and that
the only reason we're checking that is as a result of Jeff's original
conjecture that there was a connection. There *was* a problem, yes, but
I think we're looking in the wrong place for the murder weapon.
pg_clog page extension does look like it can offer problems, generally.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com