On Sat, 2010-05-08 at 20:57 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Sunday 09 May 2010 01:34:18 Bruce Momjian wrote:
> >> I think everyone agrees the current code is unusable, per Heikki's
> >> comment about a WAL file arriving after a period of no WAL activity, and
> >> look how long it took our group to even understand why that fails so
> >> badly.
>
> > To be honest its not *that* hard to simply make sure generating wal regularly
> > to combat that. While it surely aint a nice workaround its not much of a
> > problem either.
>
> Well, that's dumping a kluge onto users; but really that isn't the
> point. What we have here is a badly designed and badly implemented
> feature, and we need to not ship it like this so as to not
> institutionalize a bad design.
No, you have it backwards. HS was designed to work with SR. SR
unfortunately did not deliver any form of monitoring, and in doing so
the keepalive that it was known HS needed was left out, although it had
been on the todo list for some time. Luckily Greg and I argued to have
some monitoring added and my code was used to provide barest minimum
monitoring for SR, yet not enough to help HS.
Of course, if one team doesn't deliver for whatever reason then others
must take up the slack, if they can: no complaints. Since I personally
didn't know this was going to be the case until after freeze, it is very
late to resolve this situation sensibly and time has been against us.
It's much harder for me to reach into the depths of another person's
work and see how to add necessary mechanisms, especially when I'm
working elsewhere. Even if I had done, it's likely that I would have
been blocked with the "great idea, next release" response as already
used on this thread.
Without doubt the current mechanism suffers from the issues you mention,
though the current state is not the result of bad design, merely
inaction and lack of integration. We could resolve the current state in
many ways, if we chose.
Bruce has used the word crippleware for the current state. Raising a
problem and then blocking solutions is the best way I know to cripple a
release. It should be clear that I've done my best to avoid this
situation and have been active on both SR and HS. Had I not acted as I
have done to date, SR would at this point slurp CPU like a bandit and be
unmonitorable, both fatal flaws in production. I point this out not to
argue, but to set the record straight. IMHO your assignment of blame is
misplaced and your comments about poor design do not reflect how we
arrived at the current state.
-- Simon Riggs www.2ndQuadrant.com