On Fri, Oct 7, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Sean Laurent <sean@studyblue.com> writes:
> > We've been running into a particularly strange problem that I'm trying to
> > better understand. The super short version is that our application servers
> > lose their connection to the database when I run a backup during periods of
> > higher load and fail to reconnect.
>
> That's just weird. It sounds like the "xfs_freeze" operation, or the
> snapshotting operation, is somehow interrupting network traffic. I'd
> not expect such a thing on a normal server, but who knows what's
> connected to what in an Amazon EC2 instance?
>
> Anyway, I'd suggest trying to instrument something to prove or disprove
> that there's a networking failure involved. It might be as simple as
> watching "ping" behavior ...
Agreed that's it very weird. EBS volumes are effectively networked
attached storage, so blaming network connectivity was my first
inclination as well. Unfortunately, it's definitely not a network
failure:
- AWS support team has not detected any network outages affecting the
EC2 instance or the EBS volumes at any time remotely near when our
outages occurred.
- I can consistently ping the database instance from the application
servers while the problem is occurring.
- I can SSH into the database instance and access Postgres while the
problem is occurring.
--
Sean Laurent
Director of Operations
StudyBlue, Inc.