Thread: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Sean Laurent
Date:
We've been running into a particularly strange problem that I'm trying to better understand. The super short version is that our application servers lose their connection to the database when I run a backup during periods of higher load and fail to reconnect.

Here's an overview of the setup:

- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs

Backups are taken using a script that runs the following workflow:

- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the transaction log RAID array
- Tell Postgres the backup is finished: SELECT pg_stop_backup();
- Remove old WAL files

The whole process takes roughly 7 seconds on average. The RAID arrays are frozen for roughly 2 seconds on average.

Within a few seconds of the backup, our application servers start throwing exceptions that indicate the database connection was closed. Meanwhile, Postgres still shows the connections and we start seeing a really high number (for us) of locks in the database. The application servers refuse to recover and must be killed and restarted. Once they're killed off, the connections actually go away and the locks disappear.

What's particularly weird is that this doesn't happen all the time. The backups were running every hour, but we have only seen the app servers crash 5-10 times over the course of a month.

Has anyone encountered anything like this? Do any of these steps have ramifications that I'm not considering? Especially something that might explain the app server failure?

Thanks.

Sean Laurent
Director of Operations
StudyBlue, Inc.

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Tom Lane
Date:
Sean Laurent <sean@studyblue.com> writes:
> We've been running into a particularly strange problem that I'm trying to
> better understand. The super short version is that our application servers
> lose their connection to the database when I run a backup during periods of
> higher load and fail to reconnect.

> Here's an overview of the setup:

> - PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running
> CentOS 5.6
> - 8 disk RAID-0 array of EBS volumes used for primary data storage
> - 4 disk RAID-0 array of EBS volumes used for transaction logs
> - Root partition is ext3
> - RAID arrays are xfs

> Backups are taken using a script that runs the following workflow:

> - Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
> - Run "xfs_freeze" on the primary RAID array
> - Tell Amazon to take snapshots of each of the EBS volumes
> - Run "xfs_freeze -u" to thaw the primary RAID array
> - Run "xfs_freeze" on the transaction log RAID array
> - Tell Amazon to take snapshots of each of the EBS volumes
> - Run "xfs_freeze -u" to thaw the transaction log RAID array
> - Tell Postgres the backup is finished: SELECT pg_stop_backup();
> - Remove old WAL files

> The whole process takes roughly 7 seconds on average. The RAID arrays are
> frozen for roughly 2 seconds on average.

> Within a few seconds of the backup, our application servers start throwing
> exceptions that indicate the database connection was closed. Meanwhile,
> Postgres still shows the connections and we start seeing a really high
> number (for us) of locks in the database. The application servers refuse to
> recover and must be killed and restarted. Once they're killed off, the
> connections actually go away and the locks disappear.

That's just weird.  It sounds like the "xfs_freeze" operation, or the
snapshotting operation, is somehow interrupting network traffic.  I'd
not expect such a thing on a normal server, but who knows what's
connected to what in an Amazon EC2 instance?

Anyway, I'd suggest trying to instrument something to prove or disprove
that there's a networking failure involved.  It might be as simple as
watching "ping" behavior ...

            regards, tom lane

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Craig Ringer
Date:
On 10/07/2011 01:21 AM, Sean Laurent wrote:

> Within a few seconds of the backup, our application servers start
> throwing exceptions that indicate the database connection was closed.
> Meanwhile, Postgres still shows the connections and we start seeing a
> really high number (for us) of locks in the database. The application
> servers refuse to recover and must be killed and restarted. Once they're
> killed off, the connections actually go away and the locks disappear.

Did you have any luck with this?

This sort of thing sounds a lot like "deadlock" ... but I'm not really
sure how Pg's backends/postmaster could get into a deadlock with each
other. It'd be interesting to look at "wchan" in ps to see what the Pg
processes are waiting on.

Also, check to see if you can connect with `psql' on a local unix socket
and on a local tcp/ip socket.

Can you reproduce this on a non-EC2 system?

--
Craig Ringer

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
John R Pierce
Date:
On 10/06/11 10:21 AM, Sean Laurent wrote:
> We've been running into a particularly strange problem that I'm trying
> to better understand. The super short version is that our application
> servers lose their connection to the database when I run a backup
> during periods of higher load and fail to reconnect.
>
> Here's an overview of the setup:
>
> - PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running
> CentOS 5.6
> - 8 disk RAID-0 array of EBS volumes used for primary data storage
> - 4 disk RAID-0 array of EBS volumes used for transaction logs
> - Root partition is ext3
> - RAID arrays are xfs
>
> Backups are taken using a script that runs the following workflow:
>
> - Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
> - Run "xfs_freeze" on the primary RAID array
> - Tell Amazon to take snapshots of each of the EBS volumes
> - Run "xfs_freeze -u" to thaw the primary RAID array
> - Run "xfs_freeze" on the transaction log RAID array
> - Tell Amazon to take snapshots of each of the EBS volumes
> - Run "xfs_freeze -u" to thaw the transaction log RAID array
> - Tell Postgres the backup is finished: SELECT pg_stop_backup();
> - Remove old WAL files
>
> The whole process takes roughly 7 seconds on average. The RAID arrays
> are frozen for roughly 2 seconds on average.
>

While xfs_freeze is in effect, all writes are blocked.  This is NOT what
you want to do here, postgres does NOT expect you to take an atomic
snapshot of the database files, rather, by bracketing your backup with
pg_start_backup and pg_stop_backup, it puts things in a state where a
file by file backup will be fine.

from the man pages...

    xfs_freeze halts new access to the filesystem and creates a stable
    image on disk. xfs_freeze is intended to be used with volume
    managers and hardware RAID devices that support the creation of
    snapshots.

    The mount-point argument is the pathname of the directory where the
    filesystem is mounted. The filesystem must be mounted to be frozen
    (see mount <http://linux.die.net/man/8/mount>(8)).

    The -f flag requests the specified XFS filesystem to be frozen from
    new modifications. When this is selected, all ongoing transactions
    in the filesystem are allowed to complete, new write system calls
    are halted, other calls which modify the filesystem are halted, and
    all dirty data, metadata, and log information are written to disk.
    Any process attempting to write to the frozen filesystem will block
    waiting for the filesystem to be unfrozen.


when postgres's writer processes block, I suspect things go sour fast.




--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Craig Ringer
Date:
On 10/10/11 23:29, John R Pierce wrote:

> While xfs_freeze is in effect, all writes are blocked.  This is NOT
> what you want to do here, postgres does NOT expect you to take an
> atomic snapshot of the database files, rather, by bracketing your
> backup with pg_start_backup and pg_stop_backup, it puts things in a
> state where a file by file backup will be fine.
>
While true, taking an atomic snapshot should give them lower recovery
times and - all in all - is probably a good thing.
>
> when postgres's writer processes block, I suspect things go sour fast.
They shouldn't!

If blocking writes causes a server failure that persists once writes
have been unblocked, that's a bug IMO. You might have a bit of a backlog
of writes to clear, but after that all should be well, and if it isn't
then something needs fixing.

--
Craig Ringer

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
John R Pierce
Date:
On 10/10/11 7:44 PM, Craig Ringer wrote:
> If blocking writes causes a server failure that persists once writes
> have been unblocked, that's a bug IMO. You might have a bit of a backlog
> of writes to clear, but after that all should be well, and if it isn't
> then something needs fixing.

the process is blocked waiting for this disk write to complete,
meanwhile, the packets are queuing up and waiting for service.

best of luck with all that....


--
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast


Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Craig Ringer
Date:
On 11/10/11 12:48, John R Pierce wrote:
> On 10/10/11 7:44 PM, Craig Ringer wrote:
>> If blocking writes causes a server failure that persists once writes
>> have been unblocked, that's a bug IMO. You might have a bit of a backlog
>> of writes to clear, but after that all should be well, and if it isn't
>> then something needs fixing.
>
> the process is blocked waiting for this disk write to complete,
> meanwhile, the packets are queuing up and waiting for service.
>
> best of luck with all that....

xfs_freeze for long enough to take a snapshot doesn't take long, or it
shouldn't, anyway. Even if it did, that shouldn't cause a server failure
that persists past when disk I/O is resumed, though it might cause
individual connections to drop.

I can `kill -STOP' Pg, or unplug my network cable for several seconds
and expect everything to resume just fine when I `kill -CONT' or plug
back in. Packets will be buffered by the OS if Pg is busy or by the
closest router if the network is unplugged, and will be delivered when
it becomes responsive again. If that takes too long or if too many
packets arrive, packets will be dropped, in which case TCP/IP will
re-send them. If the outage is protracted enough the client might
eventually decide the peer has gone away and drop the connection, but
even then new connections should be established to the server just fine
once it resumes responding.

It is totally unreasonable for Pg to *stay* nonfunctional once disk I/O
resumes. Existing connections should receive responses they're waiting
on or die, depending on how long it's been, and new connections should
be accepted fine.

--
Craig Ringer

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Sean Laurent
Date:
On Mon, Oct 10, 2011 at 8:09 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 10/07/2011 01:21 AM, Sean Laurent wrote:
>> Within a few seconds of the backup, our application servers start
>> throwing exceptions that indicate the database connection was closed.
>> Meanwhile, Postgres still shows the connections and we start seeing a
>> really high number (for us) of locks in the database. The application
>> servers refuse to recover and must be killed and restarted. Once they're
>> killed off, the connections actually go away and the locks disappear.
>
> Did you have any luck with this?

No, but I have avoided it by simply not using xfs_freeze and
snapshotting EBS volumes. Instead I've started taking pg_dumps off the
slave database.

> This sort of thing sounds a lot like "deadlock" ... but I'm not really sure
> how Pg's backends/postmaster could get into a deadlock with each other. It'd
> be interesting to look at "wchan" in ps to see what the Pg processes are
> waiting on.

That's definitely a strong contender. It may be that the xfs_freeze
timing was an unrelated problem or even just a coincidence.

> Can you reproduce this on a non-EC2 system?

Unfortunately, we don't have the hardware resources to test this on a
non-EC2 system.

--
Sean Laurent
Director of Operations
StudyBlue, Inc.

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Sean Laurent
Date:
On Fri, Oct 7, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Sean Laurent <sean@studyblue.com> writes:
> > We've been running into a particularly strange problem that I'm trying to
> > better understand. The super short version is that our application servers
> > lose their connection to the database when I run a backup during periods of
> > higher load and fail to reconnect.
>
> That's just weird.  It sounds like the "xfs_freeze" operation, or the
> snapshotting operation, is somehow interrupting network traffic.  I'd
> not expect such a thing on a normal server, but who knows what's
> connected to what in an Amazon EC2 instance?
>
> Anyway, I'd suggest trying to instrument something to prove or disprove
> that there's a networking failure involved.  It might be as simple as
> watching "ping" behavior ...

Agreed that's it very weird. EBS volumes are effectively networked
attached storage, so blaming network connectivity was my first
inclination as well. Unfortunately, it's definitely not a network
failure:

- AWS support team has not detected any network outages affecting the
EC2 instance or the EBS volumes at any time remotely near when our
outages occurred.
- I can consistently ping the database instance from the application
servers while the problem is occurring.
- I can SSH into the database instance and access Postgres while the
problem is occurring.

--
Sean Laurent
Director of Operations
StudyBlue, Inc.

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Sean Laurent
Date:
On Tue, Oct 11, 2011 at 12:04 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 11/10/11 12:48, John R Pierce wrote:
>> On 10/10/11 7:44 PM, Craig Ringer wrote:
>>> If blocking writes causes a server failure that persists once writes
>>> have been unblocked, that's a bug IMO. You might have a bit of a backlog
>>> of writes to clear, but after that all should be well, and if it isn't
>>> then something needs fixing.
>>
>> the process is blocked waiting for this disk write to complete,
>> meanwhile, the packets are queuing up and waiting for service.
>>
>> best of luck with all that....
>
> xfs_freeze for long enough to take a snapshot doesn't take long, or it
> shouldn't, anyway.

On average, xfs_freeze takes about 2 seconds for us with 8 EBS volumes
at 60GB each in a software RAID-0 array.

> Even if it did, that shouldn't cause a server failure
> that persists past when disk I/O is resumed, though it might cause
> individual connections to drop.
<DELETED>
> It is totally unreasonable for Pg to *stay* nonfunctional once disk I/O
> resumes. Existing connections should receive responses they're waiting
> on or die, depending on how long it's been, and new connections should
> be accepted fine.

Exactly. I genuinely expect Postgres to be able to withstand a couple
of seconds of blocked disk I/O. Especially since this isn't a heavy
duty transaction processing system - it's under load, but not a
tremendously high load. During our busier times we average something
in the neighborhood of 300-400 transactions per second, which just
doesn't seem like that much.

As much as I would like Postgres to withstand a 2 second outage, I
don't honestly care. I'd just like to figure out whether I'm looking
at something that's actually a problem or if I should be looking
elsewhere for the problem.
--
Sean Laurent
Director of Operations
StudyBlue, Inc.

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Scott Marlowe
Date:
On Tue, Oct 11, 2011 at 5:00 PM, Sean Laurent <sean@studyblue.com> wrote:
> As much as I would like Postgres to withstand a 2 second outage, I
> don't honestly care. I'd just like to figure out whether I'm looking
> at something that's actually a problem or if I should be looking
> elsewhere for the problem.

Any chance this is a client side failure?  I.e. the client lib is
seeing the 2+ second zero response time as a disconnect?

Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From
Sean Laurent
Date:
On Tue, Oct 11, 2011 at 8:50 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 5:00 PM, Sean Laurent <sean@studyblue.com> wrote:
>> As much as I would like Postgres to withstand a 2 second outage, I
>> don't honestly care. I'd just like to figure out whether I'm looking
>> at something that's actually a problem or if I should be looking
>> elsewhere for the problem.
>
> Any chance this is a client side failure?  I.e. the client lib is
> seeing the 2+ second zero response time as a disconnect?

Good question. I don't know. Let me look into that and get back to the
list when I have an better answer.

--
Sean Laurent
Director of Operations
StudyBlue, Inc.