Home > mailing lists

Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections - Mailing list pgsql-general

From	Sean Laurent
Subject	Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections
Date	October 6, 2011 15:24:46
Msg-id	CAK=aZ=k+QSGZFCE8SX8-KbgYDJZy+-5ebmFs3aTLZEkSBb3LQw@mail.gmail.com Whole thread Raw
Responses	Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections Re: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections
List	pgsql-general

Tree view

We've been running into a particularly strange problem that I'm trying to better understand. The super short version is that our application servers lose their connection to the database when I run a backup during periods of higher load and fail to reconnect.

Here's an overview of the setup:

- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running CentOS 5.6

- 8 disk RAID-0 array of EBS volumes used for primary data storage

- 4 disk RAID-0 array of EBS volumes used for transaction logs

- Root partition is ext3

- RAID arrays are xfs

Backups are taken using a script that runs the following workflow:

- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');

- Run "xfs_freeze" on the primary RAID array

- Tell Amazon to take snapshots of each of the EBS volumes

- Run "xfs_freeze -u" to thaw the primary RAID array

- Run "xfs_freeze" on the transaction log RAID array

- Tell Amazon to take snapshots of each of the EBS volumes

- Run "xfs_freeze -u" to thaw the transaction log RAID array

- Tell Postgres the backup is finished: SELECT pg_stop_backup();

- Remove old WAL files

The whole process takes roughly 7 seconds on average. The RAID arrays are frozen for roughly 2 seconds on average.

Within a few seconds of the backup, our application servers start throwing exceptions that indicate the database connection was closed. Meanwhile, Postgres still shows the connections and we start seeing a really high number (for us) of locks in the database. The application servers refuse to recover and must be killed and restarted. Once they're killed off, the connections actually go away and the locks disappear.

What's particularly weird is that this doesn't happen all the time. The backups were running every hour, but we have only seen the app servers crash 5-10 times over the course of a month.

Has anyone encountered anything like this? Do any of these steps have ramifications that I'm not considering? Especially something that might explain the app server failure?

Thanks.

Sean Laurent

Director of Operations

StudyBlue, Inc.

pgsql-general by date:

From: Adam Cornett
Date: 06 October 2011, 13:20:32
Subject: Re: Backup Database Question

From: Carlos Mennens
Date: 06 October 2011, 15:32:07
Subject: Tuning Variables For PostgreSQL

Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections - Mailing list pgsql-general

Previous

Next