Thread: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections
We've been running into a particularly strange problem that I'm trying to better understand. The super short version is that our application servers lose their connection to the database when I run a backup during periods of higher load and fail to reconnect.
Thanks.
Here's an overview of the setup:
- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs
Backups are taken using a script that runs the following workflow:
- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the transaction log RAID array
- Tell Postgres the backup is finished: SELECT pg_stop_backup();
- Remove old WAL files
The whole process takes roughly 7 seconds on average. The RAID arrays are frozen for roughly 2 seconds on average.
Within a few seconds of the backup, our application servers start throwing exceptions that indicate the database connection was closed. Meanwhile, Postgres still shows the connections and we start seeing a really high number (for us) of locks in the database. The application servers refuse to recover and must be killed and restarted. Once they're killed off, the connections actually go away and the locks disappear.
What's particularly weird is that this doesn't happen all the time. The backups were running every hour, but we have only seen the app servers crash 5-10 times over the course of a month.
Has anyone encountered anything like this? Do any of these steps have ramifications that I'm not considering? Especially something that might explain the app server failure?
Sean Laurent
Director of Operations
StudyBlue, Inc.
Sean Laurent <sean@studyblue.com> writes: > We've been running into a particularly strange problem that I'm trying to > better understand. The super short version is that our application servers > lose their connection to the database when I run a backup during periods of > higher load and fail to reconnect. > Here's an overview of the setup: > - PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running > CentOS 5.6 > - 8 disk RAID-0 array of EBS volumes used for primary data storage > - 4 disk RAID-0 array of EBS volumes used for transaction logs > - Root partition is ext3 > - RAID arrays are xfs > Backups are taken using a script that runs the following workflow: > - Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup'); > - Run "xfs_freeze" on the primary RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the primary RAID array > - Run "xfs_freeze" on the transaction log RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the transaction log RAID array > - Tell Postgres the backup is finished: SELECT pg_stop_backup(); > - Remove old WAL files > The whole process takes roughly 7 seconds on average. The RAID arrays are > frozen for roughly 2 seconds on average. > Within a few seconds of the backup, our application servers start throwing > exceptions that indicate the database connection was closed. Meanwhile, > Postgres still shows the connections and we start seeing a really high > number (for us) of locks in the database. The application servers refuse to > recover and must be killed and restarted. Once they're killed off, the > connections actually go away and the locks disappear. That's just weird. It sounds like the "xfs_freeze" operation, or the snapshotting operation, is somehow interrupting network traffic. I'd not expect such a thing on a normal server, but who knows what's connected to what in an Amazon EC2 instance? Anyway, I'd suggest trying to instrument something to prove or disprove that there's a networking failure involved. It might be as simple as watching "ping" behavior ... regards, tom lane
On 10/07/2011 01:21 AM, Sean Laurent wrote: > Within a few seconds of the backup, our application servers start > throwing exceptions that indicate the database connection was closed. > Meanwhile, Postgres still shows the connections and we start seeing a > really high number (for us) of locks in the database. The application > servers refuse to recover and must be killed and restarted. Once they're > killed off, the connections actually go away and the locks disappear. Did you have any luck with this? This sort of thing sounds a lot like "deadlock" ... but I'm not really sure how Pg's backends/postmaster could get into a deadlock with each other. It'd be interesting to look at "wchan" in ps to see what the Pg processes are waiting on. Also, check to see if you can connect with `psql' on a local unix socket and on a local tcp/ip socket. Can you reproduce this on a non-EC2 system? -- Craig Ringer
On 10/06/11 10:21 AM, Sean Laurent wrote: > We've been running into a particularly strange problem that I'm trying > to better understand. The super short version is that our application > servers lose their connection to the database when I run a backup > during periods of higher load and fail to reconnect. > > Here's an overview of the setup: > > - PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running > CentOS 5.6 > - 8 disk RAID-0 array of EBS volumes used for primary data storage > - 4 disk RAID-0 array of EBS volumes used for transaction logs > - Root partition is ext3 > - RAID arrays are xfs > > Backups are taken using a script that runs the following workflow: > > - Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup'); > - Run "xfs_freeze" on the primary RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the primary RAID array > - Run "xfs_freeze" on the transaction log RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the transaction log RAID array > - Tell Postgres the backup is finished: SELECT pg_stop_backup(); > - Remove old WAL files > > The whole process takes roughly 7 seconds on average. The RAID arrays > are frozen for roughly 2 seconds on average. > While xfs_freeze is in effect, all writes are blocked. This is NOT what you want to do here, postgres does NOT expect you to take an atomic snapshot of the database files, rather, by bracketing your backup with pg_start_backup and pg_stop_backup, it puts things in a state where a file by file backup will be fine. from the man pages... xfs_freeze halts new access to the filesystem and creates a stable image on disk. xfs_freeze is intended to be used with volume managers and hardware RAID devices that support the creation of snapshots. The mount-point argument is the pathname of the directory where the filesystem is mounted. The filesystem must be mounted to be frozen (see mount <http://linux.die.net/man/8/mount>(8)). The -f flag requests the specified XFS filesystem to be frozen from new modifications. When this is selected, all ongoing transactions in the filesystem are allowed to complete, new write system calls are halted, other calls which modify the filesystem are halted, and all dirty data, metadata, and log information are written to disk. Any process attempting to write to the frozen filesystem will block waiting for the filesystem to be unfrozen. when postgres's writer processes block, I suspect things go sour fast. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On 10/10/11 23:29, John R Pierce wrote: > While xfs_freeze is in effect, all writes are blocked. This is NOT > what you want to do here, postgres does NOT expect you to take an > atomic snapshot of the database files, rather, by bracketing your > backup with pg_start_backup and pg_stop_backup, it puts things in a > state where a file by file backup will be fine. > While true, taking an atomic snapshot should give them lower recovery times and - all in all - is probably a good thing. > > when postgres's writer processes block, I suspect things go sour fast. They shouldn't! If blocking writes causes a server failure that persists once writes have been unblocked, that's a bug IMO. You might have a bit of a backlog of writes to clear, but after that all should be well, and if it isn't then something needs fixing. -- Craig Ringer
On 10/10/11 7:44 PM, Craig Ringer wrote: > If blocking writes causes a server failure that persists once writes > have been unblocked, that's a bug IMO. You might have a bit of a backlog > of writes to clear, but after that all should be well, and if it isn't > then something needs fixing. the process is blocked waiting for this disk write to complete, meanwhile, the packets are queuing up and waiting for service. best of luck with all that.... -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On 11/10/11 12:48, John R Pierce wrote: > On 10/10/11 7:44 PM, Craig Ringer wrote: >> If blocking writes causes a server failure that persists once writes >> have been unblocked, that's a bug IMO. You might have a bit of a backlog >> of writes to clear, but after that all should be well, and if it isn't >> then something needs fixing. > > the process is blocked waiting for this disk write to complete, > meanwhile, the packets are queuing up and waiting for service. > > best of luck with all that.... xfs_freeze for long enough to take a snapshot doesn't take long, or it shouldn't, anyway. Even if it did, that shouldn't cause a server failure that persists past when disk I/O is resumed, though it might cause individual connections to drop. I can `kill -STOP' Pg, or unplug my network cable for several seconds and expect everything to resume just fine when I `kill -CONT' or plug back in. Packets will be buffered by the OS if Pg is busy or by the closest router if the network is unplugged, and will be delivered when it becomes responsive again. If that takes too long or if too many packets arrive, packets will be dropped, in which case TCP/IP will re-send them. If the outage is protracted enough the client might eventually decide the peer has gone away and drop the connection, but even then new connections should be established to the server just fine once it resumes responding. It is totally unreasonable for Pg to *stay* nonfunctional once disk I/O resumes. Existing connections should receive responses they're waiting on or die, depending on how long it's been, and new connections should be accepted fine. -- Craig Ringer
On Mon, Oct 10, 2011 at 8:09 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 10/07/2011 01:21 AM, Sean Laurent wrote: >> Within a few seconds of the backup, our application servers start >> throwing exceptions that indicate the database connection was closed. >> Meanwhile, Postgres still shows the connections and we start seeing a >> really high number (for us) of locks in the database. The application >> servers refuse to recover and must be killed and restarted. Once they're >> killed off, the connections actually go away and the locks disappear. > > Did you have any luck with this? No, but I have avoided it by simply not using xfs_freeze and snapshotting EBS volumes. Instead I've started taking pg_dumps off the slave database. > This sort of thing sounds a lot like "deadlock" ... but I'm not really sure > how Pg's backends/postmaster could get into a deadlock with each other. It'd > be interesting to look at "wchan" in ps to see what the Pg processes are > waiting on. That's definitely a strong contender. It may be that the xfs_freeze timing was an unrelated problem or even just a coincidence. > Can you reproduce this on a non-EC2 system? Unfortunately, we don't have the hardware resources to test this on a non-EC2 system. -- Sean Laurent Director of Operations StudyBlue, Inc.
On Fri, Oct 7, 2011 at 12:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Sean Laurent <sean@studyblue.com> writes: > > We've been running into a particularly strange problem that I'm trying to > > better understand. The super short version is that our application servers > > lose their connection to the database when I run a backup during periods of > > higher load and fail to reconnect. > > That's just weird. It sounds like the "xfs_freeze" operation, or the > snapshotting operation, is somehow interrupting network traffic. I'd > not expect such a thing on a normal server, but who knows what's > connected to what in an Amazon EC2 instance? > > Anyway, I'd suggest trying to instrument something to prove or disprove > that there's a networking failure involved. It might be as simple as > watching "ping" behavior ... Agreed that's it very weird. EBS volumes are effectively networked attached storage, so blaming network connectivity was my first inclination as well. Unfortunately, it's definitely not a network failure: - AWS support team has not detected any network outages affecting the EC2 instance or the EBS volumes at any time remotely near when our outages occurred. - I can consistently ping the database instance from the application servers while the problem is occurring. - I can SSH into the database instance and access Postgres while the problem is occurring. -- Sean Laurent Director of Operations StudyBlue, Inc.
On Tue, Oct 11, 2011 at 12:04 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 11/10/11 12:48, John R Pierce wrote: >> On 10/10/11 7:44 PM, Craig Ringer wrote: >>> If blocking writes causes a server failure that persists once writes >>> have been unblocked, that's a bug IMO. You might have a bit of a backlog >>> of writes to clear, but after that all should be well, and if it isn't >>> then something needs fixing. >> >> the process is blocked waiting for this disk write to complete, >> meanwhile, the packets are queuing up and waiting for service. >> >> best of luck with all that.... > > xfs_freeze for long enough to take a snapshot doesn't take long, or it > shouldn't, anyway. On average, xfs_freeze takes about 2 seconds for us with 8 EBS volumes at 60GB each in a software RAID-0 array. > Even if it did, that shouldn't cause a server failure > that persists past when disk I/O is resumed, though it might cause > individual connections to drop. <DELETED> > It is totally unreasonable for Pg to *stay* nonfunctional once disk I/O > resumes. Existing connections should receive responses they're waiting > on or die, depending on how long it's been, and new connections should > be accepted fine. Exactly. I genuinely expect Postgres to be able to withstand a couple of seconds of blocked disk I/O. Especially since this isn't a heavy duty transaction processing system - it's under load, but not a tremendously high load. During our busier times we average something in the neighborhood of 300-400 transactions per second, which just doesn't seem like that much. As much as I would like Postgres to withstand a 2 second outage, I don't honestly care. I'd just like to figure out whether I'm looking at something that's actually a problem or if I should be looking elsewhere for the problem. -- Sean Laurent Director of Operations StudyBlue, Inc.
On Tue, Oct 11, 2011 at 5:00 PM, Sean Laurent <sean@studyblue.com> wrote: > As much as I would like Postgres to withstand a 2 second outage, I > don't honestly care. I'd just like to figure out whether I'm looking > at something that's actually a problem or if I should be looking > elsewhere for the problem. Any chance this is a client side failure? I.e. the client lib is seeing the 2+ second zero response time as a disconnect?
On Tue, Oct 11, 2011 at 8:50 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Tue, Oct 11, 2011 at 5:00 PM, Sean Laurent <sean@studyblue.com> wrote: >> As much as I would like Postgres to withstand a 2 second outage, I >> don't honestly care. I'd just like to figure out whether I'm looking >> at something that's actually a problem or if I should be looking >> elsewhere for the problem. > > Any chance this is a client side failure? I.e. the client lib is > seeing the 2+ second zero response time as a disconnect? Good question. I don't know. Let me look into that and get back to the list when I have an better answer. -- Sean Laurent Director of Operations StudyBlue, Inc.