Thread: Incredibly slow restore times after 9.0>9.2 upgrade
Hi, we have a nightly job that restores current production data to the development databases in a 'warm spare' database so that if the developers need fresh data, it's ready during the day. When we moved from 9.0 to 9.2 suddenly the restores began to take from a few hours to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS volumes, warmed them up, threw IOPS at them, pretty much all the standard stuff to get more disk performance. Here's the thing, the disk isn't saturated. The behavior I'm seeing seems very odd to me; I'm seeing the source disk which holds the dump saturated by reads, which is great, but then I just see nothing being written to the postgres volume. Just nothing happening, then a small burst. There is no write queue backup on the destination disk either. if I look at pg_stat_activity I'll see something like: COPY salesforce_reconciliation (salesforce_id, email, advisor_salesforce_id, processed) FROM stdin and even for small tables, that seems to take a very long time even though the destination disk is almost at 0 utilization. The dumps are created with pg_dump -Fc and restored with pg_restore -d db -j 2 -O -U postgres PostgreSQL-db.sql. Is it possible that some default settings were changed from 9.0 to 9.2 that would cause this kind of behavior? I'm stumped here. Thanks in advance for any consideration here. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 28.10.2014 21:55, jmcdonagh wrote: > Hi, we have a nightly job that restores current production data to > the development databases in a 'warm spare' database so that if the > developers need fresh data, it's ready during the day. When we moved > from 9.0 to 9.2 suddenly the restores began to take from a few hours > to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS > volumes, warmed them up, threw IOPS at them, pretty much all the > standard stuff to get more disk performance. So, if I understand it correctly, you've been restoring into 9.0, then you switched to 9.2 and it's much slower? Is the 9.2 configured equally to 9.0? If you do something like this SELECT name, setting FROM pg_settings WHERE source = 'configuration file'; on both versions, what do you get? > Here's the thing, the disk isn't saturated. The behavior I'm seeing > seems very odd to me; I'm seeing the source disk which holds the dump > saturated by reads, which is great, but then I just see nothing being > written to the postgres volume. Just nothing happening, then a > small burst. There is no write queue backup on the destination disk > either. if I look at pg_stat_activity I'll see something like: > > COPY salesforce_reconciliation (salesforce_id, email, > advisor_salesforce_id, processed) FROM stdin > > and even for small tables, that seems to take a very long time even > though the destination disk is almost at 0 utilization. So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk or something else? Or maybe network, because you're using EBS? What do you mean by 'utilization'? How do you measure that? > The dumps are created with pg_dump -Fc and restored with pg_restore > -d db -j 2 -O -U postgres PostgreSQL-db.sql. OK > Is it possible that some default settings were changed from 9.0 to > 9.2 that would cause this kind of behavior? I'm stumped here. Thanks > in advance for any consideration here. I doubt that. There probably were some changes (after all, we're talking about 2 major versions), but we generally don't change it in a way that'd hurt performance. regards Tomas
Hi Tomas- thank you for your thoughtful response! Tomas Vondra wrote > On 28.10.2014 21:55, jmcdonagh wrote: >> Hi, we have a nightly job that restores current production data to >> the development databases in a 'warm spare' database so that if the >> developers need fresh data, it's ready during the day. When we moved >> from 9.0 to 9.2 suddenly the restores began to take from a few hours >> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS >> volumes, warmed them up, threw IOPS at them, pretty much all the >> standard stuff to get more disk performance. > > So, if I understand it correctly, you've been restoring into 9.0, then > you switched to 9.2 and it's much slower? Yes- but since the move was done utilizing snapshots so the move involves new volumes, but I have created new volumes since then to rule out a single bad volume. Tomas Vondra wrote > Is the 9.2 configured equally to 9.0? If you do something like this > > SELECT name, setting > FROM pg_settings > WHERE source = 'configuration file'; > > on both versions, what do you get? I no longer have the 9.0 box up but we do track configuration via puppet and git. The only configuration change made for 9.2 is: -#standard_conforming_strings = off +standard_conforming_strings = off Cause we have an old app that needs this setting on otherwise we'd spend a lot of time trying to fix it. Tomas Vondra wrote >> Here's the thing, the disk isn't saturated. The behavior I'm seeing >> seems very odd to me; I'm seeing the source disk which holds the dump >> saturated by reads, which is great, but then I just see nothing being >> written to the postgres volume. Just nothing happening, then a >> small burst. There is no write queue backup on the destination disk >> either. if I look at pg_stat_activity I'll see something like: >> >> COPY salesforce_reconciliation (salesforce_id, email, >> advisor_salesforce_id, processed) FROM stdin >> >> and even for small tables, that seems to take a very long time even >> though the destination disk is almost at 0 utilization. > > So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk > or something else? Or maybe network, because you're using EBS? > > What do you mean by 'utilization'? How do you measure that? The bottleneck is I/O somehow. I say somehow, because I see iowait averaging about 50% between two CPUs, but there is just no writes to the destination EBS volume really happening, just reads from the disk where the source dump is located, then bursts of writes to the destination volume every so often. It's kind of puzzling. This is happening on multiple database servers, in multiple availability zones. Driving me bonkers. What I mean by utilization is util% from iostat -m -x 1. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824847.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Is the instance ebs-optimized? I am wondering if its a configuration on the instance not postgres or ebs.
On Wed, Oct 29, 2014 at 10:12 AM, jmcdonagh <Joseph.E.McDonagh@gmail.com> wrote:
Hi Tomas- thank you for your thoughtful response!
Tomas Vondra wrote
> On 28.10.2014 21:55, jmcdonagh wrote:
>> Hi, we have a nightly job that restores current production data to
>> the development databases in a 'warm spare' database so that if the
>> developers need fresh data, it's ready during the day. When we moved
>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>> volumes, warmed them up, threw IOPS at them, pretty much all the
>> standard stuff to get more disk performance.
>
> So, if I understand it correctly, you've been restoring into 9.0, then
> you switched to 9.2 and it's much slower?
Yes- but since the move was done utilizing snapshots so the move involves
new volumes, but I have created new volumes since then to rule out a single
bad volume.
Tomas Vondra wrote
> Is the 9.2 configured equally to 9.0? If you do something like this
>
> SELECT name, setting
> FROM pg_settings
> WHERE source = 'configuration file';
>
> on both versions, what do you get?
I no longer have the 9.0 box up but we do track configuration via puppet and
git. The only configuration change made for 9.2 is:
-#standard_conforming_strings = off
+standard_conforming_strings = off
Cause we have an old app that needs this setting on otherwise we'd spend a
lot of time trying to fix it.
Tomas Vondra wrote
>> Here's the thing, the disk isn't saturated. The behavior I'm seeing
>> seems very odd to me; I'm seeing the source disk which holds the dump
>> saturated by reads, which is great, but then I just see nothing being
>> written to the postgres volume. Just nothing happening, then a
>> small burst. There is no write queue backup on the destination disk
>> either. if I look at pg_stat_activity I'll see something like:
>>
>> COPY salesforce_reconciliation (salesforce_id, email,
>> advisor_salesforce_id, processed) FROM stdin
>>
>> and even for small tables, that seems to take a very long time even
>> though the destination disk is almost at 0 utilization.
>
> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk
> or something else? Or maybe network, because you're using EBS?
>
> What do you mean by 'utilization'? How do you measure that?
The bottleneck is I/O somehow. I say somehow, because I see iowait averaging
about 50% between two CPUs, but there is just no writes to the destination
EBS volume really happening, just reads from the disk where the source dump
is located, then bursts of writes to the destination volume every so often.
It's kind of puzzling. This is happening on multiple database servers, in
multiple availability zones. Driving me bonkers.
What I mean by utilization is util% from iostat -m -x 1.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824847.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Hi Jason, oddly enough the setting on or off does not affect this particular issue. As a rule I generally enable this option on my instances that support it. I recently tried upping the nodes to the latest generation (m3) to try and rectify/improve this issue. Unfortunately right now m3 won't work because we rely on a lot of space in mnt for temporary data work and the new instances don't have much space there (though it is much faster). So I went back to m1.large and left EBS optimized off. I'm not seeing any note-worthy change in performance. So I went and fired up an RDS postgres instance here. Eventually I want to move to RDS anyways, but it's not a good short term solution to the right now issue. Restore is running now, so I'll know within the next day or so if this is much faster. I'm puzzled by the change from 9.0 to 9.2 coinciding with this though. Before the upgrade this job never had any issues. But I am 100% aware that could be a red herring. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824871.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
I just had a thought- I know some of these tables are in need of a vacuuming. Could it be that the dump is dumping a bunch of garbage that the restore has to sift through on the restore? I don't know enough details to know if this is a dumb thought or not. The restore to RDS took roughly the same amount of time. My next move is to try on a fast instance store, and also do a postgres 9 restore of a pure SQL dump, but that won't really be a great test since I use custom format. I'm assuming here that I can't take the custom dump from 9.2 and apply it to 9.0, or can I? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825052.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
jmcdonagh <Joseph.E.McDonagh@gmail.com> writes: > I just had a thought- I know some of these tables are in need of a vacuuming. > Could it be that the dump is dumping a bunch of garbage that the restore has > to sift through on the restore? I don't know enough details to know if this > is a dumb thought or not. No. However it's true that the dump will take a bit longer having to scan a bloated table rather than a tight one. Dump will only output the live rows. psql or pg_restore whatever you're using on the target side will not have to step over any junk. HTH > > The restore to RDS took roughly the same amount of time. My next move is to > try on a fast instance store, and also do a postgres 9 restore of a pure SQL > dump, but that won't really be a great test since I use custom format. I'm > assuming here that I can't take the custom dump from 9.2 and apply it to > 9.0, or can I? > > > > -- > View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825052.html > Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
On 29.10.2014 16:12, jmcdonagh wrote: > Hi Tomas- thank you for your thoughtful response! > > > Tomas Vondra wrote >> On 28.10.2014 21:55, jmcdonagh wrote: >>> Hi, we have a nightly job that restores current production data to >>> the development databases in a 'warm spare' database so that if the >>> developers need fresh data, it's ready during the day. When we moved >>> from 9.0 to 9.2 suddenly the restores began to take from a few hours >>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS >>> volumes, warmed them up, threw IOPS at them, pretty much all the >>> standard stuff to get more disk performance. >> >> So, if I understand it correctly, you've been restoring into 9.0, then >> you switched to 9.2 and it's much slower? > > Yes- but since the move was done utilizing snapshots so the move > involves new volumes, but I have created new volumes since then to > rule out a single bad volume. My advice would be to do some basic low-level performance tests to rule this out. Use dd or (better) fio to test basic I/O performance, it's much easier to spot issues that way. > Tomas Vondra wrote >> Is the 9.2 configured equally to 9.0? If you do something like this >> >> SELECT name, setting >> FROM pg_settings >> WHERE source = 'configuration file'; >> >> on both versions, what do you get? > > I no longer have the 9.0 box up but we do track configuration via > puppet and git. The only configuration change made for 9.2 is: > > -#standard_conforming_strings = off > +standard_conforming_strings = off Compared to 9.0, I suppose? Anyway, post the non-default config values at least for 9.2, please. > Cause we have an old app that needs this setting on otherwise we'd > spend a lot of time trying to fix it. I doubt standard_conforming_strings has anything to do with the issues. > Tomas Vondra wrote >>> Here's the thing, the disk isn't saturated. The behavior I'm seeing >>> seems very odd to me; I'm seeing the source disk which holds the dump >>> saturated by reads, which is great, but then I just see nothing being >>> written to the postgres volume. Just nothing happening, then a >>> small burst. There is no write queue backup on the destination disk >>> either. if I look at pg_stat_activity I'll see something like: >>> >>> COPY salesforce_reconciliation (salesforce_id, email, >>> advisor_salesforce_id, processed) FROM stdin >>> >>> and even for small tables, that seems to take a very long time even >>> though the destination disk is almost at 0 utilization. >> >> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a >> disk or something else? Or maybe network, because you're using EBS? >> >> What do you mean by 'utilization'? How do you measure that? > > The bottleneck is I/O somehow. I say somehow, because I see iowait > averaging about 50% between two CPUs, but there is just no writes to > the destination EBS volume really happening, just reads from the > disk where the source dump is located, then bursts of writes to the > destination volume every so often. It's kind of puzzling. This is > happening on multiple database servers, in multiple availability > zones. Driving me bonkers. > > What I mean by utilization is util% from iostat -m -x 1. I find this rather contradictory. At one moment you say the disk isn't saturated, the next moment you say you're I/O bound. Also, iowait (as reported e.g. by 'top') is tricky to interpret correctly, especially on multi-cpu systems (nice intro to the complexity [1]). It's really difficult to interpret the 50% iowait without more info about what's happening on the machine. IMHO, the utilization (as reported by iotop) is much easier to interpret, because it means '% of time the device was servicing requests'. It has issues too, because 100% does not mean 'saturated' (especially on RAID arrays that can service multiple requests in parallel), but it's better than iowait. If I had to guess based from your info, I'd bet you're CPU bound, so there's very little idle time and about 50% of it is spent waiting for I/O requests (hence the 50% iowait). But in total the amount of I/O is very small, so %util is ~0. Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top' and 'vmstat 1' would be handy too. regards Tomas [1] http://veithen.blogspot.cz/2013/11/iowait-linux.html
Thanks for the confirmation Jerry. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825615.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Tomas Vondra wrote > On 29.10.2014 16:12, jmcdonagh wrote: >> Hi Tomas- thank you for your thoughtful response! >> >> >> Tomas Vondra wrote >>> On 28.10.2014 21:55, jmcdonagh wrote: >>>> Hi, we have a nightly job that restores current production data to >>>> the development databases in a 'warm spare' database so that if the >>>> developers need fresh data, it's ready during the day. When we moved >>>> from 9.0 to 9.2 suddenly the restores began to take from a few hours >>>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS >>>> volumes, warmed them up, threw IOPS at them, pretty much all the >>>> standard stuff to get more disk performance. >>> >>> So, if I understand it correctly, you've been restoring into 9.0, then >>> you switched to 9.2 and it's much slower? >> >> Yes- but since the move was done utilizing snapshots so the move >> involves new volumes, but I have created new volumes since then to >> rule out a single bad volume. > > My advice would be to do some basic low-level performance tests to rule > this out. Use dd or (better) fio to test basic I/O performance, it's > much easier to spot issues that way. I've done dd tests and the volumes perform fine. Tomas Vondra wrote >> Tomas Vondra wrote >>> Is the 9.2 configured equally to 9.0? If you do something like this >>> >>> SELECT name, setting >>> FROM pg_settings >>> WHERE source = 'configuration file'; >>> >>> on both versions, what do you get? >> >> I no longer have the 9.0 box up but we do track configuration via >> puppet and git. The only configuration change made for 9.2 is: >> >> -#standard_conforming_strings = off >> +standard_conforming_strings = off > > Compared to 9.0, I suppose? Anyway, post the non-default config values > at least for 9.2, please. Yea, so in comparison to the only change was that. Here are the non-default settings (some of them are probably defaults, but these are the uncommented lines from postgresql.conf): data_directory = '/mnt/postgresql/9.2/main' # use data in another directory hba_file = '/etc/postgresql/9.2/main/pg_hba.conf' # host-based authentication file ident_file = '/etc/postgresql/9.2/main/pg_ident.conf' # ident configuration file external_pid_file = '/var/run/postgresql/9.2-main.pid' # write an extra PID file listen_addresses = '*' # what IP address(es) to listen on; port = 5432 # (change requires restart) max_connections = 300 # (change requires restart) unix_socket_directory = '/var/run/postgresql' # (change requires restart) ssl = true # (change requires restart) shared_buffers = 4GB # min 128kB temp_buffers = 128MB # min 800kB work_mem = 256MB # min 64kB maintenance_work_mem = 256MB # min 1MB wal_buffers = 512kB # min 32kB commit_delay = 50000 # range 0-100000, in microseconds commit_siblings = 1 # range 1-1000 random_page_cost = 2.0 # same scale as above effective_cache_size = 16GB from_collapse_limit = 10 join_collapse_limit = 10 # 1 disables collapsing of explicit log_destination = 'stderr' # Valid values are combinations of client_min_messages = warning # values in order of decreasing detail: log_min_messages = warning # values in order of decreasing detail: log_min_duration_statement = 1000 # -1 is disabled, 0 logs all statements log_line_prefix = '%t ' # special values: autovacuum = on # Enable autovacuum subprocess? 'on' datestyle = 'iso, mdy' timezone = EST5EDT # actually, defaults to TZ environment client_encoding = sql_ascii # actually, defaults to database lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'en_US.UTF-8' # locale for time formatting default_text_search_config = 'pg_catalog.english' standard_conforming_strings = off Tomas Vondra wrote >> Tomas Vondra wrote >>>> Here's the thing, the disk isn't saturated. The behavior I'm seeing >>>> seems very odd to me; I'm seeing the source disk which holds the dump >>>> saturated by reads, which is great, but then I just see nothing being >>>> written to the postgres volume. Just nothing happening, then a >>>> small burst. There is no write queue backup on the destination disk >>>> either. if I look at pg_stat_activity I'll see something like: >>>> >>>> COPY salesforce_reconciliation (salesforce_id, email, >>>> advisor_salesforce_id, processed) FROM stdin >>>> >>>> and even for small tables, that seems to take a very long time even >>>> though the destination disk is almost at 0 utilization. >>> >>> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a >>> disk or something else? Or maybe network, because you're using EBS? >>> >>> What do you mean by 'utilization'? How do you measure that? >> >> The bottleneck is I/O somehow. I say somehow, because I see iowait >> averaging about 50% between two CPUs, but there is just no writes to >> the destination EBS volume really happening, just reads from the >> disk where the source dump is located, then bursts of writes to the >> destination volume every so often. It's kind of puzzling. This is >> happening on multiple database servers, in multiple availability >> zones. Driving me bonkers. >> >> What I mean by utilization is util% from iostat -m -x 1. > > I find this rather contradictory. At one moment you say the disk isn't > saturated, the next moment you say you're I/O bound. > > Also, iowait (as reported e.g. by 'top') is tricky to interpret > correctly, especially on multi-cpu systems (nice intro to the complexity > [1]). It's really difficult to interpret the 50% iowait without more > info about what's happening on the machine. > > IMHO, the utilization (as reported by iotop) is much easier to > interpret, because it means '% of time the device was servicing > requests'. It has issues too, because 100% does not mean 'saturated' > (especially on RAID arrays that can service multiple requests in > parallel), but it's better than iowait. > > If I had to guess based from your info, I'd bet you're CPU bound, so > there's very little idle time and about 50% of it is spent waiting for > I/O requests (hence the 50% iowait). But in total the amount of I/O is > very small, so %util is ~0. > > Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top' > and 'vmstat 1' would be handy too. > > regards > Tomas Well I'm confused too by this whole thing which is why I came here. I can gather those statistics but I have a quick short question, could this be caused by frivolous indexes or something like that? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825657.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.