Thread: Incredibly slow restore times after 9.0>9.2 upgrade

Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
Hi, we have a nightly job that restores current production data to the
development databases in a 'warm spare' database so that if the developers
need fresh data, it's ready during the day. When we moved from 9.0 to 9.2
suddenly the restores began to take from a few hours to more like 15 hours
or so. We're in Amazon EC2, I've tried new EBS volumes, warmed them up,
threw IOPS at them, pretty much all the standard stuff to get more disk
performance.

Here's the thing, the disk isn't saturated. The behavior I'm seeing seems
very odd to me; I'm seeing the source disk which holds the dump saturated by
reads, which is great, but then I just see nothing being written to the
postgres volume. Just nothing happening, then a small burst. There is no
write queue backup on the destination disk either. if I look at
pg_stat_activity I'll see something like:

COPY salesforce_reconciliation (salesforce_id, email, advisor_salesforce_id,
processed) FROM stdin

and even for small tables, that seems to take a very long time even though
the destination disk is almost at 0 utilization.

The dumps are created with pg_dump -Fc and restored with pg_restore -d db -j
2 -O -U postgres PostgreSQL-db.sql.

Is it possible that some default settings were changed from 9.0 to 9.2 that
would cause this kind of behavior? I'm stumped here. Thanks in advance for
any consideration here.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
Tomas Vondra
Date:
On 28.10.2014 21:55, jmcdonagh wrote:
> Hi, we have a nightly job that restores current production data to
> the development databases in a 'warm spare' database so that if the
> developers need fresh data, it's ready during the day. When we moved
> from 9.0 to 9.2 suddenly the restores began to take from a few hours
> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
> volumes, warmed them up, threw IOPS at them, pretty much all the
> standard stuff to get more disk performance.

So, if I understand it correctly, you've been restoring into 9.0, then
you switched to 9.2 and it's much slower?

Is the 9.2 configured equally to 9.0? If you do something like this

  SELECT name, setting
    FROM pg_settings
   WHERE source = 'configuration file';

on both versions, what do you get?

> Here's the thing, the disk isn't saturated. The behavior I'm seeing
> seems very odd to me; I'm seeing the source disk which holds the dump
> saturated by reads, which is great, but then I just see nothing being
> written to the postgres volume. Just nothing happening, then a
> small burst. There is no write queue backup on the destination disk
> either. if I look at pg_stat_activity I'll see something like:
>
> COPY salesforce_reconciliation (salesforce_id, email,
> advisor_salesforce_id, processed) FROM stdin
>
> and even for small tables, that seems to take a very long time even
> though the destination disk is almost at 0 utilization.

So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk
or something else? Or maybe network, because you're using EBS?

What do you mean by 'utilization'? How do you measure that?


> The dumps are created with pg_dump -Fc and restored with pg_restore
> -d db -j 2 -O -U postgres PostgreSQL-db.sql.

OK

> Is it possible that some default settings were changed from 9.0 to
> 9.2 that would cause this kind of behavior? I'm stumped here. Thanks
> in advance for any consideration here.

I doubt that. There probably were some changes (after all, we're talking
about 2 major versions), but we generally don't change it in a way
that'd hurt performance.

regards
Tomas



Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
Hi Tomas- thank you for your thoughtful response!


Tomas Vondra wrote
> On 28.10.2014 21:55, jmcdonagh wrote:
>> Hi, we have a nightly job that restores current production data to
>> the development databases in a 'warm spare' database so that if the
>> developers need fresh data, it's ready during the day. When we moved
>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>> volumes, warmed them up, threw IOPS at them, pretty much all the
>> standard stuff to get more disk performance.
>
> So, if I understand it correctly, you've been restoring into 9.0, then
> you switched to 9.2 and it's much slower?

Yes- but since the move was done utilizing snapshots so the move involves
new volumes, but I have created new volumes since then to rule out a single
bad volume.


Tomas Vondra wrote
> Is the 9.2 configured equally to 9.0? If you do something like this
>
>   SELECT name, setting
>     FROM pg_settings
>    WHERE source = 'configuration file';
>
> on both versions, what do you get?

I no longer have the 9.0 box up but we do track configuration via puppet and
git. The only configuration change made for 9.2 is:

-#standard_conforming_strings = off
+standard_conforming_strings = off

Cause we have an old app that needs this setting on otherwise we'd spend a
lot of time trying to fix it.


Tomas Vondra wrote
>> Here's the thing, the disk isn't saturated. The behavior I'm seeing
>> seems very odd to me; I'm seeing the source disk which holds the dump
>> saturated by reads, which is great, but then I just see nothing being
>> written to the postgres volume. Just nothing happening, then a
>> small burst. There is no write queue backup on the destination disk
>> either. if I look at pg_stat_activity I'll see something like:
>>
>> COPY salesforce_reconciliation (salesforce_id, email,
>> advisor_salesforce_id, processed) FROM stdin
>>
>> and even for small tables, that seems to take a very long time even
>> though the destination disk is almost at 0 utilization.
>
> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk
> or something else? Or maybe network, because you're using EBS?
>
> What do you mean by 'utilization'? How do you measure that?

The bottleneck is I/O somehow. I say somehow, because I see iowait averaging
about 50% between two CPUs, but there is just no writes to the destination
EBS volume really happening, just reads from the disk where the source dump
is located, then bursts of writes to the destination volume every so often.
It's kind of puzzling. This is happening on multiple database servers, in
multiple availability zones. Driving me bonkers.

What I mean by utilization is util% from iostat -m -x 1.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824847.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
"Mathis, Jason"
Date:
Is the instance ebs-optimized? I am wondering if its a configuration on the instance not postgres or ebs. 

On Wed, Oct 29, 2014 at 10:12 AM, jmcdonagh <Joseph.E.McDonagh@gmail.com> wrote:
Hi Tomas- thank you for your thoughtful response!


Tomas Vondra wrote
> On 28.10.2014 21:55, jmcdonagh wrote:
>> Hi, we have a nightly job that restores current production data to
>> the development databases in a 'warm spare' database so that if the
>> developers need fresh data, it's ready during the day. When we moved
>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>> volumes, warmed them up, threw IOPS at them, pretty much all the
>> standard stuff to get more disk performance.
>
> So, if I understand it correctly, you've been restoring into 9.0, then
> you switched to 9.2 and it's much slower?

Yes- but since the move was done utilizing snapshots so the move involves
new volumes, but I have created new volumes since then to rule out a single
bad volume.


Tomas Vondra wrote
> Is the 9.2 configured equally to 9.0? If you do something like this
>
>   SELECT name, setting
>     FROM pg_settings
>    WHERE source = 'configuration file';
>
> on both versions, what do you get?

I no longer have the 9.0 box up but we do track configuration via puppet and
git. The only configuration change made for 9.2 is:

-#standard_conforming_strings = off
+standard_conforming_strings = off

Cause we have an old app that needs this setting on otherwise we'd spend a
lot of time trying to fix it.


Tomas Vondra wrote
>> Here's the thing, the disk isn't saturated. The behavior I'm seeing
>> seems very odd to me; I'm seeing the source disk which holds the dump
>> saturated by reads, which is great, but then I just see nothing being
>> written to the postgres volume. Just nothing happening, then a
>> small burst. There is no write queue backup on the destination disk
>> either. if I look at pg_stat_activity I'll see something like:
>>
>> COPY salesforce_reconciliation (salesforce_id, email,
>> advisor_salesforce_id, processed) FROM stdin
>>
>> and even for small tables, that seems to take a very long time even
>> though the destination disk is almost at 0 utilization.
>
> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a disk
> or something else? Or maybe network, because you're using EBS?
>
> What do you mean by 'utilization'? How do you measure that?

The bottleneck is I/O somehow. I say somehow, because I see iowait averaging
about 50% between two CPUs, but there is just no writes to the destination
EBS volume really happening, just reads from the disk where the source dump
is located, then bursts of writes to the destination volume every so often.
It's kind of puzzling. This is happening on multiple database servers, in
multiple availability zones. Driving me bonkers.

What I mean by utilization is util% from iostat -m -x 1.





--
View this message in context: http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824847.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
Hi Jason, oddly enough the setting on or off does not affect this particular
issue. As a rule I generally enable this option on my instances that support
it. I recently tried upping the nodes to the latest generation (m3) to try
and rectify/improve this issue. Unfortunately right now m3 won't work
because we rely on a lot of space in mnt for temporary data work and the new
instances don't have much space there (though it is much faster). So I went
back to m1.large and left EBS optimized off. I'm not seeing any note-worthy
change in performance.

So I went and fired up an RDS postgres instance here. Eventually I want to
move to RDS anyways, but it's not a good short term solution to the right
now issue. Restore is running now, so I'll know within the next day or so if
this is much faster.

I'm puzzled by the change from 9.0 to 9.2 coinciding with this though.
Before the upgrade this job never had any issues. But I am 100% aware that
could be a red herring.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5824871.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
I just had a thought- I know some of these tables are in need of a vacuuming.
Could it be that the dump is dumping a bunch of garbage that the restore has
to sift through on the restore? I don't know enough details to know if this
is a dumb thought or not.

The restore to RDS took roughly the same amount of time. My next move is to
try on a fast instance store, and also do a postgres 9 restore of a pure SQL
dump, but that won't really be a great test since I use custom format. I'm
assuming here that I can't take the custom dump from 9.2 and apply it to
9.0, or can I?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825052.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
Jerry Sievers
Date:
jmcdonagh <Joseph.E.McDonagh@gmail.com> writes:

> I just had a thought- I know some of these tables are in need of a vacuuming.
> Could it be that the dump is dumping a bunch of garbage that the restore has
> to sift through on the restore? I don't know enough details to know if this
> is a dumb thought or not.

No.  However it's true that the dump will take a bit longer having to
scan a bloated table rather than a tight one.

Dump will only output the live rows.  psql or pg_restore whatever you're
using on the target side will not have to step over any junk.

HTH

>
> The restore to RDS took roughly the same amount of time. My next move is to
> try on a fast instance store, and also do a postgres 9 restore of a pure SQL
> dump, but that won't really be a great test since I use custom format. I'm
> assuming here that I can't take the custom dump from 9.2 and apply it to
> 9.0, or can I?
>
>
>
> --
> View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825052.html
> Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
Tomas Vondra
Date:
On 29.10.2014 16:12, jmcdonagh wrote:
> Hi Tomas- thank you for your thoughtful response!
>
>
> Tomas Vondra wrote
>> On 28.10.2014 21:55, jmcdonagh wrote:
>>> Hi, we have a nightly job that restores current production data to
>>> the development databases in a 'warm spare' database so that if the
>>> developers need fresh data, it's ready during the day. When we moved
>>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>>> volumes, warmed them up, threw IOPS at them, pretty much all the
>>> standard stuff to get more disk performance.
>>
>> So, if I understand it correctly, you've been restoring into 9.0, then
>> you switched to 9.2 and it's much slower?
>
> Yes- but since the move was done utilizing snapshots so the move
> involves new volumes, but I have created new volumes since then to
> rule out a single bad volume.

My advice would be to do some basic low-level performance tests to rule
this out. Use dd or (better) fio to test basic I/O performance, it's
much easier to spot issues that way.

> Tomas Vondra wrote
>> Is the 9.2 configured equally to 9.0? If you do something like this
>>
>>   SELECT name, setting
>>     FROM pg_settings
>>    WHERE source = 'configuration file';
>>
>> on both versions, what do you get?
>
> I no longer have the 9.0 box up but we do track configuration via
> puppet and git. The only configuration change made for 9.2 is:
>
> -#standard_conforming_strings = off
> +standard_conforming_strings = off

Compared to 9.0, I suppose? Anyway, post the non-default config values
at least for 9.2, please.

> Cause we have an old app that needs this setting on otherwise we'd
> spend a lot of time trying to fix it.

I doubt standard_conforming_strings has anything to do with the issues.

> Tomas Vondra wrote
>>> Here's the thing, the disk isn't saturated. The behavior I'm seeing
>>> seems very odd to me; I'm seeing the source disk which holds the dump
>>> saturated by reads, which is great, but then I just see nothing being
>>> written to the postgres volume. Just nothing happening, then a
>>> small burst. There is no write queue backup on the destination disk
>>> either. if I look at pg_stat_activity I'll see something like:
>>>
>>> COPY salesforce_reconciliation (salesforce_id, email,
>>> advisor_salesforce_id, processed) FROM stdin
>>>
>>> and even for small tables, that seems to take a very long time even
>>> though the destination disk is almost at 0 utilization.
>>
>> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a
>> disk or something else? Or maybe network, because you're using EBS?
>>
>> What do you mean by 'utilization'? How do you measure that?
>
> The bottleneck is I/O somehow. I say somehow, because I see iowait
> averaging about 50% between two CPUs, but there is just no writes to
> the destination EBS volume really happening, just reads from the
> disk where the source dump is located, then bursts of writes to the
> destination volume every so often. It's kind of puzzling. This is
> happening on multiple database servers, in multiple availability
> zones. Driving me bonkers.
>
> What I mean by utilization is util% from iostat -m -x 1.

I find this rather contradictory. At one moment you say the disk isn't
saturated, the next moment you say you're I/O bound.

Also, iowait (as reported e.g. by 'top') is tricky to interpret
correctly, especially on multi-cpu systems (nice intro to the complexity
[1]). It's really difficult to interpret the 50% iowait without more
info about what's happening on the machine.

IMHO, the utilization (as reported by iotop) is much easier to
interpret, because it means '% of time the device was servicing
requests'. It has issues too, because 100% does not mean 'saturated'
(especially on RAID arrays that can service multiple requests in
parallel), but it's better than iowait.

If I had to guess based from your info, I'd bet you're CPU bound, so
there's very little idle time and about 50% of it is spent waiting for
I/O requests (hence the 50% iowait). But in total the amount of I/O is
very small, so %util is ~0.

Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top'
and 'vmstat 1' would be handy too.

regards
Tomas

[1] http://veithen.blogspot.cz/2013/11/iowait-linux.html



Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
Thanks for the confirmation Jerry.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825615.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


Re: Incredibly slow restore times after 9.0>9.2 upgrade

From
jmcdonagh
Date:
Tomas Vondra wrote
> On 29.10.2014 16:12, jmcdonagh wrote:
>> Hi Tomas- thank you for your thoughtful response!
>>
>>
>> Tomas Vondra wrote
>>> On 28.10.2014 21:55, jmcdonagh wrote:
>>>> Hi, we have a nightly job that restores current production data to
>>>> the development databases in a 'warm spare' database so that if the
>>>> developers need fresh data, it's ready during the day. When we moved
>>>> from 9.0 to 9.2 suddenly the restores began to take from a few hours
>>>> to more like 15 hours or so. We're in Amazon EC2, I've tried new EBS
>>>> volumes, warmed them up, threw IOPS at them, pretty much all the
>>>> standard stuff to get more disk performance.
>>>
>>> So, if I understand it correctly, you've been restoring into 9.0, then
>>> you switched to 9.2 and it's much slower?
>>
>> Yes- but since the move was done utilizing snapshots so the move
>> involves new volumes, but I have created new volumes since then to
>> rule out a single bad volume.
>
> My advice would be to do some basic low-level performance tests to rule
> this out. Use dd or (better) fio to test basic I/O performance, it's
> much easier to spot issues that way.

I've done dd tests and the volumes perform fine.


Tomas Vondra wrote
>> Tomas Vondra wrote
>>> Is the 9.2 configured equally to 9.0? If you do something like this
>>>
>>>   SELECT name, setting
>>>     FROM pg_settings
>>>    WHERE source = 'configuration file';
>>>
>>> on both versions, what do you get?
>>
>> I no longer have the 9.0 box up but we do track configuration via
>> puppet and git. The only configuration change made for 9.2 is:
>>
>> -#standard_conforming_strings = off
>> +standard_conforming_strings = off
>
> Compared to 9.0, I suppose? Anyway, post the non-default config values
> at least for 9.2, please.

Yea, so in comparison to the only change was that. Here are the non-default
settings (some of them are probably defaults, but these are the uncommented
lines from postgresql.conf):

data_directory = '/mnt/postgresql/9.2/main'      # use data in another
directory
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'   # host-based
authentication file
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'   # ident
configuration file
external_pid_file = '/var/run/postgresql/9.2-main.pid'      # write an extra
PID file
listen_addresses = '*'        # what IP address(es) to listen on;
port = 5432            # (change requires restart)
max_connections = 300         # (change requires restart)
unix_socket_directory = '/var/run/postgresql'      # (change requires
restart)
ssl = true            # (change requires restart)
shared_buffers = 4GB         # min 128kB
temp_buffers = 128MB         # min 800kB
work_mem = 256MB            # min 64kB
maintenance_work_mem = 256MB      # min 1MB
wal_buffers = 512kB         # min 32kB
commit_delay = 50000         # range 0-100000, in microseconds
commit_siblings = 1         # range 1-1000
random_page_cost = 2.0         # same scale as above
effective_cache_size = 16GB
from_collapse_limit = 10
join_collapse_limit = 10      # 1 disables collapsing of explicit
log_destination = 'stderr'      # Valid values are combinations of
client_min_messages = warning      # values in order of decreasing detail:
log_min_messages = warning      # values in order of decreasing detail:
log_min_duration_statement = 1000   # -1 is disabled, 0 logs all statements
log_line_prefix = '%t '         # special values:
autovacuum = on         # Enable autovacuum subprocess?  'on'
datestyle = 'iso, mdy'
timezone = EST5EDT # actually, defaults to TZ environment
client_encoding = sql_ascii      # actually, defaults to database
lc_messages = 'en_US.UTF-8'         # locale for system error message
lc_monetary = 'en_US.UTF-8'         # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'         # locale for number formatting
lc_time = 'en_US.UTF-8'            # locale for time formatting
default_text_search_config = 'pg_catalog.english'
standard_conforming_strings = off


Tomas Vondra wrote
>> Tomas Vondra wrote
>>>> Here's the thing, the disk isn't saturated. The behavior I'm seeing
>>>> seems very odd to me; I'm seeing the source disk which holds the dump
>>>> saturated by reads, which is great, but then I just see nothing being
>>>> written to the postgres volume. Just nothing happening, then a
>>>> small burst. There is no write queue backup on the destination disk
>>>> either. if I look at pg_stat_activity I'll see something like:
>>>>
>>>> COPY salesforce_reconciliation (salesforce_id, email,
>>>> advisor_salesforce_id, processed) FROM stdin
>>>>
>>>> and even for small tables, that seems to take a very long time even
>>>> though the destination disk is almost at 0 utilization.
>>>
>>> So, where's the bottleneck? Clearly, there's one, so is it a CPU, a
>>> disk or something else? Or maybe network, because you're using EBS?
>>>
>>> What do you mean by 'utilization'? How do you measure that?
>>
>> The bottleneck is I/O somehow. I say somehow, because I see iowait
>> averaging about 50% between two CPUs, but there is just no writes to
>> the destination EBS volume really happening, just reads from the
>> disk where the source dump is located, then bursts of writes to the
>> destination volume every so often. It's kind of puzzling. This is
>> happening on multiple database servers, in multiple availability
>> zones. Driving me bonkers.
>>
>> What I mean by utilization is util% from iostat -m -x 1.
>
> I find this rather contradictory. At one moment you say the disk isn't
> saturated, the next moment you say you're I/O bound.
>
> Also, iowait (as reported e.g. by 'top') is tricky to interpret
> correctly, especially on multi-cpu systems (nice intro to the complexity
> [1]). It's really difficult to interpret the 50% iowait without more
> info about what's happening on the machine.
>
> IMHO, the utilization (as reported by iotop) is much easier to
> interpret, because it means '% of time the device was servicing
> requests'. It has issues too, because 100% does not mean 'saturated'
> (especially on RAID arrays that can service multiple requests in
> parallel), but it's better than iowait.
>
> If I had to guess based from your info, I'd bet you're CPU bound, so
> there's very little idle time and about 50% of it is spent waiting for
> I/O requests (hence the 50% iowait). But in total the amount of I/O is
> very small, so %util is ~0.
>
> Please, post a few lines of 'iostat -x -k 1' output. Samples from 'top'
> and 'vmstat 1' would be handy too.
>
> regards
> Tomas

Well I'm confused too by this whole thing which is why I came here. I can
gather those statistics but I have a quick short question, could this be
caused by frivolous indexes or something like that?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Incredibly-slow-restore-times-after-9-0-9-2-upgrade-tp5824701p5825657.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.