Thread: Server slowing down over time

Server slowing down over time

From
Jean Cavallo
Date:
Hi,

I am currently working on a data migration for a client.
The general plan is :
  - Read data from a postgresql database
  - Convert them to the new application
  - Insert in another database (same postgresql instance).

The source database is rather big (~40GB, wo indexes), and the
conversion process takes some time. It is done by multiple workers
on a separate Linux environnement, piece by piece.

When we start the migration, at first it looks good.
Performances are good, and it ran smoothly. After a few hours,
we noticed that things started to slow down. Some queries seemed
to be stuck, so we waited for them to end, and restarted the server.

After that it went well for some time (~10 minutes), then it slowed
down again. We tried again (a few times), and the pattern repeats.

My postgresql specific problem is that it looks like the server gets
stuck. CPU usage is <10%, RAM usage is under 50% max, there is
no noticeable disk usage. But, there are some (<10) active queries,
some of which may take several hours to complete. Those queries
work properly (i.e < 1min) right after the server restarts.

So my question is : What could slow the queries from ~1min to 2hours
which does not involve CPU, Memory, or disk usage, and which would
"reset" when restarting the server ?

For information, the number of processes does not seem to be the
problem, there are ~20 connections with max_connection set to 100.
We noticed at some point that the hard drive holding the target
database was heavily fragmented (100%...), but defrag did not
seem to change anything.

Also, the queries that appear to get stuck are "heavy" queries,
though after a fresh restart they execute in a reasonable time.

Finally, whatever causes the database to wait also causes the
Windows instance to slow down. But restarting Postgresql fixes
this as well.

Configuration :

The Postgresql server runs on a Windows Virtual Machine under
VMWare. The VM has dedicated resources, and the only other
VM on the host is the applicative server (which runs idle while
waiting for the database). There is nothing else running on the
server except postgresql (well, there were other things, but we
stopped everything to no avail).

PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit
Windows 2008R2 (64 bits)
10 Go RAM
4 vCPU

Host : VMWare ESXi 5.5.0 build-2068190
CPU Intel XEON X5690 3.97GHz
HDD 3x Nearline SAS 15K RAID0

Please let me know if any other information may be useful.

Jean Cavallo

Re: Server slowing down over time

From
Igor Neyman
Date:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Jean Cavallo
Sent: Thursday, August 27, 2015 1:21 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Server slowing down over time

 

Hi,

 

I am currently working on a data migration for a client.

The general plan is :

  - Read data from a postgresql database

  - Convert them to the new application

  - Insert in another database (same postgresql instance).

 

The source database is rather big (~40GB, wo indexes), and the

conversion process takes some time. It is done by multiple workers

on a separate Linux environnement, piece by piece.

 

When we start the migration, at first it looks good.

Performances are good, and it ran smoothly. After a few hours,

we noticed that things started to slow down. Some queries seemed

to be stuck, so we waited for them to end, and restarted the server.

 

After that it went well for some time (~10 minutes), then it slowed

down again. We tried again (a few times), and the pattern repeats.

 

My postgresql specific problem is that it looks like the server gets

stuck. CPU usage is <10%, RAM usage is under 50% max, there is

no noticeable disk usage. But, there are some (<10) active queries,

some of which may take several hours to complete. Those queries

work properly (i.e < 1min) right after the server restarts.

 

So my question is : What could slow the queries from ~1min to 2hours

which does not involve CPU, Memory, or disk usage, and which would

"reset" when restarting the server ?

 

For information, the number of processes does not seem to be the

problem, there are ~20 connections with max_connection set to 100.

We noticed at some point that the hard drive holding the target

database was heavily fragmented (100%...), but defrag did not

seem to change anything.

 

Also, the queries that appear to get stuck are "heavy" queries,

though after a fresh restart they execute in a reasonable time.

 

Finally, whatever causes the database to wait also causes the

Windows instance to slow down. But restarting Postgresql fixes

this as well.

 

Configuration :

 

The Postgresql server runs on a Windows Virtual Machine under

VMWare. The VM has dedicated resources, and the only other

VM on the host is the applicative server (which runs idle while

waiting for the database). There is nothing else running on the

server except postgresql (well, there were other things, but we

stopped everything to no avail).

 

PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit

Windows 2008R2 (64 bits)

10 Go RAM

4 vCPU

 

Host : VMWare ESXi 5.5.0 build-2068190

CPU Intel XEON X5690 3.97GHz

HDD 3x Nearline SAS 15K RAID0

 

Please let me know if any other information may be useful.


Jean Cavallo

 

 

Having 4 CPUs, I’d try to decrease number of connections from ~20 to 8, and see if “slowing down” still happens.

 

Regards,

Igor Neyman

 

Re: Server slowing down over time

From
Wei Shan
Date:
Could you check pg_locks table to see if there's any major difference between "healthy" state and "slowing down" state?

On 3 September 2015 at 21:07, Igor Neyman <ineyman@perceptron.com> wrote:

 

 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Jean Cavallo
Sent: Thursday, August 27, 2015 1:21 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Server slowing down over time

 

Hi,

 

I am currently working on a data migration for a client.

The general plan is :

  - Read data from a postgresql database

  - Convert them to the new application

  - Insert in another database (same postgresql instance).

 

The source database is rather big (~40GB, wo indexes), and the

conversion process takes some time. It is done by multiple workers

on a separate Linux environnement, piece by piece.

 

When we start the migration, at first it looks good.

Performances are good, and it ran smoothly. After a few hours,

we noticed that things started to slow down. Some queries seemed

to be stuck, so we waited for them to end, and restarted the server.

 

After that it went well for some time (~10 minutes), then it slowed

down again. We tried again (a few times), and the pattern repeats.

 

My postgresql specific problem is that it looks like the server gets

stuck. CPU usage is <10%, RAM usage is under 50% max, there is

no noticeable disk usage. But, there are some (<10) active queries,

some of which may take several hours to complete. Those queries

work properly (i.e < 1min) right after the server restarts.

 

So my question is : What could slow the queries from ~1min to 2hours

which does not involve CPU, Memory, or disk usage, and which would

"reset" when restarting the server ?

 

For information, the number of processes does not seem to be the

problem, there are ~20 connections with max_connection set to 100.

We noticed at some point that the hard drive holding the target

database was heavily fragmented (100%...), but defrag did not

seem to change anything.

 

Also, the queries that appear to get stuck are "heavy" queries,

though after a fresh restart they execute in a reasonable time.

 

Finally, whatever causes the database to wait also causes the

Windows instance to slow down. But restarting Postgresql fixes

this as well.

 

Configuration :

 

The Postgresql server runs on a Windows Virtual Machine under

VMWare. The VM has dedicated resources, and the only other

VM on the host is the applicative server (which runs idle while

waiting for the database). There is nothing else running on the

server except postgresql (well, there were other things, but we

stopped everything to no avail).

 

PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit

Windows 2008R2 (64 bits)

10 Go RAM

4 vCPU

 

Host : VMWare ESXi 5.5.0 build-2068190

CPU Intel XEON X5690 3.97GHz

HDD 3x Nearline SAS 15K RAID0

 

Please let me know if any other information may be useful.


Jean Cavallo

 

 

Having 4 CPUs, I’d try to decrease number of connections from ~20 to 8, and see if “slowing down” still happens.

 

Regards,

Igor Neyman

 




--
Regards,
Ang Wei Shan

Re: Server slowing down over time

From
Tomas Vondra
Date:
Hi,

On 08/27/2015 07:21 PM, Jean Cavallo wrote:
> Hi,
>
> I am currently working on a data migration for a client.
> The general plan is :
>    - Read data from a postgresql database
>    - Convert them to the new application
>    - Insert in another database (same postgresql instance).
>
> The source database is rather big (~40GB, wo indexes), and the
> conversion process takes some time. It is done by multiple workers
> on a separate Linux environnement, piece by piece.
>
> When we start the migration, at first it looks good.
> Performances are good, and it ran smoothly. After a few hours,
> we noticed that things started to slow down. Some queries seemed
> to be stuck, so we waited for them to end, and restarted the server.
>
> After that it went well for some time (~10 minutes), then it slowed
> down again. We tried again (a few times), and the pattern repeats.

If you're moving a lot of data (especially if the destination database
is empty), one possible problem is statistics. This generally is not a
problem in regular operation, because the data growth is gradual and
autovacuum analyzes the tables regularly, but in batch processes this is
often a big issue.

The usual scenario is that there's an empty (or very small) table, where
indexes are inefficient so PostgreSQL plans the queries with sequential
scans. The table suddenly grows, which would make indexes efficient, but
the planner has no idea about that until autovacuum kicks in. But before
that happens, the batch process executes queries on that table.

Try adding ANALYZE after steps that add a lot of data.

>
> My postgresql specific problem is that it looks like the server gets
> stuck. CPU usage is <10%, RAM usage is under 50% max, there is no
> noticeable disk usage. But, there are some (<10) active queries, some
> of which may take several hours to complete. Those queries work
> properly (i.e < 1min) right after the server restarts.

That's a bit strange. Essentially what you're saying is that the
workload is neither CPU nor I/O bound. To make it CPU bound, at least
one CPU would have to be 100% utilized, and with 4 CPUs that's 25%, but
you're saying there's only 10% used. But you're saying I/O is not the
bottleneck either.

> So my question is : What could slow the queries from ~1min to 2hours
>  which does not involve CPU, Memory, or disk usage, and which would
> "reset" when restarting the server ?

A lot of things, unfortunately, and the fact that this is a migration
moving data between two databases makes it even more complicated. The
virtualization does not make it less complex either.

For example, are you sure it's not stuck on the other database? I assume
you're running some long queries, so maybe it's stuck there and the
destination database is just waiting for data? That's be consistent with
the low CPU and I/O usage you observe.

Locking is another possibility, although it probably is not the only
cause - it'd be utilizing at least one CPU otherwise.

>
> For information, the number of processes does not seem to be the
> problem, there are ~20 connections with max_connection set to 100.
> We noticed at some point that the hard drive holding the target
> database was heavily fragmented (100%...), but defrag did not
> seem to change anything.

If it was a problem, you'd see high I/O usage. And that's not the case.

>
> Also, the queries that appear to get stuck are "heavy" queries,
> though after a fresh restart they execute in a reasonable time.

Does the plan change? If not, check waiting locks in pg_locks.

>
> Finally, whatever causes the database to wait also causes the
> Windows instance to slow down. But restarting Postgresql fixes
> this as well.

That's a bit strange, I guess. If you're not observing light CPU and I/O
usage, then the instance should not be slow, unless there's something
else going on - possibly at the virtualization level (e.g. another busy
instance on the same hardware, some sort of accounting that limits the
resources after a time, etc.)

> Configuration :
>
> The Postgresql server runs on a Windows Virtual Machine under
> VMWare. The VM has dedicated resources, and the only other
> VM on the host is the applicative server (which runs idle while
> waiting for the database). There is nothing else running on the
> server except postgresql (well, there were other things, but we
> stopped everything to no avail).
>
> PostgreSQL 9.3.5, compiled by Visual C++ build 1600, 64-bit

You're 4 versions behind.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Server slowing down over time

From
Jim Nasby
Date:
On 9/4/15 6:59 AM, Tomas Vondra wrote:
>> Finally, whatever causes the database to wait also causes the
>> Windows instance to slow down. But restarting Postgresql fixes
>> this as well.
>
> That's a bit strange, I guess. If you're not observing light CPU and I/O
> usage, then the instance should not be slow, unless there's something
> else going on - possibly at the virtualization level (e.g. another busy
> instance on the same hardware, some sort of accounting that limits the
> resources after a time, etc.)

I've experienced something similar on linux before. Database is slow,
but neither CPU or IO is maxed. IIRC there was nothing disturbing in
pg_locks either. I don't recall the server actually slowing down, but
this would have been on something with at least 24 cores, so...

My suspicion has always been that there's some form of locking that
isn't being caught by the tools.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com