Thread: Can lots of small writes badly hamper reads from other tables?

Can lots of small writes badly hamper reads from other tables?

From

Dave Crooke

Date:

24 January 2012, 19:16:35

Hi folks

This could be a sheer volume issue, but I though I would ask the wisdom of this forum as to next investigative steps.

----

We use PostgreSQL 8.4.4 which is bundled with our application as a VMware virtual appliance. The bulk of the app's database activity is recording performance data points which arrive in farily large sustained bursts of perhaps 10,000 rows a minute at a medium sized customer, each of which are logically separate items and being committed as individual transactions (JDBC auto-commit mode). Our offshore QA team was assigned to track an intermittent issue with speed of some large queries on other tables, and they believe based on correlation the two activities may be contending.

The large query is coming off of different tables from the ones being written to ... the raw data goes into a table named by day (partitioning is all within the app, not PG) e.g. PERF_RAW_2012_01_24 and then there are a bunch of rollup statements which run hourly to do the aggregations, e.g.

insert into PERF_HOURLY_2012_01_24 select key_columns, avg(data), now() from perf_raw_2012_01_24 where time_stamp between (now() - interval '1 hour') and now() group by key_columns

The big queries are hitting multiple of the PERF_HOURLY tables and pulling a few dozen rows from each.

We are using a 64-bit VM with 8 virtual cores and 8GB RAM, of which Java takes a bit over half, and Linux XXXXX with CentOS 5.x .... PG has 1GB of buffer cache and reasonable (AFAICT) resource limits for everything else, which are intended to be workable for a range of client sizes out of the box. True transactional consistency is disabled for performance reasons, virtual environments do not take kindly to lots of small writes.

---

Is there any tweaking we should do on the PG settings, or on the pattern in which the app is writing - we currently use 10 writer threads on the Java side and they keep PG going pretty good.

I considered bundling the writes into larger transactions, will that really help much with commit consistency off?

Is there some specific "usual suspect" stuff I should look at on the PG side to look for efficiency issues such as index lock contention or a poor buffer cache hit ratio? Will doing EXPLAIN ANALYSE on the big query be informative, and if so, does it need to be done while the write load is applied?

The other whacky idea I had was to have the writer threads pause or throttle themselves when a big query is happening (it's all in one JVM and we are using a connection pooler, so it's easy to intercept and track if needed) however that strikes me as a rather ugly hack and I'd prefer to do something more robust and based on config tweaks that leverage existing resource management in PG.

Relevant schema and config attached, all comments and advice welcome, including general tuning tips and rationale for moving to PG 9.x .... I'm well aware this isn't the acme of PG tuning :)

Cheers
Dave

Attachment

Re: Can lots of small writes badly hamper reads from other tables?

From

Andy Colson

Date:

24 January 2012, 20:06:51

On 1/24/2012 2:16 PM, Dave Crooke wrote:
> Hi folks
>
> This could be a sheer volume issue, but I though I would ask the wisdom
> of this forum as to next investigative steps.
>
> ----
>
> We use PostgreSQL 8.4.4 which is bundled with our application as a
> VMware virtual appliance. The bulk of the app's database activity is
> recording performance data points which arrive in farily large sustained
> bursts of perhaps 10,000 rows a minute at a medium sized customer, each
> of which are logically separate items and being committed as individual
> transactions (JDBC auto-commit mode). Our offshore QA team was assigned
> to track an intermittent issue with speed of some large queries on other
> tables, and they believe based on correlation the two activities may be
> contending.

You have 10 connections, all doing:

begin
insert into PERF_RAW_2012_01_24....  -- one record
commit

If that's what you're doing, yes, I'd say that's the slowest way possible.

Doing this would be faster:

begin
insert into PERF_RAW_2012_01_24....  -- one record
insert into PERF_RAW_2012_01_24....  -- one record
...
insert into PERF_RAW_2012_01_24....  -- one record
commit

Doing this would be even faster:

begin
-- one insert, multiple rows
insert into PERF_RAW_2012_01_24 values (...) (...) (...) ... (...);
insert into PERF_RAW_2012_01_24 values (...) (...) (...) ... (...);
commit

And, fastest of all fastest, use COPY.  But be careful, its so fast
it'll melt your face off :-)

I didnt even bother trying to pick out the uncommented settings from
your .conf file.  Way to much work.

VM usually have pretty slow IO, so you might wanna watch vmstat and
iostat to see if you are IO bound or CPU bound.

Also watching iostat before and after the change might be interesting.

If you you keep having lots and lots of transaction, look into
commit_delay, it'll help batch commits out to disk  (if I remember
correctly).

-Andy

Re: Can lots of small writes badly hamper reads from other tables?

From

"Tomas Vondra"

Date:

24 January 2012, 20:09:29

On 24 Leden 2012, 21:16, Dave Crooke wrote:
> Hi folks
>
> This could be a sheer volume issue, but I though I would ask the wisdom of
> this forum as to next investigative steps.
>
> ----
>
> We use PostgreSQL 8.4.4 which is bundled with our application as a VMware
> virtual appliance. The bulk of the app's database activity is recording
> performance data points which arrive in farily large sustained bursts of
> perhaps 10,000 rows a minute at a medium sized customer, each of which are
> logically separate items and being committed as individual transactions
> (JDBC auto-commit mode). Our offshore QA team was assigned to track an
> intermittent issue with speed of some large queries on other tables, and
> they believe based on correlation the two activities may be contending.
>
> The large query is coming off of different tables from the ones being
> written to ... the raw data goes into a table named by day (partitioning
> is
> all within the app, not PG) e.g. PERF_RAW_2012_01_24 and then there are a
> bunch of rollup statements which run hourly to do the aggregations, e.g.

Each storage device has some basic I/O limits - sequential speed (read/write)
and the maximum number or I/O operations it can handle. For example a 7.2k
drives can do up to 160MB/s sequential reads/writes, but not more than 120
I/O ops per second. Similarly for other devices - 15k drives can do up to
250 I/Os. SSDs can handle much more I/Os, e.g. Intel 320 can handle about
8k I/Os.

I have no idea what kind of storage device you're using and what amount of
sequential and random operations it can handle. But my guess you're hitting
the limit of random I/Os - each commit requires a fsync, and you're doing
10.000 of them per minute, i.e. about 160 per second. If the queries need
to read data from the drive (e.g. randomly), this just adds more I/Os.

> Is there any tweaking we should do on the PG settings, or on the pattern
> in
> which the app is writing - we currently use 10 writer threads on the Java
> side and they keep PG going pretty good.

The first thing you should do is grouping the inserts to one transaction.
That'll lower the number of I/Os the database needs to do. Besides that,
you can move the WAL to a separate (physical) device, thus spreading the
I/Os to more drives.

> I considered bundling the writes into larger transactions, will that
> really
> help much with commit consistency off?

What do you mean by "commit consistency off"?

> Is there some specific "usual suspect" stuff I should look at on the PG
> side to look for efficiency issues such as index lock contention or a poor
> buffer cache hit ratio? Will doing EXPLAIN ANALYSE on the big query be
> informative, and if so, does it need to be done while the write load is
> applied?

The first thing you should do is gathering some basic I/O stats.

Run  pg_test_fsync (a contrib module) to see how many fsync operations the
I/O subsystem can handle (if it reports more than 500, use "-o" to get it
running for a longer time).

Then gather "vmstat 1" and "iostat -x 1" for a few seconds when the workload
(inserts and queries) are actually running. That should tell you how the
drives are actually utilized.

Post these results to this list.

> Relevant schema and config attached, all comments and advice welcome,
> including general tuning tips and rationale for moving to PG 9.x .... I'm
> well aware this isn't the acme of PG tuning :)

There's a nice page about tuning at the wiki:

  http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

I'd recommend significantly increasing the number of checkpoint segments,
e.g. to 64 (1GB) and setting completion target to 0.9. This usually helps
write-heavy workloads. And enable log_checkpoints.

Tomas

Re: Can lots of small writes badly hamper reads from other tables?

From

Dave Crooke

Date:

24 January 2012, 20:36:57

Hi guys

Thanks for the quick followups folks .... please note I am deliberately running a setup without commit guarantees, so a lot of the conventional advice about not doing small writes isn't applicable, but I do want to understand more about how this affects PG internals even if the I/O is smoothed out.

By "commit consistency off" I am referring to the setting "synchronous_commit = off" in postgresql.conf .... IIRC this should mean semantically that a DB crash may lose data that was reported back to the app as successfully committed, but will result in a consistent state on reboot and recovery. In this case IIUC the "120 commits per second per drive" limit does not apply, and I hope the advice about testing fsync is similarly not applicable to my case. Also, IIUC that settings like commit_siblings and commit_delay should be ignored by PG in my case.

I would be interested in learning what the **in-memory** constraints and costs are on the PG server side of doing a lot of small commits when sync wrties are off, e.g. the implications for the locking system, and whether this can affect the long queries on the other tables apart from general resource contention.

The pattern of lots of tiny transactions is semantically correct for the app, and I am using a JDBC prepared statement on the Java side, which I believe the PG driver will turn in to a pre-compiled statement with enough uses (it does NOT do so on the first few hits). This should in theory be even cheaper than a multiple INSERT VALUES which is all text and has to be parsed.

However, if necessary for performance I can bundle the inserts into slightly larger transactions - cases where writes fail are due only to occasional duplicates (same primary key) coming from upstream and are pretty rare, and in practice losing a batch of say 100 of these records occasionally is not a big deal in my world (ignoring sound of cringing DBAs :) so I could afford to bundle into transactions and then just drop a whole bundle if any single write has a primary key collision.

Storage setup varies by customer, but a typical setup is to take RAID groups of about 5-10TB each net from something like an EMC Clariion and slice each group into 1TB LUNs which become VMWare datastores, which are written simultaneously from multiple hosts. A mid-size Clariion would host perhaps 50-100 of these small LUNs, and a customer running a high performance environment might have Fibrechannel disks and RAID-10, but SATA and RAID-5/6 would also be normal, albeit with a substantial write-back cache (maybe 1GB, IIRC a current Clariion SP has 4GB total). Each file on the datastore corresponds to a virtual disk on a VM, and the datastore is formatted with VMFS (concurrent writer filesystem, uses SCSI locking to control access to block allocation and directory entries).

The other type of VMWare datastore works at the filesystem layer - instead of a shared SAN with iSCSI / FC-AL, the VMware hosts are all pointed at a shared NFS server directory. NetApp is the popular back end for this configuration.

On top of this virtualization, I have PG laid out on two virtual disks - WAL and log files are on the main system partition, index and table data on a second partition. Both formatted with ext3fs.

One of my larger customers had his SAN guy complain to him that our app was writing more data to the NetApp it was on than every other app combined, so I am mindful of the volume being more than some of these systems were planned for :)

Cheers
Dave

On Tue, Jan 24, 2012 at 3:09 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 24 Leden 2012, 21:16, Dave Crooke wrote:
> Hi folks
>
> This could be a sheer volume issue, but I though I would ask the wisdom of
> this forum as to next investigative steps.
>
> ----
>
> We use PostgreSQL 8.4.4 which is bundled with our application as a VMware
> virtual appliance. The bulk of the app's database activity is recording
> performance data points which arrive in farily large sustained bursts of
> perhaps 10,000 rows a minute at a medium sized customer, each of which are
> logically separate items and being committed as individual transactions
> (JDBC auto-commit mode). Our offshore QA team was assigned to track an
> intermittent issue with speed of some large queries on other tables, and
> they believe based on correlation the two activities may be contending.
>
> The large query is coming off of different tables from the ones being
> written to ... the raw data goes into a table named by day (partitioning
> is
> all within the app, not PG) e.g. PERF_RAW_2012_01_24 and then there are a
> bunch of rollup statements which run hourly to do the aggregations, e.g.

Each storage device has some basic I/O limits - sequential speed (read/write)
and the maximum number or I/O operations it can handle. For example a 7.2k
drives can do up to 160MB/s sequential reads/writes, but not more than 120
I/O ops per second. Similarly for other devices - 15k drives can do up to
250 I/Os. SSDs can handle much more I/Os, e.g. Intel 320 can handle about
8k I/Os.

I have no idea what kind of storage device you're using and what amount of
sequential and random operations it can handle. But my guess you're hitting
the limit of random I/Os - each commit requires a fsync, and you're doing
10.000 of them per minute, i.e. about 160 per second. If the queries need
to read data from the drive (e.g. randomly), this just adds more I/Os.

> Is there any tweaking we should do on the PG settings, or on the pattern
> in
> which the app is writing - we currently use 10 writer threads on the Java
> side and they keep PG going pretty good.

The first thing you should do is grouping the inserts to one transaction.
That'll lower the number of I/Os the database needs to do. Besides that,
you can move the WAL to a separate (physical) device, thus spreading the
I/Os to more drives.

> I considered bundling the writes into larger transactions, will that
> really
> help much with commit consistency off?

What do you mean by "commit consistency off"?

> Is there some specific "usual suspect" stuff I should look at on the PG
> side to look for efficiency issues such as index lock contention or a poor
> buffer cache hit ratio? Will doing EXPLAIN ANALYSE on the big query be
> informative, and if so, does it need to be done while the write load is
> applied?

The first thing you should do is gathering some basic I/O stats.

Run pg_test_fsync (a contrib module) to see how many fsync operations the
I/O subsystem can handle (if it reports more than 500, use "-o" to get it
running for a longer time).

Then gather "vmstat 1" and "iostat -x 1" for a few seconds when the workload
(inserts and queries) are actually running. That should tell you how the
drives are actually utilized.

Post these results to this list.

> Relevant schema and config attached, all comments and advice welcome,
> including general tuning tips and rationale for moving to PG 9.x .... I'm
> well aware this isn't the acme of PG tuning :)

There's a nice page about tuning at the wiki:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

I'd recommend significantly increasing the number of checkpoint segments,
e.g. to 64 (1GB) and setting completion target to 0.9. This usually helps
write-heavy workloads. And enable log_checkpoints.

Tomas

Re: Can lots of small writes badly hamper reads from other tables?

From

Tomas Vondra

Date:

24 January 2012, 22:46:18

On 24.1.2012 22:36, Dave Crooke wrote:
> Hi guys
>
> Thanks for the quick followups folks .... please note I am deliberately
> running a setup without commit guarantees, so a lot of the conventional
> advice about not doing small writes isn't applicable, but I do want to
> understand more about how this affects PG internals even if the I/O is
> smoothed out.
>
> By "commit consistency off" I am referring to the setting
> "synchronous_commit = off" in postgresql.conf .... IIRC this should mean
> semantically that a DB crash may lose data that was reported back to the
> app as successfully committed, but will result in a consistent state on
> reboot and recovery. In this case IIUC the "120 commits per second per
> drive" limit does not apply, and I hope the advice about testing fsync
> is similarly not applicable to my case. Also, IIUC that settings like
> commit_siblings and commit_delay should be ignored by PG in my case.
Oh, I haven't noticed the synchronous_commit=off bit. You're right about
the consistency guarantees (possibility of lost transactions but no
corruption).

IIRC the async commit issues fsync for each commit, but does not wait
for it to finish. The question is whether this improves the way the I/O
is used or not. That's difficult to answer without more detailed info
(vmstat/iostat).

In some cases this may actually hammer the system even worse, killing
the performance, because you're removing the "wait time" so the INSERT
processes are submitting more fsync operations than it can handle.

There are cases when this may actually improve the I/O utilization (e.g.
when there's a lot of drives in RAID).

You need to watch the drive and CPU stats to identify the causes. Is it
CPU bound (100% cpu utilization)? Is it I/O bound (drives 100% utilized)?

Moreover, it's not just about the fsync operations. If there are
constraints that need to be checked (e.g. foreign keys, unique
constrains etc.), that may cause additional I/O operations.

Maybe you could get better results with commit_delay/commit_siblings.
That effectively groups commits into a single fsync operation. (Which
synchronous_commit=off does not do IIRC).

I've seen really good results with large amounts of concurrent clients.
How many of those "insert" processes are there?

> I would be interested in learning what the **in-memory** constraints and
> costs are on the PG server side of doing a lot of small commits when
> sync wrties are _off_, e.g. the implications for the locking system, and
> whether this can affect the long queries on the other tables apart from
> general resource contention.
I really doubt this is the case. If you're interested in watching these
issues, set up a pgbench database with small scaling factor (so that the
DB fits into memory) and maybe set fsync=off. Then you'll be able to
observe the locking issues etc.

But this all is just a hypothesis, and my suggestion is that you really
verify if before trying to fix it - if the bottleneck really is inside
PostgreSQL (locking or whatever).

Eliminate all the other usual bottlenecks first - I/O and CPU. Show us
some stats, e.g. vmstat, iostat etc.

> The pattern of lots of tiny transactions is semantically correct for the
> app, and I am using a JDBC prepared statement on the Java side, which I
> believe the PG driver will turn in to a pre-compiled statement with
> enough uses (it does NOT do so on the first few hits). This should in
> theory be even cheaper than a multiple INSERT VALUES which is all text
> and has to be parsed.
>
> However, if necessary for performance I can bundle the inserts into
> slightly larger transactions - cases where writes fail are due only to
> occasional duplicates (same primary key) coming from upstream and are
> pretty rare, and in practice losing a batch of say 100 of these records
> occasionally is not a big deal in my world (ignoring sound of cringing
> DBAs  so I could afford to bundle into transactions and then just drop
> a whole bundle if any single write has a primary key collision.
If it's semantically correct, let's try to keep it that way.

> Storage setup varies by customer, but a typical setup is to take RAID
> groups of about 5-10TB each net from something like an EMC Clariion and
> slice each group into 1TB LUNs which become VMWare datastores, which are
> written simultaneously from multiple hosts. A mid-size Clariion would
> host perhaps 50-100 of these small LUNs, and a customer running a high
> performance environment might have Fibrechannel disks and RAID-10, but
> SATA and RAID-5/6 would also be normal, albeit with a substantial
> write-back cache (maybe 1GB, IIRC a current Clariion SP has 4GB total).
> Each file on the datastore corresponds to a virtual disk on a VM, and
> the datastore is formatted with VMFS (concurrent writer filesystem, uses
> SCSI locking to control access to block allocation and directory entries).
>
> The other type of VMWare datastore works at the filesystem layer -
> instead of a shared SAN with iSCSI / FC-AL, the VMware hosts are all
> pointed at a shared NFS server directory. NetApp is the popular back end
> for this configuration.
Hmmmm, tuning such workloads is usually tightly bound to the I/O layout.
What works great on one setup is going to fail miserably on another one.

Especially RAID-5/6 are well known to suck at write-intensive workloads.
The usual tuning advice in this case is "OMG, get rid of RAID-5/6!"

You really need to gather some data from each setup, see where's the
bottleneck and fix it. It might be in a different place for each setup.

You need to see "inside" the storage, not just the first level. One
fairly frequent mistake (and I've done that repeatedly) is the belief
that when iostat tell's you a device is 100% utilized it can't handle
more I/Os.

With a RAID array that's not true - what matters is how the individual
devices are used, not the virtual device on top of them. Consider for
example a RAID-1 with two drives. The array may report 100% utilization
but the devices are in fact 50% utilized because half of the requests is
handed to the first device, the other half to the second one.

> On top of this virtualization, I have PG laid out on two virtual disks -
> WAL and log files are on the main system partition, index and table data
> on a second partition. Both formatted with ext3fs.
One suggestion - try to increase the effective_io_concurrency. There're
some recommendations here

  http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html

Use the number of drives as a starting point or experiment a bit.

And increase the checkpoint parameters as I recommended before. You may
even increase the checkpoint timeout - that may significantly lower the
amount of data that's written during checkpoints.

> One of my larger customers had his SAN guy complain to him that our app
> was writing more data to the NetApp it was on than every other app
> combined, so I am mindful of the volume being more than some of these
> systems were planned for
I'm not familiar with NetApp - AFAIK they use RAID-DP which is a somehow
improved version of RAID 4, that should perform better than RAID 6. But
in my experience these claims usually miss the "for most workloads" part.

cheers
Tomas

Re: Can lots of small writes badly hamper reads from other tables?

From

Jeff Janes

Date:

25 January 2012, 15:19:37

On Tue, Jan 24, 2012 at 12:16 PM, Dave Crooke <dcrooke@gmail.com> wrote:
> Hi folks
>
> This could be a sheer volume issue, but I though I would ask the wisdom of
> this forum as to next investigative steps.

To answers the question in your subject, yes.  If the disk head is
positioned to write in one place, it can't be reading from some other
place.  The various levels of caches and re-ordering and other tricks
can improve the situation, but they have a finite capacity to do so.

> We use PostgreSQL 8.4.4 which is bundled with our application as a VMware
> virtual appliance. The bulk of the app's database activity is recording
> performance data points which arrive in farily large sustained bursts of
> perhaps 10,000 rows a minute at a medium sized customer, each of which are
> logically separate items and being committed as individual transactions
> (JDBC auto-commit mode). Our offshore QA team was assigned to track an
> intermittent issue with speed of some large queries on other tables, and
> they believe based on correlation the two activities may be contending.
>
> The large query is coming off of different tables from the ones being
> written to ... the raw data goes into a table named by day (partitioning is
> all within the app, not PG) e.g. PERF_RAW_2012_01_24 and then there are a
> bunch of rollup statements which run hourly to do the aggregations, e.g.

In your attached schema there are two perf_raw tables, and they have
different sets of indexes on them.
Which set is in operation during the inserts?


> insert into PERF_HOURLY_2012_01_24 select key_columns, avg(data), now() from
> perf_raw_2012_01_24 where time_stamp between (now() - interval '1 hour') and
> now() group by key_columns
>
> The big queries are hitting multiple of the PERF_HOURLY tables and pulling a
> few dozen rows from each.

How big are they those big queries, really?  A few dozen tables times
a few dozen rows?

...
>
> Is there any tweaking we should do on the PG settings, or on the pattern in
> which the app is writing - we currently use 10 writer threads on the Java
> side and they keep PG going pretty good.

Do you need 10 writer threads?  What happens if you use fewer?

>
> I considered bundling the writes into larger transactions, will that really
> help much with commit consistency off?

With synchronous_commit=off, I wouldn't expect the transaction
structure to make much difference.  Especially not if the target of
the mass inserts is indexed.

> Is there some specific "usual suspect" stuff I should look at on the PG side
> to look for efficiency issues such as index lock contention or a poor buffer
> cache hit ratio? Will doing EXPLAIN ANALYSE on the big query be informative,
> and if so, does it need to be done while the write load is applied?

EXPLAIN would probably help, EXPLAIN ANALYSE while the problem is in
action would help more.

Even better would be to see where the queries are blocking during the
problem, but there is no easy way to get that in postgres.  I'd strace
-ttt -T the query process (although the mere act of stracing it can
slow it down enough to relieve the bottleneck you are trying to
identify)

>
> The other whacky idea I had was to have the writer threads pause or throttle
> themselves when a big query is happening (it's all in one JVM and we are
> using a connection pooler, so it's easy to intercept and track if needed)
> however that strikes me as a rather ugly hack and I'd prefer to do something
> more robust and based on config tweaks that leverage existing resource
> management in PG.

Why not just always throttle them?  If you slam the data in as fast as
possible during brief bursts, you are probably just setting yourself
up for this type of issue.  (The brief bursts can be useful if they
make better use of cache, but then you have to accept that other
things will be disrupted during those bursts.)

Cheers,

Jeff