Thread: WAL + SSD = slow inserts?

WAL + SSD = slow inserts?

From

Skarsol

Date:

05 December 2013, 15:02:07

I'm trying to increase the speed of inserts in a database that is on a not super fast storage system. I have installed a pair of SSDs and placed pg_xlog on them but am still getting inserts that take up to a second to complete, with .3 seconds being about average. Iostat doesn't show the SSDs stressed at all, and changing synchronous_commit doesn't seem to affect it one way or the other. Where would I look next for what could be causing the delay?

Re: WAL + SSD = slow inserts?

From

desmodemone

Date:

05 December 2013, 15:06:47

Hello,

could you please post the postgresql version, the postgresql.conf, the operative system used, the kernel version and the filesystem used ?

Thank you

2013/12/5 Skarsol <skarsol@gmail.com>

I'm trying to increase the speed of inserts in a database that is on a not super fast storage system. I have installed a pair of SSDs and placed pg_xlog on them but am still getting inserts that take up to a second to complete, with .3 seconds being about average. Iostat doesn't show the SSDs stressed at all, and changing synchronous_commit doesn't seem to affect it one way or the other. Where would I look next for what could be causing the delay?

Re: WAL + SSD = slow inserts?

From

Skarsol

Date:

05 December 2013, 15:16:44

psql (PostgreSQL) 9.2.5
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

All relevant filesystems are ext4

Changes from defaults:
max_connections = 500
shared_buffers = 32000MB
temp_buffers = 24MB
work_mem = 1GB
maintenance_work_mem = 5GB
wal_level = archive
wal_buffers = 16MB
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p /databases/pg_archive/db/%f'
effective_cache_size = 64000MB
default_statistics_target = 5000
log_checkpoints = on
stats_temp_directory = '/tmp/pgstat'

On Thu, Dec 5, 2013 at 9:06 AM, desmodemone <desmodemone@gmail.com> wrote:

Hello,
could you please post the postgresql version, the postgresql.conf, the operative system used, the kernel version and the filesystem used ?

Thank you

2013/12/5 Skarsol <skarsol@gmail.com>
I'm trying to increase the speed of inserts in a database that is on a not super fast storage system. I have installed a pair of SSDs and placed pg_xlog on them but am still getting inserts that take up to a second to complete, with .3 seconds being about average. Iostat doesn't show the SSDs stressed at all, and changing synchronous_commit doesn't seem to affect it one way or the other. Where would I look next for what could be causing the delay?

Re: WAL + SSD = slow inserts?

From

Scott Marlowe

Date:

05 December 2013, 15:50:35

On Thu, Dec 5, 2013 at 8:16 AM, Skarsol <skarsol@gmail.com> wrote:
> psql (PostgreSQL) 9.2.5
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
> Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64
> x86_64 x86_64 GNU/Linux
> All relevant filesystems are ext4
>
> Changes from defaults:
> max_connections = 500
> shared_buffers = 32000MB
> temp_buffers = 24MB
> work_mem = 1GB
> maintenance_work_mem = 5GB
> wal_level = archive
> wal_buffers = 16MB
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p
> /databases/pg_archive/db/%f'
> effective_cache_size = 64000MB
> default_statistics_target = 5000
> log_checkpoints = on
> stats_temp_directory = '/tmp/pgstat'

OK I'd make the following changes.
1: Drop shared_buffers to something like 1000MB
2: drop work_mem to 16MB or so. 1GB is pathological, as it can make
the machine run out of memory quite fast.
3: drop max_connections to 100 or so. if you really need 500 conns,
then work_mem of 1G is that much worse.

Next, move pg_xlog OFF the SSDs and back onto spinning media and put
your data/base dir on the SSDs.

SSDs aren't much faster, if at all, for pg_xlog, but are much much
faster for data/base files.

Also changing the io schduler for the SSDs to noop:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/block/switching-sched.txt?id=HEAD

Re: WAL + SSD = slow inserts?

From

Skarsol

Date:

05 December 2013, 16:13:54

On Thu, Dec 5, 2013 at 9:50 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Thu, Dec 5, 2013 at 8:16 AM, Skarsol <skarsol@gmail.com> wrote:
> psql (PostgreSQL) 9.2.5
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
> Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64
> x86_64 x86_64 GNU/Linux
> All relevant filesystems are ext4
>
> Changes from defaults:
> max_connections = 500
> shared_buffers = 32000MB
> temp_buffers = 24MB
> work_mem = 1GB
> maintenance_work_mem = 5GB
> wal_level = archive
> wal_buffers = 16MB
> checkpoint_completion_target = 0.9
> archive_mode = on
> archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p
> /databases/pg_archive/db/%f'
> effective_cache_size = 64000MB
> default_statistics_target = 5000
> log_checkpoints = on
> stats_temp_directory = '/tmp/pgstat'

OK I'd make the following changes.
1: Drop shared_buffers to something like 1000MB
2: drop work_mem to 16MB or so. 1GB is pathological, as it can make
the machine run out of memory quite fast.
3: drop max_connections to 100 or so. if you really need 500 conns,
then work_mem of 1G is that much worse.

Next, move pg_xlog OFF the SSDs and back onto spinning media and put
your data/base dir on the SSDs.

SSDs aren't much faster, if at all, for pg_xlog, but are much much
faster for data/base files.

Also changing the io schduler for the SSDs to noop:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/block/switching-sched.txt?id=HEAD

Changing the scheduler to noop seems to have had a decent effect. I've made the other recommended changes other than the connections as we do need that many currently. We're looking to implement pg_bouncer which should help with that.

Moving the whole database to SSD isn't an option currently due to size.

The slowest inserts are happening on tables that are partitioned by creation time. As part of the process there is a rule to select curval from a sequence but there are no other selects or anything in the trigger procedure. Could the sequence be slowing it down? I dont see a way to change the tablespace of one.

Re: WAL + SSD = slow inserts?

From

Jeff Janes

Date:

05 December 2013, 17:57:06

On Thu, Dec 5, 2013 at 7:01 AM, Skarsol <skarsol@gmail.com> wrote:

I'm trying to increase the speed of inserts in a database that is on a not super fast storage system. I have installed a pair of SSDs and placed pg_xlog on them but am still getting inserts that take up to a second to complete, with .3 seconds being about average. Iostat doesn't show the SSDs stressed at all, and changing synchronous_commit doesn't seem to affect it one way or the other. Where would I look next for what could be causing the delay?

What are you inserting? At 0.3 seconds per, I'm guessing this is not just a simple single-row insert statement.

Are you IO bound or CPU bound?

Re: WAL + SSD = slow inserts?

From

Scott Marlowe

Date:

05 December 2013, 18:08:20

On Thu, Dec 5, 2013 at 9:13 AM, Skarsol <skarsol@gmail.com> wrote:
> On Thu, Dec 5, 2013 at 9:50 AM, Scott Marlowe <scott.marlowe@gmail.com>
> wrote:
>>
>> On Thu, Dec 5, 2013 at 8:16 AM, Skarsol <skarsol@gmail.com> wrote:
>> > psql (PostgreSQL) 9.2.5
>> > Red Hat Enterprise Linux Server release 6.4 (Santiago)
>> > Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013
>> > x86_64
>> > x86_64 x86_64 GNU/Linux
>> > All relevant filesystems are ext4
>> >
>> > Changes from defaults:
>> > max_connections = 500
>> > shared_buffers = 32000MB
>> > temp_buffers = 24MB
>> > work_mem = 1GB
>> > maintenance_work_mem = 5GB
>> > wal_level = archive
>> > wal_buffers = 16MB
>> > checkpoint_completion_target = 0.9
>> > archive_mode = on
>> > archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p
>> > /databases/pg_archive/db/%f'
>> > effective_cache_size = 64000MB
>> > default_statistics_target = 5000
>> > log_checkpoints = on
>> > stats_temp_directory = '/tmp/pgstat'
>>
>> OK I'd make the following changes.
>> 1: Drop shared_buffers to something like 1000MB
>> 2: drop work_mem to 16MB or so. 1GB is pathological, as it can make
>> the machine run out of memory quite fast.
>> 3: drop max_connections to 100 or so. if you really need 500 conns,
>> then work_mem of 1G is that much worse.
>>
>> Next, move pg_xlog OFF the SSDs and back onto spinning media and put
>> your data/base dir on the SSDs.
>>
>> SSDs aren't much faster, if at all, for pg_xlog, but are much much
>> faster for data/base files.
>>
>> Also changing the io schduler for the SSDs to noop:
>>
>>
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/block/switching-sched.txt?id=HEAD
>
>
> Changing the scheduler to noop seems to have had a decent effect. I've made
> the other recommended changes other than the connections as we do need that
> many currently. We're looking to implement pg_bouncer which should help with
> that.
>
> Moving the whole database to SSD isn't an option currently due to size.
>
> The slowest inserts are happening on tables that are partitioned by creation
> time. As part of the process there is a rule to select curval from a
> sequence but there are no other selects or anything in  the trigger
> procedure. Could the sequence be slowing it down? I dont see a way to change
> the tablespace of one.

Rules have a lot of overhead. Is there a reason you're not using
defaults or triggers?

Re: WAL + SSD = slow inserts?

From

bricklen

Date:

05 December 2013, 19:19:32

On Thu, Dec 5, 2013 at 10:08 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Rules have a lot of overhead. Is there a reason you're not using
defaults or triggers?

Or for even less overhead, load the partitions directly, and preferably use "DEFAULT nextval('some_sequence')" as Scott mentioned.

Re: WAL + SSD = slow inserts?

From

Mark Kirkwood

Date:

05 December 2013, 22:01:23

On 06/12/13 05:13, Skarsol wrote:
> On Thu, Dec 5, 2013 at 9:50 AM, Scott Marlowe <scott.marlowe@gmail.com>wrote:
>
>> On Thu, Dec 5, 2013 at 8:16 AM, Skarsol <skarsol@gmail.com> wrote:
>>> psql (PostgreSQL) 9.2.5
>>> Red Hat Enterprise Linux Server release 6.4 (Santiago)
>>> Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013
>> x86_64
>>> x86_64 x86_64 GNU/Linux
>>> All relevant filesystems are ext4
>>>
>>> Changes from defaults:
>>> max_connections = 500
>>> shared_buffers = 32000MB
>>> temp_buffers = 24MB
>>> work_mem = 1GB
>>> maintenance_work_mem = 5GB
>>> wal_level = archive
>>> wal_buffers = 16MB
>>> checkpoint_completion_target = 0.9
>>> archive_mode = on
>>> archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p
>>> /databases/pg_archive/db/%f'
>>> effective_cache_size = 64000MB
>>> default_statistics_target = 5000
>>> log_checkpoints = on
>>> stats_temp_directory = '/tmp/pgstat'
>> OK I'd make the following changes.
>> 1: Drop shared_buffers to something like 1000MB
>> 2: drop work_mem to 16MB or so. 1GB is pathological, as it can make
>> the machine run out of memory quite fast.
>> 3: drop max_connections to 100 or so. if you really need 500 conns,
>> then work_mem of 1G is that much worse.
>>
>> Next, move pg_xlog OFF the SSDs and back onto spinning media and put
>> your data/base dir on the SSDs.
>>
>> SSDs aren't much faster, if at all, for pg_xlog, but are much much
>> faster for data/base files.
>>
>> Also changing the io schduler for the SSDs to noop:
>>
>>
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/block/switching-sched.txt?id=HEAD
>>
> Changing the scheduler to noop seems to have had a decent effect. I've made
> the other recommended changes other than the connections as we do need that
> many currently. We're looking to implement pg_bouncer which should help
> with that.
>


What model SSD are you using? Some can work better with deadline than
noop (as their own scheduling firmware may be pretty poor). Also, check
if there are updates for the SSD firmware. I have a couple of Crucial
M4s that changed from being fairly average to very fast indeed after
getting later firmware...

Cheers

Mark

Re: WAL + SSD = slow inserts?

From

Skarsol

Date:

06 December 2013, 05:55:37

On Thu, Dec 5, 2013 at 1:19 PM, bricklen <bricklen@gmail.com> wrote:

On Thu, Dec 5, 2013 at 10:08 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Rules have a lot of overhead. Is there a reason you're not using
defaults or triggers?

Or for even less overhead, load the partitions directly, and preferably use "DEFAULT nextval('some_sequence')" as Scott mentioned.

The rule is being used to return the id of the insert, it's not part of the partitioning itself. The id is generated with default nextval. I've looked at using returning instead but that will require a large refactoring of the codebase and seems to have issues when combined with the partitioning. The partitioning is done with a BEFORE INSERT ON trigger. The trigger proc doesn't do any selects, it's just based on the contents of the insert itself.

Re: WAL + SSD = slow inserts?

From

Rosser Schwarz

Date:

06 December 2013, 06:03:48

On Thu, Dec 5, 2013 at 9:55 PM, Skarsol <skarsol@gmail.com> wrote:

The rule is being used to return the id of the insert...

Take a look at the RETURNING clause of the INSERT statement. That should meet your needs here without having to bother with rules.

rls

--
:wq

Re: WAL + SSD = slow inserts?

From

Tomas Vondra

Date:

06 December 2013, 23:59:37

On 5.12.2013 17:13, Skarsol wrote:
> On Thu, Dec 5, 2013 at 9:50 AM, Scott Marlowe <scott.marlowe@gmail.com
> <mailto:scott.marlowe@gmail.com>> wrote:
>
>     On Thu, Dec 5, 2013 at 8:16 AM, Skarsol <skarsol@gmail.com
>     <mailto:skarsol@gmail.com>> wrote:
>     > psql (PostgreSQL) 9.2.5
>     > Red Hat Enterprise Linux Server release 6.4 (Santiago)
>     > Linux 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT
>     2013 x86_64
>     > x86_64 x86_64 GNU/Linux
>     > All relevant filesystems are ext4
>     >
>     > Changes from defaults:
>     > max_connections = 500
>     > shared_buffers = 32000MB
>     > temp_buffers = 24MB
>     > work_mem = 1GB
>     > maintenance_work_mem = 5GB
>     > wal_level = archive
>     > wal_buffers = 16MB
>     > checkpoint_completion_target = 0.9
>     > archive_mode = on
>     > archive_command = 'test ! -f /databases/pg_archive/db/%f && cp %p
>     > /databases/pg_archive/db/%f'
>     > effective_cache_size = 64000MB
>     > default_statistics_target = 5000
>     > log_checkpoints = on
>     > stats_temp_directory = '/tmp/pgstat'
>
>     OK I'd make the following changes.
>     1: Drop shared_buffers to something like 1000MB
>     2: drop work_mem to 16MB or so. 1GB is pathological, as it can make
>     the machine run out of memory quite fast.
>     3: drop max_connections to 100 or so. if you really need 500 conns,
>     then work_mem of 1G is that much worse.
>
>     Next, move pg_xlog OFF the SSDs and back onto spinning media and put
>     your data/base dir on the SSDs.
>
>     SSDs aren't much faster, if at all, for pg_xlog, but are much much
>     faster for data/base files.
>
>     Also changing the io schduler for the SSDs to noop:
>
>
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/block/switching-sched.txt?id=HEAD
>
>
> Changing the scheduler to noop seems to have had a decent effect. I've
> made the other recommended changes other than the connections as we do
> need that many currently. We're looking to implement pg_bouncer which
> should help with that.

I'm wondering if you left checkpoint_segments at the default? Try
bumping it up to 32 or more, otherwise it might cause frequent
checkpoints. I see you have log_checkpoints=on, so do you see any
checkpoint messages in the logs?

Also, how much data are you actually inserting? Are you inserting a
single row, or large number of them? What is the structure of the table?
Are there any foreign keys in it?

If you do a batch of such inserts (so that it takes a minute or so in
total), what do you see in top/iostat? Is the system CPU or IO bound?
Show us a dozen lines of

   iostat -x -k 1
   vmstat 1

> Moving the whole database to SSD isn't an option currently due to size.

Moving the WAL to SSDs is rather wasteful, in my experience. A RAID
controller with decent write cache (256MB or more) and BBU is both
faster and cheaper.

Also, there are huge differences between various SSDs vendors and
models, or even between the same SSD model with different firmware
versions. What SSD model are you using? Have you updated the firmware?

> The slowest inserts are happening on tables that are partitioned by
> creation time. As part of the process there is a rule to select curval
> from a sequence but there are no other selects or anything in  the
> trigger procedure. Could the sequence be slowing it down? I dont see a
> way to change the tablespace of one.

There are many possible causes for this - for example you're not telling
us something about the table structure (e.g. FK constraints might be
causing this) or about the hardware.

Have you done some tests on the SSD to verify it works properly? I've
seen broken firmwares behaving like this (unexpectedly high latencies in
random intervals etc.).

Tomas