Thread: Postgres on SSD

Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The
main reason for this experiment is to see if SSD can significantly
improve query performance. So, I have the following questions:

- Could you please share your experience with SSD? Any issues?
- What needs to be changed at Postgres/Operating system level? The
obvious one is to change random_page_cost (now: 2) and seq_page_cost
(now: 4). What else should I look at?

Background:
Database schema is pretty simple:
Database size is around ~1.4TB. Main tables occupied around 1/3
(450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All
indexes are on separate table space (~550GB)
1) main table(s)
This big fact table has ~200 columns and average row size is 1.5kb.
This table is never updated and new data is inserted constantly using
copy in chunks about 10k rows. Table is quite sparse so it is broken
into 5 tables which are joined as necessary (regular query reads 10 -
40% of the row size).

Tables are partitioned by month but I'm thinking to use week or two
week partitions.

Primary key is composite key (datetime, organisation, transaction_id).
The transaction_id column is unique but "datetime" column is used for
partitioning and all queries contains organisation="...."
(multi-tetant database). In addition, there are ~15 single column
indexes. Old data is deleted after 6 months (drop partition)

The usual  query looks like this:
- select ... from T where organisation = ... and datetime between ...
and ... where ... order by <single col> limit 1000
User can choose any column for sorting but we created indexes for the
most popular/reasonable ones (those ~15 single column indexes).

In the reality, query is more complex because of few Postgres issues:
- partitions/limit/order issue described on Stackoverflow and fixed in
9.1 or 9.2 [2], [3].
- partitions/join issues ie left join "on" clause must contain
datetime condition in order to avoid fulltable scan on joined table

Query response time for indexed columns is between 5 to 30 sec
(sometimes 200sec). The target is to have all queries under 5 sec. If
query has order by on non-indexed column then response time is in
hundreds seconds but desired response time should be 10sec (test query
is over 1 month range and organisation has between 0.5 and 2 mil row
per month; single partition has > 30 mil rows)

2) materialised aggregate tables
About 45 tables like this: agg_attribute1_attribute2(date,
organisation, attribute1, attribute2, count) (= select datetime::date,
organisation, attribute1, attribute2, count(*) from T where
organisation = ... and datetime between ... and ... group by 1,2,3,4)
Tables are updated by cron job every 15 minutes. Thanks for 8.3 HOT
updates - almost no bloat! Monthly partitions are used (date column).
Query response time is between 5 to 30 sec (sometimes 200sec) and the
target is to have all queries under 5 sec

Usual query is:
select attribute1, count(*) from agg_attribute1_... where organisation
= ... and datetime between ... and ... group by 1 limit 10
or
select attribute1, attribute2 count(*) from agg_attribute1_attribute2
where organisation = ... and datetime between ... and ... group by 1
limit 10

Top N queries perform even worse -- the query response time is in
minutes and the target is around 15 sec

Current hardware setup:
XEN host with 16 CPU (Intel(R) Xeon(R) CPU L5520  @ 2.27GHz). CentOS 5.6
80GB RAM
Storage: some Hitachi Fibre channel SAN with two LUNs:
1st LUN has *everything* under $PG_DATA (used 850 GB)
2nd LUN has *all* indexes (index table space) (used 550GB)

Postgres settings:
             name             |
         current_setting


------------------------------+--------------------------------------------------------------------------------------------------
----------------
 version                      | PostgreSQL 8.4.5 on
x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red
Hat 4.
1.2-48), 64-bit
 archive_command              | walarchive.sh %p %f
 archive_mode                 | on
 autovacuum                   | on
 autovacuum_max_workers       | 6
 autovacuum_naptime           | 5min
 autovacuum_vacuum_cost_delay | -1
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 48
 constraint_exclusion         | on
 default_statistics_target    | 100
 effective_cache_size         | 20GB
 fsync                        | on
 lc_collate                   | en_US.UTF-8
 lc_ctype                     | en_US.UTF-8
 log_autovacuum_min_duration  | 0
 log_destination              | csvlog
 log_min_duration_statement   | 10s
 log_rotation_age             | 1d
 log_rotation_size            | 0
 log_truncate_on_rotation     | on
 logging_collector            | on
 maintenance_work_mem         | 256MB
 max_connections              | 100
 max_stack_depth              | 2MB
 random_page_cost             | 2
 server_encoding              | UTF8
 shared_buffers               | 9GB
 TimeZone                     | UTC
 vacuum_cost_delay            | 0
 wal_buffers                  | 50MB
 wal_sync_method              | fdatasync
 wal_writer_delay             | 1s
 work_mem                     | 256MB



[1] http://www.fusionio.com/products/iodrive-duo/
[2] http://stackoverflow.com/questions/6268633/postgres-partitioning-order-by-performance
[3] http://stackoverflow.com/questions/2236776/efficient-querying-of-multi-partition-postgres-table

Thanks,
--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres on SSD

From
"Tomas Vondra"
Date:
On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote:
> - What needs to be changed at Postgres/Operating system level? The
> obvious one is to change random_page_cost (now: 2) and seq_page_cost
> (now: 4). What else should I look at?

Are you sure about this? I'm not quite sure setting seq_page_cost=4 and
random_page_cost=2 makes sense. Usually seq_page_cost is lower than
random_page_cost, so I wonder why have you set it like that.

Tomas


Re: Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

2011/8/10 Tomas Vondra <tv@fuzzy.cz>:
> On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote:
>> - What needs to be changed at Postgres/Operating system level? The
>> obvious one is to change random_page_cost (now: 2) and seq_page_cost
>> (now: 4). What else should I look at?
>
> Are you sure about this? I'm not quite sure setting seq_page_cost=4 and
> random_page_cost=2 makes sense. Usually seq_page_cost is lower than
> random_page_cost, so I wonder why have you set it like that.

Ups! Well spotted Tomas! The actual values are:
random_page_cost = 2
seq_page_cost = 1

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

2011/8/10 Tomas Vondra <tv@fuzzy.cz>:
> On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote:
>> - What needs to be changed at Postgres/Operating system level? The
>> obvious one is to change random_page_cost (now: 2) and seq_page_cost
>> (now: 4). What else should I look at?
>
> Are you sure about this? I'm not quite sure setting seq_page_cost=4 and
> random_page_cost=2 makes sense. Usually seq_page_cost is lower than
> random_page_cost, so I wonder why have you set it like that.

Ups! Well spotted Tomas! The actual values are:
random_page_cost = 2
seq_page_cost = 1

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres on SSD

From
Amitabh Kant
Date:
There have been several discussions for SSD in recent months although not specific to Fusion IO drives.

See http://archives.postgresql.org/pgsql-general/2011-04/msg00460.php . You can search the archives for more such reference.


Amitabh

2011/8/11 Ondrej Ivanič <ondrej.ivanic@gmail.com>
Hi,

2011/8/10 Tomas Vondra <tv@fuzzy.cz>:
> On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote:
>> - What needs to be changed at Postgres/Operating system level? The
>> obvious one is to change random_page_cost (now: 2) and seq_page_cost
>> (now: 4). What else should I look at?
>
> Are you sure about this? I'm not quite sure setting seq_page_cost=4 and
> random_page_cost=2 makes sense. Usually seq_page_cost is lower than
> random_page_cost, so I wonder why have you set it like that.

Ups! Well spotted Tomas! The actual values are:
random_page_cost = 2
seq_page_cost = 1

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

2011/8/11 Amitabh Kant <amitabhkant@gmail.com>:
> There have been several discussions for SSD in recent months although not
> specific to Fusion IO drives.
>
> See http://archives.postgresql.org/pgsql-general/2011-04/msg00460.php . You
> can search the archives for more such reference.

I've read this one several days ago but the discussion turned into
"flamewar" about SSD longevity...

Thanks,
--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres on SSD

From
Greg Smith
Date:
On 08/09/2011 07:17 PM, Ondrej Ivanič wrote:
> I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The
> main reason for this experiment is to see if SSD can significantly
> improve query performance
> Database size is around ~1.4TB. Main tables occupied around 1/3
> (450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All
> indexes are on separate table space (~550GB)
>

ioDrive hardware is fast at executing all sorts of I/O, but it
particularly excels compared to normal drives with really random
workloads.  For example, I recently tested them in two different
systems, both head to head against regular 20 disk RAID10 arrays (Dell
MD units).  At sequential reads and writes, all drives were basically
the same; >1.2GB/s reads, >600MB/s writes.  The regular drive array was
actually a bit faster on sequential writes, which is common with SSD
showdowns.

Your tables are pretty big; not much of them will fit in memory.  If
your aggregated queries end up executing a lot of sequential scans of
the data set in order to compute, or for them to be utilized, you will
probably discover this is barely faster on FusionIO.  And you certainly
could speed that up for far less money spent on other hardware.

Is there a component to your workload that does a lot of random read or
write requests?  If so, is that chunk of the data set bigger than RAM,
but small enough to fit on the FusionIO drive?  Only when all those
conditions are true does that hardware really make sense.  For example,
running a 300GB pgbench test on a system with 128GB of RAM, the FusionIO
drive was almost 10X as fast as the 20 disk array.  And its raw seek
rate was 20X as fast at all concurrency levels.

But at the same time, tests on database sizes that fit into RAM were
slower on FusionIO than the regular disk array.  When there's no random
I/O to worry about, the slower read/write write of the SSD meant it lost
the small database tests.

You really need to measure your current system carefully to figure out
just what it's doing as far as I/O goes to make this sort of decision.
Given what ioDrives cost, if you're not sure how to do that yourself
it's surely worth hiring a storage oriented database consultant for a
few days to help figure it out.

> XEN host with 16 CPU (Intel(R) Xeon(R) CPU L5520  @ 2.27GHz). CentOS 5.6
> 80GB RAM
> Storage: some Hitachi Fibre channel SAN with two LUNs:
> 1st LUN has *everything* under $PG_DATA (used 850 GB)
> 2nd LUN has *all* indexes (index table space) (used 550GB)
>

Make sure you do basic benchmarks of all this hardware before you start
mixing even more stuff into the mix.  Both Xen hosts and SANs can cause
all sorts of performance bottlenecks.  It's possible you won't even be
able to fully utilize the hardware you've already got if it's running
with a virtual machine layer in there.  I have no idea how a FusionIO
drive will work in that environment, but I wouldn't expect  it to be
great.  They need a fast CPU to run well, and some processing is done in
the driver rather than on the card.

>   checkpoint_segments          | 48
>   maintenance_work_mem         | 256MB
>   shared_buffers               | 9GB
>   wal_buffers                  | 50MB
>   work_mem                     | 256MB
>

checkpoint_segments should be higher, at least 64 and probably 128 to
256.  shared_buffers should be lower (at most 8GB, maybe even less).
maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM.
There's no proven benefit to increasing wal_buffers over 16MB.

This setting for work_mem can easily allow your server to allocate over
250GB of RAM for query working memory, if all 100 connections do
something.  Either reduce that a lot, or decrease max_connections, if
you want this server to run safely.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Re: Postgres on SSD

From
Vick Khera
Date:
2011/8/10 Ondrej Ivanič <ondrej.ivanic@gmail.com>:
> Ups! Well spotted Tomas! The actual values are:
> random_page_cost = 2
> seq_page_cost = 1
>

With the SSD I would set these to the same value of 1.  That's what I do.

Re: Postgres on SSD

From
Greg Smith
Date:
On 08/12/2011 04:24 PM, Vick Khera wrote:
> 2011/8/10 Ondrej Ivanič<ondrej.ivanic@gmail.com>:
>
>> Ups! Well spotted Tomas! The actual values are:
>> random_page_cost = 2
>> seq_page_cost = 1
>>
>>
> With the SSD I would set these to the same value of 1.  That's what I do.
>

That probably makes sense on your RAMSAN.  Sequential access on FusionIO
drives is at least 3X as fast as completely random though, and
worst-case it can be even slower relative to what a sequential scan can
deliver.  It's not the >50X difference seen on regular drives, but
there's an easily measurable gap.  I'm not sure if it's that the flash
cells deliver stuff faster when you read a sequential series from the
same cell of flash, or if it's just that there's less physical IOs
happening.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

On 12 August 2011 14:57, Greg Smith <greg@2ndquadrant.com> wrote:
> ioDrive hardware is fast at executing all sorts of I/O, but it particularly
> excels compared to normal drives with really random workloads.

That's what I hope for :). It looks like that ioDrive is 3 to 5 times
faster for seq IO comparing to our SAN

> Your tables are pretty big; not much of them will fit in memory.  If your
> aggregated queries end up executing a lot of sequential scans of the data
> set in order to compute, or for them to be utilized, you will probably
> discover this is barely faster on FusionIO.

About 99% disk activity is random IO (reads). Seq IO is caused by
reading last X minutes of data by aggregates cron job. Majority of the
queries are primary key scan queries plus some extra where condition
filtering. Default random_page_cost and seq_page_cost wasn't way to go
because planner chose seq scan and query execution took so long....

> Is there a component to your workload that does a lot of random read or
> write requests?  If so, is that chunk of the data set bigger than RAM, but
> small enough to fit on the FusionIO drive?  Only when all those conditions
> are true does that hardware really make sense.

Yes, almost all queries do random IO. Final result set is the top 1000
rows only but several weeks of data must be read and sorted before
applying the limit. I have two cards available (= 2.4TB) so I can have
entire dataset there.

> But at the same time, tests on database sizes that fit into RAM were slower
> on FusionIO than the regular disk array.  When there's no random I/O to
> worry about, the slower read/write write of the SSD meant it lost the small
> database tests.

Makes sense to me.

>>  checkpoint_segments          | 48
>>  maintenance_work_mem         | 256MB
>>  shared_buffers               | 9GB
>>  wal_buffers                  | 50MB
>>  work_mem                     | 256MB
>>
>
> checkpoint_segments should be higher, at least 64 and probably 128 to 256.
>  shared_buffers should be lower (at most 8GB, maybe even less).
>  maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM.
>  There's no proven benefit to increasing wal_buffers over 16MB.

Yes, shared_buffers are high; based on your input and other sources I
would like to try lower values. I really do not understand this part
of the "magic" - I mean checkpoint_segments and WAL related settings.

> This setting for work_mem can easily allow your server to allocate over
> 250GB of RAM for query working memory, if all 100 connections do something.
>  Either reduce that a lot, or decrease max_connections, if you want this
> server to run safely.

Yes, I'm aware about this issue. There nothing like pg-pool between
Apache (PHP) and Postgres and there is no more than 15 connections
simultaneously....

Thanks,
--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)



--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)

Re: Postgres on SSD

From
Ondrej Ivanič
Date:
Hi,

On 12 August 2011 14:57, Greg Smith <greg@2ndquadrant.com> wrote:
>> I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The
>> main reason for this experiment is to see if SSD can significantly
>> improve query performance

The result is that FusionIO will help to our queries which was
expected. Most of the long running queries return data between 5 and
30 sec range which is very good. The rest of the queries is super fast
but aggregates queries need sometimes several minutes. Anyway, the
overal performance is satisfactory and the rest could be fixed by
redesigning aggregates (keep/rollover top N counts not everything)

>> Database size is around ~1.4TB. Main tables occupied around 1/3
>> (450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All
>> indexes are on separate table space (~550GB)

The redesign mentioned above can reduce aggregated data size to size
between 1/3 and 1/4 of the current size (and speed up queries). I've
tried to change several settings (work_mem, shared_buffers,
random/seq/... costs) but I wasn't able to get better benchmark
results. Our schema is very simple and query execution plan is
reasonable.

>>  checkpoint_segments          | 48
>>  maintenance_work_mem         | 256MB
>>  shared_buffers               | 9GB
>>  wal_buffers                  | 50MB
>>  work_mem                     | 256MB
>>
>
> checkpoint_segments should be higher, at least 64 and probably 128 to 256.
>  shared_buffers should be lower (at most 8GB, maybe even less).
>  maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM.
>  There's no proven benefit to increasing wal_buffers over 16MB.

I think this is the part which I have to look at...

Thanks,
--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)