Thread: Postgres on SSD
Hi, I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The main reason for this experiment is to see if SSD can significantly improve query performance. So, I have the following questions: - Could you please share your experience with SSD? Any issues? - What needs to be changed at Postgres/Operating system level? The obvious one is to change random_page_cost (now: 2) and seq_page_cost (now: 4). What else should I look at? Background: Database schema is pretty simple: Database size is around ~1.4TB. Main tables occupied around 1/3 (450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All indexes are on separate table space (~550GB) 1) main table(s) This big fact table has ~200 columns and average row size is 1.5kb. This table is never updated and new data is inserted constantly using copy in chunks about 10k rows. Table is quite sparse so it is broken into 5 tables which are joined as necessary (regular query reads 10 - 40% of the row size). Tables are partitioned by month but I'm thinking to use week or two week partitions. Primary key is composite key (datetime, organisation, transaction_id). The transaction_id column is unique but "datetime" column is used for partitioning and all queries contains organisation="...." (multi-tetant database). In addition, there are ~15 single column indexes. Old data is deleted after 6 months (drop partition) The usual query looks like this: - select ... from T where organisation = ... and datetime between ... and ... where ... order by <single col> limit 1000 User can choose any column for sorting but we created indexes for the most popular/reasonable ones (those ~15 single column indexes). In the reality, query is more complex because of few Postgres issues: - partitions/limit/order issue described on Stackoverflow and fixed in 9.1 or 9.2 [2], [3]. - partitions/join issues ie left join "on" clause must contain datetime condition in order to avoid fulltable scan on joined table Query response time for indexed columns is between 5 to 30 sec (sometimes 200sec). The target is to have all queries under 5 sec. If query has order by on non-indexed column then response time is in hundreds seconds but desired response time should be 10sec (test query is over 1 month range and organisation has between 0.5 and 2 mil row per month; single partition has > 30 mil rows) 2) materialised aggregate tables About 45 tables like this: agg_attribute1_attribute2(date, organisation, attribute1, attribute2, count) (= select datetime::date, organisation, attribute1, attribute2, count(*) from T where organisation = ... and datetime between ... and ... group by 1,2,3,4) Tables are updated by cron job every 15 minutes. Thanks for 8.3 HOT updates - almost no bloat! Monthly partitions are used (date column). Query response time is between 5 to 30 sec (sometimes 200sec) and the target is to have all queries under 5 sec Usual query is: select attribute1, count(*) from agg_attribute1_... where organisation = ... and datetime between ... and ... group by 1 limit 10 or select attribute1, attribute2 count(*) from agg_attribute1_attribute2 where organisation = ... and datetime between ... and ... group by 1 limit 10 Top N queries perform even worse -- the query response time is in minutes and the target is around 15 sec Current hardware setup: XEN host with 16 CPU (Intel(R) Xeon(R) CPU L5520 @ 2.27GHz). CentOS 5.6 80GB RAM Storage: some Hitachi Fibre channel SAN with two LUNs: 1st LUN has *everything* under $PG_DATA (used 850 GB) 2nd LUN has *all* indexes (index table space) (used 550GB) Postgres settings: name | current_setting ------------------------------+-------------------------------------------------------------------------------------------------- ---------------- version | PostgreSQL 8.4.5 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4. 1.2-48), 64-bit archive_command | walarchive.sh %p %f archive_mode | on autovacuum | on autovacuum_max_workers | 6 autovacuum_naptime | 5min autovacuum_vacuum_cost_delay | -1 checkpoint_completion_target | 0.9 checkpoint_segments | 48 constraint_exclusion | on default_statistics_target | 100 effective_cache_size | 20GB fsync | on lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 log_autovacuum_min_duration | 0 log_destination | csvlog log_min_duration_statement | 10s log_rotation_age | 1d log_rotation_size | 0 log_truncate_on_rotation | on logging_collector | on maintenance_work_mem | 256MB max_connections | 100 max_stack_depth | 2MB random_page_cost | 2 server_encoding | UTF8 shared_buffers | 9GB TimeZone | UTC vacuum_cost_delay | 0 wal_buffers | 50MB wal_sync_method | fdatasync wal_writer_delay | 1s work_mem | 256MB [1] http://www.fusionio.com/products/iodrive-duo/ [2] http://stackoverflow.com/questions/6268633/postgres-partitioning-order-by-performance [3] http://stackoverflow.com/questions/2236776/efficient-querying-of-multi-partition-postgres-table Thanks, -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote: > - What needs to be changed at Postgres/Operating system level? The > obvious one is to change random_page_cost (now: 2) and seq_page_cost > (now: 4). What else should I look at? Are you sure about this? I'm not quite sure setting seq_page_cost=4 and random_page_cost=2 makes sense. Usually seq_page_cost is lower than random_page_cost, so I wonder why have you set it like that. Tomas
Hi, 2011/8/10 Tomas Vondra <tv@fuzzy.cz>: > On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote: >> - What needs to be changed at Postgres/Operating system level? The >> obvious one is to change random_page_cost (now: 2) and seq_page_cost >> (now: 4). What else should I look at? > > Are you sure about this? I'm not quite sure setting seq_page_cost=4 and > random_page_cost=2 makes sense. Usually seq_page_cost is lower than > random_page_cost, so I wonder why have you set it like that. Ups! Well spotted Tomas! The actual values are: random_page_cost = 2 seq_page_cost = 1 -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
Hi, 2011/8/10 Tomas Vondra <tv@fuzzy.cz>: > On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote: >> - What needs to be changed at Postgres/Operating system level? The >> obvious one is to change random_page_cost (now: 2) and seq_page_cost >> (now: 4). What else should I look at? > > Are you sure about this? I'm not quite sure setting seq_page_cost=4 and > random_page_cost=2 makes sense. Usually seq_page_cost is lower than > random_page_cost, so I wonder why have you set it like that. Ups! Well spotted Tomas! The actual values are: random_page_cost = 2 seq_page_cost = 1 -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
There have been several discussions for SSD in recent months although not specific to Fusion IO drives.
See http://archives.postgresql.org/pgsql-general/2011-04/msg00460.php . You can search the archives for more such reference.
Amitabh
See http://archives.postgresql.org/pgsql-general/2011-04/msg00460.php . You can search the archives for more such reference.
Amitabh
2011/8/11 Ondrej Ivanič <ondrej.ivanic@gmail.com>
Hi,
2011/8/10 Tomas Vondra <tv@fuzzy.cz>:> On 10 Srpen 2011, 1:17, Ondrej Ivanič wrote:Ups! Well spotted Tomas! The actual values are:
>> - What needs to be changed at Postgres/Operating system level? The
>> obvious one is to change random_page_cost (now: 2) and seq_page_cost
>> (now: 4). What else should I look at?
>
> Are you sure about this? I'm not quite sure setting seq_page_cost=4 and
> random_page_cost=2 makes sense. Usually seq_page_cost is lower than
> random_page_cost, so I wonder why have you set it like that.
random_page_cost = 2
seq_page_cost = 1Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi, 2011/8/11 Amitabh Kant <amitabhkant@gmail.com>: > There have been several discussions for SSD in recent months although not > specific to Fusion IO drives. > > See http://archives.postgresql.org/pgsql-general/2011-04/msg00460.php . You > can search the archives for more such reference. I've read this one several days ago but the discussion turned into "flamewar" about SSD longevity... Thanks, -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
On 08/09/2011 07:17 PM, Ondrej Ivanič wrote: > I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The > main reason for this experiment is to see if SSD can significantly > improve query performance > Database size is around ~1.4TB. Main tables occupied around 1/3 > (450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All > indexes are on separate table space (~550GB) > ioDrive hardware is fast at executing all sorts of I/O, but it particularly excels compared to normal drives with really random workloads. For example, I recently tested them in two different systems, both head to head against regular 20 disk RAID10 arrays (Dell MD units). At sequential reads and writes, all drives were basically the same; >1.2GB/s reads, >600MB/s writes. The regular drive array was actually a bit faster on sequential writes, which is common with SSD showdowns. Your tables are pretty big; not much of them will fit in memory. If your aggregated queries end up executing a lot of sequential scans of the data set in order to compute, or for them to be utilized, you will probably discover this is barely faster on FusionIO. And you certainly could speed that up for far less money spent on other hardware. Is there a component to your workload that does a lot of random read or write requests? If so, is that chunk of the data set bigger than RAM, but small enough to fit on the FusionIO drive? Only when all those conditions are true does that hardware really make sense. For example, running a 300GB pgbench test on a system with 128GB of RAM, the FusionIO drive was almost 10X as fast as the 20 disk array. And its raw seek rate was 20X as fast at all concurrency levels. But at the same time, tests on database sizes that fit into RAM were slower on FusionIO than the regular disk array. When there's no random I/O to worry about, the slower read/write write of the SSD meant it lost the small database tests. You really need to measure your current system carefully to figure out just what it's doing as far as I/O goes to make this sort of decision. Given what ioDrives cost, if you're not sure how to do that yourself it's surely worth hiring a storage oriented database consultant for a few days to help figure it out. > XEN host with 16 CPU (Intel(R) Xeon(R) CPU L5520 @ 2.27GHz). CentOS 5.6 > 80GB RAM > Storage: some Hitachi Fibre channel SAN with two LUNs: > 1st LUN has *everything* under $PG_DATA (used 850 GB) > 2nd LUN has *all* indexes (index table space) (used 550GB) > Make sure you do basic benchmarks of all this hardware before you start mixing even more stuff into the mix. Both Xen hosts and SANs can cause all sorts of performance bottlenecks. It's possible you won't even be able to fully utilize the hardware you've already got if it's running with a virtual machine layer in there. I have no idea how a FusionIO drive will work in that environment, but I wouldn't expect it to be great. They need a fast CPU to run well, and some processing is done in the driver rather than on the card. > checkpoint_segments | 48 > maintenance_work_mem | 256MB > shared_buffers | 9GB > wal_buffers | 50MB > work_mem | 256MB > checkpoint_segments should be higher, at least 64 and probably 128 to 256. shared_buffers should be lower (at most 8GB, maybe even less). maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM. There's no proven benefit to increasing wal_buffers over 16MB. This setting for work_mem can easily allow your server to allocate over 250GB of RAM for query working memory, if all 100 connections do something. Either reduce that a lot, or decrease max_connections, if you want this server to run safely. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
2011/8/10 Ondrej Ivanič <ondrej.ivanic@gmail.com>: > Ups! Well spotted Tomas! The actual values are: > random_page_cost = 2 > seq_page_cost = 1 > With the SSD I would set these to the same value of 1. That's what I do.
On 08/12/2011 04:24 PM, Vick Khera wrote: > 2011/8/10 Ondrej Ivanič<ondrej.ivanic@gmail.com>: > >> Ups! Well spotted Tomas! The actual values are: >> random_page_cost = 2 >> seq_page_cost = 1 >> >> > With the SSD I would set these to the same value of 1. That's what I do. > That probably makes sense on your RAMSAN. Sequential access on FusionIO drives is at least 3X as fast as completely random though, and worst-case it can be even slower relative to what a sequential scan can deliver. It's not the >50X difference seen on regular drives, but there's an easily measurable gap. I'm not sure if it's that the flash cells deliver stuff faster when you read a sequential series from the same cell of flash, or if it's just that there's less physical IOs happening. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
Hi, On 12 August 2011 14:57, Greg Smith <greg@2ndquadrant.com> wrote: > ioDrive hardware is fast at executing all sorts of I/O, but it particularly > excels compared to normal drives with really random workloads. That's what I hope for :). It looks like that ioDrive is 3 to 5 times faster for seq IO comparing to our SAN > Your tables are pretty big; not much of them will fit in memory. If your > aggregated queries end up executing a lot of sequential scans of the data > set in order to compute, or for them to be utilized, you will probably > discover this is barely faster on FusionIO. About 99% disk activity is random IO (reads). Seq IO is caused by reading last X minutes of data by aggregates cron job. Majority of the queries are primary key scan queries plus some extra where condition filtering. Default random_page_cost and seq_page_cost wasn't way to go because planner chose seq scan and query execution took so long.... > Is there a component to your workload that does a lot of random read or > write requests? If so, is that chunk of the data set bigger than RAM, but > small enough to fit on the FusionIO drive? Only when all those conditions > are true does that hardware really make sense. Yes, almost all queries do random IO. Final result set is the top 1000 rows only but several weeks of data must be read and sorted before applying the limit. I have two cards available (= 2.4TB) so I can have entire dataset there. > But at the same time, tests on database sizes that fit into RAM were slower > on FusionIO than the regular disk array. When there's no random I/O to > worry about, the slower read/write write of the SSD meant it lost the small > database tests. Makes sense to me. >> checkpoint_segments | 48 >> maintenance_work_mem | 256MB >> shared_buffers | 9GB >> wal_buffers | 50MB >> work_mem | 256MB >> > > checkpoint_segments should be higher, at least 64 and probably 128 to 256. > shared_buffers should be lower (at most 8GB, maybe even less). > maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM. > There's no proven benefit to increasing wal_buffers over 16MB. Yes, shared_buffers are high; based on your input and other sources I would like to try lower values. I really do not understand this part of the "magic" - I mean checkpoint_segments and WAL related settings. > This setting for work_mem can easily allow your server to allocate over > 250GB of RAM for query working memory, if all 100 connections do something. > Either reduce that a lot, or decrease max_connections, if you want this > server to run safely. Yes, I'm aware about this issue. There nothing like pg-pool between Apache (PHP) and Postgres and there is no more than 15 connections simultaneously.... Thanks, -- Ondrej Ivanic (ondrej.ivanic@gmail.com) -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
Hi, On 12 August 2011 14:57, Greg Smith <greg@2ndquadrant.com> wrote: >> I'm about to evaluate this SSD card: FusionIO ioDrive Duo [1]. The >> main reason for this experiment is to see if SSD can significantly >> improve query performance The result is that FusionIO will help to our queries which was expected. Most of the long running queries return data between 5 and 30 sec range which is very good. The rest of the queries is super fast but aggregates queries need sometimes several minutes. Anyway, the overal performance is satisfactory and the rest could be fixed by redesigning aggregates (keep/rollover top N counts not everything) >> Database size is around ~1.4TB. Main tables occupied around 1/3 >> (450GB, ~220mil rows) and aggregated data occupied 2/3 (900GB). All >> indexes are on separate table space (~550GB) The redesign mentioned above can reduce aggregated data size to size between 1/3 and 1/4 of the current size (and speed up queries). I've tried to change several settings (work_mem, shared_buffers, random/seq/... costs) but I wasn't able to get better benchmark results. Our schema is very simple and query execution plan is reasonable. >> checkpoint_segments | 48 >> maintenance_work_mem | 256MB >> shared_buffers | 9GB >> wal_buffers | 50MB >> work_mem | 256MB >> > > checkpoint_segments should be higher, at least 64 and probably 128 to 256. > shared_buffers should be lower (at most 8GB, maybe even less). > maintenance_work_mem should be 1 to 2GB on a server with 80GB of RAM. > There's no proven benefit to increasing wal_buffers over 16MB. I think this is the part which I have to look at... Thanks, -- Ondrej Ivanic (ondrej.ivanic@gmail.com)