Thread: H800 + md1200 Performance problem
Hello there,
I am having performance problem with new DELL server. Actually I have this two servers
Server A (old - production)
-----------------
2xCPU Six-Core AMD Opteron 2439 SE
--
César Martín Pérez
cmartinp@gmail.com
64GB RAM
Raid controller Perc6 512MB cache NV
- 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers)
- 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers)
Server B (new)
------------------
2xCPU 16 Core AMD Opteron 6282 SE
64GB RAM
Raid controller H700 1GB cache NV
- 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2)
- 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no barriers)
Raid controller H800 1GB cache nv
- MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) (ext4 bs 4096, stride 64, stripe-width 384, no barriers)
Postgres DB is the same in both servers. This DB has 170GB size with some tables partitioned by date with a trigger. In both shared_buffers, checkpoint_segments... settings are similar because RAM is similar.
I supposed that, new server had to be faster than old, because have more disk in RAID10 and two RAID controllers with more cache memory, but really I'm not obtaining the expected results
For example this query:
EXPLAIN ANALYZE SELECT c.id AS c__id, c.fk_news_id AS c__fk_news_id, c.fk_news_group_id AS c__fk_news_group_id, c.fk_company_id AS c__fk_company_id, c.import_date AS c__import_date, c.highlight AS c__highlight, c.status AS c__status, c.ord AS c__ord, c.news_date AS c__news_date, c.fk_media_id AS c__fk_media_id, c.title AS c__title, c.search_title_idx AS c__search_title_idx, c.stored AS c__stored, c.tono AS c__tono, c.media_type AS c__media_type, c.fk_editions_news_id AS c__fk_editions_news_id, c.dossier_selected AS c__dossier_selected, c.update_stats AS c__update_stats, c.url_news AS c__url_news, c.url_image AS c__url_image, m.id AS m__id, m.name AS m__name, m.media_type AS m__media_type, m.media_code AS m__media_code, m.fk_data_source_id AS m__fk_data_source_id, m.language_iso AS m__language_iso, m.country_iso AS m__country_iso, m.region_iso AS m__region_iso, m.subregion_iso AS m__subregion_iso, m.media_code_temp AS m__media_code_temp, m.url AS m__url, m.current_rank AS m__current_rank, m.typologyid AS m__typologyid, m.fk_platform_id AS m__fk_platform_id, m.page_views_per_day AS m__page_views_per_day, m.audience AS m__audience, m.last_stats_update AS m__last_stats_update, n.id AS n__id, n.fk_media_id AS n__fk_media_id, n.fk_news_media_id AS n__fk_news_media_id, n.fk_data_source_id AS n__fk_data_source_id, n.news_code AS n__news_code, n.title AS n__title, n.searchfull_idx AS n__searchfull_idx, n.news_date AS n__news_date, n.economical_value AS n__economical_value, n.audience AS n__audience, n.media_type AS n__media_type, n.url_news AS n__url_news, n.url_news_old AS n__url_news_old, n.url_image AS n__url_image, n.typologyid AS n__typologyid, n.author AS n__author, n.fk_platform_id AS n__fk_platform_id, n2.id AS n2__id, n2.name AS n2__name, n3.id AS n3__id, n3.name AS n3__name, f.id AS f__id, f.name AS f__name, n4.id AS n4__id, n4.opentext AS n4__opentext, i.id AS i__id, i.name AS i__name, i.ord AS i__ord, i2.id AS i2__id, i2.name AS i2__name FROM company_news_internet c LEFT JOIN media_internet m ON c.fk_media_id = m.id AND m.media_type = 4 LEFT JOIN news_internet n ON c.fk_news_id = n.id AND n.media_type = 4 LEFT JOIN news_media_internet n2 ON n.fk_news_media_id = n2.id AND n2.media_type = 4 LEFT JOIN news_group_internet n3 ON c.fk_news_group_id = n3.id AND n3.media_type = 4 LEFT JOIN feed_internet f ON n3.fk_feed_id = f.id LEFT JOIN news_text_internet n4 ON c.fk_news_id = n4.fk_news_id AND n4.media_type = 4 LEFT JOIN internet_typology i ON n.typologyid = i.id LEFT JOIN internet_media_platform i2 ON n.fk_platform_id = i2.id WHERE (c.fk_company_id = '16073' AND c.status <> '-3' AND n3.fk_feed_id = '30693' AND n3.status = '1' AND f.fk_company_id = '16073') AND n.typologyid IN ('6', '7', '1', '2', '3', '5', '4') AND c.id > '49764393' AND c.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND n.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND c.fk_news_group_id IN ('43475') AND (c.media_type = 4) ORDER BY c.news_date DESC, c.id DESC LIMIT 200
Takes about 20 second in server A but in new server B takes 150 seconds... In EXPLAIN I have noticed that sequential scan on table news_internet_201112 takes 2s:
-> Seq Scan on news_internet_201112 n (cost=0.00..119749.12 rows=1406528 width=535) (actual time=0.046..2186.379 rows=1844831 loops=1)
Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))
While in Server B, takes 11s:
-> Seq Scan on news_internet_201112 n (cost=0.00..119520.12 rows=1405093 width=482) (actual time=0.177..11783.621 rows=1844831 loops=1)
Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))
Is notorious that, while in server A, execution time vary only few second when I execute the same query repeated times, in server B, execution time fluctuates between 30 and 150 second despite the server dont have any client.
In other example, when I query entire table, running twice the same query:
Server 1
------------
EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.042..19665.155 rows=6731337 loops=1)
Total runtime: 20391.555 ms
-
EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.012..2171.181 rows=6731337 loops=1)
Total runtime: 2831.028 ms
Server 2
------------
EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.110..10010.443 rows=6765779 loops=1)
Total runtime: 11552.818 ms
-
EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.023..8173.801 rows=6765779 loops=1)
Total runtime: 12939.717 ms
It seems that Server B don cache the table¿?¿?
I'm lost, I had tested different file systems, like XFS, stripe sizes... but I not have had results
Any ideas that could be happen?
Thanks a lot!!
César Martín Pérez
cmartinp@gmail.com
Did you check your read ahead settings (getra)?
Mike DelNegro
Sent from my iPhone
Sent from my iPhone
=Hello there,I am having performance problem with new DELL server. Actually I have this two serversServer A (old - production)-----------------2xCPU Six-Core AMD Opteron 2439 SE64GB RAMRaid controller Perc6 512MB cache NV- 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers)- 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers)Server B (new)------------------2xCPU 16 Core AMD Opteron 6282 SE64GB RAMRaid controller H700 1GB cache NV- 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2)- 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no barriers)Raid controller H800 1GB cache nv- MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) (ext4 bs 4096, stride 64, stripe-width 384, no barriers)Postgres DB is the same in both servers. This DB has 170GB size with some tables partitioned by date with a trigger. In both shared_buffers, checkpoint_segments... settings are similar because RAM is similar.I supposed that, new server had to be faster than old, because have more disk in RAID10 and two RAID controllers with more cache memory, but really I'm not obtaining the expected resultsFor example this query:EXPLAIN ANALYZE SELECT c.id AS c__id, c.fk_news_id AS c__fk_news_id, c.fk_news_group_id AS c__fk_news_group_id, c.fk_company_id AS c__fk_company_id, c.import_date AS c__import_date, c.highlight AS c__highlight, c.status AS c__status, c.ord AS c__ord, c.news_date AS c__news_date, c.fk_media_id AS c__fk_media_id, c.title AS c__title, c.search_title_idx AS c__search_title_idx, c.stored AS c__stored, c.tono AS c__tono, c.media_type AS c__media_type, c.fk_editions_news_id AS c__fk_editions_news_id, c.dossier_selected AS c__dossier_selected, c.update_stats AS c__update_stats, c.url_news AS c__url_news, c.url_image AS c__url_image, m.id AS m__id, m.name AS m__name, m.media_type AS m__media_type, m.media_code AS m__media_code, m.fk_data_source_id AS m__fk_data_source_id, m.language_iso AS m__language_iso, m.country_iso AS m__country_iso, m.region_iso AS m__region_iso, m.subregion_iso AS m__subregion_iso, m.media_code_temp AS m__media_code_temp, m.url AS m__url, m.current_rank AS m__current_rank, m.typologyid AS m__typologyid, m.fk_platform_id AS m__fk_platform_id, m.page_views_per_day AS m__page_views_per_day, m.audience AS m__audience, m.last_stats_update AS m__last_stats_update, n.id AS n__id, n.fk_media_id AS n__fk_media_id, n.fk_news_media_id AS n__fk_news_media_id, n.fk_data_source_id AS n__fk_data_source_id, n.news_code AS n__news_code, n.title AS n__title, n.searchfull_idx AS n__searchfull_idx, n.news_date AS n__news_date, n.economical_value AS n__economical_value, n.audience AS n__audience, n.media_type AS n__media_type, n.url_news AS n__url_news, n.url_news_old AS n__url_news_old, n.url_image AS n__url_image, n.typologyid AS n__typologyid, n.author AS n__author, n.fk_platform_id AS n__fk_platform_id, n2.id AS n2__id, n2.name AS n2__name, n3.id AS n3__id, n3.name AS n3__name, f.id AS f__id, f.name AS f__name, n4.id AS n4__id, n4.opentext AS n4__opentext, i.id AS i__id, i.name AS i__name, i.ord AS i__ord, i2.id AS i2__id, i2.name AS i2__name FROM company_news_internet c LEFT JOIN media_internet m ON c.fk_media_id = m.id AND m.media_type = 4 LEFT JOIN news_internet n ON c.fk_news_id = n.id AND n.media_type = 4 LEFT JOIN news_media_internet n2 ON n.fk_news_media_id = n2.id AND n2.media_type = 4 LEFT JOIN news_group_internet n3 ON c.fk_news_group_id = n3.id AND n3.media_type = 4 LEFT JOIN feed_internet f ON n3.fk_feed_id = f.id LEFT JOIN news_text_internet n4 ON c.fk_news_id = n4.fk_news_id AND n4.media_type = 4 LEFT JOIN internet_typology i ON n.typologyid = i.id LEFT JOIN internet_media_platform i2 ON n.fk_platform_id = i2.id WHERE (c.fk_company_id = '16073' AND c.status <> '-3' AND n3.fk_feed_id = '30693' AND n3.status = '1' AND f.fk_company_id = '16073') AND n.typologyid IN ('6', '7', '1', '2', '3', '5', '4') AND c.id > '49764393' AND c.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND n.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND c.fk_news_group_id IN ('43475') AND (c.media_type = 4) ORDER BY c.news_date DESC, c.id DESC LIMIT 200Takes about 20 second in server A but in new server B takes 150 seconds... In EXPLAIN I have noticed that sequential scan on table news_internet_201112 takes 2s:-> Seq Scan on news_internet_201112 n (cost=0.00..119749.12 rows=1406528 width=535) (actual time=0.046..2186.379 rows=1844831 loops=1)Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))While in Server B, takes 11s:-> Seq Scan on news_internet_201112 n (cost=0.00..119520.12 rows=1405093 width=482) (actual time=0.177..11783.621 rows=1844831 loops=1)Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))Is notorious that, while in server A, execution time vary only few second when I execute the same query repeated times, in server B, execution time fluctuates between 30 and 150 second despite the server dont have any client.In other example, when I query entire table, running twice the same query:Server 1------------EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.042..19665.155 rows=6731337 loops=1)Total runtime: 20391.555 ms-EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.012..2171.181 rows=6731337 loops=1)Total runtime: 2831.028 msServer 2------------EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.110..10010.443 rows=6765779 loops=1)Total runtime: 11552.818 ms-EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.023..8173.801 rows=6765779 loops=1)Total runtime: 12939.717 msIt seems that Server B don cache the table¿?¿?I'm lost, I had tested different file systems, like XFS, stripe sizes... but I not have had resultsAny ideas that could be happen?Thanks a lot!!--
César Martín Pérez
cmartinp@gmail.com
Hi Mike,
--
César Martín Pérez
cmartinp@gmail.com
Thank you for your fast response.
blockdev --getra /dev/sdc
256
What value do you recommend for this setting?
Thanks!
El 3 de abril de 2012 14:37, Mike DelNegro <mdelnegro@yahoo.com> escribió:
Did you check your read ahead settings (getra)?Mike DelNegro
Sent from my iPhoneHello there,I am having performance problem with new DELL server. Actually I have this two serversServer A (old - production)-----------------2xCPU Six-Core AMD Opteron 2439 SE64GB RAMRaid controller Perc6 512MB cache NV- 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers)- 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers)Server B (new)------------------2xCPU 16 Core AMD Opteron 6282 SE64GB RAMRaid controller H700 1GB cache NV- 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2)- 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no barriers)Raid controller H800 1GB cache nv- MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) (ext4 bs 4096, stride 64, stripe-width 384, no barriers)Postgres DB is the same in both servers. This DB has 170GB size with some tables partitioned by date with a trigger. In both shared_buffers, checkpoint_segments... settings are similar because RAM is similar.I supposed that, new server had to be faster than old, because have more disk in RAID10 and two RAID controllers with more cache memory, but really I'm not obtaining the expected resultsFor example this query:EXPLAIN ANALYZE SELECT c.id AS c__id, c.fk_news_id AS c__fk_news_id, c.fk_news_group_id AS c__fk_news_group_id, c.fk_company_id AS c__fk_company_id, c.import_date AS c__import_date, c.highlight AS c__highlight, c.status AS c__status, c.ord AS c__ord, c.news_date AS c__news_date, c.fk_media_id AS c__fk_media_id, c.title AS c__title, c.search_title_idx AS c__search_title_idx, c.stored AS c__stored, c.tono AS c__tono, c.media_type AS c__media_type, c.fk_editions_news_id AS c__fk_editions_news_id, c.dossier_selected AS c__dossier_selected, c.update_stats AS c__update_stats, c.url_news AS c__url_news, c.url_image AS c__url_image, m.id AS m__id, m.name AS m__name, m.media_type AS m__media_type, m.media_code AS m__media_code, m.fk_data_source_id AS m__fk_data_source_id, m.language_iso AS m__language_iso, m.country_iso AS m__country_iso, m.region_iso AS m__region_iso, m.subregion_iso AS m__subregion_iso, m.media_code_temp AS m__media_code_temp, m.url AS m__url, m.current_rank AS m__current_rank, m.typologyid AS m__typologyid, m.fk_platform_id AS m__fk_platform_id, m.page_views_per_day AS m__page_views_per_day, m.audience AS m__audience, m.last_stats_update AS m__last_stats_update, n.id AS n__id, n.fk_media_id AS n__fk_media_id, n.fk_news_media_id AS n__fk_news_media_id, n.fk_data_source_id AS n__fk_data_source_id, n.news_code AS n__news_code, n.title AS n__title, n.searchfull_idx AS n__searchfull_idx, n.news_date AS n__news_date, n.economical_value AS n__economical_value, n.audience AS n__audience, n.media_type AS n__media_type, n.url_news AS n__url_news, n.url_news_old AS n__url_news_old, n.url_image AS n__url_image, n.typologyid AS n__typologyid, n.author AS n__author, n.fk_platform_id AS n__fk_platform_id, n2.id AS n2__id, n2.name AS n2__name, n3.id AS n3__id, n3.name AS n3__name, f.id AS f__id, f.name AS f__name, n4.id AS n4__id, n4.opentext AS n4__opentext, i.id AS i__id, i.name AS i__name, i.ord AS i__ord, i2.id AS i2__id, i2.name AS i2__name FROM company_news_internet c LEFT JOIN media_internet m ON c.fk_media_id = m.id AND m.media_type = 4 LEFT JOIN news_internet n ON c.fk_news_id = n.id AND n.media_type = 4 LEFT JOIN news_media_internet n2 ON n.fk_news_media_id = n2.id AND n2.media_type = 4 LEFT JOIN news_group_internet n3 ON c.fk_news_group_id = n3.id AND n3.media_type = 4 LEFT JOIN feed_internet f ON n3.fk_feed_id = f.id LEFT JOIN news_text_internet n4 ON c.fk_news_id = n4.fk_news_id AND n4.media_type = 4 LEFT JOIN internet_typology i ON n.typologyid = i.id LEFT JOIN internet_media_platform i2 ON n.fk_platform_id = i2.id WHERE (c.fk_company_id = '16073' AND c.status <> '-3' AND n3.fk_feed_id = '30693' AND n3.status = '1' AND f.fk_company_id = '16073') AND n.typologyid IN ('6', '7', '1', '2', '3', '5', '4') AND c.id > '49764393' AND c.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND n.news_date >= '2012-04-02'::timestamp - INTERVAL '4 months' AND c.fk_news_group_id IN ('43475') AND (c.media_type = 4) ORDER BY c.news_date DESC, c.id DESC LIMIT 200Takes about 20 second in server A but in new server B takes 150 seconds... In EXPLAIN I have noticed that sequential scan on table news_internet_201112 takes 2s:-> Seq Scan on news_internet_201112 n (cost=0.00..119749.12 rows=1406528 width=535) (actual time=0.046..2186.379 rows=1844831 loops=1)Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))While in Server B, takes 11s:-> Seq Scan on news_internet_201112 n (cost=0.00..119520.12 rows=1405093 width=482) (actual time=0.177..11783.621 rows=1844831 loops=1)Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without time zone) AND (media_type = 4) AND (typologyid = ANY ('{6,7,1,2,3,5,4}'::integer[])))Is notorious that, while in server A, execution time vary only few second when I execute the same query repeated times, in server B, execution time fluctuates between 30 and 150 second despite the server dont have any client.In other example, when I query entire table, running twice the same query:Server 1------------EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.042..19665.155 rows=6731337 loops=1)Total runtime: 20391.555 ms-EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 rows=6731337 width=318) (actual time=0.012..2171.181 rows=6731337 loops=1)Total runtime: 2831.028 msServer 2------------EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN---------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.110..10010.443 rows=6765779 loops=1)Total runtime: 11552.818 ms-EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.023..8173.801 rows=6765779 loops=1)Total runtime: 12939.717 msIt seems that Server B don cache the table¿?¿?I'm lost, I had tested different file systems, like XFS, stripe sizes... but I not have had resultsAny ideas that could be happen?Thanks a lot!!--
César Martín Pérez
cmartinp@gmail.com
César Martín Pérez
cmartinp@gmail.com
On Tue, Apr 3, 2012 at 7:20 AM, Cesar Martin <cmartinp@gmail.com> wrote: > Hello there, > > I am having performance problem with new DELL server. Actually I have this > two servers > > Server A (old - production) > ----------------- > 2xCPU Six-Core AMD Opteron 2439 SE > 64GB RAM > Raid controller Perc6 512MB cache NV > - 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers) > - 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers) > > Server B (new) > ------------------ > 2xCPU 16 Core AMD Opteron 6282 SE > 64GB RAM > Raid controller H700 1GB cache NV > - 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2) > - 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no > barriers) > Raid controller H800 1GB cache nv > - MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) > (ext4 bs 4096, stride 64, stripe-width 384, no barriers) > > Postgres DB is the same in both servers. This DB has 170GB size with some > tables partitioned by date with a trigger. In both shared_buffers, > checkpoint_segments... settings are similar because RAM is similar. > > I supposed that, new server had to be faster than old, because have more > disk in RAID10 and two RAID controllers with more cache memory, but really > I'm not obtaining the expected results > > For example this query: > > EXPLAIN ANALYZE SELECT c.id AS c__id, c.fk_news_id AS c__fk_news_id, > c.fk_news_group_id AS c__fk_news_group_id, c.fk_company_id AS > c__fk_company_id, c.import_date AS c__import_date, c.highlight AS > c__highlight, c.status AS c__status, c.ord AS c__ord, c.news_date AS > c__news_date, c.fk_media_id AS c__fk_media_id, c.title AS c__title, > c.search_title_idx AS c__search_title_idx, c.stored AS c__stored, c.tono AS > c__tono, c.media_type AS c__media_type, c.fk_editions_news_id AS > c__fk_editions_news_id, c.dossier_selected AS c__dossier_selected, > c.update_stats AS c__update_stats, c.url_news AS c__url_news, c.url_image AS > c__url_image, m.id AS m__id, m.name AS m__name, m.media_type AS > m__media_type, m.media_code AS m__media_code, m.fk_data_source_id AS > m__fk_data_source_id, m.language_iso AS m__language_iso, m.country_iso AS > m__country_iso, m.region_iso AS m__region_iso, m.subregion_iso AS > m__subregion_iso, m.media_code_temp AS m__media_code_temp, m.url AS m__url, > m.current_rank AS m__current_rank, m.typologyid AS m__typologyid, > m.fk_platform_id AS m__fk_platform_id, m.page_views_per_day AS > m__page_views_per_day, m.audience AS m__audience, m.last_stats_update AS > m__last_stats_update, n.id AS n__id, n.fk_media_id AS n__fk_media_id, > n.fk_news_media_id AS n__fk_news_media_id, n.fk_data_source_id AS > n__fk_data_source_id, n.news_code AS n__news_code, n.title AS n__title, > n.searchfull_idx AS n__searchfull_idx, n.news_date AS n__news_date, > n.economical_value AS n__economical_value, n.audience AS n__audience, > n.media_type AS n__media_type, n.url_news AS n__url_news, n.url_news_old AS > n__url_news_old, n.url_image AS n__url_image, n.typologyid AS n__typologyid, > n.author AS n__author, n.fk_platform_id AS n__fk_platform_id, n2.id AS > n2__id, n2.name AS n2__name, n3.id AS n3__id, n3.name AS n3__name, f.id AS > f__id, f.name AS f__name, n4.id AS n4__id, n4.opentext AS n4__opentext, i.id > AS i__id, i.name AS i__name, i.ord AS i__ord, i2.id AS i2__id, i2.name AS > i2__name FROM company_news_internet c LEFT JOIN media_internet m ON > c.fk_media_id = m.id AND m.media_type = 4 LEFT JOIN news_internet n ON > c.fk_news_id = n.id AND n.media_type = 4 LEFT JOIN news_media_internet n2 ON > n.fk_news_media_id = n2.id AND n2.media_type = 4 LEFT JOIN > news_group_internet n3 ON c.fk_news_group_id = n3.id AND n3.media_type = 4 > LEFT JOIN feed_internet f ON n3.fk_feed_id = f.id LEFT JOIN > news_text_internet n4 ON c.fk_news_id = n4.fk_news_id AND n4.media_type = 4 > LEFT JOIN internet_typology i ON n.typologyid = i.id LEFT JOIN > internet_media_platform i2 ON n.fk_platform_id = i2.id WHERE > (c.fk_company_id = '16073' AND c.status <> '-3' AND n3.fk_feed_id = '30693' > AND n3.status = '1' AND f.fk_company_id = '16073') AND n.typologyid IN ('6', > '7', '1', '2', '3', '5', '4') AND c.id > '49764393' AND c.news_date >= > '2012-04-02'::timestamp - INTERVAL '4 months' AND n.news_date >= > '2012-04-02'::timestamp - INTERVAL '4 months' AND c.fk_news_group_id IN > ('43475') AND (c.media_type = 4) ORDER BY c.news_date DESC, c.id DESC LIMIT > 200 > > Takes about 20 second in server A but in new server B takes 150 seconds... > In EXPLAIN I have noticed that sequential scan on table news_internet_201112 > takes 2s: > -> Seq Scan on news_internet_201112 n (cost=0.00..119749.12 > rows=1406528 width=535) (actual time=0.046..2186.379 rows=1844831 loops=1) > Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without > time zone) AND (media_type = 4) AND (typologyid = ANY > ('{6,7,1,2,3,5,4}'::integer[]))) > > While in Server B, takes 11s: > -> Seq Scan on news_internet_201112 n (cost=0.00..119520.12 > rows=1405093 width=482) (actual time=0.177..11783.621 rows=1844831 loops=1) > Filter: ((news_date >= '2011-12-02 00:00:00'::timestamp without > time zone) AND (media_type = 4) AND (typologyid = ANY > ('{6,7,1,2,3,5,4}'::integer[]))) > > Is notorious that, while in server A, execution time vary only few second > when I execute the same query repeated times, in server B, execution time > fluctuates between 30 and 150 second despite the server dont have any > client. > > In other example, when I query entire table, running twice the same query: > Server 1 > ------------ > EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ; > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 > rows=6731337 width=318) (actual time=0.042..19665.155 rows=6731337 loops=1) > Total runtime: 20391.555 ms > - > EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..457010.37 > rows=6731337 width=318) (actual time=0.012..2171.181 rows=6731337 loops=1) > Total runtime: 2831.028 ms > > Server 2 > ------------ > EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ; > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 > rows=6765779 width=323) (actual time=0.110..10010.443 rows=6765779 loops=1) > Total runtime: 11552.818 ms > - > EXPLAIN ANALYZE SELECT * from company_news_internet_201111 ; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 > rows=6765779 width=323) (actual time=0.023..8173.801 rows=6765779 loops=1) > Total runtime: 12939.717 ms > > It seems that Server B don cache the table¿?¿? > > I'm lost, I had tested different file systems, like XFS, stripe sizes... but > I not have had results > > Any ideas that could be happen? > > Thanks a lot!! That's a significant regression. Probable hardware issue -- have you run performance tests on it such as bonnie++? dd? What's iowait during the scan? merlin
On 3.4.2012 14:59, Cesar Martin wrote: > Hi Mike, > Thank you for your fast response. > > blockdev --getra /dev/sdc > 256 That's way too low. Is this setting the same on both machines? Anyway, set it to 4096, 8192 or even 16384 and check the difference. BTW explain analyze is nice, but it's only half the info, especially when the issue is outside PostgreSQL (hw, OS, ...). Please, provide samples from iostat / vmstat or tools like that. Tomas
On Tue, Apr 3, 2012 at 6:20 AM, Cesar Martin <cmartinp@gmail.com> wrote: > Hello there, > > I am having performance problem with new DELL server. Actually I have this > two servers > > Server A (old - production) > ----------------- > 2xCPU Six-Core AMD Opteron 2439 SE > 64GB RAM > Raid controller Perc6 512MB cache NV > - 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers) > - 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers) > > Server B (new) > ------------------ > 2xCPU 16 Core AMD Opteron 6282 SE > 64GB RAM > Raid controller H700 1GB cache NV > - 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2) > - 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no > barriers) > Raid controller H800 1GB cache nv > - MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) > (ext4 bs 4096, stride 64, stripe-width 384, no barriers) > > Postgres DB is the same in both servers. This DB has 170GB size with some > tables partitioned by date with a trigger. In both shared_buffers, > checkpoint_segments... settings are similar because RAM is similar. > > I supposed that, new server had to be faster than old, because have more > disk in RAID10 and two RAID controllers with more cache memory, but really > I'm not obtaining the expected results What does sysctl -n vm.zone_reclaim_mode say? If it says 1, change it to 0: sysctl -w zone_reclaim_mode=0 It's an automatic setting designed to make large virtual hosting servers etc run faster but totally screws with pg and file servers with big numbers of cores and large memory spaces.
On Tue, Apr 3, 2012 at 9:32 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Tue, Apr 3, 2012 at 6:20 AM, Cesar Martin <cmartinp@gmail.com> wrote: >> Hello there, >> >> I am having performance problem with new DELL server. Actually I have this >> two servers >> >> Server A (old - production) >> ----------------- >> 2xCPU Six-Core AMD Opteron 2439 SE >> 64GB RAM >> Raid controller Perc6 512MB cache NV >> - 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers) >> - 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers) >> >> Server B (new) >> ------------------ >> 2xCPU 16 Core AMD Opteron 6282 SE >> 64GB RAM >> Raid controller H700 1GB cache NV >> - 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2) >> - 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no >> barriers) >> Raid controller H800 1GB cache nv >> - MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18) >> (ext4 bs 4096, stride 64, stripe-width 384, no barriers) >> >> Postgres DB is the same in both servers. This DB has 170GB size with some >> tables partitioned by date with a trigger. In both shared_buffers, >> checkpoint_segments... settings are similar because RAM is similar. >> >> I supposed that, new server had to be faster than old, because have more >> disk in RAID10 and two RAID controllers with more cache memory, but really >> I'm not obtaining the expected results > > What does > > sysctl -n vm.zone_reclaim_mode > > say? If it says 1, change it to 0: > > sysctl -w zone_reclaim_mode=0 That should be: sysctl -w vm.zone_reclaim_mode=0
Yes, setting is the same in both machines.
--
César Martín Pérez
cmartinp@gmail.com
The results of bonnie++ running without arguments are:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
cltbbdd01 126G 94 99 202873 99 208327 95 1639 91 819392 88 2131 139
Latency 88144us 228ms 338ms 171ms 147ms 20325us
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
cltbbdd01 16 8063 26 +++++ +++ 27361 96 31437 96 +++++ +++ +++++ +++
Latency 7850us 2290us 2310us 530us 11us 522us
With DD, one core of CPU put at 100% and results are about 100-170 MBps, that I thing is bad result for this HW:
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=100
100+0 records in
100+0 records out
838860800 bytes (839 MB) copied, 8,1822 s, 103 MB/s
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1000 conv=fdatasync
1000+0 records in
1000+0 records out
8388608000 bytes (8,4 GB) copied, 50,8388 s, 165 MB/s
dd if=/dev/zero of=/vol02/bonnie/DD bs=1M count=1024 conv=fdatasync
1024+0 records in
1024+0 records out
1073741824 bytes (1,1 GB) copied, 7,39628 s, 145 MB/s
When monitor I/O activity with iostat, during dd, I have noticed that, if the test takes 10 second, the disk have activity only during last 3 or 4 seconds and iostat report about 250-350MBps. Is it normal?
I set read ahead to different values, but the results don't differ substantially...
Thanks!
El 3 de abril de 2012 15:21, Tomas Vondra <tv@fuzzy.cz> escribió:
On 3.4.2012 14:59, Cesar Martin wrote:That's way too low. Is this setting the same on both machines?
> Hi Mike,
> Thank you for your fast response.
>
> blockdev --getra /dev/sdc
> 256
Anyway, set it to 4096, 8192 or even 16384 and check the difference.
BTW explain analyze is nice, but it's only half the info, especially
when the issue is outside PostgreSQL (hw, OS, ...). Please, provide
samples from iostat / vmstat or tools like that.
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
César Martín Pérez
cmartinp@gmail.com
OK Scott. I go to change this kernel parameter and will repeat the tests.
--
César Martín Pérez
cmartinp@gmail.com
Tanks!
El 3 de abril de 2012 17:34, Scott Marlowe <scott.marlowe@gmail.com> escribió:
That should be:On Tue, Apr 3, 2012 at 9:32 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Apr 3, 2012 at 6:20 AM, Cesar Martin <cmartinp@gmail.com> wrote:
>> Hello there,
>>
>> I am having performance problem with new DELL server. Actually I have this
>> two servers
>>
>> Server A (old - production)
>> -----------------
>> 2xCPU Six-Core AMD Opteron 2439 SE
>> 64GB RAM
>> Raid controller Perc6 512MB cache NV
>> - 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers)
>> - 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers)
>>
>> Server B (new)
>> ------------------
>> 2xCPU 16 Core AMD Opteron 6282 SE
>> 64GB RAM
>> Raid controller H700 1GB cache NV
>> - 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2)
>> - 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no
>> barriers)
>> Raid controller H800 1GB cache nv
>> - MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18)
>> (ext4 bs 4096, stride 64, stripe-width 384, no barriers)
>>
>> Postgres DB is the same in both servers. This DB has 170GB size with some
>> tables partitioned by date with a trigger. In both shared_buffers,
>> checkpoint_segments... settings are similar because RAM is similar.
>>
>> I supposed that, new server had to be faster than old, because have more
>> disk in RAID10 and two RAID controllers with more cache memory, but really
>> I'm not obtaining the expected results
>
> What does
>
> sysctl -n vm.zone_reclaim_mode
>
> say? If it says 1, change it to 0:
>
> sysctl -w zone_reclaim_mode=0
sysctl -w vm.zone_reclaim_mode=0
César Martín Pérez
cmartinp@gmail.com
OK Scott. I go to change this kernel parameter and will repeat the tests.
--
César Martín Pérez
cmartinp@gmail.com
Tanks!
El 3 de abril de 2012 17:34, Scott Marlowe <scott.marlowe@gmail.com> escribió:
That should be:On Tue, Apr 3, 2012 at 9:32 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Apr 3, 2012 at 6:20 AM, Cesar Martin <cmartinp@gmail.com> wrote:
>> Hello there,
>>
>> I am having performance problem with new DELL server. Actually I have this
>> two servers
>>
>> Server A (old - production)
>> -----------------
>> 2xCPU Six-Core AMD Opteron 2439 SE
>> 64GB RAM
>> Raid controller Perc6 512MB cache NV
>> - 2 HD 146GB SAS 15Krpm RAID1 (SO Centos 5.4 y pg_xlog) (XFS no barriers)
>> - 6 HD 300GB SAS 15Krpm RAID10 (DB Postgres 8.3.9) (XFS no barriers)
>>
>> Server B (new)
>> ------------------
>> 2xCPU 16 Core AMD Opteron 6282 SE
>> 64GB RAM
>> Raid controller H700 1GB cache NV
>> - 2HD 74GB SAS 15Krpm RAID1 stripe 16k (SO Centos 6.2)
>> - 4HD 146GB SAS 15Krpm RAID10 stripe 16k XFS (pg_xlog) (ext4 bs 4096, no
>> barriers)
>> Raid controller H800 1GB cache nv
>> - MD1200 12HD 300GB SAS 15Krpm RAID10 stripe 256k (DB Postgres 8.3.18)
>> (ext4 bs 4096, stride 64, stripe-width 384, no barriers)
>>
>> Postgres DB is the same in both servers. This DB has 170GB size with some
>> tables partitioned by date with a trigger. In both shared_buffers,
>> checkpoint_segments... settings are similar because RAM is similar.
>>
>> I supposed that, new server had to be faster than old, because have more
>> disk in RAID10 and two RAID controllers with more cache memory, but really
>> I'm not obtaining the expected results
>
> What does
>
> sysctl -n vm.zone_reclaim_mode
>
> say? If it says 1, change it to 0:
>
> sysctl -w zone_reclaim_mode=0
sysctl -w vm.zone_reclaim_mode=0
César Martín Pérez
cmartinp@gmail.com
On 3.4.2012 17:42, Cesar Martin wrote: > Yes, setting is the same in both machines. > > The results of bonnie++ running without arguments are: > > Version 1.96 ------Sequential Output------ --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > cltbbdd01 126G 94 99 202873 99 208327 95 1639 91 819392 88 > 2131 139 > Latency 88144us 228ms 338ms 171ms 147ms > 20325us > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- > -Delete-- > files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > /sec %CP > cltbbdd01 16 8063 26 +++++ +++ 27361 96 31437 96 +++++ +++ > +++++ +++ > Latency 7850us 2290us 2310us 530us 11us > 522us > > With DD, one core of CPU put at 100% and results are about 100-170 > MBps, that I thing is bad result for this HW: > > dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=100 > 100+0 records in > 100+0 records out > 838860800 bytes (839 MB) copied, 8,1822 s, 103 MB/s > > dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1000 conv=fdatasync > 1000+0 records in > 1000+0 records out > 8388608000 bytes (8,4 GB) copied, 50,8388 s, 165 MB/s > > dd if=/dev/zero of=/vol02/bonnie/DD bs=1M count=1024 conv=fdatasync > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1,1 GB) copied, 7,39628 s, 145 MB/s > > When monitor I/O activity with iostat, during dd, I have noticed that, > if the test takes 10 second, the disk have activity only during last 3 > or 4 seconds and iostat report about 250-350MBps. Is it normal? Well, you're testing writing, and the default behavior is to write the data into page cache. And you do have 64GB of RAM so the write cache may take large portion of the RAM - even gigabytes. To really test the I/O you need to (a) write about 2x the amount of RAM or (b) tune the dirty_ratio/dirty_background_ratio accordingly. BTW what are you trying to achieve with "conv=fdatasync" at the end. My dd man page does not mention 'fdatasync' and IMHO it's a mistake on your side. If you want to sync the data at the end, then you need to do something like time sh -c "dd ... && sync" > I set read ahead to different values, but the results don't differ > substantially... Because read-ahead is for reading (which is what a SELECT does most of the time), but the dests above are writing to the device. And writing is not influenced by read-ahead. To test reading, do this: dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1024 Tomas
On Tue, Apr 3, 2012 at 1:01 PM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 3.4.2012 17:42, Cesar Martin wrote: >> Yes, setting is the same in both machines. >> >> The results of bonnie++ running without arguments are: >> >> Version 1.96 ------Sequential Output------ --Sequential Input- >> --Random- >> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- >> --Seeks-- >> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP >> /sec %CP >> cltbbdd01 126G 94 99 202873 99 208327 95 1639 91 819392 88 >> 2131 139 >> Latency 88144us 228ms 338ms 171ms 147ms >> 20325us >> ------Sequential Create------ --------Random >> Create-------- >> -Create-- --Read--- -Delete-- -Create-- --Read--- >> -Delete-- >> files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >> /sec %CP >> cltbbdd01 16 8063 26 +++++ +++ 27361 96 31437 96 +++++ +++ >> +++++ +++ >> Latency 7850us 2290us 2310us 530us 11us >> 522us >> >> With DD, one core of CPU put at 100% and results are about 100-170 >> MBps, that I thing is bad result for this HW: >> >> dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=100 >> 100+0 records in >> 100+0 records out >> 838860800 bytes (839 MB) copied, 8,1822 s, 103 MB/s >> >> dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1000 conv=fdatasync >> 1000+0 records in >> 1000+0 records out >> 8388608000 bytes (8,4 GB) copied, 50,8388 s, 165 MB/s >> >> dd if=/dev/zero of=/vol02/bonnie/DD bs=1M count=1024 conv=fdatasync >> 1024+0 records in >> 1024+0 records out >> 1073741824 bytes (1,1 GB) copied, 7,39628 s, 145 MB/s >> >> When monitor I/O activity with iostat, during dd, I have noticed that, >> if the test takes 10 second, the disk have activity only during last 3 >> or 4 seconds and iostat report about 250-350MBps. Is it normal? > > Well, you're testing writing, and the default behavior is to write the > data into page cache. And you do have 64GB of RAM so the write cache may > take large portion of the RAM - even gigabytes. To really test the I/O > you need to (a) write about 2x the amount of RAM or (b) tune the > dirty_ratio/dirty_background_ratio accordingly. > > BTW what are you trying to achieve with "conv=fdatasync" at the end. My > dd man page does not mention 'fdatasync' and IMHO it's a mistake on your > side. If you want to sync the data at the end, then you need to do > something like > > time sh -c "dd ... && sync" > >> I set read ahead to different values, but the results don't differ >> substantially... > > Because read-ahead is for reading (which is what a SELECT does most of > the time), but the dests above are writing to the device. And writing is > not influenced by read-ahead. Yeah, but I have to agree with Cesar -- that's pretty unspectacular results for 12 drive sas array to say the least (unless the way dd was being run was throwing it off somehow). Something is definitely not right here. Maybe we can see similar tests run on the production server as a point of comparison? merlin
Hello,
--
César Martín Pérez
cmartinp@gmail.com
Yesterday I changed the kernel setting, that said Scott, vm.zone_reclaim_mode = 0. I have done new benchmarks and I have noticed changes at least in Postgres:
First exec:
EXPLAIN ANALYZE SELECT * from company_news_internet_201111;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.020..7984.707 rows=6765779 loops=1)
Total runtime: 12699.008 ms
(2 filas)
Second:
EXPLAIN ANALYZE SELECT * from company_news_internet_201111;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 rows=6765779 width=323) (actual time=0.023..1767.440 rows=6765779 loops=1)
Total runtime: 2696.901 ms
It seems that now data is being cached right...
The large query in first exec takes 80 seconds and in second exec takes around 23 seconds. This is not spectacular but is better than yesterday.
Furthermore the results of dd are strange:
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 803,738 s, 171 MB/s
171 MB/s I think is bad value for 12 SAS RAID10... And when I execute iostat during the dd execution i obtain results like:
sdc 1514,62 0,01 108,58 11 117765
sdc 3705,50 0,01 316,62 0 633
sdc 2,00 0,00 0,05 0 0
sdc 920,00 0,00 63,49 0 126
sdc 8322,50 0,03 712,00 0 1424
sdc 6662,50 0,02 568,53 0 1137
sdc 0,00 0,00 0,00 0 0
sdc 1,50 0,00 0,04 0 0
sdc 6413,00 0,01 412,28 0 824
sdc 13107,50 0,03 867,94 0 1735
sdc 0,00 0,00 0,00 0 0
sdc 1,50 0,00 0,03 0 0
sdc 9719,00 0,03 815,49 0 1630
sdc 2817,50 0,01 272,51 0 545
sdc 1,50 0,00 0,05 0 0
sdc 1181,00 0,00 71,49 0 142
sdc 7225,00 0,01 362,56 0 725
sdc 2973,50 0,01 269,97 0 539
I don't understand why MB_wrtn/s go from 0 to near 800MB/s constantly during execution.
Read results:
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 257,626 s, 533 MB/s
sdc 3157,00 392,69 0,00 785 0
sdc 3481,00 432,75 0,00 865 0
sdc 2669,50 331,50 0,00 663 0
sdc 3725,50 463,75 0,00 927 0
sdc 2998,50 372,38 0,00 744 0
sdc 3600,50 448,00 0,00 896 0
sdc 3588,00 446,50 0,00 893 0
sdc 3494,00 434,50 0,00 869 0
sdc 3141,50 390,62 0,00 781 0
sdc 3667,50 456,62 0,00 913 0
sdc 3429,35 426,18 0,00 856 0
sdc 3043,50 378,06 0,00 756 0
sdc 3366,00 417,94 0,00 835 0
sdc 3480,50 432,62 0,00 865 0
sdc 3523,50 438,06 0,00 876 0
sdc 3554,50 441,88 0,00 883 0
sdc 3635,00 452,19 0,00 904 0
sdc 3107,00 386,20 0,00 772 0
sdc 3695,00 460,00 0,00 920 0
sdc 3475,50 432,11 0,00 864 0
sdc 3487,50 433,50 0,00 867 0
sdc 3232,50 402,39 0,00 804 0
sdc 3698,00 460,67 0,00 921 0
sdc 5059,50 632,00 0,00 1264 0
sdc 3934,00 489,56 0,00 979 0
sdc 4536,50 566,75 0,00 1133 0
sdc 5298,00 662,12 0,00 1324 0
Here results I think that are more logical. Read speed is maintained along all the test...
About the parameter "conv=fdatasync" that mention Tomas, I saw it at http://romanrm.ru/en/dd-benchmark and I started to use but is possible wrong. Before I used time sh -c "dd if=/dev/zero of=ddfile bs=X count=Y && sync".
What is your opinion about the results??
I have noticed that since I changed the setting vm.zone_reclaim_mode = 0, swap is totally full. Do you recommend me disable swap?
Thanks!!
El 3 de abril de 2012 20:01, Tomas Vondra <tv@fuzzy.cz> escribió:
Well, you're testing writing, and the default behavior is to write theOn 3.4.2012 17:42, Cesar Martin wrote:
> Yes, setting is the same in both machines.
>
> The results of bonnie++ running without arguments are:
>
> Version 1.96 ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> cltbbdd01 126G 94 99 202873 99 208327 95 1639 91 819392 88
> 2131 139
> Latency 88144us 228ms 338ms 171ms 147ms
> 20325us
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> /sec %CP
> cltbbdd01 16 8063 26 +++++ +++ 27361 96 31437 96 +++++ +++
> +++++ +++
> Latency 7850us 2290us 2310us 530us 11us
> 522us
>
> With DD, one core of CPU put at 100% and results are about 100-170
> MBps, that I thing is bad result for this HW:
>
> dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=100
> 100+0 records in
> 100+0 records out
> 838860800 bytes (839 MB) copied, 8,1822 s, 103 MB/s
>
> dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1000 conv=fdatasync
> 1000+0 records in
> 1000+0 records out
> 8388608000 bytes (8,4 GB) copied, 50,8388 s, 165 MB/s
>
> dd if=/dev/zero of=/vol02/bonnie/DD bs=1M count=1024 conv=fdatasync
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1,1 GB) copied, 7,39628 s, 145 MB/s
>
> When monitor I/O activity with iostat, during dd, I have noticed that,
> if the test takes 10 second, the disk have activity only during last 3
> or 4 seconds and iostat report about 250-350MBps. Is it normal?
data into page cache. And you do have 64GB of RAM so the write cache may
take large portion of the RAM - even gigabytes. To really test the I/O
you need to (a) write about 2x the amount of RAM or (b) tune the
dirty_ratio/dirty_background_ratio accordingly.
BTW what are you trying to achieve with "conv=fdatasync" at the end. My
dd man page does not mention 'fdatasync' and IMHO it's a mistake on your
side. If you want to sync the data at the end, then you need to do
something like
time sh -c "dd ... && sync"Because read-ahead is for reading (which is what a SELECT does most of
> I set read ahead to different values, but the results don't differ
> substantially...
the time), but the dests above are writing to the device. And writing is
not influenced by read-ahead.
To test reading, do this:
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1024
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
César Martín Pérez
cmartinp@gmail.com
On Wed, Apr 4, 2012 at 3:42 AM, Cesar Martin <cmartinp@gmail.com> wrote: > > I have noticed that since I changed the setting vm.zone_reclaim_mode = 0, > swap is totally full. Do you recommend me disable swap? Yes
On 4.4.2012 15:15, Scott Marlowe wrote: > On Wed, Apr 4, 2012 at 3:42 AM, Cesar Martin <cmartinp@gmail.com> wrote: >> >> I have noticed that since I changed the setting vm.zone_reclaim_mode = 0, >> swap is totally full. Do you recommend me disable swap? > > Yes Careful about that - it depends on how you disable it. Setting 'vm.swappiness = 0' is a good idea, don't remove the swap (I've been bitten by the vm.overcommit=2 without a swap repeatedly). T.
On Wed, Apr 4, 2012 at 7:20 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 4.4.2012 15:15, Scott Marlowe wrote: >> On Wed, Apr 4, 2012 at 3:42 AM, Cesar Martin <cmartinp@gmail.com> wrote: >>> >>> I have noticed that since I changed the setting vm.zone_reclaim_mode = 0, >>> swap is totally full. Do you recommend me disable swap? >> >> Yes > > Careful about that - it depends on how you disable it. > > Setting 'vm.swappiness = 0' is a good idea, don't remove the swap (I've > been bitten by the vm.overcommit=2 without a swap repeatedly). I've had far more problems with swap on and swappiness set to 0 than with swap off. But this has always been on large memory machines with 64 to 256G memory. Even with fairly late model linux kernels (i.e. 10.04 LTS through 11.04) I've watched the kswapd start up swapping hard on a machine with zero memory pressure and no need for swap. Took about 2 weeks of hard running before kswapd decided to act pathological. Seen it with swap on, with swappiness to 0, and overcommit to either 0 or 2 on big machines. Once we just took the swap partitions away it the machines ran fine.
On Wed, Apr 4, 2012 at 1:22 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > Even with fairly late model linux kernels (i.e. > 10.04 LTS through 11.04) I've watched the kswapd start up swapping > hard on a machine with zero memory pressure and no need for swap. > Took about 2 weeks of hard running before kswapd decided to act > pathological. Perhaps you had some overfull partitions in tmpfs?
On Wed, Apr 4, 2012 at 10:28 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Apr 4, 2012 at 1:22 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> Even with fairly late model linux kernels (i.e. >> 10.04 LTS through 11.04) I've watched the kswapd start up swapping >> hard on a machine with zero memory pressure and no need for swap. >> Took about 2 weeks of hard running before kswapd decided to act >> pathological. > > Perhaps you had some overfull partitions in tmpfs? Nope. Didn't use tmpfs for anything on that machine. Stock Ubuntu 10.04 with Postgres just doing simple but high traffic postgres stuff.
On Wed, Apr 4, 2012 at 10:31 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Wed, Apr 4, 2012 at 10:28 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Apr 4, 2012 at 1:22 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >>> Even with fairly late model linux kernels (i.e. >>> 10.04 LTS through 11.04) I've watched the kswapd start up swapping >>> hard on a machine with zero memory pressure and no need for swap. >>> Took about 2 weeks of hard running before kswapd decided to act >>> pathological. >> >> Perhaps you had some overfull partitions in tmpfs? > > Nope. Didn't use tmpfs for anything on that machine. Stock Ubuntu > 10.04 with Postgres just doing simple but high traffic postgres stuff. Just to clarify, the machine had 128G RAM and about 95G of it was kernel cache, the rest used by shared memory (set to 4G) and postgresql.
On 4.4.2012 18:22, Scott Marlowe wrote: > On Wed, Apr 4, 2012 at 7:20 AM, Tomas Vondra <tv@fuzzy.cz> wrote: >> On 4.4.2012 15:15, Scott Marlowe wrote: >>> On Wed, Apr 4, 2012 at 3:42 AM, Cesar Martin <cmartinp@gmail.com> wrote: >>>> >>>> I have noticed that since I changed the setting vm.zone_reclaim_mode = 0, >>>> swap is totally full. Do you recommend me disable swap? >>> >>> Yes >> >> Careful about that - it depends on how you disable it. >> >> Setting 'vm.swappiness = 0' is a good idea, don't remove the swap (I've >> been bitten by the vm.overcommit=2 without a swap repeatedly). > > I've had far more problems with swap on and swappiness set to 0 than > with swap off. But this has always been on large memory machines with > 64 to 256G memory. Even with fairly late model linux kernels (i.e. > 10.04 LTS through 11.04) I've watched the kswapd start up swapping > hard on a machine with zero memory pressure and no need for swap. > Took about 2 weeks of hard running before kswapd decided to act > pathological. > > Seen it with swap on, with swappiness to 0, and overcommit to either 0 > or 2 on big machines. Once we just took the swap partitions away it > the machines ran fine. I've experienced the issues in exactly the opposite case - machines with very little memory (like a VPS with 512MB of RAM). I did want to operate that machine without a swap yet it kept failing because of OOM errors or panicking (depending on the overcommit ratio value). Turns out it's quite difficult (~ almost impossible) tune the VM for a swap-less case. In the end I've just added a 256MB of swap and everything started to work fine - funny thing is the swap is not used at all (according to sar). T.
On Wed, Apr 4, 2012 at 4:42 AM, Cesar Martin <cmartinp@gmail.com> wrote: > Hello, > > Yesterday I changed the kernel setting, that said > Scott, vm.zone_reclaim_mode = 0. I have done new benchmarks and I have > noticed changes at least in Postgres: > > First exec: > EXPLAIN ANALYZE SELECT * from company_news_internet_201111; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 > rows=6765779 width=323) (actual time=0.020..7984.707 rows=6765779 loops=1) > Total runtime: 12699.008 ms > (2 filas) > > Second: > EXPLAIN ANALYZE SELECT * from company_news_internet_201111; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on company_news_internet_201111 (cost=0.00..369577.79 > rows=6765779 width=323) (actual time=0.023..1767.440 rows=6765779 loops=1) > Total runtime: 2696.901 ms > > It seems that now data is being cached right... > > The large query in first exec takes 80 seconds and in second exec takes > around 23 seconds. This is not spectacular but is better than yesterday. > > Furthermore the results of dd are strange: > > dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384 > 16384+0 records in > 16384+0 records out > 137438953472 bytes (137 GB) copied, 803,738 s, 171 MB/s > > 171 MB/s I think is bad value for 12 SAS RAID10... And when I execute iostat > during the dd execution i obtain results like: > sdc 1514,62 0,01 108,58 11 117765 > sdc 3705,50 0,01 316,62 0 633 > sdc 2,00 0,00 0,05 0 0 > sdc 920,00 0,00 63,49 0 126 > sdc 8322,50 0,03 712,00 0 1424 > sdc 6662,50 0,02 568,53 0 1137 > sdc 0,00 0,00 0,00 0 0 > sdc 1,50 0,00 0,04 0 0 > sdc 6413,00 0,01 412,28 0 824 > sdc 13107,50 0,03 867,94 0 1735 > sdc 0,00 0,00 0,00 0 0 > sdc 1,50 0,00 0,03 0 0 > sdc 9719,00 0,03 815,49 0 1630 > sdc 2817,50 0,01 272,51 0 545 > sdc 1,50 0,00 0,05 0 0 > sdc 1181,00 0,00 71,49 0 142 > sdc 7225,00 0,01 362,56 0 725 > sdc 2973,50 0,01 269,97 0 539 > > I don't understand why MB_wrtn/s go from 0 to near 800MB/s constantly during > execution. This is looking more and more like a a raid controller issue. ISTM it's bucking the cache, filling it up and flushing it synchronously. your read results are ok but not what they should be IMO. Maybe it's an environmental issue or the card is just a straight up lemon (no surprise in the dell line). Are you using standard drivers, and have you checked for updates? Have you considered contacting dell support? merlin
Raid controller issue or driver problem was the first problem that I studied.
I installed Centos 5.4 al the beginning, but I had performance problems, and I contacted Dell support... but Centos is not support by Dell... Then I installed Redhat 6 and we contact Dell with same problem.
Dell say that all is right and that this is a software problem.
I have installed Centos 5.4, 6.2 and Redhat 6 with similar result, I think that not is driver problem (megasas-raid kernel module).
I will check kernel updates...
Thanks!
PS. lately I'm pretty disappointed with the quality of the DELL components, is not the first problem we have with hardware in new machines.
--
César Martín Pérez
cmartinp@gmail.com
I installed Centos 5.4 al the beginning, but I had performance problems, and I contacted Dell support... but Centos is not support by Dell... Then I installed Redhat 6 and we contact Dell with same problem.
Dell say that all is right and that this is a software problem.
I have installed Centos 5.4, 6.2 and Redhat 6 with similar result, I think that not is driver problem (megasas-raid kernel module).
I will check kernel updates...
Thanks!
PS. lately I'm pretty disappointed with the quality of the DELL components, is not the first problem we have with hardware in new machines.
El 4 de abril de 2012 19:16, Merlin Moncure <mmoncure@gmail.com> escribió:
This is looking more and more like a a raid controller issue. ISTMOn Wed, Apr 4, 2012 at 4:42 AM, Cesar Martin <cmartinp@gmail.com> wrote:
> Hello,
>
> Yesterday I changed the kernel setting, that said
> Scott, vm.zone_reclaim_mode = 0. I have done new benchmarks and I have
> noticed changes at least in Postgres:
>
> First exec:
> EXPLAIN ANALYZE SELECT * from company_news_internet_201111;
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on company_news_internet_201111 (cost=0.00..369577.79
> rows=6765779 width=323) (actual time=0.020..7984.707 rows=6765779 loops=1)
> Total runtime: 12699.008 ms
> (2 filas)
>
> Second:
> EXPLAIN ANALYZE SELECT * from company_news_internet_201111;
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on company_news_internet_201111 (cost=0.00..369577.79
> rows=6765779 width=323) (actual time=0.023..1767.440 rows=6765779 loops=1)
> Total runtime: 2696.901 ms
>
> It seems that now data is being cached right...
>
> The large query in first exec takes 80 seconds and in second exec takes
> around 23 seconds. This is not spectacular but is better than yesterday.
>
> Furthermore the results of dd are strange:
>
> dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
> 16384+0 records in
> 16384+0 records out
> 137438953472 bytes (137 GB) copied, 803,738 s, 171 MB/s
>
> 171 MB/s I think is bad value for 12 SAS RAID10... And when I execute iostat
> during the dd execution i obtain results like:
> sdc 1514,62 0,01 108,58 11 117765
> sdc 3705,50 0,01 316,62 0 633
> sdc 2,00 0,00 0,05 0 0
> sdc 920,00 0,00 63,49 0 126
> sdc 8322,50 0,03 712,00 0 1424
> sdc 6662,50 0,02 568,53 0 1137
> sdc 0,00 0,00 0,00 0 0
> sdc 1,50 0,00 0,04 0 0
> sdc 6413,00 0,01 412,28 0 824
> sdc 13107,50 0,03 867,94 0 1735
> sdc 0,00 0,00 0,00 0 0
> sdc 1,50 0,00 0,03 0 0
> sdc 9719,00 0,03 815,49 0 1630
> sdc 2817,50 0,01 272,51 0 545
> sdc 1,50 0,00 0,05 0 0
> sdc 1181,00 0,00 71,49 0 142
> sdc 7225,00 0,01 362,56 0 725
> sdc 2973,50 0,01 269,97 0 539
>
> I don't understand why MB_wrtn/s go from 0 to near 800MB/s constantly during
> execution.
it's bucking the cache, filling it up and flushing it synchronously.
your read results are ok but not what they should be IMO. Maybe it's
an environmental issue or the card is just a straight up lemon (no
surprise in the dell line). Are you using standard drivers, and have
you checked for updates? Have you considered contacting dell support?
merlin
--
César Martín Pérez
cmartinp@gmail.com
On Wed, Apr 4, 2012 at 12:46 PM, Cesar Martin <cmartinp@gmail.com> wrote: > Raid controller issue or driver problem was the first problem that I > studied. > I installed Centos 5.4 al the beginning, but I had performance problems, and > I contacted Dell support... but Centos is not support by Dell... Then I > installed Redhat 6 and we contact Dell with same problem. > Dell say that all is right and that this is a software problem. > I have installed Centos 5.4, 6.2 and Redhat 6 with similar result, I think > that not is driver problem (megasas-raid kernel module). > I will check kernel updates... > Thanks! Look for firmware updates to your RAID card.
On Wed, Apr 4, 2012 at 1:55 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Wed, Apr 4, 2012 at 12:46 PM, Cesar Martin <cmartinp@gmail.com> wrote: >> Raid controller issue or driver problem was the first problem that I >> studied. >> I installed Centos 5.4 al the beginning, but I had performance problems, and >> I contacted Dell support... but Centos is not support by Dell... Then I >> installed Redhat 6 and we contact Dell with same problem. >> Dell say that all is right and that this is a software problem. >> I have installed Centos 5.4, 6.2 and Redhat 6 with similar result, I think >> that not is driver problem (megasas-raid kernel module). >> I will check kernel updates... >> Thanks! > > Look for firmware updates to your RAID card. allready checked that: look here: http://www.dell.com/support/drivers/us/en/04/DriverDetails?DriverId=R269683&FileId=2731095787&DriverName=Dell%20PERC%20H800%20Adapter%2C%20v.12.3.0-0032%2C%20A02&urlProductCode=False latest update is july 2010. i've been down this road with dell many times and I would advise RMAing the whole server -- that will at least get their attention. dell performance/software support is worthless and it's a crying shame blowing 10 grand on a server only to have it underperform your 3 year old workhorse. merlin
On 4.4.2012 20:46, Cesar Martin wrote: > Raid controller issue or driver problem was the first problem that I > studied. > I installed Centos 5.4 al the beginning, but I had performance problems, > and I contacted Dell support... but Centos is not support by Dell... > Then I installed Redhat 6 and we contact Dell with same problem. > Dell say that all is right and that this is a software problem. > I have installed Centos 5.4, 6.2 and Redhat 6 with similar result, I > think that not is driver problem (megasas-raid kernel module). > I will check kernel updates... > Thanks! Well, there are different meanings of 'working'. Obviously you mean 'gives reasonable performance' while Dell understands 'is not on fire'. IIRC H800 is just a 926x controller from LSI, so it's probably based on LSI 2108. Can you post basic info about the setting, i.e. MegaCli -AdpAllInfo -aALL or something like that? I'm especially interested in the access/cache policies, cache drop interval .etc, i.e. MegaCli -LDGetProp (-Cache | -Access | -Name | -DskCache) What I'd do next is testing a much smaller array (even a single drive) to see if the issue exists. If it works, try to add another drive etc. It's much easier to show them something's wrong. The simpler the test case, the better. I've found this (it's about a 2108-based controller from LSI): http://www.xbitlabs.com/articles/storage/display/lsi-megaraid-sas9260-8i_3.html#sect0 The paragraphs below the diagram are interesting. Not sure if they describe the same issue you have, but maybe it's related. Anyway, it's quite usual that a RAID controller has about 50% write performance compared to read performance, usually due to on-board CPU bottleneck. You do have ~ 530 MB/s and 170 MB/s, so it's not exactly 50% but it's not very far. But the fluctuation, that surely is strange. What are the page cache dirty limits, i.e. cat /proc/sys/vm/dirty_background_ratio cat /proc/sys/vm/dirty_ratio That's probably #1 source I've seen responsible for such issues (on machines with a lot of RAM). Tomas
> From: Tomas Vondra <tv@fuzzy.cz>
> But the fluctuation, that surely is strange. What are the page cache
> dirty limits, i.e.
>
> cat /proc/sys/vm/dirty_background_ratio
> cat /proc/sys/vm/dirty_ratio
>
> That's probably #1 source I've seen responsible for such issues (on
> machines with a lot of RAM).
>
+1 on that.
We're running similar 32 core dell servers with H700s and 128Gb RAM.
With those at the defaults (I don't recall if it's 5 and 10 respectively) you're looking at 3.2Gb of dirty pages before pdflush flushes them and 6.4Gb before the process is forced to flush its self.
> dirty limits, i.e.
>
> cat /proc/sys/vm/dirty_background_ratio
> cat /proc/sys/vm/dirty_ratio
>
> That's probably #1 source I've seen responsible for such issues (on
> machines with a lot of RAM).
>
+1 on that.
We're running similar 32 core dell servers with H700s and 128Gb RAM.
With those at the defaults (I don't recall if it's 5 and 10 respectively) you're looking at 3.2Gb of dirty pages before pdflush flushes them and 6.4Gb before the process is forced to flush its self.
On 5.4.2012 17:17, Cesar Martin wrote: > Well, I have installed megacli on server and attach the results in file > megacli.txt. Also we have "Dell Open Manage" install in server, that can > generate a log of H800. I attach to mail with name lsi_0403. > > About dirty limits, I have default values: > vm.dirty_background_ratio = 10 > vm.dirty_ratio = 20 > > I have compared with other server and values are the same, except in > centos 5.4 database production server that have vm.dirty_ratio = 40 Do the other machines have the same amount of RAM? The point is that the values that work with less memory don't work that well with large amounts of memory (and the amount of RAM did grow a lot recently). For example a few years ago the average amount of RAM was ~8GB. In that case the vm.dirty_background_ratio = 10 => 800MB vm.dirty_ratio = 20 => 1600MB which is all peachy if you have a decent controller with a write cache. But turn that to 64GB and suddenly vm.dirty_background_ratio = 10 => 6.4GB vm.dirty_ratio = 20 => 12.8GB The problem is that there'll be a lot of data waiting (for 30 seconds by default), and then suddenly it starts writing all of them to the controller. Such systems behave just as your system - short strokes of writes interleaved with 'no activity'. Greg Smith wrote a nice howto about this - it's from 2007 but all the recommendations are still valid: http://www.westnet.com/~gsmith/content/linux-pdflush.htm TL;DR: - decrease the dirty_background_ratio/dirty_ratio (or use *_bytes) - consider decreasing the dirty_expire_centiseconds T.
On Thu, Apr 5, 2012 at 10:49 AM, Tomas Vondra <tv@fuzzy.cz> wrote: > On 5.4.2012 17:17, Cesar Martin wrote: >> Well, I have installed megacli on server and attach the results in file >> megacli.txt. Also we have "Dell Open Manage" install in server, that can >> generate a log of H800. I attach to mail with name lsi_0403. >> >> About dirty limits, I have default values: >> vm.dirty_background_ratio = 10 >> vm.dirty_ratio = 20 >> >> I have compared with other server and values are the same, except in >> centos 5.4 database production server that have vm.dirty_ratio = 40 > > Do the other machines have the same amount of RAM? The point is that the > values that work with less memory don't work that well with large > amounts of memory (and the amount of RAM did grow a lot recently). > > For example a few years ago the average amount of RAM was ~8GB. In that > case the > > vm.dirty_background_ratio = 10 => 800MB > vm.dirty_ratio = 20 => 1600MB > > which is all peachy if you have a decent controller with a write cache. > But turn that to 64GB and suddenly > > vm.dirty_background_ratio = 10 => 6.4GB > vm.dirty_ratio = 20 => 12.8GB > > The problem is that there'll be a lot of data waiting (for 30 seconds by > default), and then suddenly it starts writing all of them to the > controller. Such systems behave just as your system - short strokes of > writes interleaved with 'no activity'. > > Greg Smith wrote a nice howto about this - it's from 2007 but all the > recommendations are still valid: > > http://www.westnet.com/~gsmith/content/linux-pdflush.htm > > TL;DR: > > - decrease the dirty_background_ratio/dirty_ratio (or use *_bytes) > > - consider decreasing the dirty_expire_centiseconds The original problem is read based performance issue though and this will not have any affect on that whatsoever (although it's still excellent advice). Also dd should bypass the o/s buffer cache. I still pretty much convinced that there is a fundamental performance issue with the raid card dell needs to explain. merlin
On 5.4.2012 20:43, Merlin Moncure wrote: > The original problem is read based performance issue though and this > will not have any affect on that whatsoever (although it's still > excellent advice). Also dd should bypass the o/s buffer cache. I > still pretty much convinced that there is a fundamental performance > issue with the raid card dell needs to explain. Well, there are two issues IMHO. 1) Read performance that's not exactly as good as one'd expect from a 12 x 15k SAS RAID10 array. Given that the 15k Cheetah drives usually give like 170 MB/s for sequential reads/writes. I'd definitely expect more than 533 MB/s when reading the data. At least something near 1GB/s (equal to 6 drives). Hmm, the dd read performance seems to grow over time - I wonder if this is the issue with adaptive read policy, as mentioned in the xbitlabs report. Cesar, can you set the read policy to a 'read ahead' megacli -LDSetProp RA -LALL -aALL or maybe 'no read-ahead' megacli -LDSetProp NORA -LALL -aALL It's worth a try, maybe it somehow conflicts with the way kernel handles read-ahead or something. I find these adaptive heuristics a bit unpredictable ... Another thing - I see the patrol reads are enabled. Can you disable that and try how that affects the performance? 2) Write performance behaviour, that's much more suspicious ... Not sure if it's related to the read performance issues. Tomas
Hi,
--
César Martín Pérez
cmartinp@gmail.com
Today I'm doing new benchmarks with RA, NORA, WB and WT in the controller:
With NORA
-----------------
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 318,306 s, 432 MB/s
With RA
------------
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 179,712 s, 765 MB/s
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 202,948 s, 677 MB/s
dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 213,157 s, 645 MB/s
With Adaptative RA
-----------------
[root@cltbbdd01 ~]# dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 169,533 s, 811 MB/s
[root@cltbbdd01 ~]# dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 207,223 s, 663 MB/s
It's very strange the differences between the same test under same conditions... It seems thah adaptative read ahead is the best solution.
For write test, I apply tuned-adm throughput-performance, that change IO elevator to deadline and grow up vm.dirty_ratio to 40.... ?¿?¿?
With WB
-------------
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 539,041 s, 255 MB/s
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 505,695 s, 272 MB/s
Enforce WB
-----------------
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 662,538 s, 207 MB/s
With WT
--------------
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 750,615 s, 183 MB/s
I think that this results are more logical... WT results in bad performance and differences, inside the same test, are minimum.
Later I have put pair of dd at same time:
dd if=/dev/zero of=/vol02/bonnie/DD2 bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 633,613 s, 217 MB/s
dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=16384
16384+0 records in
16384+0 records out
137438953472 bytes (137 GB) copied, 732,759 s, 188 MB/s
Is very strange, that with parallel DD I take 400MBps. It's like if Centos have limit in IO throughput of a process...
El 5 de abril de 2012 22:06, Tomas Vondra <tv@fuzzy.cz> escribió:
On 5.4.2012 20:43, Merlin Moncure wrote:Well, there are two issues IMHO.
> The original problem is read based performance issue though and this
> will not have any affect on that whatsoever (although it's still
> excellent advice). Also dd should bypass the o/s buffer cache. I
> still pretty much convinced that there is a fundamental performance
> issue with the raid card dell needs to explain.
1) Read performance that's not exactly as good as one'd expect from a
12 x 15k SAS RAID10 array. Given that the 15k Cheetah drives usually
give like 170 MB/s for sequential reads/writes. I'd definitely
expect more than 533 MB/s when reading the data. At least something
near 1GB/s (equal to 6 drives).
Hmm, the dd read performance seems to grow over time - I wonder if
this is the issue with adaptive read policy, as mentioned in the
xbitlabs report.
Cesar, can you set the read policy to a 'read ahead'
megacli -LDSetProp RA -LALL -aALL
or maybe 'no read-ahead'
megacli -LDSetProp NORA -LALL -aALL
It's worth a try, maybe it somehow conflicts with the way kernel
handles read-ahead or something. I find these adaptive heuristics
a bit unpredictable ...
Another thing - I see the patrol reads are enabled. Can you disable
that and try how that affects the performance?
2) Write performance behaviour, that's much more suspicious ...
Not sure if it's related to the read performance issues.
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
César Martín Pérez
cmartinp@gmail.com
Hi,
--
César Martín Pérez
cmartinp@gmail.com
Finally the problem was BIOS configuration. DBPM had was set to "Active Power Controller" I changed this to "Max Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx
Now wirite speed are 550MB/s and read 1,1GB/s.
Thank you all for your advice.
El 9 de abril de 2012 18:24, Cesar Martin <cmartinp@gmail.com> escribió:
Hi,Today I'm doing new benchmarks with RA, NORA, WB and WT in the controller:With NORA-----------------dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 318,306 s, 432 MB/sWith RA------------dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 179,712 s, 765 MB/sdd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 202,948 s, 677 MB/sdd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 213,157 s, 645 MB/sWith Adaptative RA-----------------[root@cltbbdd01 ~]# dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 169,533 s, 811 MB/s[root@cltbbdd01 ~]# dd if=/vol02/bonnie/DD of=/dev/null bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 207,223 s, 663 MB/sIt's very strange the differences between the same test under same conditions... It seems thah adaptative read ahead is the best solution.For write test, I apply tuned-adm throughput-performance, that change IO elevator to deadline and grow up vm.dirty_ratio to 40.... ?¿?¿?With WB-------------dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 539,041 s, 255 MB/sdd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 505,695 s, 272 MB/sEnforce WB-----------------dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 662,538 s, 207 MB/sWith WT--------------dd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 750,615 s, 183 MB/sI think that this results are more logical... WT results in bad performance and differences, inside the same test, are minimum.Later I have put pair of dd at same time:dd if=/dev/zero of=/vol02/bonnie/DD2 bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 633,613 s, 217 MB/sdd if=/dev/zero of=/vol02/bonnie/DD bs=8M count=1638416384+0 records in16384+0 records out137438953472 bytes (137 GB) copied, 732,759 s, 188 MB/sIs very strange, that with parallel DD I take 400MBps. It's like if Centos have limit in IO throughput of a process...El 5 de abril de 2012 22:06, Tomas Vondra <tv@fuzzy.cz> escribió:On 5.4.2012 20:43, Merlin Moncure wrote:Well, there are two issues IMHO.
> The original problem is read based performance issue though and this
> will not have any affect on that whatsoever (although it's still
> excellent advice). Also dd should bypass the o/s buffer cache. I
> still pretty much convinced that there is a fundamental performance
> issue with the raid card dell needs to explain.
1) Read performance that's not exactly as good as one'd expect from a
12 x 15k SAS RAID10 array. Given that the 15k Cheetah drives usually
give like 170 MB/s for sequential reads/writes. I'd definitely
expect more than 533 MB/s when reading the data. At least something
near 1GB/s (equal to 6 drives).
Hmm, the dd read performance seems to grow over time - I wonder if
this is the issue with adaptive read policy, as mentioned in the
xbitlabs report.
Cesar, can you set the read policy to a 'read ahead'
megacli -LDSetProp RA -LALL -aALL
or maybe 'no read-ahead'
megacli -LDSetProp NORA -LALL -aALL
It's worth a try, maybe it somehow conflicts with the way kernel
handles read-ahead or something. I find these adaptive heuristics
a bit unpredictable ...
Another thing - I see the patrol reads are enabled. Can you disable
that and try how that affects the performance?
2) Write performance behaviour, that's much more suspicious ...
Not sure if it's related to the read performance issues.
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
César Martín Pérez
cmartinp@gmail.com
On Mon, Apr 16, 2012 at 8:13 AM, Cesar Martin <cmartinp@gmail.com> wrote: > Hi, > > Finally the problem was BIOS configuration. DBPM had was set to "Active > Power Controller" I changed this to "Max > Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx > Now wirite speed are 550MB/s and read 1,1GB/s. Why in the world would a server be delivered to a customer with such a setting turned on? ugh.
On Mon, Apr 16, 2012 at 10:45 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Apr 16, 2012 at 8:13 AM, Cesar Martin <cmartinp@gmail.com> wrote: >> Hi, >> >> Finally the problem was BIOS configuration. DBPM had was set to "Active >> Power Controller" I changed this to "Max >> Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx >> Now wirite speed are 550MB/s and read 1,1GB/s. > > Why in the world would a server be delivered to a customer with such a > setting turned on? ugh. likely informal pressure to reduce power consumption. anyways, this verifies my suspicion that it was a dell problem. in my dealings with them, you truly have to threaten to send the server back then the solution magically appears. don't spend time and money playing their 'qualified environment' game -- it never works...just tell them to shove it. there are a number of second tier vendors that give good value and allow you to to things like install your own disk drives without getting your support terminated. of course, you lose the 'enterprise support', to which I give a value of approximately zero. merlin
On Mon, Apr 16, 2012 at 10:08 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Apr 16, 2012 at 10:45 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> On Mon, Apr 16, 2012 at 8:13 AM, Cesar Martin <cmartinp@gmail.com> wrote: >>> Hi, >>> >>> Finally the problem was BIOS configuration. DBPM had was set to "Active >>> Power Controller" I changed this to "Max >>> Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx >>> Now wirite speed are 550MB/s and read 1,1GB/s. >> >> Why in the world would a server be delivered to a customer with such a >> setting turned on? ugh. > > likely informal pressure to reduce power consumption. anyways, this > verifies my suspicion that it was a dell problem. in my dealings with > them, you truly have to threaten to send the server back then the > solution magically appears. don't spend time and money playing their > 'qualified environment' game -- it never works...just tell them to > shove it. > > there are a number of second tier vendors that give good value and > allow you to to things like install your own disk drives without > getting your support terminated. of course, you lose the 'enterprise > support', to which I give a value of approximately zero. Dell's support never even came close to what I used to get from Aberdeen.
On Mon, Apr 16, 2012 at 10:31 AM, Glyn Astill <glynastill@yahoo.co.uk> wrote: >> From: Scott Marlowe <scott.marlowe@gmail.com> >>On Mon, Apr 16, 2012 at 8:13 AM, Cesar Martin <cmartinp@gmail.com> wrote: >>> Hi, >>> >>> Finally the problem was BIOS configuration. DBPM had was set to "Active >>> Power Controller" I changed this to "Max >>> Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx >>> Now wirite speed are 550MB/s and read 1,1GB/s. >> >>Why in the world would a server be delivered to a customer with such a >>setting turned on? ugh. > > > Because it's Dell and that's what they do. > > > When our R910s arrived, despite them knowing what we were using them for, they'd installed the memory to use only one channelper cpu. Burried deep in their manual I discovered that they called this "power optimised" mode and I had to buy awhole extra bunch of risers to be able to use all of the channels properly. > > If it wasn't for proper load testing, and Greg Smiths stream scaling tests I don't think I'd even have spotted it. See and that's where a small technically knowledgeable supplier is so great. "No you don't want 8 8G dimms, you want 16 4G dimms." etc.
> From: Scott Marlowe <scott.marlowe@gmail.com> >On Mon, Apr 16, 2012 at 8:13 AM, Cesar Martin <cmartinp@gmail.com> wrote: >> Hi, >> >> Finally the problem was BIOS configuration. DBPM had was set to "Active >> Power Controller" I changed this to "Max >> Performance". http://en.community.dell.com/techcenter/power-cooling/w/wiki/best-practices-in-power-management.aspx >> Now wirite speed are 550MB/s and read 1,1GB/s. > >Why in the world would a server be delivered to a customer with such a >setting turned on? ugh. Because it's Dell and that's what they do. When our R910s arrived, despite them knowing what we were using them for, they'd installed the memory to use only one channelper cpu. Burried deep in their manual I discovered that they called this "power optimised" mode and I had to buy awhole extra bunch of risers to be able to use all of the channels properly. If it wasn't for proper load testing, and Greg Smiths stream scaling tests I don't think I'd even have spotted it.