Thread: crosstab speed
ok, I have an application that I am trying to speed up. Its a reporting application that makes heavy use of the crosstab function.
Here is some of the setup / configuration details:
Postgres 8.3.3
RedHat Enterprise 5.2 (2.6.18 kernel)
sun x4600, 8 dual core opteron 8218 processors, 32BG, StorageTek SAN
6 15k FC disks raid 10 for data,
2 15k FC disks raid 1 for xlog,
2 10k SAS disks raid 1 for OS
The table that I am querying has just under 600k records, 55 columns, 30 indexes
The table is not static, there are several hundred inserts a day into it.
This is not the only application that uses postgres on this server. There are several other transactional apps as well
here is an example query
select "COL_HEAD"[1] as site, "COL_HEAD"[2] as product_line_description, "COL_HEAD"[3] as report_sls, "COL_HEAD"[4] as fy_period, "2006" , "2007" , "2008" , "2009" from public.crosstab('select ARRAY[site::text,product_line_description::text,report_sls::text,fy_period::text] as COL_HEADER, fy_year, sum(invoice_value) from order_data_tbl where fy_year is not null group by site::text,product_line_description::text,report_sls::text,fy_period::text, fy_year order by site::text,product_line_description::text,report_sls::text,fy_period::text', 'select fy_year from order_data_tbl where fy_year is not null group by fy_year order by fy_year') as order_data_tbl("COL_HEAD" text[], "2006" numeric(20,2) , "2007" numeric(20,2) , "2008" numeric(20,2) , "2009" numeric(20,2) )
The crostab function is taking between 5 and 15 seconds to return. While the query is running one of the cores will be close to 100%, but watching iostat makes be believe that the entire table is cached and none of it is being read from disk. Depending on what report is being run the indexes may or may not be of any assistance. In the above query the planner does not use an index. Depending on what the user is looking for some indexes will be used because there is more specified in the where clause, at which point the query time can be under two seconds. The problem is that most reports that get generated with this application don't have a where clause. Are there any changes that can make to my config to speed up these huge aggregating queries?
Here is my postgresql.conf
max_connections = 1500
shared_buffers = 8GB
work_mem = 2GB
maintenance_work_mem = 8GB
max_fsm_pages = 2048000
wal_buffers = 1024kB
checkpoint_segments = 256
checkpoint_timeout = 10min
effective_cache_size = 20GB
default_statistics_target = 100
log_destination = 'stderr'
logging_collector = on
log_directory = 'pg_log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 1GB
log_error_verbosity = default
autovacuum = on
autovacuum_max_workers = 9
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
synchronize_seqscans = on
log_min_duration_statement = 250
-Jeremiah Elliott
Here is some of the setup / configuration details:
Postgres 8.3.3
RedHat Enterprise 5.2 (2.6.18 kernel)
sun x4600, 8 dual core opteron 8218 processors, 32BG, StorageTek SAN
6 15k FC disks raid 10 for data,
2 15k FC disks raid 1 for xlog,
2 10k SAS disks raid 1 for OS
The table that I am querying has just under 600k records, 55 columns, 30 indexes
The table is not static, there are several hundred inserts a day into it.
This is not the only application that uses postgres on this server. There are several other transactional apps as well
here is an example query
select "COL_HEAD"[1] as site, "COL_HEAD"[2] as product_line_description, "COL_HEAD"[3] as report_sls, "COL_HEAD"[4] as fy_period, "2006" , "2007" , "2008" , "2009" from public.crosstab('select ARRAY[site::text,product_line_description::text,report_sls::text,fy_period::text] as COL_HEADER, fy_year, sum(invoice_value) from order_data_tbl where fy_year is not null group by site::text,product_line_description::text,report_sls::text,fy_period::text, fy_year order by site::text,product_line_description::text,report_sls::text,fy_period::text', 'select fy_year from order_data_tbl where fy_year is not null group by fy_year order by fy_year') as order_data_tbl("COL_HEAD" text[], "2006" numeric(20,2) , "2007" numeric(20,2) , "2008" numeric(20,2) , "2009" numeric(20,2) )
The crostab function is taking between 5 and 15 seconds to return. While the query is running one of the cores will be close to 100%, but watching iostat makes be believe that the entire table is cached and none of it is being read from disk. Depending on what report is being run the indexes may or may not be of any assistance. In the above query the planner does not use an index. Depending on what the user is looking for some indexes will be used because there is more specified in the where clause, at which point the query time can be under two seconds. The problem is that most reports that get generated with this application don't have a where clause. Are there any changes that can make to my config to speed up these huge aggregating queries?
Here is my postgresql.conf
max_connections = 1500
shared_buffers = 8GB
work_mem = 2GB
maintenance_work_mem = 8GB
max_fsm_pages = 2048000
wal_buffers = 1024kB
checkpoint_segments = 256
checkpoint_timeout = 10min
effective_cache_size = 20GB
default_statistics_target = 100
log_destination = 'stderr'
logging_collector = on
log_directory = 'pg_log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 1GB
log_error_verbosity = default
autovacuum = on
autovacuum_max_workers = 9
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
synchronize_seqscans = on
log_min_duration_statement = 250
-Jeremiah Elliott
On Thu, Nov 13, 2008 at 1:42 PM, Jeremiah Elliott <jeremiah0@gmail.com> wrote: > ok, I have an application that I am trying to speed up. Its a reporting > application that makes heavy use of the crosstab function. > > Here is some of the setup / configuration details: > Postgres 8.3.3 > RedHat Enterprise 5.2 (2.6.18 kernel) > sun x4600, 8 dual core opteron 8218 processors, 32BG, StorageTek SAN > 6 15k FC disks raid 10 for data, > 2 15k FC disks raid 1 for xlog, > 2 10k SAS disks raid 1 for OS > The table that I am querying has just under 600k records, 55 columns, 30 > indexes > The table is not static, there are several hundred inserts a day into it. > This is not the only application that uses postgres on this server. There > are several other transactional apps as well > > here is an example query > select "COL_HEAD"[1] as site, "COL_HEAD"[2] as product_line_description, > "COL_HEAD"[3] as report_sls, "COL_HEAD"[4] as fy_period, "2006" , "2007" , > "2008" , "2009" from public.crosstab('select > ARRAY[site::text,product_line_description::text,report_sls::text,fy_period::text] > as COL_HEADER, fy_year, sum(invoice_value) from order_data_tbl where > fy_year is not null group by > site::text,product_line_description::text,report_sls::text,fy_period::text, > fy_year order by > site::text,product_line_description::text,report_sls::text,fy_period::text', > 'select fy_year from order_data_tbl where fy_year is not null group by > fy_year order by fy_year') as order_data_tbl("COL_HEAD" text[], "2006" > numeric(20,2) , "2007" numeric(20,2) , "2008" numeric(20,2) , "2009" > numeric(20,2) ) Providing explain analyze output form that would probably help. > The crostab function is taking between 5 and 15 seconds to return. While the > query is running one of the cores will be close to 100%, but watching iostat > makes be believe that the entire table is cached and none of it is being > read from disk. Depending on what report is being run the indexes may or may > not be of any assistance. In the above query the planner does not use an > index. Depending on what the user is looking for some indexes will be used > because there is more specified in the where clause, at which point the > query time can be under two seconds. The problem is that most reports that > get generated with this application don't have a where clause. Are there any > changes that can make to my config to speed up these huge aggregating > queries? Either get a faster CPU (incremental change at best) or rethink your queries or pre-create the output ahead of time with either a materialized view or in a table your app knows to use. Most other options won't help that much if you're running over a metric ton of data at a shot. > Here is my postgresql.conf > > max_connections = 1500 > work_mem = 2GB These two settings are kind of incompatble. It means you expect to upwards of a thousand users, and each one can grab 8G for each sort they run. If they're large datasets with multiple sorts required, even a handful of queries could put your machine in a swap storm and basically your own queries would DOS the machine. It's better, if you have a lot of users who don't need large work_mem to set it to something more sane, like 2 or 4 Meg, and then issue a set work_mem=xxxx when you run your single monstrous query. > maintenance_work_mem = 8GB > autovacuum_max_workers = 9 These two are also quite dangerous together, as you can have each thread grab 8Gigs at a time. (Someone correct me if I'm wrong, but I'm pretty sure maint_work_mem is per vacuum thread). Generally you'll not see a big return after the first few hundreds of megabytes. Same goes for work_mem. If you repeat the same basic query, or parts of it over and over, it may be faster to look into building some materialized views on top of the tables to use for that. Jonathan Gardner wrote an excellent tutorial on how to "roll your own" that's located here: http://jonathangardner.net/tech/w/PostgreSQL/Materialized_Views
Jeremiah Elliott wrote: > ok, I have an application that I am trying to speed up. Its a reporting > application that makes heavy use of the crosstab function. <snip> > here is an example query > > The crostab function is taking between 5 and 15 seconds to return. Please run the two embedded queries independently, i.e. select ARRAY[site::text,product_line_description::text,report_sls::text,fy_period::text] as COL_HEADER, fy_year, sum(invoice_value) from order_data_tbl where fy_year is not null group by site::text,product_line_description::text,report_sls::text,fy_period::text, fy_year order by site::text,product_line_description::text,report_sls::text,fy_period::text; -- and -- select fy_year from order_data_tbl where fy_year is not null group by fy_year order by fy_year; How long does each take? crosstab cannot run any faster than the sum of these two queries run on their own. If the second one doesn't change often, can you pre-calculate it, perhaps once a day? Joe