Thread: Performance options for CPU bound multi-SUM query
I have a warehousing case where data is bucketed by a key of an hourly timestamp and 3 other columns. In addition there are 32 numeric columns. The tables are partitioned on regular date ranges, and aggregated to the lowest resolution usable.
The principal use case is to select over a range of dates or hours, filter by the other key columns, and SUM() all 32 of the other columns. The execution plan shows the primary key index limits row selection efficiently, but the query appears CPU bound in performing all of those 32 SUM() aggregates.
I am looking at a couple of distributed PostgreSQL forks, but until those reach feature parity with 9.5 I am hoping to stay with single node PostgreSQL.
Are there any other options I can use to improve query times?
Server is 64GB RAM, with work_mem set to 1GB. All SSD storage, with separate RAID-10 volumes for OS, data, and indexes. Additional setting beyond defaults as follows.
default_statistics_target = 500
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 1GB
wal_buffers = 16MB
checkpoint_segments = 128
shared_buffers = 16GB
max_connections = 20
Matt <bsg075@gmail.com> wrote: > I have a warehousing case where data is bucketed by a key of an hourly > timestamp and 3 other columns. In addition there are 32 numeric columns. The > tables are partitioned on regular date ranges, and aggregated to the lowest > resolution usable. > > The principal use case is to select over a range of dates or hours, filter by > the other key columns, and SUM() all 32 of the other columns. The execution > plan shows the primary key index limits row selection efficiently, but the > query appears CPU bound in performing all of those 32 SUM() aggregates. > > I am looking at a couple of distributed PostgreSQL forks, but until those reach > feature parity with 9.5 I am hoping to stay with single node PostgreSQL. > > Are there any other options I can use to improve query times? Maybe cybertec's agg() - patch, see http://www.cybertec.at/postgresql_produkte/agg-parallele-aggregierungen-fuer-postgresql/ (and ask Hans for an english docu!) But, i see, it needs 9.5. Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknown) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°
On 25 January 2016 at 15:45, Matt <bsg075@gmail.com> wrote: > I have a warehousing case where data is bucketed by a key of an hourly > timestamp and 3 other columns. In addition there are 32 numeric columns. The > tables are partitioned on regular date ranges, and aggregated to the lowest > resolution usable. > > The principal use case is to select over a range of dates or hours, filter > by the other key columns, and SUM() all 32 of the other columns. The > execution plan shows the primary key index limits row selection efficiently, > but the query appears CPU bound in performing all of those 32 SUM() > aggregates. > SUM(numeric) also has to work quite a bit harder than an an aggregate like sum(float8) too since the addition in numeric is implemented in software. It depends on the use case, but for some cases the float4 or float8 types might be an option and it will offer much faster aggregation. There is also https://github.com/2ndQuadrant/fixeddecimal which may be of some use if you need fixed precision up to a predefined scale. We found that using fixeddecimal instead of numeric for the TPC-H benchmark improved performance of query 1 significantly. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Moving from NUMERIC to FLOAT(8) did indeed lower query times by about 20%. I will try fixeddecimal and agg() as time permits. On 25 Jan 2016, at 4:44, David Rowley wrote: > On 25 January 2016 at 15:45, Matt <bsg075@gmail.com> wrote: >> I have a warehousing case where data is bucketed by a key of an >> hourly >> timestamp and 3 other columns. In addition there are 32 numeric >> columns. The >> tables are partitioned on regular date ranges, and aggregated to the >> lowest >> resolution usable. >> >> The principal use case is to select over a range of dates or hours, >> filter >> by the other key columns, and SUM() all 32 of the other columns. The >> execution plan shows the primary key index limits row selection >> efficiently, >> but the query appears CPU bound in performing all of those 32 SUM() >> aggregates. >> > > SUM(numeric) also has to work quite a bit harder than an an aggregate > like sum(float8) too since the addition in numeric is implemented in > software. > It depends on the use case, but for some cases the float4 or float8 > types might be an option and it will offer much faster aggregation. > There is also https://github.com/2ndQuadrant/fixeddecimal which may be > of some use if you need fixed precision up to a predefined scale. We > found that using fixeddecimal instead of numeric for the TPC-H > benchmark improved performance of query 1 significantly. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On 28 January 2016 at 08:41, Matt <bsg075@gmail.com> wrote: > Moving from NUMERIC to FLOAT(8) did indeed lower query times by about 20%. > > I will try fixeddecimal and agg() as time permits. That's surprisingly little gain. Please note that you'll not gain any further improvements from the fixeddecimal type than you won't have already gotten from float8. My tests showed that it's very slightly slower than float8, which is possibly due to float8 addition not having overflow checks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services