Table Partitions / Partial Indexes - Mailing list pgsql-performance
From | Mike C |
---|---|
Subject | Table Partitions / Partial Indexes |
Date | |
Msg-id | bd0eabd0512111807q6b54115q2477c7c2a37f1c1a@mail.gmail.com Whole thread Raw |
Responses |
Re: Table Partitions / Partial Indexes
Re: Table Partitions / Partial Indexes |
List | pgsql-performance |
Hello,
I've got a table with ~60 Million rows and am having performance problems querying it. Disks are setup as 4x10K SCSI 76GB, RAID 1+0. The table is being inserted into multiple times every second of the day, with no updates and every 2nd day we delete 1/60th of the data (as it becomes old). Vacuum analyze is scheduled to run 3 times a day.
Query:
select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC from PC_TRAFFIC where FK_DEVICE = 996 and TRAFFIC_DATE >= '2005-10-14 00:00:00' and TRAFFIC_DATE <= '2005-11-13 23:59:59' group by SOURCE_MAC order by 1 desc
Table:
CREATE TABLE PC_TRAFFIC (
PK_PC_TRAFFIC INTEGER NOT NULL,
TRAFFIC_DATE TIMESTAMP NOT NULL,
SOURCE_MAC CHAR(20) NOT NULL,
DEST_IP CHAR(15),
DEST_PORT INTEGER,
TOTAL_TO DOUBLE PRECISION,
TOTAL_FROM DOUBLE PRECISION,
FK_DEVICE SMALLINT,
PROTOCOL_TYPE SMALLINT
);
CREATE INDEX pc_traffic_pkidx ON pc_traffic (pk_pc_traffic);
CREATE INDEX pc_traffic_idx3 ON pc_traffic (fk_device, traffic_date);
Plan:
Sort (cost=76650.58..76650.58 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=76650.54..76650.57 rows=2 width=40)
-> Bitmap Heap Scan on pc_traffic (cost=534.64..76327.03 rows=43134 width=40)
Recheck Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on pc_traffic_idx3 (cost=0.00..534.64 rows=43134 width=0)
Index Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
(7 rows)
CLUSTER on PC_TRAFFIC_IDX3 gives me significantly improved performance:
Sort (cost=39886.65..39886.66 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=39886.61..39886.64 rows=2 width=40)
-> Index Scan using pc_traffic_idx3 on pc_traffic (cost=0.00..39551.26 rows=44714 width=40)
Index Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
(5 rows)
However the clustering is only effective on the first shot. Because of the constant usage of the table we can't perform a vacuum full nor any exclusive lock function.
Would table partitioning/partial indexes help much? Partitioning on date range doesn't make much sense for this setup, where a typical 1-month query spans both tables (as the billing month for the customer might start midway through a calendar month).
Noting that the index scan was quicker than the bitmap, I'm trying to make the indexes smaller/more likely to index scan. I have tried partitioning against fk_device, with 10 child tables. I'm using fk_device % 10 = 1, fk_device % 10 = 2, fk_device % 10 = 3, etc... as the check constraint.
CREATE TABLE pc_traffic_0 (CHECK(FK_DEVICE % 10 = 0)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_1 (CHECK(FK_DEVICE % 10 = 1)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_2 (CHECK(FK_DEVICE % 10 = 2)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_3 (CHECK(FK_DEVICE % 10 = 3)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_4 (CHECK(FK_DEVICE % 10 = 4)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_5 (CHECK(FK_DEVICE % 10 = 5)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_6 (CHECK(FK_DEVICE % 10 = 6)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_7 (CHECK(FK_DEVICE % 10 = 7)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_8 (CHECK(FK_DEVICE % 10 = 8)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_9 (CHECK(FK_DEVICE % 10 = 9)) INHERITS (pc_traffic);
... indexes now look like:
CREATE INDEX pc_traffic_6_idx3 ON pc_traffic_6 (fk_device, traffic_date);
To take advantage of the query my SQL now has to include the mod operation (so the query planner picks up the correct child tables):
select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC from PC_TRAFFIC where FK_DEVICE = 996 and FK_DEVICE % 10 = 6 and TRAFFIC_DATE >= '2005-10-14 00:00:00' and TRAFFIC_DATE <= '2005-11-13 23:59:59' group by SOURCE_MAC order by 1 desc
Sorry I would show the plan but I'm rebuilding the dev database atm. It was faster though and did pick up the correct child table. It was also a bitmap scan on the index IIRC.
Would I be better off creating many partial indexes instead of multiple tables AND multiple indexes?
Am I using a horrid method for partitioning the data? (% 10)
Should there be that big of an improvement for multiple tables given that all the data is still stored on the same filesystem?
Any advice on table splitting much appreciated.
Cheers,
Mike C.
I've got a table with ~60 Million rows and am having performance problems querying it. Disks are setup as 4x10K SCSI 76GB, RAID 1+0. The table is being inserted into multiple times every second of the day, with no updates and every 2nd day we delete 1/60th of the data (as it becomes old). Vacuum analyze is scheduled to run 3 times a day.
Query:
select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC from PC_TRAFFIC where FK_DEVICE = 996 and TRAFFIC_DATE >= '2005-10-14 00:00:00' and TRAFFIC_DATE <= '2005-11-13 23:59:59' group by SOURCE_MAC order by 1 desc
Table:
CREATE TABLE PC_TRAFFIC (
PK_PC_TRAFFIC INTEGER NOT NULL,
TRAFFIC_DATE TIMESTAMP NOT NULL,
SOURCE_MAC CHAR(20) NOT NULL,
DEST_IP CHAR(15),
DEST_PORT INTEGER,
TOTAL_TO DOUBLE PRECISION,
TOTAL_FROM DOUBLE PRECISION,
FK_DEVICE SMALLINT,
PROTOCOL_TYPE SMALLINT
);
CREATE INDEX pc_traffic_pkidx ON pc_traffic (pk_pc_traffic);
CREATE INDEX pc_traffic_idx3 ON pc_traffic (fk_device, traffic_date);
Plan:
Sort (cost=76650.58..76650.58 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=76650.54..76650.57 rows=2 width=40)
-> Bitmap Heap Scan on pc_traffic (cost=534.64..76327.03 rows=43134 width=40)
Recheck Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on pc_traffic_idx3 (cost=0.00..534.64 rows=43134 width=0)
Index Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
(7 rows)
CLUSTER on PC_TRAFFIC_IDX3 gives me significantly improved performance:
Sort (cost=39886.65..39886.66 rows=2 width=40)
Sort Key: sum(total_from)
-> HashAggregate (cost=39886.61..39886.64 rows=2 width=40)
-> Index Scan using pc_traffic_idx3 on pc_traffic (cost=0.00..39551.26 rows=44714 width=40)
Index Cond: ((fk_device = 996) AND (traffic_date >= '2005-10-01 00:00:00'::timestamp without time zone) AND (traffic_date <= '2005-10-31 23:59:59'::timestamp without time zone))
(5 rows)
However the clustering is only effective on the first shot. Because of the constant usage of the table we can't perform a vacuum full nor any exclusive lock function.
Would table partitioning/partial indexes help much? Partitioning on date range doesn't make much sense for this setup, where a typical 1-month query spans both tables (as the billing month for the customer might start midway through a calendar month).
Noting that the index scan was quicker than the bitmap, I'm trying to make the indexes smaller/more likely to index scan. I have tried partitioning against fk_device, with 10 child tables. I'm using fk_device % 10 = 1, fk_device % 10 = 2, fk_device % 10 = 3, etc... as the check constraint.
CREATE TABLE pc_traffic_0 (CHECK(FK_DEVICE % 10 = 0)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_1 (CHECK(FK_DEVICE % 10 = 1)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_2 (CHECK(FK_DEVICE % 10 = 2)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_3 (CHECK(FK_DEVICE % 10 = 3)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_4 (CHECK(FK_DEVICE % 10 = 4)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_5 (CHECK(FK_DEVICE % 10 = 5)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_6 (CHECK(FK_DEVICE % 10 = 6)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_7 (CHECK(FK_DEVICE % 10 = 7)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_8 (CHECK(FK_DEVICE % 10 = 8)) INHERITS (pc_traffic);
CREATE TABLE pc_traffic_9 (CHECK(FK_DEVICE % 10 = 9)) INHERITS (pc_traffic);
... indexes now look like:
CREATE INDEX pc_traffic_6_idx3 ON pc_traffic_6 (fk_device, traffic_date);
To take advantage of the query my SQL now has to include the mod operation (so the query planner picks up the correct child tables):
select sum(TOTAL_FROM) as TOTAL_IN, sum(TOTAL_TO) as TOTAL_OUT, SOURCE_MAC from PC_TRAFFIC where FK_DEVICE = 996 and FK_DEVICE % 10 = 6 and TRAFFIC_DATE >= '2005-10-14 00:00:00' and TRAFFIC_DATE <= '2005-11-13 23:59:59' group by SOURCE_MAC order by 1 desc
Sorry I would show the plan but I'm rebuilding the dev database atm. It was faster though and did pick up the correct child table. It was also a bitmap scan on the index IIRC.
Would I be better off creating many partial indexes instead of multiple tables AND multiple indexes?
Am I using a horrid method for partitioning the data? (% 10)
Should there be that big of an improvement for multiple tables given that all the data is still stored on the same filesystem?
Any advice on table splitting much appreciated.
Cheers,
Mike C.
pgsql-performance by date: