Thread: table clustering brings joy
This is just an FYI for other people out there with large tables: table clustering sped up my queries from 10-100 times, which I am very happy about. I'm posting this in case it's ever useful to anybody. If someone reading this feels that I did something wrong, let me know.
I conducted tests on a table with 13,982,464 rows that looks like this:
create table clustered_tagged_genes (
pmid integer,
mention text
);
The table is indexed on pmid and mention, and clustered on mention. The statistics target was set to 1000 for each index, and the table was vacuum analyzed. I haven't yet methodically compared varying the statistics target yet.
FYI, there are 1,348,398 distinct mention values and 4,162,269 distinct pmid values in the table.
For comparison, I had a non-clustered version of the same table:
create table tagged_genes (
pmid integer,
mention text
);
This table is indexed on pmid and mention, but not clustered. (Again, statistics targets set to 1000, and table vacuum analyzed).
I tried two different equivalent queries on these tables:
"group-by-query"
SELECT pmid
FROM {tablename}
WHERE mention IN ({term1} [, ...])
GROUP BY pmid
HAVING count(distinct mention) = {num_terms};
and
"intersect-query"
SELECT pmid FROM {table_name} WHERE mention = {term1}
[ INTERSECT
SELECT pmid FROM {table_name} WHERE mention = {term2}
...
]
Out of curiosity, I also created a denormalized version of the table, which ganged associated pmids together into an array column type:
create table array_tagged_genes (
pmids integer[],
mention text
);
The table is indexed on mention, statistics target set to 1000, and table vacuum analyzed.
For the array_tagged_genes query, I used this:
"array-query":
SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention = {term1}
[ INTERSECT
SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention = {term2}
...
]
where array_explode is a plpgsql function to unpack the pmids from their array and return them on separate rows.
The following test results were created by a perl program that dragged a file larger than memory through memory, restarted the postgresql server, ran explain analyze on the next query and averaged the results across three trials of each query. (Every individual trial hopefully performed a query without the benefit of any operating system or postgresql caching.) The computer was an otherwise unloaded dual 1.8GHz PowerMac G5 running Mac OS X 10.4.2 with 1GB RAM, running off a SATA drive (7200 rpm, 9ms seek, if you care).
-Kevin Murphy
I conducted tests on a table with 13,982,464 rows that looks like this:
create table clustered_tagged_genes (
pmid integer,
mention text
);
The table is indexed on pmid and mention, and clustered on mention. The statistics target was set to 1000 for each index, and the table was vacuum analyzed. I haven't yet methodically compared varying the statistics target yet.
FYI, there are 1,348,398 distinct mention values and 4,162,269 distinct pmid values in the table.
For comparison, I had a non-clustered version of the same table:
create table tagged_genes (
pmid integer,
mention text
);
This table is indexed on pmid and mention, but not clustered. (Again, statistics targets set to 1000, and table vacuum analyzed).
I tried two different equivalent queries on these tables:
"group-by-query"
SELECT pmid
FROM {tablename}
WHERE mention IN ({term1} [, ...])
GROUP BY pmid
HAVING count(distinct mention) = {num_terms};
and
"intersect-query"
SELECT pmid FROM {table_name} WHERE mention = {term1}
[ INTERSECT
SELECT pmid FROM {table_name} WHERE mention = {term2}
...
]
Out of curiosity, I also created a denormalized version of the table, which ganged associated pmids together into an array column type:
create table array_tagged_genes (
pmids integer[],
mention text
);
The table is indexed on mention, statistics target set to 1000, and table vacuum analyzed.
For the array_tagged_genes query, I used this:
"array-query":
SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention = {term1}
[ INTERSECT
SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention = {term2}
...
]
where array_explode is a plpgsql function to unpack the pmids from their array and return them on separate rows.
The following test results were created by a perl program that dragged a file larger than memory through memory, restarted the postgresql server, ran explain analyze on the next query and averaged the results across three trials of each query. (Every individual trial hopefully performed a query without the benefit of any operating system or postgresql caching.) The computer was an otherwise unloaded dual 1.8GHz PowerMac G5 running Mac OS X 10.4.2 with 1GB RAM, running off a SATA drive (7200 rpm, 9ms seek, if you care).
Rank | Query type/table | Search terms | Rows | Msecs | Msecs/row | Relative time |
1 | intersect-query/clustered | mycn | 479 | 49.84 | 0.10 | 1 |
2 | array-query/array | mycn | 479 | 79.18 | 0.17 | 1.59 |
3 | group-by-query/clustered | mycn | 479 | 163.43 | 0.34 | 3.28 |
4 | group-by-query/nonclustered | mycn | 479 | 2238.79 | 4.67 | 44.92 |
1 | group-by-query/clustered | mycn,trka | 20 | 109.08 | 5.45 | 1 |
2 | intersect-query/clustered | mycn,trka | 20 | 138.64 | 6.93 | 1.27 |
3 | array-query/array | mycn,trka | 20 | 264.25 | 13.21 | 2.42 |
4 | group-by-query/nonclustered | mycn,trka | 20 | 10076.79 | 503.84 | 92.38 |
1 | intersect-query/clustered | lcat | 1341 | 68.11 | 0.05 | 1 |
2 | group-by-query/clustered | lcat | 1341 | 86.61 | 0.06 | 1.27 |
3 | array-query/array | lcat | 1341 | 222.2 | 0.17 | 3.26 |
4 | group-by-query/nonclustered | lcat | 1341 | 7325.77 | 5.46 | 107.55 |
1 | intersect-query/clustered | eps15 | 148 | 61.05 | 0.41 | 1 |
2 | group-by-query/clustered | eps15 | 148 | 67.75 | 0.46 | 1.11 |
3 | array-query/array | eps15 | 148 | 80.45 | 0.54 | 1.32 |
4 | group-by-query/nonclustered | eps15 | 148 | 929.84 | 6.28 | 15.23 |
1 | intersect-query/clustered | brca1 | 2822 | 105.74 | 0.04 | 1 |
2 | group-by-query/clustered | brca1 | 2822 | 161.95 | 0.06 | 1.53 |
3 | array-query/array | brca1 | 2822 | 198.72 | 0.07 | 1.88 |
4 | group-by-query/nonclustered | brca1 | 2822 | 12113.8 | 4.29 | 114.56 |
1 | intersect-query/clustered | eps15,tax4,irs4 | 1 | 102.9 | 102.90 | 1 |
2 | group-by-query/clustered | eps15,tax4,irs4 | 1 | 112.58 | 112.58 | 1.09 |
3 | array-query/array | eps15,tax4,irs4 | 1 | 124.78 | 124.78 | 1.21 |
4 | group-by-query/nonclustered | eps15,tax4,irs4 | 1 | 1024.09 | 1024.09 | 9.95 |
-Kevin Murphy
Kevin Murphy <murphy@genome.chop.edu> writes: > This is just an FYI for other people out there with large tables: table > clustering sped up my queries from 10-100 times, which I am very happy about. > I'm posting this in case it's ever useful to anybody. If someone reading this > feels that I did something wrong, let me know. One thing you should realize is that cluster effectively removes all dead tuples from the table. If you had lots of dead tuples that could have been slowing things down. Vacuum only marks dead tuples for reuse. If you're not running vacuum often enough or you've done big batch updates then you may have accumulated lots of dead tuples and then your vacuum analyze doesn't actually remove them from the table. Moreover, if the tables are undergoing updates or deletes then you should expect to always have some steady state level of dead tuples in the table. Any tests conducted immediately after a "vacuum full" or "cluster" won't include that factor. All that said clustering is indeed often quite effective. Especially if it makes an index scan efficient enough to win over sequential scans you can see some huge effects. It's most useful for tables that aren't undergoing lots of updates and don't need to be reclustered often. -- greg
Greg Stark wrote: >All that said clustering is indeed often quite effective. Especially if it >makes an index scan efficient enough to win over sequential scans you can see >some huge effects. It's most useful for tables that aren't undergoing lots of >updates and don't need to be reclustered often. > > Which is my situation; most tables are read-only. Thanks for the comments. -Kevin
Quick questions: For big tables with frequent insert, no update, and frequent read (using indexes), will clustering help? what should be done on such table other than regular analyze? comments are appreciated. On 8/16/05, Kevin Murphy <murphy@genome.chop.edu> wrote: > Greg Stark wrote: > > >All that said clustering is indeed often quite effective. Especially if it > >makes an index scan efficient enough to win over sequential scans you can see > >some huge effects. It's most useful for tables that aren't undergoing lots of > >updates and don't need to be reclustered often. > > > > > Which is my situation; most tables are read-only. Thanks for the comments. > > -Kevin > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Junaili Lie <junaili@gmail.com> writes: > Quick questions: > For big tables with frequent insert, no update, and frequent read > (using indexes), will clustering help? > what should be done on such table other than regular analyze? > comments are appreciated. If you never have any deletes or updates then you don't really need to vacuum the table regularly. (You still need to vacuum it before transaction id wraparound but that's a pretty long time.) So clustering won't help you by removing dead tuples and compacting the table. But it can still help by ordering the records in the same order as your index. The more the record order is correlated with the index the more effective the index is and the larger the result set that can use that index productively. That will only help if you're often retrieving moderately large result sets by one particular index. If you normally only retrieve one record at a time or from lots of different indexes then it probably won't really make much difference. New records won't be inserted in order though so periodically you'll want to recluster the table to maintain the order. -- greg
Greg Stark wrote: > clustering... > That will only help if you're often retrieving moderately large result sets by > one particular index. If you normally only retrieve one record at a time or > from lots of different indexes then it probably won't really make much > difference. It'll also help for columns whose values are related in some way. For example, clustering a table of addresses based on "zip code" will help lookups based on city or county or state (presumably because all the disk pages for a given city will be grouped together within the disk pages for the zip codes within the city).