table clustering brings joy - Mailing list pgsql-general

From Kevin Murphy
Subject table clustering brings joy
Date
Msg-id 43021891.5040500@genome.chop.edu
Whole thread Raw
Responses Re: table clustering brings joy  (Greg Stark <gsstark@mit.edu>)
List pgsql-general
This is just an FYI for other people out there with large tables: table clustering sped up my queries from 10-100 times, which I am very happy about.  I'm posting this in case it's ever useful to anybody.  If someone reading this feels that I did something wrong, let me know.

I conducted tests on a table  with 13,982,464 rows that looks like this:

create table clustered_tagged_genes (
    pmid integer,
    mention text
);

The table is indexed on pmid and mention, and clustered on mention.  The statistics target was set to 1000 for each index, and the table was vacuum analyzed.  I haven't yet methodically compared varying the statistics target yet.

FYI, there are 1,348,398 distinct mention values and 4,162,269 distinct pmid values in the table.

For comparison, I had a non-clustered version of the same table:

create table tagged_genes (
    pmid integer,
    mention text
);

This table is indexed on pmid and mention, but not clustered.  (Again, statistics targets set to 1000, and table vacuum analyzed).

I tried two different equivalent queries on these tables:

"group-by-query"

  SELECT pmid
    FROM  {tablename}
    WHERE mention IN ({term1} [, ...])
    GROUP BY pmid
    HAVING count(distinct mention) = {num_terms};

and

"intersect-query"

  SELECT pmid FROM {table_name} WHERE mention = {term1}
[      INTERSECT
  SELECT pmid FROM {table_name} WHERE mention = {term2}
...
]


Out of curiosity, I also created a denormalized version of the table, which ganged associated pmids together into an array column type:

create table  array_tagged_genes (
    pmids integer[],
    mention text
);

The table is indexed on mention, statistics target set to 1000, and table vacuum analyzed.

For the array_tagged_genes query, I used this:

"array-query":

  SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention  = {term1}
[      INTERSECT
  SELECT * FROM array_explode((select pmids FROM array_tagged_genes WHERE mention  = {term2}
...
]

where array_explode is a plpgsql function to unpack the pmids from their array and return them on separate rows.

The following test results were created by a perl program that dragged a file larger than memory through memory, restarted the postgresql server, ran explain analyze on the next query and averaged the results across three trials of each query.  (Every individual trial hopefully performed a query without the benefit of any operating system or postgresql caching.)  The computer was an otherwise unloaded dual 1.8GHz PowerMac G5 running Mac OS X 10.4.2 with 1GB RAM, running off a SATA drive (7200 rpm, 9ms seek, if you care).


RankQuery type/tableSearch termsRowsMsecsMsecs/rowRelative time
1intersect-query/clusteredmycn47949.840.101
2array-query/arraymycn47979.180.171.59
3group-by-query/clusteredmycn479163.430.343.28
4group-by-query/nonclusteredmycn4792238.794.6744.92







1group-by-query/clusteredmycn,trka20109.085.451
2intersect-query/clusteredmycn,trka20138.646.931.27
3array-query/arraymycn,trka20264.2513.212.42
4group-by-query/nonclusteredmycn,trka2010076.79503.8492.38







1intersect-query/clusteredlcat134168.110.051
2group-by-query/clusteredlcat134186.610.061.27
3array-query/arraylcat1341222.20.173.26
4group-by-query/nonclusteredlcat13417325.775.46107.55







1intersect-query/clusteredeps1514861.050.411
2group-by-query/clusteredeps1514867.750.461.11
3array-query/arrayeps1514880.450.541.32
4group-by-query/nonclusteredeps15148929.846.2815.23







1intersect-query/clusteredbrca12822105.740.041
2group-by-query/clusteredbrca12822161.950.061.53
3array-query/arraybrca12822198.720.071.88
4group-by-query/nonclusteredbrca1282212113.84.29114.56







1intersect-query/clusteredeps15,tax4,irs41102.9102.901
2group-by-query/clusteredeps15,tax4,irs41112.58112.581.09
3array-query/arrayeps15,tax4,irs41124.78124.781.21
4group-by-query/nonclusteredeps15,tax4,irs411024.091024.099.95

-Kevin Murphy

pgsql-general by date:

Previous
From: marcelo Cortez
Date:
Subject: lock problem
Next
From: Richard Huxton
Date:
Subject: Re: lock problem