Re: same plan, add 1 condition, 1900x slower - Mailing list pgsql-performance

From Mitch Skinner
Subject Re: same plan, add 1 condition, 1900x slower
Date
Msg-id 1131718843.29496.197.camel@enzian
Whole thread Raw
In response to Re: same plan, add 1 condition, 1900x slower  (Richard Huxton <dev@archonet.com>)
List pgsql-performance
On Fri, 2005-11-11 at 11:51 +0000, Richard Huxton wrote:
> Does external_id_map really have 15 million rows? If not, try a VACUUM
> FULL on it. Be prepared to give it some time to complete.

Thanks for the reply.  It does indeed have that many rows:
statgen=> select count(*) from util.external_id_map ;
  count
----------
 15562513
(1 row)

That table never gets deletions or updates, only insertions and reads.
For fun and base-covering, I'm running a full vacuum now.  Usually
there's just a nightly lazy vacuum.

If it helps, here's some background on what we're doing and why (plus
some stuff at the end about how it relates to Postgres):

We get very similar data from multiple sources, and I want to be able to
combine it all into one schema.  The data from different sources is
similar enough (it's generally constrained by the underlying biology,
e.g., each person has a father and a mother, two versions of each
regular chromosome, etc.) that I think putting it all into one set of
tables makes sense.

Different people in our group use different tools (Python, R, Java), so
instead of integrating at the code level (like a shared class hierarchy)
we use the schema as our shared idea of the data.  This helps make my
analyses comparable to the analyses from my co-workers.  We don't all
want to have to write basic sanity checks in each of our languages, so
we want to be able to have foreign keys in the schema.  Having foreign
keys and multiple data sources means that we have to generate our own
internal identifiers (otherwise we'd expect to have ID collisions from
different sources).  I'd like to be able to have a stable
internal-external ID mapping (this is actually something we spent a lot
of time arguing about), so we have a table that does exactly that.

When we import data, we do a bunch of joins against the external_id_map
table to translate external IDs into internal IDs.  It means that the
external_id_map table gets pretty big and the joins can take a long time
(it takes four hours to import one 11-million row source table into our
canonical schema, because we have to do 5 ID translations per row on
that one), but we don't need to import data too often so it works.  The
main speed concern is that exploratory data analyses are pretty
interactive, and also sometimes you want to run a bunch of analyses in
parallel, and if the queries are slow that can be a bottleneck.

I'm looking forward to partitioning the external_id_map table with 8.1,
and when Greenplum comes out with their stuff we'll probably take a
look.  If the main Postgres engine had parallel query execution, I'd be
pretty happy.  I also followed the external sort thread with interest,
but I didn't get the impression that there was a very clear consensus
there.

Since some of our sources change over time, and I can't generally expect
them to have timestamps on their data, what we do when we re-import from
a source is delete everything out of the canonical tables from that
source and then re-insert.  It sounds like mass deletions are not such a
common thing to do; I think there was a thread about this recently and
Tom questioned the real-world need to worry about that workload.  I was
thinking that maybe the foreign key integrity checks might be better
done by a join rather than a per-deleted-row trigger queue, but since
all my foreign keys are indexed on both ends it doesn't look like a
bottleneck.

Anyway, all that probably has an effect on the data distribution in our
tables and indexes.  I'll report back on the effect of the full vacuum.

Thanks for reading,
Mitch

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: same plan, add 1 condition, 1900x slower
Next
From: Mitch Skinner
Date:
Subject: Re: same plan, add 1 condition, 1900x slower