Re: same plan, add 1 condition, 1900x slower - Mailing list pgsql-performance
From | Mitch Skinner |
---|---|
Subject | Re: same plan, add 1 condition, 1900x slower |
Date | |
Msg-id | 1131718843.29496.197.camel@enzian Whole thread Raw |
In response to | Re: same plan, add 1 condition, 1900x slower (Richard Huxton <dev@archonet.com>) |
List | pgsql-performance |
On Fri, 2005-11-11 at 11:51 +0000, Richard Huxton wrote: > Does external_id_map really have 15 million rows? If not, try a VACUUM > FULL on it. Be prepared to give it some time to complete. Thanks for the reply. It does indeed have that many rows: statgen=> select count(*) from util.external_id_map ; count ---------- 15562513 (1 row) That table never gets deletions or updates, only insertions and reads. For fun and base-covering, I'm running a full vacuum now. Usually there's just a nightly lazy vacuum. If it helps, here's some background on what we're doing and why (plus some stuff at the end about how it relates to Postgres): We get very similar data from multiple sources, and I want to be able to combine it all into one schema. The data from different sources is similar enough (it's generally constrained by the underlying biology, e.g., each person has a father and a mother, two versions of each regular chromosome, etc.) that I think putting it all into one set of tables makes sense. Different people in our group use different tools (Python, R, Java), so instead of integrating at the code level (like a shared class hierarchy) we use the schema as our shared idea of the data. This helps make my analyses comparable to the analyses from my co-workers. We don't all want to have to write basic sanity checks in each of our languages, so we want to be able to have foreign keys in the schema. Having foreign keys and multiple data sources means that we have to generate our own internal identifiers (otherwise we'd expect to have ID collisions from different sources). I'd like to be able to have a stable internal-external ID mapping (this is actually something we spent a lot of time arguing about), so we have a table that does exactly that. When we import data, we do a bunch of joins against the external_id_map table to translate external IDs into internal IDs. It means that the external_id_map table gets pretty big and the joins can take a long time (it takes four hours to import one 11-million row source table into our canonical schema, because we have to do 5 ID translations per row on that one), but we don't need to import data too often so it works. The main speed concern is that exploratory data analyses are pretty interactive, and also sometimes you want to run a bunch of analyses in parallel, and if the queries are slow that can be a bottleneck. I'm looking forward to partitioning the external_id_map table with 8.1, and when Greenplum comes out with their stuff we'll probably take a look. If the main Postgres engine had parallel query execution, I'd be pretty happy. I also followed the external sort thread with interest, but I didn't get the impression that there was a very clear consensus there. Since some of our sources change over time, and I can't generally expect them to have timestamps on their data, what we do when we re-import from a source is delete everything out of the canonical tables from that source and then re-insert. It sounds like mass deletions are not such a common thing to do; I think there was a thread about this recently and Tom questioned the real-world need to worry about that workload. I was thinking that maybe the foreign key integrity checks might be better done by a join rather than a per-deleted-row trigger queue, but since all my foreign keys are indexed on both ends it doesn't look like a bottleneck. Anyway, all that probably has an effect on the data distribution in our tables and indexes. I'll report back on the effect of the full vacuum. Thanks for reading, Mitch
pgsql-performance by date: