Database slowness -- my design, hardware, or both? - Mailing list pgsql-general
From | Reuven M. Lerner |
---|---|
Subject | Database slowness -- my design, hardware, or both? |
Date | |
Msg-id | 45ECEE87.50709@lerner.co.il Whole thread Raw |
Responses |
Re: Database slowness -- my design, hardware, or both?
|
List | pgsql-general |
Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't failed me yet. But I've been having some serious performance problems on a database that I've been using in my grad-school research group, and it's clear that I need help from some more experienced hands. Basically, we've collected large amounts of data from students who used our chemistry simulation software. That data, collected in XML logfiles and then imported into a normalized set of PostgreSQL tables, keeps track of every action the student took, as well as what the chemistry simulation was doing. So if the number of molecules changes, we keep track of that. If the user changes the color representation of the molecules, we keep track of that, too. There is a timestamp (unfortunately, with one-second resolution) associated with each action. The simulations take place in several different activities, each of which has a set of sections and nodes through which the user passes. We're trying to look for patterns in this data, which means that we're heading into the worlds of data warehousing and data mining. These are relatively new topics for me, although I'm trying to get up to speed on them. (The patterns for which we're searching have already been determined. So we're trying to describe particular patterns, rather than simply let data-mining software go wild and find correlations.) My first plan was to create a single fact table (which I called "transactions"), with foreign keys to various other relevant tables, and pre-computed data that I would need in my calculations. I then figured that I would write some queries, in a combination of Perl, Pl/PgSQL, and straight SQL, to tag particular rows has being in a pattern (or not). Once I've tagged the rows that are in the pattern, I can retrieve them with a SQL query, and hand them off to the researcher analyzing them. In theory, this would be the way to go. In practice, every step has become a performance nightmare, taking many more hours than I might have hoped or expected. For example, I've got a Pl/PgSQL function that goes through each variable-assignment row, and UPDATEs is previous_value column with whatever the previous value might have been. This function takes 4-6 hours to run, across 2 million rows, representing two of our six activities. (I've removed the other four activities' worth, in the hopes that I'll see a performance improvement.) When I only had 750,000 rows in our fact table, things ran at a somewhat acceptable speed. Now, the database is getting seriously bogged down with what I'm doing. I've tried this on a few different pieces of hardware, including some with multiple CPUs and lots of memory. And yet, nothing is going quickly, or even remotely quickly. I'm stuck with the nagging feeling that (a) I have some seriously unoptimized parts of my query, (b) faster disks would be helpful, and/or (c) what I'm doing is inherently slow and painful, and there's no way around it. Numerous invocations of EXPLAIN, and frequent uses of VACUUM tell me that I'm dealing with a fair amount of data here. But let's be honest; the whole database is about 5 GB unvacuumed, and I know that people are doing all sorts of things with 100 GB and larger disks. So perhaps I've hit a hardware bottleneck, and need some zippier disks? I'm not sure. So I'm looking for suggestions and advice from the PostgreSQL community. I haven't included a lot of details here, because I'm not even sure what details would be relevant, given the complexity of our situation. So if there are some particular metrics that I should share with this group, I'd be happy to do so, from hardware configurations to table definitions, to the queries that are getting bogged down. Thanks in advance, Reuven
pgsql-general by date: