Thread: Database slowness -- my design, hardware, or both?

Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, everyone.  I've been using PostgreSQL for a decade, and it hasn't
failed me yet.  But I've been having some serious performance problems
on a database that I've been using in my grad-school research group, and
it's clear that I need help from some more experienced hands.

Basically, we've collected large amounts of data from students who used
our chemistry simulation software.  That data, collected in XML logfiles
and then imported into a normalized set of PostgreSQL tables, keeps
track of every action the student took, as well as what the chemistry
simulation was doing.  So if the number of molecules changes, we keep
track of that.  If the user changes the color representation of the
molecules, we keep track of that, too.  There is a timestamp
(unfortunately, with one-second resolution) associated with each
action.  The simulations take place in several different activities,
each of which has a set of sections and nodes through which the user passes.

We're trying to look for patterns in this data, which means that we're
heading into the worlds of data warehousing and data mining.  These are
relatively new topics for me, although I'm trying to get up to speed on
them.  (The patterns for which we're searching have already been
determined.  So we're trying to describe particular patterns, rather
than simply let data-mining software go wild and find correlations.)

My first plan was to create a single fact table (which I called
"transactions"), with foreign keys to various other relevant tables, and
pre-computed data that I would need in my calculations.  I then figured
that I would write some queries, in a combination of Perl, Pl/PgSQL, and
straight SQL, to tag particular rows has being in a pattern (or not).
Once I've tagged the rows that are in the pattern, I can retrieve them
with a SQL query, and hand them off to the researcher analyzing them.

In theory, this would be the way to go.  In practice, every step has
become a performance nightmare, taking many more hours than I might have
hoped or expected.  For example, I've got a Pl/PgSQL function that goes
through each variable-assignment row, and UPDATEs is previous_value
column with whatever the previous value might have been.  This function
takes 4-6 hours to run, across 2 million rows, representing two of our
six activities.  (I've removed the other four activities' worth, in the
hopes that I'll see a performance improvement.)

When I only had 750,000 rows in our fact table, things ran at a somewhat
acceptable speed.  Now, the database is getting seriously bogged down
with what I'm doing.

I've tried this on a few different pieces of hardware, including some
with multiple CPUs and lots of memory.  And yet, nothing is going
quickly, or even remotely quickly.  I'm stuck with the nagging feeling
that (a) I have some seriously unoptimized parts of my query, (b) faster
disks would be helpful, and/or (c) what I'm doing is inherently slow and
painful, and there's no way around it.

Numerous invocations of EXPLAIN, and frequent uses of VACUUM tell me
that I'm dealing with a fair amount of data here.  But let's be honest;
the whole database is about 5 GB unvacuumed, and I know that people are
doing all sorts of things with 100 GB and larger disks.  So perhaps I've
hit a hardware bottleneck, and need some zippier disks?  I'm not sure.

So I'm looking for suggestions and advice from the PostgreSQL
community.  I haven't included a lot of details here, because I'm not
even sure what details would be relevant, given the complexity of our
situation.  So if there are some particular metrics that I should share
with this group, I'd be happy to do so, from hardware configurations to
table definitions, to the queries that are getting bogged down.

Thanks in advance,

Reuven


Re: Database slowness -- my design, hardware, or both?

From
Tom Lane
Date:
"Reuven M. Lerner" <reuven@lerner.co.il> writes:
> Hi, everyone.  I've been using PostgreSQL for a decade, and it hasn't
> failed me yet.  But I've been having some serious performance problems
> on a database that I've been using in my grad-school research group, and
> it's clear that I need help from some more experienced hands.

What PG version are you using?

> In theory, this would be the way to go.  In practice, every step has
> become a performance nightmare, taking many more hours than I might have
> hoped or expected.  For example, I've got a Pl/PgSQL function that goes
> through each variable-assignment row, and UPDATEs is previous_value
> column with whatever the previous value might have been.

I'd try to think of a way to eliminate the function altogether in favor
of a single UPDATE command.  In general, row-at-a-time thinking isn't
the way to win in SQL.

            regards, tom lane

Re: Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, Tom.  You wrote:
Hi, everyone.  I've been using PostgreSQL for a decade, and it hasn't 
failed me yet.  But I've been having some serious performance problems 
on a database that I've been using in my grad-school research group, and 
it's clear that I need help from some more experienced hands.   
What PG version are you using? 
I've been using 8.0, 8.1, and 8.2 at various points, depending on which machine I've been using.  My main machine is currently using 8.2.0.  (I wish that I had control over which version was being used, but my sysadmin powers are inversely proportional to the computer power made available to me.
I'd try to think of a way to eliminate the function altogether in favor
of a single UPDATE command.  In general, row-at-a-time thinking isn't
the way to win in SQL.
Well, I've tried to do massive UPDATEs as much as possible.  But the patterns that we're looking for are basically of the variety, "If the user clicks on X and then clicks on Y, but without Z between the two of them, and if these are all part of the same simulation run, then we tag action X as being of interest to us."  So it's oodles of keeping track of back-and-forth for each of the rows in the table, and looking forward and backward in the table.

I agree that row-at-a-time thinking isn't the best way to work, but I didn't see a good alternative for our purposes.  I'm open to any and all suggestions.

Reuven

Re: Database slowness -- my design, hardware, or both?

From
"Webb Sprague"
Date:
>  Well, I've tried to do massive UPDATEs as much as possible.  But the
> patterns that we're looking for are basically of the variety, "If the user
> clicks on X and then clicks on Y, but without Z between the two of them, and
> if these are all part of the same simulation run, then we tag action X as
> being of interest to us."  So it's oodles of keeping track of back-and-forth
> for each of the rows in the table, and looking forward and backward in the
> table.
>
>  I agree that row-at-a-time thinking isn't the best way to work, but I
> didn't see a good alternative for our purposes.  I'm open to any and all
> suggestions.

Can you post at least some table schemas, indexes, queries, and
explain output?  I think of database optimization as a serious case of
devil in the details, and generalities (like -- make sure you index,
make sure your indexes help using explain, avoid row-at-a-time
thinking) won't get you far. So if we had something concrete to work
with, well, we would have something concrete to work with.

Re: Database slowness -- my design, hardware, or both?

From
"Webb Sprague"
Date:
.
>  Heh.  Sure thing.  I wasn't sure how much detail to give when initially
> posting.

Looks like enough to get the real experts on the list started :)

I will try to look again tommorrow, but I bet other folks have better
intuition than me.

How much concurrency is there on your database?

Re: Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, Webb Sprague.  You wrote:
> How much concurrency is there on your database?
Almost none.  I'm generally running one client against the server.  I
often have a second client connected simultaneously, just for the
purpose of answering short questions.

I'm now thinking of separating each activity into its own database, in
the hopes that this will improve the system speed enough to do what I
need.  But I'm far from convinced that this will really help.

Reuven


Re: Database slowness -- my design, hardware, or both?

From
"Webb Sprague"
Date:
I am not able to look as closely as it deserves ...

... but I see two seq scans in your explain in a loop -- this is
probably not good.  If you can find a way to rewrite the IN clause
(either de-normalizing through triggers to save whatever you need on
an insert and not have to deal with a set, or by using except in the
query, or someing else more creative)...

Also -- there is a good book on temporal databases by Snodgrass that
might give some interesting ideas; maybe you have already seen it, but
still.  I am thinking you could increment a sequence variable to give
you a "tick" integer with each action, rather than trying to use
timestamps with all their overhead and inaccuracy (1 second is a long
time, really).  Lamport also did work on clocks that might apply.

Also have you tried dropping all your fk and checks just to see if you
get a difference in speed on an update?  It would be interesting,
perhaps.

If you could get rid of the sorted limit 1 clause in your function,
there would be less variablity and make it easier to understand; you
probably need to denormalize somehow, perhaps using ticks; I am not
sure....

Could a trigger set your previous_value on insert or update, rather
than querying for it later?

> I'm now thinking of separating each activity into its own database, in
> the hopes that this will improve the system speed enough to do what I
> need.  But I'm far from convinced that this will really help.

Better to figure out the real problem -- more interesting, more scalable.

Hope my incoherent brain dump lights a spark.

Re: Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, Webb Sprague.  You wrote:
> ... but I see two seq scans in your explain in a loop -- this is
> probably not good.  If you can find a way to rewrite the IN clause
> (either de-normalizing through triggers to save whatever you need on
> an insert and not have to deal with a set, or by using except in the
> query, or someing else more creative)...
I would normally agree that an IN clause is a place to worry -- except
that I'm using IN on a very small table, with about 4-5 rows.  That
might indeed affect things, and I could certainly pull out these values
into a Perl variable that I then interpolate literally into the SQL.
However, I have to assume that this wouldn't affect things all that much.
> Also -- there is a good book on temporal databases by Snodgrass that
> might give some interesting ideas; maybe you have already seen it, but
> still.
Heh -- that's one of the first references that I looked at when I
started this project.  Perhaps I'll look at it again; the design of my
data warehouse took some of those suggestions into account, but I
wouldn't be surprised if there's something more I could be doing.
>   I am thinking you could increment a sequence variable to give
> you a "tick" integer with each action, rather than trying to use
> timestamps with all their overhead and inaccuracy (1 second is a long
> time, really).  Lamport also did work on clocks that might apply.
I agree that 1 second is a ridiculously long time.  The logfiles were
created by people outside of my research group, and are a wonderful
lesson in how to use XML poorly and inefficiently.  The designers of the
logfile weren't expecting to get dozens or hundreds of values per
second.  You can be sure that my research group, which is now looking to
create logfiles of our own, will place a strong emphasis on
high-resolution timestamps.

Practically speaking, I've managed to get around this problem by using
the "id" column, which comes from a sequence, and is thus guaranteed to
be increasing.  We can thus be assured that a row with an ID of x will
come before a row whose ID is x + 10, so long as their transaction types
are the same.  So we can't guarantee that a node entry comes before a
variable setting via the ID, but we can order two variable settings
based on the ID.
> Also have you tried dropping all your fk and checks just to see if you
> get a difference in speed on an update?  It would be interesting,
> perhaps.
I sorta figured that because our UPDATE is taking place on a column
without a foreign key, that this wouldn't matter much.  Am I wrong?
> If you could get rid of the sorted limit 1 clause in your function,
> there would be less variablity and make it easier to understand; you
> probably need to denormalize somehow, perhaps using ticks; I am not
> sure....
Yeah, that's a good point.  I'm not sure how possible that'll be,
though, given that at *some* point, I need to find the "most recent"
setting of the same variable.  There might be dozens or hundreds of rows
separating the assignment of a new value to a particular variable.
>
> Could a trigger set your previous_value on insert or update, rather
> than querying for it later?
Well, what I've been doing is INSERTing each row into the "transactions"
table without previous_value.  Then I run the mass UPDATE for
previous_value, doing nothing else at the same time.  Then, after
previous_value is set, I do the query.

Reuven


Re: Database slowness -- my design, hardware, or both?

From
Alvaro Herrera
Date:
Reuven M. Lerner escribió:
> Hi, Webb Sprague.  You wrote:
> >... but I see two seq scans in your explain in a loop -- this is
> >probably not good.  If you can find a way to rewrite the IN clause
> >(either de-normalizing through triggers to save whatever you need on
> >an insert and not have to deal with a set, or by using except in the
> >query, or someing else more creative)...
> I would normally agree that an IN clause is a place to worry -- except
> that I'm using IN on a very small table, with about 4-5 rows.  That
> might indeed affect things, and I could certainly pull out these values
> into a Perl variable that I then interpolate literally into the SQL.
> However, I have to assume that this wouldn't affect things all that much.

Don't assume -- measure.  I had a query which ran orders of magnitude
faster because I interpolated the constant list in the big query.  The
table from which the interpolated values were being extracted had about
30 rows or so.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, Alvaro Herrera.  You wrote:
Don't assume -- measure.  I had a query which ran orders of magnitude
faster because I interpolated the constant list in the big query.  The
table from which the interpolated values were being extracted had about
30 rows or so. 
OK, I modified things to use interpolation.  Here's the updated query:

explain  UPDATE Transactions
               SET previous_value = previous_value(id)
             WHERE new_value IS NOT NULL
               AND new_value <> ''
               AND node_id IN (351, 169, 664, 240);

And here is the query plan produced by explain:
                                    QUERY PLAN                                   
----------------------------------------------------------------------------------
 Bitmap Heap Scan on transactions  (cost=8842.88..98283.93 rows=407288 width=249)
   Recheck Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
   Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text))
   ->  Bitmap Index Scan on node_id_idx  (cost=0.00..8842.88 rows=434276 width=0)
         Index Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
(5 rows)
I'm still a bit surprised by how different the query plan came out with what would seem like a minor change. 

Reuven

Re: Database slowness -- my design, hardware, or both?

From
"Webb Sprague"
Date:
>  OK, I modified things to use interpolation.  Here's the updated query:
>
>
> explain  UPDATE Transactions
>                 SET previous_value = previous_value(id)
>               WHERE new_value IS NOT NULL
>                 AND new_value <> ''
>                 AND node_id IN (351, 169, 664, 240);
>
>  And here is the query plan produced by explain:
>
>                                     QUERY
> PLAN
> ----------------------------------------------------------------------------------
>   Bitmap Heap Scan on transactions  (cost=8842.88..98283.93 rows=407288
> width=249)
>     Recheck Cond: (node_id = ANY
> ('{351,169,664,240}'::integer[]))
>     Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text))
>     ->  Bitmap Index Scan on node_id_idx  (cost=0.00..8842.88 rows=434276
> width=0)
>           Index Cond: (node_id = ANY
> ('{351,169,664,240}'::integer[]))
>  (5 rows)
>  I'm still a bit surprised by how different the query plan came out with
> what would seem like a minor change.

Do you have new \timings?

What you or I think is a minor change isn't necessarily what the
planner thinks is a minor change, especially when you change data from
something that requires a query to something that is determinate.  I
would suggest changing your function to remove as many such queries as
possible too (I am thinking of the order by limit 1).  This would be a
good move also in that you narrow down the amount of moving parts to
diagnose and it just makes the whole thing cleaner.

The meta-moral is that db optimization requires systematic
experimentation.  Use the database to store the results of the various
experiments!

In light of this, I would suggest you try removing the check clauses
and seeing if you get a difference too.  Just like Francis Bacon said
-- don't deduce from first principles, experiment!

I would also try amortizing the analysis with triggers, rather than
building the table all at once; this may be better or worse, depending
on the on-line character of the application (e.g., if they are waiting
at an ATM, in might be a deal breaker to add two seconds to do an
insert / update, but not if you are tailing off a log file that gets
updated every minute or so.)

W

Re: Database slowness -- my design, hardware, or both?

From
"Reuven M. Lerner"
Date:
Hi, Webb Sprague.  You wrote:

Do you have new \timings?
Yup.  It just finished executing a little while ago.  With the explicitly interpolated array in place, I got the following:
LOG:  statement: UPDATE Transactions
              SET previous_value = previous_value(id)
            WHERE new_value IS NOT NULL
              AND new_value <> ''
              AND node_id IN (351, 169, 664, 240)
LOG:  duration: 16842710.469 ms

The previous version, which included lots of calls to RAISE NOTICE and also used a subselect, had the following timing:
LOG:  statement: UPDATE Transactions
              SET previous_value = previous_value(id)
            WHERE new_value IS NOT NULL
              AND new_value <> ''
              AND node_id IN (SELECT node_id FROM NodeSegments)
LOG:  duration: 16687993.067 ms
(I keep timing information in the logfile, rather than using \timing.)

So it looks like this didn't make much of a timing difference at all.  And what little difference there was, was negative.  Bleah.
What you or I think is a minor change isn't necessarily what the
planner thinks is a minor change, especially when you change data from
something that requires a query to something that is determinate.  I
would suggest changing your function to remove as many such queries as
possible too (I am thinking of the order by limit 1).  This would be a
good move also in that you narrow down the amount of moving parts to
diagnose and it just makes the whole thing cleaner.
Good idea.  I'll see if I can get the function to be a bit cleaner, although I'm not sure if it is, given the constraints of the problem.  That's one of the reasons why I've been adding these "hints" to the database -- so that I can have many small queries, rather than one large one.
I would also try amortizing the analysis with triggers, rather than
building the table all at once; this may be better or worse, depending
on the on-line character of the application (e.g., if they are waiting
at an ATM, in might be a deal breaker to add two seconds to do an
insert / update, but not if you are tailing off a log file that gets
updated every minute or so.)
The data that I'm dealing with is not changing over time.  So I've been trying to build the transactions table (aka my data warehouse) slowly, adding one or two columns at a time with hints that will make extracting the data easier.  Unfortunately, building those hints has proven to be very slow going.

Reuven

Re: Database slowness -- my design, hardware, or both?

From
Richard Broersma Jr
Date:
>     LOG:  statement: UPDATE Transactions
>                   SET previous_value = previous_value(id)
>                 WHERE new_value IS NOT NULL
>                   AND new_value <> ''
>                   AND node_id IN (SELECT node_id FROM NodeSegments)
>     LOG:  duration: 16687993.067 ms

I hope that I can presume some suggestions that I gleened after finishing my celko book.  I don't
know if the suggestions presented will help in your case.

From the reading WHERE conditions such as <> '' or IS NOT NULL can be preformance killers as these
may discourge the optimizer from using an index scan.  The suggest was to replace this with:

     new_value > '<some minimum value possible in this field i.e. A>'

this WHERE conditions should only find non-NULL and non-empty strings.

Also, the IN is also know as a killer so the suggestion was to reform the query like so:


UPDATE Transactions
SET previous_value = previous_value(id)
FROM NodeSegments
WHERE Transactions.Node_id = NodeSegments.Node_id
AND Transactions.new_value > 'A'; --assuming A is a minimum value


I hope this can help.

Regards,
Richard Broersma Jr.

Re: Database slowness -- my design, hardware, or both?

From
"Martin Gainty"
Date:
Reuven--

a few years back this same scenario happened for me working on a project with a large Oracle DB which held clinical
data
 
We needed to implement a join condition but the column we were using wa not using an Index scan 
so since the cardinality of the results was extraordinary we re-implemented the column to use EXISTS (for large result
sets)
*following the omnipresent example used by Oracle books where IN is used for columns with low cardinality *
and following the docs from EXISTS where the EXISTS clause ALWAYS has to return something 

In our case we were joining on a doctorid with IN (which made no sense as there were millions of PatientIDs)  to find
allpatients whose doctors
 
were in PPO's so to increase performance we changed the IN clause for the column with high cardinality (doctorid) to
EXISTS

select p.PATIENT_NAME from PATIENT p where p.doctorid in (select doctorid from ppo_table)

/******there are many doctorids in this ppo table so we will change to EXISTS********/

select p.PATIENT_NAME from PATIENT p
    where exists (select 0 from  ppo_table ppo where p.doctorID = ppo.doctorID); 

Shalom/
Martin--
--------------------------------------------------------------------------- 
This e-mail message (including attachments, if any) is intended for the use of the individual or entity to which it is
addressedand may contain information that is privileged, proprietary , confidential and exempt from disclosure. If you
arenot the intended recipient, you are notified that any dissemination, distribution or copying of this communication
isstrictly prohibited.
 
--------------------------------------------------------------------------- 
Le présent message électronique (y compris les pièces qui y sont annexées, le cas échéant) s'adresse au destinataire
indiquéet peut contenir des renseignements de caractère privé ou confidentiel. Si vous n'êtes pas le destinataire de ce
document,nous vous signalons qu'il est strictement interdit de le diffuser, de le distribuer ou de le reproduire.
 
----- Original Message ----- 
From: "Alvaro Herrera" <alvherre@commandprompt.com>
To: "Reuven M. Lerner" <reuven@lerner.co.il>
Cc: "Webb Sprague" <webb.sprague@gmail.com>; <pgsql-general@postgresql.org>
Sent: Tuesday, March 06, 2007 4:19 PM
Subject: Re: [GENERAL] Database slowness -- my design, hardware, or both?


> Reuven M. Lerner escribió:
>> Hi, Webb Sprague.  You wrote:
>> >... but I see two seq scans in your explain in a loop -- this is
>> >probably not good.  If you can find a way to rewrite the IN clause
>> >(either de-normalizing through triggers to save whatever you need on
>> >an insert and not have to deal with a set, or by using except in the
>> >query, or someing else more creative)...
>> I would normally agree that an IN clause is a place to worry -- except 
>> that I'm using IN on a very small table, with about 4-5 rows.  That 
>> might indeed affect things, and I could certainly pull out these values 
>> into a Perl variable that I then interpolate literally into the SQL.  
>> However, I have to assume that this wouldn't affect things all that much.
> 
> Don't assume -- measure.  I had a query which ran orders of magnitude
> faster because I interpolated the constant list in the big query.  The
> table from which the interpolated values were being extracted had about
> 30 rows or so.
> 
> -- 
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>