Thread: Database slowness -- my design, hardware, or both?
Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't failed me yet. But I've been having some serious performance problems on a database that I've been using in my grad-school research group, and it's clear that I need help from some more experienced hands. Basically, we've collected large amounts of data from students who used our chemistry simulation software. That data, collected in XML logfiles and then imported into a normalized set of PostgreSQL tables, keeps track of every action the student took, as well as what the chemistry simulation was doing. So if the number of molecules changes, we keep track of that. If the user changes the color representation of the molecules, we keep track of that, too. There is a timestamp (unfortunately, with one-second resolution) associated with each action. The simulations take place in several different activities, each of which has a set of sections and nodes through which the user passes. We're trying to look for patterns in this data, which means that we're heading into the worlds of data warehousing and data mining. These are relatively new topics for me, although I'm trying to get up to speed on them. (The patterns for which we're searching have already been determined. So we're trying to describe particular patterns, rather than simply let data-mining software go wild and find correlations.) My first plan was to create a single fact table (which I called "transactions"), with foreign keys to various other relevant tables, and pre-computed data that I would need in my calculations. I then figured that I would write some queries, in a combination of Perl, Pl/PgSQL, and straight SQL, to tag particular rows has being in a pattern (or not). Once I've tagged the rows that are in the pattern, I can retrieve them with a SQL query, and hand them off to the researcher analyzing them. In theory, this would be the way to go. In practice, every step has become a performance nightmare, taking many more hours than I might have hoped or expected. For example, I've got a Pl/PgSQL function that goes through each variable-assignment row, and UPDATEs is previous_value column with whatever the previous value might have been. This function takes 4-6 hours to run, across 2 million rows, representing two of our six activities. (I've removed the other four activities' worth, in the hopes that I'll see a performance improvement.) When I only had 750,000 rows in our fact table, things ran at a somewhat acceptable speed. Now, the database is getting seriously bogged down with what I'm doing. I've tried this on a few different pieces of hardware, including some with multiple CPUs and lots of memory. And yet, nothing is going quickly, or even remotely quickly. I'm stuck with the nagging feeling that (a) I have some seriously unoptimized parts of my query, (b) faster disks would be helpful, and/or (c) what I'm doing is inherently slow and painful, and there's no way around it. Numerous invocations of EXPLAIN, and frequent uses of VACUUM tell me that I'm dealing with a fair amount of data here. But let's be honest; the whole database is about 5 GB unvacuumed, and I know that people are doing all sorts of things with 100 GB and larger disks. So perhaps I've hit a hardware bottleneck, and need some zippier disks? I'm not sure. So I'm looking for suggestions and advice from the PostgreSQL community. I haven't included a lot of details here, because I'm not even sure what details would be relevant, given the complexity of our situation. So if there are some particular metrics that I should share with this group, I'd be happy to do so, from hardware configurations to table definitions, to the queries that are getting bogged down. Thanks in advance, Reuven
"Reuven M. Lerner" <reuven@lerner.co.il> writes: > Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't > failed me yet. But I've been having some serious performance problems > on a database that I've been using in my grad-school research group, and > it's clear that I need help from some more experienced hands. What PG version are you using? > In theory, this would be the way to go. In practice, every step has > become a performance nightmare, taking many more hours than I might have > hoped or expected. For example, I've got a Pl/PgSQL function that goes > through each variable-assignment row, and UPDATEs is previous_value > column with whatever the previous value might have been. I'd try to think of a way to eliminate the function altogether in favor of a single UPDATE command. In general, row-at-a-time thinking isn't the way to win in SQL. regards, tom lane
Hi, Tom. You wrote:
I agree that row-at-a-time thinking isn't the best way to work, but I didn't see a good alternative for our purposes. I'm open to any and all suggestions.
Reuven
I've been using 8.0, 8.1, and 8.2 at various points, depending on which machine I've been using. My main machine is currently using 8.2.0. (I wish that I had control over which version was being used, but my sysadmin powers are inversely proportional to the computer power made available to me.Hi, everyone. I've been using PostgreSQL for a decade, and it hasn't failed me yet. But I've been having some serious performance problems on a database that I've been using in my grad-school research group, and it's clear that I need help from some more experienced hands.What PG version are you using?
Well, I've tried to do massive UPDATEs as much as possible. But the patterns that we're looking for are basically of the variety, "If the user clicks on X and then clicks on Y, but without Z between the two of them, and if these are all part of the same simulation run, then we tag action X as being of interest to us." So it's oodles of keeping track of back-and-forth for each of the rows in the table, and looking forward and backward in the table.I'd try to think of a way to eliminate the function altogether in favor of a single UPDATE command. In general, row-at-a-time thinking isn't the way to win in SQL.
I agree that row-at-a-time thinking isn't the best way to work, but I didn't see a good alternative for our purposes. I'm open to any and all suggestions.
Reuven
> Well, I've tried to do massive UPDATEs as much as possible. But the > patterns that we're looking for are basically of the variety, "If the user > clicks on X and then clicks on Y, but without Z between the two of them, and > if these are all part of the same simulation run, then we tag action X as > being of interest to us." So it's oodles of keeping track of back-and-forth > for each of the rows in the table, and looking forward and backward in the > table. > > I agree that row-at-a-time thinking isn't the best way to work, but I > didn't see a good alternative for our purposes. I'm open to any and all > suggestions. Can you post at least some table schemas, indexes, queries, and explain output? I think of database optimization as a serious case of devil in the details, and generalities (like -- make sure you index, make sure your indexes help using explain, avoid row-at-a-time thinking) won't get you far. So if we had something concrete to work with, well, we would have something concrete to work with.
. > Heh. Sure thing. I wasn't sure how much detail to give when initially > posting. Looks like enough to get the real experts on the list started :) I will try to look again tommorrow, but I bet other folks have better intuition than me. How much concurrency is there on your database?
Hi, Webb Sprague. You wrote: > How much concurrency is there on your database? Almost none. I'm generally running one client against the server. I often have a second client connected simultaneously, just for the purpose of answering short questions. I'm now thinking of separating each activity into its own database, in the hopes that this will improve the system speed enough to do what I need. But I'm far from convinced that this will really help. Reuven
I am not able to look as closely as it deserves ... ... but I see two seq scans in your explain in a loop -- this is probably not good. If you can find a way to rewrite the IN clause (either de-normalizing through triggers to save whatever you need on an insert and not have to deal with a set, or by using except in the query, or someing else more creative)... Also -- there is a good book on temporal databases by Snodgrass that might give some interesting ideas; maybe you have already seen it, but still. I am thinking you could increment a sequence variable to give you a "tick" integer with each action, rather than trying to use timestamps with all their overhead and inaccuracy (1 second is a long time, really). Lamport also did work on clocks that might apply. Also have you tried dropping all your fk and checks just to see if you get a difference in speed on an update? It would be interesting, perhaps. If you could get rid of the sorted limit 1 clause in your function, there would be less variablity and make it easier to understand; you probably need to denormalize somehow, perhaps using ticks; I am not sure.... Could a trigger set your previous_value on insert or update, rather than querying for it later? > I'm now thinking of separating each activity into its own database, in > the hopes that this will improve the system speed enough to do what I > need. But I'm far from convinced that this will really help. Better to figure out the real problem -- more interesting, more scalable. Hope my incoherent brain dump lights a spark.
Hi, Webb Sprague. You wrote: > ... but I see two seq scans in your explain in a loop -- this is > probably not good. If you can find a way to rewrite the IN clause > (either de-normalizing through triggers to save whatever you need on > an insert and not have to deal with a set, or by using except in the > query, or someing else more creative)... I would normally agree that an IN clause is a place to worry -- except that I'm using IN on a very small table, with about 4-5 rows. That might indeed affect things, and I could certainly pull out these values into a Perl variable that I then interpolate literally into the SQL. However, I have to assume that this wouldn't affect things all that much. > Also -- there is a good book on temporal databases by Snodgrass that > might give some interesting ideas; maybe you have already seen it, but > still. Heh -- that's one of the first references that I looked at when I started this project. Perhaps I'll look at it again; the design of my data warehouse took some of those suggestions into account, but I wouldn't be surprised if there's something more I could be doing. > I am thinking you could increment a sequence variable to give > you a "tick" integer with each action, rather than trying to use > timestamps with all their overhead and inaccuracy (1 second is a long > time, really). Lamport also did work on clocks that might apply. I agree that 1 second is a ridiculously long time. The logfiles were created by people outside of my research group, and are a wonderful lesson in how to use XML poorly and inefficiently. The designers of the logfile weren't expecting to get dozens or hundreds of values per second. You can be sure that my research group, which is now looking to create logfiles of our own, will place a strong emphasis on high-resolution timestamps. Practically speaking, I've managed to get around this problem by using the "id" column, which comes from a sequence, and is thus guaranteed to be increasing. We can thus be assured that a row with an ID of x will come before a row whose ID is x + 10, so long as their transaction types are the same. So we can't guarantee that a node entry comes before a variable setting via the ID, but we can order two variable settings based on the ID. > Also have you tried dropping all your fk and checks just to see if you > get a difference in speed on an update? It would be interesting, > perhaps. I sorta figured that because our UPDATE is taking place on a column without a foreign key, that this wouldn't matter much. Am I wrong? > If you could get rid of the sorted limit 1 clause in your function, > there would be less variablity and make it easier to understand; you > probably need to denormalize somehow, perhaps using ticks; I am not > sure.... Yeah, that's a good point. I'm not sure how possible that'll be, though, given that at *some* point, I need to find the "most recent" setting of the same variable. There might be dozens or hundreds of rows separating the assignment of a new value to a particular variable. > > Could a trigger set your previous_value on insert or update, rather > than querying for it later? Well, what I've been doing is INSERTing each row into the "transactions" table without previous_value. Then I run the mass UPDATE for previous_value, doing nothing else at the same time. Then, after previous_value is set, I do the query. Reuven
Reuven M. Lerner escribió: > Hi, Webb Sprague. You wrote: > >... but I see two seq scans in your explain in a loop -- this is > >probably not good. If you can find a way to rewrite the IN clause > >(either de-normalizing through triggers to save whatever you need on > >an insert and not have to deal with a set, or by using except in the > >query, or someing else more creative)... > I would normally agree that an IN clause is a place to worry -- except > that I'm using IN on a very small table, with about 4-5 rows. That > might indeed affect things, and I could certainly pull out these values > into a Perl variable that I then interpolate literally into the SQL. > However, I have to assume that this wouldn't affect things all that much. Don't assume -- measure. I had a query which ran orders of magnitude faster because I interpolated the constant list in the big query. The table from which the interpolated values were being extracted had about 30 rows or so. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Hi, Alvaro Herrera. You wrote:
Reuven
OK, I modified things to use interpolation. Here's the updated query:Don't assume -- measure. I had a query which ran orders of magnitude faster because I interpolated the constant list in the big query. The table from which the interpolated values were being extracted had about 30 rows or so.
explain UPDATE TransactionsAnd here is the query plan produced by explain:
SET previous_value = previous_value(id)
WHERE new_value IS NOT NULL
AND new_value <> ''
AND node_id IN (351, 169, 664, 240);
QUERY PLANI'm still a bit surprised by how different the query plan came out with what would seem like a minor change.
----------------------------------------------------------------------------------
Bitmap Heap Scan on transactions (cost=8842.88..98283.93 rows=407288 width=249)
Recheck Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text))
-> Bitmap Index Scan on node_id_idx (cost=0.00..8842.88 rows=434276 width=0)
Index Cond: (node_id = ANY ('{351,169,664,240}'::integer[]))
(5 rows)
Reuven
> OK, I modified things to use interpolation. Here's the updated query: > > > explain UPDATE Transactions > SET previous_value = previous_value(id) > WHERE new_value IS NOT NULL > AND new_value <> '' > AND node_id IN (351, 169, 664, 240); > > And here is the query plan produced by explain: > > QUERY > PLAN > ---------------------------------------------------------------------------------- > Bitmap Heap Scan on transactions (cost=8842.88..98283.93 rows=407288 > width=249) > Recheck Cond: (node_id = ANY > ('{351,169,664,240}'::integer[])) > Filter: ((new_value IS NOT NULL) AND (new_value <> ''::text)) > -> Bitmap Index Scan on node_id_idx (cost=0.00..8842.88 rows=434276 > width=0) > Index Cond: (node_id = ANY > ('{351,169,664,240}'::integer[])) > (5 rows) > I'm still a bit surprised by how different the query plan came out with > what would seem like a minor change. Do you have new \timings? What you or I think is a minor change isn't necessarily what the planner thinks is a minor change, especially when you change data from something that requires a query to something that is determinate. I would suggest changing your function to remove as many such queries as possible too (I am thinking of the order by limit 1). This would be a good move also in that you narrow down the amount of moving parts to diagnose and it just makes the whole thing cleaner. The meta-moral is that db optimization requires systematic experimentation. Use the database to store the results of the various experiments! In light of this, I would suggest you try removing the check clauses and seeing if you get a difference too. Just like Francis Bacon said -- don't deduce from first principles, experiment! I would also try amortizing the analysis with triggers, rather than building the table all at once; this may be better or worse, depending on the on-line character of the application (e.g., if they are waiting at an ATM, in might be a deal breaker to add two seconds to do an insert / update, but not if you are tailing off a log file that gets updated every minute or so.) W
Hi, Webb Sprague. You wrote:
So it looks like this didn't make much of a timing difference at all. And what little difference there was, was negative. Bleah.
Reuven
Yup. It just finished executing a little while ago. With the explicitly interpolated array in place, I got the following:
Do you have new \timings?
LOG: statement: UPDATE TransactionsThe previous version, which included lots of calls to RAISE NOTICE and also used a subselect, had the following timing:
SET previous_value = previous_value(id)
WHERE new_value IS NOT NULL
AND new_value <> ''
AND node_id IN (351, 169, 664, 240)
LOG: duration: 16842710.469 ms
LOG: statement: UPDATE Transactions(I keep timing information in the logfile, rather than using \timing.)
SET previous_value = previous_value(id)
WHERE new_value IS NOT NULL
AND new_value <> ''
AND node_id IN (SELECT node_id FROM NodeSegments)
LOG: duration: 16687993.067 ms
So it looks like this didn't make much of a timing difference at all. And what little difference there was, was negative. Bleah.
What you or I think is a minor change isn't necessarily what theGood idea. I'll see if I can get the function to be a bit cleaner, although I'm not sure if it is, given the constraints of the problem. That's one of the reasons why I've been adding these "hints" to the database -- so that I can have many small queries, rather than one large one.
planner thinks is a minor change, especially when you change data from
something that requires a query to something that is determinate. I
would suggest changing your function to remove as many such queries as
possible too (I am thinking of the order by limit 1). This would be a
good move also in that you narrow down the amount of moving parts to
diagnose and it just makes the whole thing cleaner.
I would also try amortizing the analysis with triggers, rather thanThe data that I'm dealing with is not changing over time. So I've been trying to build the transactions table (aka my data warehouse) slowly, adding one or two columns at a time with hints that will make extracting the data easier. Unfortunately, building those hints has proven to be very slow going.
building the table all at once; this may be better or worse, depending
on the on-line character of the application (e.g., if they are waiting
at an ATM, in might be a deal breaker to add two seconds to do an
insert / update, but not if you are tailing off a log file that gets
updated every minute or so.)
Reuven
> LOG: statement: UPDATE Transactions > SET previous_value = previous_value(id) > WHERE new_value IS NOT NULL > AND new_value <> '' > AND node_id IN (SELECT node_id FROM NodeSegments) > LOG: duration: 16687993.067 ms I hope that I can presume some suggestions that I gleened after finishing my celko book. I don't know if the suggestions presented will help in your case. From the reading WHERE conditions such as <> '' or IS NOT NULL can be preformance killers as these may discourge the optimizer from using an index scan. The suggest was to replace this with: new_value > '<some minimum value possible in this field i.e. A>' this WHERE conditions should only find non-NULL and non-empty strings. Also, the IN is also know as a killer so the suggestion was to reform the query like so: UPDATE Transactions SET previous_value = previous_value(id) FROM NodeSegments WHERE Transactions.Node_id = NodeSegments.Node_id AND Transactions.new_value > 'A'; --assuming A is a minimum value I hope this can help. Regards, Richard Broersma Jr.
Reuven-- a few years back this same scenario happened for me working on a project with a large Oracle DB which held clinical data We needed to implement a join condition but the column we were using wa not using an Index scan so since the cardinality of the results was extraordinary we re-implemented the column to use EXISTS (for large result sets) *following the omnipresent example used by Oracle books where IN is used for columns with low cardinality * and following the docs from EXISTS where the EXISTS clause ALWAYS has to return something In our case we were joining on a doctorid with IN (which made no sense as there were millions of PatientIDs) to find allpatients whose doctors were in PPO's so to increase performance we changed the IN clause for the column with high cardinality (doctorid) to EXISTS select p.PATIENT_NAME from PATIENT p where p.doctorid in (select doctorid from ppo_table) /******there are many doctorids in this ppo table so we will change to EXISTS********/ select p.PATIENT_NAME from PATIENT p where exists (select 0 from ppo_table ppo where p.doctorID = ppo.doctorID); Shalom/ Martin-- --------------------------------------------------------------------------- This e-mail message (including attachments, if any) is intended for the use of the individual or entity to which it is addressedand may contain information that is privileged, proprietary , confidential and exempt from disclosure. If you arenot the intended recipient, you are notified that any dissemination, distribution or copying of this communication isstrictly prohibited. --------------------------------------------------------------------------- Le présent message électronique (y compris les pièces qui y sont annexées, le cas échéant) s'adresse au destinataire indiquéet peut contenir des renseignements de caractère privé ou confidentiel. Si vous n'êtes pas le destinataire de ce document,nous vous signalons qu'il est strictement interdit de le diffuser, de le distribuer ou de le reproduire. ----- Original Message ----- From: "Alvaro Herrera" <alvherre@commandprompt.com> To: "Reuven M. Lerner" <reuven@lerner.co.il> Cc: "Webb Sprague" <webb.sprague@gmail.com>; <pgsql-general@postgresql.org> Sent: Tuesday, March 06, 2007 4:19 PM Subject: Re: [GENERAL] Database slowness -- my design, hardware, or both? > Reuven M. Lerner escribió: >> Hi, Webb Sprague. You wrote: >> >... but I see two seq scans in your explain in a loop -- this is >> >probably not good. If you can find a way to rewrite the IN clause >> >(either de-normalizing through triggers to save whatever you need on >> >an insert and not have to deal with a set, or by using except in the >> >query, or someing else more creative)... >> I would normally agree that an IN clause is a place to worry -- except >> that I'm using IN on a very small table, with about 4-5 rows. That >> might indeed affect things, and I could certainly pull out these values >> into a Perl variable that I then interpolate literally into the SQL. >> However, I have to assume that this wouldn't affect things all that much. > > Don't assume -- measure. I had a query which ran orders of magnitude > faster because I interpolated the constant list in the big query. The > table from which the interpolated values were being extracted had about > 30 rows or so. > > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >