Thread: The REAL cost of joins
Hello everyone, I have inadvertently set off a furball on an unrelated ng on what is the actual cost of SQL joins. But there's no reliable conclusion. I would like to actually know that, that is, are JOINs truly expensive? As they say, one measurement is worth a thousand opinions, so I've done measurement on my PG app: $ time echo "\c hrs; SELECT hosts.ip, reservation.start_date, architecture.architecture, os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project, email.email FROM hosts INNER JOIN project ON project.id = hosts.project_id INNER JOIN architecture ON hosts.architecture_id = architecture.id INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id INNER JOIN os_version ON hosts.os_version_id = os_version.id INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id INNER JOIN email ON reservation.email_id = email.id ;" | psql > /dev/null real 0m0.099s user 0m0.015s sys 0m0.005s $ time echo "\c hrs; > SELECT hosts.ip FROM hosts; > SELECT reservation.start_date FROM reservation; > SELECT architecture.architecture FROM architecture; > SELECT os_rel.os_rel FROM os_rel; > SELECT os_version.os_version FROM os_version; > SELECT project.project FROM project; > SELECT email.email FROM email; > " | psql > /dev/null real 0m0.046s user 0m0.008s sys 0m0.004s Note: I've created indexes on those tables, both on data columns like hosts.ip and on .id columns. What do you think of this? And in general: when (if?) should one denormalize data? Regards, mk
On Wed, Mar 3, 2010 at 3:59 PM, Marcin Krol <mrkafk@gmail.com> wrote: > > Hello everyone, > > I have inadvertently set off a furball on an unrelated ng on what is the actual cost of SQL joins. But there's no reliableconclusion. I would like to actually know that, that is, are JOINs truly expensive? > > As they say, one measurement is worth a thousand opinions, so I've done measurement on my PG app: > > $ time echo "\c hrs; > SELECT hosts.ip, reservation.start_date, architecture.architecture, os_kind.os_kind, os_rel.os_rel, os_version.os_version,project.project, email.email FROM hosts > INNER JOIN project ON project.id = hosts.project_id > INNER JOIN architecture ON hosts.architecture_id = architecture.id > INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id > INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id > INNER JOIN os_version ON hosts.os_version_id = os_version.id > INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id > INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id > INNER JOIN email ON reservation.email_id = email.id > > ;" | psql > /dev/null > > real 0m0.099s > user 0m0.015s > sys 0m0.005s > > > > $ time echo "\c hrs; > > SELECT hosts.ip FROM hosts; > > SELECT reservation.start_date FROM reservation; > > SELECT architecture.architecture FROM architecture; > > SELECT os_rel.os_rel FROM os_rel; > > SELECT os_version.os_version FROM os_version; > > SELECT project.project FROM project; > > SELECT email.email FROM email; > > " | psql > /dev/null > > real 0m0.046s > user 0m0.008s > sys 0m0.004s > > Note: I've created indexes on those tables, both on data columns like hosts.ip and on .id columns. > > What do you think of this? And in general: when (if?) should one denormalize data? I don't get what these two examples are supposed to show? They do not, in general, create equivalent output data. If you had to write procedural logic to do the same correlation between the individual result sets as the joins are performing then you'd be back to at least the same performance and probably worse, so what exactly is being compared here? As to your second question; well, outside of not doing premature optimization I'll observe that for operational databases, denormalization rarely gives me enough savings to justify the resultant extra data management required to maintain it. If you're talking about a data warehouse (and given the way you've worded the question I supsect you are not?) then that's a different question, in that case I prefer to generate the data from the operational side of the house so you're no longer paying an operational cost to maintain the denormalized data (you've got a whole different set of costs instead). -- Peter Hunsberger
On 03/03/10 21:59, Marcin Krol wrote: > What do you think of this? And in general: when (if?) should one > denormalize data? As a last resort. No sooner. The support costs of denormalising your database is such that if you can reasonably just buy more hardware / add caching / etc, do so. Oh, and I'm afraid your tests are probably meaningless. 1. The times are too small to be accurate. 2. You have the overhead of starting psql and connecting to the database, starting a new backend etc. 3. You have the overhead of parsing the query 4. You're fetching all rows (which presumably aren't many) - not a terribly useful scenario. If you wanted to measure actual join costs, you'd need to repeat the tests (say) 100-1000 times in a loop, optionally with prepared plans. Varying WHERE clauses might be useful too, if that's how your real application will work. -- Richard Huxton Archonet Ltd
the reason you are using joins, most often is because your schema is normalized. One way or another, de-normalisation + queries will cost you more, than normalised tables and joins.
That's at least the short answer.
That's at least the short answer.
On 4/03/2010 4:47 PM, Richard Huxton wrote: > On 03/03/10 21:59, Marcin Krol wrote: >> What do you think of this? And in general: when (if?) should one >> denormalize data? > > As a last resort. No sooner. > > The support costs of denormalising your database is such that if you can > reasonably just buy more hardware / add caching / etc, do so. Well, materialized views are a denormalization of sorts, and a really useful one - but they're not the authorative store of the data. Things like views, materialized views, trigger-maintained side tables etc are what you should look at before even considering denormalizing your main data storage. -- Craig Ringer
On Wed, Mar 3, 2010 at 4:59 PM, Marcin Krol <mrkafk@gmail.com> wrote: > Hello everyone, > > I have inadvertently set off a furball on an unrelated ng on what is the > actual cost of SQL joins. But there's no reliable conclusion. I would like > to actually know that, that is, are JOINs truly expensive? There's a lot of variables that go in to join performance. Are the necessary columns indexed, what is the comparison operator, size of the index, etc? Complicating the calculation, postgres has access to different algorithms it chooses depending on circumstances. In general, joins cost less than a lot of novice programmers think because they are simply unaware of the number of highly efficient ways to attack the problem of doing set operations on lists of sorted items, or perhaps have had bad experiences on databases that had a lousy selection of algorithms or poor (if any) statistics. The three biggest factors on join performance are 1: having index in appropriate places, 2: if/when there is a full cache fault and the database has to get information from disk when executing the join. And 3: there being enough working memory to do things like 'in query' sorts. Point 2 is where denormalization can reap real benefits because it can reduce the number of disk seeks to get data (compared to the cpu and memory, disk seeks take eons). Databases have a characteristic that reminds me of some of the challenges that aircraft engineers face. Jets that go supersonic have an entirely different set of operating principles because above the speed of sound all the rules changes in terms of thermals, vibration, stresses, etc. Similarly in the database world performance tend to lurch in a very unpleasant direction when you active working set of data is not able to be properly served in RAM. The 'sonic boom' you hear is your disk drives grinding as the clock ticks away...for many DBAs hearing this sound the first time is a life changing experience. This is when those little things you never bothered to think about like having the records that tend to be pulled up together by grouped together on disk become suddenly very important...you discover the CLUSTER command and become best friends :-). merlin
On Wed, 2010-03-03 at 22:59 +0100, Marcin Krol wrote: > What do you think of this? And in general: when (if?) should one > denormalize data? Your example shows that its quicker to put the ingredients on the table than it is to bake a cake. I'm not sure that's an argument against baking, nor is it an argument in favour of baking the night before. -- Simon Riggs www.2ndQuadrant.com