Thread: The REAL cost of joins

The REAL cost of joins

From
Marcin Krol
Date:
Hello everyone,

I have inadvertently set off a furball on an unrelated ng on what is the
actual cost of SQL joins. But there's no reliable conclusion. I would
like to actually know that, that is, are JOINs truly expensive?

As they say, one measurement is worth a thousand opinions, so I've done
measurement on my PG app:

$ time echo "\c hrs;
SELECT hosts.ip, reservation.start_date, architecture.architecture,
os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project,
email.email FROM hosts
  INNER JOIN project ON project.id = hosts.project_id
  INNER JOIN architecture ON hosts.architecture_id = architecture.id
  INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
  INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
  INNER JOIN os_version ON hosts.os_version_id = os_version.id
  INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
  INNER JOIN reservation on reservation.id =
reservation_hosts.reservation_id
  INNER JOIN email ON reservation.email_id = email.id

;" | psql > /dev/null

real    0m0.099s
user    0m0.015s
sys     0m0.005s



$ time echo "\c hrs;
 > SELECT hosts.ip FROM hosts;
 > SELECT reservation.start_date FROM reservation;
 > SELECT architecture.architecture FROM architecture;
 > SELECT os_rel.os_rel FROM os_rel;
 > SELECT os_version.os_version FROM os_version;
 > SELECT project.project FROM project;
 > SELECT email.email FROM email;
 > " | psql > /dev/null

real    0m0.046s
user    0m0.008s
sys     0m0.004s

Note: I've created indexes on those tables, both on data columns like
hosts.ip and on .id columns.

What do you think of this? And in general: when (if?) should one
denormalize data?

Regards,
mk


Re: The REAL cost of joins

From
Peter Hunsberger
Date:
On Wed, Mar 3, 2010 at 3:59 PM, Marcin Krol <mrkafk@gmail.com> wrote:
>
> Hello everyone,
>
> I have inadvertently set off a furball on an unrelated ng on what is the actual cost of SQL joins. But there's no
reliableconclusion. I would like to actually know that, that is, are JOINs truly expensive? 
>
> As they say, one measurement is worth a thousand opinions, so I've done measurement on my PG app:
>
> $ time echo "\c hrs;
> SELECT hosts.ip, reservation.start_date, architecture.architecture, os_kind.os_kind, os_rel.os_rel,
os_version.os_version,project.project, email.email FROM hosts 
>  INNER JOIN project ON project.id = hosts.project_id
>  INNER JOIN architecture ON hosts.architecture_id = architecture.id
>  INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
>  INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
>  INNER JOIN os_version ON hosts.os_version_id = os_version.id
>  INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
>  INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id
>  INNER JOIN email ON reservation.email_id = email.id
>
> ;" | psql > /dev/null
>
> real    0m0.099s
> user    0m0.015s
> sys     0m0.005s
>
>
>
> $ time echo "\c hrs;
> > SELECT hosts.ip FROM hosts;
> > SELECT reservation.start_date FROM reservation;
> > SELECT architecture.architecture FROM architecture;
> > SELECT os_rel.os_rel FROM os_rel;
> > SELECT os_version.os_version FROM os_version;
> > SELECT project.project FROM project;
> > SELECT email.email FROM email;
> > " | psql > /dev/null
>
> real    0m0.046s
> user    0m0.008s
> sys     0m0.004s
>
> Note: I've created indexes on those tables, both on data columns like hosts.ip and on .id columns.
>
> What do you think of this? And in general: when (if?) should one denormalize data?

I don't get what these two examples are supposed to show?  They do
not, in general, create equivalent output data. If you had to write
procedural logic to do the same correlation between the individual
result sets as the joins are performing then you'd be back to at least
the same performance and probably worse, so what exactly is being
compared here?

As to your second question; well, outside of not doing premature
optimization I'll observe that for operational databases,
denormalization rarely gives me enough savings to justify the
resultant extra data management required to maintain it.  If you're
talking about a data warehouse (and given the way you've worded the
question I supsect you are not?) then that's a different question, in
that case I prefer to generate the data from the operational side of
the house so you're no longer paying an operational cost to maintain
the denormalized data (you've got a whole different set of costs
instead).

--
Peter Hunsberger

Re: The REAL cost of joins

From
Richard Huxton
Date:
On 03/03/10 21:59, Marcin Krol wrote:
> What do you think of this? And in general: when (if?) should one
> denormalize data?

As a last resort. No sooner.

The support costs of denormalising your database is such that if you can
reasonably just buy more hardware / add caching / etc, do so.

Oh, and I'm afraid your tests are probably meaningless.
1. The times are too small to be accurate.
2. You have the overhead of starting psql and connecting to the
database, starting a new backend etc.
3. You have the overhead of parsing the query
4. You're fetching all rows (which presumably aren't many) - not a
terribly useful scenario.

If you wanted to measure actual join costs, you'd need to repeat the
tests (say) 100-1000 times in a loop, optionally with prepared plans.
Varying WHERE clauses might be useful too, if that's how your real
application will work.

--
   Richard Huxton
   Archonet Ltd

Re: The REAL cost of joins

From
Grzegorz Jaśkiewicz
Date:
the reason you are using joins, most often is because your schema is normalized. One way or another, de-normalisation + queries will cost you more, than normalised tables and joins.
That's at least the short answer.

Re: The REAL cost of joins

From
Craig Ringer
Date:
On 4/03/2010 4:47 PM, Richard Huxton wrote:
> On 03/03/10 21:59, Marcin Krol wrote:
>> What do you think of this? And in general: when (if?) should one
>> denormalize data?
>
> As a last resort. No sooner.
>
> The support costs of denormalising your database is such that if you can
> reasonably just buy more hardware / add caching / etc, do so.

Well, materialized views are a denormalization of sorts, and a really
useful one - but they're not the authorative store of the data.

Things like views, materialized views, trigger-maintained side tables
etc are what you should look at before even considering denormalizing
your main data storage.

--
Craig Ringer

Re: The REAL cost of joins

From
Merlin Moncure
Date:
On Wed, Mar 3, 2010 at 4:59 PM, Marcin Krol <mrkafk@gmail.com> wrote:
> Hello everyone,
>
> I have inadvertently set off a furball on an unrelated ng on what is the
> actual cost of SQL joins. But there's no reliable conclusion. I would like
> to actually know that, that is, are JOINs truly expensive?

There's a lot of variables that go in to join performance.  Are the
necessary columns indexed, what is the comparison operator, size of
the index, etc?  Complicating the calculation, postgres has access to
different algorithms it chooses depending on circumstances.  In
general, joins cost less than a lot of novice programmers think
because they are simply unaware of the number of highly efficient ways
to attack the problem of doing set operations on lists of sorted
items, or perhaps have had bad experiences on databases that had a
lousy selection of algorithms or poor (if any) statistics.

The three biggest factors on join performance are 1: having index in
appropriate places,  2: if/when there is a full cache fault and the
database has to get information from disk when executing the join.
And 3: there being enough working memory to do things like 'in query'
sorts. Point 2 is where denormalization can reap real benefits because
it can reduce the number of disk seeks to get data (compared to the
cpu and memory, disk seeks take eons).

Databases have a characteristic that reminds me of some of the
challenges that aircraft engineers face.  Jets that go supersonic have
an entirely different set of operating principles because above the
speed of sound all the rules changes in terms of thermals, vibration,
stresses, etc.  Similarly in the database world performance tend to
lurch in a very unpleasant direction when you active working set of
data is not able to be properly served in RAM.   The 'sonic boom' you
hear is your disk drives grinding as the clock ticks away...for many
DBAs hearing this sound the first time is a life changing experience.
This is when those little things you never bothered to think about
like having the records that tend to be pulled up together by grouped
together on disk become suddenly very important...you discover the
CLUSTER command and become best friends :-).

merlin

Re: The REAL cost of joins

From
Simon Riggs
Date:
On Wed, 2010-03-03 at 22:59 +0100, Marcin Krol wrote:

> What do you think of this? And in general: when (if?) should one
> denormalize data?

Your example shows that its quicker to put the ingredients on the table
than it is to bake a cake. I'm not sure that's an argument against
baking, nor is it an argument in favour of baking the night before.

--
 Simon Riggs           www.2ndQuadrant.com