Thread: Database design - best practice

Database design - best practice

From

Niels Kristian Schjødt

Date:

28 November 2012, 15:41:21

Hi,

I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is
regardingbest practices, and how architectural decisions might influence the performance. First a little background:

The setup:
I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One
advertbelongs to one car and one seller.
The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly
used,and a lot of the columns should be directly available for searching e.g. color, milage, price,
has_automatic_transmissionetc.

So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect
thatsquite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound
likea regular need for some normalization, but wait a second and let me explain :-)
The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example
has_automatic_transmission(boolean) I can't see why it would make sense to put that into a separate table and join in
thevalues. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.

The questions:
Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it
have,whether I do the following:
1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched
columnsin the table into a single text columns instead of let's say 20 booleans?
2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is
usingthe id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what
aboutthe "width" of the table does that affect the performance when adding new columns?
4) In general what performance downsides do you get when adding a lot of columns to one table instead of having
themin separate tables?
5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with
150columns?

Hope there is some good answers out there :-)

Re: Database design - best practice

From

Willem Leenen

Date:

28 November 2012, 16:10:26

Niels,

" I can't see why it would make sense to put that into a separate table and join in the values "
You don't normalize for performance. People DEnormalize for performance.

Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column.

2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during hash joins.

4) People DEnormalize for performance.

5) " Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?"

I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference.... ).
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning.

My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right.

Regards,
Willem

> From: nielskristian@autouncle.com
> Subject: [PERFORM] Database design - best practice
> Date: Wed, 28 Nov 2012 13:41:14 +0100
> To: pgsql-performance@postgresql.org
>
> Hi,
>
> I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background:
>
> The setup:
> I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
> A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller.
> The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc.
>
> So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-)
> The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.
>
> The questions:
> Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following:
> 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans?
> 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
> 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the "width" of the table does that affect the performance when adding new columns?
> 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables?
> 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?
>
> Hope there is some good answers out there :-)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Database design - best practice

From

Niels Kristian Schjødt

Date:

28 November 2012, 16:20:35

Thanks for the advice.

Currently I see a lot of I/O related to update/inserts, so I'm trying to track down these guys at first. In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding?

Den 28/11/2012 kl. 14.10 skrev Willem Leenen <willem_leenen@hotmail.com>:

Niels,

" I can't see why it would make sense to put that into a separate table and join in the values "
You don't normalize for performance. People DEnormalize for performance.

Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column.

2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during hash joins.

4) People DEnormalize for performance.

5) " Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?"

I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference.... ).
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning.

My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right.

Regards,
Willem

> From: nielskristian@autouncle.com
> Subject: [PERFORM] Database design - best practice
> Date: Wed, 28 Nov 2012 13:41:14 +0100
> To: pgsql-performance@postgresql.org
>
> Hi,
>
> I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background:
>
> The setup:
> I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
> A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller.
> The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc.
>
> So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-)
> The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.
>
> The questions:
> Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following:
> 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans?
> 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
> 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the "width" of the table does that affect the performance when adding new columns?
> 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables?
> 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?
>
> Hope there is some good answers out there :-)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Database design - best practice

From

Willem Leenen

Date:

28 November 2012, 16:27:23

In relation to question 2, I read somewhere in the documentation that because of MVCC, the whole row has to be rewritten even though I just update one single column in that row. Hence if the table is wider (has more columns), the update will be slower. Does this match your understanding?

No. I don't count the number of rows, but number of blocks (pages) that are modified, which are 8K each.

My advice would be to first establish a solutiondirection via diagnosing the problem. My experience is that most solutions are not obscure at all.

Subject: Re: [PERFORM] Database design - best practice
From: nielskristian@autouncle.com
Date: Wed, 28 Nov 2012 14:20:27 +0100
CC: pgsql-performance@postgresql.org
To: willem_leenen@hotmail.com

Thanks for the advice.

Den 28/11/2012 kl. 14.10 skrev Willem Leenen <willem_leenen@hotmail.com>:

Niels,

" I can't see why it would make sense to put that into a separate table and join in the values "
You don't normalize for performance. People DEnormalize for performance.

Questions: (AFAIK)

1) This is a way to disaster. Get yourself a book on RDBMS from for example Celko. Do NOT go against the flow of the RDBMS rules, as here in rule #1 atomic values of a column.

2) This is not the big fish you are after. First benchmark your setup and compare the results with your desired performance level. First quantify your problem, if there is any, before using tricks.

3) A row will need more memory when it is wider, this may be amplified during hash joins.

4) People DEnormalize for performance.

5) " Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?"

I know the answer, but i encourage you to simply test this. I have seen lot's of urban legends about performance ( including the dropping of the referential integrity be cause that would make a difference.... ).
Of course , when it's a full table scan, and it are ALL disk reads, (or ALL memory reads_) you can simply calculate it too. But just get into the habit of testing for learning.

My advice:
- know what performance you need.
- test if you have this, varying tablecontent and systemload
- do not tamper with the RDBMS rules, this will haunt you.
- if you have the latest postgres version, you can use covering indexes: tables aren't accessed at all, bypassing most of your questions. Check with peers if you've got the indexes right.

Regards,
Willem

> From: nielskristian@autouncle.com
> Subject: [PERFORM] Database design - best practice
> Date: Wed, 28 Nov 2012 13:41:14 +0100
> To: pgsql-performance@postgresql.org
>
> Hi,
>
> I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background:
>
> The setup:
> I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
> A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller.
> The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc.
>
> So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-)
> The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.
>
> The questions:
> Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following:
> 1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans?
> 2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
> 3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the "width" of the table does that affect the performance when adding new columns?
> 4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables?
> 5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?
>
> Hope there is some good answers out there :-)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Database design - best practice

From

Bèrto ëd Sèra

Date:

28 November 2012, 16:29:54

Hi Kristian,

> " I can't see why it would make sense to put that into a separate table and
> join in the values "
> You don't normalize for performance. People DEnormalize for performance.

Yes. In short, you seem more of a developer than a RDBMS guy. This is
not a personal fault, but it's a *very* dangerous state to be in and
you should address the problem asap. Erase from your head all you
could possibly know in terms of "putting it into a file" and read very
basic texts about normal forms. Like this:

http://en.wikipedia.org/wiki/Database_normalization

As already said by Willem, learn to test your stuff. There is a
\timing command in psql, use it.

For example (addressing your other post), you want to check how long it takes to
UPDATE "adverts"
SET
  "last_observed_at" = '2012-11-28 00:02:30.265154',
  "data_source_id" ='83d024a57bc2958940f3ca281bddcbf4'
WHERE
  "adverts"."id" IN ( 1602382, 4916432, ...... 3637777 ) ;

as opposed to

UPDATE "adverts"
SET
  "last_observed_at" = '2012-11-28 00:02:30.265154',
  "data_source_id" ='83d024a57bc2958940f3ca281bddcbf4'
WHERE
  "adverts"."id" = 1602382 OR
  "adverts"."id" = 4916432 OR
   ......
  "adverts"."id" = 3637777;

My 5 pence
Bèrto
--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.

Re: Database design - best practice

From

"Kevin Grittner"

Date:

28 November 2012, 17:40:17

Niels Kristian Schjødt wrote:

> So my main concern is actually about the cars table, since this
> one currently has a lot of columns (151 - I expect thats quite a
> lot?),

That's pretty wide, but not outrageous.

> and a lot of data (4 mil. rows, and growing).

That's not a big deal. It's not unusual to have hundreds of
millions of rows in a PostgreSQL table. Properly indexed, that
should perform fine on queries. Sometimes partitioning rows into
sub-tables helps, but you didn't really mention anything which
suggests that would be helpful for you.

> Now you might start by thinking, this could sound like a regular
> need for some normalization

On the contrary, what you describe sounds well normalized. Breaking
off attributes of a car into separate tables would not advance
that.

> The columns in this table is for the most very short stings,
> integers, decimals or booleans. So take for an example
> has_automatic_transmission (boolean) I can't see why it would
> make sense to put that into a separate table and join in the
> values. Or the milage or the price as another example. The cars
> table used for search is indexed quite a lot.

On the face of it, it sounds like you should have some one-column
indexes on the columns most useful for selection (based on
frequency of use and how selective a selection on the column tends
to be).

You might benefit from a technique called "vertical partitioning"
-- where you split off less frequently referenced column and/or
columns which are updated more often into "sibling" tables, with
the same primary key as the car table. That can sometimes buy some
performance at the expense of programming complexity and more
difficulty maintaining data integrity. I wouldn't go there without
evidence that your performance is not adequate without it.

-Kevin

Re: Database design - best practice

From

Vitalii Tymchyshyn

Date:

28 November 2012, 17:41:24

Let me be devil advocate here :)

First of all, even if you read any basics about normalization, don't take it to your heart :) Think.

Know that each normalization/denormalization step has it's cons and pros. E.g. in NoSQL world they often don't normalize much.

What's interesting with PosgreSQL is that it is suited quite good for NoSQL-like scenarios.

First of all, each unfilled (null) data column takes 1 bit only. This, BTW, leads to interesting consequence that performance-wise it can be better to have null/true boolean than false/true. Especially if you've got a lot of "false".

So, PostgreSQL should be good with 10th, possible 100th of data column with most columns empty. Record of 151 null columns would take header + roundup(151/8 ) = 19 bytes. Not much. NoSQLs usually put column names into records and this costs more.

Any null columns at the end of record take no space at all (so, you can think on reordering your columns to put the least used to the record end).

Adding column with null as default is cheap operation that do not require table scan.

You can have partial indexes to speed things up, like create index on car (car_id) where (has_automatic_transmission);

At the other side, when you normalize you need to join. Instead of select * from car where has_automatic_transmission (that will use index above), you will have to "select * from car where id in (select id from car_with_automatic_transmission)". The plan is much more complex here. It will be slower.

The main normalization plus for you is that you work with record as a whole, so if there is a lot of information in there that is rarely used, you will "pay" for it's access every time, both on selects and updates.

So, as conclusion, I agree with others, that you should check. But remember, joining two tables with millions of records os never cheap :)

Best regards, Vitalii Tymchyshyn

2012/11/28 Niels Kristian Schjødt <nielskristian@autouncle.com>

Hi,

I'm on the hunt for some solid knowledge on a theoretical level about the performance of postgresql. My question is regarding best practices, and how architectural decisions might influence the performance. First a little background:

The setup:
I have a database which holds informations on used cars. The database has mainly 3 tables of interest for this case:
A cars table, an adverts table and a sellers table. One car has many adverts and one seller has many adverts. One advert belongs to one car and one seller.
The database is powering a website for searching used cars. When searching for used cars, the cars table is mainly used, and a lot of the columns should be directly available for searching e.g. color, milage, price, has_automatic_transmission etc.

So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect thats quite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound like a regular need for some normalization, but wait a second and let me explain :-)
The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example has_automatic_transmission (boolean) I can't see why it would make sense to put that into a separate table and join in the values. Or the milage or the price as another example. The cars table used for search is indexed quite a lot.

The questions:
Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does it have, whether I do the following:
1) In general would the read and/or the write on the database be faster, if I serialized some of the not searched columns in the table into a single text columns instead of let's say 20 booleans?
2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update statement is using the id to find the car. Would the write performance of that UPDATE be affected, if the table had fewer columns?
3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but what about the "width" of the table does that affect the performance when adding new columns?
4) In general what performance downsides do you get when adding a lot of columns to one table instead of having them in separate tables?
5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table with 150 columns?

Hope there is some good answers out there :-)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Best regards,
Vitalii Tymchyshyn

Re: Database design - best practice

From

Jeff Janes

Date:

28 November 2012, 21:05:43

On Wed, Nov 28, 2012 at 4:41 AM, Niels Kristian Schjødt
<nielskristian@autouncle.com> wrote:
>
> So my main concern is actually about the cars table, since this one currently has a lot of columns (151 - I expect
thatsquite a lot?), and a lot of data (4 mil. rows, and growing). Now you might start by thinking, this could sound
likea regular need for some normalization, but wait a second and let me explain :-) 

If you have 151 single-valued pieces of information, than that is what
you have.  You can't tell if something is normalized or not by
counting the columns.

> The columns in this table is for the most very short stings, integers, decimals or booleans. So take for an example
has_automatic_transmission(boolean) I can't see why it would make sense to put that into a separate table and join in
thevalues. 

I can't see why that would make sense, either.  Nor do I think that
doing so would increase the level of normalization.  What rule of
normalization would be served by creating gratuitous joins?

> Or the milage or the price as another example. The cars table used for search is indexed quite a lot.

How useful are the indices?

> The questions:
> Having the above setup in mind, what impact on performance, in terms of read performance and write performance, does
ithave, whether I do the following: 
>         1) In general would the read and/or the write on the database be faster, if I serialized some of the not
searchedcolumns in the table into a single text columns instead of let's say 20 booleans? 

Probably not. And could make it much worse, depending on how you
serialize it.  For example, if you use hstore or json, now the "column
names" for each of the 20 booleans are repeated in every row, rather
than being metadata stored only once.  But try it and see.

>         2) Lets say I'm updating a timestamp in a single one of the 151 columns in the cars table. The update
statementis using the id to find the car. Would the write performance of that UPDATE be affected, if the table had
fewercolumns? 

Yes, but probably not by much.  The biggest effect will be on whether
the timestamp column is indexed.  If it is, then updating it means
that all other indexes on the table will also need to be updated.  If
it is not indexed, then the update can be a HOT update.

>         3) When adding a new column to the table i know that it becomes slower the more rows is in the table, but
whatabout the "width" of the table does that affect the performance when adding new columns? 

Adding a new column to a table is pretty much instantaneous if the
default value is NULL.

>         4) In general what performance downsides do you get when adding a lot of columns to one table instead of
havingthem in separate tables? 

This question cannot be answered in general.  If every time you use
the main table you have to join it to the separate table, then
performance will be bad.  If you almost never have to join to the
separate table, then performance will be better.

>         5) Is it significantly faster to select * from a table with 20 columns, than selecting the same 20 in a table
with150 columns? 

If the extra 130 columns are mostly null, the difference will be very
small.  Or, if the where clause is such that you only do a single-row
lookup on a primary key column, for example, the difference will also
be small.

Cheers,

Jeff