Thread: tables with lots of columns - what alternative from performance point of view?
tables with lots of columns - what alternative from performance point of view?
From
hubert depesz lubaczewski
Date:
hi
jus recently there were some thread on postgresql list with people asying : i have 700 columns, i have 1000 columns and so on.
some people, imediatelly responded: change your schema.
this is what forced to me ask:
i have a situation where i ahve to store a number of "objects" in database.
all objects have 3 specific attributes (which go into objects table), and may have a lot of "custom fields".
basically - lsit of accessible custom fields for object depends on which object-category this object belongs to.
now.
i know, i could have written it in this way:
create table object_custom_fields (id serial primary key, object_id int8, field_id int8, field_value text);
but:
this approach has two very big drawbacks (for me):
1. the table cannot differentiate between custom fields of type "date", "number" and so on. - everything is stored as text.
2. it is rather slow. i have to do a non-unique index scan over object_custom_fields, get all records, and pivot it (on the client side of curse) to make it usable.
i did it differently, definitelly not nicely, but i dont see any other way to get this performance with unknown list of custom fields:
1. create table cf_types (id serial, codename text, representation text);
2. create table cf_definitions (id serial, category_id int8, type_id int8, field-number int4);
3. create table cf_values (id serial, object_id int8 (unique), ...................................................);
where
cf_types store information like this:
id | codename | representation
----+------------+----------------
1 | bool | boolean
2 | integer | integer
3 | number | number
4 | text | text
5 | note | text
6 | date | date
...
basically - there might be many "types" with the same representation.
then
cf_values have a lot of (128 at the moment) fields for all possible representations.
basically it looks like:
id, object_id, boolean_1 ... boolean_128, integer_1..integer_128, ...
the datatypes of this fields relate to their content (integer_* fields have datatype int8, and so on).
now.
in cf_definitions i specify, category, field_type_id, and a field-number - which relates to _NUMBER in fields in cf_values.
what i did achive is *very* fast retrieval of data for any given object.
the schema of cf_values table is absolutelly awful, and i will never say differently.
my point is - if somebody (tom lane for example) says - redesign your schema - whenever he reads about table with 700 column (i have more :) - then i must have missed something absolutelyl simple, fast and elegant. what is this?
depesz
jus recently there were some thread on postgresql list with people asying : i have 700 columns, i have 1000 columns and so on.
some people, imediatelly responded: change your schema.
this is what forced to me ask:
i have a situation where i ahve to store a number of "objects" in database.
all objects have 3 specific attributes (which go into objects table), and may have a lot of "custom fields".
basically - lsit of accessible custom fields for object depends on which object-category this object belongs to.
now.
i know, i could have written it in this way:
create table object_custom_fields (id serial primary key, object_id int8, field_id int8, field_value text);
but:
this approach has two very big drawbacks (for me):
1. the table cannot differentiate between custom fields of type "date", "number" and so on. - everything is stored as text.
2. it is rather slow. i have to do a non-unique index scan over object_custom_fields, get all records, and pivot it (on the client side of curse) to make it usable.
i did it differently, definitelly not nicely, but i dont see any other way to get this performance with unknown list of custom fields:
1. create table cf_types (id serial, codename text, representation text);
2. create table cf_definitions (id serial, category_id int8, type_id int8, field-number int4);
3. create table cf_values (id serial, object_id int8 (unique), ...................................................);
where
cf_types store information like this:
id | codename | representation
----+------------+----------------
1 | bool | boolean
2 | integer | integer
3 | number | number
4 | text | text
5 | note | text
6 | date | date
...
basically - there might be many "types" with the same representation.
then
cf_values have a lot of (128 at the moment) fields for all possible representations.
basically it looks like:
id, object_id, boolean_1 ... boolean_128, integer_1..integer_128, ...
the datatypes of this fields relate to their content (integer_* fields have datatype int8, and so on).
now.
in cf_definitions i specify, category, field_type_id, and a field-number - which relates to _NUMBER in fields in cf_values.
what i did achive is *very* fast retrieval of data for any given object.
the schema of cf_values table is absolutelly awful, and i will never say differently.
my point is - if somebody (tom lane for example) says - redesign your schema - whenever he reads about table with 700 column (i have more :) - then i must have missed something absolutelyl simple, fast and elegant. what is this?
depesz
contrib/hstore will save you. See http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore for details. Oleg On Wed, 7 Dec 2005, hubert depesz lubaczewski wrote: > hi > jus recently there were some thread on postgresql list with people asying : > i have 700 columns, i have 1000 columns and so on. > some people, imediatelly responded: change your schema. > this is what forced to me ask: > > i have a situation where i ahve to store a number of "objects" in database. > all objects have 3 specific attributes (which go into objects table), and > may have a lot of "custom fields". > basically - lsit of accessible custom fields for object depends on which > object-category this object belongs to. > now. > i know, i could have written it in this way: > > create table object_custom_fields (id serial primary key, object_id int8, > field_id int8, field_value text); > but: > this approach has two very big drawbacks (for me): > 1. the table cannot differentiate between custom fields of type "date", > "number" and so on. - everything is stored as text. > 2. it is rather slow. i have to do a non-unique index scan over > object_custom_fields, get all records, and pivot it (on the client side of > curse) to make it usable. > > i did it differently, definitelly not nicely, but i dont see any other way > to get this performance with unknown list of custom fields: > 1. create table cf_types (id serial, codename text, representation text); > 2. create table cf_definitions (id serial, category_id int8, type_id int8, > field-number int4); > 3. create table cf_values (id serial, object_id int8 (unique), > ...................................................); > > where > cf_types store information like this: > id | codename | representation > ----+------------+---------------- > 1 | bool | boolean > 2 | integer | integer > 3 | number | number > 4 | text | text > 5 | note | text > 6 | date | date > ... > basically - there might be many "types" with the same representation. > then > cf_values have a lot of (128 at the moment) fields for all possible > representations. > basically it looks like: > id, object_id, boolean_1 ... boolean_128, integer_1..integer_128, ... > the datatypes of this fields relate to their content (integer_* fields have > datatype int8, and so on). > > now. > in cf_definitions i specify, category, field_type_id, and a field-number - > which relates to _NUMBER in fields in cf_values. > > what i did achive is *very* fast retrieval of data for any given object. > the schema of cf_values table is absolutelly awful, and i will never say > differently. > my point is - if somebody (tom lane for example) says - redesign your schema > - whenever he reads about table with 700 column (i have more :) - then i > must have missed something absolutelyl simple, fast and elegant. what is > this? > > depesz > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Re: tables with lots of columns - what alternative from performance point of view?
From
hubert depesz lubaczewski
Date:
On 12/7/05, Oleg Bartunov <oleg@sai.msu.su> wrote:
thanks. i didn't know about it, and it looks great. but i'm not sure if we will be able to use it - my developers use java + hibernate, and they say it cannot work with any "fancy" datatypes (including such a base things like "INTERVAL").
i will definitelly use is though in my other (not hibernate-dependant) projects.
best regards
depesz
contrib/hstore will save you.
See http://www.sai.msu.su/~megera/postgres/gist/hstore/README.hstore
for details.
thanks. i didn't know about it, and it looks great. but i'm not sure if we will be able to use it - my developers use java + hibernate, and they say it cannot work with any "fancy" datatypes (including such a base things like "INTERVAL").
i will definitelly use is though in my other (not hibernate-dependant) projects.
best regards
depesz