Thread: a newbie question on table design
Hi all, I have a large sdf file with many records of molecules and associated data items and I want to save them in a PostgreSQL database. There are about less than 40 data items for every molecule(the names of the data items fore each molecule are the same, but values differ). The number of molecules may exceed 20 million. Now I have come up with two ways to construct the table: 1) a table with about 40 columns, every data item has its corresponding column, and one molecule corresponds to one row in the table. This is direct and simple. The drawbacks is if I want to add more data types to the database, I have to modify the structure of the table. 2) a table with just 3 columns: CREATE TABLE mytable( id serial, data_name text, data_value text ); Then a single molecule will corresonds to about 40 rows in the database. If I need to add more data types to the table, I just need to add new rows with new "data_name" column values. The drawback of this table is it has too many rows(40 times of the former one) and waste a lot space. Which one is better, or there are some other smart ways ? I have another question. Since the data I need to save is huge, is it appropriate that I save the data value in compressed format ? Regards,
On Friday 15. February 2008, fdu.xiaojf@gmail.com wrote: >Hi all, > >I have a large sdf file with many records of molecules and associated >data items and I want to save them in a PostgreSQL database. There are >about less than 40 data items for every molecule(the names of the data >items fore each molecule are the same, but values differ). The number >of molecules may exceed 20 million. > >Now I have come up with two ways to construct the table: > >1) a table with about 40 columns, every data item has its > corresponding column, and one molecule corresponds to one row in the > table. > >This is direct and simple. The drawbacks is if I want to add more data >types to the database, I have to modify the structure of the table. > >2) a table with just 3 columns: > > CREATE TABLE mytable( > id serial, > data_name text, > data_value text > ); > >Then a single molecule will corresonds to about 40 rows in the > database. This is a sound concept, but I'd rather store the data_name in a separate table with an integer key, and replace data_name in mytable with a data_name_fk REFERENCES data_names (data_name_id). That's just Occam's Razor applied to database design, aka first normal form. You'd probably store the name of the molecule in a third table. Then you have a model very similar to the classic 'book database' where a book can have multiple authors, and an author can have multiple books. There are examples for this design all over the place. >If I need to add more data types to the table, I just need to add new >rows with new "data_name" column values. The drawback of this table is >it has too many rows(40 times of the former one) and waste a lot > space. > >Which one is better, or there are some other smart ways ? > >I have another question. Since the data I need to save is huge, is it >appropriate that I save the data value in compressed format ? That sounds a lot like premature optimization. Postgres is actually quite good at compacting data natively. -- Leif Biberg Kristensen | Registered Linux User #338009 http://solumslekt.org/ | Cruising with Gentoo/KDE My Jazz Jukebox: http://www.last.fm/user/leifbk/
If you can select stable structure (common columns) and additional columns than you can : 1. Use base class (table) with common columns + inherited tables with their own additional columns 2. We use contrib/hstore as a storage for semistructured data - we store additional columns as a key->value pairs in hstore data type. This is very flexible design. On Fri, 15 Feb 2008, Leif B. Kristensen wrote: > On Friday 15. February 2008, fdu.xiaojf@gmail.com wrote: >> Hi all, >> >> I have a large sdf file with many records of molecules and associated >> data items and I want to save them in a PostgreSQL database. There are >> about less than 40 data items for every molecule(the names of the data >> items fore each molecule are the same, but values differ). The number >> of molecules may exceed 20 million. >> >> Now I have come up with two ways to construct the table: >> >> 1) a table with about 40 columns, every data item has its >> corresponding column, and one molecule corresponds to one row in the >> table. >> >> This is direct and simple. The drawbacks is if I want to add more data >> types to the database, I have to modify the structure of the table. >> >> 2) a table with just 3 columns: >> >> CREATE TABLE mytable( >> id serial, >> data_name text, >> data_value text >> ); >> >> Then a single molecule will corresonds to about 40 rows in the >> database. > > This is a sound concept, but I'd rather store the data_name in a > separate table with an integer key, and replace data_name in mytable > with a data_name_fk REFERENCES data_names (data_name_id). That's just > Occam's Razor applied to database design, aka first normal form. > > You'd probably store the name of the molecule in a third table. Then you > have a model very similar to the classic 'book database' where a book > can have multiple authors, and an author can have multiple books. There > are examples for this design all over the place. > >> If I need to add more data types to the table, I just need to add new >> rows with new "data_name" column values. The drawback of this table is >> it has too many rows(40 times of the former one) and waste a lot >> space. >> >> Which one is better, or there are some other smart ways ? >> >> I have another question. Since the data I need to save is huge, is it >> appropriate that I save the data value in compressed format ? > > That sounds a lot like premature optimization. Postgres is actually > quite good at compacting data natively. > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Feb 15, 2008, at 4:49 AM, fdu.xiaojf@gmail.com wrote: > Hi all, > > I have a large sdf file with many records of molecules and associated > data items and I want to save them in a PostgreSQL database. There are > about less than 40 data items for every molecule(the names of the data > items fore each molecule are the same, but values differ). The number > of molecules may exceed 20 million. > > Now I have come up with two ways to construct the table: > > 1) a table with about 40 columns, every data item has its > corresponding > column, and one molecule corresponds to one row in the table. > > This is direct and simple. The drawbacks is if I want to add more data > types to the database, I have to modify the structure of the table. > > 2) a table with just 3 columns: > > CREATE TABLE mytable( > id serial, > data_name text, > data_value text > ); That looks kinda like an entity-attribute-value format. I'm guessing that either there'll be another column to define the entity, or the id isn't really a serial. > Then a single molecule will corresonds to about 40 rows in the > database. > > If I need to add more data types to the table, I just need to add new > rows with new "data_name" column values. The drawback of this table is > it has too many rows(40 times of the former one) and waste a lot > space. > > Which one is better, or there are some other smart ways ? Somebody will, shortly, leap out and explain at great length why EAV is evil, evil, evil and would never be used by any right-thinking person. Don't take them too seriously. EAV is sometimes appropriate. This is probably not one of those times, though. You're likely to get much more benefit from the power of SQL by putting one molecule per row of the table. 40 columns isn't excessive, and modifying the structure of the table to add or modify columns isn't really a problem, especially in postgresql, where you can take advantage of DDL being transactional. > I have another question. Since the data I need to save is huge, is it > appropriate that I save the data value in compressed format ? Not on your first implementation. Postgresql does some basic compression and out-of-line storage of data automatically. If you have huge blobs of opaque data that you don't plan on querying from within the database you might consider doing some client-side compression of them. Maybe. By the third or fourth iteration, if benchmarks suggest it'd be worthwhile. Right now, though, don't even consider it. Cheers, Steve