Thread: Performance issues of one vs. two split tables.
Is there any benefit of splitting up a table into two tables that will always have a one-to-one relationship? Say I have a "user" table that has first, last, email, password, and last_accessed columns. This user table will be accessed often. (It's not really "user", but that's not important in this discussion) Say that there's also about 10 columns of settings or preferences for each user. Are there any cases or reasons to have a separate "user_preferences" table vs. just placing all the columns together in one table? Another related question: Say I have a table "color" that has about 20 different rows ("red", "blue", "green", etc.). I want the user to be able to select zero or more favorite colors. I would typically use a link table: create table favorite_colors ( color int references color(id), user int references user(id) ); Now, that table can have a large number of rows if I have a large number of users and if everyone likes all the colors. For some value of "large", is there a time when one might consider using a single column in the user or user_prefs table to represent their color choices instead of a link table? table user_prefs ( ... favorite_colors bit varying, ... ); Where each bit represents the primary key of the colors table. Seems like poor design, but I'm wondering if there might be overriding concerns at times. For example, if I have 1 million users and they each like all colors and thus have a 20 million row link table how much space would be saved by using a bit column as above? -- Bill Moseley moseley@hank.org
> Say I have a "user" table that has first, last, email, password, and > last_accessed columns. This user table will be accessed often. (It's > not really "user", but that's not important in this discussion) > > Say that there's also about 10 columns of settings or preferences for > each user. Are there any cases or reasons to have a separate > "user_preferences" table vs. just placing all the columns together in > one table? I did something like that on MySQL some time ago. In the Users table there was stuff that other users need to see (like his login name, etc), and stuff that only this user needs to see (like his preferences). So, when displaying posts in the forum, for instance, only a small part of the fields in the Users table was needed, the rest was just dead weight, that made the table unable to fit in RAM. So I split the table, and it was faster. However, you can also buy more RAM...
On Mon, May 14, 2007 at 10:52:13PM +0200, PFC wrote: > > >Say I have a "user" table that has first, last, email, password, and > >last_accessed columns. This user table will be accessed often. (It's > >not really "user", but that's not important in this discussion) > > > >Say that there's also about 10 columns of settings or preferences for > >each user. Are there any cases or reasons to have a separate > >"user_preferences" table vs. just placing all the columns together in > >one table? > > I did something like that on MySQL some time ago. > In the Users table there was stuff that other users need to see > (like his login name, etc), and stuff that only this user needs to see > (like his preferences). > So, when displaying posts in the forum, for instance, only a small > part of the fields in the Users table was needed, the rest was just dead > weight, that made the table unable to fit in RAM. Well, that's part of my question. If not selecting those columns in the common selects how much "dead weight" is brought along due to the extra columns defined in the table, if any? -- Bill Moseley moseley@hank.org
On Mon, 14 May 2007, PFC wrote: > I did something like that on MySQL some time ago. > In the Users table there was stuff that other users need to see (like > his login name, etc), and stuff that only this user needs to see (like his > preferences). > So, when displaying posts in the forum, for instance, only a small > part of the fields in the Users table was needed, the rest was just dead > weight, that made the table unable to fit in RAM. > So I split the table, and it was faster. > > However, you can also buy more RAM... Or, use Views without paying more. Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc. | Accelerator(TM) <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863
On 5/14/07, Rich Shepard <rshepard@appl-ecosys.com> wrote: > On Mon, 14 May 2007, PFC wrote: > > > I did something like that on MySQL some time ago. > > In the Users table there was stuff that other users need to see (like > > his login name, etc), and stuff that only this user needs to see (like his > > preferences). > > So, when displaying posts in the forum, for instance, only a small > > part of the fields in the Users table was needed, the rest was just dead > > weight, that made the table unable to fit in RAM. > > So I split the table, and it was faster. > > > > However, you can also buy more RAM... > > Or, use Views without paying more. Well, views are not going to help with memory consumption here. It is the table contents that gets cached in buffer cache, not the views contents. So if you have a view which returns only one column from 15-column table, you will be caching that 15-column data nonetheless. View, as the name states, is converted into a select on a real table. As for actual colors table -- you might consider keeping colors as int[] (integer array) column, though I'll never admit I suggeted you that. :D Regards, Dawid
On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote: > Well, views are not going to help with memory consumption here. > It is the table contents that gets cached in buffer cache, not the > views contents. So if you have a view which returns only one > column from 15-column table, you will be caching that 15-column > data nonetheless. View, as the name states, is converted into > a select on a real table. Are you saying that in Postgresql: select first_name, last_name from user_table; uses the same memory as this? select first_name, last_name, passowrd, email, [10 other columns] from user_table; -- Bill Moseley moseley@hank.org
On 5/15/07, Bill Moseley <moseley@hank.org> wrote: > On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote: > > Well, views are not going to help with memory consumption here. > > It is the table contents that gets cached in buffer cache, not the > > views contents. So if you have a view which returns only one > > column from 15-column table, you will be caching that 15-column > > data nonetheless. View, as the name states, is converted into > > a select on a real table. > > Are you saying that in Postgresql: > > select first_name, last_name from user_table; > > uses the same memory as this? > > select first_name, last_name, > passowrd, email, > [10 other columns] > from user_table; Yes. You read whole page (8KB) into buffer_cache, then extract these columns from these buffer. From the buffer cache point of view, whole tuple is contained in the cache. Say, if you first SELECT fname, lname FROM user_table; and then you issue SELECT * FROM user_table; -- the second select will be returned from buffer cache -- since all rows are already in the cache. Having seperate caches for possible SELECT [column list] would be well, not quite efficient. Now, select fname,lname will take less private memory, but this memory will be freed as soon as the query finishes, but this won't help our cache much. Regards, Dawid
> Say, if you first SELECT fname, lname FROM user_table; > and then you issue SELECT * FROM user_table; -- the > second select will be returned from buffer cache -- since > all rows are already in the cache. ...Unless your table contains some large TEXT columns that have been stored out of line (TOASTed) by postgres.
Can anyone provide input on this question? I'm curious how to look at this from a disk and memory usage perspective. Would using a bit column type help much? I'm not thrilled by the loss of referential integrity. On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote: > > Say I have a table "color" that has about 20 different rows ("red", > "blue", "green", etc.). I want the user to be able to select zero or > more favorite colors. I would typically use a link table: > > create table favorite_colors ( > color int references color(id), > user int references user(id) > ); > > Now, that table can have a large number of rows if I have a large > number of users and if everyone likes all the colors. > > For some value of "large", is there a time when one might consider > using a single column in the user or user_prefs table to represent > their color choices instead of a link table? > > table user_prefs ( > ... > favorite_colors bit varying, > ... > ); > > Where each bit represents the primary key of the colors table. > > Seems like poor design, but I'm wondering if there might be overriding > concerns at times. > > For example, if I have 1 million users and they each like all colors > and thus have a 20 million row link table how much space would be > saved by using a bit column as above? -- Bill Moseley moseley@hank.org
I'm sure there's a point where you'd be saving a "substantial" amount of disk space using a non-normalized scheme, but, like you say, you'd be missing out on other things. In general, disks are cheap while the man hours used to try to fix data corruption is not. On May 15, 2007, at 7:35 AM, Bill Moseley wrote: > Can anyone provide input on this question? I'm curious how to look at > this from a disk and memory usage perspective. Would using a bit > column type help much? > > I'm not thrilled by the loss of referential integrity. > > On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote: >> >> Say I have a table "color" that has about 20 different rows ("red", >> "blue", "green", etc.). I want the user to be able to select zero or >> more favorite colors. I would typically use a link table: >> >> create table favorite_colors ( >> color int references color(id), >> user int references user(id) >> ); >> >> Now, that table can have a large number of rows if I have a large >> number of users and if everyone likes all the colors. >> >> For some value of "large", is there a time when one might consider >> using a single column in the user or user_prefs table to represent >> their color choices instead of a link table? >> >> table user_prefs ( >> ... >> favorite_colors bit varying, >> ... >> ); >> >> Where each bit represents the primary key of the colors table. >> >> Seems like poor design, but I'm wondering if there might be >> overriding >> concerns at times. >> >> For example, if I have 1 million users and they each like all colors >> and thus have a 20 million row link table how much space would be >> saved by using a bit column as above? > > -- > Bill Moseley > moseley@hank.org > > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
On Tue, May 15, 2007 at 07:51:44AM +0200, Dawid Kuroczko wrote: > On 5/15/07, Bill Moseley <moseley@hank.org> wrote: > >On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote: > >> Well, views are not going to help with memory consumption here. > >> It is the table contents that gets cached in buffer cache, not the > >> views contents. So if you have a view which returns only one > >> column from 15-column table, you will be caching that 15-column > >> data nonetheless. View, as the name states, is converted into > >> a select on a real table. > > > >Are you saying that in Postgresql: > > > > select first_name, last_name from user_table; > > > >uses the same memory as this? > > > > select first_name, last_name, > > passowrd, email, > > [10 other columns] > > from user_table; > > Yes. You read whole page (8KB) into buffer_cache, > then extract these columns from these buffer. From the > buffer cache point of view, whole tuple is contained in the > cache. Sorry, I don't mean to drag this thread out much longer. But, I have one more question regarding joins. Say I have a customer table and an order table. I want a list of all order id's for a given customer. SELECT o.id FROM order o JOIN customer c on o.customer = c.id Does that bring into memory all columns from both order and customer? Maybe that's not a good example due to indexes. See, I've seen this splitting of one-to-one tables a number of time (such as the user and user_preferences example) and I'm not sure if that's just poor schema design, premature optimization, or someone making smart use of their knowledge of the internal workings of Postgresql.... -- Bill Moseley moseley@hank.org
> SELECT o.id > FROM order o > JOIN customer c on o.customer = c.id > > Does that bring into memory all columns from both order and customer? > Maybe that's not a good example due to indexes. No, it just pulls the columns you ask from the table, nothing less, nothing more. Splitting tables (vertical partitioning) is used to reduce the size of the working set that has to fit in RAM... this is a different reason than what you're thinking about.
lists@peufeu.com (PFC) writes: >> SELECT o.id >> FROM order o >> JOIN customer c on o.customer = c.id >> >> Does that bring into memory all columns from both order and customer? >> Maybe that's not a good example due to indexes. > > No, it just pulls the columns you ask from the table, nothing > less, nothing more. That's not quite 100% accurate. In order to construct the join, the entire pages of the relevant tuples in tables "order" and "customer" will need to be drawn into memory. Thus, if there are a whole bunch of columns on each table, the data in those extra columns (e.g. - all columns aside from "id", the one that was asked for in the result set) will indeed be drawn into memory. They may not be drawn into the return set, but they will still be drawn into memory. If you alter tables "customer" and "order", taking some columns off, and stowing them in separate tables, then you'll find that more tuples of "customer" and "order" will fit into a buffer page, and that the join will be assembled with somewhat less memory usage. Whether or not that is a worthwhile change to make will vary considerably. -- (reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc")) http://linuxdatabases.info/info/languages.html There are two kinds of pedestrians -- the quick and the dead.
One other possible reason for splitting the table up in two chunks is to grant different rights on the 2 sets of columns.
Susan Cassidy
Bill Moseley <moseley@hank.org> Sent by: pgsql-general-owner@postgresql.org 05/15/2007 09:44 AM |
|
Sorry, I don't mean to drag this thread out much longer. But, I have
one more question regarding joins.
Say I have a customer table and an order table. I want a list of all
order id's for a given customer.
SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id
Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.
See, I've seen this splitting of one-to-one tables a number of time
(such as the user and user_preferences example) and I'm not sure if
that's just poor schema design, premature optimization, or someone
making smart use of their knowledge of the internal workings of
Postgresql....
--
Bill Moseley
moseley@hank.org
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org/
----------------------------------------------------------------------------------------------
Simply protected storage solutions ensure that your information is
automatically safe, readily available and always there, visit us at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------
On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote: > lists@peufeu.com (PFC) writes: > >> SELECT o.id > >> FROM order o > >> JOIN customer c on o.customer = c.id > >> > >> Does that bring into memory all columns from both order and customer? > >> Maybe that's not a good example due to indexes. > > > > No, it just pulls the columns you ask from the table, nothing > > less, nothing more. > > That's not quite 100% accurate. > > In order to construct the join, the entire pages of the relevant > tuples in tables "order" and "customer" will need to be drawn into > memory. > > Thus, if there are a whole bunch of columns on each table, the data in > those extra columns (e.g. - all columns aside from "id", the one that > was asked for in the result set) will indeed be drawn into memory. Is that specific to Postgresql? From an outside perspective it just seems odd that potentially a large amount of data would be pulled off disk into memory that is never used. Perhaps there's an overriding reason for this. > If you alter tables "customer" and "order", taking some columns off, > and stowing them in separate tables, then you'll find that more tuples > of "customer" and "order" will fit into a buffer page, and that the > join will be assembled with somewhat less memory usage. > > Whether or not that is a worthwhile change to make will vary > considerably. Makes designing the schema a bit tough. ;) -- Bill Moseley moseley@hank.org
On May 15, 2007, at 12:02 PM, Bill Moseley wrote: > On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote: >> lists@peufeu.com (PFC) writes: >>>> SELECT o.id >>>> FROM order o >>>> JOIN customer c on o.customer = c.id >>>> >>>> Does that bring into memory all columns from both order and >>>> customer? >>>> Maybe that's not a good example due to indexes. >>> >>> No, it just pulls the columns you ask from the table, nothing >>> less, nothing more. >> >> That's not quite 100% accurate. >> >> In order to construct the join, the entire pages of the relevant >> tuples in tables "order" and "customer" will need to be drawn into >> memory. >> >> Thus, if there are a whole bunch of columns on each table, the >> data in >> those extra columns (e.g. - all columns aside from "id", the one that >> was asked for in the result set) will indeed be drawn into memory. > > Is that specific to Postgresql? From an outside perspective it just > seems odd that potentially a large amount of data would be pulled off > disk into memory that is never used. Perhaps there's an overriding > reason for this. The columns are next to each other on the disk. You need to read the entire block off disk into system cache, so you'll be reading all the columns of all the rows in that block into memory. That's just the way that most (all?) modern filesystems work, and so the way that most filesystem based databases are going to work. I've seen some databases that don't store all the main columns of a table together on disk, but they're fairly rare. Pushing data into lookaside tables either manually or automatically via toast changes the tradeoffs. Cheers, Steve
>> Thus, if there are a whole bunch of columns on each table, the >> data in >> those extra columns (e.g. - all columns aside from "id", the one that >> was asked for in the result set) will indeed be drawn into memory. > > Is that specific to Postgresql? From an outside perspective it just > seems odd that potentially a large amount of data would be pulled off > disk into memory that is never used. Perhaps there's an overriding > reason for this. Anything else would seem odd to me. Pulling a page into memory typically has OS support, and is thus very fast. Picking and choosing bits and pieces to read would be prohibitively slow. Moreover, caching only those bits and pieces would require complicated code to decide whether the cached data is relevant to the next query. Validating cached data at the page level is much simpler, and thus faster. Or so I assume ... - John D. Burger MITRE
>> Thus, if there are a whole bunch of columns on each table, the data in >> those extra columns (e.g. - all columns aside from "id", the one that >> was asked for in the result set) will indeed be drawn into memory. Yeah, I wanted to mean that ;) All the columns are loaded (except the TOASTed ones which are not mentioned in the query) into memory, but only the requested ones are processed and returned to the client... > Is that specific to Postgresql? Nope. All databases do more or less the same. > From an outside perspective it just > seems odd that potentially a large amount of data would be pulled off > disk into memory that is never used. Perhaps there's an overriding > reason for this. Yeah, where would you put this data if you didn't put it where it is now ? >> If you alter tables "customer" and "order", taking some columns off, >> and stowing them in separate tables, then you'll find that more tuples >> of "customer" and "order" will fit into a buffer page, and that the >> join will be assembled with somewhat less memory usage. >> >> Whether or not that is a worthwhile change to make will vary >> considerably. > > Makes designing the schema a bit tough. ;) "Premature optimization is the root of all evil" Build a test database, fill it with data, and experiment.
On Tue, May 15, 2007 at 09:20:53PM +0200, PFC wrote: > >From an outside perspective it just > >seems odd that potentially a large amount of data would be pulled off > >disk into memory that is never used. Perhaps there's an overriding > >reason for this. > > Yeah, where would you put this data if you didn't put it where it is > now ? Swish-e isn't a database by any means, but it does have a way to store column like meta data for each "row". When it does a search it only explicitly pulls from disk the meta data that it's asked to return. Granted, the OS is reading from disk more than the application is asking for, but the application is only allocating memory for the data it's going to return. And the column (meta data) is not always stored together on disk. Without knowing Pg internals I wasn't aware of how the actual table data was organized and fetched into memory. > "Premature optimization is the root of all evil" Exactly what prompted this thread. ;) -- Bill Moseley moseley@hank.org
On May 14, 2007, at 4:37 PM, Bill Moseley wrote: > Say that there's also about 10 columns of settings or preferences for > each user. Are there any cases or reasons to have a separate > "user_preferences" table vs. just placing all the columns together in > one table? when you have, say 65 million users, it makes sense to push the ancillary info to another table to keep from having to copy too much data when you update the main info (like last access time).
On May 15, 2007, at 10:35 AM, Bill Moseley wrote: >> For some value of "large", is there a time when one might consider >> using a single column in the user or user_prefs table to represent >> their color choices instead of a link table? We use bitfields on our large user table. It is becoming unworkable to scan for matches, since overall most people have very few selections made. We are moving it to a model like your favorite_colors table which just links the option and the user. We find that doing joins on large tables which can be indexed to avoid full table scans are very fast in postgres, since the index can do much of your culling of potential matching rows. With bitfields, you are more or less forced into doing a sequence scan to find everyone who likes the color red. Of course, if you're playing with only a few thousand users, either approach works well.
> We use bitfields on our large user table. It is becoming unworkable to > scan for matches, since overall most people have very few selections > made. > > We are moving it to a model like your favorite_colors table which just > links the option and the user. > > We find that doing joins on large tables which can be indexed to avoid > full table scans are very fast in postgres, since the index can do much > of your culling of potential matching rows. With bitfields, you are > more or less forced into doing a sequence scan to find everyone who > likes the color red. > > Of course, if you're playing with only a few thousand users, either > approach works well. Things you could try : * Use an integer array instead of a bitfield (for instance, in users table, column favourites would contain { 1,2 } if the user selected items 1 and 2 ) Then, you can make a Gist index on it and use the indexed intersection operator This is likely the optimal solution if the maximum number of items is small (say, 100 is good, 100000 is not) * keep your bitfields and create conditional indexes : CREATE INDEX ... WHERE bitfield_column & 1; CREATE INDEX ... WHERE bitfield_column & 2; CREATE INDEX ... WHERE bitfield_column & 4; CREATE INDEX ... WHERE bitfield_column & 8; CREATE INDEX ... WHERE bitfield_column & 16; etc... Obviously this will only work if you have, say, 10 favouritess. 100 indexes on a table would really suck. Then, when looking for users who chose bits 1 and 2, do : SELECT WHERE (bitfield_column & 1) AND (bitfield_column & 2) postgres will do a bitmap-and using the two indexes (note : when we get bitmap indexes, this will be even better) * use tsearch2 : favourites = 'red blue' and fulltext-search it * use a favourites table : This makes queries hard to optimize. Consider the table (user_id, item_id) meaning user selected this item as favourite. If you want to know which users did select both items 1 and 2, you have to do a self-join, something like : SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND a.item_id=1 AND b.item_id = 2 This is likely to be "not very fast" if 1 million users check each option but only 100 check both.
On 16-May-07, at 4:05 PM, PFC wrote:
This makes queries hard to optimize. Consider the table (user_id, item_id) meaning user selected this item as favourite.
If you want to know which users did select both items 1 and 2, you have to do a self-join, something like :
SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND a.item_id=1 AND b.item_id = 2
You could get users who have selected both items 1 and 2 without doing a self-join with a query like the following:
select user_id from favourite where item_id = 1 or item_id = 2 group by user_id having count(*) = 2;