Thread: Performance issues of one vs. two split tables.

Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

14 May 2007, 17:37:24

Is there any benefit of splitting up a table into two tables that will
always have a one-to-one relationship?

Say I have a "user" table that has first, last, email, password, and
last_accessed columns.  This user table will be accessed often.  (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user.  Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?


Another related question:

Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.).  I want the user to be able to select zero or
more favorite colors.  I would typically use a link table:

    create table favorite_colors (
        color       int references color(id),
        user        int references user(id)
    );

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

    table user_prefs (
        ...
        favorite_colors     bit varying,
        ...
    );

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be overriding
concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?



--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

PFC

Date:

14 May 2007, 17:50:21

> Say I have a "user" table that has first, last, email, password, and
> last_accessed columns.  This user table will be accessed often.  (It's
> not really "user", but that's not important in this discussion)
>
> Say that there's also about 10 columns of settings or preferences for
> each user.  Are there any cases or reasons to have a separate
> "user_preferences" table vs. just placing all the columns together in
> one table?

    I did something like that on MySQL some time ago.
    In the Users table there was stuff that other users need to see (like his
login name, etc), and stuff that only this user needs to see (like his
preferences).
    So, when displaying posts in the forum, for instance, only a small part
of the fields in the Users table was needed, the rest was just dead
weight, that made the table unable to fit in RAM.
    So I split the table, and it was faster.

    However, you can also buy more RAM...

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

14 May 2007, 17:53:21

On Mon, May 14, 2007 at 10:52:13PM +0200, PFC wrote:
>
> >Say I have a "user" table that has first, last, email, password, and
> >last_accessed columns.  This user table will be accessed often.  (It's
> >not really "user", but that's not important in this discussion)
> >
> >Say that there's also about 10 columns of settings or preferences for
> >each user.  Are there any cases or reasons to have a separate
> >"user_preferences" table vs. just placing all the columns together in
> >one table?
>
>     I did something like that on MySQL some time ago.
>     In the Users table there was stuff that other users need to see
>     (like his  login name, etc), and stuff that only this user needs to see
> (like his  preferences).

>     So, when displaying posts in the forum, for instance, only a small
>     part  of the fields in the Users table was needed, the rest was just dead
> weight, that made the table unable to fit in RAM.

Well, that's part of my question.  If not selecting those columns in
the common selects how much "dead weight" is brought along due to the
extra columns defined in the table, if any?

--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

Rich Shepard

Date:

14 May 2007, 17:59:01

On Mon, 14 May 2007, PFC wrote:

>     I did something like that on MySQL some time ago.
>     In the Users table there was stuff that other users need to see (like
> his login name, etc), and stuff that only this user needs to see (like his
> preferences).
>     So, when displaying posts in the forum, for instance, only a small
> part of the fields in the Users table was needed, the rest was just dead
> weight, that made the table unable to fit in RAM.
>     So I split the table, and it was faster.
>
>     However, you can also buy more RAM...

   Or, use Views without paying more.

Rich

--
Richard B. Shepard, Ph.D.               |    The Environmental Permitting
Applied Ecosystem Services, Inc.        |          Accelerator(TM)
<http://www.appl-ecosys.com>     Voice: 503-667-4517      Fax: 503-667-8863

Performance issues of one vs. two split tables.

From

"Dawid Kuroczko"

Date:

15 May 2007, 01:33:34

On 5/14/07, Rich Shepard <rshepard@appl-ecosys.com> wrote:
> On Mon, 14 May 2007, PFC wrote:
>
> >       I did something like that on MySQL some time ago.
> >       In the Users table there was stuff that other users need to see (like
> > his login name, etc), and stuff that only this user needs to see (like his
> > preferences).
> >       So, when displaying posts in the forum, for instance, only a small
> > part of the fields in the Users table was needed, the rest was just dead
> > weight, that made the table unable to fit in RAM.
> >       So I split the table, and it was faster.
> >
> >       However, you can also buy more RAM...
>
>    Or, use Views without paying more.

Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents.  So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless.  View, as the name states, is converted into
a select on a real table.

As for actual colors table -- you might consider keeping colors
as int[] (integer array) column, though I'll never admit I suggeted
you that. :D

   Regards,
        Dawid

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

15 May 2007, 02:23:01

On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> Well, views are not going to help with memory consumption here.
> It is the table contents that gets cached in buffer cache, not the
> views contents.  So if you have a view which returns only one
> column from 15-column table, you will be caching that 15-column
> data nonetheless.  View, as the name states, is converted into
> a select on a real table.

Are you saying that in Postgresql:

    select first_name, last_name from user_table;

uses the same memory as this?

    select first_name, last_name,
    passowrd, email,
    [10 other columns]
    from user_table;



--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

"Dawid Kuroczko"

Date:

15 May 2007, 02:51:49

On 5/15/07, Bill Moseley <moseley@hank.org> wrote:
> On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> > Well, views are not going to help with memory consumption here.
> > It is the table contents that gets cached in buffer cache, not the
> > views contents.  So if you have a view which returns only one
> > column from 15-column table, you will be caching that 15-column
> > data nonetheless.  View, as the name states, is converted into
> > a select on a real table.
>
> Are you saying that in Postgresql:
>
>     select first_name, last_name from user_table;
>
> uses the same memory as this?
>
>     select first_name, last_name,
>     passowrd, email,
>     [10 other columns]
>     from user_table;

Yes.  You read whole page (8KB) into buffer_cache,
then extract these columns from these buffer.  From the
buffer cache point of view, whole tuple is contained in the
cache.

Say, if you first SELECT fname, lname FROM user_table;
and then you issue SELECT * FROM user_table; -- the
second select will be returned from buffer cache -- since
all rows are already in the cache.

Having seperate caches for possible SELECT [column list]
would be well, not quite efficient.

Now, select fname,lname will take less private memory,
but this memory will be freed as soon as the query finishes,
but this won't help our cache much.

   Regards,
        Dawid

Re: Performance issues of one vs. two split tables.

From

PFC

Date:

15 May 2007, 03:55:44

> Say, if you first SELECT fname, lname FROM user_table;
> and then you issue SELECT * FROM user_table; -- the
> second select will be returned from buffer cache -- since
> all rows are already in the cache.

    ...Unless your table contains some large TEXT columns that have been
stored out of line (TOASTed) by postgres.

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

15 May 2007, 11:35:18

Can anyone provide input on this question?  I'm curious how to look at
this from a disk and memory usage perspective.  Would using a bit
column type help much?

I'm not thrilled by the loss of referential integrity.

On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:
>
> Say I have a table "color" that has about 20 different rows ("red",
> "blue", "green", etc.).  I want the user to be able to select zero or
> more favorite colors.  I would typically use a link table:
>
>     create table favorite_colors (
>         color       int references color(id),
>         user        int references user(id)
>     );
>
> Now, that table can have a large number of rows if I have a large
> number of users and if everyone likes all the colors.
>
> For some value of "large", is there a time when one might consider
> using a single column in the user or user_prefs table to represent
> their color choices instead of a link table?
>
>     table user_prefs (
>         ...
>         favorite_colors     bit varying,
>         ...
>     );
>
> Where each bit represents the primary key of the colors table.
>
> Seems like poor design, but I'm wondering if there might be overriding
> concerns at times.
>
> For example, if I have 1 million users and they each like all colors
> and thus have a 20 million row link table how much space would be
> saved by using a bit column as above?

--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

Ben

Date:

15 May 2007, 12:51:01

I'm sure there's a point where you'd be saving a "substantial" amount
of disk space using a non-normalized scheme, but, like you say, you'd
be missing out on other things. In general, disks are cheap while the
man hours used to try to fix data corruption is not.

On May 15, 2007, at 7:35 AM, Bill Moseley wrote:

> Can anyone provide input on this question?  I'm curious how to look at
> this from a disk and memory usage perspective.  Would using a bit
> column type help much?
>
> I'm not thrilled by the loss of referential integrity.
>
> On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:
>>
>> Say I have a table "color" that has about 20 different rows ("red",
>> "blue", "green", etc.).  I want the user to be able to select zero or
>> more favorite colors.  I would typically use a link table:
>>
>>     create table favorite_colors (
>>         color       int references color(id),
>>         user        int references user(id)
>>     );
>>
>> Now, that table can have a large number of rows if I have a large
>> number of users and if everyone likes all the colors.
>>
>> For some value of "large", is there a time when one might consider
>> using a single column in the user or user_prefs table to represent
>> their color choices instead of a link table?
>>
>>     table user_prefs (
>>         ...
>>         favorite_colors     bit varying,
>>         ...
>>     );
>>
>> Where each bit represents the primary key of the colors table.
>>
>> Seems like poor design, but I'm wondering if there might be
>> overriding
>> concerns at times.
>>
>> For example, if I have 1 million users and they each like all colors
>> and thus have a 20 million row link table how much space would be
>> saved by using a bit column as above?
>
> --
> Bill Moseley
> moseley@hank.org
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

15 May 2007, 13:41:15

On Tue, May 15, 2007 at 07:51:44AM +0200, Dawid Kuroczko wrote:
> On 5/15/07, Bill Moseley <moseley@hank.org> wrote:
> >On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> >> Well, views are not going to help with memory consumption here.
> >> It is the table contents that gets cached in buffer cache, not the
> >> views contents.  So if you have a view which returns only one
> >> column from 15-column table, you will be caching that 15-column
> >> data nonetheless.  View, as the name states, is converted into
> >> a select on a real table.
> >
> >Are you saying that in Postgresql:
> >
> >    select first_name, last_name from user_table;
> >
> >uses the same memory as this?
> >
> >    select first_name, last_name,
> >    passowrd, email,
> >    [10 other columns]
> >    from user_table;
>
> Yes.  You read whole page (8KB) into buffer_cache,
> then extract these columns from these buffer.  From the
> buffer cache point of view, whole tuple is contained in the
> cache.

Sorry, I don't mean to drag this thread out much longer.  But, I have
one more question regarding joins.

Say I have a customer table and an order table.  I want a list of all
order id's for a given customer.

    SELECT o.id
    FROM order o
    JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

See, I've seen this splitting of one-to-one tables a number of time
(such as the user and user_preferences example) and I'm not sure if
that's just poor schema design, premature optimization, or someone
making smart use of their knowledge of the internal workings of
Postgresql....

--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

PFC

Date:

15 May 2007, 13:47:02

>     SELECT o.id
>     FROM order o
>     JOIN customer c on o.customer = c.id
>
> Does that bring into memory all columns from both order and customer?
> Maybe that's not a good example due to indexes.

    No, it just pulls the columns you ask from the table, nothing less,
nothing more.

    Splitting tables (vertical partitioning) is used to reduce the size of
the working set that has to fit in RAM... this is a different reason than
what you're thinking about.

Re: Performance issues of one vs. two split tables.

From

Chris Browne

Date:

15 May 2007, 15:04:07

lists@peufeu.com (PFC) writes:
>>     SELECT o.id
>>     FROM order o
>>     JOIN customer c on o.customer = c.id
>>
>> Does that bring into memory all columns from both order and customer?
>> Maybe that's not a good example due to indexes.
>
>     No, it just pulls the columns you ask from the table, nothing
> less,  nothing more.

That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

They may not be drawn into the return set, but they will still be
drawn into memory.

If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://linuxdatabases.info/info/languages.html
There are two kinds of pedestrians -- the quick and the dead.

Re: Performance issues of one vs. two split tables.

From

SCassidy@overlandstorage.com

Date:

15 May 2007, 15:30:37

One other possible reason for splitting the table up in two chunks is to grant different rights on the 2 sets of columns.

Susan Cassidy

Bill Moseley <moseley@hank.org>
Sent by: pgsql-general-owner@postgresql.org

05/15/2007 09:44 AM

To	Postgres General <pgsql-general@postgresql.org>
cc
Subject	Re: [GENERAL] Performance issues of one vs. two split tables.

Sorry, I don't mean to drag this thread out much longer. But, I have one more question regarding joins. Say I have a customer table and an order table. I want a list of all order id's for a given customer. SELECT o.id FROM order o JOIN customer c on o.customer = c.id Does that bring into memory all columns from both order and customer? Maybe that's not a good example due to indexes. See, I've seen this splitting of one-to-one tables a number of time (such as the user and user_preferences example) and I'm not sure if that's just poor schema design, premature optimization, or someone making smart use of their knowledge of the internal workings of Postgresql.... -- Bill Moseley moseley@hank.org ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/

----------------------------------------------------------------------------------------------
Simply protected storage solutions ensure that your information is
automatically safe, readily available and always there, visit us at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

15 May 2007, 16:02:59

On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:
> lists@peufeu.com (PFC) writes:
> >>     SELECT o.id
> >>     FROM order o
> >>     JOIN customer c on o.customer = c.id
> >>
> >> Does that bring into memory all columns from both order and customer?
> >> Maybe that's not a good example due to indexes.
> >
> >     No, it just pulls the columns you ask from the table, nothing
> > less,  nothing more.
>
> That's not quite 100% accurate.
>
> In order to construct the join, the entire pages of the relevant
> tuples in tables "order" and "customer" will need to be drawn into
> memory.
>
> Thus, if there are a whole bunch of columns on each table, the data in
> those extra columns (e.g. - all columns aside from "id", the one that
> was asked for in the result set) will indeed be drawn into memory.

Is that specific to Postgresql?  From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used.  Perhaps there's an overriding
reason for this.

> If you alter tables "customer" and "order", taking some columns off,
> and stowing them in separate tables, then you'll find that more tuples
> of "customer" and "order" will fit into a buffer page, and that the
> join will be assembled with somewhat less memory usage.
>
> Whether or not that is a worthwhile change to make will vary
> considerably.

Makes designing the schema a bit tough. ;)

--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

Steve Atkins

Date:

15 May 2007, 16:14:13

On May 15, 2007, at 12:02 PM, Bill Moseley wrote:

> On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:
>> lists@peufeu.com (PFC) writes:
>>>>     SELECT o.id
>>>>     FROM order o
>>>>     JOIN customer c on o.customer = c.id
>>>>
>>>> Does that bring into memory all columns from both order and
>>>> customer?
>>>> Maybe that's not a good example due to indexes.
>>>
>>>     No, it just pulls the columns you ask from the table, nothing
>>> less,  nothing more.
>>
>> That's not quite 100% accurate.
>>
>> In order to construct the join, the entire pages of the relevant
>> tuples in tables "order" and "customer" will need to be drawn into
>> memory.
>>
>> Thus, if there are a whole bunch of columns on each table, the
>> data in
>> those extra columns (e.g. - all columns aside from "id", the one that
>> was asked for in the result set) will indeed be drawn into memory.
>
> Is that specific to Postgresql?  From an outside perspective it just
> seems odd that potentially a large amount of data would be pulled off
> disk into memory that is never used.  Perhaps there's an overriding
> reason for this.

The columns are next to each other on the disk. You need to read
the entire block off disk into system cache, so you'll be reading all
the columns of all the rows in that block into memory.

That's just the way that most (all?) modern filesystems work, and so the
way that most filesystem based databases are going to work. I've seen
some databases that don't store all the main columns of a table together
on disk, but they're fairly rare.

Pushing data into lookaside tables either manually or automatically
via toast changes the tradeoffs.

Cheers,
   Steve

Re: Performance issues of one vs. two split tables.

From

"John D. Burger"

Date:

15 May 2007, 16:14:32

>> Thus, if there are a whole bunch of columns on each table, the
>> data in
>> those extra columns (e.g. - all columns aside from "id", the one that
>> was asked for in the result set) will indeed be drawn into memory.
>
> Is that specific to Postgresql?  From an outside perspective it just
> seems odd that potentially a large amount of data would be pulled off
> disk into memory that is never used.  Perhaps there's an overriding
> reason for this.

Anything else would seem odd to me.  Pulling a page into memory
typically has OS support, and is thus very fast.  Picking and
choosing bits and pieces to read would be prohibitively slow.
Moreover, caching only those bits and pieces would require
complicated code to decide whether the cached data is relevant to the
next query.  Validating cached data at the page level is much
simpler, and thus faster.

Or so I assume ...

- John D. Burger
   MITRE

Re: Performance issues of one vs. two split tables.

From

PFC

Date:

15 May 2007, 16:18:59

>> Thus, if there are a whole bunch of columns on each table, the data in
>> those extra columns (e.g. - all columns aside from "id", the one that
>> was asked for in the result set) will indeed be drawn into memory.

    Yeah, I wanted to mean that ;)
    All the columns are loaded (except the TOASTed ones which are not
mentioned in the query) into memory, but only the requested ones are
processed and returned to the client...

> Is that specific to Postgresql?

    Nope. All databases do more or less the same.

> From an outside perspective it just
> seems odd that potentially a large amount of data would be pulled off
> disk into memory that is never used.  Perhaps there's an overriding
> reason for this.

    Yeah, where would you put this data if you didn't put it where it is now ?

>> If you alter tables "customer" and "order", taking some columns off,
>> and stowing them in separate tables, then you'll find that more tuples
>> of "customer" and "order" will fit into a buffer page, and that the
>> join will be assembled with somewhat less memory usage.
>>
>> Whether or not that is a worthwhile change to make will vary
>> considerably.
>
> Makes designing the schema a bit tough. ;)

    "Premature optimization is the root of all evil"

    Build a test database, fill it with data, and experiment.

Re: Performance issues of one vs. two split tables.

From

Bill Moseley

Date:

16 May 2007, 03:01:13

On Tue, May 15, 2007 at 09:20:53PM +0200, PFC wrote:
> >From an outside perspective it just
> >seems odd that potentially a large amount of data would be pulled off
> >disk into memory that is never used.  Perhaps there's an overriding
> >reason for this.
>
>     Yeah, where would you put this data if you didn't put it where it is
>     now ?

Swish-e isn't a database by any means, but it does have a way to store
column like meta data for each "row".  When it does a search it only
explicitly pulls from disk the meta data that it's asked to return.
Granted, the OS is reading from disk more than the application is
asking for, but the application is only allocating memory for the data
it's going to return.  And the column (meta data) is not always stored
together on disk.

Without knowing Pg internals I wasn't aware of how the actual table
data was organized and fetched into memory.

>     "Premature optimization is the root of all evil"

Exactly what prompted this thread. ;)

--
Bill Moseley
moseley@hank.org

Re: Performance issues of one vs. two split tables.

From

Vivek Khera

Date:

16 May 2007, 17:59:30

On May 14, 2007, at 4:37 PM, Bill Moseley wrote:

> Say that there's also about 10 columns of settings or preferences for
> each user.  Are there any cases or reasons to have a separate
> "user_preferences" table vs. just placing all the columns together in
> one table?

when you have, say 65 million users, it makes sense to push the
ancillary info to another table to keep from having to copy too much
data when you update the main info (like last access time).

Re: Performance issues of one vs. two split tables.

From

Vivek Khera

Date:

16 May 2007, 18:04:11

On May 15, 2007, at 10:35 AM, Bill Moseley wrote:

>> For some value of "large", is there a time when one might consider
>> using a single column in the user or user_prefs table to represent
>> their color choices instead of a link table?

We use bitfields on our large user table.  It is becoming unworkable
to scan for matches, since overall most people have very few
selections made.

We are moving it to a model like your favorite_colors table which
just links the option and the user.

We find that doing joins on large tables which can be indexed to
avoid full table scans are very fast in postgres, since the index can
do much of your culling of potential matching rows.  With bitfields,
you are more or less forced into doing a sequence scan to find
everyone who likes the color red.

Of course, if you're playing with only a few thousand users, either
approach works well.

Re: Performance issues of one vs. two split tables.

From

PFC

Date:

16 May 2007, 19:03:56

> We use bitfields on our large user table.  It is becoming unworkable to
> scan for matches, since overall most people have very few selections
> made.
>
> We are moving it to a model like your favorite_colors table which just
> links the option and the user.
>
> We find that doing joins on large tables which can be indexed to avoid
> full table scans are very fast in postgres, since the index can do much
> of your culling of potential matching rows.  With bitfields, you are
> more or less forced into doing a sequence scan to find everyone who
> likes the color red.
>
> Of course, if you're playing with only a few thousand users, either
> approach works well.

    Things you could try :

    * Use an integer array instead of a bitfield
    (for instance, in users table, column favourites would contain { 1,2 } if
the user selected items 1 and 2 )
    Then, you can make a Gist index on it and use the indexed intersection
operator

    This is likely the optimal solution if the maximum number of items is
small (say, 100 is good, 100000 is not)

    * keep your bitfields and create conditional indexes :

CREATE INDEX ... WHERE bitfield_column & 1;
CREATE INDEX ... WHERE bitfield_column & 2;
CREATE INDEX ... WHERE bitfield_column & 4;
CREATE INDEX ... WHERE bitfield_column & 8;
CREATE INDEX ... WHERE bitfield_column & 16;
etc...

    Obviously this will only work if you have, say, 10 favouritess. 100
indexes on a table would really suck.

    Then, when looking for users who chose bits 1 and 2, do :
    SELECT WHERE (bitfield_column & 1) AND (bitfield_column & 2)
    postgres will do a bitmap-and using the two indexes

    (note : when we get bitmap indexes, this will be even better)

    * use tsearch2 :

    favourites = 'red blue'
    and fulltext-search it

    * use a favourites table :

    This makes queries hard to optimize. Consider the table (user_id,
item_id) meaning user selected this item as favourite.
    If you want to know which users did select both items 1 and 2, you have
to do a self-join, something like :

SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND
a.item_id=1 AND b.item_id = 2

    This is likely to be "not very fast" if 1 million users check each option
but only 100 check both.

Re: Performance issues of one vs. two split tables.

From

Brian Wipf

Date:

16 May 2007, 20:00:32

On 16-May-07, at 4:05 PM, PFC wrote:

This makes queries hard to optimize. Consider the table (user_id, item_id) meaning user selected this item as favourite.
If you want to know which users did select both items 1 and 2, you have to do a self-join, something like :

SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND a.item_id=1 AND b.item_id = 2

You could get users who have selected both items 1 and 2 without doing a self-join with a query like the following:

select user_id from favourite where item_id = 1 or item_id = 2 group by user_id having count(*) = 2;