Thread: Modeling a table with arbitrary columns

Modeling a table with arbitrary columns

From
Andreas Hartmann
Date:
Hi everyone,

I want to model the following scenario for an online marketing application:

Users can create mailings. The list of recipients can be uploaded as
spreadsheets with arbitrary columns (each row is a recipient). I expect
the following maximum quantities the DB will contain:

* up to 5000 mailings
* 0-10'000 recipients per mailing, average maybe 2000
* approx. 20 columns per spreadsheet

I see basically two approaches to store the recipients:

A) A single table with a fixed number of generic columns. If the
spreadsheet has less columns than the table, the values will be null.

CREATE TABLE recipient (
   mailing integer,
   row integer,
   col_1 text,
   …
   col_50 text,
   PRIMARY KEY (mailing, row),
   FOREIGN KEY mailing REFERENCES mailing(id)
);


B) Two tables, one for the recipients and one for the values:

CREATE TABLE recipient (
   mailing integer,
   row integer,
   PRIMARY KEY (mailing, row),
   FOREIGN KEY mailing REFERENCES mailing(id)
);

CREATE TABLE recipient_value (
   mailing integer,
   row integer,
   column integer,
   value text,
   PRIMARY KEY (mailing, row, column),
   FOREIGN KEY mailing REFERENCES mailing(id),
   FOREIGN KEY row REFERENCES recipient(row)
);


I have the feeling that the second approach is cleaner. But since the
recipient_value table will contain approx. 20 times more rows than the
recipient table in approach A, I'd expect a performance degradation.

Is there a limit to the number of rows that should be stored in a table?
With approach B the maximum number of rows could be about 200'000'000,
which sounds quite a lot …

Thanks a lot in advance for any suggestions!

Best regards,
Andreas



--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01

Re: Modeling a table with arbitrary columns

From
Thomas Kellerer
Date:
Andreas Hartmann wrote on 29.10.2009 21:52:
> Hi everyone,
>
> I want to model the following scenario for an online marketing application:
>
> Users can create mailings. The list of recipients can be uploaded as
> spreadsheets with arbitrary columns (each row is a recipient). I expect
> the following maximum quantities the DB will contain:
>
> * up to 5000 mailings
> * 0-10'000 recipients per mailing, average maybe 2000
> * approx. 20 columns per spreadsheet
>
[...]
>
> I have the feeling that the second approach is cleaner. But since the
> recipient_value table will contain approx. 20 times more rows than the
> recipient table in approach A, I'd expect a performance degradation.
>
> Is there a limit to the number of rows that should be stored in a table?
> With approach B the maximum number of rows could be about 200'000'000,
> which sounds quite a lot …
>

I don't think the number of rows is that critical (it sure doesn't hit any "limits". The question is how you want to
accessthat information and how quick that has to be. If you need sub-second response time for aggregates over that,
you'llprobably have to throw quite some hardware at it.  

You could also check out the hstore contrib module which lets you store key value pairs in a single column, which might
actuallybe what you are looking for (note that I have never used it, so I cannot tell how fast it acutally is) 

http://www.postgresql.org/docs/current/static/hstore.html

So something like

CREATE TABLE recipient (
   mailing integer NOT NULL,
   row integer NOT NULL,
   recipient_values hstore,
   PRIMARY KEY (mailing, row),
   FOREIGN KEY (mailing) REFERENCES mailing (id)
)

Btw: I would get rid of a column named "row", this more a "recipient_id", but that is just personal taste.

Regards
Thomas

Re: Modeling a table with arbitrary columns

From
Віталій Тимчишин
Date:


2009/10/29 Andreas Hartmann <andreas@apache.org>
Hi everyone,

I want to model the following scenario for an online marketing application:

Users can create mailings. The list of recipients can be uploaded as spreadsheets with arbitrary columns (each row is a recipient). I expect the following maximum quantities the DB will contain:

I see basically two approaches to store the recipients:

A) A single table with a fixed number of generic columns. If the spreadsheet has less columns than the table, the values will be null.

B) Two tables, one for the recipients and one for the values:

One more option is to use arrays (and single table).

Re: Modeling a table with arbitrary columns

From
Robert Haas
Date:
On Thu, Oct 29, 2009 at 4:52 PM, Andreas Hartmann <andreas@apache.org> wrote:
> Hi everyone,
>
> I want to model the following scenario for an online marketing application:
>
> Users can create mailings. The list of recipients can be uploaded as
> spreadsheets with arbitrary columns (each row is a recipient). I expect the
> following maximum quantities the DB will contain:
>
> * up to 5000 mailings
> * 0-10'000 recipients per mailing, average maybe 2000
> * approx. 20 columns per spreadsheet
>
> I see basically two approaches to store the recipients:
>
> A) A single table with a fixed number of generic columns. If the spreadsheet
> has less columns than the table, the values will be null.
>
> CREATE TABLE recipient (
>  mailing integer,
>  row integer,
>  col_1 text,
>  …
>  col_50 text,
>  PRIMARY KEY (mailing, row),
>  FOREIGN KEY mailing REFERENCES mailing(id)
> );
>
>
> B) Two tables, one for the recipients and one for the values:
>
> CREATE TABLE recipient (
>  mailing integer,
>  row integer,
>  PRIMARY KEY (mailing, row),
>  FOREIGN KEY mailing REFERENCES mailing(id)
> );
>
> CREATE TABLE recipient_value (
>  mailing integer,
>  row integer,
>  column integer,
>  value text,
>  PRIMARY KEY (mailing, row, column),
>  FOREIGN KEY mailing REFERENCES mailing(id),
>  FOREIGN KEY row REFERENCES recipient(row)
> );
>
>
> I have the feeling that the second approach is cleaner. But since the
> recipient_value table will contain approx. 20 times more rows than the
> recipient table in approach A, I'd expect a performance degradation.
>
> Is there a limit to the number of rows that should be stored in a table?
> With approach B the maximum number of rows could be about 200'000'000, which
> sounds quite a lot …
>
> Thanks a lot in advance for any suggestions!

Another possibility would be to create a table for each upload based
on the columns that are actually present.  Just have your upload
script read the spreadsheet, determine the format, and create an
appropriate table for that particular upload.

But a lot of it depends on what kinds of queries you want to write.

...Robert