Thread: Are there performance advantages in storing bulky field in separate table?

Are there performance advantages in storing bulky field in separate table?

From

Ian Mayo

Date:

08 April 2009, 12:26:15

Hi all,
I'm designing a Postgresql database, and would appreciate this design advice.

I've got a fairly straightforward table that's similar to a blog table
(entryId, date, title, author, etc).  There is, however, the
requirement to allow a single, fairly bulky binary attachment to
around 1% of the rows.

There will be a few million rows, and I value efficient searches by
date, title, and author.

Would there be a performance advantage in storing the attachment in a
separate table - linked by entryId foreign key?  Or shall I just
include it as an ALLOW NULL field my blog table?

[of course, I'd appreciate redirection to the 'right' list if this is
the wrong one].

cheers,
Ian

Re: Are there performance advantages in storing bulky field in separate table?

From

Tom Lane

Date:

08 April 2009, 12:47:47

Ian Mayo <ianmayo@tesco.net> writes:
> I've got a fairly straightforward table that's similar to a blog table
> (entryId, date, title, author, etc).  There is, however, the
> requirement to allow a single, fairly bulky binary attachment to
> around 1% of the rows.

> There will be a few million rows, and I value efficient searches by
> date, title, and author.

> Would there be a performance advantage in storing the attachment in a
> separate table - linked by entryId foreign key?

No.  You'd basically be manually reinventing the TOAST mechanism;
or the large object mechanism, if you choose to store the blob
as a large object rather than a plain bytea field.  Either way,
it won't physically be in the same table as the main row data.

If you're curious, this goes into some of the gory details:
http://www.postgresql.org/docs/8.3/static/storage-toast.html

            regards, tom lane

Re: Are there performance advantages in storing bulky field in separate table?

From

Ian Mayo

Date:

08 April 2009, 12:56:40

Cheers Tom,

On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ian Mayo <ianmayo@tesco.net> writes:
>> [snip]
> No.  You'd basically be manually reinventing the TOAST mechanism;
> or the large object mechanism, if you choose to store the blob
> as a large object rather than a plain bytea field.  Either way,
> it won't physically be in the same table as the main row data.

fine, that keeps the design simpler

> If you're curious, this goes into some of the gory details:
> http://www.postgresql.org/docs/8.3/static/storage-toast.html

Oooh, no, much too gory for me.

>                        regards, tom lane

thanks again.

One more thing:  hey, did you hear?  I just got some advice from Tom Lane!

Ian

Re: Are there performance advantages in storing bulky field in separate table?

From

Sam Mason

Date:

08 April 2009, 13:39:07

On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:
> One more thing:  hey, did you hear?  I just got some advice from Tom Lane!

Statistically speaking; he's the person most likely to answer you by
quite a long way.  Out of the ~24k emails going back to Oct 2007 I've
got from pgsql-general the most common people who wrote them are:

   who              num mails  of total
  Tom Lane              1,935    8.0%
  Scott Marlowe         1,077    4.5%
  Alvaro Herrera          521    2.2%
  Joshua Drake            468    1.9%
  Richard Huxton          432    1.8%
  Craig Ringer            338    1.4%
  Ivan Sergio Borgonovo   314    1.3%
  Sam Mason               310    1.3%
  Raymond O'Donnell       270    1.1%
  Martijn van Oosterhout  264    1.1%
  Greg Smith              252    1.0%

The remaining ~2000 distinct addresses were less than one percent each.

I didn't expect to see myself there; ho hum, maybe I should spend less
time on email!  It's also somewhat biased as I only have archives as
long as I've posted.  I just tried pulling numbers from markmail.org and
get somewhat different results.  Here it knows about 161k messages and
the top twenty posters are:

  Tom Lane               14,147    8.8%
  Bruce Momjian           3,400    2.1%
  Scott Marlowe           3,112    1.9%
  Richard Huxton          2,738    1.7%
  Martijn van Oosterhout  2,480    1.5%
  Alvaro Herrera          1,853    1.2%
  Stephan Szabo           1,783    1.1%
  Joshua D. Drake         1,720    1.1%
  Peter Eisentraut        1,488    0.9%
  Michael Fuhr            1,328    0.8%
  Bruno Wolff III         1,201    0.7%
  Andrew Sullivan           985    0.6%
  Doug McNaught             773    0.5%
  Jan Wieck                 764    0.5%
  Ron Johnson               764    0.5%
  Jim C. Nasby              745    0.5%
  Magnus Hagander           665    0.4%
  Marc G. Fournier          630    0.4%
  Dennis Gearon             602    0.4%
  The Hermit Hacker         601    0.4%

I've not been able to merge people where they use different email
address like with my own archive, but manually fiddled Scott Marlowe as
he appeared as "scott.marlowe" as well.  I'm glad to see I drop off the
bottom now!

--
  Sam  http://samason.me.uk/

Re: Are there performance advantages in storing bulky field in separate table?

From

Ivan Sergio Borgonovo

Date:

08 April 2009, 14:28:24

On Wed, 8 Apr 2009 17:39:02 +0100
Sam Mason <sam@samason.me.uk> wrote:

> On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:
> > One more thing:  hey, did you hear?  I just got some advice from
> > Tom Lane!
>
> Statistically speaking; he's the person most likely to answer you
> by quite a long way.  Out of the ~24k emails going back to Oct
> 2007 I've got from pgsql-general the most common people who wrote
> them are:

>    who              num mails  of total
>   Tom Lane              1,935    8.0%
>   Scott Marlowe         1,077    4.5%
>   Alvaro Herrera          521    2.2%
>   Joshua Drake            468    1.9%
>   Richard Huxton          432    1.8%
>   Craig Ringer            338    1.4%
>   Ivan Sergio Borgonovo   314    1.3%

I just wrote privately to Tom that I'm ashamed I ask so much and
answer so few.
But well I'm an exception ;) I'm the top of non-contributors.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: Are there performance advantages in storing bulky field in separate table?

From

Robert Treat

Date:

08 April 2009, 16:13:43

On Wednesday 08 April 2009 11:56:35 Ian Mayo wrote:
> Cheers Tom,
>
> On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Ian Mayo <ianmayo@tesco.net> writes:
> >> [snip]
> >
> > No.  You'd basically be manually reinventing the TOAST mechanism;
> > or the large object mechanism, if you choose to store the blob
> > as a large object rather than a plain bytea field.  Either way,
> > it won't physically be in the same table as the main row data.
>
> fine, that keeps the design simpler
>

Maybe I've been reading too much Pascal again lately, but if only 1% of your
rows are going to have data in this column, personally, I'd put it in a
separate table.

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Are there performance advantages in storing bulky field in separate table?

From

Ian Mayo

Date:

08 April 2009, 16:30:40

On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat
<xzilla@users.sourceforge.net> wrote:
> Maybe I've been reading too much Pascal again lately, but if only 1% of your
> rows are going to have data in this column, personally, I'd put it in a
> separate table.

thanks for that Robert - it does match my (completely groundless)
first impression.

In the nature of debate, would you mind passing on the pascal-related
reasons why you'd put the data in another table?

cheers,
Ian

Re: Are there performance advantages in storing bulky field in separate table?

From

Ron Mayer

Date:

08 April 2009, 17:06:01

Sam Mason wrote:
> On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:
>> One more thing:  hey, did you hear?  I just got some advice from Tom Lane!
>
> Statistically speaking; he's the person most likely to answer you by

Even so, this might be the #1 advantage of Postgres over Oracle (cost
being #2).

Unless you're one of their ten biggest customers, I imagine it'd take
quite some time to similar support from the core team's counterparts
of the other big databases.

Re: Are there performance advantages in storing bulky field in separate table?

From

Chris Browne

Date:

08 April 2009, 17:55:11

ianmayo@tesco.net (Ian Mayo) writes:
> On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat
> <xzilla@users.sourceforge.net> wrote:
>> Maybe I've been reading too much Pascal again lately, but if only 1% of your
>> rows are going to have data in this column, personally, I'd put it in a
>> separate table.
>
> thanks for that Robert - it does match my (completely groundless)
> first impression.
>
> In the nature of debate, would you mind passing on the pascal-related
> reasons why you'd put the data in another table?

Fabian Pascal's thesis is that you shouldn't have NULLs altogether, as
this leads to having to support the 3-or-more-valued logic of NULLs.
The "Third Manifesto" declines to support having NULLs in relations.

Hugh Darwen wrote the relevant paper explaining how to avoid them:
   http://web.onetel.com/~hughdarwen/TheThirdManifesto/Missing-info-without-nulls.pdf

I tend to agree that it is reasonable to go to *some* effort to avoid
having NULL values.

Unfortunately, it seems to me that Darwen's papers elaboration on the
issue doesn't present a solution that is without points to criticize.
He does nicely describe how you may indicate various reasons why you
might have missing information.  This is both good and bad...

 - It is good because it provides unambiguous ways to determine why
   the data was missing.  NULL leaves that ambiguous.

 - It is Not So Good because it replaces the 3-value-logic of NULLs
   with an "as many values for logic as we have kinds of unknown
   values," which is more like a 5- or 6-value logic.

http://en.wikipedia.org/wiki/Fabian_Pascal

"Pascal is known for his sharp criticisms of the data management
industry, trade press, current state of higher education, Western
culture and alleged media bias. Pascal advocates strict adherence to
the principles of the relational model, and argues that departing from
the model in the name of pragmatism is responsible for serious data
management troubles. Criticism of Pascal's advocacy often centers
around his polemical style, which some perceive as overly
confrontational and unprofessional.

He has retired from the technological industry and now does political
commentary, specially on Middle East issues."
--
let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;;
http://linuxdatabases.info/info/linuxxian.html
"The only thing  better than TV with the  sound off is  Radio with the
sound off." -- Dave Moon

Re: Are there performance advantages in storing bulky field in separate table?

From

Robert Treat

Date:

08 April 2009, 18:07:03

On Wednesday 08 April 2009 15:30:28 Ian Mayo wrote:
> On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat
>
> <xzilla@users.sourceforge.net> wrote:
> > Maybe I've been reading too much Pascal again lately, but if only 1% of
> > your rows are going to have data in this column, personally, I'd put it
> > in a separate table.
>
> thanks for that Robert - it does match my (completely groundless)
> first impression.
>
> In the nature of debate, would you mind passing on the pascal-related
> reasons why you'd put the data in another table?
>

You can be sure that discussion of this topic in this forum will soon be
visited by religious zealots, but the short answer is "nulls are bad, mmkay".
A slightly longer answer would be that, as a general rule, attributes of your
relations that only apply to 1% of the rows are better represented as a one
to N relationship using a second table. For a longer answer, see
http://www.databasedesign-resource.com/null-values-in-a-database.html
or http://www.dbazine.com/ofinterest/oi-articles/pascal27

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Are there performance advantages in storing bulky field in separate table?

From

Sam Mason

Date:

08 April 2009, 19:07:34

On Wed, Apr 08, 2009 at 05:06:44PM -0400, Robert Treat wrote:
> A slightly longer answer would be that, as a general rule, attributes
> of your relations that only apply to 1% of the rows are better
> represented as a one to N relationship using a second table.

Have you tried to maintain a non-trivial schema that does this?  I'd be
interested to know how it works because I've only tried to work with
small examples that do this and it gets difficult to maintain very
quickly.

> For a longer answer, see
> http://www.databasedesign-resource.com/null-values-in-a-database.html
> or http://www.dbazine.com/ofinterest/oi-articles/pascal27

Both of those articles seem to be written by people who struggle with,
or have incomplete mental models of, the semantics of NULL values.  The
second also appears to be designed to sell a book so is of course going
to be presenting biased viewpoints.

How would outer joins work without some concept of a missing value.
Once you allow these missing values as the result of an outer join
you would be deliberately introducing limits if you couldn't also save
these values back into tables.  I would say that defaulting columns to
allowing NULLs was a mistake though.

I'd be happy without NULLs in databases if there was some other way
to handle missing values.  Parametric polymorphism and some sort of
option[1] or Maybe[2] type is what springs to mind for me.  NULL would
be represented as NONE or Nothing respectively and non-NULL values as
(SOME v) or (Just v).

--
  Sam  http://samason.me.uk/

 [1] http://www.standardml.org/Basis/option.html
 [2] http://www.haskell.org/onlinereport/maybe.html

Re: Are there performance advantages in storing bulky field in separate table?

From

Thomas Kellerer

Date:

08 April 2009, 19:22:45

Robert Treat wrote on 08.04.2009 23:06:

> http://www.databasedesign-resource.com/null-values-in-a-database.html

That is a very - hmm - strange article.

One of the proofs that nulls are bad is that "SELECT * FROM theTable" (theTable
being empty) returns nothing, whereas SELECT COUNT(*) FROM theTable returns a
single row with 0 (zero):

"This last example is even worse: The SELECT * returns 'No rows selected', but
the SELECT COUNT(*) returns ONE row with the value 0!"

I stopped reading the article at that point...

Re: Are there performance advantages in storing bulky field in separate table?

From

Ron Mayer

Date:

08 April 2009, 19:25:34

Robert Treat wrote:
>
> You can be sure that discussion of this topic in this forum will soon be
> visited by religious zealots, but the short answer is "nulls are bad, mmkay".
> A slightly longer answer would be that, as a general rule, attributes of your
> relations that only apply to 1% of the rows are better represented as a one

To fulfill your prophecy of zealotry, I've got a number of tables
with columns that are mostly null that I can't think of that nice a
way of refactoring.  I'd love ideas to improve the design, though.

One example's an address table. Most addresses have a few fields
that are typically present (building number, city, state, etc).
Others, as described in various government's address standards,
are fields that are typically absent.  For example in US addressing
rules, the "Urbanization Name" line:
    http://www.usps.com/ncsc/addressstds/addressformats.htm
    MRS MARIA SUAREZ              Name
    URB LAS GLADIOLAS             Urbanization name
    150 CALLE A                   House no. and st. name
    SAN JUAN PR 00926-3232        City, state, and ZIP+4
Similarly sparse columns in my address tables are,
titles, division/department Names and mailstop codes.
(described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm)

While I realize I could stick in some string (empty string, or
some other magic string like "urbanization name doesn't apply to
this address") into a table, it sure is convenient to put nulls
in those columns.

I'm quite curious what you'd suggest a well-designed address table
would look like without nulls.

Re: Are there performance advantages in storing bulky field in separate table?

From

"Leif B. Kristensen"

Date:

08 April 2009, 20:57:12

On Wednesday 8. April 2009, Ron Mayer wrote:
>Sam Mason wrote:
>> On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote:
>>> One more thing:  hey, did you hear?  I just got some advice from
>>> Tom Lane!
>>
>> Statistically speaking; he's the person most likely to answer you by
>
>Even so, this might be the #1 advantage of Postgres over Oracle (cost
>being #2).

I'll subscribe to that. Like almost everybody else on this list, I've
got some excellent replies from Tom. And of course it's one of the
major advantages of PostgreSQL. (The cost being a close second.)
--
Leif Biberg Kristensen | Registered Linux User #338009
Me And My Database: http://solumslekt.org/blog/

Re: Are there performance advantages in storing bulky field in separate table?

From

Richard Huxton

Date:

09 April 2009, 08:30:54

Ivan Sergio Borgonovo wrote:
> On Wed, 8 Apr 2009 17:39:02 +0100
> Sam Mason <sam@samason.me.uk> wrote:
>
>>    who              num mails  of total
>>   Tom Lane              1,935    8.0%
>>   Scott Marlowe         1,077    4.5%
>>   Alvaro Herrera          521    2.2%
>>   Joshua Drake            468    1.9%
>>   Richard Huxton          432    1.8%
>>   Craig Ringer            338    1.4%
>>   Ivan Sergio Borgonovo   314    1.3%
>
> I just wrote privately to Tom that I'm ashamed I ask so much and
> answer so few.
> But well I'm an exception ;) I'm the top of non-contributors.

Not so fast there citizen. I'll thank you to note that I've not
contributed any code either, and for a significant number of years too :-)

--
   Richard Huxton
   Archonet Ltd

Re: Are there performance advantages in storing bulky field in separate table?

From

Grzegorz Jaśkiewicz

Date:

09 April 2009, 08:54:57

If I may, I got an instance once, where table with bytea field was
pretty slow. Turned out, that queries modified everything apart from
bytea bit.
moving it to separate table actually helped performance.

But that only will happen providing that you have the
bytea/text/whatever that won't change, once inserted.

Re: Are there performance advantages in storing bulky field in separate table?

From

Robert Treat

Date:

09 April 2009, 09:23:52

On Wednesday 08 April 2009 18:25:25 Ron Mayer wrote:
> Robert Treat wrote:
> > You can be sure that discussion of this topic in this forum will soon be
> > visited by religious zealots, but the short answer is "nulls are bad,
> > mmkay". A slightly longer answer would be that, as a general rule,
> > attributes of your relations that only apply to 1% of the rows are better
> > represented as a one
>
> To fulfill your prophecy of zealotry, I've got a number of tables
> with columns that are mostly null that I can't think of that nice a
> way of refactoring.  I'd love ideas to improve the design, though.
>
> One example's an address table. Most addresses have a few fields
> that are typically present (building number, city, state, etc).
> Others, as described in various government's address standards,
> are fields that are typically absent.  For example in US addressing
> rules, the "Urbanization Name" line:
>     http://www.usps.com/ncsc/addressstds/addressformats.htm
>     MRS MARIA SUAREZ              Name
>     URB LAS GLADIOLAS             Urbanization name
>     150 CALLE A                   House no. and st. name
>     SAN JUAN PR 00926-3232        City, state, and ZIP+4
> Similarly sparse columns in my address tables are,
> titles, division/department Names and mailstop codes.
> (described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm)
>
> While I realize I could stick in some string (empty string, or
> some other magic string like "urbanization name doesn't apply to
> this address") into a table, it sure is convenient to put nulls
> in those columns.
>
> I'm quite curious what you'd suggest a well-designed address table
> would look like without nulls.

The decision here would depend on your perticular sect of the anti-null
religion, but you have a couple of choices:

1) Break these fields out into one or more tables, containing entries only for
those address that have the additional information. Ideally you might be able
to do something like "extended_address_info" where all of these fields could
be kept, all of them being non-null. I suspect you can't do the ideal, so
you'd end up with a bunch of tables.

This would be used by the "normalization trumps nullification" sect

2) Given that all of these columns have an authoritarian source of what should
be allowed, you could use the "magic string" approach without requiring too
much magic, and these columns could even be a foriegn key into a table
containing the authoritarian options.

This could be justified by the all nulls are bad sect, but might also be used
by a null using crowd who take a strict approach to nulls meaning "unknown
value", since here it isn't that the value is unknown; there isn't a valid
value for these columns. (Adding the magic string to your FK table creates a
valid reference value for those entries that would otherwise not match)

Personally, if you force me into a "well-designed address table *without*
nulls"  decision, I would take this latter approach. HTH

--
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com