Thread: Are there performance advantages in storing bulky field in separate table?
Hi all, I'm designing a Postgresql database, and would appreciate this design advice. I've got a fairly straightforward table that's similar to a blog table (entryId, date, title, author, etc). There is, however, the requirement to allow a single, fairly bulky binary attachment to around 1% of the rows. There will be a few million rows, and I value efficient searches by date, title, and author. Would there be a performance advantage in storing the attachment in a separate table - linked by entryId foreign key? Or shall I just include it as an ALLOW NULL field my blog table? [of course, I'd appreciate redirection to the 'right' list if this is the wrong one]. cheers, Ian
Ian Mayo <ianmayo@tesco.net> writes: > I've got a fairly straightforward table that's similar to a blog table > (entryId, date, title, author, etc). There is, however, the > requirement to allow a single, fairly bulky binary attachment to > around 1% of the rows. > There will be a few million rows, and I value efficient searches by > date, title, and author. > Would there be a performance advantage in storing the attachment in a > separate table - linked by entryId foreign key? No. You'd basically be manually reinventing the TOAST mechanism; or the large object mechanism, if you choose to store the blob as a large object rather than a plain bytea field. Either way, it won't physically be in the same table as the main row data. If you're curious, this goes into some of the gory details: http://www.postgresql.org/docs/8.3/static/storage-toast.html regards, tom lane
Cheers Tom, On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ian Mayo <ianmayo@tesco.net> writes: >> [snip] > No. You'd basically be manually reinventing the TOAST mechanism; > or the large object mechanism, if you choose to store the blob > as a large object rather than a plain bytea field. Either way, > it won't physically be in the same table as the main row data. fine, that keeps the design simpler > If you're curious, this goes into some of the gory details: > http://www.postgresql.org/docs/8.3/static/storage-toast.html Oooh, no, much too gory for me. > regards, tom lane thanks again. One more thing: hey, did you hear? I just got some advice from Tom Lane! Ian
On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote: > One more thing: hey, did you hear? I just got some advice from Tom Lane! Statistically speaking; he's the person most likely to answer you by quite a long way. Out of the ~24k emails going back to Oct 2007 I've got from pgsql-general the most common people who wrote them are: who num mails of total Tom Lane 1,935 8.0% Scott Marlowe 1,077 4.5% Alvaro Herrera 521 2.2% Joshua Drake 468 1.9% Richard Huxton 432 1.8% Craig Ringer 338 1.4% Ivan Sergio Borgonovo 314 1.3% Sam Mason 310 1.3% Raymond O'Donnell 270 1.1% Martijn van Oosterhout 264 1.1% Greg Smith 252 1.0% The remaining ~2000 distinct addresses were less than one percent each. I didn't expect to see myself there; ho hum, maybe I should spend less time on email! It's also somewhat biased as I only have archives as long as I've posted. I just tried pulling numbers from markmail.org and get somewhat different results. Here it knows about 161k messages and the top twenty posters are: Tom Lane 14,147 8.8% Bruce Momjian 3,400 2.1% Scott Marlowe 3,112 1.9% Richard Huxton 2,738 1.7% Martijn van Oosterhout 2,480 1.5% Alvaro Herrera 1,853 1.2% Stephan Szabo 1,783 1.1% Joshua D. Drake 1,720 1.1% Peter Eisentraut 1,488 0.9% Michael Fuhr 1,328 0.8% Bruno Wolff III 1,201 0.7% Andrew Sullivan 985 0.6% Doug McNaught 773 0.5% Jan Wieck 764 0.5% Ron Johnson 764 0.5% Jim C. Nasby 745 0.5% Magnus Hagander 665 0.4% Marc G. Fournier 630 0.4% Dennis Gearon 602 0.4% The Hermit Hacker 601 0.4% I've not been able to merge people where they use different email address like with my own archive, but manually fiddled Scott Marlowe as he appeared as "scott.marlowe" as well. I'm glad to see I drop off the bottom now! -- Sam http://samason.me.uk/
Re: Are there performance advantages in storing bulky field in separate table?
From
Ivan Sergio Borgonovo
Date:
On Wed, 8 Apr 2009 17:39:02 +0100 Sam Mason <sam@samason.me.uk> wrote: > On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote: > > One more thing: hey, did you hear? I just got some advice from > > Tom Lane! > > Statistically speaking; he's the person most likely to answer you > by quite a long way. Out of the ~24k emails going back to Oct > 2007 I've got from pgsql-general the most common people who wrote > them are: > who num mails of total > Tom Lane 1,935 8.0% > Scott Marlowe 1,077 4.5% > Alvaro Herrera 521 2.2% > Joshua Drake 468 1.9% > Richard Huxton 432 1.8% > Craig Ringer 338 1.4% > Ivan Sergio Borgonovo 314 1.3% I just wrote privately to Tom that I'm ashamed I ask so much and answer so few. But well I'm an exception ;) I'm the top of non-contributors. -- Ivan Sergio Borgonovo http://www.webthatworks.it
Re: Are there performance advantages in storing bulky field in separate table?
From
Robert Treat
Date:
On Wednesday 08 April 2009 11:56:35 Ian Mayo wrote: > Cheers Tom, > > On Wed, Apr 8, 2009 at 4:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ian Mayo <ianmayo@tesco.net> writes: > >> [snip] > > > > No. You'd basically be manually reinventing the TOAST mechanism; > > or the large object mechanism, if you choose to store the blob > > as a large object rather than a plain bytea field. Either way, > > it won't physically be in the same table as the main row data. > > fine, that keeps the design simpler > Maybe I've been reading too much Pascal again lately, but if only 1% of your rows are going to have data in this column, personally, I'd put it in a separate table. -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat <xzilla@users.sourceforge.net> wrote: > Maybe I've been reading too much Pascal again lately, but if only 1% of your > rows are going to have data in this column, personally, I'd put it in a > separate table. thanks for that Robert - it does match my (completely groundless) first impression. In the nature of debate, would you mind passing on the pascal-related reasons why you'd put the data in another table? cheers, Ian
Sam Mason wrote: > On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote: >> One more thing: hey, did you hear? I just got some advice from Tom Lane! > > Statistically speaking; he's the person most likely to answer you by Even so, this might be the #1 advantage of Postgres over Oracle (cost being #2). Unless you're one of their ten biggest customers, I imagine it'd take quite some time to similar support from the core team's counterparts of the other big databases.
Re: Are there performance advantages in storing bulky field in separate table?
From
Chris Browne
Date:
ianmayo@tesco.net (Ian Mayo) writes: > On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat > <xzilla@users.sourceforge.net> wrote: >> Maybe I've been reading too much Pascal again lately, but if only 1% of your >> rows are going to have data in this column, personally, I'd put it in a >> separate table. > > thanks for that Robert - it does match my (completely groundless) > first impression. > > In the nature of debate, would you mind passing on the pascal-related > reasons why you'd put the data in another table? Fabian Pascal's thesis is that you shouldn't have NULLs altogether, as this leads to having to support the 3-or-more-valued logic of NULLs. The "Third Manifesto" declines to support having NULLs in relations. Hugh Darwen wrote the relevant paper explaining how to avoid them: http://web.onetel.com/~hughdarwen/TheThirdManifesto/Missing-info-without-nulls.pdf I tend to agree that it is reasonable to go to *some* effort to avoid having NULL values. Unfortunately, it seems to me that Darwen's papers elaboration on the issue doesn't present a solution that is without points to criticize. He does nicely describe how you may indicate various reasons why you might have missing information. This is both good and bad... - It is good because it provides unambiguous ways to determine why the data was missing. NULL leaves that ambiguous. - It is Not So Good because it replaces the 3-value-logic of NULLs with an "as many values for logic as we have kinds of unknown values," which is more like a 5- or 6-value logic. http://en.wikipedia.org/wiki/Fabian_Pascal "Pascal is known for his sharp criticisms of the data management industry, trade press, current state of higher education, Western culture and alleged media bias. Pascal advocates strict adherence to the principles of the relational model, and argues that departing from the model in the name of pragmatism is responsible for serious data management troubles. Criticism of Pascal's advocacy often centers around his polemical style, which some perceive as overly confrontational and unprofessional. He has retired from the technological industry and now does political commentary, specially on Middle East issues." -- let name="cbbrowne" and tld="cbbrowne.com" in name ^ "@" ^ tld;; http://linuxdatabases.info/info/linuxxian.html "The only thing better than TV with the sound off is Radio with the sound off." -- Dave Moon
Re: Are there performance advantages in storing bulky field in separate table?
From
Robert Treat
Date:
On Wednesday 08 April 2009 15:30:28 Ian Mayo wrote: > On Wed, Apr 8, 2009 at 8:13 PM, Robert Treat > > <xzilla@users.sourceforge.net> wrote: > > Maybe I've been reading too much Pascal again lately, but if only 1% of > > your rows are going to have data in this column, personally, I'd put it > > in a separate table. > > thanks for that Robert - it does match my (completely groundless) > first impression. > > In the nature of debate, would you mind passing on the pascal-related > reasons why you'd put the data in another table? > You can be sure that discussion of this topic in this forum will soon be visited by religious zealots, but the short answer is "nulls are bad, mmkay". A slightly longer answer would be that, as a general rule, attributes of your relations that only apply to 1% of the rows are better represented as a one to N relationship using a second table. For a longer answer, see http://www.databasedesign-resource.com/null-values-in-a-database.html or http://www.dbazine.com/ofinterest/oi-articles/pascal27 -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Wed, Apr 08, 2009 at 05:06:44PM -0400, Robert Treat wrote: > A slightly longer answer would be that, as a general rule, attributes > of your relations that only apply to 1% of the rows are better > represented as a one to N relationship using a second table. Have you tried to maintain a non-trivial schema that does this? I'd be interested to know how it works because I've only tried to work with small examples that do this and it gets difficult to maintain very quickly. > For a longer answer, see > http://www.databasedesign-resource.com/null-values-in-a-database.html > or http://www.dbazine.com/ofinterest/oi-articles/pascal27 Both of those articles seem to be written by people who struggle with, or have incomplete mental models of, the semantics of NULL values. The second also appears to be designed to sell a book so is of course going to be presenting biased viewpoints. How would outer joins work without some concept of a missing value. Once you allow these missing values as the result of an outer join you would be deliberately introducing limits if you couldn't also save these values back into tables. I would say that defaulting columns to allowing NULLs was a mistake though. I'd be happy without NULLs in databases if there was some other way to handle missing values. Parametric polymorphism and some sort of option[1] or Maybe[2] type is what springs to mind for me. NULL would be represented as NONE or Nothing respectively and non-NULL values as (SOME v) or (Just v). -- Sam http://samason.me.uk/ [1] http://www.standardml.org/Basis/option.html [2] http://www.haskell.org/onlinereport/maybe.html
Re: Are there performance advantages in storing bulky field in separate table?
From
Thomas Kellerer
Date:
Robert Treat wrote on 08.04.2009 23:06: > http://www.databasedesign-resource.com/null-values-in-a-database.html That is a very - hmm - strange article. One of the proofs that nulls are bad is that "SELECT * FROM theTable" (theTable being empty) returns nothing, whereas SELECT COUNT(*) FROM theTable returns a single row with 0 (zero): "This last example is even worse: The SELECT * returns 'No rows selected', but the SELECT COUNT(*) returns ONE row with the value 0!" I stopped reading the article at that point...
Robert Treat wrote: > > You can be sure that discussion of this topic in this forum will soon be > visited by religious zealots, but the short answer is "nulls are bad, mmkay". > A slightly longer answer would be that, as a general rule, attributes of your > relations that only apply to 1% of the rows are better represented as a one To fulfill your prophecy of zealotry, I've got a number of tables with columns that are mostly null that I can't think of that nice a way of refactoring. I'd love ideas to improve the design, though. One example's an address table. Most addresses have a few fields that are typically present (building number, city, state, etc). Others, as described in various government's address standards, are fields that are typically absent. For example in US addressing rules, the "Urbanization Name" line: http://www.usps.com/ncsc/addressstds/addressformats.htm MRS MARIA SUAREZ Name URB LAS GLADIOLAS Urbanization name 150 CALLE A House no. and st. name SAN JUAN PR 00926-3232 City, state, and ZIP+4 Similarly sparse columns in my address tables are, titles, division/department Names and mailstop codes. (described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm) While I realize I could stick in some string (empty string, or some other magic string like "urbanization name doesn't apply to this address") into a table, it sure is convenient to put nulls in those columns. I'm quite curious what you'd suggest a well-designed address table would look like without nulls.
Re: Are there performance advantages in storing bulky field in separate table?
From
"Leif B. Kristensen"
Date:
On Wednesday 8. April 2009, Ron Mayer wrote: >Sam Mason wrote: >> On Wed, Apr 08, 2009 at 04:56:35PM +0100, Ian Mayo wrote: >>> One more thing: hey, did you hear? I just got some advice from >>> Tom Lane! >> >> Statistically speaking; he's the person most likely to answer you by > >Even so, this might be the #1 advantage of Postgres over Oracle (cost >being #2). I'll subscribe to that. Like almost everybody else on this list, I've got some excellent replies from Tom. And of course it's one of the major advantages of PostgreSQL. (The cost being a close second.) -- Leif Biberg Kristensen | Registered Linux User #338009 Me And My Database: http://solumslekt.org/blog/
Re: Are there performance advantages in storing bulky field in separate table?
From
Richard Huxton
Date:
Ivan Sergio Borgonovo wrote: > On Wed, 8 Apr 2009 17:39:02 +0100 > Sam Mason <sam@samason.me.uk> wrote: > >> who num mails of total >> Tom Lane 1,935 8.0% >> Scott Marlowe 1,077 4.5% >> Alvaro Herrera 521 2.2% >> Joshua Drake 468 1.9% >> Richard Huxton 432 1.8% >> Craig Ringer 338 1.4% >> Ivan Sergio Borgonovo 314 1.3% > > I just wrote privately to Tom that I'm ashamed I ask so much and > answer so few. > But well I'm an exception ;) I'm the top of non-contributors. Not so fast there citizen. I'll thank you to note that I've not contributed any code either, and for a significant number of years too :-) -- Richard Huxton Archonet Ltd
Re: Are there performance advantages in storing bulky field in separate table?
From
Grzegorz Jaśkiewicz
Date:
If I may, I got an instance once, where table with bytea field was pretty slow. Turned out, that queries modified everything apart from bytea bit. moving it to separate table actually helped performance. But that only will happen providing that you have the bytea/text/whatever that won't change, once inserted.
Re: Are there performance advantages in storing bulky field in separate table?
From
Robert Treat
Date:
On Wednesday 08 April 2009 18:25:25 Ron Mayer wrote: > Robert Treat wrote: > > You can be sure that discussion of this topic in this forum will soon be > > visited by religious zealots, but the short answer is "nulls are bad, > > mmkay". A slightly longer answer would be that, as a general rule, > > attributes of your relations that only apply to 1% of the rows are better > > represented as a one > > To fulfill your prophecy of zealotry, I've got a number of tables > with columns that are mostly null that I can't think of that nice a > way of refactoring. I'd love ideas to improve the design, though. > > One example's an address table. Most addresses have a few fields > that are typically present (building number, city, state, etc). > Others, as described in various government's address standards, > are fields that are typically absent. For example in US addressing > rules, the "Urbanization Name" line: > http://www.usps.com/ncsc/addressstds/addressformats.htm > MRS MARIA SUAREZ Name > URB LAS GLADIOLAS Urbanization name > 150 CALLE A House no. and st. name > SAN JUAN PR 00926-3232 City, state, and ZIP+4 > Similarly sparse columns in my address tables are, > titles, division/department Names and mailstop codes. > (described here: http://pe.usps.gov/text/pub28/pub28c3_011.htm) > > While I realize I could stick in some string (empty string, or > some other magic string like "urbanization name doesn't apply to > this address") into a table, it sure is convenient to put nulls > in those columns. > > I'm quite curious what you'd suggest a well-designed address table > would look like without nulls. The decision here would depend on your perticular sect of the anti-null religion, but you have a couple of choices: 1) Break these fields out into one or more tables, containing entries only for those address that have the additional information. Ideally you might be able to do something like "extended_address_info" where all of these fields could be kept, all of them being non-null. I suspect you can't do the ideal, so you'd end up with a bunch of tables. This would be used by the "normalization trumps nullification" sect 2) Given that all of these columns have an authoritarian source of what should be allowed, you could use the "magic string" approach without requiring too much magic, and these columns could even be a foriegn key into a table containing the authoritarian options. This could be justified by the all nulls are bad sect, but might also be used by a null using crowd who take a strict approach to nulls meaning "unknown value", since here it isn't that the value is unknown; there isn't a valid value for these columns. (Adding the magic string to your FK table creates a valid reference value for those entries that would otherwise not match) Personally, if you force me into a "well-designed address table *without* nulls" decision, I would take this latter approach. HTH -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com