Re: The "many nulls" problem - Mailing list pgsql-performance
From | Oleg Bartunov |
---|---|
Subject | Re: The "many nulls" problem |
Date | |
Msg-id | Pine.LNX.4.64.0803142158140.27666@sn.sai.msu.ru Whole thread Raw |
In response to | The "many nulls" problem ("Kynn Jones" <kynnjo@gmail.com>) |
Responses |
Re: The "many nulls" problem
|
List | pgsql-performance |
Kynn, have you seen contrib/hstore ? You can have one table with common attributes and hide others in hstore Oleg On Fri, 14 Mar 2008, Kynn Jones wrote: > It often happens that a particular pieces of information is non-null for a > small minority of cases. A superficially different manifestation of this is > when two pieces of information are identical in all but a small minority of > cases. This can be easily mapped to the previous description by defining a > null in one column to mean that its contents should be obtained from those > of another column. A further variant of this is when one piece of > information is a simple function of another one in all but a small minority > of cases. > > (BTW, I vaguely recall that RDb theorists have a technical term for this > particular design issue, but I don't remember it.) > > In all these cases, the design choice, at least according to RDb's 101, is > between including a column in the table that will be NULL most of the time, > or defining a second auxiliary column that references the first one and > holds the non-redundant information for the minority of cases for which this > is necessary (and maybe define a VIEW that includes all the columns). > > But for me it is a frequent occurrence that my quaint and simple RDb's 101 > reasoning doesn't really apply for PostgreSQL. Basically, Pg is too smart > for it! For example, does a large proportion of NULLs really imply a lot of > wasted space? Maybe this is true for fixed-length data types, but what > about for type TEXT or VARCHAR? > > Just to be concrete, consider the case of a customers database for some home > shopping website. Suppose that, as it happens, for the majority of this > site's customers, the shipping and billing addresses are identical. Or > consider the scenario of a company in which, for most employees, the email > address can be readily computed from the first and last name using the rule > First M. Last => first_last@acme.com, but the company allows some > flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's > known to everyone by his nickname, Yaz, the email is > yaz_tanaka@acme.combecause hardly anyone remembers or even knows his > full name.) > > What's your schema design approach for such situations? How would you go > about deciding whether the number of exceptional cases is small enough to > warrant a second table? Of course, one could do a systematic profiling of > various possible scenarios, but as a first approximation what's your > rule-of-thumb? > > TIA! > > Kynn > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-performance by date: