Reducing data type space usage - Mailing list pgsql-hackers

From Gregory Stark
Subject Reducing data type space usage
Date
Msg-id 87mz91ow7m.fsf@enterprisedb.com
Whole thread Raw
Responses Re: Reducing data type space usage  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Reducing data type space usage  (Mark Dilger <pgsql@markdilger.com>)
Re: Reducing data type space usage  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers

Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:

As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes  things like NUMERIC which in common
usewill rarely be larger than 12-20  bytes and often things like text.
 
  In cases like these we really only need 1 or sometimes 2 byte varlena  header overhead, not 4 as we currently do. In
factwe *never* need more  than 2 bytes of varlena header on disk anyways with the standard  configuration.
 

Case 2) Data types that are different sizes depending on the typmod but are always  the same size that can be
determinedstatically for a given typmod. In the  case of a ASCII encoded database CHAR(n) fits this category and in any
case we'll eventually have per-column encoding. NUMERC(a,b) could also be made  to fit this as well.    In cases like
thesewe don't need *any* varlena header. If we could arrange  for the functions to have enough information to know how
largethe data  must be.
 

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's clear  that that's the most realistic way
toapproach it.
 
  I don't think any other approaches were even suggested. Tom said he wanted  a second varlena format for numeric that
wouldhave 2-byte alignment. But I  think we could always just say that we always use the 2-byte varlena header  on data
typeswith 2-byte alignment and the 4-byte header on data types  with 4-byte alignment needs. Or heap_form_tuple could
beeven cleverer  about it but I'm not sure it's worth it.
 
  This limits the wasted space to 1-2% for most variable sized data that are  50 bytes long or more. But for very small
datasuch as the quite common  cases where those are often only 1-4 bytes it still means a 25-100%  performance drain.
 

Case 2) Solving this is quite difficult without introducing major performance  problems or security holes. The one
approachwe have that's practical right  now is introducing special data types such as the oft-mentioned "char" data
type."char" doesn't have quite the right semantics to use as a transparent  substitute for CHAR but we could define a
CHAR(1)with exactly the right  semantics and substitute it transparently in parser/analyze.c (btw having  two files
namedanalyze.c is pretty annoying). We could do the same with  NUMERIC(a,b) for sufficiently small values of a and b
withsomething like  D'Arcy's CASH data type (which uses an integer internally).
 
  The problem with defining lots of data types is that the number of casts  and cross-data-type comparisons grows
quadraticallyas the number of data  types grows. In theory we would save space by defining a CHAR(n) for  whatever size
nthe user needs but I can't really see anything other than  CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC
substitutelike CASH  (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte  substitute might be reasonable
butanything else would be pointless.
 

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: regression test for uuid datatype
Next
From: "Rocco Altier"
Date:
Subject: Re: [PATCHES] Linking on AIX (Was: Fix linking of OpenLDAP libraries )