Home > mailing lists

Reducing data type space usage - Mailing list pgsql-hackers

From	Gregory Stark
Subject	Reducing data type space usage
Date	September 15, 2006 14:50:49
Msg-id	87mz91ow7m.fsf@enterprisedb.com Whole thread Raw
Responses	Re: Reducing data type space usage Re: Reducing data type space usage Re: Reducing data type space usage
List	pgsql-hackers

Tree view

Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:

As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes things like NUMERIC which in common
usewill rarely be larger than 12-20 bytes and often things like text.

In cases like these we really only need 1 or sometimes 2 byte varlena header overhead, not 4 as we currently do. In
factwe *never* need more than 2 bytes of varlena header on disk anyways with the standard configuration.

Case 2) Data types that are different sizes depending on the typmod but are always the same size that can be
determinedstatically for a given typmod. In the case of a ASCII encoded database CHAR(n) fits this category and in any
case we'll eventually have per-column encoding. NUMERC(a,b) could also be made to fit this as well. In cases like
thesewe don't need *any* varlena header. If we could arrange for the functions to have enough information to know how
largethe data must be.

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's clear that that's the most realistic way
toapproach it.

I don't think any other approaches were even suggested. Tom said he wanted a second varlena format for numeric that
wouldhave 2-byte alignment. But I think we could always just say that we always use the 2-byte varlena header on data
typeswith 2-byte alignment and the 4-byte header on data types with 4-byte alignment needs. Or heap_form_tuple could
beeven cleverer about it but I'm not sure it's worth it.

This limits the wasted space to 1-2% for most variable sized data that are 50 bytes long or more. But for very small
datasuch as the quite common cases where those are often only 1-4 bytes it still means a 25-100% performance drain.

Case 2) Solving this is quite difficult without introducing major performance problems or security holes. The one
approachwe have that's practical right now is introducing special data types such as the oft-mentioned "char" data
type."char" doesn't have quite the right semantics to use as a transparent substitute for CHAR but we could define a
CHAR(1)with exactly the right semantics and substitute it transparently in parser/analyze.c (btw having two files
namedanalyze.c is pretty annoying). We could do the same with NUMERIC(a,b) for sufficiently small values of a and b
withsomething like D'Arcy's CASH data type (which uses an integer internally).

The problem with defining lots of data types is that the number of casts and cross-data-type comparisons grows
quadraticallyas the number of data types grows. In theory we would save space by defining a CHAR(n) for whatever size
nthe user needs but I can't really see anything other than CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC
substitutelike CASH (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte substitute might be reasonable
butanything else would be pointless.

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

-- Gregory Stark EnterpriseDB http://www.enterprisedb.com

pgsql-hackers by date:

From: Andrew Dunstan
Date: 15 September 2006, 14:30:01
Subject: Re: regression test for uuid datatype

From: "Rocco Altier"
Date: 15 September 2006, 15:03:56
Subject: Re: [PATCHES] Linking on AIX (Was: Fix linking of OpenLDAP libraries )

Reducing data type space usage - Mailing list pgsql-hackers

Previous

Next