Varchar vs foreign key vs enumerator - table and index size - Mailing list pgsql-performance

From Łukasz Walkowski
Subject Varchar vs foreign key vs enumerator - table and index size
Date
Msg-id 3AAC8AFE-E908-4AF7-ACDE-F9461B4A62D1@homplex.pl
Whole thread Raw
Responses Re: Varchar vs foreign key vs enumerator - table and index size
Re: Varchar vs foreign key vs enumerator - table and index size
Re: Varchar vs foreign key vs enumerator - table and index size
List pgsql-performance
Hi,
This is my first post on this group so welcome everyone! Currently I'm working on optimizing a quite simple database
usedto store events from one website. Every event is a set of data describing user behaviour. The main table that
storesall events is built using schema:  

     Column      |            Type             | Modifiers
-----------------+-----------------------------+-----------
 id              | bigint                      | not null
 browser         | character varying(255)      |
 created         | timestamp without time zone |
 eventsource     | character varying(255)      |
 eventtype       | character varying(255)      |
 ipaddress       | character varying(255)      |
 objectid        | bigint                      |
 sessionid       | character varying(255)      |
 shopids         | integer[]                   |
 source          | character varying(255)      |
 sourceid        | bigint                      |
 supplierid      | bigint                      |
 cookieuuid      | uuid                        |
 serializeddata  | bytea                       |
 devicetype      | character varying(255)      |
 operatingsystem | character varying(255)      |

 It was a quick project to play with EclipseLink, Hibernate and some Jersey Rest services, so isn't perfect. However
thedatabase became quite usefull and we decided to optimize this table as it grew quite large (128GB right now without
indexes,about 630M records). There is only primary key index on this table. Here is the list of changes that I'd like
tomake to the table (some of them should be done from the scratch):  

1. Changing ipaddress from varchar to inet - this should save some space and lower the size of potential index.

2. Changing id for some composite id with created contained in it.

3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem
containa small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser
isset to normalized browser name. In every case I can store those data using one of 3 different methods:  

- store as varchar as it is now - nice and easy, but index on those columns is quite big and I think storing many of
similarstrings is waste of space.  

- store only id's and join external tables as needed, for example for browsers I only need smallint key, as there is a
limitednumber of browsers. The column browser becomes smallint and we have additional table with two columns (id,
browservarchar). This should save some space on event table, but if I want name of the browser in some report I need to
jointables. Second thing - on every insert there is constraint that is checked for this field and this can affect
performance.I was thinking about the same strategy for the remaining fields - this would give me 5 additional tables
and5 additional constraints on event table. Browser table will have about ~100 records, eventtype and eventsource will
haveabout 8-12 records each, devicetype - 4 records, operatingsystem - didn't really check this one, but I think
somethingaround 100 like browser.  

- introduce enumerator type for each of the column and store those values as enumerator. This one should be the most
spaceefficient, but it will be problematic in case of changing column values like browser or operatingsystem as
alteringenumerator isn't that simple.  

For browser average text length is 19 characters, for eventsource and eventtype eventsource average text lenght is 24
characters.Database encoding is set to UTF8.  

My question is - what is estimated difference in table size between those 3 variants of storing columns? In theory
thirdone should give me the smallest database and index size but is the most problematic from all of the above.  

Lukasz Walkowski

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Query plan change with multiple elements in IN clause
Next
From: Kevin Grittner
Date:
Subject: Re: How clustering for scale out works in PostgreSQL