Varchar vs foreign key vs enumerator - table and index size - Mailing list pgsql-performance
From | Łukasz Walkowski |
---|---|
Subject | Varchar vs foreign key vs enumerator - table and index size |
Date | |
Msg-id | 3AAC8AFE-E908-4AF7-ACDE-F9461B4A62D1@homplex.pl Whole thread Raw |
Responses |
Re: Varchar vs foreign key vs enumerator - table and index size
Re: Varchar vs foreign key vs enumerator - table and index size Re: Varchar vs foreign key vs enumerator - table and index size |
List | pgsql-performance |
Hi, This is my first post on this group so welcome everyone! Currently I'm working on optimizing a quite simple database usedto store events from one website. Every event is a set of data describing user behaviour. The main table that storesall events is built using schema: Column | Type | Modifiers -----------------+-----------------------------+----------- id | bigint | not null browser | character varying(255) | created | timestamp without time zone | eventsource | character varying(255) | eventtype | character varying(255) | ipaddress | character varying(255) | objectid | bigint | sessionid | character varying(255) | shopids | integer[] | source | character varying(255) | sourceid | bigint | supplierid | bigint | cookieuuid | uuid | serializeddata | bytea | devicetype | character varying(255) | operatingsystem | character varying(255) | It was a quick project to play with EclipseLink, Hibernate and some Jersey Rest services, so isn't perfect. However thedatabase became quite usefull and we decided to optimize this table as it grew quite large (128GB right now without indexes,about 630M records). There is only primary key index on this table. Here is the list of changes that I'd like tomake to the table (some of them should be done from the scratch): 1. Changing ipaddress from varchar to inet - this should save some space and lower the size of potential index. 2. Changing id for some composite id with created contained in it. 3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem containa small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser isset to normalized browser name. In every case I can store those data using one of 3 different methods: - store as varchar as it is now - nice and easy, but index on those columns is quite big and I think storing many of similarstrings is waste of space. - store only id's and join external tables as needed, for example for browsers I only need smallint key, as there is a limitednumber of browsers. The column browser becomes smallint and we have additional table with two columns (id, browservarchar). This should save some space on event table, but if I want name of the browser in some report I need to jointables. Second thing - on every insert there is constraint that is checked for this field and this can affect performance.I was thinking about the same strategy for the remaining fields - this would give me 5 additional tables and5 additional constraints on event table. Browser table will have about ~100 records, eventtype and eventsource will haveabout 8-12 records each, devicetype - 4 records, operatingsystem - didn't really check this one, but I think somethingaround 100 like browser. - introduce enumerator type for each of the column and store those values as enumerator. This one should be the most spaceefficient, but it will be problematic in case of changing column values like browser or operatingsystem as alteringenumerator isn't that simple. For browser average text length is 19 characters, for eventsource and eventtype eventsource average text lenght is 24 characters.Database encoding is set to UTF8. My question is - what is estimated difference in table size between those 3 variants of storing columns? In theory thirdone should give me the smallest database and index size but is the most problematic from all of the above. Lukasz Walkowski
pgsql-performance by date: