Thread: A space-efficient, user-friendly way to store categorical data
Hi,
I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping reads and writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a long list of cities, and many users will have the same city. Some options are:
- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but complicates reads and writes
A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but instead of labels being declared ahead of time, they would be added as data is inserted.
It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicate values for TOAST-able data types.
Thanks,
Andrew
I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping reads and writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a long list of cities, and many users will have the same city. Some options are:
- Use a text column
- Use an enum column - saves space, but labels must be set ahead of time
- Create another table for cities (normalize) - saves space, but complicates reads and writes
A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but instead of labels being declared ahead of time, they would be added as data is inserted.
It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicate values for TOAST-able data types.
Thanks,
Andrew
Andrew Kane <andrew@chartkick.com> writes: > A better option could be a new "dynamic enum" type, which would have > similar storage requirements as an enum, but instead of labels being > declared ahead of time, they would be added as data is inserted. You realize, of course, that it's possible to add labels to an enum type today. (Removing them is another story.) You haven't explained exactly what you have in mind that is going to be able to duplicate the advantages of the current enum implementation without its disadvantages, so it's hard to evaluate this proposal. regards, tom lane
On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Kane <andrew@chartkick.com> writes: >> A better option could be a new "dynamic enum" type, which would have >> similar storage requirements as an enum, but instead of labels being >> declared ahead of time, they would be added as data is inserted. > > You realize, of course, that it's possible to add labels to an enum type > today. (Removing them is another story.) > > You haven't explained exactly what you have in mind that is going to be > able to duplicate the advantages of the current enum implementation > without its disadvantages, so it's hard to evaluate this proposal. > This sounds rather like the idea I have been tossing around in my head for a while, and in sporadic discussions with a few people, for a dictionary object. The idea is to have an append-only list of labels which would not obey transactional semantics, and would thus help us avoid the pitfalls of enums - there wouldn't be any rollback of an addition. The use case would be for a jsonb representation which would replace object keys with the oid value of the corresponding dictionary entry rather like enums now. We could have a per-table dictionary which in most typical json use cases would be very small, and we know from some experimental data that the compression in space used from such a change would often be substantial. This would have to be modifiable dynamically rather than requiring explicit additions to the dictionary, to be of practical use for the jsonb case, I believe. I hadn't thought about this as a sort of super enum that was usable directly by users, but it makes sense. I have no idea how hard or even possible it would be to implement. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andrew Kane <andrew@chartkick.com> writes: >>> A better option could be a new "dynamic enum" type, which would have >>> similar storage requirements as an enum, but instead of labels being >>> declared ahead of time, they would be added as data is inserted. >> >> You realize, of course, that it's possible to add labels to an enum type >> today. (Removing them is another story.) >> >> You haven't explained exactly what you have in mind that is going to be >> able to duplicate the advantages of the current enum implementation >> without its disadvantages, so it's hard to evaluate this proposal. >> > > > This sounds rather like the idea I have been tossing around in my head > for a while, and in sporadic discussions with a few people, for a > dictionary object. The idea is to have an append-only list of labels > which would not obey transactional semantics, and would thus help us > avoid the pitfalls of enums - there wouldn't be any rollback of an > addition. The use case would be for a jsonb representation which > would replace object keys with the oid value of the corresponding > dictionary entry rather like enums now. We could have a per-table > dictionary which in most typical json use cases would be very small, > and we know from some experimental data that the compression in space > used from such a change would often be substantial. > > This would have to be modifiable dynamically rather than requiring > explicit additions to the dictionary, to be of practical use for the > jsonb case, I believe. > > I hadn't thought about this as a sort of super enum that was usable > directly by users, but it makes sense. > > I have no idea how hard or even possible it would be to implement. I have had thoughts over the years about something similar, but going the other way and hiding it from the end user. If you could declare a column to have a special compressed property (independently of the type) then it could either automatically maintain a dictionary, or at least build a new dictionary for your when you next run some kind of COMPRESS operation. There would be no user visible difference except footprint. In ancient DB2 they had a column property along those lines called "VALUE COMPRESSION" (they also have a row-level version, and now they have much more advanced kinds of adaptive compression that I haven't kept up with). In some ways it'd be a bit like toast with shared entries, but I haven't seriously looked into how such a thing might be implemented. -- Thomas Munro http://www.enterprisedb.com
On 02/11/2018 10:06 PM, Thomas Munro wrote: > On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Andrew Kane <andrew@chartkick.com> writes: >>>> A better option could be a new "dynamic enum" type, which would have >>>> similar storage requirements as an enum, but instead of labels being >>>> declared ahead of time, they would be added as data is inserted. >>> >>> You realize, of course, that it's possible to add labels to an enum type >>> today. (Removing them is another story.) >>> >>> You haven't explained exactly what you have in mind that is going to be >>> able to duplicate the advantages of the current enum implementation >>> without its disadvantages, so it's hard to evaluate this proposal. >>> >> >> >> This sounds rather like the idea I have been tossing around in my head >> for a while, and in sporadic discussions with a few people, for a >> dictionary object. The idea is to have an append-only list of labels >> which would not obey transactional semantics, and would thus help us >> avoid the pitfalls of enums - there wouldn't be any rollback of an >> addition. The use case would be for a jsonb representation which >> would replace object keys with the oid value of the corresponding >> dictionary entry rather like enums now. We could have a per-table >> dictionary which in most typical json use cases would be very small, >> and we know from some experimental data that the compression in space >> used from such a change would often be substantial. >> >> This would have to be modifiable dynamically rather than requiring >> explicit additions to the dictionary, to be of practical use for the >> jsonb case, I believe. >> >> I hadn't thought about this as a sort of super enum that was usable >> directly by users, but it makes sense. >> >> I have no idea how hard or even possible it would be to implement. > > I have had thoughts over the years about something similar, but going > the other way and hiding it from the end user. If you could declare a > column to have a special compressed property (independently of the > type) then it could either automatically maintain a dictionary, or at > least build a new dictionary for your when you next run some kind of > COMPRESS operation. There would be no user visible difference except > footprint. In ancient DB2 they had a column property along those > lines called "VALUE COMPRESSION" (they also have a row-level version, > and now they have much more advanced kinds of adaptive compression > that I haven't kept up with). In some ways it'd be a bit like toast > with shared entries, but I haven't seriously looked into how such a > thing might be implemented. For what it is worth, there is a similar concept in R called "factors". When categorical data is stored in a data.frame (R language equivalent to relations) it is transparently and automatically converted. I believe this is both for storage compression and to facilitate some of the analytics. In R you can also explicitly specify to *not* convert strings to factors as a performance optimization, because that conversion does have a noticeable impact during ingestion and is not always needed. I can also envision usefulness of this type of mechanism in other security related scenarios. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
> On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew@chartkick.com> wrote: > > Hi, > > I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping readsand writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a longlist of cities, and many users will have the same city. Some options are: > > - Use a text column > - Use an enum column - saves space, but labels must be set ahead of time > - Create another table for cities (normalize) - saves space, but complicates reads and writes > > A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but insteadof labels being declared ahead of time, they would be added as data is inserted. > > It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicatevalues for TOAST-able data types. > I have already done this and use it in production. My implementation is specific to my needs, and would have to be made more general purpose if it were ever to be accepted by this community, but I'll lay out the basic design for your consideration. I maintain multiple mappings of text to/from Oid. Some of the names, typically the most common ones that I am likely to encounter, are known at compile time. I have a quick hardcoded lookup that maps these names to their compile time assigned Oid. Conversions from the name to Oid or visa versa happen without entailing any shared lock overhead. For each mapping, I maintain a separate catalog table. The compile time known values are inserted into the catalog with DATA lines, per the usual mechanism for populating catalog tables during bootstrap. For now, keeping the functions that map Oids to and from compile time known strings synchronized with the DATA lines is a manual PITA process, except that I almost never need to extend the mappings, and as such have not yet bothered to write a perl script for managing it. For each mapping, I also have a cache. See syscache.h. I have functions defined that convert between names and Oids that check the hardcoded values, then the cache, then the full catalog table if not found in the cache. It all works transactionally and plays well with other processes that are adding or looking up values simultaneously. There are more than just 'get' and 'set' functions defined, because sometimes when you 'get' a value, you want it to be assigned an entry if it does not exist yet, and sometimes you don't. For a few of the mappings, where I know the list will never get too large, I map from text to one or two bytes rather than to a full four byte Oid. This helps, because I have other tables with multiple columns that are mappings of this sort, such that I can pack two or three mapping columns into just 4 bytes. Whether you'd get a similar payoff depends a lot on your table structure. Since I'm using syscache, I have to store a full four bytes there, but the tables that reference that don't have to spend a full four bytes. To make the conversion easier, I have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid range above 10000 that is reserved for their use, with a mapping to/from the lower range [0..255] or [0..65535] automatically performed when casting to/from Oid. The main reason I prefer this over the built-in enum type is that for the data distribution I am handling, I have greater than 90% of the lookups succeed without looking beyond the compile-time known values. (For some mappings, it's really more like 100%, with the possibility of run-time added values being a theoretical possibility that I don't want to exclude, but as yet has never happened in practice.) The primary drawback of my implementation from the point of view of making it general purpose is that you need to know something about your data distribution before you compile postgres. AFAIK, most postgres users don't compile their own copy, but use an RPM or some such, and that puts my type of solution out of reach for them. The second most obvious problem with my implementation is that it lacks any means for removing values from the mappings. Once added, they are never removed. For me, this is a good thing, because it means I don't need to create indexes referencing a master table of mappings to guarantee data integrity, and for my workload, none of the mappings are the kind of thing where mappings data would ever get expired. I am in a really small niche situation that would likely not be applicable for most other users. Extending this to help the community as a whole in the form of something general purpose seems difficult, which is why I never tried to contribute it. Could you perhaps tell me how similar your situation is to mine, and whether my approach is anything like what you were contemplating? I am curious to hear an answer to Tom's question, upthread, about why you would not just use the enumeration type that postgres already supplies. Mark Dilger
Thanks everyone for the feedback. The current enum implementation requires you to create a new type and add labels outside a transaction prior to an insert.
-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);
-- on table creation
CREATE TYPE city AS ENUM ();
CREATE TABLE "users" ("city" city);
-- on insert
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;
What would be ideal:
-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);
ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago';
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;
What would be ideal:
-- on table creation
CREATE TABLE "users" ("city" dynamic_enum);
-- on insert
BEGIN;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;
INSERT INTO "users" ("city") VALUES ('Chicago');
COMMIT;
Since enums have a fixed number of labels, this type of feature may be better off as a property you could add to text columns (as Thomas mentions). This would avoid issues with hitting the max number of labels.
- Andrew
On Mon, Feb 12, 2018 at 4:06 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
> On Feb 10, 2018, at 7:46 PM, Andrew Kane <andrew@chartkick.com> wrote:
>
> Hi,
>
> I'm hoping to get feedback on an idea for a new data type to allow for efficient storage of text values while keeping reads and writes user-friendly. Suppose you want to store categorical data like current city for users. There will be a long list of cities, and many users will have the same city. Some options are:
>
> - Use a text column
> - Use an enum column - saves space, but labels must be set ahead of time
> - Create another table for cities (normalize) - saves space, but complicates reads and writes
>
> A better option could be a new "dynamic enum" type, which would have similar storage requirements as an enum, but instead of labels being declared ahead of time, they would be added as data is inserted.
>
> It'd be great to hear what others think of this (or if I'm missing something). Another direction could be to deduplicate values for TOAST-able data types.
>
I have already done this and use it in production. My implementation is
specific to my needs, and would have to be made more general purpose if
it were ever to be accepted by this community, but I'll lay out the
basic design for your consideration.
I maintain multiple mappings of text to/from Oid.
Some of the names, typically the most common ones that I am likely to
encounter, are known at compile time. I have a quick hardcoded lookup
that maps these names to their compile time assigned Oid. Conversions
from the name to Oid or visa versa happen without entailing any shared
lock overhead.
For each mapping, I maintain a separate catalog table. The compile time
known values are inserted into the catalog with DATA lines, per the
usual mechanism for populating catalog tables during bootstrap.
For now, keeping the functions that map Oids to and from compile time
known strings synchronized with the DATA lines is a manual PITA process,
except that I almost never need to extend the mappings, and as such have
not yet bothered to write a perl script for managing it.
For each mapping, I also have a cache. See syscache.h. I have
functions defined that convert between names and Oids that check the
hardcoded values, then the cache, then the full catalog table if not
found in the cache. It all works transactionally and plays well with
other processes that are adding or looking up values simultaneously.
There are more than just 'get' and 'set' functions defined, because
sometimes when you 'get' a value, you want it to be assigned an entry if
it does not exist yet, and sometimes you don't.
For a few of the mappings, where I know the list will never get too
large, I map from text to one or two bytes rather than to a full four
byte Oid. This helps, because I have other tables with multiple columns
that are mappings of this sort, such that I can pack two or three
mapping columns into just 4 bytes. Whether you'd get a similar payoff
depends a lot on your table structure. Since I'm using syscache, I have
to store a full four bytes there, but the tables that reference that
don't have to spend a full four bytes. To make the conversion easier, I
have types smalloid (2 byte) and tinyoid (1 byte) defined, and an Oid
range above 10000 that is reserved for their use, with a mapping to/from
the lower range [0..255] or [0..65535] automatically performed when
casting to/from Oid.
The main reason I prefer this over the built-in enum type is that for
the data distribution I am handling, I have greater than 90% of the
lookups succeed without looking beyond the compile-time known values.
(For some mappings, it's really more like 100%, with the possibility
of run-time added values being a theoretical possibility that I don't
want to exclude, but as yet has never happened in practice.)
The primary drawback of my implementation from the point of view of
making it general purpose is that you need to know something about your
data distribution before you compile postgres. AFAIK, most postgres
users don't compile their own copy, but use an RPM or some such, and
that puts my type of solution out of reach for them.
The second most obvious problem with my implementation is that it lacks
any means for removing values from the mappings. Once added, they are
never removed. For me, this is a good thing, because it means I don't
need to create indexes referencing a master table of mappings to
guarantee data integrity, and for my workload, none of the mappings are
the kind of thing where mappings data would ever get expired.
I am in a really small niche situation that would likely not be
applicable for most other users.
Extending this to help the community as a whole in the form of something
general purpose seems difficult, which is why I never tried to
contribute it.
Could you perhaps tell me how similar your situation is to mine, and
whether my approach is anything like what you were contemplating? I am
curious to hear an answer to Tom's question, upthread, about why you
would not just use the enumeration type that postgres already supplies.
Mark Dilger
Andrew Kane <andrew@chartkick.com> writes: > Thanks everyone for the feedback. The current enum implementation requires > you to create a new type and add labels outside a transaction prior to an > insert. Right ... > Since enums have a fixed number of labels, this type of feature may be > better off as a property you could add to text columns (as Thomas > mentions). This would avoid issues with hitting the max number of labels. ... but you're not saying how you'd avoid the need for prior commit of the labels. The sticking point for enums is that once a value has gotten into a btree index, we can't ever lose the ability to compare that value to others, or the index will be broken. So inserting an uncommitted value into user tables has to be prevented. Maybe there's a way to assign the labels so that they can be compared without reference to any outside data, but it's not clear to me how that would work. regards, tom lane
> On Feb 12, 2018, at 5:08 PM, Andrew Kane <andrew@chartkick.com> wrote: > > Thanks everyone for the feedback. The current enum implementation requires you to create a new type and add labels outsidea transaction prior to an insert. > > -- on table creation > CREATE TYPE city AS ENUM (); > CREATE TABLE "users" ("city" city); > > -- on insert > ALTER TYPE city ADD VALUE IF NOT EXISTS 'Chicago'; > BEGIN; > INSERT INTO "users" ("city") VALUES ('Chicago'); > COMMIT; > > What would be ideal: > > -- on table creation > CREATE TABLE "users" ("city" dynamic_enum); > > -- on insert > BEGIN; > INSERT INTO "users" ("city") VALUES ('Chicago'); > COMMIT; > > Since enums have a fixed number of labels, this type of feature may be better off as a property you could add to text columns(as Thomas mentions). This would avoid issues with hitting the max number of labels. In your proposed feature, what happens if I create two tables: CREATE TABLE myusers (city dynamic_enum); CREATE TABLE yourusers (city dynamic_enum); Do you imagine that myusers and yourusers are referring to the same enum or to two different enums? Are the enums stored in a new table within pg_catalog, or are they stored in something akin to a toast table? If you insert billions of rows into a table, but only have 30 distinct values, can you quickly query for all 30 distinct enum values, or would you have to walk billions of rows to find them all? mark
> On Feb 12, 2018, at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Kane <andrew@chartkick.com> writes: >> Thanks everyone for the feedback. The current enum implementation requires >> you to create a new type and add labels outside a transaction prior to an >> insert. > > Right ... > >> Since enums have a fixed number of labels, this type of feature may be >> better off as a property you could add to text columns (as Thomas >> mentions). This would avoid issues with hitting the max number of labels. > > ... but you're not saying how you'd avoid the need for prior commit of the > labels. The sticking point for enums is that once a value has gotten into > a btree index, we can't ever lose the ability to compare that value to > others, or the index will be broken. So inserting an uncommitted value > into user tables has to be prevented. > > Maybe there's a way to assign the labels so that they can be compared > without reference to any outside data, but it's not clear to me how > that would work. When I implemented this, I wrote the comparators to work on the Oid for the value, not the string representation. That works fine. If you want to sort the data on the stringified version, cast to text first. That works well enough for me, since I'm typically not interested in what sort order is used, as long as it is deterministic and works for indexing, group by, and so forth.
They'd refer to separate enums.
I originally thought an enum was a good comparison for this feature, but I'm no longer sure that it is. A text-based ordering would be desired rather than the label index.
A better comparison may be a two-column lookup table:
-- create
CREATE TABLE cities (id bigserial primary key, name text)
CREATE UNIQUE INDEX ON cities (name);
CREATE TABLE users (city_id bigint);
-- insert
BEGIN;
INSERT INTO cities (name) VALUES ('Chicago') ON CONFLICT (name) DO NOTHING RETURNING id;
INSERT INTO users (city_id) VALUES (<city id returned from earlier>);
COMMIT;
-- select
SELECT * FROM users FROM users INNER JOIN cities ON cities.id = users.city_id WHERE name = 'Chicago';
Ideally, the lookup table could be maintained by Postgres to make reads and writes easier.
-- create
CREATE TABLE users (city text DEDUPED);
-- insert
INSERT INTO users (city) VALUES ('Chicago');
-- query
SELECT * FROM users WHERE city = 'Chicago';
I originally thought an enum was a good comparison for this feature, but I'm no longer sure that it is. A text-based ordering would be desired rather than the label index.
A better comparison may be a two-column lookup table:
-- create
CREATE TABLE cities (id bigserial primary key, name text)
CREATE UNIQUE INDEX ON cities (name);
CREATE TABLE users (city_id bigint);
-- insert
BEGIN;
INSERT INTO cities (name) VALUES ('Chicago') ON CONFLICT (name) DO NOTHING RETURNING id;
INSERT INTO users (city_id) VALUES (<city id returned from earlier>);
COMMIT;
-- select
SELECT * FROM users FROM users INNER JOIN cities ON cities.id = users.city_id WHERE name = 'Chicago';
Ideally, the lookup table could be maintained by Postgres to make reads and writes easier.
-- create
CREATE TABLE users (city text DEDUPED);
-- insert
INSERT INTO users (city) VALUES ('Chicago');
-- query
SELECT * FROM users WHERE city = 'Chicago';
I'm not really sure the best place to store this lookup table.
- Andrew
On Mon, Feb 12, 2018 at 7:11 PM, Mark Dilger <hornschnorter@gmail.com> wrote:
> On Feb 12, 2018, at 6:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrew Kane <andrew@chartkick.com> writes:
>> Thanks everyone for the feedback. The current enum implementation requires
>> you to create a new type and add labels outside a transaction prior to an
>> insert.
>
> Right ...
>
>> Since enums have a fixed number of labels, this type of feature may be
>> better off as a property you could add to text columns (as Thomas
>> mentions). This would avoid issues with hitting the max number of labels.
>
> ... but you're not saying how you'd avoid the need for prior commit of the
> labels. The sticking point for enums is that once a value has gotten into
> a btree index, we can't ever lose the ability to compare that value to
> others, or the index will be broken. So inserting an uncommitted value
> into user tables has to be prevented.
>
> Maybe there's a way to assign the labels so that they can be compared
> without reference to any outside data, but it's not clear to me how
> that would work.
When I implemented this, I wrote the comparators to work on the Oid for
the value, not the string representation. That works fine. If you want to
sort the data on the stringified version, cast to text first. That works well
enough for me, since I'm typically not interested in what sort order is used,
as long as it is deterministic and works for indexing, group by, and so forth.
Hi, On 2018-02-12 09:54:29 +1030, Andrew Dunstan wrote: > The idea is to have an append-only list of labels > which would not obey transactional semantics, and would thus help us > avoid the pitfalls of enums - there wouldn't be any rollback of an > addition. FWIW, I think we can resolve the issues of enum transactionality. While not trivial I think the solution is to allow to make additions of enum values non-transactional (by basically inserting them with immediate visibility) *while* in a transaction, and to change deletion of values to set a 'deleted' column to true. With some additional effort we could even make additions transactional by storing the xid that added the enum label along the row. Visibility for index comparisons would regard labels as visible regardless of that xid, but when creating new values we'd make a visibility check. The big issue with that is that autovacuum has to know about it... Greetings, Andres Freund