Re: is there a way to automate deduplication of strings? - Mailing list pgsql-novice
| From | Tim Anderson |
|---|---|
| Subject | Re: is there a way to automate deduplication of strings? |
| Date | |
| Msg-id | 75e0f4dd-cc89-4305-bb8b-4e19c0060ff4@timando.net Whole thread Raw |
| In response to | RE: is there a way to automate deduplication of strings? (Chris Papademetrious <Christopher.Papademetrious@synopsys.com>) |
| Responses |
RE: is there a way to automate deduplication of strings?
|
| List | pgsql-novice |
One thing you could do if you're OK with the space overhead of uuid vs int, is use a hash of the user agent e.g. `md5(user_agent)::uuid` which would reduce the need to lookup the value when inserting. Then when you get a foreign key violation, add the user agent to the user_agent table. On 1/1/26 10:25, Chris Papademetrious wrote: > > Hi Greg, > > Thanks for the reply! I tried to be vague to avoid getting distracted > by the details, but I think I overdid it! > > Let’s say I have a table of transactions like this: > > CREATE TABLE transaction ( > > id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > > user_agent TEXT NOT NULL, > > -- > > -- ...more columns here... > > -- > > ); > > The table can contain millions of transactions. The *user_agent*field > stores information about the application that performed each > transaction. These user-agent values will be populated from a > relatively small set of unique values. For example, > > MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop > (BuildID 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile > (BuildID 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID > 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID > 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop > (BuildID 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile > (BuildID 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID > 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID > 11111111-1111-1111-1111-111111111111) > > The values themselves will vary over time (as new versions appear and > old versions age out) so the set cannot be hardcoded, but the column > will always contain large numbers of duplicate values. > > I could store the user-agent values in a separate table and reference > them by a UUID computed from their value: > > CREATE TABLE user_agent ( > > id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > > user_agent TEXT NOT NULL UNIQUE, > > ); > > CREATE TABLE transaction ( > > id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > > user_agent_id UUID NOT NULL, > > CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES > user_agent(id) ON DELETE CASCADE > > ); > > but this adds transactional complexity for storage and retrieval, > along with cleanup of no-longer-referenced values over time. > > I’m wishing for a magic “sparsely stored texts” column in Postgres > that performs this deduplication automatically, but I don’t think it > exists. So I’m wondering, is there an extension or some other trick to > get the space savings without the transactional complexity? > > * Chris > > *From:*Greg Sabino Mullane <htamfids@gmail.com> > *Sent:* Wednesday, December 31, 2025 10:12 AM > *To:* Chris Papademetrious <chrispy@synopsys.com> > *Cc:* pgsql-novice@lists.postgresql.org > *Subject:* Re: is there a way to automate deduplication of strings? > > It is not quite clear what you are trying to do. Can you provide a > small test table to show what you want to achieve? > > Cheers, > > Greg > > -- > > Crunchy Data - https://www.crunchydata.com > <https://urldefense.com/v3/__https:/www.crunchydata.com__;!!A4F2R9G_pg!dkWhEcznp8l1toENuJsDgY1GABWqYLWZBHncXtqiCaZguLzkN0U0-3JBHLCm7wE8gN0L4ZIwyXqM5pGNLIs7ORVTFh7KnJI$> > > Enterprise Postgres Software Products & Tech Support >
pgsql-novice by date: