Re: Updating large tables without dead tuples - Mailing list pgsql-performance

From Vik Fearing
Subject Re: Updating large tables without dead tuples
Date
Msg-id b9135b32-81ba-a589-8909-ac84adc20284@2ndquadrant.com
Whole thread Raw
In response to Updating large tables without dead tuples  ("ldh@laurent-hasson.com" <ldh@laurent-hasson.com>)
Responses RE: Updating large tables without dead tuples  ("ldh@laurent-hasson.com" <ldh@laurent-hasson.com>)
List pgsql-performance
On 02/24/2018 12:27 AM, ldh@laurent-hasson.com wrote:
> Hello
> 
>  
> 
> I work with a large and wide table (about 300 million rows, about 50
> columns), and from time to time, we get business requirements to make
> some modifications. But sometimes, it’s just some plain mistake. This
> has happened to us a few weeks ago where someone made a mistake and we
> had to update a single column of a large and wide table. Literally, the
> source data screwed up a zip code and we had to patch on our end.
> 
>  
> 
> Anyways… Query ran was:
> 
>     update T set source_id = substr(sourceId, 2, 10);
> 
> Took about 10h and created 100’s of millions of dead tuples, causing
> another couple of hours of vacuum.
> 
>  
> 
> This was done during a maintenance window, and that table is read-only
> except when we ETL data to it on a weekly basis, and so I was just
> wondering why I should pay the “bloat” penalty for this type of
> transaction. Is there a trick that could be use here?
Yes, there is a trick I like to use here, as long as you don't mind
locking the table (even against reads).

I'll assume T.source_id is of type text.  If it's not, use whatever the
actual type is.

ALTER TABLE T
    ALTER COLUMN source_id TYPE text USING substr(sourceId, 2, 10);

I copied what you had verbatim, I earnestly hope you don't have two
columns source_id and sourceId in your table.

This will rewrite the entire table just the same as a VACUUM FULL after
your UPDATE would.

Don't forget to VACUUM ANALYZE this table after the operation.  Even
though there will be no dead rows, you still need to VACUUM it to
generate the visibility map and you need to ANALYZE it for statistics on
your "new" column.

Foreign keys remain intact with this solution and you don't have double
wal logging like for an UPDATE.
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: why does this query not use a parallel query
Next
From: "John van Breda"
Date:
Subject: Slow index scan backward.