Re: Logical replication, need to reclaim big disk space - Mailing list pgsql-general

From Moreno Andreo
Subject Re: Logical replication, need to reclaim big disk space
Date
Msg-id 468e0359-0670-4dc9-9014-87dca6c69821@evolu-s.it
Whole thread Raw
In response to Logical replication, need to reclaim big disk space  (Moreno Andreo <moreno.andreo@evolu-s.it>)
Responses Re: Logical replication, need to reclaim big disk space
List pgsql-general
On 16/05/25 21:33, Achilleas Mantzios wrote:
> On 16/5/25 18:45, Moreno Andreo wrote:
>
>> Hi,
>>     we are moving our old binary data approach, moving them from 
>> bytea fields in a table to external storage (making database smaller 
>> and related operations faster and smarter).
>> In short, we have a job that runs in background and copies data from 
>> the table to an external file and then sets the bytea field to NULL.
>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)
>>
>> This results, at the end of the operations, to a table that's less 
>> than one tenth in size.
>> We have a multi-tenant architecture (100s of schemas with identical 
>> architecture, all inheriting from public) and we are performing the 
>> task on one table per schema.
>>
> So? toasted data are kept on separate TOAST tables, unless those bytea 
> cols are selected, you won't even touch them. I cannot understand what 
> you are trying to achieve here.
>
> Years ago, when I made the mistake to go for a coffee and let my 
> developers "improvise" , the result was a design similar to what you 
> are trying to achieve. Years after, I am seriously considering moving 
> those data back to PostgreSQL.
The "related operations" I was talking about are backups and database 
maintenance when needed, cluster/replica management, etc. With a smaller 
database size they would be easier in timing and effort, right?
We are mostly talking about costs, here. To give things their names, I'm 
moving bytea contents (85% of total data) to files into Google Cloud 
Storage buckets, that has a fraction of the cost of the disks holding my 
database (on GCE, to be clear ).
This data is not accessed frequently (just by the owner when he needs to 
do it), so no need to keep it on expensive hardware.
I've already read in these years that keeping many big bytea fields in 
databases is not recommended, but might have misunderstood this.
Another way would have been to move these tables to a different 
tablespace, in cheaper storage, but it still would have been 3 times the 
buckets cost.

Why are you considering to get data back to database tables?
>
>
>> The problem is: this is generating BIG table bloat, as you may imagine.
>> Running a VACUUM FULL on an ex-22GB table on a standalone test server 
>> is almost immediate.
>> If I had only one server, I'll process a table a time, with a nightly 
>> script, and issue a VACUUM FULL to tables that have already been 
>> processed.
>>
>> But I'm in a logical replication architecture (we are using a 
>> multimaster system called pgEdge, but I don't think it will make big 
>> difference, since it's based on logical replication), and I'm 
>> building a test cluster.
>>
> So you use PgEdge , but you wanna lose all the benefits of 
> multi-master , since your binary data won't be replicated ...
I don't think I need it to be replicated, since this data cannot be 
"edited", so either it's there or it's been deleted. Buckets have 
protections for data deletions or events like ransomware attacks and such.
Also multi-master was an absolute requirement one year ago because of a 
project we were building, but it has been abandoned and now a simple 
logical replication would be enough, but let's do one thing a time.
>> I've been instructed to issue VACUUM FULL on both nodes, nightly, but 
>> before proceeding I read on docs that VACUUM FULL can disrupt logical 
>> replication, so I'm a bit concerned on how to proceed. Rows are 
>> cleared one a time (one transaction, one row, to keep errors to the 
>> record that issued them)
>>
> PgEdge is based on the old pg_logical, the old 2ndQuadrant extension, 
> not the native logical replication we have since pgsql 10. But I might 
> be mistaken.
Don't know about this, it keeps running on latest pg versions (we are 
about to upgrade to 17.4, if I'm not wrong), but I'll ask
>> I read about extensions like pg_squeeze, but I wonder if they are 
>> still not dangerous for replication.
>>
> What's pgEdge take on that, I mean the bytea thing you are trying to 
> achieve here.
They are positive, it's they that suggested to do VACUUM FULL on both 
nodes... I'm quite new to replication, so I'm searching some advise here.
>> Thanks for your help.
>> Moreno.-
>>
>>
>>
>
>





pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: pg_stat_statements has duplicate entries for the same query & queryId
Next
From: Achilleas Mantzios
Date:
Subject: Re: Logical replication, need to reclaim big disk space