Home > mailing lists

[MASSMAIL]enhance the efficiency of migrating particularly large tables - Mailing list pgsql-hackers

From	David Zhang
Subject	[MASSMAIL]enhance the efficiency of migrating particularly large tables
Date	April 9 00:52:13
Msg-id	480e2d1e-9ed9-41e1-877d-a4c62e736975@highgo.ca Whole thread Raw
Responses	Re: enhance the efficiency of migrating particularly large tables
List	pgsql-hackers

Tree view

Hi Postgres hackers,

I'm reaching out to gather some comments on enhancing the efficiency of 
migrating particularly large tables with significant data volumes in 
PostgreSQL.

When migrating a particularly large table with a significant amount of 
data, users sometimes tend to split the table into multiple segments and 
utilize multiple sessions to process data from different segments in 
parallel, aiming to enhance efficiency. When segmenting a large table, 
it's challenging if the table lacks fields suitable for segmentation or 
if the data distribution is uneven. I believe that the data volume in 
each block should be relatively balanced when vacuum is enabled. 
Therefore, the ctid can be used to segment a large table, and I am 
thinking the entire process can be outlined as follows:
1) determine the minimum and maximum ctid.
2) calculate the number of data blocks based on the maximum and minimum 
ctid.
3) generate multiple SQL queries, such as SELECT * FROM tbl WHERE ctid 
 >= '(xx,1)' AND ctid < '(xxx,1)'.

However, when executing SELECT min(ctid) and max(ctid), it performs a 
Seq Scan, which can be slow for a large table. Is there a way to 
retrieve the minimum and maximum ctid other than using the system 
functions min() and max()?

Since the minimum and maximum ctid are in order, theoretically, it 
should start searching from the first block and can stop as soon as it 
finds the first available one when retrieving the minimum ctid. 
Similarly, it should start searching in reverse order from the last 
block and stop upon finding the first occurrence when retrieving the 
maximum ctid. Here's a piece of code snippet:

         /* scan the relation for minimum or maximum ctid */
         if (find_max_ctid)
             dir = BackwardScanDirection;
         else
             dir = ForwardScanDirection;

         while ((tuple = heap_getnext(scan, dir)) != NULL)
         ...

The attached is a simple POC by referring to the extension pgstattuple. 
Any feedback, suggestions, or alternative solutions from the community 
would be greatly appreciated.

Thank you,

David

Attachment

get_ctid.c

pgsql-hackers by date:

From: Andrew Dunstan
Date: 09 April, 00:42:35
Subject: Re: WIP Incremental JSON Parser

From: Andrew Dunstan
Date: 09 April, 01:20:13
Subject: Re: WIP Incremental JSON Parser

[MASSMAIL]enhance the efficiency of migrating particularly large tables - Mailing list pgsql-hackers

Attachment

Previous

Next