Thread: Modification of data in base folder and very large tables

Modification of data in base folder and very large tables

From
Ogden Brash
Date:
I have a question about the files in .../data/postgresql/11/main/base, specifically in relation to very large tables and how they are written.

I have been attempting to restore a relatively large database with pg_restore and it has been running for more than a week. (I also  have another thread about the same restore related to network vs. local disk I/O)

I ran the pg_restore in verbose mode using multiple jobs so I can tell what has finished and what has not. The database had 66 tables and most of them have been restored. Two of the tables were quite large (billions of rows translating to between 1 and 2TB of data on disk for those two tables) and those two tables are pretty much the only things remaining that has not been reported as finished by pg_restore.

As the process has been going for a week, I have been tracking the machine (a dedicated piece of hardware, non-virtualized) and have been noticing a progressive slowdown (as tracked by iostat). There is nothing running on the machine besides postgresql and the server is only doing this restore, nothing else. It is now, on average, running at less than 25% of the speed that it was running four days ago (as measured by rate of I/O). 

I started to dig into what was happening on the machine and I noticed the following:

iotop reports that two postgres processes (I assume each processing one of the two tables that needs to be processed) are doing all the I/O. That makes sense

Total DISK READ :    1473.81 K/s | Total DISK WRITE :     617.30 K/s
Actual DISK READ:    1473.81 K/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 6601 be/4 postgres  586.44 K/s    7.72 K/s  0.00 % 97.39 % postgres: 11/main: postg~s thebruteff [local] COPY
 6600 be/4 postgres  887.37 K/s  601.87 K/s  0.00 % 93.42 % postgres: 11/main: postg~s thebruteff [local] COPY
  666 be/3 root        0.00 B/s    7.72 K/s  0.00 %  5.73 % [jbd2/sda1-8]
    1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
    2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
    4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]

So, the next thing I though I would do is an "lsof" command for each of those two processes to see what they were writing. That was a bit of a surpise:

# lsof -p 6600 | wc -l;
840

# lsof -p 6601 | wc -l;
906

Is that normal? That there be so many open file pointers? ~900 open file pointers for each of the processes?

The next I did was go to see the actual data files, to see how many there are. In my case they are in postgresql/11/main/base/24576 and there are 2076 files there. That made sense. However, I found that when I list them by modification date I see something interesting:

-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
-rw------- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
-rw------- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
-rw------- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
-rw------- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
-rw------- 1 postgres postgres  342925312 Oct  8 13:05 27076.88

If you notice, the file size is capped at 1 GB and as the giant table has grown it has added more files in this directory. However, the mysterious thing to me is that it keeps modifying those files constantly - even the ones that are completely full. So for the two large tables it has been restoring all week, the modification time for the ever growing list of files is being updating constantly.

Could it be that thats why I am seeing a slowdown over the course of the week - that for some reason as the number of files for the table has grown, the system is spending more and more time seeking around the disk to touch all those files for some reason?

Does anyone who understands the details of postgresql's interaction with the file system have an explanation for why all those files which are full are being touched constantly?


Re: Modification of data in base folder and very large tables

From
Andrew Gierth
Date:
>>>>> "Ogden" == Ogden Brash <info@litika.com> writes:

 Ogden> I have a question about the files in
 Ogden> .../data/postgresql/11/main/base, specifically in relation to
 Ogden> very large tables and how they are written.

 Ogden> I have been attempting to restore a relatively large database
 Ogden> with pg_restore and it has been running for more than a week.

Did you do the restore into a completely fresh database? Or did you make
the mistake of creating tables and indexes first?

What relation does the filenode 27083 correspond to? You can find that
with:

select oid::regclass from pg_class
 where pg_relation_filenode(oid) = '27083';

-- 
Andrew (irc:RhodiumToad)



Re: Modification of data in base folder and very large tables

From
Jeff Janes
Date:
On Wed, Oct 9, 2019 at 4:33 AM Ogden Brash <info@litika.com> wrote:
# lsof -p 6600 | wc -l;
840

# lsof -p 6601 | wc -l;
906

Is that normal? That there be so many open file pointers? ~900 open file pointers for each of the processes?

I don't think PostgreSQL makes any effort to conserve file handles, until it starts reaching the max.  So any file that has ever been opened will remain open, unless it was somehow invalidated (e.g. the file needs to be deleted).  If those processes were previously loading smaller tables before the got bogged down in the huge ones, a large number of handles would not be unexpected.
 
 
The next I did was go to see the actual data files, to see how many there are. In my case they are in postgresql/11/main/base/24576 and there are 2076 files there. That made sense. However, I found that when I list them by modification date I see something interesting:

-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
-rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
-rw------- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
-rw------- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
-rw------- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
-rw------- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
-rw------- 1 postgres postgres  342925312 Oct  8 13:05 27076.88

If you notice, the file size is capped at 1 GB and as the giant table has grown it has added more files in this directory. However, the mysterious thing to me is that it keeps modifying those files constantly - even the ones that are completely full. So for the two large tables it has been restoring all week, the modification time for the ever growing list of files is being updating constantly.

The bgwriter, the checkpointer, and autovac, plus any backends that decide they need a clean page from the buffer cache can all touch those files.  They might touch them in ways that are not IO intensive, but still cause the modification time to get updated. In my hands, one all dirty buffers a given file have been flushed and all contents in the file have been vacuumed, its mtime stops changing just due to copy in which is directed to later files.

It is also squishy what it even means to modify a file.  I think filesystems have heuristics to avoid "gratuitous" updates to mtime, which make it hard to recon with.
 

Could it be that thats why I am seeing a slowdown over the course of the week - that for some reason as the number of files for the table has grown, the system is spending more and more time seeking around the disk to touch all those files for some reason?

I don't think lsof or mtime are effective ways to research this.  How about running strace -ttt -T -y on those processes?
 
Cheers,

Jeff

Re: Modification of data in base folder and very large tables

From
Jerry Sievers
Date:
Ogden Brash <info@litika.com> writes:

> I have a question about the files in .../data/postgresql/11/main/
> base, specifically in relation to very large tables and how they are
> written.
>
> I have been attempting to restore a relatively large database with
> pg_restore and it has been running for more than a week. (I also 
> have another thread about the same restore related to network vs.
> local disk I/O)
>
> I ran the pg_restore in verbose mode using multiple jobs so I can
> tell what has finished and what has not. The database had 66 tables
> and most of them have been restored. Two of the tables were quite
> large (billions of rows translating to between 1 and 2TB of data on
> disk for those two tables) and those two tables are pretty much the
> only things remaining that has not been reported as finished by
> pg_restore.
>
> As the process has been going for a week, I have been tracking the
> machine (a dedicated piece of hardware, non-virtualized) and have
> been noticing a progressive slowdown (as tracked by iostat). There is

Do the tables that are being loaded have any indexes on them?

> nothing running on the machine besides postgresql and the server is
> only doing this restore, nothing else. It is now, on average, running
> at less than 25% of the speed that it was running four days ago (as
> measured by rate of I/O). 
>
> I started to dig into what was happening on the machine and I noticed
> the following:
>
> iotop reports that two postgres processes (I assume each processing
> one of the two tables that needs to be processed) are doing all the I
> /O. That makes sense
>
> Total DISK READ :    1473.81 K/s | Total DISK WRITE :     617.30 K/s
> Actual DISK READ:    1473.81 K/s | Actual DISK WRITE:       0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>   
> COMMAND
>  6601 be/4 postgres  586.44 K/s    7.72 K/s  0.00 % 97.39 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>  6600 be/4 postgres  887.37 K/s  601.87 K/s  0.00 % 93.42 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>   666 be/3 root        0.00 B/s    7.72 K/s  0.00 %  5.73 % [jbd2/
> sda1-8]
>     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
>     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 %
> [kthreadd]
>     4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/
> 0:0H]
>
> So, the next thing I though I would do is an "lsof" command for each
> of those two processes to see what they were writing. That was a bit
> of a surpise:
>
> # lsof -p 6600 | wc -l;
> 840
>
> # lsof -p 6601 | wc -l;
> 906
>
> Is that normal? That there be so many open file pointers? ~900 open
> file pointers for each of the processes?
>
> The next I did was go to see the actual data files, to see how many
> there are. In my case they are in postgresql/11/main/base/24576 and
> there are 2076 files there. That made sense. However, I found that
> when I list them by modification date I see something interesting:
>
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
> -rw------- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
> -rw------- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
> -rw------- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
> -rw------- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
> -rw------- 1 postgres postgres  342925312 Oct  8 13:05 27076.88
>
> If you notice, the file size is capped at 1 GB and as the giant table
> has grown it has added more files in this directory. However, the
> mysterious thing to me is that it keeps modifying those files
> constantly - even the ones that are completely full. So for the two
> large tables it has been restoring all week, the modification time
> for the ever growing list of files is being updating constantly.
>
> Could it be that thats why I am seeing a slowdown over the course of
> the week - that for some reason as the number of files for the table
> has grown, the system is spending more and more time seeking around
> the disk to touch all those files for some reason?
>
> Does anyone who understands the details of postgresql's interaction
> with the file system have an explanation for why all those files
> which are full are being touched constantly?
>
>
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net



Re: Modification of data in base folder and very large tables

From
Ogden Brash
Date:
First of all, thanks to Jeff for pointing out strace. I had not used it before and it is quite informative. This is the rather depressing one minute summary for the pg_restore:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.23    0.176089          65      2726           read
 16.51    0.036226          12      3040           lseek
  1.69    0.003710          11       326           write
  0.79    0.001733           7       232           recvfrom
  0.77    0.001684          18        96        96 openat
  0.02    0.000035          12         3         1 futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.219477                  6423        97 total

The two pg_related processes which are doing 99% of the i/o on the machine are spending less than 2% of all that time actually writing data. No wonder it is going slowly!

Which takes me to Jerry's question as to whether there are any indexes. My immediate off-the cuff answer would be "no". I deleted all indexes in the target database before I did the pg_restore, and the tables do not have any foreign keys or any other constraints or triggers. I did the restore as a data only restore so that it would not try to recreate any tables.

However, after studying the details of the strace output and reading about other people's experiences, I believe that the more accurate answer is that I did not take into consideration the index that gets created as a result of the primary key. In this case, the primary key is big int that is automatically assigned by a sequence.

If each of the tables has about 3+ billion rows, the index is still going to be pretty large and spread over many files. In the source database that was backed up, the primary key sequence was sequentially assigned and written, but as various posprocessing operations were applied and the rows modified,  the data, is probably in a relatively random evenly distributed order. So I now believe that all the files that are being constantly touched are not actually the files for the data rows, but the files for the index, and as the system is reading the data it is jumping around recreating the index for the primary key based on the random order of the dta rows it reads.

Sound plausible? I'm still a postgres noob.

As an experiment, I am in the process of clustering the source database tables by the primary key constraint. I am hoping that if I redo the pg_dump after that, it will contain the records in more-or-less primary key order and on the subsequent pg_restore it should not have to spend the vast majority of the time on reading and seeking.

It is surprising to me that the cluster operations (which also have to churn through the entire index and all the records) are going *much* faster than pg_restore. At the rate they are going, they should be finished in about 12 hours, whereas the pg_restore is nowhere close to finishing and has been churning for 10 days.


On Wed, Oct 9, 2019 at 4:05 PM Jerry Sievers <gsievers19@comcast.net> wrote:
Ogden Brash <info@litika.com> writes:

> I have a question about the files in .../data/postgresql/11/main/
> base, specifically in relation to very large tables and how they are
> written.
>
> I have been attempting to restore a relatively large database with
> pg_restore and it has been running for more than a week. (I also 
> have another thread about the same restore related to network vs.
> local disk I/O)
>
> I ran the pg_restore in verbose mode using multiple jobs so I can
> tell what has finished and what has not. The database had 66 tables
> and most of them have been restored. Two of the tables were quite
> large (billions of rows translating to between 1 and 2TB of data on
> disk for those two tables) and those two tables are pretty much the
> only things remaining that has not been reported as finished by
> pg_restore.
>
> As the process has been going for a week, I have been tracking the
> machine (a dedicated piece of hardware, non-virtualized) and have
> been noticing a progressive slowdown (as tracked by iostat). There is

Do the tables that are being loaded have any indexes on them?

> nothing running on the machine besides postgresql and the server is
> only doing this restore, nothing else. It is now, on average, running
> at less than 25% of the speed that it was running four days ago (as
> measured by rate of I/O). 
>
> I started to dig into what was happening on the machine and I noticed
> the following:
>
> iotop reports that two postgres processes (I assume each processing
> one of the two tables that needs to be processed) are doing all the I
> /O. That makes sense
>
> Total DISK READ :    1473.81 K/s | Total DISK WRITE :     617.30 K/s
> Actual DISK READ:    1473.81 K/s | Actual DISK WRITE:       0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>   
> COMMAND
>  6601 be/4 postgres  586.44 K/s    7.72 K/s  0.00 % 97.39 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>  6600 be/4 postgres  887.37 K/s  601.87 K/s  0.00 % 93.42 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>   666 be/3 root        0.00 B/s    7.72 K/s  0.00 %  5.73 % [jbd2/
> sda1-8]
>     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
>     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 %
> [kthreadd]
>     4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/
> 0:0H]
>
> So, the next thing I though I would do is an "lsof" command for each
> of those two processes to see what they were writing. That was a bit
> of a surpise:
>
> # lsof -p 6600 | wc -l;
> 840
>
> # lsof -p 6601 | wc -l;
> 906
>
> Is that normal? That there be so many open file pointers? ~900 open
> file pointers for each of the processes?
>
> The next I did was go to see the actual data files, to see how many
> there are. In my case they are in postgresql/11/main/base/24576 and
> there are 2076 files there. That made sense. However, I found that
> when I list them by modification date I see something interesting:
>
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
> -rw------- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
> -rw------- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
> -rw------- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
> -rw------- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
> -rw------- 1 postgres postgres  342925312 Oct  8 13:05 27076.88
>
> If you notice, the file size is capped at 1 GB and as the giant table
> has grown it has added more files in this directory. However, the
> mysterious thing to me is that it keeps modifying those files
> constantly - even the ones that are completely full. So for the two
> large tables it has been restoring all week, the modification time
> for the ever growing list of files is being updating constantly.
>
> Could it be that thats why I am seeing a slowdown over the course of
> the week - that for some reason as the number of files for the table
> has grown, the system is spending more and more time seeking around
> the disk to touch all those files for some reason?
>
> Does anyone who understands the details of postgresql's interaction
> with the file system have an explanation for why all those files
> which are full are being touched constantly?
>
>
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net

Re: Modification of data in base folder and very large tables

From
Andrew Gierth
Date:
>>>>> "Ogden" == Ogden Brash <info@litika.com> writes:

 Ogden> I did the restore as a data only restore so that it would not
 Ogden> try to recreate any tables.

Doing data-only restores is almost always a mistake.

pg_dump/pg_restore are very careful to create things in an order that
allows the data part of the restore to run quickly: tables are created
first without any indexes or constraints, then data is loaded, then
indexes and constraints are created in bulk afterwards.

If you do a data-only restore into an existing table, then it's up to
you to avoid performance problems.

 Ogden> As an experiment, I am in the process of clustering the source
 Ogden> database tables by the primary key constraint. I am hoping that
 Ogden> if I redo the pg_dump after that, it will contain the records in
 Ogden> more-or-less primary key order and on the subsequent pg_restore
 Ogden> it should not have to spend the vast majority of the time on
 Ogden> reading and seeking.

This is a waste of time; just restore the data without the primary key
in place and then create it at the end.

 Ogden> It is surprising to me that the cluster operations (which also
 Ogden> have to churn through the entire index and all the records) are
 Ogden> going *much* faster than pg_restore.

That's because cluster, as with creation of a fresh index, can do a bulk
index build: sequentially read the table, sort the values (spilling to
temporary files if need be, but these are also read and written
sequentially), and write out the index data in one sequential pass.

-- 
Andrew (irc:RhodiumToad)



Re: Modification of data in base folder and very large tables

From
Jeff Janes
Date:
On Thu, Oct 10, 2019 at 3:40 AM Ogden Brash <info@litika.com> wrote:
If each of the tables has about 3+ billion rows, the index is still going to be pretty large and spread over many files. In the source database that was backed up, the primary key sequence was sequentially assigned and written, but as various posprocessing operations were applied and the rows modified,  the data, is probably in a relatively random evenly distributed order. So I now believe that all the files that are being constantly touched are not actually the files for the data rows, but the files for the index, and as the system is reading the data it is jumping around recreating the index for the primary key based on the random order of the dta rows it reads.

Sound plausible? I'm still a postgres noob.

Yes, perfectly plausible.
 

As an experiment, I am in the process of clustering the source database tables by the primary key constraint. I am hoping that if I redo the pg_dump after that, it will contain the records in more-or-less primary key order and on the subsequent pg_restore it should not have to spend the vast majority of the time on reading and seeking.

It is surprising to me that the cluster operations (which also have to churn through the entire index and all the records) are going *much* faster than pg_restore.

The cluster gets to lock the table against any concurrent changes, and then rebuild the indexes from scratch in bulk. You could have done more-or-less the same thing just by dropping the primary key while doing the load. Alas, there is no way to do that without losing the work already done. When you do a data-only pg_restore, you are dis-inviting it from doing such optimizations.  Isn't the whole point of data-only restore is that you leave the table open for other business while it happens?  (Or maybe there is some other point to it that I am missing--if you want some halfway measure between creating the table from scratch, and leaving it completely open for business as usual, then you have to evaluate each of those steps and implement them yourself, there is no way that pg_restore can reasonably guess which constraints and indexes it is allowed to drop and which it is not).

Perhaps https://www.postgresql.org/docs/current/populate.html#POPULATE-RM-INDEXES should mention the index assocated with primary keys, since dropping them does require a different syntax and they might be overlooked. 

Cheers,

Jeff