Re: Modification of data in base folder and very large tables - Mailing list pgsql-performance

From Ogden Brash
Subject Re: Modification of data in base folder and very large tables
Date
Msg-id CAFCR_K2+T=iS_oTTOEdK67Za7gN0-2+LeK+m7=vjw9Zu0Jsdtg@mail.gmail.com
Whole thread Raw
In response to Re: Modification of data in base folder and very large tables  (Jerry Sievers <gsievers19@comcast.net>)
Responses Re: Modification of data in base folder and very large tables
Re: Modification of data in base folder and very large tables
List pgsql-performance
First of all, thanks to Jeff for pointing out strace. I had not used it before and it is quite informative. This is the rather depressing one minute summary for the pg_restore:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.23    0.176089          65      2726           read
 16.51    0.036226          12      3040           lseek
  1.69    0.003710          11       326           write
  0.79    0.001733           7       232           recvfrom
  0.77    0.001684          18        96        96 openat
  0.02    0.000035          12         3         1 futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.219477                  6423        97 total

The two pg_related processes which are doing 99% of the i/o on the machine are spending less than 2% of all that time actually writing data. No wonder it is going slowly!

Which takes me to Jerry's question as to whether there are any indexes. My immediate off-the cuff answer would be "no". I deleted all indexes in the target database before I did the pg_restore, and the tables do not have any foreign keys or any other constraints or triggers. I did the restore as a data only restore so that it would not try to recreate any tables.

However, after studying the details of the strace output and reading about other people's experiences, I believe that the more accurate answer is that I did not take into consideration the index that gets created as a result of the primary key. In this case, the primary key is big int that is automatically assigned by a sequence.

If each of the tables has about 3+ billion rows, the index is still going to be pretty large and spread over many files. In the source database that was backed up, the primary key sequence was sequentially assigned and written, but as various posprocessing operations were applied and the rows modified,  the data, is probably in a relatively random evenly distributed order. So I now believe that all the files that are being constantly touched are not actually the files for the data rows, but the files for the index, and as the system is reading the data it is jumping around recreating the index for the primary key based on the random order of the dta rows it reads.

Sound plausible? I'm still a postgres noob.

As an experiment, I am in the process of clustering the source database tables by the primary key constraint. I am hoping that if I redo the pg_dump after that, it will contain the records in more-or-less primary key order and on the subsequent pg_restore it should not have to spend the vast majority of the time on reading and seeking.

It is surprising to me that the cluster operations (which also have to churn through the entire index and all the records) are going *much* faster than pg_restore. At the rate they are going, they should be finished in about 12 hours, whereas the pg_restore is nowhere close to finishing and has been churning for 10 days.


On Wed, Oct 9, 2019 at 4:05 PM Jerry Sievers <gsievers19@comcast.net> wrote:
Ogden Brash <info@litika.com> writes:

> I have a question about the files in .../data/postgresql/11/main/
> base, specifically in relation to very large tables and how they are
> written.
>
> I have been attempting to restore a relatively large database with
> pg_restore and it has been running for more than a week. (I also 
> have another thread about the same restore related to network vs.
> local disk I/O)
>
> I ran the pg_restore in verbose mode using multiple jobs so I can
> tell what has finished and what has not. The database had 66 tables
> and most of them have been restored. Two of the tables were quite
> large (billions of rows translating to between 1 and 2TB of data on
> disk for those two tables) and those two tables are pretty much the
> only things remaining that has not been reported as finished by
> pg_restore.
>
> As the process has been going for a week, I have been tracking the
> machine (a dedicated piece of hardware, non-virtualized) and have
> been noticing a progressive slowdown (as tracked by iostat). There is

Do the tables that are being loaded have any indexes on them?

> nothing running on the machine besides postgresql and the server is
> only doing this restore, nothing else. It is now, on average, running
> at less than 25% of the speed that it was running four days ago (as
> measured by rate of I/O). 
>
> I started to dig into what was happening on the machine and I noticed
> the following:
>
> iotop reports that two postgres processes (I assume each processing
> one of the two tables that needs to be processed) are doing all the I
> /O. That makes sense
>
> Total DISK READ :    1473.81 K/s | Total DISK WRITE :     617.30 K/s
> Actual DISK READ:    1473.81 K/s | Actual DISK WRITE:       0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>   
> COMMAND
>  6601 be/4 postgres  586.44 K/s    7.72 K/s  0.00 % 97.39 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>  6600 be/4 postgres  887.37 K/s  601.87 K/s  0.00 % 93.42 % postgres:
> 11/main: postg~s thebruteff [local] COPY
>   666 be/3 root        0.00 B/s    7.72 K/s  0.00 %  5.73 % [jbd2/
> sda1-8]
>     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
>     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 %
> [kthreadd]
>     4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/
> 0:0H]
>
> So, the next thing I though I would do is an "lsof" command for each
> of those two processes to see what they were writing. That was a bit
> of a surpise:
>
> # lsof -p 6600 | wc -l;
> 840
>
> # lsof -p 6601 | wc -l;
> 906
>
> Is that normal? That there be so many open file pointers? ~900 open
> file pointers for each of the processes?
>
> The next I did was go to see the actual data files, to see how many
> there are. In my case they are in postgresql/11/main/base/24576 and
> there are 2076 files there. That made sense. However, I found that
> when I list them by modification date I see something interesting:
>
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
> -rw------- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
> -rw------- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
> -rw------- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
> -rw------- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
> -rw------- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
> -rw------- 1 postgres postgres  342925312 Oct  8 13:05 27076.88
>
> If you notice, the file size is capped at 1 GB and as the giant table
> has grown it has added more files in this directory. However, the
> mysterious thing to me is that it keeps modifying those files
> constantly - even the ones that are completely full. So for the two
> large tables it has been restoring all week, the modification time
> for the ever growing list of files is being updating constantly.
>
> Could it be that thats why I am seeing a slowdown over the course of
> the week - that for some reason as the number of files for the table
> has grown, the system is spending more and more time seeking around
> the disk to touch all those files for some reason?
>
> Does anyone who understands the details of postgresql's interaction
> with the file system have an explanation for why all those files
> which are full are being touched constantly?
>
>
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net

pgsql-performance by date:

Previous
From: Jerry Sievers
Date:
Subject: Re: Modification of data in base folder and very large tables
Next
From: Andrew Gierth
Date:
Subject: Re: Modification of data in base folder and very large tables