Thread: Sudden insert performance degradation

Sudden insert performance degradation

From
Henrique Montenegro
Date:
Hello list,

I am having issues with performance inserting data in Postgres and would like
to ask for help figuring out the problem as I ran out of ideas.

I have a process that generates a CSV file with 1 million records in it every
5 minutes and each file is about 240MB. I need this data to be inserted into a
table on Postgres. A peculiarity about it is that the data on these files
might be duplicate. I might have a row on the first file that is also present
on the second or the third and so on. I don't care about the duplicates, so I
have a unique constraint on my table to discard those.

The data in the CSV is pretty simple:

```
user_id,name,url
```

The table is defined like this:

```
create unlogged table users_no_dups (
    created_ts timestamp without time zone,
    user_id bigint not null,
    name text,
    url text,
    unique(user_id)
);
```

Table is created as `unlogged` as a way to improve performance. I am aware of the consequences of this possibly causing data loss.

My process for inserting data into the table is as follows:

* Create an intermediary table `users` as follows:

```
create unlogged table users (
    created_ts timestamp without time zone default current_timestamp,
    user_id bigint,
    name text,
    url text
) with (autovacuum_enabled = false, toast.autovacuum_enabled = false)
```

* Use `COPY` to copy the data from the CSV file into an intermediary table

```
copy users(user_id, name, url) from
'myfile.csv' with(format csv, header true, delimiter ',', quote '"', escape '\\')
```

* Insert the data from the `users` table into the `users_no_dups` table

```
insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing
```

* Drop the `users` table

* Repeat the whole thing for the next file.


Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.

All of a sudden inserting from `users` into `users_no_dups` started taking 20+
minutes.

I recreated the table with a `fillfactor` of `30` and tried again and things
were running well again with that same 30 seconds for processing. Again after
about 12 hours, things got really slow.

Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).

Watching on iotop, the `INSERT` statement `WRITE` speed is always between 20
and 100 K/s now. When I first started inserting  the `WRITE` speed is always
above 100M/s.

If I try to copy the `users_no_dups` table to another table (say
users_no_dups_2 with the same structure), the `WRITE` speed also goes to
100M/s or more until it gets to the last 2 GB of data being copied. Then speed
goes down to the 20 to 100K/s again and stays there (I know this from watching
`iotop`).

I have the following custom configuration on my postgres installation that
I've done in order to try to improve the performance:

```
ssl = off
shared_buffers = 8GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
```

Information about the machine:

```
Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads)
RAM: 256GB


Disk1: 2TB SSD SATA-3 Samsung Evo 860
Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM

Disk1 and Disk2 are configured as a single logical volume.

Table `users_no_dups` is in a tablespace on `Disk3`. The defaul tablespace is
in the logical volume composed by `Disk1` and `Disk2`.

OS: Ubuntu Linux 19.10
Postgres version: PostgreSQL 11.7 (Ubuntu 11.7-0ubuntu0.19.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit
```

Any ideas why I am seeing this decrease in performance with the insert or any
suggestions on how I can try to figure this out?

Sorry for the wall of text. Just trying to give as much info as I have.

Henrique




Re: Sudden insert performance degradation

From
Sebastian Dressler
Date:
Hi Henrique,

On 13. Jul 2020, at 16:23, Henrique Montenegro <typoon@gmail.com> wrote:

[...]

* Insert the data from the `users` table into the `users_no_dups` table

```
insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing
```

How do you check contraints here? Is this enforced with UK/PK?

Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.

Do you see anything suspicious in the logs, i.e. something in the realms of running out of transaction IDs?

[...]

Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).

To me it sounds like the UK/PK is getting too much to write. A possible solution could be to start partitioning the table.

[...]
```
ssl = off
shared_buffers = 8GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
```

Another suggestion would be to increase the min_wal_size here, but since you use UNLOGGED tables it does not matter much.


Information about the machine:

```
Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads)
RAM: 256GB


Disk1: 2TB SSD SATA-3 Samsung Evo 860
Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM

Disk1 and Disk2 are configured as a single logical volume.

Just curious: does that mean you mix up SSD + HDD?

Cheers,
Sebastian


--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B

Re: Sudden insert performance degradation

From
Michael Lewis
Date:
Is this an insert only table and perhaps not being picked up by autovacuum? If so, try a manual "vacuum analyze" before/after each batch run perhaps. You don't mention updates, but also have been adjusting fillfactor so I am not not sure.

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Mon, Jul 13, 2020 at 11:20 AM Sebastian Dressler <sebastian@swarm64.com> wrote:
Hi Henrique,

On 13. Jul 2020, at 16:23, Henrique Montenegro <typoon@gmail.com> wrote:

[...]

* Insert the data from the `users` table into the `users_no_dups` table

```
insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing
```

How do you check contraints here? Is this enforced with UK/PK?
 
The Unique Key is supposed to to the constraint enforcing here. The `users` table will have data that is duplicate and the maximum number of records on it is 1 million. Then I just try to insert it into the `users_no_dups` table with the `on conflict do nothing` to ignore the duplicates and discard them.



Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.

Do you see anything suspicious in the logs, i.e. something in the realms of running out of transaction IDs?

I set the log to debug1. I haven't seen anything that called my attention, but I am not really sure what to look for, so perhaps I missed it. Any suggestions on what to look for or any specific log configuration to do?
 

[...]

Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).

To me it sounds like the UK/PK is getting too much to write. A possible solution could be to start partitioning the table.

I thought about partitioning it, but I can't figure out on what. The `user_id` column is a number that is somewhat random so I don't know what kinds of range I would use for it. I will try to look at the values again and see if there is something that I could perhaps use as a range. Any other suggestions?
 

[...]
```
ssl = off
shared_buffers = 8GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
```

Another suggestion would be to increase the min_wal_size here, but since you use UNLOGGED tables it does not matter much.


Information about the machine:

```
Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads)
RAM: 256GB


Disk1: 2TB SSD SATA-3 Samsung Evo 860
Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM
Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM

Disk1 and Disk2 are configured as a single logical volume.

Just curious: does that mean you mix up SSD + HDD?

Yeah, I did that. Probably not very smart of me. I plan on undoing it soon. I assumed that is not what is causing my issue since the tablespace where the table is stored is on `Disk3` which is not part of the Logical Volume.
 

Cheers,
Sebastian


--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B


Thanks!

Henrique

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Mon, Jul 13, 2020 at 12:28 PM Michael Lewis <mlewis@entrata.com> wrote:
Is this an insert only table and perhaps not being picked up by autovacuum? If so, try a manual "vacuum analyze" before/after each batch run perhaps. You don't mention updates, but also have been adjusting fillfactor so I am not not sure.

It is mostly an insert table. Only queries I need to run on it are to aggegate the count of IDs inserted per hour.

I did the vacuuming of the table; Didn't help. I tried both vacuum(analyze) and vacuum(full) ... took a looooong time and no improvements.

I adjusted the `fillfactor` because the documentation didn't make it too clear if by `updates to the table` it meant updating the value of existing rows, or updating the table itself (which in my understanding would mean that adding new data into it would cause the table to be updated). I just started messing with the `fillfactor` to see if that would give me any improvements. It seems to me it did since the first time I created the table, I didn't change the fillfactor and stumbled upon the performance issue after 12 hours; I then recreated the table with a fillfactor of 30 and was good again for about 12 hours more. Could be a coincidence though. I tried to recreate the table using fillfactor 10, but it was taking too long to add the data to it (12+ hours running and it wasn't done yet and the WRITE speed on iotop was around 20K/s .... I ended up just canceling it).

As of now, the table has about 280 million records in it.

Henrique

Re: Sudden insert performance degradation

From
Sebastian Dressler
Date:
Hi Henrique,

On 13. Jul 2020, at 18:42, Henrique Montenegro <typoon@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:20 AM Sebastian Dressler <sebastian@swarm64.com> wrote:


Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.

Do you see anything suspicious in the logs, i.e. something in the realms of running out of transaction IDs?

I set the log to debug1. I haven't seen anything that called my attention, but I am not really sure what to look for, so perhaps I missed it. Any suggestions on what to look for or any specific log configuration to do?

Not necessarily, if you'd run out of tx IDs you would notice that cleary, I guess. I also think that this is not the issue.

 

[...]

Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).

To me it sounds like the UK/PK is getting too much to write. A possible solution could be to start partitioning the table.

I thought about partitioning it, but I can't figure out on what. The `user_id` column is a number that is somewhat random so I don't know what kinds of range I would use for it. I will try to look at the values again and see if there is something that I could perhaps use as a range. Any other suggestions?

Depending on granularity, maybe partition on `created_ts`? 

Cheers,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Mon, Jul 13, 2020 at 12:50 PM Sebastian Dressler <sebastian@swarm64.com> wrote:
Hi Henrique,

On 13. Jul 2020, at 18:42, Henrique Montenegro <typoon@gmail.com> wrote:

On Mon, Jul 13, 2020 at 11:20 AM Sebastian Dressler <sebastian@swarm64.com> wrote:


Running the above loop worked fine for about 12 hours. Each file was taking
about 30 seconds to be processed. About 4 seconds to create the `users` table
and have the CSV data loaded into it and anything between 20 and 30 seconds to
insert the data from `users` into `users_no_dups`.

Do you see anything suspicious in the logs, i.e. something in the realms of running out of transaction IDs?

I set the log to debug1. I haven't seen anything that called my attention, but I am not really sure what to look for, so perhaps I missed it. Any suggestions on what to look for or any specific log configuration to do?

Not necessarily, if you'd run out of tx IDs you would notice that cleary, I guess. I also think that this is not the issue.

 

[...]

Recreating the table now isn't really providing any improvements. I tried
recreating it with a `fillfactor` of `10`, but it was taking too long and too
much space (the table had 300GB with the fillfactor set to 30; with it set to
10 it went up to almost 1TB).

To me it sounds like the UK/PK is getting too much to write. A possible solution could be to start partitioning the table.

I thought about partitioning it, but I can't figure out on what. The `user_id` column is a number that is somewhat random so I don't know what kinds of range I would use for it. I will try to look at the values again and see if there is something that I could perhaps use as a range. Any other suggestions?

Depending on granularity, maybe partition on `created_ts`? 

I could give it a try. The reason I didn't try that yet was that I thought that since the UK is on the `user_id` column it wouldn't give me any benefit, but I can't really justify why I was thinking that. I would assume that the constraint would be validated against the index and not the whole table, so this might work. I will give it a try.

Thanks!

Henrique
 

Cheers,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B

Re: Sudden insert performance degradation

From
Jeff Janes
Date:
On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typoon@gmail.com> wrote:

insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing

Once the size of the only index exceeds shared_buffers by a bit (the amount of "a bit" depends on your RAM, kernel version, settings for dirty_background_ratio, dirty_expire_centisecs, and probably other things, and is not easy to predict) the performance falls off a cliff when inserting values in a random order.  Every insert dirties a random index leaf page, which quickly gets evicted from shared_buffers to make room for other random leaf pages to be read in, and then turns into flush calls when the kernel freaks out about the amount and age of dirty pages held in memory.

What happens if you add an "ORDER BY user_id" to your above select?
 
shared_buffers = 8GB
RAM: 256GB

Or, crank up shared_buffers by a lot.  Like, beyond the size of the growing index, or up to 240GB if the index ever becomes larger than that.  And make the time between checkpoints longer.  If the dirty buffers are retained in shared_buffers longer, chances of them getting dirtied repeatedly between writes is much higher than if you just toss them to the kernel and hope for the best.

Cheers,

Jeff

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Mon, Jul 13, 2020 at 8:05 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typoon@gmail.com> wrote:

insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing

Once the size of the only index exceeds shared_buffers by a bit (the amount of "a bit" depends on your RAM, kernel version, settings for dirty_background_ratio, dirty_expire_centisecs, and probably other things, and is not easy to predict) the performance falls off a cliff when inserting values in a random order.  Every insert dirties a random index leaf page, which quickly gets evicted from shared_buffers to make room for other random leaf pages to be read in, and then turns into flush calls when the kernel freaks out about the amount and age of dirty pages held in memory.
 
That is interesting to  know. I will do some research on those things.


What happens if you add an "ORDER BY user_id" to your above select?

I don't know. I will give it a try right now.
 
shared_buffers = 8GB
RAM: 256GB

Or, crank up shared_buffers by a lot.  Like, beyond the size of the growing index, or up to 240GB if the index ever becomes larger than that.  And make the time between checkpoints longer.  If the dirty buffers are retained in shared_buffers longer, chances of them getting dirtied repeatedly between writes is much higher than if you just toss them to the kernel and hope for the best.


I cranked it up to 160GB to see how it goes.

Cheers,

Jeff

I created the partitions as well as mentioned before. I was able to partition the table based on the user_id (found some logic to it). I was transferring the data from the original table (about 280 million records; 320GB) to the new partitioned table and things were going well with write speeds between 30MB/s and 50MB/s. After reading 270GB of the 320GB (in 4 and a half hours) and writing it to the new partitioned table, write speed went down to 7KB/s. It is so frustrating.

I will keep the partitions and try your suggestions to see how it goes.

I apologize for the long time between replies, it is just that testing this stuff takes 4+ hours each run.

If there are any other suggestions of things for me to look meanwhile as well, please keep them coming.

Thanks!

Henrique

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:
Alright, so it seems like partitioning and changing the shared_buffers as well as adding the order by helped to a certain extent, but the writes are still slow. Inserting a 1 million records file is taking almost 3 minutes (way better than the 20+ minutes, but still pretty slow compared to the 20 seconds it used to take).

The interesting thing for me right now is: If I try to insert the data from a file that has already been inserted (meaning all the data will end up being rejected due to the unique constraint), it only takes between 1 and 4 seconds for the insertion to finish executing. For regular files (which usually have 30% new unique records (meaning about 300k new records)), it is taking those 3 minutes.

**UPDATE**

I started writing this email and then it occurred to me something I should try. Leaving the information above for historical reasons.

Basically I went ahead and ran a `reindex` on all the partitions now to see if it would improve the performance and seems like that did it! I used the following script to reindex all of the partitions (the name of my partitions all start with ubp_):

```
DO $$DECLARE r record;
BEGIN
    FOR r IN select indexname from pg_indexes where tablename like 'ubp_%'
    LOOP
        EXECUTE 'reindex index ' || r.indexname;
    END LOOP;
END$$;
```

After doing this, processing of each file is taking anything between 8 and 20 seconds (most of them seem to be taking 8 seconds though). So, this is great!

In summary, what I ended up having to do was:

* Raise shared_buffers to 160GB
* Add an `order by` to the `select` subquery in the `insert` statement
* Partition the table
* Tune postgres configurations as shown below:

~~~
ssl = off
shared_buffers = 160GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
~~~

I can't tell if the raising of the `shared_buffers` was the reason for the performance gains or the adding of the `order by` was the responsible. Doesn't hurt to do both anyways. I know for a fact that the `reindex` of each partition made a huge difference in the end as explained above (bringing insert time down from 3 minutes to 8 seconds).

I have about 1800 files in my backlog to be processed now (18 billion records). I have started processing them and will report back in case performance degrades once again.

Thanks everybody for the help so far! I really appreciate it.

Henrique

PS: I checked the `dirty` ratios for the OS:

$ sysctl vm.dirty_ratio
vm.dirty_ratio = 20

$ sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10

$ sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000

These are default values; if what I understood from them is right, it seems to me that these values should be fine.

On Mon, Jul 13, 2020 at 9:02 PM Henrique Montenegro <typoon@gmail.com> wrote:


On Mon, Jul 13, 2020 at 8:05 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typoon@gmail.com> wrote:

insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing

Once the size of the only index exceeds shared_buffers by a bit (the amount of "a bit" depends on your RAM, kernel version, settings for dirty_background_ratio, dirty_expire_centisecs, and probably other things, and is not easy to predict) the performance falls off a cliff when inserting values in a random order.  Every insert dirties a random index leaf page, which quickly gets evicted from shared_buffers to make room for other random leaf pages to be read in, and then turns into flush calls when the kernel freaks out about the amount and age of dirty pages held in memory.
 
That is interesting to  know. I will do some research on those things.


What happens if you add an "ORDER BY user_id" to your above select?

I don't know. I will give it a try right now.
 
shared_buffers = 8GB
RAM: 256GB

Or, crank up shared_buffers by a lot.  Like, beyond the size of the growing index, or up to 240GB if the index ever becomes larger than that.  And make the time between checkpoints longer.  If the dirty buffers are retained in shared_buffers longer, chances of them getting dirtied repeatedly between writes is much higher than if you just toss them to the kernel and hope for the best.


I cranked it up to 160GB to see how it goes.

Cheers,

Jeff

I created the partitions as well as mentioned before. I was able to partition the table based on the user_id (found some logic to it). I was transferring the data from the original table (about 280 million records; 320GB) to the new partitioned table and things were going well with write speeds between 30MB/s and 50MB/s. After reading 270GB of the 320GB (in 4 and a half hours) and writing it to the new partitioned table, write speed went down to 7KB/s. It is so frustrating.

I will keep the partitions and try your suggestions to see how it goes.

I apologize for the long time between replies, it is just that testing this stuff takes 4+ hours each run.

If there are any other suggestions of things for me to look meanwhile as well, please keep them coming.

Thanks!

Henrique

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Tue, Jul 14, 2020 at 9:05 AM Henrique Montenegro <typoon@gmail.com> wrote:
Alright, so it seems like partitioning and changing the shared_buffers as well as adding the order by helped to a certain extent, but the writes are still slow. Inserting a 1 million records file is taking almost 3 minutes (way better than the 20+ minutes, but still pretty slow compared to the 20 seconds it used to take).

The interesting thing for me right now is: If I try to insert the data from a file that has already been inserted (meaning all the data will end up being rejected due to the unique constraint), it only takes between 1 and 4 seconds for the insertion to finish executing. For regular files (which usually have 30% new unique records (meaning about 300k new records)), it is taking those 3 minutes.

**UPDATE**

I started writing this email and then it occurred to me something I should try. Leaving the information above for historical reasons.

Basically I went ahead and ran a `reindex` on all the partitions now to see if it would improve the performance and seems like that did it! I used the following script to reindex all of the partitions (the name of my partitions all start with ubp_):

```
DO $$DECLARE r record;
BEGIN
    FOR r IN select indexname from pg_indexes where tablename like 'ubp_%'
    LOOP
        EXECUTE 'reindex index ' || r.indexname;
    END LOOP;
END$$;
```

After doing this, processing of each file is taking anything between 8 and 20 seconds (most of them seem to be taking 8 seconds though). So, this is great!

In summary, what I ended up having to do was:

* Raise shared_buffers to 160GB
* Add an `order by` to the `select` subquery in the `insert` statement
* Partition the table
* Tune postgres configurations as shown below:

~~~
ssl = off
shared_buffers = 160GB
work_mem = 12GB
maintenance_work_mem = 12GB
max_stack_depth = 4MB
synchronous_commit = off
wal_writer_flush_after = 128MB
max_wal_size = 32GB
min_wal_size = 80MB
effective_cache_size = 96GB
~~~

I can't tell if the raising of the `shared_buffers` was the reason for the performance gains or the adding of the `order by` was the responsible. Doesn't hurt to do both anyways. I know for a fact that the `reindex` of each partition made a huge difference in the end as explained above (bringing insert time down from 3 minutes to 8 seconds).

I have about 1800 files in my backlog to be processed now (18 billion records). I have started processing them and will report back in case performance degrades once again.

Thanks everybody for the help so far! I really appreciate it.

Henrique

PS: I checked the `dirty` ratios for the OS:

$ sysctl vm.dirty_ratio
vm.dirty_ratio = 20

$ sysctl vm.dirty_background_ratio
vm.dirty_background_ratio = 10

$ sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000

These are default values; if what I understood from them is right, it seems to me that these values should be fine.

On Mon, Jul 13, 2020 at 9:02 PM Henrique Montenegro <typoon@gmail.com> wrote:


On Mon, Jul 13, 2020 at 8:05 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typoon@gmail.com> wrote:

insert into users_no_dups (
    created_ts,
    user_id,
    name,
    url
) (
    select
        created_ts,
        user_id,
        name,
        url
    from
        users
) on conflict do nothing

Once the size of the only index exceeds shared_buffers by a bit (the amount of "a bit" depends on your RAM, kernel version, settings for dirty_background_ratio, dirty_expire_centisecs, and probably other things, and is not easy to predict) the performance falls off a cliff when inserting values in a random order.  Every insert dirties a random index leaf page, which quickly gets evicted from shared_buffers to make room for other random leaf pages to be read in, and then turns into flush calls when the kernel freaks out about the amount and age of dirty pages held in memory.
 
That is interesting to  know. I will do some research on those things.


What happens if you add an "ORDER BY user_id" to your above select?

I don't know. I will give it a try right now.
 
shared_buffers = 8GB
RAM: 256GB

Or, crank up shared_buffers by a lot.  Like, beyond the size of the growing index, or up to 240GB if the index ever becomes larger than that.  And make the time between checkpoints longer.  If the dirty buffers are retained in shared_buffers longer, chances of them getting dirtied repeatedly between writes is much higher than if you just toss them to the kernel and hope for the best.


I cranked it up to 160GB to see how it goes.

Cheers,

Jeff

I created the partitions as well as mentioned before. I was able to partition the table based on the user_id (found some logic to it). I was transferring the data from the original table (about 280 million records; 320GB) to the new partitioned table and things were going well with write speeds between 30MB/s and 50MB/s. After reading 270GB of the 320GB (in 4 and a half hours) and writing it to the new partitioned table, write speed went down to 7KB/s. It is so frustrating.

I will keep the partitions and try your suggestions to see how it goes.

I apologize for the long time between replies, it is just that testing this stuff takes 4+ hours each run.

If there are any other suggestions of things for me to look meanwhile as well, please keep them coming.

Thanks!

Henrique

Hello again list,

Turns out that the good performance didn't last long. After processing about
300 CSV files with 1 million records each (inserting between 200k and 300k new
records per file into the DB), performance went downhill again :(



Table `users_basic_profile_no_dups_partitioned` stats:
- 1530 partitions (based on user_id)
- 473,316,776 rows
- Unlogged
- Stored in an 8TB 7200 RPM HDD

Table `users_basic_profile` stats:
- Unlogged
- 1 million rows
- Stored in memory (using tmpfs)

Configuration file has the following custom configurations for the tests
executed below:

```
ssl = off
shared_buffers = 160GB                  # min 128kB
work_mem = 96GB                         # min 64kB
maintenance_work_mem = 12GB             # min 1MB
max_stack_depth = 4MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
synchronous_commit = off                # synchronization level;
commit_delay = 100000                   # range 0-100000, in microseconds
max_wal_size = 3GB
min_wal_size = 1GB
min_parallel_index_scan_size = 64kB
effective_cache_size = 96GB
log_min_messages = debug1 # values in order of decreasing detail:
log_checkpoints = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_line_prefix = '%m [%p] %q%u@%d '            # special values:
log_lock_waits = on                     # log lock waits >= deadlock_timeout
log_timezone = 'America/New_York'
log_executor_stats = on
datestyle = 'iso, mdy'
```

(max_wal_size was 80GB before and min_wal_size was 80MB; I changed the max
because the first restart I did to the service took a long time since it had
to sync 80+GB of data to the disk)

I restarted the postgres service and ran this query:

```
select user_id from users_basic_profile_no_dups_partitioned
where
    user_id in (
        select user_id from users_basic_profile order by user_id
    );
```

The above query took 659 seconds to run and read 73.64 GB of data from the
disk. From observing the `top` output I assume that all this data was loaded
into RAM and kept there.

I then ran the same query again and it ran in 195 seconds. This second time,
no data was read from the disk and CPU usage stayed at 100% the whole time.
I am not sure why it took so long since it seems the whole data was in memory.

I then ran the following query 6 times while increasing the limit as shown in
the table below:

```
select user_id from users_basic_profile_no_dups_partitioned
where
    user_id in (
        select user_id from users_basic_profile order by user_id
        limit 10
    );
```

  Limit  |  Time (seconds)
---------|------------------
10       | 0.6
100      | 0.6
1000     | 1.3
10000    | 116.9
100000   | 134.8
1000000  | 193.2

Notice the jump in time execution from a 1k limit to a 10k limit. Amount of
data raised 10x and execution time raised 100x.

It seems to me that inserting the data in this case is slow because the time
it takes to identify the duplicate records (which I assume would be done in a
fashion similiar to the queries above) is taking a long time.

I have attached the `explain analyze` output for the 1k and 10k queries to
this email (they are 4k+ lines each, didn't want to make this messager bigger
than it already is).

* exp1k.txt
* exp10k.txt

One thing to keep in mind is: all the data in the `users_basic_profile` table
already exists in the `users_basic_profile_no_dups_partitioned` table. So if I
try to insert the data now again, it goes SUPER fast:

```

insert into users_basic_profile_no_dups_partitioned(
                       created_ts,
                       user_id,
                       name,
                       profile_picture
                   ) (
                   select
                       created_ts,
                       user_id,
                       name,
                       profile_picture
                   from
                        users_basic_profile
                    order by
                        user_id limit 10000
                   ) on conflict do nothing;
INSERT 0 0
Time: 276.905 ms
```

I droped the `users_basic_profile` table, recreated it and then and loaded a
new file into it that has not been previously loaded:

```
drop table users_basic_profile;

create unlogged table users_basic_profile (
   created_ts timestamp without time zone default current_timestamp,
   user_id bigint,
   name text,
   profile_picture text
)
with (autovacuum_enabled = false, toast.autovacuum_enabled = false)
tablespace ramdisk;

copy users_basic_profile(user_id, name, profile_picture)
from '/tmp/myfile.csv' with (
    format csv,
    header true,
    delimiter ',',
    quote '"',
    escape '\'
);
```

The `COPY` command took 3 seconds.

I then ran the `SELECT` queries above again:


  Limit  |  Time (seconds)
---------|------------------
10       | 0.7
100      | 0.6
1000     | 1
10000    | 5.3
100000   | 68.8
1000000  | Did not complete

The 1 million query ran for 54 minutes when I finally decided to cancel it.
Disk reads at this point were at 1.4MB per second by the process performing
the `SELECT`. No other process was using the disk.

This execution was not fair, since the new data was probably not cached in RAM
yet. So I re-ran all the queries again:


  Limit  |  Time (seconds)
---------|------------------
10       | 0.7
100      | 0.7
1000     | 0.8
10000    | 1.9
100000   | 11.2
1000000  | Did not complete

The 1 million query didn't complete again. The disk read speed was again at
1.4MB/s and if it didn't complete in 10 minutes it wasn't gonna complete any
time soon.

While these numbers look better, I find the 5x increase from the 10k to
100k a bit suspicious.

The `explain analyze` plans for the 1k, 10k and 100k queries are attached:

* exp1k-secondtime.txt
* exp10k-secondtime.txt
* exp100k-secondtime.txt

The `explain` for the 1million query is also attached:
* exp1million.txt

I then tried to insert the data into the table with this query:

```
begin;
explain analyze insert into users_basic_profile_no_dups_partitioned(created_ts,
   user_id,
   name,
   profile_picture
) (
select
   created_ts,
   user_id,
   name,
   profile_picture
from
    users_basic_profile
order by
    user_id
) on conflict do nothing;
```

Disk read speed during this query was around 9MB/s with writes around 500KB/s.

The result of the explain analyze is as follows:

```
                                                                        QUERY PLAN                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Insert on users_basic_profile_no_dups_partitioned  (cost=386110.19..423009.25 rows=2951925 width=80) (actual time=156773.296..156773.296 rows=0 loops=1)
   Conflict Resolution: NOTHING
   Tuples Inserted: 293182
   Conflicting Tuples: 706818
   ->  Sort  (cost=386110.19..393490.00 rows=2951925 width=80) (actual time=777.295..1423.577 rows=1000000 loops=1)
         Sort Key: users_basic_profile.user_id
         Sort Method: quicksort  Memory: 540206kB
         ->  Seq Scan on users_basic_profile  (cost=0.00..68878.25 rows=2951925 width=80) (actual time=0.019..173.278 rows=1000000 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 156820.603 ms
(10 rows)
```

This query took 156 seconds to complete. 156 seconds is not too bad, but I was
getting between 8 seconds and 20 seconds this morning as I mentioned before.
So still something seems off. I was able to process 300 files this morning,
each one containing 1 million records inserting anything between 200k and 300k
new records into the table per file. This means that while runing these tests,
I have about 70 million more rows in the table than I did this morning.

After completing the `INSERT` I executed a `COMMIT` that took 0.03 seconds.

I decided to run the `SELECT` queries one last time:

  Limit  |  Time (seconds)
---------|------------------
10       | 0.6
100      | 0.6
1000     | 0.7
10000    | 1.6
100000   | 10.7
1000000  | 110.7

This time the 1 million query completed. Most likely due to some caching
mechanism I'd guess. Still 110 seconds seems somewhat slow.

So, does anyone have any suggestions on what could be wrong? The questions
that come to mind are:

* Why are these execution times so crazy?
* Why is the read speed from the disk so low?
* What is causing the sudden drop in performance?
* Any idea how to fix any of this?
* Any suggestions on what I should do/test/look for?

= Extra Information =

Before starting all these tests, I had executed the following
`REINDEX` command on all partitions of
`users_basic_profile_no_dups_partitioned`:


```
DO $$DECLARE r record;
BEGIN
    FOR r IN select indexname from pg_indexes where tablename like 'ubp_%'
    LOOP
        raise notice 'Processing index [%]', r.indexname;
        EXECUTE 'alter index ' || r.indexname || ' set (fillfactor=50)';
        EXECUTE 'reindex index ' || r.indexname;
    END LOOP;
END$$;
```

Before setting the `fillfactor` to 50, I tried just a regular `REINDEX`
keeping the original `fillfactor` but the results were still the same.

Structure of table `users_basic_profile_no_dups_partitioned`:

```

# \d users_basic_profile_no_dups_partitioned

        Unlogged table "public.users_basic_profile_no_dups_partitioned"
     Column      |            Type             | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+---------
 created_ts      | timestamp without time zone |           | not null |
 user_id         | bigint                      |           | not null |
 name            | text                        |           |          |
 profile_picture | text                        |           |          |
Partition key: RANGE (user_id)
Indexes:
    "users_basic_profile_no_dups_partitioned_pkey" PRIMARY KEY, btree (user_id)
Number of partitions: 1530 (Use \d+ to list them.)
```

The `profile_picture` column stores a `URL` to the picture, not a blob of the
picture.
 
Attachment

Re: Sudden insert performance degradation

From
Sebastian Dressler
Date:
Hi Henrique,

On 15. Jul 2020, at 03:13, Henrique Montenegro <typoon@gmail.com> wrote:
[...]

```
ssl = off
shared_buffers = 160GB                  # min 128kB
work_mem = 96GB                         # min 64kB
maintenance_work_mem = 12GB             # min 1MB
max_stack_depth = 4MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
synchronous_commit = off                # synchronization level;
commit_delay = 100000                   # range 0-100000, in microseconds
max_wal_size = 3GB
min_wal_size = 1GB
min_parallel_index_scan_size = 64kB
effective_cache_size = 96GB
log_min_messages = debug1 # values in order of decreasing detail:
log_checkpoints = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_line_prefix = '%m [%p] %q%u@%d '            # special values:
log_lock_waits = on                     # log lock waits >= deadlock_timeout
log_timezone = 'America/New_York'
log_executor_stats = on
datestyle = 'iso, mdy'
```

[...]

  Limit  |  Time (seconds)
---------|------------------
10       | 0.6
100      | 0.6
1000     | 1.3
10000    | 116.9
100000   | 134.8
1000000  | 193.2

Notice the jump in time execution from a 1k limit to a 10k limit. Amount of
data raised 10x and execution time raised 100x.

It seems to me that inserting the data in this case is slow because the time
it takes to identify the duplicate records (which I assume would be done in a
fashion similiar to the queries above) is taking a long time.

I have attached the `explain analyze` output for the 1k and 10k queries to
this email (they are 4k+ lines each, didn't want to make this messager bigger
than it already is).

* exp1k.txt
* exp10k.txt

[...]

I quickly glanced at the exp10k plan and there are some things I noticed (sorry for not going over all the mail, have to re-read it again):

- There are a lot of partitions now, you maybe want consider reducing the amount. To me it seems that you overload the system. Scan times are low but the overhead to start a scan is likely quite high.
- work_mem = 96GB seems very high to me, I guess you'd be better with e.g. 4GB as a start but many more parallel workers. For instance, depending on your machine, try adjusting the max_worker_processes, max_parallel_workers and max_parallel_workers_per_gather. Values depend a bit on your system, make sure, that max_parallel_workers_per_gather are much lower than max_parallel_workers and that must be lower than max_worker_processes. You can try large values, for instance 128, 120, 12.
- You may want to test with min_parallel_table_scan_size = 0
- Did you enable partition pruning, partitionwise join and aggregate?

Thanks,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:


On Wed, Jul 15, 2020 at 4:03 AM Sebastian Dressler <sebastian@swarm64.com> wrote:
Hi Henrique,

On 15. Jul 2020, at 03:13, Henrique Montenegro <typoon@gmail.com> wrote:
[...]

```
ssl = off
shared_buffers = 160GB                  # min 128kB
work_mem = 96GB                         # min 64kB
maintenance_work_mem = 12GB             # min 1MB
max_stack_depth = 4MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
synchronous_commit = off                # synchronization level;
commit_delay = 100000                   # range 0-100000, in microseconds
max_wal_size = 3GB
min_wal_size = 1GB
min_parallel_index_scan_size = 64kB
effective_cache_size = 96GB
log_min_messages = debug1 # values in order of decreasing detail:
log_checkpoints = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_line_prefix = '%m [%p] %q%u@%d '            # special values:
log_lock_waits = on                     # log lock waits >= deadlock_timeout
log_timezone = 'America/New_York'
log_executor_stats = on
datestyle = 'iso, mdy'
```

[...]

  Limit  |  Time (seconds)
---------|------------------
10       | 0.6
100      | 0.6
1000     | 1.3
10000    | 116.9
100000   | 134.8
1000000  | 193.2

Notice the jump in time execution from a 1k limit to a 10k limit. Amount of
data raised 10x and execution time raised 100x.

It seems to me that inserting the data in this case is slow because the time
it takes to identify the duplicate records (which I assume would be done in a
fashion similiar to the queries above) is taking a long time.

I have attached the `explain analyze` output for the 1k and 10k queries to
this email (they are 4k+ lines each, didn't want to make this messager bigger
than it already is).

* exp1k.txt
* exp10k.txt

[...]

I quickly glanced at the exp10k plan and there are some things I noticed (sorry for not going over all the mail, have to re-read it again):

- There are a lot of partitions now, you maybe want consider reducing the amount. To me it seems that you overload the system. Scan times are low but the overhead to start a scan is likely quite high.
- work_mem = 96GB seems very high to me, I guess you'd be better with e.g. 4GB as a start but many more parallel workers. For instance, depending on your machine, try adjusting the max_worker_processes, max_parallel_workers and max_parallel_workers_per_gather. Values depend a bit on your system, make sure, that max_parallel_workers_per_gather are much lower than max_parallel_workers and that must be lower than max_worker_processes. You can try large values, for instance 128, 120, 12.
- You may want to test with min_parallel_table_scan_size = 0
- Did you enable partition pruning, partitionwise join and aggregate?

Thanks,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B


Hi Sebastian,

That is a good idea about the parallel workers. I have tried to update them and will post the results as soon as I have them.
Regarding the partition pruning it is set to the default (which is on). partitionwise_join and partitionwise_aggregate are both set to off. I will turn them on as well and see how it goes.

Thanks for the suggestions! I will keep the list updated.

Henrique

Re: Sudden insert performance degradation

From
Henrique Montenegro
Date:
On Wed, Jul 15, 2020 at 8:24 AM Henrique Montenegro <typoon@gmail.com> wrote:


On Wed, Jul 15, 2020 at 4:03 AM Sebastian Dressler <sebastian@swarm64.com> wrote:
Hi Henrique,

On 15. Jul 2020, at 03:13, Henrique Montenegro <typoon@gmail.com> wrote:
[...]

```
ssl = off
shared_buffers = 160GB                  # min 128kB
work_mem = 96GB                         # min 64kB
maintenance_work_mem = 12GB             # min 1MB
max_stack_depth = 4MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
synchronous_commit = off                # synchronization level;
commit_delay = 100000                   # range 0-100000, in microseconds
max_wal_size = 3GB
min_wal_size = 1GB
min_parallel_index_scan_size = 64kB
effective_cache_size = 96GB
log_min_messages = debug1 # values in order of decreasing detail:
log_checkpoints = on
log_error_verbosity = verbose # terse, default, or verbose messages
log_line_prefix = '%m [%p] %q%u@%d '            # special values:
log_lock_waits = on                     # log lock waits >= deadlock_timeout
log_timezone = 'America/New_York'
log_executor_stats = on
datestyle = 'iso, mdy'
```

[...]

  Limit  |  Time (seconds)
---------|------------------
10       | 0.6
100      | 0.6
1000     | 1.3
10000    | 116.9
100000   | 134.8
1000000  | 193.2

Notice the jump in time execution from a 1k limit to a 10k limit. Amount of
data raised 10x and execution time raised 100x.

It seems to me that inserting the data in this case is slow because the time
it takes to identify the duplicate records (which I assume would be done in a
fashion similiar to the queries above) is taking a long time.

I have attached the `explain analyze` output for the 1k and 10k queries to
this email (they are 4k+ lines each, didn't want to make this messager bigger
than it already is).

* exp1k.txt
* exp10k.txt

[...]

I quickly glanced at the exp10k plan and there are some things I noticed (sorry for not going over all the mail, have to re-read it again):

- There are a lot of partitions now, you maybe want consider reducing the amount. To me it seems that you overload the system. Scan times are low but the overhead to start a scan is likely quite high.
- work_mem = 96GB seems very high to me, I guess you'd be better with e.g. 4GB as a start but many more parallel workers. For instance, depending on your machine, try adjusting the max_worker_processes, max_parallel_workers and max_parallel_workers_per_gather. Values depend a bit on your system, make sure, that max_parallel_workers_per_gather are much lower than max_parallel_workers and that must be lower than max_worker_processes. You can try large values, for instance 128, 120, 12.
- You may want to test with min_parallel_table_scan_size = 0
- Did you enable partition pruning, partitionwise join and aggregate?

Thanks,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B


Hi Sebastian,

That is a good idea about the parallel workers. I have tried to update them and will post the results as soon as I have them.
Regarding the partition pruning it is set to the default (which is on). partitionwise_join and partitionwise_aggregate are both set to off. I will turn them on as well and see how it goes.

Thanks for the suggestions! I will keep the list updated.

Henrique


Changing those parameters had almost no effect in the performance. I just executed the following `SELECT` again:

```
explain analyze
select user_id from users_basic_profile_no_dups_partitioned
where
    user_id in (
        select user_id from users_basic_profile order by user_id
    );
```

I am looking at the plan and seeing things like this:

```
 Index Only Scan using ubp_from_100036700000000_to_100036800000000_pkey on ubp_from_100036700000000_to_100036800000000 (cost=0.42..1.99 rows=1 width=8) (actual time=3.276..3.276 rows=1 loops=611)

    Index Cond: (user_id = users_basic_profile.user_id)
    Heap Fetches: 0
    Buffers: shared hit=1,688 read=146
```

Any idea why the actual time is in the 3ms range? If I query that partition directly, like this:

```
explain analyze select user_id from ubp_from_100036700000000_to_100036800000000 where user_id in (select user_id from users_basic_profile order by user_id);
```

I get this:

```
        ->  Index Only Scan using ubp_from_100036700000000_to_100036800000000_pkey on ubp_from_100036700000000_to_100036800000000  (cost=0.42..4.12 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=984904)
               Index Cond: (user_id = users_basic_profile.user_id)
               Heap Fetches: 0
```

As you can see, the `actual_time` when querying the partition table directly goes to 0.002 which is almost 2000x faster.

My google fu is also coming short on figuring that out. Any suggestions?

Thanks!

Henrique

Re: Sudden insert performance degradation

From
Justin Pryzby
Date:
On Wed, Jul 15, 2020 at 02:49:16PM -0400, Henrique Montenegro wrote:
> Any idea why the actual time is in the 3ms range? If I query that partition
> directly, like this:
> 
> As you can see, the `actual_time` when querying the partition table
> directly goes to 0.002 which is almost 2000x faster.

Because querying parents of 1000s of tables is slow.
That's improved in v12.  You can read a previous discussion about it here:
https://www.postgresql.org/message-id/20200223151038.GW31889@telsasoft.com

But I think you need to know more about partitioning.  It doesn't magically
make things faster for you, and if you just guess, then it's likely to perform
worse for reading and/or writing.

Partitioning only helps for INSERTs if nearly all the insertions happening at a
given time go into a small number of partitions.  Like inserting data
partitioned by "timestamp", where all the new data goes into a partition for
the current date.  Otherwise instead of one gigantic index which doesn't fit in
shared_buffers or RAM, you have some hundreds of indexes which also don't
simultaneously fit into RAM.  That doesn't help writes, and hurts planning
time.

-- 
Justin