Thread: Faster data load

Faster data load

From
Lok P
Date:
Hi, 

We are having a requirement to create approx 50 billion rows in a partition table(~1 billion rows per partition, 200+gb size daily partitions) for a performance test. We are currently using ' insert into <target table_partition> select.. From <source_table_partition> or <some transformed query>;' method . We have dropped all indexes and constraints First and then doing the load. Still it's taking 2-3 hours to populate one partition. Is there a faster way to achieve this? 

Few teammate suggesting to use copy command and use file load instead, which will be faster. So I wanted to understand, how different things it does behind the scenes as compared to insert as select command? As because it only deals with sql engine only. 

Additionally, when we were trying to create indexes post data load on one partition, it took 30+ minutes. Any possible way to make it faster? 

Is there any way to drive the above things in parallel by utilizing full database resources?

It's postgres 15.4

Regards
Lok

Re: Faster data load

From
Ron Johnson
Date:
On Thu, Sep 5, 2024 at 4:14 PM Lok P <loknath.73@gmail.com> wrote:
Hi, 

We are having a requirement to create approx 50 billion rows in a partition table(~1 billion rows per partition, 200+gb size daily partitions) for a performance test. We are currently using ' insert into <target table_partition> select.. From <source_table_partition> or <some transformed query>;' method . We have dropped all indexes and constraints First and then doing the load. Still it's taking 2-3 hours to populate one partition.

At three hours, that's 92,593 records/second.  Seems pretty slow.

How much of that time is taken by <some transformed query>?
How big are the records?
How fast is the hardware?

Is there a faster way to achieve this? 

Testing is the only way to know for sure.
 
Few teammate suggesting to use copy command and use file load instead, which will be faster. So I wanted to understand, how different things it does behind the scenes as compared to insert as select command? As because it only deals with sql engine only. 

COPY is highly optimized for buffered operation.  INSERT... maybe not so much.

But if the source data is already in a table, that would require piping the data to stdout and then back into the database.

psql appdb -c "COPY (SELECT ...) TO STDOUT;" | psql appdb -c "COPY some_table FROM STDOUT;".  Use binary mode, so text conversion isn't required.

Maybe that's faster, maybe not.

Additionally, when we were trying to create indexes post data load on one partition, it took 30+ minutes. Any possible way to make it faster? 

Is there any way to drive the above things in parallel by utilizing full database resources?

Put the destination tables in a different tablespace on a different controller.
 
It's postgres 15.4

Why not 15.8?

--
Death to America, and butter sauce.
Iraq lobster!

Re: Faster data load

From
Jeff Ross
Date:
On 9/5/24 14:14, Lok P wrote:

> Hi,
>
> We are having a requirement to create approx 50 billion rows in a 
> partition table(~1 billion rows per partition, 200+gb size daily 
> partitions) for a performance test. We are currently using ' insert 
> into <target table_partition> select.. From <source_table_partition> 
> or <some transformed query>;' method . We have dropped all indexes and 
> constraints First and then doing the load. Still it's taking 2-3 hours 
> to populate one partition. Is there a faster way to achieve this?
>
> Few teammate suggesting to use copy command and use file load instead, 
> which will be faster. So I wanted to understand, how different things 
> it does behind the scenes as compared to insert as select command? As 
> because it only deals with sql engine only.
>
> Additionally, when we were trying to create indexes post data load on 
> one partition, it took 30+ minutes. Any possible way to make it faster?
>
> Is there any way to drive the above things in parallel by utilizing 
> full database resources?
>
> It's postgres 15.4
>
> Regards
> Lok

Try pg_bulkload to load the data--takes a little set up but it is very 
fast.  Do pay attention to the caveats.  For a performance test they 
probably won't be relevant.

https://github.com/ossc-db/pg_bulkload?tab=readme-ov-file

Jeff




Re: Faster data load

From
Muhammad Usman Khan
Date:
Hi,

You can use pg_partman. If your table is partitioned, you can manage partitions in parallel by distributing the load across partitions concurrently. Or you can use citus. It can be an excellent solution, especially for handling large data volumes and parallelizing data operations


On Fri, 6 Sept 2024 at 01:14, Lok P <loknath.73@gmail.com> wrote:
Hi, 

We are having a requirement to create approx 50 billion rows in a partition table(~1 billion rows per partition, 200+gb size daily partitions) for a performance test. We are currently using ' insert into <target table_partition> select.. From <source_table_partition> or <some transformed query>;' method . We have dropped all indexes and constraints First and then doing the load. Still it's taking 2-3 hours to populate one partition. Is there a faster way to achieve this? 

Few teammate suggesting to use copy command and use file load instead, which will be faster. So I wanted to understand, how different things it does behind the scenes as compared to insert as select command? As because it only deals with sql engine only. 

Additionally, when we were trying to create indexes post data load on one partition, it took 30+ minutes. Any possible way to make it faster? 

Is there any way to drive the above things in parallel by utilizing full database resources?

It's postgres 15.4

Regards
Lok

Re: Faster data load

From
Lok P
Date:

On Fri, 6 Sept, 2024, 9:20 am Muhammad Usman Khan, <usman.k@bitnine.net> wrote:
Hi,

You can use pg_partman. If your table is partitioned, you can manage partitions in parallel by distributing the load across partitions concurrently. Or you can use citus. It can be an excellent solution, especially for handling large data volumes and parallelizing data operations


Thank you. 
The tables are partitioned. Also during index creation we are trying to do it multiple partitions at same time from multiple sessions.But seeing out of memory error in 5th or 6th session. And even each sessions taking 30mins per partitions for index creation. Attach index partitions happening in seconds though. 

Re: Faster data load

From
Ron Johnson
Date:
On Fri, Sep 6, 2024 at 12:43 AM Lok P <loknath.73@gmail.com> wrote:
Also during index creation we are trying to do it multiple partitions at same time from multiple sessions.But seeing out of memory error in 5th or 6th session.

Had that same problem during pg_restore.  Reduced maintenance_work_mem and the problem went away.
 
--
Death to America, and butter sauce.
Iraq lobster!

Re: Faster data load

From
Adrian Klaver
Date:
On 9/8/24 10:45, Peter J. Holzer wrote:
> On 2024-09-06 01:44:00 +0530, Lok P wrote:
>> We are having a requirement to create approx 50 billion rows in a partition
>> table(~1 billion rows per partition, 200+gb size daily partitions) for a
>> performance test. We are currently using ' insert into <target table_partition>
>> select.. From <source_table_partition> or <some transformed query>;' method .
>> We have dropped all indexes and constraints First and then doing the load.
>> Still it's taking 2-3 hours to populate one partition.
> 
> That seems quite slow. Is the table very wide or does it have a large
> number of indexes?
> 
>> Is there a faster way to achieve this?
>>
>> Few teammate suggesting to use copy command and use file load instead, which
>> will be faster.
> 
> I doubt that.
> 
> I benchmarked several strategies for populating tables 5 years ago and
> (for my test data and on our hardware at the time - YMMV) s simple
> INSERT ... SELECT was more than twice as fast as 8 parallel COPY
> operations (and about 8 times as fast as a single COPY). >
> Details will have changed since then (I should rerun that benchmark on
> a current system), but I'd be surprised if COPY became that much faster
> relative to INSERT ... SELECT.

Yeah they seem to have changed a great deal. Though you are correct in 
saying COPY is not faster then INSERT..SELECT


select version();
                                                               version 


-----------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


select count(*) from nyc_taxi_duckdb ;
   count
---------
  2846722

ll -h nyc_taxi.csv
-rw-rw-r-- 1 aklaver aklaver 252M Sep  8 10:54 nyc_taxi.csv

insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 7015.072 ms (00:07.015)

truncate  nyc_duplicate ;

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 8760.197 ms (00:08.760)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 7904.279 ms (00:07.904)

Just to see what the coming attraction offers:

select version();
                                                                version 


--------------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 17rc1 (Ubuntu 17~rc1-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 5315.878 ms (00:05.316)

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 10042.129 ms (00:10.042)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 8422.503 ms (00:08.423)

> 
>          hp
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Faster data load

From
Dominique Devienne
Date:
On Sun, Sep 8, 2024 at 8:27 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> >  simple INSERT ... SELECT was more than twice as fast as 8 parallel COPY
> > operations (and about 8 times as fast as a single COPY).
>
> Yeah they seem to have changed a great deal. Though you are correct in
> saying COPY is not faster then INSERT..SELECT

Interesting. Thanks for sharing. Although to be fair, that adds CSV parsing
to the mix, something INSERT-FROM-SELECT does not have to do,
skewing the results a little maybe. Comparing against COPY BINARY
would be fairer, but less practical I guess. --DD