Thread: Load a csv or a avro?
Hello all,
Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to load it into our postgres database. The volume will be 300million messages per day across many files in batches.
My question was, which format should we chose in regards to faster data loading performance ? and if any other aspects to it also should be considered apart from just loading performance?
Hello all,
Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to load it into our postgres database. The volume will be 300million messages per day across many files in batches.
My question was, which format should we chose in regards to faster data loading performance ? and if any other aspects to it also should be considered apart from just loading performance?
pá 5. 7. 2024 v 11:08 odesílatel sud <suds1434@gmail.com> napsal: > > Hello all, > > Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to loadit into our postgres database. The volume will be 300million messages per day across many files in batches. > > My question was, which format should we chose in regards to faster data loading performance ? and if any other aspectsto it also should be considered apart from just loading performance? We are able to load ~300 million rows per one day using CSV and COPY functions (https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-SEND).
Performance Considerations
pá 5. 7. 2024 v 11:08 odesílatel sud <suds1434@gmail.com> napsal:
>
> Hello all,
>
> Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to load it into our postgres database. The volume will be 300million messages per day across many files in batches.
>
> My question was, which format should we chose in regards to faster data loading performance ? and if any other aspects to it also should be considered apart from just loading performance?
We are able to load ~300 million rows per one day using CSV and COPY
functions (https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-SEND).
Hello all,
Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to load it into our postgres database. The volume will be 300million messages per day across many files in batches.
My question was, which format should we chose in regards to faster data loading performance ?
What application will be loading the data? If psql, then go with CSV; COPY is really efficient.
and if any other aspects to it also should be considered apart from just loading performance?
On 7/5/24 02:08, sud wrote: > Hello all, > > Its postgres database. We have option of getting files in csv and/or in > avro format messages from another system to load it into our postgres > database. The volume will be 300million messages per day across many > files in batches. Are dumping the entire contents of each file or are you pulling a portion of the data out? > > My question was, which format should we chose in regards to faster data > loading performance ? and if any other aspects to it also should be > considered apart from just loading performance? > -- Adrian Klaver adrian.klaver@aklaver.com
HiThere are different data formats available, following are few points for there performance implications1. CSV : It's easy to use and widely supported but it can be slower due to parsing overload.2. Binary : Its faster to load but not human understandable.Hope this helps.RegardsKashif Zeeshan
On 7/5/24 02:08, sud wrote:
> Hello all,
>
> Its postgres database. We have option of getting files in csv and/or in
> avro format messages from another system to load it into our postgres
> database. The volume will be 300million messages per day across many
> files in batches.
Are dumping the entire contents of each file or are you pulling a
portion of the data out?
On Fri, Jul 5, 2024 at 8:24 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:On 7/5/24 02:08, sud wrote:
> Hello all,
>
> Its postgres database. We have option of getting files in csv and/or in
> avro format messages from another system to load it into our postgres
> database. The volume will be 300million messages per day across many
> files in batches.
Are dumping the entire contents of each file or are you pulling a
portion of the data out?Yes, all the fields in the file have to be loaded to the columns in the tables in postgres.
But how will that matter here for deciding if we should ask the data in .csv or .avro format from the outside system to load into the postgres database in row and column format? Again my understanding was that irrespective of anything , the .csv file load will always faster as because the data is already stored in row and column format as compared to the .avro file in which the parser has to perform additional job to make it row and column format or map it to the columns of the database table. Is my understanding correct here?
On 7/6/24 13:09, sud wrote: > On Fri, Jul 5, 2024 at 8:24 PM Adrian Klaver <adrian.klaver@aklaver.com > <mailto:adrian.klaver@aklaver.com>> wrote: > > On 7/5/24 02:08, sud wrote: > > Hello all, > > > > Its postgres database. We have option of getting files in csv > and/or in > > avro format messages from another system to load it into our > postgres > > database. The volume will be 300million messages per day across many > > files in batches. > > Are dumping the entire contents of each file or are you pulling a > portion of the data out? > > > > Yes, all the fields in the file have to be loaded to the columns in the > tables in postgres. But how will that matter here for deciding if we > should ask the data in .csv or .avro format from the outside system to > load into the postgres database in row and column format? Again my > understanding was that irrespective of anything , the .csv file load > will always faster as because the data is already stored in row and > column format as compared to the .avro file in which the parser has to > perform additional job to make it row and column format or map it to the > columns of the database table. Is my understanding correct here? If you are going to use complete rows and all rows then COPY of CSV in Postgres would be your best choice. -- Adrian Klaver adrian.klaver@aklaver.com