Thread: COPY TO STDOUT Apache Arrow support

COPY TO STDOUT Apache Arrow support

From

Adam Lippai

Date:

21 April 2022, 17:41:17

Hi,

would it be possible to add Apache Arrow streaming format to the copy backend + frontend?

The use case is fetching (or storing) tens or hundreds of millions of rows for client side data science purposes (Pandas, Apache Arrow compute kernels, Parquet conversion etc). It looks like the serialization overhead when using the postgresql wire format can be significant.

Best regards,

Adam Lippai

Re: COPY TO STDOUT Apache Arrow support

From

Adam Lippai

Date:

13 April 2023, 21:35:48

Hi,

There are two bigger developments in this topic:

Pandas 2.0 is released and it can use Apache Arrow as a backend
Apache Arrow ADBC is released which standardizes the client API. Currently it uses the postgresql wire protocol underneath

Best regards,

Adam Lippai

On Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:

Hi,

would it be possible to add Apache Arrow streaming format to the copy backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of rows for client side data science purposes (Pandas, Apache Arrow compute kernels, Parquet conversion etc). It looks like the serialization overhead when using the postgresql wire format can be significant.

Best regards,
Adam Lippai

Re: COPY TO STDOUT Apache Arrow support

From

Adam Lippai

Date:

03 May 2023, 06:14:44

Hi,

There is also a new Arrow C library (one .h and one .c file) which makes it easier to use it from the postgresql codebase.

https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/

https://github.com/apache/arrow-nanoarrow/tree/main/dist

Best regards,

Adam Lippai

On Thu, Apr 13, 2023 at 2:35 PM Adam Lippai <adam@rigo.sk> wrote:

Hi,

There are two bigger developments in this topic:
Pandas 2.0 is released and it can use Apache Arrow as a backend
Apache Arrow ADBC is released which standardizes the client API. Currently it uses the postgresql wire protocol underneath
Best regards,
Adam Lippai

On Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:
Hi,

would it be possible to add Apache Arrow streaming format to the copy backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of rows for client side data science purposes (Pandas, Apache Arrow compute kernels, Parquet conversion etc). It looks like the serialization overhead when using the postgresql wire format can be significant.

Best regards,
Adam Lippai

Re: COPY TO STDOUT Apache Arrow support

From

Pavel Stehule

Date:

03 May 2023, 07:01:27

st 3. 5. 2023 v 5:15 odesílatel Adam Lippai <adam@rigo.sk> napsal:

Hi,

There is also a new Arrow C library (one .h and one .c file) which makes it easier to use it from the postgresql codebase.

https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
https://github.com/apache/arrow-nanoarrow/tree/main/dist

Best regards,
Adam Lippai

With 9fcdf2c787ac6da330165ea3cd50ec5155943a2b it can be implemented in extension

Regards

Pavel

On Thu, Apr 13, 2023 at 2:35 PM Adam Lippai <adam@rigo.sk> wrote:
Hi,

There are two bigger developments in this topic:
Pandas 2.0 is released and it can use Apache Arrow as a backend
Apache Arrow ADBC is released which standardizes the client API. Currently it uses the postgresql wire protocol underneath
Best regards,
Adam Lippai

On Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:
Hi,

would it be possible to add Apache Arrow streaming format to the copy backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of rows for client side data science purposes (Pandas, Apache Arrow compute kernels, Parquet conversion etc). It looks like the serialization overhead when using the postgresql wire format can be significant.

Best regards,
Adam Lippai