Thread: speed up full table scan using psql

speed up full table scan using psql

From
Lian Jiang
Date:
hi,

I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large tables (e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as the postgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node timeout or out of memory.

Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any hints.

Regards
Lian

Re: speed up full table scan using psql

From
Rob Sargent
Date:
On 5/30/23 22:25, Lian Jiang wrote:
hi,

I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large tables (e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as the postgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node timeout or out of memory.

Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any hints.

Regards
Lian
Have you looked into COPY command?  Or CopyManager java class?

Re: speed up full table scan using psql

From
David Rowley
Date:
On Wed, 31 May 2023 at 16:26, Lian Jiang <jiangok2006@gmail.com> wrote:
> I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large
tables(e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as
thepostgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node
timeoutor out of memory. 
>
> Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any
hints.

This is probably slow due to psql's alignment.  It needs to read the
entire result to know how much to whitespace to pad columns with
before it can output anything. Naturally, that requires quite a bit of
memory when large tables are queried and also lots of length checking.

As mentioned by Rob, you'll probably want to use COPY, or you could
look at using pg_dump.  Something like:

pg_dump --inserts --rows-per-insert=100 --table=<name of your table to
export> --data-only <database name>

should give you something that you might have some luck directly
importing to Snowflake without any transformation tool required.

However, if you do need to do some transformation before loading, then
you might want to do that in PostgreSQL.  For that, something like:

psql -c "copy (select col1,col2,col3+col4 from your_table) to stdout"
<name of database>

would allow you to run a query, which you could maybe do your
transformations in before importing into Snowflake

David



Re: speed up full table scan using psql

From
Lian Jiang
Date:
Thanks! Will try.

Is there a way to speed up Spark ingestion (postgres -> spark -> snowflake) which does not have issues such as alignment or inefficient json file format? I favor spark since it simplifies the ingest path of postgres -> local json.gz -> s3 -> snowflake stage -> snowflake table. Hope this makes sense.

On Tue, May 30, 2023 at 10:17 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 31 May 2023 at 16:26, Lian Jiang <jiangok2006@gmail.com> wrote:
> I am using psql to periodically dump the postgres tables into json files which are imported into snowflake. For large tables (e.g. 70M rows), it takes hours for psql to complete. Using spark to read the postgres table seems not to work as the postgres read only replication is the bottleneck so spark cluster never uses >1 worker node and the working node timeout or out of memory.
>
> Will vertical scaling the postgres db speed up psql? Or any thread related parameter of psql can help? Thanks for any hints.

This is probably slow due to psql's alignment.  It needs to read the
entire result to know how much to whitespace to pad columns with
before it can output anything. Naturally, that requires quite a bit of
memory when large tables are queried and also lots of length checking.

As mentioned by Rob, you'll probably want to use COPY, or you could
look at using pg_dump.  Something like:

pg_dump --inserts --rows-per-insert=100 --table=<name of your table to
export> --data-only <database name>

should give you something that you might have some luck directly
importing to Snowflake without any transformation tool required.

However, if you do need to do some transformation before loading, then
you might want to do that in PostgreSQL.  For that, something like:

psql -c "copy (select col1,col2,col3+col4 from your_table) to stdout"
<name of database>

would allow you to run a query, which you could maybe do your
transformations in before importing into Snowflake

David


--

Re: speed up full table scan using psql

From
Adrian Klaver
Date:
On 5/30/23 21:25, Lian Jiang wrote:
> hi,
> 
> I am using psql to periodically dump the postgres tables into json files 
> which are imported into snowflake. For large tables (e.g. 70M rows), it 

The command you are using is?

Postgres version?

> takes hours for psql to complete. Using spark to read the postgres table 
> seems not to work as the postgres read only replication is the 
> bottleneck so spark cluster never uses >1 worker node and the working 
> node timeout or out of memory.
> 
> Will vertical scaling the postgres db speed up psql? Or any thread 
> related parameter of psql can help? Thanks for any hints.
> 
> Regards
> Lian

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: speed up full table scan using psql

From
Lian Jiang
Date:
The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) from public.mytable x_tmp_uniq) to stdout"
postgres version:  14.7
Does this mean COPY and java CopyManager may not help since my psql command already uses copy?

Regarding pg_dump, it does not support json format which means extra work is needed to convert the supported format to jsonl (or parquet) so that they can be imported into snowflake. Still exploring but want to call it out early. Maybe 'custom' format can be parquet?


Thanks
Lian

Re: speed up full table scan using psql

From
Adrian Klaver
Date:
On 5/31/23 13:57, Lian Jiang wrote:
> The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
> from public.mytable x_tmp_uniq) to stdout"
> postgres version:  14.7
> Does this mean COPY and java CopyManager may not help since my psql 
> command already uses copy?

I don't think the issue is COPY itself but row_to_json(x_tmp_uniq).

This:

https://towardsdatascience.com/spark-essentials-how-to-read-and-write-data-with-pyspark-5c45e29227cd

indicates Spark can use CSV as an input source.

Given that I would just COPY the data out as CSV.

> 
> Regarding pg_dump, it does not support json format which means extra 
> work is needed to convert the supported format to jsonl (or parquet) so 
> that they can be imported into snowflake. Still exploring but want to 
> call it out early. Maybe 'custom' format can be parquet?
> 
> 
> Thanks
> Lian

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: speed up full table scan using psql

From
Adrian Klaver
Date:
On 5/31/23 13:57, Lian Jiang wrote:
> The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
> from public.mytable x_tmp_uniq) to stdout"
> postgres version:  14.7
> Does this mean COPY and java CopyManager may not help since my psql 
> command already uses copy?
> 
> Regarding pg_dump, it does not support json format which means extra 
> work is needed to convert the supported format to jsonl (or parquet) so 
> that they can be imported into snowflake. Still exploring but want to 
> call it out early. Maybe 'custom' format can be parquet?

Oops I read this:

'...Using spark to read the postgres table...'

and missed that you are trying to load into Snowflake.

It seems Snowflake supports CSV as well:

https://docs.snowflake.com/en/user-guide/data-load-prepare

So the previous advice should still hold.


> 
> 
> Thanks
> Lian

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: speed up full table scan using psql

From
Thorsten Glaser
Date:
On Wed, 31 May 2023, Adrian Klaver wrote:

> Given that I would just COPY the data out as CSV.

I recently did something similar. I found the JSON functions not quite
satisfying and the extra spaces redundant, but it turns out that, for
a numerical table, exporting as CSV, loading that via AJAX then (on
xhr.responseText) substituting newlines with '],[' and prepending '[['
and appending ']]' was enough to let JSON.parse eat it.

With strings this is more complex ofc (though partial use of JSON
functions, e.g. to convert strings to JSONString already, might help).

bye,
//mirabilos
--
Infrastrukturexperte • tarent solutions GmbH
Am Dickobskreuz 10, D-53121 Bonn • http://www.tarent.de/
Telephon +49 228 54881-393 • Fax: +49 228 54881-235
HRB AG Bonn 5168 • USt-ID (VAT): DE122264941
Geschäftsführer: Dr. Stefan Barth, Kai Ebenrett, Boris Esser, Alexander Steeg

                        ****************************************************
/⁀\ The UTF-8 Ribbon
╲ ╱ Campaign against      Mit dem tarent-Newsletter nichts mehr verpassen:
 ╳  HTML eMail! Also,     https://www.tarent.de/newsletter
╱ ╲ header encryption!
                        ****************************************************



Re: speed up full table scan using psql

From
Adrian Klaver
Date:
On 5/31/23 13:57, Lian Jiang wrote:
> The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq) 
> from public.mytable x_tmp_uniq) to stdout"

What is taking the stdout and what it is it doing?

> postgres version:  14.7
> Does this mean COPY and java CopyManager may not help since my psql 
> command already uses copy?
> 
> Regarding pg_dump, it does not support json format which means extra 
> work is needed to convert the supported format to jsonl (or parquet) so 
> that they can be imported into snowflake. Still exploring but want to 
> call it out early. Maybe 'custom' format can be parquet?
> 
> 
> Thanks
> Lian

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: speed up full table scan using psql

From
Lian Jiang
Date:
The whole command is:

psql %(pg_uri)s -c %(sql)s | %(sed)s | %(pv)s | %(split)s) 2>&1 | %(tr)s

where:
sql is "copy (select row_to_json(x_tmp_uniq) from public.mytable x_tmp_uniq) to stdout"
sed, pv, split, tr together format and split the stdout into jsonl files.

Hope this helps.


On Wed, May 31, 2023 at 9:16 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/31/23 13:57, Lian Jiang wrote:
> The command is: psql $db_url -c "copy (select row_to_json(x_tmp_uniq)
> from public.mytable x_tmp_uniq) to stdout"

What is taking the stdout and what it is it doing?

> postgres version:  14.7
> Does this mean COPY and java CopyManager may not help since my psql
> command already uses copy?
>
> Regarding pg_dump, it does not support json format which means extra
> work is needed to convert the supported format to jsonl (or parquet) so
> that they can be imported into snowflake. Still exploring but want to
> call it out early. Maybe 'custom' format can be parquet?
>
>
> Thanks
> Lian

--
Adrian Klaver
adrian.klaver@aklaver.com



--

Re: speed up full table scan using psql

From
Adrian Klaver
Date:
On 5/31/23 22:51, Lian Jiang wrote:
> The whole command is:
> 
> psql %(pg_uri)s -c %(sql)s | %(sed)s | %(pv)s | %(split)s) 2>&1 | %(tr)s
> 
> where:
> sql is "copy (select row_to_json(x_tmp_uniq) from public.mytable 
> x_tmp_uniq) to stdout"
> sed, pv, split, tr together format and split the stdout into jsonl files.

Well that is quite the pipeline. At this point I think you need to do 
some testing on your end. First create a table that is a subset of the 
original data to make testing a little quicker.  Then break the process 
down into smaller actions. Start with just doing a COPY direct to CSV 
and one with the row_to_json to see if that makes a difference. Then 
COPY directly to a file before applying the above pipeline. There are 
more ways you can slice this depending on what the preceding shows you.

> 
> Hope this helps.

-- 
Adrian Klaver
adrian.klaver@aklaver.com