Thread: Breaking up a PostgreSQL COPY command into chunks?

Breaking up a PostgreSQL COPY command into chunks?

From

Victor Hooi

Date:

08 November 2013, 02:07:07

Hi,

We're using psycopg2 with COPY to dump CSV output from a large query.

The actual SELECT query itself is large (both in number of records/columns, and also in width of values in columns), but still completes in around under a minute on the server.

However, if you then use a COPY with it, it will often time out.

We're using psycopg2 to run the command, the trace we get is something like:

Traceback (most recent call last):

File "foo.py", line 259, in <module>

jobs[job].run_all()

File "foo.py", line 127, in run_all

self.export_to_csv()

File "foo.py", line 168, in export_to_csv

cur.copy_expert(self.export_sql_statement, f)

psycopg2.extensions.TransactionRollbackError: canceling statement due to conflict with recovery

DETAIL: User was holding shared buffer pin for too long.

My question is, what are some simple ways we can use to chunk up the query?

Could we pull down a list of all the ids (auto-incrementing int), break this list up, then use a WHERE clause to break it up, running multiple COPY commands?

Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how we'd figure out when we reached the end of the results set though (apart from just counting the results?).

Or are there other approaches you guys could recommend?

Cheers,

Victor

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Date:

08 November 2013, 02:15:19

Try this,

max_standby_archive_delay = 600s        # max delay before canceling queries                                       # when reading WAL from archive;                                       # -1 allows indefinite delay
max_standby_streaming_delay = 600s      # max delay before canceling queries                                       # when reading streaming WAL;                                       # -1 allows indefinite delay

or try 
pg_xlog_replay_pause()
pg_xlog_replay_resume()

On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi <victorhooi@yahoo.com> wrote:

Hi,

We're using psycopg2 with COPY to dump CSV output from a large query.

The actual SELECT query itself is large (both in number of records/columns, and also in width of values in columns), but still completes in around under a minute on the server.

However, if you then use a COPY with it, it will often time out.

We're using psycopg2 to run the command, the trace we get is something like:

Traceback (most recent call last):
File "foo.py", line 259, in <module>
jobs[job].run_all()
File "foo.py", line 127, in run_all
self.export_to_csv()
File "foo.py", line 168, in export_to_csv
cur.copy_expert(self.export_sql_statement, f)
psycopg2.extensions.TransactionRollbackError: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.

My question is, what are some simple ways we can use to chunk up the query?

Could we pull down a list of all the ids (auto-incrementing int), break this list up, then use a WHERE clause to break it up, running multiple COPY commands?

Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how we'd figure out when we reached the end of the results set though (apart from just counting the results?).

Or are there other approaches you guys could recommend?

Cheers,
Victor

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Victor Hooi

Date:

08 November 2013, 02:45:25

Hi,

Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that helps.

Do these settings still work if you only have a single Postgres instance? (I'll need to check out setup).

So my understanding is that the default is 30 seconds (http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html) - but we're increasing it to 600 seconds, and that should give the COPY command enough time to pull down the data?

As a rough guide, the dumped CSV file is around 600 Mb.

Is there any other background you might be able to give on what you think might be happening, or how this might fix it?

And you'd recommend tweaking these values over trying to chunk up the COPY/SELECT, is that right?

I've just realised the LIMIT/ORDER thing may not work well to paginate, since there may be new records, or deleted records between each time I call it?

Cheers,

Victor

On Fri, Nov 8, 2013 at 1:15 PM, wd <wd@wdicc.com> wrote:

Try this,
max_standby_archive_delay = 600s        # max delay before canceling queries                                       # when reading WAL from archive;                                       # -1 allows indefinite delay
max_standby_streaming_delay = 600s      # max delay before canceling queries                                       # when reading streaming WAL;                                       # -1 allows indefinite delay
or try 
pg_xlog_replay_pause()
pg_xlog_replay_resume()
On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi <victorhooi@yahoo.com> wrote:
Hi,

We're using psycopg2 with COPY to dump CSV output from a large query.

The actual SELECT query itself is large (both in number of records/columns, and also in width of values in columns), but still completes in around under a minute on the server.

However, if you then use a COPY with it, it will often time out.

We're using psycopg2 to run the command, the trace we get is something like:

Traceback (most recent call last):
File "foo.py", line 259, in <module>
jobs[job].run_all()
File "foo.py", line 127, in run_all
self.export_to_csv()
File "foo.py", line 168, in export_to_csv
cur.copy_expert(self.export_sql_statement, f)
psycopg2.extensions.TransactionRollbackError: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.

My question is, what are some simple ways we can use to chunk up the query?

Could we pull down a list of all the ids (auto-incrementing int), break this list up, then use a WHERE clause to break it up, running multiple COPY commands?

Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how we'd figure out when we reached the end of the results set though (apart from just counting the results?).

Or are there other approaches you guys could recommend?

Cheers,
Victor

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Victor Hooi

Date:

08 November 2013, 04:09:50

Hi,

Aha, I spoke to the somebody, apparently we've actually got those values set to 15 minutes currently...

They think that it might be limited by the network, and how fast the PostgreSQL server can push the data across the internet. (The Postgres server and the box running the query are connected over the internet).

Cheers,

Victor

On Fri, Nov 8, 2013 at 1:44 PM, Victor Hooi <victorhooi@yahoo.com> wrote:

Hi,

Hmm, ok, I'll pass that onto our DBA/operations guys, and see if that helps.

Do these settings still work if you only have a single Postgres instance? (I'll need to check out setup).

So my understanding is that the default is 30 seconds (http://www.postgresql.org/docs/9.3/static/runtime-config-replication.html) - but we're increasing it to 600 seconds, and that should give the COPY command enough time to pull down the data?

As a rough guide, the dumped CSV file is around 600 Mb.

Is there any other background you might be able to give on what you think might be happening, or how this might fix it?

And you'd recommend tweaking these values over trying to chunk up the COPY/SELECT, is that right?

I've just realised the LIMIT/ORDER thing may not work well to paginate, since there may be new records, or deleted records between each time I call it?

Cheers,
Victor
On Fri, Nov 8, 2013 at 1:15 PM, wd <wd@wdicc.com> wrote:
Try this,
max_standby_archive_delay = 600s        # max delay before canceling queries                                       # when reading WAL from archive;                                       # -1 allows indefinite delay
max_standby_streaming_delay = 600s      # max delay before canceling queries                                       # when reading streaming WAL;                                       # -1 allows indefinite delay
or try 
pg_xlog_replay_pause()
pg_xlog_replay_resume()
On Fri, Nov 8, 2013 at 10:06 AM, Victor Hooi <victorhooi@yahoo.com> wrote:
Hi,

We're using psycopg2 with COPY to dump CSV output from a large query.

The actual SELECT query itself is large (both in number of records/columns, and also in width of values in columns), but still completes in around under a minute on the server.

However, if you then use a COPY with it, it will often time out.

We're using psycopg2 to run the command, the trace we get is something like:

Traceback (most recent call last):
File "foo.py", line 259, in <module>
jobs[job].run_all()
File "foo.py", line 127, in run_all
self.export_to_csv()
File "foo.py", line 168, in export_to_csv
cur.copy_expert(self.export_sql_statement, f)
psycopg2.extensions.TransactionRollbackError: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.

My question is, what are some simple ways we can use to chunk up the query?

Could we pull down a list of all the ids (auto-incrementing int), break this list up, then use a WHERE clause to break it up, running multiple COPY commands?

Or would it be better to use LIMIT/OFFSET to break it up? I'm not sure how we'd figure out when we reached the end of the results set though (apart from just counting the results?).

Or are there other approaches you guys could recommend?

Cheers,
Victor

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Francisco Olarte

Date:

08 November 2013, 07:36:40

On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi <victorhooi@yahoo.com> wrote:
> They think that it might be limited by the network, and how fast the
> PostgreSQL server can push the data across the internet. (The Postgres
> server and the box running the query are connected over the internet).

You previously said you had 600Mb. Over the internet. ¿ Is it a very
fat pipe ? Because otherwise the limitng factor is probably not the
speed at which postgres can push the resuts, but he throughput of your
link.

If, as you stated, you need a single transaction to get a 600Mb
snapshot I would recommend to dump it to disk, compressing on the fly
( you should get easily four o five fold reduction on a CSV file using
any decent compressor ), and then send the file. If you do not have
disk for the dump but can run programs near the server, you can try
compressing on the fly. If you have got none of this but have got
space for a spare table, use a select into, paginate this output and
drop it after. Or just look at the configs and set longer query times,
if your app NEEDS two hour queries, they can be enabled. But anyway,
doing a long transaction over the internet does not seem like a good
idea to me.

Francisco Olarte

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Victor Hooi

Date:

08 November 2013, 08:28:55

Hi,

@Francisco - Yeah, the file is around 600 Mb currently, uncompressed.

You're right, our internet connection is going to be the limiting factor.

Essentially, the PostgreSQL server is in a datacentre, the server we're dumping to is in the office.

Running a script on the PostgreSQL server in the datacentre is going to be tricky (not so much technically, just from a procedures/security point of view).

Dumping to a spare table seems like an interesting point - so we'd just create the table, COPY the results to that table, then use LIMIT/OFFSET to paginate through that, then drop the table afterwards?

Currently, I'm doing a quick hack where we download an ordered list of the ids (auto-incrementing integer) into Python, chunk it up into groups of ids, then use a WHERE IN clause to download each chunk via COPY.

Would dumping to a spare table and paginating a better approach? Reasons? (Not challenging it, I just want to understand everything).

Cheers,

Victor

On Fri, Nov 8, 2013 at 6:36 PM, Francisco Olarte <folarte@peoplecall.com> wrote:

On Fri, Nov 8, 2013 at 5:09 AM, Victor Hooi <victorhooi@yahoo.com> wrote:
> They think that it might be limited by the network, and how fast the
> PostgreSQL server can push the data across the internet. (The Postgres
> server and the box running the query are connected over the internet).

You previously said you had 600Mb. Over the internet. ¿ Is it a very
fat pipe ? Because otherwise the limitng factor is probably not the
speed at which postgres can push the resuts, but he throughput of your
link.

If, as you stated, you need a single transaction to get a 600Mb
snapshot I would recommend to dump it to disk, compressing on the fly
( you should get easily four o five fold reduction on a CSV file using
any decent compressor ), and then send the file. If you do not have
disk for the dump but can run programs near the server, you can try
compressing on the fly. If you have got none of this but have got
space for a spare table, use a select into, paginate this output and
drop it after. Or just look at the configs and set longer query times,
if your app NEEDS two hour queries, they can be enabled. But anyway,
doing a long transaction over the internet does not seem like a good
idea to me.

Francisco Olarte

Re: Breaking up a PostgreSQL COPY command into chunks?

From

Francisco Olarte

Date:

08 November 2013, 11:09:23

Hi Victor:

> You're right, our internet connection is going to be the limiting factor.
...

Good to know this. Then you have 1 restriction, your bandwidth.......

> Essentially, the PostgreSQL server is in a datacentre, the server we're
> dumping to is in the office.
> Running a script on the PostgreSQL server in the datacentre is going to be
> tricky (not so much technically, just from a procedures/security point of
> view).

... another restriction, you cannot run code on the server.

An aside here, you know postgres can do server side copy, and even
pipe the results to a program, so if you have 600Mb spare disk, or
about 150 and access to gzip, or access to ssh CLIENT in the server
and ssh SERVER in the office you could transfer the file easily. Ie,
if you have a directory which you can access from the server account,
you could do

    COPY the_table_or_query TO '/the/dir/the/filename' FORMAT .....

or

    COPY the_table_or_query TO PROGRAM 'gzip -c >
/the/dir/the/filename.czip' FORMAT .....

and then transfer the appropiate file to the office at your leisure
using whichever method you are used to.

In fact, if you have an ssh server in your office accesible from the
server you could transfer it there directly, or use a listening socket
in your program and netcat, but in this case you'll run in the same
timeout problems ( except you can compress and reduce the compression
time ).

> Dumping to a spare table seems like an interesting point - so we'd just
> create the table, COPY the results to that table, then use LIMIT/OFFSET to
> paginate through that, then drop the table afterwards?

That is more or less it. You do not copy, just create the table with
the proper data, i.e., instead of "COPY (my query) ..." you do a
'CREATE TABLE tmpxxx as SELECT...'

> Currently, I'm doing a quick hack where we download an ordered list of the
> ids (auto-incrementing integer) into Python, chunk it up into groups of ids,
> then use a WHERE IN clause to download each chunk via COPY.

This has a problem, you need to wrap everything in a single
transaction to avoid the update / delete / insert in the middle
problem you commented previously, so your transaction time is going to
be even bigger. Also,if the integers are autoincrementing, you can do
better. First get min and max  ( I do not remember if postgres already
optimized them or you'll need the 'order by limit 1' trick ). Then
just loop in appropiately sized steps ( I suppose table is dense, but
you can go to the stats, and you should know your data patterns ).

> Would dumping to a spare table and paginating a better approach? Reasons?
> (Not challenging it, I just want to understand everything).

The table is created as the single operation on a single transaction,
which has no delay problems as it does not generate output.

From there on the table is read-only, so you can use a transaction for
each chunk, and you know the ids do not change. Even if a chunk copy
fails due to the internet connection farting a bit, you just retry it.

If you are going to use limit/offset you'll need an index on the field
anyway, but you can do the index on a separate transaction after
copying ( so you do not interfere with anyone ).

You can even do everything in autocommit mode, saving server roundtrips.

Anyway, this can be done, but if you can do the server side copy and
ssh thing, that is much better. I do not know how to do it in another
OSs, but in unix, if you have a ssh-accessible account with enough
quota on the server you could do something like create a directory in
the user home, give adequate access to the server user to it, let the
server dump the data there.

Francisco Olarte.