Thread: Restoring large tables with COPY
Maybe I am missing something obvious, but I am unable to load larger tables (~300k rows) with COPY command that pg_dump by default produces. Yes, dump as INSERTs works but is slow. "Cant" as in "it does not work with the default setup I have running on devel machine" - 128M mem, 128M swap, basically default postgresql.conf: 1) Too few WAL files. - well, increase the wal_files (eg to 32), 2) Machine runs out of swap, PostgreSQL seems to keep whole TX in memory. - So I must put 1G of swap? But what if I have1G of rows? Or shortly: during pg_restore the resource requirements are order of magnitude higher than during pg_dump, which is non-obvious and may be a bad surprise when in real trouble. This is annoying, especially as dump as COPY's should be preferred as it is faster and smaller. Ofcourse the dump-as-INSERTs has also positive side - eg. ALTER TABLE DROP COLUMN with sed... Patch below implements '-m NUM' switch to pg_dump, which splits each COPY command to chunks, each maximum NUM rows. Comments? What am I missing? -- marko Index: doc/src/sgml/ref/pg_dump.sgml =================================================================== RCS file: /opt/cvs/pgsql/pgsql/doc/src/sgml/ref/pg_dump.sgml,v retrieving revision 1.41 diff -u -c -r1.41 pg_dump.sgml *** doc/src/sgml/ref/pg_dump.sgml 8 Dec 2001 03:24:37 -0000 1.41 --- doc/src/sgml/ref/pg_dump.sgml 11 Dec 2001 03:58:30 -0000 *************** *** 35,40 **** --- 35,41 ---- <arg>-f <replaceable>file</replaceable></arg> <arg>-F <replaceable>format</replaceable></arg> <arg>-i</arg> + <arg>-m <replaceable>num_rows</replaceable></arg> <group> <arg>-n</arg> <arg>-N</arg> </group> <arg>-o</arg> <arg>-O</arg> *************** *** 301,306 **** --- 302,321 ---- if you need to override the version check (and if <command>pg_dump</command> then fails, don't say you weren't warned). + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>-m <replaceable class="parameter">num_rows</replaceable></term> + <term>--maxrows=<replaceable class="parameter">num_rows</replaceable></term> + <listitem> + <para> + Set maximum number of rows to put into one COPY statement. + This starts new COPY command after every + <replaceable class="parameter">num_rows</replaceable>. + This is useful on large tables to avoid restoring whole table in + one transaction which may consume lot of resources. </para> </listitem> </varlistentry> Index: src/bin/pg_dump/pg_dump.c =================================================================== RCS file: /opt/cvs/pgsql/pgsql/src/bin/pg_dump/pg_dump.c,v retrieving revision 1.236 diff -u -c -r1.236 pg_dump.c *** src/bin/pg_dump/pg_dump.c 28 Oct 2001 06:25:58 -0000 1.236 --- src/bin/pg_dump/pg_dump.c 11 Dec 2001 04:48:42 -0000 *************** *** 116,121 **** --- 116,123 ---- bool dataOnly; bool aclsSkip; + int g_max_copy_rows = 0; + char g_opaque_type[10]; /* name for the opaque type */ /* placeholders for the delimiters for comments */ *************** *** 151,156 **** --- 153,159 ---- " -h, --host=HOSTNAME database server host name\n" " -i, --ignore-version proceed even when server version mismatches\n" " pg_dump version\n" + " m, --maxrows=NUM max rows in one COPY command\n" " -n, --no-quotes suppress mostquotes around identifiers\n" " -N, --quotes enable most quotes around identifiers\n" " -o, --oids include oids in dump\n" *************** *** 187,192 **** --- 190,196 ---- " pg_dump version\n" " -n suppressmost quotes around identifiers\n" " -N enable most quotes around identifiers\n" + " m NUM max rows in one COPY command\n" " -o include oids in dump\n" " -O do not output \\connect commands in plain\n" " text format\n" *************** *** 244,249 **** --- 248,255 ---- int ret; bool copydone; char copybuf[COPYBUFSIZ]; + int cur_row; + int linestart; if (g_verbose) write_msg(NULL, "dumping out the contents of table %s\n", classname); *************** *** 297,302 **** --- 303,310 ---- else { copydone = false; + linestart = 1; + cur_row = 0; while (!copydone) { *************** *** 310,316 **** --- 318,338 ---- } else { + /* + * Avoid too large transactions by breaking them up. + */ + if (g_max_copy_rows > 0 && linestart + && cur_row >= g_max_copy_rows) + { + cur_row = 0; + archputs("\\.\n", fout); + archprintf(fout, "COPY %s %sFROM stdin;\n", + fmtId(classname, force_quotes), + (oids && hasoids) ? "WITH OIDS " : ""); + } + archputs(copybuf, fout); + switch (ret) { case EOF: *************** *** 318,325 **** --- 340,350 ---- /* FALLTHROUGH */ case 0: archputc('\n', fout); + cur_row++; + linestart = 1; break; case 1: + linestart = 0; break; } } *************** *** 696,701 **** --- 721,727 ---- {"compress", required_argument, NULL, 'Z'}, {"help", no_argument, NULL, '?'}, {"version",no_argument, NULL, 'V'}, + {"maxrows", required_argument, NULL, 'm'}, /* * the following options don't have an equivalentshort option *************** *** 748,756 **** } #ifdef HAVE_GETOPT_LONG ! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1) #else ! while ((c = getopt(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif { --- 774,782 ---- } #ifdef HAVE_GETOPT_LONG ! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1) #else ! while ((c = getopt(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif { *************** *** 798,803 **** --- 824,833 ---- case 'i': /* ignore database version mismatch */ ignore_version= true; + break; + + case 'm': + g_max_copy_rows = atoi(optarg); break; case 'n': /* Do not forcedouble-quotes on
Marko Kreen <marko@l-t.ee> writes: > Maybe I am missing something obvious, but I am unable to load > larger tables (~300k rows) with COPY command that pg_dump by > default produces. I'd like to find out what the problem is, rather than work around it with such an ugly hack. > 1) Too few WAL files. > - well, increase the wal_files (eg to 32), What PG version are you running? 7.1.3 or later should not have a problem with WAL file growth. > 2) Machine runs out of swap, PostgreSQL seems to keep whole TX > in memory. That should not happen either. Could we see the full schema of the table you are having trouble with? > Or shortly: during pg_restore the resource requirements are > order of magnitude higher than during pg_dump, We found some client-side memory leaks in pg_restore recently; is that what you're talking about? regards, tom lane
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Maybe I am missing something obvious, but I am unable to load > > larger tables (~300k rows) with COPY command that pg_dump by > > default produces. > > I'd like to find out what the problem is, rather than work around it > with such an ugly hack. > > > 1) Too few WAL files. > > - well, increase the wal_files (eg to 32), > > What PG version are you running? 7.1.3 or later should not have a > problem with WAL file growth. 7.1.3 > > 2) Machine runs out of swap, PostgreSQL seems to keep whole TX > > in memory. > > That should not happen either. Could we see the full schema of the > table you are having trouble with? Well, there are several such tables, I will reproduce it, then send the schema. I guess its the first one, but maybe not. postgres gets killed by Linux OOM handler, so I cant tell by messages, which one it was. (hmm, i should probably run it as psql -q -a > log). > > Or shortly: during pg_restore the resource requirements are > > order of magnitude higher than during pg_dump, > > We found some client-side memory leaks in pg_restore recently; is that > what you're talking about? No, its the postgres process thats memory-hungry, it happens with "psql < db.dump" too. If I run a dump thats produced with "pg_dump -m 5000" then it loops between 20M and 10M is much better. (the 10M depends on shared_buffers I guess). -- marko
----- Original Message ----- From: Marko Kreen <marko@l-t.ee> Sent: Tuesday, December 11, 2001 10:10 AM If this thing ever gets through, shouldn't this > /* placeholders for the delimiters for comments */ > *************** > *** 151,156 **** > --- 153,159 ---- > " -h, --host=HOSTNAME database server host name\n" > " -i, --ignore-version proceed even when server version mismatches\n" > " pg_dump version\n" > + " m, --maxrows=NUM max rows in one COPY command\n" say '-m' > + " m NUM max rows in one COPY command\n" and this one too?
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Maybe I am missing something obvious, but I am unable to load > > larger tables (~300k rows) with COPY command that pg_dump by > > default produces. > > I'd like to find out what the problem is, rather than work around it > with such an ugly hack. Hmm, the problem was more 'interesting' than I thought. Basically: 1) pg_dump of 7.1.3 dumps constraints and primary keys with table defs in this case, so they are run during COPY. 2) I have some tricky CHECK contraints. Look at the attached Python script, it reproduces the problem. Sorry, cannot test on 7.2 at the moment. -- marko
Attachment
On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote: > If this thing ever gets through, shouldn't this > > > /* placeholders for the delimiters for comments */ > > *************** > > *** 151,156 **** > > --- 153,159 ---- > > " -h, --host=HOSTNAME database server host name\n" > > " -i, --ignore-version proceed even when server version mismatches\n" > > " pg_dump version\n" > > + " m, --maxrows=NUM max rows in one COPY command\n" > > say '-m' > > > + " m NUM max rows in one COPY command\n" > > and this one too? One is for systems that have 'getopt_long', second for short-getopt-only ones. The '-h, --host=HOSTNAME' means that '-h HOSTNAME' and '--host=HOSTNAME' are same. -- marko
Marko Kreen <marko@l-t.ee> writes: > Look at the attached Python script, it reproduces the problem. Hmm. You'd probably have much better luck if you rewrote the check_code function in plpgsql: that should eliminate the memory-leak problem, and also speed things up because plpgsql knows about caching query plans across function calls. IIRC, sql functions don't do that. The memory leakage is definitely a bug, but not one that is going to get fixed for 7.2. It'll take some nontrivial work on the SQL function executor... regards, tom lane
----- Original Message ----- From: Marko Kreen <marko@l-t.ee> Sent: Tuesday, December 11, 2001 12:38 PM > On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote: > > If this thing ever gets through, shouldn't this > > > > > /* placeholders for the delimiters for comments */ > > > *************** > > > *** 151,156 **** > > > --- 153,159 ---- > > > " -h, --host=HOSTNAME database server host name\n" > > > " -i, --ignore-version proceed even when server version mismatches\n" > > > " pg_dump version\n" > > > + " m, --maxrows=NUM max rows in one COPY command\n" > > > > say '-m' > > > > > + " m NUM max rows in one COPY command\n" > > > > and this one too? > > One is for systems that have 'getopt_long', second for > short-getopt-only ones. The '-h, --host=HOSTNAME' means > that '-h HOSTNAME' and '--host=HOSTNAME' are same. I know, I know. I just was trying to point out a typo :) You forgot to add '-' in the messages before 'm'.
> > > > + " m, --maxrows=NUM max rows in one COPY command\n" > > > > > > say '-m' > You forgot to add '-' in the messages before 'm'. Ah. On my screen it looks lot like a '-', but od shows 0xAD... Well, thats VIM's digraph feature in action ;) -- marko
On Tue, Dec 11, 2001 at 01:06:13PM -0500, Tom Lane wrote: > Marko Kreen <marko@l-t.ee> writes: > > Look at the attached Python script, it reproduces the problem. > > Hmm. You'd probably have much better luck if you rewrote the check_code > function in plpgsql: that should eliminate the memory-leak problem, and > also speed things up because plpgsql knows about caching query plans > across function calls. IIRC, sql functions don't do that. And I thought that the 'sql' is the more lightweight approach... Thanks, now it seems to work. -- marko