Thread: Restoring large tables with COPY

Restoring large tables with COPY

From
Marko Kreen
Date:
Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces.  Yes, dump as INSERTs works but is slow.

"Cant" as in "it does not work with the default setup I have
running on devel machine" - 128M mem, 128M swap, basically
default postgresql.conf:

1) Too few WAL files.  - well, increase the wal_files (eg to 32),

2) Machine runs out of swap, PostgreSQL seems to keep whole TX  in memory.  - So I must put 1G of swap?   But what if I
have1G of rows?
 

Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump, which is
non-obvious and may be a bad surprise when in real trouble.

This is annoying, especially as dump as COPY's should be
preferred as it is faster and smaller.  Ofcourse the
dump-as-INSERTs has also positive side - eg. ALTER TABLE DROP
COLUMN with sed...

Patch below implements '-m NUM' switch to pg_dump, which splits
each COPY command to chunks, each maximum NUM rows.

Comments?  What am I missing?

-- 
marko



Index: doc/src/sgml/ref/pg_dump.sgml
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.41
diff -u -c -r1.41 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml    8 Dec 2001 03:24:37 -0000    1.41
--- doc/src/sgml/ref/pg_dump.sgml    11 Dec 2001 03:58:30 -0000
***************
*** 35,40 ****
--- 35,41 ----    <arg>-f <replaceable>file</replaceable></arg>     <arg>-F <replaceable>format</replaceable></arg>
<arg>-i</arg>
+    <arg>-m <replaceable>num_rows</replaceable></arg>    <group> <arg>-n</arg> <arg>-N</arg> </group>    <arg>-o</arg>
  <arg>-O</arg>
 
***************
*** 301,306 ****
--- 302,321 ----     if you need to override the version check (and if     <command>pg_dump</command> then fails, don't
   say you weren't warned).
 
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
+       <term>-m <replaceable class="parameter">num_rows</replaceable></term>
+       <term>--maxrows=<replaceable class="parameter">num_rows</replaceable></term>
+       <listitem>
+        <para>
+     Set maximum number of rows to put into one COPY statement.
+     This starts new COPY command after every
+     <replaceable class="parameter">num_rows</replaceable>.
+     This is useful on large tables to avoid restoring whole table in 
+     one transaction which may consume lot of resources.        </para>       </listitem>      </varlistentry>
Index: src/bin/pg_dump/pg_dump.c
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.236
diff -u -c -r1.236 pg_dump.c
*** src/bin/pg_dump/pg_dump.c    28 Oct 2001 06:25:58 -0000    1.236
--- src/bin/pg_dump/pg_dump.c    11 Dec 2001 04:48:42 -0000
***************
*** 116,121 ****
--- 116,123 ---- bool        dataOnly; bool        aclsSkip; 
+ int            g_max_copy_rows = 0;
+  char        g_opaque_type[10];    /* name for the opaque type */  /* placeholders for the delimiters for comments
*/
***************
*** 151,156 ****
--- 153,159 ----                  "  -h, --host=HOSTNAME      database server host name\n"                  "  -i,
--ignore-version    proceed even when server version mismatches\n"                  "                           pg_dump
version\n"
+                  "  ­m, --maxrows=NUM        max rows in one COPY command\n"     "  -n, --no-quotes          suppress
mostquotes around identifiers\n"      "  -N, --quotes             enable most quotes around identifiers\n"
   "  -o, --oids               include oids in dump\n"
 
***************
*** 187,192 ****
--- 190,196 ----                  "                           pg_dump version\n"     "  -n
suppressmost quotes around identifiers\n"      "  -N                       enable most quotes around identifiers\n"
 
+                  "  ­m NUM                   max rows in one COPY command\n"                  "  -o
   include oids in dump\n"                  "  -O                       do not output \\connect commands in plain\n"
             "                           text format\n"
 
***************
*** 244,249 ****
--- 248,255 ----     int            ret;     bool        copydone;     char        copybuf[COPYBUFSIZ];
+     int            cur_row;
+     int            linestart;      if (g_verbose)         write_msg(NULL, "dumping out the contents of table %s\n",
classname);
***************
*** 297,302 ****
--- 303,310 ----         else         {             copydone = false;
+             linestart = 1;
+             cur_row = 0;              while (!copydone)             {
***************
*** 310,316 ****
--- 318,338 ----                 }                 else                 {
+                     /*
+                      * Avoid too large transactions by breaking them up.
+                      */
+                     if (g_max_copy_rows > 0 && linestart
+                             && cur_row >= g_max_copy_rows)
+                     {
+                         cur_row = 0;
+                         archputs("\\.\n", fout);
+                         archprintf(fout, "COPY %s %sFROM stdin;\n",
+                                 fmtId(classname, force_quotes),
+                                 (oids && hasoids) ? "WITH OIDS " : "");
+                     }
+                      archputs(copybuf, fout);
+                                          switch (ret)                     {                         case EOF:
***************
*** 318,325 ****
--- 340,350 ----                             /* FALLTHROUGH */                         case 0:
  archputc('\n', fout);
 
+                             cur_row++;
+                             linestart = 1;                             break;                         case 1:
+                             linestart = 0;                             break;                     }
}
***************
*** 696,701 ****
--- 721,727 ----         {"compress", required_argument, NULL, 'Z'},         {"help", no_argument, NULL, '?'},
{"version",no_argument, NULL, 'V'},
 
+         {"maxrows", required_argument, NULL, 'm'},          /*          * the following options don't have an
equivalentshort option
 
***************
*** 748,756 ****     }  #ifdef HAVE_GETOPT_LONG
!     while ((c = getopt_long(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
!     while ((c = getopt(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif      {
--- 774,782 ----     }  #ifdef HAVE_GETOPT_LONG
!     while ((c = getopt_long(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
!     while ((c = getopt(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1) #endif      {
***************
*** 798,803 ****
--- 824,833 ----              case 'i':            /* ignore database version mismatch */
ignore_version= true;
 
+                 break;
+ 
+             case 'm':
+                 g_max_copy_rows = atoi(optarg);                 break;              case 'n':            /* Do not
forcedouble-quotes on
 


Re: Restoring large tables with COPY

From
Tom Lane
Date:
Marko Kreen <marko@l-t.ee> writes:
> Maybe I am missing something obvious, but I am unable to load
> larger tables (~300k rows) with COPY command that pg_dump by
> default produces.

I'd like to find out what the problem is, rather than work around it
with such an ugly hack.

> 1) Too few WAL files.
>    - well, increase the wal_files (eg to 32),

What PG version are you running?  7.1.3 or later should not have a
problem with WAL file growth.

> 2) Machine runs out of swap, PostgreSQL seems to keep whole TX
>    in memory.

That should not happen either.  Could we see the full schema of the
table you are having trouble with?

> Or shortly: during pg_restore the resource requirements are
> order of magnitude higher than during pg_dump,

We found some client-side memory leaks in pg_restore recently; is that
what you're talking about?
        regards, tom lane


Re: Restoring large tables with COPY

From
Marko Kreen
Date:
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote:
> Marko Kreen <marko@l-t.ee> writes:
> > Maybe I am missing something obvious, but I am unable to load
> > larger tables (~300k rows) with COPY command that pg_dump by
> > default produces.
> 
> I'd like to find out what the problem is, rather than work around it
> with such an ugly hack.
> 
> > 1) Too few WAL files.
> >    - well, increase the wal_files (eg to 32),
> 
> What PG version are you running?  7.1.3 or later should not have a
> problem with WAL file growth.

7.1.3

> > 2) Machine runs out of swap, PostgreSQL seems to keep whole TX
> >    in memory.
> 
> That should not happen either.  Could we see the full schema of the
> table you are having trouble with?

Well, there are several such tables, I will reproduce it,
then send the schema.  I guess its the first one, but maybe
not.  postgres gets killed by Linux OOM handler, so I cant
tell by messages, which one it was.  (hmm, i should probably
run it as psql -q -a > log).

> > Or shortly: during pg_restore the resource requirements are
> > order of magnitude higher than during pg_dump,
> 
> We found some client-side memory leaks in pg_restore recently; is that
> what you're talking about?

No, its the postgres process thats memory-hungry, it happens
with "psql < db.dump" too.

If I run a dump thats produced with "pg_dump -m 5000" then
it loops between 20M and 10M is much better.  (the 10M
depends on shared_buffers I guess).

-- 
marko



Re: Restoring large tables with COPY

From
"Serguei Mokhov"
Date:
----- Original Message -----
From: Marko Kreen <marko@l-t.ee>
Sent: Tuesday, December 11, 2001 10:10 AM

If this thing ever gets through, shouldn't this

>   /* placeholders for the delimiters for comments */
> ***************
> *** 151,156 ****
> --- 153,159 ----
>   "  -h, --host=HOSTNAME      database server host name\n"
>   "  -i, --ignore-version     proceed even when server version mismatches\n"
>   "                           pg_dump version\n"
> + "  ­m, --maxrows=NUM        max rows in one COPY command\n"

say '-m'

> + "  ­m NUM                   max rows in one COPY command\n"

and this one too?




Re: Restoring large tables with COPY

From
Marko Kreen
Date:
On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote:
> Marko Kreen <marko@l-t.ee> writes:
> > Maybe I am missing something obvious, but I am unable to load
> > larger tables (~300k rows) with COPY command that pg_dump by
> > default produces.
>
> I'd like to find out what the problem is, rather than work around it
> with such an ugly hack.

Hmm, the problem was more 'interesting' than I thought.
Basically:

1) pg_dump of 7.1.3 dumps constraints and primary keys
   with table defs in this case, so they are run during COPY.
2) I have some tricky CHECK contraints.

Look at the attached Python script, it reproduces the problem.
Sorry, cannot test on 7.2 at the moment.

--
marko


Attachment

Re: Restoring large tables with COPY

From
Marko Kreen
Date:
On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote:
> If this thing ever gets through, shouldn't this
> 
> >   /* placeholders for the delimiters for comments */
> > ***************
> > *** 151,156 ****
> > --- 153,159 ----
> >   "  -h, --host=HOSTNAME      database server host name\n"
> >   "  -i, --ignore-version     proceed even when server version mismatches\n"
> >   "                           pg_dump version\n"
> > + "  ­m, --maxrows=NUM        max rows in one COPY command\n"
> 
> say '-m'
> 
> > + "  ­m NUM                   max rows in one COPY command\n"
> 
> and this one too?

One is for systems that have 'getopt_long', second for
short-getopt-only ones.  The '-h, --host=HOSTNAME' means
that '-h HOSTNAME' and '--host=HOSTNAME' are same.

-- 
marko



Re: Restoring large tables with COPY

From
Tom Lane
Date:
Marko Kreen <marko@l-t.ee> writes:
> Look at the attached Python script, it reproduces the problem.

Hmm.  You'd probably have much better luck if you rewrote the check_code
function in plpgsql: that should eliminate the memory-leak problem, and
also speed things up because plpgsql knows about caching query plans
across function calls.  IIRC, sql functions don't do that.

The memory leakage is definitely a bug, but not one that is going to get
fixed for 7.2.  It'll take some nontrivial work on the SQL function
executor...
        regards, tom lane


Re: Restoring large tables with COPY

From
"Serguei Mokhov"
Date:
----- Original Message -----
From: Marko Kreen <marko@l-t.ee>
Sent: Tuesday, December 11, 2001 12:38 PM

> On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote:
> > If this thing ever gets through, shouldn't this
> >
> > >   /* placeholders for the delimiters for comments */
> > > ***************
> > > *** 151,156 ****
> > > --- 153,159 ----
> > >   "  -h, --host=HOSTNAME      database server host name\n"
> > >   "  -i, --ignore-version     proceed even when server version mismatches\n"
> > >   "                           pg_dump version\n"
> > > + "  ­m, --maxrows=NUM        max rows in one COPY command\n"
> >
> > say '-m'
> >
> > > + "  ­m NUM                   max rows in one COPY command\n"
> >
> > and this one too?
>
> One is for systems that have 'getopt_long', second for
> short-getopt-only ones.  The '-h, --host=HOSTNAME' means
> that '-h HOSTNAME' and '--host=HOSTNAME' are same.

I know, I know. I just was trying to point out a typo :)
You forgot to add '-' in the messages before 'm'.





Re: Restoring large tables with COPY

From
Marko Kreen
Date:
> > > > + "  ­m, --maxrows=NUM        max rows in one COPY command\n"
> > >
> > > say '-m'

> You forgot to add '-' in the messages before 'm'.

Ah.  On my screen it looks lot like a '-', but od shows 0xAD...
Well, thats VIM's digraph feature in action ;)

-- 
marko



Re: Restoring large tables with COPY

From
Marko Kreen
Date:
On Tue, Dec 11, 2001 at 01:06:13PM -0500, Tom Lane wrote:
> Marko Kreen <marko@l-t.ee> writes:
> > Look at the attached Python script, it reproduces the problem.
> 
> Hmm.  You'd probably have much better luck if you rewrote the check_code
> function in plpgsql: that should eliminate the memory-leak problem, and
> also speed things up because plpgsql knows about caching query plans
> across function calls.  IIRC, sql functions don't do that.

And I thought that the 'sql' is the more lightweight approach...

Thanks, now it seems to work.

-- 
marko