Thread: PyGreSQL inserttable patch

PyGreSQL inserttable patch

From
"Christoph Zwerschke"
Date:
PyGreSQL inserttable patch
=====================

I suggested an improvement of the inserttable in the PyGreSQL interface
already in January, but seemingly it was never implemented. I was told this
is the right place to get patches in for PyGreSQL, so I'm reposting my patch
here.

I consider the inserttable methode essential in populating the database
because of its benefits in performance compared to insert, so I think this
patch is quite essential. The attachment is an improved version of the
corresponding pg_inserttable function in pgmodule.c, which fixes the
following problems:

* The function raised exceptions because PyList_GetItem was used beyond the
size of the list. This was checked by comparing the result with NULL, but
the exception was not cleaned up, which could result in mysterious errors in
the following Python code. Instead of clearing the exception using
PyErr_Clear or something like that, I avoided throwing the exception at all
by at first requesting the size of the list. Using this opportunity, I also
checked the uniformity of the size of the rows passed in the lists/tuples.
The function also accepts (and silently ignores) empty lists and sublists.
* Python "None" values are now accepted and properly converted to PostgreSQL
NULL values
* The function now generates an error message in case of a line buffer
overflow
* It copes with tabulators, newlines and backslashes in strings now
* Rewrote the buffer filling code which should now run faster by avoiding
unnecessary string copy operations forth and back

[Side Mark: I think the following should be added to the to do list: "Write
an option for inserttable which uses 'copy binary from' instead of 'copy
from.'" This could speed up things considerably. In particular, numbers
would not be needed to be converted to strings forth and back. This is a
pretty difficult and laborious task and I can't do it. However, it should be
annotated as a possible 'to do'.]

Christoph Zwerschke
Zentrale Universitaetsverwaltung Heidelberg

Attachment

Re: PyGreSQL inserttable patch

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

I will try to apply it within the next 48 hours.

---------------------------------------------------------------------------


Christoph Zwerschke wrote:
> PyGreSQL inserttable patch
> =====================
>
> I suggested an improvement of the inserttable in the PyGreSQL interface
> already in January, but seemingly it was never implemented. I was told this
> is the right place to get patches in for PyGreSQL, so I'm reposting my patch
> here.
>
> I consider the inserttable methode essential in populating the database
> because of its benefits in performance compared to insert, so I think this
> patch is quite essential. The attachment is an improved version of the
> corresponding pg_inserttable function in pgmodule.c, which fixes the
> following problems:
>
> * The function raised exceptions because PyList_GetItem was used beyond the
> size of the list. This was checked by comparing the result with NULL, but
> the exception was not cleaned up, which could result in mysterious errors in
> the following Python code. Instead of clearing the exception using
> PyErr_Clear or something like that, I avoided throwing the exception at all
> by at first requesting the size of the list. Using this opportunity, I also
> checked the uniformity of the size of the rows passed in the lists/tuples.
> The function also accepts (and silently ignores) empty lists and sublists.
> * Python "None" values are now accepted and properly converted to PostgreSQL
> NULL values
> * The function now generates an error message in case of a line buffer
> overflow
> * It copes with tabulators, newlines and backslashes in strings now
> * Rewrote the buffer filling code which should now run faster by avoiding
> unnecessary string copy operations forth and back
>
> [Side Mark: I think the following should be added to the to do list: "Write
> an option for inserttable which uses 'copy binary from' instead of 'copy
> from.'" This could speed up things considerably. In particular, numbers
> would not be needed to be converted to strings forth and back. This is a
> pretty difficult and laborious task and I can't do it. However, it should be
> annotated as a possible 'to do'.]
>
> Christoph Zwerschke
> Zentrale Universitaetsverwaltung Heidelberg

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: PyGreSQL inserttable patch

From
Bruce Momjian
Date:
Function updated.  I assume you wanted to replace pg_inserttable in
pgmodule.c with your version, which I have done.

As for your comment about COPY BINARY, binary is not portable between
machines so I can't see why we would perfer that.  Is that something
that we would never do between machines?

---------------------------------------------------------------------------

Christoph Zwerschke wrote:
> PyGreSQL inserttable patch
> =====================
>
> I suggested an improvement of the inserttable in the PyGreSQL interface
> already in January, but seemingly it was never implemented. I was told this
> is the right place to get patches in for PyGreSQL, so I'm reposting my patch
> here.
>
> I consider the inserttable methode essential in populating the database
> because of its benefits in performance compared to insert, so I think this
> patch is quite essential. The attachment is an improved version of the
> corresponding pg_inserttable function in pgmodule.c, which fixes the
> following problems:
>
> * The function raised exceptions because PyList_GetItem was used beyond the
> size of the list. This was checked by comparing the result with NULL, but
> the exception was not cleaned up, which could result in mysterious errors in
> the following Python code. Instead of clearing the exception using
> PyErr_Clear or something like that, I avoided throwing the exception at all
> by at first requesting the size of the list. Using this opportunity, I also
> checked the uniformity of the size of the rows passed in the lists/tuples.
> The function also accepts (and silently ignores) empty lists and sublists.
> * Python "None" values are now accepted and properly converted to PostgreSQL
> NULL values
> * The function now generates an error message in case of a line buffer
> overflow
> * It copes with tabulators, newlines and backslashes in strings now
> * Rewrote the buffer filling code which should now run faster by avoiding
> unnecessary string copy operations forth and back
>
> [Side Mark: I think the following should be added to the to do list: "Write
> an option for inserttable which uses 'copy binary from' instead of 'copy
> from.'" This could speed up things considerably. In particular, numbers
> would not be needed to be converted to strings forth and back. This is a
> pretty difficult and laborious task and I can't do it. However, it should be
> annotated as a possible 'to do'.]
>
> Christoph Zwerschke
> Zentrale Universitaetsverwaltung Heidelberg

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: PyGreSQL inserttable patch

From
"Christoph Zwerschke"
Date:
> Function updated.  I assume you wanted to replace pg_inserttable in
> pgmodule.c with your version, which I have done.

Yes, thank you. I'm using the patched function intensely in a project for
nearly a year so there shouldn't be any great bugs in it.

> As for your comment about COPY BINARY, binary is not portable between
> machines so I can't see why we would perfer that.  Is that something
> that we would never do between machines?

You are right, as pg_inserttable is running on the client side, it does not
know which binary format to provide for the server. However, I understand
the format might become machine independant in the future (the new file
format since 7.1 already indicates the integer layout etc.) Also,
pg_inserttable could check whether the database host of the connection is
the localhost and switch to binary mode only in that case. I just had the
idea that this could reduce the number of conversions and speed up the
process a lot. For instance, a Python floating point number will be
converted three times, first with PyFloat_AS_DOUBLE to a C double, then with
sprintf("%g") to a string representation and then again in PostgreSQL "COPY
TO" to PostgreSQL double precision. My idea was to convert the Python float
directly to a PostgreSQL float.

Christoph Zwerschke
Zentrale Universitaetsverwaltung Heidelberg


Re: PyGreSQL inserttable patch

From
Bruce Momjian
Date:
Christoph Zwerschke wrote:
> > Function updated.  I assume you wanted to replace pg_inserttable in
> > pgmodule.c with your version, which I have done.
>
> Yes, thank you. I'm using the patched function intensely in a project for
> nearly a year so there shouldn't be any great bugs in it.
>
> > As for your comment about COPY BINARY, binary is not portable between
> > machines so I can't see why we would perfer that.  Is that something
> > that we would never do between machines?
>
> You are right, as pg_inserttable is running on the client side, it does not
> know which binary format to provide for the server. However, I understand
> the format might become machine independant in the future (the new file
> format since 7.1 already indicates the integer layout etc.) Also,
> pg_inserttable could check whether the database host of the connection is
> the localhost and switch to binary mode only in that case. I just had the
> idea that this could reduce the number of conversions and speed up the
> process a lot. For instance, a Python floating point number will be
> converted three times, first with PyFloat_AS_DOUBLE to a C double, then with
> sprintf("%g") to a string representation and then again in PostgreSQL "COPY
> TO" to PostgreSQL double precision. My idea was to convert the Python float
> directly to a PostgreSQL float.

There are no concrete plans for a binary-portable COPY format, and in
fact if we did one it would probably have the same overhead as ASCII
does now.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073