Thread: PyGreSQL inserttable patch
PyGreSQL inserttable patch ===================== I suggested an improvement of the inserttable in the PyGreSQL interface already in January, but seemingly it was never implemented. I was told this is the right place to get patches in for PyGreSQL, so I'm reposting my patch here. I consider the inserttable methode essential in populating the database because of its benefits in performance compared to insert, so I think this patch is quite essential. The attachment is an improved version of the corresponding pg_inserttable function in pgmodule.c, which fixes the following problems: * The function raised exceptions because PyList_GetItem was used beyond the size of the list. This was checked by comparing the result with NULL, but the exception was not cleaned up, which could result in mysterious errors in the following Python code. Instead of clearing the exception using PyErr_Clear or something like that, I avoided throwing the exception at all by at first requesting the size of the list. Using this opportunity, I also checked the uniformity of the size of the rows passed in the lists/tuples. The function also accepts (and silently ignores) empty lists and sublists. * Python "None" values are now accepted and properly converted to PostgreSQL NULL values * The function now generates an error message in case of a line buffer overflow * It copes with tabulators, newlines and backslashes in strings now * Rewrote the buffer filling code which should now run faster by avoiding unnecessary string copy operations forth and back [Side Mark: I think the following should be added to the to do list: "Write an option for inserttable which uses 'copy binary from' instead of 'copy from.'" This could speed up things considerably. In particular, numbers would not be needed to be converted to strings forth and back. This is a pretty difficult and laborious task and I can't do it. However, it should be annotated as a possible 'to do'.] Christoph Zwerschke Zentrale Universitaetsverwaltung Heidelberg
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches I will try to apply it within the next 48 hours. --------------------------------------------------------------------------- Christoph Zwerschke wrote: > PyGreSQL inserttable patch > ===================== > > I suggested an improvement of the inserttable in the PyGreSQL interface > already in January, but seemingly it was never implemented. I was told this > is the right place to get patches in for PyGreSQL, so I'm reposting my patch > here. > > I consider the inserttable methode essential in populating the database > because of its benefits in performance compared to insert, so I think this > patch is quite essential. The attachment is an improved version of the > corresponding pg_inserttable function in pgmodule.c, which fixes the > following problems: > > * The function raised exceptions because PyList_GetItem was used beyond the > size of the list. This was checked by comparing the result with NULL, but > the exception was not cleaned up, which could result in mysterious errors in > the following Python code. Instead of clearing the exception using > PyErr_Clear or something like that, I avoided throwing the exception at all > by at first requesting the size of the list. Using this opportunity, I also > checked the uniformity of the size of the rows passed in the lists/tuples. > The function also accepts (and silently ignores) empty lists and sublists. > * Python "None" values are now accepted and properly converted to PostgreSQL > NULL values > * The function now generates an error message in case of a line buffer > overflow > * It copes with tabulators, newlines and backslashes in strings now > * Rewrote the buffer filling code which should now run faster by avoiding > unnecessary string copy operations forth and back > > [Side Mark: I think the following should be added to the to do list: "Write > an option for inserttable which uses 'copy binary from' instead of 'copy > from.'" This could speed up things considerably. In particular, numbers > would not be needed to be converted to strings forth and back. This is a > pretty difficult and laborious task and I can't do it. However, it should be > annotated as a possible 'to do'.] > > Christoph Zwerschke > Zentrale Universitaetsverwaltung Heidelberg [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Function updated. I assume you wanted to replace pg_inserttable in pgmodule.c with your version, which I have done. As for your comment about COPY BINARY, binary is not portable between machines so I can't see why we would perfer that. Is that something that we would never do between machines? --------------------------------------------------------------------------- Christoph Zwerschke wrote: > PyGreSQL inserttable patch > ===================== > > I suggested an improvement of the inserttable in the PyGreSQL interface > already in January, but seemingly it was never implemented. I was told this > is the right place to get patches in for PyGreSQL, so I'm reposting my patch > here. > > I consider the inserttable methode essential in populating the database > because of its benefits in performance compared to insert, so I think this > patch is quite essential. The attachment is an improved version of the > corresponding pg_inserttable function in pgmodule.c, which fixes the > following problems: > > * The function raised exceptions because PyList_GetItem was used beyond the > size of the list. This was checked by comparing the result with NULL, but > the exception was not cleaned up, which could result in mysterious errors in > the following Python code. Instead of clearing the exception using > PyErr_Clear or something like that, I avoided throwing the exception at all > by at first requesting the size of the list. Using this opportunity, I also > checked the uniformity of the size of the rows passed in the lists/tuples. > The function also accepts (and silently ignores) empty lists and sublists. > * Python "None" values are now accepted and properly converted to PostgreSQL > NULL values > * The function now generates an error message in case of a line buffer > overflow > * It copes with tabulators, newlines and backslashes in strings now > * Rewrote the buffer filling code which should now run faster by avoiding > unnecessary string copy operations forth and back > > [Side Mark: I think the following should be added to the to do list: "Write > an option for inserttable which uses 'copy binary from' instead of 'copy > from.'" This could speed up things considerably. In particular, numbers > would not be needed to be converted to strings forth and back. This is a > pretty difficult and laborious task and I can't do it. However, it should be > annotated as a possible 'to do'.] > > Christoph Zwerschke > Zentrale Universitaetsverwaltung Heidelberg [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> Function updated. I assume you wanted to replace pg_inserttable in > pgmodule.c with your version, which I have done. Yes, thank you. I'm using the patched function intensely in a project for nearly a year so there shouldn't be any great bugs in it. > As for your comment about COPY BINARY, binary is not portable between > machines so I can't see why we would perfer that. Is that something > that we would never do between machines? You are right, as pg_inserttable is running on the client side, it does not know which binary format to provide for the server. However, I understand the format might become machine independant in the future (the new file format since 7.1 already indicates the integer layout etc.) Also, pg_inserttable could check whether the database host of the connection is the localhost and switch to binary mode only in that case. I just had the idea that this could reduce the number of conversions and speed up the process a lot. For instance, a Python floating point number will be converted three times, first with PyFloat_AS_DOUBLE to a C double, then with sprintf("%g") to a string representation and then again in PostgreSQL "COPY TO" to PostgreSQL double precision. My idea was to convert the Python float directly to a PostgreSQL float. Christoph Zwerschke Zentrale Universitaetsverwaltung Heidelberg
Christoph Zwerschke wrote: > > Function updated. I assume you wanted to replace pg_inserttable in > > pgmodule.c with your version, which I have done. > > Yes, thank you. I'm using the patched function intensely in a project for > nearly a year so there shouldn't be any great bugs in it. > > > As for your comment about COPY BINARY, binary is not portable between > > machines so I can't see why we would perfer that. Is that something > > that we would never do between machines? > > You are right, as pg_inserttable is running on the client side, it does not > know which binary format to provide for the server. However, I understand > the format might become machine independant in the future (the new file > format since 7.1 already indicates the integer layout etc.) Also, > pg_inserttable could check whether the database host of the connection is > the localhost and switch to binary mode only in that case. I just had the > idea that this could reduce the number of conversions and speed up the > process a lot. For instance, a Python floating point number will be > converted three times, first with PyFloat_AS_DOUBLE to a C double, then with > sprintf("%g") to a string representation and then again in PostgreSQL "COPY > TO" to PostgreSQL double precision. My idea was to convert the Python float > directly to a PostgreSQL float. There are no concrete plans for a binary-portable COPY format, and in fact if we did one it would probably have the same overhead as ASCII does now. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073