Re: [psycopg] Turbo ODBC - Mailing list psycopg
From | Wes McKinney |
---|---|
Subject | Re: [psycopg] Turbo ODBC |
Date | |
Msg-id | CAJPUwMA2gnH+ebzGpWfKJL1xa5r2VFcdHUibK_aiLLccDQPc6g@mail.gmail.com Whole thread Raw |
In response to | [psycopg] Turbo ODBC (Jim Nasby <Jim.Nasby@BlueTreble.com>) |
List | psycopg |
Adding Uwe Korn -- he's been working on adding C++-level Arrow support to TurbODBC: https://github.com/blue-yonder/turbodbc/pull/26 We're using Arrow data structures as an efficient (column-oriented) intermediary en route to pandas -- you could also implement normal Python tuple iteration on top of Arrow record batches. The main Python-compatible Arrow implementation is in C++11 so if you wanted to use in C you'd need to do a little work to create C bindings. On Mon, Jan 16, 2017 at 6:24 PM, Matthew Rocklin <mrocklin@continuum.io> wrote: > CC'ing Wes McKinney in case he's interested in this problem (how do we > efficiently move data from Postgres into Python). > > On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: >> >> On 1/15/17 3:25 PM, Adrian Klaver wrote: >>> >>> Have you looked at asyncpg: >> >> >> I'm pulling Matt Rocklin in, who I've been working with on some plpython >> improvements. Here's his offlist replies to me: >> >>> On 1/15/17 5:15 PM, Matthew Rocklin wrote: >>>> >>>> You might also contact Wes McKinney and check out >>>> Arrow: http://pyarrow.readthedocs.io/en/latest/ >>>> >>>> This project was designed for inter-system data interchange. It would >>>> help with conversion to Spark and other popular tools as well. >>>> >>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin@continuum.io >>>> <mailto:mrocklin@continuum.io>> wrote: >>>> >>>> I haven't taken a look at asyncpg. >>>> >>>> Quick note, you might want to be careful about mixing HDFS and >>>> HDF5. Postgres's competition for data storage is HDF5 not HDFS >>>> (Which stands for the Hadoop File System) >>>> >>>> I still think that the best thing to do here would be to dump out >>>> python arrays for each of the columns in a result set. I suspect >>>> that you could beat /any/ system that doesn't do this handily. This >>>> would avoid any performance pain of building up Python objects, and >>>> can continue to use just the standard library. >>>> >>>> You can stick to Pure Python and still send arrays encoded with >>>> bytes. We can convert these to NumPy or Pandas trivially with zero >>>> overhead. >>>> >>>> In [1]: from array import array # this is in the standard library >>>> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring >>>> In [3]: a = array('L', b) # eight byte unsigned integer >>>> In [4]: a >>>> Out[4]: array('L', [0, 0]) >>>> >>>> We would only be bound by >>>> >>>> 1. The cost in postgres to tranpose the data from row-major to >>>> column major >>>> 2. The cost to move bytes across a socket >>>> >>>> This should run at gigabytes per second. Everything else is likely >>>> to be competing around to 50-100MB/s range at top speed is my guess. >> >> >> My guess is that the big gains for external communication would come from: >> >> 1) reducing latency effects >> 2) reducing data marshaling >> >> 1 could be accomplished in various ways, and some changes to the FE-BE >> protocol might be helpful. If we can demonstrate a clear win we might be >> able to get protocol changes. I do suspect this needs to be tune-able >> though. My 4x improvement to plpy.execute is changing from the equivalent to >> a large batch to single row operation. Moving data to python over a local >> filesystem socket would be hurt be a large batch size, while something on a >> high latency connection would presumably benefit from moderately large >> batches since that reduces overall TCP overhead. >> >> I think 2 would necessitate FE-BE changes, although some of that could be >> tested without the changes. Right now, everything is going to get marshaled >> into a simple text format (ie: ints become a string of digits), but for test >> purposes there's ways you could run that through the binary output functions >> (in the case of int4, you'd get 4 bytes in network order, which should be >> faster to handle (and could be passed directly to something like ndarray). >> >> The two problems you run into with the array type are handling NULLs and >> building columns from rows. You could allow the option of specifying that >> NULLs won't be allowed in specific fields, or you could use something like a >> Pandas Series that provides other ways of handling the equivalent of NULL. >> >> For the row to column conversion, there's simply no way to get around the >> pattern of appending items one at a time to an array-like container (list, >> ndarray, Series, etc), and dealing with the memory allocation problem. I've >> looked at python's list code and it's certainly no worse at that then what >> Postgres would do. The big reason I would do that work in python though is >> you then have the ability to use an number of types for that; they just have >> to support the equivalent of append(). >> >> -- >> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX >> Experts in Analytics, Data Architecture and PostgreSQL >> Data in Trouble? Get it in Treble! http://BlueTreble.com >> 855-TREBLE2 (855-873-2532) > >