Re: In-Memory Columnar Store - Mailing list pgsql-hackers
From | knizhnik |
---|---|
Subject | Re: In-Memory Columnar Store |
Date | |
Msg-id | 52A88DFC.5020601@garret.ru Whole thread Raw |
In response to | Re: In-Memory Columnar Store (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: In-Memory Columnar Store
|
List | pgsql-hackers |
Hello! Implementation of IMCS itself took me about two months (with testing and writing documentation). But huge part of the code was previously written by me for other projects, so I have reused them. Most of the time I have spent in integration of this code with PostgreSQL (I was not so familiar with it before). Certainly implementations of columnar store for Oracle (Oracle Database In-Memory Option), DB2 (BLU Acceleration), ... are more convenient for users: them can execute normal SQL queries and do not require users to learn new functions and approach. But it requires complete redesign of query engine (or providing alternative implementation). I was not able to do it. This is why I try to provide advantages of vertical data representation (vector operation, parallel execution, data skipping) as well as advantages of fast access to in-memory data as standard PostgreSQL extension. There are obviously some limitations and queries look more complicated than in case of standard SQL... But from the other side it is possible to write queries which are hardly to be expressed using standard SQL. For example calculating split-adjusted prices can not be done in SQL without using stored procedures. To make usage of IMCS functions as simple as possible I defined a larger number of various operators for most popular operations. For example Volume-Weighted-Average-Price can be calculated just as: select Volume//Close as VWAP from Quote_get(); It is even shore than analog SQL statement: select sum(Close*Volume)/sum(Volume) as VWAP from Quote; Concerning integration with PostgreSQL, there were several problems. Some of them seems to have no easy solution, but other are IMHO imperfections in PostgreSQL which I hope will be fixed sometime: 1. Calls in PL/pgSQL are very slow - about 1-2 micsroseconds at my computer. Just defining insertion per-row trigger with empty procedure increase time of insertion of 6 million records twice - from 7 till 15 seconds. If trigger procedure is not empty, then time is increased proportionally number of performed calls. In my case inserting data with propagation it in columnar store using trigger takes about 80 seconds. But if I first load data without triggers in PostgreSQL table and then insert it in columnar store using load function (implemented in C), then time will be 7+9=16 seconds. Certainly I realize that plpgsql is interpreted language. But for example also interpreted Python is able to do 100 times more calls per second. Unfortunately profiler doesn;t show some bottleneck - looks like long calltime is caused by large overhead of initializing and resetting memory context and copying arguments data. 2. Inefficient implementation of expanding composite type columns using (foo()).* clause. In this case function foo() will be invoked as much times as there are fields in the returned composite type. Even in case of placing call in FROM list (thanks to lateral joins in 9.3), PostgreSQL still sometimes performs redundant calls which can be avoided using hack with adding "OFFSET 1" clause. 3. 256Gb limit for used shared memory segment size at Linux. Concerning last problem - I have included in IMCS distributive much simpler patch which just set MAP_HUGETLB flags when a) is it defined in system headers b) requested memory size is larger than 256Gb In this case right now PostgreSQL will just fail to start. But certainly it is more correct to trigger this flag through configuration parameter, because large pages can minimize MMU overhead and so increase speed even if size of used memory is less than 256Gb (this is why Oracle is widely using it). . Вызов функции занимает прядка 2 микросекунд. Т.е. если я напишу триггер с пустой процедурой, то вставка 6 миллионов объектов займёт 15 секунд. Это при том, что без триггера вставка занимает всего 7 секунд... On 12/11/2013 06:33 PM, Merlin Moncure wrote: > On Mon, Dec 9, 2013 at 1:40 PM, knizhnik <knizhnik@garret.ru> wrote: >> Hello! >> >> I want to annouce my implementation of In-Memory Columnar Store extension >> for PostgreSQL: >> >> Documentation: http://www.garret.ru/imcs/user_guide.html >> Sources: http://www.garret.ru/imcs-1.01.tar.gz >> >> Any feedbacks, bug reports and suggestions are welcome. >> >> Vertical representation of data is stored in PostgreSQL shared memory. >> This is why it is important to be able to utilize all available physical >> memory. >> Now servers with Tb or more RAM are not something exotic, especially in >> financial world. >> But there is limitation in Linux with standard 4kb pages for maximal size >> of mapped memory segment: 256Gb. >> It is possible to overcome this limitation either by creating multiple >> segments - but it requires too much changes in PostgreSQL memory manager. >> Or just set MAP_HUGETLB flag (assuming that huge pages were allocated in the >> system). >> >> I found several messages related with MAP_HUGETLB flag, the most recent one >> was from 21 of November: >> http://www.postgresql.org/message-id/20131125032920.GA23793@toroid.org >> >> I wonder what is the current status of this patch? > I looked over your extension. I think it's a pretty amazing example > of the postgres extension and type systems -- up there with postgis. > Very well done. How long did this take you to write? > > MAP_HUGETLB patch was marked 'returned with feedback'. > https://commitfest.postgresql.org/action/patch_view?id=1308. It seems > likely to be revived, perhaps in time for 9.4. > > Honestly, I think your efforts here provide more argument for adding > huge tbl support. > > merlin
pgsql-hackers by date: