Re: I'd like to discuss scaleout at PGCon - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: I'd like to discuss scaleout at PGCon |
Date | |
Msg-id | a147739a-dd03-73e1-0187-1bafa14dec5e@postgrespro.ru Whole thread Raw |
In response to | Re: I'd like to discuss scaleout at PGCon ("MauMau" <maumau307@gmail.com>) |
Responses |
Re: I'd like to discuss scaleout at PGCon
RE: I'd like to discuss scaleout at PGCon |
List | pgsql-hackers |
On 05.06.2018 20:17, MauMau wrote: > From: Merlin Moncure >> FWIW, Distributed analytical queries is the right market to be in. >> This is the field in which I work, and this is where the action is > at. >> I am very, very, sure about this. My view is that many of the >> existing solutions to this problem (in particular hadoop class >> soltuions) have major architectural downsides that make them >> inappropriate in use cases that postgres really shines at; direct >> hookups to low latency applications for example. postgres is >> fundamentally a more capable 'node' with its multiple man-millennia > of >> engineering behind it. Unlimited vertical scaling (RAC etc) is >> interesting too, but this is not the way the market is moving as >> hardware advancements have reduced or eliminated the need for that > in >> many spheres. > I'm feeling the same. As the Moore's Law ceases to hold, software > needs to make most of the processor power. Hadoop and Spark are > written in Java and Scala. According to Google [1] (see Fig. 8), Java > is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by > 2.5x - 3.6x. > > Won't PostgreSQL be able to cover the workloads of Hadoop and Spark > someday, when PostgreSQL supports scaleout, in-memory database, > multi-model capability, and in-database filesystem? That may be a > pipedream, but why do people have to tolerate the separation of the > relational-based data warehouse and Hadoop-based data lake? > > > [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala". > Proceedings of Scala Days 2011 > > Regards > MauMau > > I can not completely agree with it. I have done a lot of benchmarking of PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H queries. The picture is not so obvious... All this systems provides different scalability and so shows best performance at different hardware configurations. Also Java JIT has made a good progress since 2011. Calculation intensive code (like matrix multiplication) implemented in Java is about 2 times slower than optimized C code. But DBMSes are rarely CPU bounded. Even if all database fits in memory (which is not so common scenario for big data applications), speed of modern CPU is much higher than RAM access speed... Java application are slower than C/C++ mostly because of garbage collection. This is why SparkSQL is moving to off-heap approach when objects are allocated outside Java heap and so not affecting Java GC. New versions of SparkSQL with off-heap memory and native code generation show very good performance. And high scalability always was one of the major features of SparkSQL. So it is naive to expect that Postgres will be 4 times faster than SparkSQL on analytic queries just because it is written in C and SparkSQL - in Scala. Postgres has made a very good progress in support of OLAP in last releases: it now supports parallel query execution, JIT, partitioning... But still its scalability is very limited comparing with SparkSQL. I am not sure about GreenPlum with its sophisticated distributed query optimizer, but most of other OLAP solutions for Postgres are not able to efficiently handle complex queries (with a lot of joins by non-partitioning keys). I do not want to say that it is not possible to implement good analytic platform for OLAP on top of Postgres. But it is very challenged task. And IMHO choice of programming language is not so important. What is more important is format of storing data. The bast systems for data analytic: Vartica, HyPer, KDB,... are using vertical data mode. SparkSQL is also using Parquet file format which provides efficient extraction and processing of data. With abstract storage API Postgres is also given a chance to implement efficient storage for OLAP data processing. But huge amount of work has to be done here. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: