Re: PostgreSQL suitable? - Mailing list pgsql-general

From Vincenzo Romano
Subject Re: PostgreSQL suitable?
Date
Msg-id CAHjZ2x654f=YP7faiUyXmY+aDSGgqfBtERkWcBgFcauYcg-e=g@mail.gmail.com
Whole thread Raw
In response to Re: PostgreSQL suitable?  (Stephen Frost <sfrost@snowman.net>)
Responses Re: PostgreSQL suitable?
List pgsql-general
Hi.

2017-12-19 17:32 GMT+01:00 Stephen Frost <sfrost@snowman.net>:
> Greetings,
>
> * Vincenzo Romano (vincenzo.romano@notorand.it) wrote:
>> 2017-12-19 16:48 GMT+01:00 Stephen Frost <sfrost@snowman.net>:
>> > * Vincenzo Romano (vincenzo.romano@notorand.it) wrote:
>> >> Sorry, my bad: I confused V10 with v11.
>> >> But accordingly to a discussion with Bruce Momjan, table partitionin V10 is
>> >> little more than syntactic sugar around old-fashioned table partitioning.
>> >
>> > Well, it's a bit more than that since there's tuple-routing, but you're
>> > right that the partition elimination is the same as it was in earlier
>> > versions and based on constraint exclusion.  That said, as noted in the
>> > email you replied to, reasonable numbers of partitions aren't too bad
>> > even with the planning cost; it's when you have many thousands of
>> > partitions that you get into cases where planning time for queries is
>> > really bad.
>>
>> When you have to handle a 100TB table, the number of partitions
>> shouldn't be 10 or 12
>> as seen in most examples and tests.
>
> Probably not, but it would depend on the data and what you're doing with
> it.  While there are some general rules-of-thumb, there's no one right
> answer when it comes to the size of individual partitions.

I think partitioning is done in order to make smaller tables or to
balance table size and table number.
You currently have to keep in mind a limit to the number of sub-tables.
This makes partitioning less effective.

>> This is the same type of issues you hit with partial indexes (this is
>> why I mentioned them earlier).
>> Sub-table (and partial index) selection algorithm should be
>> logarithmic or sub-linear.
>
> Sure, and work is being done to improve PG in that exact area, but that
> doesn't mean it can't handle workloads like this today, but you have to
> be aware of the costs associated with today's partitions.
>
>> As long as it'll be linear, you'll hear about "reasonable number of partitions".
>> One thousand partitions for a 100TB table would make "manageable"
>> 100GB sub-tables.
>
> Yup, and 100G tables are certainly large and a bit awkward but they can
> be managed.

Of course. But I would bet they are still considered as "very large tables".

>> I could be easily wrong, but this is an are where PG needs improvements.
>
> Of course, and work is being done to improve it.  What's important is
> knowing that there's a cost to having more partitions when querying
> through the parent when you get to a point where you have thousands of
> partitions.  That cost may be perfectly fine in some use-cases and in
> others it might not be, and instead you'd likely have to build logic
> into the application layer to address it.  That's not ideal, which is
> why there's work being done to improve PG, but it's not necessairly a
> big deal either.  Dynamic SQL is another approach.
>
>> One could maybe think about multi-level partitioning, though.
>
> Not sure exactly what you're referring to here, but a simple multi-level
> partitioning setup with PG wouldn't actually change the cost for
> partition elimination today.

It'd be a tree of tables, pointing to logarithmic selection.
This is why we love tree data structures.

>> > Also as noted on this thread, PG could handle this data volume, but to
>> > be efficient there would be work to be done in normalization,
>> > aggregation, and analyzing the system to ensure you're storing and
>> > querying on the data efficiently.
>>
>> Normalization will grow the number of tables (and later joins) and
>> you'll will very likely end up with at least a table with a
>> "gazillion" rows.
>
> Natuarlly, but that "gazillion" rows table would be much smaller for
> having the data normalized- if you don't normalize it then the gazillion
> row table is a huge amount of duplicated data, making the entire system
> much larger than necessary.

Partitioning is done for other reasons than de-duplicating data.
The number of rows to be analyzed would be still the same and the
indexes over those columns would still be rather large.

>> I fear normalization, provided it's really needed, would provide little help.
>
> I seriously doubt that's the case.  Normalization might reduce that
> 100TB down to 10's of TB instead, or perhaps even smaller.

A 10x duplication factor seems very bad to me.
If that's the case then normalization would provide for some
improvement while eating resources to be done.
Storage nowadays isn't a big issue. Querying is.

>> With tables that big I usually do "software partitioning".
>> I make the software aware of the partition schema so it can direct the
>> queries to the needed (sub-)tables.
>
> Yes, implementing partition elimination at the application level is
> certainly an approach that can work with PG.  Even then, ideally, you'd
> normalize the data so that the footprint is much smaller, making more
> efficient use of the storage and memory available.

Normalizing can save resources that you later need to join tables back.
It's a matter of trade-offs and it strictly depends on the queries to
be run in my opinion.
Effective table partitioning, instead, seems to me to have bigger importance.

My bottom line here is: table partitioning is nice, but not ready yet
for the big data where it makes the bigger sense.

> Thanks!

I do thank you for your acute remarks.

> Stephen

-- 
Vincenzo Romano - NotOrAnd.IT
Information Technologies
--
NON QVIETIS MARIBVS NAVTA PERITVS


pgsql-general by date:

Previous
From: Stephen Frost
Date:
Subject: Re: PostgreSQL suitable?
Next
From: Tory M Blue
Date:
Subject: Re: AWS Aurora and PG 10