Re: Dynamic Partitioning using Segment Visibility Maps - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Dynamic Partitioning using Segment Visibility Maps
Date
Msg-id 1200044579.4266.933.camel@ebony.site
Whole thread Raw
In response to Re: Dynamic Partitioning using Segment Visibility Maps  (Gavin Sherry <swm@alcove.com.au>)
List pgsql-hackers
Gavin,

I've responded to the technical questions you raise here:

On Thu, 2008-01-10 at 02:22 +0100, Gavin Sherry wrote:

> > After we note that a segment is read-only we can scan the segment and
> > record min/max values for all columns. These are then "implicit
> > constraints", which can then be used for segment exclusion in a similar
> > manner as we do with the current constraint exclusion mechanism.
> 
> What about columns which are varlen? The maximum value could be very long -- 1
> GB in fact. Say we decide to not support those. Well, what about
> numeric? What about user defined types?

No problem with any datatypes that I can see. Why would there be?

Very long values as boundary values would be a problem with any type of
partitioning, so some sensible heuristics would need to apply in any
case.

> > This would also allow a Merge Join to utilise exclusion for the inner
> > plan at execution time. Instead of scanning the whole inner table plan
> > from the start, we would be able to start the scan from the appropriate
> > segment. This would require us to pass the current value of the outer
> > plan down into the inner plan. The typical executor nodes on the inner
> > plan would be a SortNode and below that a SeqScan. In that case the
> > executor would need to pass the outer value from the MergeJoinNode down
> > thru the SortNode to the SeqScan node. The SeqScan node could then
> > perform partition exclusion, reducing the time for that scan and also
> > reducing the time for the resulting sort. This sounds like it might be
> > worth doing in normal cases also. It might turn out that the potentially
> > applicable cases are already excluded during planning, I haven't thought
> > about that aspect in enough detail yet.
> 
> I don't think that would work at all. Say you have have skew of data
> across the partitions. You may end up doing a seq scan for every outer
> tuple. It would look like a really expensive nested loop.

No, you've misunderstood. I said start the plan, i.e. do this once, not
for every row. That would be ridiculous.

> > If we collect data for all columns then many of our implicit constraints
> > would be useless. e.g. if a column only has a few values and these are
> > present in all segments. Matching our predicate against all constraints
> > would be expensive, so we must weed out poor constraints. We would do
> > this by removing any constraint that overlapped more than 10% of other
> > segments. Various heuristics would likely need to be discovered during
> > development to make this work without resorting to manual commands.
> > 
> > Note that all of this exclusion is happening in the executor, not the
> > planner. That allows this form of exclusion to work with stable
> > functions and parameters without problem.
> 
> So, in that case, you've totally undercut the planner. All the steps up
> the tree would be based on the stats for scanning the entire table but
> you've only returned part of it.

Answered already. Not an issue solely for this particular approach.

> To solve that, you could put the min/max values in a catalog somewhere.
> For a 16 TB table, that's 16000 min/max values. That's pretty expensive.

We agreed the partitioning could and would be adjustable, so the number
need not be 16000.

The number of partitions we can support needs to be fairly high to
support a wide range of applications. Requests for 1-2000 partitions
seem fairly typical to me and we need to be able to handle that, however
we do partitioning.

> And how do we handle making things non-read-only. You'd have to wait for
> all current queries to finish... that could take a long time.

No, I explained that it would not require locking or waiting. The
existing queries would just not take advantage of the newly read-only
state until they complete.

> How do we handle crashes during setting of min/max? I presume it needs
> WAL support.

It's in a catalog table, as we've said. Handled.

> What happens if the free space map gives us a free block in a read only
> segment. Then we need to look at concurrency and transactionality of
> min/max values don't we? 

I discussed that and explained how it would be handled. 

> > Visibility Map
> > --------------
> [snip]
> > No dynamic shared memory cache is required because any concurrent
> > changes to the table would be ignored by a Scan anyway, so it doesn't
> > matter if an INSERT, UPDATE or DELETE occurs while we are scanning. Any
> > new scans that start will attempt to lock the table and then perform a
> > rel cache check before continuing. So the visibility will be set
> > correctly for *that* scan at least.
> 
> Is that true in the presence of READ COMMITTED?

Yes, because READ COMMITTED has nothing to do with it.

The SVM would be refreshed via the relcache invalidation mechanism,
which gets checked every time we obtain a lock we did not already hold.
In the case of a table with bits set in the SVM we would need to recheck
that each time we try to obtain a lock, even if we already hold it.

> > In most cases the visibility map can be summarised as a single boolean
> > to show whether any 100% visible segments exist. That makes accessing
> > the map very cheap in the common, unset case.
> 
> What do we do in failure scenarios. Does this go into WAL? 

Yes

> Is it
> replicated via log shipping techniques. 

Yes
> > Setting the Visibility Map
> > --------------------------
> > 
> > VACUUM would scan all READ_WRITE_ALLOWED segments and mark some of
> > them as EFFECTIVE_READ_ONLY if 100% of the remaining rows are visible to
> > all current backends. Marking the map will cause a rel cache
> > invalidation.
> 
> Argh. Remember, we're talking about 16 TB tables here. With the best of
> tuning, that takes all day. So, day 1, load data; day 2 VACUUM it? With
> the existing partitioning approach, there's no such thing.

I would differ with you there. The first query after a large COPY will
be forced to set the hint bits and re-write the table. If you just
loaded 16TB then you'd have to rewrite 16TB, which as you point out can
take a while. The VACUUM will kick in automatically following a large
load and either it or a query will perform the re-writing.

For a 16TB table with declared partitions, each one of those tuples
would need to have their partition calculated prior to insertion. I
think the cost of post-assessing the partition boundaries is the same or
cheaper than the cost of pre-assessing the partition location.
> > We would never mark the highest numbered segment
> > EFFECTIVE_READ_ONLY, since it is likely to expand when INSERTs occur.
> > This is also a useful way of excluding smaller tables from the overheads
> > involved for this feature.
> 
> What about concurrent update or deletes -- during VACUUM. Surely we
> don't have to vacuum full? Concurrent updates and deletes could change
> the partition boundaries and that will impact on the accuracy of segment
> exclusion (I do not think it leads to data corruption but it might lead
> to the inclusion of a segment that shouldn't have been included). That
> leads to unpredictable behaviour.

The VACUUM would need to set a read-only pending flag. After the VACUUM
has finished any update or delete would clear the flag. As you say its
possible that concurrent updates and deletes could have changed rows in
blocks behind the first VACUUM scan. After the VACUUM we re-scan a
read-only pending segment to derive boundary values. During this scan,
if we find a change made by a transaction after the VACUUM's xmin then
we would clear the flag. So concurrent changes of any kind would be
caught and we don't need locks or VACUUM FULL.
> > In an insert-only table this would mean that only the last 1 or 2
> > segments would be read write, so VACUUM would always run in a bounded
> > time, no matter how large the table. 
> 
> You said else where that unload was low cost. What about segments which
> cross the unload boundary. Half of their data needs to be removed, say.
> That sounds expensive. And, they'll have to be vacuumed.

Well, yes, but if you wanted this to be properly optimised you would
need to delete using the boundary values of the partition, so its
effectively equivalent to deleting a declared partition.

> With the existing partitioning, we could just truncate a partition. This
> seems really expensive in comparison.

I explained it would be possible to do that if it was required.

> > There would be additional complexity in selectivity estimation and plan
> > costing. The above proposal allows dynamic segment exclusion, which
> > cannot be assessed at planning time anyway, so suggestions welcome...
> 
> As I've noted elsewhere, you cannot just undermine the planner like
> this. You're plans will be totally bogus. I hope you're not proposing
> that. If you're proposing tracking information for each segment then I
> there are problems there too.

Discussed already, not an issue related solely to dynamic partitioning.

> > Comparison with other Partitioning approaches
> > ---------------------------------------------
> > 
> > Once I realised this was possible in fairly automatic way, I've tried
> > hard to keep away from manual overrides, commands and new DDL.
> > 
> > Declarative partitioning is a big overhead, though worth it for large
> > databases. No overhead is *much* better though.
> 
> I think you're making this out to be harder than it is. Dealing with
> lots of data is difficult, issuing one DDL command is not such a big
> deal.

I agree, that's not really the hard part.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Gavin Sherry
Date:
Subject: Re: Dynamic Partitioning using Segment Visibility Maps
Next
From: Simon Riggs
Date:
Subject: Storage Model for Partitioning