Thread: Index over only uncommon values in table

Index over only uncommon values in table

From
Steven Schlansker
Date:
Hi everyone,

I assume this is not easy with standard PG but I wanted to double check.

I have a column that has a very uneven distribution of values.  ~95% of the values will be the same, with some long
tailof another few dozens of values. 

I want to have an index over this value.  Queries that select the most common value will not use the index, because it
isa overwhelming percentage of the table.  This means that ~95% of the disk space and IOPS to maintain the index is
"wasted".

I cannot use a hardcoded partial index because:
1) The common value is not known at schema definition time, and may change (very slowly) over time.
2) JDBC uses prepared statements for everything, and the value to be selected is not known at statement prepare time,
soany partial indices are ignored (this is a really really obnoxious behavior and makes partial indices almost useless
combinedwith prepared statements, sadly…) 

The table size is expected to approach the 0.5 billion row mark within the next few months, hence my eagerness to save
evenseemingly small amounts of per-row costs. 

Curious if anyone has a good way to approach this problem.
Thanks,
Steven



Re: Index over only uncommon values in table

From
John R Pierce
Date:
On 6/18/2013 12:17 PM, Steven Schlansker wrote:
> 1) The common value is not known at schema definition time, and may change (very slowly) over time.

how could a value thats constant in 95% of the rows change, unless you
added 20 times more rows with a new value (and for a big portion of the
time, no value would meet your 95% criteria).

--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: Index over only uncommon values in table

From
Steven Schlansker
Date:
On Jun 18, 2013, at 12:23 PM, John R Pierce <pierce@hogranch.com> wrote:

> On 6/18/2013 12:17 PM, Steven Schlansker wrote:
>> 1) The common value is not known at schema definition time, and may change (very slowly) over time.
>
> how could a value thats constant in 95% of the rows change, unless you added 20 times more rows with a new value (and
fora big portion of the time, no value would meet your 95% criteria). 

The table is a denormalized version of some packed data.  The packed data is constant, but the extractor code changes
overtime.  The value in question is a "extractor version used to create this row". 

There is a periodic job that attempts to find batches of rows that have fields extracted by an old version of the
extractor. These rows are re-extracted from the packed data. 

So, most of the time the vast majority of rows will have CURRENT_VERSION as their version, and a small percentage of
rowswill have a previous version.  The job will select rows where extracted_version != CURRENT_VERSION.  If this query
isnot indexed, even doing a periodic check if any rows exist takes an absurd amount of time. 

At some point, the code changes, and CURRENT_VERSION gets incremented.  Rows then slowly (over a period of days /
weeks)get "upgraded" to the new current version, in batches of thousands. 


This is what I mean by a very slowly changing mostly-constant value.

Hope that makes sense,
Steven




Re: Index over only uncommon values in table

From
David Johnston
Date:
Steven Schlansker-3 wrote
> 1) The common value is not known at schema definition time, and may change
> (very slowly) over time.
>
> 2) JDBC uses prepared statements for everything, and the value to be
> selected is not known at statement prepare time, so any partial indices
> are ignored (this is a really really obnoxious behavior and makes partial
> indices almost useless combined with prepared statements, sadly…)

I'm not conversant enough to explain, in recent versions of PostgreSQL,
where this behavior has been modified so that parameter values are indeed
taken into account by the planner.

Thinking out loud here...

For your partial-index solution I guess you would have to implement a
routine where you query the statistics table for the most common values
array and the create a partial index where those values are excluded.
Periodically you would have to create a new index as the most frequent
values change.  To get the planner to use said partial index you would have
to append the same "NOT IN ('a','b','c')" clause to the query that you use
to construct the index.  Solving the planner and index problem at the same
time you should consider encapsulating this logic in a view that you can
replace as necessary.

You should include the version of PostgreSQL you are using since possible
solutions will vary depending on the presence of newer features.

Dave




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Index-over-only-uncommon-values-in-table-tp5759735p5759743.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Index over only uncommon values in table

From
David Johnston
Date:
Steven Schlansker-3 wrote
> At some point, the code changes, and CURRENT_VERSION gets incremented.
> Rows then slowly (over a period of days / weeks) get "upgraded" to the new
> current version, in batches of thousands.
>
> This is what I mean by a very slowly changing mostly-constant value.

This seems insane without knowing the details.  This seems like it would be
more of a cache invalidation problem.  What percentage of your rows are
being updated multiple times without ever being queried for other reasons?

I was going to say that table partitioning (INHERITS) seems like a
possibility; then I thought maybe not; now I'm back to suggesting you
consider it.

Every version of the extractor would get its own table.  To "upgrade" you
remove the record from the older table and add it to the newer one.  Maybe
even consider calling the these "version_upgraded" to distinguish them from
records originally insert using the newest version.  Or have "original
version" as the partition key and a second "current version" field that
varies.  Not sure how the planner would be able to use constraint exclusion
to limiting the scanning though...

Hope this helps.

David J.



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Index-over-only-uncommon-values-in-table-tp5759735p5759748.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Index over only uncommon values in table

From
Steven Schlansker
Date:
On Jun 18, 2013, at 1:49 PM, David Johnston <polobo@yahoo.com> wrote:

> Steven Schlansker-3 wrote
>> At some point, the code changes, and CURRENT_VERSION gets incremented.
>> Rows then slowly (over a period of days / weeks) get "upgraded" to the new
>> current version, in batches of thousands.
>>
>> This is what I mean by a very slowly changing mostly-constant value.
>
> This seems insane without knowing the details.  This seems like it would be
> more of a cache invalidation problem.  What percentage of your rows are
> being updated multiple times without ever being queried for other reasons?

I am open to suggestions of how to do it better.  The problem I face is that
doing any sort of updates in one big go -- whether it be by ALTER TABLE
statements or large UPDATE queries -- is all but unworkable.  It takes days
or weeks depending on what the update is, so any locking causes the entire
system to grind to a halt.  And there is nothing more depressing than losing
5 days of work on a huge UPDATE because something hiccuped.

Hence, allowing "outdated" versions in the table, which then over time get upgraded
in reasonably-sized batches.

>
> I was going to say that table partitioning (INHERITS) seems like a
> possibility; then I thought maybe not; now I'm back to suggesting you
> consider it.
>
> Every version of the extractor would get its own table.  To "upgrade" you
> remove the record from the older table and add it to the newer one.  Maybe
> even consider calling the these "version_upgraded" to distinguish them from
> records originally insert using the newest version.  Or have "original
> version" as the partition key and a second "current version" field that
> varies.  Not sure how the planner would be able to use constraint exclusion
> to limiting the scanning though…
>

Interesting idea.  I have been trying to avoid making code changes require
schema changes as well -- it is very nice to not have to make schema changes for every
code deployment.  The code may get changed multiple times in the same day, if I am
busy hacking on it.  Having to muck around with table inheritance and changing partition
definitions on code deployments seems unpleasant.  Perhaps I am overestimating the work
involved, but I am very much trying to keep the deployment process as brain-dead-simple
as possible.

Thanks for the input.
Steven



Re: Index over only uncommon values in table

From
Jeff Janes
Date:
On Tue, Jun 18, 2013 at 12:17 PM, Steven Schlansker <steven@likeness.com> wrote:
Hi everyone,

I assume this is not easy with standard PG but I wanted to double check.

I have a column that has a very uneven distribution of values.  ~95% of the values will be the same, with some long tail of another few dozens of values.

I want to have an index over this value.  Queries that select the most common value will not use the index, because it is a overwhelming percentage of the table.  This means that ~95% of the disk space and IOPS to maintain the index is "wasted".

I cannot use a hardcoded partial index because:
1) The common value is not known at schema definition time, and may change (very slowly) over time.


I think this is going to turn into a game of whack-a-mole.  There is going to have to be some transition period during which all or most of the rows need to be indexed.  So that amount of space is going to have to be available, and given that it is available, what advantage is there to using it some of the time and not using it some of the time? You can't feasibly use it for something else during the off periods, because then you will run into emergency out of space situations.
 
 
2) JDBC uses prepared statements for everything, and the value to be selected is not known at statement prepare time, so any partial indices are ignored (this is a really really obnoxious behavior and makes partial indices almost useless combined with prepared statements, sadly…)


What version are you using?  This has been improved in 9.2.0.

Cheers,

Jeff

Re: Index over only uncommon values in table

From
Steven Schlansker
Date:
On Jun 18, 2013, at 2:29 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> On Tue, Jun 18, 2013 at 12:17 PM, Steven Schlansker <steven@likeness.com> wrote:
> Hi everyone,
>
> I assume this is not easy with standard PG but I wanted to double check.
>
> I have a column that has a very uneven distribution of values.  ~95% of the values will be the same, with some long
tailof another few dozens of values. 
>
> I want to have an index over this value.  Queries that select the most common value will not use the index, because
itis a overwhelming percentage of the table.  This means that ~95% of the disk space and IOPS to maintain the index is
"wasted".
>
> I cannot use a hardcoded partial index because:
> 1) The common value is not known at schema definition time, and may change (very slowly) over time.
>
>
> I think this is going to turn into a game of whack-a-mole.  There is going to have to be some transition period
duringwhich all or most of the rows need to be indexed.  So that amount of space is going to have to be available, and
giventhat it is available, what advantage is there to using it some of the time and not using it some of the time? You
can'tfeasibly use it for something else during the off periods, because then you will run into emergency out of space
situations.

This is a good point.

I could define it further to e.g. only take values which make up < 10% of the table, for example, which may solve this
problem.

But from the responses I've been getting it sounds like there's nothing standard that will "just solve my problem" and
it'snot worth the effort to me to cook up something intelligent right now. 

>
>
> 2) JDBC uses prepared statements for everything, and the value to be selected is not known at statement prepare time,
soany partial indices are ignored (this is a really really obnoxious behavior and makes partial indices almost useless
combinedwith prepared statements, sadly…) 
>
>
> What version are you using?  This has been improved in 9.2.0.

Thank goodness!  We are in the process of migrating to the 9.2 branch now, so I am thrilled to get this.  I'm sorry I
didn'tmention the version earlier, funny how the simplest things escape your mind when you're trying to make a good
mailinglist post... 

Thanks again for all the input.



Re: Index over only uncommon values in table

From
David Johnston
Date:
Steven Schlansker-3 wrote
> The code may get changed multiple times in the same day, if I am
> busy hacking on it.

On the production table???

The other thought-line is just use a (primary key, version) index and make
use LIMIT / OFFSET with an ORDER BY on the PK and the filter on version
AFTER the initial rows are selected.

UPDATE tbl SET ....
FROM (
WITH possible_set AS ( SELECT pkid, version FROM tbl ORDER BY pkid DESC
LIMIT 5000 OFFSET 10000 )
SELECT pkid FROM possible_set WHERE version <> 'CURRENT_VERSION'
) src
tbl.pkid = src.pkid
;

DESC pkid means that newer records are prioritized over older ones.

Probably want a table to keep track of which ranges have already been
reserved and/or updated then client can connect and reserve a range and
perform an update.  Concurrency and failure modes would need attention.  You
can also keep track on entire ranges and identify which version all the
records are at and select ranges based upon how far behind the current
version they are (or whatever priority algorithm you desire - including
manual tweaks).

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Index-over-only-uncommon-values-in-table-tp5759735p5759765.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.