Thread: Statistics and Multi-Column indexes

Statistics and Multi-Column indexes

From
lars
Date:
I know this has been discussed various times...

We are maintaining a large multi tenant database where *all* tables have
a tenant-id and all indexes and PKs lead with the tenant-id.
Statistics and counts for the all other columns are only really
meaningful within the context of the tenant they belong to.

There appear to be five options for me:
1. Using single column indexes on all interesting columns and rely on
PostgreSQLs bitmap indexes to combine them (which are pretty cool).
2. Use multi column indexes and accept that sometimes Postgres pick the
wrong index (because a non-tenant-id
column might seem highly selective over the table, but it is not for a
particular tenant - or vice versa).
3. Use a functional index that combines multiple columns and only query
via these, that causes statistics
gathering for the expression.
I.e. create index i on t((tenantid||column1)) and SELECT ... FROM t
WHERE tenantid||column1 = '...'
4. Play with n_distinct and/or set the statistics for the inner columns
to some fixed values that lead to the plans that we want.
5. Have a completely different schema and maybe a database per tenant.

Currently we use Oracle and query hinting, but I do not like that
practice at all (and Postgres does not have hints anyway).
Are there any other options?

#1 would be the simplest, but I am concerned about the overhead, both
maintaining two indexes and building the bitmap during queries - for
every query.

I don't think #2 is actually an option. We have some tenants with many
(sometimes 100s) millions of rows per table,
and picking the wrong index would be disastrous.

Could something like #3 be generally added to Postgres? I.e. if there is
a multi column index keep combined statistics for
the involved columns. Of course in that case is it no longer possible to
query the index by prefix.
#3 also seems expensive as the expression needs to be evaluated for each
changed row.

Still trying #4. I guess it involves setting the stat target for the
inner columns to 0 and then inserting my own records into
pg_statistic. Probably only setting n_distinct, i.e. set it "low" if the
inner column is not selective within the context of a tenant and "high"
otherwise.

For various reasons #5 is also not an option.

And of course the same set of questions comes up with joins.

Thanks.

-- Lars


Re: Statistics and Multi-Column indexes

From
Samuel Gendler
Date:
On Sun, Jul 10, 2011 at 2:16 PM, lars <lhofhansl@yahoo.com> wrote:
I know this has been discussed various times...

We are maintaining a large multi tenant database where *all* tables have a tenant-id and all indexes and PKs lead with the tenant-id.
Statistics and counts for the all other columns are only really meaningful within the context of the tenant they belong to.

There appear to be five options for me:
1. Using single column indexes on all interesting columns and rely on PostgreSQLs bitmap indexes to combine them (which are pretty cool).
2. Use multi column indexes and accept that sometimes Postgres pick the wrong index (because a non-tenant-id
column might seem highly selective over the table, but it is not for a particular tenant - or vice versa).
3. Use a functional index that combines multiple columns and only query via these, that causes statistics
gathering for the expression.
I.e. create index i on t((tenantid||column1)) and SELECT ... FROM t WHERE tenantid||column1 = '...'
4. Play with n_distinct and/or set the statistics for the inner columns to some fixed values that lead to the plans that we want.
5. Have a completely different schema and maybe a database per tenant.

 
What about partitioning tables by tenant id and then maintaining indexes on each partition independent of tenant id, since constraint exclusion should handle filtering by tenant id for you.  That seems like a potentially more tolerable variant of #5 How many tenants are we talking about?  I gather partitioning starts to become problematic when the number of partitions gets large.


Re: Statistics and Multi-Column indexes

From
"Kevin Grittner"
Date:
lars <lhofhansl@yahoo.com> wrote:

> We are maintaining a large multi tenant database where *all*
> tables have a tenant-id and all indexes and PKs lead with the
> tenant-id. Statistics and counts for the all other columns are
> only really meaningful within the context of the tenant they
> belong to.
>
> There appear to be five options for me:
> 1. Using single column indexes on all interesting columns and rely
> on PostgreSQLs bitmap indexes to combine them (which are pretty
> cool).

Those are cool -- when programmers are skeptical that they should
just say *what* they want and let the database figure out how to get
it, I like to point out that this option is available to the
planner, but not to application programming.  Of course, there are a
great many other reason *I* find more compelling, but this one tends
to impress application programmers.

Assuming you keep the primary key as a multi-column index, this
seems like a good place to start.

> 2. Use multi column indexes and accept that sometimes Postgres
> pick the wrong index (because a non-tenant-id column might seem
> highly selective over the table, but it is not for a particular
> tenant - or vice versa).

If you have a lot of queries which access data based on matching
some set of columns, an occasional multicolumn index in addition to
the individual column index may be worth it.

You might want to avoid prepared statements, since these are planned
for the general case and can fall down badly for the extremes.

> 3. Use a functional index that combines multiple columns and only
> query via these, that causes statistics gathering for the
> expression. I.e. create index i on t((tenantid||column1)) and
> SELECT ... FROM t WHERE tenantid||column1 = '...'

I would hold off on that until I saw evidence of a specific need.

> 4. Play with n_distinct and/or set the statistics for the inner
> columns to some fixed values that lead to the plans that we want.

Try not to think in terms of "plans we want", but in terms of
modeling your costs so that, given your tables and indexes, the
PostgreSQL planner will do a good job of picking a fast plan.  You
normally need to tweak a few of the costing factors to match the
reality of your server and load.

> 5. Have a completely different schema and maybe a database per
> tenant.

> Are there any other options?

If most queries operate within a single tenant and you have less
than 100 tenants, you might think about partitioned tables.  Beyond
100, or if most queries need to look at many partitions, it becomes
problematic.

> I don't think #2 is actually an option. We have some tenants with
> many (sometimes 100s) millions of rows per table, and picking the
> wrong index would be disastrous.

You can always drop counter-productive or unused indexes.  I think
it's best to look at indexes, as much as possible, as database
tuning options, rather than something with any semantic meaning.  If
you're doing things right, an index will never change the result of
any query, so you are free to try different tunings and see what's
fastest with your schema and your production load.

We have tables with 100s of millions of rows where everything is
indexed by county number.  The biggest county has about 20% of the
rows; the smallest about 0.05%.  We generally do pretty well with
(1) and (2).  We do occasionally find it useful to create an index
with a WHERE clause, though.  You might want to consider those.

-Kevin

Re: Statistics and Multi-Column indexes

From
lars
Date:
On 07/10/2011 02:31 PM, Samuel Gendler wrote:
> What about partitioning tables by tenant id and then maintaining
> indexes on each partition independent of tenant id, since constraint
> exclusion should handle filtering by tenant id for you.  That seems
> like a potentially more tolerable variant of #5 How many tenants are
> we talking about?  I gather partitioning starts to become problematic
> when the number of partitions gets large.
>
I thought I had replied... Apparently I didn't.

The database can grow in two dimensions: The number of tenants and the
number of rows per tenant.
We have many tenants with relatively little data and a few with a lot of
data. So the number of tenants
is known ahead of time and might be 1000's.

-- Lars