Thread: Alternatives to very large tables with many performance-killing indicies?

Alternatives to very large tables with many performance-killing indicies?

From

Wells Oliver

Date:

16 August 2012, 20:54:30

Hey folks, a question. We have a table that's getting large (6 million rows right now, but hey, no end in sight). It's wide-ish, too, 98 columns.

The problem is that each of these columns needs to be searchable quickly at an application level, and I'm far too responsible an individual to put 98 indexes on a table. Wondering what you folks have come across in terms of creative solutions that might be native to postgres. I can build something that indexes the data and caches it and runs separately from PG, but I wanted to exhaust all native options first.

Thanks!

--
Wells Oliver
wellsoliver@gmail.com

Re: Alternatives to very large tables with many performance-killing indicies?

From

Merlin Moncure

Date:

16 August 2012, 21:00:45

On Thu, Aug 16, 2012 at 3:54 PM, Wells Oliver <wellsoliver@gmail.com> wrote:
> Hey folks, a question. We have a table that's getting large (6 million rows
> right now, but hey, no end in sight). It's wide-ish, too, 98 columns.
>
> The problem is that each of these columns needs to be searchable quickly at
> an application level, and I'm far too responsible an individual to put 98
> indexes on a table. Wondering what you folks have come across in terms of
> creative solutions that might be native to postgres. I can build something
> that indexes the data and caches it and runs separately from PG, but I
> wanted to exhaust all native options first.

Well, you could explore normalizing your table, particularly if many
of your 98 columns are null most of the time.  Another option would be
to implement hstore for attributes and index with GIN/GIST --
especially if you need to filter on multiple columns.  Organizing big
data for fast searching is a complicated topic and requires
significant thought in terms of optimization.

merlin

Re: Alternatives to very large tables with many performance-killing indicies?

From

Jasen Betts

Date:

19 August 2012, 00:32:09

On 2012-08-16, Wells Oliver <wellsoliver@gmail.com> wrote:
> --0023543336c685451c04c7683ffb
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hey folks, a question. We have a table that's getting large (6 million rows
> right now, but hey, no end in sight). It's wide-ish, too, 98 columns.
>
> The problem is that each of these columns needs to be searchable quickly at
> an application level, and I'm far too responsible an individual to put 98
> indexes on a table. Wondering what you folks have come across in terms of
> creative solutions that might be native to postgres. I can build something
> that indexes the data and caches it and runs separately from PG, but I
> wanted to exhaust all native options first.

get rid of some of the columns ?



--
⚂⚃ 100% natural

Re: Alternatives to very large tables with many performance-killing indicies?

From

Jeff Janes

Date:

23 August 2012, 00:56:31

On Thu, Aug 16, 2012 at 1:54 PM, Wells Oliver <wellsoliver@gmail.com> wrote:
> Hey folks, a question. We have a table that's getting large (6 million rows
> right now, but hey, no end in sight).

Does it grow in chunks, or one row at a time?

> It's wide-ish, too, 98 columns.

How many of the columns are NULL for any given row?  Or perhaps
better, what is the distribution of values for any given column?  For
a given column, is there some magic value (NULL, 0, 1, -1, 9999, '')
which most of the rows have?

> The problem is that each of these columns needs to be searchable quickly at
> an application level, and I'm far too responsible an individual to put 98
> indexes on a table.

That is somewhat melodramatic.  Sure, creating 98 indexes does not
come for free.  And it is great that you are aware of this.  But just
because they are not free does not mean they are not worth their cost.
 Look at all the other costs of using a RDBMS.  Each letter of ACID
does not come for free.  But it is often worth the price.

In the generic case, you have a large amount of data to index.
Indexing a lot of data requires a lot of resources.  There is magic
bullet to this.

> Wondering what you folks have come across in terms of
> creative solutions that might be native to postgres. I can build something
> that indexes the data and caches it and runs separately from PG, but I
> wanted to exhaust all native options first.

If the data is frequently updated/inserted, then how would you
invalidate the cache when needed?  And if the data is not frequently
updated/inserted, then what about the obvious PG solution (building 96
indexes) is a problem?

If your queries are of the nature of:

where
  col1=:1 or
  col2=:1 or
  col3=:1 or
...
  col96=:1 or

then a full text index would probably be a better option.

Otherwise, it is hard to say.  You could replace 96 columns with a
single hstore column which has 96 different keys.  But from what I can
tell, maintaining a gin index on that hstore column would probably be
slower than maintaining 96 individual btree indexes.

And if you go with a gist index on the single hstore column, the cost
of maintenance is greatly reduced relative to gin.  But the index is
basically useless, you might as well just drop the index and do the
full table scan instead.

Cheers,

Jeff

Re: Alternatives to very large tables with many performance-killing indicies?

From

Scott Marlowe

Date:

23 August 2012, 03:16:58

On Thu, Aug 16, 2012 at 2:54 PM, Wells Oliver <wellsoliver@gmail.com> wrote:
> Hey folks, a question. We have a table that's getting large (6 million rows
> right now, but hey, no end in sight). It's wide-ish, too, 98 columns.
>
> The problem is that each of these columns needs to be searchable quickly at
> an application level, and I'm far too responsible an individual to put 98
> indexes on a table. Wondering what you folks have come across in terms of
> creative solutions that might be native to postgres. I can build something
> that indexes the data and caches it and runs separately from PG, but I
> wanted to exhaust all native options first.

I submit that you're far better off working with the users to see
which fields they really need indexes on, and especially which
combinations of fields with functional and / or partial indexes serve
them the best.

To start with you can create indexes willy nilly if you want and then
use the pg_stat*index tables to see which are or are not getting used
and start pruning them as time goes by.  But keep an eye out for long
running queries with your logging and investigate to see what
specialized indexes might help the most for those queries.  Often a
simple index on (a,b) where x is not null or something can give great
improvements over any bitmap hash scans of multiple indexes ever
could, especially on large data sets.

Re: Alternatives to very large tables with many performance-killing indicies?

From

Martijn van Oosterhout

Date:

23 August 2012, 07:09:58

On Wed, Aug 22, 2012 at 05:56:27PM -0700, Jeff Janes wrote:
> > It's wide-ish, too, 98 columns.
>
> How many of the columns are NULL for any given row?  Or perhaps
> better, what is the distribution of values for any given column?  For
> a given column, is there some magic value (NULL, 0, 1, -1, 9999, '')
> which most of the rows have?

In particular, if the data is sparse, as in lots of NULLs, and you
don't need to search on those, you might consider partial indexes.  If
you create partial indexes for only the non-NULL entries, postgres is
smart enough to use it when you query it for something not NULL.
Example:

db=# create temp table foo (a int4, b int4);
CREATE TABLE
db=# insert into foo (a) select generate_series(1,100000);
INSERT 0 100000
db=# update foo set b=1 where a=1;
UPDATE 1
db=# create index bar on foo(b) where b is not null;
CREATE INDEX
db=# explain select * from foo where b=1;
                             QUERY PLAN
--------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=4.38..424.59 rows=500 width=8)
   Recheck Cond: (b = 1)
   ->  Bitmap Index Scan on bar  (cost=0.00..4.26 rows=500 width=0)
         Index Cond: (b = 1)
(4 rows)

In this case a row update will only update indexes with non-NULL rows,
which may cut the overhead considerably.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

Attachment

signature.asc