Re: maximum columns for brin bloom indexes - Mailing list pgsql-hackers

From Jaime Casanova
Subject Re: maximum columns for brin bloom indexes
Date
Msg-id 20210408144918.GA3086@ahch-to
Whole thread Raw
In response to Re: maximum columns for brin bloom indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: maximum columns for brin bloom indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Thu, Apr 08, 2021 at 12:18:36PM +0200, Tomas Vondra wrote:
> On 4/8/21 9:08 AM, Jaime Casanova wrote:
> > Hi everyone,
> > 
> > When testing brin bloom indexes I noted that we need to reduce the
> > PAGES_PER_RANGE parameter of the index to allow more columns on it.
> > 
> > Sadly, this could be a problem if you create the index before the table
> > grows, once it reaches some number of rows (i see the error as early as
> > 1000 rows) it starts error out.
> > 
> >     create table t1(i int, j int);
> >     
> >     -- uses default PAGES_PER_RANGE=128
> >     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;
> >     
> >     insert into t1 
> >         select random()*1000, random()*1000 from generate_series(1, 1000);
> >     ERROR:  index row size 8968 exceeds maximum 8152 for index "t1_i_j_idx"
> > 
> > if instead you create the index with a minor PAGES_PER_RANGE it goes
> > fine, in this case it works once you reduce it to at least 116
> > 
> >     create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) 
> >         with (pages_per_range=116);
> > 
> > 
> > so, for having:
> > two int columns - PAGES_PER_RANGE should be max 116
> > three int columns - PAGES_PER_RANGE should be max 77
> > one int and one timestamp - PAGES_PER_RANGE should be max 121 
> > 
> > and so on
> > 
> 
> No, because this very much depends on the number if distinct values in
> the page page range, which determines how well the bloom filter
> compresses. You used 1000, but that's just an arbitrary value and the
> actual data might have any other value. And it's unlikely that all three
> columns will have the same number of distinct values.
>

Ok, that makes sense. Still I see a few odd things: 

    """
    drop table if exists t1;
    create table t1(i int, j int);
    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

    -- This one will succeed, I guess because it has less different
    -- values
    insert into t1
    select random()*20, random()*100 from generate_series(1, 1000);

    -- succeed
    insert into t1
    select random()*20, random()*100 from generate_series(1, 100000);

    -- succeed
    insert into t1
    select random()*200, random()*1000 from generate_series(1, 1000);

    -- succeed
    insert into t1
    select random()*200, random()*1000 from generate_series(1, 1000);

    -- succeed? This is the case it has been causing problems before
    insert into t1
    select random()*1000, random()*1000 from generate_series(1, 1000);
    """

Maybe this makes sense, but it looks random to me. If it makes sense
this is something we should document better. 

Let's try another combination:

    """
    drop table if exists t1;
    create table t1(i int, j int);
    create index on t1 using brin(i int4_bloom_ops, j int4_bloom_ops ) ;

    -- this fails again
    insert into t1
    select random()*1000, random()*1000 from generate_series(1, 1000);

    -- and this starts to fail now, but this worked before
    insert into t1
    select random()*20, random()*100 from generate_series(1, 1000);
    """

> Of course, this also depends on the false positive rate.
> 

How the false positive rate work?

> FWIW I doubt people are using multi-column BRIN indexes very often.
> 

true. 

Another question, should we allow to create a brin multi column index
that uses different opclasses?

CREATE INDEX ON t1 USING brin (i int4_bloom_ops, j int4_minmax_ops);

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: allow partial union-all and improve parallel subquery costing
Next
From: Julien Rouhaud
Date:
Subject: Re: pg_stat_statements oddity with track = all