Thread: How to inject knowledge into a Postgres database

How to inject knowledge into a Postgres database

From
han.holl@informationslogik.nl
Date:
Hello,

We have a table of people with a date-of-birth and a surname, both indexed.
We have queries like this:
select report from table where dateofbirth = '1966-12-12' and surname like
'boer%'

The planner estimates that in a 1.5M record table 'boer%' will have one record
and 1966-12-12 about 40.
Unfortunately, names are not evenly distributed, and in some combinations it
will have to process many thousands of records. On some older and slower
systems, this hurts.

Is there a way to tell the planner to always prefer the dateofbirth index ?
Alternatively, to inform it about the wildly uneven distribution of surnames
(this must be even worse in China).

A third possibility would be to rewrite such a query as a nested query: is
there a rewrite query - hook (like apaches mod_rewrite) available ? Or is
there a somewhere a proxy-server that could do this ?

Cheers,

Han Holl

Re: How to inject knowledge into a Postgres database

From
Tom Lane
Date:
han.holl@informationslogik.nl writes:
> Is there a way to tell the planner to always prefer the dateofbirth index ?
> Alternatively, to inform it about the wildly uneven distribution of surnames
> (this must be even worse in China).

Increase the statistics target for that column, and re-ANALYZE the
table.

            regards, tom lane

Re: How to inject knowledge into a Postgres database

From
Oleg Bartunov
Date:
You could increase statistics or try contrib/tsearch2

     Oleg
On Fri, 7 Oct 2005, han.holl@informationslogik.nl wrote:

>
> Hello,
>
> We have a table of people with a date-of-birth and a surname, both indexed.
> We have queries like this:
> select report from table where dateofbirth = '1966-12-12' and surname like
> 'boer%'
>
> The planner estimates that in a 1.5M record table 'boer%' will have one record
> and 1966-12-12 about 40.
> Unfortunately, names are not evenly distributed, and in some combinations it
> will have to process many thousands of records. On some older and slower
> systems, this hurts.
>
> Is there a way to tell the planner to always prefer the dateofbirth index ?
> Alternatively, to inform it about the wildly uneven distribution of surnames
> (this must be even worse in China).
>
> A third possibility would be to rewrite such a query as a nested query: is
> there a rewrite query - hook (like apaches mod_rewrite) available ? Or is
> there a somewhere a proxy-server that could do this ?
>
> Cheers,
>
> Han Holl
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: How to inject knowledge into a Postgres database

From
Yonatan Ben-Nes
Date:
Oleg Bartunov wrote:
> You could increase statistics or try contrib/tsearch2
>
>     Oleg
> On Fri, 7 Oct 2005, han.holl@informationslogik.nl wrote:
>
>>
>> Hello,
>>
>> We have a table of people with a date-of-birth and a surname, both
>> indexed.
>> We have queries like this:
>> select report from table where dateofbirth = '1966-12-12' and surname
>> like
>> 'boer%'
>>
>> The planner estimates that in a 1.5M record table 'boer%' will have
>> one record
>> and 1966-12-12 about 40.
>> Unfortunately, names are not evenly distributed, and in some
>> combinations it
>> will have to process many thousands of records. On some older and slower
>> systems, this hurts.
>>
>> Is there a way to tell the planner to always prefer the dateofbirth
>> index ?
>> Alternatively, to inform it about the wildly uneven distribution of
>> surnames
>> (this must be even worse in China).
>>
>> A third possibility would be to rewrite such a query as a nested
>> query: is
>> there a rewrite query - hook (like apaches mod_rewrite) available ? Or is
>> there a somewhere a proxy-server that could do this ?
>>
>> Cheers,
>>
>> Han Holl
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match

You can also "force" it to use the index using ORDER BY dateofbirth.

Cheers,
   Ben-Nes Yonatan

Re: How to inject knowledge into a Postgres database

From
han.holl@informationslogik.nl
Date:
Tom, Oleg, Yonathan,

thanks for the suggestions.
Indeed, upping the statistics from 10 to 100 helped.

But order by did not:
palga=# explain analyze select rapnaam from udps where geboortedatum =
'1966-01-01' and naamvrouw like 'vos%' order by geboortedatum;
                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=18.07..18.08 rows=1 width=18) (actual time=774.941..774.941
rows=0 loops=1)
   Sort Key: main.geboortedatum
   ->  Index Scan using nv on main  (cost=0.00..18.06 rows=1 width=18) (actual
time=746.121..746.121 rows=0 loops=1)
         Index Cond: (((naamvrouw)::text >= 'vos'::character varying) AND
((naamvrouw)::text < 'vot'::character varying))
         Filter: ((geboortedatum = '1966-01-01'::date) AND ((naamvrouw)::text
~~ 'vos%'::text))
 Total runtime: 775.068 ms
(6 rows)

I got a similar problem with a functional index, but I guess my only option is
to create a real column with the results of the function, and replace the
functional index with a real one.

Thanks to all,

Han Holl

Re: How to inject knowledge into a Postgres database

From
"Jim C. Nasby"
Date:
What's the datatype on naamvrouw? Notice that it's being casted to text,
which means an index on that column won't be used.

On Mon, Oct 10, 2005 at 11:29:38AM +0200, han.holl@informationslogik.nl wrote:
>
> Tom, Oleg, Yonathan,
>
> thanks for the suggestions.
> Indeed, upping the statistics from 10 to 100 helped.
>
> But order by did not:
> palga=# explain analyze select rapnaam from udps where geboortedatum =
> '1966-01-01' and naamvrouw like 'vos%' order by geboortedatum;
>                                                         QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=18.07..18.08 rows=1 width=18) (actual time=774.941..774.941
> rows=0 loops=1)
>    Sort Key: main.geboortedatum
>    ->  Index Scan using nv on main  (cost=0.00..18.06 rows=1 width=18) (actual
> time=746.121..746.121 rows=0 loops=1)
>          Index Cond: (((naamvrouw)::text >= 'vos'::character varying) AND
> ((naamvrouw)::text < 'vot'::character varying))
>          Filter: ((geboortedatum = '1966-01-01'::date) AND ((naamvrouw)::text
> ~~ 'vos%'::text))
>  Total runtime: 775.068 ms
> (6 rows)
>
> I got a similar problem with a functional index, but I guess my only option is
> to create a real column with the results of the function, and replace the
> functional index with a real one.
>
> Thanks to all,
>
> Han Holl
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: How to inject knowledge into a Postgres database

From
han.holl@informationslogik.nl
Date:
On Tuesday 11 October 2005 01:50, Jim C. Nasby wrote:
> What's the datatype on naamvrouw? Notice that it's being casted to text,
> which means an index on that column won't be used.
>
Datatype is text. My problem is that I want it to use the date of birth index.

I've increased the statistics to 1000, and only occasionally, because of the
very uneven distribution of surnames, the planner does the wrong thing.
(Thinking that only 14 rows will match 'blaa' instead of 767.)

I've tried to set an even higher limit, but 1000 seems to be the maximum.
Is this hardcoded, or can I set a higher maximum somewhere ?

If the only penalty is slower analyzing, I don't care: we analyze at night
when these system are idle.

Cheers,

Han Holl

Re: How to inject knowledge into a Postgres database

From
Tom Lane
Date:
han.holl@informationslogik.nl writes:
> I've increased the statistics to 1000, and only occasionally, because of the
> very uneven distribution of surnames, the planner does the wrong thing.
> I've tried to set an even higher limit, but 1000 seems to be the maximum.
> Is this hardcoded, or can I set a higher maximum somewhere ?

You'd have to change the source code, but it's a simple tweak in the
ALTER SET STATISTICS code.

> If the only penalty is slower analyzing, I don't care: we analyze at night
> when these system are idle.

You'd be wrong about that --- the planner operations that use the data
would necessarily be slower, too.  I don't have any concrete information
about how much slower, but I'd be hesitant to raise the figure much
beyond 1000 ...

However, if you can show you have a real-world case that benefits, I'd
be willing to think about raising the wired-in limit to 10000 or so.

            regards, tom lane

Re: How to inject knowledge into a Postgres database

From
han.holl@informationslogik.nl
Date:
On Thursday 13 October 2005 16:06, Tom Lane wrote:
>
> You'd have to change the source code, but it's a simple tweak in the
> ALTER SET STATISTICS code.
>
I don't think I'd want to do that.

> > If the only penalty is slower analyzing, I don't care: we analyze at
> > night when these system are idle.
>
> You'd be wrong about that --- the planner operations that use the data
> would necessarily be slower, too.  I don't have any concrete information
> about how much slower, but I'd be hesitant to raise the figure much
> beyond 1000 ...
>
> However, if you can show you have a real-world case that benefits, I'd
> be willing to think about raising the wired-in limit to 10000 or so.
>
The example I gave earlier in the thread, date_of_birth = 'some-date' and
surname like 'blaa%', was a real life example, but I had to pull it from a
logfile that logs queries longer than 500 ms. It happens two or three times a
day in a laboratory with 50 people querying the database all day.

Estimates for date_of_birth number of rows are quite good (even at the default
stats of 10) but surnames are just too unevenly distributed.

But in 99% of all cases the guess is right, and by making it a nested query I
could improve 1% and worsen 99%.

Cheers,

Han Holl