Thread: Searching in varchar column having 100M records

Searching in varchar column having 100M records

From
mayank rupareliya
Date:
My table is having data like below with 100M records (contains all dummy data). I am having btree index on column ("field").
While searching for any text from that column takes longer (more than 1 minute).

user Id field
d848f466-5e12-46e7-acf4-e12aff592241Northern Arkansas College
24c32757-e6a8-4dbd-aac7-1efd867156cefemale
6e225c57-c1d1-48a5-b9aa-513223efc81b1.0, 3.67, 3.67, 4.67, 7.0, 3.0
088c6342-a240-45a7-9d12-e0e707292031Weber
b05088cf-cba6-4bd7-8f8f-1469226874d0addd#$e@aaa.com


Table and index are created using following query.

create table fields(user_id varchar(64), field varchar(64));
CREATE INDEX index_field ON public.fields USING btree (field);

Search Query: 
EXPLAIN (ANALYZE, BUFFERS) select * from fields where field='Mueller';

Bitmap Heap Scan on fields  (cost=72.61..10069.32 rows=2586 width=55) (actual time=88.017..65358.548 rows=31882 loops=1)
  Recheck Cond: ((field)::text = 'Mueller'::text)
  Heap Blocks: exact=31403
  Buffers: shared hit=2 read=31492
  ->  Bitmap Index Scan on index_field  (cost=0.00..71.96 rows=2586 width=0) (actual time=55.960..55.960 rows=31882 loops=1)
        Index Cond: ((field)::text = 'Mueller'::text)
        Buffers: shared read=91
Planning Time: 0.331 ms
Execution Time: 65399.314 ms



Any suggestions for improvement?

Best Regards,
Mayank

Re: Searching in varchar column having 100M records

From
Sergei Kornilov
Date:
Hello

Please recheck with track_io_timing = on in configuration. explain (analyze,buffers) with this option will report how
manytime we spend during i/o
 

>   Buffers: shared hit=2 read=31492

31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD

Your query reads table data from disks (well, or from OS cache). You need more RAM for shared_buffers or disks with
betterperformance.
 

regards, Sergei



Re: Searching in varchar column having 100M records

From
Tomas Vondra
Date:
On Wed, Jul 17, 2019 at 02:53:20PM +0300, Sergei Kornilov wrote:
>Hello
>
>Please recheck with track_io_timing = on in configuration. explain
>(analyze,buffers) with this option will report how many time we spend
>during i/o
>
>>   Buffers: shared hit=2 read=31492
>
>31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD
>
>Your query reads table data from disks (well, or from OS cache). You need
>more RAM for shared_buffers or disks with better performance.
>

Either that, or try creating a covering index, so that the query can do an
index-only scan. That might reduce the amount of IO against the table, and
in the index the data should be located close to each other (same page or
pages close to each other).

So try something like

    CREATE INDEX ios_idx ON table (field, user_id);

and make sure the table is vacuumed often enough (so that the visibility
map is up to date).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Searching in varchar column having 100M records

From
Andreas Kretschmer
Date:

Am 17.07.19 um 14:48 schrieb Tomas Vondra:
> Either that, or try creating a covering index, so that the query can 
> do an
> index-only scan. That might reduce the amount of IO against the table, 
> and
> in the index the data should be located close to each other (same page or
> pages close to each other).
>
> So try something like
>
>    CREATE INDEX ios_idx ON table (field, user_id);
>
> and make sure the table is vacuumed often enough (so that the visibility
> map is up to date). 

yeah, and please don't use varchar(64), but instead UUID for the user_id 
- field to save space on disk and for faster comparison.


Regards, Andreas

-- 
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com




Re: Searching in varchar column having 100M records

From
"David G. Johnston"
Date:
On Wed, Jul 17, 2019 at 4:04 AM mayank rupareliya <mayankjr03@gmail.com> wrote:
create table fields(user_id varchar(64), field varchar(64));
CREATE INDEX index_field ON public.fields USING btree (field);

Any suggestions for improvement?

Reduce the number of rows by constructing a relationally normalized data model.

David J.

Re: Searching in varchar column having 100M records

From
Gavin Flower
Date:
On 17/07/2019 23:03, mayank rupareliya wrote:
[...]
> Table and index are created using following query.
>
> create table fields(user_id varchar(64), field varchar(64));
> CREATE INDEX index_field ON public.fields USING btree (field);

[...]

Any particular reason for using varchar instead of text, for field?

Also, as Andreas pointed out, use UUID for the user_id.


Cheers,
Gavin




Re: Searching in varchar column having 100M records

From
mayank rupareliya
Date:
Please recheck with track_io_timing = on in configuration. explain (analyze,buffers) with this option will report how many time we spend during i/o

>   Buffers: shared hit=2 read=31492

31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD

Your query reads table data from disks (well, or from OS cache). You need more RAM for shared_buffers or disks with better performance.

Thanks Sergei.. track_io_timing = on helps.. Following is the result after changing that config.

Aggregate  (cost=10075.78..10075.79 rows=1 width=8) (actual time=63088.198..63088.199 rows=1 loops=1)
  Buffers: shared read=31089
  I/O Timings: read=61334.252
  ->  Bitmap Heap Scan on fields  (cost=72.61..10069.32 rows=2586 width=0) (actual time=69.509..63021.448 rows=31414 loops=1)
        Recheck Cond: ((field)::text = 'Klein'::text)
        Heap Blocks: exact=30999
        Buffers: shared read=31089
        I/O Timings: read=61334.252
        ->  Bitmap Index Scan on index_field  (cost=0.00..71.96 rows=2586 width=0) (actual time=58.671..58.671 rows=31414 loops=1)
              Index Cond: ((field)::text = 'Klein'::text)
              Buffers: shared read=90
              I/O Timings: read=45.316
Planning Time: 66.435 ms
Execution Time: 63088.774 ms


So try something like

    CREATE INDEX ios_idx ON table (field, user_id);

and make sure the table is vacuumed often enough (so that the visibility
map is up to date).

Thanks Tomas... I tried this and result improved but not much.

Thanks Andreas, David, Gavin

Any particular reason for using varchar instead of text, for field? No

use UUID for the user_id. Agreed

Regards,
Mayank

On Thu, Jul 18, 2019 at 4:25 AM Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:
On 17/07/2019 23:03, mayank rupareliya wrote:
[...]
> Table and index are created using following query.
>
> create table fields(user_id varchar(64), field varchar(64));
> CREATE INDEX index_field ON public.fields USING btree (field);

[...]

Any particular reason for using varchar instead of text, for field?

Also, as Andreas pointed out, use UUID for the user_id.


Cheers,
Gavin



Re: Searching in varchar column having 100M records

From
Tomas Vondra
Date:
On Thu, Jul 18, 2019 at 05:21:49PM +0530, mayank rupareliya wrote:
>*Please recheck with track_io_timing = on in configuration. explain
>(analyze,buffers) with this option will report how many time we spend
>during i/o*
>
>*>   Buffers: shared hit=2 read=31492*
>
>*31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD*
>
>*Your query reads table data from disks (well, or from OS cache). You need
>more RAM for shared_buffers or disks with better performance.*
>
>
>Thanks Sergei..
>*track_io_timing = on helps.. Following is the result after changing that
>config.*
>
>Aggregate  (cost=10075.78..10075.79 rows=1 width=8) (actual
>time=63088.198..63088.199 rows=1 loops=1)
>  Buffers: shared read=31089
>  I/O Timings: read=61334.252
>  ->  Bitmap Heap Scan on fields  (cost=72.61..10069.32 rows=2586 width=0)
>(actual time=69.509..63021.448 rows=31414 loops=1)
>        Recheck Cond: ((field)::text = 'Klein'::text)
>        Heap Blocks: exact=30999
>        Buffers: shared read=31089
>        I/O Timings: read=61334.252
>        ->  Bitmap Index Scan on index_field  (cost=0.00..71.96 rows=2586
>width=0) (actual time=58.671..58.671 rows=31414 loops=1)
>              Index Cond: ((field)::text = 'Klein'::text)
>              Buffers: shared read=90
>              I/O Timings: read=45.316
>Planning Time: 66.435 ms
>Execution Time: 63088.774 ms
>

How did that help? It only gives you insight that it's really the I/O that
takes time. You need to reduce that, somehow.

>
>*So try something like*
>
>*    CREATE INDEX ios_idx ON table (field, user_id);*
>
>*and make sure the table is vacuumed often enough (so that the visibility*
>*map is up to date).*
>
>Thanks Tomas... I tried this and result improved but not much.
>

Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Searching in varchar column having 100M records

From
Gavin Flower
Date:
On 18/07/2019 23:51, mayank rupareliya wrote:
[...]
> Thanks Andreas, David, Gavin
>
> /Any particular reason for using varchar instead of text, for field?/ No
>
> use UUID for the user_id.Agreed

/[...]/

/Use of text is preferred, but I can't see it making any significant 
difference to performance -- but I could be wrong!/

/Cheers,
Gavin
/




Re: Searching in varchar column having 100M records

From
mayank rupareliya
Date:
Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.

This is the latest query execution with explain after adding indexing on both columns.

Aggregate  (cost=174173.57..174173.58 rows=1 width=8) (actual time=65087.657..65087.658 rows=1 loops=1)
  ->  Bitmap Heap Scan on fields  (cost=1382.56..174042.61 rows=52386 width=0) (actual time=160.340..65024.533 rows=31857 loops=1)
        Recheck Cond: ((field)::text = 'Champlin'::text)
        Heap Blocks: exact=31433
        ->  Bitmap Index Scan on index_field  (cost=0.00..1369.46 rows=52386 width=0) (actual time=125.078..125.079 rows=31857 loops=1)
              Index Cond: ((field)::text = 'Champlin'::text)
Planning Time: 8.595 ms
Execution Time: 65093.508 ms

On Thu, Jul 18, 2019 at 6:11 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Thu, Jul 18, 2019 at 05:21:49PM +0530, mayank rupareliya wrote:
>*Please recheck with track_io_timing = on in configuration. explain
>(analyze,buffers) with this option will report how many time we spend
>during i/o*
>
>*>   Buffers: shared hit=2 read=31492*
>
>*31492 blocks / 65 sec ~ 480 IOPS, not bad if you are using HDD*
>
>*Your query reads table data from disks (well, or from OS cache). You need
>more RAM for shared_buffers or disks with better performance.*
>
>
>Thanks Sergei..
>*track_io_timing = on helps.. Following is the result after changing that
>config.*
>
>Aggregate  (cost=10075.78..10075.79 rows=1 width=8) (actual
>time=63088.198..63088.199 rows=1 loops=1)
>  Buffers: shared read=31089
>  I/O Timings: read=61334.252
>  ->  Bitmap Heap Scan on fields  (cost=72.61..10069.32 rows=2586 width=0)
>(actual time=69.509..63021.448 rows=31414 loops=1)
>        Recheck Cond: ((field)::text = 'Klein'::text)
>        Heap Blocks: exact=30999
>        Buffers: shared read=31089
>        I/O Timings: read=61334.252
>        ->  Bitmap Index Scan on index_field  (cost=0.00..71.96 rows=2586
>width=0) (actual time=58.671..58.671 rows=31414 loops=1)
>              Index Cond: ((field)::text = 'Klein'::text)
>              Buffers: shared read=90
>              I/O Timings: read=45.316
>Planning Time: 66.435 ms
>Execution Time: 63088.774 ms
>

How did that help? It only gives you insight that it's really the I/O that
takes time. You need to reduce that, somehow.

>
>*So try something like*
>
>*    CREATE INDEX ios_idx ON table (field, user_id);*
>
>*and make sure the table is vacuumed often enough (so that the visibility*
>*map is up to date).*
>
>Thanks Tomas... I tried this and result improved but not much.
>

Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Searching in varchar column having 100M records

From
Michael Lewis
Date:


On Fri, Jul 19, 2019 at 8:13 AM mayank rupareliya <mayankjr03@gmail.com> wrote:
Well, you haven't shown us the execution plan, so it's hard to check why
it did not help much and give you further advice.

This is the latest query execution with explain after adding indexing on both columns.

Aggregate  (cost=174173.57..174173.58 rows=1 width=8) (actual time=65087.657..65087.658 rows=1 loops=1)
  ->  Bitmap Heap Scan on fields  (cost=1382.56..174042.61 rows=52386 width=0) (actual time=160.340..65024.533 rows=31857 loops=1)
        Recheck Cond: ((field)::text = 'Champlin'::text)
        Heap Blocks: exact=31433
        ->  Bitmap Index Scan on index_field  (cost=0.00..1369.46 rows=52386 width=0) (actual time=125.078..125.079 rows=31857 loops=1)
              Index Cond: ((field)::text = 'Champlin'::text)
Planning Time: 8.595 ms
Execution Time: 65093.508 ms



Are you on a solid state drive? If so, have you tried setting effective_io_concurrency to 200 or 300 and checking performance? Given nearly all of the execution time is doing a bitmap heap scan, I wonder about adjusting this.

effective_io_concurrency
"The allowed range is 1 to 1000, or zero to disable issuance of asynchronous I/O requests. Currently, this setting only affects bitmap heap scans."
"The default is 1 on supported systems, otherwise 0. " 

Re: Searching in varchar column having 100M records

From
Tomas Vondra
Date:
On Fri, Jul 19, 2019 at 07:43:26PM +0530, mayank rupareliya wrote:
>Well, you haven't shown us the execution plan, so it's hard to check why
>it did not help much and give you further advice.
>
>
>This is the latest query execution with explain after adding indexing on
>both columns.
>
>Aggregate  (cost=174173.57..174173.58 rows=1 width=8) (actual
>time=65087.657..65087.658 rows=1 loops=1)
>  ->  Bitmap Heap Scan on fields  (cost=1382.56..174042.61 rows=52386
>width=0) (actual time=160.340..65024.533 rows=31857 loops=1)
>        Recheck Cond: ((field)::text = 'Champlin'::text)
>        Heap Blocks: exact=31433
>        ->  Bitmap Index Scan on index_field  (cost=0.00..1369.46
>rows=52386 width=0) (actual time=125.078..125.079 rows=31857 loops=1)
>              Index Cond: ((field)::text = 'Champlin'::text)
>Planning Time: 8.595 ms
>Execution Time: 65093.508 ms
>

That very clearly does not use the index-only scan, so it's not
surprising it's not any faster. You need to find out why the planner
makes that decision.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Searching in varchar column having 100M records

From
mgbii bax
Date:
Another suggestion, try to cluster the table using the index for the "field" column, then analyze. If you're on a spinning disk it will help if you sort your search "field" during bulk insert. 
--

regards

marie g. bacuno ii


On Fri, Jul 19, 2019 at 11:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Fri, Jul 19, 2019 at 07:43:26PM +0530, mayank rupareliya wrote:
>Well, you haven't shown us the execution plan, so it's hard to check why
>it did not help much and give you further advice.
>
>
>This is the latest query execution with explain after adding indexing on
>both columns.
>
>Aggregate  (cost=174173.57..174173.58 rows=1 width=8) (actual
>time=65087.657..65087.658 rows=1 loops=1)
>  ->  Bitmap Heap Scan on fields  (cost=1382.56..174042.61 rows=52386
>width=0) (actual time=160.340..65024.533 rows=31857 loops=1)
>        Recheck Cond: ((field)::text = 'Champlin'::text)
>        Heap Blocks: exact=31433
>        ->  Bitmap Index Scan on index_field  (cost=0.00..1369.46
>rows=52386 width=0) (actual time=125.078..125.079 rows=31857 loops=1)
>              Index Cond: ((field)::text = 'Champlin'::text)
>Planning Time: 8.595 ms
>Execution Time: 65093.508 ms
>

That very clearly does not use the index-only scan, so it's not
surprising it's not any faster. You need to find out why the planner
makes that decision.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services