Re: insert and query performance on big string table with pg_trgm - Mailing list pgsql-performance

From Matthew Hall
Subject Re: insert and query performance on big string table with pg_trgm
Date
Msg-id 56B84857-06EA-4534-A13F-F7AA3B924446@mhcomputing.net
Whole thread Raw
In response to Re: insert and query performance on big string table with pg_trgm  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: insert and query performance on big string table with pg_trgm
Re: insert and query performance on big string table with pg_trgm
List pgsql-performance
Hi Jeff,

Thanks so much for writing. You've got some great points.

> On Nov 20, 2017, at 5:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> While I have not done exhaustive testing, from the tests I have done I've never found gist to be better than gin with
trgmindexes. 

Thanks, this helps considerably, as the documentation was kind of confusing and I didn't want to get it wrong if I
couldavoid it. 

> Do you really need the artificial primary key, when you already have another column that would be used as the primary
key? If you need to use this it a foreign key in another type, then very well might.  But maintaining two unique
indexesdoesn't come free. 

OK, fair enough, I'll test with it removed and see what happens.

> Are all indexes present at the time you insert?  It will probably be much faster to insert without the gin index (at
least)and build it after the load. 

There is some flexibility on the initial load, but the updates in the future will require the de-duplication
capability.I'm willing to accept that might be somewhat slower on the load process, to get the accurate updates,
providedwe could try meeting the read-side goal I wrote about, or at least figure out why it's impossible, so I can
understandwhat I need to fix to make it possible. 

> Without knowing this key fact, it is hard to interpret the rest of your data.

I'm assuming you're referring to the part about the need for the primary key, and the indexes during loading? I did try
todescribe that in the earlier mail, but obviously I'm new at writing these, so sorry if I didn't make it more clear. I
canget rid of the bigserial PK and the indexes could be made separately, but I would need a way to de-duplicate on
futurereloading... that's why I had the ON CONFLICT DO NOTHING expression on the INSERT. So we'd still want to learn
whythe INSERT is slow to fix up the update processes that would happen in the future. 

> * maintenance_work_mem 512 MB
>
> Building a gin index in bulk could benefit from more memory here.

Fixed it; I will re-test w/ 1 GB. Have you got any recommended values so I don't screw it up?

> * synchronous_commit off
>
> If you already are using unlogged tables, this might not be so helpful, but does increase the risk of the rest of
yoursystem. 

Fixed it; the unlogged mode change came later than this did.

>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97 postgres
>
> You should expand the command line (by hitting 'c', at least in my version of top) so we can see which postgres
processthis is. 

Good point, I'll write back once I retry w/ your other advice.

> explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

track_io_timing was missing because sadly I had only found it in one document at the very end of the investigation,
afterdoing the big job which generated all of the material posted. It's there now, so here is some better output on the
query:

explain (analyze, buffers) select * from huge_table where value ilike '%canada%';
Bitmap Heap Scan on huge_table  (cost=273.44..61690.09 rows=16702 width=33) (actual time=5701.511..76469.688
rows=110166loops=1)  Recheck Cond: ((value)::text ~~* '%canada%'::text)  Rows Removed by Index Recheck: 198  Heap
Blocks:exact=66657  Buffers: shared hit=12372 read=56201 dirtied=36906  I/O Timings: read=74195.734  ->  Bitmap Index
Scanon huge_table_value_trgm  (cost=0.00..269.26 rows=16702 width=0) (actual time=5683.032..5683.032 rows=110468
loops=1)       Index Cond: ((value)::text ~~* '%canada%'::text)        Buffers: shared hit=888 read=1028        I/O
Timings:read=5470.839Planning time: 0.271 msExecution time: 76506.949 ms 

I will work some more on the insert piece.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds as could be expected for such a torture test
query,but it's still WAY faster than the first such query. If you change it out to a different expression, it's longer
againof course. There does seem to be a low-to-medium correlation between the number of rows found and the query
completiontime. 

> Cheers,
> Jeff

Thanks,
Matthew.

pgsql-performance by date:

Previous
From: phb07
Date:
Subject: Re: [PERFORM] POWA doesn't show queries executed
Next
From: Zakhar Shapurau
Date:
Subject: Using GROUPING SETS with more than one set disables predicate pushdown?