Re: How to implement Gin method? - Mailing list pgsql-hackers

From kenji uno
Subject Re: How to implement Gin method?
Date
Msg-id 201307080620.r686KoYI030408@mail.digitaldolphins.jp
Whole thread Raw
In response to Re: How to implement Gin method?  (Martijn van Oosterhout <kleptog@svana.org>)
Responses Re: How to implement Gin method?
List pgsql-hackers
Hi.

Ok, ok thanks.

My problem is to shorten time of searching full text stored in text field.

The table definition is like following:
CREATE TABLE xxx(...  title character varying,...  fts1body text,...)

If user requests keywords, we use a kind of "stristr" that is targeting Japanese text encoded in UTF-8.

"aaa bbb ccc" [Click here to search!]
SELECT * FROM xxx WHERE TRUEAND (ddstrike(title,'aaa') OR ddstrike(fts1body,'aaa') OR ...)AND (ddstrike(title,'bbb') OR
ddstrike(fts1body,'bbb')OR ...)AND ...
 

As you can imagine easily, yes, it is very slow!

So I need trial and error for speeding up.

My trial is "Insert a light weight filter done by integer key, before text searching!"

For example,filter('A') -> 1filter('B') -> 2filter('C') -> 4filter('AAABBC') -> 7 or {1,2,4}

It may fit to inverse index like GIN!

So I began to study GIN.

I'm sorry to say. Today I found I could apply int array GIN support at contrib/_int.sql.

I made GIN index.
CREATE INDEX xxx_idx_filter ON xxx USING GIN (filter(fts1body) gist__int_ops);


The following sample query is very very fast! 11065 hits in 22 milli secs (total 215,278 records).
SELECT COUNT(*) FROM xxx WHERE filter(fts1body) @> filter('ABC');


However the following query is very slow! 9,400ms. It uses "Seq Scan" lol.
SELECT * FROM xxx WHERE TRUEAND (ddstrike(title,'ABC') OR (filter(fts1body) @> filter('AAA') AND
ddstrike(fts1body,'AAA')))


Apply filter to "title" column too.

The best query result costs 3,700ms. 18 hits. It surely uses expected query plan: two "Bitmap index scan" -> "Bitmap
Or"-> "Bitmap Heap Scan".
 
SELECT * FROM xxx WHERE TRUEAND (filter(title) @> filter('ABC') OR filter(fts1body) @> filter('ABC')) AND
(ddstrike(title,'ABC')OR ddstrike(fts1body,'ABC'))
 


The pure query costs 3,800ms. 18 hits. Single "Seq Scan".
SELECT * FROM xxxWHERE TRUEAND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC'))


Finally I noticed I had spent useless time, and need to find another good one.

Sorry.


However, I may think good idea which uses inverted index.

So I want to know...
- the actual work of extractQuery and consistant.
- the detail interface of extractValue/extractQuery/consistant. It may help understanding.


I looked at contrib/_int.sql of PG8.2.22

There are definitions of int[] GIN support.

---
CREATE OPERATOR CLASS gin__int_ops
FOR TYPE _int4 USING gin
ASOPERATOR    3    &&,OPERATOR    6    = (anyarray, anyarray)    RECHECK,OPERATOR    7    @>,OPERATOR    8    <@
RECHECK,OPERATOR   13    @,OPERATOR    14    ~    RECHECK,OPERATOR    20    @@ (_int4, query_int),FUNCTION    1
btint4cmp(int4, int4),FUNCTION    2    ginarrayextract (anyarray, internal),FUNCTION    3    ginint4_queryextract
(internal,internal, int2),FUNCTION    4    ginint4_consistent (internal, int2, internal),STORAGE        int4;
 
---

I checked the PG8.2.22 source code.

Both ginint4_queryextract and ginint4_consistent assume that "query" argument is a PGARRAY (ArrayType *). Where is it
decided?Is it array of STORAGE type?
 

Both extractQuery(ginint4_queryextract) and extractValue(ginarrayextract) seem to return similar value type. They
returnDatum array of int4. Is it array of STORAGE type?
 

I want to understand the overview of GIN extension.

Thanks

kenji uno




pgsql-hackers by date:

Previous
From: Jaime Casanova
Date:
Subject: Re: in-catalog Extension Scripts and Control parameters (templates?)
Next
From: Heikki Linnakangas
Date:
Subject: Re: Review: extension template