Re: [GENERAL] full text search on hstore or json with materialized view? - Mailing list pgsql-general

From Rj Ewing
Subject Re: [GENERAL] full text search on hstore or json with materialized view?
Date
Msg-id CAOSSsV1MQjX-OUObRWbmEHep=+7CXn9Rh3a=Rarpv1ru++pb8A@mail.gmail.com
Whole thread Raw
In response to Re: [GENERAL] full text search on hstore or json with materialized view?  (Rj Ewing <ewing.rj@gmail.com>)
List pgsql-general
okay, messing around a bit more with the secondary k,v table it seems like this could be a good solution..

I created a keys table to hold the 63 key values, then I dropped and recreated the secondary table, using a FK referencing the keys table. I'm not really sure why, but a basic full text query on 44 million row is taking aproxx. 20ms.

my table structure is:

 Table "public.samples_lg_txt"
 Column |   Type   | Modifiers
--------+----------+-----------
 id     | integer  |
 key    | integer  |
 val    | text     |
 tsv    | tsvector |
Indexes:
    "idx_tsv_samples_lg_text" gin (tsv)
Foreign-key constraints:
    "samples_lg_txt_id_fkey" FOREIGN KEY (id) REFERENCES samples_lg(id) ON DELETE CASCADE
    "samples_lg_txt_key_fkey" FOREIGN KEY (key) REFERENCES keys(id)


how would I write an AND query that filtered on 2 separate keys from the samples_lg_txt table?

something like:

SELECT COUNT(*) FROM samples WHERE id IN ( SELECT DISTINCT(s.id) FROM samples_lg_txt s JOIN keys k ON s.key = k.id WHERE (name = 'key1' AND tsv @@ to_tsquery('value1')) AND (name = 'key2' AND tsv @@ to_tsquery('value2'));

On Wed, Apr 19, 2017 at 11:57 AM, Rj Ewing <ewing.rj@gmail.com> wrote:
I did some testing using a secondary table with the key, value column. However I don't think this will provide the performance that we need. Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows in the secondary k,v table for full text searching. The same query is es takes ~50 ms on my local machine with 1/10th the ram allocated to es then was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneuner2@comcast.net> wrote:
On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I am evaluating postgres for as a datastore for our webapp. We are moving
>away from a triple store db due to performance issues.
>
>Our data model consists of sets of user defined attributes. Approx 10% of
>the attributes tend to be 100% filled with 50% of the attributes having
>approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>hstore will be best for us.
>
>Unfortunately, from my understanding, postres doesn't support fulltext
>search across hstore or jsonb key:values or even the entire document. While
>this is not a deal breaker, this would be a great feature to have. We have
>been experimenting w/ elasticsearch a bit, and particularly enjoy this
>feature, however we don't really want to involve the complexity and
>overhead of adding elasticsearch in front of our datasource right now.

hstore and JSON values all really are just formatted text with a
custom column type.  You can create tsvectors from the values if you
cast them to text.

Note that a tsvector can only work on a /flat/ key:value structure: it
won't understand nesting, and it and even with a flat store it won't
understand the difference between keys/tags and the associated values.

E.g., you will be able to see that a value contains both "foo" and
"bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
you either must check the token positions (from the tsvector) or *try*
to extract the key(s) you are interested in and check the associated
value(s).

This might work ok if you search only for keys in a "document" ... but
trying to search values, I think would be far too complicated.

It might help if you stored a 2D array instead of a flat structure,
but even that would be fairly complicated to work with.



>An idea that has come up is to use a materialized view or secondary table
>with triggers, where we would have 3 columns (id, key, value).
>
>I think this would allow us to store a tsvector and gin index. Giving us
>the ability to use fulltext search on k:v pairs, then join the original
>data on the id field to return the entire record.

This is a much better idea because it separates the key from the
value, and unlike the full "document" case [above], you will know that
the FTS index is covering only the values.

If you need to preserve key order to reconstruct records, you will
need an additional column to maintain that ordering.


>is anyone currently doing this? Is there a better alternative? Any
>performance issues that immediately jump out ( I realize the writes will
>take longer)?
>
>the nature of our data is "relatively" static with bulk uploads (100 - 1000
>records). So we can sacrifice some write performance.
>
>RJ

Having to "reconstruct" records will make reads take longer as well,
but I think separating the keys and values is the best way to do it.


YMMV,
George



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


pgsql-general by date:

Previous
From: Alban Hertroys
Date:
Subject: Re: [GENERAL] UDP buffer drops / statistics collector
Next
From: Henry M
Date:
Subject: Re: [GENERAL] referential integrity between elements of an array andanother table?