Re: [GENERAL] full text search on hstore or json with materialized view? - Mailing list pgsql-general

From Rj Ewing
Subject Re: [GENERAL] full text search on hstore or json with materialized view?
Date
Msg-id CAOSSsV0Lsga_ek0nwo_9L-aMapE_3qWYN3Du23S6ejeZO10Cyg@mail.gmail.com
Whole thread Raw
In response to Re: [GENERAL] full text search on hstore or json with materialized view?  (George Neuner <gneuner2@comcast.net>)
Responses Re: [GENERAL] full text search on hstore or json with materialized view?  (Rj Ewing <ewing.rj@gmail.com>)
List pgsql-general
I did some testing using a secondary table with the key, value column. However I don't think this will provide the performance that we need. Queries we taking 60+ seconds just for a count.

With 1 million rows in the primary table, this resulted in 44 million rows in the secondary k,v table for full text searching. The same query is es takes ~50 ms on my local machine with 1/10th the ram allocated to es then was allocated to psql.

I'm gonna test using trigrams indexes on approx 10 json fields, and see if that gives us what we are looking for.

any thought on getting sub 1 sec queries on a table with 44 million rows?

RJ

On Tue, Apr 18, 2017 at 10:35 PM, George Neuner <gneuner2@comcast.net> wrote:
On Tue, 18 Apr 2017 14:38:15 -0700, Rj Ewing <ewing.rj@gmail.com>
wrote:

>I am evaluating postgres for as a datastore for our webapp. We are moving
>away from a triple store db due to performance issues.
>
>Our data model consists of sets of user defined attributes. Approx 10% of
>the attributes tend to be 100% filled with 50% of the attributes having
>approx 25% filled. This is fairly sparse data, and it seems that jsonb or
>hstore will be best for us.
>
>Unfortunately, from my understanding, postres doesn't support fulltext
>search across hstore or jsonb key:values or even the entire document. While
>this is not a deal breaker, this would be a great feature to have. We have
>been experimenting w/ elasticsearch a bit, and particularly enjoy this
>feature, however we don't really want to involve the complexity and
>overhead of adding elasticsearch in front of our datasource right now.

hstore and JSON values all really are just formatted text with a
custom column type.  You can create tsvectors from the values if you
cast them to text.

Note that a tsvector can only work on a /flat/ key:value structure: it
won't understand nesting, and it and even with a flat store it won't
understand the difference between keys/tags and the associated values.

E.g., you will be able to see that a value contains both "foo" and
"bar", but to distinguish 'foo:bar' from 'bar:foo' or 'foo:q,bar:z'
you either must check the token positions (from the tsvector) or *try*
to extract the key(s) you are interested in and check the associated
value(s).

This might work ok if you search only for keys in a "document" ... but
trying to search values, I think would be far too complicated.

It might help if you stored a 2D array instead of a flat structure,
but even that would be fairly complicated to work with.



>An idea that has come up is to use a materialized view or secondary table
>with triggers, where we would have 3 columns (id, key, value).
>
>I think this would allow us to store a tsvector and gin index. Giving us
>the ability to use fulltext search on k:v pairs, then join the original
>data on the id field to return the entire record.

This is a much better idea because it separates the key from the
value, and unlike the full "document" case [above], you will know that
the FTS index is covering only the values.

If you need to preserve key order to reconstruct records, you will
need an additional column to maintain that ordering.


>is anyone currently doing this? Is there a better alternative? Any
>performance issues that immediately jump out ( I realize the writes will
>take longer)?
>
>the nature of our data is "relatively" static with bulk uploads (100 - 1000
>records). So we can sacrifice some write performance.
>
>RJ

Having to "reconstruct" records will make reads take longer as well,
but I think separating the keys and values is the best way to do it.


YMMV,
George



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

pgsql-general by date:

Previous
From: Luciano Mittmann
Date:
Subject: Re: [GENERAL] streaming replication and archive_status
Next
From: Henry M
Date:
Subject: [GENERAL] referential integrity between elements of an array and another table?