Re: Finding common hstore key=>value pairs with hstore - Mailing list pgsql-general
From | Paul Norman |
---|---|
Subject | Re: Finding common hstore key=>value pairs with hstore |
Date | |
Msg-id | 109d01cdfc88$6ca72ea0$45f58be0$@mac.com Whole thread Raw |
In response to | Re: Finding common hstore key=>value pairs with hstore (John R Pierce <pierce@hogranch.com>) |
List | pgsql-general |
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of John R Pierce > Subject: Re: [GENERAL] Finding common hstore key=>value pairs with > hstore > > On 1/27/2013 3:09 AM, Paul Norman wrote: > > I am in a situation where I have two tables, a and b, each with a > > hstore column called tags. Both tags columns have a GIN index on them. > > I want to find rows of a and b where the both have a particular hstore > > key and that key is the same. > > > > One way to do this would be SELECT * FROM a JOIN b ON a.tags -> 'foo' > > = b.tags -> 'foo'; This would not use the indexes. > > > > I would like some way that makes use of the two indexes. ON a.tags @> > > hstore('foo', b.tags -> 'foo') would be better, making use of the > > a.tags index, but not the b.tags one. Is there any way to use both? > > > > I suppose the other problem is the hstore statistics gathering is > > questionable enough that the query plan it comes up with could be > > absolutely abysmal. This is of course not unique to this particular > > problem but is a common problem with hstore columns > > Seems to me like you should be using a separate field for this 'foo' > defined as a foreign key. I am cross-referencing two independent data sources. I had desired to get into the full complexities of the data schema, but I'll summarize the important points, leaving aside some of the postgis matter that doesn't directly impact on the hstore issue. I'll also keep calling it foo to avoid escaping the tag name, and because the name doesn't really matter. I actually have three tables, but I'll just explain for one because the others differ only in the nature of the geometries stored in them. All three are kept updated with continually changing data though a complicated process I have one table, "nodes", with OpenStreetMap node data. It has 3 columns that matter for this, id, tags and geom. id is a bigint and primary key, tags is a hstore and geom is a postgis geometry in EPSG:4326. I have a btree index on id, gist on (geom,tags) and gin on tags. The table is 175GB, 1.75B rows and the indexes are 39GB, 154GB and 24GB. I also have some other indexes on other columns and some for testing purposes. For the tags column, about 90% are empty hstores and about 1% have the key I am interested in. There are approximately 370k unique values for this key. For this data there is nothing special about the key I am interested in, it is simply one of many. The other table, import_foo is a table created by the software I am writing. It also has three columns that matter, id, tags and geom. id is an int, but otherwise the columns are the same type. The id from this table bears absolutely no relation to the id from the nodes table. For my test dataset this table is about 100k rows. I am only interested in rows in the nodes table so all queries against this table use ST_Intersects to filter to the region of interest. There are on the order of 500k rows in the test region of which about 82k have the foo key. The first thing my software does is delete rows from import_foo that it can exactly match against with both the foo key and the bar key (another key, similar statistics as foo) against the OSM data, as well as delete the corresponding objects from its memory. This reduces import_foo down to approximately 2k rows. This step is the first one where the situation in my original message arises. It should be noted that not all rows of import_foo will have bar keys. The "easy" rows being dealt with, my program then goes on to consider more complex matches on the remaining 2k rows. Here it matches them on the foo key and the position (i.e. they are within N meters of each other), ignoring the bar key. There may be multiple rows from nodes matching the same row from import_foo. The behavior if multiple rows from import_foo match the same row from nodes is currently undefined in the application, but it is unlikely this particular combination will come up. This is another step that the situation in my original message arises. Given that there is nothing special about foo in nodes and that the two data sources are entirely independent, I don't see that it makes sense to consider a foreign key here. This was probably excessively long, but explains my use case in moderate detail.
pgsql-general by date: