Whether to use "IN" clause - Mailing list pgsql-sql

From Mark Williams
Subject Whether to use "IN" clause
Date
Msg-id 002901d5bfc3$4c05e3c0$e411ab40$@gmail.com
Whole thread Raw
List pgsql-sql

I have a table (DOCS_TABLE) which can vary greatly in size.

 

DOCS_TABLE has a timestamp field (LAST_UPDATED). As the name suggests this records the date on which data in the record last changed.

 

When user’s open an app if they haven’t queried DOCS_TABLE previously it is loaded via the server using a fairly complicated WHERE statement (COMPLICATED_WHERE_STATEMENT) to establish which items from the table the user is permitted to access.

 

When the user closes the app the data from DOCS_TABLE is stored locally along with a timestamp to record the date and time the data was last refreshed (STORED_TMESTAMP). Next time the app opens it loads the data from the locally stored file.

 

It then needs to ensure the user is working with up-to-date data. At the moment I am running a refresh query SELECT [fields] FROM DOCS_TABLE WHERE LAST_UPDATED >[STORED_TIMESTAMP] AND [COMPLICATED_WHERE_STATEMENT].

 

I use the resulting data from the refresh query to update the data displayed to the user.

 

This works, although one obvious problem is that it does not deal with any rows from DOCS_TABLE which were available to the user when the data was stored locally, but have since been deleted or access denied. As such, within the app, I run a check to make sure the user still has access to that row before trying to do anything with it, but that does not strike me as a particularly elegant solution. It would be better if such items could be removed from the data displayed to the user.

 

The method I am contemplating and on which I would appreciate feedback is as follows:

 

  1. Load the data from the local file.
  2. Run a thread for the following.
  3. Run a query (ID_QUERY) to ascertain which rows are now available to the user: SELECT id FROM DOCS_TABLE WHERE [COMPLICATED_WHERE_STATEMENT].
  4. Check the locally saved data against the result of this query to see what rows are no longer available to the user and remove them.
  5. Build a list of ids from the locally saved data (EXISTING_ID_ARRAY).
  6. Check the locally saved data against the results from ID_QUERY to see whether there are any new records to be added and build a list of the ids (NEW_ID_ARRAY).
  7. Run the refresh query using the arrays: SELECT [fields] FROM DOCS_TABLE WHERE (id in ([NEW_ID_ARRAY])) OR (id in [EXISTING_ID_ARRAY] AND LAST_UPDATED >[STORED_TIMESTAMP]).

 

Unless there is a completely different proposal as to how to approach this refresh, I am happy that use of NEW_ID_ARRAY is the way to go.

 

But is it sensible to use EXISTING_ID_ARRAY? It will cut out the use of the COMPLICATED_WHERE_STATEMENT and enable the query to focus explicitly on a group of records clearly identified. However, the size of the array, could become quite large. In this respect, I suppose my question is whether there is a law of diminishing returns with such an IN clause. For example, if there were 1M records in the table and 20 items in the array, using the array must evidently be much quicker. But what if the array contained 800K items? I assume that it will then probably be significantly less efficient to use EXISTING_ID_ARRAY and more efficient to use COMPLICATED_WHERE_STATEMENT.

 

I appreciate without providing full details of the structure of DOCS_TABLE, the data being retrieved from it and the full nature of the COMPLICATED_WHERE_STATEMENT, I am asking for a comparison between apples and pears. What I am really interested in is whether my logic set out above is sound or idiotic and any suggestions on how best to achieve what I am trying to achieve.

 

Many thanks and Happy New Year!

 

Regards,

 

Mark

__

 

pgsql-sql by date:

Previous
From: Rob Sargent
Date:
Subject: Re: Solving my query needs with Rank and may be CrossTab
Next
From: Mike Martin
Date:
Subject: For each statement trigger and update table