Thread: Advice - indexing on varchar fields where only last x characters known

Advice - indexing on varchar fields where only last x characters known

From
Gary Stainburn
Date:
I have a problem with a field that appears on a number of my tables.

The field is the Vehicle Identification Number. Every vehicle has one and it 
uniquely identifies that vehicle.

Traditionally this was a 11 character string but a number of years ago was 
extended to 17 characters by adding a 6 character prefix.


The problem that I have is that these VIN numbers are provided by a number of 
data systems including manufacturer feeds, logistics companies as well as 
internal systems. Some use the full 17 character string while others only use 
the last 11.

On top of this, my users are used to only having to type the last 6 characters 
for speed and usability reasons.

However, it means that every time I'm trying to connect various tables up 
using foreign keys or doing searches I have to make allowences for this which 
means I'm using things like substring, like, regex etc. all of which are very 
slow. 

Can anyone suggest a better / more efficient way of handling these.

Gary



Re: Advice - indexing on varchar fields where only last x characters known

From
Jayadevan M
Date:
Hi,
PostgreSQL does have the ability to index on expressions. Will that help?
http://www.postgresql.org/docs/9.1/static/indexes-expressional.html
I think mentioning the version of PostgreSQL you are using, sample queries/plans etc will help.
Jayadevan



On Thu, Oct 17, 2013 at 4:50 PM, Gary Stainburn <gary.stainburn@ringways.co.uk> wrote:
I have a problem with a field that appears on a number of my tables.

The field is the Vehicle Identification Number. Every vehicle has one and it
uniquely identifies that vehicle.

Traditionally this was a 11 character string but a number of years ago was
extended to 17 characters by adding a 6 character prefix.


The problem that I have is that these VIN numbers are provided by a number of
data systems including manufacturer feeds, logistics companies as well as
internal systems. Some use the full 17 character string while others only use
the last 11.

On top of this, my users are used to only having to type the last 6 characters
for speed and usability reasons.

However, it means that every time I'm trying to connect various tables up
using foreign keys or doing searches I have to make allowences for this which
means I'm using things like substring, like, regex etc. all of which are very
slow.

Can anyone suggest a better / more efficient way of handling these.

Gary


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: Advice - indexing on varchar fields where only last x characters known

From
skinner@britvault.co.uk (Craig R. Skinner)
Date:
On 2013-10-17 Thu 12:20 PM |, Gary Stainburn wrote:
> 
> The problem that I have is that these VIN numbers are provided by a number of 
> data systems including manufacturer feeds, logistics companies as well as 
> internal systems. Some use the full 17 character string while others only use 
> the last 11.
> 

Split the vin into 2/3 columns?
vin_prefix (NULL), vin_suffix (NOT NULL)
vin_prefix, vin_centre, vin_suffix

UNIQUE all together in 1 constraint & index each separately.

Maybe...
-- 
Craig Skinner | http://twitter.com/Craig_Skinner | http://linkd.in/yGqkv7



Re: Advice - indexing on varchar fields where only last x characters known

From
Gavin Flower
Date:
On 18/10/13 00:20, Gary Stainburn wrote:
> I have a problem with a field that appears on a number of my tables.
>
> The field is the Vehicle Identification Number. Every vehicle has one and it
> uniquely identifies that vehicle.
>
> Traditionally this was a 11 character string but a number of years ago was
> extended to 17 characters by adding a 6 character prefix.
>
>
> The problem that I have is that these VIN numbers are provided by a number of
> data systems including manufacturer feeds, logistics companies as well as
> internal systems. Some use the full 17 character string while others only use
> the last 11.
>
> On top of this, my users are used to only having to type the last 6 characters
> for speed and usability reasons.
>
> However, it means that every time I'm trying to connect various tables up
> using foreign keys or doing searches I have to make allowences for this which
> means I'm using things like substring, like, regex etc. all of which are very
> slow.
>
> Can anyone suggest a better / more efficient way of handling these.
>
> Gary
>
>
Use 2 fields, one for the 6 character prefix, and the other for the 
original 11 digits.

Search for the 6 character prefix, or a null prefix AND the first 6 
characters of the 11 digit field.

It might be better to have a string for the prefix and make it blank 
rather than null, when nothing is entered there.


Cheers,
Gavin





Re: Advice - indexing on varchar fields where only last x characters known

From
Alvaro Herrera
Date:
Gary Stainburn wrote:

> The problem that I have is that these VIN numbers are provided by a number of 
> data systems including manufacturer feeds, logistics companies as well as 
> internal systems. Some use the full 17 character string while others only use 
> the last 11.
> 
> On top of this, my users are used to only having to type the last 6 characters 
> for speed and usability reasons.

Try creating an index on reverse(vin) and using the same function in
queries; you can put a % at the end of the sought-for literal to match
suffixes.  That works quite well and is very simple to implement.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Advice - indexing on varchar fields where only last x characters known

From
Gavin Flower
Date:
On 18/10/13 07:39, Alvaro Herrera wrote:
> Gary Stainburn wrote:
>
>> The problem that I have is that these VIN numbers are provided by a number of
>> data systems including manufacturer feeds, logistics companies as well as
>> internal systems. Some use the full 17 character string while others only use
>> the last 11.
>>
>> On top of this, my users are used to only having to type the last 6 characters
>> for speed and usability reasons.
> Try creating an index on reverse(vin) and using the same function in
> queries; you can put a % at the end of the sought-for literal to match
> suffixes.  That works quite well and is very simple to implement.
>
That is extremely cunning,
and obvious in retrospect! :-)


Cheers,
Gavin



Re: Advice - indexing on varchar fields where only last x characters known

From
David Johnston
Date:
Gary Stainburn wrote
> However, it means that every time I'm trying to connect various tables up 
> using foreign keys 

The degree to which each input source guarantees uniqueness of a given VIN
matters.  Keep in mind that any system that required the user to manually
enter the VIN has the propensity for errors.  Either outright invalid VINs
or marginally correct VINs with typos (which mean the VIN might be less or
more than 17 characters even if the 17-character version was intended). 
Specifically it is not uncommon for the VIN to be made-up when it is a
required field but the user does not know what the VIN is.

6 characters are unique within a model year but you need at least 8
characters to be generally unique for a given manufacturer.

For foreign key purposes it may be worthwhile to generate a "matching" table
and then during import use an algorithm to match up different records.  Then
during general queries that table can be used for joins.  In this way you
only pay the price of matching once and that during import as opposed to
during user requests.  Having a canonical VIN table helps here though during
import that table then has to be maintained.  The added advantage is that
such a mapping table allows you to search against a single table and such a
table (and likely its indexes) should be fairly small so as to make good use
of memory.

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Advice-indexing-on-varchar-fields-where-only-last-x-characters-known-tp5774839p5774944.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.



Re: Re: Advice - indexing on varchar fields where only last x characters known

From
Gary Stainburn
Date:
On Thursday 17 October 2013 20:16:30 David Johnston wrote:
> Gary Stainburn wrote
>
> > However, it means that every time I'm trying to connect various tables up
> > using foreign keys
>
> The degree to which each input source guarantees uniqueness of a given VIN
> matters.  Keep in mind that any system that required the user to manually
> enter the VIN has the propensity for errors.  Either outright invalid VINs
> or marginally correct VINs with typos (which mean the VIN might be less or
> more than 17 characters even if the 17-character version was intended).
> Specifically it is not uncommon for the VIN to be made-up when it is a
> required field but the user does not know what the VIN is.
>
> 6 characters are unique within a model year but you need at least 8
> characters to be generally unique for a given manufacturer.
>
> For foreign key purposes it may be worthwhile to generate a "matching"
> table and then during import use an algorithm to match up different
> records.  Then during general queries that table can be used for joins.  In
> this way you only pay the price of matching once and that during import as
> opposed to during user requests.  Having a canonical VIN table helps here
> though during import that table then has to be maintained.  The added
> advantage is that such a mapping table allows you to search against a
> single table and such a table (and likely its indexes) should be fairly
> small so as to make good use of memory.
>
> David J.

David

You raise a number of valid points here. Thankfully it is very rare that the 
user creates a VIN number so the chances of creating the wrong VIN is slim. 
Most data is created from data feeds from external sources. Although these 
may also be wrong (even Ford get it wrong some times) the chances are slim.

Where a user enters a 6 character VIN the times that more than one vehicle is 
returned is negligable, and where it does I return an error and ask for a 
longer VIN, which may be anywhere between 8 and the full 17 characters.

I like the idea of using reverse() to create the index, and will investigate 
that. Will it work on an old server running 8.3.3?

On one system I did generate a second field fixed at 11 characters for use 
only with foreign keys and used a function and trigger to automatically 
update it based on the value of the original VIN field.

I don't like it because it goes against the principles of relational databases 
in that it is redundant information, and was wondering if there were better 
options.  

While the reverse() option may work for searching it will not be ideal for 
foreign keys.



-- 
Gary Stainburn
Group I.T. Manager
Ringways Garages
http://www.ringways.co.uk 



Re: Re: Advice - indexing on varchar fields where only last x characters known

From
skinner@britvault.co.uk (Craig R. Skinner)
Date:
On 2013-10-18 Fri 08:22 AM |, Gary Stainburn wrote:
> 
> While the reverse() option may work for searching it will not be ideal for 
> foreign keys.
> 

Use an integer (serial) as the PK/FK & unique the VIN.

Then you can use reverse() for searching.

Kitcars, boat/garden/equipment/truck trailers, caravans, etc.
might not have any VIN.



Re: Advice - indexing on varchar fields where only last x characters known

From
David Johnston
Date:
Gary Stainburn wrote
> On Thursday 17 October 2013 20:16:30 David Johnston wrote:
>> Gary Stainburn wrote
>>
>> > However, it means that every time I'm trying to connect various tables
>> up
>> > using foreign keys
>>
>> The degree to which each input source guarantees uniqueness of a given
>> VIN
>> matters.  Keep in mind that any system that required the user to manually
>> enter the VIN has the propensity for errors.  Either outright invalid
>> VINs
>> or marginally correct VINs with typos (which mean the VIN might be less
>> or
>> more than 17 characters even if the 17-character version was intended).
>> Specifically it is not uncommon for the VIN to be made-up when it is a
>> required field but the user does not know what the VIN is.
>>
>> 6 characters are unique within a model year but you need at least 8
>> characters to be generally unique for a given manufacturer.
>>
>> For foreign key purposes it may be worthwhile to generate a "matching"
>> table and then during import use an algorithm to match up different
>> records.  Then during general queries that table can be used for joins. 
>> In
>> this way you only pay the price of matching once and that during import
>> as
>> opposed to during user requests.  Having a canonical VIN table helps here
>> though during import that table then has to be maintained.  The added
>> advantage is that such a mapping table allows you to search against a
>> single table and such a table (and likely its indexes) should be fairly
>> small so as to make good use of memory.
>>
>> David J.
> 
> David
> 
> You raise a number of valid points here. Thankfully it is very rare that
> the 
> user creates a VIN number so the chances of creating the wrong VIN is
> slim. 
> Most data is created from data feeds from external sources. Although these 
> may also be wrong (even Ford get it wrong some times) the chances are
> slim.
> 
> Where a user enters a 6 character VIN the times that more than one vehicle
> is 
> returned is negligable, and where it does I return an error and ask for a 
> longer VIN, which may be anywhere between 8 and the full 17 characters.

Unless they scan barcodes the service department usually has a frequent need
to key in VINs manually.

A better UI would be to list all matches and let the user pick.  Zero
matches could also result in a similarity search...

The nature of VINs make them poor FKs.  This is one case where I either use
a surrogate key or, in most cases, some other natural key and leae the VIN
as a descriptive attribute.  The VIN is often of secondary priority. Stock
numbers and invoice numbers are the primaries.

I haven't done much work with cross-department linking of vehicles; mostly
worked in a vertical fashion so a vehicle is informative.  I also only have
USA exposure.

As to the trigger usage: it is not redundant information.  One column
contains raw data and the other is normalized.  While they both represent
the same concept generally the specific characteristics makes them
non-redundant in the model.  A surrogate key is, in many ways, redundant to
the table's natural key but it does have different characteristics and
purpose so my dislike for them is not because they add redundant information
to the model.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Advice-indexing-on-varchar-fields-where-only-last-x-characters-known-tp5774839p5775023.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.



Re: Re: Advice - indexing on varchar fields where only last x characters known

From
Gary Stainburn
Date:
Craig,

We only deal with Ford and Kia new cars. Thankfully, I am only dealing with 
pre-sales / sales at the moment. 

After-sales is already catered for. VIN validation there is already dealt with 
as once the VIN is entered it is used to access manufacturer's warranty / 
aftersales databases.

On Friday 18 October 2013 14:44:12 David Johnston wrote:
> A better UI would be to list all matches and let the user pick.  Zero
> matches could also result in a similarity search...

I have built this facility into the UI using onKeyUp events to trigger AJAX 
(onchange only works once the field loses focus). Although this greatly 
increases accuracy on user input it does increase the load on the SQL server 
as entering a single VIN generates a number of searches.

>
> The nature of VINs make them poor FKs.  This is one case where I either use
> a surrogate key or, in most cases, some other natural key and leae the VIN
> as a descriptive attribute.  The VIN is often of secondary priority. Stock
> numbers and invoice numbers are the primaries.

Where possible I do use SERIAL for primary keys and foreign keys. However, 
this method still requires the initial search to be done using real data to 
make the initial pairing.  There are a number of stages that have to to be 
actioned between vehicle order and customer handover. Unfortunately, many of 
these actions occur asynchronously so I cannot guarantee certain pairs will 
exist before others are required.

You make a valid point about stock numbers making a far better PK and where I 
have one I use it. 

However, as an example. a factory order from Ford won't even have a VIN 
allocated until it reached sheduled (build start date allocated), so the only 
reference I have is the Ford Order Number. The FON is the link from our 
internal order to the vehicle. Then once the vehicle is scheduled I get a VIN 
too. When the vehicle is received into our compound it is booked in by 
scanning the VIN. This then links to the FON, and back to the order. The 
stock number only gets allocated after the vehicle has been booked into 
stock. Things then become much easier, which is good because that's also when 
it becomes busier.

>
> I haven't done much work with cross-department linking of vehicles; mostly
> worked in a vertical fashion so a vehicle is informative.  I also only have
> USA exposure.
>
> As to the trigger usage: it is not redundant information.  One column
> contains raw data and the other is normalized.  While they both represent
> the same concept generally the specific characteristics makes them
> non-redundant in the model.  A surrogate key is, in many ways, redundant to
> the table's natural key but it does have different characteristics and
> purpose so my dislike for them is not because they add redundant
> information to the model.

You make a valid point here.  I am looking to make a number of changes to my 
systems as I add enhancements and I think that  I will use all of the methods 
mentioned. 

Using reverse() may not be as much use as I first thought because of the AJAX 
style of the user input, but could still be of use on programmatic searching.

Adding a SERIAL as an index for my compound stock would also be of use and 
could be implemented  progressively.

Using the fixed length 11 character VIN field and a trigger has already proved 
it's worth on one table so I think I will expand it's use more to improve 
creating foreign key pairs as different data feeds allow.

Thanks to all replies.

-- 
Gary Stainburn
Group I.T. Manager
Ringways Garages
http://www.ringways.co.uk