# Re: Need an SQL or Cobol function to evaluate the similarity of two chains - IBM AS400

This is a discussion on Re: Need an SQL or Cobol function to evaluate the similarity of two chains - IBM AS400 ; I second the 'Why not?' question, but an alternative might be to sort the names into alphabetic order and look for matches. (e.g. Sorting the names CABDE and CABED would give you ABCDE both times.) But this approach has obvious ...

# Thread: Re: Need an SQL or Cobol function to evaluate the similarity of two chains

1. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

I second the 'Why not?' question, but an alternative might be to sort
the names into alphabetic order and look for matches. (e.g. Sorting the
names CABDE and CABED would give you ABCDE both times.) But this
approach has obvious flaws. SOUNDEX exists to solve the problem you
have, so why not use it?

2. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

How soundex will tell me whether :

JULIE
and
STE-JULIE

are similar ? That's why !

I would need a function that tells me the % of similarity, may 50% in
that case.

Thanks for help.

walker.l2 wrote:
> I second the 'Why not?' question, but an alternative might be to sort
> the names into alphabetic order and look for matches. (e.g. Sorting the
> names CABDE and CABED would give you ABCDE both times.) But this
> approach has obvious flaws. SOUNDEX exists to solve the problem you
> have, so why not use it?

3. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

Well that is slightly different to the cases you mentioned to start
with. (Which a SOUNDEX would work well for.)

But how similar do you think "JULIE" and "STE-JULIE" are? None of the
first 5 characters in "JULIE" positionally match "STE-J", so position
matching gives you 0% similarity.
Only 2 characters in common are found in the first 5 - 40% similarity.
All characters in "JULIE" are in "STE-JULIE" but they are also all in
"EILUJ-ETS" - is "JULIE" more similar to "STE-JULIE" than "EILUJ-ETS"
or the same? If so, how much % similarity do you assign to each case,
and how do you determine that?
If you do a substring match you will find "JULIE" in "STE-JULIE", but
not in "STE-JULEI" or "STE-JUILE", so again how much % similarity do
you assign to those cases?

The first step to solving this is to work out what the business rules
are that define 'similarity', then you can attack the problem of
finding an algorithm that fulfills those business rules.

4. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

"Karlharv" wrote in message
news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
> How soundex will tell me whether :
>
> JULIE
> and
> STE-JULIE
>
> are similar ? That's why !
>

Ah, that isn't quite what your original posting was about" :-)

So would "NEW YORK" (USA) be similar to "YORK" (England)?

or would "NEW ORLEANS" be similar to "NEW YORK"?

And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK" being
similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW ORLEANS"?
(Stretching the point I know but you get the idea)...

Are you trying to find the same places? i.e. geographic - if so do you have
Zip Codes or Post Codes?

Tell us more!

GB

5. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

gb wrote:
> "Karlharv" wrote in message
> news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
> > How soundex will tell me whether :
> >
> > JULIE
> > and
> > STE-JULIE
> >
> > are similar ? That's why !
> >

>
> Ah, that isn't quite what your original posting was about" :-)
>
> So would "NEW YORK" (USA) be similar to "YORK" (England)?
>
> or would "NEW ORLEANS" be similar to "NEW YORK"?
>
> And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK" being
> similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW ORLEANS"?
> (Stretching the point I know but you get the idea)...
>
> Are you trying to find the same places? i.e. geographic - if so do you have
> Zip Codes or Post Codes?
>
> Tell us more!

This kind of thing is not a trivial issue at all. I did some work for
several years for a reinsurance company, and the issue of name search
and assignment was very important. The name of the underlying insured
was not part of the reinsurance company's database - their insureds
were other insurance companies, either primary (ceding) or reinsurance
(retroceding) - but it was a necessary data element. As just one
example, consider the Royal Dutch Shell oil company. This might appear
in various claims as "Shell", "Shell Oil", "Shell Oil Co.", "Royal
Dutch Shell", and numerous other variants. They were all the same
company, however, and the reinsurance company would want to know how
much exposure it had on all claims originating from this one company.

6. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

"Jonathan Ball" wrote in message
news:1161893197.025793.168900@k70g2000cwa.googlegr oups.com...
| gb wrote:
| > "Karlharv" wrote in message
| > news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
| > > How soundex will tell me whether :
| > >
| > > JULIE
| > > and
| > > STE-JULIE
| > >
| > > are similar ? That's why !
| > >
| >
| > Ah, that isn't quite what your original posting was about" :-)
| >
| > So would "NEW YORK" (USA) be similar to "YORK" (England)?
| >
| > or would "NEW ORLEANS" be similar to "NEW YORK"?
| >
| > And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK"
being
| > similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW
ORLEANS"?
| > (Stretching the point I know but you get the idea)...
| >
| > Are you trying to find the same places? i.e. geographic - if so do
you have
| > Zip Codes or Post Codes?
| >
| > Tell us more!
|
| This kind of thing is not a trivial issue at all. I did some work
for
| several years for a reinsurance company, and the issue of name
search
| and assignment was very important. The name of the underlying
insured
| was not part of the reinsurance company's database - their insureds
| were other insurance companies, either primary (ceding) or
reinsurance
| (retroceding) - but it was a necessary data element. As just one
| example, consider the Royal Dutch Shell oil company. This might
appear
| in various claims as "Shell", "Shell Oil", "Shell Oil Co.", "Royal
| Dutch Shell", and numerous other variants. They were all the same
| company, however, and the reinsurance company would want to know how
| much exposure it had on all claims originating from this one
company.

In a former life, I also designed a system where we needed to look for
duplicate people and / or addresses in a large insurance database.

We had a fancy system that maintained a table of "possible duplicate
record pairs". The system looked for first / last name reversal,
various spelling differences and abbreviations in addresses, data
entry errors on social security numbers, member numbers, dates of
birth, etc.

Our solution was based on two tables:

a "field / string alternate key table" and a
a "similar pairs table".

The alternate key table contained only three fields, a pointer key
that points to the source record in the primary production database, a
code that identified the source field in the database and a value code
"token" based on whatever string was found in the database on that
record.

We had an extract / pre-processing program that read the production
data base and generate multiple records in this alternate key table
based on every "identifying field" (name, address, zip, social
security number, member number, date of birth, etc) in the database.

The pre-processing module was table driven. Each identifying field
and each sub-string or word in an identifying field in the primary
database was used to make a table lookup in a standard spelling /
standard abbreviation control table. If the string was found, the
table value was used to replace whatever was in the production
database.

Using your example and skipping the field identifier code, St, Sainte
and Ste would all be converted into a standard token of St and the
alternate key file would have ten records that would look like:

1,St
1,Julie
2,St
2,Julie
3,St
3,Jullie
4,St
4,Julie
5,St
5,Julei

Where 1 to 5 are keys to your five sample records. This alternate key
file is actually indexed by the token field and used to look up groups
of records in the primary database that have common words and
substrings on different records. In my system we actually used a
Sondex code in this field for all alphabetic token strings rather than
duplicate data from the production database. This saved space and
improved coverage of different spellings which reduced the need to
handle every possible misspelling.

Note: this approach eliminates any concern for confusion on family and
given name and allows for extreme flexibility in address fields. It
also lets you put Social Security number and Member number and Account
number and Date of Birth and Zip code in the same look-up field.

The second table which was built from the alternate key table had the
following fields:

· k1 & k2 which point to a pair of similar records in the
production system and in our example would contain the numbers 1 to 5.

· the date when two records were first found to be similar
enough to be loaded into the pairs table.

· a similarity rating (discussed below).

· a manually entered code field which if non blank meant that
the pair was a false positive and after human review was determined to
be two different people and should be excluded from future reports of
possible duplicates.

The similarity rating (a value between zero and one) was based on the
"route mean square" of the weighted value of similarities of all
tokens that were common to the two primary database records. Our
definition of similarity had several parts. Which primary database
fields contributed the common tokens? How many tokens were common
across the pair of records? How close were they to being identical?
How rare was each token in the whole database?

For example, two primary records with the names John Smith but
different addressed, different dates of birth, etc. did not generate a
similarity rating high enough to create a record in the possible pairs
table. That is because there were thousands of Johns and thousands of
Smiths in the file. So having these tokens in common did not
contribute many points to the total similarity rating. However, one
record with Zigfried Zapatowskie and another Zigfried Zapatowsky
probably would yield a high enough similarity rating even with
different values in all other fields. Likewise, if date of birth and
zip code were the same and if not too many records came from the same
zip code it would probably have generated a possible duplicate record
even if there were differences to both name and address. We found a
lot of twins this way!

In any case, I think the weights and other particulars of the formula
will have to be tuned to the specifics of your data and your driving
business issues. For example, what is the cost of researching false
positives vs. the cost of missing real duplicates?

I hope this was valuable or at least interesting. I enjoyed the walk
down this particular memory lane.

Mike Sicilian

msicilian at earthlink dot net

7. ## Re: Need an SQL or Cobol function to evaluate the similarity of two chains

In a former life with a non-profit where we had to find people based on last
name, we kept a key based on a slightly modified soundex, which gave us a
base set of records to look at, then narrowed it down by generating a
percentage match based on a letter by letter compare which I think counted
100 percent match for the same letter in the same place, but if that wasn't
true it counted 50 percent for the same letter in the next position. I
didn't write it, so I don't recall the exact logic, just the general
approach. The results were displayed in subfile, so a human always made the
final selection. The desired percentage match was variable so you could get
more or fewer records in the subfile. I think we defaulted to 75%.

Soundex by itself is not the answer, because sometimes it gives you
outlandish matches.

Sam

"Karlharv" wrote in message
news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
> How soundex will tell me whether :
>
> JULIE
> and
> STE-JULIE
>
> are similar ? That's why !
>
> I would need a function that tells me the % of similarity, may 50% in
> that case.
>
> Thanks for help.
>
> walker.l2 wrote:
>> I second the 'Why not?' question, but an alternative might be to sort
>> the names into alphabetic order and look for matches. (e.g. Sorting the
>> names CABDE and CABED would give you ABCDE both times.) But this
>> approach has obvious flaws. SOUNDEX exists to solve the problem you
>> have, so why not use it?

>