Re: Need an SQL or Cobol function to evaluate the similarity of two chains - IBM AS400

This is a discussion on Re: Need an SQL or Cobol function to evaluate the similarity of two chains - IBM AS400 ; I second the 'Why not?' question, but an alternative might be to sort the names into alphabetic order and look for matches. (e.g. Sorting the names CABDE and CABED would give you ABCDE both times.) But this approach has obvious ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Re: Need an SQL or Cobol function to evaluate the similarity of two chains

  1. Re: Need an SQL or Cobol function to evaluate the similarity of two chains

    I second the 'Why not?' question, but an alternative might be to sort
    the names into alphabetic order and look for matches. (e.g. Sorting the
    names CABDE and CABED would give you ABCDE both times.) But this
    approach has obvious flaws. SOUNDEX exists to solve the problem you
    have, so why not use it?


  2. Re: Need an SQL or Cobol function to evaluate the similarity of two chains

    How soundex will tell me whether :

    JULIE
    and
    STE-JULIE

    are similar ? That's why !

    I would need a function that tells me the % of similarity, may 50% in
    that case.

    Thanks for help.

    walker.l2 wrote:
    > I second the 'Why not?' question, but an alternative might be to sort
    > the names into alphabetic order and look for matches. (e.g. Sorting the
    > names CABDE and CABED would give you ABCDE both times.) But this
    > approach has obvious flaws. SOUNDEX exists to solve the problem you
    > have, so why not use it?



  3. Re: Need an SQL or Cobol function to evaluate the similarity of two chains

    Well that is slightly different to the cases you mentioned to start
    with. (Which a SOUNDEX would work well for.)

    But how similar do you think "JULIE" and "STE-JULIE" are? None of the
    first 5 characters in "JULIE" positionally match "STE-J", so position
    matching gives you 0% similarity.
    Only 2 characters in common are found in the first 5 - 40% similarity.
    All characters in "JULIE" are in "STE-JULIE" but they are also all in
    "EILUJ-ETS" - is "JULIE" more similar to "STE-JULIE" than "EILUJ-ETS"
    or the same? If so, how much % similarity do you assign to each case,
    and how do you determine that?
    If you do a substring match you will find "JULIE" in "STE-JULIE", but
    not in "STE-JULEI" or "STE-JUILE", so again how much % similarity do
    you assign to those cases?

    The first step to solving this is to work out what the business rules
    are that define 'similarity', then you can attack the problem of
    finding an algorithm that fulfills those business rules.


  4. Re: Need an SQL or Cobol function to evaluate the similarity of two chains


    "Karlharv" wrote in message
    news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
    > How soundex will tell me whether :
    >
    > JULIE
    > and
    > STE-JULIE
    >
    > are similar ? That's why !
    >


    Ah, that isn't quite what your original posting was about" :-)

    So would "NEW YORK" (USA) be similar to "YORK" (England)?

    or would "NEW ORLEANS" be similar to "NEW YORK"?

    And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK" being
    similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW ORLEANS"?
    (Stretching the point I know but you get the idea)...

    Are you trying to find the same places? i.e. geographic - if so do you have
    Zip Codes or Post Codes?

    Tell us more!

    GB




  5. Re: Need an SQL or Cobol function to evaluate the similarity of two chains

    gb wrote:
    > "Karlharv" wrote in message
    > news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
    > > How soundex will tell me whether :
    > >
    > > JULIE
    > > and
    > > STE-JULIE
    > >
    > > are similar ? That's why !
    > >

    >
    > Ah, that isn't quite what your original posting was about" :-)
    >
    > So would "NEW YORK" (USA) be similar to "YORK" (England)?
    >
    > or would "NEW ORLEANS" be similar to "NEW YORK"?
    >
    > And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK" being
    > similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW ORLEANS"?
    > (Stretching the point I know but you get the idea)...
    >
    > Are you trying to find the same places? i.e. geographic - if so do you have
    > Zip Codes or Post Codes?
    >
    > Tell us more!


    This kind of thing is not a trivial issue at all. I did some work for
    several years for a reinsurance company, and the issue of name search
    and assignment was very important. The name of the underlying insured
    was not part of the reinsurance company's database - their insureds
    were other insurance companies, either primary (ceding) or reinsurance
    (retroceding) - but it was a necessary data element. As just one
    example, consider the Royal Dutch Shell oil company. This might appear
    in various claims as "Shell", "Shell Oil", "Shell Oil Co.", "Royal
    Dutch Shell", and numerous other variants. They were all the same
    company, however, and the reinsurance company would want to know how
    much exposure it had on all claims originating from this one company.


  6. Re: Need an SQL or Cobol function to evaluate the similarity of two chains


    "Jonathan Ball" wrote in message
    news:1161893197.025793.168900@k70g2000cwa.googlegr oups.com...
    | gb wrote:
    | > "Karlharv" wrote in message
    | > news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
    | > > How soundex will tell me whether :
    | > >
    | > > JULIE
    | > > and
    | > > STE-JULIE
    | > >
    | > > are similar ? That's why !
    | > >
    | >
    | > Ah, that isn't quite what your original posting was about" :-)
    | >
    | > So would "NEW YORK" (USA) be similar to "YORK" (England)?
    | >
    | > or would "NEW ORLEANS" be similar to "NEW YORK"?
    | >
    | > And if so, would "YORK" being similar to "NEW YORK" and "NEW YORK"
    being
    | > similar to "NEW ORLEANS" imply that "YORK" was similar to "NEW
    ORLEANS"?
    | > (Stretching the point I know but you get the idea)...
    | >
    | > Are you trying to find the same places? i.e. geographic - if so do
    you have
    | > Zip Codes or Post Codes?
    | >
    | > Tell us more!
    |
    | This kind of thing is not a trivial issue at all. I did some work
    for
    | several years for a reinsurance company, and the issue of name
    search
    | and assignment was very important. The name of the underlying
    insured
    | was not part of the reinsurance company's database - their insureds
    | were other insurance companies, either primary (ceding) or
    reinsurance
    | (retroceding) - but it was a necessary data element. As just one
    | example, consider the Royal Dutch Shell oil company. This might
    appear
    | in various claims as "Shell", "Shell Oil", "Shell Oil Co.", "Royal
    | Dutch Shell", and numerous other variants. They were all the same
    | company, however, and the reinsurance company would want to know how
    | much exposure it had on all claims originating from this one
    company.


    In a former life, I also designed a system where we needed to look for
    duplicate people and / or addresses in a large insurance database.

    We had a fancy system that maintained a table of "possible duplicate
    record pairs". The system looked for first / last name reversal,
    various spelling differences and abbreviations in addresses, data
    entry errors on social security numbers, member numbers, dates of
    birth, etc.

    Our solution was based on two tables:

    a "field / string alternate key table" and a
    a "similar pairs table".

    The alternate key table contained only three fields, a pointer key
    that points to the source record in the primary production database, a
    code that identified the source field in the database and a value code
    "token" based on whatever string was found in the database on that
    record.

    We had an extract / pre-processing program that read the production
    data base and generate multiple records in this alternate key table
    based on every "identifying field" (name, address, zip, social
    security number, member number, date of birth, etc) in the database.

    The pre-processing module was table driven. Each identifying field
    and each sub-string or word in an identifying field in the primary
    database was used to make a table lookup in a standard spelling /
    standard abbreviation control table. If the string was found, the
    table value was used to replace whatever was in the production
    database.

    Using your example and skipping the field identifier code, St, Sainte
    and Ste would all be converted into a standard token of St and the
    alternate key file would have ten records that would look like:

    1,St
    1,Julie
    2,St
    2,Julie
    3,St
    3,Jullie
    4,St
    4,Julie
    5,St
    5,Julei

    Where 1 to 5 are keys to your five sample records. This alternate key
    file is actually indexed by the token field and used to look up groups
    of records in the primary database that have common words and
    substrings on different records. In my system we actually used a
    Sondex code in this field for all alphabetic token strings rather than
    duplicate data from the production database. This saved space and
    improved coverage of different spellings which reduced the need to
    handle every possible misspelling.

    Note: this approach eliminates any concern for confusion on family and
    given name and allows for extreme flexibility in address fields. It
    also lets you put Social Security number and Member number and Account
    number and Date of Birth and Zip code in the same look-up field.

    The second table which was built from the alternate key table had the
    following fields:

    k1 & k2 which point to a pair of similar records in the
    production system and in our example would contain the numbers 1 to 5.

    the date when two records were first found to be similar
    enough to be loaded into the pairs table.

    a similarity rating (discussed below).

    a manually entered code field which if non blank meant that
    the pair was a false positive and after human review was determined to
    be two different people and should be excluded from future reports of
    possible duplicates.

    The similarity rating (a value between zero and one) was based on the
    "route mean square" of the weighted value of similarities of all
    tokens that were common to the two primary database records. Our
    definition of similarity had several parts. Which primary database
    fields contributed the common tokens? How many tokens were common
    across the pair of records? How close were they to being identical?
    How rare was each token in the whole database?

    For example, two primary records with the names John Smith but
    different addressed, different dates of birth, etc. did not generate a
    similarity rating high enough to create a record in the possible pairs
    table. That is because there were thousands of Johns and thousands of
    Smiths in the file. So having these tokens in common did not
    contribute many points to the total similarity rating. However, one
    record with Zigfried Zapatowskie and another Zigfried Zapatowsky
    probably would yield a high enough similarity rating even with
    different values in all other fields. Likewise, if date of birth and
    zip code were the same and if not too many records came from the same
    zip code it would probably have generated a possible duplicate record
    even if there were differences to both name and address. We found a
    lot of twins this way!

    In any case, I think the weights and other particulars of the formula
    will have to be tuned to the specifics of your data and your driving
    business issues. For example, what is the cost of researching false
    positives vs. the cost of missing real duplicates?

    I hope this was valuable or at least interesting. I enjoyed the walk
    down this particular memory lane.

    Mike Sicilian

    msicilian at earthlink dot net



  7. Re: Need an SQL or Cobol function to evaluate the similarity of two chains

    In a former life with a non-profit where we had to find people based on last
    name, we kept a key based on a slightly modified soundex, which gave us a
    base set of records to look at, then narrowed it down by generating a
    percentage match based on a letter by letter compare which I think counted
    100 percent match for the same letter in the same place, but if that wasn't
    true it counted 50 percent for the same letter in the next position. I
    didn't write it, so I don't recall the exact logic, just the general
    approach. The results were displayed in subfile, so a human always made the
    final selection. The desired percentage match was variable so you could get
    more or fewer records in the subfile. I think we defaulted to 75%.

    Soundex by itself is not the answer, because sometimes it gives you
    outlandish matches.

    Sam

    "Karlharv" wrote in message
    news:1161873557.662754.57400@f16g2000cwb.googlegro ups.com...
    > How soundex will tell me whether :
    >
    > JULIE
    > and
    > STE-JULIE
    >
    > are similar ? That's why !
    >
    > I would need a function that tells me the % of similarity, may 50% in
    > that case.
    >
    > Thanks for help.
    >
    > walker.l2 wrote:
    >> I second the 'Why not?' question, but an alternative might be to sort
    >> the names into alphabetic order and look for matches. (e.g. Sorting the
    >> names CABDE and CABED would give you ABCDE both times.) But this
    >> approach has obvious flaws. SOUNDEX exists to solve the problem you
    >> have, so why not use it?

    >




+ Reply to Thread