As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.


Here are some tips for How to share your SAS knowledge with your professional network.


Combine Datasets using Inexact Character Variables in SAS

From sasCommunity
Jump to: navigation, search

Combine Datasets using Inexact Character Variables in SAS

I have coded an algorithm to calculate proportion of common letters between two text strings. When this code is used together with PROC SQL it allows combining the datasets using inexact character variables. Below is the description of this process; although this process works, the proc sql statement is very inefficient because it lacks “where” clause. Any suggestion to improve its efficiency will be much appreciated.

Algorithm description To obtain proportion of common letters in the two text strings (call them string1, and string2) search for the shorter length string in the longer string from left-to-right and record the maximum number of letters that match, call it “match_left2right”. Then do the same right-to-left and record it as “match_right2left.” Then calculate proportion of common letters as PCL = sum(match_left2right,match_right2left) / sum(length(string1),length(string2))

SAS Code The above algorithm is coded as a user defined function in SAS software using “PROC FCMP;” SAS.9.2 is required to use this function in DATA step or “PROC SQL.” Below is the complete code with description:

Invoke PROC FCMP, and name the function “PCL”, short for proportion of common letters, which will take input of two string variables:

      proc fcmp outlib=sasuser.funcs.trial;
      function pcl(string1 $, string2 $);

Create a few “dummy” variables and initialize them to zero:

      match_left2right=0;
      match_left2right1=0;
      match_right2left=0;
      match_right2left1=0;

Check if string1 has greater length then string2, and if so, obtain the maximum number of character in string2 that are also in the same order in string1 from left-to-right and record them in variable “match_left2right.”

      if length(string1) >= length(string2) then do;
       do i=1 to min(length(string1),length(string2)) by 1;
              do j=min(length(string1),length(string2))+1-i to 1 by -1;
     	      if index(string1,substr(string2,i,j)) >0 then match_left2right1=j;
                   if match_left2right1 > match_left2right then match_left2right=match_left2right1;
       end;
              end;

Also, obtain the maximum number of character in string2 that are also in the same order in string1 from right-to-left and record them in variable “match_right2left.”

       do i=1 to min(length(string1),length(string2)) by 1;
               do j=i to 1 by -1;
     	      if index(string1,substr(string2,i-j+1,j)) > 0 then match_right2left1=j;
     	      if match_right2left1 > match_right2left then match_right2left=match_right2left1;
       end;
              end;
      end;

However, if string2 has greater length then string1, obtain the maximum number of character in string1 that are also in the same order in string2 from left-to-right and record them in variable “match_left2right.”

       else if length(string2) >= length(string1) then do;
 
       do i=1 to min(length(string2),length(string1)) by 1;
              do j=min(length(string2),length(string1))+1-i to 1 by -1;
     	       if index(string2,substr(string1,i,j)) >0 then match_left2right1=j;
     	       if match_left2right1 > match_left2right then match_left2right=match_left2right1;
       end;
              end;

Also, obtain the maximum number of character in string1 that are also in the same order in string2 from right-to-left and record them in variable “match_right2left.”

       do i=1 to min(length(string2),length(string1)) by 1;
              do j=i to 1 by -1;
     	      if index(string2,substr(string1,i-j+1,j)) > 0 then match_right2left1=j;
     	      if match_right2left1 > match_right2left then match_right2left=match_right2left1;
       end;
              end;
      end;

Now we can calculate the proportion of common letters, return its value and end the function:

      pcl =  sum(match_left2right,match_right2left)/sum(length(string1),length(string2));
      if string1='' or string2=''  then pcl=.;
      return(pcl);
      endsub;

Merging two datasets using SQL and PCL function Say we have two datasets as follows, and we need to find the matches for master dataset in base dataset. Master dataset:

Masterid String1

101 ACCESS ANYTIME BANCORP INC

103 Fin Svcs, Ladenburg Thalmann Inc


Base dataset:

Baseid String2

3 ACCESS ANYTIME BANCORP

18 PEOPLES BANCTRUST CO INC

19 EUROTECH LTD

20 MED DESIGN CORP

21 SI DIAMOND TECH INC

22 Ladenburg Thalmann Financial Services Inc


The following SQL statements will provide us with the best match for the string variable in master dataset:

      proc sql;
       create table matched_output as
       select *,   pcl(masterdataset.string1, basebasedataset.string2) as pcl,
      from masterdataset ,  basebasedataset
      group by masterdataset.masterid
      having max(pcl)=pcl;
      quit;

Following is the output produced by the above sql statement:


masterid string1 string2 baseid pcl

101 ACCESS ANYTIME BANCORP INC ACCESS ANYTIME BANCORP 3 0.92

102 NETWORK CONNECTION INC (THE) NETWORK CONNECTION INC 4 0.88

103 Fin Svcs, Ladenburg Thalmann Inc Ladenburg Thalmann Financial Services Inc 22 0.52


This procedure is working but the Proc Sql statement is very inefficient any suggestion to make it more efficient is much appreciated. Any other suggestion is welcome as well.