sas generate all possible miss spelling

1.6k Views Asked by At

Does any one know how to generate the possible misspelling ?

Example : unemployment - uemployment - onemploymnet -- etc.

3

There are 3 best solutions below

2
On

If you just want to generate a list of possible misspellings, you might try a tool like this one. Otherwise, in SAS you might be able to use a function like COMPGED to compute a measure of the similarity between the string someone entered, and the one you wanted them to type. If the two are "close enough" by your standard, replace their text with the one you wanted.

Here is an example that computes the Generalized Edit Distance between "unemployment" and a variety of plausible mispellings.

data misspell;
  input misspell $16.;
  length misspell string $16.;
  retain string "unemployment";
  GED=compged(misspell, string,'iL');
datalines;
nemployment
uemployment
unmployment
uneployment
unemloyment
unempoyment
unemplyment
unemploment
unemployent
unemploymnt
unemploymet
unemploymen
unemploymenyt
unemploymenty
unemploymenht
unemploymenth
unemploymengt
unemploymentg
unemploymenft
unemploymentf
blahblah
;
proc print data=misspell label;
   label GED='Generalized Edit Distance';
   var misspell string GED;
run;
1
On

If you are looking for a general spell checker, SAS does have proc spell.

It will take some tweaking to get it working for your situation; it's very old and clunky. It doesn't work well in this case, but you may have better results if you try and use another dictionary? A Google search will show other examples.

filename name temp lrecl=256;
options caps;

data _null_;
  file name;
  informat name $256.;
  input name &;
  put name;
  cards;
uemployment 
onemploymnet 
;

proc spell in=name
  dictionary=SASHELP.BASE.NAMES
  suggest;
run;

options nocaps;
0
On

Essentially you are trying to develop a list of text strings based on some rule of thumb, such as one letter is missing from the word, that a letter is misplaced into the wrong spot, that one letter was mistyped, etc. The problem is that these rules have to be explicitly defined before you can write the code, in SAS or any other language (this is what Chris was referring to). If your requirement is reduced to this one-wrong-letter scenario then this might be managable; otherwise, the commenters are correct and you can easily create massive lists of incorrect spellings (after all, all combinations except "unemployment" constitute a misspelling of that word).

Having said that, there are many ways in SAS to accomplish this text manipulation (rx functions, some combination of other text-string functions, macros); however, there are probably better ways to accomplish this. I would suggest an external Perl process to generate a text file that can be read into SAS, but other programmers might have better alternatives.