I have a dataframe with a column that has many acronyms in it.
I would like to simply (a) identify all acronyms in each cell on the next column and (b) produce a list of all unique acronyms found (not duplicates).
I would like to simply use pyspellchecker to find any word that is misspelled and treat it as an acronym.
I know that method will also produce non-acronyms that are simply misspelled words but I can't think of any other way to do it (unless we assume that all acronyms will also be in all uppercase which is unfortunately not the case in my dataset).
For example I have,
| Column 1 |
|---|
| I worked for the NBA |
| I worked at the CIA |
| I am seeing a pt |
| CIA and NBA are both cool places to work |
Desired output:
| Column 1 | Column 2 |
|---|---|
| I worked for the NBA | NBA |
| I worked at the CIA | CIA |
| I am seeing a pt | pt |
| CIA and NBA are both cool places to work | CIA,NBA |
| I also worked at NSA catedslf | NSA, catedslf |
and
{NBA, CIA, pt, NSA, catedslf}
I through catedslf in there just to show that its okay if I also catch misspelled words (I know its unavoidable).
Not sure if this is exactly what you want, but maybe it helps. I suppose you have a dataframe like this (not a series):
Then this
results in
Then
produces
and
finally
But there might be other problems ahead. For example punctuation. Maybe you should do something like:
beforehand (there might be better ways to do that).