Is there a way to identify and create a list of all acronyms in a dataframe?

Question

Is there a way to identify and create a list of all acronyms in a dataframe?

600 Views Asked by user2520842 At 25 March 2022 at 17:02

I have a dataframe with a column that has many acronyms in it.

I would like to simply (a) identify all acronyms in each cell on the next column and (b) produce a list of all unique acronyms found (not duplicates).

I would like to simply use pyspellchecker to find any word that is misspelled and treat it as an acronym.

I know that method will also produce non-acronyms that are simply misspelled words but I can't think of any other way to do it (unless we assume that all acronyms will also be in all uppercase which is unfortunately not the case in my dataset).

For example I have,

Column 1
I worked for the NBA
I worked at the CIA
I am seeing a pt
CIA and NBA are both cool places to work

Desired output:

Column 1	Column 2
I worked for the NBA	NBA
I worked at the CIA	CIA
I am seeing a pt	pt
CIA and NBA are both cool places to work	CIA,NBA
I also worked at NSA catedslf	NSA, catedslf

and

{NBA, CIA, pt, NSA, catedslf}

I through catedslf in there just to show that its okay if I also catch misspelled words (I know its unavoidable).

Original Q&A

There are 1 best solutions below

**Timus** · Answer 1 · 2022-03-25T20:46:05.897000

Not sure if this is exactly what you want, but maybe it helps. I suppose you have a dataframe like this (not a series):

df =

                                   Column 1
0                      I worked for the NBA
1                       I worked at the CIA
2                          I am seeing a pt
3  CIA and NBA are both cool places to work
4             I also worked at NSA catedslf

Then this

from spellchecker import SpellChecker

spell = SpellChecker()
df["Column 2"] = df.assign(
    misspelled=df["Column 1"].str.split().map(spell.unknown),
    acronyms=df["Column 1"].str.findall(r"([A-Z]{2,})").map(set)
)[["misspelled", "acronyms"]].apply(lambda row: set.union(*row), axis=1)

results in

                                   Column 1         Column 2
0                      I worked for the NBA            {NBA}
1                       I worked at the CIA            {CIA}
2                          I am seeing a pt             {pt}
3  CIA and NBA are both cool places to work       {NBA, CIA}
4             I also worked at NSA catedslf  {catedslf, NSA}

Then

result = set.union(*df["Column 2"])

produces

{'NSA', 'CIA', 'catedslf', 'NBA', 'pt'}

and

df["Column 2"] = df["Column 2"].map(", ".join)

finally

                                   Column 1       Column 2
0                      I worked for the NBA            NBA
1                       I worked at the CIA            CIA
2                          I am seeing a pt             pt
3  CIA and NBA are both cool places to work       CIA, NBA
4             I also worked at NSA catedslf  NSA, catedslf

But there might be other problems ahead. For example punctuation. Maybe you should do something like:

from string import punctuation

df["Column 1"] = df["Column 1"].str.translate(str.maketrans("", "", punctuation))

beforehand (there might be better ways to do that).

Is there a way to identify and create a list of all acronyms in a dataframe?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in PYSPELLCHECKER

Trending Questions

Popular # Hahtags

Popular Questions