I'm a neophyte in R.
I have a data frame that consists of about ~4000 conversations between two people. It's structured roughly like this:
Unique Identifier | column1 | column2 |
---|---|---|
123456 | blahblah | blahblah |
789412 | blahblah | blahblah |
My goal is to get a similarity score for message 1 and message 2 of each row. So eventually the data frame would look like:
Unique Identifier | column1 | column2 | cosine |
---|---|---|---|
123456 | blahblah | blahblah | .562 |
789412 | blahblah | blahblah | .264 |
Ultimately, I’d have ~4000 scores (one for each row). I’m assuming that costring is the correct command to run for this, but I keep getting errors. I'm assuming it's because R doesn't know that I want to compare column1 & 2 in each row.
consider the stringdist package instead
We take 1 - the cosine distance to get cosine similarity.