How to find the co-occurences of a specific term with udpipe in R?

261 Views Asked by At

I am new to the udpipe package, and I think it has great potential for the social sciences.

A current project of mine to study how news articles write about networks and networking (i.e. the people kind, not computer networks). For this, I webscraped 500 articles with the search string "network" from a Dutch site for news about the flexible economy (this is the major source of news and discussion about e.g. self-employment). The data is in Dutch, but that should not matter for my question.

What I like to use udpipe for, is to find out in what context the noun "netwerk" or verb "netwerken" is used. I tried kwic to get this (from quanteda), but that gives me just the "window in which it occurs.

I would like to use the lemma (netwerk/netwerken) with the co-occurences operator, but without specifying a second term, and only limited to that specific lemma, rather than calculating all co-occurences.

Is this possible, and how? A normal language example: In my network, I contact a lot of people through Facebook -> I would like to get co-occurrence of network and contact (a verb) I found most of my clients through my network -> here I would like "my network" + "found my clients".

Any help is mightily appreciated!

1

There are 1 best solutions below

2
On BEST ANSWER

It looks like that udpipe makes more sense about "context" than kwic. If sentence level, lemma and limiting word types suffices it should be rather straight forward. Udpipe had dutch model also available prebuilt.

#install.packages("udpipe")
library(udpipe)
#dl <- udpipe_download_model(language = "english")
# Check the name on download result
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")

# Single and multisentence samples
txt <- c("Is this possible, and how? A normal language example: In
my network, I contact a lot of people through Facebook -> I would like to get co-occurrence of
network and contact (a verb) I found most of my clients through my network")
txtb <- c("I found most of my clients through my network")
x <- udpipe_annotate(udmodel_en, x = txt)
x <- as.data.frame(x)
xb <- udpipe_annotate(udmodel_en, x = txtb)
xb <- as.data.frame(xb)

# Raw preview
table(x$sentence[x$lemma == 'network'])

# Use x or xb here 
xn <- udpipe_annotate(udmodel_en, x = x$sentence[x$lemma == 'network'])
xdf <- as.data.frame(xn)

# Reduce noise and group by sentence ~ doc_id to table
df_view = subset(xdf, xdf$upos %in% c('PRON','NOUN','VERB','PROPN'))
library(tidyverse)
df_view %>% group_by(doc_id) %>% 
summarize(lemma = paste(sort(unique(lemma)),collapse=", "))

On quick test, the prebuilt model defines network and networking as independent root lemmas so some rough stemmer might work better. I did however ensure that including networks in sentences created new match.

                    I found most of my clients through my network 
                                                                1 
I would like to get co-occurrence of network and contact (a verb) 
                                                                1 
     In my network, I contact a lot of people through Facebook -> 
                                                                1 
A tibble: 3 × 2
doc_id  lemma
<chr>   <chr>
doc1    contact, Facebook, I, lot, my, network, people
doc2    co-occurrence, contact, get, I, like, network, verb
doc3    client, find, I, my, network

It is totally possible to also find previous and following words as context by stepping up and down from matching lemma indexes but that felt closer to what kwic was allready doing. I did not include dynamic co-occurring tabulation and ordering but I would imagine it should be rather trivial part now when contextul words are extracted. I think it might need some stop words etc but those should become more apparent with bigger data.