Find a Pattern across Multiple lines with R

Question

Find a Pattern across Multiple lines with R

637 Views Asked by Johann At 24 May 2020 at 18:58

I am trying to identify a pattern across multiple lines, to be exact 2 lines. Since the pattern in either individual line is not unique I am using this approach.

So far I have tried to go with the function "grep" but I think I am missing the correct regular expression here.

grep("^Item\\s{0,}2[^A]", f.text, ignore.case = TRUE)

This part is a modified version of the edgar package function "getfillings" and tries to extract only the Management's Comment/Item 2 for quarterly results. If possible I would include something after ... 2[^A] in the function that reacts to the new line and then the string "Management..."

The pattern in the plain txts which I have, looks like this:

Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations

I would appreciate any comment on how to capture this best in a regular expression with R.

Example Input looks like this:

21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk

and the desired output would be

Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10

I need to match "Item 2. ... Management Discussion" since Item 2 is not unique. How could I formulate a regular expression across two lines?

Original Q&A

There are 2 best solutions below

Chris Ruehlemann On 25 May 2020 at 11:07

You can simply remove the line break:

gsub("\\n", "", text)
[1] "21 Item 2.Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

Now you have everything in one long line and can extract whatever pattern you have in mind. For example, using str_extract from package stringr:

library(stringr)
str_extract(gsub("\\n", "", text), "Management.*on Form 10")
[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"

Data:

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

text
[1] "21 Item 2.\nManagement Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.\nQuantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

**Martin Gal** · Accepted Answer · 2020-05-25T10:53:25.243000

Not very sophisticated since I'm no expert in string manipulation: Using package tidyverse provides some powerful tools to get your desired output.

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk Item 4.
Fluffy Text example Item 5.
Lorem ipsum dolor sit amet, consectetur adipisici elit"

Now

text %>%
  str_extract_all("(?<=Item\\s\\d[[:punct:]]\\n).*", simplify = TRUE) %>%
  str_remove("\\s+Item\\s\\d[[:punct:]]")

gives you

[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"
[2] "Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"                           
[3] "Fluffy Text example"                                                                                                                                 
[4] "Lorem ipsum dolor sit amet, consectetur adipisici elit"

If you just want to extract Item 2, replace the \\d inside str_extract_all with 2.

Find a Pattern across Multiple lines with R

There are 2 best solutions below

Related Questions in R

Related Questions in REGEX

Related Questions in TEXT-MINING

Related Questions in SEC

Trending Questions

Popular # Hahtags

Popular Questions