Grouping occurences of a string to a row

333 Views Asked by At

tl;dr Is there a way to group together a large number of values to a single column without truncation of those values?


I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.

----------
id    name
1     forest
2     forest
3     park
4     riverbank
.
.
.
.
.
48178   water
----------

I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:

----------
type         V1
forest       forest,forest,forest
park         park,park,park,park
riverbank    riverbank,riverbank,
water        water,water,water,water
----------

However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.

The output to the str is:

'data.frame':   4 obs. of  2 variables:
 $ type: chr  "forest" "park" "riverbank" "water"
 $ V1  : chr  "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`

How do I group together same strings and push them to a row, without truncation?

4

There are 4 best solutions below

0
On BEST ANSWER

Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.

From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.

However, it appears that you are stuck with the output of the str() function.

Please, refer to the help page ?str.

From the Description section:

Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.

str() has a parameter nchar.max which defaults to 128.

nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.

The longch example in the Examples section illustrates the effect of this parameter:

nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__

Maximum length of a character string

According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.

0
On

If all you want is a count of occurance, then why not simply use table ?

df<- read.table(head=T, text="id    name
1     forest
2     forest
3     park
4     riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs

# if for some reason you want a repetition,then 
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste,  collapse=",")
data.frame(type=df1$Var1, V1=y)
0
On

Your strings are not really truncated, only their display by str are truncated:

size <- 48000
df <- data.frame(1:size, 
                 type=sample(c("forest", "park", "riverbank", "water" ), 
                             size, replace = TRUE), 
                 stringsAsFactors = FALSE)

res <- by(df$type , df$type, paste, collapse=",")


str(res)
 'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
 - attr(*, "dimnames")=List of 1
  ..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
 - attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")


lengths( strsplit(res, ','))
   forest      park riverbank     water 
    11993     12017     11953     12037 

sum(lengths( strsplit(res, ',')))
[1] 48000
0
On

Try storing the results in a list using base R split() function:

new.list <- split(df, f=df$type)

This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.