Dropping repeat characters of a long string in Stata (left-truncation)

21 Views Asked by At

I have a dataset of what used to be ~8000 rows of a long 80 character string variable. I split up the string so that every row is now one character of the string where caseid identifies the individual (which used to be one row when the string was still together) and method is the name of the variable with the strings split up. Here is an example of what my string looked like before I separated it: ‘1110000333333333333333333333333NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN’ and now this string is split between rows 1-67 with CASE ID 1 10 1, for example. I need to drop all the characters that are at the end of the string that repeat for each case ID. In this example, I would drop rows 36-67 that repeat the N for Case ID 1 10 1. The characters that can be dropped are 1, 2, 3, 5, 6, 7, 8, 9, C, E, L, M, N, S, W. However, the number of characters (i.e., number of rows) I would drop for each individual varies as one person may only have 3 characters that repeat, for example, and another person may have 20, and someone else the whole string is the same character and therefore I would drop the individual/ all the rows. Essentially, I am trying to left-truncate my data. I would like to create a new variable of these dropped characters called dropped_characters to see which rows /characters were dropped. I tried doing this when the string as still together (i.e., before I separated it out into different rows) and it didn’t work. So now I am trying to do it with the string already split up. I have tried various versions of the code below but I keep getting a syntax error and it properly create dropped_characters

  • Create a new variable to identify the characters to be dropped gen dropped_characters = ""

  • Identify groups based on caseid egen group_id = group(caseid)

  • Loop through each group levelsof group_id, local(groups) foreach group in `groups' {

    • Loop through each character to identify and drop the unnecessary ones forval i = 1/max_rows' { local char : substr(method, i', 1) if inlist("char'", "1", "2", "3", "5", "6", "7", "8", "9", "C", "E", "L", "M", "N", "S", "W") { * Append to the variable indicating dropped characters replace dropped_characters = dropped_characters + "char'" if group_id == group' * Set the character to missing replace method = subinstr(method, "char'", "", i') if group_id == group' } } }
  • Drop empty rows drop if missing(method)

  • Display the dropped characters for each case ID tabulate caseid dropped_characters, gen(dropped_char_summary)

  • Save the cleaned dataset save cleaned_data.dta, replace

I tried the code above as well as substr which Stata returned as an error

0

There are 0 best solutions below