Merging aggregate data frames creates N/As

24 Views Asked by At

there, complete data analysis newbie with only basics in R here.

I want to do an analysis of the relationship between aggregate polygyny levels and ideal family size (IFS) in Kenyan women from the DHS Survey data. The aggregate levels I wanted to check were Region, Ethnic Group and Religious Group.

My steps thus far:

  • Create aggregate values for regions: I determine the percentage of women in polygynous unions for each region, then assign polygyny level based on thresholds (worked)

  • Create aggregate values frame for ethnic groups, essentially same procedure

  • Create aggregate values frame for religious groups. Same there.

  • Lastly, bring all the values together in a regression data frame, where I can use my aggregate polygyny levels and check their effect on individual IFS. Here, each row should represent one respondent and each column one of the aggregate variables, including aggregate polygyny for their respective religious or ethnic group.

However, whenever I try to merge them together, half of the columns now contain almost only N/As, and I don't understand why. I assume the problem has to be the naming of the columns for the different frames, but the confusing part is that they all contain at least a handful of values (should be around 10.000). The aggregate labels I created worked perfectly for the regional variable, but for none of the others as they should.

I tried different versions of the merge, this being the latest:

> # Merge with Religious Data
> regression_data <- merge(regression_data, merged_religious_data[, c("ReligiousGroup", "PolygynyLevel")], by.x = "ReligiousGroup", by.y = "ReligiousGroup", all.x = TRUE)
Warning message:
In merge.data.frame(regression_data, merged_religious_data[, c("ReligiousGroup",  :
  column names ‘PolygynyLevel.x’, ‘PolygynyLevel.y’ are duplicated in the result
> 
> # Merge with Ethnic Data
> regression_data <- merge(regression_data, merged_ethnic_data[, c("V131", "PolygynyLevel")], by.x = "V131", by.y = "V131", all.x = TRUE)
Warning message:
In merge.data.frame(regression_data, merged_ethnic_data[, c("V131",  :
  column names ‘PolygynyLevel.x’, ‘PolygynyLevel.y’, ‘PolygynyLevel.x’, ‘PolygynyLevel.y’ are duplicated in the result

The idea being that I can clean up column names afterwards, as long as I just get all the information to show up. However, this is the output, N/As suddenly abound. It doesn't make sense from the data, so I must be making some stupid mistake while merging. It's my first large scale analysis project and I had to guesswork my way through a lot of it.

summary(regression_data)
      V131       ReligiousGroup      Region     PolygynyLevel.x IdealFamilySize    UrbanRural    EducationLevel 
 Min.   : 1.0    Min.   :1.0     Min.   :1.00   Low   :2702     Min.   : 0.000   Min.   :0.000   Min.   :1.000  
 1st Qu.: 4.5    1st Qu.:2.0     1st Qu.:2.75   Middle:5404     1st Qu.: 3.000   1st Qu.:1.500   1st Qu.:1.000  
 Median : 8.0    Median :3.0     Median :4.50   High  :2702     Median : 4.000   Median :2.000   Median :1.000  
 Mean   :13.4    Mean   :3.2     Mean   :4.50                   Mean   : 4.445   Mean   :1.875   Mean   :1.625  
 3rd Qu.:11.5    3rd Qu.:4.0     3rd Qu.:6.25                   3rd Qu.: 5.000   3rd Qu.:3.000   3rd Qu.:2.250  
 Max.   :96.0    Max.   :6.0     Max.   :8.00                   Max.   :20.000   Max.   :3.000   Max.   :3.000  
 NA's   :10793   NA's   :10803                                                                                  
 PolygynyLevel.y PolygynyLevel.x PolygynyLevel.y PolygynyLevel.x PolygynyLevel.y
 Low   :    1    Low   :    3    Low   :    1    Low   :    1    Low   :    3   
 Middle:    2    Middle:    5    Middle:    2    Middle:    2    Middle:    5   
 High  :    2    High  :    7    High  :    2    High  :    2    High  :    7   
 NA's  :10803    NA's  :10793    NA's  :10803    NA's  :10803    NA's  :10793 

V131 is just the code for the ethnic marker. It's there because I thought I could save myself trouble just using the column names from the original.

EDIT: I've scrolled a little bit up and down the Viewer, and it seems that, for whatever reason, R treats the group labels as individuals. It lists each of the aggregate labels once and then replaces all the other rows with N/As. I have no idea why that happens.

1

There are 1 best solutions below

1
SadGypsy On

sorry, this question wasn't asked very competently, but I solved the problem yesterday. The problem arose because I had named corresponding columns differently for the merge, resulting in confusion and mismatches.

I coded all the data frames again, keeping the coded names in the aggregates. They then corresponded perfectly and I got the regression frame I needed for my analysis.

Sorry, future requests will hopefully be less stupid.