there, complete data analysis newbie with only basics in R here.
I want to do an analysis of the relationship between aggregate polygyny levels and ideal family size (IFS) in Kenyan women from the DHS Survey data. The aggregate levels I wanted to check were Region, Ethnic Group and Religious Group.
My steps thus far:
Create aggregate values for regions: I determine the percentage of women in polygynous unions for each region, then assign polygyny level based on thresholds (worked)
Create aggregate values frame for ethnic groups, essentially same procedure
Create aggregate values frame for religious groups. Same there.
Lastly, bring all the values together in a regression data frame, where I can use my aggregate polygyny levels and check their effect on individual IFS. Here, each row should represent one respondent and each column one of the aggregate variables, including aggregate polygyny for their respective religious or ethnic group.
However, whenever I try to merge them together, half of the columns now contain almost only N/As, and I don't understand why. I assume the problem has to be the naming of the columns for the different frames, but the confusing part is that they all contain at least a handful of values (should be around 10.000). The aggregate labels I created worked perfectly for the regional variable, but for none of the others as they should.
I tried different versions of the merge, this being the latest:
> # Merge with Religious Data
> regression_data <- merge(regression_data, merged_religious_data[, c("ReligiousGroup", "PolygynyLevel")], by.x = "ReligiousGroup", by.y = "ReligiousGroup", all.x = TRUE)
Warning message:
In merge.data.frame(regression_data, merged_religious_data[, c("ReligiousGroup", :
column names ‘PolygynyLevel.x’, ‘PolygynyLevel.y’ are duplicated in the result
>
> # Merge with Ethnic Data
> regression_data <- merge(regression_data, merged_ethnic_data[, c("V131", "PolygynyLevel")], by.x = "V131", by.y = "V131", all.x = TRUE)
Warning message:
In merge.data.frame(regression_data, merged_ethnic_data[, c("V131", :
column names ‘PolygynyLevel.x’, ‘PolygynyLevel.y’, ‘PolygynyLevel.x’, ‘PolygynyLevel.y’ are duplicated in the result
The idea being that I can clean up column names afterwards, as long as I just get all the information to show up. However, this is the output, N/As suddenly abound. It doesn't make sense from the data, so I must be making some stupid mistake while merging. It's my first large scale analysis project and I had to guesswork my way through a lot of it.
summary(regression_data)
V131 ReligiousGroup Region PolygynyLevel.x IdealFamilySize UrbanRural EducationLevel
Min. : 1.0 Min. :1.0 Min. :1.00 Low :2702 Min. : 0.000 Min. :0.000 Min. :1.000
1st Qu.: 4.5 1st Qu.:2.0 1st Qu.:2.75 Middle:5404 1st Qu.: 3.000 1st Qu.:1.500 1st Qu.:1.000
Median : 8.0 Median :3.0 Median :4.50 High :2702 Median : 4.000 Median :2.000 Median :1.000
Mean :13.4 Mean :3.2 Mean :4.50 Mean : 4.445 Mean :1.875 Mean :1.625
3rd Qu.:11.5 3rd Qu.:4.0 3rd Qu.:6.25 3rd Qu.: 5.000 3rd Qu.:3.000 3rd Qu.:2.250
Max. :96.0 Max. :6.0 Max. :8.00 Max. :20.000 Max. :3.000 Max. :3.000
NA's :10793 NA's :10803
PolygynyLevel.y PolygynyLevel.x PolygynyLevel.y PolygynyLevel.x PolygynyLevel.y
Low : 1 Low : 3 Low : 1 Low : 1 Low : 3
Middle: 2 Middle: 5 Middle: 2 Middle: 2 Middle: 5
High : 2 High : 7 High : 2 High : 2 High : 7
NA's :10803 NA's :10793 NA's :10803 NA's :10803 NA's :10793
V131 is just the code for the ethnic marker. It's there because I thought I could save myself trouble just using the column names from the original.
EDIT: I've scrolled a little bit up and down the Viewer, and it seems that, for whatever reason, R treats the group labels as individuals. It lists each of the aggregate labels once and then replaces all the other rows with N/As. I have no idea why that happens.
sorry, this question wasn't asked very competently, but I solved the problem yesterday. The problem arose because I had named corresponding columns differently for the merge, resulting in confusion and mismatches.
I coded all the data frames again, keeping the coded names in the aggregates. They then corresponded perfectly and I got the regression frame I needed for my analysis.
Sorry, future requests will hopefully be less stupid.