rxMerge for factored levels

413 Views Asked by At

I'm new to RRE, I'm having issue with the rxMerge function.

I want to merge two xdf datasets by a factor column that have different number of level. I want an inner join to keep only the matching levels. I get the following error :

ERROR: Factor key 'mat' has mismatched levels. Call rxFactors to make the levels the same, then call rxSort on the input files.

Here is my merge function :

rxMergeXdf(inFile1 = cible_2015_xdf, inFile2 = data_2015,
       outFile = all_data_2015,
       matchVars = "mat",
       type = "inner",
       varsToDrop2 = "ref",
       overwrite=TRUE
       )

I've seen an exemple in the notice with origin and destination flights (http://www.revolutionanalytics.com/sites/default/files/data-step-white-paper.pdf), but I want my output to have only the number of matching levels. I have unique levels in both datasets, levels are ID numbers (with letters so i cannot pass them into numeric values).

Thanks a lot in advance

Ouriel

2

There are 2 best solutions below

2
On

You will need to re-level the factors to have the same levels before merging.

new_levels <- unique(c(rxGetVarInfo(cible_2015_xdf, varsToKeep = "mat")[[1]][["levels"]],
                       rxGetVarInfo(data_2015, varsToKeep = "mat")[[1]][["levels"]]))

rxFactors(inData = cible_2015_xdf, outFile = cible_2015_xdf, 
          factorInfo = list(mat = list(newLevels = new_levels)),
          overwrite = TRUE)
rxFactors(inData = data_2015, outFile = data_2015, 
          factorInfo = list(mat = list(newLevels = new_levels)),
          overwrite = TRUE)

rxMergeXdf(inFile1 = cible_2015_xdf, inFile2 = data_2015,
           outFile = all_data_2015,
           matchVars = "mat",
           type = "inner",
           varsToDrop2 = "ref",
           overwrite=TRUE)
1
On

In addition to what Derek said, you can also use the dplyrXdf package which will handle these and similar factor-related issues for you.

devtools::install_github("RevolutionAnalytics/dplyrXdf")
library(dplyrXdf)

all_data_2015 <- inner_join(cible_2015_xdf, data_2015, by="mat")

* disclosure: I wrote dplyrXdf.