Merging two datasets from different indexation plataforms (same variables with different headers)

59 Views Asked by At

It might be silly, but I am having trouble merging two datasets (in a new file); I want to have the structure of Excel A (which I am calling BehaviouralEconomicsTourism.xlsx) with the information from the variables (same variables with different spelling from scopus.csv). So, I want them to appear as new rows, adding the information from these variables in common. Does someone have any suggestions? I tried to follow the following code (I tried some other variations before but did not save). But it did not work for hte purpose intended

# Install and load necessary packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("readxl", quietly = TRUE)) install.packages("readxl")
if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl")
if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")

library(tidyverse)
library(readxl)
library(writexl)
library(janitor)

# Load datasets from Sheet2 in BehaviouralEconomicsTourism.xlsx
behavioral_economics_df <- read_excel("/Users/s5320381/Downloads/BehaviouralEconomicsTourism.xlsx", sheet = "Sheet2")
scopus_df <- read.csv("/Users/s5320381/Downloads/scopus.csv")

# Clean column names
scopus_df <- janitor::clean_names(scopus_df)
behavioral_economics_df <- janitor::clean_names(behavioral_economics_df)

# Mapping columns between the two datasets
column_mapping <- c(
  "authors" = "authors",
  "author_full_names" = "author_full_names",
  "article_title" = "title",
  "publication_year" = "year",
  "volume" = "volume",
  "issue" = "issue",
  "publisher" = "publisher",
  "source_title" = "source_title",
  "author_keywords" = "author_keywords",
  "abstract" = "abstract",
  "issn" = "issn",
  "isbn" = "isbn",
  "start_page" = "page_start",
  "end_page" = "page_end",
  "doi" = "doi",
  "language" = "language_of_original_document",
  "doi_link" = "link",
  "affiliations" = "affiliations"
)

# Convert relevant columns to character in scopus_df
scopus_df <- scopus_df %>%
  mutate(across(c(volume, issue, start_page, end_page), as.character))

# Convert relevant columns to double in behavioral_economics_df
behavioral_economics_df <- behavioral_economics_df %>%
  mutate(across(c(volume, issue, start_page, end_page), as.double))

# Merge datasets based on common columns
merged_df <- bind_rows(behavioral_economics_df, scopus_df %>% select(column_mapping, references))

# Save the merged dataset to a new Excel file
writexl::write_xlsx(merged_df, "/Users/s5320381/Downloads/test201223.xlsx")

I tried to merge two datasets with same variables that possess different header for these variables.

1

There are 1 best solutions below

0
Jannn On

Assuming you have the following dataframes (given you didn't share a piece of code):

  behavioral_economics_df = tibble(article_title = letters[1:3],
                                   publication_year = seq(1990, 1992, 1),
                                   start_page = seq(1, 3, 1),
                                   end_page = seq(4, 6, 1),
                                   language = letters[4:6],
                                   doi_link = letters[7:9])
  
  
  scopus_df = tibble(title = letters[10:12],
                     year = seq(1993, 1995, 1),
                     page_start = seq(10, 12, 1),
                     page_end = seq(13, 15, 1),
                     language_of_original_document = letters[13:15],
                     link = letters[16:18],
                     other_column = c("A", 1, NA))

Then you can use dplyr::select to select and rename the variables you need as follows:

 merge_df = bind_rows(behavioral_economics_df,
                       scopus_df %>% 
                         select(article_title = title, 
                                publication_year = year,
                                start_page = page_start,
                                end_page = page_end,
                                language = language_of_original_document,
                                doi_link = link)) 

Note that for bind_rows to work you need that variables in both dataframes have same names and types so I believe the error might be related to this detail.