Create AVRO File AWS Glue Dynamic Frame One to Many Join

342 Views Asked by At

Is the following behavior possible in AWS Glue? I am trying to create a single AVRO file by joining two DynamicFrames in a one-to-many fasion.

For example I have a DyF with many Teacher types: teacher_id teacher_name

and a Dyf with many Student types: student_id teacher_id student_name

I am trying to combine these so that a teacher may have many students for example:

[
  {
    teacher_id: 1,
    teacher_name: 'John',
    students: [
      {
        student_id: 100,
        teacher_id: 1
        student_name: 'Sally'
      },
      {
        student_id: 200,
        teacher_id: 1,
        student_name: 'Jack'
      }
    ]
  },
  ...
]

Using Join.apply(teacher, student, 'teacher_id', 'teacher_id') only results in repeated rows as such:

[
  {
    teacher_id: 1,
    teacher_name: 'John',
    student_id: 100,
    teacher_id: 1
    student_name: 'Sally'
  },
  {
    teacher_id: 1,
    teacher_name: 'John',
    student_id: 200,
    teacher_id: 1
    student_name: 'Jack'
  }
  ...
]
]

1

There are 1 best solutions below

0
On

There are probably better ways to do this than what I am suggesting below, but I expect that the following approach would work:

from pyspark.sql.functions import col,struct

#first make your two tables into dataframes so we can use Spark
students = students.toDF()
teachers = teachers.toDF()

#then convert your students DF to having a foreign key and a struct
students = students.select(
  col("teacher_id").alias("student_teacher_id"),
  struct("student_id","teacher_id","student_name").alias("student_data"))#I'm not sure you want to keep the teacher_id here, but up to you :)

#then perform your join
result = teachers.join(students, teachers.teacher_id == students.student_teacher_id)

After this you should end up with a row with all the teacher data, and the teacher data will have struct column containing the students that are related to the teacher. If you were to serialize or output into a hierarchical format (e.g. JSON), it should show each student as a child of the teacher.