Automatic spark schema inference for custom data source

1k Views Asked by At

I'm implementing spark(1.5.2) sql RelationProvider for custom data source (properties files).

Can some one please explain how automatic inference algorithm should be implemented?

1

There are 1 best solutions below

1
David Griffin On

In general, you need to create a StructType that represents your schema. A StructType contains an Array[StructField], where each element of the array corresponds to a column in your schema. A StructField can be any supported DataType -- including another StructType for nested schemas.

Creating a schema can be as simple as:

val schema = StructType(Array(
  StructField("col1", StringType),
  StructField("col2", LongType)
))

If you want to generate a schema from a complex dataset -- one that includes nested StructTypes -- then you most likely need to create a recursive function. A good example of what such a function looks like can be found in the spark-avro integration library. The function toSqlType takes an Avro schema and converts it into a Spark StructType.