Difference between Alluxio(Tachyon) and Tungsten in Spark?

1.4k Views Asked by At

Tachyon is a distributed, in-memory storage system that is developed separately from Spark which could be used as an off-heap persistence storage during a Spark application

Tungsten is a new Spark SQL component that provides more efficient Spark operations by working directly at the byte level. Since Tungsten no longer depends on working with Java objects, we can use either on-heap (in the JVM) or off-heap storage

In off-heap mode, both reduces garbage collection overhead, since data is not stored as Java objects.

So could I simply consider Tachyon brings benefits to general RDD whereas spark-sql benefits from Tungsten ?

Suppose following code

val df = spark.range(10)

val rdd = df.rdd

df.persist(StorageLevel.OFF_HEAP) // in Tungsten format(bytes)?

df.show

rdd.persist(StorageLevel.OFF_HEAP) // in Tachyon storage ?

rdd.count
3

There are 3 best solutions below

1
On BEST ANSWER

In short both yours statements are incorrect:

  • Since Spark 1.6 OFF_HEAP storage doesn't use Alluxio anymore and instead uses Spark's internal off-heap store. See for example SPARK-16025.
  • All storage modes in Spark SQL store data in internal binary format, which can be further configured using spark.sql.inMemoryColumnarStorage.* properties.
0
On

Spark interacts with Alluxio and Tungsten for data at different stages.

For Spark, Alluxio is an external distributed storage system, like HDFS. Spark interacts with Alluxio through the filesystem interface (see the following example). It is essentially the same interface by which Spark access HDFS or local filesystem, except the storage service is provided by Alluxio which may leverage memory for storage media.

// save data as text file to Alluxio
> rdd.saveAsTextFile("alluxio://localhost:19998/rdd1")
// read data as text file from Alluxio
> rdd = sc.textFile("alluxio://localhost:19998/rdd1")
// save data as object file to Alluxio
> rdd.saveAsObjectFile("alluxio://localhost:19998/rdd2")
// read data as object file from Alluxio
> rdd = sc.objectFile("alluxio://localhost:19998/rdd2")

Spark only interacts with Alluxio at the stages to read input data files and write output files.

Tungsten is the internal data representation for Spark aiming for the efficiency of memory and CPU. Essentially, the default memory layout of JVM objects is considered inefficient for Spark applications due to the memory space and GC overhead (See the blog on Project Tungsten from databricks). Tungsten helps Spark process data from a binary data format directly without bothering JVM to construct the JVM objects.

As a result, a Spark application may read input files from Alluxio---Alluxio sends Spark the bytes without understanding these bytes, then parse the data and represented it inside Spark according to the protocol Tungsten defintes.

1
On

Alluxio gets the benefits of memory speed read/write operations. Spark is capable of reading data from Alluxio (in memory storage system). This gives the benefits of avoiding Input/Output(IO) from Harddisk (any file system such as HDFS etc sitting on Hardisk).

Tungsten- is an backend optimization engine of spark. the code written dataframe/dataset APIs or in Spark SQL gets optimized in the form of logical/optimized logical plans by Catalyst Optimizer. Once this stage is over, tungsten optimization engine takes over and is responsible for generating Code (called as 'Code gen') on the fly that is highly optimized for execution on distributed environment.

To me both serve different purposes and I will prefer to treat them separately.

Hope it helps to some extent.