Recommended way to access HBase using Scala

728 Views Asked by Ellen Spertus At 18 May 2018 at 17:20

Now that SpyGlass is no longer being maintained, what is the recommended way to access HBase using Scala/Scalding? A similar question was asked in 2013, but most of the suggested links are either dead or to defunct projects. The only link that seems useful is to Apache Flink. Is that considered the best option nowadays? Are people still recommending SpyGlass for new projects even though it isn't been maintained? Performance (massively parallel) and testability are priorities.

Original Q&A

There are 3 best solutions below

stefanobaghino On 23 May 2018 at 10:06 BEST ANSWER

Depends on what do you mean by "recommended", I guess.

DIY

Eel

If you just want to access data on HBase from a Scala application, you may want to have a look at Eel, which includes libraries to interact with many storage formats and systems in the Big Data landscape and is natively written in Scala.

You'll most likely be interested in using the eel-hbase module, which from a few releases includes an HBaseSource class (as well as an HBaseSink). It's actually so recent I just noticed the README still mentions that HBase is not supported. There are no explicit examples with Hive, but source and sinks work in similar ways.

Kite

Another alternative could be Kite, which also has a quite extensive set of examples you can draw inspiration from (including with HBase), but it looks less active of a project than Eel.

Big Data frameworks

If you want a framework that helps you instead of brewing your own solution with libraries. Of course you'll have to account for some learning curve.

Spark

Spark is a fairly mature project and the HBase project itself as built a connector for Spark 2.1.1 (Scaladocs here). Here is an introductory talk that can come to your help.

The general idea is that you could use this custom data source as suggested in this example:

sqlContext
  .read
  .options(Map(HBaseTableCatalog.tableCatalog->cat, HBaseRelation.HBASE_CONFIGFILE -> conf))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .load()

Giving you access to HBase data through the Spark SQL API. Here is a short extract from the same example:

val df1 = withCatalog(cat1, conf1)
val df2 = withCatalog(cat2, conf2)
val s1 = df1.filter($"col0" <= "row120" && $"col0" > "row090").select("col0", "col2")
val s2 = df2.filter($"col0" <= "row150" && $"col0" > "row100").select("col0", "col5")
val result =  s1.join(s2, Seq("col0"))

Performance considerations aside, as you may see the language can feel pretty natural for data manipulation.

Flink

Two answers already dealt with Flink, so I won't add much more, except for a link to an example from the latest stable release at the time of writing (1.4.2) that you may be interested in having a look at.

Soheil Pourbafrani On 19 May 2018 at 05:37

According to my experiences in writing data Cassandra using Flink Cassandra connector, I think the best way is to use Flink built-in connectors. Since Flink 1.4.3 you can use HBase Flink connector. See here

Vitaly Tsvetkoff On 22 May 2018 at 10:56

I connect to HBase in Flink using java. Just create HBase Connection object in open and close it within close methods of RichFunction (i.e. RichSinkFunction). These methods are called once by each flink slot.

I think you can do something like this in Scala too.

Recommended way to access HBase using Scala

There are 3 best solutions below

DIY

Eel

Kite

Big Data frameworks

Spark

Flink

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in HBASE

Related Questions in APACHE-FLINK

Related Questions in SCALDING

Trending Questions

Popular # Hahtags

Popular Questions