2

this is simple " how to " question:: We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.

Thank you,

1 Answer 1

1

There are multiple ways you can do that.

  1. Spark Hbase connector:

https://github.com/hortonworks-spark/shc

You can see lot of examples on the link.

  1. Also you can use SPark core to load the data to Hbase using HbaseConfiguration.

Code Example:

val fileRDD = sc.textFile(args(0), 2)
  val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }

  val conf = HBaseConfiguration.create()
  conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
  conf.set("hbase.zookeeper.quorum", "localhost:2181")
  conf.set("hbase.master", "localhost:60000")
  conf.set("fs.default.name", "hdfs://localhost:8020")
  conf.set("hbase.rootdir", "/hbase")

  val jobConf = new Configuration(conf)
  jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
  jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
  jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)

  transformedRDD.saveAsNewAPIHadoopDataset(jobConf)



def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {

    val cfDataBytes = Bytes.toBytes("cf")
    val rowkey = Bytes.toBytes(line.split("\\|")(1))
    val put = new Put(rowkey)

    put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
    put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
    put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
    put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
    put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
    return (new ImmutableBytesWritable(rowkey), put)
  }
  1. Also you can use this one

https://github.com/nerdammer/spark-hbase-connector

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.