loading csv file to HBase through Spark

Question

this is simple " how to " question:: We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.

Thank you,

Shankar · Accepted Answer · 2017-04-06 10:41:55Z

There are multiple ways you can do that.

Spark Hbase connector:

https://github.com/hortonworks-spark/shc

You can see lot of examples on the link.

Also you can use SPark core to load the data to Hbase using HbaseConfiguration.

Code Example:

val fileRDD = sc.textFile(args(0), 2)
  val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }

  val conf = HBaseConfiguration.create()
  conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
  conf.set("hbase.zookeeper.quorum", "localhost:2181")
  conf.set("hbase.master", "localhost:60000")
  conf.set("fs.default.name", "hdfs://localhost:8020")
  conf.set("hbase.rootdir", "/hbase")

  val jobConf = new Configuration(conf)
  jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
  jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
  jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)

  transformedRDD.saveAsNewAPIHadoopDataset(jobConf)



def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {

    val cfDataBytes = Bytes.toBytes("cf")
    val rowkey = Bytes.toBytes(line.split("\\|")(1))
    val put = new Put(rowkey)

    put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
    put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
    put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
    put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
    put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
    return (new ImmutableBytesWritable(rowkey), put)
  }

Also you can use this one

https://github.com/nerdammer/spark-hbase-connector

Collectives™ on Stack Overflow

loading csv file to HBase through Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related