HBase系列4——常用API介绍与Spark读写HBase

2017-10-14

1. Java API

Java API包括了对HBase的各种操作，本节主要对Java API中的基本操作进行简要介绍，诸如批量处理以及过滤器的使用等高级API，感兴趣的可以进一步了解。Java API中的CRUD主要通过HTable类提供的方法实现，而管理和创建HBase表，则通过HBaseAdmin类实现。

1.1 Get 操作

Get操作包括单次Get请求和批量Get请求，我们以单次Get请求为例：

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
Get get = new Get(Bytes.toBytes("row1"));
get.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"));
Result result = table.get(get);
byte[] val = result.getValue(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"));
table.close();
System.out.println("Value: " + Bytes.toString(val));

1.2 Put 操作

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
Put put = new Put(Bytes.toBytes("row1"));
put.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"), Bytes.toBytes("val1"));
table.put(put);
table.close();

1.3 Delete 操作

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
Delete delete = new Delete(Bytes.toBytes("row1"));
delete.deleteColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"), 1); //删除特定行特定列特定版本
table.delete(delete);
table.close();

1.4 Scan

Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-5"))
.setStartRow(Bytes.toBytes("row-10"))
.setStopRow(Bytes.toBytes("row-20")); //使用Builder模式精确Scan
ResultScanner scanner = table.getScanner(scan);
for(Result res : scanner) {
	System.out.println(res);
}
table.close();

1.5 HBaseAdmin 操作

HBaseAdmin提供了建表、创建列簇、检查表是否存在、修改表结构和列簇结构和删除表等功能。

Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin =  new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor(Bytes.toBytes(testtable));
HColumnDescriptor coldef = new HColumnDescriptor(Bytes.toBytes("colfam1"));
desc.addFamily(coldef);
admin.createTable(desc);
boolean avail = admin.isTableAvailable(Bytes.toBytes("testtable"));
System.out.println("Table available: " + avail);

2. Shell 基本命令

在HBase安装目录下输入./bin/hbase shell即可进入HBase的Shell命令行模式，在模式下可以完成对HBase的一系列操作。

status 查看集群状态信息
version 查看HBase版本信息
create 't1', {NAME=>'f1', VERSION=>5} 创建表
alter 't1', NAME=>'f1', VERSION=>5 修改表
describe 't1' 获取表的元数据信息和是否可用的状态
disable 't1' 下线表
enable 't1' 上线表
drop 't1' 删除表
exist 't1' 判断某个表是否存在
list 罗列所有表名称
count 't1' 统计表的总行数
delete 't1', 'r1', 'c1', ts1 删除特点单元格
get 't1', 'r1', {COLUMN=>{'c1','c2','c3'}}获取某几行数据
put 't1', 'r1', 'c1', 'value', ts1 写入数据
scan 't1', {COLUMNS=>['c1','c2'], LIMIT=>10, STARTROW=>'xyz'} 根据特定条件扫描表

truncate 't1' 清空表
以上为Shell经常用到的命令，还有工具命令(compact、flush等)、复制命令等，可以进一步了解。

3. Spark读写HBase

Spark操作HBase有两种方式，一种方式是在Spark架构中调用HBase Java API的方式，该方式下要遵循Spark分布式计算的特点来编程。
由于HBase提供了对Hadoop MapReduce框架的支持，因此在Spark中我们可以使用另一种方式，即利用NewAPIHadoop接口，实现对HBase的读写。

3.1 Spark 写HBase

import org.apache.hadoop.hbase.client.{Put, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._

object HBaseWrite {
    def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setAppName("HBaseTest")
        sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result]))

        val sc = new SparkContext(sparkConf)
        val tablename = "test1"
        sc.hadoopConfiguration.set("hbase.zookeeper.quorum", "hadoop001,hadoop002,hadoop003")
        sc.hadoopConfiguration.set("hbase.zookeeper.property.clientPort", "2181")
        sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)

        val job = new Job(sc.hadoopConfiguration)
        job.setOutputKeyClass(classOf[ImmutableBytesWritable])
        job.setOutputValueClass(classOf[Result])
        job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

        val indataRDD = sc.makeRDD(Array("1,jack,15", "2,Lily,16", "3,mike,16"))
        val rdd = indataRDD.map(_.split(',')).map { arr => {
            val put = new Put(Bytes.toBytes(arr(0)))
            put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes(arr(1)))
            put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(arr(2).toInt))
            (new ImmutableBytesWritable, put)
        }
        }

        rdd.saveAsNewAPIHadoopDataset(job.getConfiguration)
		sc.stop();
    }
}

3.2 Spark 读HBase

import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.spark._

object HBaseRead {
    def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setAppName("HBaseTest")
        sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result]))
        val sc = new SparkContext(sparkConf)

        val tablename = "test1"
        val conf = HBaseConfiguration.create()
        conf.set("hbase.zookeeper.quorum", "hadoop001,hadoop002,hadoop003")
        conf.set("hbase.zookeeper.property.clientPort", "2181")
        conf.set(TableInputFormat.INPUT_TABLE, tablename)

        // 如果表不存在则创建表
        val admin = new HBaseAdmin(conf)
        if (!admin.isTableAvailable(tablename)) {
            val tableDesc = new HTableDescriptor(TableName.valueOf(tablename))
            admin.createTable(tableDesc)
        }

        //读取数据并转化成rdd
        val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
            classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
            classOf[org.apache.hadoop.hbase.client.Result])

        val count = hBaseRDD.count()
        println(count)
        hBaseRDD.collect.foreach { case (_, result) => {
            //获取行键
            val key = Bytes.toString(result.getRow)
            //通过列族和列名获取列
            val name = Bytes.toString(result.getValue("cf1".getBytes, "name".getBytes))
            val age = Bytes.toInt(result.getValue("cf1".getBytes, "age".getBytes))
            println("Row key:" + key + " Name:" + name + " Age:" + age)
        }
        }
        admin.close()
        sc.stop()
    }
}

3.3 Spark Scan HBase

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.protobuf.ProtobufUtil
import org.apache.hadoop.hbase.protobuf.generated.ClientProtos
import org.apache.hadoop.hbase.util.{Base64, Bytes}
import org.apache.spark.{SparkConf, SparkContext}

object HBaseScan {
    def main(args: Array[String]): Unit = {
        val sparkConf = new SparkConf().setAppName("HBaseTest")
        sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]))
        val sc = new SparkContext(sparkConf)

        val tablename = "test1"
        val conf = HBaseConfiguration.create()
        val scan = new Scan()
        scan.setStartRow(Bytes.toBytes(args(0)))
        scan.setStopRow(Bytes.toBytes(args(1)))
        scan.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"))

        def convertScanToString(scan: Scan) = {
            val proto: ClientProtos.Scan = ProtobufUtil.toScan(scan)
            Base64.encodeBytes(proto.toByteArray)
        }

        conf.set("hbase.zookeeper.quorum", "hadoop001,hadoop002,hadoop003")
        conf.set("hbase.zookeeper.property.clientPort", "2181")

        /** TableInputFormat 中有若干参数可以用来过滤 ,可以参考看一下TableInputFormat的静态常量 */
        conf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.SCAN,
            convertScanToString(scan))

        conf.set(org.apache.hadoop.hbase.mapreduce.TableInputFormat.INPUT_TABLE, tablename)
        val rdd = sc.newAPIHadoopRDD(conf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
            classOf[ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

        rdd.collect.foreach { case (_, result) => {
            //获取行键
            val key = Bytes.toString(result.getRow)
            //通过列族和列名获取列
            val name = Bytes.toString(result.getValue("cf1".getBytes, "name".getBytes))
           // val age = Bytes.toInt(result.getValue("cf1".getBytes, "age".getBytes))
            println("Row key:" + key + " Name:" + name )
        }
        }
        sc.stop()
    }
}