Apache HBase on Amazon EMR

Finally! Amazon EMR is going to leverage the power of HBase. I love EMR, but I hated its restrictions. A step in the right direction, for sure…

AWS has already given you a lot of storage and processing options to choose from, and today we are adding a really important one.

You can now use Apache HBase to store and process extremely large amounts of data (think billions of rows and millions of columns per row) on AWS. HBase offers a number of powerful features including:

  • Strictly consistent reads and writes.
  • High write throughput.
  • Automatic sharding of tables.
  • Efficient storage of sparse data.
  • Low-latency data access via in-memory operations.
  • Direct input and output to Hadoop jobs.
  • Integration with Apache Hive for SQL-like queries over HBase tables, joins, and JDBC support.

HBase is formally part of the Apache Hadoop project, and runs within Amazon Elastic MapReduce. You can launch HBase jobs (version 0.92.0) from the command line or the AWS Management Console.