Hadoop has gained familiarity over the years. The Hadoop Ecosystem stack . Apache Phoenix works as an SQL skin for HBase that requires basic HBase understanding and, to some extent, an understanding of its native calls behavior. Knowledge about other Hadoop ecosystem components, along with HBase, will be an added advantage in best understanding the big data landscape and in utilizing Phoenix and its best available features. In this chapter we provide an overview of these components and their place in the ecosystem.
HDFS (Hadoop Distributed File System) is a distributed file-system that provides high throughput access to data. HDFS stores data in the form of blocks. Each block is 64 MB in size for older versions and 128 MB for newer Hadoop versions. A file larger than a block in size will automatically split into multiple blocks and be stored, replicated on the nodes, by a default replication factor of three to each block; this means each block will be available on three nodes to ensure high availability and fault tolerance. The replication number is configurable and can be changed in the HDFS configuration file.
Hadoop MapReduce is a software framework with which we can easily write applications to process large amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. MapReduce is a programming technique containing two types of algorithm, namely Map and Reduce.
HBase is a NoSQL column family database that runs on top of Hadoop HDFS. HBase was developed to handle large storage tables which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability. The HBase concept was inspired by Google’s Big Table. Hadoop is mainly meant for batch processing, in which data will be accessed only in a sequential manner, where HBase is used for quick random access of huge data.
Hive is an interactive, easy, SQL-like scripting language used to query data stored in HDFS. Although we can use Java to work with HDFS, many data programmers are most comfortable using SQL. Hive was initially created by Facebook for its own infrastructure processing, later they made it open source and donated it to the Apache Software Foundation. The advantage of Hive is that it runs MapReduce jobs behind the scenes, but the programmer does not have to worry about how this is happening. The programmer simply writes HQL (Hive Query Language), and results will be displayed on the console.
Hive is a part of the Hadoop ecosystem and provides an SQL-like interactive interface to Hadoop’s underlying HDFS. You can write ad-hoc queries and analyze large datasets Stored in HDFS. Programmers can plug in their custom mappers and reducers when it is inconvenient or inefficient to write this logic in Hive Query Language.
Apache Hadoop YARN is a cluster management technology and a sub project of Apache Hadoop in Apache Software Foundation (ASF) like other HDFS, Hadoop Common and MapReduce. YARN stands for Yet Another Resource Negotiator. YARN is a general purpose, distributed, application management framework that supersedes the classic MapReduce framework for processing data in Hadoop clusters.
What is Hadoop Ecosystem