Components of Hadoop
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines, each offering local computation and storage. The main components of the Hadoop ecosystem include:
Hadoop Common: The common utilities that support the other Hadoop modules. This includes libraries and utilities needed by other Hadoop components.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. It stores data across multiple machines and is designed to be fault-tolerant.
Yet Another Resource Negotiator (YARN): A resource-management platform responsible for managing compute resources in clusters and using them for scheduling users' applications.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. It divides the task into smaller, manageable chunks that are processed independently and the results are aggregated.
Additionally, the Hadoop ecosystem includes several other tools and components that enhance its functionality:
Apache Pig: A high-level platform for creating programs that run on Hadoop. Pig scripts are used to process and analyze large data sets.
Apache Hive: A data warehouse infrastructure built on top of Hadoop. It provides data summarization, query, and analysis. Hive uses a SQL-like language called HiveQL for querying data.
Apache HBase: A distributed, scalable, big data store. It runs on top of HDFS and provides random, real-time read/write access to data.
Apache Sqoop: A tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases.
Apache Flume: A distributed service for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store.
Apache Oozie: A workflow scheduler system to manage Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
Apache Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Apache Spark: Although not originally a part of the Hadoop ecosystem, Spark is a fast and general engine for large-scale data processing that works well with HDFS and can run in Hadoop clusters.
These components together form a powerful suite of tools for handling large-scale data processing, storage, and analysis in a distributed environment.
For more information visit:https://hkrtrainings.com/components-of-hadoop