Apache Hadoop is a powerful, scalable, and versatile open-source framework used for distributed storage and processing of large data sets. It enables organizations to manage and analyze big data efficiently. If you’re looking to set up Hadoop on an Ubuntu machine, you’ve come to the right place. This step-by-step guide will walk you through the installation process.
Prerequisites
Before starting, make sure your system meets the following requirements:
- Operating System: Ubuntu 18.04 or higher
- Java: OpenJDK 8 or higher
- Memory: Minimum 4GB of RAM
- Disk Space: Minimum 10GB of free disk space
Step 1: Update and Upgrade the System
First, ensure that your system is up-to-date. Run the following commands to update and upgrade the software packages:
sudo apt update
sudo apt upgrade -y
Step 2: Install Java
Hadoop requires Java to run. Let’s install OpenJDK 8 on the system:
sudo apt install openjdk-8-jdk -y
To verify the installation, check the Java version:
The output should display the Java version installed, similar to:
openjdk version "1.8.0_292"
Step 3: Download Hadoop
Next, download the latest version of Hadoop from the Apache Hadoop official website. Alternatively, use wget
to download it directly via the terminal.
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once downloaded, extract the tarball:
tar -xvzf hadoop-3.3.6.tar.gz
Move the extracted files to /usr/local/hadoop
:
sudo mv hadoop-3.3.6 /usr/local/hadoop
Step 4: Configure Hadoop Environment Variables
Set up the environment variables for Hadoop. Open the .bashrc
file for editing:
Add the following lines at the end of the file:
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Save and close the file (Ctrl + X
, then Y
to confirm).
To apply the changes, run:
Step 5: Configure Hadoop XML Files
Hadoop requires configuration of several XML files to set up its environment. These files are located in the etc/hadoop
directory inside your Hadoop installation folder.
Navigate to the configuration folder:
cd $HADOOP_HOME/etc/hadoop
Edit the following files:
hadoop-env.sh
: Set the Java home directory.
Change the line export JAVA_HOME=${JAVA_HOME}
to:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
core-site.xml
: Define the Hadoop file system and NameNode.
Add the following configuration between the <configuration>
tags:
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
hdfs-site.xml
: Configure the replication factor and NameNode data directory.
Add the following:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/datanode</value>
</property>
mapred-site.xml
: Set the MapReduce framework.
cp mapred-site.xml.template mapred-site.xml
nano mapred-site.xml
Add the following:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
: Configure the YARN resource manager.
Add the following:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Step 6: Format the Hadoop NameNode
Format the Hadoop NameNode using the command:
You should see output indicating that the formatting was successful.
Step 7: Start Hadoop Services
Start the Hadoop Distributed File System (HDFS) and YARN services:
start-dfs.sh
start-yarn.sh
To verify that the Hadoop daemons are running, use the jps
command:
You should see the following output:
NameNode
DataNode
ResourceManager
NodeManager
Step 8: Access the Hadoop Web Interfaces
Hadoop provides web interfaces to monitor the cluster:
Step 9: Run a Test
To ensure everything is working, try running a sample MapReduce job included with Hadoop:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 16 1000
The job should run successfully, displaying the calculated value of Pi.
Conclusion
You have successfully installed Hadoop on your Ubuntu machine! From here, you can begin building your distributed computing environment and explore the various functionalities that Hadoop offers. Whether you’re working on big data analytics, machine learning, or building a robust data lake, Hadoop is a great choice.
Feel free to experiment with different configurations, set up a multi-node cluster, and explore the world of big data with Hadoop! If you have any questions, let me know in the comments below.
Happy Hadooping!