How to Install Hadoop on Ubuntu

Published October 6, 2024
How to Install Hadoop on Ubuntu
Cheap Dedicated Server

Installing Hadoop on Ubuntu: A Step-by-Step Guide


 

Apache Hadoop is a powerful, scalable, and versatile open-source framework used for distributed storage and processing of large data sets. It enables organizations to manage and analyze big data efficiently. If you’re looking to set up Hadoop on an Ubuntu machine, you’ve come to the right place. This step-by-step guide will walk you through the installation process.

Prerequisites

Before starting, make sure your system meets the following requirements:

  • Operating System: Ubuntu 18.04 or higher
  • Java: OpenJDK 8 or higher
  • Memory: Minimum 4GB of RAM
  • Disk Space: Minimum 10GB of free disk space

Step 1: Update and Upgrade the System

First, ensure that your system is up-to-date. Run the following commands to update and upgrade the software packages:

       Update and Upgrade the System

 
 

sudo apt update

sudo apt upgrade -y

Step 2: Install Java

Hadoop requires Java to run. Let’s install OpenJDK 8 on the system:

Install Java

 
 

sudo apt install openjdk-8-jdk -y

To verify the installation, check the Java version:

To verify the installation, check the Java version

 
 

java -version

The output should display the Java version installed, similar to:

The output should display the Java version installed

 
 

openjdk version "1.8.0_292"

Step 3: Download Hadoop

Next, download the latest version of Hadoop from the Apache Hadoop official website. Alternatively, use wget to download it directly via the terminal.

Download Hadoop

 

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once downloaded, extract the tarball:

Once downloaded, extract the tarball

 

tar -xvzf hadoop-3.3.6.tar.gz

Move the extracted files to /usr/local/hadoop:

Move the extracted files to

 

sudo mv hadoop-3.3.6 /usr/local/hadoop

Step 4: Configure Hadoop Environment Variables

Set up the environment variables for Hadoop. Open the .bashrc file for editing:

 

nano ~/.bashrc

Add the following lines at the end of the file:

Add the following lines at the end of the file

 

# Set Hadoop-related environment variables

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Save and close the file (Ctrl + X, then Y to confirm).

To apply the changes, run:

 

source ~/.bashrc

Step 5: Configure Hadoop XML Files

Hadoop requires configuration of several XML files to set up its environment. These files are located in the etc/hadoop directory inside your Hadoop installation folder.

Navigate to the configuration folder:

 

cd $HADOOP_HOME/etc/hadoop

Edit the following files:

  1. hadoop-env.sh: Set the Java home directory.
 

nano hadoop-env.sh

Change the line export JAVA_HOME=${JAVA_HOME} to:

 

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

  1. core-site.xml: Define the Hadoop file system and NameNode.
 

nano core-site.xml

Add the following configuration between the <configuration> tags:

 

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

  1. hdfs-site.xml: Configure the replication factor and NameNode data directory.
 

nano hdfs-site.xml

Add the following:

 

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:///usr/local/hadoop/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///usr/local/hadoop/hdfs/datanode</value>

</property>

  1. mapred-site.xml: Set the MapReduce framework.
 
 

cp mapred-site.xml.template mapred-site.xml

nano mapred-site.xml

Add the following:

 

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

  1. yarn-site.xml: Configure the YARN resource manager.
 

nano yarn-site.xml

Add the following:

 
 

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

Step 6: Format the Hadoop NameNode

Format the Hadoop NameNode using the command:

 

hdfs namenode -format

You should see output indicating that the formatting was successful.

Step 7: Start Hadoop Services

Start the Hadoop Distributed File System (HDFS) and YARN services:

 

start-dfs.sh

start-yarn.sh

To verify that the Hadoop daemons are running, use the jps command:

 

jps

You should see the following output:

 
 

NameNode

DataNode

ResourceManager

NodeManager

Step 8: Access the Hadoop Web Interfaces

Hadoop provides web interfaces to monitor the cluster:

Step 9: Run a Test

To ensure everything is working, try running a sample MapReduce job included with Hadoop:

 
 

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 16 1000

 

The job should run successfully, displaying the calculated value of Pi.

Conclusion

You have successfully installed Hadoop on your Ubuntu machine! From here, you can begin building your distributed computing environment and explore the various functionalities that Hadoop offers. Whether you’re working on big data analytics, machine learning, or building a robust data lake, Hadoop is a great choice.

Feel free to experiment with different configurations, set up a multi-node cluster, and explore the world of big data with Hadoop! If you have any questions, let me know in the comments below.

Happy Hadooping!


 

How to Install Hadoop on Ubuntu

 

What version of Ubuntu is recommended for Hadoop installation?

It is recommended to use Ubuntu 18.04 or higher for a stable Hadoop environment.

What should I do if I get a "JAVA_HOME not set" error?

Ensure that you’ve correctly set the JAVA_HOME variable in the hadoop-env.sh file, and that the Java path is accurate (e.g., /usr/lib/jvm/java-8-openjdk-amd64).

How can I check if Hadoop services are running properly?

Run the jps command after starting the services. You should see NameNode, DataNode, ResourceManager, and NodeManager in the output.

How can I access Hadoop's web UI?

Visit http://localhost:9870 for the HDFS NameNode UI and http://localhost:8088 for the YARN ResourceManager UI.