Boost Your Big Data Skills: Hadoop Installation on Ubuntu

Published September 20, 2024

Hadoop Installation

 Hadoop Installation on Ubuntu

Hadoop is a distributed computing framework for processing and storing massive datasets. It runs on Ubuntu and offers scalable data storage and parallel processing capabilities.

Installing Hadoop enables you to efficiently handle big data challenges and extract valuable insights from your data.

Hadoop Installation requires:

  • Install Java.
  • Create a User.
  • Download Hadoop.
  • Configure Environment.
  • Configure Hadoop.
  • Start Hadoop.
  • Access Web Interface.

Requirements for Hadoop Installation on Ubuntu

Before installing Hadoop on Ubuntu, make sure your system is meeting below specifications:

  • A Linux VPS running Ubuntu.
  • A non-root user with sudo privileges.
  • Access to Terminal/Command line.

 Hadoop Installation

After setting up Hadoop on Ubuntu and purchasing a Linux VPS, you’re ready to follow the steps in this guide. By the end, you’ll be fully equipped to harness Hadoop’s power to efficiently manage and analyze large datasets.

Step 1: Install Java Development Kit (JDK)

  • Since Hadoop requires Java to run, use the following command to install the default JDK and JRE:

sudo apt install default-jdk default-jre -y

  • Then, run the command below to Verify the installation by checking the Java version:

java -version

Output:

java version “11.0.16” 2021-08-09 LTS
OpenJDK 64-Bit Server VM (build 11.0.16+8-Ubuntu-

As you see, if Java is installed, you’ll see the version information.

Step 2: Create a dedicated user for Hadoop & Configure SSH

  • To create a new user, run the command below and create the Hadoop user:

sudo adduser hadoop

  • To add the user to the sudo group, type:

sudo usermod -aG sudo hadoop

  • Run the command below to switch to the Hadoop user:

sudo su – hadoop

  • To install OpenSSH server and client, run:

sudo apt install openssh-server openssh-client -y

  • Then, generate SSH keys by running the following command:

ssh-keygen -t rsa

Step 3: Download the latest stable release

To get started, head over to the Apache Hadoop download page and locate the latest stable release (such as version 3.3.4). Once you find it, copy the download link. Next, you can easily download the release by using the wget command.

wget https://downloads.apache.org/hadoop/common/stable/hadoop-3.3.4.tar.gz

  • Then, extract the downloaded file:

tar -xvzf hadoop-3.3.4.tar.gz

  • To move the extracted directory, run:

sudo mv hadoop-3.3.4 /usr/local/hadoop

  • Use the command below to create a directory for logs:

sudo mkdir /usr/local/hadoop/logs

  • Now, you need to change ownership of the Hadoop directory. So, use:

sudo chown -R hadoop:hadoop /usr/local/hadoop

Step 4: Configure Hadoop Environment Variables

  • Edit the .bashrc file using the command below:

sudo nano ~/.bashrc

Add environment variables to the end of the file by running the following command:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”

  • To save changes and source the .bashrc file, type:

source ~/.bashrc

  • When you are finished, you are ready for Ubuntu Hadoop setup.

Step 5: Configure Hadoop Environment Variables

  • First, edit the hadoop-env.sh file by running the command below:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

  • Now, you must add the path to Java. If you haven’t already added the JAVA_HOME variable in your .bashrc file, include it here:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CLASSPATH+=” $HADOOP_HOME/lib/*.jar”

Save changes and exit when you are done.

  • Then, change your current working directory to /usr/local/hadoop/lib:

cd /usr/local/hadoop/lib

  • The below command lets you download the javax activation file:

sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/javax.activation-api-1.2.0.jar

  • When you are finished, you can check the Hadoop version:

hadoop version

  • If you have passed the steps correctly, you can now configure Hadoop Core Site. To edit the core-site.xml file, run:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

  • Add the default filesystem URI:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9000</value>
<description>The default file system URI</description>
</property>
</configuration>

Save changes and exit.

  • Use the following command to create directories for NameNode and DataNode:

sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}

  • Then, change ownership of the directories:

sudo chown -R hadoop:hadoop /home/hadoop/hdfs

  • To change the ownership of the created directory to the hadoop user:

sudo chown -R hadoop:hadoop /home/hadoop/hdfs

  • To edit the hdfs-site.xml file, first run:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

  • Then, paste the following line to set the replication factor:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Save changes and exit.

  • At this point, you can configure MapReduce. Run the command below to edit the mapred-site.xml file:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

  • To set the MapReduce framework, paste the following line:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Save changes and exit.

  • To configure YARN, run the command below and edit the yarn-site.xml file:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

  • Paste the following to enable the MapReduce shuffle service:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Save changes and exit.

  • Format the NameNode by running the following command:

hdfs namenode -format

This initializes the Hadoop Distributed File System (HDFS).

Step 6: Start the Hadoop Cluster

  • Run the command below to start the NameNode and DataNode:

startdfs.sh

  • To start the ResourceManager and NodeManager, run:

start-yarn.sh

  • Check running processes by running the command below:

jps

You should see processes like NameNode, DataNode, ResourceManager, and NodeManager running.

If all is correct, you are ready to access the Hadoop Web Interface.

What are the best Features and Advantages of Hadoop on Ubuntu?

  • Scalability: Easily scale Hadoop clusters to handle growing data volumes by adding more nodes.
  • Fault Tolerance: Data is replicated across multiple nodes, ensuring data durability and availability.
  • Parallel Processing: Hadoop distributes data processing tasks across multiple nodes, accelerating performance.
  • Cost-Effective: Hadoop can run on commodity hardware, making it a cost-effective solution for big data processing.
  • Open Source: Hadoop is freely available and has a large, active community providing support and development.
  • Integration with Other Tools: Hadoop integrates seamlessly with other big data tools like Spark, Hive, and Pig, expanding its capabilities.
  • Flexibility: Hadoop supports various data formats and can be customized to meet specific use cases.

After Hadoop Installation on Ubuntu

  • Configure and start the Hadoop cluster: Begin by setting up essential services such as the NameNode, DataNode, ResourceManager, and NodeManager to get your cluster running smoothly.
  • Load data into HDFS: Next, upload your data files into the Hadoop Distributed File System (HDFS) to store and prepare them for processing.
  • Run MapReduce jobs: Once your data is ready, you can execute MapReduce jobs for tasks like word counting, filtering, or aggregation.
  • Utilize other Hadoop components: Additionally, explore advanced tools such as Hive, Pig, and Spark for enhanced data analysis and machine learning capabilities.
  • Monitor and manage the cluster: Don’t forget to regularly monitor resource usage and job execution using the Hadoop web interface for troubleshooting and management.
  • Integrate with other systems: Finally, connect Hadoop with other applications and databases to create seamless data pipelines and workflows for broader functionality.

How to Monitor the Performance of the Hadoop Cluster?

Use the Hadoop web interface to monitor resource usage, job execution, and other metrics.

You can also use tools like Ganglia or Nagios for more advanced monitoring.

Why Hadoop Services are Not starting on Ubuntu?

There could be several reasons for this. To troubleshoot, consider:

  • Configuration errors: Verify that your configuration files (core-site.xml, hdfs-site.xml, etc.) are correct and contain the necessary properties.
  • NameNode format: Ensure that you’ve formatted the NameNode using hdfs namenode -format.
  • Port conflicts: Check if other applications are using the ports specified in your Hadoop configuration (e.g., 9000 for NameNode).
  • Firewall issues: Make sure your firewall is configured to allow Hadoop services to communicate.

How to Troubleshoot issues with HDFS?

Use the hdfs dfs -ls command to list files and directories in HDFS.

If you encounter errors, check the logs for clues. You can also use the hdfs dfs -tail <filename> command to view the latest lines of a log file.

Why My MapReduce jobs are failing?

There could be several reasons for job failures, including:

  • Input/output errors: Ensure that your input and output paths are correct and that the data format is compatible with your MapReduce job.
  • Job configuration issues: Check your job configuration for errors or inconsistencies.
  • Resource limitations: If your cluster is under heavy load, your job might fail due to insufficient resources.
  • Programming errors: Review your MapReduce code for logical errors or bugs.

Read MORE; Ubuntu Server vs. Ubuntu Desktop: Which One Should You Choose?