The Apache Hadoop or also known as Hadoop is an open-source, Java-based framework that allows for the distributed processing of large data sets across computers. It is used to store and process large datasets. It allows clustering multiple computers to store and process data more quickly instead of using a single large computer. Hadoop consists of four main modules:
– HDFS (Hadoop Distributed File System)
– YARN (Yet Another Resource Negotiator)
– MapReduce
– Hadoop Common
In this tutorial, we will explain how to install Hadoop on Debian 11.
Table of Contents
Prerequisites
- Debian 11
- SSH root access or a normal system user with sudo privileges
Step 1. Login to the server
First, log in to your Debian 11 server through SSH as the root user:
Replace “root” with a user that has sudo privileges if necessary. Additionally, replace “IP_Address” and “Port_Number” with your server’s respective IP address and SSH port number.
You can check whether you have the proper Debian version installed on your server with the following command:
You should get this output:
Before starting, you have to make sure that all Ubuntu OS packages installed on the server are up to date.
You can do this by running the following commands:
Step 2. Create a System User and Generate SSH Key
It is not a good idea to run Hadoop as root, so for security reasons, we will create a new system user:
A user ‘hadoop’ has been created, let’s log in as the user.
Hadoop requires ssh access to manage its nodes, whether remote or local nodes. To access the nodes without a password, we can generate SSH key and copy the public key to the ~/.ssh/authorized_keys file.
You will get an output like this.
Next, let’s add hadoop’s public key to the authorized key file, to allow user ‘hadoop’ to log in to the system without a password and only use the SSH key.
Log in to the system through SSH now.
You should be able to log in to SSH without a password now.
Let’s exit from user ‘hadoop’ and then continue to the next step.
Step 3. Install Java
Hadoop is written in Java, so we require Java in our system to be able to run Hadoop. Let’s run this command below to install the default JDK for Java from the repository.
Java should be installed now, you can check and verify it by invoking this command:
Step 4. Download and Install Hadoop
At the time of writing this article, the latest stable version of Hadoop is version 3.3.2. You can go to their download page at https://hadoop.apache.org/releases.html to check the more recent version if any.
Let’s log in as user ‘hadoop’ to download and extract it, so we do not need to change the file and directory permission.
Before continuing to the next steps, make sure JAVA_HOME is pointing to the correct directory, you can check this by listing /usr/lib/jvm
Now, let’s edit /opt/hadoop/.bashrc
Insert the following lines into the file.
Save the file and exit, then run the command below to activate the newly added environment variables.
Step 5. Configure Hadoop
Hadoop can be configured to run in a single node or multi-node cluster. In this tutorial, we will show you how to set up Hadoop single node cluster or pseudo-distributed mode. There are some files we need to modify in this step, now let’s edit the Hadoop environment file first.
Add the following line to the file.
Edit core-site.xml file.
Add these lines to the configuration tag.
Edit hdfs-site.xml file
Add these lines to the configuration tag.
Save the file by pressing CTRL + O and exit with CTRL + X
Edit yarn-site.xml file
Add these lines to the configuration tag.
The last file to modify is the mapred-site.xml.
Add these lines to the configuration tag.
Do not forget to save the file and then exit from the nano editor.
The files above have been modified, we need to create some directories, run this command:
Prior to starting Hadoop services for the first time, we need to format the namenode.
Start namenode and datanode
If you see this warning message:
It means your server OS is 64bit, but Hadoop native library is 32bit. This is expected and you can ignore the warning. If you are not comfortable with it, you can download Hadoop source file then compile it to get the 64bit shared library.
Now let’s start the YARN resource and node managers.
The last one, run this command:
You will get an output like this:
Now. you can go to http://YOUR_SERVER_IP_ADDRESS:9870/ and see the namenode, datanode, etc.
To check the YARN web portal, you can navigate to http://YOUR_SERVER_IP_ADDRESS:8088/
That’s it. You have successfully installed and configured Hadoop on Debian 11 VPS.
Of course, you don’t have to install Hadoop on Debian 11 if you have a Managed Debian Server with us. You can simply ask our support team to install Hadoop on Debian 11 for you. They are available 24/7 and will be able to help you with the installation.
PS. If you enjoyed reading this blog post on how to install Hadoop on Debian 11, feel free to share it on social networks or simply leave a comment in the comments section. Thanks.