How to Install Apache Hive on Ubuntu

apache hive

Apache Hive serves as an enterprise-level system for a data warehouse that offers an interface for high-level querying of information contained within the Hadoop Distributed File System (HDFS). An enterprise data warehouse can be defined as a system for querying, managing, and analyzing information contained within the Hadoop Distributed File System.

In this article, we will focus on every step of the process of installing Apache Hive on Ubuntu and provide explanations of the processes to help you understand.

Understanding Apache Hive

The queries to be executed within a Hive are issued using the Hive Query Language (HiveQL), which can be executed in a Hive CLI shell. Beeline is a JDBC client that allows connection from any environment to Hive so it can query and run.

The functionality offered by Hive helps in the generation of SQL queries, which are interpreted as MapReduce jobs and executed by the engines that process data in a Hadoop cluster in a distributed manner. This capability is useful for organizations that analyze large data sets, as they do not have to learn complicated programming in MapReduce.

How to Install Apache Hive on Ubuntu

Hive relies on HDFS for data storage and uses several execution engines (MapReduce, Tez, or Spark) for query processing. Without Hadoop, Hive becomes non-functional. Hive is a data warehouse software built on top of Hadoop.

Prerequisites and System Requirements

Check the following requirements before starting the hive installation on Ubuntu:

  • Java 8 or Higher: You must have Java 8 installed and the JAVA_HOME environment variable configured.
  • Working Hadoop Installation: Environment variables must be configured for a working Hadoop installation.
  • Ubuntu Version: This guide supports Ubuntu 24.04, but any version from Ubuntu 18.04 LTS is acceptable.
  • System Resources: Minimum of 4GB RAM, 8GB is recommended for production uses.
  • Storage: 20GB of free disk space.

Step 1: Download and Extract Apache Hive

The first step to install Hive on Ubuntu is to check your version of Hive relative to the installed Hadoop version:

hadoop version

This command will give you a descriptive output for your Hadoop installation, such as the version number, date of compilation, and many other relevant details. For selecting your Hive version, make sure to capture at least the major version number (for instance, 3.3, 3.4):

check hadoop version

Just like with any other software, you have to download the appropriate version of Hive from the official page. Make sure to check which version would be compatible with your local installation of Hadoop:

download apache hive tar file

On the provided site, Apache Hive offers a compatibility matrix on which Hive versions are compatible with which Hadoop versions.

Download Using wget (Recommended):

wget <a href="https://downloads.apache.org/hive/hive-4.0.1/apache-hive-4.0.1-bin.tar.gz">https://downloads.apache.org/hive/hive-4.0.1/apache-hive-4.0.1-bin.tar.gz</a>

The wget command enables you to directly download files to your current directory. If needed, you can rename the file using the -O option. This method is preferred because it’s scriptable and doesn’t require a web browser:

download apache hive using wget

Extract the Archive:

The binary distribution (bin.tar.gz) includes all the Hive components of the already compiled versions, which are ready for immediate use. This is in contrast to source distributions, which need to be compiled and tend to be time-consuming. The binary distribution consists of:

tar xzf apache-hive-4.0.1-bin.tar.gz

After extracting files, a directory named apache-hive-4.0.1-bin is created, which contains all the files and directories of Hive.

extract apache hive

Step 2: Configure Hive Environment Variables

Setting environment variables is essential for recognizing Hive commands and for the proper functioning of other components of the Hadoop ecosystem. A new bash session runs a script called .bashrc. If you set environment variables here, they will be available in every terminal session.

Let’s modify the .bashrc file:

nano ~/.bashrc

Here, set environment variables for Hive:

export HIVE_HOME="/home/hdoop/apache-hive-4.0.1-bin"

Above, the HIVE_HOME variable locates the installation folder of Hive on your system. This variable is referenced by several Hive scripts and configuration files. Make sure to substitute /home/hdoop/apache-hive-4.0.1-bin with the actual path to where you extracted Hive.

export PATH=$PATH:$HIVE_HOME/bin

Here, PATH allows the system to know where to search for specific command files to execute. With the addition of $HIVE_HOME/bin, it becomes possible to execute Hive commands (hive, beeline, schematool) from any directory.

set path variable for apache hive

Save and exit the editor.

Apply Changes:

The .bashrc file needs to be updated with the new variables after executing the source command in the current terminal session. If you do not use the source command, you will need to start a fresh terminal session to access the variables.

source ~/.bashrc
apply changes

Verify Configuration:

This command verifies that the environment variables were set properly.

echo $HIVE_HOME
verify configuration

Now, verify that your PATH contains the Hive bin directory.

echo $PATH | grep hive
verify path

Step 3: Configure Hadoop for Hive Integration

Hadoop has specific configurations that need to be set for integration with other Hadoop components to work, especially around permissions and proxy settings.

The core-site.xml file is one of the most critical configuration files within Hadoop since it contains core properties that impact the entire Hadoop ecosystem. Here, Hive-specific configuration helps in the proper setting of authentication and authorization.

Edit core-site.xml:

Change the file core-site.xml with the command:  

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add Proxy User Configurations:

Proxy users in Hadoop are designed to act on behalf of other users. Services like Hive, which require access to HDFS and other Hadoop components, rely on proxy users as they help maintain appropriate security boundaries.  

<configuration>
<property>
    <name>hadoop.proxyuser.db_user.groups</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.db_user.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.server.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.server.groups</name>
    <value>*</value>
</property>
</configuration>

Here,

  • hadoop.proxyuser.db_user.groups: Defines which groups the proxy user has permission to act on their behalf. In this case, the wildcard means all groups.  
  • hadoop.proxyuser.db_user.hosts: Defines the host(s) from which the proxy user is allowed to act. Once again, the wildcard means all hosts.  
  • hadoop.proxyuser.server.hosts: Grants permission to the server to act as a proxy from any host.  
  • hadoop.proxyuser.server.groups: Grants permission for the server to act as a proxy for any group.
Configure Hadoop for Hive Integration

Security Considerations: For production environments, replace the wildcard * with precise group names and host names to improve security. This configuration is useful in development and testing environments.  

Step 4: Create Essential HDFS Directories

HDFS Houses specific directories for Hive to store temporary data and warehouse tables. It is important to understand the purpose of each directory for the smooth functioning of Hive.  

In HDFS, the /tmp directory acts as a temporary workspace for Hive operations. To start, execute the following command to make the directory:  

hadoop fs -mkdir /tmp

The primary function of the directory is to store intermediate datasets for a given task. It is used by Hive for storing temporary data when executing complex queries that require the generation of intermediate files and datasets.  

create temp directory

Set appropriate permissions:

hadoop fs -chmod g+w /tmp

The g+w permission allows members of the same group to write into subdirectories. In this case, it is granted to allow multiple users to create temporary files during their Hive sessions.  

set directory permissions

Verify creation:

hadoop fs -ls /
verify directory creation

Custom configuration is not mandatory to run Hive, but it will help it perform better and work well within your environment. Every distribution of Apache Hive comes with some files that are configuration templates. These files can be found in the Hive conf directory and contain default settings for Hive.

Navigate to the configuration directory:

Proceed to the configuration directory using the cd command:

cd apache-hive-4.0.1-bin/conf
access apache hive configuration directory

List available configuration files:

See available configuration files using the ls -l command:

ls -l

You will see several template files, with hive-default.xml.template being the main template configuration file.

list available configuration files

Copy and modify the template:

cp hive-default.xml.template hive-site.xml
Copy and modify the template

Edit the configuration:

Open the hive-site.xml file and edit it:

nano hive-site.xml

Among the key parameters, the hive.metastore.warehouse.dir setting should correctly reflect the directory in the HDFS warehouse.

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>
edit config file

Importance of Custom Configurations

  • Optimization of Performance: Custom configurations allow flexibility to set hardware workload-specific tuning options, memory settings, execution engine preferences, and other vital parameters.
  • Integration with Environment: Helps to guarantee that Hive will work well with the configuration of the Hadoop cluster, the security settings, and other components of the ecosystem.
  • Compliance Considerations: The production setting often differs from the default setting in logging, monitoring, security, and data governance.

Step 6: Initialize the Metadata Database

Within Hive, there is a database dedicated to storing metadata of each table, its columns, partitions, and even additional structural details. This metadata is distinct from the actual data stored in HDFS.

To store metadata, Apache Hive relies on the Derby database. Derby is a lightweight embedded Java database that is suitable for development and small-scale deployments. Now go to the Hive home directory:

cd apache-hive-4.0.1-bin
access apache hive bin directory

Now, initialize the schema:

bin/schematool -dbType derby -initSchema
initialize schema

Schema initialization creates the requisite database tables and structures that are necessary to store Hive metadata. These are:

  • Definitions of databases
  • Tables’ schemas and associated metadata
  • Data and attribute types of the columns.
  • Details regarding the partitions.
  • Security and Authorization Information.

If the schema is not properly set up, it will not be possible for Hive to maintain and keep track of table structures, column definitions, or even where the data is stored. Moreover, the metadata store is critical for mapping SQL queries to the required data access methods.

Derby is the default metadata store for Hive. In the hive-site.xml file, set the database type with the hive.metastore.warehouse.db.type parameter if you wish to use a different database solution like Postgres or MySQL.

Step 7: Launch and Test Hive

Now that all the components are configured, you can start the Hive services and conduct the installation tests. HiveServer2 is the server-side component that enables numerous clients to connect to Hive at the same time.

Launch HiveServer2:

bin/hiveserver2

HiveServer2 provides a Thrift-based interface supporting multiple client connections, which makes it the preferred access method to Hive in multi-user environments. 

hive server

The server must wait till it starts, when it can show the Hive Session ID. It will log messages indicating successful startup along with the session ID that identifies your HiveServer2 instance.

Hive Session ID

Verify the installation, and execute a simple query:

CREATE TABLE IF NOT EXISTS student_details(
name STRING,
marks FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

The command creates a table, student_details, including the columns. Finally, it confirms successful creation.

create table

Conclusion

You have installed Apache Hive on Ubuntu and configured it with a step-by-step guide on every procedure. The installation serves as a robust starting point for performing big data analytics through SQL-like queries on data stored in Hadoop. Make sure you meet the prerequisites, such as Java 8 or higher, a working Hadoop installation, etc., and you are good to go with the Apache Hive installation on your Ubuntu machine.

Experience Ultahost’s cheap Linux VPS hosting for better performance and reliability at a budget-friendly cost. Ultahost offers complete flexibility and control while handling all server management, ensuring everything runs smoothly and reliably with guaranteed uptime!

FAQ

Is it possible to replace Derby with MySQL for the metastore?
Why are HDFS directories /tmp and /user/hive/warehouse needed?
What are proxy user settings?
Should I always use HiveServer2 and Beeline?
How do I verify Hive and Hadoop integration?
What if Schematool isn’t found?
Can Hive work without Hadoop?

Related Post

How to Uninstall or Remove Packages in Ubun...

Managing installed packages is essential for keeping yo...

How to List Installed Packages on Ubuntu 22.0...

Listing installed packages on operating systems, such a...

How to Install Wine on Ubuntu 22.04

Wine is a third-party tool that helps you to operate Wi...

How to Install Minikube on Ubuntu

Minikube is a lightweight, portable, and easy-to-use wa...

Install Wildcard SSL Certificate on Ubuntu 22...

Securing your website with an SSL certificate is import...

How to Connect to SQLite from the Command Lin...

SQLite is a lightweight, self-contained, and serverless...

Leave a Comment