Knowledge Base
MilesWeb / How-Tos

Learn to Install Apache Spark on Ubuntu 18.04 Server

Approx. read time : 4 min

Introduction to Apache Spark

Apache Spark is a distributed open-source and general-purpose framework used for clustered computing. It is designed to offer computational speed right from machine learning to stream processing to complex SQL queries. It is easy to process and distribute work on large datasets across multiple computers.

Moreover, it uses in-memory cluster computing to speed up the applications by reducing the need to write to disk. You get APIs for multiple programming languages such as Python, R, and Scala with Spark. These APIs eliminate the lower-level work that might otherwise be needed to manage big data.

Data collection in an efficient way is booming. There’s also an increase in production of data in an efficient and effective way and this has lead to rise in the new methodologies to analyze that data. Speed is essential among individuals and industries’ when it comes to analyzing through the large amount of information presented from all fronts. With the increase in the amount of data, the technology assigned to make sense of it all should be fast. Apache Spark is one of the newest open-source technologies, that offers this functionality. In this tutorial, you will learn about installing Apache Spark on Ubuntu.

Pre-requisites

  • This tutorial is performed on a Self-Managed Ubuntu 18.04 server as the root user.

Install Dependencies

You should ensure that all your system packages are up to date. For checking this, execute the below command:

Since Java is needed to run Apache Spark, make sure that Java is installed. To check this, execute the below command:

Download Apache Spark

Next you can download Apache Spark to the server. When this article was written, version 3.0.1 was the newest release. Download Apache Spark using the below command:

After you have finished downloading, extract the Apache Spark tar file with the below command:

Finally, shift the extracted directory to /opt as below:

Configure the Environment

Before you start the Spark master server, you need to configure a few environmental variables. At first, set the environment variables in the .profile file with the below commands:

To verify if the new environment variables are accessible within the shell and available to Apache Spark, run the below command:

Start Apache Spark

When the environment is configured, you need to start the Spark master server. The essential directory was added to the system PATH variable by the previous command, so you can easily run the below command from any directory:

Here, the Apache Spark user interface is being run locally on a remote server. For viewing the web interface, you should use SSH tunneling to forward a port from the local machine to the server. Logout of the server and then execute the below command. Make sure you replace the hostname for your server’s hostname or IP:

You can now view the web interface from a browser on your local machine by visiting http://localhost:8080/. After the loading of web interface, copy the URL as you will need it in the next step.

Start Spark Worker Process

Here, the installation of Apache Spark is on a single machine. Therefore, the worker process will also be started on this server. Go to the terminal to start up the worker, using the below command and paste in the Spark URL from the web interface.

Because the worker is running, you will see it back in the web interface.

Verify Spark Shell

The web interface is easy to use but ensure that Spark’s command-line environment works as expected. Open the Spark Shell by executing the below command in the terminal:

You will get the Spark Shell in Scala as well as Python. By holding the CTRL key + D, exit the current Spark Shell. For testing the pyspark, run the below command:

Note: You will get the above warnings if one of the newer versions of the Java JDK is installed. Java 8 won’t give you this error. As per https://github.com/apache/spark/pull/24825, and https://issues.apache.org/jira/browse/HADOOP-10848, this issue has been resolved.

Shut Down Apache Spark

In case, for any reason it becomes essential to turn off the main and worker Spark processes, run the below commands:

Conclusion

Apache Spark offers an intuitive interface for working with big datasets. In this tutorial, you have learned to get a basic setup going on a single system but, Apache Spark thrives on distributed systems. Hope that this information will help you get off and running with your next big data project!

Also Read

Learn To Add a User and Grant Root Privileges on Ubuntu 18.04

Learn to Restart Apache on Dedicated Server

 

Pallavi Godse
Pallavi is a Digital Marketing Executive at MilesWeb and has an experience of over 4 years in content development. She is interested in writing engaging content on business, technology, web hosting and other topics related to information technology.
Need help? We’re always here for you.
Register Your Free Domain Name
OFFER EXPIRES IN
04
min
59
sec