Saturday, May 26, 2018

How to install Apache Spark on Oracle Enterprise Linux



Below are the steps to install Apache Spark on Oracle Enterprise Linux Update 8.

Prerequisites:

  • JDK
  • Python - Optional, if you want to use the python shell for Spark (pyspark)

JDK

openJDK was already installed on my host. But I ran into some issue later while starting spark shell. So finally installed latest Oracle JDK. And need to set JAVA_HOME to point to Oracle JDK.
$java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Python

Python 2.6.6 was installed already.

$ python --version
Python 2.6.6

Download Apache Spark


Download latest stable version from Apache Spark download page: https://spark.apache.org/downloads.html
I have downloaded spark-2.3.0-bin-hadoop2.7.tgz

Install Apache Spark binary


cd /scratch/rajiv/softwares/
tar -xvzf spark-2.3.0-bin-hadoop2.7.tgz
apache spark is extracted into a folder spark-2.3.0-bin-hadoop2.7

Set environment variables
bash
export SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH

Start Spark Shell (Scala)


cd $SPARK_HOME/bin
bash-4.1$ ./spark-shell
But got below error. Somehow spark shell is expecting jdk under “/scratch/rajiv/softwares/jdk7/bin/java”. But in my case OpenJDK was installed using rpm under “/usr/bin/java”

/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7/bin/spark-class: line 71: /scratch/rajiv/softwares/jdk7/bin/java: No such file or directory

I felt that installing JDK using tar.gz and setting JAVA_HOME environment variable will solve the problem. I have downloaded latest Oracle JDK (tar.gz and set JAVA_HOME). And it worked after that.

Install Oracle JDK


Get the latest jdk version from   - http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Accept the license and download the tar.gz relevant to your box(32 bit vs 64 bit) and operating system
I have downloaded jdk-8u172-linux-x64.tar.gz   (Linux x64)
Extract tar.gz

tar -xvzf jdk-8u172-linux-x64.tar.gz
Set JAVA_HOME environment variable
bash-4.1$ export JAVA_HOME=`pwd`/jdk1.8.0_172
bash-4.1$ echo $JAVA_HOME
/scratch/rajiv/softwares/jdk1.8.0_172
Now try to start spark shell again

bash-4.1$ ./bin/spark-shell

$./bin/spark-shell
2018-04-23 23:37:14 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://<yourhostname>:4040
Spark context available as 'sc' (master = local[*], app id = local-1524551841157).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)
Type in expressions to have them evaluated.
Type :help for more information.



Spark Quickstart


I have followed commands in Spark Quickstart https://spark.apache.org/docs/latest/quick-start.html

Read file and make Spark Dataset


Below command will open README.md file and read into a Spark DataFrame.

scala> val textFile = spark.read.textFile("README.md")


There were some warnings as shown below. But it worked.

2018-04-23 23:37:39 WARN  ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2018-04-23 23:37:39 WARN  ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2018-04-23 23:37:40 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
textFile: org.apache.spark.sql.Dataset[String] = [value: string]



Get number of items in the data set


scala> textFile.count()
res0: Long = 103

scala>

Use “Control + D” to quit the shell.

Using pyspark


Set environment variables
bash
export SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export JAVA_HOME=/scratch/rajiv/softwares/jdk1.8.0_172

Run pyspark


$ ./bin/pyspark

Python 2.6.6 (r266:84292, Jul 23 2015, 05:13:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-04-25 03:16:24 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 2.6.6 (r266:84292, Jul 23 2015 05:13:40)
SparkSession available as 'spark'.
>>> 

>>> sc
Spark has predefined variables. For SparkContext variable name is sc. Above command output shows that the variable is initialized.