rajiv kuriakose's blog: May 2018

Below are the steps to install Apache Spark on Oracle Enterprise Linux Update 8.

Prerequisites:

JDK
Python - Optional, if you want to use the python shell for Spark (pyspark)

JDK

openJDK was already installed on my host. But I ran into some issue later while starting spark shell. So finally installed latest Oracle JDK. And need to set JAVA_HOME to point to Oracle JDK.

$java -version

openjdk version "1.8.0_121"

OpenJDK Runtime Environment (build 1.8.0_121-b13)

OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Python

Python 2.6.6 was installed already.

$ python --version

Python 2.6.6

Download Apache Spark

Download latest stable version from Apache Spark download page: https://spark.apache.org/downloads.html

I have downloaded spark-2.3.0-bin-hadoop2.7.tgz

Install Apache Spark binary

cd /scratch/rajiv/softwares/

tar -xvzf spark-2.3.0-bin-hadoop2.7.tgz

apache spark is extracted into a folder spark-2.3.0-bin-hadoop2.7

Set environment variables

bash

export SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7

export PATH=$SPARK_HOME/bin:$PATH

Start Spark Shell (Scala)

cd $SPARK_HOME/bin

bash-4.1$ ./spark-shell

But got below error. Somehow spark shell is expecting jdk under “/scratch/rajiv/softwares/jdk7/bin/java”. But in my case OpenJDK was installed using rpm under “/usr/bin/java”

/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7/bin/spark-class: line 71: /scratch/rajiv/softwares/jdk7/bin/java: No such file or directory

I felt that installing JDK using tar.gz and setting JAVA_HOME environment variable will solve the problem. I have downloaded latest Oracle JDK (tar.gz and set JAVA_HOME). And it worked after that.

Install Oracle JDK

Get the latest jdk version from - http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Accept the license and download the tar.gz relevant to your box(32 bit vs 64 bit) and operating system

I have downloaded jdk-8u172-linux-x64.tar.gz (Linux x64)

Extract tar.gz

tar -xvzf jdk-8u172-linux-x64.tar.gz

Set JAVA_HOME environment variable

bash-4.1$ export JAVA_HOME=`pwd`/jdk1.8.0_172

bash-4.1$ echo $JAVA_HOME

/scratch/rajiv/softwares/jdk1.8.0_172

Now try to start spark shell again

bash-4.1$ ./bin/spark-shell

$./bin/spark-shell

2018-04-23 23:37:14 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://<yourhostname>:4040

Spark context available as 'sc' (master = local[*], app id = local-1524551841157).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__ ___ _____/ /__

   _\ \/ _ \/ _ `/ __/ '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0

      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)

Type in expressions to have them evaluated.

Type :help for more information.

Spark Quickstart

I have followed commands in Spark Quickstart https://spark.apache.org/docs/latest/quick-start.html

Read file and make Spark Dataset

Below command will open README.md file and read into a Spark DataFrame.

scala> val textFile = spark.read.textFile("README.md")

There were some warnings as shown below. But it worked.

2018-04-23 23:37:39 WARN ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

2018-04-23 23:37:39 WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException

2018-04-23 23:37:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

textFile: org.apache.spark.sql.Dataset[String] = [value: string]

Get number of items in the data set

scala> textFile.count()

res0: Long = 103

scala>

Use “Control + D” to quit the shell.

Using pyspark

Set environment variables

bash

export SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7

export PATH=$SPARK_HOME/bin:$PATH

export JAVA_HOME=/scratch/rajiv/softwares/jdk1.8.0_172

Run pyspark

$ ./bin/pyspark

Python 2.6.6 (r266:84292, Jul 23 2015, 05:13:40)

[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

2018-04-25 03:16:24 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Welcome to

      ____              __

     / __/__ ___ _____/ /__

    _\ \/ _ \/ _ `/ __/ '_/

   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0

      /_/

Using Python version 2.6.6 (r266:84292, Jul 23 2015 05:13:40)

SparkSession available as 'spark'.

>>>

>>> sc

Spark has predefined variables. For SparkContext variable name is sc. Above command output shows that the variable is initialized.

rajiv kuriakose's blog

Saturday, May 26, 2018

How to install Apache Spark on Oracle Enterprise Linux

Prerequisites:

JDK

Python

Download Apache Spark

Install Apache Spark binary

Start Spark Shell (Scala)

Install Oracle JDK

Extract tar.gz

Now try to start spark shell again

Spark Quickstart

Read file and make Spark Dataset

Get number of items in the data set

Using pyspark

Run pyspark

Blog Archive

My Blog List

About Me

rajiv kuriakose's blog

Saturday, May 26, 2018

How to install Apache Spark on Oracle Enterprise Linux

Prerequisites:

JDK

Python

Download Apache Spark

Install Apache Spark binary

Start Spark Shell (Scala)

Install Oracle JDK

Extract tar.gz

Now try to start spark shell again

Spark Quickstart

Read file and make Spark Dataset

Get number of items in the data set

Using pyspark

Run pyspark

Blog Archive

My Blog List

Subscribe To

About Me