Prerequisites:
- JDK
- Python - Optional, if you want to use the python shell for Spark (pyspark)
JDK
openJDK was already installed on my host. But I ran into
some issue later while starting spark shell. So finally installed latest Oracle
JDK. And need to set JAVA_HOME to point to Oracle JDK.
$java -version
openjdk version "1.8.0_121"OpenJDK Runtime Environment (build 1.8.0_121-b13)OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Python
Python 2.6.6 was installed already.
$ python --version
Python 2.6.6
Download Apache Spark
Download latest stable version from Apache Spark download
page: https://spark.apache.org/downloads.html
I have downloaded spark-2.3.0-bin-hadoop2.7.tgz
Install Apache Spark binary
cd /scratch/rajiv/softwares/
tar -xvzf spark-2.3.0-bin-hadoop2.7.tgz
apache spark is extracted into a folder
spark-2.3.0-bin-hadoop2.7
Set environment variables
bash
export
SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
Start Spark Shell (Scala)
cd $SPARK_HOME/bin
bash-4.1$ ./spark-shell
But got below error. Somehow spark shell is expecting jdk
under “/scratch/rajiv/softwares/jdk7/bin/java”. But in my case OpenJDK was
installed using rpm under “/usr/bin/java”
/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7/bin/spark-class: line 71: /scratch/rajiv/softwares/jdk7/bin/java: No such file or directory
I felt that installing JDK using tar.gz and setting
JAVA_HOME environment variable will solve the problem. I have downloaded latest
Oracle JDK (tar.gz and set JAVA_HOME). And it worked after that.
Install Oracle JDK
Get the latest jdk version from -
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Accept the license and download the tar.gz relevant to your
box(32 bit vs 64 bit) and operating system
I have downloaded jdk-8u172-linux-x64.tar.gz (Linux x64)
Extract tar.gz
tar -xvzf jdk-8u172-linux-x64.tar.gz
Set JAVA_HOME environment variable
bash-4.1$ export JAVA_HOME=`pwd`/jdk1.8.0_172
bash-4.1$ echo $JAVA_HOME
/scratch/rajiv/softwares/jdk1.8.0_172
Now try to start spark shell again
bash-4.1$ ./bin/spark-shell
$./bin/spark-shell
2018-04-23 23:37:14 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Spark context Web UI available at http://<yourhostname>:4040Spark context available as 'sc' (master = local[*], app id = local-1524551841157).Spark session available as 'spark'.Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.3.0/_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172)Type in expressions to have them evaluated.Type :help for more information.
Spark Quickstart
I have followed commands in Spark Quickstart https://spark.apache.org/docs/latest/quick-start.html
Read file and make Spark Dataset
Below command will open README.md file and read into a Spark
DataFrame.
scala>
val textFile = spark.read.textFile("README.md")
There
were some warnings as shown below. But it worked.
2018-04-23 23:37:39 WARN ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.02018-04-23 23:37:39 WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException2018-04-23 23:37:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectExceptiontextFile: org.apache.spark.sql.Dataset[String] = [value: string]
Get number of items in the data set
scala>
textFile.count()
res0:
Long = 103
scala>
Use “Control + D” to quit the shell.
Using pyspark
Set environment variables
bash
export
SPARK_HOME=/scratch/rajiv/softwares/spark-2.3.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export JAVA_HOME=/scratch/rajiv/softwares/jdk1.8.0_172
Run pyspark
$ ./bin/pyspark
Python 2.6.6 (r266:84292, Jul 23 2015, 05:13:40)[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2Type "help", "copyright", "credits" or "license" for more information.2018-04-25 03:16:24 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//__ / .__/\_,_/_/ /_/\_\ version 2.3.0/_/Using Python version 2.6.6 (r266:84292, Jul 23 2015 05:13:40)SparkSession available as 'spark'.>>>
>>> sc
Spark has predefined variables. For SparkContext variable
name is sc. Above command output shows that the variable is initialized.