What is Apache SystemML?

SystemML Beginner Tutorial

Level: Beginner   |   Time: 20 minutes

How to set up and run Apache SystemML locally.



If you haven’t run Apache SystemML before, make sure to set up your environment first.

1Install Homebrew

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
# Linux
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"

2Install Java

brew tap caskroom/cask
brew install Caskroom/cask/java

brew install python
pip install jupyter matplotlib numpy


Download Spark and SystemML.

3Download and Install Spark 1.6

brew tap homebrew/versions
brew install apache-spark16

4Download and Install SystemML

Download systemml-0.10.0-incubating.zip and unzip it to a location of your choice.
This is optional, but will be much easier in the long run: move SystemML to your bash profile.

vim .bash_profile
# 1 line:
export SYSTEMML_HOME=/Users/stc/Documents/systemml-0.10.0-incubating

Similar to the Scala API, SystemML also provides a Python MLContext API. In addition to the regular SystemML.jar file, you’ll need to install the Python API as follows:

Python 2:
pip install systemml
# Bleeding edge: pip install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python

Python 3:
pip3 install systemml
# Bleeding edge: pip3 install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python

Ways to Use

How to Use SystemML.

5In Jupyter Notebook

Get Started

Start up your Jupyter notebook by moving to the folder where you saved the notebook. Then copy and paste the line below:

Python 2:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar--conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100

Python 3:
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar --conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100

6To Run SystemML in the Spark Shell

Start Spark Shell with SystemML

To use SystemML with Spark Shell, the SystemML jar can be referenced using Spark Shell’s --jars option. Start the Spark Shell with SystemML with the following line of code in your terminal:

spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar

Create the MLContext

To begin, start an MLContext by typing the code below. Once successful, you should see a “Welcome to Apache SystemML!” message.

import org.apache.sysml.api.mlcontext._
import org.apache.sysml.api.mlcontext.ScriptFactory._
val ml = new MLContext(sc)

Hello World

The ScriptFactory class allows DML and PYDML scripts to be created from Strings, Files, URLs, and InputStreams. Here, we’ll use the dmlmethod to create a DML “hello world” script based on a String. We execute the script using MLContext’s execute method, which displays “hello world” to the console. The execute method returns an MLResults object, which contains no results since the script has no outputs.

val helloScript = dml("print('hello world')")

DataFrame Example

As an example of how to use SystemML, we’ll first use Spark to create a DataFrame called df of random doubles from 0 to 1 consisting of 10,000 rows and 1,000 columns.

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
import scala.util.Random
val numRows = 10000
val numCols = 1000
val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = sqlContext.createDataFrame(data, schema)

We’ll create a DML script using the ScriptFactory dml method to find the minimum, maximum, and mean values in a matrix. This script has one input variable, matrix Xin, and three output variables, minOut, maxOut, and meanOut. For performance, we’ll specify metadata indicating that the matrix has 10,000 rows and 1,000 columns. We execute the script and obtain the results as a Tuple by calling getTuple on the results, specifying the types and names of the output variables.

val minMaxMean =
minOut = min(Xin)
maxOut = max(Xin)
meanOut = mean(Xin)
val mm = new MatrixMetadata(numRows, numCols)
val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
val (min, max, mean) = ml.execute(minMaxMeanScript).getTuple[Double, Double, Double]("minOut", "maxOut", "meanOut")

Many different types of input and output variables are automatically allowed. These types include Boolean, Long, Double, String, Array[Array[Double]], RDD and JavaRDD in CSV (dense) and IJV (sparse) formats, DataFrame, BinaryBlockMatrix,Matrix, and Frame. RDDs and JavaRDDs are assumed to be CSV format unless MatrixMetadata is supplied indicating IJV format.

RDD Example:

Let’s take a look at an example of input matrices as RDDs in CSV format. We’ll create two 2x2 matrices and input these into a DML script. This script will sum each matrix and create a message based on which sum is greater. We will output the sums and the message.

val rdd1 = sc.parallelize(Array("1.0,2.0", "3.0,4.0"))
val rdd2 = sc.parallelize(Array("5.0,6.0", "7.0,8.0"))
val sums = """
s1 = sum(m1);
s2 = sum(m2);
if (s1 > s2) {
message = "s1 is greater"
} else if (s2 > s1) {
message = "s2 is greater"
} else {
message = "s1 and s2 are equal"
val sumScript = dmlFromFile("sums.dml").in(Map("m1"-> rdd1, "m2"-> rdd2)).out("s1", "s2", "message")
val sumResults = ml.execute(sumScript)
val s1 = sumResults.getDouble("s1")
val s2 = sumResults.getDouble("s2")
val message = sumResults.getString("message")
val rdd1Metadata = new MatrixMetadata(2, 2)
val rdd2Metadata = new MatrixMetadata(2, 2)
val sumScript = dmlFromFile("sums.dml").in(Seq(("m1", rdd1, rdd1Metadata), ("m2", rdd2, rdd2Metadata))).out("s1", "s2", "message")
val (firstSum, secondSum, sumMessage) = ml.execute(sumScript).getTuple[Double, Double, String]("s1", "s2", "message")

Congratulations! You’ve now run examples in Apache SystemML!