# Overview

The Spark MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python. As a result, it offers a convenient way to interact with SystemML from the Spark Shell and from Notebooks such as Jupyter and Zeppelin.

# Spark Shell Example

## Start Spark Shell with SystemML

To use SystemML with Spark Shell, the SystemML jar can be referenced using Spark Shell’s --jars option.

## Create MLContext

All primary classes that a user interacts with are located in the org.apache.sysml.api.mlcontext package. For convenience, we can additionally add a static import of ScriptFactory to shorten the syntax for creating Script objects. An MLContext object can be created by passing its constructor a reference to the SparkSession (spark) or SparkContext (sc). If successful, you should see a “Welcome to Apache SystemML!” message.

## Hello World

The ScriptFactory class allows DML and PYDML scripts to be created from Strings, Files, URLs, and InputStreams. Here, we’ll use the dml method to create a DML “hello world” script based on a String. Notice that the script reports that it has no inputs or outputs.

We execute the script using MLContext’s execute method, which displays “hello world” to the console. The execute method returns an MLResults object, which contains no results since the script has no outputs.

## LeNet on MNIST Example

SystemML features the DML-based nn library for deep learning.

At project build time, SystemML automatically generates wrapper classes for DML scripts to enable convenient access to scripts and execution of functions. In the example below, we obtain a reference (clf) to the LeNet on MNIST example. We generate dummy data, train a convolutional net using the LeNet architecture, compute the class probability predictions, and then evaluate the convolutional net.

Note that these automatic script wrappers are currently not available in Python but will be made available in the near future.

## DataFrame Example

For demonstration purposes, we’ll use Spark to create a DataFrame called df of random doubles from 0 to 1 consisting of 10,000 rows and 100 columns.

We’ll create a DML script to find the minimum, maximum, and mean values in a matrix. This script has one input variable, matrix Xin, and three output variables, minOut, maxOut, and meanOut.

For performance, we’ll specify metadata indicating that the matrix has 10,000 rows and 100 columns.

We’ll create a DML script using the ScriptFactory dml method with the minMaxMean script String. The input variable is specified to be our DataFrame df with MatrixMetadata mm. The output variables are specified to be minOut, maxOut, and meanOut. Notice that inputs are supplied by the in method, and outputs are supplied by the out method.

We execute the script and obtain the results as a Tuple by calling getTuple on the results, specifying the types and names of the output variables.

Many different types of input and output variables are automatically allowed. These types include Boolean, Long, Double, String, Array[Array[Double]], RDD<String> and JavaRDD<String> in CSV (dense) and IJV (sparse) formats, DataFrame, Matrix, and Frame. RDDs and JavaRDDs are assumed to be CSV format unless MatrixMetadata is supplied indicating IJV format.

## RDD Example

Let’s take a look at an example of input matrices as RDDs in CSV format. We’ll create two 2x2 matrices and input these into a DML script. This script will sum each matrix and create a message based on which sum is greater. We will output the sums and the message.

For fun, we’ll write the script String to a file and then use ScriptFactory’s dmlFromFile method (in Python, this method is under the systemml package) to create the script object based on the file. We’ll also specify the inputs using a Map, although we could have also chained together two in methods to specify the same inputs.

If you have metadata that you would like to supply along with the input matrices, this can be accomplished using a Scala Seq, List, or Array. This feature is currently not available in Python.

The same inputs with metadata can be supplied by chaining in methods, as in the example below, which shows that out methods can also be chained.

## Matrix Output

Let’s look at an example of reading a matrix out of SystemML. We’ll create a DML script in which we create a 2x2 matrix m. We’ll set the variable n to be the sum of the cells in the matrix.

We create a script object using String s, and we set m and n as the outputs. We execute the script, and in the results we see we have Matrix m and Double n. The n output variable has a value of 110.0.

We get Matrix m and Double n as a Tuple of values x and y.

In Scala, we then convert Matrix m to an RDD of IJV values, an RDD of CSV values, a DataFrame, and a two-dimensional Double Array, and we display the values in each of these data structures.

In Python, we use the methods toDF() and toNumPy() to get the matrix as PySpark DataFrame or NumPy array respectively.

## Univariate Statistics on Haberman Data

Our next example will involve Haberman’s Survival Data Set in CSV format from the Center for Machine Learning and Intelligent Systems. We will run the SystemML Univariate Statistics (“Univar-Stats.dml”) script on this data.

We’ll pull the data from a URL and convert it to an RDD, habermanRDD. Next, we’ll create metadata, habermanMetadata, stating that the matrix consists of 306 rows and 4 columns.

As we can see from the comments in the script here, the script requires a ‘TYPES’ input matrix that lists the types of the features (1 for scale, 2 for nominal, 3 for ordinal), so we create a typesRDD matrix consisting of 1 row and 4 columns, with corresponding metadata, typesMetadata.

## Script Information

The info method on a Script object can provide useful information about a DML or PyDML script, such as the inputs, output, symbol table, script string, and the script execution string that is passed to the internals of SystemML.

## Clearing Scripts and MLContext

Dealing with large matrices can require a significant amount of memory. To deal help deal with this, you can call a Script object’s clearAll method to clear the inputs, outputs, symbol table, and script string. In terms of memory, the symbol table is most important because it holds references to matrices.

In this example, we display the symbol table of the minMaxMeanScript, call clearAll on the script, and then display the symbol table, which is empty.

The MLContext object holds references to the scripts that have been executed. Calling clear on the MLContext clears all scripts that it has references to and then removes the references to these scripts.

## Statistics

Statistics about script executions can be output to the console by calling MLContext’s setStatistics method with a value of true.

## GPU

If the driver node has a GPU, SystemML may be able to utilize it, subject to memory constraints and what instructions are used in the dml script

Note that GPU instructions show up prepended with a “gpu” in the statistics.

## Explain

A DML or PyDML script is converted into a SystemML program during script execution. Information about this program can be displayed by calling MLContext’s setExplain method with a value of true.

Different explain levels can be set. The explain levels are NONE, HOPS, RUNTIME, RECOMPILE_HOPS, and RECOMPILE_RUNTIME.

## Script Creation and ScriptFactory

Script objects can be created using standard Script constructors. A Script can be of two types: DML (R-based syntax) and PYDML (Python-based syntax). If no ScriptType is specified, the default Script type is DML.

The ScriptFactory class offers convenient methods for creating DML and PYDML scripts from a variety of sources. ScriptFactory can create a script object from a String, File, URL, or InputStream.

Script from URL:

Here we create Script object s1 by reading Univar-Stats.dml from a URL.

Script from String:

We create Script objects s2 and s3 from Strings using ScriptFactory’s dml and dmlFromString methods. Both methods perform the same action. This example reads an algorithm at a URL to String uniString and then creates two script objects based on this String.

Script from File:

We create Script object s4 based on a path to a file using ScriptFactory’s dmlFromFile method. This example reads a URL to a String, writes this String to a file, and then uses the path to the file to create a Script object.

Script from InputStream:

The SystemML jar file contains all the primary algorithm scripts. We can read one of these scripts as an InputStream and use this to create a Script object.

Script from Resource:

As mentioned, the SystemML jar file contains all the primary algorithm script files. For convenience, we can read these script files or other script files on the classpath using ScriptFactory’s dmlFromResource and pydmlFromResource methods.

## ScriptExecutor

A Script is executed by a ScriptExecutor. If no ScriptExecutor is specified, a default ScriptExecutor will be created to execute a Script. Script execution consists of several steps, as detailed in SystemML’s Optimizer: Plan Generation for Large-Scale Machine Learning Programs. Additional information can be found in the Javadocs for ScriptExecutor.

Advanced users may find it useful to be able to specify their own execution or to override ScriptExecutor methods by subclassing ScriptExecutor.

In this example, we override the parseScript and validateScript methods to display messages to the console during these execution steps.

When supplying matrix data to Apache SystemML using the MLContext API, matrix metadata can be supplied using a MatrixMetadata object. Supplying characteristics about a matrix can significantly improve performance. For some types of input matrices, supplying metadata is mandatory. Metadata at a minimum typically consists of the number of rows and columns in a matrix. The number of non-zeros can also be supplied.

Additionally, the number of rows and columns per block can be supplied, although in typical usage it’s probably fine to use the default values used by SystemML (1,000 rows and 1,000 columns per block). SystemML handles a matrix internally by splitting the matrix into chunks, or blocks. The number of rows and columns per block refers to the size of these matrix blocks.

Here we see an example of inputting an RDD of Strings in CSV format with no metadata. Note that in general it is recommended that metadata is supplied. We output the sum and mean of the cells in the matrix.

Next, we’ll supply an RDD in IJV format. IJV is a sparse format where each line has three space-separated values. The first value indicates the row number, the second value indicates the column number, and the third value indicates the cell value. Since the total numbers of rows and columns can’t be determined from these IJV rows, we need to supply metadata describing the matrix size.

Here, we specify that our matrix has 3 rows and 3 columns.

Next, we’ll run the same DML, but this time we’ll specify that the input matrix is 4x4 instead of 3x3.

## Matrix Data Conversions and Performance

Internally, Apache SystemML uses a binary-block matrix representation, where a matrix is represented as a grouping of blocks. Each block is equal in size to the other blocks in the matrix and consists of a number of rows and columns. The default block size is 1,000 rows by 1,000 columns.

Conversion of a large set of data to a SystemML matrix representation can potentially be time-consuming. Therefore, if you use a set of data multiple times, one way to potentially improve performance is to convert it to a SystemML matrix representation and then use this representation rather than performing the data conversion each time.

If you have an input DataFrame, it can be converted to a Matrix, and this Matrix can be passed as an input rather than passing in the DataFrame as an input.

For example, suppose we had a 10000x100 matrix represented as a DataFrame, as we saw in an earlier example. Now suppose we create two Script objects with the DataFrame as an input, as shown below. In the Spark Shell, when executing this code, you can see that each of the two Script object creations requires the time-consuming data conversion step.

Rather than passing in a DataFrame each time to the Script object creation, let’s instead create a Matrix object based on the DataFrame and pass this Matrix to the Script object creation. If we run the code below in the Spark Shell, we see that the data conversion step occurs when the Matrix object is created. However, when we create a Script object twice, we see that no conversion penalty occurs, since this conversion occurred when the Matrix was created.

When a matrix is returned as an output, it is returned as a Matrix object, which is a wrapper around a SystemML MatrixObject. As a result, an output Matrix is already in a SystemML representation, meaning that it can be passed as an input with no data conversion penalty.

As an example, here we read in matrix x as an RDD in CSV format. We create a Script that adds one to all values in the matrix. We obtain the resulting matrix y as a Matrix. We execute the script five times, feeding the output matrix as the input matrix for the next script execution.

## Project Information

SystemML project information such as version and build time can be obtained through the MLContext API. The project version can be obtained by ml.version. The build time can be obtained by ml.buildTime. The contents of the project manifest can be displayed using ml.info. Individual properties can be obtained using the ml.info.property method, as shown below.

# Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization

Similar to the Scala API, SystemML also provides a Python MLContext API. Before usage, you’ll need to install it first.

Here, we’ll explore the use of SystemML via PySpark in a Jupyter notebook. This Jupyter notebook example can be nicely viewed in a rendered state on GitHub, and can be downloaded here to a directory of your choice.

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

This will open Jupyter in a browser:

We can then open up the SystemML-PySpark-Recommendation-Demo notebook.

# Recommended Spark Configuration Settings

For best performance, we recommend setting the following flags when running SystemML with Spark: --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128.