systemml package

Submodules

systemml.classloader module

systemml.classloader.createJavaObject(sc, obj_type)

Performs appropriate check if SystemML.jar is available and returns the handle to MLContext object on JVM

Parameters:
  • sc (SparkContext) – SparkContext
  • obj_type (Type of object to create ('mlcontext' or 'dummy')) –
class systemml.classloader.jvm_stdout(parallel_flush=False)

Bases: object

This is useful utility class to get the output of the driver JVM from within a Jupyter notebook

Parameters:parallel_flush (boolean) – Should flush the stdout in parallel
flush_stdout()
systemml.classloader.set_default_jvm_stdout(enable, parallel_flush=True)

This is useful utility method to get the output of the driver JVM from within a Jupyter notebook

Parameters:
  • enable (boolean) – Should flush the stdout by default when mlcontext.execute is invoked
  • parallel_flush (boolean) – Should flush the stdout in parallel
systemml.classloader.get_spark_context()

Internal method to get already initialized SparkContext. Developers should always use get_spark_context() instead of SparkContext._active_spark_context to ensure SystemML loaded.

Returns:sc – SparkContext
Return type:SparkContext

systemml.converters module

systemml.converters.getNumCols(numPyArr)
systemml.converters.convertToMatrixBlock(sc, src, maxSizeBlockInMB=8)
systemml.converters.convert_caffemodel(sc, deploy_file, caffemodel_file, output_dir, format='binary', is_caffe_installed=False)

Saves the weights and bias in the caffemodel file to output_dir in the specified format. This method does not requires caffe to be installed.

Parameters:
  • sc (SparkContext) – SparkContext
  • deploy_file (string) – Path to the input network file
  • caffemodel_file (string) – Path to the input caffemodel file
  • output_dir (string) – Path to the output directory
  • format (string) – Format of the weights and bias (can be binary, csv or text)
  • is_caffe_installed (bool) – True if caffe is installed
systemml.converters.convert_lmdb_to_jpeg(lmdb_img_file, output_dir)

Saves the images in the lmdb file as jpeg in the output_dir. This method requires caffe to be installed along with lmdb and cv2 package. To install cv2 package, do pip install opencv-python.

Parameters:
  • lmdb_img_file (string) – Path to the input lmdb file
  • output_dir (string) – Output directory for images (local filesystem)
systemml.converters.convertToNumPyArr(sc, mb)
systemml.converters.convertToPandasDF(X)
systemml.converters.convertToLabeledDF(sparkSession, X, y=None)
systemml.converters.convertImageToNumPyArr(im, img_shape=None, add_rotated_images=False, add_mirrored_images=False, color_mode='RGB', mean=None)
systemml.converters.getDatasetMean(dataset_name)
Parameters:dataset_name (Name of the dataset used to train model. This name is artificial name based on dataset used to train the model.) –
Returns:mean
Return type:Mean value of model if its defined in the list DATASET_MEAN else None.

systemml.defmatrix module

systemml.defmatrix.setSparkContext(sc)

Before using the matrix, the user needs to invoke this function if SparkContext is not previously created in the session.

Parameters:sc (SparkContext) – SparkContext
class systemml.defmatrix.matrix(data, op=None)

Bases: object

matrix class is a python wrapper that implements basic matrix operators, matrix functions as well as converters to common Python types (for example: Numpy arrays, PySpark DataFrame and Pandas DataFrame).

The operators supported are:

  1. Arithmetic operators: +, -, , /, //, %, * as well as dot (i.e. matrix multiplication)
  2. Indexing in the matrix
  3. Relational/Boolean operators: <, <=, >, >=, ==, !=, &, |

In addition, following functions are supported for matrix:

  1. transpose
  2. Aggregation functions: sum, mean, var, sd, max, min, argmin, argmax, cumsum
  3. Global statistical built-In functions: exp, log, abs, sqrt, round, floor, ceil, ceiling, sin, cos, tan, asin, acos, atan, sign, solve

For all the above functions, we always return a two dimensional matrix, especially for aggregation functions with axis. For example: Assuming m1 is a matrix of (3, n), NumPy returns a 1d vector of dimension (3,) for operation m1.sum(axis=1) whereas SystemML returns a 2d matrix of dimension (3, 1).

Note: an evaluated matrix contains a data field computed by eval method as DataFrame or NumPy array.

Examples

>>> import SystemML as sml
>>> import numpy as np
>>> sml.setSparkContext(sc)

Welcome to Apache SystemML!

>>> m1 = sml.matrix(np.ones((3,3)) + 2)
>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> m2 = m1 * (m2 + m1)
>>> m4 = 1.0 - m2
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
mVar1 = load(" ", format="csv")
mVar2 = load(" ", format="csv")
mVar3 = mVar2 + mVar1
mVar4 = mVar1 * mVar3
mVar5 = 1.0 - mVar4
save(mVar5, " ")
>>> m2.eval()
>>> m2
# This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
mVar4 = load(" ", format="csv")
mVar5 = 1.0 - mVar4
save(mVar5, " ")
>>> m4.sum(axis=1).toNumPy()
array([[-60.],
       [-60.],
       [-60.]])

Design Decisions:

  1. Until eval() method is invoked, we create an AST (not exposed to the user) that consist of unevaluated operations and data required by those operations. As an anology, a spark user can treat eval() method similar to calling RDD.persist() followed by RDD.count().

  2. The AST consist of two kinds of nodes: either of type matrix or of type DMLOp. Both these classes expose _visit method, that helps in traversing the AST in DFS manner.

  3. A matrix object can either be evaluated or not. If evaluated, the attribute ‘data’ is set to one of the supported types (for example: NumPy array or DataFrame). In this case, the attribute ‘op’ is set to None. If not evaluated, the attribute ‘op’ which refers to one of the intermediate node of AST and if of type DMLOp. In this case, the attribute ‘data’ is set to None.

  4. DMLOp has an attribute ‘inputs’ which contains list of matrix objects or DMLOp.

  5. To simplify the traversal, every matrix object is considered immutable and an matrix operations creates a new matrix object. As an example: m1 = sml.matrix(np.ones((3,3))) creates a matrix object backed by ‘data=(np.ones((3,3))’. m1 = m1 * 2 will create a new matrix object which is now backed by ‘op=DMLOp( … )’ whose input is earlier created matrix object.

  6. Left indexing (implemented in __setitem__ method) is a special case, where Python expects the existing object to be mutated. To ensure the above property, we make deep copy of existing object and point any references to the left-indexed matrix to the newly created object. Then the left-indexed matrix is set to be backed by DMLOp consisting of following pydml: left-indexed-matrix = new-deep-copied-matrix left-indexed-matrix[index] = value

  7. Please use m.print_ast() and/or type m for debugging. Here is a sample session:

    >>> npm = np.ones((3,3))
    >>> m1 = sml.matrix(npm + 3)
    >>> m2 = sml.matrix(npm + 5)
    >>> m3 = m1 + m2
    >>> m3
    mVar2 = load(" ", format="csv")
    mVar1 = load(" ", format="csv")
    mVar3 = mVar1 + mVar2
    save(mVar3, " ")
    >>> m3.print_ast()
    - [mVar3] (op).
      - [mVar1] (data).
      - [mVar2] (data).
    
THROW_ARRAY_CONVERSION_ERROR = False
abs()
acos()
arccos()
arcsin()
arctan()
argmax(axis=None)

Returns the indices of the maximum values along an axis.

Parameters:axis (int, optional (only axis=1, i.e. rowIndexMax is supported in this version)) –
argmin(axis=None)

Returns the indices of the minimum values along an axis.

Parameters:axis (int, optional (only axis=1, i.e. rowIndexMax is supported in this version)) –
asfptype()
asin()
astype(t)
atan()
ceil()
ceiling()
cos()
cosh()
cumsum(axis=None)

Returns the indices of the maximum values along an axis.

Parameters:axis (int, optional (only axis=0, i.e. cumsum along the rows is supported in this version)) –
deg2rad()

Convert angles from degrees to radians.

dml = []
dot(other)

Numpy way of performing matrix multiplication

eval()

This is a convenience function that calls the global eval method

exp()
exp2()
expm1()
floor()
get_shape()
hstack(other)

Stack matrices horizontally (column wise). Invokes cbind internally.

ldexp(other)
log(y=None)
log10()
log1p()
log2()
logaddexp(other)
logaddexp2(other)
logical_not()
max(other=None, axis=None)

Compute the maximum value along the specified axis

Parameters:
  • other (matrix or numpy array (& other supported types) or scalar) –
  • axis (int, optional) –
mean(axis=None)

Compute the arithmetic mean along the specified axis

Parameters:axis (int, optional) –
min(other=None, axis=None)

Compute the minimum value along the specified axis

Parameters:
  • other (matrix or numpy array (& other supported types) or scalar) –
  • axis (int, optional) –
ml = None
mod(other)
moment(moment=1, axis=None)

Calculates the nth moment about the mean

Parameters:
  • moment (int) – can be 1, 2, 3 or 4
  • axis (int, optional) –
ndim = 2
negative()
ones_like()
print_ast()

Please use m.print_ast() and/or type m for debugging. Here is a sample session:

>>> npm = np.ones((3,3))
>>> m1 = sml.matrix(npm + 3)
>>> m2 = sml.matrix(npm + 5)
>>> m3 = m1 + m2
>>> m3
mVar2 = load(" ", format="csv")
mVar1 = load(" ", format="csv")
mVar3 = mVar1 + mVar2
save(mVar3, " ")
>>> m3.print_ast()
- [mVar3] (op).
  - [mVar1] (data).
  - [mVar2] (data).
prod()

Return the product of all cells in matrix

rad2deg()

Convert angles from radians to degrees.

reciprocal()
remainder(other)
remove_empty(axis=None)

Removes all empty rows or columns from the input matrix target X according to specified axis.

Parameters:axis (int (0 or 1)) –
replace(pattern=None, replacement=None)

Removes all empty rows or columns from the input matrix target X according to specified axis.

Parameters:
  • pattern (float or int) –
  • replacement (float or int) –
round()
save(file, format='csv')

Allows user to save a matrix to filesystem

Parameters:
  • file (filepath) –
  • format (can be csv, text or binary or mm) –
script = None
sd(axis=None)

Compute the standard deviation along the specified axis

Parameters:axis (int, optional) –
set_shape(shape)
shape
sign()
sin()
sinh()
sqrt()
square()
sum(axis=None)

Compute the sum along the specified axis

Parameters:axis (int, optional) –
systemmlVarID = 0
tan()
tanh()
toDF()

This is a convenience function that calls the global eval method and then converts the matrix object into DataFrame.

toNumPy()

This is a convenience function that calls the global eval method and then converts the matrix object into NumPy array.

toPandas()

This is a convenience function that calls the global eval method and then converts the matrix object into Pandas DataFrame.

trace()

Return the sum of the cells of the main diagonal square matrix

transpose()

Transposes the matrix.

var(axis=None)

Compute the variance along the specified axis. We assume that delta degree of freedom is 1 (unlike NumPy which assumes ddof=0).

Parameters:axis (int, optional) –
visited = []
vstack(other)

Stack matrices vertically (row wise). Invokes rbind internally.

zeros_like()
systemml.defmatrix.eval(outputs, execute=True)

Executes the unevaluated DML script and computes the matrices specified by outputs.

Parameters:
  • outputs (list of matrices or a matrix object) –
  • execute (specified whether to execute the unevaluated operation or just return the script.) –
systemml.defmatrix.solve(A, b)

Computes the least squares solution for system of linear equations A %*% x = b

Examples

>>> import numpy as np
>>> from sklearn import datasets
>>> import SystemML as sml
>>> from pyspark.sql import SparkSession
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> X_train = diabetes_X[:-20]
>>> X_test = diabetes_X[-20:]
>>> y_train = diabetes.target[:-20]
>>> y_test = diabetes.target[-20:]
>>> sml.setSparkContext(sc)
>>> X = sml.matrix(X_train)
>>> y = sml.matrix(y_train)
>>> A = X.transpose().dot(X)
>>> b = X.transpose().dot(y)
>>> beta = sml.solve(A, b).toNumPy()
>>> y_predicted = X_test.dot(beta)
>>> print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
Residual sum of squares: 25282.12
class systemml.defmatrix.DMLOp(inputs, dml=None)

Bases: object

Represents an intermediate node of Abstract syntax tree created to generate the PyDML script

MAX_DEPTH = 0
systemml.defmatrix.set_lazy(isLazy)

This method allows users to set whether the matrix operations should be executed in lazy manner.

Parameters:isLazy (True if matrix operations should be evaluated in lazy manner.) –
systemml.defmatrix.debug_array_conversion(throwError)
systemml.defmatrix.load(file, format='csv')

Allows user to load a matrix from filesystem

Parameters:
  • file (filepath) –
  • format (can be csv, text or binary or mm) –
systemml.defmatrix.full(shape, fill_value)

Return a new array of given shape filled with fill_value.

Parameters:
  • shape (tuple of length 2) –
  • fill_value (float or int) –
systemml.defmatrix.seq(start=None, stop=None, step=1)

Creates a single column vector with values starting from <start>, to <stop>, in increments of <step>. Note: Unlike Numpy’s arange which returns a row-vector, this returns a column vector. Also, Unlike Numpy’s arange which doesnot include stop, this method includes stop in the interval.

Parameters:
  • start (int or float [Optional: default = 0]) –
  • stop (int or float) –
  • step (int float [Optional: default = 1]) –

systemml.mlcontext module

class systemml.mlcontext.MLResults(results, sc)

Bases: object

Wrapper around a Java ML Results object.

Parameters:
  • results (JavaObject) – A Java MLResults object as returned by calling ml.execute().
  • sc (SparkContext) – SparkContext
get(*outputs)
Parameters:outputs (string, list of strings) – Output variables as defined inside the DML script.
class systemml.mlcontext.MLContext(sc)

Bases: object

Wrapper around the new SystemML MLContext.

Parameters:sc (SparkContext or SparkSession) – An instance of pyspark.SparkContext or pyspark.sql.SparkSession.
buildTime()

Display the project build time.

close()

Closes this MLContext instance to cleanup buffer pool, static/local state and scratch space. Note the SparkContext is not explicitly closed to allow external reuse.

execute(script)

Execute a DML / PyDML script.

Parameters:script (Script instance) – Script instance defined with the appropriate input and output variables.
Returns:ml_results – MLResults instance.
Return type:MLResults
info()

Display the project information.

isExplain()

Returns True if program instruction details should be output, False otherwise.

isForceGPU()

Returns True if “force” GPU mode is enabled, False otherwise.

isGPU()

Returns True if GPU mode is enabled, False otherwise.

isStatistics()

Returns True if program execution statistics should be output, False otherwise.

resetConfig()

Reset configuration settings to default values.

setConfig(configFilePath)

Set SystemML configuration based on a configuration file.

Parameters:configFilePath (String) –
setConfigProperty(propertyName, propertyValue)

Set configuration property, such as setConfigProperty(“sysml.localtmpdir”, “/tmp/systemml”).

Parameters:
  • propertyName (String) –
  • propertyValue (String) –
setExplain(explain)

Explanation about the program. Mainly intended for developers.

Parameters:explain (boolean) –
setExplainLevel(explainLevel)

Set explain level.

Parameters:explainLevel (string) – Can be one of “hops”, “runtime”, “recompile_hops”, “recompile_runtime” or in the above in upper case.
setForceGPU(enable)

Whether or not to force the usage of GPU operators.

Parameters:enable (boolean) –
setGPU(enable)

Whether or not to enable GPU.

Parameters:enable (boolean) –
setStatistics(statistics)

Whether or not to output statistics (such as execution time, elapsed time) about script executions.

Parameters:statistics (boolean) –
setStatisticsMaxHeavyHitters(maxHeavyHitters)

The maximum number of heavy hitters that are printed as part of the statistics.

Parameters:maxHeavyHitters (int) –
version()

Display the project version.

class systemml.mlcontext.Script(scriptString, scriptType='dml', isResource=False, scriptFormat='auto')

Bases: object

Instance of a DML/PyDML Script.

Parameters:
  • scriptString (string) – Can be either a file path to a DML script or a DML script itself.
  • scriptType (string) – Script language, either “dml” for DML (R-like) or “pydml” for PyDML (Python-like).
  • isResource (boolean) – If true, scriptString is a path to a resource on the classpath
  • scriptFormat (string) – Optional script format, either “auto” or “url” or “file” or “resource” or “string”
clearAll()

Clear the script string, inputs, outputs, and symbol table.

clearIO()

Clear the inputs and outputs, but not the symbol table.

clearIOS()

Clear the inputs, outputs, and symbol table.

clearInputs()

Clear the inputs.

clearOutputs()

Clear the outputs.

clearSymbolTable()

Clear the symbol table.

displayInputParameters()

Display the script input parameters.

displayInputVariables()

Display the script input variables.

displayInputs()

Display the script inputs.

displayOutputVariables()

Display the script output variables.

displayOutputs()

Display the script outputs.

displaySymbolTable()

Display the script symbol table.

getInputVariables()

Obtain the input variable names.

getName()

Obtain the script name.

getOutputVariables()

Obtain the output variable names.

getResults()

Obtain the results of the script execution.

getScriptExecutionString()

Generate the script execution string, which adds read/load/write/save statements to the beginning and end of the script to execute.

getScriptString()

Obtain the script string (in unicode).

getScriptType()

Obtain the script type.

info()

Display information about the script as a String. This consists of the script type, inputs, outputs, input parameters, input variables, output variables, the symbol table, the script string, and the script execution string.

input(*args, **kwargs)
Parameters:
  • args (name, value tuple) – where name is a string, and currently supported value formats are double, string, dataframe, rdd, and list of such object.
  • kwargs (dict of name, value pairs) – To know what formats are supported for name and value, look above.
isDML()

Is the script type DML?

isPYDML()

Is the script type DML?

output(*names)
Parameters:names (string, list of strings) – Output variables as defined inside the DML script.
results()

Obtain the results of the script execution.

setName(name)

Set the script name.

setResults(results)

Set the results of the script execution.

setScriptString(scriptString)

Set the script string.

Parameters:scriptString (string) – Can be either a file path to a DML script or a DML script itself.
class systemml.mlcontext.Matrix(javaMatrix, sc)

Bases: object

Wrapper around a Java Matrix object.

Parameters:
  • javaMatrix (JavaObject) – A Java Matrix object as returned by calling ml.execute().get().
  • sc (SparkContext) – SparkContext
toDF()

Convert the Matrix to a PySpark SQL DataFrame.

Returns:A PySpark SQL DataFrame representing the matrix, with one “__INDEX” column containing the row index (since Spark DataFrames are unordered), followed by columns of doubles for each column in the matrix.
Return type:PySpark SQL DataFrame
toNumPy()

Convert the Matrix to a NumPy Array.

Returns:A NumPy Array representing the Matrix object.
Return type:NumPy Array
systemml.mlcontext.dml(scriptString)

Create a dml script object based on a string.

Parameters:scriptString (string) – Can be a path to a dml script or a dml script itself.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.pydml(scriptString)

Create a pydml script object based on a string.

Parameters:scriptString (string) – Can be a path to a pydml script or a pydml script itself.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.dmlFromResource(resourcePath)

Create a dml script object based on a resource path.

Parameters:resourcePath (string) – Path to a dml script on the classpath.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.pydmlFromResource(resourcePath)

Create a pydml script object based on a resource path.

Parameters:resourcePath (string) – Path to a pydml script on the classpath.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.dmlFromFile(filePath)

Create a dml script object based on a file path.

Parameters:filePath (string) – Path to a dml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.pydmlFromFile(filePath)

Create a pydml script object based on a file path.

Parameters:filePath (string) – Path to a pydml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.dmlFromUrl(url)

Create a dml script object based on a url.

Parameters:url (string) – URL to a dml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.pydmlFromUrl(url)

Create a pydml script object based on a url.

Parameters:url (string) – URL to a pydml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.mlcontext.getHopDAG(ml, script, lines=None, conf=None, apply_rewrites=True, with_subgraph=False)

Compile a DML / PyDML script.

Parameters:
  • ml (MLContext instance) – MLContext instance.
  • script (Script instance) – Script instance defined with the appropriate input and output variables.
  • lines (list of integers) – Optional: only display the hops that have begin and end line number equals to the given integers.
  • conf (SparkConf instance) – Optional spark configuration
  • apply_rewrites (boolean) – If True, perform static rewrites, perform intra-/inter-procedural analysis to propagate size information into functions and apply dynamic rewrites
  • with_subgraph (boolean) – If False, the dot graph will be created without subgraphs for statement blocks.
Returns:

hopDAG – hop DAG in dot format

Return type:

string

systemml.mlcontext.getNumCols(numPyArr)
systemml.mlcontext.convertToMatrixBlock(sc, src, maxSizeBlockInMB=8)
systemml.mlcontext.convert_caffemodel(sc, deploy_file, caffemodel_file, output_dir, format='binary', is_caffe_installed=False)

Saves the weights and bias in the caffemodel file to output_dir in the specified format. This method does not requires caffe to be installed.

Parameters:
  • sc (SparkContext) – SparkContext
  • deploy_file (string) – Path to the input network file
  • caffemodel_file (string) – Path to the input caffemodel file
  • output_dir (string) – Path to the output directory
  • format (string) – Format of the weights and bias (can be binary, csv or text)
  • is_caffe_installed (bool) – True if caffe is installed
systemml.mlcontext.convert_lmdb_to_jpeg(lmdb_img_file, output_dir)

Saves the images in the lmdb file as jpeg in the output_dir. This method requires caffe to be installed along with lmdb and cv2 package. To install cv2 package, do pip install opencv-python.

Parameters:
  • lmdb_img_file (string) – Path to the input lmdb file
  • output_dir (string) – Output directory for images (local filesystem)
systemml.mlcontext.convertToNumPyArr(sc, mb)
systemml.mlcontext.convertToPandasDF(X)
systemml.mlcontext.convertToLabeledDF(sparkSession, X, y=None)
systemml.mlcontext.convertImageToNumPyArr(im, img_shape=None, add_rotated_images=False, add_mirrored_images=False, color_mode='RGB', mean=None)
systemml.mlcontext.getDatasetMean(dataset_name)
Parameters:dataset_name (Name of the dataset used to train model. This name is artificial name based on dataset used to train the model.) –
Returns:mean
Return type:Mean value of model if its defined in the list DATASET_MEAN else None.

Module contents

class systemml.MLResults(results, sc)

Bases: object

Wrapper around a Java ML Results object.

Parameters:
  • results (JavaObject) – A Java MLResults object as returned by calling ml.execute().
  • sc (SparkContext) – SparkContext
get(*outputs)
Parameters:outputs (string, list of strings) – Output variables as defined inside the DML script.
class systemml.MLContext(sc)

Bases: object

Wrapper around the new SystemML MLContext.

Parameters:sc (SparkContext or SparkSession) – An instance of pyspark.SparkContext or pyspark.sql.SparkSession.
buildTime()

Display the project build time.

close()

Closes this MLContext instance to cleanup buffer pool, static/local state and scratch space. Note the SparkContext is not explicitly closed to allow external reuse.

execute(script)

Execute a DML / PyDML script.

Parameters:script (Script instance) – Script instance defined with the appropriate input and output variables.
Returns:ml_results – MLResults instance.
Return type:MLResults
info()

Display the project information.

isExplain()

Returns True if program instruction details should be output, False otherwise.

isForceGPU()

Returns True if “force” GPU mode is enabled, False otherwise.

isGPU()

Returns True if GPU mode is enabled, False otherwise.

isStatistics()

Returns True if program execution statistics should be output, False otherwise.

resetConfig()

Reset configuration settings to default values.

setConfig(configFilePath)

Set SystemML configuration based on a configuration file.

Parameters:configFilePath (String) –
setConfigProperty(propertyName, propertyValue)

Set configuration property, such as setConfigProperty(“sysml.localtmpdir”, “/tmp/systemml”).

Parameters:
  • propertyName (String) –
  • propertyValue (String) –
setExplain(explain)

Explanation about the program. Mainly intended for developers.

Parameters:explain (boolean) –
setExplainLevel(explainLevel)

Set explain level.

Parameters:explainLevel (string) – Can be one of “hops”, “runtime”, “recompile_hops”, “recompile_runtime” or in the above in upper case.
setForceGPU(enable)

Whether or not to force the usage of GPU operators.

Parameters:enable (boolean) –
setGPU(enable)

Whether or not to enable GPU.

Parameters:enable (boolean) –
setStatistics(statistics)

Whether or not to output statistics (such as execution time, elapsed time) about script executions.

Parameters:statistics (boolean) –
setStatisticsMaxHeavyHitters(maxHeavyHitters)

The maximum number of heavy hitters that are printed as part of the statistics.

Parameters:maxHeavyHitters (int) –
version()

Display the project version.

class systemml.Script(scriptString, scriptType='dml', isResource=False, scriptFormat='auto')

Bases: object

Instance of a DML/PyDML Script.

Parameters:
  • scriptString (string) – Can be either a file path to a DML script or a DML script itself.
  • scriptType (string) – Script language, either “dml” for DML (R-like) or “pydml” for PyDML (Python-like).
  • isResource (boolean) – If true, scriptString is a path to a resource on the classpath
  • scriptFormat (string) – Optional script format, either “auto” or “url” or “file” or “resource” or “string”
clearAll()

Clear the script string, inputs, outputs, and symbol table.

clearIO()

Clear the inputs and outputs, but not the symbol table.

clearIOS()

Clear the inputs, outputs, and symbol table.

clearInputs()

Clear the inputs.

clearOutputs()

Clear the outputs.

clearSymbolTable()

Clear the symbol table.

displayInputParameters()

Display the script input parameters.

displayInputVariables()

Display the script input variables.

displayInputs()

Display the script inputs.

displayOutputVariables()

Display the script output variables.

displayOutputs()

Display the script outputs.

displaySymbolTable()

Display the script symbol table.

getInputVariables()

Obtain the input variable names.

getName()

Obtain the script name.

getOutputVariables()

Obtain the output variable names.

getResults()

Obtain the results of the script execution.

getScriptExecutionString()

Generate the script execution string, which adds read/load/write/save statements to the beginning and end of the script to execute.

getScriptString()

Obtain the script string (in unicode).

getScriptType()

Obtain the script type.

info()

Display information about the script as a String. This consists of the script type, inputs, outputs, input parameters, input variables, output variables, the symbol table, the script string, and the script execution string.

input(*args, **kwargs)
Parameters:
  • args (name, value tuple) – where name is a string, and currently supported value formats are double, string, dataframe, rdd, and list of such object.
  • kwargs (dict of name, value pairs) – To know what formats are supported for name and value, look above.
isDML()

Is the script type DML?

isPYDML()

Is the script type DML?

output(*names)
Parameters:names (string, list of strings) – Output variables as defined inside the DML script.
results()

Obtain the results of the script execution.

setName(name)

Set the script name.

setResults(results)

Set the results of the script execution.

setScriptString(scriptString)

Set the script string.

Parameters:scriptString (string) – Can be either a file path to a DML script or a DML script itself.
class systemml.Matrix(javaMatrix, sc)

Bases: object

Wrapper around a Java Matrix object.

Parameters:
  • javaMatrix (JavaObject) – A Java Matrix object as returned by calling ml.execute().get().
  • sc (SparkContext) – SparkContext
toDF()

Convert the Matrix to a PySpark SQL DataFrame.

Returns:A PySpark SQL DataFrame representing the matrix, with one “__INDEX” column containing the row index (since Spark DataFrames are unordered), followed by columns of doubles for each column in the matrix.
Return type:PySpark SQL DataFrame
toNumPy()

Convert the Matrix to a NumPy Array.

Returns:A NumPy Array representing the Matrix object.
Return type:NumPy Array
systemml.dml(scriptString)

Create a dml script object based on a string.

Parameters:scriptString (string) – Can be a path to a dml script or a dml script itself.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.pydml(scriptString)

Create a pydml script object based on a string.

Parameters:scriptString (string) – Can be a path to a pydml script or a pydml script itself.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.dmlFromResource(resourcePath)

Create a dml script object based on a resource path.

Parameters:resourcePath (string) – Path to a dml script on the classpath.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.pydmlFromResource(resourcePath)

Create a pydml script object based on a resource path.

Parameters:resourcePath (string) – Path to a pydml script on the classpath.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.dmlFromFile(filePath)

Create a dml script object based on a file path.

Parameters:filePath (string) – Path to a dml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.pydmlFromFile(filePath)

Create a pydml script object based on a file path.

Parameters:filePath (string) – Path to a pydml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.dmlFromUrl(url)

Create a dml script object based on a url.

Parameters:url (string) – URL to a dml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.pydmlFromUrl(url)

Create a pydml script object based on a url.

Parameters:url (string) – URL to a pydml script.
Returns:script – Instance of a script object.
Return type:Script instance
systemml.getHopDAG(ml, script, lines=None, conf=None, apply_rewrites=True, with_subgraph=False)

Compile a DML / PyDML script.

Parameters:
  • ml (MLContext instance) – MLContext instance.
  • script (Script instance) – Script instance defined with the appropriate input and output variables.
  • lines (list of integers) – Optional: only display the hops that have begin and end line number equals to the given integers.
  • conf (SparkConf instance) – Optional spark configuration
  • apply_rewrites (boolean) – If True, perform static rewrites, perform intra-/inter-procedural analysis to propagate size information into functions and apply dynamic rewrites
  • with_subgraph (boolean) – If False, the dot graph will be created without subgraphs for statement blocks.
Returns:

hopDAG – hop DAG in dot format

Return type:

string

systemml.setSparkContext(sc)

Before using the matrix, the user needs to invoke this function if SparkContext is not previously created in the session.

Parameters:sc (SparkContext) – SparkContext
class systemml.matrix(data, op=None)

Bases: object

matrix class is a python wrapper that implements basic matrix operators, matrix functions as well as converters to common Python types (for example: Numpy arrays, PySpark DataFrame and Pandas DataFrame).

The operators supported are:

  1. Arithmetic operators: +, -, , /, //, %, * as well as dot (i.e. matrix multiplication)
  2. Indexing in the matrix
  3. Relational/Boolean operators: <, <=, >, >=, ==, !=, &, |

In addition, following functions are supported for matrix:

  1. transpose
  2. Aggregation functions: sum, mean, var, sd, max, min, argmin, argmax, cumsum
  3. Global statistical built-In functions: exp, log, abs, sqrt, round, floor, ceil, ceiling, sin, cos, tan, asin, acos, atan, sign, solve

For all the above functions, we always return a two dimensional matrix, especially for aggregation functions with axis. For example: Assuming m1 is a matrix of (3, n), NumPy returns a 1d vector of dimension (3,) for operation m1.sum(axis=1) whereas SystemML returns a 2d matrix of dimension (3, 1).

Note: an evaluated matrix contains a data field computed by eval method as DataFrame or NumPy array.

Examples

>>> import SystemML as sml
>>> import numpy as np
>>> sml.setSparkContext(sc)

Welcome to Apache SystemML!

>>> m1 = sml.matrix(np.ones((3,3)) + 2)
>>> m2 = sml.matrix(np.ones((3,3)) + 3)
>>> m2 = m1 * (m2 + m1)
>>> m4 = 1.0 - m2
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
mVar1 = load(" ", format="csv")
mVar2 = load(" ", format="csv")
mVar3 = mVar2 + mVar1
mVar4 = mVar1 * mVar3
mVar5 = 1.0 - mVar4
save(mVar5, " ")
>>> m2.eval()
>>> m2
# This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.
>>> m4
# This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
mVar4 = load(" ", format="csv")
mVar5 = 1.0 - mVar4
save(mVar5, " ")
>>> m4.sum(axis=1).toNumPy()
array([[-60.],
       [-60.],
       [-60.]])

Design Decisions:

  1. Until eval() method is invoked, we create an AST (not exposed to the user) that consist of unevaluated operations and data required by those operations. As an anology, a spark user can treat eval() method similar to calling RDD.persist() followed by RDD.count().

  2. The AST consist of two kinds of nodes: either of type matrix or of type DMLOp. Both these classes expose _visit method, that helps in traversing the AST in DFS manner.

  3. A matrix object can either be evaluated or not. If evaluated, the attribute ‘data’ is set to one of the supported types (for example: NumPy array or DataFrame). In this case, the attribute ‘op’ is set to None. If not evaluated, the attribute ‘op’ which refers to one of the intermediate node of AST and if of type DMLOp. In this case, the attribute ‘data’ is set to None.

  4. DMLOp has an attribute ‘inputs’ which contains list of matrix objects or DMLOp.

  5. To simplify the traversal, every matrix object is considered immutable and an matrix operations creates a new matrix object. As an example: m1 = sml.matrix(np.ones((3,3))) creates a matrix object backed by ‘data=(np.ones((3,3))’. m1 = m1 * 2 will create a new matrix object which is now backed by ‘op=DMLOp( … )’ whose input is earlier created matrix object.

  6. Left indexing (implemented in __setitem__ method) is a special case, where Python expects the existing object to be mutated. To ensure the above property, we make deep copy of existing object and point any references to the left-indexed matrix to the newly created object. Then the left-indexed matrix is set to be backed by DMLOp consisting of following pydml: left-indexed-matrix = new-deep-copied-matrix left-indexed-matrix[index] = value

  7. Please use m.print_ast() and/or type m for debugging. Here is a sample session:

    >>> npm = np.ones((3,3))
    >>> m1 = sml.matrix(npm + 3)
    >>> m2 = sml.matrix(npm + 5)
    >>> m3 = m1 + m2
    >>> m3
    mVar2 = load(" ", format="csv")
    mVar1 = load(" ", format="csv")
    mVar3 = mVar1 + mVar2
    save(mVar3, " ")
    >>> m3.print_ast()
    - [mVar3] (op).
      - [mVar1] (data).
      - [mVar2] (data).
    
THROW_ARRAY_CONVERSION_ERROR = False
abs()
acos()
arccos()
arcsin()
arctan()
argmax(axis=None)

Returns the indices of the maximum values along an axis.

Parameters:axis (int, optional (only axis=1, i.e. rowIndexMax is supported in this version)) –
argmin(axis=None)

Returns the indices of the minimum values along an axis.

Parameters:axis (int, optional (only axis=1, i.e. rowIndexMax is supported in this version)) –
asfptype()
asin()
astype(t)
atan()
ceil()
ceiling()
cos()
cosh()
cumsum(axis=None)

Returns the indices of the maximum values along an axis.

Parameters:axis (int, optional (only axis=0, i.e. cumsum along the rows is supported in this version)) –
deg2rad()

Convert angles from degrees to radians.

dml = []
dot(other)

Numpy way of performing matrix multiplication

eval()

This is a convenience function that calls the global eval method

exp()
exp2()
expm1()
floor()
get_shape()
hstack(other)

Stack matrices horizontally (column wise). Invokes cbind internally.

ldexp(other)
log(y=None)
log10()
log1p()
log2()
logaddexp(other)
logaddexp2(other)
logical_not()
max(other=None, axis=None)

Compute the maximum value along the specified axis

Parameters:
  • other (matrix or numpy array (& other supported types) or scalar) –
  • axis (int, optional) –
mean(axis=None)

Compute the arithmetic mean along the specified axis

Parameters:axis (int, optional) –
min(other=None, axis=None)

Compute the minimum value along the specified axis

Parameters:
  • other (matrix or numpy array (& other supported types) or scalar) –
  • axis (int, optional) –
ml = None
mod(other)
moment(moment=1, axis=None)

Calculates the nth moment about the mean

Parameters:
  • moment (int) – can be 1, 2, 3 or 4
  • axis (int, optional) –
ndim = 2
negative()
ones_like()
print_ast()

Please use m.print_ast() and/or type m for debugging. Here is a sample session:

>>> npm = np.ones((3,3))
>>> m1 = sml.matrix(npm + 3)
>>> m2 = sml.matrix(npm + 5)
>>> m3 = m1 + m2
>>> m3
mVar2 = load(" ", format="csv")
mVar1 = load(" ", format="csv")
mVar3 = mVar1 + mVar2
save(mVar3, " ")
>>> m3.print_ast()
- [mVar3] (op).
  - [mVar1] (data).
  - [mVar2] (data).
prod()

Return the product of all cells in matrix

rad2deg()

Convert angles from radians to degrees.

reciprocal()
remainder(other)
remove_empty(axis=None)

Removes all empty rows or columns from the input matrix target X according to specified axis.

Parameters:axis (int (0 or 1)) –
replace(pattern=None, replacement=None)

Removes all empty rows or columns from the input matrix target X according to specified axis.

Parameters:
  • pattern (float or int) –
  • replacement (float or int) –
round()
save(file, format='csv')

Allows user to save a matrix to filesystem

Parameters:
  • file (filepath) –
  • format (can be csv, text or binary or mm) –
script = None
sd(axis=None)

Compute the standard deviation along the specified axis

Parameters:axis (int, optional) –
set_shape(shape)
shape
sign()
sin()
sinh()
sqrt()
square()
sum(axis=None)

Compute the sum along the specified axis

Parameters:axis (int, optional) –
systemmlVarID = 0
tan()
tanh()
toDF()

This is a convenience function that calls the global eval method and then converts the matrix object into DataFrame.

toNumPy()

This is a convenience function that calls the global eval method and then converts the matrix object into NumPy array.

toPandas()

This is a convenience function that calls the global eval method and then converts the matrix object into Pandas DataFrame.

trace()

Return the sum of the cells of the main diagonal square matrix

transpose()

Transposes the matrix.

var(axis=None)

Compute the variance along the specified axis. We assume that delta degree of freedom is 1 (unlike NumPy which assumes ddof=0).

Parameters:axis (int, optional) –
visited = []
vstack(other)

Stack matrices vertically (row wise). Invokes rbind internally.

zeros_like()
systemml.eval(outputs, execute=True)

Executes the unevaluated DML script and computes the matrices specified by outputs.

Parameters:
  • outputs (list of matrices or a matrix object) –
  • execute (specified whether to execute the unevaluated operation or just return the script.) –
systemml.solve(A, b)

Computes the least squares solution for system of linear equations A %*% x = b

Examples

>>> import numpy as np
>>> from sklearn import datasets
>>> import SystemML as sml
>>> from pyspark.sql import SparkSession
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> X_train = diabetes_X[:-20]
>>> X_test = diabetes_X[-20:]
>>> y_train = diabetes.target[:-20]
>>> y_test = diabetes.target[-20:]
>>> sml.setSparkContext(sc)
>>> X = sml.matrix(X_train)
>>> y = sml.matrix(y_train)
>>> A = X.transpose().dot(X)
>>> b = X.transpose().dot(y)
>>> beta = sml.solve(A, b).toNumPy()
>>> y_predicted = X_test.dot(beta)
>>> print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
Residual sum of squares: 25282.12
class systemml.DMLOp(inputs, dml=None)

Bases: object

Represents an intermediate node of Abstract syntax tree created to generate the PyDML script

MAX_DEPTH = 0
systemml.set_lazy(isLazy)

This method allows users to set whether the matrix operations should be executed in lazy manner.

Parameters:isLazy (True if matrix operations should be evaluated in lazy manner.) –
systemml.debug_array_conversion(throwError)
systemml.load(file, format='csv')

Allows user to load a matrix from filesystem

Parameters:
  • file (filepath) –
  • format (can be csv, text or binary or mm) –
systemml.full(shape, fill_value)

Return a new array of given shape filled with fill_value.

Parameters:
  • shape (tuple of length 2) –
  • fill_value (float or int) –
systemml.seq(start=None, stop=None, step=1)

Creates a single column vector with values starting from <start>, to <stop>, in increments of <step>. Note: Unlike Numpy’s arange which returns a row-vector, this returns a column vector. Also, Unlike Numpy’s arange which doesnot include stop, this method includes stop in the interval.

Parameters:
  • start (int or float [Optional: default = 0]) –
  • stop (int or float) –
  • step (int float [Optional: default = 1]) –
systemml.getNumCols(numPyArr)
systemml.convertToMatrixBlock(sc, src, maxSizeBlockInMB=8)
systemml.convert_caffemodel(sc, deploy_file, caffemodel_file, output_dir, format='binary', is_caffe_installed=False)

Saves the weights and bias in the caffemodel file to output_dir in the specified format. This method does not requires caffe to be installed.

Parameters:
  • sc (SparkContext) – SparkContext
  • deploy_file (string) – Path to the input network file
  • caffemodel_file (string) – Path to the input caffemodel file
  • output_dir (string) – Path to the output directory
  • format (string) – Format of the weights and bias (can be binary, csv or text)
  • is_caffe_installed (bool) – True if caffe is installed
systemml.convert_lmdb_to_jpeg(lmdb_img_file, output_dir)

Saves the images in the lmdb file as jpeg in the output_dir. This method requires caffe to be installed along with lmdb and cv2 package. To install cv2 package, do pip install opencv-python.

Parameters:
  • lmdb_img_file (string) – Path to the input lmdb file
  • output_dir (string) – Output directory for images (local filesystem)
systemml.convertToNumPyArr(sc, mb)
systemml.convertToPandasDF(X)
systemml.convertToLabeledDF(sparkSession, X, y=None)
systemml.convertImageToNumPyArr(im, img_shape=None, add_rotated_images=False, add_mirrored_images=False, color_mode='RGB', mean=None)
systemml.getDatasetMean(dataset_name)
Parameters:dataset_name (Name of the dataset used to train model. This name is artificial name based on dataset used to train the model.) –
Returns:mean
Return type:Mean value of model if its defined in the list DATASET_MEAN else None.