org.apache.sysml.runtime.instructions.spark.utils

Class RDDConverterUtils

  • java.lang.Object
    • org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils


  • public class RDDConverterUtils
    extends Object
    • Constructor Detail

      • RDDConverterUtils

        public RDDConverterUtils()
    • Method Detail

      • binaryBlockToLabeledPoints

        public static org.apache.spark.api.java.JavaRDD<org.apache.spark.ml.feature.LabeledPoint> binaryBlockToLabeledPoints(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in)
        Converter from binary block rdd to rdd of labeled points. Note that the input needs to be reblocked to satisfy the 'clen <= bclen' constraint.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        Returns:
        JavaRDD of labeled points
      • csvToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> csvToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                        org.apache.spark.api.java.JavaRDD<String> input,
                                                                                                        MatrixCharacteristics mcOut,
                                                                                                        boolean hasHeader,
                                                                                                        String delim,
                                                                                                        boolean fill,
                                                                                                        double fillValue)
                                                                                                 throws DMLRuntimeException
        Example usage:
        
         import org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils
         import org.apache.sysml.runtime.matrix.MatrixCharacteristics
         import org.apache.spark.api.java.JavaSparkContext
         val A = sc.textFile("ranA.csv")
         val Amc = new MatrixCharacteristics
         val Abin = RDDConverterUtils.csvToBinaryBlock(new JavaSparkContext(sc), A, Amc, false, ",", false, 0)
         
        Parameters:
        sc - java spark context
        input - rdd of strings
        mcOut - matrix characteristics
        hasHeader - if true, has header
        delim - delimiter as a string
        fill - if true, fill in empty values with fillValue
        fillValue - fill value used to fill empty values
        Returns:
        matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        Throws:
        DMLRuntimeException - if DMLRuntimeException occurs
      • dataFrameToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> dataFrameToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                              org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df,
                                                                                                              MatrixCharacteristics mc,
                                                                                                              boolean containsID,
                                                                                                              boolean isVector)
      • binaryBlockToDataFrame

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame(org.apache.spark.sql.SparkSession sparkSession,
                                                                                                    org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                                                    MatrixCharacteristics mc,
                                                                                                    boolean toVector)
      • binaryBlockToDataFrame

        @Deprecated
        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame(org.apache.spark.sql.SQLContext sqlContext,
                                                                                                                 org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                                                                 MatrixCharacteristics mc,
                                                                                                                 boolean toVector)
        Deprecated. 
      • libsvmToBinaryBlock

        public static void libsvmToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                               String pathIn,
                                               String pathX,
                                               String pathY,
                                               MatrixCharacteristics mcOutX)
                                        throws DMLRuntimeException
        Converts a libsvm text input file into two binary block matrices for features and labels, and saves these to the specified output files. This call also deletes existing files at the specified output locations, as well as determines and writes the meta data files of both output matrices.

        Note: We use org.apache.spark.mllib.util.MLUtils.loadLibSVMFile for parsing the libsvm input files in order to ensure consistency with Spark.

        Parameters:
        sc - java spark context
        pathIn - path to libsvm input file
        pathX - path to binary block output file of features
        pathY - path to binary block output file of labels
        mcOutX - matrix characteristics of output matrix X
        Throws:
        DMLRuntimeException - if output path not writable or conversion failure
      • stringToSerializableText

        public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> stringToSerializableText(org.apache.spark.api.java.JavaPairRDD<Long,String> in)

Copyright © 2017 The Apache Software Foundation. All rights reserved.