org.apache.sysml.parser.AParserWrapperperforms syntatic validation and parses the input DML script using ANTLR into a a hierarchy of
Statementas defined by control structures. Another important class of the language component is
DMLTranslatorwhich performs live variable analysis and semantic validation. During that process we also retrieve input data characteristics -- i.e., format, number of rows, columns, and non-zero values -- as well as infrastructure characteristics, which are used for subsequent optimizations. Finally, we construct directed acyclic graphs (DAGs) of high-level operators (
Hop) per statement block.
InterProceduralAnalysisfor statistics propagation into functions and over entire programs, and operator ordering of matrix multiplication chains. We compute memory estimates for all HOPs, reflecting the memory requirements of in-memory single-node operations and intermediates. Each HOP DAG is compiled to a DAG of low-level operators (
Lop) such as grouping and aggregate, which are backend-specific physical operators. Operator selection picks the best physical operators for a given HOP based on memory estimates, data, and cluster characteristics. Individual LOPs have corresponding runtime implementations, called instructions, and the optimizer generates an executable runtime program of instructions.
CPInstruction(some of which are multi-threaded ), maintains an in-memory buffer pool, and launches MR or Spark jobs if the runtime plan contains distributed computations in the form of
MRInstructionor Spark instructions (
SPInstruction). For the MR backend, the SystemML compiler groups LOPs -- and thus, MR instructions -- into a minimal number of MR jobs (MR-job instructions). This procedure is referred to as piggybacking (
Dag) For the Spark backend, we rely on Spark's lazy evaluation and stage construction. CP instructions may also be backed by GPU kernels (
GPUInstruction). The multi-level buffer pool caches local matrices in-memory, evicts them if necessary, and handles data exchange between local and distributed runtime backends. The core of SystemML's runtime instructions is an adaptive matrix block library, which is sparsity-aware and operates on the entire matrix in CP, or blocks of a matrix in a distributed setting. Further key features include parallel for-loops for task-parallel computations, and dynamic recompilation for runtime plan adaptation addressing initial unknowns.
Copyright © 2017 The Apache Software Foundation. All rights reserved.