From 85c8af6b35e23af1729fa81fcf94af3c388536d0 Mon Sep 17 00:00:00 2001 From: Hong Date: Thu, 15 Dec 2022 14:12:37 +0800 Subject: [PATCH] [ML-172] Update documents for OAP 1.5.0 (#241) --- CHANGELOG.md | 144 ++++++++++++++++++++++++++++++++- README.md | 2 +- docs/OAP-Developer-Guide.md | 8 +- docs/OAP-Installation-Guide.md | 2 +- docs/index.md | 42 ++++++---- mkdocs.yml | 4 +- 6 files changed, 177 insertions(+), 25 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 83a6e75a8..168bcb267 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,147 @@ # Change log -Generated on 2022-07-05 +Generated on 2022-12-14 + +## Release 1.5.0 + +### Gazelle Plugin + +#### Features +||| +|:---|:---| +|[#931](https://github.com/oap-project/gazelle_plugin/issues/931)|Reuse partition vectors for arrow scan| +|[#955](https://github.com/oap-project/gazelle_plugin/issues/955)|implement missing expressions| +|[#1120](https://github.com/oap-project/gazelle_plugin/issues/1120)|Support aggregation window functions with order by| +|[#1135](https://github.com/oap-project/gazelle_plugin/issues/1135)|Supports Spark 3.2.2 shims| +|[#1114](https://github.com/oap-project/gazelle_plugin/issues/1114)|Remove tmp directory after application exits| +|[#862](https://github.com/oap-project/gazelle_plugin/issues/862)|implement row_number window function| +|[#1007](https://github.com/oap-project/gazelle_plugin/issues/1007)|Document how to test columnar UDF| +|[#942](https://github.com/oap-project/gazelle_plugin/issues/942)|Use hash aggregate for string type input| + +#### Performance +||| +|:---|:---| +|[#1144](https://github.com/oap-project/gazelle_plugin/issues/1144)|Optimize cast WSCG performance| + +#### Bugs Fixed +||| +|:---|:---| +|[#1170](https://github.com/oap-project/gazelle_plugin/issues/1170)|Segfault on data source v2| +|[#1164](https://github.com/oap-project/gazelle_plugin/issues/1164)|Limit the column num in WSCG| +|[#1166](https://github.com/oap-project/gazelle_plugin/issues/1166)|Peers' values should be considered in window function for CURRENT ROW in range mode| +|[#1149](https://github.com/oap-project/gazelle_plugin/issues/1149)|Vulnerability issues| +|[#1112](https://github.com/oap-project/gazelle_plugin/issues/1112)|Validate Error: “Invalid: Length spanned by binary offsets (21) larger than values array (size 20)”| +|[#1103](https://github.com/oap-project/gazelle_plugin/issues/1103)|wrong hashagg results| +|[#929](https://github.com/oap-project/gazelle_plugin/issues/929)|Failed to add user extension while using gazelle| +|[#1100](https://github.com/oap-project/gazelle_plugin/issues/1100)|Wildcard in json path is not supported| +|[#1079](https://github.com/oap-project/gazelle_plugin/issues/1079)|Like function gets wrong result when default escape char is contained| +|[#1046](https://github.com/oap-project/gazelle_plugin/issues/1046)|Fall back to use row-based operators, error is makeStructField is unable to parse from conv| +|[#1053](https://github.com/oap-project/gazelle_plugin/issues/1053)|Exception when there is function expression in pos or len of substring| +|[#1024](https://github.com/oap-project/gazelle_plugin/issues/1024)|ShortType is not supported in ColumnarLiteral| +|[#1034](https://github.com/oap-project/gazelle_plugin/issues/1034)|Exception when there is unix_timestamp in CaseWhen | +|[#1032](https://github.com/oap-project/gazelle_plugin/issues/1032)|Missing WSCG check for ExistenceJoin| +|[#1027](https://github.com/oap-project/gazelle_plugin/issues/1027)|partition by literal in window function| +|[#1019](https://github.com/oap-project/gazelle_plugin/issues/1019)|Support more date formats for from_unixtime & unix_timestamp| +|[#999](https://github.com/oap-project/gazelle_plugin/issues/999)|The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec| +|[#984](https://github.com/oap-project/gazelle_plugin/issues/984)|concat_ws| +|[#958](https://github.com/oap-project/gazelle_plugin/issues/958)|JVM/Native R2C and CoalesceBatcth process time inaccuracy| +|[#979](https://github.com/oap-project/gazelle_plugin/issues/979)|Failed to find column while reading parquet with case insensitive| + +#### PRs +||| +|:---|:---| +|[#1175](https://github.com/oap-project/gazelle_plugin/pull/1175)|[NSE-1171] Support merge parquet schema and read missing schema| +|[#1178](https://github.com/oap-project/gazelle_plugin/pull/1178)|[NSE-1161][FOLLOWUP] Remove extra compression type check| +|[#1162](https://github.com/oap-project/gazelle_plugin/pull/1162)|[NSE-1161] Support read-write parquet conversion to read-write arrow| +|[#1014](https://github.com/oap-project/gazelle_plugin/pull/1014)|[NSE-956] allow to write parquet with compression| +|[#1176](https://github.com/oap-project/gazelle_plugin/pull/1176)|bump h2/pgsql version| +|[#1173](https://github.com/oap-project/gazelle_plugin/pull/1173)|[NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode| +|[#1172](https://github.com/oap-project/gazelle_plugin/pull/1172)|[NSE-1170] Setting correct row number in batch scan w/ partition columns| +|[#1169](https://github.com/oap-project/gazelle_plugin/pull/1169)|[NSE-1161] Format sql config string key| +|[#1167](https://github.com/oap-project/gazelle_plugin/pull/1167)|[NSE-1166] Cover peers' values in sum window function in range mode| +|[#1165](https://github.com/oap-project/gazelle_plugin/pull/1165)|[NSE-1164] Limit the max column num in WSCG| +|[#1160](https://github.com/oap-project/gazelle_plugin/pull/1160)|[NSE-1149] upgrade guava to 30.1.1| +|[#1158](https://github.com/oap-project/gazelle_plugin/pull/1158)|[NSE-1149] upgrade guava to 30.1.1| +|[#1152](https://github.com/oap-project/gazelle_plugin/pull/1152)|[NSE-1149] upgrade guava to 24.1.1| +|[#1153](https://github.com/oap-project/gazelle_plugin/pull/1153)|[NSE-1149] upgrade pgsql to 42.3.3| +|[#1150](https://github.com/oap-project/gazelle_plugin/pull/1150)|[NSE-1149] Remove log4j in shims module| +|[#1146](https://github.com/oap-project/gazelle_plugin/pull/1146)|[NSE-1135] Introduce shim layer for supporting spark 3.2.2| +|[#1145](https://github.com/oap-project/gazelle_plugin/pull/1145)|[NSE-1144] Optimize cast wscg performance| +|[#1136](https://github.com/oap-project/gazelle_plugin/pull/1136)|Remove project from wscg when it's the child of window| +|[#1122](https://github.com/oap-project/gazelle_plugin/pull/1122)|[NSE-1120] Support sum window function with order by statement| +|[#1131](https://github.com/oap-project/gazelle_plugin/pull/1131)|[NSE-1114] Remove temp directory without FileUtils.forceDeleteOnExit| +|[#1129](https://github.com/oap-project/gazelle_plugin/pull/1129)|[NSE-1127] Use larger buffer for hash agg| +|[#1130](https://github.com/oap-project/gazelle_plugin/pull/1130)|[NSE-610] fix hashjoin build time metric| +|[#1126](https://github.com/oap-project/gazelle_plugin/pull/1126)|[NSE-1125] Add status check for hashing GetOrInsert| +|[#1056](https://github.com/oap-project/gazelle_plugin/pull/1056)|[NSE-955] Support window function lag| +|[#1123](https://github.com/oap-project/gazelle_plugin/pull/1123)|[NSE-1118] fix codegen on TPCDS Q88| +|[#1119](https://github.com/oap-project/gazelle_plugin/pull/1119)|[NSE-1118] adding more checks for SMJ codegen| +|[#1058](https://github.com/oap-project/gazelle_plugin/pull/1058)|[NSE-981] Add a test suite for projection codegen| +|[#1117](https://github.com/oap-project/gazelle_plugin/pull/1117)|[NSE-1116] Disable columnar url_decoder| +|[#1113](https://github.com/oap-project/gazelle_plugin/pull/1113)|[NSE-1112] Fix Arrow array meta data validating issue when writing parquet files| +|[#1039](https://github.com/oap-project/gazelle_plugin/pull/1039)|[NSE-1019] fix codegen for all expressions| +|[#1115](https://github.com/oap-project/gazelle_plugin/pull/1115)|[NSE-1114] Remove tmp directory after application exits| +|[#1111](https://github.com/oap-project/gazelle_plugin/pull/1111)|remove debug log| +|[#1098](https://github.com/oap-project/gazelle_plugin/pull/1098)|[NSE-1108] allow to use different cases in column names| +|[#1082](https://github.com/oap-project/gazelle_plugin/pull/1082)|[NSE-1071] Refactor vector resizing in hash aggregate| +|[#1036](https://github.com/oap-project/gazelle_plugin/pull/1036)|[NSE-987] fix string date| +|[#948](https://github.com/oap-project/gazelle_plugin/pull/948)|[NSE-947] Add a whole stage fallback strategy| +|[#1099](https://github.com/oap-project/gazelle_plugin/pull/1099)|[NSE-1104] fix hashagg w/ empty string| +|[#1102](https://github.com/oap-project/gazelle_plugin/pull/1102)|[NSE-400] Fix memory leak for native C2R and R2C.| +|[#1101](https://github.com/oap-project/gazelle_plugin/pull/1101)|[NSE-1100] Fall back get_json_object when wildcard is contained in json path| +|[#1090](https://github.com/oap-project/gazelle_plugin/pull/1090)|[NSE-1065] fix on count distinct w/ keys| +|[#1097](https://github.com/oap-project/gazelle_plugin/pull/1097)|Ignore two unit tests| +|[#1081](https://github.com/oap-project/gazelle_plugin/pull/1081)|[NSE-1075] Support dynamic merge file partition| +|[#1080](https://github.com/oap-project/gazelle_plugin/pull/1080)|[NSE-1079] Set the default escape char for like function| +|[#1078](https://github.com/oap-project/gazelle_plugin/pull/1078)|[NSE-610] support big keys in hashagg| +|[#1072](https://github.com/oap-project/gazelle_plugin/pull/1072)|[NSE-1071] Add tiny optimizations for hash aggregation functions| +|[#1069](https://github.com/oap-project/gazelle_plugin/pull/1069)|[NSE-800] Remove spark-arrow-datasource-parquet in assembly| +|[#1066](https://github.com/oap-project/gazelle_plugin/pull/1066)|[NSE-1065] Adding hashagg w/ filter support| +|[#1067](https://github.com/oap-project/gazelle_plugin/pull/1067)|[NSE-958] Fix JVM R2C operator metrics| +|[#935](https://github.com/oap-project/gazelle_plugin/pull/935)|[NSE-931] Reuse partition vectors for arrow scan| +|[#1064](https://github.com/oap-project/gazelle_plugin/pull/1064)|[NSE-955] Implement parse_url| +|[#1063](https://github.com/oap-project/gazelle_plugin/pull/1063)|[NSE-955] Support more date format in unix timestamp| +|[#930](https://github.com/oap-project/gazelle_plugin/pull/930)|[NSE-929] Support user defined spark extensions| +|[#1038](https://github.com/oap-project/gazelle_plugin/pull/1038)|[NSE-928] allow to sort with big partitions | +|[#1057](https://github.com/oap-project/gazelle_plugin/pull/1057)|[NSE-1019] fix codegen for unixtimestamp| +|[#1055](https://github.com/oap-project/gazelle_plugin/pull/1055)|[NSE-955] Support md5/sha1/sha2 functions| +|[#903](https://github.com/oap-project/gazelle_plugin/pull/903)|[NSE-610] hashagg opt#3| +|[#1044](https://github.com/oap-project/gazelle_plugin/pull/1044)|[NE-400] fix memory leakage in native columnartorow| +|[#1041](https://github.com/oap-project/gazelle_plugin/pull/1041)|[NSE-1023] [NSE-1046] Cover more supported expressions in getting AttributeReference| +|[#1054](https://github.com/oap-project/gazelle_plugin/pull/1054)|[NSE-1053] Support function in substring's pos and len| +|[#1049](https://github.com/oap-project/gazelle_plugin/pull/1049)|[NSE-955] Support bin function| +|[#1048](https://github.com/oap-project/gazelle_plugin/pull/1048)|[NSE-955] Support power function| +|[#1042](https://github.com/oap-project/gazelle_plugin/pull/1042)|[NSE-955] Support find_in_set function| +|[#1025](https://github.com/oap-project/gazelle_plugin/pull/1025)|[NSE-1024] Support ShortType in ColumnarLiteral| +|[#1037](https://github.com/oap-project/gazelle_plugin/pull/1037)|[NSE-955] Turn on the support for get_json_object| +|[#1033](https://github.com/oap-project/gazelle_plugin/pull/1033)|[NSE-1032] Adding WSCG check for keys in Join| +|[#1035](https://github.com/oap-project/gazelle_plugin/pull/1035)|[NSE-1034] Add timeZoneId in ColumnarUnixTimestamp| +|[#1028](https://github.com/oap-project/gazelle_plugin/pull/1028)|[NSE-1027] Problem with Literal in window function| +|[#1017](https://github.com/oap-project/gazelle_plugin/pull/1017)|[NSE-999] use TimSort for STRING/DECIMAL onekey based sorting| +|[#1022](https://github.com/oap-project/gazelle_plugin/pull/1022)|[NSE-955] Support remainder function| +|[#1021](https://github.com/oap-project/gazelle_plugin/pull/1021)|[NSE-1019] [NSE-1020] Support more date formats and be aware of local time zone in handling unix timestamp| +|[#1009](https://github.com/oap-project/gazelle_plugin/pull/1009)|[NSE-999] s/string/string_view in sort| +|[#990](https://github.com/oap-project/gazelle_plugin/pull/990)|[NSE-943] Improve rowtocolumn operator| +|[#1000](https://github.com/oap-project/gazelle_plugin/pull/1000)|[NSE-862] improve row_number()| +|[#1013](https://github.com/oap-project/gazelle_plugin/pull/1013)|[NSE-955] Add Murmur3Hash expression support| +|[#995](https://github.com/oap-project/gazelle_plugin/pull/995)|[NSE-981] Add more codegen checking in BHJ & SHJ| +|[#1006](https://github.com/oap-project/gazelle_plugin/pull/1006)|[NSE-1007] Add a test guide for columnar UDF| +|[#969](https://github.com/oap-project/gazelle_plugin/pull/969)|[NSE-943] Optimize data conversion for String/Binary type in Row2Columnar| +|[#973](https://github.com/oap-project/gazelle_plugin/pull/973)|[NSE-928] Add ARROW_CHECK for batch_size check| +|[#992](https://github.com/oap-project/gazelle_plugin/pull/992)|[NSE-984] fix concat_ws| +|[#991](https://github.com/oap-project/gazelle_plugin/pull/991)|[NSE-981] check all expressions in HashAgg| +|[#993](https://github.com/oap-project/gazelle_plugin/pull/993)|[NSE-979] fix data source| +|[#980](https://github.com/oap-project/gazelle_plugin/pull/980)|[NSE-979] Support reading parquet with case sensitive| +|[#985](https://github.com/oap-project/gazelle_plugin/pull/985)|[NSE-981] Implement supportColumnarCodegen to reflect the actual support state| +|[#964](https://github.com/oap-project/gazelle_plugin/pull/964)|[NSE-955] implement lpad/rpad| +|[#963](https://github.com/oap-project/gazelle_plugin/pull/963)|[NSE-955] implement concat_ws| +|[#971](https://github.com/oap-project/gazelle_plugin/pull/971)|[NSE-955] Support hex expression| +|[#968](https://github.com/oap-project/gazelle_plugin/pull/968)|[NSE-955] implement lower function | +|[#965](https://github.com/oap-project/gazelle_plugin/pull/965)|[NSE-955] Support expression conv| +|[#949](https://github.com/oap-project/gazelle_plugin/pull/949)|[NSE-862] implement row_number function| +|[#960](https://github.com/oap-project/gazelle_plugin/pull/960)|[NSE-955] doc: Add columnar expression development guide| +|[#941](https://github.com/oap-project/gazelle_plugin/pull/941)|[NSE-942] Force to use hash aggregate for string type input| +|[#959](https://github.com/oap-project/gazelle_plugin/pull/959)|[NSE-958] Fix SQLMetrics inaccuracy in JVM/Native R2C and CoalesceBatcth| + ## Release 1.4.0 diff --git a/README.md b/README.md index 708e493b0..fd6ab8233 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,7 @@ You can find the all the OAP MLlib documents on the [project web page](https://o ## Java/Scala Users Preferred -Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.4.0/oap-mllib-1.4.0.jar). +Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.5.0/oap-mllib-1.5.0.jar). Then you can refer to the following [Running](#running) section to try out. diff --git a/docs/OAP-Developer-Guide.md b/docs/OAP-Developer-Guide.md index 9dddf10a4..da8c0dc34 100644 --- a/docs/OAP-Developer-Guide.md +++ b/docs/OAP-Developer-Guide.md @@ -4,8 +4,8 @@ This document contains the instructions & scripts on installing necessary depend You can get more detailed information from OAP each module below. -* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.4.0) -* [Gazelle Plugin](https://github.com/oap-project/gazelle_plugin/tree/v1.4.0) +* [OAP MLlib](https://github.com/oap-project/oap-mllib/tree/v1.5.0) +* [Gazelle Plugin](https://github.com/oap-project/gazelle_plugin/tree/v1.5.0) ## Building OAP @@ -18,14 +18,14 @@ We provide scripts to help automatically install dependencies required, please c # cd oap-tools # sh dev/install-compile-time-dependencies.sh ``` -*Note*: oap-tools tag version `v1.4.0` corresponds to all OAP modules' tag version `v1.4.0`. +*Note*: oap-tools tag version `v1.5.0` corresponds to all OAP modules' tag version `v1.5.0`. Then the dependencies below will be installed: * [Cmake](https://cmake.org/install/) * [GCC > 9](https://gcc.gnu.org/wiki/InstallingGCC) * [OneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi.html) -* [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.4.0) +* [Arrow](https://github.com/oap-project/arrow/tree/v4.0.0-oap-1.5.0) * [LLVM](https://llvm.org/) diff --git a/docs/OAP-Installation-Guide.md b/docs/OAP-Installation-Guide.md index 695c5be3d..22dd0454c 100644 --- a/docs/OAP-Installation-Guide.md +++ b/docs/OAP-Installation-Guide.md @@ -28,7 +28,7 @@ To test your installation, run the command `conda list` in your terminal window Create a Conda environment and install OAP Conda package. ```bash -$ conda create -n oapenv -c conda-forge -c intel -y oap=1.4.0.spark32 +$ conda create -n oapenv -c conda-forge -c intel -y oap=1.5.0.spark32 ``` Once finished steps above, you have completed OAP dependencies installation and OAP building, and will find built OAP jars under `$HOME/miniconda2/envs/oapenv/oap_jars` diff --git a/docs/index.md b/docs/index.md index 8453ba3ef..748528b1b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,10 +1,24 @@ -# OAP MLlib +# Introduction -## Overview +## The Problem -OAP MLlib is an optimized package to accelerate machine learning algorithms in [Apache Spark MLlib](https://spark.apache.org/mllib). It is compatible with Spark MLlib and leverages open source [Intel® oneAPI Data Analytics Library (oneDAL)](https://github.com/oneapi-src/oneDAL) to provide highly optimized algorithms and get most out of CPU and GPU capabilities. It also take advantage of open source [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL) to provide efficient communication patterns in multi-node multi-GPU clusters. +[Apache Spark MLlib](https://spark.apache.org/mllib) is a scalable machine learning library based on Spark unified platform. It seamlessly integrates with Spark SQL, Spark Streaming and other machine learning and deep learning frameworks without additional glue code for the entire pipeline. -## Compatibility +However, JVM-based MLlib only has limited use of BLAS acceleration and Spark shuffle is also slow for communication during distributed training. It doesn't fully utilize modern CPU and GPU capabilities to achieve best performance. + +## OAP MLlib Solution + +OAP MLlib is a platform optimized package to accelerate machine learning algorithms in Apache Spark MLlib. It is compatible with Spark MLlib and leverages open source [Intel® oneAPI Data Analytics Library (oneDAL)](https://github.com/oneapi-src/oneDAL) to provide highly optimized algorithms and get most out of CPU and GPU capabilities. It also take advantage of open source [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL) to provide efficient communication patterns in multi-node multi-GPU clusters. + +## Who will use OAP MLlib + +This solution is intended for researchers, data scientists and enterprise users to accelerate their Spark MLlib algorithms with minimum configuration changes. + +# Architecture + +The following diagram shows the high-level architecture of OAP MLlib. + +![OAP MLlib Architecture](./images/arch.png) OAP MLlib maintains the same API interfaces with Spark MLlib. That means the application built with Spark MLlib can be running directly with minimum configuration. @@ -12,33 +26,29 @@ Most of the algorithms can produce the same results that are identical with Spar For those algorithms that are not accelerated by OAP MLlib, the original Spark MLlib one will be used. -## Online Documentation - -You can find the all the OAP MLlib documents on the [project web page](https://oap-project.github.io/oap-mllib). - -## Getting Started +# Getting Started -### Java/Scala Users Preferred +## Java/Scala Users Preferred -Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.4.0/oap-mllib-1.4.0.jar). +Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.5.0/oap-mllib-1.5.0.jar). Then you can refer to the following [Running](#running) section to try out. -### Python/PySpark Users Preferred +## Python/PySpark Users Preferred Use a pre-built JAR to get started. If you have finished [OAP Installation Guide](./OAP-Installation-Guide.md), you can find compiled OAP MLlib JAR `oap-mllib-x.x.x.jar` in `$HOME/miniconda2/envs/oapenv/oap_jars/`. Then you can refer to the following [Running](#running) section to try out. -### Building From Scratch +## Building From Scratch -You can also build the package from source code, please refer to [Building](#building) section. +You can also build the package from source code, please refer to [Building Code](#building-code) section. ## Running ### Supported Spark Versions -OAP MLlib's latest version supports multiple Spark versions as below. +OAP MLlib's latest version supports multiple Spark versions as below. * Apache Spark 3.1.1 * Apache Spark 3.1.2 @@ -115,7 +125,7 @@ Edit related variables in "`Minimun Settings`" of `env.sh` As PySpark-based applications call their Scala counterparts, they shall be supported out-of-box. Examples can be found in the [Examples](#examples) section. -## Building +## Building Code ### Prerequisites diff --git a/mkdocs.yml b/mkdocs.yml index b7396415c..f451c0455 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -19,5 +19,5 @@ theme: readthedocs plugins: - search - mkdocs-versioning: - version: 1.4.0 - exclude_from_nav: ["image", "js", "css", "fonts", "img"] + version: 1.5.0 + exclude_from_nav: ["images", "js", "css", "fonts", "img"]