Getting Started on Spark workload qualification

The RAPIDS Accelerator for Apache Spark runs as many operations as possible on the GPU. If there are operators which do not yet run on GPU, they will seamlessly fallback to the CPU. There may be some performance overhead because of host memory to GPU memory transfer. When converting an existing Spark workload from CPU to GPU, it is recommended to do an analysis to understand if there are any features (functions, expressions, data types, data formats) that do not yet run on the GPU. Understanding this will help prioritize workloads that are best suited to the GPU.

Significant performance benefits can be gained even if all operations are not yet fully supported by the GPU. It all depends on how critical the portion that is executing on the CPU is to the overall performance of the query.

This article describes the tools we provide and how to do gap analysis and workload qualification.

1. Qualification and Profiling tool

Requirements

Spark event logs from Spark 2.x or 3.x
Spark 3.0.1+ jars
rapids-4-spark-tools jar

How to use

If you have Spark event logs from prior runs of the applications on Spark 2.x or 3.x, you can use the Qualification tool and Profiling tool to analyze them. The qualification tool outputs the score, rank and some of the potentially not-supported features for each Spark application. For example, the CSV output can print Unsupported Read File Formats and Types, Unsupported Write Data Format and Potential Problems which are the indication of some not-supported features. Its output can help you focus on the Spark applications which are best suited for the GPU.

The profiling tool outputs SQL plan metrics and also prints out actual query plans to provide more insights. In the following example the profiling tool output for a specific Spark application shows that it has a query with a large HashAggregate and SortMergeJoin. Those are indicators for a good candidate application for the RAPIDS Accelerator.

+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+
|appIndex|sqlID|nodeID|nodeName                                            |accumulatorId|name                                |max_value    |metricType|
+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+
|1       |88   |8     |SortMergeJoin                                       |11111        |number of output rows               |500000000    |sum       |
|1       |88   |9     |HashAggregate                                       |22222        |number of output rows               |600000000    |sum       |

Since the two tools are only analyzing Spark event logs they do not have the detail that can be captured from a running Spark job. However it is very convenient because you can run the tools on existing logs and do not need a GPU cluster to run the tools.

2. Function `explainPotentialGpuPlan`

Requirements

A Spark 3.x CPU cluster
The rapids-4-spark and cudf jars
Ability to modify the existing Spark application code

How to use

Starting with version 21.12 of the RAPIDS Accelerator, a new function named explainPotentialGpuPlan is added which can help us understand the potential GPU plan and if there are any unsupported features on a CPU cluster. Basically it can return output which is the same as the driver logs with spark.rapids.sql.explain=all.

In spark-shell, add the rapids-4-spark and cudf jars into –jars option or put them in the Spark classpath.

For example:
```
spark-shell --jars /PathTo/cudf-<version>.jar,/PathTo/rapids-4-spark_<version>.jar
```

Test if the class can be successfully loaded or not.

import com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan

Enable optional RAPIDS Accelerator related parameters based on your setup.

Enabling optional parameters may allow more operations to run on the GPU but please understand the meaning and risk of above parameters before enabling it. Please refer to configs doc for details of RAPIDS Accelerator parameters.

For example, if your jobs have double, float and decimal operators together with some Scala UDFs, you can set the following parameters:
```
spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)
spark.conf.set("spark.rapids.sql.variableFloatAgg.enabled", true)
spark.conf.set("spark.rapids.sql.decimalType.enabled", true)
spark.conf.set("spark.rapids.sql.castFloatToDecimal.enabled",true)
spark.conf.set("spark.rapids.sql.castDecimalToFloat.enabled",true)
spark.conf.set("spark.rapids.sql.udfCompiler.enabled",true)
```

Run the function explainPotentialGpuPlan on the query DataFrame.

For example:

val jdbcDF = spark.read.format("jdbc").
          option("driver", "com.mysql.jdbc.Driver").
          option("url", "jdbc:mysql://localhost:3306/hive?useSSL=false").
          option("dbtable", "TBLS").option("user", "xxx").
          option("password", "xxx").
          load()
jdbcDF.createOrReplaceTempView("t")
val mydf=spark.sql("select count(distinct TBL_ID) from t")
   
val output=com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(mydf)
println(output)

Below are sample driver log messages starting with ! which indicate the unsupported features in this version:

!NOT_FOUND <RowDataSourceScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RowDataSourceScanExec could be found

This log can show you which operators (on what data type) can not run on GPU and the reason. If it shows a specific RAPIDS Accelerator parameter which can be turned on to enable that feature, you should first understand the risk and applicability of that parameter based on configs doc and then enable that parameter and try the tool again.

Since its output is directly based on specific version of rapids-4-spark jar, the gap analysis is pretty accurate.

3. Run Spark applications with Spark RAPIDS Accelerator on a GPU Spark Cluster

Requirements

A Spark 3.x GPU cluster
The rapids-4-spark and cudf jars

How to use

Follow the getting-started guides to start a Spark 3+ GPU cluster and run the existing Spark workloads on the GPU cluster with parameter spark.rapids.sql.explain=all. The Spark driver log should be collected to check the not-supported messages. This is the most accurate way to do gap analysis.

For example, the log lines starting with ! is the so-called not-supported messages:

!Exec <GenerateExec> cannot run on GPU because not all expressions can be replaced
  !NOT_FOUND <ReplicateRows> replicaterows(sum#99L, gender#76) cannot run on GPU because no GPU enabled version of expression class 

The indentation indicates the parent and child relationship for those expressions. If not all of the children expressions can run on GPU, the parent can not run on GPU either. So above example shows the missing feature is ReplicateRows expression. So we filed a feature request issue-4104 based on 21.12 version.

Getting Started on Spark workload qualification

1. Qualification and Profiling tool

Requirements

How to use

2. Function explainPotentialGpuPlan

Requirements

How to use

3. Run Spark applications with Spark RAPIDS Accelerator on a GPU Spark Cluster

Requirements

How to use

2. Function `explainPotentialGpuPlan`