RAPIDS Accelerator for Apache Spark Configuration
The following is the list of options that rapids-plugin-4-spark
supports.
On startup use: --conf [conf key]=[conf value]
. For example:
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-21.12.0.jar,cudf-21.12.2-cuda11.jar' \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleOps.enabled=true
At runtime use: spark.conf.set("[conf key]", [conf value])
. For example:
scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)
All configs can be set on startup, but some configs, especially for shuffle, will not work if they are set at runtime.
General Configuration
Name | Description | Default Value |
---|---|---|
spark.rapids.alluxio.pathsToReplace | List of paths to be replaced with corresponding alluxio scheme. Eg, when configureis set to “s3:/foo->alluxio://0.1.2.3:19998/foo,gcs:/bar->alluxio://0.1.2.3:19998/bar”, which means: s3:/foo/a.csv will be replaced to alluxio://0.1.2.3:19998/foo/a.csv and gcs:/bar/b.csv will be replaced to alluxio://0.1.2.3:19998/bar/b.csv | None |
spark.rapids.cloudSchemes | Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type | None |
spark.rapids.gpu.resourceName | The name of the Spark resource that represents a GPU that you want the plugin to use if using custom resources with Spark. | gpu |
spark.rapids.memory.gpu.allocFraction | The fraction of available (free) GPU memory that should be allocated for pooled memory. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction, and greater than or equal to the minimum limit configured via spark.rapids.memory.gpu.minAllocFraction. | 1.0 |
spark.rapids.memory.gpu.debug | Provides a log of GPU memory allocations and frees. If set to STDOUT or STDERR the logging will go there. Setting it to NONE disables logging. All other values are reserved for possible future expansion and in the mean time will disable logging. | NONE |
spark.rapids.memory.gpu.direct.storage.spill.batchWriteBuffer.size | The size of the GPU memory buffer used to batch small buffers when spilling to GDS. Note that this buffer is mapped to the PCI Base Address Register (BAR) space, which may be very limited on some GPUs (e.g. the NVIDIA T4 only has 256 MiB), and it is also used by UCX bounce buffers. | 8388608 |
spark.rapids.memory.gpu.direct.storage.spill.enabled | Should GPUDirect Storage (GDS) be used to spill GPU memory buffers directly to disk. GDS must be enabled and the directory spark.local.dir must support GDS. This is an experimental feature. For more information on GDS, see https://docs.nvidia.com/gpudirect-storage/. | false |
spark.rapids.memory.gpu.maxAllocFraction | The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve. | 1.0 |
spark.rapids.memory.gpu.minAllocFraction | The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction. | 0.25 |
spark.rapids.memory.gpu.oomDumpDir | The path to a local directory where a heap dump will be created if the GPU encounters an unrecoverable out-of-memory (OOM) error. The filename will be of the form: “gpu-oom- | None |
spark.rapids.memory.gpu.pool | Select the RMM pooling allocator to use. Valid values are “DEFAULT”, “ARENA”, “ASYNC”, and “NONE”. With “DEFAULT”, the RMM pool allocator is used; with “ARENA”, the RMM arena allocator is used; with “ASYNC”, the new CUDA stream-ordered memory allocator in CUDA 11.2+ is used. If set to “NONE”, pooling is disabled and RMM just passes through to CUDA memory allocation directly. Note: “ARENA” is the recommended pool allocator if CUDF is built with Per-Thread Default Stream (PTDS), as “DEFAULT” is known to be unstable (https://github.com/NVIDIA/spark-rapids/issues/1141) | ARENA |
spark.rapids.memory.gpu.pooling.enabled | Should RMM act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. DEPRECATED: please use spark.rapids.memory.gpu.pool instead. | true |
spark.rapids.memory.gpu.reserve | The amount of GPU memory that should remain unallocated by RMM and left for system use such as memory needed for kernels and kernel launches. | 671088640 |
spark.rapids.memory.gpu.unspill.enabled | When a spilled GPU buffer is needed again, should it be unspilled, or only copied back into GPU memory temporarily. Unspilling may be useful for GPU buffers that are needed frequently, for example, broadcast variables; however, it may also increase GPU memory usage | false |
spark.rapids.memory.host.spillStorageSize | Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk | 1073741824 |
spark.rapids.memory.pinnedPool.size | The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool. | 0 |
spark.rapids.python.concurrentPythonWorkers | Set the number of Python worker processes that can execute concurrently per GPU. Python worker processes may temporarily block when the number of concurrent Python worker processes started by the same executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. >0 means enabled, while <=0 means unlimited | 0 |
spark.rapids.python.memory.gpu.allocFraction | The fraction of total GPU memory that should be initially allocated for pooled memory for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.allocFraction)), since the executor will share the GPU with its owning Python workers. Half of the rest will be used if not specified | None |
spark.rapids.python.memory.gpu.maxAllocFraction | The fraction of total GPU memory that limits the maximum size of the RMM pool for all the Python workers. It supposes to be less than (1 - $(spark.rapids.memory.gpu.maxAllocFraction)), since the executor will share the GPU with its owning Python workers. when setting to 0 it means no limit. | 0.0 |
spark.rapids.python.memory.gpu.pooling.enabled | Should RMM in Python workers act as a pooling allocator for GPU memory, or should it just pass through to CUDA memory allocation directly. When not specified, It will honor the value of config ‘spark.rapids.memory.gpu.pooling.enabled’ | None |
spark.rapids.shuffle.enabled | Enable or disable the RAPIDS Shuffle Manager at runtime. The RAPIDS Shuffle Manager must already be configured. When set to false , the built-in Spark shuffle will be used. | true |
spark.rapids.shuffle.transport.earlyStart | Enable early connection establishment for RAPIDS Shuffle | true |
spark.rapids.shuffle.transport.earlyStart.heartbeatInterval | Shuffle early start heartbeat interval (milliseconds). Executors will send a heartbeat RPC message to the driver at this interval | 5000 |
spark.rapids.shuffle.transport.earlyStart.heartbeatTimeout | Shuffle early start heartbeat timeout (milliseconds). Executors that don’t heartbeat within this timeout will be considered stale. This timeout must be higher than the value for spark.rapids.shuffle.transport.earlyStart.heartbeatInterval | 10000 |
spark.rapids.shuffle.transport.maxReceiveInflightBytes | Maximum aggregate amount of bytes that be fetched at any given time from peers during shuffle | 1073741824 |
spark.rapids.shuffle.ucx.activeMessages.forceRndv | Set to true to force ‘rndv’ mode for all UCX Active Messages. This should only be required with UCX 1.10.x. UCX 1.11.x deployments should set to false. | false |
spark.rapids.shuffle.ucx.managementServerHost | The host to be used to start the management server | null |
spark.rapids.shuffle.ucx.useWakeup | When set to true, use UCX’s event-based progress (epoll) in order to wake up the progress thread when needed, instead of a hot loop. | true |
spark.rapids.sql.batchSizeBytes | Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column. | 2147483647 |
spark.rapids.sql.castDecimalToFloat.enabled | Casting from decimal to floating point types on the GPU returns results that have tiny difference compared to results returned from CPU. | false |
spark.rapids.sql.castDecimalToString.enabled | When set to true, casting from decimal to string is supported on the GPU. The GPU does NOT produce exact same string as spark produces, but producing strings which are semantically equal. For instance, given input BigDecimal(123, -2), the GPU produces “12300”, which spark produces “1.23E+4”. | false |
spark.rapids.sql.castFloatToDecimal.enabled | Casting from floating point types to decimal on the GPU returns results that have tiny difference compared to results returned from CPU. | false |
spark.rapids.sql.castFloatToIntegralTypes.enabled | Casting from floating point types to integral types on the GPU supports a slightly different range of values when using Spark 3.1.0 or later. Refer to the CAST documentation for more details. | false |
spark.rapids.sql.castFloatToString.enabled | Casting from floating point types to string on the GPU returns results that have a different precision than the default results of Spark. | false |
spark.rapids.sql.castStringToDecimal.enabled | When set to true, enables casting from strings to decimal type on the GPU. Currently string to decimal type on the GPU might produce results which slightly differed from the correct results when the string represents any number exceeding the max precision that CAST_STRING_TO_FLOAT can keep. For instance, the GPU returns 99999999999999987 given input string “99999999999999999”. The cause of divergence is that we can not cast strings containing scientific notation to decimal directly. So, we have to cast strings to floats firstly. Then, cast floats to decimals. The first step may lead to precision loss. | false |
spark.rapids.sql.castStringToFloat.enabled | When set to true, enables casting from strings to float types (float, double) on the GPU. Currently hex values aren’t supported on the GPU. Also note that casting from string to float types on the GPU returns incorrect results when the string represents any number “1.7976931348623158E308” <= x < “1.7976931348623159E308” and “-1.7976931348623158E308” >= x > “-1.7976931348623159E308” in both these cases the GPU returns Double.MaxValue while CPU returns “+Infinity” and “-Infinity” respectively | false |
spark.rapids.sql.castStringToTimestamp.enabled | When set to true, casting from string to timestamp is supported on the GPU. The GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details. | false |
spark.rapids.sql.concurrentGpuTasks | Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. | 1 |
spark.rapids.sql.createMap.enabled | The GPU-enabled version of the CreateMap expression (map SQL function) does not detect duplicate keys in all cases and does not guarantee which key wins if there are duplicates. When this config is set to true, CreateMap will be enabled to run on the GPU even when there might be duplicate keys. | false |
spark.rapids.sql.csv.read.bool.enabled | Parsing an invalid CSV boolean value produces true instead of null | false |
spark.rapids.sql.csv.read.byte.enabled | Parsing CSV bytes is much more lenient and will return 0 for some malformed values instead of null | false |
spark.rapids.sql.csv.read.date.enabled | Parsing invalid CSV dates produces different results from Spark | false |
spark.rapids.sql.csv.read.double.enabled | Parsing CSV double has some issues at the min and max values for floatingpoint numbers and can be more lenient on parsing inf and -inf values | false |
spark.rapids.sql.csv.read.float.enabled | Parsing CSV floats has some issues at the min and max values for floatingpoint numbers and can be more lenient on parsing inf and -inf values | false |
spark.rapids.sql.csv.read.integer.enabled | Parsing CSV integers is much more lenient and will return 0 for some malformed values instead of null | false |
spark.rapids.sql.csv.read.long.enabled | Parsing CSV longs is much more lenient and will return 0 for some malformed values instead of null | false |
spark.rapids.sql.csv.read.short.enabled | Parsing CSV shorts is much more lenient and will return 0 for some malformed values instead of null | false |
spark.rapids.sql.csvTimestamps.enabled | When set to true, enables the CSV parser to read timestamps. The default output format for Spark includes a timezone at the end. Anything except the UTC timezone is not supported. Timestamps after 2038 and before 1902 are also not supported. | false |
spark.rapids.sql.decimalOverflowGuarantees | FOR TESTING ONLY. DO NOT USE IN PRODUCTION. Please see the decimal section of the compatibility documents for more information on this config. | true |
spark.rapids.sql.enabled | Enable (true) or disable (false) sql operations on the GPU | true |
spark.rapids.sql.explain | Explain why some parts of a query were not placed on a GPU or not. Possible values are ALL: print everything, NONE: print nothing, NOT_ON_GPU: print only parts of a query that did not go on the GPU | NONE |
spark.rapids.sql.format.csv.enabled | When set to false disables all csv input and output acceleration. (only input is currently supported anyways) | true |
spark.rapids.sql.format.csv.read.enabled | When set to false disables csv input acceleration | true |
spark.rapids.sql.format.orc.enabled | When set to false disables all orc input and output acceleration | true |
spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel | A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type | 2147483647 |
spark.rapids.sql.format.orc.multiThreadedRead.numThreads | The maximum number of threads, on the executor, to use for reading small orc files in parallel. This can not be changed at runtime after the executor has started. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type. | 20 |
spark.rapids.sql.format.orc.read.enabled | When set to false disables orc input acceleration | true |
spark.rapids.sql.format.orc.reader.type | Sets the orc reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.orc.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.orc.multiThreadedRead.numThreads and spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes. | AUTO |
spark.rapids.sql.format.orc.write.enabled | When set to false disables orc output acceleration | true |
spark.rapids.sql.format.parquet.enabled | When set to false disables all parquet input and output acceleration | true |
spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel | A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type | 2147483647 |
spark.rapids.sql.format.parquet.multiThreadedRead.numThreads | The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED reader, see spark.rapids.sql.format.parquet.reader.type. | 20 |
spark.rapids.sql.format.parquet.read.enabled | When set to false disables parquet input acceleration | true |
spark.rapids.sql.format.parquet.reader.type | Sets the parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.format.parquet.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes. | AUTO |
spark.rapids.sql.format.parquet.write.enabled | When set to false disables parquet output acceleration | true |
spark.rapids.sql.format.parquet.writer.int96.enabled | When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96 | true |
spark.rapids.sql.hasExtendedYearValues | Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don’t care about getting the correct values on values with the extended range. | true |
spark.rapids.sql.hasNans | Config to indicate if your data has NaN’s. Cudf doesn’t currently support NaN’s properly so you can get corrupt data if you have NaN’s in your data and it runs on the GPU. | true |
spark.rapids.sql.hashOptimizeSort.enabled | Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats. | false |
spark.rapids.sql.improvedFloatOps.enabled | For some floating point operations spark uses one way to compute the value and the underlying cudf implementation can use an improved algorithm. In some cases this can result in cudf producing an answer when spark overflows. Because this is not as compatible with spark, we have it disabled by default. | false |
spark.rapids.sql.improvedTimeOps.enabled | When set to true, some operators will avoid overflowing by converting epoch days directly to seconds without first converting to microseconds | false |
spark.rapids.sql.incompatibleDateFormats.enabled | When parsing strings as dates and timestamps in functions like unix_timestamp, some formats are fully supported on the GPU and some are unsupported and will fall back to the CPU. Some formats behave differently on the GPU than the CPU. Spark on the CPU interprets date formats with unsupported trailing characters as nulls, while Spark on the GPU will parse the date with invalid trailing characters. More detail can be found at parsing strings as dates or timestamps. | false |
spark.rapids.sql.incompatibleOps.enabled | For operations that work, but are not 100% compatible with the Spark equivalent set if they should be enabled by default or disabled by default. | false |
spark.rapids.sql.join.cross.enabled | When set to true cross joins are enabled on the GPU | true |
spark.rapids.sql.join.fullOuter.enabled | When set to true full outer joins are enabled on the GPU | true |
spark.rapids.sql.join.inner.enabled | When set to true inner joins are enabled on the GPU | true |
spark.rapids.sql.join.leftAnti.enabled | When set to true left anti joins are enabled on the GPU | true |
spark.rapids.sql.join.leftOuter.enabled | When set to true left outer joins are enabled on the GPU | true |
spark.rapids.sql.join.leftSemi.enabled | When set to true left semi joins are enabled on the GPU | true |
spark.rapids.sql.join.rightOuter.enabled | When set to true right outer joins are enabled on the GPU | true |
spark.rapids.sql.metrics.level | GPU plans can produce a lot more metrics than CPU plans do. In very large queries this can sometimes result in going over the max result size limit for the driver. Supported values include DEBUG which will enable all metrics supported and typically only needs to be enabled when debugging the plugin. MODERATE which should output enough metrics to understand how long each part of the query is taking and how much data is going to each part of the query. ESSENTIAL which disables most metrics except those Apache Spark CPU plans will also report or their equivalents. | MODERATE |
spark.rapids.sql.python.gpu.enabled | This is an experimental feature and is likely to change in the future. Enable (true) or disable (false) support for scheduling Python Pandas UDFs with GPU resources. When enabled, pandas UDFs are assumed to share the same GPU that the RAPIDs accelerator uses and will honor the python GPU configs | false |
spark.rapids.sql.reader.batchSizeBytes | Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch. | 2147483647 |
spark.rapids.sql.reader.batchSizeRows | Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader. | 2147483647 |
spark.rapids.sql.replaceSortMergeJoin.enabled | Allow replacing sortMergeJoin with HashJoin | true |
spark.rapids.sql.rowBasedUDF.enabled | When set to true, optimizes a row-based UDF in a GPU operation by transferring only the data it needs between GPU and CPU inside a query operation, instead of falling this operation back to CPU. This is an experimental feature, and this config might be removed in the future. | false |
spark.rapids.sql.shuffle.spillThreads | Number of threads used to spill shuffle data to disk in the background. | 6 |
spark.rapids.sql.stableSort.enabled | Enable or disable stable sorting. Apache Spark’s sorting is typically a stable sort, but sort stability cannot be guaranteed in distributed work loads because the order in which upstream data arrives to a task is not guaranteed. Sort stability then only matters when reading and sorting data from a file using a single task/partition. Because of limitations in the plugin when you enable stable sorting all of the data for a single task will be combined into a single batch before sorting. This currently disables spilling from GPU memory if the data size is too large. | false |
spark.rapids.sql.suppressPlanningFailure | Option to fallback an individual query to CPU if an unexpected condition prevents the query plan from being converted to a GPU-enabled one. Note this is different from a normal CPU fallback for a yet-to-be-supported Spark SQL feature. If this happens the error should be reported and investigated as a GitHub issue. | false |
spark.rapids.sql.udfCompiler.enabled | When set to true, Scala UDFs will be considered for compilation as Catalyst expressions | false |
spark.rapids.sql.variableFloatAgg.enabled | Spark assumes that all operations produce the exact same result each time. This is not true for some floating point aggregations, which can produce slightly different results on the GPU as the aggregation is done in parallel. This can enable those operations if you know the query is only computing it once. | false |
spark.rapids.sql.window.range.byte.enabled | When the order-by column of a range based window is byte type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the byte type order-by column | false |
spark.rapids.sql.window.range.int.enabled | When the order-by column of a range based window is int type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the int type order-by column | true |
spark.rapids.sql.window.range.long.enabled | When the order-by column of a range based window is long type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the long type order-by column | true |
spark.rapids.sql.window.range.short.enabled | When the order-by column of a range based window is short type and the range boundary calculated for a value has overflow, CPU and GPU will get the different results. When set to false disables the range window acceleration for the short type order-by column | false |
Supported GPU Operators and Fine Tuning
The RAPIDS Accelerator for Apache Spark can be configured to enable or disable specific GPU accelerated expressions. Enabled expressions are candidates for GPU execution. If the expression is configured as disabled, the accelerator plugin will not attempt replacement, and it will run on the CPU.
Please leverage the spark.rapids.sql.explain
setting to get feedback from the plugin as to why parts of a query may not be executing on the GPU.
NOTE: Setting spark.rapids.sql.incompatibleOps.enabled=true
will enable all the settings in the table below which are not enabled by default due to incompatibilities.