concept `spark SQL` in category `spark`

appears as: Spark SQL

Spark in Action, Second Edition

This is an excerpt from Manning's book Spark in Action, Second Edition. Login to get full access to this book.

Table .L.6 Options for ingesting and writing data from/to a database (continued)

Option

Description

url

The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. For example: jdbc:postgresql://localhost/analytics?user=pg&password=secret . See chapter 8 for examples.

dbtable

The JDBC table that should be read. Anything that is valid in a FROM clause of an SQL query can be used. Instead of a table, you could also use a subquery in parentheses. See chapter 8 for examples.

driver

The class name of the JDBC driver to use to connect to this URL. See chapter 8 for examples.

partitionColumn , lowerBound , upperBound

These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Note that lowerBound and upperBound are used to decide the partition stride, not to filter rows in the table. All rows in the table will be partitioned and returned. This option applies only to reading. See chapter 8 for examples.

numPartitions

The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, Spark decreases it to this limit by calling coalesce(numPartitions) before writing. See chapter 8 for an example on reading.

fetchsize

The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers that default to low fetch size (for example, Oracle with 10 rows). This option applies only to reading.

batchsize

The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000 .

isolationLevel

The transaction isolation level, which applies to the current connection. It can be NONE , READ_COMMITTED , READ_UNCOMMITTED , REPEATABLE_READ , or SERIALIZABLE , corresponding to standard transaction isolation levels defined by JDBC's Connection object, with a default of READ_UNCOMMITTED . This option applies only to writing. Refer to the documentation in java.sql.Connection.

sessionInitStatement

After each database session is opened to the remote database and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code.

Example:

option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")

truncate

When SaveMode.Overwrite is enabled, this option causes Spark to truncate an existing table instead of dropping and re-creating it. This can be more efficient and prevents the table metadata (for example, indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to false . This option applies only to writing.

createTableOptions

If specified, this option allows setting of database-specific table and partition options when creating a table. This option applies only to writing.

Example:

CREATE TABLE t (name string) ENGINE=InnoDB

createTableColumnTypes

The database column datatypes to use, instead of the defaults, when creating the table. Datatype information should be specified in the same format as CREATE TABLE column syntax; for example: name CHAR(64), comments VARCHAR(1024) . The specified types should be valid Spark SQL datatypes. This option applies only to writing.

customSchema

The custom schema to use for reading data from JDBC connectors. For example, id DECIMAL(38, 0), name STRING . You can also specify partial fields, and the others use the default type mapping; for example: id DECIMAL(38, 0) . The column names should be identical to the corresponding column names of the JDBC table. Users can specify the corresponding datatypes of Spark SQL instead of using the defaults. This option applies only to reading.

to see more go to Appendix L. Reference for ingestion

Listing 15.6 Aggregation using Spark SQL

df .createOrReplaceTempView( "orders" ); #1
      String sqlStatement = "SELECT " + #2
             " firstName, " +
             " lastName, " +
             " state, " +
             " SUM(quantity), " +
             " SUM(revenue), " +
             " AVG(revenue) " +
             " FROM orders " +
             " GROUP BY firstName, lastName, state" ; #3
      Dataset<Row> sqlDf = spark .sql( sqlStatement ); #4
       sqlDf .show(20);
    }
 }

to see more go to 15. Aggregating your data

Option	Description
`url`	The JDBC URL to connect to. The source-specific connection properties may be specified in the URL. For example: `jdbc:postgresql://localhost/analytics?user=pg&password=secret` . See chapter 8 for examples.
`dbtable`	The JDBC table that should be read. Anything that is valid in a `FROM` clause of an SQL query can be used. Instead of a table, you could also use a subquery in parentheses. See chapter 8 for examples.
`driver`	The class name of the JDBC driver to use to connect to this URL. See chapter 8 for examples.
`partitionColumn` , `lowerBound` , `upperBound`	These options must all be specified if any of them is specified. In addition, `numPartitions` must be specified. They describe how to partition the table when reading in parallel from multiple workers. `partitionColumn` must be a numeric column from the table in question. Note that `lowerBound` and `upperBound` are used to decide the partition stride, not to filter rows in the table. All rows in the table will be partitioned and returned. This option applies only to reading. See chapter 8 for examples.
`numPartitions`	The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, Spark decreases it to this limit by calling `coalesce(numPartitions)` before writing. See chapter 8 for an example on reading.
`fetchsize`	The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers that default to low fetch size (for example, Oracle with 10 rows). This option applies only to reading.
`batchsize`	The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to `1000` .
`isolationLevel`	The transaction isolation level, which applies to the current connection. It can be `NONE` , `READ_COMMITTED` , `READ_UNCOMMITTED` , `REPEATABLE_READ` , or `SERIALIZABLE` , corresponding to standard transaction isolation levels defined by JDBC's `Connection` object, with a default of `READ_UNCOMMITTED` . This option applies only to writing. Refer to the documentation in java.sql.Connection.
`sessionInitStatement`	After each database session is opened to the remote database and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code. Example: `option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")`
`truncate`	When `SaveMode.Overwrite` is enabled, this option causes Spark to truncate an existing table instead of dropping and re-creating it. This can be more efficient and prevents the table metadata (for example, indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. It defaults to `false` . This option applies only to writing.
`createTableOptions`	If specified, this option allows setting of database-specific table and partition options when creating a table. This option applies only to writing. Example: `CREATE TABLE t (name string) ENGINE=InnoDB`
`createTableColumnTypes`	The database column datatypes to use, instead of the defaults, when creating the table. Datatype information should be specified in the same format as `CREATE TABLE` column syntax; for example: `name CHAR(64), comments VARCHAR(1024)` . The specified types should be valid Spark SQL datatypes. This option applies only to writing.
`customSchema`	The custom schema to use for reading data from JDBC connectors. For example, `id DECIMAL(38, 0), name STRING` . You can also specify partial fields, and the others use the default type mapping; for example: `id DECIMAL(38, 0)` . The column names should be identical to the corresponding column names of the JDBC table. Users can specify the corresponding datatypes of Spark SQL instead of using the defaults. This option applies only to reading.

concept spark SQL in category spark

Spark in Action, Second Edition

Table .L.6 Options for ingesting and writing data from/to a database (continued)

Listing 15.6 Aggregation using Spark SQL

Unable to load book!

concept `spark SQL` in category `spark`