From b2dc49ae121c04a531944e09a433c7327fe24377 Mon Sep 17 00:00:00 2001 From: Manfred Moser Date: Mon, 25 Nov 2024 14:51:13 -0800 Subject: [PATCH] Improve and expand client protocol docs Include the new spooling protocol and its configuration for CLI and JDBC driver. --- docs/src/main/sphinx/admin.md | 1 + .../admin/properties-client-protocol.md | 232 ++++++++++++++++++ .../main/sphinx/admin/properties-general.md | 19 -- docs/src/main/sphinx/admin/properties.md | 1 + docs/src/main/sphinx/client.md | 54 ++-- docs/src/main/sphinx/client/cli.md | 14 ++ .../src/main/sphinx/client/client-protocol.md | 145 +++++++++++ docs/src/main/sphinx/client/jdbc.md | 19 ++ .../main/sphinx/develop/client-protocol.md | 3 + docs/src/main/sphinx/overview/concepts.md | 21 +- 10 files changed, 461 insertions(+), 48 deletions(-) create mode 100644 docs/src/main/sphinx/admin/properties-client-protocol.md create mode 100644 docs/src/main/sphinx/client/client-protocol.md diff --git a/docs/src/main/sphinx/admin.md b/docs/src/main/sphinx/admin.md index fe0efc1eba1c..cb7879475eed 100644 --- a/docs/src/main/sphinx/admin.md +++ b/docs/src/main/sphinx/admin.md @@ -63,6 +63,7 @@ admin/properties * [Properties reference overview](admin/properties) * [](admin/properties-general) +* [](admin/properties-client-protocol) * [](admin/properties-http-server) * [](admin/properties-resource-management) * [](admin/properties-query-management) diff --git a/docs/src/main/sphinx/admin/properties-client-protocol.md b/docs/src/main/sphinx/admin/properties-client-protocol.md new file mode 100644 index 000000000000..9c79fcb650bc --- /dev/null +++ b/docs/src/main/sphinx/admin/properties-client-protocol.md @@ -0,0 +1,232 @@ +# Client protocol properties + +The following sections provide a reference for all properties related to the +[client protocol](/client/client-protocol). + +(prop-protocol-spooling)= +## Spooling protocol properties + +The following properties are related to the [](protocol-spooling). + +### `protocol.spooling.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Enable the support for the client [](protocol-spooling). The protocol is used if +client drivers and applications request usage, otherwise the direct protocol is +used automatically. + +### `protocol.spooling.shared-secret-key` + +- **Type:** [](prop-type-string) + +A required 256 bit, base64-encoded secret key used to secure spooled metadata +exchanged with the client. + +### `protocol.spooling.retrieval-mode` + +- **Type:** [](prop-type-string) +- **Default value:** `STORAGE` + +Determines how the client retrieves the segment. Following are possible values: + +* `STORAGE` - client accesses the storage directly with the pre-signed URI. Uses + one client HTTP request per data segment. +* `COORDINATOR_STORAGE_REDIRECT` - client first accesses the coordinator, which + redirects the client to the storage with the pre-signed URI. Uses two client + HTTP requests per data segment. +* `COORDINATOR_PROXY` - client accesses the coordinator and gets data segment + through it. Uses one client HTTP request per data segment, but requires a + coordinator HTTP request to the storage. +* `WORKER_PROXY` - client accesses the coordinator, which redirects to an + available worker node. It fetches the data from the storage and provides it + to the client. Uses two client HTTP requests, and requires a worker request to + the storage. + +### `protocol.spooling.encoding.json.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Activate support for using uncompressed JSON encoding for spooled segments. + +### `protocol.spooling.encoding.json+zstd.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Activate support for using JSON encoding with Zstandard compression for spooled +segments. + +### `protocol.spooling.encoding.json+lz4.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Activate support for using JSON encoding with LZ4 compression for spooled +segments. + +### `protocol.spooling.initial-segment-size` + +- **Type:** [](prop-type-data-size) +- **Default value:** 8MB + +Initial size of the spooled segments. + +### `protocol.spooling.maximum-segment-size` + +- **Type:** [](prop-type-data-size) +- **Default value:** 16MB + +Maximum size for each spooled segment. + +### `protocol.spooling.inlining.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Allow spooled protocol to inline initial rows to decrease time to return the +first row. + +### `protocol.spooling.inlining.max-rows` + +- **Type:** [](prop-type-integer) +- **Default value:** 1000 + +Maximum number of rows to inline per worker. + +### `protocol.spooling.inlining.max-size` + +- **Type:** [](prop-type-data-size) +- **Default value:** 128kB + +Maximum size of rows to inline per worker. + +(prop-spooling-file-system)= +## Spooling file system properties + +The following properties are used to configure the object storage used with the +[](protocol-spooling). + +### `fs.azure.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `false` + +Activate [](/object-storage/file-system-azure) for spooling segments. + +### `fs.s3.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `false` + +Activate [](/object-storage/file-system-s3) for spooling segments. + +### `fs.gcs.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `false` + +Activate [](/object-storage/file-system-gcs) for spooling segments. + +### `fs.location` + +- **Type:** [](prop-type-string) + +The object storage location to use for spooling segments. Must be accessible by +the coordinator and all workers. With the `protocol.spooling.retrieval-mode` +retrieval modes `STORAGE` and `COORDINATOR_STORAGE_REDIRECT` the location must +also be accessible by all clients. Valid location values vary by object storage +type, and typically follow a pattern of `scheme://bucketName/path/`. + +Examples: + +* `s3://my-spooling-bucket/my-segments/` + +:::{caution} +When using the same object storage for spooling from multiple Trino clusters, +you must use separate locations for each cluster. For example: + +* `s3://my-spooling-bucket/my-segments/cluster1` +* `s3://my-spooling-bucket/my-segments/cluster2` +::: + +### `fs.segment.ttl` + +- **Type:** [](prop-type-duration) +- **Default value:** `12h` + +Maximum available time for the client to retrieve spooled segment before it +expires and is pruned. + +### `fs.segment.direct.ttl` + +- **Type:** [](prop-type-duration) +- **Default value:** `1h` + +Maximum available time for the client to retrieve spooled segment using the +pre-signed URI. + +### `fs.segment.encryption` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Encrypt segments with ephemeral keys using Server-Side Encryption with Customer +key (SSE-C). + +### `fs.segment.explicit-ack` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Activate pruning of segments on client acknowledgment of a successful read of +each segment. + +### `fs.segment.pruning.enabled` + +- **Type:** [](prop-type-boolean) +- **Default value:** `true` + +Activate periodic pruning of expired segments. + +### `fs.segment.pruning.interval` + +- **Type:** [](prop-type-duration) +- **Default value:** `5m` + +Interval to prune expired segments. + +### `fs.segment.pruning.batch-size` + +- **Type:** integer +- **Default value:** `250` + +Number of expired segments to prune as a single batch operation. + +(prop-protocol-shared)= +## Shared protocol properties + +The following properties are related to the [](protocol-spooling) and the +[](protocol-direct), formerly named the V1 protocol. + +### `protocol.v1.prepared-statement-compression.length-threshold` + +- **Type:** [](prop-type-integer) +- **Default value:** `2048` + +Prepared statements that are submitted to Trino for processing, and are longer +than the value of this property, are compressed for transport via the HTTP +header to improve handling, and to avoid failures due to hitting HTTP header +size limits. + +### `protocol.v1.prepared-statement-compression.min-gain` + +- **Type:** [](prop-type-integer) +- **Default value:** `512` + +Prepared statement compression is not applied if the size gain is less than the +configured value. Smaller statements do not benefit from compression, and are +left uncompressed. + diff --git a/docs/src/main/sphinx/admin/properties-general.md b/docs/src/main/sphinx/admin/properties-general.md index 141328bd4aac..5b2e56a25b58 100644 --- a/docs/src/main/sphinx/admin/properties-general.md +++ b/docs/src/main/sphinx/admin/properties-general.md @@ -33,25 +33,6 @@ across nodes in the cluster. It can be disabled, when it is known that the output data set is not skewed, in order to avoid the overhead of hashing and redistributing all the data across the network. -## `protocol.v1.prepared-statement-compression.length-threshold` - -- **Type:** {ref}`prop-type-integer` -- **Default value:** `2048` - -Prepared statements that are submitted to Trino for processing, and are longer -than the value of this property, are compressed for transport via the HTTP -header to improve handling, and to avoid failures due to hitting HTTP header -size limits. - -## `protocol.v1.prepared-statement-compression.min-gain` - -- **Type:** {ref}`prop-type-integer` -- **Default value:** `512` - -Prepared statement compression is not applied if the size gain is less than the -configured value. Smaller statements do not benefit from compression, and are -left uncompressed. - (file-compression)= ## File compression and decompression diff --git a/docs/src/main/sphinx/admin/properties.md b/docs/src/main/sphinx/admin/properties.md index 4789f7e1e5af..9f2f3e6ddd2d 100644 --- a/docs/src/main/sphinx/admin/properties.md +++ b/docs/src/main/sphinx/admin/properties.md @@ -15,6 +15,7 @@ properties, refer to the {doc}`connector documentation `. :titlesonly: true General +Client protocol HTTP server Resource management Query management diff --git a/docs/src/main/sphinx/client.md b/docs/src/main/sphinx/client.md index cb53c3e32d25..502bf3552520 100644 --- a/docs/src/main/sphinx/client.md +++ b/docs/src/main/sphinx/client.md @@ -1,27 +1,49 @@ # Clients -A [client](trino-concept-client) is used to send queries to Trino and receive -results, or otherwise interact with Trino and the connected data sources. +A [client](trino-concept-client) is used to send SQL queries to Trino, and +therefore any [connected data sources](trino-concept-data-source), and receive +results. -Some clients, such as the [command line interface](/client/cli), can provide a -user interface directly. Clients like the [JDBC driver](/client/jdbc), provide a -mechanism for other applications, including your own custom applications, to -connect to Trino. +## Client drivers -The following clients are available as part of every Trino release: +Client drivers, also called client libraries, provide a mechanism for other +applications to connect to Trino. The application are called client application +and include your own custom applications or scripts. The Trino project maintains the +following client drivers: + +* [Trino JDBC driver](/client/jdbc) +* [trino-go-client](https://github.com/trinodb/trino-go-client) +* [trino-js-client](https://github.com/trinodb/trino-js-client) +* [trino-python-client](https://github.com/trinodb/trino-python-client) + +Other communities and vendors provide [other client +drivers](https://trino.io/ecosystem/client.html). + +## Client applications + +Client applications provide a user interface and other user-facing features to +run queries with Trino. You can inspect the results, perform analytics with +further queries, and create visualizations. Client applications typically use a +client driver. + +The Trino project maintains the [Trino command line interface](/client/cli) as a +client application. + +Other communities and vendors provide [numerous other client +applications](https://trino.io/ecosystem/client.html) + +## Client protocol + +All client drivers and client applications communicate with the Trino +coordinator using the [client protocol](/client/client-protocol). + +Configure support for the [spooling protocol](protocol-spooling) on the cluster +to improve throughput for client interactions with higher data transfer demands. ```{toctree} :maxdepth: 1 +client/client-protocol client/cli client/jdbc ``` - -The Trino project maintains the following other client libraries: - -* [trino-go-client](https://github.com/trinodb/trino-go-client) -* [trino-js-client](https://github.com/trinodb/trino-js-client) -* [trino-python-client](https://github.com/trinodb/trino-python-client) - -In addition, other communities and vendors provide [numerous other client -libraries, drivers, and applications](https://trino.io/ecosystem/client) diff --git a/docs/src/main/sphinx/client/cli.md b/docs/src/main/sphinx/client/cli.md index 64e23a52c969..d96f357d2ff6 100644 --- a/docs/src/main/sphinx/client/cli.md +++ b/docs/src/main/sphinx/client/cli.md @@ -604,6 +604,20 @@ Query 20200707_170726_00030_2iup9 failed: line 1:25: Column 'region' cannot be r SELECT nationkey, name, region FROM tpch.sf1.nation LIMIT 3 ``` +(cli-spooling-protocol)= +## Spooling protocol + +The Trino CLI automatically uses of the spooling protocol to improve throughput +for client interactions with higher data transfer demands, if the +[](protocol-spooling) is configured on the cluster. + +Optionally use the `--encoding` option to configure a different desired +encoding, compared to the default on the cluster. The available values are +`json+zstd` (recommended) for JSON with Zstandard compression, and `json+lz4` +for JSON with LZ4 compression, and `json` for uncompressed JSON. + +The CLI process must have network access to the spooling object storage. + (cli-output-format)= ## Output formats diff --git a/docs/src/main/sphinx/client/client-protocol.md b/docs/src/main/sphinx/client/client-protocol.md new file mode 100644 index 000000000000..d822804d535f --- /dev/null +++ b/docs/src/main/sphinx/client/client-protocol.md @@ -0,0 +1,145 @@ +# Client protocol + +The Trino client protocol is a HTTP-based protocol that allows +[clients](/client) to submit SQL queries and receive results. + +The protocol is a sequence of REST API calls to the +[coordinator](trino-concept-coordinator) of the Trino +[cluster](trino-concept-cluster). Following is a high-level overview: + +1. Client submits SQL query text to the coordinator of the Trino cluster. +2. The coordinator starts processing the query. +3. The coordinator returns a result set and a URI `nextUri` on the coordinator. +4. The client receives the result set and initiates another request for more + data from the URI `nextUri`. +5. The coordinator continues processing the query and returns further data with + a new URI. +6. The client and coordinator continue with steps 4. and 5. until all + result set data is returned to the client or the client stops requesting + more data. + +The client protocol supports two modes. Configure the [spooling +protocol](protocol-spooling) for optimal throughput for your clients. + +(protocol-spooling)= +## Spooling protocol + +The spooling protocol uses an object storage location to store the data for +retrieval by the client. The coordinator and all workers can write result set +data to the storage in parallel. The coordinator only provides the URLs to all +the individual data segments on the object storage to the cluster. The spooling +protocol also allows compression of the data. + +Data on the object storage is automatically removed after download by the +client. + +The spooling protocol has the following characteristics, compared to the [direct +protocol](protocol-direct). + +* Provides higher throughput for data transfer, specifically for queries that + return more data. +* Results in faster query processing completion on the cluster, independent of + the client retrieving all data, since data is read from the object storage. +* Requires object storage and configuration on the Trino cluster. +* Reduces CPU and I/O load on the coordinator. +* Automatically falls back to the direct protocol for queries that don't benefit + from using the spooling protocol. +* Requires newer client drivers or client applications that support the spooling + protocol and actively request usage of the spooling protocol. +* Clients must have access to the object storage. +* Works with older client drivers and client applications by automatically + falling back to the direct protocol if spooling protocol is not supported. + +### Configuration + +The following steps are necessary to configure support for the spooling protocol +on a Trino cluster: + +* Configure the spooling protocol usage in [](config-properties) using the + [](prop-protocol-spooling). +* Choose a suitable object storage that is accessible to your Trino cluster and + your clients. +* Configure the object storage in `etc/spooling-manager.properties` using the + [](prop-spooling-file-system). + +Minimal configuration in [](config-properties): + +```properties +protocol.spooling.enabled=true +protocol.spooling.shared-secret-key=jxTKysfCBuMZtFqUf8UJDQ1w9ez8rynEJsJqgJf66u0= +``` + +Refer to [](prop-protocol-spooling) for further optional configuration. + +Suitable object storage systems for spooling are S3 and compatible systems, +Azure Storage, and Google Cloud Storage. The object storage system must provide +good connectivity for all cluster nodes as well as any clients. + +Activate the desired system with +`fs.s3.enabled`, `fs.azure.enabled`, or `fs.s3.enabled=true` in +`etc/spooling-manager.properties`and configure further details using relevant +properties from [](prop-spooling-file-system), +[](/object-storage/file-system-s3), [](/object-storage/file-system-azure), and +[](/object-storage/file-system-gcs). + +The `spooling-manager.name` property must be set to `filesystem`. + +Following is a minimalistic example for using the S3-compatible MinIO object +storage: + +```properties +spooling-manager.name=filesystem +fs.s3.enabled=true +fs.location=s3://spooling +s3.endpoint=http://minio:9080/ +s3.region=fake-value +s3.aws-access-key=minio-access-key +s3.aws-secret-key=minio-secret-key +s3.path-style-access=true +``` + +Refer to [](prop-spooling-file-system) for further configuration properties. + +The system assumes the object storage to be unbounded in terms of data and data +transfer volume. Spooled segments on object storage are automatically removed by +the clients after reads as well as the coordinator in specific intervals. Sizing +and transfer demands vary with the query workload on your cluster. + +Segments on object storage are encrypted, compressed, and can only be used by +the specific client who initiated the query. + +The following client drivers and client applications support the spooling protocol. + +* [Trino JDBC driver](jdbc-spooling-protocol), version 466 and newer +* [Trino command line interface](cli-spooling-protocol), version 466 and newer + +Refer to the documentation for other your specific client drivers and client +applications for up to date information. + +(protocol-direct)= +## Direct protocol + +The direct protocol transfers all data from the workers to the coordinator, and +from there directly to the client. + +The direct protocol, also know as the `v1` protocol, has the following +characteristics, compared to the spooling protocol: + +* Provides lower performance, specifically for queries that return more data. +* Results in slower query processing completion on the cluster, since data is + provided by the coordinator and read by the client sequentially. +* Requires **no** object storage or configuration in the Trino cluster. +* Increases CPU and I/O load on the coordinator. +* Works with older client drivers and client applications without support for + the spooling protocol. + +### Configuration + +Use of the direct protocol requires not configuration. Find optional +configuration properties in [](prop-protocol-shared). + +## Development and reference information + +Further technical details about the client protocol, including information +useful for developing a client driver, are available in the [Trino client REST +API developer reference](/develop/client-protocol). diff --git a/docs/src/main/sphinx/client/jdbc.md b/docs/src/main/sphinx/client/jdbc.md index 016236c68d13..c3d3344fdb03 100644 --- a/docs/src/main/sphinx/client/jdbc.md +++ b/docs/src/main/sphinx/client/jdbc.md @@ -261,4 +261,23 @@ may not be specified using both methods. `PREPARE ` followed by `EXECUTE `. This reduces network overhead and uses smaller HTTP headers and requires Trino 431 or greater. +* - `encoding` + - Set the encoding when using the [spooling protocol](jdbc-spooling-protocol). + Valid values are JSON with Zstandard compression, `json+zstd` (recommended), + JSON with LZ4 compression `json+lz4`, and uncompressed JSON `json`. By + default, the default encoding configured on the cluster is used. + ::: + +(jdbc-spooling-protocol)= +## Spooling protocol + +The Trino JDBC driver automatically uses of the spooling protocol to improve +throughput for client interactions with higher data transfer demands, if the +[](protocol-spooling) is configured on the cluster. + +Optionally use the `encoding` parameter to configure a different desired +encoding, compared to the default on the cluster. + +The JVM process using the JDBC driver must have network access to the spooling +object storage. diff --git a/docs/src/main/sphinx/develop/client-protocol.md b/docs/src/main/sphinx/develop/client-protocol.md index 346617d585d3..b7d353ef7d9f 100644 --- a/docs/src/main/sphinx/develop/client-protocol.md +++ b/docs/src/main/sphinx/develop/client-protocol.md @@ -6,6 +6,9 @@ the community. The preferred method to interact with Trino is using these existing clients. This document provides details about the API for reference. It can also be used to implement your own client, if necessary. +Find more information about client drivers, client applications, and the client +protocol configuration in the [client documentation](/client). + ## HTTP methods - A `POST` to `/v1/statement` runs the query string in the `POST` body, diff --git a/docs/src/main/sphinx/overview/concepts.md b/docs/src/main/sphinx/overview/concepts.md index a6776de1dbad..fa8bcc7b9e6c 100644 --- a/docs/src/main/sphinx/overview/concepts.md +++ b/docs/src/main/sphinx/overview/concepts.md @@ -96,26 +96,21 @@ using a REST API. Clients allow you to connect to Trino, submit SQL queries, and receive the results. Clients can access all configured data sources using -[catalogs](trino-concept-catalog). Clients are full-featured applications or -libraries and drivers that allow you to connect to any application supporting -that driver, or even your own custom application or script. +[catalogs](trino-concept-catalog). Clients are full-featured client applications +or client drivers and libraries that allow you to connect with any application +supporting that driver, or even your own custom application or script. -Clients include command line tools, desktop applications, web-based +Clients applications include command line tools, desktop applications, web-based applications, and software-as-a-service solutions with features such as interactive SQL query authoring with editors, or rich user interfaces for graphical query creation, query running and result rendering, visualizations with charts and graphs, reporting, and dashboard creation. -[A comprehensive list with more details for each client is available on the -Trino website](https://trino.io/ecosystem/client). - -Clients can process the returned data from Trino and be used for data pipelines -across catalogs and from other data sources to Trino catalogs. - -From a technical perspective, clients only interact with the -[coordinator](trino-concept-coordinator) of the Trino cluster and use the -[](/develop/client-protocol). +Client application that support other query languages or user interface +components to build a query, must translate each request to [SQL as supported by +Trino](/language). +More details are available in the [Trino client documentation](/client). (trino-concept-data-source)= ## Data source