Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) #13540

2010YOUY01 · 2024-11-23T16:15:42Z

Which issue does this PR close?

Rationale for this change

To make it easier to test memory-limited queries, i was looking for a datasource which can generate long streams, and itself doesn't consume lots of memory.
There are several existing approaches are close, but can't fully satisfy this requirement:

MemoryExec have to buffer all output data up front
select * from unnest(generate_series(1,10000)) also have to buffer all output during construction
It's possible to do such tests on large files, however it would be hard to setup and takes longer to run

So in this PR, a new StreamingMemoryExec is introduced, it can be used to generate batches lazily by defining a 'iterator' on RecordBatch in its interface. And a new user-defined table function is implemented with this new execution plan.

This function's behavior is kept consistent with DuckDB's generate_series() table function, and only 2 argument variant is implemented for now

I also think this UDTF can be useful to tests/micro-benchmarks in general

> select * from generate_series(1,3);
+-------+
| value |
+-------+
| 1     |
| 2     |
| 3     |
+-------+

What changes are included in this PR?

Move TableFunctionImpl from datafusion crate -> datafusion-catalog crate, so new UDTFs defined in datafusion-functions-table crate won't have circular dependency with datafusion crate (TableFunctionImpl depends on TableProvider in datafusion-catalog crate, so I think it's the best place to move it to)
Introduced StreamingMemoryExec
Implemented a new UDTF generate_series()

Are these changes tested?

Unit tests for StreamingMemoryExec, and sqllogictests for generate_series() UDTF

Are there any user-facing changes?

No

2010YOUY01 · 2024-11-23T16:23:27Z

datafusion/physical-plan/src/memory.rs

+    ) -> Result<Self> {
+        let cache = PlanProperties::new(
+            EquivalenceProperties::new(Arc::clone(&schema)),
+            Partitioning::RoundRobinBatch(generators.len()),


Is this field mean: Let's say a exec only have 1 generator, the optimizer will try to insert a RepartitionExec to the output of StreamingMemoryExec, and use round robin to repartition output batches to target_partitions number?
I found it works on some aggregate queries, but not for a sort query

> explain select * from generate_series(1, 10000) as t1(v1) order by v1; +---------------+----------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+----------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: t1.v1 ASC NULLS LAST | | | SubqueryAlias: t1 | | | Projection: tmp_table.value AS v1 | | | TableScan: tmp_table projection=[value] | | physical_plan | SortExec: expr=[v1@0 ASC NULLS LAST], preserve_partitioning=[false] | | | ProjectionExec: expr=[value@0 as v1] | | | StreamingMemoryExec: partitions=1, batch_generators=[generate_series: start=1, end=10000, batch_size=8192] | | | | +---------------+----------------------------------------------------------------------------------------------------------------+

Parallelized sort should look like

> explain select * from lineitem order by l_orderkey; +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | logical_plan | Sort: lineitem.l_orderkey ASC NULLS LAST | | | TableScan: lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] | | physical_plan | SortPreservingMergeExec: [l_orderkey@0 ASC NULLS LAST] | | | SortExec: expr=[l_orderkey@0 ASC NULLS LAST], preserve_partitioning=[true] | | | ParquetExec: file_groups={14 groups: [[Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-0.parquet:0..11525426], [Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-0.parquet:11525426..20311205, Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-1.parquet:0..2739647], [Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-1.parquet:2739647..14265073], [Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-1.parquet:14265073..20193593, Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-2.parquet:0..5596906], [Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem/part-2.parquet:5596906..17122332], ...]}, projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment] | | | | +---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I want to know is it correct here, and this sort query in theory can be made parallelized by changing somewhere in optimization phase? 🤔

jayzhan211 · 2024-11-26T02:29:14Z

datafusion/physical-plan/src/memory.rs

+/// Stream that generates record batches on demand
+pub struct StreamingMemoryStream {
+    schema: SchemaRef,
+    generator: Box<dyn StreamingBatchGenerator>,


Should we use Arc given it makes sense to have generator generate across threads

Good point, Arc is more flexible: implementation can choose to let a StreamingBatchGenerate share between multiple streams, or create separate generators for each stream

jayzhan211 · 2024-11-26T02:29:46Z

datafusion/physical-plan/src/memory.rs

+    /// Schema representing the data
+    schema: SchemaRef,
+    /// Functions to generate batches for each partition
+    batch_generators: Vec<Box<dyn StreamingBatchGenerator>>,


Suggested change

batch_generators: Vec<Box<dyn StreamingBatchGenerator>>,

batch_generators: Vec<Arc<dyn StreamingBatchGenerator>>,

jayzhan211 · 2024-11-26T02:31:46Z

datafusion/catalog/src/table.rs

@@ -297,3 +299,40 @@ pub trait TableProviderFactory: Debug + Sync + Send {
        cmd: &CreateExternalTable,
    ) -> Result<Arc<dyn TableProvider>>;
 }
+
+/// A trait for table function implementations
+pub trait TableFunctionImpl: Debug + Sync + Send {


Suggested change

pub trait TableFunctionImpl: Debug + Sync + Send {

/// Returns this function's name

fn name(&self) -> &str;

This is much more consistent with other UDFs

This is the part of the refactor, changing this requires several fixes. We can leave it to a separate PR to make this PR smaller

I also recommend to take name() inside TableFunctionImpl

datafusion/sqllogictest/test_files/table_functions.slt

jayzhan211 · 2024-11-26T02:40:42Z

datafusion/sqllogictest/test_files/table_functions.slt

+statement error DataFusion error: Error during planning: Second argument must be an integer literal
+SELECT * FROM generate_series(1, NULL)
+
+statement error DataFusion error: Error during planning: generate_series expects 2 arguments


D SELECT * FROM generate_series(1, 5, null); ┌─────────────────┐ │ generate_series │ │ int64 │ ├─────────────────┤ │ 0 rows │ └─────────────────┘

I'll open an issue for 3-arg support later

2010YOUY01 · 2024-11-27T14:39:12Z

Thank you for the comments @jayzhan211 , I have updated.

Now I think the concurrent generator might not be very straightforward, I'll first leave some rationale here.
If we can come to agreement I'll add more doc to it (else revert back to Box version for simplicity)

Rationale for `StreamingMemoryStream` design

Suppose we want to run select ... from generate_series(1,100) in two partitions
And the underlying batch generator is wrapped with Mutex

pub struct StreamingMemoryStream {
    ...
    generator: Arc<Mutex<dyn StreamingBatchGenerator>>,
}

It's possible to implement the UDTF in 3 ways:
1.

[generator_1_50]   --- [StreamingMemoryStream  stream1] --> xxStream1
[generator_50_100] --- [StreamingMemoryStream  stream2] --> xxStream1

[generator_1_100] --- [StreamingMemoryStream  stream1] --> Repartition --> xxStream1
                                                                       |-> xxStream2

[generator_1_100] --- [StreamingMemoryStream  stream1] --> xxStream1
                  |-- [StreamingMemoryStream  stream2] --> xxStream2

1 and 2 is the common pattern for datafusion scanning operators to do plan-time parallelism, generator won't be accessed by multiple threads thus Mutex is redundant
3 make the StreamingBatchGenerator being able to concurrently accessed by multiple streams.

The Mutex is added to make it possible for case 3 (so the interface can be more general-purpose for future use cases)

jayzhan211 · 2024-11-27T15:14:43Z

Thank you for the comments @jayzhan211 , I have updated.

Now I think the concurrent generator might not be very straightforward, I'll first leave some rationale here. If we can come to agreement I'll add more doc to it (else revert back to Box version for simplicity)

Rationale for StreamingMemoryStream design

Suppose we want to run select ... from generate_series(1,100) in two partitions And the underlying batch generator is wrapped with Mutex
pub struct StreamingMemoryStream {
    ...
    generator: Arc<Mutex<dyn StreamingBatchGenerator>>,
}
It's possible to implement the UDTF in 3 ways: 1.
[generator_1_50]   --- [StreamingMemoryStream  stream1] --> xxStream1
[generator_50_100] --- [StreamingMemoryStream  stream2] --> xxStream1
[generator_1_100] --- [StreamingMemoryStream  stream1] --> Repartition --> xxStream1
                                                                       |-> xxStream2                                                                                
[generator_1_100] --- [StreamingMemoryStream  stream1] --> xxStream1
                  |-- [StreamingMemoryStream  stream2] --> xxStream2                                                                                                                                      
1 and 2 is the common pattern for datafusion scanning operators to do plan-time parallelism, generator won't be accessed by multiple threads thus Mutex is redundant 3 make the StreamingBatchGenerator being able to concurrently accessed by multiple streams.

The Mutex is added to make it possible for case 3 (so the interface can be more general-purpose for future use cases)

Given the quick look, I'm not sure whether we need 3rd case. It seems the 1st case runs execution parallelly too. I would need to think about the advantage of the 3rd case over 1st case.

For 3rd case, if we need it, we might use Arc<RwLock<dyn T>> if generate_next_batch takes &self. It might be more efficiently.

berkaysynnada

Thank you @2010YOUY01. I reviewed the changes and LGTM. I have a few minor comments and one question: I noticed another approach of generate_series(), which can be used like this:
SELECT generate_series(1, 5)
I assume it is not a udtf in that context. Does this implementation leave room for that usage?

berkaysynnada · 2024-11-29T07:25:32Z

datafusion/catalog/src/table.rs

@@ -297,3 +299,40 @@ pub trait TableProviderFactory: Debug + Sync + Send {
        cmd: &CreateExternalTable,
    ) -> Result<Arc<dyn TableProvider>>;
 }
+
+/// A trait for table function implementations
+pub trait TableFunctionImpl: Debug + Sync + Send {


I also recommend to take name() inside TableFunctionImpl

berkaysynnada · 2024-11-29T07:30:07Z

datafusion/core/src/execution/session_state_defaults.rs

+    pub fn default_table_functions() -> Vec<Arc<TableFunction>> {
+        functions_table::all_default_table_functions()
+    }
+


Can we get the defaults from a singleton ? Like other default getters, with get_or_init or smth similar

This is a good idea, updated

berkaysynnada · 2024-11-29T07:34:20Z

datafusion/functions-table/src/generate_series.rs

+#[derive(Debug, Clone)]
+struct GenerateSeriesState {
+    schema: SchemaRef,
+    _start: i64,


Is there any reason to keep this?

This field is for display, i added a comment to explain

berkaysynnada · 2024-11-29T07:35:25Z

datafusion/functions-table/src/generate_series.rs

+    // Check input `exprs` type and number. Input validity check (e.g. start <= end)
+    // will be performed in `TableProvider::scan`
+    fn call(&self, exprs: &[Expr]) -> Result<Arc<dyn TableProvider>> {
+        // TODO: support 3 arguments following DuckDB:


There is also a one length argument usage, perhaps can be added as todo

berkaysynnada · 2024-11-29T07:42:04Z

datafusion/physical-plan/src/memory.rs

@@ -365,8 +366,165 @@ impl RecordBatchStream for MemoryStream {
    }
 }

+pub trait StreamingBatchGenerator: Send + Sync + fmt::Debug + fmt::Display {


I am not very satisfied with these Streaming... naming. Can we describe the behavior better with a different naming for these? Perhaps we could use 'lazy' term

I agree Lazy* is less ambiguous, updated all prefixes

2010YOUY01 · 2024-11-30T07:41:12Z

Thank you @2010YOUY01. I reviewed the changes and LGTM. I have a few minor comments and one question: I noticed another approach of generate_series(), which can be used like this: SELECT generate_series(1, 5) I assume it is not a udtf in that context. Does this implementation leave room for that usage?

Thank you for the review @berkaysynnada

The answer to the question is yes, they are different function with the same name.
Using same name is probably not the best idea, but we do this to keep the behavior the same as DuckDB.
I added a sql test to make sure they can work together (like select generate_series() from generate_series())

@jayzhan211 I thought about the sharing generator issue again, I think making a interface more flexible is important, so I kept the lock, and updated RwLock as you suggested

Add generate_series() udtf

ac50592

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate labels Nov 23, 2024

liscence

92e94e6

2010YOUY01 commented Nov 23, 2024

View reviewed changes

2010YOUY01 added 2 commits November 24, 2024 00:26

fix examples

95f95f2

clippy

5df8964

jayzhan211 reviewed Nov 26, 2024

View reviewed changes

comments

fd2a808

berkaysynnada approved these changes Nov 29, 2024

View reviewed changes

2010YOUY01 added 4 commits November 30, 2024 14:51

singleton udtf init

765d9a7

StreamingMemoryExec -> LazyMemoryExec

4e76e3e

use RwLock

07126eb

test udf+udtf generate_series() in the same sql

e5834b7

2010YOUY01 added 3 commits November 30, 2024 15:42

CI

e775df4

Merge remote-tracking branch 'public/main' into generate_series

cd38a60

CI

28f0f33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) #13540

Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) #13540

2010YOUY01 commented Nov 23, 2024 •

edited

Loading

2010YOUY01 Nov 23, 2024 •

edited

Loading

jayzhan211 Nov 26, 2024

2010YOUY01 Nov 26, 2024

jayzhan211 Nov 26, 2024

jayzhan211 Nov 26, 2024

2010YOUY01 Nov 27, 2024

berkaysynnada Nov 29, 2024

jayzhan211 Nov 26, 2024

2010YOUY01 Nov 27, 2024

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

jayzhan211 commented Nov 27, 2024

Rationale for `StreamingMemoryStream` design

berkaysynnada left a comment

berkaysynnada Nov 29, 2024

berkaysynnada Nov 29, 2024

2010YOUY01 Nov 30, 2024

berkaysynnada Nov 29, 2024

2010YOUY01 Nov 30, 2024

berkaysynnada Nov 29, 2024

berkaysynnada Nov 29, 2024

2010YOUY01 Nov 30, 2024

2010YOUY01 commented Nov 30, 2024

	batch_generators: Vec<Box<dyn StreamingBatchGenerator>>,
	batch_generators: Vec<Arc<dyn StreamingBatchGenerator>>,

	pub trait TableFunctionImpl: Debug + Sync + Send {
	/// Returns this function's name
	fn name(&self) -> &str;

Add generate_series() udtf (and introduce 'lazy' MemoryExec) #13540

Are you sure you want to change the base?

Add generate_series() udtf (and introduce 'lazy' MemoryExec) #13540

Conversation

2010YOUY01 commented Nov 23, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

2010YOUY01 Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Nov 27, 2024 • edited Loading

Rationale for StreamingMemoryStream design

jayzhan211 commented Nov 27, 2024

Rationale for StreamingMemoryStream design

berkaysynnada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 commented Nov 30, 2024

Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) #13540

Add generate_series() udtf (and introduce 'lazy' `MemoryExec`) #13540

2010YOUY01 commented Nov 23, 2024 •

edited

Loading

2010YOUY01 Nov 23, 2024 •

edited

Loading

2010YOUY01 commented Nov 27, 2024 •

edited

Loading

Rationale for `StreamingMemoryStream` design

Rationale for `StreamingMemoryStream` design