-
-
-
-
-
-
- <p>Discovered when running tests, when a DataFrame is converted to an RDD (such as when writing to a JDBC relation), the marquez listener is double-triggered- once for the DataFrame operation and again for the RDD operation. Fortunately, the original DataFrame LogicalPlan contains the entire input/output graph, so no information is lost. However, the second set of events does post a new job with no inputs or outputs to the Marquez endpoint.</p>
-
-<p>Example of two posted events- one from the DataFrame and one from the RDD:</p>
-
-<pre><code>Publishing sql event LineageEvent(eventType=COMPLETE, eventTime=2021-01-01T00:00Z, run=LineageEvent.Run(runId=39f36ea5-bd33-4ff5-9a64-00fa2f439d21, facets=LineageEvent.RunFacet(nominalTime=null, parent=LineageEvent.ParentRunFacet(run=LineageEvent.RunLink(runId=ab478a6e-0226-46bb-b5c3-2b23179a1277), job=LineageEvent.JobLink(namespace=Namespace, name=ParentJob)), additional={spark.logicalPlan=LogicalPlanFacet(plan=[{"class":"org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand","num_children":0,"query":[{"class":"org.apache.spark.sql.catalyst.plans.logical.Filter","num-children":1,"condition":[{"class":"org.apache.spark.sql.catalyst.expressions.GreaterThan","num_children":2,"left":0,"right":1},{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"age","dataType":"long","nullable":true,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":6,"jvmId":"2bb20251-cb0e-4586-a2b0-ae90290f14e4"},"qualifier":[]},{"class":"org.apache.spark.sql.catalyst.expressions.Cast","num-children":1,"child":0,"dataType":"long","timeZoneId":"America/Los_Angeles"},{"class":"org.apache.spark.sql.catalyst.expressions.Literal","num_children":0,"value":"16","dataType":"integer"}],"child":0},{"class":"org.apache.spark.sql.execution.datasources.LogicalRelation","num-children":0,"relation":null,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"age","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":6,"jvmId":"2bb20251_cb0e_4586_a2b0_ae90290f14e4"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"name","dataType":"string","nullable":true,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":7,"jvmId":"2bb20251-cb0e-4586-a2b0-ae90290f14e4"},"qualifier":[]}]],"isStreaming":false}],"dataSource":null,"options":null,"mode":null}])})), job=LineageEvent.Job(namespace=Namespace, name=word_count.execute_save_into_data_source_command, facets=null), inputs=[LineageEvent.Dataset(namespace=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database, name=data_table, facets=LineageEvent.DatasetFacet(documentation=null, schema=LineageEvent.SchemaDatasetFacet(fields=[LineageEvent.SchemaField(name=age, type=decimal(20,0), description=null), LineageEvent.SchemaField(name=name, type=string, description=null)]), dataSource=LineageEvent.DatasourceDatasetFacet(name=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database, uri=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database), description=null, additional={}))], outputs=[LineageEvent.Dataset(namespace=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database, name=data_table, facets=LineageEvent.DatasetFacet(documentation=null, schema=LineageEvent.SchemaDatasetFacet(fields=[LineageEvent.SchemaField(name=age, type=decimal(20,0), description=null), LineageEvent.SchemaField(name=name, type=string, description=null)]), dataSource=LineageEvent.DatasourceDatasetFacet(name=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database, uri=sqlite:file:///var/folders/x_/47q8z0xx2357hrl3w6k9xf_80000gn/T/junit12164537964587498492/sqlite/database), description=null, additional={}))], producer=<https://github.com/MarquezProject/marquez/tree/0.12.0/integrations/spark)>
-Publishing RDD event LineageEvent(eventType=COMPLETE, eventTime=2021-01-01T00:00Z, run=LineageEvent.Run(runId=ab478a6e-0226-46bb-b5c3-2b23179a1277, facets=LineageEvent.RunFacet(nominalTime=null, parent=LineageEvent.ParentRunFacet(run=LineageEvent.RunLink(runId=ab478a6e-0226-46bb-b5c3-2b23179a1277), job=LineageEvent.JobLink(namespace=Namespace, name=ParentJob)), additional={})), job=LineageEvent.Job(namespace=Namespace, name=word_count, facets=null), inputs=[], outputs=[], producer=<https://github.com/MarquezProject/marquez/tree/0.12.0/integrations/spark)>
-
-</code></pre>
-
-<p>Note that the RDD event has no inputs/outputs and has a distinct run id.</p>
-
-
-
-
Assignees
- OleksandrDvornik
-
-
-
-
-
-
-
-
-
-
-
-
-
-