diff --git a/channel/boston-meetup/index.html b/channel/boston-meetup/index.html index 28d3a36..67d3cf1 100644 --- a/channel/boston-meetup/index.html +++ b/channel/boston-meetup/index.html @@ -52,6 +52,12 @@
*Thread Reply:* Hi Michael, While I’m currently going through internal review process before creating a PR, I can do a quick demo on the current OpenLineage sensor approach and get some initial feedback if that is okay.
+*Thread Reply:* Hi Michael, While I’m currently going through internal review process before creating a PR, I can do a quick demo on the current OpenLineage sensor approach and get some initial feedback if that is okay.
@@ -747,7 +753,7 @@*Thread Reply:* thanks for the demo!
+*Thread Reply:* thanks for the demo!
@@ -777,7 +783,7 @@*Thread Reply:* Thanks for the opportunity!
+*Thread Reply:* Thanks for the opportunity!
@@ -884,7 +890,7 @@*Thread Reply:* Thanks for the PR! I'll take a look tomorrow.
+*Thread Reply:* Thanks for the PR! I'll take a look tomorrow.
@@ -914,7 +920,7 @@*Thread Reply:* @Dalin Kim looks great! I approved it. One thing to do is to rebase on main and force-with-lease push: it appears that there are some conflicts.
+*Thread Reply:* @Dalin Kim looks great! I approved it. One thing to do is to rebase on main and force-with-lease push: it appears that there are some conflicts.
@@ -940,7 +946,7 @@*Thread Reply:* @Maciej Obuchowski Thank you for the review. Just had a small conflict in CHANGELOG, which has been resolved. For integration test, do you suggest a similar approach like airflow integration using flask?
+*Thread Reply:* @Maciej Obuchowski Thank you for the review. Just had a small conflict in CHANGELOG, which has been resolved. For integration test, do you suggest a similar approach like airflow integration using flask?
@@ -966,7 +972,7 @@*Thread Reply:* Also, I reached out to Sandy from Dagster for a review to make sure this is good from Dagster side of things.
+*Thread Reply:* Also, I reached out to Sandy from Dagster for a review to make sure this is good from Dagster side of things.
@@ -992,7 +998,7 @@*Thread Reply:* > For integration test, do you suggest a similar approach like airflow integration using flask? +
*Thread Reply:* > For integration test, do you suggest a similar approach like airflow integration using flask? Yes, I think we can reuse that part. The most important thing is that we have real Dagster running "real" workloads.
@@ -1023,7 +1029,7 @@*Thread Reply:* Documentation has been updated based on feedback from Yuhan from Dagster team, and I believe everything is set from my end.
+*Thread Reply:* Documentation has been updated based on feedback from Yuhan from Dagster team, and I believe everything is set from my end.
@@ -1049,7 +1055,7 @@*Thread Reply:* Great! Let's just get rid of this linting error and I'll merge it then.
+*Thread Reply:* Great! Let's just get rid of this linting error and I'll merge it then.
@@ -1077,7 +1083,7 @@*Thread Reply:* Thank you. I overlooked this when updating docstring. All should be good now.
+*Thread Reply:* Thank you. I overlooked this when updating docstring. All should be good now.
One final question - should we make the dagster unit test job “required” in the ci and how can that be configured?
@@ -1114,7 +1120,7 @@*Thread Reply:* @Willy Lulciuc I think you configured it, am I right?
+*Thread Reply:* @Willy Lulciuc I think you configured it, am I right?
@Dalin Kim one more rebase, please 🙏 I've turned auto-merge on, but unfortunately I can't rebase your branch on fork.
@@ -1147,7 +1153,7 @@*Thread Reply:* Rebased and pushed
+*Thread Reply:* Rebased and pushed
@@ -1173,7 +1179,7 @@*Thread Reply:* @Dalin Kim merged!
+*Thread Reply:* @Dalin Kim merged!
@@ -1199,7 +1205,7 @@*Thread Reply:* @Maciej Obuchowski Awesome! Thank you so much for all your help!
+*Thread Reply:* @Maciej Obuchowski Awesome! Thank you so much for all your help!
@@ -1255,7 +1261,7 @@*Thread Reply:* Sorry - we're not running integration tests for airflow on forks due to security reason. If you're not touching any Airflow files then it should not affect you at all 🙂
+*Thread Reply:* Sorry - we're not running integration tests for airflow on forks due to security reason. If you're not touching any Airflow files then it should not affect you at all 🙂
@@ -1281,7 +1287,7 @@*Thread Reply:* In other words, it's just as expected.
+*Thread Reply:* In other words, it's just as expected.
@@ -1307,7 +1313,7 @@*Thread Reply:* Thank you for the clarification
+*Thread Reply:* Thank you for the clarification
@@ -1565,7 +1571,7 @@*Thread Reply:* Thanks for letting us know. We’ll take a look and follow up.
+*Thread Reply:* Thanks for letting us know. We’ll take a look and follow up.
@@ -1591,7 +1597,7 @@*Thread Reply:* Just as an update on findings, it appears that this MR introduced a breaking change for the test helper function that creates a test EventLogRecord.
+*Thread Reply:* Just as an update on findings, it appears that this MR introduced a breaking change for the test helper function that creates a test EventLogRecord.
*Thread Reply:* Thanks for the heads up. We’ll look into it and follow up
+*Thread Reply:* Thanks for the heads up. We’ll look into it and follow up
@@ -1893,7 +1899,7 @@*Thread Reply:* Here is the PR to fix the error with the latest Dagster version
+*Thread Reply:* Here is the PR to fix the error with the latest Dagster version
*Thread Reply:* Thanks! Merged.
+*Thread Reply:* Thanks! Merged.
@@ -1990,7 +1996,7 @@*Thread Reply:* Thanks!
+*Thread Reply:* Thanks!
@@ -2120,7 +2126,7 @@*Thread Reply:* That's definitely the advantage of the log-parsing method. The Airflow integration, especially the most recent version for Airflow 2.3+, has the advantage of being more robust when it comes to delivering lineage in real-time
+*Thread Reply:* That's definitely the advantage of the log-parsing method. The Airflow integration, especially the most recent version for Airflow 2.3+, has the advantage of being more robust when it comes to delivering lineage in real-time
@@ -2150,7 +2156,7 @@*Thread Reply:* Thanks for the heads up on this integration!
+*Thread Reply:* Thanks for the heads up on this integration!
@@ -2176,7 +2182,7 @@*Thread Reply:* I suspect future integrations will move towards this pattern as well? Sorry, off-topic for this channel, but this was the first place I've seen this metadata collection method in this project.
+*Thread Reply:* I suspect future integrations will move towards this pattern as well? Sorry, off-topic for this channel, but this was the first place I've seen this metadata collection method in this project.
@@ -2202,7 +2208,7 @@*Thread Reply:* I would be curious to explore similar external executor integrations for Airflow, say for Papermill or Kubernetes (via the corresponding operators). Suppose one would need to pass job metadata through to the respective platform where logs are actually collected.
+*Thread Reply:* I would be curious to explore similar external executor integrations for Airflow, say for Papermill or Kubernetes (via the corresponding operators). Suppose one would need to pass job metadata through to the respective platform where logs are actually collected.
@@ -2239,6 +2245,58 @@*Thread Reply:* hey look, more fun +
*Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263
@@ -576,7 +582,7 @@*Thread Reply:* nice to have fun with you Jakub
+*Thread Reply:* nice to have fun with you Jakub
@@ -606,7 +612,7 @@*Thread Reply:* Can't wait to see it on the 1st January.
+*Thread Reply:* Can't wait to see it on the 1st January.
@@ -632,7 +638,7 @@*Thread Reply:* Ain’t no party like a dev ex improvement party
+*Thread Reply:* Ain’t no party like a dev ex improvement party
@@ -658,7 +664,7 @@*Thread Reply:* Gentoo installation party is in similar category of fun
+*Thread Reply:* Gentoo installation party is in similar category of fun
@@ -764,7 +770,7 @@*Thread Reply:* future 😉
+*Thread Reply:* future 😉
@@ -790,7 +796,7 @@*Thread Reply:* ok
+*Thread Reply:* ok
@@ -816,7 +822,7 @@*Thread Reply:* I will then replace enum with string
+*Thread Reply:* I will then replace enum with string
@@ -926,7 +932,7 @@*Thread Reply:* this is the next to go
+*Thread Reply:* this is the next to go
@@ -952,7 +958,7 @@*Thread Reply:* and i consider it ready
+*Thread Reply:* and i consider it ready
@@ -978,7 +984,7 @@*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs
+*Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs
@@ -1004,7 +1010,7 @@*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉
+*Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉
@@ -1064,7 +1070,7 @@*Thread Reply:* did you check if LineageCollector
is instantiated once per process?
*Thread Reply:* did you check if LineageCollector
is instantiated once per process?
*Thread Reply:* Using it only via get_hook_lineage_collector
*Thread Reply:* Using it only via get_hook_lineage_collector
is it time to support hudi?
*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”
+*Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”
@@ -1246,7 +1252,7 @@*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"
+*Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"
@@ -1298,7 +1304,7 @@*Thread Reply:* you are now admin
+*Thread Reply:* you are now admin
@@ -1354,7 +1360,7 @@*Thread Reply:* we have it in SQL operators
+*Thread Reply:* we have it in SQL operators
@@ -1380,7 +1386,7 @@*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏
+*Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏
@@ -1406,7 +1412,7 @@*Thread Reply:* I’ll reply there
+*Thread Reply:* I’ll reply there
@@ -1462,7 +1468,7 @@*Thread Reply:* What about GitHub projects?
+*Thread Reply:* What about GitHub projects?
@@ -1492,7 +1498,7 @@*Thread Reply:* Projects is the way to go, thanks
+*Thread Reply:* Projects is the way to go, thanks
@@ -1518,7 +1524,7 @@*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…
+*Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…
@@ -1574,7 +1580,7 @@*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6
+*Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6
*Thread Reply:* We should sell OL to governments
+*Thread Reply:* We should sell OL to governments
@@ -1655,7 +1661,7 @@*Thread Reply:* we may have to rebrand to ClosedLineage
+*Thread Reply:* we may have to rebrand to ClosedLineage
@@ -1681,7 +1687,7 @@*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint
+*Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint
@@ -1707,7 +1713,7 @@*Thread Reply:* we would need to improve our stock photo game
+*Thread Reply:* we would need to improve our stock photo game
@@ -1760,7 +1766,7 @@*Thread Reply:* thanks, updated the talks board
+*Thread Reply:* thanks, updated the talks board
@@ -1786,7 +1792,7 @@*Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1
+*Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1
@@ -1812,7 +1818,7 @@*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂
+*Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂
@@ -1889,7 +1895,7 @@*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off +
*Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off
https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md
disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing.
They reuse the extractors but then “patch” the behavior.
*Thread Reply:* Of course this approach will need changing again with AF 2.7
+*Thread Reply:* Of course this approach will need changing again with AF 2.7
@@ -2130,7 +2136,7 @@*Thread Reply:* It’s their choice 🤷
+*Thread Reply:* It’s their choice 🤷
@@ -2156,7 +2162,7 @@*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction
+*Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction
*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities
+*Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities
@@ -2229,7 +2235,7 @@*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)
+*Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)
@@ -2255,7 +2261,7 @@*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql>
-> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.
*Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql>
-> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.
My understanding by example is: If you do
create table x as select ** from y
@@ -2290,7 +2296,7 @@
*Thread Reply:* Ah! That would have been a good idea, but I can’t :(
+*Thread Reply:* Ah! That would have been a good idea, but I can’t :(
@@ -2367,7 +2373,7 @@*Thread Reply:* Do you prefer an earlier meeting tomorrow?
+*Thread Reply:* Do you prefer an earlier meeting tomorrow?
@@ -2393,7 +2399,7 @@*Thread Reply:* maybe let's keep today's meeting then
+*Thread Reply:* maybe let's keep today's meeting then
@@ -2408,6 +2414,12684 @@*Thread Reply:* slack is the new email?
+*Thread Reply:* slack is the new email?
@@ -583,7 +589,7 @@*Thread Reply:* :(
+*Thread Reply:* :(
@@ -609,7 +615,7 @@*Thread Reply:* I'd prefer a google group as well
+*Thread Reply:* I'd prefer a google group as well
@@ -635,7 +641,7 @@*Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through
+*Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through
@@ -661,7 +667,7 @@*Thread Reply:* And I think it is also better for having thoughtful design discussions
+*Thread Reply:* And I think it is also better for having thoughtful design discussions
@@ -687,7 +693,7 @@*Thread Reply:* I’m happy to create a google group if that would help.
+*Thread Reply:* I’m happy to create a google group if that would help.
@@ -713,7 +719,7 @@*Thread Reply:* Here it is: https://groups.google.com/g/openlineage
+*Thread Reply:* Here it is: https://groups.google.com/g/openlineage
@@ -739,7 +745,7 @@*Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points
+*Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points
@@ -765,7 +771,7 @@*Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?
+*Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?
@@ -791,7 +797,7 @@*Thread Reply:* I don't really know how to do that
+*Thread Reply:* I don't really know how to do that
@@ -817,7 +823,7 @@*Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.
+*Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.
@@ -843,7 +849,7 @@*Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions
+*Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions
@@ -899,7 +905,7 @@*Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement
+*Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement
@@ -1313,7 +1319,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1
*Thread Reply:* Looks great!
+*Thread Reply:* Looks great!
@@ -1620,7 +1626,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1
*Thread Reply:* Welcome!
+*Thread Reply:* Welcome!
@@ -1888,7 +1894,7 @@*Thread Reply:* Yes, that’s about right.
+*Thread Reply:* Yes, that’s about right.
@@ -1914,7 +1920,7 @@*Thread Reply:* There’s some subtlety here.
+*Thread Reply:* There’s some subtlety here.
@@ -1940,7 +1946,7 @@*Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations
+*Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations
@@ -1970,7 +1976,7 @@*Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)
+*Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)
@@ -2000,7 +2006,7 @@*Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)
+*Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)
@@ -2026,7 +2032,7 @@*Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.
+*Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.
@@ -2056,7 +2062,7 @@*Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic
+*Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic
@@ -2082,7 +2088,7 @@*Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.
+*Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.
@@ -2138,7 +2144,7 @@*Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.
+*Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.
@@ -2194,7 +2200,7 @@*Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.
+*Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.
@@ -2220,7 +2226,7 @@*Thread Reply:* Yep! Its a validation that the problem is real!
+*Thread Reply:* Yep! Its a validation that the problem is real!
@@ -2277,7 +2283,7 @@*Thread Reply:* Welcome!
+*Thread Reply:* Welcome!
@@ -2307,7 +2313,7 @@*Thread Reply:* What are you current use cases for lineage?
+*Thread Reply:* What are you current use cases for lineage?
@@ -2440,7 +2446,7 @@*Thread Reply:* Welcome!
+*Thread Reply:* Welcome!
@@ -2466,7 +2472,7 @@*Thread Reply:* Would you mind sharing your main use cases for collecting lineage?
+*Thread Reply:* Would you mind sharing your main use cases for collecting lineage?
@@ -2548,7 +2554,7 @@*Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer
+*Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer
@@ -2574,7 +2580,7 @@*Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.
+*Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.
@@ -2600,7 +2606,7 @@*Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.
+*Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.
@@ -2626,7 +2632,7 @@*Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.
+*Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.
Thats the hard part of this. How to you model a visualization/dashboard to all the possible ways they can be created since it differs depending on how the tool you use abstracts away creating an visualization.
@@ -2688,7 +2694,7 @@*Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success
+*Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success
@@ -2714,7 +2720,7 @@*Thread Reply:* Hi Jason and welcome
+*Thread Reply:* Hi Jason and welcome
@@ -2804,7 +2810,7 @@*Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.
+*Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.
When: Thursday, January 7th at 10AM PST Where: https://us02web.zoom.us/j/89344845719?pwd=Y09RZkxMZHc2U3pOTGZ6SnVMUUVoQT09
@@ -2833,7 +2839,7 @@*Thread Reply:* https://wiki.lfaidata.foundation/pages/viewpage.action?pageId=18481442
+*Thread Reply:* https://wiki.lfaidata.foundation/pages/viewpage.action?pageId=18481442
@@ -2859,7 +2865,7 @@*Thread Reply:* that’s in 15 min
+*Thread Reply:* that’s in 15 min
@@ -2885,7 +2891,7 @@*Thread Reply:* https://lists.lfaidata.foundation/g/marquez-technical-discuss/viewevent?repeatid=32038&eventid=987925&calstart=2021-01-07
+*Thread Reply:* https://lists.lfaidata.foundation/g/marquez-technical-discuss/viewevent?repeatid=32038&eventid=987925&calstart=2021-01-07
@@ -2911,7 +2917,7 @@*Thread Reply:* And it’s merged!
+*Thread Reply:* And it’s merged!
@@ -2937,7 +2943,7 @@*Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec
+*Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec
@@ -3019,7 +3025,7 @@*Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts
+*Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts
@@ -3045,7 +3051,7 @@*Thread Reply:* Is there something you are interested about in particular?
+*Thread Reply:* Is there something you are interested about in particular?
@@ -3230,7 +3236,7 @@*Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md +
*Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md The process to contribute to the model is described here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md In particular, now we’d want to contribute more facets and integrations. Marquez has a reference implementation: https://github.com/MarquezProject/marquez/pull/880 @@ -3294,7 +3300,7 @@
*Thread Reply:* It is not on general, as it can be a bit noisy
+*Thread Reply:* It is not on general, as it can be a bit noisy
@@ -3346,7 +3352,7 @@*Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900
+*Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900
*Thread Reply:* Feel free to comment, this is ready to merge
+*Thread Reply:* Feel free to comment, this is ready to merge
@@ -3532,7 +3538,7 @@*Thread Reply:* Thanks, Julien. The new spec format looks great 👍
+*Thread Reply:* Thanks, Julien. The new spec format looks great 👍
@@ -3764,7 +3770,7 @@*Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 +
*Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 Thank you.
*Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23
+*Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23
*Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?
+*Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?
@@ -4070,7 +4076,7 @@*Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.
+*Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.
• OpenLineage defines the schema to collect metadata • Marquez has a /lineage endpoint implementing the OpenLineage spec to receive this metadata, implemented by the OpenLineageResource you pointed out @@ -4103,7 +4109,7 @@
*Thread Reply:* thank you! very clear answer🙂
+*Thread Reply:* thank you! very clear answer🙂
@@ -4155,7 +4161,7 @@*Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300
+*Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300
*Thread Reply:* Ha, I signed up just to ask this precise question!
+*Thread Reply:* Ha, I signed up just to ask this precise question!
@@ -4303,7 +4309,7 @@*Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?
+*Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?
@@ -4329,7 +4335,7 @@*Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.
+*Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.
@@ -4515,7 +4521,7 @@*Thread Reply:* That's good idea.
+*Thread Reply:* That's good idea.
Now I'm wondering - let's say we want to track on which offset checkpoint ended processing. That would mean we want to expose checkpoint id, time, and offset. I suppose we don't want to overwrite previous checkpoint info, so we want to have some collection of data in this facet.
@@ -4572,7 +4578,7 @@*Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals
+*Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals
@@ -4598,7 +4604,7 @@*Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose
+*Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose
@@ -4794,7 +4800,7 @@*Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags
as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?
*Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags
as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?
*Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect
above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.
*Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect
above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.
*Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?
+*Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?
@@ -4872,7 +4878,7 @@*Thread Reply:* Let me pull in my colleagues
+*Thread Reply:* Let me pull in my colleagues
@@ -4898,7 +4904,7 @@*Thread Reply:* Standby
+*Thread Reply:* Standby
@@ -4924,7 +4930,7 @@*Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:
+*Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:
@@ -4950,7 +4956,7 @@*Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:
*Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:
Curious to hear your thoughts on all of this!
@@ -4979,7 +4985,7 @@*Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: +
*Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: > -> If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false? Correct, The spec defines what facets looks like (and how you can make your own custom facets) but it does not make statements about whether facets are required. However, you can have your own validation and make certain things required if you wish to on the client side? @@ -5025,7 +5031,7 @@
*Thread Reply:* +
*Thread Reply:* Marquez existed before OpenLineage. In particular the /run end-point to create and update runs will be deprecated as the OpenLineage /lineage endpoint replaces it. At the moment we are mapping OpenLineage metadata to Marquez. Soon Marquez will have all the facets exposed in the Marquez API. (See: https://github.com/MarquezProject/marquez/pull/894/files) We could make Marquez Configurable or Pluggable for validation purposes. There is already a notion of LineageListener for example. Although Marquez collects the metadata. I feel like this validation would be better upstream or with some some other mechanism. The question is when do you create a dataset vs when do you become a stakeholder? What are the various stakeholder and what is the responsibility of the minimum one stakeholder? I would probably make it required to deploy the job that the stakeholder is defined. This would apply to the output dataset and would be collected in Marquez.
@@ -5059,7 +5065,7 @@*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200
+*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200
*Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)
+*Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)
@@ -5292,7 +5298,7 @@*Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage
+*Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage
@@ -5318,7 +5324,7 @@*Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.
+*Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.
@@ -5344,7 +5350,7 @@*Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.
+*Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.
@@ -5370,7 +5376,7 @@*Thread Reply:* Yeah, OpenLineage events need to be published to a backend
(can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage
endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.
*Thread Reply:* Yeah, OpenLineage events need to be published to a backend
(can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage
endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.
*Thread Reply:* Thanks both!
+*Thread Reply:* Thanks both!
@@ -5422,7 +5428,7 @@*Thread Reply:* sorry, I was away last week. Yes that sounds right.
+*Thread Reply:* sorry, I was away last week. Yes that sounds right.
@@ -5474,7 +5480,7 @@*Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.
+*Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.
@@ -5530,7 +5536,7 @@*Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.
+*Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.
@@ -5556,7 +5562,7 @@*Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.
+*Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.
@@ -5582,7 +5588,7 @@*Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model
+*Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model
*Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?
+*Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?
@@ -5813,7 +5819,7 @@*Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage
+*Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage
@@ -5839,7 +5845,7 @@*Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.
+*Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.
@@ -5891,7 +5897,7 @@*Thread Reply:* This is a proper usage. That schema is optional if it’s not available.
+*Thread Reply:* This is a proper usage. That schema is optional if it’s not available.
@@ -5917,7 +5923,7 @@*Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket
+*Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket
@@ -5943,7 +5949,7 @@*Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)
+*Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)
@@ -5969,7 +5975,7 @@*Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java
+*Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java
*Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java
+*Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java
*Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87
+*Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87
@@ -6097,7 +6103,7 @@*Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!
+*Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!
@@ -6123,7 +6129,7 @@*Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.
+*Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.
@@ -6149,7 +6155,7 @@*Thread Reply:* thanks
+*Thread Reply:* thanks
@@ -6207,7 +6213,7 @@*Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.
+*Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.
@@ -6233,7 +6239,7 @@*Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47
+*Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47
@@ -6259,7 +6265,7 @@*Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.
+*Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.
@@ -6406,7 +6412,7 @@*Thread Reply:* Huge milestone! 🙌💯🎊
+*Thread Reply:* Huge milestone! 🙌💯🎊
@@ -6492,7 +6498,7 @@*Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂
+*Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂
@@ -6518,7 +6524,7 @@*Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌
+*Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌
@@ -6544,7 +6550,7 @@*Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3
+*Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3
*Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes
+*Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes
*Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like
+*Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like
@@ -6860,7 +6866,7 @@*Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
+*Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
*Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on
+*Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on
@@ -7240,7 +7246,7 @@*Thread Reply:* Welcome! Let use know if you have any questions
+*Thread Reply:* Welcome! Let use know if you have any questions
@@ -7309,7 +7315,7 @@*Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? +
*Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? Yes, at core OpenLineage just enforces format of the event. We also aim to provide clients - REST, later Kafka, etc. and some reference implementations - which are now in Marquez repo. https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png
There are several differences between push and poll models. Most important one is that with push model, latency between your job and emitting OpenLineage events is very low. With some systems, with internal, push based model you have more runtime metadata available than when looking from outside. Another one would be that naive poll implementation would need to "rebuild the world" on each change. There are also disadvantages, such as that usually, it's easier to write plugin that extracts data from outside the system than hooking up to the internals.
@@ -7394,7 +7400,7 @@*Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!
+*Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!
@@ -7420,7 +7426,7 @@*Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: +
*Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: > Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.
@@ -7447,7 +7453,7 @@*Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.
+*Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.
@@ -7473,7 +7479,7 @@*Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.
+*Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.
@@ -7499,7 +7505,7 @@*Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.
+*Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.
So, how would you see using it? Just to proxy the events to concrete search and metadata backend?
@@ -7529,7 +7535,7 @@*Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86
+*Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86
*Thread Reply:* thanks Julien! will take a look
+*Thread Reply:* thanks Julien! will take a look
@@ -7631,7 +7637,7 @@*Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md
+*Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md
*Thread Reply:* We’re starting a monthly call, I will publish more details here
+*Thread Reply:* We’re starting a monthly call, I will publish more details here
@@ -7751,7 +7757,7 @@*Thread Reply:* Do you have a specific use case in mind?
+*Thread Reply:* Do you have a specific use case in mind?
@@ -7777,7 +7783,7 @@*Thread Reply:* Nothing specific yet
+*Thread Reply:* Nothing specific yet
@@ -7854,7 +7860,7 @@*Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?
+*Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?
@@ -7880,7 +7886,7 @@*Thread Reply:* Same!
+*Thread Reply:* Same!
@@ -7906,7 +7912,7 @@*Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite
+*Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite
@@ -7932,7 +7938,7 @@*Thread Reply:* You can find the invitation on the tsc mailing list: https://lists.lfaidata.foundation/g/openlineage-tsc/topic/invitation_openlineage/83423919?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,83423919
+*Thread Reply:* You can find the invitation on the tsc mailing list: https://lists.lfaidata.foundation/g/openlineage-tsc/topic/invitation_openlineage/83423919?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,83423919
@@ -7958,7 +7964,7 @@*Thread Reply:* @Julien Le Dem Can't access the calendar.
+*Thread Reply:* @Julien Le Dem Can't access the calendar.
@@ -7984,7 +7990,7 @@*Thread Reply:* Can you please share the meeting details
+*Thread Reply:* Can you please share the meeting details
@@ -8010,7 +8016,7 @@*Thread Reply:*
+*Thread Reply:*
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?
+*Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?
@@ -8106,7 +8112,7 @@*Thread Reply:* Thanks
+*Thread Reply:* Thanks
@@ -8132,7 +8138,7 @@*Thread Reply:* it is 9am,thanks
+*Thread Reply:* it is 9am,thanks
@@ -8158,7 +8164,7 @@*Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive
+*Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive
@@ -8218,7 +8224,7 @@*Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events
+*Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events
@@ -8244,7 +8250,7 @@*Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.
+*Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.
{
eventType: 'START',
@@ -8315,7 +8321,7 @@
Group Direct Messages
*Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.
+*Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.
For example, if the data goes from S3 -> Snowflake
via Airflow and then from Snowflake -> Salesforce
via Hightouch, this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage?
*Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:
+*Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:
COMPLETE
event as the input datasets have already been associated with the run ID*Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started
+*Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started
*Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? +
*Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? Yes, the dataset and the namespace that it was registered under would have to be the same to properly build the lineage graph. We’re working on defining unique dataset names and have made some good progress in this area. I’d suggest reviewing the OL naming conventions if you haven’t already: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
@@ -8450,7 +8456,7 @@*Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂
+*Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂
@@ -8476,7 +8482,7 @@*Thread Reply:* 🙂
+*Thread Reply:* 🙂
@@ -8530,7 +8536,7 @@*Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues
+*Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues
@@ -8556,7 +8562,7 @@*Thread Reply:* Thank you! Will do.
+*Thread Reply:* Thank you! Will do.
@@ -8582,7 +8588,7 @@*Thread Reply:* Thanks for reporting Antonio
+*Thread Reply:* Thanks for reporting Antonio
@@ -8732,7 +8738,7 @@*Thread Reply:* Very nice!
+*Thread Reply:* Very nice!
@@ -8784,7 +8790,7 @@*Thread Reply:* Here is the error...
+*Thread Reply:* Here is the error...
21/06/20 11:02:56 WARN ArgumentParser: missing jobs in [, api, v1, namespaces, spark_integration] at 5
21/06/20 11:02:56 WARN ArgumentParser: missing runs in [, api, v1, namespaces, spark_integration] at 7
@@ -8829,7 +8835,7 @@
Group Direct Messages
*Thread Reply:* Here is my code ...
+*Thread Reply:* Here is my code ...
```from pyspark.sql import SparkSession from pyspark.sql.functions import lit
@@ -8899,7 +8905,7 @@*Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...
+*Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...
@@ -8925,7 +8931,7 @@*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job
+*Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job
@@ -8986,7 +8992,7 @@*Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it
+*Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it
@@ -9012,7 +9018,7 @@*Thread Reply:* Can anyone help me?
+*Thread Reply:* Can anyone help me?
@@ -9038,7 +9044,7 @@*Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add +
*Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add
spark.sparkContext.setLogLevel('info')
to the setup code, everything works reliably. Also works if you remove the master('local[1]')
- at least when running in a notebook
*Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events
table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them
*Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events
table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them
*Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. +
*Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. Depending on the exported CSV, I would translate the CSV to an OL event, see https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json
@@ -9227,7 +9233,7 @@*Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?
+*Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?
@@ -9253,7 +9259,7 @@*Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events
+*Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events
@@ -9279,7 +9285,7 @@*Thread Reply:* 2. store it in graph db (neo4j)
+*Thread Reply:* 2. store it in graph db (neo4j)
@@ -9305,7 +9311,7 @@*Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.
+*Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.
*Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed
+*Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed
@@ -9393,7 +9399,7 @@*Thread Reply:* There are no silly questions! 😉
+*Thread Reply:* There are no silly questions! 😉
@@ -9453,7 +9459,7 @@*Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:
+*Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:
> What are events and facets and what are their purpose? An OpenLineage event is used to capture the lineage metadata at a point in time for a given run in execution. That is, the runs state transition, the inputs and outputs consumed/produced and the job associated with the run are part of the event. The metadata defined in the event can then be consumed by an HTTP backend (as well as other transport layers). Marquez is an HTTP backend implementation that consumes OL events via a REST API call. The OL core model only defines the metadata that should be captured in the context of a run, while the processing of the event is up to the backend implementation consuming the event (think consumer / producer model here). For Marquez, the end-to-end lineage metadata is stored for pipelines (composed of multiple jobs) with built-in metadata versioning support. Now, for the second part of your question: the OL core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. I’d recommend checking out the getting started guide for OL 🙂
@@ -9591,7 +9597,7 @@*Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800
+*Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* In your first cell, you have +
*Thread Reply:* In your first cell, you have
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark.sparkContext.setLogLevel('info')
@@ -9754,7 +9760,7 @@
*Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink
and Beam
on our list of future integrations.
*Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink
and Beam
on our list of future integrations.
*Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?
+*Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?
@@ -9832,7 +9838,7 @@*Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!
+*Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!
@@ -9862,7 +9868,7 @@*Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model
+*Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model
@@ -9980,7 +9986,7 @@*Thread Reply:* Based on community feedback, +
*Thread Reply:* Based on community feedback, The new proposed mission statement: “to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used”
@@ -10137,7 +10143,7 @@*Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation
+*Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation
@@ -10163,7 +10169,7 @@*Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35
+*Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35
*Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski
+*Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski
@@ -10334,7 +10340,7 @@*Thread Reply:* @Michael Collado
+*Thread Reply:* @Michael Collado
@@ -10403,7 +10409,7 @@*Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite
+*Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite
@@ -10429,7 +10435,7 @@*Thread Reply:* It is starting now
+*Thread Reply:* It is starting now
@@ -10538,7 +10544,7 @@*Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63
+*Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63
*Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84
+*Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84
*Thread Reply:* for this one, please give explicit approval in the ticket
+*Thread Reply:* for this one, please give explicit approval in the ticket
@@ -10735,7 +10741,7 @@*Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^
+*Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^
@@ -10761,7 +10767,7 @@*Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit
+*Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit
@@ -10854,7 +10860,7 @@*Thread Reply:* The demo in this does not really use the openlineage spec does it?
+*Thread Reply:* The demo in this does not really use the openlineage spec does it?
Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?
@@ -10882,7 +10888,7 @@*Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?
+*Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?
@@ -10908,7 +10914,7 @@*Thread Reply:* > The demo in this does not really use the openlineage spec does it? +
*Thread Reply:* > The demo in this does not really use the openlineage spec does it?
@Samia Rahman In our Airflow talk, the demo used the marquez-airflow
lib that sends OpenLineage events to Marquez’s
*Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? +
*Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?
Yes, Marquez ingests OpenLineage events that confirm to the spec via the
*Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture: +
*Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture: • https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/ • https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
*Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:
+*Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:
*Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.
+*Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.
@@ -11144,7 +11150,7 @@*Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/
+*Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/
*Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow
but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯
*Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow
but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯
*Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet
is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet
to display the SQL executed by a given job to the user via the UI.
*Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet
is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet
to display the SQL executed by a given job to the user via the UI.
*Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control
+*Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control
*Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143
+*Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143
*Thread Reply:* *Thread Reply:* /|~~~
+
/|~~~
///|
/////|
///////|
@@ -11713,7 +11719,7 @@
Supress success
*Thread Reply:* ⛵
+*Thread Reply:* ⛵
@@ -11972,7 +11978,7 @@*Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.
+*Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.
@@ -11998,7 +12004,7 @@*Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. +
*Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. To capture the column level lineage, you would want to add a ColumnLineage facet to the Output dataset facets. Which is something that is needed in the spec. Here is a proposal, please chime in: https://github.com/OpenLineage/OpenLineage/issues/148 @@ -12136,7 +12142,7 @@
*Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.
+*Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.
*Thread Reply:* Does anyone know if the Spline developers are in this slack group?
+*Thread Reply:* Does anyone know if the Spline developers are in this slack group?
@@ -12212,7 +12218,7 @@*Thread Reply:* @Luke Smith how have things progressed on your side the past year?
+*Thread Reply:* @Luke Smith how have things progressed on your side the past year?
@@ -12442,7 +12448,7 @@*Thread Reply:* Just a reminder that this is in 2 hours
+*Thread Reply:* Just a reminder that this is in 2 hours
@@ -12468,7 +12474,7 @@*Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -12494,7 +12500,7 @@*Thread Reply:* The recording of the meeting is linked there: +
*Thread Reply:* The recording of the meeting is linked there: https://us02web.zoom.us/rec/share/2k4O-Rjmmd5TYXzT-pEQsbYXt6o4V6SnS6Vi7a27BPve9aoMmjm-bP8UzBBzsFzg.uY1je-PyT4qTgYLZ?startTime=1628697944000 • Passcode: =RBUj01C
@@ -12548,7 +12554,7 @@*Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2
+*Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2
@@ -12574,7 +12580,7 @@*Thread Reply:* Thank you Maciej. I'll take a look
+*Thread Reply:* Thank you Maciej. I'll take a look
@@ -12742,7 +12748,7 @@*Thread Reply:* The logicalPlan
facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate
and Join
rather than WholeStageCodegen
and Exchange
. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that
*Thread Reply:* The logicalPlan
facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate
and Join
rather than WholeStageCodegen
and Exchange
. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that
*Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark
lib? (Note, the marquez-spark
lib has been deprecated)
*Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark
lib? (Note, the marquez-spark
lib has been deprecated)
*Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars
or spark.jars.packages
. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we
*Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars
or spark.jars.packages
. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we
/dbfs/databricks/init/lineage
)/mnt/driver-daemon/jars
.conf
file in /databricks/driver/conf
(we use open-lineage.conf
)
The .conf
file will be read by the driver on initialization. It should follow this format (lineagehosturl should point to your API):
@@ -12965,7 +12971,7 @@ *Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?
+*Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?
@@ -12991,7 +12997,7 @@*Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener
+*Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener
@@ -13017,7 +13023,7 @@*Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞
+*Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url
) and install the library on the cluster, but then enable tracking at a notebook level with:
*Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url
) and install the library on the cluster, but then enable tracking at a notebook level with:
%scala
import za.co.absa.spline.harvester.SparkLineageInitializer._
@@ -13130,7 +13136,7 @@
Supress success
*Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level
+*Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level
@@ -13160,7 +13166,7 @@*Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.
+*Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.
@@ -13186,7 +13192,7 @@*Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have
+*Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have
@@ -13216,7 +13222,7 @@*Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!
+*Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!
Now I have this:
@@ -13253,7 +13259,7 @@*Thread Reply:* Is this error thrown during init or job execution?
+*Thread Reply:* Is this error thrown during init or job execution?
@@ -13279,7 +13285,7 @@*Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar
+*Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar
@@ -13305,7 +13311,7 @@*Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer
+*Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer
@@ -13331,7 +13337,7 @@*Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?
+*Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?
@@ -13357,7 +13363,7 @@*Thread Reply:* Yes Willy, thank you!
+*Thread Reply:* Yes Willy, thank you!
@@ -13383,7 +13389,7 @@*Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?
+*Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* I found the solution here: +
*Thread Reply:* I found the solution here: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html
*Thread Reply:* It works now! 😄
+*Thread Reply:* It works now! 😄
@@ -13540,7 +13546,7 @@*Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark
with Databricks. Let me know if that’s something you’d like to contribute 🙂
*Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark
with Databricks. Let me know if that’s something you’d like to contribute 🙂
*Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂
+*Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂
@@ -13592,7 +13598,7 @@*Thread Reply:* Awesome. Thanks!
+*Thread Reply:* Awesome. Thanks!
@@ -13644,7 +13650,7 @@*Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.
+*Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.
*Thread Reply:* great, thanks 🙂
+*Thread Reply:* great, thanks 🙂
@@ -13745,7 +13751,7 @@*Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.
+*Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.
@@ -13797,7 +13803,7 @@*Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.
+*Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.
@@ -13823,7 +13829,7 @@*Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.
+*Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.
@@ -13853,7 +13859,7 @@*Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.
+*Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.
@@ -13879,7 +13885,7 @@*Thread Reply:* are you actually using the DatasetSource
interface in any capacity? Or are you just scanning the source code to find incompatibilities?
*Thread Reply:* are you actually using the DatasetSource
interface in any capacity? Or are you just scanning the source code to find incompatibilities?
*Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.
+*Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.
*Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:
+*Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:
insert into . . . values . . .
@@ -14069,7 +14075,7 @@ *Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?
+*Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?
@@ -14095,7 +14101,7 @@*Thread Reply:* Yes, that's my read of what I'm getting from others on the team.
+*Thread Reply:* Yes, that's my read of what I'm getting from others on the team.
@@ -14121,7 +14127,7 @@*Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO...
? Is it a data source defined in Databricks? a Hive table? a view on GCS?
*Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO...
? Is it a data source defined in Databricks? a Hive table? a view on GCS?
*Thread Reply:* oh, it's a Delta table?
+*Thread Reply:* oh, it's a Delta table?
@@ -14173,7 +14179,7 @@*Thread Reply:* Yes, it's created via
+*Thread Reply:* Yes, it's created via
CREATE TABLE . . . using DELTA location "/dbfs/mnt/ . . . "
*Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?
+*Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?
@@ -14312,7 +14318,7 @@*Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.
+*Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.
@@ -14338,7 +14344,7 @@*Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0
+*Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0
@@ -14390,7 +14396,7 @@*Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)
+*Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)
@@ -14420,7 +14426,7 @@*Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -14446,7 +14452,7 @@*Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -14472,7 +14478,7 @@*Thread Reply:* Good meeting today. @Julien Le Dem. Thanks
+*Thread Reply:* Good meeting today. @Julien Le Dem. Thanks
@@ -14572,7 +14578,7 @@*Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2
+*Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2
*Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?
+*Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?
@@ -14648,7 +14654,7 @@*Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator
+*Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator
@@ -14765,7 +14771,7 @@*Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)
+*Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)
@@ -14979,7 +14985,7 @@*Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)
+*Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)
@@ -15005,7 +15011,7 @@*Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272
+*Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272
*Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847
+*Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847
*Thread Reply: Can you try newer version in 2.4.* line, like 2.4.7?
+*Thread Reply:* Can you try newer version in 2.4.** line, like 2.4.7?
@@ -15389,7 +15395,7 @@*Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.
+*Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.
@@ -15415,7 +15421,7 @@*Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:
+*Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:
@@ -15441,7 +15447,7 @@*Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD +
*Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: config$1 at java.base/java.lang.Class.getDeclaredField(Class.java:2411)
@@ -15469,7 +15475,7 @@*Thread Reply:* i can also try to switch to 2.11 scala
+*Thread Reply:* i can also try to switch to 2.11 scala
@@ -15495,7 +15501,7 @@*Thread Reply:* or do you have some recommended setup that works for sure?
+*Thread Reply:* or do you have some recommended setup that works for sure?
@@ -15521,7 +15527,7 @@*Thread Reply:* One more check - you're using Java 8 with this, right?
+*Thread Reply:* One more check - you're using Java 8 with this, right?
@@ -15547,7 +15553,7 @@*Thread Reply:* This is what works for me: +
*Thread Reply:* This is what works for me:
-> % cat tools/spark-2.4/RELEASE
Spark 2.4.8 (git revision 4be4064) built for Hadoop 2.7.3
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036
*Thread Reply:* spark-shell: +
*Thread Reply:* spark-shell:
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
*Thread Reply:* awesome let me try 🙂
+*Thread Reply:* awesome let me try 🙂
@@ -15629,7 +15635,7 @@*Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception
+
*Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception
java.lang.NullPointerException
at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:164)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39)
@@ -15670,7 +15676,7 @@
*Thread Reply:* closed related issue #274
+*Thread Reply:* closed related issue #274
@@ -15750,7 +15756,7 @@*Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.
+*Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.
@@ -15776,7 +15782,7 @@*Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here
+*Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here
*Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.
*Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.
• How would you like me to add the test? Fork openlineage and create PR
@@ -16061,7 +16067,7 @@*Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"
+*Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"
@@ -16087,7 +16093,7 @@*Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?
+*Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?
@@ -16113,7 +16119,7 @@*Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.
+*Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.
*Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279
+*Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279
*Thread Reply:* Hey Thomas!
+*Thread Reply:* Hey Thomas!
Following what we do with Airflow, yes, I think that each task should be mapped to job.
@@ -16349,7 +16355,7 @@*Thread Reply:* this is very helpful, thank you
+*Thread Reply:* this is very helpful, thank you
@@ -16375,7 +16381,7 @@*Thread Reply:* what would be the namespace used in the Job
-definition of each task?
*Thread Reply:* what would be the namespace used in the Job
-definition of each task?
*Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.
+*Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.
For example, it would be good if it helped you identify Prefect instance where the job was run.
@@ -16429,7 +16435,7 @@*Thread Reply:* If you use openlineage-python
client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE
env variable.
*Thread Reply:* If you use openlineage-python
client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE
env variable.
*Thread Reply:* awesome, thank you 🙂
+*Thread Reply:* awesome, thank you 🙂
@@ -16481,7 +16487,7 @@*Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all
+*Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all
@@ -16507,7 +16513,7 @@*Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81
+*Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81
*Thread Reply:* Done!
+*Thread Reply:* Done!
@@ -16587,7 +16593,7 @@*Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage
+*Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage
@@ -16613,7 +16619,7 @@*Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner
and TaskRunner
may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.
*Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner
and TaskRunner
may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.
How are you dealing with Prefect-library tasks such as the included BigQuery-tasks and such? Is it necessary to create DatasetTask
for them to show up in the lineage graph?
*Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask
subclass where you could define custom facets on a per task basis
*Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask
subclass where you could define custom facets on a per task basis
*Thread Reply:* or some sort of other hook where basically you would define some lineage
attribute or put something in the prefect.context
that the TaskRunner
would find and attach
*Thread Reply:* or some sort of other hook where basically you would define some lineage
attribute or put something in the prefect.context
that the TaskRunner
would find and attach
*Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)
+*Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)
@@ -16719,7 +16725,7 @@*Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?
+*Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?
@@ -16745,7 +16751,7 @@*Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets
+*Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets
@@ -16771,7 +16777,7 @@*Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task
+*Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task
@@ -16797,7 +16803,7 @@*Thread Reply:* @davzucky
+*Thread Reply:* @davzucky
@@ -16823,7 +16829,7 @@*Thread Reply:* I see. Is this supported by the airflow-integration?
+*Thread Reply:* I see. Is this supported by the airflow-integration?
@@ -16849,7 +16855,7 @@*Thread Reply:* I think so, yes
+*Thread Reply:* I think so, yes
@@ -16875,7 +16881,7 @@*Thread Reply:* The airflow code is here https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py
+*Thread Reply:* The airflow code is here https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/bigquery_extractor.py
@@ -16901,7 +16907,7 @@*Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)
+*Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)
@@ -16927,7 +16933,7 @@*Thread Reply:* Interesting, I like how dynamic this is
+*Thread Reply:* Interesting, I like how dynamic this is
@@ -16980,7 +16986,7 @@*Thread Reply:* Hey. +
*Thread Reply:* Hey.
Generally, dataSource
name should be namespace of particular dataset.
In some cases, like Postgres, dataSource
facet is used to provide additionally connection strings, with info like particular host and port that we're connected to.
*Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).
+*Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).
@@ -17037,7 +17043,7 @@*Thread Reply:* @Willy Lulciuc what do you think?
+*Thread Reply:* @Willy Lulciuc what do you think?
@@ -17063,7 +17069,7 @@*Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales
, default
, <snowflake://my-account1>
, <snowflake://my-account2>
.
+
*Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales
, default
, <snowflake://my-account1>
, <snowflake://my-account2>
.
I think the new marquez UI with the nice namespace dropdown and job/dataset search is awesome, and I'd expect to be able to filter by job namespace everywhere, but how about being able to filter datasets by source (which would be populated by the OL dataset namespace) and not persist dataset namespaces in the global namespace table?
*Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?
+*Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?
@@ -17152,7 +17158,7 @@*Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there
+*Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there
@@ -17182,7 +17188,7 @@*Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.
+*Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.
@@ -17208,7 +17214,7 @@*Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.
+*Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.
We could batch events if the network roundtrips are responsible for majority of the slowdown. However, we can't assume any particular environment.
@@ -17242,7 +17248,7 @@*Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.
+*Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.
@@ -17268,7 +17274,7 @@*Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?
+*Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?
@@ -17294,7 +17300,7 @@*Thread Reply:* Done
+*Thread Reply:* Done
@@ -17320,7 +17326,7 @@*Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config
so it can be mapped into a container
*Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config
so it can be mapped into a container
*Thread Reply:* Makes sense, also, all clients could use that config
+*Thread Reply:* Makes sense, also, all clients could use that config
@@ -17372,7 +17378,7 @@*Thread Reply:* if dbt
could actually stream the events, that would be great.
*Thread Reply:* if dbt
could actually stream the events, that would be great.
*Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt
only produces after end of execution.
*Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt
only produces after end of execution.
*Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284
+*Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284
@@ -17541,7 +17547,7 @@*Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: +
*Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response:
'{"code":400,"message":"Unable to process JSON"}'
The JSON I'm pushing:
*Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. +
*Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. Aren’t inputs and facets arrays?
@@ -17617,7 +17623,7 @@*Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)
+*Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)
@@ -17643,7 +17649,7 @@*Thread Reply:* okay it was my timestamp 🥲
+*Thread Reply:* okay it was my timestamp 🥲
@@ -17669,7 +17675,7 @@*Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py
+*Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py
*Thread Reply:* I might move this into a proper PR @Julien Le Dem
+*Thread Reply:* I might move this into a proper PR @Julien Le Dem
@@ -17889,7 +17895,7 @@*Thread Reply:* Successfully got a basic prefect flow working
+*Thread Reply:* Successfully got a basic prefect flow working
*Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType
concept; can't find where though.
*Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType
concept; can't find where though.
*Thread Reply:* I've taken a look at your repo. Looks great so far!
+*Thread Reply:* I've taken a look at your repo. Looks great so far!
One thing I've noticed I don't think you need to use any stuff from Marquez to emit events. It's lineage ingestion API is deprecated - you can just use openlineage-python client. If there's something you think it's missing from it, feel free to write that here or open issue.
@@ -18004,7 +18010,7 @@*Thread Reply:* And would that be replaced by just some Input/Output
notion @Maciej Obuchowski?
*Thread Reply:* And would that be replaced by just some Input/Output
notion @Maciej Obuchowski?
*Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client
to talk to it
*Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client
to talk to it
*Thread Reply:* Yes 🙌
+*Thread Reply:* Yes 🙌
@@ -18136,7 +18142,7 @@*Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets
+*Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets
@@ -18162,7 +18168,7 @@*Thread Reply:* So, they will fail and there's nothing to do from your side 🙂
+*Thread Reply:* So, they will fail and there's nothing to do from your side 🙂
@@ -18188,7 +18194,7 @@*Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs
+*Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs
@@ -18214,7 +18220,7 @@*Thread Reply:* ah okie. good to know. +
*Thread Reply:* ah okie. good to know. and in build-integration-spark Could not resolve all artifacts. Is that also known issue? Or something from my side that i could fix?
@@ -18241,7 +18247,7 @@*Thread Reply:* Looks like gradle server problem? +
*Thread Reply:* Looks like gradle server problem?
> Could not get resource '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'.
> Could not GET '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. Received status code 500 from server: Internal Server Error
*Thread Reply:* After retry, there's spotless error:
+*Thread Reply:* After retry, there's spotless error:
+········.orElse(Collections.emptyList()).stream()
*Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂
+*Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂
@@ -18323,7 +18329,7 @@*Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?
+*Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?
@@ -18349,7 +18355,7 @@*Thread Reply:* For spotless, you can just fix this one line 🙂 +
*Thread Reply:* For spotless, you can just fix this one line 🙂 Though I don't guarantee that tests that run later will pass, so you might need Java 8 for later testing
@@ -18376,7 +18382,7 @@*Thread Reply:* yup looks better now
+*Thread Reply:* yup looks better now
@@ -18402,7 +18408,7 @@*Thread Reply:* thanks
+*Thread Reply:* thanks
@@ -18428,7 +18434,7 @@*Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂
+*Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂
@@ -18484,7 +18490,7 @@*Thread Reply:* @Thomas Fredriksen would love any feedback
+*Thread Reply:* @Thomas Fredriksen would love any feedback
@@ -18510,7 +18516,7 @@*Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner
and TaskRunner
is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!
*Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner
and TaskRunner
is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!
*Thread Reply:* Hey @Brad, it looks great!
+*Thread Reply:* Hey @Brad, it looks great!
I've seen you're using task_qualified_name
to name datasets and I don't think it's the right way.
I'd take a look at naming conventions here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
*Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example
+*Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example
@@ -18593,7 +18599,7 @@*Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.
+*Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.
Second, in Airflow we have a concept of Extractor where you can write specialized code to expose datasets. For example, for BigQuery we extract datasets from query plan. Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused. Also, this way allows to emit additional facets, some of which are really useful - like query statistics for BigQuery, and data quality tests for dbt.
@@ -18623,7 +18629,7 @@*Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results
object in prefect (The Result object is the way in which data is serialized and persisted).
*Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results
object in prefect (The Result object is the way in which data is serialized and persisted).
> Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused This definitely applies to prefect and the similar tasks exist in prefect and we should definitely leverage the common library in this case.
@@ -18662,7 +18668,7 @@*Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. +
*Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. First version of an integration doesn't have to be perfect. in particular, not handling this use case would be okay, since it does not lock us into some particular way of doing it later.
> My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now. @@ -18694,7 +18700,7 @@
*Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. +
*Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. Duh, yep you’re right @Maciej Obuchowski, I’m over thinking this. I’m going to clean this up based on your comments
@@ -18721,7 +18727,7 @@*Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?
+*Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?
@@ -18747,7 +18753,7 @@*Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running
+*Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running
@@ -18773,7 +18779,7 @@*Thread Reply:* nice!
+*Thread Reply:* nice!
@@ -18799,7 +18805,7 @@*Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week
+*Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week
@@ -18825,7 +18831,7 @@*Thread Reply:* but I do plan to pick this up next week
+*Thread Reply:* but I do plan to pick this up next week
@@ -18851,7 +18857,7 @@*Thread Reply:* (the additional changes I mention above)
+*Thread Reply:* (the additional changes I mention above)
@@ -18877,7 +18883,7 @@*Thread Reply:* looking forward to it 🙂 let me know if you need any help!
+*Thread Reply:* looking forward to it 🙂 let me know if you need any help!
@@ -18903,7 +18909,7 @@*Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out
+*Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out
@@ -18963,7 +18969,7 @@*Thread Reply:* I think you can reffer to https://openlineage.io/
+*Thread Reply:* I think you can reffer to https://openlineage.io/
*Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467
+*Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467
The first step should be to add SQLServer or Dremio to the dataset naming schema here https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
*Thread Reply:* Thank you @Maciej Obuchowski, +
*Thread Reply:* Thank you @Maciej Obuchowski, I tried to give it a try but without success yet. Not sure where I am suppose to add the sqlserver naming schema... If you have any documentation that I could read I would be glad =) Many thanks
@@ -19237,7 +19243,7 @@*Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake
+*Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake
*Thread Reply:* Snowflake +
*Thread Reply:* Snowflake See: Object Identifiers — Snowflake Documentation Datasource hierarchy: • account name @@ -19541,7 +19547,7 @@
*Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)
+*Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)
Our platform performs automated enrichment and processing of data. In doing so, we often make calls to functions or out to other data services (such as APIs, or SELECTs against databases). We capture the inputs that pass to these, along with the outputs. (And, if the input is derived from other outputs, we capture the full chain, right back to the root).
@@ -19571,7 +19577,7 @@*Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?
+*Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?
@@ -19597,7 +19603,7 @@*Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.
+*Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.
We could of course model this situation, but that would capture for example schema of those parameters. Not their values.
@@ -19625,7 +19631,7 @@*Thread Reply:* I think this might be better suited for https://opentelemetry.io/
+*Thread Reply:* I think this might be better suited for https://opentelemetry.io/
@@ -19651,7 +19657,7 @@*Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.
+*Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.
Our customers use Vyne to fetch data from lots of different sources. Eg:
@@ -19685,7 +19691,7 @@*Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well
+*Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well
@@ -19740,7 +19746,7 @@*Thread Reply:* Hello @Drew Bittenbender!
+*Thread Reply:* Hello @Drew Bittenbender!
For your two first questions:
@@ -19771,7 +19777,7 @@*Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot
+*Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot
@@ -19797,7 +19803,7 @@*Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors
+*Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors
@@ -19827,7 +19833,7 @@*Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id
macro and use openlineage-python
library inside, instead of implementing extractor.
*Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id
macro and use openlineage-python
library inside, instead of implementing extractor.
*Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly
+*Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly
@@ -19879,7 +19885,7 @@*Thread Reply:* To clarify on what airflow operators are supported out of the box: +
*Thread Reply:* To clarify on what airflow operators are supported out of the box: • postgres • bigquery • snowflake @@ -19992,7 +19998,7 @@
*Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148
+*Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148
*Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future
+*Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future
@@ -20197,7 +20203,7 @@*Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: +
*Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: https://github.com/OpenLineage/OpenLineage/issues/205
*Thread Reply:* Thank you for your reply!
+*Thread Reply:* Thank you for your reply!
@@ -20290,7 +20296,7 @@*Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 +
*Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 We’re labelling it experimental because the config step might change as discussion in the airflow github evolve. It will track succesful jobs in its current state.
*Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview
+*Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview
*Thread Reply:* @SAM Let us know of your progress.
+*Thread Reply:* @SAM Let us know of your progress.
@@ -20509,7 +20515,7 @@*Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.
+*Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.
@@ -20535,7 +20541,7 @@*Thread Reply:* Hey @Maciej Obuchowski 😊 +
*Thread Reply:* Hey @Maciej Obuchowski 😊 Yep, will do now! Thanks!
@@ -20562,7 +20568,7 @@*Thread Reply:* Well...will do tomorrow morning 😅
+*Thread Reply:* Well...will do tomorrow morning 😅
@@ -20588,7 +20594,7 @@*Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318
+*Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318
*Thread Reply:* Thanks a lot. I pulled it in the current project.
+*Thread Reply:* Thanks a lot. I pulled it in the current project.
@@ -20686,7 +20692,7 @@*Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol
codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅
*Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol
codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅
*Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
+*Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
@@ -20738,7 +20744,7 @@*Thread Reply:* Sure!
+*Thread Reply:* Sure!
@@ -20764,7 +20770,7 @@*Thread Reply:* will work on this today and I’ll try to submit a PR by EOD
+*Thread Reply:* will work on this today and I’ll try to submit a PR by EOD
@@ -20790,7 +20796,7 @@*Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324
+*Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324
*Thread Reply:* Host would be something like +
*Thread Reply:* Host would be something like
examplecluster.<XXXXXXXXXXXX>.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a>
right?
*Thread Reply:* Yep, let me update the PR
+*Thread Reply:* Yep, let me update the PR
@@ -20896,7 +20902,7 @@*Thread Reply:* Done
+*Thread Reply:* Done
@@ -20922,7 +20928,7 @@*Thread Reply:* 🙌
+*Thread Reply:* 🙌
@@ -20948,7 +20954,7 @@*Thread Reply:* If you want to look at dbt integration itself, there are two things:
+*Thread Reply:* If you want to look at dbt integration itself, there are two things:
We need to determine how Redshift adapter reports metrics https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L412
@@ -21006,7 +21012,7 @@*Thread Reply:* I figured out how to generate the namespace. +
*Thread Reply:* I figured out how to generate the namespace.
But I can’t understand which of the JSON files is inspected for metrics. Is it run_results.json
?
*Thread Reply:* yes, run_results.json
- it's different in bigquery and snowflake, so I presume it's different in redshift too
*Thread Reply:* yes, run_results.json
- it's different in bigquery and snowflake, so I presume it's different in redshift too
*Thread Reply:* Ok thanks!
+*Thread Reply:* Ok thanks!
@@ -21085,7 +21091,7 @@*Thread Reply:* Should be stats:rows:value
*Thread Reply:* Should be stats:rows:value
*Thread Reply:* Regarding namespace: if env_var
is used in profiles.yml
, how is this handled now?
*Thread Reply:* Regarding namespace: if env_var
is used in profiles.yml
, how is this handled now?
*Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?
+*Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?
@@ -21163,7 +21169,7 @@*Thread Reply:* Exactly
+*Thread Reply:* Exactly
@@ -21189,7 +21195,7 @@*Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var
*Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var
*Thread Reply:* Do you want to run jinja on the dbt profile?
+*Thread Reply:* Do you want to run jinja on the dbt profile?
@@ -21241,7 +21247,7 @@*Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml
, but we only take target path and profile name from it.
*Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml
, but we only take target path and profile name from it.
*Thread Reply:* The env_var
syntax in the profile is quite simple, I was thinking of extracting the env var name using re
and then retrieving the value from os
*Thread Reply:* The env_var
syntax in the profile is quite simple, I was thinking of extracting the env var name using re
and then retrieving the value from os
*Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. +
*Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. The method is pretty simple: ``` @contextmember @staticmethod @@ -21336,7 +21342,7 @@
*Thread Reply:* Oh cool! +
*Thread Reply:* Oh cool! I will definitely use this one!
@@ -21363,7 +21369,7 @@*Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free
+*Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free
@@ -21389,7 +21395,7 @@*Thread Reply:* So this env_var
method is defined in dbt and not in OpenLineage codebase, right?
*Thread Reply:* So this env_var
method is defined in dbt and not in OpenLineage codebase, right?
*Thread Reply:* yes
+*Thread Reply:* yes
@@ -21441,7 +21447,7 @@*Thread Reply:* dbt is on Apache license 🙂
+*Thread Reply:* dbt is on Apache license 🙂
@@ -21467,7 +21473,7 @@*Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?
+*Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?
@@ -21493,7 +21499,7 @@*Thread Reply:* I’m asking for guidance here 😊
+*Thread Reply:* I’m asking for guidance here 😊
@@ -21519,7 +21525,7 @@*Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples
+*Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples
just with the env_var method passed to the render method 🙂
@@ -21547,7 +21553,7 @@*Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file +
*Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L176
*Thread Reply:* ok, got it. +
*Thread Reply:* ok, got it. Will try to implement following your suggestions. Thanks @Maciej Obuchowski 🙌
@@ -21631,7 +21637,7 @@*Thread Reply:* We need to:
+*Thread Reply:* We need to:
profile.yml
jinja2.Template
@@ -21662,7 +21668,7 @@ *Thread Reply:* The dbt method implements that: +
*Thread Reply:* The dbt method implements that:
``` @contextmember
@staticmethod
def envvar(var: str, default: Optional[str] = None) -> str:
@@ -21704,7 +21710,7 @@ Supress success
*Thread Reply:* Ok, but I need to pass var
to the env_var
method.
+
*Thread Reply:* Ok, but I need to pass var
to the env_var
method.
And to pass the var
value, I need to look into the loaded Template and search for env var names…
*Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself
+*Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself
@@ -21757,7 +21763,7 @@*Thread Reply:* you can try the quick example from here, but just pass the env_var
method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples
*Thread Reply:* you can try the quick example from here, but just pass the env_var
method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples
*Thread Reply:* Ok, will try
+*Thread Reply:* Ok, will try
@@ -21809,7 +21815,7 @@*Thread Reply:* I’m trying to run +
*Thread Reply:* I’m trying to run
pip install -e ".[dev]"
so that I can test my changes, but I get
ERROR: Could not find a version that satisfies the requirement openlineage-integration-common[dbt]==0.2.3 (from openlineage-dbt[dev]) (from versions: 0.0.1rc7, 0.0.1rc8, 0.0.1, 0.1.0rc5, 0.1.0, 0.2.0, 0.2.1, 0.2.2)
@@ -21840,7 +21846,7 @@
Supress success
*Thread Reply:* can you try installing it manually?
+*Thread Reply:* can you try installing it manually?
pip install openlineage-integration-common[dbt]==0.2.3
*Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files
+*Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files
*Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. +
*Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. Installing from the public pypi resolved the issue
@@ -21946,7 +21952,7 @@*Thread Reply:* Can;’t seem to make env_var
working as the render method of a Template 😅
*Thread Reply:* Can;’t seem to make env_var
working as the render method of a Template 😅
*Thread Reply:* try this:
+*Thread Reply:* try this:
@@ -21998,7 +22004,7 @@*Thread Reply:* ```import os +
*Thread Reply:* ```import os from typing import Optional from jinja2 import Template
@@ -22045,7 +22051,7 @@*Thread Reply:* works for me: +
*Thread Reply:* works for me:
mobuchowski@thinkpad [18:57:14] [~]
-> % ENV_VAR=world python jinja_example.py
Hello world!
*Thread Reply:* Finally 😅 +
*Thread Reply:* Finally 😅 https://github.com/OpenLineage/OpenLineage/pull/328
There are minimal tests for Redshift and env vars. @@ -22133,7 +22139,7 @@
*Thread Reply:* Hi @Maciej Obuchowski 😊 +
*Thread Reply:* Hi @Maciej Obuchowski 😊 Regarding this comment https://github.com/OpenLineage/OpenLineage/pull/328#discussion_r726586564
How can we distinguish between snowflake, bigquery and redshift in this method?
@@ -22175,7 +22181,7 @@*Thread Reply:* Well, why not do something like
+*Thread Reply:* Well, why not do something like
```bytes = getfrommultiple_chains(... rest of stuff)
@@ -22206,7 +22212,7 @@*Thread Reply:* we can store adapter type in the class
+*Thread Reply:* we can store adapter type in the class
@@ -22232,7 +22238,7 @@*Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍
+*Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍
@@ -22258,7 +22264,7 @@*Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.
+*Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.
@@ -22284,7 +22290,7 @@*Thread Reply:* Not yet, but I'll try to do that this afternoon. +
*Thread Reply:* Not yet, but I'll try to do that this afternoon. Need to figure out how to build the lib locally, then I can use it to test with Redshift
@@ -22311,7 +22317,7 @@*Thread Reply:* I think pip install -e .[dbt]
in common directory should be enough
*Thread Reply:* I think pip install -e .[dbt]
in common directory should be enough
*Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. +
*Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez.
However, I’m not sure about the namespace
.
I also see exceptions in Marquez logs
*Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?
+*Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?
@@ -22418,7 +22424,7 @@*Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet
*Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet
*Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 +
*Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 Regarding the error: in the context of this PR, is it something I should worry about or not?
@@ -22471,7 +22477,7 @@*Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.
+*Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.
@@ -22497,7 +22503,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -22523,7 +22529,7 @@*Thread Reply:* Is there anything else I can do to improve the PR?
+*Thread Reply:* Is there anything else I can do to improve the PR?
@@ -22549,7 +22555,7 @@*Thread Reply:* did you figure out the namespace stuff? +
*Thread Reply:* did you figure out the namespace stuff? I think it's ready to be merged outside of that
@@ -22576,7 +22582,7 @@*Thread Reply:* Not yet
+*Thread Reply:* Not yet
@@ -22602,7 +22608,7 @@*Thread Reply:* Ok i figured it out. +
*Thread Reply:* Ok i figured it out.
When running dbt locally, we connect to Redshift using an SSH tunnel.
dbt runs on Docker, hence it can access the tunnel using host.docker.internal
*Thread Reply:* So the namespace is correct
+*Thread Reply:* So the namespace is correct
@@ -22656,7 +22662,7 @@*Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.
+*Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.
@@ -22682,7 +22688,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -22708,7 +22714,7 @@*Thread Reply:* merged your PR 🙌
+*Thread Reply:* merged your PR 🙌
@@ -22734,7 +22740,7 @@*Thread Reply:* 🎉
+*Thread Reply:* 🎉
@@ -22760,7 +22766,7 @@*Thread Reply:* I think I'm going to change it up a bit. +
*Thread Reply:* I think I'm going to change it up a bit. The problem is that we can try to render jinja everywhere, including comments. I tried to make it skip unknown methods and values here, but I think the right solution is to load the yaml, and then try to render jinja for values.
@@ -22788,7 +22794,7 @@*Thread Reply:* Ok sounds good to me!
+*Thread Reply:* Ok sounds good to me!
@@ -22849,7 +22855,7 @@*Thread Reply:* Does it work with simply dbt run
?
*Thread Reply:* Does it work with simply dbt run
?
*Thread Reply:* also, do you have dbt-snowflake
installed?
*Thread Reply:* also, do you have dbt-snowflake
installed?
*Thread Reply:* it works with dbt run
*Thread Reply:* it works with dbt run
*Thread Reply:* no i haven’t installed dbt-snowflake
*Thread Reply:* no i haven’t installed dbt-snowflake
*Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?
+*Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?
@@ -22983,7 +22989,7 @@*Thread Reply:* it feels very weird to me, since the dbt-ol
script just runs dbt run
underneath
*Thread Reply:* it feels very weird to me, since the dbt-ol
script just runs dbt run
underneath
*Thread Reply:* this is my profiles.yml file: +
*Thread Reply:* this is my profiles.yml file: ```snowflake: target: dev outputs: @@ -23054,7 +23060,7 @@
*Thread Reply:* Yes, it looks that everything is okay on your side...
+*Thread Reply:* Yes, it looks that everything is okay on your side...
@@ -23080,7 +23086,7 @@*Thread Reply:* may be I’ll restart my machine and try again
+*Thread Reply:* may be I’ll restart my machine and try again
@@ -23106,7 +23112,7 @@*Thread Reply:* can you try +
*Thread Reply:* can you try
OPENLINEAGE_URL=<http://localhost:5000> dbt-ol debug
*Thread Reply:* Actually i had to use venv
that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:
*Thread Reply:* Actually i had to use venv
that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:
*Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322
+*Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322
*Thread Reply:* oh, thanks a lot
+*Thread Reply:* oh, thanks a lot
@@ -23256,7 +23262,7 @@*Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900
+*Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900
*Thread Reply:* Hi, +
*Thread Reply:* Hi,
openlineage-dbt==0.2.3
worked, thanks a lot for the quick fix.
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. +
*Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything.
./docker/up.sh
*Thread Reply:* I guess that it's the easiest way now. There should be API for that.
+*Thread Reply:* I guess that it's the easiest way now. There should be API for that.
@@ -23481,7 +23487,7 @@*Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754
+*Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754
*Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.
+*Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.
@@ -23574,7 +23580,7 @@*Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0
release
*Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0
release
*Thread Reply:* Thanks Willy.
+*Thread Reply:* Thanks Willy.
@@ -23630,7 +23636,7 @@*Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323
+*Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323
*Thread Reply:* Reminder that the meeting is today. See you soon
+*Thread Reply:* Reminder that the meeting is today. See you soon
@@ -23788,7 +23794,7 @@*Thread Reply:* The recording and notes of the meeting are now available: +
*Thread Reply:* The recording and notes of the meeting are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Oct13th2021
@@ -23910,7 +23916,7 @@*Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.
+*Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.
@@ -23936,7 +23942,7 @@*Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍
+*Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍
Thanks for the PR
@@ -23968,7 +23974,7 @@*Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:
+*Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:
@@ -23994,7 +24000,7 @@*Thread Reply:* We’ve release OpenLineage 0.2.3
with the hotfix for adding dbt v3
manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3
*Thread Reply:* We’ve release OpenLineage 0.2.3
with the hotfix for adding dbt v3
manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3
You can download and install openlineage-dbt
0.2.3 with the fix using:
*Thread Reply:* @Maciej Obuchowski might know
+*Thread Reply:* @Maciej Obuchowski might know
@@ -24076,7 +24082,7 @@*Thread Reply:* @Drew Bittenbender dbt-ol
always calls dbt
command now, without spawning shell - so it does not have access to bash aliases.
*Thread Reply:* @Drew Bittenbender dbt-ol
always calls dbt
command now, without spawning shell - so it does not have access to bash aliases.
Can you elaborate about your use case? Do you mean that dbt
in your path does docker run
or something like this? It still might be a problem if we won't have access to artifacts generated by dbt
in target directory.
*Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.
+*Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.
It seems that by installing openlineage-dbt in a virtual environment, it pulls down it's own version of dbt which it calls inline rather than shelling out and executing the dbt setup resident in the host system. I understand that opening a shell is a security risk so that is understandable.
@@ -24132,7 +24138,7 @@*Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.
+*Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.
For now I think you could build your own image based on official one, and install openlineage-dbt inside, something like:
@@ -24164,7 +24170,7 @@*Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run
*Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run
*Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol
can't run without access to dbt's target
directory, so if it's not visible in host, the only option is to have dbt-ol
in container.
*Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol
can't run without access to dbt's target
directory, so if it's not visible in host, the only option is to have dbt-ol
in container.
*Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run
- dbt docs generate
does not emit events itself, but generates data that run take into account.
*Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run
- dbt docs generate
does not emit events itself, but generates data that run take into account.
*Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx
+*Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx
*Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.
+*Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.
*Thread Reply:* Do they have it in dbt docs?
+*Thread Reply:* Do they have it in dbt docs?
@@ -24385,7 +24391,7 @@*Thread Reply:* I prepared this yaml file, not sure this is what u asked
+*Thread Reply:* I prepared this yaml file, not sure this is what u asked
*Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?
+*Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?
@@ -24543,7 +24549,7 @@*Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? +
*Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? OpenLineage defines discrete classes for both OpenLineage.InputDataset and OpenLineage.OutputDataset datasets. But, for clarification, are you asking:
*Thread Reply:* > In Azure Databricks sources often map to ADB mount points. We are looking for a way to translate this into source metadata in the OL output. Is there some configuration that would make this possible, or any other suggestions? +
*Thread Reply:* > In Azure Databricks sources often map to ADB mount points. We are looking for a way to translate this into source metadata in the OL output. Is there some configuration that would make this possible, or any other suggestions?
I would look into our OutputDatasetVisitors class (as a starting point) that extracts metadata from the spark logical plan to construct a mapping between a logic plan
to one or more OpenLineage.Dataset
for the spark job. But, I think @Michael Collado will have a more detailed suggestion / approach to what you’re asking
*Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)
+*Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)
@@ -24626,7 +24632,7 @@*Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.
+*Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.
*Thread Reply:* Do you use the dbfs
scheme to access the files from Spark as in the example on that page?
+
*Thread Reply:* Do you use the dbfs
scheme to access the files from Spark as in the example on that page?
df = spark.read.text("dbfs:/mymount/my_file.txt")
*Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!
+*Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!
@@ -24730,7 +24736,7 @@*Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage
face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!
*Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage
face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!
*Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path
The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.
*Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path
The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.
*Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java +
*Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java Here is the paragraph about this in the spec: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming
*Thread Reply:* This example adds facets to the run, but you can also add them to the job
+*Thread Reply:* This example adds facets to the run, but you can also add them to the job
@@ -25006,7 +25012,7 @@*Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done
+*Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done
@@ -25032,7 +25038,7 @@*Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want
+*Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want
@@ -25058,7 +25064,7 @@*Thread Reply:* Thank you guys!!
+*Thread Reply:* Thank you guys!!
@@ -25145,7 +25151,7 @@*Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java
+*Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java
*Thread Reply:* *Thread Reply:* SparkSession.builder()
+
SparkSession.builder()
.config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+")
.config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")
.config("spark.openlineage.host", "<https://localhost>")
@@ -25228,7 +25234,7 @@
Supress success
*Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java
+*Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java
*Thread Reply:* the apiKey setting is sent in an “Authorization” header
+*Thread Reply:* the apiKey setting is sent in an “Authorization” header
@@ -25305,7 +25311,7 @@*Thread Reply:* “Bearer $KEY”
+*Thread Reply:* “Bearer $KEY”
@@ -25331,7 +25337,7 @@*Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.
+*Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.
I think we might need to add in a url parameter configuration for our hackathon. We're using a bit of serverless code to shuttle open lineage events to a queue so that another job and/or serverless application can read that queue at its leisure.
@@ -25416,7 +25422,7 @@*Thread Reply:* Yes, please open an issue detailing the use case.
+*Thread Reply:* Yes, please open an issue detailing the use case.
@@ -25497,7 +25503,7 @@*Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet +
*Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet
run": {
...
"facets": {
@@ -25536,7 +25542,7 @@
Supress success
*Thread Reply:* I think this is part of the Spark 3 gap
+*Thread Reply:* I think this is part of the Spark 3 gap
@@ -25562,7 +25568,7 @@*Thread Reply:* an unknown output will cause missing output lineage
+*Thread Reply:* an unknown output will cause missing output lineage
@@ -25588,7 +25594,7 @@*Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java
+*Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java
*Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!
+*Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!
@@ -25753,7 +25759,7 @@*Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)
+*Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)
@@ -25892,7 +25898,7 @@*Thread Reply:* Maybe you want to contribute it? +
*Thread Reply:* Maybe you want to contribute it? It's not that hard, mostly testing, and figuring out what would be the naming of openlineage namespace for Trino, and how some additional statistics work.
For example, recently we had added support for Redshift by community member @ale
@@ -26068,7 +26074,7 @@*Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?
+*Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?
@@ -26094,7 +26100,7 @@*Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.
+*Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.
@@ -26120,7 +26126,7 @@*Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328
+*Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328
*Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration
+*Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration
@@ -26232,7 +26238,7 @@*Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286
+*Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286
Did you think about this?
*Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG
+*Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG
@@ -26323,7 +26329,7 @@*Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344
+*Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344
*Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.
+*Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.
@@ -26412,7 +26418,7 @@*Thread Reply:* Also, dbt build
is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141
*Thread Reply:* Also, dbt build
is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141
I guess there's a reason for it that I didn't see since you support v3 of the manifest.
*Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate
has been run before dbt-ol run
?
*Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate
has been run before dbt-ol run
?
*Thread Reply:* Tried with dbt
versions 0.20.2 and 0.21.0
, openlineage-dbt==0.3.1
*Thread Reply:* Tried with dbt
versions 0.20.2 and 0.21.0
, openlineage-dbt==0.3.1
*Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build
might be a little larger task.
*Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build
might be a little larger task.
*Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376
+*Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376
*Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383
+*Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383
*Thread Reply:* If anyone has any questions or comments - happy to discuss here
+*Thread Reply:* If anyone has any questions or comments - happy to discuss here
@@ -26808,7 +26814,7 @@*Thread Reply:* @davzucky
+*Thread Reply:* @davzucky
@@ -26834,7 +26840,7 @@*Thread Reply:* Thanks for updating the community, Brad!
+*Thread Reply:* Thanks for updating the community, Brad!
@@ -26860,7 +26866,7 @@*Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2
+*Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2
@@ -26924,7 +26930,7 @@*Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀
+*Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀
@@ -26989,7 +26995,7 @@*Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!
+*Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!
If you can give me a little more detail on what your system infrastructure is like, it’ll help us set priority and design
@@ -27017,7 +27023,7 @@*Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:
+*Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:
https://airflow.apache.org/docs/apache-airflow/stable/lineage.html
@@ -27047,7 +27053,7 @@*Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark
+*Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark
@@ -27073,7 +27079,7 @@*Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski
+*Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski
@@ -27099,7 +27105,7 @@*Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor
from airflow integration should cover Redshift if you're using it from PostgresOperator
🙂
*Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor
from airflow integration should cover Redshift if you're using it from PostgresOperator
🙂
It's definitely interesting use case - you'd be using most of the existing integrations we have.
@@ -27127,7 +27133,7 @@*Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?
+*Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?
@@ -27153,7 +27159,7 @@*Thread Reply:* I am using Redshift with PostgresOperator and it is returning…
+*Thread Reply:* I am using Redshift with PostgresOperator and it is returning…
[2021-11-06 03:43:06,541] {{__init__.py:92}} ERROR - Failed to extract metadata 'NoneType' object has no attribute 'host' task_type=PostgresOperator airflow_dag_id=counter task_id=inc airflow_run_id=scheduled__2021-11-06T03:42:00+00:00
Traceback (most recent call last):
@@ -27272,7 +27278,7 @@
*Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend
as AIRFLOW__LINEAGE__BACKEND
*Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend
as AIRFLOW__LINEAGE__BACKEND
openlineage.airflow.backend.OpenLineageBackend
, so we can fix the documentation 🙂*Thread Reply:* https://pypi.org/project/openlineage-airflow/
+*Thread Reply:* https://pypi.org/project/openlineage-airflow/
*Thread Reply:* (I googled it and found that page that seems to have an outdated doc)
+*Thread Reply:* (I googled it and found that page that seems to have an outdated doc)
@@ -27378,7 +27384,7 @@*Thread Reply:* @Maciej Obuchowski +
*Thread Reply:* @Maciej Obuchowski @Julien Le Dem that's the page i followed. Please guys revise the documentation, as it is very important
@@ -27405,7 +27411,7 @@*Thread Reply:* It should just copy actual readme
+*Thread Reply:* It should just copy actual readme
*Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README
+*Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README
@@ -27508,7 +27514,7 @@*Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:
+*Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/openlineage/client/client.py", line 49, in __init__
@@ -27539,7 +27545,7 @@
Supress success
*Thread Reply:* why is it not a valid URL?
+*Thread Reply:* why is it not a valid URL?
@@ -27565,7 +27571,7 @@*Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine
+*Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine
@@ -27591,7 +27597,7 @@*Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error
+*Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error
@@ -27617,7 +27623,7 @@*Thread Reply:* aaaah, gotcha, good catch!
+*Thread Reply:* aaaah, gotcha, good catch!
@@ -27673,7 +27679,7 @@*Thread Reply:* Are you sure that openlineage-airflow
is present in the container?
*Thread Reply:* Are you sure that openlineage-airflow
is present in the container?
*Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS
are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend
?
*Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS
are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend
?
*Thread Reply:* I am checking this accessing the Kubernetes pod
+*Thread Reply:* I am checking this accessing the Kubernetes pod
@@ -27920,7 +27926,7 @@*Thread Reply:* Yes
+*Thread Reply:* Yes
@@ -27946,7 +27952,7 @@*Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737
+*Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737
*Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow
+*Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow
@@ -28027,7 +28033,7 @@*Thread Reply:* i was expecting to see a nice graph. I think it is missing some features
+*Thread Reply:* i was expecting to see a nice graph. I think it is missing some features
@@ -28053,7 +28059,7 @@*Thread Reply:* at this early stage
+*Thread Reply:* at this early stage
@@ -28079,7 +28085,7 @@*Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors
+*Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors
@@ -28105,7 +28111,7 @@*Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.
+*Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.
And what is it doing GreatExpectactions extractor?
@@ -28135,7 +28141,7 @@*Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.
+*Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.
@@ -28161,7 +28167,7 @@*Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task +
*Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task I think this is good idea. Overall, OpenLineage strongly focuses on automatic metadata collection. However, using them would be a nice fallback for not-covered-yet cases.
> And that the same dag graph can be seen in marquez, and not one job per task. @@ -28218,7 +28224,7 @@
*Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384
+*Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384
*Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.
+*Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.
At the moment the same info we have in marquez related the tasks, is available in airflow UI or using airflow API.
@@ -28305,7 +28311,7 @@*Thread Reply:* > how many people are working full time in this library? +
*Thread Reply:* > how many people are working full time in this library? On Airflow integration or on OpenLineage overall? 🙂
> The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. @@ -28362,7 +28368,7 @@
*Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case
+*Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case
@@ -28444,7 +28450,7 @@*Thread Reply:* It should only affect you if you're using https://greatexpectations.io/
+*Thread Reply:* It should only affect you if you're using https://greatexpectations.io/
@@ -28470,7 +28476,7 @@*Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:
+*Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:
WARNING:/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/great_expectations_extractor.py:Did not find great_expectations_provider library or failed to import it
*Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 +
*Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 which wuill affect visual layer.
Regarding 2), I think we need to get along with Airflow maintainers regarding long term mechanism on which OL will work: https://github.com/apache/airflow/issues/17984
@@ -28587,7 +28593,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#custom-extractors
I think you can base your code on any existing extractor, like PostgresExtractor: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/postgres_extractor.py#L53
@@ -28645,7 +28651,7 @@*Thread Reply:* Thank you very much @Maciej Obuchowski
+*Thread Reply:* Thank you very much @Maciej Obuchowski
@@ -28697,7 +28703,7 @@*Thread Reply:* It worked like that in Airflow 1.10.
+*Thread Reply:* It worked like that in Airflow 1.10.
This is an unfortunate limitation of LineageBackend API that we're using for Airflow 2. We're trying to work out solution for this with Airflow maintainers: https://github.com/apache/airflow/issues/17984
*Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown
facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.
*Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown
facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.
*Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect
+*Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect
@@ -29089,7 +29095,7 @@*Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)
+*Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)
@@ -29171,7 +29177,7 @@*Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: +
*Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: https://openlineage.io/docs/openapi/
@@ -29198,7 +29204,7 @@*Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png
+*Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png
*Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet
+*Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet
@@ -29278,7 +29284,7 @@*Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more
+*Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more
@@ -29304,7 +29310,7 @@*Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).
+*Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).
We have this issue open in OpenLineage that you should go +1 to help with our planning 🙂
*Thread Reply:* interesting... what if it were instead on all the read_**
to_**
functions?
*Thread Reply:* interesting... what if it were instead on all the read_**
to_**
functions?
*Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.
+*Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.
If you POST an OpenLineage runEvent to the /lineage
endpoint in Marquez, it’ll create any missing jobs or datasets that are relevant.
*Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. +
*Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g.
http://localhost:5000/api/v1/namespaces/testing_java/datasets/incremental_data
as that currently returns the Marquez version of a dataset including default set fields for type
and the above mentioned properties.
*Thread Reply:* I believe the intention for type
is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.
*Thread Reply:* I believe the intention for type
is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.
I don't actually know what happened to the datasource type field- maybe @Julien Le Dem can comment on whether that field was dropped intentionally or whether it was an oversight.
@@ -29508,7 +29514,7 @@*Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL
: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438
*Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL
: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438
*Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12
+*Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12
*Thread Reply:* If you want to add something please chime in this thread
+*Thread Reply:* If you want to add something please chime in this thread
@@ -29771,7 +29777,7 @@*Thread Reply:* https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -29797,7 +29803,7 @@*Thread Reply:* The monthly meeting is happening tomorrow. +
*Thread Reply:* The monthly meeting is happening tomorrow. The purview team will present at the December meeting instead See full agenda here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting You are welcome to contribute
@@ -29826,7 +29832,7 @@*Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0
+*Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0
@@ -29852,7 +29858,7 @@*Thread Reply:* It’s happening now ^
+*Thread Reply:* It’s happening now ^
@@ -29878,7 +29884,7 @@*Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) +
*Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) I have a few TODOs to follow up on tickets
@@ -29959,7 +29965,7 @@*Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.
+*Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.
Maybe it’s simply a “Job” But than what is run ?
@@ -29987,7 +29993,7 @@*Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?
+*Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?
Have you considered having some examples of different use cases like those?
@@ -30015,9 +30021,9 @@*Thread Reply: By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +
*Thread Reply:* By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? For example important use-case for lineage is troubleshooting or error notifications (e.g mark report or job as temporarily in bad state if upstream data integration is broken). -In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select * from x,y group by a,b) . +In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select ** from x,y group by a,b) . But what are the cases where you’d want to see multiple outputs ? You can have single process produce multiple tables (in above example) but they’d alway be separate queries. The actual inputs for each output would be different.
But having multiple outputs create ambiguity as now If x or y is broken but have multiple outputs I do not know which is really impacted?
@@ -30046,7 +30052,7 @@*Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. +
*Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. > > Maybe it’s simply a “Job” But than what is run ? Every continuous process eventually has end - for example, you can deploy new version of your Flink pipeline. The new version would be the next Run for the same Job.
@@ -30079,7 +30085,7 @@*Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? +
*Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? Our reference implementation is an web application https://marquezproject.github.io/marquez/
We definitely do not exclude any of the things you're talking about - and it would make a lot of sense to talk more about potential usages.
@@ -30108,7 +30114,7 @@*Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +
*Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? I think this is too SQL-centric view 🙂
Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too.
@@ -30139,7 +30145,7 @@*Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. +
*Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. Yes I think that would be great if we expand on potential usages. if Open Lineage documentation (perhaps) has all kind of examples for different use-cases or case studies. Financal or healthcase industry case study and how would someone doing integration with OpenLineage. It would be easier to understand the concepts and make sure things are modeled consistently.
@@ -30166,7 +30172,7 @@*Thread Reply:* > I think this is too SQL-centric view 🙂 +
*Thread Reply:* > I think this is too SQL-centric view 🙂 > > Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too. Thanks for answering @Maciej Obuchowski
@@ -30253,7 +30259,7 @@*Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.
+*Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.
> 2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design. Wouldn't that leave the possibility of breaking exactly-once unless you're going full into two phase commit?
@@ -30282,7 +30288,7 @@*Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design +
*Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design In a Spark or flink job it is less likely now that you mention it. But in a batch job (airflow python or kubernetes operator for example) users could do anything and then they’d need lineage to figure out what is wrong if even if what they did is suboptimal 🙂
> I see what you’re trying to model. @@ -30315,7 +30321,7 @@
*Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148
+*Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148
@@ -30367,7 +30373,7 @@*Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736
+*Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736
*Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets
and jobs
tables (I know, not dieal)
*Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets
and jobs
tables (I know, not dieal)
*Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?
+*Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?
@@ -30469,7 +30475,7 @@*Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events
+*Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events
@@ -30522,7 +30528,7 @@*Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
+*Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md
*Thread Reply:* Yes!
+*Thread Reply:* Yes!
@@ -30819,7 +30825,7 @@*Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. +
*Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. As an additional question: When viewing a Dataset in Marquez will it cross the job namespace bounds? As in, will I see jobs from different job namespaces?
@@ -30847,7 +30853,7 @@*Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: +
*Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces:
sql-runner
-dev is the job namespace.
I cannot see a graph of my job now. Is this something to do with the namespace names?
*Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database
+*Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database
@@ -30910,7 +30916,7 @@*Thread Reply:* Wait, it does work? Marquez was being temperamental
+*Thread Reply:* Wait, it does work? Marquez was being temperamental
@@ -30936,7 +30942,7 @@*Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset
+*Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset
@@ -30962,7 +30968,7 @@*Thread Reply:* Here's what I mean:
+*Thread Reply:* Here's what I mean:
*Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744
+*Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744
*Thread Reply:* or, maybe not? It was released already.
+*Thread Reply:* or, maybe not? It was released already.
Can you create issue on github with those helpful gifs? @Lyndon Armitage
@@ -31085,7 +31091,7 @@*Thread Reply:* I think you are right Maciej
+*Thread Reply:* I think you are right Maciej
@@ -31111,7 +31117,7 @@*Thread Reply:* Was that patched in 0,19.1?
+*Thread Reply:* Was that patched in 0,19.1?
@@ -31137,7 +31143,7 @@*Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1
+*Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1
Haven't tested this myself unfortunately.
@@ -31165,7 +31171,7 @@*Thread Reply:* Perhaps not. It is urlencoding them: +
*Thread Reply:* Perhaps not. It is urlencoding them:
<http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP>
But the error seems to be in marquez getting them.
*Thread Reply:* This is an example Lineage event JSON I am sending.
+*Thread Reply:* This is an example Lineage event JSON I am sending.
*Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).
+*Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).
@@ -31258,7 +31264,7 @@*Thread Reply:* A 404 is returned for: http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asql_tests_like:HBMOFA.ORDDETP
+*Thread Reply:* A 404 is returned for: http://localhost:3000/api/v1/lineage/?nodeId=dataset:jdbc%3Ah2%3Amem%3Asql_tests_like:HBMOFA.ORDDETP
@@ -31284,7 +31290,7 @@*Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues
+*Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues
@@ -31310,7 +31316,7 @@*Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?
+*Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?
*Thread Reply:* Yup, thanks!
+*Thread Reply:* Yup, thanks!
@@ -31503,7 +31509,7 @@*Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?
+*Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?
@@ -31529,7 +31535,7 @@*Thread Reply:* The table does not exist in the Glue catalog until …
+*Thread Reply:* The table does not exist in the Glue catalog until …
A Glue crawler connects to one or more data stores (in this case S3), determines the data structures, and writes tables into the Data Catalog.
@@ -31559,7 +31565,7 @@*Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?
+*Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?
In that case I’d say that the glue Crawler is a job that outputs a dataset.
@@ -31587,7 +31593,7 @@*Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset
+*Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset
@@ -31613,7 +31619,7 @@*Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.
+*Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.
If the Glue table isn’t a distinct dataset from the S3 data, how does this compare to a view in a database on top of a table. Are they 2 datasets or just one?
@@ -31643,7 +31649,7 @@*Thread Reply:* @John Thomas yes, its the metadata flow.
+*Thread Reply:* @John Thomas yes, its the metadata flow.
@@ -31669,7 +31675,7 @@*Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier
+*Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier
@@ -31695,7 +31701,7 @@*Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?
+*Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?
@@ -31721,7 +31727,7 @@*Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.
+*Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.
Aligning with the Spark integration sounds like it might make sense then. Is there an example I could build from?
@@ -31749,7 +31755,7 @@*Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/
+*Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/
@@ -31775,7 +31781,7 @@*Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!
+*Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!
@@ -31805,7 +31811,7 @@*Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? +
*Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? Is there a specific place to look in the regular logs?
@@ -31832,7 +31838,7 @@*Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent
*Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent
*Thread Reply:* Output event:
+*Thread Reply:* Output event:
2021-11-22 08:12:15,513 INFO [spark-listener-group-shared] agent.OpenLineageContext (OpenLineageContext.java:emit(50)): Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) { “eventType”: “COMPLETE”, @@ -32033,7 +32039,7 @@
*Thread Reply:* This sink record is missing details …
+*Thread Reply:* This sink record is missing details …
2021-11-22 08:12:15,481 INFO [Thread-7] sinks.HadoopDataSink (HadoopDataSink.scala:$anonfun$writeDynamicFrame$1(275)): nameSpace: , table:
@@ -32061,7 +32067,7 @@*Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.
+*Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.
@@ -32087,7 +32093,7 @@*Thread Reply:* Are you using the existing spark integration for the spark lineage?
+*Thread Reply:* Are you using the existing spark integration for the spark lineage?
@@ -32113,7 +32119,7 @@*Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark +
*Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark
In the Glue context I was not clear on the correct settings for “spark.openlineage.parentJobName” and “spark.openlineage.parentRunId”, I put in static values (which may be incorrect)?
I injected these via: "--conf": "spark.openlineage.parentJobName=nyc-taxi-raw-stage",
*Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.
+*Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.
@@ -32167,7 +32173,7 @@*Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more
+*Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more
@@ -32193,7 +32199,7 @@*Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). +
*Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). It is emitting the above for each transform in the Glue job, but does not seem to capture the output …
@@ -32220,7 +32226,7 @@*Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?
+*Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?
@@ -32246,7 +32252,7 @@*Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README +
*Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README
Mine from AWS Glue…
21/11/22 18:48:48 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener
21/11/22 18:48:49 INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=<http://ec2>-….<a href="http://compute-1.amazonaws.com:5000">compute-1.amazonaws.com:5000</a>, version=v1, namespace=spark_integration, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: <http://ec2>-….<a href="http://compute-1.amazonaws.com:5000/api/v1/lineage">compute-1.amazonaws.com:5000/api/v1/lineage</a>
@@ -32276,7 +32282,7 @@
*Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/
+*Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/
*Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?
+*Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?
@@ -32349,7 +32355,7 @@*Thread Reply:* yeah, this thread will work great 🙂
+*Thread Reply:* yeah, this thread will work great 🙂
@@ -32375,7 +32381,7 @@*Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?
+*Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?
@@ -32401,7 +32407,7 @@*Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. +
*Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. I haven’t had the space to look at it in a while, but happy to support if you are looking at it.
@@ -32454,7 +32460,7 @@*Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444
+*Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444
*Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk
+*Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk
*Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas
+*Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas
@@ -32644,7 +32650,7 @@*Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized
+*Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized
@@ -32670,7 +32676,7 @@*Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark
+*Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark
@@ -32696,7 +32702,7 @@*Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas
+*Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas
@@ -32722,7 +32728,7 @@*Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen
+*Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen
@@ -32748,7 +32754,7 @@*Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?
+*Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?
@@ -32800,7 +32806,7 @@*Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects
+*Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects
@@ -32888,7 +32894,7 @@*Thread Reply:* Anyone? Thanks!
+*Thread Reply:* Anyone? Thanks!
@@ -33022,7 +33028,7 @@*Thread Reply:* Have you tried reflection? https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/reflect/FieldUtils.html#getDeclaredField-jav[…].lang.String-boolean-
+*Thread Reply:* Have you tried reflection? https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/reflect/FieldUtils.html#getDeclaredField-jav[…].lang.String-boolean-
@@ -33048,7 +33054,7 @@*Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!
+*Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!
@@ -33078,7 +33084,7 @@*Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...
+*Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...
Thank you so much for pointing to those utilities!
@@ -33110,7 +33116,7 @@*Thread Reply:* Glad I could help!
+*Thread Reply:* Glad I could help!
@@ -33162,7 +33168,7 @@*Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)
+*Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)
OM I am not an expert but it's a metadata model with clients and API around it.
@@ -33216,7 +33222,7 @@*Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:
+*Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:
@@ -33242,7 +33248,7 @@*Thread Reply:* ```def lineageparentid(run_id, task): +
*Thread Reply:* ```def lineageparentid(run_id, task): """ Macro function which returns the generated job and run id for a given task. This can be used to forward the ids from a task to a child run so the job @@ -33300,7 +33306,7 @@
*Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77
+*Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77
*Thread Reply:* the quickest response ever! And that works like a charm 🙌
+*Thread Reply:* the quickest response ever! And that works like a charm 🙌
@@ -33381,7 +33387,7 @@*Thread Reply:* Glad I could help!
+*Thread Reply:* Glad I could help!
@@ -33433,7 +33439,7 @@*Thread Reply:* For SQL, EXPLAIN EXTENDED
and show() in scala-shell is helpful:
*Thread Reply:* For SQL, EXPLAIN EXTENDED
and show() in scala-shell is helpful:
spark.sql("EXPLAIN EXTENDED CREATE TABLE tbl USING delta LOCATION '/tmp/delta' AS SELECT ** FROM tmp").show(false)
```|== Parsed Logical Plan ==
@@ -33481,7 +33487,7 @@
*Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.
+*Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.
@@ -33511,7 +33517,7 @@*Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:
+*Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:
"spark.logicalPlan": {
"_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.4.0-SNAPSHOT/integration/spark>",
@@ -33601,7 +33607,7 @@
Supress success
*Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞
+*Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞
@@ -33627,7 +33633,7 @@*Thread Reply:* Thank you as usual!!
+*Thread Reply:* Thank you as usual!!
@@ -33653,7 +33659,7 @@*Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones
+*Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones
@@ -33679,7 +33685,7 @@*Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!
+*Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!
@@ -33897,7 +33903,7 @@*Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂
+*Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂
We’ve to-date been working on our integrations library, making it as easy to set up as possible.
@@ -33925,7 +33931,7 @@*Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. +
*Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. I’ll keep an eye on it
@@ -33985,7 +33991,7 @@*Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -34011,7 +34017,7 @@*Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite
+*Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite
@@ -34063,7 +34069,7 @@*Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?
+*Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?
@@ -34089,7 +34095,7 @@*Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez
+*Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez
@@ -34115,7 +34121,7 @@*Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +
*Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
OpenLineageHttpException(code=0, message=java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError
(although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}')
at [Source: UNKNOWN; line: -1, column: -1], details=java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError
(although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}')
at [Source: UNKNOWN; line: -1, column: -1])
@@ -34163,7 +34169,7 @@
*Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1
+*Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1
@@ -34249,7 +34255,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files
there’s ongoing PR for proxy backend, which opens http API and redirects events to Kafka.
@@ -34277,7 +34283,7 @@*Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.
+*Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.
For now, you'll need to make something to capture the HTTP api events and send them to the Kafka topic. Changing the spark.openlineage.url
parameter will send the runEvents wherever you like, but obviously you can't directly produce HTTP events to a topic
*Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.
+*Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.
@@ -34331,7 +34337,7 @@*Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.
+*Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.
@@ -34456,7 +34462,7 @@*Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂
+*Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂
This one should be good: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L72
@@ -34495,7 +34501,7 @@*Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6
+*Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6
@@ -34521,7 +34527,7 @@*Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞
+*Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞
Is there any other method for determining the type of DataSourceV2Relation?
@@ -34549,7 +34555,7 @@*Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:
+*Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:
I merely needed to use DataSourceV2Realtion rather than NamedRelation!
@@ -34581,7 +34587,7 @@*Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala
+*Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala
*Thread Reply:* I guess you can use object.getClass.getCanonicalName()
to find if the passed class matches the one that Cosmos provider uses.
*Thread Reply:* I guess you can use object.getClass.getCanonicalName()
to find if the passed class matches the one that Cosmos provider uses.
*Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂
+*Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂
@@ -34684,7 +34690,7 @@*Thread Reply:* Thank you so much!
+*Thread Reply:* Thank you so much!
@@ -34710,7 +34716,7 @@*Thread Reply:* Glad to help 😄
+*Thread Reply:* Glad to help 😄
@@ -34736,7 +34742,7 @@*Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?
+*Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?
@@ -34762,7 +34768,7 @@*Thread Reply:* If any, of course 🙂
+*Thread Reply:* If any, of course 🙂
@@ -34788,7 +34794,7 @@*Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.
+*Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.
@@ -34818,7 +34824,7 @@*Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450
+*Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450
*Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.
+*Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.
@@ -35007,7 +35013,7 @@*Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.
+*Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.
Right now, OpenLineage checks based on the provider property if it's an Iceberg or Delta provider.
@@ -35039,7 +35045,7 @@*Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt
is satisfied.
*Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt
is satisfied.
This means, that we can depend on the order of the visitors...
@@ -35067,7 +35073,7 @@*Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.
+*Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.
@@ -35161,7 +35167,7 @@*Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started
+*Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started
*Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo
+*Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo
@@ -35264,7 +35270,7 @@*Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.
+*Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.
Check the great expectations integration to see how those facets are being used
@@ -35292,7 +35298,7 @@*Thread Reply:* This is great. Thanks @Michael Collado!
+*Thread Reply:* This is great. Thanks @Michael Collado!
@@ -35381,7 +35387,7 @@*Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?
+*Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?
@@ -35407,7 +35413,7 @@*Thread Reply:* There is no details about failure.
+*Thread Reply:* There is no details about failure.
```dbt-ol build -t DEV --profile cdp --profiles-dir /c/Work/dbt/cdp100/profiles --project-dir /c/Work/dbt/cdp100 --select +riskrawmastersharedshareclass
Running OpenLineage dbt wrapper version 0.4.0
@@ -35495,7 +35501,7 @@ Supress success
*Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test
*Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test
*Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".
+*Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".
*Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one +
*Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one
22:52:07 Database Error in model customers (models/customers.sql)
22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12]
22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql
@@ -35649,7 +35655,7 @@
*Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞
+*Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞
Created issue to add it to spec in a generic way: https://github.com/OpenLineage/OpenLineage/issues/446
@@ -35764,7 +35770,7 @@*Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.
+*Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.
@@ -35825,7 +35831,7 @@*Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend
method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental
*Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend
method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental
*Thread Reply:* You don't need to do anything with DAG import then.
+*Thread Reply:* You don't need to do anything with DAG import then.
@@ -35881,7 +35887,7 @@*Thread Reply:* Thanks!!!!! i'll try
+*Thread Reply:* Thanks!!!!! i'll try
@@ -35937,7 +35943,7 @@*Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆
+*Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆
@@ -36158,7 +36164,7 @@*Thread Reply:* ^ It’s happening now!
+*Thread Reply:* ^ It’s happening now!
@@ -36210,7 +36216,7 @@*Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?
+*Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?
@@ -36262,7 +36268,7 @@*Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.
+*Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.
@@ -36296,7 +36302,7 @@*Thread Reply:* thank you. 🙂
+*Thread Reply:* thank you. 🙂
@@ -36362,7 +36368,7 @@*Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery
in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438
*Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery
in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438
*Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND
variable correctly? What you pasted above looks a little odd with the 2 =
signs
*Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND
variable correctly? What you pasted above looks a little odd with the 2 =
signs
*Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: +
*Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like:
INFO - Constructing openlineage client to send events to
*Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data
+*Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data
@@ -36471,7 +36477,7 @@*Thread Reply:* hope this helps!
+*Thread Reply:* hope this helps!
@@ -36497,7 +36503,7 @@*Thread Reply:* Thank you, will try again.
+*Thread Reply:* Thank you, will try again.
@@ -36557,7 +36563,7 @@*Thread Reply:* BTW, this was actually the 0.5.1
release. Because, pypi... 🤷♂️:skintone4:
*Thread Reply:* BTW, this was actually the 0.5.1
release. Because, pypi... 🤷♂️:skintone4:
*Thread Reply:* nice on the dbt-spark
support 👍
*Thread Reply:* nice on the dbt-spark
support 👍
*Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?
+*Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?
We've just found yesterday that it somehow breaks our example...
@@ -36671,7 +36677,7 @@*Thread Reply:* yes i just installed docker on mac
+*Thread Reply:* yes i just installed docker on mac
@@ -36697,7 +36703,7 @@*Thread Reply:* and docker compose version 1.29.2
+*Thread Reply:* and docker compose version 1.29.2
@@ -36723,7 +36729,7 @@*Thread Reply:* What you can do is to uncheck this, do docker system prune -a
and try again.
*Thread Reply:* What you can do is to uncheck this, do docker system prune -a
and try again.
*Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
+*Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
@@ -36775,7 +36781,7 @@*Thread Reply:* Try to restart docker for mac
+*Thread Reply:* Try to restart docker for mac
@@ -36801,7 +36807,7 @@*Thread Reply:* It needs to show Docker Desktop is running
:
*Thread Reply:* It needs to show Docker Desktop is running
:
*Thread Reply:* yeah done . I will try to implement the example again and see thank you very much
+*Thread Reply:* yeah done . I will try to implement the example again and see thank you very much
@@ -36862,7 +36868,7 @@*Thread Reply:* i dont why i getting this when i $ docker-compose up :
+*Thread Reply:* i dont why i getting this when i $ docker-compose up :
WARNING: The TAG variable is not set. Defaulting to a blank string.
WARNING: The APIPORT variable is not set. Defaulting to a blank string.
@@ -36898,7 +36904,7 @@ Supress success
*Thread Reply:* are you running it exactly like here, with respect to directories, etc?
+*Thread Reply:* are you running it exactly like here, with respect to directories, etc?
https://github.com/MarquezProject/marquez/tree/main/examples/airflow
@@ -36926,7 +36932,7 @@*Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui
+*Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui
@@ -36952,7 +36958,7 @@*Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?
+*Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?
@@ -37034,7 +37040,7 @@*Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?
+*Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?
@@ -37060,7 +37066,7 @@*Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.
+*Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.
@@ -37086,7 +37092,7 @@*Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.
+*Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.
@@ -37165,7 +37171,7 @@*Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.
+*Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.
@@ -37191,7 +37197,7 @@*Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND
is it properly set? Possible the env isn’t making it all the way there.
*Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND
is it properly set? Possible the env isn’t making it all the way there.
*Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)
+*Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)
*Thread Reply:* Thank you for the info!
+*Thread Reply:* Thank you for the info!
@@ -37349,7 +37355,7 @@*Thread Reply:* what do you mean by “Integrate Openlineage”?
+*Thread Reply:* what do you mean by “Integrate Openlineage”?
Can you give a little more information on what you’re trying to accomplish and what the existing project is?
@@ -37377,7 +37383,7 @@*Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.
+*Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.
@@ -37403,7 +37409,7 @@*Thread Reply:* Gotcha!
+*Thread Reply:* Gotcha!
Broadly speaking, all an integration needs to do is to send runEvents to Marquez.
@@ -37581,7 +37587,7 @@*Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.
+*Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.
@@ -37607,7 +37613,7 @@*Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol
+*Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol
@@ -37697,7 +37703,7 @@*Thread Reply:* Hello!
+*Thread Reply:* Hello!
@@ -37750,7 +37756,7 @@*Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!
+*Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!
I'm not super familiar with Atlas, were you thinking in terms of enabling Atlas to receive runEvents from OpenLineage connectors?
@@ -37778,7 +37784,7 @@*Thread Reply:* Hi John! +
*Thread Reply:* Hi John! Yes, exactly, it’d be nice to see Atlas as a receiver side of the OpenLineage events. Is there some guidelines on how to implement it? I guess we need OpenLineage-compatible server implementation so we could receive events and send them to Atlas, right?
@@ -37805,7 +37811,7 @@*Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. +
*Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. Marquez is our reference implementation of OpenLineage, so I’d look around in that repo to see how it’s been implemented :)
@@ -37832,7 +37838,7 @@*Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 +
*Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 If it’d not get any traction we at New Work might contribute as well
@@ -37859,7 +37865,7 @@*Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end
+*Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end
@@ -37889,7 +37895,7 @@*Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097
+*Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097
*Thread Reply:* Got it, thank you!
+*Thread Reply:* Got it, thank you!
@@ -38241,7 +38247,7 @@*Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? +
*Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? It would be needed to create a new model for spark, other than https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json and upload it to atlas (what could be done with a call to the atlas Api) Does it makes sense?
*Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!
+*Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!
@@ -38644,7 +38650,7 @@*Thread Reply:* sorry for the ignorance, +
*Thread Reply:* sorry for the ignorance, But what is the purpose of the bridge?the comunicación with atlas should be done throw kafka, and that messages can be sent by the proxy. What are I missing?
@@ -38671,7 +38677,7 @@*Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that
+*Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that
@@ -38697,7 +38703,7 @@*Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.
+*Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.
@@ -38723,7 +38729,7 @@*Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.
+*Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.
@@ -38749,7 +38755,7 @@*Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?
+*Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?
@@ -38775,7 +38781,7 @@*Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.
+*Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.
@@ -38860,7 +38866,7 @@*Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂
+*Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂
@@ -38986,7 +38992,7 @@*Thread Reply:* By duplicated, you mean with the same runId?
+*Thread Reply:* By duplicated, you mean with the same runId?
@@ -39012,7 +39018,7 @@*Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that
+*Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that
@@ -39068,7 +39074,7 @@*Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java
+*Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java
*Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.
+*Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.
I was also thinking about here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L337-L340
@@ -39155,7 +39161,7 @@*Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.
+*Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.
@@ -39274,7 +39280,7 @@*Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run
+*Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run
*Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid
. If the env facet is present on any event, it will be returned by the API
*Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid
. If the env facet is present on any event, it will be returned by the API
*Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.
+*Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.
So, we do need to maintain some state. That makes total sense, Mike.
@@ -39437,7 +39443,7 @@*Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). +
*Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀).
Re: the failure events, I think job failures will currently result in one FAIL
event and one COMPLETE
event. The SparkListenerJobEnd
event will trigger a FAIL
event but the SparkListenerSQLExecutionEnd
event will trigger the COMPLETE
event.
*Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!
+*Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!
@@ -39643,7 +39649,7 @@*Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?
+*Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?
I think that maybe you can check sparkContext.getLocalProperty("spark.sql.execution.id")
*Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞
+*Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞
But that is the expectation? A SparkSQLExecutionStart and JobStart SHOULD have the same execution ID, right?
@@ -39732,7 +39738,7 @@*Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.
+*Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.
@@ -39758,7 +39764,7 @@*Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:
+*Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:
I've attached the logs and a screenshot of what I'm seeing the Spark UI. If you had a chance to take a look, it's a bit verbose but I'd appreciate a second pair of eyes on my analysis. Hopefully I got something wrong 😅
*Thread Reply:* I think we've encountered the same stuff in Delta before 🙂
+*Thread Reply:* I think we've encountered the same stuff in Delta before 🙂
https://github.com/OpenLineage/OpenLineage/issues/388#issuecomment-964401860
@@ -39842,7 +39848,7 @@*Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart
event is reported with a spark.sql.execution.id
that differs from the execution id of the SparkSQLExecutionStart
?
*Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart
event is reported with a spark.sql.execution.id
that differs from the execution id of the SparkSQLExecutionStart
?
*Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9|
+
*Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9|
😂
*Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅
+*Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅
But you're exactly right, there's a SparkSQLExecutionStart event with execution id 8 and then a set of JobStart events all with execution id 9!
@@ -39997,7 +40003,7 @@*Thread Reply:* You won’t want to miss this talk!
+*Thread Reply:* You won’t want to miss this talk!
@@ -40051,7 +40057,7 @@*Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.
+*Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.
If you mean sending information to DataHub, that should already be possible if users pass a datahub api endpoint to the OPENLINEAGE_ENDPOINT variable
@@ -40079,7 +40085,7 @@*Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.
+*Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.
Could you be please more specific, possibly link an article how to find the endpoint on the datahub side? Many thanks!
@@ -40107,7 +40113,7 @@*Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it
+*Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it
@@ -40133,7 +40139,7 @@*Thread Reply:* I'd reach out to datahub, potentially?
+*Thread Reply:* I'd reach out to datahub, potentially?
@@ -40159,7 +40165,7 @@*Thread Reply:* i see. ok, will do!
+*Thread Reply:* i see. ok, will do!
@@ -40185,7 +40191,7 @@*Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this
+*Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this
@@ -40211,7 +40217,7 @@*Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.
+*Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.
I see this is more political think that would require join effort of DataHub team and OpenLineage with a common goal.
@@ -40429,7 +40435,7 @@*Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.
+*Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.
all the included extractors are here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors
@@ -40457,7 +40463,7 @@*Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? +
*Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct?
OPENLINEAGE_EXTRACTOR_<operator>=full.path.to.ExtractorClass
Also do I need anything else other than Marquez and openlineage_airflow?
*Thread Reply:* Yes, as long as the extractors are in the python path.
+*Thread Reply:* Yes, as long as the extractors are in the python path.
@@ -40511,7 +40517,7 @@*Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.
+*Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.
@@ -40537,7 +40543,7 @@*Thread Reply:* That will be great help. Thanks
+*Thread Reply:* That will be great help. Thanks
@@ -40563,10 +40569,10 @@*Thread Reply:* This is the one I wrote:
+*Thread Reply:* This is the one I wrote:
*Thread Reply:* to make it work, I set this environment variable:
+*Thread Reply:* to make it work, I set this environment variable:
OPENLINEAGE_EXTRACTOR_HttpToBigQueryOperator=http_to_bigquery.HttpToBigQueryExtractor
*Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218
+*Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218
@@ -40712,7 +40718,7 @@*Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.
+*Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.
@@ -40738,7 +40744,7 @@*Thread Reply:* The good news is that we're working to solve this problem in general.
+*Thread Reply:* The good news is that we're working to solve this problem in general.
@@ -40764,7 +40770,7 @@*Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.
+*Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.
@@ -40790,7 +40796,7 @@*Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34
+*Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34
*Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.
+*Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.
@@ -40975,7 +40981,7 @@*Thread Reply:* Do you need to specify a namespace that is not « default »?
+*Thread Reply:* Do you need to specify a namespace that is not « default »?
@@ -41027,7 +41033,7 @@*Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂
+*Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂
@@ -41057,7 +41063,7 @@*Thread Reply:* Ok so I had to update a chart dependency
+*Thread Reply:* Ok so I had to update a chart dependency
@@ -41083,7 +41089,7 @@*Thread Reply:* Now I installed the service in amazon using this +
*Thread Reply:* Now I installed the service in amazon using this
helm install marquez . --dependency-update --set marquez.db.host=myhost --set marquez.db.user=myuser --set marquez.db.password=mypassword --namespace marquez --atomic --wait
*Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually
+*Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually
@@ -41136,7 +41142,7 @@*Thread Reply:* however I can not fetch initial data when login into the endpoint
+*Thread Reply:* however I can not fetch initial data when login into the endpoint
*Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?
+*Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?
http://localhost:5000/api/v1/namespaces
@@ -41199,7 +41205,7 @@*Thread Reply:* i got this +
*Thread Reply:* i got this
{"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}
*Thread Reply:* i have to use the namespace marquez to redirect there +
*Thread Reply:* i have to use the namespace marquez to redirect there
kubectl port-forward svc/marquez 5000:80 -n marquez
*Thread Reply:* is there something i need to change in a config file?
+*Thread Reply:* is there something i need to change in a config file?
@@ -41279,7 +41285,7 @@*Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?
+*Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?
@@ -41305,7 +41311,7 @@*Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself
+*Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself
@@ -41331,7 +41337,7 @@*Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.
+*Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.
As to the EKS installation (or any non-local install), this is where you would need to use what’s called an ingress controller to expose the services outside of the Kubernetes cluster. There are different flavors of these (NGINX is popular), and I believe that AWS EKS has some built-in capabilities that might help as well.
@@ -41386,7 +41392,7 @@*Thread Reply:* So how do i fix this issue?
+*Thread Reply:* So how do i fix this issue?
@@ -41412,7 +41418,7 @@*Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.
+*Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.
However, if you are just seeking to explore Marquez and try things out, then I would highly recommend the “Open in Gitpod” functionality at https://github.com/MarquezProject/marquez#try-it. That will perform a full deployment for you in a temporary environment very quickly.
*Thread Reply:* i need to use it in aws for a POC
+*Thread Reply:* i need to use it in aws for a POC
@@ -41491,7 +41497,7 @@*Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? +
*Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? This guide is omitting many steps https://marquezproject.github.io/marquez/running-on-aws.html
*Thread Reply:* I still see this +
*Thread Reply:* I still see this https://files.slack.com/files-pri/T01CWUYP5AR-F036JKN77EW/image.png
*Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml
+*Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml
@@ -41709,7 +41715,7 @@*Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.
+*Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.
@@ -41735,7 +41741,7 @@*Thread Reply:* Or are you working through the local deploy right now?
+*Thread Reply:* Or are you working through the local deploy right now?
@@ -41761,7 +41767,7 @@*Thread Reply:* It shows the same using the exposed service
+*Thread Reply:* It shows the same using the exposed service
@@ -41787,7 +41793,7 @@*Thread Reply:* i just didnt do another screenshot
+*Thread Reply:* i just didnt do another screenshot
@@ -41813,7 +41819,7 @@*Thread Reply:* Could it be communication with the DB?
+*Thread Reply:* Could it be communication with the DB?
@@ -41839,7 +41845,7 @@*Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.
+*Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.
@@ -41865,7 +41871,7 @@*Thread Reply:* i see this error +
*Thread Reply:* i see this error
Error occured while trying to proxy to: <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces">xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces</a>
*Thread Reply:* it seems to be trying to use the same address to access the api endpoint
+*Thread Reply:* it seems to be trying to use the same address to access the api endpoint
@@ -41918,7 +41924,7 @@*Thread Reply:* however the api service is in a different endpoint
+*Thread Reply:* however the api service is in a different endpoint
@@ -41944,7 +41950,7 @@*Thread Reply:* The API resides here +
*Thread Reply:* The API resides here
<a href="http://Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com">Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com</a>
*Thread Reply:* The web service resides here +
*Thread Reply:* The web service resides here
<a href="http://xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com">xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com</a>
*Thread Reply:* do they both need to be under the same LB?
+*Thread Reply:* do they both need to be under the same LB?
@@ -42024,7 +42030,7 @@*Thread Reply:* How would i do that is they install as separate services?
+*Thread Reply:* How would i do that is they install as separate services?
@@ -42050,7 +42056,7 @@*Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.
+*Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.
Here is an example from one of the AWS repos - in the ingress resource you can see the single rule setup to point traffic to a given service.
@@ -42167,7 +42173,7 @@*Thread Reply:* Thanks for the help. Now I know what the issue is
+*Thread Reply:* Thanks for the help. Now I know what the issue is
@@ -42193,7 +42199,7 @@*Thread Reply:* Great to hear!!
+*Thread Reply:* Great to hear!!
@@ -42248,7 +42254,7 @@*Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.
+*Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.
@@ -42274,7 +42280,7 @@*Thread Reply:* It supports the databases listed here: https://openlineage.io/integration
+*Thread Reply:* It supports the databases listed here: https://openlineage.io/integration
*Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?
+*Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?
@@ -42373,7 +42379,7 @@*Thread Reply:* yup
+*Thread Reply:* yup
@@ -42399,7 +42405,7 @@*Thread Reply:* how to run
+*Thread Reply:* how to run
@@ -42425,7 +42431,7 @@*Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/
+*Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/
*Thread Reply:* once you have run ./docker/up.sh
, you should be able to run through that and see how the system runs
*Thread Reply:* once you have run ./docker/up.sh
, you should be able to run through that and see how the system runs
*Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python
+*Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python
@@ -42524,7 +42530,7 @@*Thread Reply:* Thank you
+*Thread Reply:* Thank you
@@ -42550,7 +42556,7 @@*Thread Reply:* You are welcome 🙂
+*Thread Reply:* You are welcome 🙂
@@ -42576,7 +42582,7 @@*Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.
+*Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.
Is there any work in that direction, is the current client code considered mature and just needs re-packaging, or is it just a thought sketch and some serious work is needed?
@@ -42606,7 +42612,7 @@*Thread Reply:* What do you mean without pip-package?
+*Thread Reply:* What do you mean without pip-package?
@@ -42632,7 +42638,7 @@*Thread Reply:* https://pypi.org/project/openlineage-python/
+*Thread Reply:* https://pypi.org/project/openlineage-python/
@@ -42658,7 +42664,7 @@*Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka +
*Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka https://github.com/OpenLineage/OpenLineage/pull/530
@@ -42685,7 +42691,7 @@*Thread Reply:* My apologies Maciej! +
*Thread Reply:* My apologies Maciej! In my defense - looking for "open lineage" on pypi doesn't show this in the first 20 results. Still, should have checked setup.py. My bad, and thank you for the pointer!
@@ -42712,7 +42718,7 @@*Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉
+*Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉
@@ -42738,7 +42744,7 @@*Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂
+*Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂
@@ -42837,7 +42843,7 @@*Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java
+*Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java
*Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow
+*Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow
@@ -43234,7 +43240,7 @@*Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.
+*Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.
I know that EKS needs to have connectivity established to the Postgres database, even in the case of RDS, so that could be the culprit.
@@ -43262,7 +43268,7 @@*Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs +
*Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs
[HPM] Proxy created: /api/v1 -> <http://localhost:5000/>
App listening on port 3000!
[HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://localhost:5000/> (ECONNREFUSED) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)
*Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.
+*Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.
marquez.hostname=marquez-interface-test.di.rbx.com
@@ -43319,7 +43325,7 @@*Thread Reply:* assuming that is the DNS used by the website
+*Thread Reply:* assuming that is the DNS used by the website
@@ -43345,7 +43351,7 @@*Thread Reply:* thanks, that did it. I have a question regarding the database
+*Thread Reply:* thanks, that did it. I have a question regarding the database
@@ -43371,7 +43377,7 @@*Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?
+*Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?
@@ -43397,7 +43403,7 @@*Thread Reply:* Also could you put both the API and interface on the same port (3000)
+*Thread Reply:* Also could you put both the API and interface on the same port (3000)
@@ -43423,7 +43429,7 @@*Thread Reply:* Seems I am still having the forwarding issue +
*Thread Reply:* Seems I am still having the forwarding issue
[HPM] Proxy created: /api/v1 -> <http://marquez-interface-test.di.rbx.com:5000/>
App listening on port 3000!
[HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://marquez-interface-test.di.rbx.com:5000/> (ECONNRESET) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)
*Thread Reply:* It's always Delta, isn't it?
+*Thread Reply:* It's always Delta, isn't it?
When I originally worked on Delta support I tried to find answer on Delta slack and got an answer:
@@ -43522,7 +43528,7 @@*Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.
+*Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.
@@ -43548,7 +43554,7 @@*Thread Reply:* Argh!! It's always Databricks doing something 🙄
+*Thread Reply:* Argh!! It's always Databricks doing something 🙄
Thanks, Maciej!
@@ -43576,7 +43582,7 @@*Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!
+*Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!
@@ -43602,7 +43608,7 @@*Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423
+*Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423
@@ -43628,7 +43634,7 @@*Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?
+*Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?
Might be worth to just post here all the events + logical plans that are generated for particular job, as I've done in that issue
@@ -43656,7 +43662,7 @@*Thread Reply:* *Thread Reply:* scala> spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp")
+
scala> spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp")
21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 3
21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect
21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 4
@@ -43702,7 +43708,7 @@
Supress success
*Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.
+*Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.
As far as we know, the SQLExecutionStart event does not have any way to get these properties :(
@@ -43797,7 +43803,7 @@*Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.
+*Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.
https://github.com/OpenLineage/OpenLineage/issues/617
*Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui
+*Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui
@@ -44074,7 +44080,7 @@*Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s
and it will run a seeding job as it starts.
*Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s
and it will run a seeding job as it starts.
*Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run
+*Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run
@@ -44152,7 +44158,7 @@*Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929
+*Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929
*Thread Reply:* Hi Marco,
+*Thread Reply:* Hi Marco,
Datakin is a reporting tool built on the Marquez API, and therefore designed to take in Lineage using the OpenLineage specification.
@@ -44295,7 +44301,7 @@*Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?
+*Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?
@@ -44321,7 +44327,7 @@*Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez
+*Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez
@@ -44347,7 +44353,7 @@*Thread Reply:* Will I still be able to use facets for backfills?
+*Thread Reply:* Will I still be able to use facets for backfills?
@@ -44373,7 +44379,7 @@*Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API
+*Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API
@@ -44488,7 +44494,7 @@*Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+
+*Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+
@@ -44514,7 +44520,7 @@*Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. +
*Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. I have marquez running in AWS with an ingress. Do i use OpenLineageURL or Marquez_URL?
@@ -44551,7 +44557,7 @@*Thread Reply:* Also would a new namespace be created if i add the variable?
+*Thread Reply:* Also would a new namespace be created if i add the variable?
@@ -44603,7 +44609,7 @@*Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run
command with dtb-ol run
, and capturing lineage data from the runresult file
*Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run
command with dtb-ol run
, and capturing lineage data from the runresult file
These things don't necessarily preclude you from using OpenLineage on trino, so it may work already.
@@ -44631,7 +44637,7 @@*Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.
+*Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.
@@ -44657,7 +44663,7 @@*Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.
+*Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.
We don't currently have plans for this, but a great first step would be opening an issue in the OpenLineage repo.
@@ -44687,7 +44693,7 @@*Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas
+*Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas
@@ -44963,7 +44969,7 @@*Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?
+*Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?
@@ -44989,7 +44995,7 @@*Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator']
seems right to me
*Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator']
seems right to me
*Thread Reply:* It is in an EKS cluster +
*Thread Reply:* It is in an EKS cluster
I checked and the variable is there
OPENLINEAGE_EXTRACTOR_QUERYOPERATOR=shared.plugins.ol_custom_extractors.QueryOperatorExtractor
*Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well
+*Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well
@@ -45069,7 +45075,7 @@*Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: +
*Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py#L77
*Thread Reply:* I am currently digging into _get_extractor
to see where it might be missing yours 🤔
*Thread Reply:* I am currently digging into _get_extractor
to see where it might be missing yours 🤔
*Thread Reply:* Thanks
+*Thread Reply:* Thanks
@@ -45173,7 +45179,7 @@*Thread Reply:* silly idea, but you could add a log message to __init__
in your extractor.
*Thread Reply:* silly idea, but you could add a log message to __init__
in your extractor.
*Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing
+*Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing
@@ -45276,7 +45282,7 @@*Thread Reply:* if you add a log line, you can verify that your PYTHONPATH
and env are correct
*Thread Reply:* if you add a log line, you can verify that your PYTHONPATH
and env are correct
*Thread Reply:* will try that
+*Thread Reply:* will try that
@@ -45328,7 +45334,7 @@*Thread Reply:* and let you know
+*Thread Reply:* and let you know
@@ -45354,7 +45360,7 @@*Thread Reply:* ok!
+*Thread Reply:* ok!
@@ -45380,7 +45386,7 @@*Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator
instead of full caps?
*Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator
instead of full caps?
*Thread Reply:* Will try that too
+*Thread Reply:* Will try that too
@@ -45436,7 +45442,7 @@*Thread Reply:* Thanks for helping
+*Thread Reply:* Thanks for helping
@@ -45462,7 +45468,7 @@*Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?
+*Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?
@@ -45488,7 +45494,7 @@*Thread Reply:* yes, it's case sensitive...
+*Thread Reply:* yes, it's case sensitive...
@@ -45514,7 +45520,7 @@*Thread Reply:* i see
+*Thread Reply:* i see
@@ -45540,7 +45546,7 @@*Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered
+*Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered
@@ -45566,7 +45572,7 @@*Thread Reply:* Could there be a way not to make this case sensitive?
+*Thread Reply:* Could there be a way not to make this case sensitive?
@@ -45592,7 +45598,7 @@*Thread Reply:* yes - could you create issue on OpenLineage repository?
+*Thread Reply:* yes - could you create issue on OpenLineage repository?
@@ -45618,7 +45624,7 @@*Thread Reply:* sure
+*Thread Reply:* sure
@@ -45694,7 +45700,7 @@*Thread Reply:* Yes, it should. This line is the likely culprit: +
*Thread Reply:* Yes, it should. This line is the likely culprit: https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py#L30
*Thread Reply:* I bet if that said ['INTO','OVERWRITE']
it would work
*Thread Reply:* I bet if that said ['INTO','OVERWRITE']
it would work
*Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE
be a token we look for? if so, I can submit a short PR.
*Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE
be a token we look for? if so, I can submit a short PR.
*Thread Reply:* we have a better solution
+*Thread Reply:* we have a better solution
@@ -45824,7 +45830,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644
*Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!
+*Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!
@@ -45904,7 +45910,7 @@*Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs
+*Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs
*Thread Reply:* let me review this PR
+*Thread Reply:* let me review this PR
@@ -46073,7 +46079,7 @@*Thread Reply:* Do i have to download a new version of the opelineage-airflow python library
+*Thread Reply:* Do i have to download a new version of the opelineage-airflow python library
@@ -46099,7 +46105,7 @@*Thread Reply:* If so which version?
+*Thread Reply:* If so which version?
@@ -46125,7 +46131,7 @@*Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl
branch
*Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl
branch
*Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?
+*Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?
@@ -46177,7 +46183,7 @@*Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.
+*Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.
@@ -46203,7 +46209,7 @@*Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work
+*Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work
@@ -46229,7 +46235,7 @@*Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?
+*Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?
@@ -46259,7 +46265,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -46311,7 +46317,7 @@*Thread Reply:* Hm - what version of dbt are you using?
+*Thread Reply:* Hm - what version of dbt are you using?
@@ -46337,7 +46343,7 @@*Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0
+*Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0
*Thread Reply:* also make sure you are on the latest version of openlineage-dbt
- I believe we have made it a bit more tolerant of dbt schema changes.
*Thread Reply:* also make sure you are on the latest version of openlineage-dbt
- I believe we have made it a bit more tolerant of dbt schema changes.
*Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.
+*Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.
@@ -46440,7 +46446,7 @@*Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.
+*Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.
@@ -46466,7 +46472,7 @@*Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397
+*Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397
*Thread Reply:* if you run pip3 list | grep openlineage-dbt
, what version does it show?
*Thread Reply:* if you run pip3 list | grep openlineage-dbt
, what version does it show?
*Thread Reply:* I wonder if you have somehow ended up with an older version of the integration
+*Thread Reply:* I wonder if you have somehow ended up with an older version of the integration
@@ -46584,7 +46590,7 @@*Thread Reply:* it is 0.1.0
+*Thread Reply:* it is 0.1.0
@@ -46610,7 +46616,7 @@*Thread Reply:* is it 0.1.0 the older version of openlineage ?
+*Thread Reply:* is it 0.1.0 the older version of openlineage ?
@@ -46636,7 +46642,7 @@*Thread Reply:* *Thread Reply:* ❯ pip3 list | grep openlineage-dbt
+
❯ pip3 list | grep openlineage-dbt
openlineage-dbt 0.6.2
logger = logging.getLogger(name)
*Thread Reply:* the latest is 0.6.2 - that might be your issue
+*Thread Reply:* the latest is 0.6.2 - that might be your issue
@@ -46689,7 +46695,7 @@*Thread Reply:* How are you going about installing it?
+*Thread Reply:* How are you going about installing it?
@@ -46715,7 +46721,7 @@*Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"
+*Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"
@@ -46741,7 +46747,7 @@*Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.
+*Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.
@@ -46767,7 +46773,7 @@*Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0
+*Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0
@@ -46793,7 +46799,7 @@*Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version
+*Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version
@@ -46866,7 +46872,7 @@*Thread Reply:* they'll work with new parser - added test for those
+*Thread Reply:* they'll work with new parser - added test for those
*Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!
+*Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!
@@ -46943,7 +46949,7 @@*Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?
+*Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?
@@ -46969,7 +46975,7 @@*Thread Reply:* it should be week-two, unless anything comes up
+*Thread Reply:* it should be week-two, unless anything comes up
@@ -46995,7 +47001,7 @@*Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.
+*Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.
@@ -47047,7 +47053,7 @@*Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)
+*Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)
@@ -47099,7 +47105,7 @@*Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser
+*Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser
@@ -47129,7 +47135,7 @@*Thread Reply:* Will the parser be released after the 13?
+*Thread Reply:* Will the parser be released after the 13?
@@ -47155,7 +47161,7 @@*Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release
+*Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release
@@ -47185,7 +47191,7 @@*Thread Reply:* Thanks, Maciej
+*Thread Reply:* Thanks, Maciej
@@ -47241,7 +47247,7 @@*Thread Reply:* I had kind of the same question. +
*Thread Reply:* I had kind of the same question. I found https://marquezproject.github.io/marquez/openapi.html#tag/Lineage With some of the entries marked Deprecated, I am not sure how to proceed.
@@ -47269,7 +47275,7 @@*Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?
+*Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?
@@ -47295,7 +47301,7 @@*Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.
+*Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.
The GET methods for jobs/datasets/etc are still functional
@@ -47323,7 +47329,7 @@*Thread Reply:* Hey John,
+*Thread Reply:* Hey John,
Thanks for sharing the OpenAPI docs. Was wondering if there are any means to setup OpenLineage API that will receive events without a consumer like Marquez or is it essential to always pair with a consumer to receive the events?
@@ -47351,7 +47357,7 @@*Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?
+*Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?
Marquez is our reference implementation of an OpenLineage consumer, but egeria also has a functional endpoint
@@ -47379,7 +47385,7 @@*Thread Reply:* Hi @John Thomas, +
*Thread Reply:* Hi @John Thomas, Would creation of Sources and Datasets have an equivalent in the OpenLineage specification ? Sofar I only see the Inputs and Outputs in the Run Event spec.
@@ -47407,7 +47413,7 @@*Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent
+*Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent
@@ -47461,7 +47467,7 @@*Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?
+*Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?
@Maciej Obuchowski might be more help here
@@ -47489,7 +47495,7 @@*Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc
+*Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc
@@ -47515,7 +47521,7 @@*Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem
+*Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem
@@ -47541,7 +47547,7 @@*Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.
+*Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.
@@ -47567,7 +47573,7 @@*Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark
?
*Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark
?
*Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy
+*Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy
@@ -47619,7 +47625,7 @@*Thread Reply:* Do you have any examples?
+*Thread Reply:* Do you have any examples?
@@ -47645,7 +47651,7 @@*Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/
+*Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/
*Thread Reply:* and the doc page here has a good overview: +
*Thread Reply:* and the doc page here has a good overview: https://openlineage.io/integration/apache-spark/
*Thread Reply:* is this all we need to pass? +
*Thread Reply:* is this all we need to pass?
spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \
--packages "io.openlineage:openlineage_spark:0.2.+" \
--conf "spark.openlineage.host=http://<your_ol_endpoint>" \
@@ -47771,7 +47777,7 @@
logger = logging.getLogger(name)
*Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.
+*Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.
@@ -47797,7 +47803,7 @@*Thread Reply:* Looks right to me
+*Thread Reply:* Looks right to me
@@ -47823,7 +47829,7 @@*Thread Reply:* Will give it a try
+*Thread Reply:* Will give it a try
@@ -47849,7 +47855,7 @@*Thread Reply:* Do we have to install the library on the spark side or the airflow side?
+*Thread Reply:* Do we have to install the library on the spark side or the airflow side?
@@ -47875,7 +47881,7 @@*Thread Reply:* I assume is the spark side
+*Thread Reply:* I assume is the spark side
@@ -47901,7 +47907,7 @@*Thread Reply:* The —packages
argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)
*Thread Reply:* The —packages
argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)
*Thread Reply:* sounds good
+*Thread Reply:* sounds good
@@ -47979,7 +47985,7 @@*Thread Reply:* @Will Johnson
+*Thread Reply:* @Will Johnson
@@ -48005,7 +48011,7 @@*Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!
+*Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!
@@ -48091,7 +48097,7 @@*Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?
+*Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?
@@ -48117,7 +48123,7 @@*Thread Reply:* yes Marco
+*Thread Reply:* yes Marco
@@ -48143,7 +48149,7 @@*Thread Reply:* can you open marquez on +
*Thread Reply:* can you open marquez on
<http://localhost:3000>
*Thread Reply:* and get a response from +
*Thread Reply:* and get a response from
<http://localhost:5000/api/v1/namespaces>
*Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly
+*Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly
*Thread Reply:* In theory you should receive events in jobs under airflow namespace
+*Thread Reply:* In theory you should receive events in jobs under airflow namespace
@@ -48305,7 +48311,7 @@*Thread Reply:* It looks like you need to add a payment method to your DBT account
+*Thread Reply:* It looks like you need to add a payment method to your DBT account
@@ -48357,7 +48363,7 @@*Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.
+*Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.
@@ -48387,7 +48393,7 @@*Thread Reply:* Thanks for the quick reply Maciej.
+*Thread Reply:* Thanks for the quick reply Maciej.
@@ -48447,7 +48453,7 @@*Thread Reply:* Hi Sandeep,
+*Thread Reply:* Hi Sandeep,
1&3: We don't currently have Hive or Presto on the roadmap! The best way to start the conversation around them would be to create a proposal in the OpenLineage repo, outlining your thoughts on implementation and benefits.
@@ -48484,7 +48490,7 @@*Thread Reply:* Thank you John +
*Thread Reply:* Thank you John The standard facets links to the github issues currently
@@ -48511,7 +48517,7 @@*Thread Reply:* ah here -https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets
+*Thread Reply:* ah here -https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets
@@ -48537,7 +48543,7 @@*Thread Reply:* Will check it out thank you
+*Thread Reply:* Will check it out thank you
@@ -48660,7 +48666,7 @@*Thread Reply:* is there anything in your marquez-api logs that might indicate issues?
+*Thread Reply:* is there anything in your marquez-api logs that might indicate issues?
What guide did you follow to setup the spark integration?
@@ -48688,7 +48694,7 @@*Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach
+*Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach
@@ -48714,7 +48720,7 @@*Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side +
*Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side To confirm, we should be able to see the datasets from the marquez UI with the spark integration right ?
@@ -48741,7 +48747,7 @@*Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here
+*Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here
@@ -48767,7 +48773,7 @@*Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets
+*Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets
@@ -48793,7 +48799,7 @@*Thread Reply:* Are you looking at the same namespace?
+*Thread Reply:* Are you looking at the same namespace?
@@ -48819,7 +48825,7 @@*Thread Reply:* Yes, the same one where I can see the job
+*Thread Reply:* Yes, the same one where I can see the job
@@ -48845,7 +48851,7 @@*Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here
+*Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here
@@ -48871,7 +48877,7 @@*Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?
+*Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?
@@ -48897,7 +48903,7 @@*Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically
+*Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically
@@ -48923,7 +48929,7 @@*Thread Reply:* ok, that sounds good, will try that
+*Thread Reply:* ok, that sounds good, will try that
@@ -48949,7 +48955,7 @@*Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api +
*Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post We don’t seem to add a DataSet in this, does marquez internally create this “dataset” based on Output and fields ?
@@ -48977,7 +48983,7 @@*Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from
+*Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from
@@ -49003,7 +49009,7 @@*Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however
+*Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however
@@ -49029,7 +49035,7 @@*Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? +
*Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? The integration seems to be only POSTing lineage objects
@@ -49056,7 +49062,7 @@*Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:
+*Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:
https://openlineage.io/docs/openapi/
@@ -49088,7 +49094,7 @@*Thread Reply:* > Yes, the same one where I can see the job +
*Thread Reply:* > Yes, the same one where I can see the job I think you should look at other namespace, which name depends on what systems you're actually using
@@ -49115,7 +49121,7 @@*Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?
+*Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?
@@ -49141,7 +49147,7 @@*Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519
+*Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519
*Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. +
*Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. Looking forward to understanding the expected behavior
@@ -49315,7 +49321,7 @@*Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435
+*Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435
*Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!
+*Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!
@@ -49397,7 +49403,7 @@*Thread Reply:* Is there a timeline around when we can expect this fix ?
+*Thread Reply:* Is there a timeline around when we can expect this fix ?
@@ -49423,7 +49429,7 @@*Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.
+*Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.
@@ -49449,7 +49455,7 @@*Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.
+*Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.
@@ -49566,7 +49572,7 @@*Thread Reply:* You probably need to change dataset from default
+*Thread Reply:* You probably need to change dataset from default
@@ -49592,7 +49598,7 @@*Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev
. The airflow is mounted over k8s using helm chart.
+
*Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev
. The airflow is mounted over k8s using helm chart.
``` config:
AIRFLOWWEBSERVERBASEURL: "http://airflow.dev.test.io"
PYTHONPATH: "/opt/airflow/dags/repo/config"
@@ -49644,7 +49650,7 @@ logger = logging.getLogger(name)
*Thread Reply:* I think answer is somewhere in airflow logs 🙂 +
*Thread Reply:* I think answer is somewhere in airflow logs 🙂 For some reason, OpenLineage events aren't send to Marquez.
@@ -49671,7 +49677,7 @@*Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez
+*Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez
*Thread Reply:* That's more of a Marquez question 🙂 +
*Thread Reply:* That's more of a Marquez question 🙂 We have a long-standing issue to add that API https://github.com/MarquezProject/marquez/issues/1736
@@ -49759,7 +49765,7 @@*Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.
+*Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.
@@ -49794,7 +49800,7 @@*Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.
+*Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.
@@ -49945,7 +49951,7 @@*Thread Reply:* And I definitely recommend to use newer version than 0.2.+
🙂
*Thread Reply:* And I definitely recommend to use newer version than 0.2.+
🙂
*Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter
+*Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter
@@ -49997,7 +50003,7 @@*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR
+*Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR
*Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+"
as part of the spark jar file, that meant as part of the pom.xml
*Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+"
as part of the spark jar file, that meant as part of the pom.xml
*Thread Reply:* I think it needs to run on the driver
+*Thread Reply:* I think it needs to run on the driver
@@ -50145,7 +50151,7 @@*Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. +
*Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. Can you specify where do you think it's limited? The way to solve that problems would be to evolve OpenLineage.
> One practical question/example: how do we create a job of type STREAMING, @@ -50177,7 +50183,7 @@
*Thread Reply:* Hey Maciej
+*Thread Reply:* Hey Maciej
first off - thanks for being active on the channel!
@@ -50214,7 +50220,7 @@*Thread Reply:* > I just gave an example of how would you express a specific job type creation +
*Thread Reply:* > I just gave an example of how would you express a specific job type creation Yes, but you're trying to achieve something by passing this parameter or creating a job in a certain way. We're trying to cover everything in OpenLineage API. Even if we don't have everything, the spec from the beginning is focused to allow emitting custom data by custom facet mechanism.
> I have the feeling I'm still missing some key concepts on how OpenLineage is designed. @@ -50244,7 +50250,7 @@
*Thread Reply:* > Any pointers are welcome! +
*Thread Reply:* > Any pointers are welcome! BTW: OpenLineage is an open standard. Everyone is welcome to contribute and discuss. Every feedback ultimately helps us build better systems.
@@ -50271,7 +50277,7 @@*Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂
+*Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂
My temporary goal is to go over the documentation and to write the gaps that confused me (and the solutions) and maybe publish that as an article for wider audience. So far I realized that: • I don't get the naming convention - it became clearer that it's important with the Naming examples, but more info is needed @@ -50339,7 +50345,7 @@
*Thread Reply:* Driver logs from Databricks:
+*Thread Reply:* Driver logs from Databricks:
```22/04/21 14:05:07 INFO EventEmitter: Lineage completed successfully: ResponseMessage(responseCode=200, body={}, error=null) {"eventType":"COMPLETE",[...], "schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}
@@ -50391,7 +50397,7 @@*Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609
+*Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609
@@ -50417,7 +50423,7 @@*Thread Reply:* Spark 3.1.2 with Scala 2.12
+*Thread Reply:* Spark 3.1.2 with Scala 2.12
@@ -50443,7 +50449,7 @@*Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.
+*Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.
@@ -50469,7 +50475,7 @@*Thread Reply:* Has this been resolved? +
*Thread Reply:* Has this been resolved? I am facing the same issue with spark 3.2.
@@ -50522,7 +50528,7 @@*Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written
+*Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written
@@ -50548,7 +50554,7 @@*Thread Reply:* ah sure, that makes sense
+*Thread Reply:* ah sure, that makes sense
@@ -50600,7 +50606,7 @@*Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645
+*Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645
*Thread Reply:* nice -thanks!
+*Thread Reply:* nice -thanks!
@@ -50686,7 +50692,7 @@*Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?
+*Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?
@@ -50712,7 +50718,7 @@*Thread Reply:* No, it should be automatic.
+*Thread Reply:* No, it should be automatic.
@@ -50766,7 +50772,7 @@*Thread Reply:* Hey Will,
+*Thread Reply:* Hey Will,
while I would not consider myself in the team, I'm dabbling in OL, hitting walls and learning as I go. If I don't have enough experience to contribute, I'd be happy to at least proof-read and point out things which are not clear from a novice perspective. Let me know!
@@ -50798,7 +50804,7 @@*Thread Reply:* I'll hold you to that @Mirko Raca 😉
+*Thread Reply:* I'll hold you to that @Mirko Raca 😉
@@ -50824,7 +50830,7 @@*Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.
+*Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.
@@ -50850,7 +50856,7 @@*Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.
+*Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.
@@ -50876,7 +50882,7 @@*Thread Reply:* the most recent one was an astronomer webinar
+*Thread Reply:* the most recent one was an astronomer webinar
happy to share the slides with you if you want 👍 here’s a PDF:
@@ -50917,7 +50923,7 @@*Thread Reply:* the other ones have not been public, unfortunately 😕
+*Thread Reply:* the other ones have not been public, unfortunately 😕
@@ -50943,7 +50949,7 @@*Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO
+*Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO
@@ -50969,7 +50975,7 @@*Thread Reply:* Thank you so much, Ross! This is a great base to work from.
+*Thread Reply:* Thank you so much, Ross! This is a great base to work from.
@@ -51081,7 +51087,7 @@*Thread Reply:* For a spark job like that, you'd have at least four events:
+*Thread Reply:* For a spark job like that, you'd have at least four events:
*Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez>
where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage>
instead of <http://lineage.com/common/marquez/api/v1/lineage>
which is causing the problem
+
*Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez>
where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage>
instead of <http://lineage.com/common/marquez/api/v1/lineage>
which is causing the problem
https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java#L49
I see that it has been added 5 months ago and released it as part of 0.4.0, is there anyway that we can fix the line to be like below
this.lineageURI =
@@ -51175,7 +51181,7 @@
logger = logging.getLogger(name)
*Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.
+*Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.
@@ -51227,7 +51233,7 @@*Thread Reply:* @Michael Collado made one for a recent webinar:
+*Thread Reply:* @Michael Collado made one for a recent webinar:
https://gist.github.com/collado-mike/d1854958b7b1672f5a494933f80b8b58
@@ -51255,7 +51261,7 @@*Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets
+*Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets
@@ -51281,7 +51287,7 @@*Thread Reply:* Thanks! I'm going to take a look
+*Thread Reply:* Thanks! I'm going to take a look
@@ -51337,7 +51343,7 @@*Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!
+*Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!
@@ -51390,7 +51396,7 @@*Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle
+*Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle
@@ -51416,7 +51422,7 @@*Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file
+*Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file
@@ -51442,7 +51448,7 @@*Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition
+*Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition
@@ -51472,7 +51478,7 @@*Thread Reply:* @John Thomas may i know how to solve this kind of issue?
+*Thread Reply:* @John Thomas may i know how to solve this kind of issue?
@@ -51498,7 +51504,7 @@*Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar
+*Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar
@@ -51524,7 +51530,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this
*Thread Reply:* oh no I meant making sure your jobs don't close too quickly
+*Thread Reply:* oh no I meant making sure your jobs don't close too quickly
@@ -51601,7 +51607,7 @@*Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.
+*Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.
@@ -51661,7 +51667,7 @@*Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:
+*Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:
@@ -51747,7 +51753,7 @@*Thread Reply:* Hi @Hubert Dulay,
+*Thread Reply:* Hi @Hubert Dulay,
while I'm not an expert, I can offer the following:
• Marquez has had the logger = logging.getLogger(name)
*Thread Reply:* Thanks for the details
+*Thread Reply:* Thanks for the details
@@ -51831,7 +51837,7 @@*Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!
+*Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!
*Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:
+*Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:
22/05/02 18:43:39 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener
22/05/02 18:43:40 INFO EventEmitter: Init OpenLineageContext: Args: ArgumentParser(host=<http://135.170.226.91:8400>, version=v1, namespace=gus-namespace, jobName=default, parentRunId=null, apiKey=Optional.empty, urlParams=Optional[{}]) URI: <http://135.170.226.91:8400/api/v1/lineage>?
@@ -51955,7 +51961,7 @@
logger = logging.getLogger(name)
*Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis
+*Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis
It might also be a firewall issue, but your cURL should preclude that
@@ -51983,7 +51989,7 @@*Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...
+*Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...
@@ -52009,7 +52015,7 @@*Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script
+*Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script
@@ -52035,7 +52041,7 @@*Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)
+*Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)
@@ -52061,7 +52067,7 @@*Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂
+*Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂
@@ -52087,7 +52093,7 @@*Thread Reply:* sounds good, i will give it a go!
+*Thread Reply:* sounds good, i will give it a go!
@@ -52113,7 +52119,7 @@*Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.
+*Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.
In addition, is http://135.170.226.91:8400 accessible to Databricks? Could you try doing a %sh command inside of a databricks notebook and see if you can ping that IP address (https://linux.die.net/man/8/ping)?
@@ -52164,7 +52170,7 @@*Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.
+*Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.
*Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.
+*Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.
@@ -52251,7 +52257,7 @@*Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.
+*Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.
@Kostikey Mustakas Sorry for the distraction, but I’m curious how you have set up your networking to make the API requests work with Databricks. Good luck with your issue!
@@ -52280,7 +52286,7 @@*Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.
+*Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.
@@ -52310,7 +52316,7 @@*Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout
+*Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout
@@ -52340,7 +52346,7 @@*Thread Reply:* *Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] ....
+
22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] ....
OpenLineageHttpException(code=null, message={"code":404,"message":"HTTP 404 Not Found"}, details=null)
at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68)
logger = logging.getLogger(name)
*Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.
+*Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.
I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this:
spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com>
@@ -52410,7 +52416,7 @@
logger = logging.getLogger(name)
*Thread Reply:* I tried two more things: +
*Thread Reply:* I tried two more things:
• curl
works, ping
fails, just like in the previous report
• Databricks allows providing spark configs without quotes, whereas quotes are generally required for Spark. So I added the quotes to the host name, but now I’m getting: ERROR OpenLineageSparkListener: Unable to parse open lineage endpoint. Lineage events will not be collected
*Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!
+*Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!
@@ -52464,7 +52470,7 @@*Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795
+*Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795
*Thread Reply:* Thank you @Maciej Obuchowski. +
*Thread Reply:* Thank you @Maciej Obuchowski. Just to clarify, the Spark Context crashes with and without port; it’s just that adding the port causes it to crash more quickly (on the 1st attempt).
I will run some more experiments when I have time, and add the results to the ticket.
@@ -52616,7 +52622,7 @@*Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?
+*Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?
@@ -52642,7 +52648,7 @@*Thread Reply:* Yes, and I’ll update the msg for others. Thank you
+*Thread Reply:* Yes, and I’ll update the msg for others. Thank you
@@ -52668,7 +52674,7 @@*Thread Reply:* Thank you!
+*Thread Reply:* Thank you!
@@ -52746,7 +52752,7 @@*Thread Reply:* It seems we can't for now. This was the same question I had last week:
+*Thread Reply:* It seems we can't for now. This was the same question I had last week:
https://github.com/MarquezProject/marquez/issues/1736
*Thread Reply:* Seems that it's really popular request 🙂
+*Thread Reply:* Seems that it's really popular request 🙂
@@ -52876,7 +52882,7 @@*Thread Reply:* @Mirko Raca pointed out that I was missing eventType
.
*Thread Reply:* @Mirko Raca pointed out that I was missing eventType
.
Mirko Raca :
"From a quick glance - you're missing "eventType": "START",
attribute. It's also worth noting that metadata typically shows up after the second event (type COMPLETE
)"
*Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)
+*Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)
@@ -52963,7 +52969,7 @@*Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.
+*Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.
*Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)
+*Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)
@@ -53062,7 +53068,7 @@*Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.
+*Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.
The catch-all would be to create a Dataset facet which would record the distinction between append/overwrite per run. But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected).
@@ -53090,7 +53096,7 @@*Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.
+*Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.
@@ -53116,7 +53122,7 @@*Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths
+*Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths
@@ -53142,7 +53148,7 @@*Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration
+*Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration
@@ -53168,7 +53174,7 @@*Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). +
*Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). It displays this information when it exists
@@ -53199,7 +53205,7 @@*Thread Reply:* Oh that looks perfect! I completely missed that, thanks!
+*Thread Reply:* Oh that looks perfect! I completely missed that, thanks!
@@ -53251,7 +53257,7 @@*Thread Reply:* Work with Spark is not yet fully merged
+*Thread Reply:* Work with Spark is not yet fully merged
@@ -53367,7 +53373,7 @@*Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy
+*Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy
or configure OL client to send data to kafka: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka
@@ -53400,7 +53406,7 @@*Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know
+*Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know
@@ -53426,7 +53432,7 @@*Thread Reply:* The second link 😉
+*Thread Reply:* The second link 😉
@@ -53452,7 +53458,7 @@*Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 +
*Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 You are saying that there is a way to use this python client in conjonction with the MWAA lineage backend to relay the job events that come with the airflow integration (without including it in the DAGs)? Our strategy is to use both the airflow backend to collect automatic lineage events without modifying any existing DAGs, and the in-code implementation to allow our data engineers to send their own events if they want to. The second option works perfectly but the first one is where we struggle a bit, especially with MWAA.
@@ -53481,7 +53487,7 @@*Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file
+*Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file
@@ -53507,7 +53513,7 @@*Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!
+*Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!
@@ -53663,7 +53669,7 @@*Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.
+*Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.
If the things you're sending/receiving with TaskFlow are accessible in terms of metadata in the environment the DAG is running in, then you should be able to make one that would work!
@@ -53695,7 +53701,7 @@*Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂
+*Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂
@@ -53721,7 +53727,7 @@*Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: +
*Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs:
[2022-05-18, 20:52:34 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=Postgres_1_to_Snowflake_1_v3 task_id=Postgres_1 airflow_run_id=scheduled__2022-05-18T20:51:34.334045+00:00
The picture is my custom extractor, it's not doing anything currently as this is just a test.
*Thread Reply:* thanks again for the help yall
+*Thread Reply:* thanks again for the help yall
@@ -53784,7 +53790,7 @@*Thread Reply:* did you set the environment variable with the path to your extractor?
+*Thread Reply:* did you set the environment variable with the path to your extractor?
@@ -53810,7 +53816,7 @@*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* i believe thats correct @John Thomas
+*Thread Reply:* i believe thats correct @John Thomas
@@ -53871,7 +53877,7 @@*Thread Reply:* and the versions im using: +
*Thread Reply:* and the versions im using: Astronomer Runtime 5.0.0 based on Airflow 2.3.0+astro.1
@@ -53898,7 +53904,7 @@*Thread Reply:* this might not be the problem, but you should have only one of extract
and extract_on_complete
- which one are you meaning to use?
*Thread Reply:* this might not be the problem, but you should have only one of extract
and extract_on_complete
- which one are you meaning to use?
*Thread Reply:* ahh thanks John, as of right now extract_on_complete
.
*Thread Reply:* ahh thanks John, as of right now extract_on_complete
.
This is a similar setup as Michael had in the video.
@@ -53961,7 +53967,7 @@*Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor
+*Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor
@@ -53987,7 +53993,7 @@*Thread Reply:* is there anything in logs regarding extractors?
+*Thread Reply:* is there anything in logs regarding extractors?
@@ -54013,7 +54019,7 @@*Thread Reply:* just this: +
*Thread Reply:* just this:
[2022-05-18, 21:36:59 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=competitive_oss_projects_git_to_snowflake task_id=Transform_git_logs_to_S3 airflow_run_id=scheduled__2022-05-18T21:35:57.694690+00:00
*Thread Reply:* @John Thomas Thanks, I appreciate your help.
+*Thread Reply:* @John Thomas Thanks, I appreciate your help.
@@ -54066,7 +54072,7 @@*Thread Reply:* No Failed to import
messages?
*Thread Reply:* No Failed to import
messages?
*Thread Reply: @Maciej Obuchowski None that I can see. Here is the full log:
-``` Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log.
+ *Thread Reply:* @Maciej Obuchowski None that I can see. Here is the full log:
+```* Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log.
Please provide a bucket_name instead of "s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log"
Falling back to local log
* Reading local file: /usr/local/airflow/logs/dagid=Postgres1toSnowflake1v3/runid=scheduled2022-05-19T15:23:49.248097+00:00/taskid=Postgres1/attempt=1.log
@@ -54163,7 +54169,7 @@ [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?
+*Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?
@@ -54189,7 +54195,7 @@*Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS
variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694
*Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS
variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694
*Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow
. Working now with 0.8.2
*Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow
. Working now with 0.8.2
*Thread Reply:* 🙌
+*Thread Reply:* 🙌
@@ -54295,7 +54301,7 @@*Thread Reply:* Currently, it assumes latest version. +
*Thread Reply:* Currently, it assumes latest version. There's an effort with DatasetVersionDatasetFacet to be able to specify it manually - or extract this information from cases like Iceberg or Delta Lake tables.
@@ -54322,7 +54328,7 @@*Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?
+*Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?
@@ -54348,7 +54354,7 @@*Thread Reply:* yes
+*Thread Reply:* yes
@@ -54378,7 +54384,7 @@*Thread Reply:* Thanks, that's very helpful 👍
+*Thread Reply:* Thanks, that's very helpful 👍
@@ -54466,7 +54472,7 @@*Thread Reply:* @Howard Yoo What version of airflow?
+*Thread Reply:* @Howard Yoo What version of airflow?
@@ -54492,7 +54498,7 @@*Thread Reply:* it's 2.3
+*Thread Reply:* it's 2.3
@@ -54518,7 +54524,7 @@*Thread Reply:* (sorry, it's 2.4)
+*Thread Reply:* (sorry, it's 2.4)
@@ -54544,7 +54550,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.
"Airflow 2.3+ Integration automatically registers itself for Airflow 2.3 if it's installed on Airflow worker's python. This means you don't have to do anything besides configuring it, which is described in Configuration section."
@@ -54573,7 +54579,7 @@*Thread Reply:* Right, configuring I don't see any issues
+*Thread Reply:* Right, configuring I don't see any issues
@@ -54599,7 +54605,7 @@*Thread Reply:* so you dont need:
+*Thread Reply:* so you dont need:
from openlineage.airflow import DAG
*Thread Reply:* Okay... that makes sense then
+*Thread Reply:* Okay... that makes sense then
@@ -54655,7 +54661,7 @@*Thread Reply:* so if you need to import DAG it would just be: +
*Thread Reply:* so if you need to import DAG it would just be:
from airflow import DAG
*Thread Reply:* Thanks!
+*Thread Reply:* Thanks!
@@ -54772,7 +54778,7 @@*Thread Reply:* What changes do you mean by letting openlineage support? +
*Thread Reply:* What changes do you mean by letting openlineage support? Or, do you mean, to write Apache Camel integration?
@@ -54799,7 +54805,7 @@*Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow
+*Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow
@@ -54825,7 +54831,7 @@*Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves
+*Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves
@@ -54851,7 +54857,7 @@*Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser
+*Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser
@@ -54877,7 +54883,7 @@*Thread Reply:* As we support more systems, the general libraries will grow as well.
+*Thread Reply:* As we support more systems, the general libraries will grow as well.
@@ -54939,7 +54945,7 @@*Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener
+*Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener
@@ -54965,7 +54971,7 @@*Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this
+*Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this
*Thread Reply:* Thank you so much, Michael!
+*Thread Reply:* Thank you so much, Michael!
@@ -55159,7 +55165,7 @@*Thread Reply:* In Python or Java?
+*Thread Reply:* In Python or Java?
@@ -55185,7 +55191,7 @@*Thread Reply:* In python just inherit BaseFacet and add _get_schema
static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet
*Thread Reply:* In python just inherit BaseFacet and add _get_schema
static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet
In Java you can take a look at Spark's custom facets.
*Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.
+*Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.
I'm not sure what the disconnect is, but the facets aren't showing up in the inputs and outputs. The Lineage event is sent successfully to my astrocloud.
@@ -55410,7 +55416,7 @@*Thread Reply:* _get_schema
should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems
*Thread Reply:* _get_schema
should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems
*Thread Reply:* can you register ManualLineageFacet
as facets
not as inputFacets
or outputFacets
?
*Thread Reply:* can you register ManualLineageFacet
as facets
not as inputFacets
or outputFacets
?
*Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! +
*Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! Also great talk today at the airflow summit.
@@ -55489,7 +55495,7 @@*Thread Reply:* Thanks 🙇
+*Thread Reply:* Thanks 🙇
@@ -55611,7 +55617,7 @@*Thread Reply:* We're also getting data quality from dbt if you're running dbt test
or dbt build
+
*Thread Reply:* We're also getting data quality from dbt if you're running dbt test
or dbt build
https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L399
*Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset
+*Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset
*Thread Reply:* Thanks @Maciej Obuchowski!!!
+*Thread Reply:* Thanks @Maciej Obuchowski!!!
@@ -55871,7 +55877,7 @@*Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side
+*Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side
@@ -55901,7 +55907,7 @@*Thread Reply:* Thanks! I will keep an eye on updates.
+*Thread Reply:* Thanks! I will keep an eye on updates.
@@ -55999,7 +56005,7 @@*Thread Reply:* Looks great! Thanks for sharing!
+*Thread Reply:* Looks great! Thanks for sharing!
@@ -56148,7 +56154,7 @@*Thread Reply:* Might be that case if you're using Spark 3.2
+*Thread Reply:* Might be that case if you're using Spark 3.2
@@ -56174,7 +56180,7 @@*Thread Reply:* There were some changes to those operators
+*Thread Reply:* There were some changes to those operators
@@ -56200,7 +56206,7 @@*Thread Reply:* If you're not using 3.2, please share more details 🙂
+*Thread Reply:* If you're not using 3.2, please share more details 🙂
@@ -56226,7 +56232,7 @@*Thread Reply:* Yeap, im using spark version 3.2.1
+*Thread Reply:* Yeap, im using spark version 3.2.1
@@ -56252,7 +56258,7 @@*Thread Reply:* is it open issue, or i have some option to force them to be sent?)
+*Thread Reply:* is it open issue, or i have some option to force them to be sent?)
@@ -56278,7 +56284,7 @@*Thread Reply:* btw thank you for quick response @Maciej Obuchowski
+*Thread Reply:* btw thank you for quick response @Maciej Obuchowski
@@ -56304,7 +56310,7 @@*Thread Reply:* Yes, we have issue for AlterTable at least
+*Thread Reply:* Yes, we have issue for AlterTable at least
@@ -56330,7 +56336,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. +
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2.
@Ilqar Memmedov Did you mean drop table
or drop columns
? I am not aware of any drop table
issue.
*Thread Reply:* @Paweł Leszczyński drop table statement.
+*Thread Reply:* @Paweł Leszczyński drop table statement.
@@ -56419,7 +56425,7 @@*Thread Reply:* For reproduce it, i just create simple spark job. +
*Thread Reply:* For reproduce it, i just create simple spark job. Create table as select from other, Select data from table, and then drop entire table.
@@ -56493,7 +56499,7 @@*Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.
+*Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.
@@ -56519,7 +56525,7 @@*Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail
+*Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail
@@ -56676,7 +56682,7 @@*Thread Reply:* I think internal naming is more important 🙂
+*Thread Reply:* I think internal naming is more important 🙂
I guess, for now, try to match what the local directory has.
@@ -56704,7 +56710,7 @@*Thread Reply:* Thanks @Maciej Obuchowski
+*Thread Reply:* Thanks @Maciej Obuchowski
@@ -56809,7 +56815,7 @@*Thread Reply:* Those jobs are actually what Spark does underneath 🙂
+*Thread Reply:* Those jobs are actually what Spark does underneath 🙂
@@ -56835,7 +56841,7 @@*Thread Reply:* Are you using Delta Lake btw?
+*Thread Reply:* Are you using Delta Lake btw?
@@ -56861,7 +56867,7 @@*Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .
+*Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .
@@ -56887,7 +56893,7 @@*Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200
+*Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200
*Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.
+*Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.
If anything, we should filter some 'useless' nodes like collect_limit
since they add nothing.
*Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?
+*Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?
@@ -57225,7 +57231,7 @@*Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?
+*Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?
@@ -57251,7 +57257,7 @@*Thread Reply:* I think I understand what you are asking -
+*Thread Reply:* I think I understand what you are asking -
The runID is used to correlate different state updates (i.e., start, fail, complete, abort) across the lifespan of a run. So if you are trying to add additional metadata to the same job run, you’d use the same runID.
@@ -57287,7 +57293,7 @@*Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. +
*Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. A(job) --> DS1(dataset) --> B(job) --> DS2(dataset) --> C(job) --> DS3(dataset) --> D(job)
@@ -57314,7 +57320,7 @@*Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined
+*Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined
@@ -57340,7 +57346,7 @@*Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output
+*Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output
@@ -57366,7 +57372,7 @@*Thread Reply:* got it
+*Thread Reply:* got it
@@ -57392,7 +57398,7 @@*Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)
+*Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)
@@ -57422,7 +57428,7 @@*Thread Reply:* > The eventual “aggregation” should be done by event consumer. +
*Thread Reply:* > The eventual “aggregation” should be done by event consumer. @Maciej Obuchowski Are there any known client side libraries that support this aggregation already ? In case of spark applications running as part of ETL pipelines, most of the times our end user is interested in seeing only the aggregated view where all jobs spawned as part of a single application are rolled up into 1 job.
@@ -57449,7 +57455,7 @@*Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.
+*Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.
We'd love to have something like it, but AFAIK it affects only some percentage of Spark jobs and we can only do so much.
@@ -57479,7 +57485,7 @@*Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!
+*Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!
Apache Atlas doesn't have the same model as Marquez. It only knows of effectively one entity that represents the complete asset.
@@ -57542,7 +57548,7 @@*Thread Reply:* thank you !
+*Thread Reply:* thank you !
@@ -57640,7 +57646,7 @@*Thread Reply:* Hi, is the link correct? The meeting room is empty
+*Thread Reply:* Hi, is the link correct? The meeting room is empty
@@ -57666,7 +57672,7 @@*Thread Reply:* sorry about that, thanks for letting us know
+*Thread Reply:* sorry about that, thanks for letting us know
@@ -57736,7 +57742,7 @@*Thread Reply:* Hey @Mark Beebe!
+*Thread Reply:* Hey @Mark Beebe!
In this case, nodeId
is going to be either a dataset or a job. You need to tell Marquez where to start since there is likely to be more than one graph. So you need to get your hands on an identifier for that starting node.
*Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:
+*Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:
*Thread Reply:* Or you can search, if you know the name of the dataset:
+*Thread Reply:* Or you can search, if you know the name of the dataset:
*Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.
+*Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.
@@ -57860,7 +57866,7 @@*Thread Reply:* That worked, thank you so much!
+*Thread Reply:* That worked, thank you so much!
@@ -57946,7 +57952,7 @@*Thread Reply:* tnazarew - Tomasz Nazarewicz
+*Thread Reply:* tnazarew - Tomasz Nazarewicz
@@ -57972,7 +57978,7 @@*Thread Reply:* 🙌
+*Thread Reply:* 🙌
@@ -57998,7 +58004,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -58390,7 +58396,7 @@*Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets
, jobs
, and runs
. The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323
*Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets
, jobs
, and runs
. The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323
*Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type
+*Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type
@@ -58518,7 +58524,7 @@*Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use
+*Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use
@@ -58596,7 +58602,7 @@*Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)
+*Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)
@@ -58622,7 +58628,7 @@*Thread Reply:* Sorry didn't notice you over here ! lol
+*Thread Reply:* Sorry didn't notice you over here ! lol
@@ -58648,7 +58654,7 @@*Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws
+*Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws
@@ -58674,7 +58680,7 @@*Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?
+*Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?
@@ -58700,7 +58706,7 @@*Thread Reply:* no, just visualize the current migration flow.
+*Thread Reply:* no, just visualize the current migration flow.
@@ -58726,7 +58732,7 @@*Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌
+*Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌
@@ -58752,7 +58758,7 @@*Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink
+*Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink
@@ -58778,7 +58784,7 @@*Thread Reply:* correct
+*Thread Reply:* correct
@@ -58804,7 +58810,7 @@*Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)
+*Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)
@@ -58830,7 +58836,7 @@*Thread Reply:* yes
+*Thread Reply:* yes
@@ -58856,7 +58862,7 @@*Thread Reply:* which I think I can do once the first nodes exist
+*Thread Reply:* which I think I can do once the first nodes exist
@@ -58882,7 +58888,7 @@*Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.
+*Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.
@@ -58908,7 +58914,7 @@*Thread Reply:* really just want to re-create food_devlivery using my own biz context
+*Thread Reply:* really just want to re-create food_devlivery using my own biz context
@@ -58934,7 +58940,7 @@*Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)
+*Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)
*Thread Reply:* that goes over the py
client with some OL examples, but really calling openlineage.emit(...)
method with RunEvents and specifying Marquez as the backend will get you up and running!
*Thread Reply:* that goes over the py
client with some OL examples, but really calling openlineage.emit(...)
method with RunEvents and specifying Marquez as the backend will get you up and running!
*Thread Reply:* Don’t forget to configure the transport for the client
+*Thread Reply:* Don’t forget to configure the transport for the client
@@ -59084,7 +59090,7 @@*Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂
+*Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂
@@ -59110,7 +59116,7 @@*Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉
+*Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉
@@ -59136,7 +59142,7 @@*Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time
+*Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time
@@ -59162,7 +59168,7 @@*Thread Reply:* saw that too !
+*Thread Reply:* saw that too !
@@ -59188,7 +59194,7 @@*Thread Reply:* you’re on top of it!
+*Thread Reply:* you’re on top of it!
@@ -59214,7 +59220,7 @@*Thread Reply:* ha. Thanks again!
+*Thread Reply:* ha. Thanks again!
@@ -59347,7 +59353,7 @@*Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem
+*Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem
@@ -59377,7 +59383,7 @@*Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well
+*Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well
@@ -59407,7 +59413,7 @@*Thread Reply:* I have created a draft PR for this here. +
*Thread Reply:* I have created a draft PR for this here. Please let me know if the changes make sense.
*Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too
+*Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too
*Thread Reply:* I have made the changes in-line to the mentioned comments here. +
*Thread Reply:* I have made the changes in-line to the mentioned comments here. Does this look good?
*Thread Reply:* I think it looks good! Would be great to have tests for this feature though.
+*Thread Reply:* I think it looks good! Would be great to have tests for this feature though.
@@ -59692,7 +59698,7 @@*Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. +
*Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. Thank you for the support! 😄
@@ -59723,7 +59729,7 @@*Thread Reply:* One change and I think it will be good for now.
+*Thread Reply:* One change and I think it will be good for now.
@@ -59749,7 +59755,7 @@*Thread Reply:* Have you tested it manually?
+*Thread Reply:* Have you tested it manually?
@@ -59775,7 +59781,7 @@*Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 +
*Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 Yes, I tested it manually (for Airflow versions 2.1.4 and 2.3.3) and it works 🎉
@@ -59802,7 +59808,7 @@*Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )
+*Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )
@@ -59832,7 +59838,7 @@*Thread Reply:* Yes, Sure! I will add it in the PR description
+*Thread Reply:* Yes, Sure! I will add it in the PR description
@@ -59858,7 +59864,7 @@*Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag
+*Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag
@@ -59888,7 +59894,7 @@*Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏
+*Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏
@@ -59914,7 +59920,7 @@*Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?
+*Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?
@@ -59940,7 +59946,7 @@*Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?
+*Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?
@@ -59966,7 +59972,7 @@*Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/
+*Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/
@@ -59992,7 +59998,7 @@*Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then
+*Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then
@@ -60022,7 +60028,7 @@*Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍
+*Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍
@@ -60052,7 +60058,7 @@*Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16
+*Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16
Also, have added an example for it. 🙂 Let me know if something is unclear and needs to be updated.
@@ -60109,7 +60115,7 @@*Thread Reply:* Thanks! very cool.
+*Thread Reply:* Thanks! very cool.
@@ -60135,7 +60141,7 @@*Thread Reply:* Does Airflow check the types of the inlets/outlets btw?
+*Thread Reply:* Does Airflow check the types of the inlets/outlets btw?
Like I wonder if a user could directly define an OpenLineage DataSet ( which might even have various other facets included on it ) and specify it in the inlets/outlets ?
*Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities
as opposed to openlineage.client.run
.
*Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities
as opposed to openlineage.client.run
.
*Thread Reply:* I am accustomed to creating OpenLineage entities like this:
+*Thread Reply:* I am accustomed to creating OpenLineage entities like this:
taxes = Dataset(namespace="<postgres://foobar>", name="schema.table")
*Thread Reply:* I don’t dislike the airflow.lineage.entities
models especially, but if we only support one of them…
*Thread Reply:* I don’t dislike the airflow.lineage.entities
models especially, but if we only support one of them…
*Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.
+*Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.
Like we would suggest users to use openlineage.client.run.Dataset
but if a user already has DAGs that use Table
then they'd still work in a best efforts way.
*Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones
+*Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones
@@ -60322,7 +60328,7 @@*Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow
?
*Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow
?
*Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015
+*Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015
( I left the support for converting the Airflow Table to Dataset because I think that's nice to have also )
*Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)
+*Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)
@@ -60500,7 +60506,7 @@*Thread Reply:* Give me a sec to send you over the diff…
+*Thread Reply:* Give me a sec to send you over the diff…
@@ -60530,7 +60536,7 @@*Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR
+*Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR
*Thread Reply:* I think you should declare those as sources? Or do you need something different?
+*Thread Reply:* I think you should declare those as sources? Or do you need something different?
@@ -60706,7 +60712,7 @@*Thread Reply:* I'll try to experiment with this.
+*Thread Reply:* I'll try to experiment with this.
@@ -60759,7 +60765,7 @@*Thread Reply:* this should already be working if you run dbt-ol test
or dbt-ol build
*Thread Reply:* this should already be working if you run dbt-ol test
or dbt-ol build
*Thread Reply:* oh, nice!
+*Thread Reply:* oh, nice!
@@ -60837,7 +60843,7 @@*Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)
+*Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)
@@ -60863,7 +60869,7 @@*Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)
+*Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)
@@ -60889,7 +60895,7 @@*Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are
+*Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are
@@ -60941,7 +60947,7 @@*Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?
+*Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?
A few months ago, Flink was being introduced and it was said that more thought was needed around supporting streaming services in OpenLineage.
@@ -60975,7 +60981,7 @@*Thread Reply:* @Will Johnson added your item
+*Thread Reply:* @Will Johnson added your item
@@ -61093,7 +61099,7 @@*Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".
+*Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".
@@ -61123,7 +61129,7 @@*Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?
+*Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?
@@ -61153,7 +61159,7 @@*Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!
+*Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!
@@ -61179,7 +61185,7 @@*Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.
+*Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.
@@ -61205,7 +61211,7 @@*Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022
+*Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022
I’m happy to help craft the abstract, look over slides, etc. (I could help present, but all I’ve done with OpenLineage is one tutorial, so I’m hardly an expert).
@@ -61286,7 +61292,7 @@*Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?
+*Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?
@@ -61312,7 +61318,7 @@*Thread Reply:* This is described here. Notably: +
*Thread Reply:* This is described here. Notably:
> Custom implementations are registered by following Java's ServiceLoader
conventions. A file called io.openlineage.spark.api.OpenLineageEventHandlerFactory
must exist in the application or jar's META-INF/service
directory. Each line of that file must be the fully qualified class name of a concrete implementation of OpenLineageEventHandlerFactory
. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions.
*Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory
+*Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory
*Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!
+*Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!
@@ -61496,7 +61502,7 @@*Thread Reply:* Can you expand on what's not working exactly?
+*Thread Reply:* Can you expand on what's not working exactly?
This is not something we're aware of.
@@ -61524,7 +61530,7 @@*Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error +
*Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error
Could not determine the dependencies of task ':shadowJar'.
> Could not resolve all dependencies for configuration ':runtimeClasspath'.
> Could not find spark:app:0.10.0.
@@ -61576,7 +61582,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* Can you try 0.11? I think we might already fixed that.
+*Thread Reply:* Can you try 0.11? I think we might already fixed that.
@@ -61602,7 +61608,7 @@*Thread Reply:* Tried with that as well. Doesn't work
+*Thread Reply:* Tried with that as well. Doesn't work
@@ -61628,7 +61634,7 @@*Thread Reply:* Same error with 0.11.0 as well
+*Thread Reply:* Same error with 0.11.0 as well
@@ -61654,7 +61660,7 @@*Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml
but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module
*Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml
but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module
*Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required
+*Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required
@@ -61706,7 +61712,7 @@*Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources
+*Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources
@@ -61732,7 +61738,7 @@*Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^
+*Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^
@@ -61758,7 +61764,7 @@*Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle
file:
+
*Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle
file:
repositories {
mavenCentral() {
metadataSources {
@@ -61793,7 +61799,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig
+*Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig
@@ -61819,7 +61825,7 @@*Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle
and rebuild shadowJar again.
*Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle
and rebuild shadowJar again.
*Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.
+*Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.
Then you can try doing classForName just from the spark-shell without any listeners added. The classes should be available there.
@@ -62014,7 +62020,7 @@*Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.
+*Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.
It looks like Databricks AND open source spark does some magic when you install a library OR use --jars on the spark-shell. In both Databricks and Apache Spark, the thread running the SparkListener cannot see the additional libraries installed unless they're on the original / main class path.
@@ -62050,7 +62056,7 @@*Thread Reply:* We run tests using --packages
for external stuff like Delta - which is the same as --jars
, but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.
*Thread Reply:* We run tests using --packages
for external stuff like Delta - which is the same as --jars
, but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.
What if you did it like it? By that I mean adding it to your code with compileOnly
in gradle or provided
in maven, compiling with it, then using static method to check if it loads?
*Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") +
*Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") Isn't that this actual scenario?
@@ -62105,7 +62111,7 @@*Thread Reply:* Thank you for the reply, Maciej!
+*Thread Reply:* Thank you for the reply, Maciej!
I will try the compileOnly route tonight!
@@ -62137,7 +62143,7 @@*Thread Reply:* The difference between --jars
and --packages
is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.
*Thread Reply:* The difference between --jars
and --packages
is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.
More doc can be found here: (https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)
@@ -62171,7 +62177,7 @@*Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.
+*Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.
When using jars, the spark listener does NOT find the class using a classloader.
@@ -62207,7 +62213,7 @@*Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta
+*Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta
@@ -62233,7 +62239,7 @@*Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?
+*Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?
@@ -62259,7 +62265,7 @@*Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?
+*Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?
@@ -62285,7 +62291,7 @@*Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.
+*Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.
@@ -62311,7 +62317,7 @@*Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0
*Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0
*Thread Reply:* Trying with --pacakges right now.
+*Thread Reply:* Trying with --pacakges right now.
@@ -62390,7 +62396,7 @@*Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:
+*Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:
Spark Shell command
spark-shell --master local[4] \
@@ -62440,7 +62446,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* what if you load your listener using also packages?
+*Thread Reply:* what if you load your listener using also packages?
@@ -62466,7 +62472,7 @@*Thread Reply:* That's how I'm doing it locally using spark.conf: +
*Thread Reply:* That's how I'm doing it locally using spark.conf:
spark.jars.packages com.google.cloud.bigdataoss:gcs_connector:hadoop3-2.2.2,io.delta:delta_core_2.12:1.0.0,org.apache.iceberg:iceberg_spark3_runtime:0.12.1,io.openlineage:openlineage_spark:0.9.0
*Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! +
*Thread Reply:* @Maciej Obuchowski - You beautiful bearded man!
🙏
2022-07-14 11:14:21,266 INFO MyLogger: Trying LogicalRelation
2022-07-14 11:14:21,266 INFO MyLogger: Got logical relation
@@ -62541,7 +62547,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages
+*Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages
Remember Java's ClassLoaders are hierarchical - there are parent ClassLoaders and child ClassLoaders. Parents can't see their children's classes, but children can see their parent's classes.
@@ -62575,7 +62581,7 @@*Thread Reply:* In Databricks, we were not able to simply use the --packages
argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other
*Thread Reply:* In Databricks, we were not able to simply use the --packages
argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other
*Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages
configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)
*Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages
configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)
*Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.
+*Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.
https://docs.microsoft.com/en-us/azure/databricks/libraries/
@@ -62709,7 +62715,7 @@*Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.
+*Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.
@@ -62735,7 +62741,7 @@*Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷♂️:skintone4:
+*Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷♂️:skintone4:
@@ -62765,7 +62771,7 @@*Thread Reply:* Quick update: +
*Thread Reply:* Quick update: • Turns out using a class loader from a Scala spark listener does not have this problem. • https://stackoverflow.com/questions/7671888/scala-classloaders-confusion • I'm trying to use URLClassLoader as recommended by a few MSFT folks and point it at the /local_disk0/tmp folder. @@ -62846,7 +62852,7 @@
*Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏
+*Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏
@@ -62876,7 +62882,7 @@*Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too
+*Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too
@@ -63025,7 +63031,7 @@*Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)
+*Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)
@@ -63051,7 +63057,7 @@*Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations
semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.
*Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations
semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.
*Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)
+*Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)
@@ -63103,7 +63109,7 @@*Thread Reply:* No pressure, of course 😉
+*Thread Reply:* No pressure, of course 😉
@@ -63129,7 +63135,7 @@*Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.
+*Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.
@@ -63155,7 +63161,7 @@*Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!
+*Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!
@@ -63185,7 +63191,7 @@*Thread Reply:* chiming in as well to say this is really cool 👍
+*Thread Reply:* chiming in as well to say this is really cool 👍
@@ -63211,7 +63217,7 @@*Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?
+*Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?
@@ -63237,7 +63243,7 @@*Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.
+*Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.
*Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.
+*Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.
The repo is open for business! Feel free to add the page where you think it fits.
*Thread Reply:* OK! Let’s do this!
+*Thread Reply:* OK! Let’s do this!
@@ -63414,7 +63420,7 @@*Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!
+*Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!
*Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯
+*Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯
@@ -63553,7 +63559,7 @@*Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main
will ship a new site.
*Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main
will ship a new site.
*Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a>
… we’ll eventually want the docs to live under the docs
subdomain though?
*Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a>
… we’ll eventually want the docs to live under the docs
subdomain though?
*Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website
+*Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website
@@ -63652,7 +63658,7 @@*Thread Reply:* and it came live at openlineage.io/docs 😄
+*Thread Reply:* and it came live at openlineage.io/docs 😄
@@ -63678,7 +63684,7 @@*Thread Reply:* nice and sounds good 👍
+*Thread Reply:* nice and sounds good 👍
@@ -63704,7 +63710,7 @@*Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo
+*Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo
@@ -63808,7 +63814,7 @@*Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data
+*Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data
@@ -63834,7 +63840,7 @@*Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces
+*Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces
@@ -63938,7 +63944,7 @@*Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR
+*Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR
*Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive
+*Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive
@@ -64025,7 +64031,7 @@*Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.
+*Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.
@@ -64055,7 +64061,7 @@*Thread Reply:* There are a couple of aws people here (including me) following.
+*Thread Reply:* There are a couple of aws people here (including me) following.
@@ -64120,7 +64126,7 @@*Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description
property to a job.
*Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description
property to a job.
*Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217
+*Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217
@@ -64176,7 +64182,7 @@*Thread Reply:* (looking for a curl
example…)
*Thread Reply:* (looking for a curl
example…)
*Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?
+*Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?
Are these documented anywhere?
@@ -64230,7 +64236,7 @@*Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.
+*Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.
@@ -64256,7 +64262,7 @@*Thread Reply:* I couldn’t find a curl
example with a description
field, but I did generate this one with a sql
field:
*Thread Reply:* I couldn’t find a curl
example with a description
field, but I did generate this one with a sql
field:
{
"job": {
@@ -64309,7 +64315,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets
+*Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets
@@ -64335,7 +64341,7 @@*Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well
+*Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well
@@ -64361,7 +64367,7 @@*Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: +
*Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like:
{
"facets": {
"description": "test-description-job",
@@ -64405,7 +64411,7 @@
[2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
*Thread Reply:* I got the documentation link to work by renaming the property from documentation
-> description
. I still haven't been able to get the external link to work
*Thread Reply:* I got the documentation link to work by renaming the property from documentation
-> description
. I still haven't been able to get the external link to work
*Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.
+*Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.
I think it would be cool to see some discussions start to develop around it 👍
@@ -64544,7 +64550,7 @@*Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.
+*Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.
@@ -64570,7 +64576,7 @@*Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)
+*Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)
*Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.
+*Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.
@@ -64651,7 +64657,7 @@*Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well
+*Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well
@@ -64706,7 +64712,7 @@*Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json
+*Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json
*Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=
+*Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=
@@ -65023,7 +65029,7 @@*Thread Reply:* It is supported in the Spark integration
+*Thread Reply:* It is supported in the Spark integration
@@ -65049,7 +65055,7 @@*Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets
+*Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets
@@ -65116,7 +65122,7 @@*Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅
+*Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅
@@ -65142,7 +65148,7 @@*Thread Reply:* 😄
+*Thread Reply:* 😄
@@ -65168,7 +65174,7 @@*Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need
+*Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need
@@ -65194,7 +65200,7 @@*Thread Reply:* This looks nice by the way.
+*Thread Reply:* This looks nice by the way.
@@ -65268,7 +65274,7 @@*Thread Reply:* do you have proper execute permissions?
+*Thread Reply:* do you have proper execute permissions?
@@ -65294,7 +65300,7 @@*Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable
+*Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable
@@ -65320,7 +65326,7 @@*Thread Reply:* yes i have admin rights. how to make this as executable?
+*Thread Reply:* yes i have admin rights. how to make this as executable?
@@ -65346,7 +65352,7 @@*Thread Reply:* btw do we have a sample docker image where dbt-ol can run?
+*Thread Reply:* btw do we have a sample docker image where dbt-ol can run?
@@ -65372,7 +65378,7 @@*Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run
?
*Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run
?
*Thread Reply:* will try that
+*Thread Reply:* will try that
@@ -65480,7 +65486,7 @@*Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app
, shared
, spark2
, spark3
, spark32
, etc. If the test case is from shared submodule (this one looks like that), you could try running:
+
*Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app
, shared
, spark2
, spark3
, spark32
, etc. If the test case is from shared submodule (this one looks like that), you could try running:
./gradlew :shared:test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest
*Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:
+*Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:
```> Task :shared:test FAILED
@@ -65557,7 +65563,7 @@*Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in +
*Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in
OpenLineage/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/plan
are being run at all.
*Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.
+*Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.
@@ -65615,11 +65621,11 @@*Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: +
*Thread Reply:* For reference, I attached the stdout and stderr messages from running the following:
./gradlew :shared:spotlessApply && ./gradlew :app:spotlessApply && ./gradlew clean build test
*Thread Reply:* I'll look into it
+*Thread Reply:* I'll look into it
@@ -65677,7 +65683,7 @@*Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues
+*Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues
@@ -65707,7 +65713,7 @@*Thread Reply:* That's great, thank you!
+*Thread Reply:* That's great, thank you!
@@ -65733,7 +65739,7 @@*Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?
+*Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?
*Thread Reply:* I'm still testing it, should've changed it to draft, sorry
+*Thread Reply:* I'm still testing it, should've changed it to draft, sorry
@@ -65829,7 +65835,7 @@*Thread Reply:* No worries! If I can help with testing or anything please let me know!
+*Thread Reply:* No worries! If I can help with testing or anything please let me know!
@@ -65855,7 +65861,7 @@*Thread Reply:* Will do! Thanks :)
+*Thread Reply:* Will do! Thanks :)
@@ -65881,7 +65887,7 @@*Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.
+*Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.
@@ -65907,7 +65913,7 @@*Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test
+
*Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test
but one is failing with ./gradlew app:build
but if it fixes your problem I can disable this test for now and make a PR without it, then you can maybe unblock your stuff and I will have more time to investigate the issue.
@@ -65936,7 +65942,7 @@*Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.
+*Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.
@@ -65962,7 +65968,7 @@*Thread Reply:* Thank you for your help Tomasz!
+*Thread Reply:* Thank you for your help Tomasz!
@@ -65988,7 +65994,7 @@*Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes
+*Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes
*Thread Reply:* its waiting for review currently
+*Thread Reply:* its waiting for review currently
@@ -66086,7 +66092,7 @@*Thread Reply:* Thank you!
+*Thread Reply:* Thank you!
@@ -66261,7 +66267,7 @@*Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?
+*Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?
@@ -66287,7 +66293,7 @@*Thread Reply:* Sure, it’s already on my list, will do
+*Thread Reply:* Sure, it’s already on my list, will do
@@ -66317,7 +66323,7 @@*Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage
+*Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage
*Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.
+*Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.
@Michael Collado or @Maciej Obuchowski: are you able to create some doc? I think one of you was working on that.
@@ -66431,7 +66437,7 @@*Thread Reply:* Yes, parent run
+*Thread Reply:* Yes, parent run
@@ -66487,7 +66493,7 @@*Thread Reply:* Will take a look
+*Thread Reply:* Will take a look
@@ -66513,7 +66519,7 @@*Thread Reply:* But generally this support message is just a warning
+*Thread Reply:* But generally this support message is just a warning
@@ -66539,7 +66545,7 @@*Thread Reply:* @shweta p any actual error you've found? +
*Thread Reply:* @shweta p any actual error you've found? I've tested it with dbt-bigquery on 1.1.0 and it works despite warning:
➜ small OPENLINEAGE_URL=<http://localhost:5050> dbt-ol build
@@ -66620,7 +66626,7 @@
SundayFunday
*Thread Reply:* I think it's safe to say we'll see a release by the end of next week
+*Thread Reply:* I think it's safe to say we'll see a release by the end of next week
@@ -66719,7 +66725,7 @@*Thread Reply:* Welcome to the community!
+*Thread Reply:* Welcome to the community!
We talked about this exact topic in the most recent community call. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nextmeeting:Nov10th2021(9amPT)
@@ -66764,7 +66770,7 @@*Thread Reply:* > or do you mean reporting start job/end job should be processed each event? +
*Thread Reply:* > or do you mean reporting start job/end job should be processed each event? We definitely want to avoid tracking every single event 🙂
One thing worth mentioning is that OpenLineage events are meant to be cumulative - the streaming jobs start, run, and eventually finish or restart. In the meantime, we capture additional events "in the middle" - for example, on Apache Flink checkpoint, or every few minutes - where we can emit additional information connected to the state of the job.
@@ -66797,7 +66803,7 @@*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer
+*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer
jobs start, run, and eventually finish or restart
*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.
+*Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.
> if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state. But this is an misunderstanding 🙂 @@ -66903,7 +66909,7 @@
*Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING
- do I miss something why you want to add StateFacet?
*Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING
- do I miss something why you want to add StateFacet?
*Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.
+*Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.
each event that entering the sub process (job) should do REST call “Start job” and “End job” ?
@@ -66999,7 +67005,7 @@*Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.
+*Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.
@@ -67054,7 +67060,7 @@*Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.
+*Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.
We're doing something similar within Airflow now, but as a fallback mechanism: https://github.com/OpenLineage/OpenLineage/pull/914
@@ -67084,7 +67090,7 @@*Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. +
*Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. If i understand your response correctly, then you assume that OpenLineage can get wide enough "native" support across the stack without resorting to a fallback like 'static code analysis'. Is that your base assumption?
@@ -67164,7 +67170,7 @@*Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.
+*Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.
Few ideas:
@@ -67195,7 +67201,7 @@*Thread Reply:* ok, now I understand, thank you
+*Thread Reply:* ok, now I understand, thank you
@@ -67225,7 +67231,7 @@*Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927
+*Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927
But if you need just the raw events endpoint, without UI, then Marquez might be overkill for your needs
*Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.
+*Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.
The Lineage endpoint in Marquez can output the whole graph centered on a node ID, and you can use the jobs/datasets apis to grab lists of each for reference
@@ -67341,7 +67347,7 @@*Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py
+*Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py
Where is your lineage coming from?
*Thread Reply:* yes @Barak F we are using open lineage
+*Thread Reply:* yes @Barak F we are using open lineage
@@ -67598,7 +67604,7 @@*Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)
+*Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)
@@ -67624,7 +67630,7 @@*Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor
+*Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor
Not sure it solves your issue though 🙂
@@ -67652,7 +67658,7 @@*Thread Reply:* thanks
+*Thread Reply:* thanks
@@ -67815,7 +67821,7 @@*Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.
+*Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.
Server model was added here after some discussion in Marquez which is relevant - I think @Michael Collado @Willy Lulciuc can add to that
*Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?
+*Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?
@@ -67972,7 +67978,7 @@*Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?
+*Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?
@@ -67998,7 +68004,7 @@*Thread Reply:* @Maciej Obuchowski no, we don’t. +
*Thread Reply:* @Maciej Obuchowski no, we don’t. @Varun Singh create table as select may suit you well. Other examples are within tests like: • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]lifecycle/plan/column/ColumnLevelLineageUtilsV2CatalogTest.java • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]ecycle/plan/column/ColumnLevelLineageUtilsNonV2CatalogTest.java
@@ -68075,7 +68081,7 @@*Thread Reply:* I’ve created an issue for column lineage in case of renaming: +
*Thread Reply:* I’ve created an issue for column lineage in case of renaming: https://github.com/OpenLineage/OpenLineage/issues/993
*Thread Reply:* Thanks @Paweł Leszczyński!
+*Thread Reply:* Thanks @Paweł Leszczyński!
@@ -68183,7 +68189,7 @@*Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.
+*Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.
This copying results in thousands of SQL queries run against the target database for each sync. I don’t think each of these queries should map to an OpenLineage job, I think the entire synchronization should. Maybe I’m wrong here.
@@ -68211,7 +68217,7 @@*Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.
+*Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.
@@ -68237,7 +68243,7 @@*Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating
+*Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating
@@ -68263,7 +68269,7 @@*Thread Reply:* or is that good enough?
+*Thread Reply:* or is that good enough?
@@ -68289,7 +68295,7 @@*Thread Reply:* You are looking at a pretty big topic here 🙂
+*Thread Reply:* You are looking at a pretty big topic here 🙂
Basically you're asking what is a job in OpenLineage - and it's not fully answered yet.
@@ -68319,7 +68325,7 @@*Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂
+*Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂
@@ -68345,7 +68351,7 @@*Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.
+*Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.
I can see it being a nuisance having all of that on a lineage graph. OTOH, I can see it being useful to know that a datum can be traced back to a specific endpoint at SFDC.
@@ -68373,7 +68379,7 @@*Thread Reply:* this is a design decision, IMO.
+*Thread Reply:* this is a design decision, IMO.
@@ -68567,7 +68573,7 @@*Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!
+*Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!
@@ -68593,7 +68599,7 @@*Thread Reply:* We missed you too @Will Johnson 😉
+*Thread Reply:* We missed you too @Will Johnson 😉
@@ -68688,7 +68694,7 @@*Thread Reply:* From the first look, you're missing outputs
field in your event - this might break something
*Thread Reply:* From the first look, you're missing outputs
field in your event - this might break something
*Thread Reply:* If not, then Marquez logs might help to see something
+*Thread Reply:* If not, then Marquez logs might help to see something
@@ -68740,7 +68746,7 @@*Thread Reply:* Does the START event needs to have an output?
+*Thread Reply:* Does the START event needs to have an output?
@@ -68766,7 +68772,7 @@*Thread Reply:* It can have empty output 🙂
+*Thread Reply:* It can have empty output 🙂
@@ -68792,7 +68798,7 @@*Thread Reply:* well, in your case you need to send COMPLETE
event
*Thread Reply:* well, in your case you need to send COMPLETE
event
*Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.
+*Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.
@@ -68844,7 +68850,7 @@*Thread Reply:* After I send COMPLETE
event with the same information I can see the dataset.
*Thread Reply:* After I send COMPLETE
event with the same information I can see the dataset.
*Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input
dataset till I have the COMPLETE
event with input and output?
*Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input
dataset till I have the COMPLETE
event with input and output?
*Thread Reply:* @Raj Mishra Yes and no 🙂
+*Thread Reply:* @Raj Mishra Yes and no 🙂
Basically your COMPLETE
event does not need to contain any input and output datasets at all - OpenLineage model is cumulative, so it's enough to have datasets on either start or complete.
That also means you can add different datasets in different moment of a run lifecycle - for example, you know inputs, but not outputs, so you emit inputs on START
, but not COMPLETE
.
*Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.
+*Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.
@@ -68999,7 +69005,7 @@*Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature
+*Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature
@@ -69025,7 +69031,7 @@*Thread Reply:* Agreed 🙂
+*Thread Reply:* Agreed 🙂
How do you produce this dataset? Spark integration? Are you using any system like Apache Iceberg/Delta Lake or just writing raw files?
@@ -69053,7 +69059,7 @@*Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables
+*Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables
@@ -69079,7 +69085,7 @@*Thread Reply:* written using Spark dataframe API, like +
*Thread Reply:* written using Spark dataframe API, like
df.write.format("parquet").save("/tmp/spark_output/parquet")
or RDD?
*Thread Reply:* the actual API used matters, because we're handling different cases separately
+*Thread Reply:* the actual API used matters, because we're handling different cases separately
@@ -69133,7 +69139,7 @@*Thread Reply:* I see. Let me look that up to be absolutely sure
+*Thread Reply:* I see. Let me look that up to be absolutely sure
@@ -69159,7 +69165,7 @@*Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")
*Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")
*Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?
+*Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?
@@ -69211,7 +69217,7 @@*Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.
+*Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.
@@ -69237,7 +69243,7 @@*Thread Reply:* Sounds good , Ty !
+*Thread Reply:* Sounds good , Ty !
@@ -69316,7 +69322,7 @@*Thread Reply:* yes
+*Thread Reply:* yes
@@ -69342,7 +69348,7 @@*Thread Reply:* please do create 🙂
+*Thread Reply:* please do create 🙂
@@ -69398,7 +69404,7 @@*Thread Reply:* it's per project. +
*Thread Reply:* it's per project. java based: ./gradlew test python based: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development
@@ -69510,7 +69516,7 @@*Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.
+*Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.
@@ -69536,7 +69542,7 @@*Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋
+*Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋
@@ -69562,7 +69568,7 @@*Thread Reply:* @Paweł Leszczyński might want to look at that
+*Thread Reply:* @Paweł Leszczyński might want to look at that
@@ -69617,7 +69623,7 @@*Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.
+*Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.
@@ -69643,7 +69649,7 @@*Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. +
*Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. My company want to integrate with openLineage-spark, I am working on figure out what info does OpenLineage make available for non-sql and does it at least have support for logging the logical plan?
@@ -69670,7 +69676,7 @@*Thread Reply:* Yes, it does send the logical plan as part of the event
+*Thread Reply:* Yes, it does send the logical plan as part of the event
@@ -69696,7 +69702,7 @@*Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/
+*Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/
*Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"
+*Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"
@@ -69769,7 +69775,7 @@*Thread Reply:* you need to add the jar, set the listener and pass your OL config
+*Thread Reply:* you need to add the jar, set the listener and pass your OL config
@@ -69795,7 +69801,7 @@*Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/
+*Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/
*Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video
+*Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video
@@ -69872,7 +69878,7 @@*Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.
+*Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.
@@ -69898,7 +69904,7 @@*Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration
+*Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration
@@ -69979,7 +69985,7 @@*Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release
+*Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release
@@ -70009,7 +70015,7 @@*Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein
+*Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein
@@ -70092,7 +70098,7 @@*Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?
+*Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?
@@ -70144,7 +70150,7 @@*Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. +
*Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. A basic idea to do such a thing would be to implement an OpenLineage endpoint (receive the lineage events through http posts) and convert them to a format the graph db understand. If others in the community have ideas, please chime in
@@ -70171,7 +70177,7 @@*Thread Reply:* Understood, thanks a lot Julien. Make sense.
+*Thread Reply:* Understood, thanks a lot Julien. Make sense.
@@ -70227,7 +70233,7 @@*Thread Reply:* @Michael Robinson ^
+*Thread Reply:* @Michael Robinson ^
@@ -70253,7 +70259,7 @@*Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.
+*Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.
@@ -70279,7 +70285,7 @@*Thread Reply:* 🙏
+*Thread Reply:* 🙏
@@ -70305,7 +70311,7 @@*Thread Reply:* Thanks, all. The release is authorized
+*Thread Reply:* Thanks, all. The release is authorized
@@ -70335,7 +70341,7 @@*Thread Reply:* can you also state the main purpose for this release?
+*Thread Reply:* can you also state the main purpose for this release?
@@ -70361,7 +70367,7 @@*Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality
+*Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality
@@ -70387,7 +70393,7 @@*Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.
+*Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.
@@ -70486,7 +70492,7 @@*Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 +
*Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 https://openlineage.io/docs/integrations/spark/spark_column_lineage
*Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again
+*Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again
@@ -70560,7 +70566,7 @@*Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?
+*Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?
@@ -70586,7 +70592,7 @@*Thread Reply:* @Harel Shein
+*Thread Reply:* @Harel Shein
@@ -70612,7 +70618,7 @@*Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?
+*Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?
@@ -70638,7 +70644,7 @@*Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage
+*Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage
@@ -70664,7 +70670,7 @@*Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
+*Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html
@@ -70690,7 +70696,7 @@*Thread Reply:* spark.read \ +
*Thread Reply:* spark.read \ .option("header", "true") \ .option("inferschema", "true") \ .csv("data/input/batch/wikidata.csv") \ @@ -70722,7 +70728,7 @@
*Thread Reply:* This is simple code run on my local for testing
+*Thread Reply:* This is simple code run on my local for testing
@@ -70748,7 +70754,7 @@*Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.
+*Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.
@@ -70774,7 +70780,7 @@*Thread Reply:* Specifically this commit is what implemented it. +
*Thread Reply:* Specifically this commit is what implemented it. https://github.com/OpenLineage/OpenLineage/commit/ce30178cc81b63b9930be11ac7500ed34808edd3
@@ -70801,7 +70807,7 @@*Thread Reply:* I see. I use 0.13.0
+*Thread Reply:* I see. I use 0.13.0
@@ -70827,7 +70833,7 @@*Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow
+*Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow
@@ -70853,7 +70859,7 @@*Thread Reply:* Great. Thanks Harel.
+*Thread Reply:* Great. Thanks Harel.
@@ -70957,7 +70963,7 @@*Thread Reply:* Reading dataset - which input
dataset implies - does not mutate the dataset 🙂
*Thread Reply:* Reading dataset - which input
dataset implies - does not mutate the dataset 🙂
*Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs
list
*Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs
list
*Thread Reply:* So, changing the input
dataset will always create new output
data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?
*Thread Reply:* So, changing the input
dataset will always create new output
data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?
*Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output
*Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output
*Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python
+*Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python
*Thread Reply:* (docs in progress 😉)
+*Thread Reply:* (docs in progress 😉)
@@ -71160,7 +71166,7 @@*Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?
+*Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?
@@ -71186,7 +71192,7 @@*Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset
+*Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset
@@ -71212,7 +71218,7 @@*Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.
+*Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.
@@ -71238,7 +71244,7 @@*Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case
+*Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case
@@ -71264,7 +71270,7 @@*Thread Reply:* unless we can some how easily hide the process that does that on the graph.
+*Thread Reply:* unless we can some how easily hide the process that does that on the graph.
@@ -71316,7 +71322,7 @@*Thread Reply:* I don't think it will work for Spark 2.X.
+*Thread Reply:* I don't think it will work for Spark 2.X.
@@ -71342,7 +71348,7 @@*Thread Reply:* Is there have plan to support spark 2.x?
+*Thread Reply:* Is there have plan to support spark 2.x?
@@ -71368,7 +71374,7 @@*Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.
+*Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.
@@ -71394,7 +71400,7 @@*Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x
+*Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x
@@ -71493,7 +71499,7 @@*Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet
for input dataset only at START - and after I finish writing the dataset might have different version than before writing.
*Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet
for input dataset only at START - and after I finish writing the dataset might have different version than before writing.
*Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.
+*Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.
@@ -71545,7 +71551,7 @@*Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?
+*Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?
@@ -71571,7 +71577,7 @@*Thread Reply:* Yes
+*Thread Reply:* Yes
@@ -71601,7 +71607,7 @@*Thread Reply:* As usual, thank you Maciej for the responses and insights!
+*Thread Reply:* As usual, thank you Maciej for the responses and insights!
@@ -71657,7 +71663,7 @@*Thread Reply:* Anyone can help for it? Does I miss something
+*Thread Reply:* Anyone can help for it? Does I miss something
@@ -71748,7 +71754,7 @@*Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.
+*Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.
@@ -71804,7 +71810,7 @@*Thread Reply:* Which integration are you looking to implement?
+*Thread Reply:* Which integration are you looking to implement?
And what environment are you looking to deploy it on? The Cloud? On-Prem?
@@ -71832,7 +71838,7 @@*Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres
+*Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres
@@ -71858,7 +71864,7 @@*Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?
+*Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?
https://github.com/OpenLineage/OpenLineage/tree/main/integration
@@ -71886,7 +71892,7 @@*Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.
+*Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.
@@ -71912,7 +71918,7 @@*Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available
+*Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available
@@ -71938,7 +71944,7 @@*Thread Reply:* https://bit.ly/MarquezSlack
+*Thread Reply:* https://bit.ly/MarquezSlack
@@ -71964,7 +71970,7 @@*Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.
+*Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.
@@ -72039,7 +72045,7 @@*Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯
+*Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯
@@ -72099,7 +72105,7 @@*Thread Reply:* Is PR #1069 all that’s going in 0.14.1
?
*Thread Reply:* Is PR #1069 all that’s going in 0.14.1
?
*Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…
+*Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…
@@ -72155,7 +72161,7 @@*Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)
+*Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)
@@ -72181,7 +72187,7 @@*Thread Reply:* Thanks for clarifying!
+*Thread Reply:* Thanks for clarifying!
@@ -72207,7 +72213,7 @@*Thread Reply:* Thanks, all. The release is authorized.
+*Thread Reply:* Thanks, all. The release is authorized.
@@ -72237,7 +72243,7 @@*Thread Reply:* 1058 also fixes some bugs
+*Thread Reply:* 1058 also fixes some bugs
@@ -72289,7 +72295,7 @@*Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations
+*Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations
Besides that, there is this proposal that did not get enough love yet https://github.com/OpenLineage/OpenLineage/issues/323
@@ -72317,7 +72323,7 @@*Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.
+*Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.
@@ -72343,7 +72349,7 @@*Thread Reply:* this PR does not seem to support lineage between datasets
+*Thread Reply:* this PR does not seem to support lineage between datasets
@@ -72369,7 +72375,7 @@*Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.
+*Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.
In OpenLineage, something observes the lineage relationship being created.
@@ -72397,7 +72403,7 @@*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.
+*Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.
@@ -72462,7 +72468,7 @@*Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.
+*Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.
@@ -72488,7 +72494,7 @@*Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships
+*Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships
@@ -72514,7 +72520,7 @@*Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?
+*Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?
@@ -72540,7 +72546,7 @@*Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?
+*Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?
@@ -72566,7 +72572,7 @@*Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility
+*Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility
@@ -72592,7 +72598,7 @@*Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.
+*Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.
@@ -72618,7 +72624,7 @@*Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”
+*Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”
@@ -72713,7 +72719,7 @@*Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?
+*Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?
@@ -72739,7 +72745,7 @@*Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!
+*Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!
@@ -72795,7 +72801,7 @@*Thread Reply:* yes!
+*Thread Reply:* yes!
@@ -72821,7 +72827,7 @@*Thread Reply:* Any example or ticket on how to lineage across namespace
+*Thread Reply:* Any example or ticket on how to lineage across namespace
@@ -72873,7 +72879,7 @@*Thread Reply:* Yes https://openlineage.io/blog/column-lineage/
+*Thread Reply:* Yes https://openlineage.io/blog/column-lineage/
*Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage +
*Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage • Proposal on how to implement column level lineage in Marquez (implementation is currently work in progress): https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md @Iftach Schonbaum let us know if you find the information useful.
*Thread Reply:* or is it automatic for anything that exists in extractors/
?
*Thread Reply:* or is it automatic for anything that exists in extractors/
?
*Thread Reply:* Yes
+*Thread Reply:* Yes
@@ -73229,7 +73235,7 @@*Thread Reply:* so anything i add to extractors
directory with the same name as the operator will automatically extract the metadata from the operator is that correct?
*Thread Reply:* so anything i add to extractors
directory with the same name as the operator will automatically extract the metadata from the operator is that correct?
*Thread Reply:* Well, not entirely
+*Thread Reply:* Well, not entirely
@@ -73281,7 +73287,7 @@*Thread Reply:* please take a look at the source code of one of the extractors
+*Thread Reply:* please take a look at the source code of one of the extractors
@@ -73311,7 +73317,7 @@*Thread Reply:* also, there are docs available at openlineage.io/docs
+*Thread Reply:* also, there are docs available at openlineage.io/docs
@@ -73341,7 +73347,7 @@*Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez
+*Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez
@@ -73367,7 +73373,7 @@*Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.
+*Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.
@@ -73393,7 +73399,7 @@*Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.
+*Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.
@@ -73423,7 +73429,7 @@*Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to
+*Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to
@@ -73453,7 +73459,7 @@*Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read
+*Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read
*Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏
+*Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏
@@ -73562,7 +73568,7 @@*Thread Reply:* In my head, I have:
+*Thread Reply:* In my head, I have:
Pyspark script -> Store metadata in S3 -> Marquez UI gets data from S3 and displays it
@@ -73592,10 +73598,10 @@*Thread Reply: It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: - * sent to rest endpoint (like Marquez), - * send as an event onto Kafka, - * print it onto console +
*Thread Reply:* It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: + * sent to rest endpoint (like Marquez), + * send as an event onto Kafka, + ** print it onto console There is no S3 in between Spark & Marquez by default. Marquez serves both as an API where events are sent and UI to investigate them.
@@ -73623,7 +73629,7 @@*Thread Reply:* Yeah S3 was just an example for a storage option.
+*Thread Reply:* Yeah S3 was just an example for a storage option.
I actually found the answer I was looking for, turns out I had to look at Marquez documentation: https://marquezproject.ai/resources/deployment/
@@ -73688,7 +73694,7 @@*Thread Reply:* The Spark execution model follows:
+*Thread Reply:* The Spark execution model follows:
*Thread Reply:* Thanks @Will Johnson +
*Thread Reply:* Thanks @Will Johnson Can I get an example of how the proposed plan can be used to distinguish between start and job start events? Because I compare the 2 starts events I got, only the event_time is different, all other information are the same.
@@ -73748,7 +73754,7 @@*Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for +
*Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for (1). 1 Spark SQL execution start event (2). 3 Spark job start event (Each query has a job start event ) (3). 3 Spark job end event (Each query has a job end event ) @@ -73778,7 +73784,7 @@
*Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.
+*Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.
Re: Multiple Queries in One Command: This is where Spark's execution model comes into play. I believe each one of those commands are executed sequentially and as a result, you'd actually get three execution start and three execution end. If you chose DROP + Create Table As Select, that would be only two commands and thus only two execution start events.
*Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, +
*Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson,
For multiple queries in one command, I still have a confused place why Drop + CreateTable
and Drop + CreateTableAsSelect
act different.
When I test Drop + Create Table
@@ -73894,7 +73900,7 @@
*Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?
+*Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?
I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well.
@@ -73922,7 +73928,7 @@*Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable
and Drop + CreateTableAsSelect
act different.
+
*Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable
and Drop + CreateTableAsSelect
act different.
@Hanbing Wang That's basically why we capture all the events (SQL Execution, Job) instead of one of them. We're just inconsistently notified of them by Spark.
Some computations emit SQL Execution events, some emit Job events, I think majority emits both. This also differs by spark version.
@@ -73956,7 +73962,7 @@*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help +
*Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help We are not running on Databricks. We implemented the OpenLineage Spark listener, and custom the Event Transport which emitting the events to our own events pipeline with a hive metastore. We are using Spark version 3.2.1 @@ -73986,7 +73992,7 @@
*Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.
+*Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.
@@ -74012,7 +74018,7 @@*Thread Reply:* @Maciej Obuchowski - I'll add it to my list!
+*Thread Reply:* @Maciej Obuchowski - I'll add it to my list!
@@ -74038,7 +74044,7 @@*Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.
+*Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.
@@ -74096,7 +74102,7 @@*Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.
+*Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.
The events can be merged then at the backend side. The behavior, you describe, can be then achieved by using backends like Marquez and Marquez API to obtain combined data.
@@ -74336,7 +74342,7 @@*Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.
+*Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.
@@ -74362,7 +74368,7 @@*Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java
+*Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java
You can create a facet that could extract out the properties that you might set from within the spark session.
@@ -74454,7 +74460,7 @@*Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. +
*Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. sc.setJobGroup("myjobgroupid", "job description goes here") This set the value in spark as setLocalProperty(SparkContext.SPARKJOBGROUPID, group_id)
@@ -74485,7 +74491,7 @@*Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?
+*Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?
@srutikanta hota What information would you like to include? There is great chance we already have some fields for that. If not it’s still worth putting in in write place like: is this info job specific, run specific or relates to some of input / output datasets?
@@ -74513,7 +74519,7 @@*Thread Reply:* @srutikanta hota sounds like you want to set up +
*Thread Reply:* @srutikanta hota sounds like you want to set up
spark.openlineage.parentJobName
spark.openlineage.parentRunId
https://openlineage.io/docs/integrations/spark/
*Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId
+*Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId
https://spark.apache.org/docs/1.6.1/api/R/setJobGroup.html
@@ -74612,7 +74618,7 @@*Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.
+*Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.
@@ -74732,7 +74738,7 @@*Thread Reply:* Is this in the context of an OpenLineage extractor?
+*Thread Reply:* Is this in the context of an OpenLineage extractor?
@@ -74758,7 +74764,7 @@*Thread Reply:* Yes! I was specifically looking at the PostgresOperator
+*Thread Reply:* Yes! I was specifically looking at the PostgresOperator
@@ -74784,7 +74790,7 @@*Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)
+*Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)
@@ -74810,7 +74816,7 @@*Thread Reply:* The extractor for the SQL operators gets the query like this: +
*Thread Reply:* The extractor for the SQL operators gets the query like this: https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c443[…]gration/airflow/openlineage/airflow/extractors/sql_extractor.py
*Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...
+*Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...
@@ -74892,7 +74898,7 @@*Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907
+*Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907
*Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: +
*Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py#L58
*Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.
+*Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.
@@ -75035,7 +75041,7 @@*Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.
+*Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.
@@ -75061,7 +75067,7 @@*Thread Reply:* I don't know enough about the airflow internals 😞
+*Thread Reply:* I don't know enough about the airflow internals 😞
@@ -75087,7 +75093,7 @@*Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.
+*Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.
*Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database
+*Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database
@@ -75444,7 +75450,7 @@*Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 +
*Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit
*Thread Reply:* amazing thank you I will take a look
+*Thread Reply:* amazing thank you I will take a look
@@ -75595,7 +75601,7 @@*Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0
+*Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0
@@ -75625,7 +75631,7 @@*Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x
+*Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x
@@ -75651,7 +75657,7 @@*Thread Reply:* just caught this myself. I’ll make the change
+*Thread Reply:* just caught this myself. I’ll make the change
@@ -75677,7 +75683,7 @@*Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?
+*Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?
@@ -75703,7 +75709,7 @@*Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?
+*Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?
We want to remove 1.10 integration because for multiple PRs, maintaining compatibility with it takes a lot of time; the code is littered with checks like this.
if parse_version(AIRFLOW_VERSION) >= parse_version("2.0.0"):
*Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.
+*Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.
@@ -75762,7 +75768,7 @@*Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.
+*Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.
@@ -75816,7 +75822,7 @@*Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1
+*Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1
@@ -75842,7 +75848,7 @@*Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984
+*Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984
Consensus of Airflow maintainers wasn't positive about changing this interface, so we went with another direction: https://github.com/apache/airflow/pull/20443
*Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91
+*Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91
*Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something
+*Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something
@@ -75995,7 +76001,7 @@*Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it
+*Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it
@@ -76021,7 +76027,7 @@*Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage
implementation.
*Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage
implementation.
*Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage
backend. is there something obvious i am perhaps missing 🤔
*Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage
backend. is there something obvious i am perhaps missing 🤔
ends up with requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url
immediately after task start. but by the end on task success/failure it emits the event with our custom backend both RunState.COMPLETE
and RunState.START
into our own pipeline.
*Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend
, what I think is happening is OpenLineagePlugin
that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30
*Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend
, what I think is happening is OpenLineagePlugin
that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30
*Thread Reply:* I think if you want to extend it with proprietary code there are two good options.
+*Thread Reply:* I think if you want to extend it with proprietary code there are two good options.
First, if your code only needs to touch HTTP client side - which I guess is the case due to 401 error - then you can create custom Transport
.
*Thread Reply:* amazing thank you for your help. i will take a look
+*Thread Reply:* amazing thank you for your help. i will take a look
@@ -76186,7 +76192,7 @@*Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.
+*Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.
we're trying to not fork and instead opt with extending.
@@ -76214,7 +76220,7 @@*Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70
+*Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70
*Thread Reply:* got it. and in terms of the openlineage.yml
and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path
*Thread Reply:* got it. and in terms of the openlineage.yml
and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path
*Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.
+*Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.
@@ -76325,7 +76331,7 @@*Thread Reply:* @Paul Lee you should derive from Transport
base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2
*Thread Reply:* @Paul Lee you should derive from Transport
base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2
*Thread Reply:* your custom transport should have also define custom class Config
, and this class should implement from_dict
method
*Thread Reply:* your custom transport should have also define custom class Config
, and this class should implement from_dict
method
*Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47
+*Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47
*Thread Reply:* and I know we need to document this better 🙂
+*Thread Reply:* and I know we need to document this better 🙂
@@ -76483,7 +76489,7 @@*Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done
+*Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done
@@ -76509,7 +76515,7 @@*Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂
+*Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂
*Thread Reply:* Thanks, all. The release is authorized.
+*Thread Reply:* Thanks, all. The release is authorized.
@@ -76751,7 +76757,7 @@*Thread Reply:* would love to add improvement in docs :) for newcomers
+*Thread Reply:* would love to add improvement in docs :) for newcomers
@@ -76781,7 +76787,7 @@*Thread Reply:* also, what’s TSC?
+*Thread Reply:* also, what’s TSC?
@@ -76807,7 +76813,7 @@*Thread Reply:* Technical Steering Committee, but it’s open to everyone
+*Thread Reply:* Technical Steering Committee, but it’s open to everyone
@@ -76837,7 +76843,7 @@*Thread Reply:* and we encourage newcomers to attend
+*Thread Reply:* and we encourage newcomers to attend
@@ -76889,7 +76895,7 @@*Thread Reply:* is there any error/warn message logged maybe?
+*Thread Reply:* is there any error/warn message logged maybe?
@@ -76915,7 +76921,7 @@*Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.
+*Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.
but on SUCCESS nothing fires.
@@ -76943,7 +76949,7 @@*Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔
+*Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔
@@ -76969,7 +76975,7 @@*Thread Reply:* uhm, any chance you're experiencing this with custom extractors?
+*Thread Reply:* uhm, any chance you're experiencing this with custom extractors?
@@ -76995,7 +77001,7 @@*Thread Reply:* I'd be happy to jump on a quick call if you wish
+*Thread Reply:* I'd be happy to jump on a quick call if you wish
@@ -77021,7 +77027,7 @@*Thread Reply:* but in more EU friendly hours 🙂
+*Thread Reply:* but in more EU friendly hours 🙂
@@ -77047,7 +77053,7 @@*Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.
+*Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.
@@ -77144,7 +77150,7 @@*Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.
+*Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.
*Thread Reply:* Hi @Hanna Moazam,
+*Thread Reply:* Hi @Hanna Moazam,
Within recent PR https://github.com/OpenLineage/OpenLineage/pull/1111, I removed BigQuery dependencies from spark2, spark32 and spark3 subprojects. It has to stay in shared
because of BigQueryNodeVisitor. The usage of BigQueryNodeVisitor
is tricky as we never know if bigquery classes are available on runtime or not. The check is done in io.openlineage.spark.agent.lifecycle.BaseVisitorFactory
if (BigQueryNodeVisitor.hasBigQueryClasses()) {
@@ -77303,7 +77309,7 @@
SundayFunday
*Thread Reply:* So, if we wanted to add Snowflake, we would need to:
+*Thread Reply:* So, if we wanted to add Snowflake, we would need to:
*Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.
+*Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.
@@ -77362,7 +77368,7 @@*Thread Reply:* Thank you Pawel!
+*Thread Reply:* Thank you Pawel!
@@ -77388,7 +77394,7 @@*Thread Reply:* Basically the same pattern you've already done with Kusto 😉 +
*Thread Reply:* Basically the same pattern you've already done with Kusto 😉 https://github.com/OpenLineage/OpenLineage/blob/a96ecdabe66567151e7739e25cd9dd03d6[…]va/io/openlineage/spark/agent/lifecycle/BaseVisitorFactory.java
*Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)
+*Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)
@@ -77496,7 +77502,7 @@*Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez
+*Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez
*Thread Reply:* Hey @Austin Poulton, welcome! 👋
+*Thread Reply:* Hey @Austin Poulton, welcome! 👋
@@ -77902,7 +77908,7 @@*Thread Reply:* thanks Harel 🙂
+*Thread Reply:* thanks Harel 🙂
@@ -77973,7 +77979,7 @@*Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.
+*Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.
@@ -78025,7 +78031,7 @@*Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works
+*Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works
@@ -78051,7 +78057,7 @@*Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?
+*Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?
@@ -78170,7 +78176,7 @@*Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.
+*Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.
@@ -78268,7 +78274,7 @@*Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. +
*Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. We have somewhat of a start of a doc: https://openlineage.io/docs/development/developing/
@@ -78440,7 +78446,7 @@*Thread Reply:* Welcome Kenton! Happy to help 👍
+*Thread Reply:* Welcome Kenton! Happy to help 👍
@@ -78497,7 +78503,7 @@*Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration
+*Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration
@@ -78523,7 +78529,7 @@*Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system
+*Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system
@@ -78549,7 +78555,7 @@*Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.
+*Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.
*Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it
+*Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it
@@ -78666,7 +78672,7 @@*Thread Reply:* @Maciej Obuchowski Do you mean adding something like +
*Thread Reply:* @Maciej Obuchowski Do you mean adding something like
spark.openlineage.jobFacet.FacetName.Key=Value
to the spark conf should add a new job facet like
"FacetName": {
"Key": "Value"
@@ -78696,7 +78702,7 @@
SundayFunday
*Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets
+*Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets
@@ -78748,7 +78754,7 @@*Thread Reply:* Hi @Varun Singh, what version of openlineage-spark
where you using? Are you able to copy lineage event here?
*Thread Reply:* Hi @Varun Singh, what version of openlineage-spark
where you using? Are you able to copy lineage event here?
*Thread Reply:* What if we just not include it in the BaseVisitorFactory
but only in the Spark3 visitor factories?
*Thread Reply:* What if we just not include it in the BaseVisitorFactory
but only in the Spark3 visitor factories?
*Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51
+*Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51
*Thread Reply:* I don’t think this is possible at the moment.
+*Thread Reply:* I don’t think this is possible at the moment.
@@ -79135,7 +79141,7 @@*Thread Reply:* Thanks, all. The release is authorized.
+*Thread Reply:* Thanks, all. The release is authorized.
@@ -79161,7 +79167,7 @@*Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306
+*Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306
@@ -79213,7 +79219,7 @@*Thread Reply:* I think we had similar request before, but nothing was implemented.
+*Thread Reply:* I think we had similar request before, but nothing was implemented.
@@ -79333,7 +79339,7 @@*Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship
+*Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship
@@ -79359,7 +79365,7 @@*Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?
+*Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?
@@ -79411,7 +79417,7 @@*Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?
+*Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?
@@ -79437,7 +79443,7 @@*Thread Reply:* If you're on 1.10 then I think it won't work
+*Thread Reply:* If you're on 1.10 then I think it won't work
@@ -79463,7 +79469,7 @@*Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.
+*Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.
cc. @Eli Schachar @Allison Suarez
@@ -79491,7 +79497,7 @@*Thread Reply:* is there no workaround we can make work?
+*Thread Reply:* is there no workaround we can make work?
@@ -79517,7 +79523,7 @@*Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?
+*Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?
@@ -79569,7 +79575,7 @@*Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.
+*Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.
Either in the URL version https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L48
@@ -79699,7 +79705,7 @@*Thread Reply:* I think that's what NominalTimeFacet covers
+*Thread Reply:* I think that's what NominalTimeFacet covers
*Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?
+*Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?
@@ -79798,7 +79804,7 @@*Thread Reply:* i am using airflow 2.x
+*Thread Reply:* i am using airflow 2.x
@@ -79824,7 +79830,7 @@*Thread Reply:* can we connect if you have time ?
+*Thread Reply:* can we connect if you have time ?
@@ -79850,7 +79856,7 @@*Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20
+*Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20
*Thread Reply:* yes
+*Thread Reply:* yes
@@ -79923,7 +79929,7 @@*Thread Reply:* i already set configuration in airflow.cfg file
+*Thread Reply:* i already set configuration in airflow.cfg file
@@ -79949,7 +79955,7 @@*Thread Reply:* where are you sending the events to?
+*Thread Reply:* where are you sending the events to?
@@ -79975,7 +79981,7 @@*Thread Reply:* i have a docker machine on which marquez is working
+*Thread Reply:* i have a docker machine on which marquez is working
@@ -80001,7 +80007,7 @@*Thread Reply:* so, what is the issue you are seeing?
+*Thread Reply:* so, what is the issue you are seeing?
@@ -80027,7 +80033,7 @@*Thread Reply:* there is no error
+*Thread Reply:* there is no error
@@ -80053,7 +80059,7 @@*Thread Reply:* ```[lineage]
+*Thread Reply:* ```[lineage]
*Thread Reply:* above config i have set
+*Thread Reply:* above config i have set
@@ -80120,7 +80126,7 @@*Thread Reply:* please let me know any other thing need to do
+*Thread Reply:* please let me know any other thing need to do
@@ -80172,7 +80178,7 @@*Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/
+*Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/
@@ -80224,7 +80230,7 @@*Thread Reply:* hey, I'll DM you.
+*Thread Reply:* hey, I'll DM you.
@@ -80287,7 +80293,7 @@*Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.
+*Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.
@@ -80370,7 +80376,7 @@*Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/
+*Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/
*Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.
+*Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.
That means we have thousands of companies having experimented with OpenLineage and Microsoft Purview.
@@ -80526,7 +80532,7 @@*Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?
+*Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?
@@ -80556,7 +80562,7 @@*Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞
+*Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞
@@ -80586,7 +80592,7 @@*Thread Reply:* This is starting now. CC @Will Johnson
+*Thread Reply:* This is starting now. CC @Will Johnson
@@ -80612,7 +80618,7 @@*Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions
+*Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions
@@ -80638,7 +80644,7 @@*Thread Reply:* feel free to follow up here as well
+*Thread Reply:* feel free to follow up here as well
@@ -80664,7 +80670,7 @@*Thread Reply:* ascii art to the rescue! (top “depends on” bottom)
+*Thread Reply:* ascii art to the rescue! (top “depends on” bottom)
/ \
/ / \ \
@@ -80719,7 +80725,7 @@ MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* 😍
+*Thread Reply:* 😍
@@ -80745,7 +80751,7 @@*Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)
+*Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)
@@ -80771,7 +80777,7 @@*Thread Reply:* Hi, is there a recording for this meeting?
+*Thread Reply:* Hi, is there a recording for this meeting?
@@ -80823,7 +80829,7 @@*Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?
+*Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?
@@ -80849,7 +80855,7 @@*Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.
+*Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.
@@ -80875,7 +80881,7 @@*Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer
+*Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer
@@ -80975,7 +80981,7 @@*Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.
+*Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.
@@ -81001,7 +81007,7 @@*Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets
+*Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets
@@ -81027,7 +81033,7 @@*Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.
+*Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.
@@ -81053,7 +81059,7 @@*Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!
+*Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!
@@ -81105,7 +81111,7 @@*Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark
+*Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark
@@ -81135,7 +81141,7 @@*Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.
+*Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.
@@ -81161,7 +81167,7 @@*Thread Reply:* I am already able to capture column lineage
+*Thread Reply:* I am already able to capture column lineage
@@ -81187,7 +81193,7 @@*Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage
+*Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage
@@ -81213,7 +81219,7 @@*Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java
+*Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java
*Thread Reply:* but it is only used at the moment to capture some databricks environment attributes
+*Thread Reply:* but it is only used at the moment to capture some databricks environment attributes
@@ -81316,7 +81322,7 @@*Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.
+*Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.
you can also have a look at extending
section of spark integration docs (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending) and create a class thats add run facet builder according to your needs.
*Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.
+*Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.
@@ -81374,7 +81380,7 @@*Thread Reply:* That is great! Thank you so much! This really helps!
+*Thread Reply:* That is great! Thank you so much! This really helps!
@@ -81400,7 +81406,7 @@*Thread Reply:* *Thread Reply:* List<String> dbPropertiesKeys =
+
List<String> dbPropertiesKeys =
Arrays.asList(
"orgId",
"spark.databricks.clusterUsageTags.clusterOwnerOrgId",
@@ -81443,7 +81449,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419
+*Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419
*Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! +
*Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! https://github.com/OpenLineage/OpenLineage/pull/1545
*Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.
+*Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.
@@ -81621,7 +81627,7 @@*Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!
+*Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!
@@ -81647,7 +81653,7 @@*Thread Reply:* @Maciej Obuchowski ^ 🙂
+*Thread Reply:* @Maciej Obuchowski ^ 🙂
@@ -81677,7 +81683,7 @@*Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?
+*Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?
23/02/06 12:18:39 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception
java.lang.NullPointerException
@@ -81723,7 +81729,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...
+*Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...
@@ -81749,7 +81755,7 @@*Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose
+*Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose
@@ -81779,7 +81785,7 @@*Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?
+*Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?
*Thread Reply:* tests failing isn't expected behaviour 🙂
+*Thread Reply:* tests failing isn't expected behaviour 🙂
@@ -81943,7 +81949,7 @@*Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.
+*Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.
@@ -81969,7 +81975,7 @@*Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then
+*Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then
@@ -81995,7 +82001,7 @@*Thread Reply:* I have pushed just now
+*Thread Reply:* I have pushed just now
@@ -82021,7 +82027,7 @@*Thread Reply:* You can run the tests
+*Thread Reply:* You can run the tests
@@ -82047,7 +82053,7 @@*Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.
+*Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.
@@ -82077,7 +82083,7 @@*Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...
+*Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...
@@ -82129,7 +82135,7 @@*Thread Reply:* Hi 👋 this would require a series of JSON calls:
+*Thread Reply:* Hi 👋 this would require a series of JSON calls:
*Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so +
*Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so • you create a relationship between datasets by referring to them in the same job - i.e., this task ran that read from these datasets and wrote to those datasets • you create a relationship between tasks by referring to the same datasets across both of them - i.e., this task wrote that dataset and this other task read from it
@@ -82186,7 +82192,7 @@*Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.
+*Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.
(it’s an airflow workshop, but those examples are for a part of the workshop that doesn’t involve airflow)
@@ -82214,7 +82220,7 @@*Thread Reply:* (these can also be found in the docs)
+*Thread Reply:* (these can also be found in the docs)
*Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it
+*Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it
@@ -82291,7 +82297,7 @@*Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.
+*Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.
@@ -82321,7 +82327,7 @@*Thread Reply:* @Ross Turk - Thanks for your response.
+*Thread Reply:* @Ross Turk - Thanks for your response.
@@ -82347,7 +82353,7 @@*Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. +
*Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. Second Activity : After the completion of first activity I am making Azure databricks call to use the lookup file and generate the output tables. How I can refer the databricks generated tables facets as an input to the subsequent activities in the pipeline. When I refer it's as an input the spark tables metadata is not showing up. How can this be achievable. After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez.
@@ -82460,7 +82466,7 @@*Thread Reply:* welcome! let’s us know if you have any questions
+*Thread Reply:* welcome! let’s us know if you have any questions
@@ -82512,7 +82518,7 @@*Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions
+*Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions
@@ -82728,7 +82734,7 @@*Thread Reply:* @Ross Turk maybe you can help with this?
+*Thread Reply:* @Ross Turk maybe you can help with this?
@@ -82754,7 +82760,7 @@*Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567
+*Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567
*Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.
+*Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.
@@ -82831,7 +82837,7 @@*Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?
+*Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?
@@ -82857,7 +82863,7 @@*Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.
+*Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.
@@ -82909,7 +82915,7 @@*Thread Reply:* I was looking for something similar to the airflow integration
+*Thread Reply:* I was looking for something similar to the airflow integration
@@ -82935,7 +82941,7 @@*Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?
+*Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?
@@ -82961,7 +82967,7 @@*Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi
+*Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi
@@ -83026,7 +83032,7 @@*Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration
+
*Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration
Would it be possible to have more details on what this entails please? Thanks!
*Thread Reply:* @Tomasz Nazarewicz might explain this better
+*Thread Reply:* @Tomasz Nazarewicz might explain this better
@@ -83079,7 +83085,7 @@*Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.
+*Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.
@@ -83105,7 +83111,7 @@*Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂
+*Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂
@@ -83131,7 +83137,7 @@*Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!
+*Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!
@@ -83192,7 +83198,7 @@*Thread Reply:* @Michael Robinson Will there be a recording?
+*Thread Reply:* @Michael Robinson Will there be a recording?
@@ -83218,7 +83224,7 @@*Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
+*Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting
@@ -83364,7 +83370,7 @@*Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: +
*Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb
For scenario you describe, there should be symlink
facet sent similar to:
@@ -83464,7 +83470,7 @@
*Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!
+*Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!
@@ -83591,7 +83597,7 @@*Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?
+*Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?
@@ -83747,7 +83753,7 @@*Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?
+*Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?
@@ -83773,7 +83779,7 @@*Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention
+*Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention
@@ -83825,7 +83831,7 @@*Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand
, Spark Openlineage implementation uses method:
+
*Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand
, Spark Openlineage implementation uses method:
DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file");
(https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/sr[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java)
*Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?
+*Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?
@@ -83911,7 +83917,7 @@*Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!
+*Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!
@@ -83993,7 +83999,7 @@*Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński
+*Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński
@@ -84019,7 +84025,7 @@*Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ;
will be treated as an array
*Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ;
will be treated as an array
*Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled
*Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled
*Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible
+*Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible
@@ -84097,7 +84103,7 @@*Thread Reply:* Maybe the condition could be having ;
and []
in the value
*Thread Reply:* Maybe the condition could be having ;
and []
in the value
*Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!
+*Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!
@@ -84153,7 +84159,7 @@*Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this
+*Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this
*Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)
+*Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)
@@ -84440,7 +84446,7 @@*Thread Reply:* What are the errors?
+*Thread Reply:* What are the errors?
@@ -84466,7 +84472,7 @@*Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, +
*Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, operable program or batch file.
@@ -84493,7 +84499,7 @@*Thread Reply:* Hm, I think this is due to different windows conventions around scripts.
+*Thread Reply:* Hm, I think this is due to different windows conventions around scripts.
@@ -84519,7 +84525,7 @@*Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/
and run pip install openlineage-dbt
, the script winds up in ./venv/bin/dbt-ol
.
*Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/
and run pip install openlineage-dbt
, the script winds up in ./venv/bin/dbt-ol
.
*Thread Reply:* (maybe that helps!)
+*Thread Reply:* (maybe that helps!)
@@ -84571,7 +84577,7 @@*Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...
*Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...
*Thread Reply:* That needs one fix though
+*Thread Reply:* That needs one fix though
@@ -84623,7 +84629,7 @@*Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( +
*Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( \python.exe: No module named dbt-ol
@@ -84650,7 +84656,7 @@*Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?
+*Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?
@@ -84676,7 +84682,7 @@*Thread Reply:* @Michael Robinson GE issue is on Windows?
+*Thread Reply:* @Michael Robinson GE issue is on Windows?
@@ -84702,7 +84708,7 @@*Thread Reply:* No, not Windows
+*Thread Reply:* No, not Windows
@@ -84728,7 +84734,7 @@*Thread Reply:* (that I know of)
+*Thread Reply:* (that I know of)
@@ -84754,7 +84760,7 @@*Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations
+*Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations
*Thread Reply:* Hm. You should be able to find the dbt-ol
command wherever pip
is installing the packages. In my case, that's usually in a virtual environment.
*Thread Reply:* Hm. You should be able to find the dbt-ol
command wherever pip
is installing the packages. In my case, that's usually in a virtual environment.
But if I am not in a virtual environment, it installs the packages in my PYTHONPATH
. You might try this to see if the dbt-ol
script can be found in one of the directories in sys.path
.
*Thread Reply:* this can help you verify that your PYTHONPATH
and PATH
are correct - installing an unrelated python command-line tool and seeing if you can execute it:
*Thread Reply:* this can help you verify that your PYTHONPATH
and PATH
are correct - installing an unrelated python command-line tool and seeing if you can execute it:
*Thread Reply:* Again, I think this is windows issue
+*Thread Reply:* Again, I think this is windows issue
@@ -84881,7 +84887,7 @@*Thread Reply:* @Maciej Obuchowski you think even if dbt-ol
could be found in the path, that might not be the issue?
*Thread Reply:* @Maciej Obuchowski you think even if dbt-ol
could be found in the path, that might not be the issue?
*Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.
+*Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.
@@ -84933,7 +84939,7 @@*Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here
+*Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here
@@ -84959,7 +84965,7 @@*Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.
+*Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.
For some reason, I see only the following folder created
@@ -85056,7 +85062,7 @@*Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?
+*Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?
@@ -85082,7 +85088,7 @@*Thread Reply:* This is an amazing write up! 🔥 💯 🚀
+*Thread Reply:* This is an amazing write up! 🔥 💯 🚀
@@ -85108,7 +85114,7 @@*Thread Reply:* Happy to have it promoted. 😄 +
*Thread Reply:* Happy to have it promoted. 😄 Vish posted on LinkedIn: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios if you want something to repost there.
*Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version
+*Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version
@@ -85226,7 +85232,7 @@*Thread Reply:* ./gradlew clean build publishToMavenLocal
+
*Thread Reply:* ./gradlew clean build publishToMavenLocal
in /client/java
should help.
*Thread Reply:* Ahh yeap this works thanks! 🙂
+*Thread Reply:* Ahh yeap this works thanks! 🙂
@@ -85305,7 +85311,7 @@*Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?
+*Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?
@@ -85331,7 +85337,7 @@*Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!
+*Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!
@@ -85361,7 +85367,7 @@*Thread Reply:* I don’t know enough about Atlas to make that doc.
+*Thread Reply:* I don’t know enough about Atlas to make that doc.
@@ -85421,7 +85427,7 @@*Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs
+*Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs
*Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.
+*Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.
@@ -85507,7 +85513,7 @@*Thread Reply:* More explicitly: +
*Thread Reply:* More explicitly: • Airflow is an interesting platform to observe because it runs a large variety of workloads and lineage can only be automatically extracted for some of them • In general, OpenLineage is essentially a standard and data model for lineage. There are integrations for various systems, including Airflow, that cause them to emit lineage events to an OpenLineage compatible backend. It's a push model. • Marquez is one such backend, and the one I recommend for testing & development @@ -85545,7 +85551,7 @@
*Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.
+*Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.
@@ -85571,7 +85577,7 @@*Thread Reply:* Thanks a lot!! Very helpful!
+*Thread Reply:* Thanks a lot!! Very helpful!
@@ -85597,7 +85603,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -85649,7 +85655,7 @@*Thread Reply:* Hello Brad, on top of my head:
+*Thread Reply:* Hello Brad, on top of my head:
@@ -85675,7 +85681,7 @@*Thread Reply:* • Marquez uses the API HTTP Post. so does Astro +
*Thread Reply:* • Marquez uses the API HTTP Post. so does Astro • Egeria and Purview prefer consuming through a Kafka topic. There is a ProxyBackend that takes HTTP Posts and writes to Kafka. The client can also be configured to write to Kafka
@@ -85706,7 +85712,7 @@*Thread Reply:* @Will Johnson @Mandy Chessell might have opinions
+*Thread Reply:* @Will Johnson @Mandy Chessell might have opinions
@@ -85732,7 +85738,7 @@*Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/
+*Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/
*Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/
+*Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/
*Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.
+*Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.
However, there is a good deal of interest in using the kafka transport instead and that's our future roadmap.
@@ -85914,7 +85920,7 @@*Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener
and generate a lock for each SparkListenerSQLExecutionStart
and release it when the accompanying SparkListenerSQLExecutionEnd
event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!
*Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener
and generate a lock for each SparkListenerSQLExecutionStart
and release it when the accompanying SparkListenerSQLExecutionEnd
event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!
*Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞
+*Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞
@@ -85992,7 +85998,7 @@*Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂
+*Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂
@@ -86044,7 +86050,7 @@*Thread Reply:* Command +
*Thread Reply:* Command
spark-submit --packages "io.openlineage:openlineage_spark:0.19.+" \
--conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \
--conf "spark.openlineage.transport.type=kafka" \
@@ -86076,7 +86082,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* *Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception
+
23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception
java.lang.NullPointerException
at io.openlineage.client.transports.TransportFactory.build(TransportFactory.java:44)
at io.openlineage.spark.agent.EventEmitter.<init>(EventEmitter.java:40)
@@ -86121,7 +86127,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.
+*Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.
@@ -86147,7 +86153,7 @@*Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?
+*Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?
@@ -86173,7 +86179,7 @@*Thread Reply:* 👀 Could I get some help on this, please?
+*Thread Reply:* 👀 Could I get some help on this, please?
@@ -86199,7 +86205,7 @@*Thread Reply:* I think any NullPointerException
is clearly our bug, can you open issue on OL GitHub?
*Thread Reply:* I think any NullPointerException
is clearly our bug, can you open issue on OL GitHub?
*Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get +
*Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get
23/01/30 14:28:33 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event
I am trying to print to console at the moment. I haven't been able to get Kafka transport type working though.
@@ -86258,7 +86264,7 @@*Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs
+*Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs
@@ -86284,7 +86290,7 @@*Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event
+
*Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event
I see this and don't see any events on the console.
*Thread Reply:* Any logs filling pattern +
*Thread Reply:* Any logs filling pattern
log.warn("Unable to access job conf from RDD", nfe);
or
<a href="http://log.info">log.info</a>("Found job conf from RDD {}", jc);
@@ -86341,7 +86347,7 @@
*Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents +
*Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents 23/01/30 14:40:49 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) @@ -86396,7 +86402,7 @@
*Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521
+*Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521
@@ -86422,7 +86428,7 @@*Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release
+*Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release
@@ -86448,7 +86454,7 @@*Thread Reply:* Aah got it. I will give it a try with SQL and a jar.
+*Thread Reply:* Aah got it. I will give it a try with SQL and a jar.
Do you have a ETA on when the python issue would be fixed?
@@ -86476,7 +86482,7 @@*Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.
+*Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.
@@ -86502,7 +86508,7 @@*Thread Reply:* I think that has nothing to do with python
+*Thread Reply:* I think that has nothing to do with python
@@ -86528,7 +86534,7 @@*Thread Reply:* BTW, which Spark version are you using?
+*Thread Reply:* BTW, which Spark version are you using?
@@ -86554,7 +86560,7 @@*Thread Reply:* We are on 3.3.1
+*Thread Reply:* We are on 3.3.1
@@ -86584,7 +86590,7 @@*Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.
+*Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.
@@ -86610,7 +86616,7 @@*Thread Reply:* I think we plan to release somewhere in the next week
+*Thread Reply:* I think we plan to release somewhere in the next week
@@ -86640,7 +86646,7 @@*Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today
+*Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today
@@ -86699,7 +86705,7 @@*Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.
+*Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.
@@ -86729,7 +86735,7 @@*Thread Reply:* @Benji Lampel any help would be appreciated!
+*Thread Reply:* @Benji Lampel any help would be appreciated!
@@ -86755,7 +86761,7 @@*Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? +
*Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code?
Could you try this with a PostgresCheckOperator
?
(Also, only the SqlColumnCheckOperator
and SqlTableCheckOperator
will provide data quality facets in their output, those functions are not implementable in the other operators at this time)
*Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.
+*Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.
3-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - Traceback (most recent call last):
[2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
@@ -86832,7 +86838,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* and above that
+*Thread Reply:* and above that
[2023-01-31, 00:32:38 UTC] {connection.py:424} ERROR - Unable to retrieve connection from secrets backend (EnvironmentVariablesBackend). Checking subsequent secrets backend.
Traceback (most recent call last):
@@ -86867,7 +86873,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator
as we're still migrating from 1.10.15 @Benji Lampel
*Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator
as we're still migrating from 1.10.15 @Benji Lampel
*Thread Reply:* What do you mean by wrapping the CheckOperator
? Like how so, exactly? Can you show me the operator code you're using in the DAG?
*Thread Reply:* What do you mean by wrapping the CheckOperator
? Like how so, exactly? Can you show me the operator code you're using in the DAG?
*Thread Reply:* like so
+*Thread Reply:* like so
class CustomSQLCheckOperator(CheckOperator):
....
*Thread Reply:* i think i found the issue though, we have our own get_hook
function and so we don't follow the traditional Airflow way of setting CONN_ID
which is why CONN_ID
is always None
and that path only gets called through OpenLineage
which doesn't ever get called with our custom wrapper
*Thread Reply:* i think i found the issue though, we have our own get_hook
function and so we don't follow the traditional Airflow way of setting CONN_ID
which is why CONN_ID
is always None
and that path only gets called through OpenLineage
which doesn't ever get called with our custom wrapper
*Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka
+*Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka
@@ -87056,7 +87062,7 @@*Thread Reply:* Thank you for this!
+*Thread Reply:* Thank you for this!
I created a openlineage.yml
file with the following data to test out the integration locally.
transport:
@@ -87130,7 +87136,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.
+*Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.
We need to indicate this way better, and do not throw this error directly at user thought, both in docs and code.
@@ -87349,7 +87355,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556
Does this fit your experience?
@@ -87377,7 +87383,7 @@*Thread Reply:* We sometimes experience this in context of very small, quick jobs
+*Thread Reply:* We sometimes experience this in context of very small, quick jobs
@@ -87403,7 +87409,7 @@*Thread Reply:* Yes, my scenarios are dealing with quick jobs. +
*Thread Reply:* Yes, my scenarios are dealing with quick jobs. Good to know that we will be able to solve it with future spark versions. Thanks!
@@ -87509,7 +87515,7 @@*Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯
+*Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯
@@ -87535,7 +87541,7 @@*Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.
+*Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.
@@ -87561,7 +87567,7 @@*Thread Reply:* Lol the proxy is still in development and not ready for use
+*Thread Reply:* Lol the proxy is still in development and not ready for use
@@ -87587,7 +87593,7 @@*Thread Reply:* Good point! Let’s make that clear in the release / docs?
+*Thread Reply:* Good point! Let’s make that clear in the release / docs?
@@ -87617,7 +87623,7 @@*Thread Reply:* But it doesn’t block anything anyway, so happy to see the release
+*Thread Reply:* But it doesn’t block anything anyway, so happy to see the release
@@ -87643,7 +87649,7 @@*Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳
+*Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳
@@ -87699,7 +87705,7 @@*Thread Reply:* Hey @Daniel Joanes! thanks for the question.
+*Thread Reply:* Hey @Daniel Joanes! thanks for the question.
I am not aware of an issue that captures this. Column-level lineage is a somewhat new facet in the spec, and implementations across the various integrations are in varying states of readiness.
@@ -87729,7 +87735,7 @@*Thread Reply:* Go for it, once it's created i'll add a watch
+*Thread Reply:* Go for it, once it's created i'll add a watch
@@ -87759,7 +87765,7 @@*Thread Reply:* Thanks Ross!
+*Thread Reply:* Thanks Ross!
@@ -87785,7 +87791,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581
+*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581
*Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.
+*Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.
@@ -87971,7 +87977,7 @@*Thread Reply:* 0.20.5
?
*Thread Reply:* 0.20.5
?
*Thread Reply:* @Michael Robinson auth'd
+*Thread Reply:* @Michael Robinson auth'd
@@ -88027,7 +88033,7 @@*Thread Reply:* 👍 the release is authorized
+*Thread Reply:* 👍 the release is authorized
@@ -88126,14 +88132,14 @@*Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.
+*Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.
@@ -88415,7 +88421,7 @@*Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.
+*Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.
Although Openlineage spec provides some built-in facets definition, a facet object can be anything you want (https://openlineage.io/apidocs/openapi/#tag/OpenLineage/operation/postRunEvent). The example metadata provided in this chat could be put into job or run facets I believe.
@@ -88445,7 +88451,7 @@*Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)
+*Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)
@@ -88471,7 +88477,7 @@*Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!
+*Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!
@@ -88497,7 +88503,7 @@*Thread Reply:* I think @Will Johnson might have something to add about customization
+*Thread Reply:* I think @Will Johnson might have something to add about customization
@@ -88523,7 +88529,7 @@*Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!
+*Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!
@Susmitha Anandarao - You might take a look at https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java which has a hard coded set of properties we are extracting.
@@ -88797,7 +88803,7 @@*Thread Reply:* https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/6398/workflows/9d2d19c8-f2d9-4148-a4f3-5dad3ba99eb1/jobs/97759 +
*Thread Reply:* https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/6398/workflows/9d2d19c8-f2d9-4148-a4f3-5dad3ba99eb1/jobs/97759
ERROR: Missing environment variable {i}
*Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch
and it passes all the tests.
*Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch
and it passes all the tests.
*Thread Reply:* Feel free to un-draft your PR if you think it's ready for review
+*Thread Reply:* Feel free to un-draft your PR if you think it's ready for review
@@ -88876,7 +88882,7 @@*Thread Reply:* Ok nice thanks 👍
+*Thread Reply:* Ok nice thanks 👍
@@ -88902,7 +88908,7 @@*Thread Reply:* I think it's ready, however should I update the version somewhere?
+*Thread Reply:* I think it's ready, however should I update the version somewhere?
@@ -88928,7 +88934,7 @@*Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft
, so I'm not sure if you want to add something else to it.
*Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft
, so I'm not sure if you want to add something else to it.
*Thread Reply:* No I don't want to add anything so I opened it 👍
+*Thread Reply:* No I don't want to add anything so I opened it 👍
@@ -89035,7 +89041,7 @@*Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending
+*Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending
*Thread Reply:* I have that set up already like this: +
*Thread Reply:* I have that set up already like this:
public class LyftOpenLineageEventHandlerFactory implements OpenLineageEventHandlerFactory {
@Override
public Collection<PartialFunction<LogicalPlan, List<OutputDataset>>>
@@ -89122,7 +89128,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change
+*Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change
@@ -89148,7 +89154,7 @@*Thread Reply:* .
+*Thread Reply:* .
@@ -89174,7 +89180,7 @@*Thread Reply:* @Michael Collado
+*Thread Reply:* @Michael Collado
@@ -89200,7 +89206,7 @@*Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one
+*Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one
@@ -89226,7 +89232,7 @@*Thread Reply:* Have you added the file to the META-INF folder of your jar?
+*Thread Reply:* Have you added the file to the META-INF folder of your jar?
@@ -89252,7 +89258,7 @@*Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one
+*Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one
@@ -89278,7 +89284,7 @@*Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor
+*Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor
@@ -89304,7 +89310,7 @@*Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory
but maybe im wrong @Michael Collado
*Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory
but maybe im wrong @Michael Collado
*Thread Reply:* the VisitorFactory
is loaded by the InternalEventHandlerFactory
here. However, the createOutputDatasetQueryPlanVisitors
should contain a union of everything defined by the VisitorFactory
as well as your custom visitors: see this code.
*Thread Reply:* the VisitorFactory
is loaded by the InternalEventHandlerFactory
here. However, the createOutputDatasetQueryPlanVisitors
should contain a union of everything defined by the VisitorFactory
as well as your custom visitors: see this code.
*Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?
+*Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?
*Thread Reply:* This was helpful, it works now, thank you so much Michael!
+*Thread Reply:* This was helpful, it works now, thank you so much Michael!
@@ -89517,7 +89523,7 @@*Thread Reply:* what is the curl
cmd you are running? and what endpoint are you hitting? (assuming Marquez?)
*Thread Reply:* what is the curl
cmd you are running? and what endpoint are you hitting? (assuming Marquez?)
*Thread Reply:* yep +
*Thread Reply:* yep I am running curl - X curl -X POST http://localhost:5000/api/v1/namespaces/test ^ -H 'Content-Type: application/json' ^ -d '{ownerName:"me", description:"no description"^ @@ -89578,7 +89584,7 @@
*Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace}
:
+
*Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace}
:
marquez-api | DELETE /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource)
marquez-api | GET /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource)
marquez-api | PUT /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource)
@@ -89608,7 +89614,7 @@
*Thread Reply:* 3rd stupid question of the night +
*Thread Reply:* 3rd stupid question of the night Sorry kept on trying POST who knows why
@@ -89635,7 +89641,7 @@*Thread Reply:* no worries! keep the questions coming!
+*Thread Reply:* no worries! keep the questions coming!
@@ -89661,7 +89667,7 @@*Thread Reply:* well, maybe because it’s so late on your end! get some rest!
+*Thread Reply:* well, maybe because it’s so late on your end! get some rest!
@@ -89687,7 +89693,7 @@*Thread Reply:* Yeah but I want to see how it works +
*Thread Reply:* Yeah but I want to see how it works Right now I have a response 200 for the creation of the names ... but it seems that nothing occurred nor on marquez front end (localhost:3000) nor on the database
@@ -89716,7 +89722,7 @@*Thread Reply:* can you curl the list namespaces endpoint?
+*Thread Reply:* can you curl the list namespaces endpoint?
@@ -89742,7 +89748,7 @@*Thread Reply:* yep : nothing changed +
*Thread Reply:* yep : nothing changed only default and food_delivery
@@ -89769,7 +89775,7 @@*Thread Reply:* can you post your server logs? you should see the request
+*Thread Reply:* can you post your server logs? you should see the request
@@ -89795,7 +89801,7 @@*Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 +
*Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 marquez-api | INFO [2023-02-17 00:32:07,072] marquez.logging.LoggingMdcFilter: status: 200
@@ -89822,7 +89828,7 @@*Thread Reply:* the server is returning a 500
?
*Thread Reply:* the server is returning a 500
?
*Thread Reply:* odd that LoggingMdcFilter
is logging 200
*Thread Reply:* odd that LoggingMdcFilter
is logging 200
*Thread Reply:* Bit confused because now I realize that postman is returning bad request
+*Thread Reply:* Bit confused because now I realize that postman is returning bad request
@@ -89900,7 +89906,7 @@*Thread Reply:*
+*Thread Reply:*
*Thread Reply:* You'll notice that I go to use 3000 in the url +
*Thread Reply:* You'll notice that I go to use 3000 in the url If I use 5000 I get No host
@@ -89962,7 +89968,7 @@*Thread Reply:* odd, the API should be using port 5000
, have you followed our quickstart for Marquez?
*Thread Reply:* odd, the API should be using port 5000
, have you followed our quickstart for Marquez?
*Thread Reply:* Hello Willy +
*Thread Reply:* Hello Willy I am starting from scratch followin instruction from https://openlineage.io/docs/getting-started/ I am on Windows Instead of @@ -90038,7 +90044,7 @@
*Thread Reply:* The issue is related to PostMan
+*Thread Reply:* The issue is related to PostMan
@@ -90090,7 +90096,7 @@*Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: +
*Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information:
"transformationDescription": {
"type": "string",
"description": "a string representation of the transformation applied"
@@ -90125,7 +90131,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?
+*Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?
@@ -90151,7 +90157,7 @@*Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.
+*Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.
To sum up, we don't have yet a proposal on that but this seems to be a natural next step in enriching column lineage features.
@@ -90179,7 +90185,7 @@*Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂
+*Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂
@@ -90205,7 +90211,7 @@*Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.
+*Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.
@@ -90231,7 +90237,7 @@*Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!
+*Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!
@@ -90257,7 +90263,7 @@*Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.
+*Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.
@@ -90319,7 +90325,7 @@*Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging
configuration section is present in marquez.yml
.
*Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging
configuration section is present in marquez.yml
.
*Thread Reply:* Thank You Pavel
+*Thread Reply:* Thank You Pavel
@@ -90397,7 +90403,7 @@*Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?
+*Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?
@@ -90423,7 +90429,7 @@*Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.
+*Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.
@@ -90449,7 +90455,7 @@*Thread Reply:* Okay yeah sure that works. Thanks
+*Thread Reply:* Okay yeah sure that works. Thanks
@@ -90479,7 +90485,7 @@*Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI
+*Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI
@@ -90505,7 +90511,7 @@*Thread Reply:* Awesome thanks
+*Thread Reply:* Awesome thanks
@@ -90569,7 +90575,7 @@*Thread Reply:* This is my checkponit configuration in GE. +
*Thread Reply:* This is my checkponit configuration in GE.
```name: 'openlineagecheckpoint'
configversion: 1.0
templatename:
@@ -90632,7 +90638,7 @@ MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?
+*Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?
@@ -90658,7 +90664,7 @@*Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow
+*Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow
@@ -90684,7 +90690,7 @@*Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44
for now. That is the latest one that works for me.
*Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44
for now. That is the latest one that works for me.
*Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment
+*Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment
@@ -90736,7 +90742,7 @@*Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44
, this is my requirement file
*Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44
, this is my requirement file
great_expectations==0.15.44
sqlalchemy
@@ -90771,7 +90777,7 @@ MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack
+*Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack
@@ -90823,7 +90829,7 @@*Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ +
*Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ Are you using AWS Glue with Spark jobs?
*Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?
+*Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?
@@ -90901,7 +90907,7 @@*Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄
+*Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄
@@ -90962,7 +90968,7 @@*Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.
+*Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.
@@ -91014,7 +91020,7 @@*Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.
+*Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.
@@ -91202,7 +91208,7 @@*Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?
+*Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?
@@ -91228,7 +91234,7 @@*Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 +
*Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 that was fixed in 0.20.6
*Thread Reply:* hmm perhaps.
+*Thread Reply:* hmm perhaps.
@@ -91305,7 +91311,7 @@*Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?
+*Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?
@@ -91331,7 +91337,7 @@*Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?
+*Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?
@@ -91361,7 +91367,7 @@*Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars
+*Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars
@@ -91391,7 +91397,7 @@*Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both
+*Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both
@@ -91482,7 +91488,7 @@*Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?
+*Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?
OpenLineage's events may send partial information. You should expect to collect all events for a given RunId and merge them together to get the complete events.
@@ -91512,7 +91518,7 @@*Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command
+*Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command
@@ -91538,7 +91544,7 @@*Thread Reply:* And 2 complete
+*Thread Reply:* And 2 complete
@@ -91564,7 +91570,7 @@*Thread Reply:* I am currently only testing on parquet tables...
+*Thread Reply:* I am currently only testing on parquet tables...
@@ -91590,7 +91596,7 @@*Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.
+*Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.
@@ -91616,7 +91622,7 @@*Thread Reply:* Ahh I see...okay thanks!
+*Thread Reply:* Ahh I see...okay thanks!
@@ -91642,7 +91648,7 @@*Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios
+*Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios
@@ -91672,7 +91678,7 @@*Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?
+*Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?
@@ -91698,7 +91704,7 @@*Thread Reply:* CC: @Paweł Leszczyński ^
+*Thread Reply:* CC: @Paweł Leszczyński ^
@@ -91821,7 +91827,7 @@*Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true
*Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true
*Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113
+*Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113
*Thread Reply:* you're right, only the change was included in 2.5.0
+*Thread Reply:* you're right, only the change was included in 2.5.0
@@ -92072,7 +92078,7 @@*Thread Reply:* Thanks Jakub
+*Thread Reply:* Thanks Jakub
@@ -92196,7 +92202,7 @@*Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?
+*Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?
*Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)
+*Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)
@@ -92454,7 +92460,7 @@*Thread Reply:* Perhaps DocumentationJobFacet
or DocumentationDatasetFacet
?
*Thread Reply:* Perhaps DocumentationJobFacet
or DocumentationDatasetFacet
?
*Thread Reply:* dbt-ol
does not read from openlineage.yml
so you need to pass this information in OPENLINEAGE_NAMESPACE
environment variable
*Thread Reply:* dbt-ol
does not read from openlineage.yml
so you need to pass this information in OPENLINEAGE_NAMESPACE
environment variable
*Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment()
, I’ll do some testing with Kafka backends.
*Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment()
, I’ll do some testing with Kafka backends.
*Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIG
to specify the yml file holding transport info and OPENLINEAGE_NAMESPACE
*Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIG
to specify the yml file holding transport info and OPENLINEAGE_NAMESPACE
*Thread Reply:* Awesome! That’s exactly what I was going to test.
+*Thread Reply:* Awesome! That’s exactly what I was going to test.
@@ -92665,7 +92671,7 @@*Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml
.
*Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml
.
*Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol
does not read OL namespace from openlineage.yml
but from OPENLINEAGE_NAMESPACE
env var instead
*Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol
does not read OL namespace from openlineage.yml
but from OPENLINEAGE_NAMESPACE
env var instead
*Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table
+*Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table
@@ -92919,7 +92925,7 @@*Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that
+*Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that
@@ -92945,7 +92951,7 @@*Thread Reply:* So what I understand:
+*Thread Reply:* So what I understand:
*Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me
+*Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me
@@ -93052,7 +93058,7 @@*Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.
+*Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.
For now, JSON Schema generator isn't flexible enough to generate custom facets from whatever schema we give it, so it would be unnecessary complexity
*Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.
+*Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.
@@ -93165,7 +93171,7 @@*Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet
it seems to be okay to assume schema stays the same if not send.
*Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet
it seems to be okay to assume schema stays the same if not send.
For others, like OutputStatisticsOutputDatasetFacet
you definitely can't assume that, as the data is unique to each run.
*Thread Reply:* ok great thanks, that makes sense to me
+*Thread Reply:* ok great thanks, that makes sense to me
@@ -93254,7 +93260,7 @@*Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/
+*Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/
*Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.
+*Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.
@@ -93353,7 +93359,7 @@*Thread Reply:* Sure thanks!
+*Thread Reply:* Sure thanks!
@@ -93379,7 +93385,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 +
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 I have opened an issue here. Thanks! 🙂
*Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?
+*Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?
@@ -93516,7 +93522,7 @@*Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. +
*Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. May I ask if you want just to test the flow or looking for any viable Snowflake data lineage solution?
@@ -93543,7 +93549,7 @@*Thread Reply:* I believe that this will work if you change the line to client.transport.emit()
*Thread Reply:* I believe that this will work if you change the line to client.transport.emit()
*Thread Reply:* (this would be in the dags/lineage
folder, if memory serves)
*Thread Reply:* (this would be in the dags/lineage
folder, if memory serves)
*Thread Reply:* Ross is right, that should work
+*Thread Reply:* Ross is right, that should work
@@ -93621,7 +93627,7 @@*Thread Reply:* This works! Thank you so much!
+*Thread Reply:* This works! Thank you so much!
@@ -93647,7 +93653,7 @@*Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂
+*Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂
@@ -93673,7 +93679,7 @@*Thread Reply:* I have been meaning to revisit that tutorial 👍
+*Thread Reply:* I have been meaning to revisit that tutorial 👍
@@ -93740,7 +93746,7 @@*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.
+*Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.
@@ -93901,7 +93907,7 @@*Thread Reply:* Do you have small configuration and job to replicate this?
+*Thread Reply:* Do you have small configuration and job to replicate this?
@@ -93927,7 +93933,7 @@*Thread Reply:* Yeah. For configs: +
*Thread Reply:* Yeah. For configs:
spark.driver.bindAddress: "localhost"
spark.master: "local[**]"
spark.sql.catalogImplementation: "hive"
@@ -93963,7 +93969,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* The issue is because of the spark.jars.packages
config... spark.jars
config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...
*Thread Reply:* The issue is because of the spark.jars.packages
config... spark.jars
config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...
*Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener
which is not in Spark jars would fall under the same problems, right?
*Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener
which is not in Spark jars would fall under the same problems, right?
*Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol
+*Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol
@@ -94067,7 +94073,7 @@*Thread Reply:* In this case you would have four runevents:
+*Thread Reply:* In this case you would have four runevents:
START
event on my-job
where my-input
is the input and my-output
is the output, with a runId you generate on the clientCOMPLETE
event on my-job
with the same runId from #1START
event on my-job2
where the input is my-output
and the output is my-final-output
, with a separate runId you generateCOMPLETE
event on my-job2
with the same runId from #3*Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.
+*Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.
@@ -94122,7 +94128,7 @@*Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events
table so it can be replicated.
*Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events
table so it can be replicated.
*Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!
+*Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!
@@ -94213,7 +94219,7 @@*Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.
+*Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.
However, I don't think you can currently do any filtering over it
*Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430
+*Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430
*Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.
+*Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.
@@ -94339,7 +94345,7 @@*Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞
+*Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞
@@ -94393,7 +94399,7 @@*Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?
+*Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?
@@ -94419,7 +94425,7 @@*Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?
+*Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?
@@ -94609,7 +94615,7 @@*Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor
?
*Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor
?
*Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE
env var.
*Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE
env var.
*Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?
+*Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?
@@ -94687,7 +94693,7 @@*Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins
and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'
. This is the output of airflow plugins
*Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins
and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'
. This is the output of airflow plugins
name | macros | listeners | source
==================+================================================+==============================+=================================================
@@ -94746,7 +94752,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …
*Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …
*Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)
+*Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)
@@ -94826,7 +94832,7 @@*Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted
+*Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted
@@ -94852,7 +94858,7 @@*Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...
+*Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...
@@ -94878,7 +94884,7 @@*Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)
+*Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)
@@ -94904,7 +94910,7 @@*Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)
+*Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)
*Thread Reply:* Thanks! This is very helpful. Exactly what I needed.
+*Thread Reply:* Thanks! This is very helpful. Exactly what I needed.
@@ -95011,7 +95017,7 @@*Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/
+*Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/
*Thread Reply:* > The snowflake.connector import connect
package seems to be outdated here in extract_openlineage.py
and is not working for airflow.
+
*Thread Reply:* > The snowflake.connector import connect
package seems to be outdated here in extract_openlineage.py
and is not working for airflow.
What's the error?
> Does anyone know how to rewrite this code (e.g., with SnowflakeOperator
)
@@ -95429,7 +95435,7 @@
*Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: +
*Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is:
****** Log file does not exist: /opt/bitnami/airflow/logs/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log
****** Fetching from: <http://1c8bb4a78f14:8793/log/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log>
****** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!!
@@ -95462,7 +95468,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* I think it's Airflow error related to getting logs from worker
+*Thread Reply:* I think it's Airflow error related to getting logs from worker
@@ -95488,7 +95494,7 @@*Thread Reply:* snowflake.connector
is a Snowflake connector library that SnowflakeOperator
uses underneath to connect to Snowflake
*Thread Reply:* snowflake.connector
is a Snowflake connector library that SnowflakeOperator
uses underneath to connect to Snowflake
*Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?
+*Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?
@@ -95540,7 +95546,7 @@*Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388
+*Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388
I would try running it in Docker TBH
*Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!
+*Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!
@@ -95729,7 +95735,7 @@*Thread Reply:* Very interesting!
+*Thread Reply:* Very interesting!
@@ -95755,7 +95761,7 @@*Thread Reply:* that’s awesome 🙂
+*Thread Reply:* that’s awesome 🙂
@@ -95807,7 +95813,7 @@*Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage
+*Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage
@@ -95863,7 +95869,7 @@*Thread Reply:* The dag is basically just: +
*Thread Reply:* The dag is basically just:
```dag = DAG(
dagid="asanaexampledag",
defaultargs=defaultargs,
@@ -95902,7 +95908,7 @@ MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* This is the error I’m getting, seems to be coming from this line: +
*Thread Reply:* This is the error I’m getting, seems to be coming from this line:
[2023-04-13, 17:45:33 UTC] {logging_mixin.py:115} WARNING - Exception in thread Thread-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
@@ -96024,7 +96030,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0
+*Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0
@@ -96050,7 +96056,7 @@*Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy
+*Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy
*Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.
+*Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.
@@ -96131,7 +96137,7 @@*Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: +
*Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: ```import datetime
from airflow.lineage.entities import Table @@ -96180,7 +96186,7 @@
*Thread Reply:* is there any extra configuration you made possibly?
+*Thread Reply:* is there any extra configuration you made possibly?
@@ -96206,7 +96212,7 @@*Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output
+
*Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output
looks like this was reported here and solved by this PR (not sure if this was released in 2.3.3 or later)
*Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.
+*Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.
@@ -96307,7 +96313,7 @@*Thread Reply:* I ran it against 2.4 and same dag works
+*Thread Reply:* I ran it against 2.4 and same dag works
@@ -96333,7 +96339,7 @@*Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)
+*Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)
@@ -96359,7 +96365,7 @@*Thread Reply:* 👍
+*Thread Reply:* 👍
@@ -96385,7 +96391,7 @@*Thread Reply:* Got this working! We just monkey patched the __deepcopy__
method of the BaseOperator
for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!
*Thread Reply:* Got this working! We just monkey patched the __deepcopy__
method of the BaseOperator
for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!
*Thread Reply:* This is the spark submit command: +
*Thread Reply:* This is the spark submit command:
spark-submit --py-files /usr/local/lib/common_utils.zip,/usr/local/lib/team_utils.zip,/usr/local/lib/project_utils.zip
--conf spark.executor.cores=16
--conf spark.hadoop.fs.s3a.connection.maximum=100 --conf spark.sql.shuffle.partitions=1000
@@ -96489,7 +96495,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.
+*Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.
@@ -96515,7 +96521,7 @@*Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!
+*Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!
@@ -96541,7 +96547,7 @@*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 +
*Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 Opened an issue here
*Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor
should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl
), but they're in the same package. Thanks for bringing this.
*Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor
should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl
), but they're in the same package. Thanks for bringing this.
*Thread Reply:* This PR should resolve problem: +
*Thread Reply:* This PR should resolve problem: https://github.com/OpenLineage/OpenLineage/pull/1788
*Thread Reply:* Thank you so much @Paweł Leszczyński 🙂
+*Thread Reply:* Thank you so much @Paweł Leszczyński 🙂
@@ -96766,7 +96772,7 @@*Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?
+*Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?
@@ -96792,7 +96798,7 @@*Thread Reply:* @Maciej Obuchowski ^
+*Thread Reply:* @Maciej Obuchowski ^
@@ -96818,7 +96824,7 @@*Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.
+*Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.
@@ -96844,7 +96850,7 @@*Thread Reply:* Hi Allison 👋, +
*Thread Reply:* Hi Allison 👋, Anyone can request a release in the #general channel. I encourage you to go this route. You’ll need three +1s (there’s more info about the process here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md), but I don’t know of any reasons why we can’t do a mid-cycle release. 🙂
@@ -96875,7 +96881,7 @@*Thread Reply:* seems like we got enough +1s
+*Thread Reply:* seems like we got enough +1s
@@ -96901,7 +96907,7 @@*Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third
+*Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third
@@ -96931,7 +96937,7 @@*Thread Reply:* oooh
+*Thread Reply:* oooh
@@ -96957,7 +96963,7 @@*Thread Reply:* Yeah, sorry I forgot to mention that!
+*Thread Reply:* Yeah, sorry I forgot to mention that!
@@ -96983,7 +96989,7 @@*Thread Reply:* we have it now
+*Thread Reply:* we have it now
@@ -97138,7 +97144,7 @@*Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).
+*Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).
@@ -97246,7 +97252,7 @@*Thread Reply:* Hi Sai,
+*Thread Reply:* Hi Sai,
Perhaps you could try within printing OpenLineage events into logs. This can be achieved with Spark config parameter:
spark.openlineage.transport.type
@@ -97278,7 +97284,7 @@
*Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:
+*Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:
23/04/20 10:00:15 INFO ConsoleTransport: {"eventType":"START","eventTime":"2023-04-20T10:00:15.085Z","run":{"runId":"ef4f46d1-d13a-420a-87c3-19fbf6ffa231","facets":{"spark.logicalPlan":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.22.0/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect","num-children":2,"name":0,"partitioning":[],"query":1,"tableSpec":null,"writeOptions":null,"ignoreIfExists":false},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","num-children":0,"catalog":null,"ident":null},{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"workorderid","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-cl
@@ -97315,7 +97321,7 @@*Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection
+*Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection
@@ -97341,7 +97347,7 @@*Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark
+*Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark
@@ -97367,7 +97373,7 @@*Thread Reply:* please note that spark.openlineage.transport.url
has to be used which is different from what you have on screenshot attached
*Thread Reply:* please note that spark.openlineage.transport.url
has to be used which is different from what you have on screenshot attached
*Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?
+*Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?
@@ -97419,7 +97425,7 @@*Thread Reply:* yes, please give it a try
+*Thread Reply:* yes, please give it a try
@@ -97445,7 +97451,7 @@*Thread Reply:* sure will give a try and let you know the outcome
+*Thread Reply:* sure will give a try and let you know the outcome
@@ -97471,7 +97477,7 @@*Thread Reply:* and set spark.openlineage.transport.type
to http
*Thread Reply:* and set spark.openlineage.transport.type
to http
*Thread Reply:* okay
+*Thread Reply:* okay
@@ -97523,7 +97529,7 @@*Thread Reply:* does these configs suffice or I need to add anything else
+*Thread Reply:* does these configs suffice or I need to add anything else
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.consoleTransport true @@ -97555,7 +97561,7 @@
*Thread Reply:* spark.openlineage.consoleTransport true
this one can be removed
*Thread Reply:* spark.openlineage.consoleTransport true
this one can be removed
*Thread Reply:* otherwise shall be OK
+*Thread Reply:* otherwise shall be OK
@@ -97607,7 +97613,7 @@*Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.
+*Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.
@@ -97633,7 +97639,7 @@*Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart +
*Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd
Does this need any changes in the config side?
@@ -97773,7 +97779,7 @@*Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog
and provide more details how are you trying to read/write data?
*Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog
and provide more details how are you trying to read/write data?
*Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog
makes use of the V2SessionCatalog
to understand this better. Thanks!
*Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog
makes use of the V2SessionCatalog
to understand this better. Thanks!
*Thread Reply:* Hi @Paweł Leszczyński +
*Thread Reply:* Hi @Paweł Leszczyński
'''
CREATE TABLE IF NOT EXISTS TABLE_NAME (
SOME COLUMNS
@@ -97863,7 +97869,7 @@
MARQUEZAPIKEY=[YOURAPIKEY]
*Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.
+*Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.
@@ -97889,7 +97895,7 @@*Thread Reply:* which spark & delta versions are you using?
+*Thread Reply:* which spark & delta versions are you using?
@@ -97915,7 +97921,7 @@*Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. +
*Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. https://github.com/OpenLineage/OpenLineage/pull/1798
*Thread Reply:* Hi
+*Thread Reply:* Hi
@@ -98063,7 +98069,7 @@*Thread Reply:* Hmm actually I am noticing this error on my local
+*Thread Reply:* Hmm actually I am noticing this error on my local
@@ -98089,7 +98095,7 @@*Thread Reply:* But on the prod job, I am seeing no such error in the logs...
+*Thread Reply:* But on the prod job, I am seeing no such error in the logs...
@@ -98115,7 +98121,7 @@*Thread Reply:* Also, I was using spark 3.1.2
+*Thread Reply:* Also, I was using spark 3.1.2
@@ -98145,7 +98151,7 @@*Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2
+*Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2
@@ -98175,7 +98181,7 @@*Thread Reply:* Not too sure which delta version the prod job was using...
+*Thread Reply:* Not too sure which delta version the prod job was using...
@@ -98201,7 +98207,7 @@*Thread Reply:* I was running on Spark 3.1.2 the following command: +
*Thread Reply:* I was running on Spark 3.1.2 the following command:
spark.sql(
"CREATE TABLE t_partitioned (a int, b int) USING delta "
+ "PARTITIONED BY (a) LOCATION '/tmp/delta/tbl'"
@@ -98236,7 +98242,7 @@
MARQUEZ