From ce469696b209e390f88b815d345a3d8a75d73000 Mon Sep 17 00:00:00 2001 From: merobi-hub Date: Mon, 29 Jan 2024 17:35:19 -0500 Subject: [PATCH] End of Jan 24 update. Signed-off-by: merobi-hub --- channel/boston-meetup/index.html | 6 + channel/dagster-integration/index.html | 114 +- channel/data-council-meetup/index.html | 32 + channel/dev-discuss/index.html | 12774 +++++++- channel/general/index.html | 24017 +++++++++++++--- channel/gx-integration/index.html | 8 +- channel/jobs/index.html | 520 + channel/london-meetup/index.html | 6 + channel/mark-grover/index.html | 10 +- channel/nyc-meetup/index.html | 6 + channel/open-lineage-plus-bacalhau/index.html | 74 +- channel/providence-meetup/index.html | 8 +- channel/sf-meetup/index.html | 22 +- channel/spec-compliance/index.html | 20 +- channel/toronto-meetup/index.html | 6 + index.html | 24017 +++++++++++++--- 16 files changed, 54302 insertions(+), 7338 deletions(-) create mode 100644 channel/jobs/index.html diff --git a/channel/boston-meetup/index.html b/channel/boston-meetup/index.html index 28d3a36..67d3cf1 100644 --- a/channel/boston-meetup/index.html +++ b/channel/boston-meetup/index.html @@ -52,6 +52,12 @@

Public Channels

+
  • + + # jobs + +
  • +
  • # london-meetup diff --git a/channel/dagster-integration/index.html b/channel/dagster-integration/index.html index 7aa61e4..523ef8c 100644 --- a/channel/dagster-integration/index.html +++ b/channel/dagster-integration/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -721,7 +727,7 @@

    Group Direct Messages

    2022-02-04 17:18:32
    -

    *Thread Reply:* Hi Michael, While I’m currently going through internal review process before creating a PR, I can do a quick demo on the current OpenLineage sensor approach and get some initial feedback if that is okay.

    +

    *Thread Reply:* Hi Michael, While I’m currently going through internal review process before creating a PR, I can do a quick demo on the current OpenLineage sensor approach and get some initial feedback if that is okay.

    @@ -747,7 +753,7 @@

    Group Direct Messages

    2022-02-09 13:04:24
    -

    *Thread Reply:* thanks for the demo!

    +

    *Thread Reply:* thanks for the demo!

    @@ -777,7 +783,7 @@

    Group Direct Messages

    2022-02-10 22:37:06
    -

    *Thread Reply:* Thanks for the opportunity!

    +

    *Thread Reply:* Thanks for the opportunity!

    @@ -884,7 +890,7 @@

    Group Direct Messages

    2022-02-15 15:39:14
    -

    *Thread Reply:* Thanks for the PR! I'll take a look tomorrow.

    +

    *Thread Reply:* Thanks for the PR! I'll take a look tomorrow.

    @@ -914,7 +920,7 @@

    Group Direct Messages

    2022-02-16 09:37:27
    -

    *Thread Reply:* @Dalin Kim looks great! I approved it. One thing to do is to rebase on main and force-with-lease push: it appears that there are some conflicts.

    +

    *Thread Reply:* @Dalin Kim looks great! I approved it. One thing to do is to rebase on main and force-with-lease push: it appears that there are some conflicts.

    @@ -940,7 +946,7 @@

    Group Direct Messages

    2022-02-16 11:59:49
    -

    *Thread Reply:* @Maciej Obuchowski Thank you for the review. Just had a small conflict in CHANGELOG, which has been resolved. For integration test, do you suggest a similar approach like airflow integration using flask?

    +

    *Thread Reply:* @Maciej Obuchowski Thank you for the review. Just had a small conflict in CHANGELOG, which has been resolved. For integration test, do you suggest a similar approach like airflow integration using flask?

    @@ -966,7 +972,7 @@

    Group Direct Messages

    2022-02-16 12:03:08
    -

    *Thread Reply:* Also, I reached out to Sandy from Dagster for a review to make sure this is good from Dagster side of things.

    +

    *Thread Reply:* Also, I reached out to Sandy from Dagster for a review to make sure this is good from Dagster side of things.

    @@ -992,7 +998,7 @@

    Group Direct Messages

    2022-02-16 12:09:12
    -

    *Thread Reply:* > For integration test, do you suggest a similar approach like airflow integration using flask? +

    *Thread Reply:* > For integration test, do you suggest a similar approach like airflow integration using flask? Yes, I think we can reuse that part. The most important thing is that we have real Dagster running "real" workloads.

    @@ -1023,7 +1029,7 @@

    Group Direct Messages

    2022-02-17 00:45:37
    -

    *Thread Reply:* Documentation has been updated based on feedback from Yuhan from Dagster team, and I believe everything is set from my end.

    +

    *Thread Reply:* Documentation has been updated based on feedback from Yuhan from Dagster team, and I believe everything is set from my end.

    @@ -1049,7 +1055,7 @@

    Group Direct Messages

    2022-02-17 05:25:16
    -

    *Thread Reply:* Great! Let's just get rid of this linting error and I'll merge it then.

    +

    *Thread Reply:* Great! Let's just get rid of this linting error and I'll merge it then.

    https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/2204/workflows/cdb412e6-b41a-4fab-bc8e-d8bee71d051d/jobs/18963

    @@ -1077,7 +1083,7 @@

    Group Direct Messages

    2022-02-17 11:27:41
    -

    *Thread Reply:* Thank you. I overlooked this when updating docstring. All should be good now.

    +

    *Thread Reply:* Thank you. I overlooked this when updating docstring. All should be good now.

    One final question - should we make the dagster unit test job “required” in the ci and how can that be configured?

    @@ -1114,7 +1120,7 @@

    Group Direct Messages

    2022-02-17 11:54:08
    -

    *Thread Reply:* @Willy Lulciuc I think you configured it, am I right?

    +

    *Thread Reply:* @Willy Lulciuc I think you configured it, am I right?

    @Dalin Kim one more rebase, please 🙏 I've turned auto-merge on, but unfortunately I can't rebase your branch on fork.

    @@ -1147,7 +1153,7 @@

    Group Direct Messages

    2022-02-17 11:58:34
    -

    *Thread Reply:* Rebased and pushed

    +

    *Thread Reply:* Rebased and pushed

    @@ -1173,7 +1179,7 @@

    Group Direct Messages

    2022-02-17 12:23:44
    -

    *Thread Reply:* @Dalin Kim merged!

    +

    *Thread Reply:* @Dalin Kim merged!

    @@ -1199,7 +1205,7 @@

    Group Direct Messages

    2022-02-17 12:24:28
    -

    *Thread Reply:* @Maciej Obuchowski Awesome! Thank you so much for all your help!

    +

    *Thread Reply:* @Maciej Obuchowski Awesome! Thank you so much for all your help!

    @@ -1255,7 +1261,7 @@

    Group Direct Messages

    2022-02-15 14:47:49
    -

    *Thread Reply:* Sorry - we're not running integration tests for airflow on forks due to security reason. If you're not touching any Airflow files then it should not affect you at all 🙂

    +

    *Thread Reply:* Sorry - we're not running integration tests for airflow on forks due to security reason. If you're not touching any Airflow files then it should not affect you at all 🙂

    @@ -1281,7 +1287,7 @@

    Group Direct Messages

    2022-02-15 14:48:23
    -

    *Thread Reply:* In other words, it's just as expected.

    +

    *Thread Reply:* In other words, it's just as expected.

    @@ -1307,7 +1313,7 @@

    Group Direct Messages

    2022-02-15 15:09:42
    -

    *Thread Reply:* Thank you for the clarification

    +

    *Thread Reply:* Thank you for the clarification

    @@ -1565,7 +1571,7 @@

    Group Direct Messages

    2022-03-04 09:21:58
    -

    *Thread Reply:* Thanks for letting us know. We’ll take a look and follow up.

    +

    *Thread Reply:* Thanks for letting us know. We’ll take a look and follow up.

    @@ -1591,7 +1597,7 @@

    Group Direct Messages

    2022-03-04 12:32:26
    -

    *Thread Reply:* Just as an update on findings, it appears that this MR introduced a breaking change for the test helper function that creates a test EventLogRecord.

    +

    *Thread Reply:* Just as an update on findings, it appears that this MR introduced a breaking change for the test helper function that creates a test EventLogRecord.

    @@ -1867,7 +1873,7 @@

    Group Direct Messages

    2022-03-21 12:56:33
    -

    *Thread Reply:* Thanks for the heads up. We’ll look into it and follow up

    +

    *Thread Reply:* Thanks for the heads up. We’ll look into it and follow up

    @@ -1893,7 +1899,7 @@

    Group Direct Messages

    2022-03-22 16:02:01
    -

    *Thread Reply:* Here is the PR to fix the error with the latest Dagster version

    +

    *Thread Reply:* Here is the PR to fix the error with the latest Dagster version

    @@ -1964,7 +1970,7 @@

    Group Direct Messages

    2022-03-23 08:20:41
    -

    *Thread Reply:* Thanks! Merged.

    +

    *Thread Reply:* Thanks! Merged.

    @@ -1990,7 +1996,7 @@

    Group Direct Messages

    2022-03-23 09:49:53
    -

    *Thread Reply:* Thanks!

    +

    *Thread Reply:* Thanks!

    @@ -2120,7 +2126,7 @@

    Group Direct Messages

    2022-03-31 15:59:33
    -

    *Thread Reply:* That's definitely the advantage of the log-parsing method. The Airflow integration, especially the most recent version for Airflow 2.3+, has the advantage of being more robust when it comes to delivering lineage in real-time

    +

    *Thread Reply:* That's definitely the advantage of the log-parsing method. The Airflow integration, especially the most recent version for Airflow 2.3+, has the advantage of being more robust when it comes to delivering lineage in real-time

    @@ -2150,7 +2156,7 @@

    Group Direct Messages

    2022-03-31 19:18:29
    -

    *Thread Reply:* Thanks for the heads up on this integration!

    +

    *Thread Reply:* Thanks for the heads up on this integration!

    @@ -2176,7 +2182,7 @@

    Group Direct Messages

    2022-03-31 19:20:10
    -

    *Thread Reply:* I suspect future integrations will move towards this pattern as well? Sorry, off-topic for this channel, but this was the first place I've seen this metadata collection method in this project.

    +

    *Thread Reply:* I suspect future integrations will move towards this pattern as well? Sorry, off-topic for this channel, but this was the first place I've seen this metadata collection method in this project.

    @@ -2202,7 +2208,7 @@

    Group Direct Messages

    2022-03-31 19:22:14
    -

    *Thread Reply:* I would be curious to explore similar external executor integrations for Airflow, say for Papermill or Kubernetes (via the corresponding operators). Suppose one would need to pass job metadata through to the respective platform where logs are actually collected.

    +

    *Thread Reply:* I would be curious to explore similar external executor integrations for Airflow, say for Papermill or Kubernetes (via the corresponding operators). Suppose one would need to pass job metadata through to the respective platform where logs are actually collected.

    @@ -2239,6 +2245,58 @@

    Group Direct Messages

    + + +
    +
    + + + + +
    + +
    Fraser Marlow + (fraser@elementl.com) +
    +
    2023-12-02 21:43:01
    +
    +

    On Tuesday, the Dagster team will be showcasing Embedded ELT, a way to save tens of thousands of dollars on data movement tasks. Come catch the session with @Pedram (Dagster) +https://dagster.io/events/embedded-elt-dec-2023

    +
    +
    dagster.io
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + +
    diff --git a/channel/data-council-meetup/index.html b/channel/data-council-meetup/index.html index bf5cbd0..759bc70 100644 --- a/channel/data-council-meetup/index.html +++ b/channel/data-council-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -377,6 +383,32 @@

    Group Direct Messages

    + + +
    +
    + + + + +
    + +
    Gowthaman Chinnathambi + (gowthamancdev@gmail.com) +
    +
    2024-01-18 23:27:54
    +
    +

    @Gowthaman Chinnathambi has joined the channel

    + + + +
    +
    +
    +
    + + + diff --git a/channel/dev-discuss/index.html b/channel/dev-discuss/index.html index aee329f..68bd692 100644 --- a/channel/dev-discuss/index.html +++ b/channel/dev-discuss/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -549,7 +555,7 @@

    Group Direct Messages

    2023-11-15 04:35:37
    -

    *Thread Reply:* hey look, more fun +

    *Thread Reply:* hey look, more fun https://github.com/OpenLineage/OpenLineage/pull/2263

    @@ -576,7 +582,7 @@

    Group Direct Messages

    2023-11-15 05:03:58
    -

    *Thread Reply:* nice to have fun with you Jakub

    +

    *Thread Reply:* nice to have fun with you Jakub

    @@ -606,7 +612,7 @@

    Group Direct Messages

    2023-11-15 05:42:34
    -

    *Thread Reply:* Can't wait to see it on the 1st January.

    +

    *Thread Reply:* Can't wait to see it on the 1st January.

    @@ -632,7 +638,7 @@

    Group Direct Messages

    2023-11-15 06:56:03
    -

    *Thread Reply:* Ain’t no party like a dev ex improvement party

    +

    *Thread Reply:* Ain’t no party like a dev ex improvement party

    @@ -658,7 +664,7 @@

    Group Direct Messages

    2023-11-15 11:45:53
    -

    *Thread Reply:* Gentoo installation party is in similar category of fun

    +

    *Thread Reply:* Gentoo installation party is in similar category of fun

    @@ -764,7 +770,7 @@

    Group Direct Messages

    2023-11-15 03:36:02
    -

    *Thread Reply:* future 😉

    +

    *Thread Reply:* future 😉

    @@ -790,7 +796,7 @@

    Group Direct Messages

    2023-11-15 03:36:10
    -

    *Thread Reply:* ok

    +

    *Thread Reply:* ok

    @@ -816,7 +822,7 @@

    Group Direct Messages

    2023-11-15 03:36:23
    -

    *Thread Reply:* I will then replace enum with string

    +

    *Thread Reply:* I will then replace enum with string

    @@ -926,7 +932,7 @@

    Group Direct Messages

    2023-11-15 03:36:33
    -

    *Thread Reply:* this is the next to go

    +

    *Thread Reply:* this is the next to go

    @@ -952,7 +958,7 @@

    Group Direct Messages

    2023-11-15 03:36:38
    -

    *Thread Reply:* and i consider it ready

    +

    *Thread Reply:* and i consider it ready

    @@ -978,7 +984,7 @@

    Group Direct Messages

    2023-11-15 03:37:31
    -

    *Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs

    +

    *Thread Reply:* Then we have a draft one with streaming support https://github.com/MarquezProject/marquez/pull/2682/files -> which has an integration test of lineage endpoint working for streaming jobs

    @@ -1004,7 +1010,7 @@

    Group Direct Messages

    2023-11-15 03:38:32
    -

    *Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉

    +

    *Thread Reply:* I still need to work on #2682 but you can review #2654. once you get some sleep, of course 😉

    @@ -1064,7 +1070,7 @@

    Group Direct Messages

    2023-11-15 12:24:27
    -

    *Thread Reply:* did you check if LineageCollector is instantiated once per process?

    +

    *Thread Reply:* did you check if LineageCollector is instantiated once per process?

    @@ -1090,7 +1096,7 @@

    Group Direct Messages

    2023-11-15 12:26:37
    -

    *Thread Reply:* Using it only via get_hook_lineage_collector

    +

    *Thread Reply:* Using it only via get_hook_lineage_collector

    @@ -1119,12 +1125,12 @@

    Group Direct Messages

    is it time to support hudi?

    - + - - + @@ -1220,7 +1226,7 @@

    Group Direct Messages

    2023-11-15 14:58:08
    -

    *Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”

    +

    *Thread Reply:* Maybe something like “OL has many desirable integrations, including a best-in-class Spark integration, but it’s like any other open standard in that it requires contributions in order to approach total coverage. Thankfully, we have many active contributors, and integrations are being added or improved upon all the time.”

    @@ -1246,7 +1252,7 @@

    Group Direct Messages

    2023-11-15 16:04:51
    -

    *Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"

    +

    *Thread Reply:* Maybe rephrase pain points to "something we're not actively focusing on"

    @@ -1298,7 +1304,7 @@

    Group Direct Messages

    2023-11-15 16:53:09
    -

    *Thread Reply:* you are now admin

    +

    *Thread Reply:* you are now admin

    @@ -1354,7 +1360,7 @@

    Group Direct Messages

    2023-11-15 17:33:19
    -

    *Thread Reply:* we have it in SQL operators

    +

    *Thread Reply:* we have it in SQL operators

    @@ -1380,7 +1386,7 @@

    Group Direct Messages

    2023-11-15 17:34:25
    -

    *Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏

    +

    *Thread Reply:* OOh any docs / code? or if you’d like to respond in the MQZ slack 🙏

    @@ -1406,7 +1412,7 @@

    Group Direct Messages

    2023-11-15 17:35:19
    -

    *Thread Reply:* I’ll reply there

    +

    *Thread Reply:* I’ll reply there

    @@ -1462,7 +1468,7 @@

    Group Direct Messages

    2023-11-15 19:32:17
    -

    *Thread Reply:* What about GitHub projects?

    +

    *Thread Reply:* What about GitHub projects?

    @@ -1492,7 +1498,7 @@

    Group Direct Messages

    2023-11-16 09:27:46
    -

    *Thread Reply:* Projects is the way to go, thanks

    +

    *Thread Reply:* Projects is the way to go, thanks

    @@ -1518,7 +1524,7 @@

    Group Direct Messages

    2023-11-16 10:23:34
    -

    *Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…

    +

    *Thread Reply:* Set up a Projects board. New projects are private by default. We could make it public. The one thing that’s missing that we could use is a built-in date field for alerting about upcoming deadlines…

    @@ -1574,7 +1580,7 @@

    Group Direct Messages

    2023-11-16 09:31:59
    -

    *Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6

    +

    *Thread Reply:* https://newsroom.accenture.com/news/2023/accenture-to-expand-government-transformation-capabilities-in-the-uk-with-acquisition-of-6point6

    newsroom.accenture.com
    @@ -1625,7 +1631,7 @@

    Group Direct Messages

    2023-11-16 10:03:27
    -

    *Thread Reply:* We should sell OL to governments

    +

    *Thread Reply:* We should sell OL to governments

    @@ -1655,7 +1661,7 @@

    Group Direct Messages

    2023-11-16 10:20:36
    -

    *Thread Reply:* we may have to rebrand to ClosedLineage

    +

    *Thread Reply:* we may have to rebrand to ClosedLineage

    @@ -1681,7 +1687,7 @@

    Group Direct Messages

    2023-11-16 10:23:37
    -

    *Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint

    +

    *Thread Reply:* not in this way; just emit any event second time to secret NSA endpoint

    @@ -1707,7 +1713,7 @@

    Group Direct Messages

    2023-11-16 11:13:17
    -

    *Thread Reply:* we would need to improve our stock photo game

    +

    *Thread Reply:* we would need to improve our stock photo game

    @@ -1760,7 +1766,7 @@

    Group Direct Messages

    2023-11-16 12:42:56
    -

    *Thread Reply:* thanks, updated the talks board

    +

    *Thread Reply:* thanks, updated the talks board

    @@ -1786,7 +1792,7 @@

    Group Direct Messages

    2023-11-16 12:43:10
    -

    *Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1

    +

    *Thread Reply:* https://github.com/orgs/OpenLineage/projects/4/views/1

    @@ -1812,7 +1818,7 @@

    Group Direct Messages

    2023-11-16 15:19:53
    -

    *Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂

    +

    *Thread Reply:* I'm in, will think what to talk about and appreciate any advice 🙂

    @@ -1889,7 +1895,7 @@

    Group Direct Messages

    2023-11-17 13:47:21
    -

    *Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off +

    *Thread Reply:* It looks like the datahub airflow plugin uses OL. but turns it off https://github.com/datahub-project/datahub/blob/2b0811b9875d7d7ea11fb01d0157a21fdd67f020/docs/lineage/airflow.md disable_openlineage_plugin true Disable the OpenLineage plugin to avoid duplicative processing. They reuse the extractors but then “patch” the behavior.

    @@ -2104,7 +2110,7 @@

    Group Direct Messages

    2023-11-17 13:48:52
    -

    *Thread Reply:* Of course this approach will need changing again with AF 2.7

    +

    *Thread Reply:* Of course this approach will need changing again with AF 2.7

    @@ -2130,7 +2136,7 @@

    Group Direct Messages

    2023-11-17 13:49:02
    -

    *Thread Reply:* It’s their choice 🤷

    +

    *Thread Reply:* It’s their choice 🤷

    @@ -2156,7 +2162,7 @@

    Group Direct Messages

    2023-11-17 13:51:23
    -

    *Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction

    +

    *Thread Reply:* It looks like we can possibly learn from their approach in SQL parsing: https://datahubproject.io/docs/lineage/airflow/#automatic-lineage-extraction

    datahubproject.io
    @@ -2203,7 +2209,7 @@

    Group Direct Messages

    2023-11-17 16:42:51
    -

    *Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities

    +

    *Thread Reply:* what's that approach? I only know they have been claiming best SQL parsing capabilities

    @@ -2229,7 +2235,7 @@

    Group Direct Messages

    2023-11-17 20:54:48
    -

    *Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)

    +

    *Thread Reply:* I haven’t looked in the details but I’m assuming it is in this repo. (my comment is entirely based on the claim here)

    @@ -2255,7 +2261,7 @@

    Group Direct Messages

    2023-11-20 02:58:07
    -

    *Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql> -> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.

    +

    *Thread Reply:* <https://www.acryldata.io/blog/extracting-column-level-lineage-from-sql> -> The interesting difference is that in order to find table schemas, they use their data catalog to evaluate column-level lineage instead of doing this on the client side.

    My understanding by example is: If you do create table x as select ** from y @@ -2290,7 +2296,7 @@

    Group Direct Messages

    - ❤️ Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński, Harel Shein, Ross Turk + ❤️ Jakub Dardziński, Maciej Obuchowski, Paweł Leszczyński, Harel Shein, Ross Turk, Willy Lulciuc
    @@ -2341,7 +2347,7 @@

    Group Direct Messages

    2023-11-21 09:27:22
    -

    *Thread Reply:* Ah! That would have been a good idea, but I can’t :(

    +

    *Thread Reply:* Ah! That would have been a good idea, but I can’t :(

    @@ -2367,7 +2373,7 @@

    Group Direct Messages

    2023-11-21 09:27:44
    -

    *Thread Reply:* Do you prefer an earlier meeting tomorrow?

    +

    *Thread Reply:* Do you prefer an earlier meeting tomorrow?

    @@ -2393,7 +2399,7 @@

    Group Direct Messages

    2023-11-21 09:28:54
    -

    *Thread Reply:* maybe let's keep today's meeting then

    +

    *Thread Reply:* maybe let's keep today's meeting then

    @@ -2408,6 +2414,12684 @@

    Group Direct Messages

    + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-22 09:23:31
    +
    +

    The full project history is now available at https://openlineage.github.io/slack-archives/. Check it out!

    + + + + +
    + 🙌 Harel Shein, Maciej Obuchowski, Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-11-22 09:25:38
    +
    +

    *Thread Reply:* nice one! 👏

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-22 09:49:12
    +
    +

    *Thread Reply:* very cool!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Ross Turk + (ross@rossturk.com) +
    +
    2023-11-22 12:08:24
    +
    +

    *Thread Reply:* tfw you thought the scrollback was gone 😳

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-22 12:09:41
    +
    +

    *Thread Reply:* slack has a good activation story, I wonder how much longer they can keep this up for

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Ross Turk + (ross@rossturk.com) +
    +
    2023-11-22 12:10:08
    +
    +

    *Thread Reply:* always nice to be reminded that there are no actual incremental costs on their end

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-22 12:11:23
    +
    +

    *Thread Reply:* I guess it’s the difference between storing your data in memory vs. on a glacier 🧊

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Ross Turk + (ross@rossturk.com) +
    +
    2023-11-22 12:36:12
    +
    +

    *Thread Reply:* ah yes surely there is some tiering going on

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-27 11:44:20
    +
    +

    anyone seen this PR from decathlon?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-27 14:32:03
    +
    +

    *Thread Reply:* i might get to marquez slack/PRs today, but most likely tmr morning

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-27 14:32:52
    +
    +

    *Thread Reply:* If you’re looking for priorities, it would be really great if you could give feedback on one of @Paweł Leszczyński streaming support PRs today

    + + + +
    + 👍 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-27 14:34:01
    +
    +

    *Thread Reply:* ok, I’ll get to the streaming PR first

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-27 14:34:29
    +
    +

    *Thread Reply:* FYI, the namespace filtering is a good idea, just needs some feedback on impl / naming

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-27 14:31:11
    +
    +

    Jens would like to know if there’s anything we want included in the welcome portion of the slide deck. Suggestions? (Aside from the usual links)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 02:35:56
    +
    +

    @Paweł Leszczyński I reviewed your PR today (mainly the logic on versioning for streaming jobs); here is the main versioning limitations for jobs: a new JobVersion is created only when a job run completes or fails (or is in the done state); that is, we don’t know if we have received all the input/output datasets so we hold off on creating a new job version until we do.

    + +

    For streaming, we’ll need to create a job version on start. Do we assume we have all input/output datasets associated with the streaming job? Does OpenLineage guarantee this to be the case for streaming jobs? Having versioning logic for batch vs streaming is a reasonable solution, just want to clarify

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-11-28 02:38:00
    +
    +

    *Thread Reply:* yes, the logic adds distinction on how to create job version per processing type. For streaming, I think it makes more sense to create it at the beginning. Then, within other events of the same run, we need to check if the version has changed, and create new version in that case

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 02:48:40
    +
    +

    *Thread Reply:* would we want to use the same versioning func Utils.newJobVersionFor() for streaming? That is, should we assume the input/output datasets contained within the OL event be the “current” set for the streaming job?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 02:50:19
    +
    +

    *Thread Reply:* that is, +2 input streams, 1 output stream (version 1) +then, 1 input streams, 2 output stream (version 2) +...

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 02:51:16
    +
    +

    *Thread Reply:* but what about the case when the in/out streams are not present: +1 input streams, 2 output stream (version 2) +then, 1 input streams, 0 output stream (version 3) +...

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-11-28 03:08:36
    +
    +

    *Thread Reply:* The meaning for the streaming events should be slightly different.

    + +

    For batch, input and output datasets are cumulative from all the events. If we have an event with output datasets A + B, then another event with output datasets B + C, then we assume job has output datasets A + B + C.

    + +

    For streaming, we may have a streaming job than for a week was reading data from topics A + B, and then in the next week it was reading from B + C. I think this should the mimicked in different job versions. Making it cumulative for jobs that run for several weeks does not make that much sense to me. The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not?

    + +

    This part is missing in PR and our Flink integration always sends all the input & datasets. I can add extra logic that will prevent creating new job version if event has no input nor output datasets. However, I can't see any clean and generic solution to this.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 14:07:01
    +
    +

    *Thread Reply:* > The problem here is: what happens if a producer some extra events with no input/output datasets specified, like amount of bytes read? Shall we treat it as a new version? If not, why not? +We can view the bytes read as additional metadata about the jobs inputs/outputs that wouldn’t trigger a new version (for the job or dataset). I would associate the bytes with the current dataset version and sum them up (I’ve read X bytes from dataset version D); you can also view tags in a similar way. In our current versioning logic for datasets, we create a new dataset version when a job completes, I think we’ll want to do something similar for streaming jobs; that is, when X bytes are written to a given dataset that would trigger a new version

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 14:11:17
    +
    +

    *Thread Reply:* > I can add extra logic that will prevent creating new job version if event has no input nor output datasets +Yes, if in/out no datasets are present, then I wouldn’t create a new job version. @Julien Le Dem opened an issue a while back about this https://github.com/MarquezProject/marquez/issues/1513. that is, there’s a difference between an empty set [ ] and null

    +
    + + + + + + + +
    +
    Milestone
    + <a href="https://github.com/MarquezProject/marquez/milestone/4">Roadmap</a> +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-28 14:11:51
    +
    +

    *Thread Reply:* > This part is missing in PR and our Flink integration always sends all the input & datasets +This is very important to note in the code andor API docs

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-11-29 04:48:36
    +
    +

    *Thread Reply:* Sure we should. Just wanted to make sure if this the way we want to go.

    + + + +
    + 🙏 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-11-30 09:53:22
    +
    +

    *Thread Reply:* @Willy Lulciuc did you had a chance to look at this as well https://github.com/MarquezProject/marquez/pull/2654 ? This should be merged before streaming support I believe.

    +
    + + + + + + + +
    +
    Labels
    + docs, api +
    + +
    +
    Comments
    + 4 +
    + + + + + + + + + + +
    + + + +
    + 👀 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-30 14:13:27
    +
    +

    *Thread Reply:* ahh sorry, I hadn’t realized they were related / dependent on one another. sure I’ll give the PR a pass

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-01 09:15:30
    +
    +

    *Thread Reply:* I looked into your comments and found them, as always, really useful. I introduced changes based on most of them. Please take a look at my reponses within Job model class. I think there is one issue we still need to discuss.

    + +

    What to do with existing type field? I would opt for deprecating it as within the introduced job facet, a notion of jobType stands for QUERY|COMMAND|DAG|TASK|JOB|MODEL , while processingType determines if a job is batch or streaming.

    + +

    One solution I see is deprecating type and introducing JobLabels class as property within Job with fields like jobType, processingType , integration

    + +

    Another would be to send processingType within existing type field. This would mimic existing API, but require further work. The disadvantage is that we still have mismatch between job type in marquez and openlineage spec.

    + +

    I would opt for (2), but (1) works for me as well.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-28 16:57:50
    +
    +

    I’m working on a redesign of the Ecosystem page for a more modern, user-friendly layout. It’s a work in progress, and feedback is welcome: https://github.com/OpenLineage/docs/pull/258.

    +
    + + + + + + + +
    +
    Comments
    + 1 +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-29 11:37:41
    +
    +

    Can someone count the folks in the room please? Can’t see anyone other than the speaker

    + + + +
    +
    +
    +
    + + + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-11-29 12:08:38
    +
    +

    @Michael Robinson can you hear the questions?

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-29 12:16:11
    +
    +

    *Thread Reply:* I could hear all but one of the questions after the first talk

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-11-29 12:16:39
    +
    +

    *Thread Reply:* Oh then it's better than I thought

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Ross Turk + (ross@rossturk.com) +
    +
    2023-11-29 12:43:44
    +
    +

    I just had a lovely conversation at reinvent with the CTO of dbt, Connor, and didn’t even know it was him until the end 🤯

    + + + +
    + 🙃 Harel Shein, Willy Lulciuc, Jakub Dardziński, Paweł Leszczyński, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-29 14:38:01
    +
    +

    Congrats on a great event!

    + + + +
    + 🙌 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-11-30 05:35:30
    +
    +

    *Thread Reply:* Yeah it was pretty nice 🙂 +A lot of good discussions with Google people. Also Jarek Potiuk was there

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-11-30 05:36:29
    +
    +

    *Thread Reply:* I think it won't be the last one Warsaw OpenLineage meetup

    + + + +
    + 🎉 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-11-30 07:17:49
    +
    +

    https://openlineage.slack.com/archives/C01CK9T7HKR/p1701288000527449 +putting it here. I don’t feel like I’m the best person to answer but I feel like operational lineage which we’re trying to provide is the thing

    +
    + + +
    + + + } + + Stefan Krawczyk + (https://openlineage.slack.com/team/U065SAYCS5C) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-11-30 09:26:25
    +
    +

    created a project for Blog post ideas: https://github.com/orgs/OpenLineage/projects/5/views/1

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-30 11:15:28
    +
    +

    Release update: we’re due for an OpenLineage release and overdue for a Marquez release. As tomorrow, the first, is a Friday, we should wait until Monday at the earliest. I’m planning to open a vote for an OL release then, but Marquez is red so I’m holding off on a Marquez release for the time being.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-30 19:50:45
    +
    +

    *Thread Reply:* I can address the red CI status, it’s bc we’re seeing issues publishing our snaphots

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-30 19:51:09
    +
    +

    *Thread Reply:* I think we should release Marquez on Mon. as well

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-11-30 19:52:12
    +
    +

    *Thread Reply:* 👍

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-01 05:04:26
    +
    +

    *Thread Reply:* I want to get this https://github.com/OpenLineage/OpenLineage/pull/2284 into OL release

    +
    + + + + + + + +
    +
    Labels
    + integration/spark, ci +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-30 19:51:43
    +
    +

    would be interesting if we can use this comparison as a learning (improve docs, etc): https://blog.singleorigin.tech/race-to-the-finish-line-age/

    +
    +
    Single Origin Blog
    + + + + + + +
    +
    Written by
    + Engineering +
    + + + + + + + + + + + + +
    + + + +
    + ➕ Harel Shein, Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-11-30 19:52:10
    +
    +

    or rather, use the format for comparing OL with other tools 😉

    + + + +
    + 👀 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 10:12:21
    +
    +

    *Thread Reply:* It would be nice to have something like this (I would want it to be a little more even-handed, though). It will be interesting to see if they will ever update this now that there’s automated lineage from Airflow supported by OL

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 08:58:19
    +
    +

    Review needed of the newsletter section on Airflow Provider progress @Jakub Dardziński @Maciej Obuchowski when you have a moment. It will ship by 5 PM ET today, fyi. Already shared it with you. Thanks!

    + + + +
    + 👍 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-01 10:45:56
    +
    +

    *Thread Reply:* LGTM 👍

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 11:02:04
    +
    +

    *Thread Reply:* Thanks @Jakub Dardziński

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 10:21:23
    +
    +

    They finally uploaded the OpenLineage Airflow Summit videos to the Airflow channel on YT: https://www.youtube.com/@ApacheAirflow/videos

    +
    +
    YouTube
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 11:13:14
    +
    +

    On Monday I’m meeting with someone at Confluent about organizing a meetup in London in January. I’m thinking I’ll suggest Jan. 24 or 31 as mid-week days work better and folks need time to come back from the vacation. If you have thoughts on this, would you please let me know by 10:00 am ET on Monday? Also, standup will be happening before the meeting — perhaps we can discuss it then. @Harel Shein

    + + + +
    + 👍 Harel Shein, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-04 10:20:59
    +
    +

    *Thread Reply:* Confluent says January 31st will work for them for a London meetup, and they’ll be providing a speaker as well. Is it safe to firm this up with them?

    + + + +
    + 🎉 Harel Shein, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 10:37:42
    +
    +

    *Thread Reply:* I'd say yes, eventually if Maciej doesn't get new passport till this time I can speak

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-04 10:44:17
    +
    +

    *Thread Reply:* I already got the photos 😂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 10:44:41
    +
    +

    *Thread Reply:* you gotta share them

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-04 10:45:36
    +
    +

    *Thread Reply:* Also apparently it's possible to get temporary passport at airport in 15 minutes

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-04 10:57:40
    +
    +

    *Thread Reply:* How civilized...

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-04 17:07:10
    +
    +

    *Thread Reply:* if the price is right 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 17:08:55
    +
    +

    *Thread Reply:* you can get it in the Warsaw airport just like last-minute passport, costs barely nothing (30 PLN which is ~7/8 USD)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-04 17:09:21
    +
    +

    *Thread Reply:* wow, that’s impressive!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 17:10:19
    +
    +

    *Thread Reply:* yeah, many people are surprised how developed our public service may be

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 06:44:04
    +
    +

    *Thread Reply:* tbh it's always random, can be good can be shit 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 06:44:21
    +
    +

    *Thread Reply:* lately it's definitely been better than 10 years ago tho

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-01 15:34:09
    +
    +

    Feedback requested on the newsletter:

    + + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 14:38:30
    +
    +

    https://blog.datahubproject.io/extracting-column-level-lineage-from-sql-779b8ce17567 +https://datastation.multiprocess.io/blog/2022-04-11-sql-parsers.html

    +
    +
    Medium
    + + + + + + +
    +
    Reading time
    + 8 min read +
    + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 14:40:02
    +
    +

    *Thread Reply:* [6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information. +🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-04 17:00:33
    +
    +

    *Thread Reply:* it’s open source, should we consider testing it out?

    + +
    + + + + + + + + + +
    + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-04 17:00:53
    +
    +

    *Thread Reply:* I’m not sure about the methodology, but these numbers are pretty significant

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-04 17:06:43
    +
    +

    *Thread Reply:* We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-05 04:09:18
    +
    +

    *Thread Reply:* More doctors smoke camels than any other cigarette 😉 If you test on BigQuery, you will not get comparable results for SnowFlake for example.

    + +

    Wondering if we can do anything about this. We could write a blog post on lineage extraction from Snowflake SQL queries. This is something we spent time on and possibly we support dialect specific queries that others don't.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 04:12:45
    +
    +

    *Thread Reply:* it all comes to the question whether we should start publishing comparisons

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-05 04:18:46
    +
    +

    *Thread Reply:* We can also accept schema information in our sql lineage parser. Actually, this would have been good idea I believe.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 04:24:59
    +
    +

    *Thread Reply:* for select ** use-case?

    + + + +
    + 👍 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-04 15:42:36
    +
    +

    Release vote is here when you get a moment: https://openlineage.slack.com/archives/C01CK9T7HKR/p1701722066253149

    +
    + + +
    + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    + 👍 Harel Shein +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2023-12-05 05:15:18
    +
    +

    @Kacper Muda has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 05:15:30
    +
    +

    Should we disable openlineage-airflow on Airflow 2.8 to force people to use provider?

    + + + +
    + 👍 Kacper Muda, Harel Shein +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 11:12:52
    +
    +

    *Thread Reply:* it sounds like maybe something about this should be included in the 2.8 docs. The dev rel team is talking about the marketing around 2.8 right now…

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 11:14:48
    +
    +

    *Thread Reply:* also, the release will evidently be out next Thursday

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 11:15:38
    +
    +

    *Thread Reply:* I mean, openlineage-airflow is not part of Airflow

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 11:15:50
    +
    +

    *Thread Reply:* We'd have provider for 2.8

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 11:16:40
    +
    +

    *Thread Reply:* so maybe the airflow newsletter would be better

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 11:18:29
    +
    +

    *Thread Reply:* is there anything about the provider that should be in the 2.8 marketing?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 11:19:04
    +
    +

    *Thread Reply:* I don't think so

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 11:25:09
    +
    +

    *Thread Reply:* Kenten wants to mention that it will be turned off in the 2.8 docs, so please lmk if anything about this changes

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 06:20:00
    +
    +

    https://github.com/OpenLineage/docs/pull/263

    + + + +
    + ✅ Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 14:32:00
    +
    +

    Changelog PR for 1.6.0: https://github.com/OpenLineage/OpenLineage/pull/2298

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 14:36:27
    +
    +

    *Thread Reply:* that’s weird ruff-lint found issues, especially when it has ruff version pinned

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 14:36:46
    +
    +

    *Thread Reply:* CHANGELOG.md:10: acccording ==&gt; according +this change is accurate though 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 14:45:42
    +
    +

    *Thread Reply:* I tried to sneak in a fix in dev but the linter didn’t like it so I changed it back. All set now

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 14:45:58
    +
    +

    *Thread Reply:* The release is in progress

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 14:47:05
    +
    +

    *Thread Reply:* ah, gotcha +dev/get_changes.py:49:17: E722 Do not use bare `except` +dev/get_changes.py:49:17: S112 `try`-`except`-`continue` detected, consider logging the exception +for next time just add except Exception: instead of except: 🙂

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 14:50:29
    +
    +

    *Thread Reply:* GTK, thank you

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 14:50:02
    +
    +

    The release-integration-flink job failed with this error message: +Execution failed for task ':examples:stateful:compileJava'. +&gt; Could not resolve all files for configuration ':examples:stateful:compileClasspath'. + &gt; Could not find io.**********************:**********************_java:1.6.0-SNAPSHOT. + Required by: + project :examples:stateful

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 15:06:17
    +
    +

    *Thread Reply:* No cache is found for key: v1-release-client-java--rOhZzScpK7x+jzwfqkQVwOVgqXO91M7VEEtzYHNvSmY= +Found a cache from build 155811 at v1-release-client-java- +is this standard behaviour?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 15:08:03
    +
    +

    *Thread Reply:* well, same happened for 1.5.0 and it worked

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-05 15:09:11
    +
    +

    *Thread Reply:* we gotta wait for Maciej/Pawel :<

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-05 18:07:20
    +
    +

    *Thread Reply:* Looks like Gradle version got bumped and gives some problems

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-06 12:47:10
    +
    +

    *Thread Reply:* Think we can release by midday tomorrow?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-06 12:57:31
    +
    +

    *Thread Reply:* oh forgot about this totally

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-05 16:29:52
    +
    +

    Feedback sought on a redesign of the ecosystem page that (hopefully) freshens and modernizes the page: https://github.com/OpenLineage/docs/pull/258

    +
    + + + + + + + +
    +
    Comments
    + 3 +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 10:49:06
    +
    +

    Changelog PR for 1.6.1: https://github.com/OpenLineage/OpenLineage/pull/2301

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 11:19:29
    +
    +

    1.6.1 release is in progress

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 11:23:41
    +
    +

    *Thread Reply:* @Maciej Obuchowski the flink job failed again

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 11:34:24
    +
    +

    *Thread Reply:* 😢

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 11:35:09
    +
    +

    *Thread Reply:* well, at least it's a different error

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 11:39:27
    +
    +

    *Thread Reply:* one more try? https://github.com/OpenLineage/OpenLineage/pull/2302 @Michael Robinson

    +
    + + + + + + + + + + + + + + + + +
    + + + +
    + ✅ Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 12:22:32
    +
    +

    *Thread Reply:* 1.6.2 changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2304

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 12:22:55
    +
    +

    *Thread Reply:* @Maciej Obuchowski 👆

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 12:24:45
    +
    +

    *Thread Reply:* merged 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 12:25:05
    +
    +

    *Thread Reply:* going out for a few hours, so next try would be tomorrow if it fails again...

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:14:14
    +
    +

    *Thread Reply:* Thanks, Maciej. That worked, and 1.6.2 is out.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 13:20:12
    +
    +

    *Thread Reply:* Great

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:04:15
    +
    +

    Starting a thread for collaboration on the community meeting next week

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:05:27
    +
    +

    *Thread Reply:* Releases: 1.6.2

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:09:51
    +
    +

    *Thread Reply:* Open proposals: 2186, 2187, 2218, 2243, 2273, 2281, 2289, 2163, 2162, 2161

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:10:13
    +
    +

    *Thread Reply:* 2023 recap/“best-of”?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-07 13:13:28
    +
    +

    *Thread Reply:* @Harel Shein any thoughts? Also, does anyone know if Julien will be back from vacation?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 13:20:39
    +
    +

    *Thread Reply:* We should probably try to something with Google proposal

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-07 13:21:00
    +
    +

    *Thread Reply:* Not sure if it needs additional discussion, maybe just implementation?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-07 15:28:13
    +
    +

    *Thread Reply:* I can ask him, but it would probably be good if you could facilitate next week @Michael Robinson?

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-07 15:29:43
    +
    +

    *Thread Reply:* I agree that we need to address those Google proposals, we should ask Jens if he’s up for presenting and discussing them first?

    + + + +
    + ✅ Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-07 15:30:48
    +
    +

    *Thread Reply:* maybe Pawel wants to present progress with https://github.com/OpenLineage/OpenLineage/issues/2162?

    + + + +
    + 👍 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-12 12:14:40
    +
    +

    *Thread Reply:* Still waiting on a response from Jens

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-12 12:22:11
    +
    +

    *Thread Reply:* I think Jens does not have a lot of time now

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-12 12:25:30
    +
    +

    *Thread Reply:* Emailed him in case he didn’t see the message

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-13 11:22:56
    +
    +

    *Thread Reply:* Jens confirmed

    + + + +
    + 🎉 Harel Shein, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-13 13:28:47
    +
    +

    *Thread Reply:* He will have to join about 15 minutes late

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-10 12:13:00
    +
    +

    Hey, I have a meeting scheduled with a few Ray committers from Anyscale for Tuesday December 12th at 8pm CET / 2pm ET / 11am PT. Would anyone like to join? +I think this would a good entry way to lineage tracking for AI/LLM workflows. JFYI Ray is in Python.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-11 04:59:03
    +
    +

    *Thread Reply:* would love to come but I'm at friend's birthday at that time 😐

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-11 20:26:40
    +
    +

    *Thread Reply:* I’d love to as well, but have diner plans 😕

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-11 20:27:02
    +
    +

    *Thread Reply:* It’s 11am PT..

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-12 09:45:23
    +
    +

    *Thread Reply:* count me in if not too late

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-12 09:46:36
    +
    +

    *Thread Reply:* not too late, adding you

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-11 20:25:50
    +
    +

    @Paweł Leszczyński mind giving this PR a quick look? https://github.com/MarquezProject/marquez/pull/2700 … it’s a dep on https://github.com/MarquezProject/marquez/pull/2698

    +
    + + + + + + + + + + + + + + + + +
    +
    + + + + + + + + + + + + + + + + +
    + + + +
    + 👀 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 04:39:08
    +
    +

    *Thread Reply:* thanks @Paweł Leszczyński for the +1 ❤️

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:36:32
    +
    +

    @Jakub Dardziński: In Marquez, metrics are exposed via the endpoint /metrics using prometheus (most of the custom metrics defined are here). Oddly enough, prometheus roadmap states that they have yet to adopt OpenMetrics! But, you can backfill the metrics into prometheus. So, knowing this, I would move to using metrics core by Dropwizard and us an exporter to export metrics to datadog using metrics-datadog. The one major benefit here is that we can define a framework around defining custom metrics internally within Marquez using core Dropwizard libraries, and then enable the reporter via configuration to emit metrics in marquez.yml : For example: +`metrics: + frequency: 1 minute # Default is 1 second. + reporters: + - type: datadog + . + .

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-12 12:41:26
    +
    +

    *Thread Reply:* I tested this actually and it works +the only thing is traces, I found it very poor to just have metrics around function name

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:43:07
    +
    +

    *Thread Reply:* I totally agree, although I feel metrics and tracing are two separate things here

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-12 12:44:39
    +
    +

    *Thread Reply:* I really appreciate your help and advice! 🙂

    + + + +
    + 🙏 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:46:00
    +
    +

    *Thread Reply:* Of course, happy to chime in here

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:47:02
    +
    +

    *Thread Reply:* I’m just happy this is getting some much needed love 😄

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:51:15
    +
    +

    *Thread Reply:* Also, it seems like datadog uses OpenTelemetry: +> Datadog Distributed Tracing allows you easily ingest traces via the Datadog libraries and agent or via OpenTelemetry +And looks like OpenTelemetry has support for Dropwizard

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-12 12:52:24
    +
    +

    *Thread Reply:* yep, that's why I liked otel idea

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:54:11
    +
    +

    *Thread Reply:* Also, here are the docs for DD + OpenTelemetry … so enabling OpenTelemetry in Marquez would be doable

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:54:40
    +
    +

    *Thread Reply:* and we can make all of the configurable via marquez.yml

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 12:55:00
    +
    +

    *Thread Reply:* hit me up with any questions! (just know, there will be a delay)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-03 16:11:56
    +
    +

    *Thread Reply:* > and we can make all of the configurable via marquez.yml +it ain’t that easy - we would need to build extended jar with OTEL agent which I think is way too much work compared to benefits. you can still configure via env vars or system properties

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 18:47:37
    +
    +

    I’ve been looking into partitioning for psql, think there’s potential here for huge perf gains. Anyone have experience?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 18:50:50
    +
    +

    *Thread Reply:* partition ranges will give a boost by default

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-13 05:05:12
    +
    +

    *Thread Reply:* Which tables do you want to partition? Event ones?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-13 05:14:18
    +
    +

    *Thread Reply:* • runs +• job_versions +• dataset_versions +• lineage_events +• and all the facets tables

    + + + +
    + 👍 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-12 20:03:11
    +
    +

    @Paweł Leszczyński +• PR 2682 approved with minor comments on stream versioning logic / suggestions ✅ +• PR 2654 approved with minor comment (we’ll want to do a follow up analysis on the query perf improvements) ✅

    +
    + + + + + + + + + + + + + + + + +
    +
    + + + + + + + + + + + + + + + + +
    + + + +
    + ❤️ Harel Shein, Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-13 05:03:23
    +
    +

    *Thread Reply:* Thanks @Willy Lulciuc. I applied all the recent comments and merged 2654.

    + +

    There is one discussion left in 2682, which I would like to resolve before merging. I added extra comment on the implemented approach and I am open to get to know if this is the approach we can go with.

    + +

    @Julien Le Dem @Maciej Obuchowski discussion is about when to create a new job version for a streaming job. No deep dive in the code is required to take part in it. +https://github.com/MarquezProject/marquez/pull/2682#discussion_r1425108745

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-13 18:08:11
    +
    +

    *Thread Reply:* awesome, left my final thoughts 👍

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-13 06:52:57
    +
    +

    Maybe we should clarify the documentation on adding custom facets at the integration level? Wdyt? +https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589|https://openlineage.slack.com/archives/C01CK9T7HKR/p1702446541936589?threadts=1702033180.635339&channel=C01CK9T7HKR&messagets=1702446541.936589

    +
    + + +
    + + + } + + Simran Suri + (https://openlineage.slack.com/team/U069R6P724Q) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2023-12-13 08:23:49
    +
    +

    Hey, i think it would help some people using Airflow integration (with Airflow 2.6) if we release a patch version of OL package with this PR included #2305. I am not sure what is the release cycle here, but maybe there is already an ETA on next patch release? If so, please let me know 🙂 Thanks !

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-13 08:25:13
    +
    +

    *Thread Reply:* you gotta ask for the release in #general, 3 votes from committers approve immediate release 🙂

    + + + +
    + 👍 Kacper Muda +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-14 05:14:10
    +
    +

    *Thread Reply:* @Michael Robinson 3 votes are in 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2023-12-14 05:22:04
    +
    +

    *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1702474416084989

    +
    + + +
    + + + } + + Kacper Muda + (https://openlineage.slack.com/team/U062WKNK3LK) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-14 12:42:10
    +
    +

    *Thread Reply:* Thanks for the ping. I replied in #general and will initiate the release as soon as possible.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 14:16:21
    +
    +

    seems that we don’t output the correct namespace as in the naming doc for Kafka. we output the kafka server/broker URL as namespace (in the Flink integration specifically) +https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#kafka

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 18:01:15
    +
    +

    *Thread Reply:* @Paweł Leszczyński, would you be able to add the Kafka: prefix to the Kafka visitors in the flink integration tomorrow?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-15 02:51:13
    +
    +

    *Thread Reply:* I am happy to do this. Just to make, sure: docs is correct, flink implementation is missing kafka:// prefix, right?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-15 05:46:57
    +
    +

    *Thread Reply:* Exactly

    + + + +
    + 👍 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-15 08:24:47
    +
    +

    *Thread Reply:* Thanks @Paweł Leszczyński. made a couple of suggestions, but we can def merge without

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 14:16:29
    +
    +

    we’re also missing Iceberg naming schema

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-15 02:53:21
    +
    +

    *Thread Reply:* would love to discuss this first. If a user stores an iceberg table in S3, then should it conform S3 naming or iceberg naming?

    + +

    it's S3 location which defines a dataset. iceberg is a format for accessing data but not identifier as such.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-15 05:57:19
    +
    +

    *Thread Reply:* No rush, just something we noticed and that some people in the community are implementing their own patch for it.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-14 14:26:24
    +
    +

    my next year goal is to have programmatic way of using naming convention

    + + + +
    + 👍 Julien Le Dem, Harel Shein +
    + +
    + ❤️ Willy Lulciuc, Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 15:59:26
    +
    +

    anyone heard of https://bitol.io/?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-14 16:17:55
    +
    +

    *Thread Reply:* nope but would be worth reaching out to them to see how we could collaborate? they’re part of the LFAI (sandbox): https://github.com/bitol-io/open-data-contract-standard

    +
    + + + + + + + +
    +
    Website
    + <https://bitol.io> +
    + +
    +
    Stars
    + 82 +
    + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-14 16:18:39
    +
    +

    *Thread Reply:* background https://medium.com/profitoptics/data-contract-101-568a9adbf9a9

    +
    +
    Medium
    + + + + + + +
    +
    Reading time
    + 7 min read +
    + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-14 16:20:17
    +
    +

    *Thread Reply:* ugh it’s yaml-based

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 17:54:05
    +
    +

    *Thread Reply:* We should still have a conversation:)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-14 19:20:29
    +
    +

    *Thread Reply:* @Michael Robinson soft ping 😉

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 17:53:38
    +
    +

    If you want to join the conversation on Ray.io integration: https://join.slack.com/t/ray-distributed/shared_invite/zt-2635sz8uo-VW076XU6bKMEiFPCJWr65Q

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-14 18:10:49
    +
    +

    *Thread Reply:* is there any specific channel/conversation?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-14 19:50:05
    +
    +

    *Thread Reply:* Yeah, but it’s private. Added you. +For everyone else, Ping me on slack when you join and I’ll add you.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-15 12:07:39
    +
    +

    A vote to release Marquez 0.43.0 is open. We need one more: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1702657403267769

    + + + +
    + 🙌 Willy Lulciuc, Harel Shein +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-15 14:28:28
    +
    +

    *Thread Reply:* the changelog PR is RFR

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-15 13:19:49
    +
    +

    AWS is making moves! https://github.com/aws-samples/aws-mwaa-openlineage

    +
    + + + + + + + +
    +
    Stars
    + 8 +
    + +
    +
    Language
    + Python +
    + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-15 13:27:59
    +
    +

    *Thread Reply:* the repo itself is pretty old, last updated 2mo ago and used OL package not provider (1.4.1)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-15 13:28:13
    +
    +

    *Thread Reply:* still it's nice they're doing this :)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-15 15:27:26
    +
    +

    *Thread Reply:* since they’re using MWAA they won’t be affected by turn-off with coming with Airflow 2.8 for a while. Otherwise that would be a good excuse to get in touch with them

    + + + +
    + 🙏 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Shubham Mehta + (shubhammehta.93@gmail.com) +
    +
    2023-12-23 03:54:50
    +
    +

    *Thread Reply:* I think this repo was related to the blog which was authored a while back - https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/ +No other moves from our end so far, at least MWAA team : )

    +
    +
    Amazon Web Services
    + + + + + + + + + + + + + + + + + +
    + + + +
    + 👍 Jakub Dardziński, Willy Lulciuc, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-15 14:47:24
    +
    +

    The release is finished. Slack post, etc., coming soon

    + + + +
    + ❤️ Harel Shein, Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-15 16:24:52
    +
    +

    have we thought of making the SQL parser pluggable?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-15 16:35:47
    +
    +

    *Thread Reply:* what do you mean by that?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-15 16:43:51
    +
    +

    *Thread Reply:* (this is coming from apple) like what if a user wanted to provide their own parse for SQL in place of the one shipped with our integrations

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-15 16:44:27
    +
    +

    *Thread Reply:* for example, if/when we integrate with DataHub, can they use their parse instead of one provided

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-15 17:33:30
    +
    +

    *Thread Reply:* that would be difficult, we would need strong motivation for that 🫥

    + + + +
    + 👍 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-18 12:15:59
    +
    +

    This question fits with what we said we would try to document more, can someone help them out with it this week? + https://openlineage.slack.com/archives/C063PLL312R/p1702683569726449

    +
    + + +
    + + + } + + Alia Nabawy + (https://openlineage.slack.com/team/U05HTTNJY8Z) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-18 14:15:48
    +
    +

    Airflow 2.8 has been released. Are we still “turning off” the external Airflow integration with this one? What do Airflow users need to know to avoid unpleasant surprises? Kenten is open to including a note in the 2.8 blog post.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2023-12-19 08:52:59
    +
    +

    *Thread Reply:* As a newcomer here, I believe it would be wise to avoid supporting Airflow 2.8+ in the openlineage-airflow package. This approach would encourage users to transition to the provider package. It's important to clearly communicate that ongoing development and enhancements will be focused on the apache-airflow-providers-openlineage package, while the openlineage-airflow will primarily be updated for bug fixes. I'll look into whether this strategy is already noted in the documentation. If not, I will propose a documentation update.

    + + + +
    + ➕ Harel Shein, Jakub Dardziński +
    + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2023-12-21 08:10:17
    +
    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2330 +Please let me know if some changes are required, i was not sure how to properly implement it.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-19 09:51:18
    +
    +

    This looks cool, might be useful for us? +https://github.com/aklivity/zilla

    +
    + + + + + + + +
    +
    Website
    + <https://docs.aklivity.io/zilla> +
    + +
    +
    Stars
    + 317 +
    + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-19 09:54:26
    +
    +

    *Thread Reply:* the license is a bit weird, but should be ok for us. +it’s apache, unless you directly compete with the company that built it.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:32:34
    +
    +

    *Thread Reply:* tbh not sure how

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:33:43
    +
    +

    *Thread Reply:* I think we should be focused on +1) being compatible with most popular solutions (kafka...) +2) being easy to integrate with (custom transports)

    + +

    rather than forcing our opinionated way on how OpenLineage events should flow in customer architecture

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-19 10:09:42
    +
    +

    Apologies for having to miss today’s committer sync — I’ll be picking up my daughter from school

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-20 14:13:00
    +
    +

    WDYT about starting to add integration specific channels and adding a little welcome bot for people when they join?

    • spark-openlineage-dev
    • spark-openlineage-users
    • airflow-openlineage-dev
    • airflow-openlineage-users
    • spark-openlineage-dev
    • flink-openlineage-users +etc…
    • +
    + + + +
    + 👀 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-20 15:10:22
    +
    +

    *Thread Reply:* the -dev and -users seems like overkill, but also understand that we may want to split user questions from development

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-20 15:11:55
    +
    +

    *Thread Reply:* maybe just shorten to spark-integration , flink-integration , etc. Or integrations-spark etc

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-20 15:37:24
    +
    +

    *Thread Reply:* we probably should consider a development and welcomechannel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2023-12-20 17:19:00
    +
    +

    *Thread Reply:* yeah.. makes sense to me. let’s leave this thread open for a few days so more people can chime in and then I’ll make a proposal based on that.

    + + + +
    + 👍 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2023-12-21 01:57:27
    +
    +

    *Thread Reply:* makes sense to me

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 05:18:53
    +
    +

    *Thread Reply:* I think there is not enough discussion for it to make sense

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 05:19:01
    +
    +

    *Thread Reply:* and empty channels do not invite to discussion

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-21 09:01:22
    +
    +

    *Thread Reply:* Maybe worth it for spark questions alone? And then for equal coverage we need the others. It’s getting easy to overlook questions in general due to the increased volume and long code snippets, IMO.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:23:22
    +
    +

    *Thread Reply:* yeah I think the volume is still quite low

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:24:15
    +
    +

    *Thread Reply:* something like Airflow's #troubleshooting channel easily has order of magnitude more messages

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:24:41
    +
    +

    *Thread Reply:* and even then, I'd split between something like #troubleshooting and #development rather than between integrations

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2023-12-21 09:25:13
    +
    +

    *Thread Reply:* not only because it's too granular, but also there's a development that isn't strictly integration related or touches multiple ones

    + + + +
    + 👍 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-20 14:45:33
    +
    +

    Link to the vote to release the hot fix in Marquez: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1703101476368589

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-21 09:04:15
    +
    +

    For the newsletter this time around, I’m thinking that a year-end review issue might be nice in mid-January when folks are back from vacation. And then a “double issue” at the end of January with the usual updates. We’ve still got a rather, um, “select” readership, so the stakes are low. If you have an opinion, please lmk.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-21 09:06:17
    +
    +

    *Thread Reply:* I’m for mid-January option

    + + + +
    + 👍 Michael Robinson, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-21 11:33:40
    +
    +

    1.7.0 changelog PR needs a review: https://github.com/OpenLineage/OpenLineage/pull/2331

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2023-12-21 12:48:58
    +
    +

    Notice for the release notes (WDYT?): +COMPATIBILITY NOTICE +Starting in 1.7.0, the Airflow integration will no longer support Airflow versions >=2.8.0. +Please use the OpenLineage Airflow Provider instead. +It includes a link to here: https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

    + + + +
    + 👍 Kacper Muda +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2023-12-22 12:46:44
    +
    +

    https://eu.communityovercode.org/ is that proper conference to talk about OL?

    +
    +
    eu.communityovercode.org
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2023-12-22 16:06:32
    +
    +

    *Thread Reply:* Yes! At least to a more broader audience

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Shubham Mehta + (shubhammehta.93@gmail.com) +
    +
    2023-12-23 03:52:59
    +
    +

    @Shubham Mehta has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 08:50:13
    +
    +

    Happy new year all!

    + + + +
    + 🎉 Kacper Muda, Maciej Obuchowski, Jakub Dardziński, Paweł Leszczyński, Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-02 12:28:58
    +
    +

    Am I the only one that sees Free trial in progress here in OL slack?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-02 12:32:25
    +
    +

    *Thread Reply:* I see it too:

    + +
    + + + + + + + + + +
    + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 13:04:34
    +
    +

    *Thread Reply:* same for me, I think Slack initiated that

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 13:04:37
    +
    +

    *Thread Reply:* or @Michael Robinson?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 13:05:01
    +
    +

    *Thread Reply:* we’re on the Slack Pro trial (we were on the free plan before)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-02 13:07:18
    +
    +

    *Thread Reply:* I think Slack initiated it

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-03 15:37:11
    +
    +

    Spotted!

    + +
    + + + + + + + + + +
    + + +
    + 🔥 Shubham Mehta +
    + +
    + ❤️ Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-03 15:44:21
    +
    +

    *Thread Reply:* a moment earlier it makes more context

    + +
    + + + + + + + + + +
    + + +
    + 😂 Harel Shein, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-04 08:29:30
    +
    +

    https://github.com/OpenLineage/OpenLineage/issues/2349 - this issue is really interesting. I am hoping to see follow-up from David.

    +
    + + + + + + + +
    +
    Comments
    + 2 +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 10:27:12
    +
    +

    So who wants to speak at our meetup with Confluent in London on Jan. 31st?

    + + + +
    + 🙂 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 11:08:20
    +
    +

    *Thread Reply:* do we have sponsorship to fly over?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 11:09:39
    +
    +

    *Thread Reply:* Not currently but I can look into it

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 11:11:17
    +
    +

    *Thread Reply:* do we have other active community members based in the UK?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 11:12:00
    +
    +

    *Thread Reply:* Yes

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 11:12:07
    +
    +

    *Thread Reply:* I’ll ask around

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 12:42:16
    +
    +

    *Thread Reply:* Abdallah Terrab at Decathlon has volunteered

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 12:43:34
    +
    +

    *Thread Reply:* woohoo!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:16:55
    +
    +

    *Thread Reply:* does it mean we're still looking for someone?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:17:19
    +
    +

    *Thread Reply:* I already told I'll go last month, but not sure if it's still needed

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:17:43
    +
    +

    *Thread Reply:* and does Confluent have a talk there?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-05 11:52:04
    +
    +

    *Thread Reply:* Sorry about that, Maciej. I’ll ask Viraj if Astronomer would cover your ticket. There will be a Confluent speaker.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:53:54
    +
    +

    *Thread Reply:* if we need to choose between kafka summit and a meetup - I think we should go for kafka summit 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:58:21
    +
    +

    *Thread Reply:* I think so too

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-05 13:12:51
    +
    +

    *Thread Reply:* Viraj has requested approval for summit, and we can expect to hear from finance soon

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-05 13:18:01
    +
    +

    *Thread Reply:* Also, question from him about the talk: what does “streaming” refer to in the title — Kafka only?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 13:20:26
    +
    +

    *Thread Reply:* Kafka, Spark, Flink

    + + + +
    + 👍 Michael Robinson, Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 15:13:41
    +
    +

    *Thread Reply:* If someone wants to do a joint talk let me know 😉

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-08 15:55:57
    +
    +

    *Thread Reply:* @Willy Lulciuc will you be in UK then?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 16:02:10
    +
    +

    *Thread Reply:* I can be, if astronomer approves? but also realizing it’s Jan 31st, so a bit tight

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 16:16:30
    +
    +

    *Thread Reply:* yeah, that sounds… unlikely 🙂

    + + + +
    + 👍 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 16:17:16
    +
    +

    *Thread Reply:* also thinking about all the traveling ive been doing recently and things I need to work on. would be great to have some focus time

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 12:42:33
    +
    +

    did anyone submit a talk to https://www.databricks.com/dataaisummit/call-for-presentations/?

    +
    +
    databricks.com
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 13:27:05
    +
    +

    *Thread Reply:* tagging @Julien Le Dem on this one. since the deadline is tomorrow.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-04 22:16:01
    +
    +

    *Thread Reply:* I don’t think I did.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-04 22:17:07
    +
    +

    *Thread Reply:* I don’t have my computer with me. ⛷️

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-04 22:18:06
    +
    +

    *Thread Reply:* Does @Willy Lulciuc want to submit? Happy to be co-speaker (if you want. But not necessary)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 08:39:14
    +
    +

    *Thread Reply:* Willy is also on vacation, I’m happy to submit for the both of us. I’ll try to get something out today

    + + + +
    + 👍 Julien Le Dem +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:23:52
    +
    +

    *Thread Reply:* would be great to have someting on Databricks conference 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:25:50
    +
    +

    *Thread Reply:* agreed. ideas welcome!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:26:34
    +
    +

    *Thread Reply:* what I’m currently thinking. learnings from openlineage adoption in Airflow and Flink, and what can be learned / applied on Spark lineage.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:26:37
    +
    +

    *Thread Reply:* catchy title!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 13:22:12
    +
    +

    This month’s TSC meeting is next Thursday. Anyone have any items to add to the agenda?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-04 14:10:20
    +
    +

    *Thread Reply:* @Kacper Muda would you want to talk about doc changes in Airflow provider maybe?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-04 14:13:36
    +
    +

    *Thread Reply:* no pressure if it's too late for you 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-04 14:16:08
    +
    +

    *Thread Reply:* It's fine, I could probably mention something about it - the problem is that I have a standing commitment every Thursday from 5:30 to 7:30 PM (Polish time, GMT+1) which means I'm unable to participate. 😞

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-04 15:46:35
    +
    +

    *Thread Reply:* @Kacper Muda would you be open to recording something? We could play it during the meeting. Something to consider if you’d like to participate but the time doesn’t work.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-05 10:04:33
    +
    +

    *Thread Reply:* Let me see how much time I'll have during the weekend and come back to You 🙂

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-08 02:48:35
    +
    +

    *Thread Reply:* Sorry, I got sick and won't be able to do it. Maybe i'll try to make it personally to the next meeting, then the docs should already be released 🙂

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-10 13:56:46
    +
    +

    *Thread Reply:* @Julien Le Dem are there updates on open proposals that you would like to cover?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-10 13:58:54
    +
    +

    *Thread Reply:* @Paweł Leszczyński as you authored about half of the changes in 1.7.0, would you be willing to talk about the Flink fixes? No slides necessary

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-10 19:21:39
    +
    +

    *Thread Reply:* @Michael Robinson no updates from me

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-11 07:56:57
    +
    +

    *Thread Reply:* sry @Michael Robinson, I won't be able to join this time. My changes in 1.7.0 are rather small fixes. Perhaps someone else can present them shortly.

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-04 16:28:04
    +
    +

    I saw some weird behavior with openlineage-airflow where it will not respected the transport config for the client, even when setting the OPENLINEAGE_CONFIG to point to a config file. +the workaround is that if you set the OPENLINEAGE_URL env it will reach that and read the config. +this bug doesn’t seem to exist in the airflow provider since the loading method is completely different.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-05 10:08:32
    +
    +

    *Thread Reply:* would you mind creating an issue?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:23:58
    +
    +

    *Thread Reply:* will do. let me repro on the latest version of openlineage-airflow and see if I can repro on the provider

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:25:05
    +
    +

    *Thread Reply:* config is definitely more complex than it needs to be...

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:29:44
    +
    +

    *Thread Reply:* hmm.. so in 1.7.0 if you define OPENLINEAGE_URL then it completely ignores whatever is in OPENLINEAGE_CONFIG yaml

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:30:47
    +
    +

    *Thread Reply:* but….

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:31:09
    +
    +

    *Thread Reply:* if you don’t define OPENLINEAGE_URL, and you do define OPENLINEAGE_CONFIG: then openlineage is disabled 😂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:33:31
    +
    +

    *Thread Reply:* ok, I found the bug

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:33:38
    +
    +

    *Thread Reply:* are you sure OPENLINEAGE_CONFIG points to something valid?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:33:41
    +
    +

    *Thread Reply:* it’s in transport factory

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:34:44
    +
    +

    *Thread Reply:* ~look at the code flow for create:~ +~the default factory doesn’t supply config, so it tried to set http config from env vars. if that doesn’t work, it just returns the console transport~

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:35:11
    +
    +

    *Thread Reply:* oh, no nvm. it’s a different flow.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:35:14
    +
    +

    *Thread Reply:* not exactly

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:35:31
    +
    +

    *Thread Reply:* but yes, OPENLINEAGE_CONFIG works for sure

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:36:00
    +
    +

    *Thread Reply:* on 0.21.1, it was working when the OPENLINEAGE_URL was supplied

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:36:16
    +
    +

    *Thread Reply:* transport_config = None if "transport" not in self.config else self.config["transport"] + self.transport = factory.create(transport_config) +this self.config actually looks at +@property + def config(self) -&gt; dict[str, Any]: + if self._config is None: + self._config = load_config() + return self._config +which then uses this: +```def loadconfig() -> dict[str, Any]: + file = _findyaml() + if file: + try: + with open(file) as f: + config: dict[str, Any] = yaml.safe_load(f) + return config + except Exception: # noqa: BLE001, S110 + # Just move to read env vars + pass + return defaultdict(dict)

    + +

    def findyaml() -> str | None: + # Check OPENLINEAGECONFIG env variable + path = os.getenv("OPENLINEAGECONFIG", None) + try: + if path and os.path.isfile(path) and os.access(path, os.R_OK): + return path + except Exception: # noqa: BLE001 + if path: + log.exception("Couldn't read file %s: ", path) + else: + pass # We can get different errors depending on system

    + +
    # Check current working directory:
    +try:
    +    cwd = os.getcwd()
    +    if "openlineage.yml" in os.listdir(cwd):
    +        return os.path.join(cwd, "openlineage.yml")
    +except Exception:  # noqa: BLE001, S110
    +    pass  # We can get different errors depending on system
    +
    +# Check $HOME/.openlineage dir
    +try:
    +    path = os.path.expanduser("~/.openlineage")
    +    if "openlineage.yml" in os.listdir(path):
    +        return os.path.join(path, "openlineage.yml")
    +except Exception:  # noqa: BLE001, S110
    +    # We can get different errors depending on system
    +    pass
    +return None```
    +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:37:16
    +
    +

    *Thread Reply:* oh I think I see

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:37:40
    +
    +

    *Thread Reply:* so this isn't passed if you have config but there is no transport field in this config +transport_config = None if "transport" not in self.config else self.config["transport"] + self.transport = factory.create(transport_config)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:38:22
    +
    +

    *Thread Reply:* here’s the config I’m using: +```transport: + type: "kafka" + config: + bootstrap.servers: "kafka1,kafka2" + security.protocol: "SSL"

    + +
    # CA certificate file for verifying the broker's certificate.
    +ssl.ca.location=ca-cert
    +# Client's certificate
    +ssl.certificate.location=client_?????_client.pem
    +# Client's key
    +ssl.key.location=client_?????_client.key
    +# Key password, if any.
    +ssl.key.password=abcdefgh
    +
    + +

    topic: "SOF0002248-afaas-lineage-DEV-airflow-lineage" + flush: True```

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:38:35
    +
    +

    *Thread Reply:* it should load, but fail when actually trying to emit to kafka

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:38:47
    +
    +

    *Thread Reply:* but it should still init the transport

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:39:22
    +
    +

    *Thread Reply:* 🤔

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:40:09
    +
    +

    *Thread Reply:* no logs?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:40:24
    +
    +

    *Thread Reply:* I’m testing on this image: +```FROM quay.io/astronomer/astro-runtime:6.4.0

    + +

    COPY openlineage.yml /usr/local/airflow/

    + +

    ENV OPENLINEAGE_CONFIG=/usr/local/airflow/openlineage.yml

    + +

    ENV AIRFLOWCORELOGGING_LEVEL=DEBUG

    + +

    ENV OPENLINEAGE_URL=http://foo.bar/```

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:40:39
    +
    +

    *Thread Reply:* are you sure there are no permission errors?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:41:02
    +
    +

    *Thread Reply:* def _find_yaml() -&gt; str | None: + # Check OPENLINEAGE_CONFIG env variable + path = os.getenv("OPENLINEAGE_CONFIG", None) + try: + if path and os.path.isfile(path) and os.access(path, os.R_OK): + return path + except Exception: # noqa: BLE001 + if path: + log.exception("Couldn't read file %s: ", path) + else: + pass # We can get different errors depending on system +it checks stuff like +os.access(path, os.R_OK)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:41:02
    +
    +

    *Thread Reply:* I’m sure, because if I uncomment +ENV OPENLINEAGE_URL=<http://foo.bar/> +on 0.21.1, it works

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:41:29
    +
    +

    *Thread Reply:* ah, I can add a permissive chmod to the dockerfile to see if it helps

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:41:40
    +
    +

    *Thread Reply:* but I’m also not seeing any log/exception in the task logs

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:47:49
    +
    +

    *Thread Reply:* will look at this later if you won't find solution 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 11:48:09
    +
    +

    *Thread Reply:* one more thing, can you try with just +transport: + type: console

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:52:49
    +
    +

    *Thread Reply:* I tried that too

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:52:51
    +
    +

    *Thread Reply:* didn’t work

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-05 11:53:27
    +
    +

    *Thread Reply:* but yeah, there’s something not great about separation of concerns between client config and adapter config

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-05 12:04:00
    +
    +

    *Thread Reply:* adapter should not care, unless you're using MARQUEZ_URL... which is backwards compatibility for when it was still marquez airflow integration

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-05 11:50:11
    +
    +

    I’m starting to put together the year-in-review issue of the newsletter and wonder if anyone has thoughts on the “big stories” of 2023 in OpenLineage. So far I’ve got: +• Launched the Airflow Provider +• Added static AKA design lineage +• Welcomed new ecosystem partners (Google, Metaphor, Grai, Datahub) +• Started meeting up and held events with Metaphor, Google, Collibra, etc. +• Graduated from the LFAI +What am I missing? Wondering in particular about features. Is iceberg support in Flink a “big” enough story? Custom transport types? SQL parser improvements?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-08 06:37:27
    +
    +

    Blog posts content about contributing to Openlineage-spark and code internals. The content comes from november meetup at google and I split it into two posts: https://docs.google.com/document/d/1Hu6clFckse1J_M1w2MMaTTJS0wUihtFsxbDQchtTVtA/edit?usp=sharing

    + + + +
    + ❤️ Jakub Dardziński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-08 11:22:32
    +
    +

    Does anyone remember why execution_date was chosen as part of the runid for an Airflow task, instead of, for example, start_date? +Due to this decision, we can encounter duplicate runid if we delete the DagRun from the database, because the execution_date remains the same. If I run a backfill job for yesterday, then delete it and run it again, I get the same ids. I'm trying to understand the rationale behind this choice so we can determine whether it's a bug or a feature. 😉

    + +
    + + + + + + + + + +
    + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-08 11:40:45
    +
    +

    *Thread Reply:* start_date is unreliable AFAIK, there can be no start date sometimes

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-08 11:40:54
    +
    +

    *Thread Reply:* this might be true for only some version of airflow

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kacper Muda + (kacper.muda@getindata.com) +
    +
    2024-01-08 11:45:02
    +
    +

    *Thread Reply:* Also here where they define combination of some deterministic attributes, executiondate is used and not startdate, so there might be something to it. That still leaves us with the behaviour i described.

    +
    + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 16:08:01
    +
    +

    *Thread Reply:* > Due to this decision, we can encounter duplicate run_id if we delete the DagRun from the database, because the execution_date remains the same. +Hmm so given that OL runID uses the same params Airflow uses to generate the hash, this seems more like a limitation. Better question would be: if a user runs, deletes, then runs the same DAG again, is that an expected scenario we should handle? tl’dr yes, but Airflow hasn’t felt it important enough to address.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 12:28:50
    +
    +

    https://www.snowflake.com/summit/call-for-papers/

    +
    +
    Snowflake
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-08 12:36:09
    +
    +

    *Thread Reply:* not concurrent with Databricks conference this year? 😂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 12:38:34
    +
    +

    *Thread Reply:* nope, a week before so that everyone goes there and gets sick and can’t attend the databricks conf on the following week

    + + + +
    + 😅 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-08 12:40:34
    +
    +

    *Thread Reply:* outstanding business move again

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 15:55:12
    +
    +

    *Thread Reply:* @Willy Lulciuc wanna submit to this?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 15:56:19
    +
    +

    *Thread Reply:* yeah id love to: maybe something like “Detecting Snowflake table schema changes with OpenLineage events” + use cases + using lineage to detect impact?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 16:24:27
    +
    +

    *Thread Reply:* yeah! that sounds like a fun talk!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-08 16:25:18
    +
    +

    *Thread Reply:* idk if @Julien Le Dem will already be in 🇫🇷 that week? but I’d be happy to co-present if not.

    + + + +
    + 🔥 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-08 16:26:52
    +
    +

    *Thread Reply:* I don’t know when I’m flying out yet but it will be in the middle of that time frame.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-08 16:28:18
    +
    +

    *Thread Reply:* +1 on Harel co-presenting :)

    + + + +
    + 🙏 Willy Lulciuc +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-08 16:29:09
    +
    +

    *Thread Reply:* School last day is the 4th. I need to be in France (not jet lagged) before the 8th

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Willy Lulciuc + (willy@datakin.com) +
    +
    2024-01-08 16:45:37
    +
    +

    *Thread Reply:* ok, @Harel Shein I’ll work on getting a rough draft ready before the deadline (added to my TODO of tasks)

    + + + +
    + 👍 Harel Shein +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-09 10:01:59
    +
    +

    hey, I'm not feeling well, will probably skip today meeting

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-09 10:04:17
    +
    +

    *Thread Reply:* get well soon ❤️‍🩹

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-09 12:55:04
    +
    +

    Added comment to discussion about Spark parent job issue: https://github.com/OpenLineage/OpenLineage/issues/1672#issuecomment-1883524216 +I think we have the consensus so I'll work on it.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-10 15:41:39
    +
    +

    *Thread Reply:* @Maciej Obuchowski should we give an update on that at the TSC meeting tomorrow?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-10 15:52:33
    +
    +

    *Thread Reply:* @Michael Robinson CC ^

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-09 12:56:51
    +
    +

    Another issue: do you think we should somehow handle the partitioning in OpenLineage standard and integrations? +I would think of a situation where we somehow know how dataset is partitioned - not think about how to automagically detect that. +Some example situations:

    + +
    1. Job reads particular partitions of a dataset. Should we indicate this, possibly as InputFacet?
    2. Run writes a particular partition to a dataset. Would it be an useful information that some particular partition was written only by some run, while others were written by particular different runs, rather than treat output dataset as "global" modification of a dataset that possibly changes all the data?
    3. +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-10 15:53:30
    +
    +

    *Thread Reply:* interesting. did you hear anyone with this usecase/requirement?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-12 10:04:24
    +
    +

    Hey, there’s a Windows user getting this error when trying to run Marquez: org-apache-tomcat-jdbc-pool-connectionpool-unable-to-create-initial-connection. Is it a driver issue? I’ll try to get more details and reproduce it, but if you know what this probably is related to, please lmk

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-12 10:05:16
    +
    +

    *Thread Reply:* do we even support Windows? 😱

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-12 10:50:00
    +
    +

    *Thread Reply:* Here’s more info about the use case: +Thanks Michael , this is really helpful , so I am working on prj where in I need to run marquez and open lineage on top of airflow dags which run dbt commands internally thru Bashoperator. I need to present to my org if we are going to be benefited by bringing in marquez matadata lineage +10:48 +so was taking this approach of setting marquez first , then will explore how it integrates with airflow using bashoperator

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-12 10:50:16
    +
    +

    *Thread Reply:* We don’t support this operator, right? What kind of graph can they expect?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-12 10:54:56
    +
    +

    *Thread Reply:* They can use dbt integration directly maybe?

    +
    +
    openlineage.io
    + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-16 06:13:54
    +
    +

    *Thread Reply:* Nice!! ❤️

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-16 06:14:43
    +
    +

    *Thread Reply:* We now have a former astronomer as engineering director at DataHub

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-16 06:43:02
    +
    +

    *Thread Reply:* ?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-16 11:12:26
    +
    +

    *Thread Reply:* https://www.linkedin.com/in/samantha-clark-engineer/

    +
    +
    linkedin.com
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:21:22
    +
    +

    taking on letting users to run integration tests

    + + + +
    + ❤️ Harel Shein +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:22:07
    +
    +

    *Thread Reply:* so there are two issues I think

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:22:37
    +
    +

    *Thread Reply:* in Airflow workflows there are: +filters: + branches: + ignore: /pull\/[0-9]+/ +which are set only for PRs with forked repos

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:23:34
    +
    +

    *Thread Reply:* in Spark there are tests that strictly require env vars (that contain credentials to various external systems like databricks or bigquery). if there are no such env vars tests fail which is confusing new committers

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:24:27
    +
    +

    *Thread Reply:* first behaviour is silent - which I think is bad because it’s easy to skip integration tests, build is green (but should it be? we don’t know, integration tests didn’t run and someone needs to know and remember that before approving and merging)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:25:51
    +
    +

    *Thread Reply:* second is misleading because it hints there’s something wrong in the code while it doesn’t neccessarily need to be. on the other hand you shouldn’t approve and merge failing build so you see there’s some action required

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:27:21
    +
    +

    *Thread Reply:* reg. action required: for now we’ve been running some manual step (using https://github.com/jklukas/git-push-fork-to-upstream-branch) which is a workaround but it’s not straightforward and requires manual work. it also doesn’t solve two issues I mentioned above

    +
    + + + + + + + +
    +
    Stars
    + 10 +
    + +
    +
    Language
    + Shell +
    + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:28:46
    +
    +

    *Thread Reply:* what I propose is to simply add approval step before integration tests: https://circleci.com/docs/workflows/#holding-a-workflow-for-a-manual-approval

    + +

    it’s circleCI only thing so you need to login into circleCI, check if there’s any pending task to approve and then approve or not

    +
    +
    circleci.com
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:30:14
    +
    +

    *Thread Reply:* it doesn’t allow for much of configuration but I think it would work. you can’t also integrate it in any way with github UI (e.g. there’s no option to click something in PR’s UI to approve)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:31:39
    +
    +

    *Thread Reply:* but that would let project maintainers to manage when the code is safe to run and it’s still visible and (I think) readable for everyone

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:34:00
    +
    +

    *Thread Reply:* the only thing I’m not sure about is who can approve

    + +

    Anyone who has push access to the repository can click the **Approval** button to continue the workflow

    + +

    but I’m not sure to which repo. if someone runs on fork and he/she has push access for fork - can he/she approve? it wouldn’t make sense..

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 14:37:34
    +
    +

    *Thread Reply:* https://circleci.com/blog/access-control-cicd/

    + +

    that’s best I could find from circleCI on this subject

    +
    +
    CircleCI
    + + + + + + + + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:01:00
    +
    +

    *Thread Reply:* so I think the best solution would be to:

    + +
    1. add approval steps only before integration tests
    2. enable Pass secrets to builds from forked pull requests (requires careful review of the CI process)
    3. make sure release and integration tests contexts are set properly for tasks
    4. +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:01:31
    +
    +

    *Thread Reply:* contexts give possibility to let users run e.g. unit tests in CI without exposing credentials

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-16 15:06:09
    +
    +

    *Thread Reply:* this approach makes sense to me, assuming the permission model is how you outlined it

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:07:21
    +
    +

    *Thread Reply:* one thing to add and test: +approval steps could have condition to always run if it’s not from fork. not sure if that’s possible

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-16 15:08:50
    +
    +

    *Thread Reply:* that sounds like it should be doable within what available in circle. GHA can definitely do that

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:09:17
    +
    +

    *Thread Reply:* GHA can do things that circleCI can’t 😂

    + + + +
    + 😂 Harel Shein, Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-17 09:46:38
    +
    +

    *Thread Reply:* > approval steps could have condition to always run if it’s not from fork. not sure if that’s possible +ffs it’s not that easy to set up

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-17 16:10:22
    +
    +

    *Thread Reply:* whenever I touch circleCI base changes I feel like magician. +yet, here goes the PR with the magic (just look at the side effect, it bothered me for so long 🪄 ) +🚨 🚨 🚨 +https://github.com/OpenLineage/OpenLineage/pull/2374

    +
    + + + + + + + +
    +
    Labels
    + ci, tests +
    + + + + + + + + + + +
    + + + +
    + 🙌 Paweł Leszczyński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-17 16:29:18
    +
    +

    *Thread Reply:* I'm assuming the reason for the speedup in determine_changed_modules is that we don't go install yq every time this runs?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-17 16:35:08
    +
    +

    *Thread Reply:* correct

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-16 15:52:55
    +
    +

    What are your nominees/candidates for the most important releases of 2023? I’ll start (feel free to disagree with my choices, btw): +• 1.0.0 +• 1.7.0 (disabled the external Airflow integration for 2.8+) +• 0.26.0 (Fluentd) +• 0.19.2 (column lineage in SQL parser) +• …

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:53:54
    +
    +

    *Thread Reply:* • 1.7.0 (disabled the external Airflow integration for 2.8+) +that doesn’t sound like one of the most important

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-16 15:55:49
    +
    +

    *Thread Reply:* 1.0.0 had this: https://openlineage.io/docs/releases/1_0_0 which actually fixed spec to match JSON schema spec

    +
    +
    openlineage.io
    + + + + + + + + + + + + + + + +
    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Gowthaman Chinnathambi + (gowthamancdev@gmail.com) +
    +
    2024-01-18 23:26:45
    +
    +

    @Gowthaman Chinnathambi has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-19 16:18:50
    + +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-19 21:06:26
    +
    +

    FYI: I'm trying to set up our meetings with the new LFAI tooling, you may get some emails. you can ignore for now.

    + + + +
    + 👍 Michael Robinson, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Krishnaraj Raol + (krishnaraj.raol@datavizz.in) +
    +
    2024-01-22 00:46:57
    +
    +

    @Krishnaraj Raol has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-22 10:45:06
    +
    +

    @Harel Shein @Julien Le Dem @tati Python client releases on both Marquez and OpenLineage are failing because PyPI no long supports password authentication. We need to configure the projects for Trusted Publishers or use an API token. I’ve looked and can’t find OpenLineage credentials for PyPI, but if I had them we’d still need to set up 2FA in order to make the change. How should we proceed here? Should we sync to sort this out? Thanks (btw I reached out to Willy separately when this came up during a Marquez release attempt last week)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 12:15:48
    +
    +

    *Thread Reply:* ah! I can look into it now.

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 12:45:31
    +
    +

    *Thread Reply:* I'm failing to find the credentials for PyPI anywhere.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 12:45:38
    +
    +

    *Thread Reply:* @Maciej Obuchowski any ideas? (git blame shows you wrote that part, and @Michael Collado did some circleCI setup at some point)

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 12:46:09
    +
    +

    *Thread Reply:* I just sent a reset password email to whoever registered for the openlineage user

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-22 13:15:34
    +
    +

    *Thread Reply:* Thanks @Harel Shein. Can confirm I didn’t get it

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-22 14:21:06
    +
    +

    *Thread Reply:* The password should be in the CircleCI context right?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:23:23
    +
    +

    *Thread Reply:* yes, it's there. I was trying to avoid writing a job that prints it out

    + + + +
    + 👍 Julien Le Dem +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:23:57
    +
    +

    *Thread Reply:* that will be the fallback if no one responds 🙂

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:38:29
    +
    +

    *Thread Reply:* alright, I've setup 2FA and added a few more emails to the PyPI account as fallback.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:39:09
    +
    +

    *Thread Reply:* unfortunately, there's only one Trusted Publisher for PyPI, which is GH Actions. so we'll have to use the API token route. PR incoming soon

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:43:39
    +
    +

    *Thread Reply:* didn't need to make any changes. I updated the circle context and re-ran the PyPI release - we're back to :largegreencircle:

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-22 14:43:46
    +
    +

    *Thread Reply:* ^ @Michael Robinson FYI

    + + + +
    + 👍 Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-22 15:11:54
    +
    +

    *Thread Reply:*

    + + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-22 15:14:47
    +
    +

    *Thread Reply:* thank you @Harel Shein. Releasing the jars now. Everything looks good

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-22 17:16:55
    +
    +

    *Thread Reply:* Great!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    tati + (tatiana.alchueyr@astronomer.io) +
    +
    2024-01-22 10:45:15
    +
    +

    @tati has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-23 11:03:11
    +
    +

    It looks like my laptop bit the dust today so might miss the sync

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-23 11:29:38
    +
    +

    Do I have to create account to join the meeting?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-23 11:31:23
    +
    +

    *Thread Reply:* turns out you can just pass your mail

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-23 11:31:57
    +
    +

    *Thread Reply:* same on my side

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-23 11:32:06
    +
    +

    *Thread Reply:* i don't remember if i have one

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-23 11:34:22
    + +
    +
    +
    + + + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-23 11:35:09
    +
    +

    *Thread Reply:* use same e-mail you got invited to

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-23 13:21:41
    +
    +

    there's a data warehouse (https://www.firebolt.io/) and a streaming platform (https://memphis.dev/) written in Go. so I guess it's not futile to write a Go client? 🙂

    + + + +
    + 🤔 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-23 15:40:21
    +
    +

    Any potential issues with scheduling a meetup on Tuesday, March 19th in Boston that you know of? The Astronomer all-hands is the preceding week

    + + + +
    + 👍 Harel Shein, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-23 16:06:35
    +
    +

    PR to add 1.8 release notes to the docs needs a review: https://github.com/OpenLineage/docs/pull/274

    +
    + + + + + + + +
    +
    Comments
    + 1 +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-23 16:18:19
    +
    +

    *Thread Reply:* thanks @Jakub Dardziński

    + + + +
    + 🙂 Jakub Dardziński +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    jayant joshi + (itsjayantjoshi@gmail.com) +
    +
    2024-01-24 01:10:16
    +
    +

    @jayant joshi has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-24 15:19:34
    +
    +

    Feedback requested on this draft of the year-in-review issue of the newsletter: https://docs.google.com/document/d/1MJB9ughykq9O8roe2dlav6d8QbHZBV2A0bTkF4w0-jo/edit?usp=sharing. Did you give a talk that isn't in the talks section? Is there an important release that should be in the releases section but isn't? Other feedback? Please share.

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-24 16:37:33
    +
    +

    Feedback requested on a new page for displaying the ecosystem survey results: https://github.com/OpenLineage/docs/pull/275. The image was created for us by Amp. @Julien Le Dem @Harel Shein @tati

    +
    + + + + + + + +
    +
    Comments
    + 1 +
    + + + + + + + + + + +
    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-25 06:01:50
    +
    +

    *Thread Reply:* Looks great!

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-25 09:00:27
    +
    +

    *Thread Reply:* Very cool indeed! +I wonder if we should share the raw data as well?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-25 09:00:53
    +
    +

    *Thread Reply:* Maybe if you could share it here first @Michael Robinson ?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-25 10:17:58
    +
    +

    *Thread Reply:* Yes, planning to include a link to the raw data as well and will share here first

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-25 10:28:10
    +
    +

    *Thread Reply:* @Harel Shein thanks for the suggestion. Lmk if there's a better way to do this, but here's a link to Google's visualizations: https://docs.google.com/forms/d/1j1SyJH0LoRNwNS1oJy0qfnDn_NPOrQw_fMb7qwouVfU/viewanalytics. And a .csv is attached. Would you use this link on the page or link to a spreadsheet instead?

    + + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-26 11:47:27
    +
    +

    *Thread Reply:* Going with linking to Google's charts for the raw data for now. LMK if you'd prefer another format, e.g. Google sheet

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-26 11:48:39
    +
    +

    *Thread Reply:* was just looking at it, looks great @Michael Robinson!

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + + + +
    +
    + + + + +
    + +
    Julien Le Dem + (julien@apache.org) +
    +
    2024-01-24 18:36:40
    +
    +

    This looks nice!

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-26 15:37:06
    +
    +

    Thanks for any feedback on the Mailchimp version of the newsletter special issue before it goes out on Monday:

    + + + + +
    + 🙌 Paweł Leszczyński, Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-26 15:41:53
    +
    +

    *Thread Reply:* @Jakub Dardziński @Maciej Obuchowski is the Airflow Provider stuff still current?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Jakub Dardziński + (jakub.dardzinski@getindata.com) +
    +
    2024-01-26 15:43:44
    +
    +

    *Thread Reply:* yeah, looks current

    + + + +
    + :gratitude_thank_you: Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Laurent Paris + (laurent@datakin.com) +
    +
    2024-01-27 08:55:40
    +
    +

    @Laurent Paris has joined the channel

    + + + +
    + 👋 Maciej Obuchowski, Harel Shein, Michael Robinson +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-29 11:47:19
    +
    +

    Hey, created issue around using Spark metrics to increase operational ease of using Spark integration, feel free to comment: https://github.com/OpenLineage/OpenLineage/issues/2402

    + + + +
    + 💯 Harel Shein +
    + +
    +
    +
    +
    + + +
    diff --git a/channel/general/index.html b/channel/general/index.html index 3323de9..6b86e9a 100644 --- a/channel/general/index.html +++ b/channel/general/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -557,7 +563,7 @@

    Group Direct Messages

    2020-10-25 17:22:09
    -

    *Thread Reply:* slack is the new email?

    +

    *Thread Reply:* slack is the new email?

    @@ -583,7 +589,7 @@

    Group Direct Messages

    2020-10-25 17:40:19
    -

    *Thread Reply:* :(

    +

    *Thread Reply:* :(

    @@ -609,7 +615,7 @@

    Group Direct Messages

    2020-10-27 12:27:04
    -

    *Thread Reply:* I'd prefer a google group as well

    +

    *Thread Reply:* I'd prefer a google group as well

    @@ -635,7 +641,7 @@

    Group Direct Messages

    2020-10-27 12:27:25
    -

    *Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

    +

    *Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

    @@ -661,7 +667,7 @@

    Group Direct Messages

    2020-10-27 12:27:38
    -

    *Thread Reply:* And I think it is also better for having thoughtful design discussions

    +

    *Thread Reply:* And I think it is also better for having thoughtful design discussions

    @@ -687,7 +693,7 @@

    Group Direct Messages

    2020-10-29 15:40:14
    -

    *Thread Reply:* I’m happy to create a google group if that would help.

    +

    *Thread Reply:* I’m happy to create a google group if that would help.

    @@ -713,7 +719,7 @@

    Group Direct Messages

    2020-10-29 15:45:23
    -

    *Thread Reply:* Here it is: https://groups.google.com/g/openlineage

    +

    *Thread Reply:* Here it is: https://groups.google.com/g/openlineage

    @@ -739,7 +745,7 @@

    Group Direct Messages

    2020-10-29 15:46:34
    -

    *Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

    +

    *Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

    @@ -765,7 +771,7 @@

    Group Direct Messages

    2020-11-03 17:34:53
    -

    *Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

    +

    *Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

    @@ -791,7 +797,7 @@

    Group Direct Messages

    2020-11-03 17:35:34
    -

    *Thread Reply:* I don't really know how to do that

    +

    *Thread Reply:* I don't really know how to do that

    @@ -817,7 +823,7 @@

    Group Direct Messages

    2021-04-02 07:18:25
    -

    *Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

    +

    *Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

    @@ -843,7 +849,7 @@

    Group Direct Messages

    2021-04-02 19:51:55
    -

    *Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

    +

    *Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

    @@ -899,7 +905,7 @@

    Group Direct Messages

    2020-10-25 17:21:44
    -

    *Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

    +

    *Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

    @@ -1313,7 +1319,7 @@

    Group Direct Messages

    2020-12-10 17:04:39
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    @@ -1471,7 +1477,7 @@

    Group Direct Messages

    2020-12-14 17:55:46
    -

    *Thread Reply:* Looks great!

    +

    *Thread Reply:* Looks great!

    @@ -1620,7 +1626,7 @@

    Group Direct Messages

    2020-12-16 21:41:16
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    @@ -1802,7 +1808,7 @@

    Group Direct Messages

    2020-12-21 12:15:42
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -1888,7 +1894,7 @@

    Group Direct Messages

    2020-12-21 10:30:51
    -

    *Thread Reply:* Yes, that’s about right.

    +

    *Thread Reply:* Yes, that’s about right.

    @@ -1914,7 +1920,7 @@

    Group Direct Messages

    2020-12-21 10:31:45
    -

    *Thread Reply:* There’s some subtlety here.

    +

    *Thread Reply:* There’s some subtlety here.

    @@ -1940,7 +1946,7 @@

    Group Direct Messages

    2020-12-21 10:32:02
    -

    *Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

    +

    *Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

    @@ -1970,7 +1976,7 @@

    Group Direct Messages

    2020-12-21 10:32:57
    -

    *Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

    +

    *Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

    @@ -2000,7 +2006,7 @@

    Group Direct Messages

    2020-12-21 10:33:20
    -

    *Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

    +

    *Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

    @@ -2026,7 +2032,7 @@

    Group Direct Messages

    2020-12-21 10:33:56
    -

    *Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

    +

    *Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

    @@ -2056,7 +2062,7 @@

    Group Direct Messages

    2020-12-21 10:34:47
    -

    *Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

    +

    *Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

    @@ -2082,7 +2088,7 @@

    Group Direct Messages

    2020-12-21 12:20:53
    -

    *Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

    +

    *Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

    @@ -2138,7 +2144,7 @@

    Group Direct Messages

    2020-12-21 12:23:41
    -

    *Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

    +

    *Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

    @@ -2194,7 +2200,7 @@

    Group Direct Messages

    2020-12-21 14:48:11
    -

    *Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

    +

    *Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

    @@ -2220,7 +2226,7 @@

    Group Direct Messages

    2020-12-21 15:03:29
    -

    *Thread Reply:* Yep! Its a validation that the problem is real!

    +

    *Thread Reply:* Yep! Its a validation that the problem is real!

    @@ -2277,7 +2283,7 @@

    Group Direct Messages

    2020-12-22 13:23:43
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -2307,7 +2313,7 @@

    Group Direct Messages

    2020-12-30 22:30:23
    -

    *Thread Reply:* What are you current use cases for lineage?

    +

    *Thread Reply:* What are you current use cases for lineage?

    @@ -2440,7 +2446,7 @@

    Group Direct Messages

    2020-12-30 22:28:54
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -2466,7 +2472,7 @@

    Group Direct Messages

    2020-12-30 22:29:43
    -

    *Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

    +

    *Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

    @@ -2548,7 +2554,7 @@

    Group Direct Messages

    2021-01-05 12:55:01
    -

    *Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

    +

    *Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

    @@ -2574,7 +2580,7 @@

    Group Direct Messages

    2021-01-05 13:20:41
    -

    *Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

    +

    *Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

    @@ -2600,7 +2606,7 @@

    Group Direct Messages

    2021-01-05 15:15:23
    -

    *Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

    +

    *Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

    @@ -2626,7 +2632,7 @@

    Group Direct Messages

    2021-01-06 18:20:06
    -

    *Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

    +

    *Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

    Thats the hard part of this. How to you model a visualization/dashboard to all the possible ways they can be created since it differs depending on how the tool you use abstracts away creating an visualization.

    @@ -2688,7 +2694,7 @@

    Group Direct Messages

    2021-01-05 17:10:22
    -

    *Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

    +

    *Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

    @@ -2714,7 +2720,7 @@

    Group Direct Messages

    2021-01-05 18:12:48
    -

    *Thread Reply:* Hi Jason and welcome

    +

    *Thread Reply:* Hi Jason and welcome

    @@ -2804,7 +2810,7 @@

    Group Direct Messages

    2021-01-07 12:46:19
    -

    *Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

    +

    *Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

    When: Thursday, January 7th at 10AM PST Wherehttps://us02web.zoom.us/j/89344845719?pwd=Y09RZkxMZHc2U3pOTGZ6SnVMUUVoQT09

    @@ -2833,7 +2839,7 @@

    Group Direct Messages

    2021-01-07 12:46:36
    2021-01-07 12:46:44
    -

    *Thread Reply:* that’s in 15 min

    +

    *Thread Reply:* that’s in 15 min

    @@ -2885,7 +2891,7 @@

    Group Direct Messages

    2021-01-07 12:48:35
    2021-01-12 17:10:23
    -

    *Thread Reply:* And it’s merged!

    +

    *Thread Reply:* And it’s merged!

    @@ -2937,7 +2943,7 @@

    Group Direct Messages

    2021-01-12 17:10:53
    -

    *Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

    +

    *Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

    @@ -3019,7 +3025,7 @@

    Group Direct Messages

    2021-01-13 19:06:34
    -

    *Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

    +

    *Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

    @@ -3045,7 +3051,7 @@

    Group Direct Messages

    2021-01-13 19:06:48
    -

    *Thread Reply:* Is there something you are interested about in particular?

    +

    *Thread Reply:* Is there something you are interested about in particular?

    @@ -3230,7 +3236,7 @@

    Group Direct Messages

    2021-02-01 17:55:59
    -

    *Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md +

    *Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md The process to contribute to the model is described here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md In particular, now we’d want to contribute more facets and integrations. Marquez has a reference implementation: https://github.com/MarquezProject/marquez/pull/880 @@ -3294,7 +3300,7 @@

    Group Direct Messages

    2021-02-01 17:57:05
    -

    *Thread Reply:* It is not on general, as it can be a bit noisy

    +

    *Thread Reply:* It is not on general, as it can be a bit noisy

    @@ -3346,7 +3352,7 @@

    Group Direct Messages

    2021-02-09 20:02:48
    -

    *Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

    +

    *Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

    @@ -3506,7 +3512,7 @@

    Group Direct Messages

    2021-02-09 21:27:26
    -

    *Thread Reply:* Feel free to comment, this is ready to merge

    +

    *Thread Reply:* Feel free to comment, this is ready to merge

    @@ -3532,7 +3538,7 @@

    Group Direct Messages

    2021-02-11 20:12:18
    -

    *Thread Reply:* Thanks, Julien. The new spec format looks great 👍

    +

    *Thread Reply:* Thanks, Julien. The new spec format looks great 👍

    @@ -3764,7 +3770,7 @@

    Group Direct Messages

    2021-02-19 14:27:23
    -

    *Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 +

    *Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 Thank you.

    @@ -3860,7 +3866,7 @@

    Group Direct Messages

    2021-02-19 14:27:46
    -

    *Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

    +

    *Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

    @@ -4044,7 +4050,7 @@

    Group Direct Messages

    2021-02-19 05:03:21
    -

    *Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

    +

    *Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

    @@ -4070,7 +4076,7 @@

    Group Direct Messages

    2021-02-19 12:41:23
    -

    *Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

    +

    *Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

    • OpenLineage defines the schema to collect metadata • Marquez has a /lineage endpoint implementing the OpenLineage spec to receive this metadata, implemented by the OpenLineageResource you pointed out @@ -4103,7 +4109,7 @@

    Group Direct Messages

    2021-02-22 03:55:18
    -

    *Thread Reply:* thank you! very clear answer🙂

    +

    *Thread Reply:* thank you! very clear answer🙂

    @@ -4155,7 +4161,7 @@

    Group Direct Messages

    2021-03-02 17:51:29
    -

    *Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

    +

    *Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

    @@ -4273,7 +4279,7 @@

    Group Direct Messages

    2021-03-16 15:07:04
    -

    *Thread Reply:* Ha, I signed up just to ask this precise question!

    +

    *Thread Reply:* Ha, I signed up just to ask this precise question!

    @@ -4303,7 +4309,7 @@

    Group Direct Messages

    2021-03-16 15:07:44
    -

    *Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

    +

    *Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

    @@ -4329,7 +4335,7 @@

    Group Direct Messages

    2021-04-02 07:24:39
    -

    *Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

    +

    *Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

    @@ -4515,7 +4521,7 @@

    Group Direct Messages

    2021-03-17 10:05:53
    -

    *Thread Reply:* That's good idea.

    +

    *Thread Reply:* That's good idea.

    Now I'm wondering - let's say we want to track on which offset checkpoint ended processing. That would mean we want to expose checkpoint id, time, and offset. I suppose we don't want to overwrite previous checkpoint info, so we want to have some collection of data in this facet.

    @@ -4572,7 +4578,7 @@

    Group Direct Messages

    2021-03-17 09:18:49
    -

    *Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

    +

    *Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

    @@ -4598,7 +4604,7 @@

    Group Direct Messages

    2021-03-17 13:43:29
    -

    *Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

    +

    *Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

    @@ -4794,7 +4800,7 @@

    Group Direct Messages

    2021-03-17 16:42:24
    -

    *Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

    +

    *Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

    @@ -4820,7 +4826,7 @@

    Group Direct Messages

    2021-03-18 16:03:55
    -

    *Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

    +

    *Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

    @@ -4846,7 +4852,7 @@

    Group Direct Messages

    2021-03-18 17:52:13
    -

    *Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

    +

    *Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

    @@ -4872,7 +4878,7 @@

    Group Direct Messages

    2021-03-22 09:15:19
    -

    *Thread Reply:* Let me pull in my colleagues

    +

    *Thread Reply:* Let me pull in my colleagues

    @@ -4898,7 +4904,7 @@

    Group Direct Messages

    2021-03-22 09:15:24
    -

    *Thread Reply:* Standby

    +

    *Thread Reply:* Standby

    @@ -4924,7 +4930,7 @@

    Group Direct Messages

    2021-03-22 10:59:57
    -

    *Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

    +

    *Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

    @@ -4950,7 +4956,7 @@

    Group Direct Messages

    2021-03-22 11:00:45
    -

    *Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

    • If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
    • According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
    • Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
    • I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?
    • +

      *Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

      • If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
      • According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
      • Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
      • I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

      Curious to hear your thoughts on all of this!

      @@ -4979,7 +4985,7 @@

      Group Direct Messages

    2021-03-24 17:06:50
    -

    *Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: +

    *Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: > -> If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false? Correct, The spec defines what facets looks like (and how you can make your own custom facets) but it does not make statements about whether facets are required. However, you can have your own validation and make certain things required if you wish to on the client side?   @@ -5025,7 +5031,7 @@

    Group Direct Messages

    2021-03-24 17:06:50
    -

    *Thread Reply:* +

    *Thread Reply:* Marquez existed before OpenLineage. In particular the /run end-point to create and update runs will be deprecated as the OpenLineage /lineage endpoint replaces it. At the moment we are mapping OpenLineage metadata to Marquez. Soon Marquez will have all the facets exposed in the Marquez API. (See: https://github.com/MarquezProject/marquez/pull/894/files) We could make Marquez Configurable or Pluggable for validation purposes. There is already a notion of LineageListener for example. Although Marquez collects the metadata. I feel like this validation would be better upstream or with some some other mechanism. The question is when do you create a dataset vs when do you become a stakeholder? What are the various stakeholder and what is the responsibility of the minimum one stakeholder? I would probably make it required to deploy the job that the stakeholder is defined. This would apply to the output dataset and would be collected in Marquez.

    @@ -5059,7 +5065,7 @@

    Group Direct Messages

    2021-05-24 16:27:03
    -

    *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

    +

    *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

    @@ -5266,7 +5272,7 @@

    Group Direct Messages

    2021-04-02 19:54:52
    -

    *Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

    +

    *Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

    @@ -5292,7 +5298,7 @@

    Group Direct Messages

    2021-04-02 19:55:32
    -

    *Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

    +

    *Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

    @@ -5318,7 +5324,7 @@

    Group Direct Messages

    2021-04-03 18:03:01
    -

    *Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

    +

    *Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

    @@ -5344,7 +5350,7 @@

    Group Direct Messages

    2021-04-05 17:51:57
    -

    *Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

    +

    *Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

    @@ -5370,7 +5376,7 @@

    Group Direct Messages

    2021-04-06 20:33:51
    -

    *Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

    +

    *Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

    @@ -5396,7 +5402,7 @@

    Group Direct Messages

    2021-04-07 10:47:06
    -

    *Thread Reply:* Thanks both!

    +

    *Thread Reply:* Thanks both!

    @@ -5422,7 +5428,7 @@

    Group Direct Messages

    2021-04-13 20:52:53
    -

    *Thread Reply:* sorry, I was away last week. Yes that sounds right.

    +

    *Thread Reply:* sorry, I was away last week. Yes that sounds right.

    @@ -5474,7 +5480,7 @@

    Group Direct Messages

    2021-04-14 13:12:31
    -

    *Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

    +

    *Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

    @@ -5530,7 +5536,7 @@

    Group Direct Messages

    2021-04-13 20:56:42
    -

    *Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

    +

    *Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

    @@ -5556,7 +5562,7 @@

    Group Direct Messages

    2021-04-13 20:57:58
    -

    *Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

    +

    *Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

    @@ -5582,7 +5588,7 @@

    Group Direct Messages

    2021-04-13 21:01:34
    -

    *Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

    +

    *Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

    @@ -5787,7 +5793,7 @@

    Group Direct Messages

    2021-04-19 16:17:06
    -

    *Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

    +

    *Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

    @@ -5813,7 +5819,7 @@

    Group Direct Messages

    2021-04-19 16:17:23
    -

    *Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

    +

    *Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

    @@ -5839,7 +5845,7 @@

    Group Direct Messages

    2021-04-19 16:19:00
    -

    *Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

    +

    *Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

    @@ -5891,7 +5897,7 @@

    Group Direct Messages

    2021-04-19 21:42:14
    -

    *Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

    +

    *Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

    @@ -5917,7 +5923,7 @@

    Group Direct Messages

    2021-04-19 21:43:27
    -

    *Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

    +

    *Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

    @@ -5943,7 +5949,7 @@

    Group Direct Messages

    2021-04-19 21:43:58
    -

    *Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

    +

    *Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

    @@ -5969,7 +5975,7 @@

    Group Direct Messages

    2021-04-19 21:47:06
    -

    *Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java

    +

    *Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java

    @@ -6020,7 +6026,7 @@

    Group Direct Messages

    2021-04-19 21:47:54
    -

    *Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java

    +

    *Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java

    @@ -6071,7 +6077,7 @@

    Group Direct Messages

    2021-04-19 21:48:48
    -

    *Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87

    +

    *Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87

    @@ -6097,7 +6103,7 @@

    Group Direct Messages

    2021-04-20 11:11:38
    -

    *Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

    +

    *Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

    @@ -6123,7 +6129,7 @@

    Group Direct Messages

    2021-04-20 17:39:24
    -

    *Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

    +

    *Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

    @@ -6149,7 +6155,7 @@

    Group Direct Messages

    2021-04-20 17:49:15
    -

    *Thread Reply:* thanks

    +

    *Thread Reply:* thanks

    @@ -6207,7 +6213,7 @@

    Group Direct Messages

    2021-04-28 20:56:53
    -

    *Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

    +

    *Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

    @@ -6233,7 +6239,7 @@

    Group Direct Messages

    2021-04-28 23:21:44
    -

    *Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

    +

    *Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

    @@ -6259,7 +6265,7 @@

    Group Direct Messages

    2021-04-28 23:22:41
    -

    *Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

    +

    *Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

    @@ -6406,7 +6412,7 @@

    Group Direct Messages

    2021-05-21 19:20:55
    -

    *Thread Reply:* Huge milestone! 🙌💯🎊

    +

    *Thread Reply:* Huge milestone! 🙌💯🎊

    @@ -6492,7 +6498,7 @@

    Group Direct Messages

    2021-06-01 17:56:16
    -

    *Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

    +

    *Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

    @@ -6518,7 +6524,7 @@

    Group Direct Messages

    2021-06-01 17:57:19
    -

    *Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

    +

    *Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

    @@ -6544,7 +6550,7 @@

    Group Direct Messages

    2021-06-01 19:58:21
    -

    *Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

    +

    *Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

    @@ -6740,7 +6746,7 @@

    Group Direct Messages

    2021-06-01 19:59:30
    -

    *Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

    +

    *Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

    @@ -6834,7 +6840,7 @@

    Group Direct Messages

    2021-06-01 20:00:50
    -

    *Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

    +

    *Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

    @@ -6860,7 +6866,7 @@

    Group Direct Messages

    2021-06-01 20:01:54
    -

    *Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

    +

    *Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

    @@ -6919,7 +6925,7 @@

    Group Direct Messages

    <em>* See LogicalTypes.md for conversion between ConvertedType and LogicalType. *</em>/ enum ConvertedType { - /<em>*</strong> a BYTE_ARRAY actually contains UTF8 encoded chars *</em>/ + /<em></strong></em> a BYTE_ARRAY actually contains UTF8 encoded chars **/ UTF8 = 0;</p> <p>/<em>*</em>* a map is converted as an optional field containing a repeated key/value pair **/ @@ -7052,7 +7058,7 @@

    Group Direct Messages

    *</em> Representation of Schemas <em>*/ enum FieldRepetitionType { - /</em><em></strong> This field is required (can not be null) and each record has exactly 1 value. *</em>/ + /</em></strong>* This field is required (can not be null) and each record has exactly 1 value. **/ REQUIRED = 0;</p> <p>/<em>*</em>* The field is optional (can be null) and each record has 0 or 1 values. **/ @@ -7067,31 +7073,31 @@

    Group Direct Messages

    <em>* All fields are optional. *</em>/ struct Statistics { - /<em>*</strong> - *</em> DEPRECATED: min and max value of the column. Use min<em>value and max</em>value. - <em>* - *</em> Values are encoded using PLAIN encoding, except that variable-length byte - <em>* arrays do not include a length prefix. + /<em></strong></em> + <em>* DEPRECATED: min and max value of the column. Use min_value and max_value. *</em> - <em>* These fields encode min and max values determined by signed comparison - *</em> only. New files should use the correct order for a column's logical type - <em>* and store the values in the min_value and max_value fields. - *</em> - <em>* To support older readers, these may be set when the column order is - *</em> signed. + <em>* Values are encoded using PLAIN encoding, except that variable-length byte + *</em> arrays do not include a length prefix. + <em>* + *</em> These fields encode min and max values determined by signed comparison + <em>* only. New files should use the correct order for a column's logical type + *</em> and store the values in the min<em>value and max</em>value fields. + <em>* + *</em> To support older readers, these may be set when the column order is + <em>* signed. <strong>/ 1: optional binary max; 2: optional binary min; - /<em>*</strong> count of null value in the column <strong>/ - 3: optional i64 null_count; - /</em><em></strong> count of distinct values occurring <strong>/ - 4: optional i64 distinct_count; - /</em><em></strong> - *</em> Min and max values for the column, determined by its ColumnOrder. - <em>* - *</em> Values are encoded using PLAIN encoding, except that variable-length byte - <em>* arrays do not include a length prefix. - *</em>/ + /</em></strong>* count of null value in the column <strong>/ + 3: optional i64 null<em>count; + /*</strong>* count of distinct values occurring <strong>/ + 4: optional i64 distinct</em>count; + /<em></strong></em> + <em>* Min and max values for the column, determined by its ColumnOrder. + *</em> + <em>* Values are encoded using PLAIN encoding, except that variable-length byte + *</em> arrays do not include a length prefix. + **/ 5: optional binary max<em>value; 6: optional binary min</em>value; }</p> @@ -7180,7 +7186,7 @@

    Group Direct Messages

    2021-06-01 23:20:50
    -

    *Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

    +

    *Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

    @@ -7240,7 +7246,7 @@

    Group Direct Messages

    2021-06-01 20:02:34
    -

    *Thread Reply:* Welcome! Let use know if you have any questions

    +

    *Thread Reply:* Welcome! Let use know if you have any questions

    @@ -7309,7 +7315,7 @@

    Group Direct Messages

    2021-06-04 04:45:15
    -

    *Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? +

    *Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? Yes, at core OpenLineage just enforces format of the event. We also aim to provide clients - REST, later Kafka, etc. and some reference implementations - which are now in Marquez repo. https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png

    There are several differences between push and poll models. Most important one is that with push model, latency between your job and emitting OpenLineage events is very low. With some systems, with internal, push based model you have more runtime metadata available than when looking from outside. Another one would be that naive poll implementation would need to "rebuild the world" on each change. There are also disadvantages, such as that usually, it's easier to write plugin that extracts data from outside the system than hooking up to the internals.

    @@ -7394,7 +7400,7 @@

    Group Direct Messages

    2021-06-04 10:39:51
    -

    *Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

    +

    *Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

    @@ -7420,7 +7426,7 @@

    Group Direct Messages

    2021-06-04 10:40:59
    -

    *Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: +

    *Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: > Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.

    @@ -7447,7 +7453,7 @@

    Group Direct Messages

    2021-06-04 21:59:59
    -

    *Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

    +

    *Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

    @@ -7473,7 +7479,7 @@

    Group Direct Messages

    2021-06-17 10:01:37
    -

    *Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

    +

    *Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

    @@ -7499,7 +7505,7 @@

    Group Direct Messages

    2021-06-17 10:28:52
    -

    *Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

    +

    *Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

    So, how would you see using it? Just to proxy the events to concrete search and metadata backend?

    @@ -7529,7 +7535,7 @@

    Group Direct Messages

    2021-07-07 19:59:28
    -

    *Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

    +

    *Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

    @@ -7579,7 +7585,7 @@

    Group Direct Messages

    2021-07-09 02:22:06
    -

    *Thread Reply:* thanks Julien! will take a look

    +

    *Thread Reply:* thanks Julien! will take a look

    @@ -7631,7 +7637,7 @@

    Group Direct Messages

    2021-06-08 14:16:42
    -

    *Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

    +

    *Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

    @@ -7725,7 +7731,7 @@

    Group Direct Messages

    2021-06-08 14:17:19
    -

    *Thread Reply:* We’re starting a monthly call, I will publish more details here

    +

    *Thread Reply:* We’re starting a monthly call, I will publish more details here

    @@ -7751,7 +7757,7 @@

    Group Direct Messages

    2021-06-08 14:17:48
    -

    *Thread Reply:* Do you have a specific use case in mind?

    +

    *Thread Reply:* Do you have a specific use case in mind?

    @@ -7777,7 +7783,7 @@

    Group Direct Messages

    2021-06-08 21:32:02
    -

    *Thread Reply:* Nothing specific yet

    +

    *Thread Reply:* Nothing specific yet

    @@ -7854,7 +7860,7 @@

    Group Direct Messages

    2021-06-09 08:33:45
    -

    *Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

    +

    *Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

    @@ -7880,7 +7886,7 @@

    Group Direct Messages

    2021-06-09 11:00:05
    -

    *Thread Reply:* Same!

    +

    *Thread Reply:* Same!

    @@ -7906,7 +7912,7 @@

    Group Direct Messages

    2021-06-09 11:01:45
    -

    *Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

    +

    *Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

    @@ -7932,7 +7938,7 @@

    Group Direct Messages

    2021-06-09 11:59:22
    2021-06-09 12:00:30
    -

    *Thread Reply:* @Julien Le Dem Can't access the calendar.

    +

    *Thread Reply:* @Julien Le Dem Can't access the calendar.

    @@ -7984,7 +7990,7 @@

    Group Direct Messages

    2021-06-09 12:00:43
    -

    *Thread Reply:* Can you please share the meeting details

    +

    *Thread Reply:* Can you please share the meeting details

    @@ -8010,7 +8016,7 @@

    Group Direct Messages

    2021-06-09 12:01:12
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8045,7 +8051,7 @@

    Group Direct Messages

    2021-06-09 12:01:24
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8080,7 +8086,7 @@

    Group Direct Messages

    2021-06-09 12:01:55
    -

    *Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

    +

    *Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

    @@ -8106,7 +8112,7 @@

    Group Direct Messages

    2021-06-09 12:01:58
    -

    *Thread Reply:* Thanks

    +

    *Thread Reply:* Thanks

    @@ -8132,7 +8138,7 @@

    Group Direct Messages

    2021-06-09 13:25:13
    -

    *Thread Reply:* it is 9am,thanks

    +

    *Thread Reply:* it is 9am,thanks

    @@ -8158,7 +8164,7 @@

    Group Direct Messages

    2021-06-09 18:37:02
    -

    *Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

    +

    *Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

    @@ -8218,7 +8224,7 @@

    Group Direct Messages

    2021-06-10 13:55:51
    -

    *Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

    +

    *Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

    @@ -8244,7 +8250,7 @@

    Group Direct Messages

    2021-06-10 13:58:58
    -

    *Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

    +

    *Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

    { eventType: 'START', @@ -8315,7 +8321,7 @@

    Group Direct Messages

    2021-06-10 14:02:59
    -

    *Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

    +

    *Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

    For example, if the data goes from S3 -&gt; Snowflake via Airflow and then from Snowflake -&gt; Salesforce via Hightouch, this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage?

    @@ -8343,7 +8349,7 @@

    Group Direct Messages

    2021-06-17 19:13:14
    -

    *Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

    +

    *Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

    1. I would use a valid UUID for the run ID as the spec will standardize on that type, see https://github.com/OpenLineage/OpenLineage/pull/65
    2. You don’t need to provide the input dataset again on the COMPLETE event as the input datasets have already been associated with the run ID
    3. For the producer, I’d recommend using a link to the producer source code version to link the producer version with the OL event that was emitted.
    @@ -8372,7 +8378,7 @@

    Group Direct Messages

    2021-06-17 19:13:59
    -

    *Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

    +

    *Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

    openlineage.io
    @@ -8419,7 +8425,7 @@

    Group Direct Messages

    2021-06-17 19:18:19
    -

    *Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? +

    *Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? Yes, the dataset and the namespace that it was registered under would have to be the same to properly build the lineage graph. We’re working on defining unique dataset names and have made some good progress in this area. I’d suggest reviewing the OL naming conventions if you haven’t already: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -8450,7 +8456,7 @@

    Group Direct Messages

    2021-06-19 01:09:27
    -

    *Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

    +

    *Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

    @@ -8476,7 +8482,7 @@

    Group Direct Messages

    2021-06-22 15:14:39
    -

    *Thread Reply:* 🙂

    +

    *Thread Reply:* 🙂

    @@ -8530,7 +8536,7 @@

    Group Direct Messages

    2021-06-11 10:16:41
    -

    *Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

    +

    *Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

    @@ -8556,7 +8562,7 @@

    Group Direct Messages

    2021-06-11 10:25:58
    -

    *Thread Reply:* Thank you! Will do.

    +

    *Thread Reply:* Thank you! Will do.

    @@ -8582,7 +8588,7 @@

    Group Direct Messages

    2021-06-11 15:31:41
    -

    *Thread Reply:* Thanks for reporting Antonio

    +

    *Thread Reply:* Thanks for reporting Antonio

    @@ -8732,7 +8738,7 @@

    Group Direct Messages

    2021-06-18 15:09:18
    -

    *Thread Reply:* Very nice!

    +

    *Thread Reply:* Very nice!

    @@ -8784,7 +8790,7 @@

    Group Direct Messages

    2021-06-20 10:09:28
    -

    *Thread Reply:* Here is the error...

    +

    *Thread Reply:* Here is the error...

    21/06/20 11:02:56 WARN ArgumentParser: missing jobs in [, api, v1, namespaces, spark_integration] at 5 21/06/20 11:02:56 WARN ArgumentParser: missing runs in [, api, v1, namespaces, spark_integration] at 7 @@ -8829,7 +8835,7 @@

    Group Direct Messages

    2021-06-20 10:10:41
    -

    *Thread Reply:* Here is my code ...

    +

    *Thread Reply:* Here is my code ...

    ```from pyspark.sql import SparkSession from pyspark.sql.functions import lit

    @@ -8899,7 +8905,7 @@

    Supress success

    2021-06-20 10:12:27
    -

    *Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

    +

    *Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

    @@ -8925,7 +8931,7 @@

    Supress success

    2021-06-20 10:13:04
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8960,7 +8966,7 @@

    Supress success

    2021-06-20 10:13:45
    -

    *Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

    +

    *Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

    @@ -8986,7 +8992,7 @@

    Supress success

    2021-06-20 10:14:35
    -

    *Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

    +

    *Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

    @@ -9012,7 +9018,7 @@

    Supress success

    2021-06-20 10:14:44
    -

    *Thread Reply:* Can anyone help me?

    +

    *Thread Reply:* Can anyone help me?

    @@ -9038,7 +9044,7 @@

    Supress success

    2021-06-22 14:42:03
    -

    *Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add +

    *Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add spark.sparkContext.setLogLevel('info') to the setup code, everything works reliably. Also works if you remove the master('local[1]') - at least when running in a notebook

    @@ -9174,7 +9180,7 @@

    Supress success

    2021-06-22 15:26:55
    -

    *Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

    +

    *Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

    @@ -9200,7 +9206,7 @@

    Supress success

    2021-06-22 15:27:29
    -

    *Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. +

    *Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. Depending on the exported CSV, I would translate the CSV to an OL event, see https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

    @@ -9227,7 +9233,7 @@

    Supress success

    2021-06-22 15:29:58
    -

    *Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

    +

    *Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

    @@ -9253,7 +9259,7 @@

    Supress success

    2021-06-22 23:33:05
    -

    *Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

    +

    *Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

    @@ -9279,7 +9285,7 @@

    Supress success

    2021-06-22 23:33:24
    -

    *Thread Reply:* 2. store it in graph db (neo4j)

    +

    *Thread Reply:* 2. store it in graph db (neo4j)

    @@ -9305,7 +9311,7 @@

    Supress success

    2021-06-22 23:38:28
    -

    *Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

    +

    *Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

    1. now my part in that is to implement an Export functionality, so that someone can download it from UI. in UI there will be option to download the report.
    2. so i need to fetch data from storage and convert it into CSV format, send to UI
    3. they can download the report from UI.
    @@ -9341,7 +9347,7 @@

    Supress success

    2021-07-01 19:18:35
    -

    *Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

    +

    *Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

    @@ -9393,7 +9399,7 @@

    Supress success

    2021-06-22 20:29:22
    -

    *Thread Reply:* There are no silly questions! 😉

    +

    *Thread Reply:* There are no silly questions! 😉

    @@ -9453,7 +9459,7 @@

    Supress success

    2021-07-01 19:07:27
    -

    *Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

    +

    *Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

    > What are events and facets and what are their purpose? An OpenLineage event is used to capture the lineage metadata at a point in time for a given run in execution. That is, the runs state transition, the inputs and outputs consumed/produced and the job associated with the run are part of the event. The metadata defined in the event can then be consumed by an HTTP backend (as well as other transport layers). Marquez is an HTTP backend implementation that consumes OL events via a REST API call. The OL core model only defines the metadata that should be captured in the context of a run, while the processing of the event is up to the backend implementation consuming the event (think consumer / producer model here). For Marquez, the end-to-end lineage metadata is stored for pipelines (composed of multiple jobs) with built-in metadata versioning support. Now, for the second part of your question: the OL core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. I’d recommend checking out the getting started guide for OL 🙂

    @@ -9591,7 +9597,7 @@

    Supress success

    2021-06-30 18:16:58
    -

    *Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

    +

    *Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

    @@ -9652,7 +9658,7 @@

    Supress success

    2021-06-30 18:36:59
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -9687,7 +9693,7 @@

    Supress success

    2021-07-01 13:45:42
    -

    *Thread Reply:* In your first cell, you have +

    *Thread Reply:* In your first cell, you have from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark.sparkContext.setLogLevel('info') @@ -9754,7 +9760,7 @@

    Supress success

    2021-07-01 16:32:18
    -

    *Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

    +

    *Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

    @@ -9780,7 +9786,7 @@

    Supress success

    2021-07-01 16:32:57
    -

    *Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

    +

    *Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

    @@ -9832,7 +9838,7 @@

    Supress success

    2021-07-01 16:24:16
    -

    *Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

    +

    *Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

    @@ -9862,7 +9868,7 @@

    Supress success

    2021-07-03 19:37:58
    -

    *Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

    +

    *Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

    @@ -9980,7 +9986,7 @@

    Supress success

    2021-07-07 19:53:17
    -

    *Thread Reply:* Based on community feedback, +

    *Thread Reply:* Based on community feedback, The new proposed mission statement: “to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used”

    @@ -10137,7 +10143,7 @@

    Supress success

    2021-07-08 18:57:44
    -

    *Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

    +

    *Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

    @@ -10163,7 +10169,7 @@

    Supress success

    2021-07-08 18:59:00
    -

    *Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

    +

    *Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

    @@ -10308,7 +10314,7 @@

    Supress success

    2021-07-09 02:23:42
    -

    *Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

    +

    *Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

    @@ -10334,7 +10340,7 @@

    Supress success

    2021-07-09 04:28:38
    -

    *Thread Reply:* @Michael Collado

    +

    *Thread Reply:* @Michael Collado

    @@ -10403,7 +10409,7 @@

    Supress success

    2021-07-12 21:18:04
    -

    *Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

    +

    *Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

    @@ -10429,7 +10435,7 @@

    Supress success

    2021-07-14 12:03:31
    -

    *Thread Reply:* It is starting now

    +

    *Thread Reply:* It is starting now

    @@ -10538,7 +10544,7 @@

    Supress success

    2021-07-14 17:04:18
    -

    *Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

    +

    *Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

    @@ -10641,7 +10647,7 @@

    Supress success

    2021-07-14 17:04:33
    -

    *Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

    +

    *Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

    @@ -10705,7 +10711,7 @@

    Supress success

    2021-07-14 17:05:02
    -

    *Thread Reply:* for this one, please give explicit approval in the ticket

    +

    *Thread Reply:* for this one, please give explicit approval in the ticket

    @@ -10735,7 +10741,7 @@

    Supress success

    2021-07-14 21:10:42
    -

    *Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

    +

    *Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

    @@ -10761,7 +10767,7 @@

    Supress success

    2021-07-27 18:58:35
    -

    *Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

    +

    *Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

    @@ -10854,7 +10860,7 @@

    Supress success

    2021-07-20 16:38:38
    -

    *Thread Reply:* The demo in this does not really use the openlineage spec does it?

    +

    *Thread Reply:* The demo in this does not really use the openlineage spec does it?

    Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?

    @@ -10882,7 +10888,7 @@

    Supress success

    2021-07-20 18:09:01
    -

    *Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

    +

    *Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

    @@ -10908,7 +10914,7 @@

    Supress success

    2021-07-26 19:06:43
    -

    *Thread Reply:* > The demo in this does not really use the openlineage spec does it? +

    *Thread Reply:* > The demo in this does not really use the openlineage spec does it? @Samia Rahman In our Airflow talk, the demo used the marquez-airflow lib that sends OpenLineage events to Marquez’s . You can check out the how does Airflow works with OpenLineage + Marquez here https://openlineage.io/integration/apache-airflow/

    @@ -10956,7 +10962,7 @@

    Supress success

    2021-07-26 19:07:51
    -

    *Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? +

    *Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? Yes, Marquez ingests OpenLineage events that confirm to the spec via the . Hope this helps!

    @@ -11011,7 +11017,7 @@

    Supress success

    2021-07-26 18:54:46
    -

    *Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture: +

    *Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture:https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/

    Supress success
    2021-07-26 18:57:59
    -

    *Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

    +

    *Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

    1. Stream query logs to a topic using the JDBC source connector, then have a consumer read the query logs off the topic, parse the logs, then stream the result of the query parsing to another topic as an OpenLineage event
    2. Add direct support for OpenLineage to the JDBC connector or any other application you planned to use to read the query logs.
    @@ -11118,7 +11124,7 @@

    Supress success

    2021-07-26 19:01:31
    -

    *Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

    +

    *Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

    @@ -11144,7 +11150,7 @@

    Supress success

    2021-08-05 12:01:55
    -

    *Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

    +

    *Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

    sqlflow.gudusoft.com
    @@ -11220,7 +11226,7 @@

    Supress success

    2021-09-21 20:22:26
    -

    *Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

    +

    *Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

    @@ -11379,7 +11385,7 @@

    Supress success

    2021-07-21 18:22:01
    -

    *Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

    +

    *Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

    @@ -11405,7 +11411,7 @@

    Supress success

    2021-07-21 18:23:03
    -

    *Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

    +

    *Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

    @@ -11560,7 +11566,7 @@

    Supress success

    2021-07-30 17:24:44
    -

    *Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

    +

    *Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

    @@ -11681,7 +11687,7 @@

    Supress success

    2021-08-03 15:11:56
    -

    *Thread Reply:* /|~~~ +

    *Thread Reply:* /|~~~ ///| /////| ///////| @@ -11713,7 +11719,7 @@

    Supress success

    2021-08-03 19:59:03
    -

    *Thread Reply:*

    +

    *Thread Reply:* ⛵

    @@ -11972,7 +11978,7 @@

    Supress success

    2021-08-05 12:56:48
    -

    *Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

    +

    *Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

    @@ -11998,7 +12004,7 @@

    Supress success

    2021-08-05 14:46:44
    -

    *Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. +

    *Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. To capture the column level lineage, you would want to add a ColumnLineage facet to the Output dataset facets. Which is something that is needed in the spec. Here is a proposal, please chime in: https://github.com/OpenLineage/OpenLineage/issues/148 @@ -12136,7 +12142,7 @@

    Supress success

    2021-08-09 19:49:51
    -

    *Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

    +

    *Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

    @@ -12186,7 +12192,7 @@

    Supress success

    2021-08-18 08:09:38
    -

    *Thread Reply:* Does anyone know if the Spline developers are in this slack group?

    +

    *Thread Reply:* Does anyone know if the Spline developers are in this slack group?

    @@ -12212,7 +12218,7 @@

    Supress success

    2022-08-03 03:07:56
    -

    *Thread Reply:* @Luke Smith how have things progressed on your side the past year?

    +

    *Thread Reply:* @Luke Smith how have things progressed on your side the past year?

    @@ -12442,7 +12448,7 @@

    Supress success

    2021-08-11 10:05:27
    -

    *Thread Reply:* Just a reminder that this is in 2 hours

    +

    *Thread Reply:* Just a reminder that this is in 2 hours

    @@ -12468,7 +12474,7 @@

    Supress success

    2021-08-11 18:50:32
    -

    *Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -12494,7 +12500,7 @@

    Supress success

    2021-08-11 18:51:19
    -

    *Thread Reply:* The recording of the meeting is linked there: +

    *Thread Reply:* The recording of the meeting is linked there: https://us02web.zoom.us/rec/share/2k4O-Rjmmd5TYXzT-pEQsbYXt6o4V6SnS6Vi7a27BPve9aoMmjm-bP8UzBBzsFzg.uY1je-PyT4qTgYLZ?startTime=1628697944000 • Passcode: =RBUj01C

    @@ -12548,7 +12554,7 @@

    Supress success

    2021-08-11 13:36:43
    -

    *Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    +

    *Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    @@ -12574,7 +12580,7 @@

    Supress success

    2021-08-11 13:48:36
    -

    *Thread Reply:* Thank you Maciej. I'll take a look

    +

    *Thread Reply:* Thank you Maciej. I'll take a look

    @@ -12742,7 +12748,7 @@

    Supress success

    2021-08-26 16:46:53
    -

    *Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

    +

    *Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

    @@ -12893,7 +12899,7 @@

    Supress success

    2021-08-31 14:30:00
    -

    *Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

    +

    *Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

    @@ -12919,7 +12925,7 @@

    Supress success

    2021-08-31 14:43:20
    -

    *Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

    +

    *Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

    1. copy the OL JAR to a staging directory on DBFS (we use /dbfs/databricks/init/lineage)
    2. using an init script, copy the JAR from the staging directory to the default JAR location for the Databricks driver -- /mnt/driver-daemon/jars
    3. Within the same init script, write the spark config parameters to a .conf file in /databricks/driver/conf (we use open-lineage.conf) The .conf file will be read by the driver on initialization. It should follow this format (lineagehosturl should point to your API): @@ -12965,7 +12971,7 @@

      Supress success

    2021-08-31 14:51:46
    -

    *Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

    +

    *Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

    @@ -12991,7 +12997,7 @@

    Supress success

    2021-08-31 14:57:02
    -

    *Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

    +

    *Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

    @@ -13017,7 +13023,7 @@

    Supress success

    2021-08-31 15:11:37
    -

    *Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

    +

    *Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

    docs.databricks.com
    @@ -13064,7 +13070,7 @@

    Supress success

    2021-08-31 15:11:53
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -13099,7 +13105,7 @@

    Supress success

    2021-08-31 15:59:38
    -

    *Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

    +

    *Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

    %scala import za.co.absa.spline.harvester.SparkLineageInitializer._ @@ -13130,7 +13136,7 @@

    Supress success

    2021-08-31 16:01:00
    -

    *Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

    +

    *Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

    @@ -13160,7 +13166,7 @@

    Supress success

    2021-08-31 16:03:50
    -

    *Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

    +

    *Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

    @@ -13186,7 +13192,7 @@

    Supress success

    2021-08-31 16:10:42
    -

    *Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

    +

    *Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

    @@ -13216,7 +13222,7 @@

    Supress success

    2021-08-31 18:26:11
    -

    *Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

    +

    *Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

    Now I have this:

    @@ -13253,7 +13259,7 @@

    Supress success

    2021-08-31 18:52:15
    -

    *Thread Reply:* Is this error thrown during init or job execution?

    +

    *Thread Reply:* Is this error thrown during init or job execution?

    @@ -13279,7 +13285,7 @@

    Supress success

    2021-08-31 18:55:30
    -

    *Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

    +

    *Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

    @@ -13305,7 +13311,7 @@

    Supress success

    2021-08-31 19:59:15
    -

    *Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

    +

    *Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

    @@ -13331,7 +13337,7 @@

    Supress success

    2021-08-31 20:00:46
    -

    *Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

    +

    *Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

    @@ -13357,7 +13363,7 @@

    Supress success

    2021-08-31 20:01:17
    -

    *Thread Reply:* Yes Willy, thank you!

    +

    *Thread Reply:* Yes Willy, thank you!

    @@ -13383,7 +13389,7 @@

    Supress success

    2021-09-02 10:06:00
    -

    *Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

    +

    *Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

    @@ -13418,7 +13424,7 @@

    Supress success

    2021-09-02 10:07:07
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -13453,7 +13459,7 @@

    Supress success

    2021-09-02 10:17:17
    -

    *Thread Reply:* I found the solution here: +

    *Thread Reply:* I found the solution here: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html

    @@ -13510,7 +13516,7 @@

    Supress success

    2021-09-02 10:17:28
    -

    *Thread Reply:* It works now! 😄

    +

    *Thread Reply:* It works now! 😄

    @@ -13540,7 +13546,7 @@

    Supress success

    2021-09-02 16:33:01
    -

    *Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

    +

    *Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

    @@ -13566,7 +13572,7 @@

    Supress success

    2021-09-02 19:59:10
    -

    *Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

    +

    *Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

    @@ -13592,7 +13598,7 @@

    Supress success

    2021-09-02 20:44:36
    -

    *Thread Reply:* Awesome. Thanks!

    +

    *Thread Reply:* Awesome. Thanks!

    @@ -13644,7 +13650,7 @@

    Supress success

    2021-09-02 04:57:55
    -

    *Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

    +

    *Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

    @@ -13719,7 +13725,7 @@

    Supress success

    2021-09-02 05:22:15
    -

    *Thread Reply:* great, thanks 🙂

    +

    *Thread Reply:* great, thanks 🙂

    @@ -13745,7 +13751,7 @@

    Supress success

    2021-09-02 11:49:48
    -

    *Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

    +

    *Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

    @@ -13797,7 +13803,7 @@

    Supress success

    2021-09-02 14:28:11
    -

    *Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

    +

    *Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

    @@ -13823,7 +13829,7 @@

    Supress success

    2021-09-02 14:36:19
    -

    *Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

    +

    *Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

    @@ -13853,7 +13859,7 @@

    Supress success

    2021-09-02 15:50:14
    -

    *Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

    +

    *Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

    @@ -13879,7 +13885,7 @@

    Supress success

    2021-09-02 17:33:08
    -

    *Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

    +

    *Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

    @@ -13905,7 +13911,7 @@

    Supress success

    2021-09-03 12:36:20
    -

    *Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

    +

    *Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

    @@ -13983,7 +13989,7 @@

    Supress success

    2021-09-03 13:42:43
    -

    *Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

    +

    *Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

    1. When attempting to do delta I/O with Spark 3 on Databricks, e.g. insert into . . . values . . . @@ -14069,7 +14075,7 @@

      Supress success

    2021-09-03 13:44:54
    -

    *Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

    +

    *Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

    @@ -14095,7 +14101,7 @@

    Supress success

    2021-09-03 13:45:49
    -

    *Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

    +

    *Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

    @@ -14121,7 +14127,7 @@

    Supress success

    2021-09-03 13:46:56
    -

    *Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

    +

    *Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

    @@ -14147,7 +14153,7 @@

    Supress success

    2021-09-03 13:47:40
    -

    *Thread Reply:* oh, it's a Delta table?

    +

    *Thread Reply:* oh, it's a Delta table?

    @@ -14173,7 +14179,7 @@

    Supress success

    2021-09-03 14:48:15
    -

    *Thread Reply:* Yes, it's created via

    +

    *Thread Reply:* Yes, it's created via

    CREATE TABLE . . . using DELTA location "/dbfs/mnt/ . . . "

    @@ -14286,7 +14292,7 @@

    Supress success

    2021-09-04 12:53:54
    -

    *Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

    +

    *Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

    @@ -14312,7 +14318,7 @@

    Supress success

    2021-09-05 13:06:19
    -

    *Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

    +

    *Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

    @@ -14338,7 +14344,7 @@

    Supress success

    2021-09-07 21:04:09
    -

    *Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

    +

    *Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

    @@ -14390,7 +14396,7 @@

    Supress success

    2021-09-07 21:03:32
    -

    *Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

    +

    *Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

    @@ -14420,7 +14426,7 @@

    Supress success

    2021-09-07 21:05:13
    -

    *Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -14446,7 +14452,7 @@

    Supress success

    2021-09-08 14:49:52
    -

    *Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -14472,7 +14478,7 @@

    Supress success

    2021-09-08 21:58:09
    -

    *Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

    +

    *Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

    @@ -14572,7 +14578,7 @@

    Supress success

    2021-09-08 04:27:04
    -

    *Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    +

    *Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    @@ -14622,7 +14628,7 @@

    Supress success

    2021-09-08 04:27:38
    -

    *Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

    +

    *Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

    @@ -14648,7 +14654,7 @@

    Supress success

    2021-11-11 06:35:24
    -

    *Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

    +

    *Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

    @@ -14765,7 +14771,7 @@

    Supress success

    2021-09-13 15:40:01
    -

    *Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

    +

    *Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

    @@ -14979,7 +14985,7 @@

    Supress success

    2021-09-14 05:03:45
    -

    *Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

    +

    *Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

    @@ -15005,7 +15011,7 @@

    Supress success

    2021-09-14 05:12:45
    -

    *Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

    +

    *Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

    @@ -15308,7 +15314,7 @@

    Supress success

    2021-09-14 08:55:44
    -

    *Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

    +

    *Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

    Stack Overflow
    @@ -15359,7 +15365,7 @@

    Supress success

    2021-09-14 08:56:10
    -

    *Thread Reply: Can you try newer version in 2.4.* line, like 2.4.7?

    +

    *Thread Reply:* Can you try newer version in 2.4.** line, like 2.4.7?

    @@ -15389,7 +15395,7 @@

    Supress success

    2021-09-14 08:57:30
    -

    *Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

    +

    *Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

    @@ -15415,7 +15421,7 @@

    Supress success

    2021-09-14 09:04:26
    -

    *Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

    +

    *Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

    @@ -15441,7 +15447,7 @@

    Supress success

    2021-09-14 09:04:27
    -

    *Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD +

    *Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: config$1 at java.base/java.lang.Class.getDeclaredField(Class.java:2411)

    @@ -15469,7 +15475,7 @@

    Supress success

    2021-09-14 09:04:48
    -

    *Thread Reply:* i can also try to switch to 2.11 scala

    +

    *Thread Reply:* i can also try to switch to 2.11 scala

    @@ -15495,7 +15501,7 @@

    Supress success

    2021-09-14 09:05:37
    -

    *Thread Reply:* or do you have some recommended setup that works for sure?

    +

    *Thread Reply:* or do you have some recommended setup that works for sure?

    @@ -15521,7 +15527,7 @@

    Supress success

    2021-09-14 09:09:58
    -

    *Thread Reply:* One more check - you're using Java 8 with this, right?

    +

    *Thread Reply:* One more check - you're using Java 8 with this, right?

    @@ -15547,7 +15553,7 @@

    Supress success

    2021-09-14 09:10:17
    -

    *Thread Reply:* This is what works for me: +

    *Thread Reply:* This is what works for me: -&gt; % cat tools/spark-2.4/RELEASE Spark 2.4.8 (git revision 4be4064) built for Hadoop 2.7.3 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

    @@ -15576,7 +15582,7 @@

    Supress success

    2021-09-14 09:11:23
    -

    *Thread Reply:* spark-shell: +

    *Thread Reply:* spark-shell: Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)

    @@ -15603,7 +15609,7 @@

    Supress success

    2021-09-14 09:12:05
    -

    *Thread Reply:* awesome let me try 🙂

    +

    *Thread Reply:* awesome let me try 🙂

    @@ -15629,7 +15635,7 @@

    Supress success

    2021-09-14 09:26:00
    -

    *Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception +

    *Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:164) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) @@ -15670,7 +15676,7 @@

    Supress success

    2021-09-14 10:59:45
    -

    *Thread Reply:* closed related issue #274

    +

    *Thread Reply:* closed related issue #274

    @@ -15750,7 +15756,7 @@

    Supress success

    2021-09-14 13:38:09
    -

    *Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

    +

    *Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

    @@ -15776,7 +15782,7 @@

    Supress success

    2021-09-14 15:12:01
    -

    *Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

    +

    *Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

    @@ -16033,7 +16039,7 @@

    Supress success

    2021-09-14 18:02:05
    -

    *Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

    +

    *Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

    • How would you like me to add the test? Fork openlineage and create PR

    @@ -16061,7 +16067,7 @@

    Supress success

    2021-09-14 18:02:50
    -

    *Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

    +

    *Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

    @@ -16087,7 +16093,7 @@

    Supress success

    2021-09-14 18:03:42
    -

    *Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

    +

    *Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

    @@ -16113,7 +16119,7 @@

    Supress success

    2021-09-14 19:06:41
    -

    *Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

    +

    *Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

    @@ -16172,7 +16178,7 @@

    Supress success

    2021-09-15 04:37:59
    -

    *Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

    +

    *Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

    @@ -16313,7 +16319,7 @@

    Supress success

    2021-09-15 09:19:23
    -

    *Thread Reply:* Hey Thomas!

    +

    *Thread Reply:* Hey Thomas!

    Following what we do with Airflow, yes, I think that each task should be mapped to job.

    @@ -16349,7 +16355,7 @@

    Supress success

    2021-09-15 09:26:21
    -

    *Thread Reply:* this is very helpful, thank you

    +

    *Thread Reply:* this is very helpful, thank you

    @@ -16375,7 +16381,7 @@

    Supress success

    2021-09-15 09:26:43
    -

    *Thread Reply:* what would be the namespace used in the Job -definition of each task?

    +

    *Thread Reply:* what would be the namespace used in the Job -definition of each task?

    @@ -16401,7 +16407,7 @@

    Supress success

    2021-09-15 09:31:34
    -

    *Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

    +

    *Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

    For example, it would be good if it helped you identify Prefect instance where the job was run.

    @@ -16429,7 +16435,7 @@

    Supress success

    2021-09-15 09:32:23
    -

    *Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

    +

    *Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

    @@ -16455,7 +16461,7 @@

    Supress success

    2021-09-15 09:32:55
    -

    *Thread Reply:* awesome, thank you 🙂

    +

    *Thread Reply:* awesome, thank you 🙂

    @@ -16481,7 +16487,7 @@

    Supress success

    2021-09-15 17:03:07
    -

    *Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

    +

    *Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

    @@ -16507,7 +16513,7 @@

    Supress success

    2021-09-15 17:27:20
    -

    *Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

    +

    *Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

    @@ -16557,7 +16563,7 @@

    Supress success

    2021-09-15 18:29:20
    -

    *Thread Reply:* Done!

    +

    *Thread Reply:* Done!

    @@ -16587,7 +16593,7 @@

    Supress success

    2021-09-16 00:06:41
    -

    *Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

    +

    *Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

    @@ -16613,7 +16619,7 @@

    Supress success

    2021-09-17 01:55:08
    -

    *Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

    +

    *Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

    How are you dealing with Prefect-library tasks such as the included BigQuery-tasks and such? Is it necessary to create DatasetTask for them to show up in the lineage graph?

    @@ -16641,7 +16647,7 @@

    Supress success

    2021-09-17 02:04:19
    -

    *Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

    +

    *Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

    @@ -16667,7 +16673,7 @@

    Supress success

    2021-09-17 02:05:21
    -

    *Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

    +

    *Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

    @@ -16693,7 +16699,7 @@

    Supress success

    2021-09-17 02:06:23
    -

    *Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

    +

    *Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

    @@ -16719,7 +16725,7 @@

    Supress success

    2021-09-17 02:16:02
    -

    *Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

    +

    *Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

    @@ -16745,7 +16751,7 @@

    Supress success

    2021-09-17 02:21:57
    -

    *Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

    +

    *Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

    @@ -16771,7 +16777,7 @@

    Supress success

    2021-09-17 02:24:31
    -

    *Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

    +

    *Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

    @@ -16797,7 +16803,7 @@

    Supress success

    2021-09-17 02:28:28
    -

    *Thread Reply:* @davzucky

    +

    *Thread Reply:* @davzucky

    @@ -16823,7 +16829,7 @@

    Supress success

    2021-09-17 02:29:12
    -

    *Thread Reply:* I see. Is this supported by the airflow-integration?

    +

    *Thread Reply:* I see. Is this supported by the airflow-integration?

    @@ -16849,7 +16855,7 @@

    Supress success

    2021-09-17 02:29:32
    -

    *Thread Reply:* I think so, yes

    +

    *Thread Reply:* I think so, yes

    @@ -16875,7 +16881,7 @@

    Supress success

    2021-09-17 02:30:51
    2021-09-17 02:31:54
    -

    *Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

    +

    *Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

    @@ -16927,7 +16933,7 @@

    Supress success

    2021-09-17 03:18:27
    -

    *Thread Reply:* Interesting, I like how dynamic this is

    +

    *Thread Reply:* Interesting, I like how dynamic this is

    @@ -16980,7 +16986,7 @@

    Supress success

    2021-09-15 09:29:25
    -

    *Thread Reply:* Hey. +

    *Thread Reply:* Hey. Generally, dataSource name should be namespace of particular dataset.

    In some cases, like Postgres, dataSource facet is used to provide additionally connection strings, with info like particular host and port that we're connected to.

    @@ -17011,7 +17017,7 @@

    Supress success

    2021-09-15 10:11:19
    -

    *Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

    +

    *Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

    @@ -17037,7 +17043,7 @@

    Supress success

    2021-09-15 10:18:59
    -

    *Thread Reply:* @Willy Lulciuc what do you think?

    +

    *Thread Reply:* @Willy Lulciuc what do you think?

    @@ -17063,7 +17069,7 @@

    Supress success

    2021-09-15 10:41:20
    -

    *Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. +

    *Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. I think the new marquez UI with the nice namespace dropdown and job/dataset search is awesome, and I'd expect to be able to filter by job namespace everywhere, but how about being able to filter datasets by source (which would be populated by the OL dataset namespace) and not persist dataset namespaces in the global namespace table?

    @@ -17126,7 +17132,7 @@

    Supress success

    2021-09-15 18:43:18
    -

    *Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

    +

    *Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

    @@ -17152,7 +17158,7 @@

    Supress success

    2021-09-15 18:49:21
    -

    *Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

    +

    *Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

    @@ -17182,7 +17188,7 @@

    Supress success

    2021-09-15 18:51:42
    -

    *Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

    +

    *Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

    @@ -17208,7 +17214,7 @@

    Supress success

    2021-09-15 18:56:36
    -

    *Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

    +

    *Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

    We could batch events if the network roundtrips are responsible for majority of the slowdown. However, we can't assume any particular environment.

    @@ -17242,7 +17248,7 @@

    Supress success

    2021-09-15 18:58:22
    -

    *Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

    +

    *Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

    @@ -17268,7 +17274,7 @@

    Supress success

    2021-09-15 21:31:35
    -

    *Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

    +

    *Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

    @@ -17294,7 +17300,7 @@

    Supress success

    2021-09-16 09:11:31
    -

    *Thread Reply:* Done

    +

    *Thread Reply:* Done

    @@ -17320,7 +17326,7 @@

    Supress success

    2021-09-16 12:05:10
    -

    *Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

    +

    *Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

    @@ -17346,7 +17352,7 @@

    Supress success

    2021-09-16 17:56:23
    -

    *Thread Reply:* Makes sense, also, all clients could use that config

    +

    *Thread Reply:* Makes sense, also, all clients could use that config

    @@ -17372,7 +17378,7 @@

    Supress success

    2021-10-18 04:47:08
    -

    *Thread Reply:* if dbt could actually stream the events, that would be great.

    +

    *Thread Reply:* if dbt could actually stream the events, that would be great.

    @@ -17398,7 +17404,7 @@

    Supress success

    2021-10-18 09:59:12
    -

    *Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

    +

    *Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

    @@ -17515,7 +17521,7 @@

    Supress success

    2021-09-16 00:16:44
    -

    *Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

    +

    *Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

    @@ -17541,7 +17547,7 @@

    Supress success

    2021-09-16 00:18:44
    -

    *Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: +

    *Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: '{"code":400,"message":"Unable to process JSON"}' The JSON I'm pushing:

    @@ -17586,7 +17592,7 @@

    Supress success

    2021-09-16 02:41:11
    -

    *Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. +

    *Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. Aren’t inputs and facets arrays?

    @@ -17617,7 +17623,7 @@

    Supress success

    2021-09-16 03:14:54
    -

    *Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

    +

    *Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

    @@ -17643,7 +17649,7 @@

    Supress success

    2021-09-16 03:46:01
    -

    *Thread Reply:* okay it was my timestamp 🥲

    +

    *Thread Reply:* okay it was my timestamp 🥲

    @@ -17669,7 +17675,7 @@

    Supress success

    2021-09-16 19:07:16
    -

    *Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

    +

    *Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

    @@ -17863,7 +17869,7 @@

    Supress success

    2021-09-16 19:07:37
    -

    *Thread Reply:* I might move this into a proper PR @Julien Le Dem

    +

    *Thread Reply:* I might move this into a proper PR @Julien Le Dem

    @@ -17889,7 +17895,7 @@

    Supress success

    2021-09-16 19:08:12
    -

    *Thread Reply:* Successfully got a basic prefect flow working

    +

    *Thread Reply:* Successfully got a basic prefect flow working

    @@ -17950,7 +17956,7 @@

    Supress success

    2021-09-16 09:53:24
    -

    *Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

    +

    *Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

    @@ -17976,7 +17982,7 @@

    Supress success

    2021-09-16 10:04:09
    -

    *Thread Reply:* I've taken a look at your repo. Looks great so far!

    +

    *Thread Reply:* I've taken a look at your repo. Looks great so far!

    One thing I've noticed I don't think you need to use any stuff from Marquez to emit events. It's lineage ingestion API is deprecated - you can just use openlineage-python client. If there's something you think it's missing from it, feel free to write that here or open issue.

    @@ -18004,7 +18010,7 @@

    Supress success

    2021-09-16 17:12:31
    -

    *Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

    +

    *Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

    @@ -18030,7 +18036,7 @@

    Supress success

    2021-09-16 17:13:26
    -

    *Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

    +

    *Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

    @@ -18056,7 +18062,7 @@

    Supress success

    2021-09-16 17:34:46
    -

    *Thread Reply:* Yes 🙌

    +

    *Thread Reply:* Yes 🙌

    @@ -18136,7 +18142,7 @@

    Supress success

    2021-09-16 06:59:34
    -

    *Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

    +

    *Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

    @@ -18162,7 +18168,7 @@

    Supress success

    2021-09-16 07:00:29
    -

    *Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

    +

    *Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

    @@ -18188,7 +18194,7 @@

    Supress success

    2021-09-16 07:00:55
    -

    *Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

    +

    *Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

    @@ -18214,7 +18220,7 @@

    Supress success

    2021-09-16 07:08:03
    -

    *Thread Reply:* ah okie. good to know. +

    *Thread Reply:* ah okie. good to know. and in build-integration-spark Could not resolve all artifacts. Is that also known issue? Or something from my side that i could fix?

    @@ -18241,7 +18247,7 @@

    Supress success

    2021-09-16 07:11:12
    -

    *Thread Reply:* Looks like gradle server problem? +

    *Thread Reply:* Looks like gradle server problem? &gt; Could not get resource '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. &gt; Could not GET '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. Received status code 500 from server: Internal Server Error

    @@ -18269,7 +18275,7 @@

    Supress success

    2021-09-16 07:34:44
    -

    *Thread Reply:* After retry, there's spotless error:

    +

    *Thread Reply:* After retry, there's spotless error:

    +········.orElse(Collections.emptyList()).stream()

    @@ -18297,7 +18303,7 @@

    Supress success

    2021-09-16 07:35:15
    -

    *Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

    +

    *Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

    @@ -18323,7 +18329,7 @@

    Supress success

    2021-09-16 07:40:01
    -

    *Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

    +

    *Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

    @@ -18349,7 +18355,7 @@

    Supress success

    2021-09-16 07:44:31
    -

    *Thread Reply:* For spotless, you can just fix this one line 🙂 +

    *Thread Reply:* For spotless, you can just fix this one line 🙂 Though I don't guarantee that tests that run later will pass, so you might need Java 8 for later testing

    @@ -18376,7 +18382,7 @@

    Supress success

    2021-09-16 08:04:36
    -

    *Thread Reply:* yup looks better now

    +

    *Thread Reply:* yup looks better now

    @@ -18402,7 +18408,7 @@

    Supress success

    2021-09-16 08:04:41
    -

    *Thread Reply:* thanks

    +

    *Thread Reply:* thanks

    @@ -18428,7 +18434,7 @@

    Supress success

    2021-09-16 14:27:02
    -

    *Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

    +

    *Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

    @@ -18484,7 +18490,7 @@

    Supress success

    2021-09-16 20:37:27
    -

    *Thread Reply:* @Thomas Fredriksen would love any feedback

    +

    *Thread Reply:* @Thomas Fredriksen would love any feedback

    @@ -18510,7 +18516,7 @@

    Supress success

    2021-09-17 04:21:13
    -

    *Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

    +

    *Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

    @@ -18536,7 +18542,7 @@

    Supress success

    2021-09-17 07:28:33
    -

    *Thread Reply:* Hey @Brad, it looks great!

    +

    *Thread Reply:* Hey @Brad, it looks great!

    I've seen you're using task_qualified_name to name datasets and I don't think it's the right way. I'd take a look at naming conventions here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -18567,7 +18573,7 @@

    Supress success

    2021-09-17 08:12:50
    -

    *Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

    +

    *Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

    @@ -18593,7 +18599,7 @@

    Supress success

    2021-09-17 08:28:04
    -

    *Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

    +

    *Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

    Second, in Airflow we have a concept of Extractor where you can write specialized code to expose datasets. For example, for BigQuery we extract datasets from query plan. Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused. Also, this way allows to emit additional facets, some of which are really useful - like query statistics for BigQuery, and data quality tests for dbt.

    @@ -18623,7 +18629,7 @@

    Supress success

    2021-09-19 23:03:14
    -

    *Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

    +

    *Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

    > Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused This definitely applies to prefect and the similar tasks exist in prefect and we should definitely leverage the common library in this case.

    @@ -18662,7 +18668,7 @@

    Supress success

    2021-09-20 06:30:51
    -

    *Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. +

    *Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. First version of an integration doesn't have to be perfect. in particular, not handling this use case would be okay, since it does not lock us into some particular way of doing it later.

    > My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now. @@ -18694,7 +18700,7 @@

    Supress success

    2021-09-23 17:27:49
    -

    *Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. +

    *Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. Duh, yep you’re right @Maciej Obuchowski, I’m over thinking this. I’m going to clean this up based on your comments

    @@ -18721,7 +18727,7 @@

    Supress success

    2021-10-06 03:39:28
    -

    *Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

    +

    *Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

    @@ -18747,7 +18753,7 @@

    Supress success

    2021-10-06 03:40:44
    -

    *Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

    +

    *Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

    @@ -18773,7 +18779,7 @@

    Supress success

    2021-10-06 03:48:14
    -

    *Thread Reply:* nice!

    +

    *Thread Reply:* nice!

    @@ -18799,7 +18805,7 @@

    Supress success

    2021-10-06 03:48:54
    -

    *Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

    +

    *Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

    @@ -18825,7 +18831,7 @@

    Supress success

    2021-10-06 03:49:09
    -

    *Thread Reply:* but I do plan to pick this up next week

    +

    *Thread Reply:* but I do plan to pick this up next week

    @@ -18851,7 +18857,7 @@

    Supress success

    2021-10-06 03:49:23
    -

    *Thread Reply:* (the additional changes I mention above)

    +

    *Thread Reply:* (the additional changes I mention above)

    @@ -18877,7 +18883,7 @@

    Supress success

    2021-10-06 03:50:08
    -

    *Thread Reply:* looking forward to it 🙂 let me know if you need any help!

    +

    *Thread Reply:* looking forward to it 🙂 let me know if you need any help!

    @@ -18903,7 +18909,7 @@

    Supress success

    2021-10-06 03:50:34
    -

    *Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

    +

    *Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

    @@ -18963,7 +18969,7 @@

    Supress success

    2021-09-20 19:18:53
    -

    *Thread Reply:* I think you can reffer to https://openlineage.io/

    +

    *Thread Reply:* I think you can reffer to https://openlineage.io/

    openlineage.io
    @@ -19132,7 +19138,7 @@

    Supress success

    2021-09-24 05:15:31
    -

    *Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

    +

    *Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

    The first step should be to add SQLServer or Dremio to the dataset naming schema here https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    Supress success
    2021-10-04 16:22:59
    -

    *Thread Reply:* Thank you @Maciej Obuchowski, +

    *Thread Reply:* Thank you @Maciej Obuchowski, I tried to give it a try but without success yet. Not sure where I am suppose to add the sqlserver naming schema... If you have any documentation that I could read I would be glad =) Many thanks

    @@ -19237,7 +19243,7 @@

    Supress success

    2021-10-07 15:13:43
    -

    *Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

    +

    *Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

    @@ -19472,7 +19478,7 @@

    Supress success

    2021-10-07 15:14:30
    -

    *Thread Reply:* Snowflake +

    *Thread Reply:* Snowflake See: Object Identifiers — Snowflake Documentation Datasource hierarchy: • account name @@ -19541,7 +19547,7 @@

    Supress success

    2021-09-24 07:06:20
    -

    *Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

    +

    *Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

    Our platform performs automated enrichment and processing of data. In doing so, we often make calls to functions or out to other data services (such as APIs, or SELECTs against databases). We capture the inputs that pass to these, along with the outputs. (And, if the input is derived from other outputs, we capture the full chain, right back to the root).

    @@ -19571,7 +19577,7 @@

    Supress success

    2021-09-24 08:47:35
    -

    *Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

    +

    *Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

    @@ -19597,7 +19603,7 @@

    Supress success

    2021-09-24 08:51:16
    -

    *Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

    +

    *Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

    We could of course model this situation, but that would capture for example schema of those parameters. Not their values.

    @@ -19625,7 +19631,7 @@

    Supress success

    2021-09-24 08:52:16
    -

    *Thread Reply:* I think this might be better suited for https://opentelemetry.io/

    +

    *Thread Reply:* I think this might be better suited for https://opentelemetry.io/

    @@ -19651,7 +19657,7 @@

    Supress success

    2021-09-24 10:55:54
    -

    *Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

    +

    *Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

    Our customers use Vyne to fetch data from lots of different sources. Eg:

    @@ -19685,7 +19691,7 @@

    Supress success

    2021-09-30 15:10:51
    -

    *Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

    +

    *Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

    @@ -19740,7 +19746,7 @@

    Supress success

    2021-09-29 13:46:59
    -

    *Thread Reply:* Hello @Drew Bittenbender!

    +

    *Thread Reply:* Hello @Drew Bittenbender!

    For your two first questions:

    @@ -19771,7 +19777,7 @@

    Supress success

    2021-09-29 13:49:41
    -

    *Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

    +

    *Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

    @@ -19797,7 +19803,7 @@

    Supress success

    2021-09-29 13:52:16
    -

    *Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    +

    *Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    @@ -19827,7 +19833,7 @@

    Supress success

    2021-09-30 05:08:53
    -

    *Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

    +

    *Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

    @@ -19853,7 +19859,7 @@

    Supress success

    2021-09-30 15:14:47
    -

    *Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

    +

    *Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

    @@ -19879,7 +19885,7 @@

    Supress success

    2021-10-07 15:09:55
    -

    *Thread Reply:* To clarify on what airflow operators are supported out of the box: +

    *Thread Reply:* To clarify on what airflow operators are supported out of the box: • postgres • bigquery • snowflake @@ -19992,7 +19998,7 @@

    Supress success

    2021-10-07 15:03:40
    -

    *Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

    +

    *Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

    @@ -20141,7 +20147,7 @@

    Supress success

    2021-10-07 15:04:36
    -

    *Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

    +

    *Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

    @@ -20197,7 +20203,7 @@

    Supress success

    2021-10-04 15:38:47
    -

    *Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: +

    *Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: https://github.com/OpenLineage/OpenLineage/issues/205

    @@ -20264,7 +20270,7 @@

    Supress success

    2021-10-05 03:01:54
    -

    *Thread Reply:* Thank you for your reply!

    +

    *Thread Reply:* Thank you for your reply!

    @@ -20290,7 +20296,7 @@

    Supress success

    2021-10-07 15:02:23
    -

    *Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 +

    *Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 We’re labelling it experimental because the config step might change as discussion in the airflow github evolve. It will track succesful jobs in its current state.

    @@ -20382,7 +20388,7 @@

    Supress success

    2021-10-05 04:18:42
    -

    *Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

    +

    *Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

    docs.getdbt.com
    @@ -20437,7 +20443,7 @@

    Supress success

    2021-10-07 14:58:24
    -

    *Thread Reply:* @SAM Let us know of your progress.

    +

    *Thread Reply:* @SAM Let us know of your progress.

    @@ -20509,7 +20515,7 @@

    Supress success

    2021-10-05 16:41:30
    -

    *Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

    +

    *Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

    @@ -20535,7 +20541,7 @@

    Supress success

    2021-10-05 16:43:39
    -

    *Thread Reply:* Hey @Maciej Obuchowski 😊 +

    *Thread Reply:* Hey @Maciej Obuchowski 😊 Yep, will do now! Thanks!

    @@ -20562,7 +20568,7 @@

    Supress success

    2021-10-05 16:46:26
    -

    *Thread Reply:* Well...will do tomorrow morning 😅

    +

    *Thread Reply:* Well...will do tomorrow morning 😅

    @@ -20588,7 +20594,7 @@

    Supress success

    2021-10-06 03:03:16
    -

    *Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

    +

    *Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

    @@ -20656,7 +20662,7 @@

    Supress success

    2021-10-07 14:51:08
    -

    *Thread Reply:* Thanks a lot. I pulled it in the current project.

    +

    *Thread Reply:* Thanks a lot. I pulled it in the current project.

    @@ -20686,7 +20692,7 @@

    Supress success

    2021-10-08 05:48:28
    -

    *Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

    +

    *Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

    @@ -20712,7 +20718,7 @@

    Supress success

    2021-10-08 05:53:05
    -

    *Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    +

    *Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -20738,7 +20744,7 @@

    Supress success

    2021-10-08 05:53:21
    -

    *Thread Reply:* Sure!

    +

    *Thread Reply:* Sure!

    @@ -20764,7 +20770,7 @@

    Supress success

    2021-10-08 05:54:21
    -

    *Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

    +

    *Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

    @@ -20790,7 +20796,7 @@

    Supress success

    2021-10-08 06:36:12
    -

    *Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

    +

    *Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

    @@ -20842,7 +20848,7 @@

    Supress success

    2021-10-08 06:39:35
    -

    *Thread Reply:* Host would be something like +

    *Thread Reply:* Host would be something like examplecluster.&lt;XXXXXXXXXXXX&gt;.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> right?

    @@ -20870,7 +20876,7 @@

    Supress success

    2021-10-08 07:13:51
    -

    *Thread Reply:* Yep, let me update the PR

    +

    *Thread Reply:* Yep, let me update the PR

    @@ -20896,7 +20902,7 @@

    Supress success

    2021-10-08 07:27:42
    -

    *Thread Reply:* Done

    +

    *Thread Reply:* Done

    @@ -20922,7 +20928,7 @@

    Supress success

    2021-10-08 07:31:40
    -

    *Thread Reply:* 🙌

    +

    *Thread Reply:* 🙌

    @@ -20948,7 +20954,7 @@

    Supress success

    2021-10-08 07:35:30
    -

    *Thread Reply:* If you want to look at dbt integration itself, there are two things:

    +

    *Thread Reply:* If you want to look at dbt integration itself, there are two things:

    We need to determine how Redshift adapter reports metrics https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L412

    @@ -21006,7 +21012,7 @@

    Supress success

    2021-10-08 08:33:31
    -

    *Thread Reply:* I figured out how to generate the namespace. +

    *Thread Reply:* I figured out how to generate the namespace. But I can’t understand which of the JSON files is inspected for metrics. Is it run_results.json ?

    @@ -21033,7 +21039,7 @@

    Supress success

    2021-10-08 09:48:50
    -

    *Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

    +

    *Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

    @@ -21059,7 +21065,7 @@

    Supress success

    2021-10-08 11:02:32
    -

    *Thread Reply:* Ok thanks!

    +

    *Thread Reply:* Ok thanks!

    @@ -21085,7 +21091,7 @@

    Supress success

    2021-10-08 11:11:57
    -

    *Thread Reply:* Should be stats:rows:value

    +

    *Thread Reply:* Should be stats:rows:value

    @@ -21111,7 +21117,7 @@

    Supress success

    2021-10-08 11:19:59
    -

    *Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

    +

    *Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

    @@ -21137,7 +21143,7 @@

    Supress success

    2021-10-08 11:44:50
    -

    *Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

    +

    *Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

    @@ -21163,7 +21169,7 @@

    Supress success

    2021-10-08 11:53:52
    -

    *Thread Reply:* Exactly

    +

    *Thread Reply:* Exactly

    @@ -21189,7 +21195,7 @@

    Supress success

    2021-10-11 07:10:38
    -

    *Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

    +

    *Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

    @@ -21215,7 +21221,7 @@

    Supress success

    2021-10-11 07:18:01
    -

    *Thread Reply:* Do you want to run jinja on the dbt profile?

    +

    *Thread Reply:* Do you want to run jinja on the dbt profile?

    @@ -21241,7 +21247,7 @@

    Supress success

    2021-10-11 07:20:18
    -

    *Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

    +

    *Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

    @@ -21267,7 +21273,7 @@

    Supress success

    2021-10-11 07:20:32
    -

    *Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

    +

    *Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

    @@ -21293,7 +21299,7 @@

    Supress success

    2021-10-11 07:23:59
    -

    *Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. +

    *Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. The method is pretty simple: ``` @contextmember @staticmethod @@ -21336,7 +21342,7 @@

    Supress success

    2021-10-11 07:25:07
    -

    *Thread Reply:* Oh cool! +

    *Thread Reply:* Oh cool! I will definitely use this one!

    @@ -21363,7 +21369,7 @@

    Supress success

    2021-10-11 07:25:09
    -

    *Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

    +

    *Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

    @@ -21389,7 +21395,7 @@

    Supress success

    2021-10-11 07:26:34
    -

    *Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

    +

    *Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

    @@ -21415,7 +21421,7 @@

    Supress success

    2021-10-11 07:27:01
    -

    *Thread Reply:* yes

    +

    *Thread Reply:* yes

    @@ -21441,7 +21447,7 @@

    Supress success

    2021-10-11 07:27:14
    -

    *Thread Reply:* dbt is on Apache license 🙂

    +

    *Thread Reply:* dbt is on Apache license 🙂

    @@ -21467,7 +21473,7 @@

    Supress success

    2021-10-11 07:28:06
    -

    *Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

    +

    *Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

    @@ -21493,7 +21499,7 @@

    Supress success

    2021-10-11 07:28:28
    -

    *Thread Reply:* I’m asking for guidance here 😊

    +

    *Thread Reply:* I’m asking for guidance here 😊

    @@ -21519,7 +21525,7 @@

    Supress success

    2021-10-11 07:34:44
    -

    *Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

    +

    *Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

    just with the env_var method passed to the render method 🙂

    @@ -21547,7 +21553,7 @@

    Supress success

    2021-10-11 07:37:05
    -

    *Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file +

    *Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L176

    @@ -21599,7 +21605,7 @@

    Supress success

    2021-10-11 07:38:53
    -

    *Thread Reply:* ok, got it. +

    *Thread Reply:* ok, got it. Will try to implement following your suggestions. Thanks @Maciej Obuchowski 🙌

    @@ -21631,7 +21637,7 @@

    Supress success

    2021-10-11 08:36:13
    -

    *Thread Reply:* We need to:

    +

    *Thread Reply:* We need to:

    1. load the template profile from the profile.yml
    2. replace any env vars we found For the first step, we can use jinja2.Template @@ -21662,7 +21668,7 @@

      Supress success

    2021-10-11 08:43:06
    -

    *Thread Reply:* The dbt method implements that: +

    *Thread Reply:* The dbt method implements that: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: @@ -21704,7 +21710,7 @@

    Supress success

    2021-10-11 08:45:54
    -

    *Thread Reply:* Ok, but I need to pass var to the env_var method. +

    *Thread Reply:* Ok, but I need to pass var to the env_var method. And to pass the var value, I need to look into the loaded Template and search for env var names…

    @@ -21731,7 +21737,7 @@

    Supress success

    2021-10-11 08:46:54
    -

    *Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

    +

    *Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

    @@ -21757,7 +21763,7 @@

    Supress success

    2021-10-11 08:47:45
    -

    *Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

    +

    *Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

    @@ -21783,7 +21789,7 @@

    Supress success

    2021-10-11 08:51:19
    -

    *Thread Reply:* Ok, will try

    +

    *Thread Reply:* Ok, will try

    @@ -21809,7 +21815,7 @@

    Supress success

    2021-10-11 09:37:49
    -

    *Thread Reply:* I’m trying to run +

    *Thread Reply:* I’m trying to run pip install -e ".[dev]" so that I can test my changes, but I get ERROR: Could not find a version that satisfies the requirement openlineage-integration-common[dbt]==0.2.3 (from openlineage-dbt[dev]) (from versions: 0.0.1rc7, 0.0.1rc8, 0.0.1, 0.1.0rc5, 0.1.0, 0.2.0, 0.2.1, 0.2.2) @@ -21840,7 +21846,7 @@

    Supress success

    2021-10-11 09:41:47
    -

    *Thread Reply:* can you try installing it manually?

    +

    *Thread Reply:* can you try installing it manually?

    pip install openlineage-integration-common[dbt]==0.2.3

    @@ -21868,7 +21874,7 @@

    Supress success

    2021-10-11 09:42:13
    -

    *Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

    +

    *Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

    PyPI
    @@ -21919,7 +21925,7 @@

    Supress success

    2021-10-11 09:44:57
    -

    *Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. +

    *Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. Installing from the public pypi resolved the issue

    @@ -21946,7 +21952,7 @@

    Supress success

    2021-10-11 12:04:55
    -

    *Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

    +

    *Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

    @@ -21972,7 +21978,7 @@

    Supress success

    2021-10-11 12:57:07
    -

    *Thread Reply:* try this:

    +

    *Thread Reply:* try this:

    @@ -21998,7 +22004,7 @@

    Supress success

    2021-10-11 12:57:09
    -

    *Thread Reply:* ```import os +

    *Thread Reply:* ```import os from typing import Optional from jinja2 import Template

    @@ -22045,7 +22051,7 @@

    Supress success

    2021-10-11 12:57:42
    -

    *Thread Reply:* works for me: +

    *Thread Reply:* works for me: mobuchowski@thinkpad [18:57:14] [~] -&gt; % ENV_VAR=world python jinja_example.py Hello world!

    @@ -22074,7 +22080,7 @@

    Supress success

    2021-10-11 16:59:13
    -

    *Thread Reply:* Finally 😅 +

    *Thread Reply:* Finally 😅 https://github.com/OpenLineage/OpenLineage/pull/328

    There are minimal tests for Redshift and env vars. @@ -22133,7 +22139,7 @@

    Supress success

    2021-10-12 03:10:45
    -

    *Thread Reply:* Hi @Maciej Obuchowski 😊 +

    *Thread Reply:* Hi @Maciej Obuchowski 😊 Regarding this comment https://github.com/OpenLineage/OpenLineage/pull/328#discussion_r726586564

    How can we distinguish between snowflake, bigquery and redshift in this method?

    @@ -22175,7 +22181,7 @@

    Supress success

    2021-10-12 05:35:00
    -

    *Thread Reply:* Well, why not do something like

    +

    *Thread Reply:* Well, why not do something like

    ```bytes = getfrommultiple_chains(... rest of stuff)

    @@ -22206,7 +22212,7 @@

    Supress success

    2021-10-12 05:36:49
    -

    *Thread Reply:* we can store adapter type in the class

    +

    *Thread Reply:* we can store adapter type in the class

    @@ -22232,7 +22238,7 @@

    Supress success

    2021-10-12 05:38:47
    -

    *Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

    +

    *Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

    @@ -22258,7 +22264,7 @@

    Supress success

    2021-10-12 05:40:35
    -

    *Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

    +

    *Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

    @@ -22284,7 +22290,7 @@

    Supress success

    2021-10-12 06:35:04
    -

    *Thread Reply:* Not yet, but I'll try to do that this afternoon. +

    *Thread Reply:* Not yet, but I'll try to do that this afternoon. Need to figure out how to build the lib locally, then I can use it to test with Redshift

    @@ -22311,7 +22317,7 @@

    Supress success

    2021-10-12 06:40:58
    -

    *Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

    +

    *Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

    @@ -22337,7 +22343,7 @@

    Supress success

    2021-10-12 09:29:13
    -

    *Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. +

    *Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. However, I’m not sure about the namespace . I also see exceptions in Marquez logs

    @@ -22392,7 +22398,7 @@

    Supress success

    2021-10-12 09:33:26
    -

    *Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

    +

    *Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

    @@ -22418,7 +22424,7 @@

    Supress success

    2021-10-12 09:44:17
    -

    *Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

    +

    *Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

    @@ -22444,7 +22450,7 @@

    Supress success

    2021-10-12 09:46:17
    -

    *Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 +

    *Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 Regarding the error: in the context of this PR, is it something I should worry about or not?

    @@ -22471,7 +22477,7 @@

    Supress success

    2021-10-12 09:54:17
    -

    *Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

    +

    *Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

    @@ -22497,7 +22503,7 @@

    Supress success

    2021-10-12 10:24:38
    -

    *Thread Reply:* 👍

    +

    *Thread Reply:* 👍

    @@ -22523,7 +22529,7 @@

    Supress success

    2021-10-12 10:24:51
    -

    *Thread Reply:* Is there anything else I can do to improve the PR?

    +

    *Thread Reply:* Is there anything else I can do to improve the PR?

    @@ -22549,7 +22555,7 @@

    Supress success

    2021-10-12 10:27:44
    -

    *Thread Reply:* did you figure out the namespace stuff? +

    *Thread Reply:* did you figure out the namespace stuff? I think it's ready to be merged outside of that

    @@ -22576,7 +22582,7 @@

    Supress success

    2021-10-12 10:49:06
    -

    *Thread Reply:* Not yet

    +

    *Thread Reply:* Not yet

    @@ -22602,7 +22608,7 @@

    Supress success

    2021-10-12 10:58:07
    -

    *Thread Reply:* Ok i figured it out. +

    *Thread Reply:* Ok i figured it out. When running dbt locally, we connect to Redshift using an SSH tunnel. dbt runs on Docker, hence it can access the tunnel using host.docker.internal

    @@ -22630,7 +22636,7 @@

    Supress success

    2021-10-12 10:58:16
    -

    *Thread Reply:* So the namespace is correct

    +

    *Thread Reply:* So the namespace is correct

    @@ -22656,7 +22662,7 @@

    Supress success

    2021-10-12 11:04:12
    -

    *Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

    +

    *Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

    @@ -22682,7 +22688,7 @@

    Supress success

    2021-10-12 11:04:37
    -

    *Thread Reply:* 👍

    +

    *Thread Reply:* 👍

    @@ -22708,7 +22714,7 @@

    Supress success

    2021-10-13 05:29:48
    -

    *Thread Reply:* merged your PR 🙌

    +

    *Thread Reply:* merged your PR 🙌

    @@ -22734,7 +22740,7 @@

    Supress success

    2021-10-13 10:54:09
    -

    *Thread Reply:* 🎉

    +

    *Thread Reply:* 🎉

    @@ -22760,7 +22766,7 @@

    Supress success

    2021-10-13 12:01:20
    -

    *Thread Reply:* I think I'm going to change it up a bit. +

    *Thread Reply:* I think I'm going to change it up a bit. The problem is that we can try to render jinja everywhere, including comments. I tried to make it skip unknown methods and values here, but I think the right solution is to load the yaml, and then try to render jinja for values.

    @@ -22788,7 +22794,7 @@

    Supress success

    2021-10-13 14:27:37
    -

    *Thread Reply:* Ok sounds good to me!

    +

    *Thread Reply:* Ok sounds good to me!

    @@ -22849,7 +22855,7 @@

    Supress success

    2021-10-06 10:54:32
    -

    *Thread Reply:* Does it work with simply dbt run?

    +

    *Thread Reply:* Does it work with simply dbt run?

    @@ -22875,7 +22881,7 @@

    Supress success

    2021-10-06 10:55:51
    -

    *Thread Reply:* also, do you have dbt-snowflake installed?

    +

    *Thread Reply:* also, do you have dbt-snowflake installed?

    @@ -22901,7 +22907,7 @@

    Supress success

    2021-10-06 11:00:42
    -

    *Thread Reply:* it works with dbt run

    +

    *Thread Reply:* it works with dbt run

    @@ -22931,7 +22937,7 @@

    Supress success

    2021-10-06 11:01:22
    -

    *Thread Reply:* no i haven’t installed dbt-snowflake

    +

    *Thread Reply:* no i haven’t installed dbt-snowflake

    @@ -22957,7 +22963,7 @@

    Supress success

    2021-10-06 12:04:19
    -

    *Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

    +

    *Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

    @@ -22983,7 +22989,7 @@

    Supress success

    2021-10-06 12:04:46
    -

    *Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

    +

    *Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

    @@ -23009,7 +23015,7 @@

    Supress success

    2021-10-06 12:19:27
    -

    *Thread Reply:* this is my profiles.yml file: +

    *Thread Reply:* this is my profiles.yml file: ```snowflake: target: dev outputs: @@ -23054,7 +23060,7 @@

    Supress success

    2021-10-06 12:26:39
    -

    *Thread Reply:* Yes, it looks that everything is okay on your side...

    +

    *Thread Reply:* Yes, it looks that everything is okay on your side...

    @@ -23080,7 +23086,7 @@

    Supress success

    2021-10-06 12:28:19
    -

    *Thread Reply:* may be I’ll restart my machine and try again

    +

    *Thread Reply:* may be I’ll restart my machine and try again

    @@ -23106,7 +23112,7 @@

    Supress success

    2021-10-06 12:30:25
    -

    *Thread Reply:* can you try +

    *Thread Reply:* can you try OPENLINEAGE_URL=<http://localhost:5000> dbt-ol debug

    @@ -23133,7 +23139,7 @@

    Supress success

    2021-10-07 05:59:03
    -

    *Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

    +

    *Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

    @@ -23177,7 +23183,7 @@

    Supress success

    2021-10-07 06:00:28
    -

    *Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

    +

    *Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

    @@ -23230,7 +23236,7 @@

    Supress success

    2021-10-07 06:00:46
    -

    *Thread Reply:* oh, thanks a lot

    +

    *Thread Reply:* oh, thanks a lot

    @@ -23256,7 +23262,7 @@

    Supress success

    2021-10-07 14:50:01
    -

    *Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

    +

    *Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

    @@ -23327,7 +23333,7 @@

    Supress success

    2021-10-07 23:23:26
    -

    *Thread Reply:* Hi, +

    *Thread Reply:* Hi, openlineage-dbt==0.2.3 worked, thanks a lot for the quick fix.

    @@ -23389,7 +23395,7 @@

    Supress success

    2021-10-07 07:46:59
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -23424,7 +23430,7 @@

    Supress success

    2021-10-07 09:51:40
    -

    *Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. +

    *Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. ./docker/up.sh

    @@ -23455,7 +23461,7 @@

    Supress success

    2021-10-07 12:28:25
    -

    *Thread Reply:* I guess that it's the easiest way now. There should be API for that.

    +

    *Thread Reply:* I guess that it's the easiest way now. There should be API for that.

    @@ -23481,7 +23487,7 @@

    Supress success

    2021-10-07 14:09:50
    -

    *Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

    +

    *Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

    @@ -23548,7 +23554,7 @@

    Supress success

    2021-10-07 14:10:36
    -

    *Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

    +

    *Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

    @@ -23574,7 +23580,7 @@

    Supress success

    2021-10-07 14:12:52
    -

    *Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

    +

    *Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

    @@ -23600,7 +23606,7 @@

    Supress success

    2021-10-07 14:39:03
    -

    *Thread Reply:* Thanks Willy.

    +

    *Thread Reply:* Thanks Willy.

    @@ -23630,7 +23636,7 @@

    Supress success

    2021-10-07 14:48:37
    -

    *Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

    +

    *Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

    @@ -23762,7 +23768,7 @@

    Supress success

    2021-10-13 10:47:49
    -

    *Thread Reply:* Reminder that the meeting is today. See you soon

    +

    *Thread Reply:* Reminder that the meeting is today. See you soon

    @@ -23788,7 +23794,7 @@

    Supress success

    2021-10-13 19:49:21
    -

    *Thread Reply:* The recording and notes of the meeting are now available: +

    *Thread Reply:* The recording and notes of the meeting are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Oct13th2021

    @@ -23910,7 +23916,7 @@

    Supress success

    2021-10-07 14:55:35
    -

    *Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

    +

    *Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

    @@ -23936,7 +23942,7 @@

    Supress success

    2021-10-07 16:44:28
    -

    *Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

    +

    *Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

    Thanks for the PR

    @@ -23968,7 +23974,7 @@

    Supress success

    2021-10-07 16:46:40
    -

    *Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

    +

    *Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

    @@ -23994,7 +24000,7 @@

    Supress success

    2021-10-07 18:12:22
    -

    *Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

    +

    *Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

    You can download and install openlineage-dbt 0.2.3 with the fix using:

    @@ -24050,7 +24056,7 @@

    Supress success

    2021-10-07 21:10:36
    -

    *Thread Reply:* @Maciej Obuchowski might know

    +

    *Thread Reply:* @Maciej Obuchowski might know

    @@ -24076,7 +24082,7 @@

    Supress success

    2021-10-08 04:23:17
    -

    *Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

    +

    *Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

    Can you elaborate about your use case? Do you mean that dbt in your path does docker run or something like this? It still might be a problem if we won't have access to artifacts generated by dbt in target directory.

    @@ -24104,7 +24110,7 @@

    Supress success

    2021-10-08 10:59:32
    -

    *Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

    +

    *Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

    It seems that by installing openlineage-dbt in a virtual environment, it pulls down it's own version of dbt which it calls inline rather than shelling out and executing the dbt setup resident in the host system. I understand that opening a shell is a security risk so that is understandable.

    @@ -24132,7 +24138,7 @@

    Supress success

    2021-10-08 11:05:00
    -

    *Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

    +

    *Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

    For now I think you could build your own image based on official one, and install openlineage-dbt inside, something like:

    @@ -24164,7 +24170,7 @@

    Supress success

    2021-10-08 11:05:15
    -

    *Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

    +

    *Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

    @@ -24190,7 +24196,7 @@

    Supress success

    2021-10-08 11:06:55
    -

    *Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

    +

    *Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

    @@ -24263,7 +24269,7 @@

    Supress success

    2021-10-08 08:16:37
    -

    *Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

    +

    *Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

    @@ -24289,7 +24295,7 @@

    Supress success

    2021-10-08 08:24:57
    -

    *Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

    +

    *Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

    @@ -24324,7 +24330,7 @@

    Supress success

    2021-10-08 11:25:22
    -

    *Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

    +

    *Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

    @@ -24359,7 +24365,7 @@

    Supress success

    2021-10-08 11:26:18
    -

    *Thread Reply:* Do they have it in dbt docs?

    +

    *Thread Reply:* Do they have it in dbt docs?

    @@ -24385,7 +24391,7 @@

    Supress success

    2021-10-08 11:33:59
    -

    *Thread Reply:* I prepared this yaml file, not sure this is what u asked

    +

    *Thread Reply:* I prepared this yaml file, not sure this is what u asked

    @@ -24484,7 +24490,7 @@

    Supress success

    2021-10-12 07:21:33
    -

    *Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

    +

    *Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

    @@ -24543,7 +24549,7 @@

    Supress success

    2021-10-12 15:50:20
    -

    *Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? +

    *Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? OpenLineage defines discrete classes for both OpenLineage.InputDataset and OpenLineage.OutputDataset datasets. But, for clarification, are you asking:

    1. If a job reads / writes to the same dataset, how can OpenLineage track which fields were used in job’s logic as input and which fields were used to write back to the resulting output?
    2. Or, if a job reads / writes from two different dataset, how can OpenLineage track which input fields were used in the job’s logic for the resulting output dataset? (i.e. column-level lineage)
    3. @@ -24573,7 +24579,7 @@

      Supress success

    2021-10-12 15:56:18
    -

    *Thread Reply:* > In Azure Databricks sources often map to ADB mount points.  We are looking for a way to translate this into source metadata in the OL output.  Is there some configuration that would make this possible, or any other suggestions? +

    *Thread Reply:* > In Azure Databricks sources often map to ADB mount points.  We are looking for a way to translate this into source metadata in the OL output.  Is there some configuration that would make this possible, or any other suggestions? I would look into our OutputDatasetVisitors class (as a starting point) that extracts metadata from the spark logical plan to construct a mapping between a logic plan to one or more OpenLineage.Dataset for the spark job. But, I think @Michael Collado will have a more detailed suggestion / approach to what you’re asking

    @@ -24600,7 +24606,7 @@

    Supress success

    2021-10-12 15:59:41
    -

    *Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

    +

    *Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

    @@ -24626,7 +24632,7 @@

    Supress success

    2021-10-12 16:59:38
    -

    *Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

    +

    *Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

    docs.microsoft.com
    @@ -24677,7 +24683,7 @@

    Supress success

    2021-10-12 17:01:23
    -

    *Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? +

    *Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? df = spark.read.text("dbfs:/mymount/my_file.txt")

    @@ -24704,7 +24710,7 @@

    Supress success

    2021-10-12 17:04:52
    -

    *Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

    +

    *Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

    @@ -24730,7 +24736,7 @@

    Supress success

    2021-10-12 17:26:35
    -

    *Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

    +

    *Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

    @@ -24879,7 +24885,7 @@

    Supress success

    2021-10-12 17:41:41
    -

    *Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

    +

    *Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

    @@ -24905,7 +24911,7 @@

    Supress success

    2021-10-12 17:49:37
    -

    *Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java +

    *Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java Here is the paragraph about this in the spec: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming

    @@ -24980,7 +24986,7 @@

    Supress success

    2021-10-12 17:51:24
    -

    *Thread Reply:* This example adds facets to the run, but you can also add them to the job

    +

    *Thread Reply:* This example adds facets to the run, but you can also add them to the job

    @@ -25006,7 +25012,7 @@

    Supress success

    2021-10-12 17:52:46
    -

    *Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

    +

    *Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

    @@ -25032,7 +25038,7 @@

    Supress success

    2021-10-12 17:54:07
    -

    *Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

    +

    *Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

    @@ -25058,7 +25064,7 @@

    Supress success

    2021-10-12 18:26:44
    -

    *Thread Reply:* Thank you guys!!

    +

    *Thread Reply:* Thank you guys!!

    @@ -25145,7 +25151,7 @@

    Supress success

    2021-10-12 21:46:12
    -

    *Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

    +

    *Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

    @@ -25196,7 +25202,7 @@

    Supress success

    2021-10-12 21:47:36
    -

    *Thread Reply:* SparkSession.builder() +

    *Thread Reply:* SparkSession.builder() .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.host", "<https://localhost>") @@ -25228,7 +25234,7 @@

    Supress success

    2021-10-12 21:48:57
    -

    *Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

    +

    *Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

    @@ -25279,7 +25285,7 @@

    Supress success

    2021-10-12 21:49:37
    -

    *Thread Reply:* the apiKey setting is sent in an “Authorization” header

    +

    *Thread Reply:* the apiKey setting is sent in an “Authorization” header

    @@ -25305,7 +25311,7 @@

    Supress success

    2021-10-12 21:49:55
    -

    *Thread Reply:* “Bearer $KEY”

    +

    *Thread Reply:* “Bearer $KEY”

    @@ -25331,7 +25337,7 @@

    Supress success

    2021-10-12 21:51:09
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a9082[…]n/java/io/openlineage/spark/agent/client/OpenLineageClient.java

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a9082[…]n/java/io/openlineage/spark/agent/client/OpenLineageClient.java

    @@ -25382,7 +25388,7 @@

    Supress success

    2021-10-12 22:54:22
    -

    *Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

    +

    *Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

    I think we might need to add in a url parameter configuration for our hackathon. We're using a bit of serverless code to shuttle open lineage events to a queue so that another job and/or serverless application can read that queue at its leisure.

    @@ -25416,7 +25422,7 @@

    Supress success

    2021-10-13 10:46:22
    -

    *Thread Reply:* Yes, please open an issue detailing the use case.

    +

    *Thread Reply:* Yes, please open an issue detailing the use case.

    @@ -25497,7 +25503,7 @@

    Supress success

    2021-10-13 17:54:22
    -

    *Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet +

    *Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet run": { ... "facets": { @@ -25536,7 +25542,7 @@

    Supress success

    2021-10-13 17:54:43
    -

    *Thread Reply:* I think this is part of the Spark 3 gap

    +

    *Thread Reply:* I think this is part of the Spark 3 gap

    @@ -25562,7 +25568,7 @@

    Supress success

    2021-10-13 17:55:46
    -

    *Thread Reply:* an unknown output will cause missing output lineage

    +

    *Thread Reply:* an unknown output will cause missing output lineage

    @@ -25588,7 +25594,7 @@

    Supress success

    2021-10-13 18:05:57
    -

    *Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

    +

    *Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

    @@ -25643,7 +25649,7 @@

    Supress success

    2021-10-13 22:49:08
    -

    *Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

    +

    *Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

    @@ -25753,7 +25759,7 @@

    Supress success

    2021-10-13 20:13:02
    -

    *Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

    +

    *Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

    @@ -25892,7 +25898,7 @@

    Supress success

    2021-10-22 05:44:17
    -

    *Thread Reply:* Maybe you want to contribute it? +

    *Thread Reply:* Maybe you want to contribute it? It's not that hard, mostly testing, and figuring out what would be the naming of openlineage namespace for Trino, and how some additional statistics work.

    For example, recently we had added support for Redshift by community member @ale

    @@ -26068,7 +26074,7 @@

    Supress success

    2021-10-28 12:20:27
    -

    *Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

    +

    *Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

    @@ -26094,7 +26100,7 @@

    Supress success

    2021-10-28 17:37:53
    -

    *Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

    +

    *Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

    @@ -26120,7 +26126,7 @@

    Supress success

    2021-10-31 07:57:55
    -

    *Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

    +

    *Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

    @@ -26180,7 +26186,7 @@

    Supress success

    2021-11-01 12:01:41
    -

    *Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

    +

    *Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

    @@ -26232,7 +26238,7 @@

    Supress success

    2021-10-28 10:05:09
    -

    *Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

    +

    *Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

    Did you think about this?

    Supress success
    2021-10-28 12:19:27
    -

    *Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

    +

    *Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

    @@ -26323,7 +26329,7 @@

    Supress success

    2021-10-28 13:56:42
    -

    *Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

    +

    *Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

    @@ -26386,7 +26392,7 @@

    Supress success

    2021-10-28 15:29:50
    -

    *Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

    +

    *Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

    @@ -26412,7 +26418,7 @@

    Supress success

    2021-10-28 15:46:45
    -

    *Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

    +

    *Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

    I guess there's a reason for it that I didn't see since you support v3 of the manifest.

    Supress success
    2021-10-29 03:45:27
    -

    *Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

    +

    *Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

    @@ -26491,7 +26497,7 @@

    Supress success

    2021-10-29 04:26:22
    -

    *Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

    +

    *Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

    @@ -26517,7 +26523,7 @@

    Supress success

    2021-10-29 10:39:10
    -

    *Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

    +

    *Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

    @@ -26543,7 +26549,7 @@

    Supress success

    2021-11-01 19:12:01
    -

    *Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

    +

    *Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

    @@ -26599,7 +26605,7 @@

    Supress success

    2021-11-02 05:48:06
    -

    *Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

    +

    *Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

    @@ -26782,7 +26788,7 @@

    Supress success

    2021-10-28 18:51:10
    -

    *Thread Reply:* If anyone has any questions or comments - happy to discuss here

    +

    *Thread Reply:* If anyone has any questions or comments - happy to discuss here

    @@ -26808,7 +26814,7 @@

    Supress success

    2021-10-28 18:51:15
    -

    *Thread Reply:* @davzucky

    +

    *Thread Reply:* @davzucky

    @@ -26834,7 +26840,7 @@

    Supress success

    2021-10-28 23:01:29
    -

    *Thread Reply:* Thanks for updating the community, Brad!

    +

    *Thread Reply:* Thanks for updating the community, Brad!

    @@ -26860,7 +26866,7 @@

    Supress success

    2021-10-28 23:47:02
    -

    *Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

    +

    *Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

    @@ -26924,7 +26930,7 @@

    Supress success

    2021-10-28 18:54:56
    -

    *Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

    +

    *Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

    @@ -26989,7 +26995,7 @@

    Supress success

    2021-11-01 12:23:11
    -

    *Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

    +

    *Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

    If you can give me a little more detail on what your system infrastructure is like, it’ll help us set priority and design

    @@ -27017,7 +27023,7 @@

    Supress success

    2021-11-01 13:57:34
    -

    *Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

    +

    *Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

    https://airflow.apache.org/docs/apache-airflow/stable/lineage.html

    @@ -27047,7 +27053,7 @@

    Supress success

    2021-11-01 14:01:24
    -

    *Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

    +

    *Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

    @@ -27073,7 +27079,7 @@

    Supress success

    2021-11-01 14:05:02
    -

    *Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

    +

    *Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

    @@ -27099,7 +27105,7 @@

    Supress success

    2021-11-01 14:08:58
    -

    *Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

    +

    *Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

    It's definitely interesting use case - you'd be using most of the existing integrations we have.

    @@ -27127,7 +27133,7 @@

    Supress success

    2021-11-01 15:04:44
    -

    *Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

    +

    *Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

    @@ -27153,7 +27159,7 @@

    Supress success

    2021-11-05 23:48:21
    -

    *Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

    +

    *Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

    [2021-11-06 03:43:06,541] {{__init__.py:92}} ERROR - Failed to extract metadata 'NoneType' object has no attribute 'host' task_type=PostgresOperator airflow_dag_id=counter task_id=inc airflow_run_id=scheduled__2021-11-06T03:42:00+00:00 Traceback (most recent call last): @@ -27272,7 +27278,7 @@

    Supress success

    2021-11-01 14:06:12
    -

    *Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

    +

    *Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

    1. Please tell us where you've seen openlineage.airflow.backend.OpenLineageBackend, so we can fix the documentation 🙂
    @@ -27301,7 +27307,7 @@

    Supress success

    2021-11-01 19:07:21
    -

    *Thread Reply:* https://pypi.org/project/openlineage-airflow/

    +

    *Thread Reply:* https://pypi.org/project/openlineage-airflow/

    PyPI
    @@ -27352,7 +27358,7 @@

    Supress success

    2021-11-01 19:08:03
    -

    *Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

    +

    *Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

    @@ -27378,7 +27384,7 @@

    Supress success

    2021-11-02 02:38:59
    -

    *Thread Reply:* @Maciej Obuchowski +

    *Thread Reply:* @Maciej Obuchowski @Julien Le Dem that's the page i followed. Please guys revise the documentation, as it is very important

    @@ -27405,7 +27411,7 @@

    Supress success

    2021-11-02 04:34:14
    -

    *Thread Reply:* It should just copy actual readme

    +

    *Thread Reply:* It should just copy actual readme

    @@ -27456,7 +27462,7 @@

    Supress success

    2021-11-03 16:30:00
    -

    *Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

    +

    *Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

    @@ -27508,7 +27514,7 @@

    Supress success

    2021-11-01 15:19:18
    -

    *Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

    +

    *Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

    Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/openlineage/client/client.py", line 49, in __init__ @@ -27539,7 +27545,7 @@

    Supress success

    2021-11-01 15:19:26
    -

    *Thread Reply:* why is it not a valid URL?

    +

    *Thread Reply:* why is it not a valid URL?

    @@ -27565,7 +27571,7 @@

    Supress success

    2021-11-01 18:39:58
    -

    *Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

    +

    *Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

    @@ -27591,7 +27597,7 @@

    Supress success

    2021-11-02 05:14:30
    -

    *Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

    +

    *Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

    @@ -27617,7 +27623,7 @@

    Supress success

    2021-11-02 10:35:28
    -

    *Thread Reply:* aaaah, gotcha, good catch!

    +

    *Thread Reply:* aaaah, gotcha, good catch!

    @@ -27673,7 +27679,7 @@

    Supress success

    2021-11-02 05:18:18
    -

    *Thread Reply:* Are you sure that openlineage-airflow is present in the container?

    +

    *Thread Reply:* Are you sure that openlineage-airflow is present in the container?

    @@ -27779,7 +27785,7 @@

    Supress success

    2021-11-02 05:25:43
    -

    *Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

    +

    *Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

    @@ -27805,7 +27811,7 @@

    Supress success

    2021-11-02 06:34:11
    -

    *Thread Reply:* I am checking this accessing the Kubernetes pod

    +

    *Thread Reply:* I am checking this accessing the Kubernetes pod

    @@ -27920,7 +27926,7 @@

    Supress success

    2021-11-02 07:34:47
    -

    *Thread Reply:* Yes

    +

    *Thread Reply:* Yes

    @@ -27946,7 +27952,7 @@

    Supress success

    2021-11-02 07:35:53
    -

    *Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

    +

    *Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

    @@ -28001,7 +28007,7 @@

    Supress success

    2021-11-02 07:46:02
    -

    *Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

    +

    *Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

    @@ -28027,7 +28033,7 @@

    Supress success

    2021-11-02 07:46:20
    -

    *Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

    +

    *Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

    @@ -28053,7 +28059,7 @@

    Supress success

    2021-11-02 07:46:25
    -

    *Thread Reply:* at this early stage

    +

    *Thread Reply:* at this early stage

    @@ -28079,7 +28085,7 @@

    Supress success

    2021-11-02 07:50:10
    -

    *Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    +

    *Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    @@ -28105,7 +28111,7 @@

    Supress success

    2021-11-02 07:55:50
    -

    *Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

    +

    *Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

    And what is it doing GreatExpectactions extractor?

    @@ -28135,7 +28141,7 @@

    Supress success

    2021-11-02 07:56:30
    -

    *Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

    +

    *Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

    @@ -28161,7 +28167,7 @@

    Supress success

    2021-11-02 08:07:06
    -

    *Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task +

    *Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task I think this is good idea. Overall, OpenLineage strongly focuses on automatic metadata collection. However, using them would be a nice fallback for not-covered-yet cases.

    > And that the same dag graph can be seen in marquez, and not one job per task. @@ -28218,7 +28224,7 @@

    Supress success

    2021-11-02 08:09:55
    -

    *Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

    +

    *Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

    @@ -28275,7 +28281,7 @@

    Supress success

    2021-11-02 08:28:29
    -

    *Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

    +

    *Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

    At the moment the same info we have in marquez related the tasks, is available in airflow UI or using airflow API.

    @@ -28305,7 +28311,7 @@

    Supress success

    2021-11-02 09:33:31
    -

    *Thread Reply:* > how many people are working full time in this library? +

    *Thread Reply:* > how many people are working full time in this library? On Airflow integration or on OpenLineage overall? 🙂

    > The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. @@ -28362,7 +28368,7 @@

    Supress success

    2021-11-02 09:35:10
    -

    *Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

    +

    *Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

    @@ -28444,7 +28450,7 @@

    Supress success

    2021-11-03 08:00:34
    -

    *Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

    +

    *Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

    @@ -28470,7 +28476,7 @@

    Supress success

    2021-11-03 15:57:02
    -

    *Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

    +

    *Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

    WARNING:/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/great_expectations_extractor.py:Did not find great_expectations_provider library or failed to import it

    @@ -28530,7 +28536,7 @@

    Supress success

    2021-11-03 08:08:21
    -

    *Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 +

    *Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 which wuill affect visual layer.

    Regarding 2), I think we need to get along with Airflow maintainers regarding long term mechanism on which OL will work: https://github.com/apache/airflow/issues/17984

    @@ -28587,7 +28593,7 @@

    Supress success

    2021-11-03 08:19:00
    2021-11-03 08:35:59
    -

    *Thread Reply:* Thank you very much @Maciej Obuchowski

    +

    *Thread Reply:* Thank you very much @Maciej Obuchowski

    @@ -28697,7 +28703,7 @@

    Supress success

    2021-11-03 08:41:14
    -

    *Thread Reply:* It worked like that in Airflow 1.10.

    +

    *Thread Reply:* It worked like that in Airflow 1.10.

    This is an unfortunate limitation of LineageBackend API that we're using for Airflow 2. We're trying to work out solution for this with Airflow maintainers: https://github.com/apache/airflow/issues/17984

    Supress success
    2021-11-04 18:47:41
    -

    *Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

    +

    *Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

    @@ -28965,7 +28971,7 @@

    Supress success

    2021-11-05 04:36:30
    -

    *Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

    +

    *Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

    @@ -29089,7 +29095,7 @@

    Supress success

    2021-11-04 18:49:31
    -

    *Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

    +

    *Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

    @@ -29171,7 +29177,7 @@

    Supress success

    2021-11-04 21:35:16
    -

    *Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: +

    *Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: https://openlineage.io/docs/openapi/

    @@ -29198,7 +29204,7 @@

    Supress success

    2021-11-04 21:38:25
    -

    *Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

    +

    *Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

    @@ -29252,7 +29258,7 @@

    Supress success

    2021-11-04 21:54:51
    -

    *Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

    +

    *Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

    @@ -29278,7 +29284,7 @@

    Supress success

    2021-11-04 21:56:26
    -

    *Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

    +

    *Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

    @@ -29304,7 +29310,7 @@

    Supress success

    2021-11-04 23:04:47
    -

    *Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

    +

    *Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

    We have this issue open in OpenLineage that you should go +1 to help with our planning 🙂

    Supress success
    2021-11-05 15:08:09
    -

    *Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

    +

    *Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

    @@ -29424,7 +29430,7 @@

    Supress success

    2021-11-05 12:59:49
    -

    *Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

    +

    *Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

    If you POST an OpenLineage runEvent to the /lineage endpoint in Marquez, it’ll create any missing jobs or datasets that are relevant.

    @@ -29452,7 +29458,7 @@

    Supress success

    2021-11-05 13:06:06
    -

    *Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. +

    *Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. http://localhost:5000/api/v1/namespaces/testing_java/datasets/incremental_data as that currently returns the Marquez version of a dataset including default set fields for type and the above mentioned properties.

    @@ -29480,7 +29486,7 @@

    Supress success

    2021-11-05 17:01:55
    -

    *Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

    +

    *Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

    I don't actually know what happened to the datasource type field- maybe @Julien Le Dem can comment on whether that field was dropped intentionally or whether it was an oversight.

    @@ -29508,7 +29514,7 @@

    Supress success

    2021-11-05 18:18:06
    -

    *Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

    +

    *Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

    @@ -29559,7 +29565,7 @@

    Supress success

    2021-11-05 18:18:25
    -

    *Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

    +

    *Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

    @@ -29612,7 +29618,7 @@

    Supress success

    2021-11-05 18:20:25
    -

    *Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12

    +

    *Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12

    @@ -29745,7 +29751,7 @@

    Supress success

    2021-11-05 18:07:57
    -

    *Thread Reply:* If you want to add something please chime in this thread

    +

    *Thread Reply:* If you want to add something please chime in this thread

    @@ -29771,7 +29777,7 @@

    Supress success

    2021-11-05 19:27:44
    2021-11-09 19:47:26
    -

    *Thread Reply:* The monthly meeting is happening tomorrow. +

    *Thread Reply:* The monthly meeting is happening tomorrow. The purview team will present at the December meeting instead See full agenda here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting You are welcome to contribute

    @@ -29826,7 +29832,7 @@

    Supress success

    2021-11-10 11:10:17
    -

    *Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0

    +

    *Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0

    @@ -29852,7 +29858,7 @@

    Supress success

    2021-11-10 12:02:23
    -

    *Thread Reply:* It’s happening now ^

    +

    *Thread Reply:* It’s happening now ^

    @@ -29878,7 +29884,7 @@

    Supress success

    2021-11-16 19:57:23
    -

    *Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) +

    *Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) I have a few TODOs to follow up on tickets

    @@ -29959,7 +29965,7 @@

    Supress success

    2021-11-09 08:18:58
    -

    *Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

    +

    *Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

    Maybe it’s simply a “Job” But than what is run ?

    @@ -29987,7 +29993,7 @@

    Supress success

    2021-11-09 08:19:44
    -

    *Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

    +

    *Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

    Have you considered having some examples of different use cases like those?

    @@ -30015,9 +30021,9 @@

    Supress success

    2021-11-09 08:21:43
    -

    *Thread Reply: By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +

    *Thread Reply:* By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? For example important use-case for lineage is troubleshooting or error notifications (e.g mark report or job as temporarily in bad state if upstream data integration is broken). -In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select * from x,y group by a,b) . +In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select ** from x,y group by a,b) . But what are the cases where you’d want to see multiple outputs ? You can have single process produce multiple tables (in above example) but they’d alway be separate queries. The actual inputs for each output would be different.

    But having multiple outputs create ambiguity as now If x or y is broken but have multiple outputs I do not know which is really impacted?

    @@ -30046,7 +30052,7 @@

    Supress success

    2021-11-09 08:34:01
    -

    *Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. +

    *Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. > > Maybe it’s simply a “Job” But than what is run ? Every continuous process eventually has end - for example, you can deploy new version of your Flink pipeline. The new version would be the next Run for the same Job.

    @@ -30079,7 +30085,7 @@

    Supress success

    2021-11-09 08:43:09
    -

    *Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? +

    *Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? Our reference implementation is an web application https://marquezproject.github.io/marquez/

    We definitely do not exclude any of the things you're talking about - and it would make a lot of sense to talk more about potential usages.

    @@ -30108,7 +30114,7 @@

    Supress success

    2021-11-09 08:45:47
    -

    *Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +

    *Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? I think this is too SQL-centric view 🙂

    Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too.

    @@ -30139,7 +30145,7 @@

    Supress success

    2021-11-17 12:11:37
    -

    *Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. +

    *Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. Yes I think that would be great if we expand on potential usages. if Open Lineage documentation (perhaps) has all kind of examples for different use-cases or case studies. Financal or healthcase industry case study and how would someone doing integration with OpenLineage. It would be easier to understand the concepts and make sure things are modeled consistently.

    @@ -30166,7 +30172,7 @@

    Supress success

    2021-11-17 14:19:19
    -

    *Thread Reply:* > I think this is too SQL-centric view 🙂 +

    *Thread Reply:* > I think this is too SQL-centric view 🙂 > > Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too. Thanks for answering @Maciej Obuchowski

    @@ -30253,7 +30259,7 @@

    Supress success

    2021-11-23 05:38:11
    -

    *Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

    +

    *Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

    > 2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design. Wouldn't that leave the possibility of breaking exactly-once unless you're going full into two phase commit?

    @@ -30282,7 +30288,7 @@

    Supress success

    2021-11-23 17:02:36
    -

    *Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design +

    *Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design In a Spark or flink job it is less likely now that you mention it. But in a batch job (airflow python or kubernetes operator for example) users could do anything and then they’d need lineage to figure out what is wrong if even if what they did is suboptimal 🙂

    > I see what you’re trying to model. @@ -30315,7 +30321,7 @@

    Supress success

    2021-11-24 04:55:41
    -

    *Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

    +

    *Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

    @@ -30367,7 +30373,7 @@

    Supress success

    2021-11-10 16:17:10
    -

    *Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

    +

    *Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

    @@ -30417,7 +30423,7 @@

    Supress success

    2021-11-10 16:17:37
    -

    *Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

    +

    *Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

    @@ -30443,7 +30449,7 @@

    Supress success

    2021-11-11 05:03:50
    -

    *Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

    +

    *Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

    @@ -30469,7 +30475,7 @@

    Supress success

    2021-12-10 14:07:57
    -

    *Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

    +

    *Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

    @@ -30522,7 +30528,7 @@

    Supress success

    2021-11-11 05:14:39
    -

    *Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    +

    *Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -30793,7 +30799,7 @@

    Supress success

    2021-11-11 05:46:06
    -

    *Thread Reply:* Yes!

    +

    *Thread Reply:* Yes!

    @@ -30819,7 +30825,7 @@

    Supress success

    2021-11-11 06:17:01
    -

    *Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. +

    *Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. As an additional question: When viewing a Dataset in Marquez will it cross the job namespace bounds? As in, will I see jobs from different job namespaces?

    @@ -30847,7 +30853,7 @@

    Supress success

    2021-11-11 09:20:14
    -

    *Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: +

    *Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: sql-runner-dev is the job namespace. I cannot see a graph of my job now. Is this something to do with the namespace names?

    @@ -30884,7 +30890,7 @@

    Supress success

    2021-11-11 09:21:46
    -

    *Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

    +

    *Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

    @@ -30910,7 +30916,7 @@

    Supress success

    2021-11-11 09:22:25
    -

    *Thread Reply:* Wait, it does work? Marquez was being temperamental

    +

    *Thread Reply:* Wait, it does work? Marquez was being temperamental

    @@ -30936,7 +30942,7 @@

    Supress success

    2021-11-11 09:24:01
    -

    *Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

    +

    *Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

    @@ -30962,7 +30968,7 @@

    Supress success

    2021-11-11 09:32:19
    -

    *Thread Reply:* Here's what I mean:

    +

    *Thread Reply:* Here's what I mean:

    @@ -30997,7 +31003,7 @@

    Supress success

    2021-11-11 09:59:24
    -

    *Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

    +

    *Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

    @@ -31057,7 +31063,7 @@

    Supress success

    2021-11-11 10:00:29
    -

    *Thread Reply:* or, maybe not? It was released already.

    +

    *Thread Reply:* or, maybe not? It was released already.

    Can you create issue on github with those helpful gifs? @Lyndon Armitage

    @@ -31085,7 +31091,7 @@

    Supress success

    2021-11-11 10:58:25
    -

    *Thread Reply:* I think you are right Maciej

    +

    *Thread Reply:* I think you are right Maciej

    @@ -31111,7 +31117,7 @@

    Supress success

    2021-11-11 10:58:52
    -

    *Thread Reply:* Was that patched in 0,19.1?

    +

    *Thread Reply:* Was that patched in 0,19.1?

    @@ -31137,7 +31143,7 @@

    Supress success

    2021-11-11 11:06:06
    -

    *Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

    +

    *Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

    Haven't tested this myself unfortunately.

    @@ -31165,7 +31171,7 @@

    Supress success

    2021-11-11 11:07:07
    -

    *Thread Reply:* Perhaps not. It is urlencoding them: +

    *Thread Reply:* Perhaps not. It is urlencoding them: <http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP> But the error seems to be in marquez getting them.

    @@ -31193,10 +31199,10 @@

    Supress success

    2021-11-11 11:09:23
    -

    *Thread Reply:* This is an example Lineage event JSON I am sending.

    +

    *Thread Reply:* This is an example Lineage event JSON I am sending.

    - + @@ -31232,7 +31238,7 @@

    Supress success

    2021-11-11 11:11:29
    -

    *Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

    +

    *Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

    @@ -31258,7 +31264,7 @@

    Supress success

    2021-11-11 11:22:00
    2021-11-11 11:36:01
    -

    *Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

    +

    *Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

    @@ -31310,7 +31316,7 @@

    Supress success

    2021-11-11 11:52:36
    -

    *Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

    +

    *Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

    @@ -31443,7 +31449,7 @@

    Supress success

    2021-11-11 11:54:41
    -

    *Thread Reply:* Yup, thanks!

    +

    *Thread Reply:* Yup, thanks!

    @@ -31503,7 +31509,7 @@

    Supress success

    2021-11-15 13:04:19
    -

    *Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

    +

    *Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

    @@ -31529,7 +31535,7 @@

    Supress success

    2021-11-15 13:35:00
    -

    *Thread Reply:* The table does not exist in the Glue catalog until …

    +

    *Thread Reply:* The table does not exist in the Glue catalog until …

    A Glue crawler connects to one or more data stores (in this case S3), determines the data structures, and writes tables into the Data Catalog.

    @@ -31559,7 +31565,7 @@

    Supress success

    2021-11-15 13:41:14
    -

    *Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

    +

    *Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

    In that case I’d say that the glue Crawler is a job that outputs a dataset.

    @@ -31587,7 +31593,7 @@

    Supress success

    2021-11-15 15:03:36
    -

    *Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

    +

    *Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

    @@ -31613,7 +31619,7 @@

    Supress success

    2021-11-15 15:23:23
    -

    *Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

    +

    *Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

    If the Glue table isn’t a distinct dataset from the S3 data, how does this compare to a view in a database on top of a table. Are they 2 datasets or just one?

    @@ -31643,7 +31649,7 @@

    Supress success

    2021-11-15 15:24:39
    -

    *Thread Reply:* @John Thomas yes, its the metadata flow.

    +

    *Thread Reply:* @John Thomas yes, its the metadata flow.

    @@ -31669,7 +31675,7 @@

    Supress success

    2021-11-15 15:24:52
    -

    *Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

    +

    *Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

    @@ -31695,7 +31701,7 @@

    Supress success

    2021-11-15 15:29:22
    -

    *Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

    +

    *Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

    @@ -31721,7 +31727,7 @@

    Supress success

    2021-11-15 15:50:37
    -

    *Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

    +

    *Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

    Aligning with the Spark integration sounds like it might make sense then. Is there an example I could build from?

    @@ -31749,7 +31755,7 @@

    Supress success

    2021-11-15 17:56:17
    -

    *Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

    +

    *Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

    @@ -31775,7 +31781,7 @@

    Supress success

    2021-11-15 17:59:14
    -

    *Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

    +

    *Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

    @@ -31805,7 +31811,7 @@

    Supress success

    2021-11-19 03:30:03
    -

    *Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? +

    *Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? Is there a specific place to look in the regular logs?

    @@ -31832,7 +31838,7 @@

    Supress success

    2021-11-19 13:39:01
    -

    *Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

    +

    *Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

    @@ -31978,7 +31984,7 @@

    Supress success

    2021-11-22 13:34:15
    -

    *Thread Reply:* Output event:

    +

    *Thread Reply:* Output event:

    2021-11-22 08:12:15,513 INFO [spark-listener-group-shared] agent.OpenLineageContext (OpenLineageContext.java:emit(50)): Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) { “eventType”: “COMPLETE”, @@ -32033,7 +32039,7 @@

    Supress success

    2021-11-22 13:34:59
    -

    *Thread Reply:* This sink record is missing details …

    +

    *Thread Reply:* This sink record is missing details …

    2021-11-22 08:12:15,481 INFO [Thread-7] sinks.HadoopDataSink (HadoopDataSink.scala:$anonfun$writeDynamicFrame$1(275)): nameSpace: , table:

    @@ -32061,7 +32067,7 @@

    Supress success

    2021-11-22 13:40:30
    -

    *Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

    +

    *Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

    @@ -32087,7 +32093,7 @@

    Supress success

    2021-11-22 14:31:06
    -

    *Thread Reply:* Are you using the existing spark integration for the spark lineage?

    +

    *Thread Reply:* Are you using the existing spark integration for the spark lineage?

    @@ -32113,7 +32119,7 @@

    Supress success

    2021-11-22 14:46:47
    -

    *Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark +

    *Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark In the Glue context I was not clear on the correct settings for “spark.openlineage.parentJobName” and “spark.openlineage.parentRunId”, I put in static values (which may be incorrect)? I injected these via: "--conf": "spark.openlineage.parentJobName=nyc-taxi-raw-stage",

    @@ -32141,7 +32147,7 @@

    Supress success

    2021-11-22 14:47:54
    -

    *Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

    +

    *Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

    @@ -32167,7 +32173,7 @@

    Supress success

    2021-11-22 15:03:31
    -

    *Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

    +

    *Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

    @@ -32193,7 +32199,7 @@

    Supress success

    2021-11-22 15:12:15
    -

    *Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). +

    *Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). It is emitting the above for each transform in the Glue job, but does not seem to capture the output …

    @@ -32220,7 +32226,7 @@

    Supress success

    2021-11-22 15:13:54
    -

    *Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

    +

    *Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

    @@ -32246,7 +32252,7 @@

    Supress success

    2021-11-22 15:25:30
    -

    *Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README +

    *Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README Mine from AWS Glue… 21/11/22 18:48:48 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 21/11/22 18:48:49 INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=<http://ec2>-….<a href="http://compute-1.amazonaws.com:5000">compute-1.amazonaws.com:5000</a>, version=v1, namespace=spark_integration, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: <http://ec2>-….<a href="http://compute-1.amazonaws.com:5000/api/v1/lineage">compute-1.amazonaws.com:5000/api/v1/lineage</a> @@ -32276,7 +32282,7 @@

    Supress success

    2021-11-22 16:12:40
    -

    *Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

    +

    *Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

    openlineage.io
    @@ -32323,7 +32329,7 @@

    Supress success

    2021-11-22 16:43:23
    -

    *Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

    +

    *Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

    @@ -32349,7 +32355,7 @@

    Supress success

    2021-11-22 16:49:50
    -

    *Thread Reply:* yeah, this thread will work great 🙂

    +

    *Thread Reply:* yeah, this thread will work great 🙂

    @@ -32375,7 +32381,7 @@

    Supress success

    2022-07-18 11:37:02
    -

    *Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

    +

    *Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

    @@ -32401,7 +32407,7 @@

    Supress success

    2022-07-18 15:14:47
    -

    *Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. +

    *Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. I haven’t had the space to look at it in a while, but happy to support if you are looking at it.

    @@ -32454,7 +32460,7 @@

    Supress success

    2021-11-23 09:01:11
    -

    *Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

    +

    *Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

    @@ -32532,7 +32538,7 @@

    Supress success

    2021-11-23 09:38:44
    -

    *Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

    +

    *Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

    YouTube
    @@ -32618,7 +32624,7 @@

    Supress success

    2021-11-23 12:45:34
    -

    *Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

    +

    *Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

    @@ -32644,7 +32650,7 @@

    Supress success

    2021-11-23 12:47:01
    -

    *Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

    +

    *Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

    @@ -32670,7 +32676,7 @@

    Supress success

    2021-11-23 13:51:00
    -

    *Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

    +

    *Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

    @@ -32696,7 +32702,7 @@

    Supress success

    2021-11-23 13:57:13
    -

    *Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

    +

    *Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

    @@ -32722,7 +32728,7 @@

    Supress success

    2021-11-24 00:36:56
    -

    *Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

    +

    *Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

    @@ -32748,7 +32754,7 @@

    Supress success

    2022-03-11 22:33:39
    -

    *Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

    +

    *Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

    @@ -32800,7 +32806,7 @@

    Supress success

    2021-11-24 11:49:14
    -

    *Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

    +

    *Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

    @@ -32888,7 +32894,7 @@

    Supress success

    2021-12-01 16:35:17
    -

    *Thread Reply:* Anyone? Thanks!

    +

    *Thread Reply:* Anyone? Thanks!

    @@ -33022,7 +33028,7 @@

    Supress success

    2021-11-30 07:03:30
    2021-11-30 11:21:34
    -

    *Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

    +

    *Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

    @@ -33078,7 +33084,7 @@

    Supress success

    2021-11-30 15:20:18
    -

    *Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

    +

    *Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

    Thank you so much for pointing to those utilities!

    @@ -33110,7 +33116,7 @@

    Supress success

    2021-11-30 15:48:25
    -

    *Thread Reply:* Glad I could help!

    +

    *Thread Reply:* Glad I could help!

    @@ -33162,7 +33168,7 @@

    Supress success

    2021-12-01 08:51:28
    -

    *Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

    +

    *Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

    OM I am not an expert but it's a metadata model with clients and API around it.

    @@ -33216,7 +33222,7 @@

    Supress success

    2021-12-01 12:40:53
    -

    *Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

    +

    *Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

    @@ -33242,7 +33248,7 @@

    Supress success

    2021-12-01 12:41:02
    -

    *Thread Reply:* ```def lineageparentid(run_id, task): +

    *Thread Reply:* ```def lineageparentid(run_id, task): """ Macro function which returns the generated job and run id for a given task. This can be used to forward the ids from a task to a child run so the job @@ -33300,7 +33306,7 @@

    Supress success

    2021-12-01 12:41:13
    -

    *Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

    +

    *Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

    @@ -33351,7 +33357,7 @@

    Supress success

    2021-12-01 12:53:27
    -

    *Thread Reply:* the quickest response ever! And that works like a charm 🙌

    +

    *Thread Reply:* the quickest response ever! And that works like a charm 🙌

    @@ -33381,7 +33387,7 @@

    Supress success

    2021-12-01 13:21:16
    -

    *Thread Reply:* Glad I could help!

    +

    *Thread Reply:* Glad I could help!

    @@ -33433,7 +33439,7 @@

    Supress success

    2021-12-01 14:18:00
    -

    *Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

    +

    *Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

    spark.sql("EXPLAIN EXTENDED CREATE TABLE tbl USING delta LOCATION '/tmp/delta' AS SELECT ** FROM tmp").show(false) ```|== Parsed Logical Plan == @@ -33481,7 +33487,7 @@

    Supress success

    2021-12-01 14:27:25
    -

    *Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

    +

    *Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

    @@ -33511,7 +33517,7 @@

    Supress success

    2021-12-01 14:27:40
    -

    *Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

    +

    *Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

    "spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.4.0-SNAPSHOT/integration/spark>", @@ -33601,7 +33607,7 @@

    Supress success

    2021-12-01 14:38:55
    -

    *Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

    +

    *Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

    @@ -33627,7 +33633,7 @@

    Supress success

    2021-12-01 14:39:02
    -

    *Thread Reply:* Thank you as usual!!

    +

    *Thread Reply:* Thank you as usual!!

    @@ -33653,7 +33659,7 @@

    Supress success

    2021-12-01 14:40:25
    -

    *Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

    +

    *Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

    @@ -33679,7 +33685,7 @@

    Supress success

    2021-12-01 14:47:20
    -

    *Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

    +

    *Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

    @@ -33897,7 +33903,7 @@

    Supress success

    2021-12-03 11:55:17
    -

    *Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

    +

    *Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

    We’ve to-date been working on our integrations library, making it as easy to set up as possible.

    @@ -33925,7 +33931,7 @@

    Supress success

    2021-12-03 12:01:25
    -

    *Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. +

    *Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. I’ll keep an eye on it

    @@ -33985,7 +33991,7 @@

    Supress success

    2021-12-06 20:28:09
    -

    *Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -34011,7 +34017,7 @@

    Supress success

    2021-12-06 20:28:25
    -

    *Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

    +

    *Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

    @@ -34063,7 +34069,7 @@

    Supress success

    2021-12-08 02:03:37
    -

    *Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

    +

    *Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

    @@ -34089,7 +34095,7 @@

    Supress success

    2021-12-09 10:15:50
    -

    *Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

    +

    *Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

    @@ -34115,7 +34121,7 @@

    Supress success

    2021-12-09 10:34:39
    -

    *Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +

    *Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} OpenLineageHttpException(code=0, message=java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1], details=java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1]) @@ -34163,7 +34169,7 @@

    Supress success

    2021-12-09 13:29:42
    -

    *Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

    +

    *Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

    @@ -34249,7 +34255,7 @@

    Supress success

    2021-12-08 12:15:38
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

    there’s ongoing PR for proxy backend, which opens http API and redirects events to Kafka.

    @@ -34277,7 +34283,7 @@

    Supress success

    2021-12-08 12:17:38
    -

    *Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

    +

    *Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

    For now, you'll need to make something to capture the HTTP api events and send them to the Kafka topic. Changing the spark.openlineage.url parameter will send the runEvents wherever you like, but obviously you can't directly produce HTTP events to a topic

    @@ -34305,7 +34311,7 @@

    Supress success

    2021-12-08 22:13:09
    -

    *Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

    +

    *Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

    @@ -34331,7 +34337,7 @@

    Supress success

    2021-12-09 12:57:10
    -

    *Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

    +

    *Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

    @@ -34456,7 +34462,7 @@

    Supress success

    2021-12-12 07:34:13
    -

    *Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

    +

    *Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

    This one should be good: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L72

    @@ -34495,7 +34501,7 @@

    Supress success

    2021-12-12 07:37:04
    -

    *Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

    +

    *Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

    @@ -34521,7 +34527,7 @@

    Supress success

    2021-12-13 14:54:13
    -

    *Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

    +

    *Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

    Is there any other method for determining the type of DataSourceV2Relation?

    @@ -34549,7 +34555,7 @@

    Supress success

    2021-12-13 14:57:06
    -

    *Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

    +

    *Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

    I merely needed to use DataSourceV2Realtion rather than NamedRelation!

    @@ -34581,7 +34587,7 @@

    Supress success

    2021-12-15 06:20:31
    -

    *Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala

    +

    *Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala

    @@ -34632,7 +34638,7 @@

    Supress success

    2021-12-15 06:22:05
    -

    *Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

    +

    *Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

    @@ -34658,7 +34664,7 @@

    Supress success

    2021-12-15 09:53:24
    -

    *Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

    +

    *Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

    @@ -34684,7 +34690,7 @@

    Supress success

    2021-12-15 09:53:28
    -

    *Thread Reply:* Thank you so much!

    +

    *Thread Reply:* Thank you so much!

    @@ -34710,7 +34716,7 @@

    Supress success

    2021-12-15 10:09:39
    -

    *Thread Reply:* Glad to help 😄

    +

    *Thread Reply:* Glad to help 😄

    @@ -34736,7 +34742,7 @@

    Supress success

    2021-12-15 10:22:58
    -

    *Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

    +

    *Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

    @@ -34762,7 +34768,7 @@

    Supress success

    2021-12-15 10:24:14
    -

    *Thread Reply:* If any, of course 🙂

    +

    *Thread Reply:* If any, of course 🙂

    @@ -34788,7 +34794,7 @@

    Supress success

    2021-12-15 10:49:31
    -

    *Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

    +

    *Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

    @@ -34818,7 +34824,7 @@

    Supress success

    2021-12-22 22:43:34
    -

    *Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

    +

    *Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

    @@ -34906,7 +34912,7 @@

    Supress success

    2021-12-23 05:04:36
    -

    *Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

    +

    *Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

    @@ -35007,7 +35013,7 @@

    Supress success

    2021-12-15 10:54:41
    -

    *Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

    +

    *Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

    Right now, OpenLineage checks based on the provider property if it's an Iceberg or Delta provider.

    @@ -35039,7 +35045,7 @@

    Supress success

    2021-12-15 11:13:20
    -

    *Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

    +

    *Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

    This means, that we can depend on the order of the visitors...

    @@ -35067,7 +35073,7 @@

    Supress success

    2021-12-15 13:59:12
    -

    *Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

    +

    *Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

    @@ -35161,7 +35167,7 @@

    Supress success

    2021-12-15 11:50:47
    -

    *Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

    +

    *Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

    demo.datakin.com
    @@ -35208,7 +35214,7 @@

    Supress success

    2021-12-15 11:51:04
    -

    *Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

    +

    *Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

    @@ -35264,7 +35270,7 @@

    Supress success

    2021-12-15 19:01:50
    -

    *Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

    +

    *Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

    Check the great expectations integration to see how those facets are being used

    @@ -35292,7 +35298,7 @@

    Supress success

    2022-05-24 06:20:50
    -

    *Thread Reply:* This is great. Thanks @Michael Collado!

    +

    *Thread Reply:* This is great. Thanks @Michael Collado!

    @@ -35381,7 +35387,7 @@

    Supress success

    2021-12-20 04:15:51
    -

    *Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

    +

    *Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

    @@ -35407,7 +35413,7 @@

    Supress success

    2021-12-20 12:28:21
    -

    *Thread Reply:* There is no details about failure.

    +

    *Thread Reply:* There is no details about failure.

    ```dbt-ol build -t DEV --profile cdp --profiles-dir /c/Work/dbt/cdp100/profiles --project-dir /c/Work/dbt/cdp100 --select +riskrawmastersharedshareclass Running OpenLineage dbt wrapper version 0.4.0 @@ -35495,7 +35501,7 @@

    Supress success

    2021-12-20 12:30:20
    -

    *Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

    +

    *Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

    @@ -35521,7 +35527,7 @@

    Supress success

    2021-12-20 16:46:11
    -

    *Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

    +

    *Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

    @@ -35601,7 +35607,7 @@

    Supress success

    2021-12-20 18:11:50
    -

    *Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one +

    *Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one 22:52:07 Database Error in model customers (models/customers.sql) 22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12] 22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql @@ -35649,7 +35655,7 @@

    Supress success

    2021-12-20 18:23:12
    -

    *Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

    +

    *Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

    Created issue to add it to spec in a generic way: https://github.com/OpenLineage/OpenLineage/issues/446

    @@ -35764,7 +35770,7 @@

    Supress success

    2021-12-20 22:49:54
    -

    *Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

    +

    *Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

    @@ -35825,7 +35831,7 @@

    Supress success

    2021-12-22 12:38:26
    -

    *Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

    +

    *Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

    @@ -35855,7 +35861,7 @@

    Supress success

    2021-12-22 12:39:06
    -

    *Thread Reply:* You don't need to do anything with DAG import then.

    +

    *Thread Reply:* You don't need to do anything with DAG import then.

    @@ -35881,7 +35887,7 @@

    Supress success

    2021-12-22 12:40:30
    -

    *Thread Reply:* Thanks!!!!! i'll try

    +

    *Thread Reply:* Thanks!!!!! i'll try

    @@ -35937,7 +35943,7 @@

    Supress success

    2021-12-27 16:49:56
    -

    *Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

    +

    *Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

    @@ -36158,7 +36164,7 @@

    Supress success

    2022-01-12 11:59:44
    -

    *Thread Reply:* ^ It’s happening now!

    +

    *Thread Reply:* ^ It’s happening now!

    @@ -36210,7 +36216,7 @@

    Supress success

    2022-01-18 14:19:39
    -

    *Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

    +

    *Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

    @@ -36262,7 +36268,7 @@

    Supress success

    2022-01-18 11:40:00
    -

    *Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

    +

    *Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

    @@ -36296,7 +36302,7 @@

    Supress success

    2022-01-18 20:35:52
    -

    *Thread Reply:* thank you. 🙂

    +

    *Thread Reply:* thank you. 🙂

    @@ -36362,7 +36368,7 @@

    Supress success

    2022-01-18 11:30:38
    -

    *Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

    +

    *Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

    @@ -36392,7 +36398,7 @@

    Supress success

    2022-01-18 11:31:51
    -

    *Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

    +

    *Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

    @@ -36418,7 +36424,7 @@

    Supress success

    2022-01-18 11:34:25
    -

    *Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: +

    *Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: INFO - Constructing openlineage client to send events to

    @@ -36445,7 +36451,7 @@

    Supress success

    2022-01-18 11:34:47
    -

    *Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

    +

    *Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

    @@ -36471,7 +36477,7 @@

    Supress success

    2022-01-18 11:34:52
    -

    *Thread Reply:* hope this helps!

    +

    *Thread Reply:* hope this helps!

    @@ -36497,7 +36503,7 @@

    Supress success

    2022-01-18 20:40:37
    -

    *Thread Reply:* Thank you, will try again.

    +

    *Thread Reply:* Thank you, will try again.

    @@ -36557,7 +36563,7 @@

    Supress success

    2022-01-19 12:39:30
    -

    *Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

    +

    *Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

    @@ -36583,7 +36589,7 @@

    Supress success

    2022-01-27 06:45:08
    -

    *Thread Reply:* nice on the dbt-spark support 👍

    +

    *Thread Reply:* nice on the dbt-spark support 👍

    @@ -36639,7 +36645,7 @@

    Supress success

    2022-01-19 11:13:35
    -

    *Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

    +

    *Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

    We've just found yesterday that it somehow breaks our example...

    @@ -36671,7 +36677,7 @@

    Supress success

    2022-01-19 11:14:51
    -

    *Thread Reply:* yes i just installed docker on mac

    +

    *Thread Reply:* yes i just installed docker on mac

    @@ -36697,7 +36703,7 @@

    Supress success

    2022-01-19 11:15:02
    -

    *Thread Reply:* and docker compose version 1.29.2

    +

    *Thread Reply:* and docker compose version 1.29.2

    @@ -36723,7 +36729,7 @@

    Supress success

    2022-01-19 11:20:24
    -

    *Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

    +

    *Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

    @@ -36749,7 +36755,7 @@

    Supress success

    2022-01-19 11:21:56
    -

    *Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

    +

    *Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

    @@ -36775,7 +36781,7 @@

    Supress success

    2022-01-19 11:22:15
    -

    *Thread Reply:* Try to restart docker for mac

    +

    *Thread Reply:* Try to restart docker for mac

    @@ -36801,7 +36807,7 @@

    Supress success

    2022-01-19 11:23:00
    -

    *Thread Reply:* It needs to show Docker Desktop is running :

    +

    *Thread Reply:* It needs to show Docker Desktop is running :

    @@ -36836,7 +36842,7 @@

    Supress success

    2022-01-19 11:24:01
    -

    *Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

    +

    *Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

    @@ -36862,7 +36868,7 @@

    Supress success

    2022-01-19 11:32:55
    -

    *Thread Reply:* i dont why i getting this when i $ docker-compose up :

    +

    *Thread Reply:* i dont why i getting this when i $ docker-compose up :

    WARNING: The TAG variable is not set. Defaulting to a blank string. WARNING: The APIPORT variable is not set. Defaulting to a blank string. @@ -36898,7 +36904,7 @@

    Supress success

    2022-01-19 11:46:12
    -

    *Thread Reply:* are you running it exactly like here, with respect to directories, etc?

    +

    *Thread Reply:* are you running it exactly like here, with respect to directories, etc?

    https://github.com/MarquezProject/marquez/tree/main/examples/airflow

    @@ -36926,7 +36932,7 @@

    Supress success

    2022-01-19 11:59:36
    -

    *Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

    +

    *Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

    @@ -36952,7 +36958,7 @@

    Supress success

    2022-01-19 12:04:01
    -

    *Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

    +

    *Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

    @@ -37034,7 +37040,7 @@

    Supress success

    2022-01-19 17:50:39
    -

    *Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

    +

    *Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

    @@ -37060,7 +37066,7 @@

    Supress success

    2022-01-19 20:49:49
    -

    *Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

    +

    *Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

    @@ -37086,7 +37092,7 @@

    Supress success

    2022-01-21 00:34:37
    -

    *Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

    +

    *Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

    @@ -37165,7 +37171,7 @@

    Supress success

    2022-01-25 14:46:58
    -

    *Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

    +

    *Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

    @@ -37191,7 +37197,7 @@

    Supress success

    2022-01-25 14:47:53
    -

    *Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

    +

    *Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

    @@ -37247,7 +37253,7 @@

    Supress success

    2022-01-21 14:03:29
    -

    *Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

    +

    *Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

    @@ -37297,7 +37303,7 @@

    Supress success

    2022-01-21 14:08:35
    -

    *Thread Reply:* Thank you for the info!

    +

    *Thread Reply:* Thank you for the info!

    @@ -37349,7 +37355,7 @@

    Supress success

    2022-01-30 12:37:39
    -

    *Thread Reply:* what do you mean by “Integrate Openlineage”?

    +

    *Thread Reply:* what do you mean by “Integrate Openlineage”?

    Can you give a little more information on what you’re trying to accomplish and what the existing project is?

    @@ -37377,7 +37383,7 @@

    Supress success

    2022-01-31 03:49:22
    -

    *Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

    +

    *Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

    @@ -37403,7 +37409,7 @@

    Supress success

    2022-01-31 15:01:31
    -

    *Thread Reply:* Gotcha!

    +

    *Thread Reply:* Gotcha!

    Broadly speaking, all an integration needs to do is to send runEvents to Marquez.

    @@ -37581,7 +37587,7 @@

    Supress success

    2022-02-15 15:28:03
    -

    *Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

    +

    *Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

    @@ -37607,7 +37613,7 @@

    Supress success

    2022-02-16 11:17:13
    -

    *Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

    +

    *Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

    @@ -37697,7 +37703,7 @@

    Supress success

    2022-02-03 12:39:57
    -

    *Thread Reply:* Hello!

    +

    *Thread Reply:* Hello!

    @@ -37750,7 +37756,7 @@

    Supress success

    2022-02-04 15:07:07
    -

    *Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

    +

    *Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

    I'm not super familiar with Atlas, were you thinking in terms of enabling Atlas to receive runEvents from OpenLineage connectors?

    @@ -37778,7 +37784,7 @@

    Supress success

    2022-02-07 05:49:16
    -

    *Thread Reply:* Hi John! +

    *Thread Reply:* Hi John! Yes, exactly, it’d be nice to see Atlas as a receiver side of the OpenLineage events. Is there some guidelines on how to implement it? I guess we need OpenLineage-compatible server implementation so we could receive events and send them to Atlas, right?

    @@ -37805,7 +37811,7 @@

    Supress success

    2022-02-07 11:30:14
    -

    *Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. +

    *Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. Marquez is our reference implementation of OpenLineage, so I’d look around in that repo to see how it’s been implemented :)

    @@ -37832,7 +37838,7 @@

    Supress success

    2022-02-07 11:50:27
    -

    *Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 +

    *Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 If it’d not get any traction we at New Work might contribute as well

    @@ -37859,7 +37865,7 @@

    Supress success

    2022-02-07 11:56:09
    -

    *Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

    +

    *Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

    @@ -37889,7 +37895,7 @@

    Supress success

    2022-02-08 11:20:47
    -

    *Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

    +

    *Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

    @@ -38215,7 +38221,7 @@

    Supress success

    2022-02-08 11:32:12
    -

    *Thread Reply:* Got it, thank you!

    +

    *Thread Reply:* Got it, thank you!

    @@ -38241,7 +38247,7 @@

    Supress success

    2022-05-04 11:12:23
    -

    *Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? +

    *Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? It would be needed to create a new model for spark, other than https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json and upload it to atlas (what could be done with a call to the atlas Api) Does it makes sense?

    Supress success
    2022-05-04 11:24:02
    -

    *Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

    +

    *Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

    @@ -38644,7 +38650,7 @@

    Supress success

    2022-05-04 14:24:28
    -

    *Thread Reply:* sorry for the ignorance, +

    *Thread Reply:* sorry for the ignorance, But what is the purpose of the bridge?the comunicación with atlas should be done throw kafka, and that messages can be sent by the proxy. What are I missing?

    @@ -38671,7 +38677,7 @@

    Supress success

    2022-05-04 16:37:33
    -

    *Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

    +

    *Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

    @@ -38697,7 +38703,7 @@

    Supress success

    2022-05-19 09:08:23
    -

    *Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

    +

    *Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

    @@ -38723,7 +38729,7 @@

    Supress success

    2022-05-19 09:11:37
    -

    *Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

    +

    *Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

    @@ -38749,7 +38755,7 @@

    Supress success

    2022-05-19 09:13:36
    -

    *Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

    +

    *Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

    @@ -38775,7 +38781,7 @@

    Supress success

    2022-05-19 12:42:31
    -

    *Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

    +

    *Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

    @@ -38860,7 +38866,7 @@

    Supress success

    2022-02-25 13:06:16
    -

    *Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

    +

    *Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

    @@ -38986,7 +38992,7 @@

    Supress success

    2022-02-10 09:05:39
    -

    *Thread Reply:* By duplicated, you mean with the same runId?

    +

    *Thread Reply:* By duplicated, you mean with the same runId?

    @@ -39012,7 +39018,7 @@

    Supress success

    2022-02-10 11:40:55
    -

    *Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

    +

    *Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

    @@ -39068,7 +39074,7 @@

    Supress success

    2022-02-15 05:15:12
    -

    *Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

    +

    *Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

    @@ -39119,7 +39125,7 @@

    Supress success

    2022-02-15 23:27:07
    -

    *Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

    +

    *Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

    I was also thinking about here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L337-L340

    @@ -39155,7 +39161,7 @@

    Supress success

    2022-02-16 04:59:17
    -

    *Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

    +

    *Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

    @@ -39274,7 +39280,7 @@

    Supress success

    2022-02-16 13:44:30
    -

    *Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

    +

    *Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

    @@ -39326,7 +39332,7 @@

    Supress success

    2022-02-16 13:47:18
    -

    *Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

    +

    *Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

    @@ -39381,7 +39387,7 @@

    Supress success

    2022-02-16 13:51:30
    -

    *Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

    +

    *Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

    So, we do need to maintain some state. That makes total sense, Mike.

    @@ -39437,7 +39443,7 @@

    Supress success

    2022-02-16 14:00:03
    -

    *Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). +

    *Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). Re: the failure events, I think job failures will currently result in one FAIL event and one COMPLETE event. The SparkListenerJobEnd event will trigger a FAIL event but the SparkListenerSQLExecutionEnd event will trigger the COMPLETE event.

    @@ -39512,7 +39518,7 @@

    Supress success

    2022-02-16 15:16:27
    -

    *Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

    +

    *Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

    @@ -39643,7 +39649,7 @@

    Supress success

    2022-02-22 09:02:48
    -

    *Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

    +

    *Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

    I think that maybe you can check sparkContext.getLocalProperty("spark.sql.execution.id")

    @@ -39702,7 +39708,7 @@

    Supress success

    2022-02-22 10:53:09
    -

    *Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

    +

    *Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

    But that is the expectation? A SparkSQLExecutionStart and JobStart SHOULD have the same execution ID, right?

    @@ -39732,7 +39738,7 @@

    Supress success

    2022-02-22 10:57:24
    -

    *Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

    +

    *Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

    @@ -39758,7 +39764,7 @@

    Supress success

    2022-02-23 03:48:38
    -

    *Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

    +

    *Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

    1. SparkSQLExecutionStart with execution id of 8 happens (so gets runid of abc123). It contains the real inputs and outputs that we want.
    2. The Synapse connector starts executing JDBC commands. These commands prepare the synapse database to connect with data that Spark will land in a staging area in the cloud. (I don't know how it' executing arbitrary commands before the official job start begins 😞 )
    3. SparkJobStart beings with execution id of 9 happens (so it gets runid of jkl456). This contains the inputs and an output to a temp folder (NOT the real output we want but a staging location) a. There are four JobIds 0 - 3, all of which point back to execution id 9 with the same physical plan. @@ -39773,7 +39779,7 @@

      Supress success

      I've attached the logs and a screenshot of what I'm seeing the Spark UI. If you had a chance to take a look, it's a bit verbose but I'd appreciate a second pair of eyes on my analysis. Hopefully I got something wrong 😅

      - + @@ -39814,7 +39820,7 @@

      Supress success

      2022-02-23 07:19:01
      -

      *Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

      +

      *Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

      https://github.com/OpenLineage/OpenLineage/issues/388#issuecomment-964401860

      @@ -39842,7 +39848,7 @@

      Supress success

      2022-02-23 14:13:18
      -

      *Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

      +

      *Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

      @@ -39868,7 +39874,7 @@

      Supress success

      2022-02-23 14:18:04
      -

      *Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| +

      *Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| 😂

      @@ -39895,7 +39901,7 @@

      Supress success

      2022-02-23 21:56:48
      -

      *Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

      +

      *Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

      But you're exactly right, there's a SparkSQLExecutionStart event with execution id 8 and then a set of JobStart events all with execution id 9!

      @@ -39997,7 +40003,7 @@

      Supress success

      2022-02-28 14:00:56
      -

      *Thread Reply:* You won’t want to miss this talk!

      +

      *Thread Reply:* You won’t want to miss this talk!

      @@ -40051,7 +40057,7 @@

      Supress success

      2022-02-28 15:29:58
      -

      *Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

      +

      *Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

      If you mean sending information to DataHub, that should already be possible if users pass a datahub api endpoint to the OPENLINEAGE_ENDPOINT variable

      @@ -40079,7 +40085,7 @@

      Supress success

      2022-02-28 16:29:54
      -

      *Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

      +

      *Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

      Could you be please more specific, possibly link an article how to find the endpoint on the datahub side? Many thanks!

      @@ -40107,7 +40113,7 @@

      Supress success

      2022-02-28 17:15:31
      -

      *Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

      +

      *Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

      @@ -40133,7 +40139,7 @@

      Supress success

      2022-02-28 17:16:27
      -

      *Thread Reply:* I'd reach out to datahub, potentially?

      +

      *Thread Reply:* I'd reach out to datahub, potentially?

      @@ -40159,7 +40165,7 @@

      Supress success

      2022-02-28 17:21:51
      -

      *Thread Reply:* i see. ok, will do!

      +

      *Thread Reply:* i see. ok, will do!

      @@ -40185,7 +40191,7 @@

      Supress success

      2022-03-02 18:15:21
      -

      *Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

      +

      *Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

      @@ -40211,7 +40217,7 @@

      Supress success

      2022-03-02 18:33:45
      -

      *Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

      +

      *Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

      I see this is more political think that would require join effort of DataHub team and OpenLineage with a common goal.

      @@ -40429,7 +40435,7 @@

      Supress success

      2022-03-07 14:56:58
      -

      *Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

      +

      *Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

      all the included extractors are here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

      @@ -40457,7 +40463,7 @@

      Supress success

      2022-03-07 15:07:41
      -

      *Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? +

      *Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? OPENLINEAGE_EXTRACTOR_&lt;operator&gt;=full.path.to.ExtractorClass Also do I need anything else other than Marquez and openlineage_airflow?

      @@ -40485,7 +40491,7 @@

      Supress success

      2022-03-07 15:30:45
      -

      *Thread Reply:* Yes, as long as the extractors are in the python path.

      +

      *Thread Reply:* Yes, as long as the extractors are in the python path.

      @@ -40511,7 +40517,7 @@

      Supress success

      2022-03-07 15:31:59
      -

      *Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

      +

      *Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

      @@ -40537,7 +40543,7 @@

      Supress success

      2022-03-07 15:32:51
      -

      *Thread Reply:* That will be great help. Thanks

      +

      *Thread Reply:* That will be great help. Thanks

      @@ -40563,10 +40569,10 @@

      Supress success

      2022-03-08 20:38:27
      -

      *Thread Reply:* This is the one I wrote:

      +

      *Thread Reply:* This is the one I wrote:

      - + @@ -40598,7 +40604,7 @@

      Supress success

      2022-03-08 20:39:30
      -

      *Thread Reply:* to make it work, I set this environment variable:

      +

      *Thread Reply:* to make it work, I set this environment variable:

      OPENLINEAGE_EXTRACTOR_HttpToBigQueryOperator=http_to_bigquery.HttpToBigQueryExtractor

      @@ -40626,7 +40632,7 @@

      Supress success

      2022-03-08 20:40:57
      -

      *Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

      +

      *Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

      @@ -40712,7 +40718,7 @@

      Supress success

      2022-03-08 12:05:25
      -

      *Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

      +

      *Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

      @@ -40738,7 +40744,7 @@

      Supress success

      2022-03-08 12:05:49
      -

      *Thread Reply:* The good news is that we're working to solve this problem in general.

      +

      *Thread Reply:* The good news is that we're working to solve this problem in general.

      @@ -40764,7 +40770,7 @@

      Supress success

      2022-03-08 12:15:52
      -

      *Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

      +

      *Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

      @@ -40790,7 +40796,7 @@

      Supress success

      2022-03-08 12:50:00
      -

      *Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

      +

      *Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

      @@ -40841,7 +40847,7 @@

      Supress success

      2022-03-08 13:19:37
      -

      *Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

      +

      *Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

      @@ -40975,7 +40981,7 @@

      Supress success

      2022-03-10 12:46:18
      -

      *Thread Reply:* Do you need to specify a namespace that is not « default »?

      +

      *Thread Reply:* Do you need to specify a namespace that is not « default »?

      @@ -41027,7 +41033,7 @@

      Supress success

      2022-03-10 07:40:13
      -

      *Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

      +

      *Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

      @@ -41057,7 +41063,7 @@

      Supress success

      2022-03-10 14:09:26
      -

      *Thread Reply:* Ok so I had to update a chart dependency

      +

      *Thread Reply:* Ok so I had to update a chart dependency

      @@ -41083,7 +41089,7 @@

      Supress success

      2022-03-10 14:10:39
      -

      *Thread Reply:* Now I installed the service in amazon using this +

      *Thread Reply:* Now I installed the service in amazon using this helm install marquez . --dependency-update --set marquez.db.host=myhost --set marquez.db.user=myuser --set marquez.db.password=mypassword --namespace marquez --atomic --wait

      @@ -41110,7 +41116,7 @@

      Supress success

      2022-03-10 14:11:31
      -

      *Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

      +

      *Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

      @@ -41136,7 +41142,7 @@

      Supress success

      2022-03-10 14:12:27
      -

      *Thread Reply:* however I can not fetch initial data when login into the endpoint

      +

      *Thread Reply:* however I can not fetch initial data when login into the endpoint

      @@ -41171,7 +41177,7 @@

      Supress success

      2022-03-10 14:52:06
      -

      *Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

      +

      *Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

      http://localhost:5000/api/v1/namespaces

      @@ -41199,7 +41205,7 @@

      Supress success

      2022-03-10 15:13:16
      -

      *Thread Reply:* i got this +

      *Thread Reply:* i got this {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

      @@ -41226,7 +41232,7 @@

      Supress success

      2022-03-10 15:13:34
      -

      *Thread Reply:* i have to use the namespace marquez to redirect there +

      *Thread Reply:* i have to use the namespace marquez to redirect there kubectl port-forward svc/marquez 5000:80 -n marquez

      @@ -41253,7 +41259,7 @@

      Supress success

      2022-03-10 15:13:48
      -

      *Thread Reply:* is there something i need to change in a config file?

      +

      *Thread Reply:* is there something i need to change in a config file?

      @@ -41279,7 +41285,7 @@

      Supress success

      2022-03-10 15:14:39
      -

      *Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

      +

      *Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

      @@ -41305,7 +41311,7 @@

      Supress success

      2022-03-10 15:14:59
      -

      *Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

      +

      *Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

      @@ -41331,7 +41337,7 @@

      Supress success

      2022-03-10 15:39:23
      -

      *Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

      +

      *Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

      As to the EKS installation (or any non-local install), this is where you would need to use what’s called an ingress controller to expose the services outside of the Kubernetes cluster. There are different flavors of these (NGINX is popular), and I believe that AWS EKS has some built-in capabilities that might help as well.

      @@ -41386,7 +41392,7 @@

      Supress success

      2022-03-10 15:40:50
      -

      *Thread Reply:* So how do i fix this issue?

      +

      *Thread Reply:* So how do i fix this issue?

      @@ -41412,7 +41418,7 @@

      Supress success

      2022-03-10 15:46:56
      -

      *Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

      +

      *Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

      However, if you are just seeking to explore Marquez and try things out, then I would highly recommend the “Open in Gitpod” functionality at https://github.com/MarquezProject/marquez#try-it. That will perform a full deployment for you in a temporary environment very quickly.

      Supress success
      2022-03-10 16:02:05
      -

      *Thread Reply:* i need to use it in aws for a POC

      +

      *Thread Reply:* i need to use it in aws for a POC

      @@ -41491,7 +41497,7 @@

      Supress success

      2022-03-10 19:15:08
      -

      *Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? +

      *Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? This guide is omitting many steps https://marquezproject.github.io/marquez/running-on-aws.html

      Supress success
      2022-03-11 13:59:28
      -

      *Thread Reply:* I still see this +

      *Thread Reply:* I still see this https://files.slack.com/files-pri/T01CWUYP5AR-F036JKN77EW/image.png

      @@ -41683,7 +41689,7 @@

      Supress success

      2022-03-11 14:00:09
      -

      *Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

      +

      *Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

      @@ -41709,7 +41715,7 @@

      Supress success

      2022-03-11 14:00:23
      -

      *Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

      +

      *Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

      @@ -41735,7 +41741,7 @@

      Supress success

      2022-03-11 14:00:52
      -

      *Thread Reply:* Or are you working through the local deploy right now?

      +

      *Thread Reply:* Or are you working through the local deploy right now?

      @@ -41761,7 +41767,7 @@

      Supress success

      2022-03-11 14:01:57
      -

      *Thread Reply:* It shows the same using the exposed service

      +

      *Thread Reply:* It shows the same using the exposed service

      @@ -41787,7 +41793,7 @@

      Supress success

      2022-03-11 14:02:09
      -

      *Thread Reply:* i just didnt do another screenshot

      +

      *Thread Reply:* i just didnt do another screenshot

      @@ -41813,7 +41819,7 @@

      Supress success

      2022-03-11 14:02:27
      -

      *Thread Reply:* Could it be communication with the DB?

      +

      *Thread Reply:* Could it be communication with the DB?

      @@ -41839,7 +41845,7 @@

      Supress success

      2022-03-11 14:04:37
      -

      *Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

      +

      *Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

      @@ -41865,7 +41871,7 @@

      Supress success

      2022-03-11 14:14:48
      -

      *Thread Reply:* i see this error +

      *Thread Reply:* i see this error Error occured while trying to proxy to: <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces">xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces</a>

      @@ -41892,7 +41898,7 @@

      Supress success

      2022-03-11 14:16:00
      -

      *Thread Reply:* it seems to be trying to use the same address to access the api endpoint

      +

      *Thread Reply:* it seems to be trying to use the same address to access the api endpoint

      @@ -41918,7 +41924,7 @@

      Supress success

      2022-03-11 14:16:26
      -

      *Thread Reply:* however the api service is in a different endpoint

      +

      *Thread Reply:* however the api service is in a different endpoint

      @@ -41944,7 +41950,7 @@

      Supress success

      2022-03-11 14:18:24
      -

      *Thread Reply:* The API resides here +

      *Thread Reply:* The API resides here <a href="http://Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com">Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com</a>

      @@ -41971,7 +41977,7 @@

      Supress success

      2022-03-11 14:19:13
      -

      *Thread Reply:* The web service resides here +

      *Thread Reply:* The web service resides here <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com">xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com</a>

      @@ -41998,7 +42004,7 @@

      Supress success

      2022-03-11 14:19:25
      -

      *Thread Reply:* do they both need to be under the same LB?

      +

      *Thread Reply:* do they both need to be under the same LB?

      @@ -42024,7 +42030,7 @@

      Supress success

      2022-03-11 14:19:56
      -

      *Thread Reply:* How would i do that is they install as separate services?

      +

      *Thread Reply:* How would i do that is they install as separate services?

      @@ -42050,7 +42056,7 @@

      Supress success

      2022-03-11 14:27:15
      -

      *Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

      +

      *Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

      Here is an example from one of the AWS repos - in the ingress resource you can see the single rule setup to point traffic to a given service.

      @@ -42167,7 +42173,7 @@

      Supress success

      2022-03-11 14:36:40
      -

      *Thread Reply:* Thanks for the help. Now I know what the issue is

      +

      *Thread Reply:* Thanks for the help. Now I know what the issue is

      @@ -42193,7 +42199,7 @@

      Supress success

      2022-03-11 14:51:34
      -

      *Thread Reply:* Great to hear!!

      +

      *Thread Reply:* Great to hear!!

      @@ -42248,7 +42254,7 @@

      Supress success

      2022-03-16 10:29:06
      -

      *Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

      +

      *Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

      @@ -42274,7 +42280,7 @@

      Supress success

      2022-03-16 10:29:15
      -

      *Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

      +

      *Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

      openlineage.io
      @@ -42347,7 +42353,7 @@

      Supress success

      2022-03-16 10:29:53
      -

      *Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

      +

      *Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

      @@ -42373,7 +42379,7 @@

      Supress success

      2022-03-16 12:45:14
      -

      *Thread Reply:* yup

      +

      *Thread Reply:* yup

      @@ -42399,7 +42405,7 @@

      Supress success

      2022-03-16 13:10:04
      -

      *Thread Reply:* how to run

      +

      *Thread Reply:* how to run

      @@ -42425,7 +42431,7 @@

      Supress success

      2022-03-16 23:08:31
      -

      *Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

      +

      *Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

      openlineage.io
      @@ -42472,7 +42478,7 @@

      Supress success

      2022-03-16 23:08:51
      -

      *Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

      +

      *Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

      @@ -42498,7 +42504,7 @@

      Supress success

      2022-03-16 23:09:45
      -

      *Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

      +

      *Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

      @@ -42524,7 +42530,7 @@

      Supress success

      2022-03-17 00:05:58
      -

      *Thread Reply:* Thank you

      +

      *Thread Reply:* Thank you

      @@ -42550,7 +42556,7 @@

      Supress success

      2022-03-19 00:00:32
      -

      *Thread Reply:* You are welcome 🙂

      +

      *Thread Reply:* You are welcome 🙂

      @@ -42576,7 +42582,7 @@

      Supress success

      2022-04-19 09:28:50
      -

      *Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

      +

      *Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

      Is there any work in that direction, is the current client code considered mature and just needs re-packaging, or is it just a thought sketch and some serious work is needed?

      @@ -42606,7 +42612,7 @@

      Supress success

      2022-04-19 09:32:17
      -

      *Thread Reply:* What do you mean without pip-package?

      +

      *Thread Reply:* What do you mean without pip-package?

      @@ -42632,7 +42638,7 @@

      Supress success

      2022-04-19 09:32:18
      -

      *Thread Reply:* https://pypi.org/project/openlineage-python/

      +

      *Thread Reply:* https://pypi.org/project/openlineage-python/

      @@ -42658,7 +42664,7 @@

      Supress success

      2022-04-19 09:35:08
      -

      *Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka +

      *Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka https://github.com/OpenLineage/OpenLineage/pull/530

      @@ -42685,7 +42691,7 @@

      Supress success

      2022-04-19 09:40:11
      -

      *Thread Reply:* My apologies Maciej! +

      *Thread Reply:* My apologies Maciej! In my defense - looking for "open lineage" on pypi doesn't show this in the first 20 results. Still, should have checked setup.py. My bad, and thank you for the pointer!

      @@ -42712,7 +42718,7 @@

      Supress success

      2022-04-19 10:00:49
      -

      *Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

      +

      *Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

      @@ -42738,7 +42744,7 @@

      Supress success

      2022-04-20 08:12:29
      -

      *Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

      +

      *Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

      @@ -42837,7 +42843,7 @@

      Supress success

      2022-03-16 14:35:14
      -

      *Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

      +

      *Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

      @@ -43095,7 +43101,7 @@

      Supress success

      2022-03-17 00:06:53
      -

      *Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

      +

      *Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

      @@ -43234,7 +43240,7 @@

      Supress success

      2022-03-23 10:20:47
      -

      *Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

      +

      *Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

      I know that EKS needs to have connectivity established to the Postgres database, even in the case of RDS, so that could be the culprit.

      @@ -43262,7 +43268,7 @@

      Supress success

      2022-03-23 16:09:09
      -

      *Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs +

      *Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs [HPM] Proxy created: /api/v1 -&gt; <http://localhost:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://localhost:5000/> (ECONNREFUSED) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

      @@ -43291,7 +43297,7 @@

      Supress success

      2022-03-23 16:22:13
      -

      *Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

      +

      *Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

      marquez.hostname=marquez-interface-test.di.rbx.com

      @@ -43319,7 +43325,7 @@

      Supress success

      2022-03-23 16:22:54
      -

      *Thread Reply:* assuming that is the DNS used by the website

      +

      *Thread Reply:* assuming that is the DNS used by the website

      @@ -43345,7 +43351,7 @@

      Supress success

      2022-03-23 16:48:53
      -

      *Thread Reply:* thanks, that did it. I have a question regarding the database

      +

      *Thread Reply:* thanks, that did it. I have a question regarding the database

      @@ -43371,7 +43377,7 @@

      Supress success

      2022-03-23 16:50:01
      -

      *Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

      +

      *Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

      @@ -43397,7 +43403,7 @@

      Supress success

      2022-03-23 16:56:10
      -

      *Thread Reply:* Also could you put both the API and interface on the same port (3000)

      +

      *Thread Reply:* Also could you put both the API and interface on the same port (3000)

      @@ -43423,7 +43429,7 @@

      Supress success

      2022-03-23 17:21:58
      -

      *Thread Reply:* Seems I am still having the forwarding issue +

      *Thread Reply:* Seems I am still having the forwarding issue [HPM] Proxy created: /api/v1 -&gt; <http://marquez-interface-test.di.rbx.com:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://marquez-interface-test.di.rbx.com:5000/> (ECONNRESET) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

      @@ -43488,7 +43494,7 @@

      Supress success

      2022-03-23 09:25:18
      -

      *Thread Reply:* It's always Delta, isn't it?

      +

      *Thread Reply:* It's always Delta, isn't it?

      When I originally worked on Delta support I tried to find answer on Delta slack and got an answer:

      @@ -43522,7 +43528,7 @@

      Supress success

      2022-03-23 09:25:48
      -

      *Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

      +

      *Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

      @@ -43548,7 +43554,7 @@

      Supress success

      2022-03-23 09:46:14
      -

      *Thread Reply:* Argh!! It's always Databricks doing something 🙄

      +

      *Thread Reply:* Argh!! It's always Databricks doing something 🙄

      Thanks, Maciej!

      @@ -43576,7 +43582,7 @@

      Supress success

      2022-03-23 09:51:59
      -

      *Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

      +

      *Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

      @@ -43602,7 +43608,7 @@

      Supress success

      2022-03-23 09:58:08
      -

      *Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

      +

      *Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

      @@ -43628,7 +43634,7 @@

      Supress success

      2022-03-23 09:59:37
      -

      *Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

      +

      *Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

      Might be worth to just post here all the events + logical plans that are generated for particular job, as I've done in that issue

      @@ -43656,7 +43662,7 @@

      Supress success

      2022-03-23 09:59:40
      -

      *Thread Reply:* scala&gt; spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") +

      *Thread Reply:* scala&gt; spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 3 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 4 @@ -43702,7 +43708,7 @@

      Supress success

      2022-03-23 11:41:37
      -

      *Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

      +

      *Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

      As far as we know, the SQLExecutionStart event does not have any way to get these properties :(

      @@ -43797,7 +43803,7 @@

      Supress success

      2022-03-23 11:42:33
      -

      *Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

      +

      *Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

      https://github.com/OpenLineage/OpenLineage/issues/617

      Supress success
      2022-03-25 20:33:02
      -

      *Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

      +

      *Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

      @@ -44074,7 +44080,7 @@

      Supress success

      2022-03-25 21:44:54
      -

      *Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

      +

      *Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

      @@ -44126,7 +44132,7 @@

      Supress success

      2022-03-31 18:34:40
      -

      *Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

      +

      *Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

      @@ -44152,7 +44158,7 @@

      Supress success

      2022-03-31 18:35:47
      -

      *Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

      +

      *Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

      @@ -44265,7 +44271,7 @@

      Supress success

      2022-03-28 15:43:46
      -

      *Thread Reply:* Hi Marco,

      +

      *Thread Reply:* Hi Marco,

      Datakin is a reporting tool built on the Marquez API, and therefore designed to take in Lineage using the OpenLineage specification.

      @@ -44295,7 +44301,7 @@

      Supress success

      2022-03-28 15:47:53
      -

      *Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

      +

      *Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

      @@ -44321,7 +44327,7 @@

      Supress success

      2022-03-28 15:55:07
      -

      *Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

      +

      *Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

      @@ -44347,7 +44353,7 @@

      Supress success

      2022-03-28 16:10:25
      -

      *Thread Reply:* Will I still be able to use facets for backfills?

      +

      *Thread Reply:* Will I still be able to use facets for backfills?

      @@ -44373,7 +44379,7 @@

      Supress success

      2022-03-28 17:04:03
      -

      *Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

      +

      *Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

      @@ -44488,7 +44494,7 @@

      Supress success

      2022-03-29 06:24:21
      -

      *Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

      +

      *Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

      @@ -44514,7 +44520,7 @@

      Supress success

      2022-03-29 17:47:39
      -

      *Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. +

      *Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. I have marquez running in AWS with an ingress. Do i use OpenLineageURL or Marquez_URL?

      @@ -44551,7 +44557,7 @@

      Supress success

      2022-03-29 17:48:09
      -

      *Thread Reply:* Also would a new namespace be created if i add the variable?

      +

      *Thread Reply:* Also would a new namespace be created if i add the variable?

      @@ -44603,7 +44609,7 @@

      Supress success

      2022-03-30 14:59:13
      -

      *Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

      +

      *Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

      These things don't necessarily preclude you from using OpenLineage on trino, so it may work already.

      @@ -44631,7 +44637,7 @@

      Supress success

      2022-03-30 18:34:38
      -

      *Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

      +

      *Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

      @@ -44657,7 +44663,7 @@

      Supress success

      2022-03-30 18:36:41
      -

      *Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

      +

      *Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

      We don't currently have plans for this, but a great first step would be opening an issue in the OpenLineage repo.

      @@ -44687,7 +44693,7 @@

      Supress success

      2022-03-30 20:23:46
      -

      *Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

      +

      *Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

      @@ -44963,7 +44969,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:11:35
      -

      *Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

      +

      *Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

      @@ -44989,7 +44995,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:11:57
      -

      *Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

      +

      *Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

      @@ -45015,7 +45021,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:33:00
      -

      *Thread Reply:* It is in an EKS cluster +

      *Thread Reply:* It is in an EKS cluster I checked and the variable is there OPENLINEAGE_EXTRACTOR_QUERYOPERATOR=shared.plugins.ol_custom_extractors.QueryOperatorExtractor

      @@ -45043,7 +45049,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:33:56
      -

      *Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

      +

      *Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

      @@ -45069,7 +45075,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:40:17
      -

      *Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: +

      *Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py#L77

      @@ -45121,7 +45127,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:40:45
      -

      *Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

      +

      *Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

      @@ -45147,7 +45153,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:46:36
      -

      *Thread Reply:* Thanks

      +

      *Thread Reply:* Thanks

      @@ -45173,7 +45179,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:47:19
      -

      *Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

      +

      *Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

      @@ -45199,7 +45205,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:47:25
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py

      @@ -45250,7 +45256,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:48:20
      -

      *Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

      +

      *Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

      @@ -45276,7 +45282,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:48:54
      -

      *Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

      +

      *Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

      @@ -45302,7 +45308,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:23
      -

      *Thread Reply:* will try that

      +

      *Thread Reply:* will try that

      @@ -45328,7 +45334,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:29
      -

      *Thread Reply:* and let you know

      +

      *Thread Reply:* and let you know

      @@ -45354,7 +45360,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:39
      -

      *Thread Reply:* ok!

      +

      *Thread Reply:* ok!

      @@ -45380,7 +45386,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:04:05
      -

      *Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

      +

      *Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

      @@ -45410,7 +45416,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:13:37
      -

      *Thread Reply:* Will try that too

      +

      *Thread Reply:* Will try that too

      @@ -45436,7 +45442,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:13:44
      -

      *Thread Reply:* Thanks for helping

      +

      *Thread Reply:* Thanks for helping

      @@ -45462,7 +45468,7 @@

      logger = logging.getLogger(name)

      2022-03-31 16:58:24
      -

      *Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

      +

      *Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

      @@ -45488,7 +45494,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:15:57
      -

      *Thread Reply:* yes, it's case sensitive...

      +

      *Thread Reply:* yes, it's case sensitive...

      @@ -45514,7 +45520,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:18:42
      -

      *Thread Reply:* i see

      +

      *Thread Reply:* i see

      @@ -45540,7 +45546,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:39:16
      -

      *Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

      +

      *Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

      @@ -45566,7 +45572,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:39:44
      -

      *Thread Reply:* Could there be a way not to make this case sensitive?

      +

      *Thread Reply:* Could there be a way not to make this case sensitive?

      @@ -45592,7 +45598,7 @@

      logger = logging.getLogger(name)

      2022-03-31 18:31:27
      -

      *Thread Reply:* yes - could you create issue on OpenLineage repository?

      +

      *Thread Reply:* yes - could you create issue on OpenLineage repository?

      @@ -45618,7 +45624,7 @@

      logger = logging.getLogger(name)

      2022-04-01 10:46:59
      -

      *Thread Reply:* sure

      +

      *Thread Reply:* sure

      @@ -45694,7 +45700,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:25:52
      -

      *Thread Reply:* Yes, it should. This line is the likely culprit: +

      *Thread Reply:* Yes, it should. This line is the likely culprit: https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py#L30

      @@ -45746,7 +45752,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:26:25
      -

      *Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

      +

      *Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

      @@ -45772,7 +45778,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:27:23
      -

      *Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

      +

      *Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

      @@ -45798,7 +45804,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:30:36
      -

      *Thread Reply:* we have a better solution

      +

      *Thread Reply:* we have a better solution

      @@ -45824,7 +45830,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:30:37
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

      @@ -45878,7 +45884,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:27
      -

      *Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

      +

      *Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

      @@ -45904,7 +45910,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:30
      -

      *Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

      +

      *Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

      @@ -46047,7 +46053,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:33
      -

      *Thread Reply:* let me review this PR

      +

      *Thread Reply:* let me review this PR

      @@ -46073,7 +46079,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:36:32
      -

      *Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

      +

      *Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

      @@ -46099,7 +46105,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:36:41
      -

      *Thread Reply:* If so which version?

      +

      *Thread Reply:* If so which version?

      @@ -46125,7 +46131,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:37:22
      -

      *Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

      +

      *Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

      @@ -46151,7 +46157,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:38:17
      -

      *Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

      +

      *Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

      @@ -46177,7 +46183,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:39:50
      -

      *Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

      +

      *Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

      @@ -46203,7 +46209,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:34
      -

      *Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

      +

      *Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

      @@ -46229,7 +46235,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:36
      -

      *Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

      +

      *Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

      @@ -46259,7 +46265,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:41
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -46311,7 +46317,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:34:10
      -

      *Thread Reply:* Hm - what version of dbt are you using?

      +

      *Thread Reply:* Hm - what version of dbt are you using?

      @@ -46337,7 +46343,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:47:50
      -

      *Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

      +

      *Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

      docs.getdbt.com
      @@ -46388,7 +46394,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:48:27
      -

      *Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

      +

      *Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

      @@ -46414,7 +46420,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:52:46
      -

      *Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

      +

      *Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

      @@ -46440,7 +46446,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:20:00
      -

      *Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

      +

      *Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

      @@ -46466,7 +46472,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:22:42
      -

      *Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

      +

      *Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

      @@ -46532,7 +46538,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:25:17
      -

      *Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

      +

      *Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

      @@ -46558,7 +46564,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:26:26
      -

      *Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

      +

      *Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

      @@ -46584,7 +46590,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:33:43
      -

      *Thread Reply:* it is 0.1.0

      +

      *Thread Reply:* it is 0.1.0

      @@ -46610,7 +46616,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:34:23
      -

      *Thread Reply:* is it 0.1.0 the older version of openlineage ?

      +

      *Thread Reply:* is it 0.1.0 the older version of openlineage ?

      @@ -46636,7 +46642,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:14
      -

      *Thread Reply:* ❯ pip3 list | grep openlineage-dbt +

      *Thread Reply:* ❯ pip3 list | grep openlineage-dbt openlineage-dbt 0.6.2

      @@ -46663,7 +46669,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:26
      -

      *Thread Reply:* the latest is 0.6.2 - that might be your issue

      +

      *Thread Reply:* the latest is 0.6.2 - that might be your issue

      @@ -46689,7 +46695,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:59
      -

      *Thread Reply:* How are you going about installing it?

      +

      *Thread Reply:* How are you going about installing it?

      @@ -46715,7 +46721,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:35:26
      -

      *Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

      +

      *Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

      @@ -46741,7 +46747,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:36:00
      -

      *Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

      +

      *Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

      @@ -46767,7 +46773,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:51:36
      -

      *Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

      +

      *Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

      @@ -46793,7 +46799,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:53:07
      -

      *Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

      +

      *Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

      @@ -46866,7 +46872,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:07:09
      -

      *Thread Reply:* they'll work with new parser - added test for those

      +

      *Thread Reply:* they'll work with new parser - added test for those

      @@ -46917,7 +46923,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:07:39
      -

      *Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

      +

      *Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

      @@ -46943,7 +46949,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:20:55
      -

      *Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

      +

      *Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

      @@ -46969,7 +46975,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:55:28
      -

      *Thread Reply:* it should be week-two, unless anything comes up

      +

      *Thread Reply:* it should be week-two, unless anything comes up

      @@ -46995,7 +47001,7 @@

      logger = logging.getLogger(name)

      2022-04-03 17:10:02
      -

      *Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

      +

      *Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

      @@ -47047,7 +47053,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:02:13
      -

      *Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

      +

      *Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

      @@ -47099,7 +47105,7 @@

      logger = logging.getLogger(name)

      2022-04-04 11:11:53
      -

      *Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

      +

      *Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

      @@ -47129,7 +47135,7 @@

      logger = logging.getLogger(name)

      2022-04-04 13:25:17
      -

      *Thread Reply:* Will the parser be released after the 13?

      +

      *Thread Reply:* Will the parser be released after the 13?

      @@ -47155,7 +47161,7 @@

      logger = logging.getLogger(name)

      2022-04-08 11:47:05
      -

      *Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

      +

      *Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

      @@ -47185,7 +47191,7 @@

      logger = logging.getLogger(name)

      2022-04-08 12:56:44
      -

      *Thread Reply:* Thanks, Maciej

      +

      *Thread Reply:* Thanks, Maciej

      @@ -47241,7 +47247,7 @@

      logger = logging.getLogger(name)

      2022-04-05 10:28:08
      -

      *Thread Reply:* I had kind of the same question. +

      *Thread Reply:* I had kind of the same question. I found https://marquezproject.github.io/marquez/openapi.html#tag/Lineage With some of the entries marked Deprecated, I am not sure how to proceed.

      @@ -47269,7 +47275,7 @@

      logger = logging.getLogger(name)

      2022-04-05 11:55:35
      -

      *Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

      +

      *Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

      @@ -47295,7 +47301,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:33:23
      -

      *Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

      +

      *Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

      The GET methods for jobs/datasets/etc are still functional

      @@ -47323,7 +47329,7 @@

      logger = logging.getLogger(name)

      2022-04-05 21:10:39
      -

      *Thread Reply:* Hey John,

      +

      *Thread Reply:* Hey John,

      Thanks for sharing the OpenAPI docs. Was wondering if there are any means to setup OpenLineage API that will receive events without a consumer like Marquez or is it essential to always pair with a consumer to receive the events?

      @@ -47351,7 +47357,7 @@

      logger = logging.getLogger(name)

      2022-04-05 21:47:13
      -

      *Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

      +

      *Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

      Marquez is our reference implementation of an OpenLineage consumer, but egeria also has a functional endpoint

      @@ -47379,7 +47385,7 @@

      logger = logging.getLogger(name)

      2022-04-06 09:53:31
      -

      *Thread Reply:* Hi @John Thomas, +

      *Thread Reply:* Hi @John Thomas, Would creation of Sources and Datasets have an equivalent in the OpenLineage specification ? Sofar I only see the Inputs and Outputs in the Run Event spec.

      @@ -47407,7 +47413,7 @@

      logger = logging.getLogger(name)

      2022-04-06 11:31:10
      -

      *Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

      +

      *Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

      @@ -47461,7 +47467,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:16:34
      -

      *Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

      +

      *Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

      @Maciej Obuchowski might be more help here

      @@ -47489,7 +47495,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:21:03
      -

      *Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

      +

      *Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

      @@ -47515,7 +47521,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:25:42
      -

      *Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

      +

      *Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

      @@ -47541,7 +47547,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:32:13
      -

      *Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

      +

      *Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

      @@ -47567,7 +47573,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:49:48
      -

      *Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

      +

      *Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

      @@ -47593,7 +47599,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:51:57
      -

      *Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

      +

      *Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

      @@ -47619,7 +47625,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:52:59
      -

      *Thread Reply:* Do you have any examples?

      +

      *Thread Reply:* Do you have any examples?

      @@ -47645,7 +47651,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:54:01
      -

      *Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

      +

      *Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

      openlineage.io
      @@ -47692,7 +47698,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:54:37
      -

      *Thread Reply:* and the doc page here has a good overview: +

      *Thread Reply:* and the doc page here has a good overview: https://openlineage.io/integration/apache-spark/

      @@ -47740,7 +47746,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:38:15
      -

      *Thread Reply:* is this all we need to pass? +

      *Thread Reply:* is this all we need to pass? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://&lt;your_ol_endpoint&gt;" \ @@ -47771,7 +47777,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:38:49
      -

      *Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

      +

      *Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

      @@ -47797,7 +47803,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:41:27
      -

      *Thread Reply:* Looks right to me

      +

      *Thread Reply:* Looks right to me

      @@ -47823,7 +47829,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:03
      -

      *Thread Reply:* Will give it a try

      +

      *Thread Reply:* Will give it a try

      @@ -47849,7 +47855,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:50
      -

      *Thread Reply:* Do we have to install the library on the spark side or the airflow side?

      +

      *Thread Reply:* Do we have to install the library on the spark side or the airflow side?

      @@ -47875,7 +47881,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:58
      -

      *Thread Reply:* I assume is the spark side

      +

      *Thread Reply:* I assume is the spark side

      @@ -47901,7 +47907,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:44:25
      -

      *Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

      +

      *Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

      @@ -47927,7 +47933,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:44:54
      -

      *Thread Reply:* sounds good

      +

      *Thread Reply:* sounds good

      @@ -47979,7 +47985,7 @@

      logger = logging.getLogger(name)

      2022-04-06 04:54:27
      -

      *Thread Reply:* @Will Johnson

      +

      *Thread Reply:* @Will Johnson

      @@ -48005,7 +48011,7 @@

      logger = logging.getLogger(name)

      2022-04-07 12:43:27
      -

      *Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

      +

      *Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

      @@ -48091,7 +48097,7 @@

      logger = logging.getLogger(name)

      2022-04-07 01:00:43
      -

      *Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

      +

      *Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

      @@ -48117,7 +48123,7 @@

      logger = logging.getLogger(name)

      2022-04-07 09:04:19
      -

      *Thread Reply:* yes Marco

      +

      *Thread Reply:* yes Marco

      @@ -48143,7 +48149,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:00:18
      -

      *Thread Reply:* can you open marquez on +

      *Thread Reply:* can you open marquez on <http://localhost:3000>

      @@ -48170,7 +48176,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:00:40
      -

      *Thread Reply:* and get a response from +

      *Thread Reply:* and get a response from <http://localhost:5000/api/v1/namespaces>

      @@ -48197,7 +48203,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:26:41
      -

      *Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

      +

      *Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

      openlineage.io
      @@ -48244,7 +48250,7 @@

      logger = logging.getLogger(name)

      2022-04-07 22:17:34
      -

      *Thread Reply:* In theory you should receive events in jobs under airflow namespace

      +

      *Thread Reply:* In theory you should receive events in jobs under airflow namespace

      @@ -48305,7 +48311,7 @@

      logger = logging.getLogger(name)

      2022-04-07 14:59:06
      -

      *Thread Reply:* It looks like you need to add a payment method to your DBT account

      +

      *Thread Reply:* It looks like you need to add a payment method to your DBT account

      @@ -48357,7 +48363,7 @@

      logger = logging.getLogger(name)

      2022-04-11 12:50:48
      -

      *Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

      +

      *Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

      @@ -48387,7 +48393,7 @@

      logger = logging.getLogger(name)

      2022-04-11 12:58:28
      -

      *Thread Reply:* Thanks for the quick reply Maciej.

      +

      *Thread Reply:* Thanks for the quick reply Maciej.

      @@ -48447,7 +48453,7 @@

      logger = logging.getLogger(name)

      2022-04-12 13:56:53
      -

      *Thread Reply:* Hi Sandeep,

      +

      *Thread Reply:* Hi Sandeep,

      1&3: We don't currently have Hive or Presto on the roadmap! The best way to start the conversation around them would be to create a proposal in the OpenLineage repo, outlining your thoughts on implementation and benefits.

      @@ -48484,7 +48490,7 @@

      logger = logging.getLogger(name)

      2022-04-12 15:39:23
      -

      *Thread Reply:* Thank you John +

      *Thread Reply:* Thank you John The standard facets links to the github issues currently

      @@ -48511,7 +48517,7 @@

      logger = logging.getLogger(name)

      2022-04-12 15:40:33
      2022-04-12 15:41:01
      -

      *Thread Reply:* Will check it out thank you

      +

      *Thread Reply:* Will check it out thank you

      @@ -48660,7 +48666,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:08:30
      -

      *Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

      +

      *Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

      What guide did you follow to setup the spark integration?

      @@ -48688,7 +48694,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:10:07
      -

      *Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

      +

      *Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

      @@ -48714,7 +48720,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:11:04
      -

      *Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side +

      *Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side To confirm, we should be able to see the datasets from the marquez UI with the spark integration right ?

      @@ -48741,7 +48747,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:11:50
      -

      *Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

      +

      *Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

      @@ -48767,7 +48773,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:14:44
      -

      *Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

      +

      *Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

      @@ -48793,7 +48799,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:40:38
      -

      *Thread Reply:* Are you looking at the same namespace?

      +

      *Thread Reply:* Are you looking at the same namespace?

      @@ -48819,7 +48825,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:40:51
      -

      *Thread Reply:* Yes, the same one where I can see the job

      +

      *Thread Reply:* Yes, the same one where I can see the job

      @@ -48845,7 +48851,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:54:49
      -

      *Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

      +

      *Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

      @@ -48871,7 +48877,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:01:10
      -

      *Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

      +

      *Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

      @@ -48897,7 +48903,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:08:24
      -

      *Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

      +

      *Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

      @@ -48923,7 +48929,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:14:43
      -

      *Thread Reply:* ok, that sounds good, will try that

      +

      *Thread Reply:* ok, that sounds good, will try that

      @@ -48949,7 +48955,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:16:06
      -

      *Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api +

      *Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post We don’t seem to add a DataSet in this, does marquez internally create this “dataset” based on Output and fields ?

      @@ -48977,7 +48983,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:16:34
      -

      *Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

      +

      *Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

      @@ -49003,7 +49009,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:17:00
      -

      *Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

      +

      *Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

      @@ -49029,7 +49035,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:19:16
      -

      *Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? +

      *Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? The integration seems to be only POSTing lineage objects

      @@ -49056,7 +49062,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:20:34
      -

      *Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

      +

      *Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

      https://openlineage.io/docs/openapi/

      @@ -49088,7 +49094,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:41:01
      -

      *Thread Reply:* > Yes, the same one where I can see the job +

      *Thread Reply:* > Yes, the same one where I can see the job I think you should look at other namespace, which name depends on what systems you're actually using

      @@ -49115,7 +49121,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:48:24
      -

      *Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

      +

      *Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

      @@ -49141,7 +49147,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:19:06
      -

      *Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

      +

      *Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

      @@ -49288,7 +49294,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:17:55
      -

      *Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. +

      *Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. Looking forward to understanding the expected behavior

      @@ -49315,7 +49321,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:39:34
      -

      *Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

      +

      *Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

      @@ -49371,7 +49377,7 @@

      logger = logging.getLogger(name)

      2022-04-15 12:36:08
      -

      *Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

      +

      *Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

      @@ -49397,7 +49403,7 @@

      logger = logging.getLogger(name)

      2022-06-10 12:37:41
      -

      *Thread Reply:* Is there a timeline around when we can expect this fix ?

      +

      *Thread Reply:* Is there a timeline around when we can expect this fix ?

      @@ -49423,7 +49429,7 @@

      logger = logging.getLogger(name)

      2022-06-10 12:46:47
      -

      *Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

      +

      *Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

      @@ -49449,7 +49455,7 @@

      logger = logging.getLogger(name)

      2022-06-10 13:10:31
      -

      *Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

      +

      *Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

      @@ -49566,7 +49572,7 @@

      logger = logging.getLogger(name)

      2022-04-19 13:40:54
      -

      *Thread Reply:* You probably need to change dataset from default

      +

      *Thread Reply:* You probably need to change dataset from default

      @@ -49592,7 +49598,7 @@

      logger = logging.getLogger(name)

      2022-04-19 13:47:04
      -

      *Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. +

      *Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. ``` config: AIRFLOWWEBSERVERBASEURL: "http://airflow.dev.test.io" PYTHONPATH: "/opt/airflow/dags/repo/config" @@ -49644,7 +49650,7 @@

      logger = logging.getLogger(name)

      2022-04-19 15:16:47
      -

      *Thread Reply:* I think answer is somewhere in airflow logs 🙂 +

      *Thread Reply:* I think answer is somewhere in airflow logs 🙂 For some reason, OpenLineage events aren't send to Marquez.

      @@ -49671,7 +49677,7 @@

      logger = logging.getLogger(name)

      2022-04-20 11:08:09
      -

      *Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

      +

      *Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

      @@ -49732,7 +49738,7 @@

      logger = logging.getLogger(name)

      2022-04-20 08:20:35
      -

      *Thread Reply:* That's more of a Marquez question 🙂 +

      *Thread Reply:* That's more of a Marquez question 🙂 We have a long-standing issue to add that API https://github.com/MarquezProject/marquez/issues/1736

      @@ -49759,7 +49765,7 @@

      logger = logging.getLogger(name)

      2022-04-20 09:32:19
      -

      *Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

      +

      *Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

      @@ -49794,7 +49800,7 @@

      logger = logging.getLogger(name)

      Any thoughts?

      - + @@ -49803,7 +49809,7 @@

      logger = logging.getLogger(name)

      - + @@ -49919,7 +49925,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:11:19
      -

      *Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

      +

      *Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

      @@ -49945,7 +49951,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:11:42
      -

      *Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

      +

      *Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

      @@ -49971,7 +49977,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:13:32
      -

      *Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

      +

      *Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

      @@ -49997,7 +50003,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:13:57
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

      @@ -50064,7 +50070,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:19:19
      -

      *Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

      +

      *Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

      @@ -50090,7 +50096,7 @@

      logger = logging.getLogger(name)

      2022-04-21 03:54:25
      -

      *Thread Reply:* I think it needs to run on the driver

      +

      *Thread Reply:* I think it needs to run on the driver

      @@ -50145,7 +50151,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:16:44
      -

      *Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. +

      *Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. Can you specify where do you think it's limited? The way to solve that problems would be to evolve OpenLineage.

      > One practical question/example: how do we create a job of type STREAMING, @@ -50177,7 +50183,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:23:11
      -

      *Thread Reply:* Hey Maciej

      +

      *Thread Reply:* Hey Maciej

      first off - thanks for being active on the channel!

      @@ -50214,7 +50220,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:53:59
      -

      *Thread Reply:* > I just gave an example of how would you express a specific job type creation +

      *Thread Reply:* > I just gave an example of how would you express a specific job type creation Yes, but you're trying to achieve something by passing this parameter or creating a job in a certain way. We're trying to cover everything in OpenLineage API. Even if we don't have everything, the spec from the beginning is focused to allow emitting custom data by custom facet mechanism.

      > I have the feeling I'm still missing some key concepts on how OpenLineage is designed. @@ -50244,7 +50250,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:29:20
      -

      *Thread Reply:* > Any pointers are welcome! +

      *Thread Reply:* > Any pointers are welcome! BTW: OpenLineage is an open standard. Everyone is welcome to contribute and discuss. Every feedback ultimately helps us build better systems.

      @@ -50271,7 +50277,7 @@

      logger = logging.getLogger(name)

      2022-04-22 03:32:48
      -

      *Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

      +

      *Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

      My temporary goal is to go over the documentation and to write the gaps that confused me (and the solutions) and maybe publish that as an article for wider audience. So far I realized that: • I don't get the naming convention - it became clearer that it's important with the Naming examples, but more info is needed @@ -50339,7 +50345,7 @@

      logger = logging.getLogger(name)

      2022-04-21 10:21:33
      -

      *Thread Reply:* Driver logs from Databricks:

      +

      *Thread Reply:* Driver logs from Databricks:

      ```22/04/21 14:05:07 INFO EventEmitter: Lineage completed successfully: ResponseMessage(responseCode=200, body={}, error=null) {"eventType":"COMPLETE",[...], "schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}

      @@ -50391,7 +50397,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:32:37
      -

      *Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

      +

      *Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

      @@ -50417,7 +50423,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:33:15
      -

      *Thread Reply:* Spark 3.1.2 with Scala 2.12

      +

      *Thread Reply:* Spark 3.1.2 with Scala 2.12

      @@ -50443,7 +50449,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:33:50
      -

      *Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

      +

      *Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

      @@ -50469,7 +50475,7 @@

      logger = logging.getLogger(name)

      2022-05-20 16:15:47
      -

      *Thread Reply:* Has this been resolved? +

      *Thread Reply:* Has this been resolved? I am facing the same issue with spark 3.2.

      @@ -50522,7 +50528,7 @@

      logger = logging.getLogger(name)

      2022-04-21 15:34:24
      -

      *Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

      +

      *Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

      @@ -50548,7 +50554,7 @@

      logger = logging.getLogger(name)

      2022-04-21 15:34:45
      -

      *Thread Reply:* ah sure, that makes sense

      +

      *Thread Reply:* ah sure, that makes sense

      @@ -50600,7 +50606,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:17:59
      -

      *Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

      +

      *Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

      @@ -50660,7 +50666,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:24:16
      -

      *Thread Reply:* nice -thanks!

      +

      *Thread Reply:* nice -thanks!

      @@ -50686,7 +50692,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:25:19
      -

      *Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

      +

      *Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

      @@ -50712,7 +50718,7 @@

      logger = logging.getLogger(name)

      2022-04-21 17:44:46
      -

      *Thread Reply:* No, it should be automatic.

      +

      *Thread Reply:* No, it should be automatic.

      @@ -50766,7 +50772,7 @@

      logger = logging.getLogger(name)

      2022-04-25 03:52:20
      -

      *Thread Reply:* Hey Will,

      +

      *Thread Reply:* Hey Will,

      while I would not consider myself in the team, I'm dabbling in OL, hitting walls and learning as I go. If I don't have enough experience to contribute, I'd be happy to at least proof-read and point out things which are not clear from a novice perspective. Let me know!

      @@ -50798,7 +50804,7 @@

      logger = logging.getLogger(name)

      2022-04-25 13:49:48
      -

      *Thread Reply:* I'll hold you to that @Mirko Raca 😉

      +

      *Thread Reply:* I'll hold you to that @Mirko Raca 😉

      @@ -50824,7 +50830,7 @@

      logger = logging.getLogger(name)

      2022-04-25 17:18:02
      -

      *Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

      +

      *Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

      @@ -50850,7 +50856,7 @@

      logger = logging.getLogger(name)

      2022-04-25 17:56:44
      -

      *Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

      +

      *Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

      @@ -50876,7 +50882,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:00:26
      -

      *Thread Reply:* the most recent one was an astronomer webinar

      +

      *Thread Reply:* the most recent one was an astronomer webinar

      happy to share the slides with you if you want 👍 here’s a PDF:

      @@ -50917,7 +50923,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:00:44
      -

      *Thread Reply:* the other ones have not been public, unfortunately 😕

      +

      *Thread Reply:* the other ones have not been public, unfortunately 😕

      @@ -50943,7 +50949,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:02:24
      -

      *Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

      +

      *Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

      @@ -50969,7 +50975,7 @@

      logger = logging.getLogger(name)

      2022-04-26 09:14:42
      -

      *Thread Reply:* Thank you so much, Ross! This is a great base to work from.

      +

      *Thread Reply:* Thank you so much, Ross! This is a great base to work from.

      @@ -51081,7 +51087,7 @@

      logger = logging.getLogger(name)

      2022-04-27 00:36:05
      -

      *Thread Reply:* For a spark job like that, you'd have at least four events:

      +

      *Thread Reply:* For a spark job like that, you'd have at least four events:

      1. START event - This represents the SparkSQLExecutionStart
      2. START event #2 - This represents a JobStart event
      3. COMPLET event - This represents a JobEnd event
      4. COMPLETE event #2 - This represents a SparkSQLExectionEnd event For CSV to Parquet, you should be seeing inputs and outputs that match across each event. OpenLineage scans the logical plan and reports back the inputs / outputs / metadata across the different facets for each event BECAUSE each event might give you some different information.
      5. @@ -51115,7 +51121,7 @@

        logger = logging.getLogger(name)

      2022-04-27 21:51:07
      -

      *Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem +

      *Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java#L49 I see that it has been added 5 months ago and released it as part of 0.4.0, is there anyway that we can fix the line to be like below this.lineageURI = @@ -51175,7 +51181,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:31:42
      -

      *Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

      +

      *Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

      @@ -51227,7 +51233,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:10:24
      -

      *Thread Reply:* @Michael Collado made one for a recent webinar:

      +

      *Thread Reply:* @Michael Collado made one for a recent webinar:

      https://gist.github.com/collado-mike/d1854958b7b1672f5a494933f80b8b58

      @@ -51255,7 +51261,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:11:38
      -

      *Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

      +

      *Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

      @@ -51281,7 +51287,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:51:32
      -

      *Thread Reply:* Thanks! I'm going to take a look

      +

      *Thread Reply:* Thanks! I'm going to take a look

      @@ -51337,7 +51343,7 @@

      logger = logging.getLogger(name)

      2022-04-28 17:44:00
      -

      *Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

      +

      *Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

      @@ -51390,7 +51396,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:41:10
      -

      *Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

      +

      *Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

      @@ -51416,7 +51422,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:47:27
      -

      *Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

      +

      *Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

      @@ -51442,7 +51448,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:48:50
      -

      *Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

      +

      *Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

      @@ -51472,7 +51478,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:16:39
      -

      *Thread Reply:* @John Thomas may i know how to solve this kind of issue?

      +

      *Thread Reply:* @John Thomas may i know how to solve this kind of issue?

      @@ -51498,7 +51504,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:20:11
      -

      *Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

      +

      *Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

      @@ -51524,7 +51530,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:21:17
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

      @@ -51575,7 +51581,7 @@

      logger = logging.getLogger(name)

      2022-05-03 18:44:43
      -

      *Thread Reply:* oh no I meant making sure your jobs don't close too quickly

      +

      *Thread Reply:* oh no I meant making sure your jobs don't close too quickly

      @@ -51601,7 +51607,7 @@

      logger = logging.getLogger(name)

      2022-05-06 00:14:15
      -

      *Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

      +

      *Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

      @@ -51661,7 +51667,7 @@

      logger = logging.getLogger(name)

      2022-04-29 18:41:37
      -

      *Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

      +

      *Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

      @@ -51747,7 +51753,7 @@

      logger = logging.getLogger(name)

      2022-05-03 04:12:09
      -

      *Thread Reply:* Hi @Hubert Dulay,

      +

      *Thread Reply:* Hi @Hubert Dulay,

      while I'm not an expert, I can offer the following: • Marquez has had the but what I got here - that API is not encouraged @@ -51779,7 +51785,7 @@

      logger = logging.getLogger(name)

      2022-05-03 21:19:06
      -

      *Thread Reply:* Thanks for the details

      +

      *Thread Reply:* Thanks for the details

      @@ -51831,7 +51837,7 @@

      logger = logging.getLogger(name)

      2022-05-02 13:26:52
      -

      *Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

      +

      *Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

      openlineage.io
      @@ -51878,7 +51884,7 @@

      logger = logging.getLogger(name)

      2022-05-02 14:58:29
      -

      *Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

      +

      *Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

      22/05/02 18:43:39 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 22/05/02 18:43:40 INFO EventEmitter: Init OpenLineageContext: Args: ArgumentParser(host=<http://135.170.226.91:8400>, version=v1, namespace=gus-namespace, jobName=default, parentRunId=null, apiKey=Optional.empty, urlParams=Optional[{}]) URI: <http://135.170.226.91:8400/api/v1/lineage>? @@ -51955,7 +51961,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:03:40
      -

      *Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

      +

      *Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

      It might also be a firewall issue, but your cURL should preclude that

      @@ -51983,7 +51989,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:05:38
      -

      *Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

      +

      *Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

      @@ -52009,7 +52015,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:07:13
      -

      *Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

      +

      *Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

      @@ -52035,7 +52041,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:08:44
      -

      *Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

      +

      *Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

      @@ -52061,7 +52067,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:09:22
      -

      *Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

      +

      *Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

      @@ -52087,7 +52093,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:11:19
      -

      *Thread Reply:* sounds good, i will give it a go!

      +

      *Thread Reply:* sounds good, i will give it a go!

      @@ -52113,7 +52119,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:16:16
      -

      *Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

      +

      *Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

      In addition, is http://135.170.226.91:8400 accessible to Databricks? Could you try doing a %sh command inside of a databricks notebook and see if you can ping that IP address (https://linux.die.net/man/8/ping)?

      @@ -52164,7 +52170,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:19:22
      -

      *Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

      +

      *Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

      @@ -52225,7 +52231,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:23:42
      -

      *Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

      +

      *Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

      @@ -52251,7 +52257,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:37:00
      -

      *Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

      +

      *Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

      @Kostikey Mustakas Sorry for the distraction, but I’m curious how you have set up your networking to make the API requests work with Databricks. Good luck with your issue!

      @@ -52280,7 +52286,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:47:17
      -

      *Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

      +

      *Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

      @@ -52310,7 +52316,7 @@

      logger = logging.getLogger(name)

      2022-05-02 20:15:01
      -

      *Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

      +

      *Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

      @@ -52340,7 +52346,7 @@

      logger = logging.getLogger(name)

      2022-05-02 20:16:41
      -

      *Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... +

      *Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... OpenLineageHttpException(code=null, message={"code":404,"message":"HTTP 404 Not Found"}, details=null) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68)

      @@ -52368,7 +52374,7 @@

      logger = logging.getLogger(name)

      2022-05-27 00:03:30
      -

      *Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

      +

      *Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

      I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this: spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> @@ -52410,7 +52416,7 @@

      logger = logging.getLogger(name)

      2022-05-27 00:22:27
      -

      *Thread Reply:* I tried two more things: +

      *Thread Reply:* I tried two more things: • curl works, ping fails, just like in the previous report • Databricks allows providing spark configs without quotes, whereas quotes are generally required for Spark. So I added the quotes to the host name, but now I’m getting: ERROR OpenLineageSparkListener: Unable to parse open lineage endpoint. Lineage events will not be collected

      @@ -52438,7 +52444,7 @@

      logger = logging.getLogger(name)

      2022-05-27 14:00:38
      -

      *Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

      +

      *Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

      @@ -52464,7 +52470,7 @@

      logger = logging.getLogger(name)

      2022-05-30 05:46:27
      -

      *Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

      +

      *Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

      @@ -52547,7 +52553,7 @@

      logger = logging.getLogger(name)

      2022-06-01 11:15:26
      -

      *Thread Reply:* Thank you @Maciej Obuchowski. +

      *Thread Reply:* Thank you @Maciej Obuchowski. Just to clarify, the Spark Context crashes with and without port; it’s just that adding the port causes it to crash more quickly (on the 1st attempt).

      I will run some more experiments when I have time, and add the results to the ticket.

      @@ -52616,7 +52622,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:09:42
      -

      *Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

      +

      *Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

      @@ -52642,7 +52648,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:14:11
      -

      *Thread Reply:* Yes, and I’ll update the msg for others. Thank you

      +

      *Thread Reply:* Yes, and I’ll update the msg for others. Thank you

      @@ -52668,7 +52674,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:16:25
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -52746,7 +52752,7 @@

      logger = logging.getLogger(name)

      2022-05-04 07:04:48
      -

      *Thread Reply:* It seems we can't for now. This was the same question I had last week:

      +

      *Thread Reply:* It seems we can't for now. This was the same question I had last week:

      https://github.com/MarquezProject/marquez/issues/1736

      logger = logging.getLogger(name)
      2022-05-04 10:56:35
      -

      *Thread Reply:* Seems that it's really popular request 🙂

      +

      *Thread Reply:* Seems that it's really popular request 🙂

      @@ -52876,7 +52882,7 @@

      logger = logging.getLogger(name)

      2022-05-04 11:28:48
      -

      *Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

      +

      *Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

      Mirko Raca : "From a quick glance - you're missing "eventType": "START", attribute. It's also worth noting that metadata typically shows up after the second event (type COMPLETE)"

      @@ -52937,7 +52943,7 @@

      logger = logging.getLogger(name)

      2022-05-06 05:26:16
      -

      *Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

      +

      *Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

      @@ -52963,7 +52969,7 @@

      logger = logging.getLogger(name)

      2022-05-06 12:59:06
      -

      *Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

      +

      *Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

      openlineage.io
      @@ -53010,7 +53016,7 @@

      logger = logging.getLogger(name)

      2022-05-06 13:00:39
      -

      *Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

      +

      *Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

      @@ -53062,7 +53068,7 @@

      logger = logging.getLogger(name)

      2022-05-12 05:48:43
      -

      *Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

      +

      *Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

      The catch-all would be to create a Dataset facet which would record the distinction between append/overwrite per run. But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected).

      @@ -53090,7 +53096,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:05:36
      -

      *Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

      +

      *Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

      @@ -53116,7 +53122,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:06:44
      -

      *Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

      +

      *Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

      @@ -53142,7 +53148,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:12:09
      -

      *Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

      +

      *Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

      @@ -53168,7 +53174,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:13:25
      -

      *Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). +

      *Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). It displays this information when it exists

      @@ -53199,7 +53205,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:13:29
      -

      *Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

      +

      *Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

      @@ -53251,7 +53257,7 @@

      logger = logging.getLogger(name)

      2022-05-13 05:19:47
      -

      *Thread Reply:* Work with Spark is not yet fully merged

      +

      *Thread Reply:* Work with Spark is not yet fully merged

      @@ -53367,7 +53373,7 @@

      logger = logging.getLogger(name)

      2022-05-16 11:31:17
      -

      *Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

      +

      *Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

      or configure OL client to send data to kafka: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      @@ -53400,7 +53406,7 @@

      logger = logging.getLogger(name)

      2022-05-16 12:15:59
      -

      *Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

      +

      *Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

      @@ -53426,7 +53432,7 @@

      logger = logging.getLogger(name)

      2022-05-16 12:37:10
      -

      *Thread Reply:* The second link 😉

      +

      *Thread Reply:* The second link 😉

      @@ -53452,7 +53458,7 @@

      logger = logging.getLogger(name)

      2022-05-17 03:46:03
      -

      *Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 +

      *Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 You are saying that there is a way to use this python client in conjonction with the MWAA lineage backend to relay the job events that come with the airflow integration (without including it in the DAGs)? Our strategy is to use both the airflow backend to collect automatic lineage events without modifying any existing DAGs, and the in-code implementation to allow our data engineers to send their own events if they want to. The second option works perfectly but the first one is where we struggle a bit, especially with MWAA.

      @@ -53481,7 +53487,7 @@

      logger = logging.getLogger(name)

      2022-05-17 05:24:30
      -

      *Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

      +

      *Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

      @@ -53507,7 +53513,7 @@

      logger = logging.getLogger(name)

      2022-05-17 05:40:45
      -

      *Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

      +

      *Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

      @@ -53663,7 +53669,7 @@

      logger = logging.getLogger(name)

      2022-05-18 16:34:25
      -

      *Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

      +

      *Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

      If the things you're sending/receiving with TaskFlow are accessible in terms of metadata in the environment the DAG is running in, then you should be able to make one that would work!

      @@ -53695,7 +53701,7 @@

      logger = logging.getLogger(name)

      2022-05-18 16:41:16
      -

      *Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

      +

      *Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

      @@ -53721,7 +53727,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:15:52
      -

      *Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: +

      *Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: [2022-05-18, 20:52:34 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=Postgres_1_to_Snowflake_1_v3 task_id=Postgres_1 airflow_run_id=scheduled__2022-05-18T20:51:34.334045+00:00 The picture is my custom extractor, it's not doing anything currently as this is just a test.

      @@ -53758,7 +53764,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:05
      -

      *Thread Reply:* thanks again for the help yall

      +

      *Thread Reply:* thanks again for the help yall

      @@ -53784,7 +53790,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:34
      -

      *Thread Reply:* did you set the environment variable with the path to your extractor?

      +

      *Thread Reply:* did you set the environment variable with the path to your extractor?

      @@ -53810,7 +53816,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:46
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -53845,7 +53851,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:17:13
      -

      *Thread Reply:* i believe thats correct @John Thomas

      +

      *Thread Reply:* i believe thats correct @John Thomas

      @@ -53871,7 +53877,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:18:35
      -

      *Thread Reply:* and the versions im using: +

      *Thread Reply:* and the versions im using: Astronomer Runtime 5.0.0 based on Airflow 2.3.0+astro.1

      @@ -53898,7 +53904,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:25:58
      -

      *Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

      +

      *Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

      @@ -53924,7 +53930,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:32:26
      -

      *Thread Reply:* ahh thanks John, as of right now extract_on_complete.

      +

      *Thread Reply:* ahh thanks John, as of right now extract_on_complete.

      This is a similar setup as Michael had in the video.

      @@ -53961,7 +53967,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:33:31
      -

      *Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

      +

      *Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

      @@ -53987,7 +53993,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:39:44
      -

      *Thread Reply:* is there anything in logs regarding extractors?

      +

      *Thread Reply:* is there anything in logs regarding extractors?

      @@ -54013,7 +54019,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:40:36
      -

      *Thread Reply:* just this: +

      *Thread Reply:* just this: [2022-05-18, 21:36:59 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=competitive_oss_projects_git_to_snowflake task_id=Transform_git_logs_to_S3 airflow_run_id=scheduled__2022-05-18T21:35:57.694690+00:00

      @@ -54040,7 +54046,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:41:11
      -

      *Thread Reply:* @John Thomas Thanks, I appreciate your help.

      +

      *Thread Reply:* @John Thomas Thanks, I appreciate your help.

      @@ -54066,7 +54072,7 @@

      logger = logging.getLogger(name)

      2022-05-19 06:01:52
      -

      *Thread Reply:* No Failed to import messages?

      +

      *Thread Reply:* No Failed to import messages?

      @@ -54092,8 +54098,8 @@

      logger = logging.getLogger(name)

      2022-05-19 11:26:34
      -

      *Thread Reply: @Maciej Obuchowski None that I can see. Here is the full log: -``` Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. +

      *Thread Reply:* @Maciej Obuchowski None that I can see. Here is the full log: +```* Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. Please provide a bucket_name instead of "s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log" Falling back to local log * Reading local file: /usr/local/airflow/logs/dagid=Postgres1toSnowflake1v3/runid=scheduled2022-05-19T15:23:49.248097+00:00/taskid=Postgres1/attempt=1.log @@ -54163,7 +54169,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 16:57:38
      -

      *Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

      +

      *Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

      @@ -54189,7 +54195,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 10:26:01
      -

      *Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

      +

      *Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

      @@ -54215,7 +54221,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 11:58:28
      -

      *Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

      +

      *Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

      @@ -54241,7 +54247,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 11:59:01
      -

      *Thread Reply:* 🙌

      +

      *Thread Reply:* 🙌

      @@ -54295,7 +54301,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 05:59:59
      -

      *Thread Reply:* Currently, it assumes latest version. +

      *Thread Reply:* Currently, it assumes latest version. There's an effort with DatasetVersionDatasetFacet to be able to specify it manually - or extract this information from cases like Iceberg or Delta Lake tables.

      @@ -54322,7 +54328,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:14:59
      -

      *Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

      +

      *Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

      @@ -54348,7 +54354,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:18:20
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -54378,7 +54384,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:54:40
      -

      *Thread Reply:* Thanks, that's very helpful 👍

      +

      *Thread Reply:* Thanks, that's very helpful 👍

      @@ -54466,7 +54472,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:27:35
      -

      *Thread Reply:* @Howard Yoo What version of airflow?

      +

      *Thread Reply:* @Howard Yoo What version of airflow?

      @@ -54492,7 +54498,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:27:51
      -

      *Thread Reply:* it's 2.3

      +

      *Thread Reply:* it's 2.3

      @@ -54518,7 +54524,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:28:42
      -

      *Thread Reply:* (sorry, it's 2.4)

      +

      *Thread Reply:* (sorry, it's 2.4)

      @@ -54544,7 +54550,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:29:28
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

      "Airflow 2.3+ Integration automatically registers itself for Airflow 2.3 if it's installed on Airflow worker's python. This means you don't have to do anything besides configuring it, which is described in Configuration section."

      @@ -54573,7 +54579,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:29:53
      -

      *Thread Reply:* Right, configuring I don't see any issues

      +

      *Thread Reply:* Right, configuring I don't see any issues

      @@ -54599,7 +54605,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:30:56
      -

      *Thread Reply:* so you dont need:

      +

      *Thread Reply:* so you dont need:

      from openlineage.airflow import DAG

      @@ -54629,7 +54635,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:31:41
      -

      *Thread Reply:* Okay... that makes sense then

      +

      *Thread Reply:* Okay... that makes sense then

      @@ -54655,7 +54661,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:32:47
      -

      *Thread Reply:* so if you need to import DAG it would just be: +

      *Thread Reply:* so if you need to import DAG it would just be: from airflow import DAG

      @@ -54686,7 +54692,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:56:19
      -

      *Thread Reply:* Thanks!

      +

      *Thread Reply:* Thanks!

      @@ -54772,7 +54778,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 10:23:55
      -

      *Thread Reply:* What changes do you mean by letting openlineage support? +

      *Thread Reply:* What changes do you mean by letting openlineage support? Or, do you mean, to write Apache Camel integration?

      @@ -54799,7 +54805,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-22 19:54:17
      -

      *Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

      +

      *Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

      @@ -54825,7 +54831,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-22 19:56:47
      -

      *Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

      +

      *Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

      @@ -54851,7 +54857,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-23 05:20:30
      -

      *Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

      +

      *Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

      @@ -54877,7 +54883,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-23 05:20:44
      -

      *Thread Reply:* As we support more systems, the general libraries will grow as well.

      +

      *Thread Reply:* As we support more systems, the general libraries will grow as well.

      @@ -54939,7 +54945,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 20:23:15
      -

      *Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

      +

      *Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

      @@ -54965,7 +54971,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-21 14:17:19
      -

      *Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

      +

      *Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

      @@ -55107,7 +55113,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-21 09:42:21
      -

      *Thread Reply:* Thank you so much, Michael!

      +

      *Thread Reply:* Thank you so much, Michael!

      @@ -55159,7 +55165,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:41:11
      -

      *Thread Reply:* In Python or Java?

      +

      *Thread Reply:* In Python or Java?

      @@ -55185,7 +55191,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:44:32
      -

      *Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

      +

      *Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

      In Java you can take a look at Spark's custom facets.

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-05-24 16:40:00
      -

      *Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

      +

      *Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

      I'm not sure what the disconnect is, but the facets aren't showing up in the inputs and outputs. The Lineage event is sent successfully to my astrocloud.

      @@ -55410,7 +55416,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 09:21:02
      -

      *Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

      +

      *Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

      @@ -55436,7 +55442,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 09:21:59
      -

      *Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

      +

      *Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

      @@ -55462,7 +55468,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 13:15:30
      -

      *Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! +

      *Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! Also great talk today at the airflow summit.

      @@ -55489,7 +55495,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 13:25:17
      -

      *Thread Reply:* Thanks 🙇

      +

      *Thread Reply:* Thanks 🙇

      @@ -55611,7 +55617,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:35:05
      -

      *Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build +

      *Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L399

      @@ -55663,7 +55669,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:37:15
      -

      *Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

      +

      *Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

      @@ -55714,7 +55720,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 13:23:34
      -

      *Thread Reply:* Thanks @Maciej Obuchowski!!!

      +

      *Thread Reply:* Thanks @Maciej Obuchowski!!!

      @@ -55871,7 +55877,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 16:59:47
      -

      *Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

      +

      *Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

      @@ -55901,7 +55907,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 17:06:12
      -

      *Thread Reply:* Thanks! I will keep an eye on updates.

      +

      *Thread Reply:* Thanks! I will keep an eye on updates.

      @@ -55999,7 +56005,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-26 06:05:33
      -

      *Thread Reply:* Looks great! Thanks for sharing!

      +

      *Thread Reply:* Looks great! Thanks for sharing!

      @@ -56148,7 +56154,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:38:41
      -

      *Thread Reply:* Might be that case if you're using Spark 3.2

      +

      *Thread Reply:* Might be that case if you're using Spark 3.2

      @@ -56174,7 +56180,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:38:54
      -

      *Thread Reply:* There were some changes to those operators

      +

      *Thread Reply:* There were some changes to those operators

      @@ -56200,7 +56206,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:39:09
      -

      *Thread Reply:* If you're not using 3.2, please share more details 🙂

      +

      *Thread Reply:* If you're not using 3.2, please share more details 🙂

      @@ -56226,7 +56232,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:58:58
      -

      *Thread Reply:* Yeap, im using spark version 3.2.1

      +

      *Thread Reply:* Yeap, im using spark version 3.2.1

      @@ -56252,7 +56258,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:59:35
      -

      *Thread Reply:* is it open issue, or i have some option to force them to be sent?)

      +

      *Thread Reply:* is it open issue, or i have some option to force them to be sent?)

      @@ -56278,7 +56284,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:59:58
      -

      *Thread Reply:* btw thank you for quick response @Maciej Obuchowski

      +

      *Thread Reply:* btw thank you for quick response @Maciej Obuchowski

      @@ -56304,7 +56310,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 08:00:34
      -

      *Thread Reply:* Yes, we have issue for AlterTable at least

      +

      *Thread Reply:* Yes, we have issue for AlterTable at least

      @@ -56330,7 +56336,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 02:52:14
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. @Ilqar Memmedov Did you mean drop table or drop columns? I am not aware of any drop table issue.

      @@ -56393,7 +56399,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 06:03:38
      -

      *Thread Reply:* @Paweł Leszczyński drop table statement.

      +

      *Thread Reply:* @Paweł Leszczyński drop table statement.

      @@ -56419,7 +56425,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 06:05:58
      -

      *Thread Reply:* For reproduce it, i just create simple spark job. +

      *Thread Reply:* For reproduce it, i just create simple spark job. Create table as select from other, Select data from table, and then drop entire table.

      @@ -56493,7 +56499,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-03 14:20:16
      -

      *Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

      +

      *Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

      @@ -56519,7 +56525,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-03 14:20:51
      -

      *Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail

      +

      *Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail

      @@ -56676,7 +56682,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 08:02:58
      -

      *Thread Reply:* I think internal naming is more important 🙂

      +

      *Thread Reply:* I think internal naming is more important 🙂

      I guess, for now, try to match what the local directory has.

      @@ -56704,7 +56710,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 10:59:39
      -

      *Thread Reply:* Thanks @Maciej Obuchowski

      +

      *Thread Reply:* Thanks @Maciej Obuchowski

      @@ -56809,7 +56815,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 11:00:41
      -

      *Thread Reply:* Those jobs are actually what Spark does underneath 🙂

      +

      *Thread Reply:* Those jobs are actually what Spark does underneath 🙂

      @@ -56835,7 +56841,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 11:00:57
      -

      *Thread Reply:* Are you using Delta Lake btw?

      +

      *Thread Reply:* Are you using Delta Lake btw?

      @@ -56861,7 +56867,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 12:02:39
      -

      *Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

      +

      *Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

      @@ -56887,7 +56893,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 13:58:05
      -

      *Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

      +

      *Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

      @@ -56961,7 +56967,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 14:27:46
      -

      *Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

      +

      *Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

      If anything, we should filter some 'useless' nodes like collect_limit since they add nothing.

      @@ -57199,7 +57205,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 14:33:09
      -

      *Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

      +

      *Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

      @@ -57225,7 +57231,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 16:09:15
      -

      *Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

      +

      *Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

      @@ -57251,7 +57257,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 22:51:38
      -

      *Thread Reply:* I think I understand what you are asking -

      +

      *Thread Reply:* I think I understand what you are asking -

      The runID is used to correlate different state updates (i.e., start, fail, complete, abort) across the lifespan of a run. So if you are trying to add additional metadata to the same job run, you’d use the same runID.

      @@ -57287,7 +57293,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 22:59:39
      -

      *Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. +

      *Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. A(job) --> DS1(dataset) --> B(job) --> DS2(dataset) --> C(job) --> DS3(dataset) --> D(job)

      @@ -57314,7 +57320,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:04:37
      -

      *Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

      +

      *Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

      @@ -57340,7 +57346,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:03
      -

      *Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

      +

      *Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

      @@ -57366,7 +57372,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:18
      -

      *Thread Reply:* got it

      +

      *Thread Reply:* got it

      @@ -57392,7 +57398,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:34
      -

      *Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

      +

      *Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

      @@ -57422,7 +57428,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-10 12:27:58
      -

      *Thread Reply:* > The eventual “aggregation” should be done by event consumer. +

      *Thread Reply:* > The eventual “aggregation” should be done by event consumer. @Maciej Obuchowski Are there any known client side libraries that support this aggregation already ? In case of spark applications running as part of ETL pipelines, most of the times our end user is interested in seeing only the aggregated view where all jobs spawned as part of a single application are rolled up into 1 job.

      @@ -57449,7 +57455,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-10 12:32:14
      -

      *Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

      +

      *Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

      We'd love to have something like it, but AFAIK it affects only some percentage of Spark jobs and we can only do so much.

      @@ -57479,7 +57485,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-11 23:38:27
      -

      *Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

      +

      *Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

      Apache Atlas doesn't have the same model as Marquez. It only knows of effectively one entity that represents the complete asset.

      @@ -57542,7 +57548,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-19 09:37:06
      -

      *Thread Reply:* thank you !

      +

      *Thread Reply:* thank you !

      @@ -57640,7 +57646,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-09 13:04:00
      -

      *Thread Reply:* Hi, is the link correct? The meeting room is empty

      +

      *Thread Reply:* Hi, is the link correct? The meeting room is empty

      @@ -57666,7 +57672,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-09 16:04:23
      -

      *Thread Reply:* sorry about that, thanks for letting us know

      +

      *Thread Reply:* sorry about that, thanks for letting us know

      @@ -57736,7 +57742,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:49:37
      -

      *Thread Reply:* Hey @Mark Beebe!

      +

      *Thread Reply:* Hey @Mark Beebe!

      In this case, nodeId is going to be either a dataset or a job. You need to tell Marquez where to start since there is likely to be more than one graph. So you need to get your hands on an identifier for that starting node.

      @@ -57764,7 +57770,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:50:07
      -

      *Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

      +

      *Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

      @@ -57799,7 +57805,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:53:43
      -

      *Thread Reply:* Or you can search, if you know the name of the dataset:

      +

      *Thread Reply:* Or you can search, if you know the name of the dataset:

      @@ -57834,7 +57840,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:53:54
      -

      *Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

      +

      *Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

      @@ -57860,7 +57866,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 08:11:30
      -

      *Thread Reply:* That worked, thank you so much!

      +

      *Thread Reply:* That worked, thank you so much!

      @@ -57946,7 +57952,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 10:37:01
      -

      *Thread Reply:* tnazarew - Tomasz Nazarewicz

      +

      *Thread Reply:* tnazarew - Tomasz Nazarewicz

      @@ -57972,7 +57978,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 10:37:14
      -

      *Thread Reply:* 🙌

      +

      *Thread Reply:* 🙌

      @@ -57998,7 +58004,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-15 12:46:45
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -58390,7 +58396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:39:16
      -

      *Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

      +

      *Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

      @@ -58492,7 +58498,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:40:28
      -

      *Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

      +

      *Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

      @@ -58518,7 +58524,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:43:20
      -

      *Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

      +

      *Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

      @@ -58596,7 +58602,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:44:49
      -

      *Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

      +

      *Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

      @@ -58622,7 +58628,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:45:09
      -

      *Thread Reply:* Sorry didn't notice you over here ! lol

      +

      *Thread Reply:* Sorry didn't notice you over here ! lol

      @@ -58648,7 +58654,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:45:53
      -

      *Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

      +

      *Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

      @@ -58674,7 +58680,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:47:39
      -

      *Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

      +

      *Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

      @@ -58700,7 +58706,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:48:14
      -

      *Thread Reply:* no, just visualize the current migration flow.

      +

      *Thread Reply:* no, just visualize the current migration flow.

      @@ -58726,7 +58732,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:48:53
      -

      *Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

      +

      *Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

      @@ -58752,7 +58758,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:08
      -

      *Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

      +

      *Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

      @@ -58778,7 +58784,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:19
      -

      *Thread Reply:* correct

      +

      *Thread Reply:* correct

      @@ -58804,7 +58810,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:45
      -

      *Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

      +

      *Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

      @@ -58830,7 +58836,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:50:05
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -58856,7 +58862,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:50:26
      -

      *Thread Reply:* which I think I can do once the first nodes exist

      +

      *Thread Reply:* which I think I can do once the first nodes exist

      @@ -58882,7 +58888,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:51:18
      -

      *Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

      +

      *Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

      @@ -58908,7 +58914,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:51:54
      -

      *Thread Reply:* really just want to re-create food_devlivery using my own biz context

      +

      *Thread Reply:* really just want to re-create food_devlivery using my own biz context

      @@ -58934,7 +58940,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:52:14
      -

      *Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

      +

      *Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

      @@ -59032,7 +59038,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:53:49
      -

      *Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

      +

      *Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

      @@ -59058,7 +59064,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:54:32
      -

      *Thread Reply:* Don’t forget to configure the transport for the client

      +

      *Thread Reply:* Don’t forget to configure the transport for the client

      @@ -59084,7 +59090,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:54:45
      -

      *Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

      +

      *Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

      @@ -59110,7 +59116,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:25
      -

      *Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

      +

      *Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

      @@ -59136,7 +59142,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:48
      -

      *Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

      +

      *Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

      @@ -59162,7 +59168,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:52
      -

      *Thread Reply:* saw that too !

      +

      *Thread Reply:* saw that too !

      @@ -59188,7 +59194,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:03
      -

      *Thread Reply:* you’re on top of it!

      +

      *Thread Reply:* you’re on top of it!

      @@ -59214,7 +59220,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:28
      -

      *Thread Reply:* ha. Thanks again!

      +

      *Thread Reply:* ha. Thanks again!

      @@ -59347,7 +59353,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:59:38
      -

      *Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

      +

      *Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

      @@ -59377,7 +59383,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:59:59
      -

      *Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

      +

      *Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

      @@ -59407,7 +59413,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 12:40:39
      -

      *Thread Reply:* I have created a draft PR for this here. +

      *Thread Reply:* I have created a draft PR for this here. Please let me know if the changes make sense.

      @@ -59491,7 +59497,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 12:42:30
      -

      *Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

      +

      *Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

      @@ -59578,7 +59584,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 18:12:47
      -

      *Thread Reply:* I have made the changes in-line to the mentioned comments here. +

      *Thread Reply:* I have made the changes in-line to the mentioned comments here. Does this look good?

      @@ -59662,7 +59668,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 09:35:22
      -

      *Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

      +

      *Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

      @@ -59692,7 +59698,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 21:56:50
      -

      *Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. +

      *Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. Thank you for the support! 😄

      @@ -59723,7 +59729,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 06:48:03
      -

      *Thread Reply:* One change and I think it will be good for now.

      +

      *Thread Reply:* One change and I think it will be good for now.

      @@ -59749,7 +59755,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 06:48:07
      -

      *Thread Reply:* Have you tested it manually?

      +

      *Thread Reply:* Have you tested it manually?

      @@ -59775,7 +59781,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 13:22:04
      -

      *Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 +

      *Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 Yes, I tested it manually (for Airflow versions 2.1.4 and 2.3.3) and it works 🎉

      @@ -59802,7 +59808,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 13:24:55
      -

      *Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

      +

      *Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

      @@ -59832,7 +59838,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 15:20:32
      -

      *Thread Reply:* Yes, Sure! I will add it in the PR description

      +

      *Thread Reply:* Yes, Sure! I will add it in the PR description

      @@ -59858,7 +59864,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 05:30:56
      -

      *Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

      +

      *Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

      @@ -59888,7 +59894,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:20:43
      -

      *Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

      +

      *Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

      @@ -59914,7 +59920,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:26:22
      -

      *Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

      +

      *Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

      @@ -59940,7 +59946,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:42:29
      -

      *Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

      +

      *Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

      @@ -59966,7 +59972,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:48:32
      -

      *Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

      +

      *Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

      @@ -59992,7 +59998,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:49:23
      -

      *Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

      +

      *Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

      @@ -60022,7 +60028,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 13:12:31
      -

      *Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

      +

      *Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

      @@ -60052,7 +60058,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 20:37:34
      -

      *Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

      +

      *Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

      Also, have added an example for it. 🙂 Let me know if something is unclear and needs to be updated.

      @@ -60109,7 +60115,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:50:54
      -

      *Thread Reply:* Thanks! very cool.

      +

      *Thread Reply:* Thanks! very cool.

      @@ -60135,7 +60141,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:52:22
      -

      *Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

      +

      *Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

      Like I wonder if a user could directly define an OpenLineage DataSet ( which might even have various other facets included on it ) and specify it in the inlets/outlets ?

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-28 12:54:56
      -

      *Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

      +

      *Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

      @@ -60214,7 +60220,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:55:42
      -

      *Thread Reply:* I am accustomed to creating OpenLineage entities like this:

      +

      *Thread Reply:* I am accustomed to creating OpenLineage entities like this:

      taxes = Dataset(namespace="<postgres://foobar>", name="schema.table")

      @@ -60242,7 +60248,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:56:45
      -

      *Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

      +

      *Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

      @@ -60268,7 +60274,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:58:18
      -

      *Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

      +

      *Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

      Like we would suggest users to use openlineage.client.run.Dataset but if a user already has DAGs that use Table then they'd still work in a best efforts way.

      @@ -60296,7 +60302,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 13:03:07
      -

      *Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

      +

      *Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

      @@ -60322,7 +60328,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 17:18:35
      -

      *Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

      +

      *Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

      @@ -60348,7 +60354,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-08-15 09:49:02
      -

      *Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

      +

      *Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

      ( I left the support for converting the Airflow Table to Dataset because I think that's nice to have also )

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-06-28 18:45:52
      -

      *Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

      +

      *Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

      @@ -60500,7 +60506,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:46:15
      -

      *Thread Reply:* Give me a sec to send you over the diff…

      +

      *Thread Reply:* Give me a sec to send you over the diff…

      @@ -60530,7 +60536,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:35
      -

      *Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

      +

      *Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

      @@ -60680,7 +60686,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 10:04:14
      -

      *Thread Reply:* I think you should declare those as sources? Or do you need something different?

      +

      *Thread Reply:* I think you should declare those as sources? Or do you need something different?

      @@ -60706,7 +60712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 21:15:33
      -

      *Thread Reply:* I'll try to experiment with this.

      +

      *Thread Reply:* I'll try to experiment with this.

      @@ -60759,7 +60765,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 10:02:17
      -

      *Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

      +

      *Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

      @@ -60785,7 +60791,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 21:15:25
      -

      *Thread Reply:* oh, nice!

      +

      *Thread Reply:* oh, nice!

      @@ -60837,7 +60843,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 14:26:59
      -

      *Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

      +

      *Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

      @@ -60863,7 +60869,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 15:31:20
      -

      *Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

      +

      *Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

      @@ -60889,7 +60895,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 15:35:06
      -

      *Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

      +

      *Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

      @@ -60941,7 +60947,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 10:21:50
      -

      *Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

      +

      *Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

      A few months ago, Flink was being introduced and it was said that more thought was needed around supporting streaming services in OpenLineage.

      @@ -60975,7 +60981,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 11:08:01
      -

      *Thread Reply:* @Will Johnson added your item

      +

      *Thread Reply:* @Will Johnson added your item

      @@ -61093,7 +61099,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-11 10:30:34
      -

      *Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

      +

      *Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

      @@ -61123,7 +61129,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 12:36:17
      -

      *Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

      +

      *Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

      @@ -61153,7 +61159,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 12:46:29
      -

      *Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

      +

      *Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

      @@ -61179,7 +61185,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 15:08:48
      -

      *Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

      +

      *Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

      @@ -61205,7 +61211,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 16:12:43
      -

      *Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

      +

      *Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

      I’m happy to help craft the abstract, look over slides, etc. (I could help present, but all I’ve done with OpenLineage is one tutorial, so I’m hardly an expert).

      @@ -61286,7 +61292,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 12:04:55
      -

      *Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

      +

      *Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

      @@ -61312,7 +61318,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 15:16:45
      -

      *Thread Reply:* This is described here. Notably: +

      *Thread Reply:* This is described here. Notably: > Custom implementations are registered by following Java's ServiceLoader conventions. A file called io.openlineage.spark.api.OpenLineageEventHandlerFactory must exist in the application or jar's META-INF/service directory. Each line of that file must be the fully qualified class name of a concrete implementation of OpenLineageEventHandlerFactory. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions.

      @@ -61339,7 +61345,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 15:17:55
      -

      *Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory

      +

      *Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory

      @@ -61390,7 +61396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 20:19:01
      -

      *Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

      +

      *Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

      @@ -61496,7 +61502,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 06:18:29
      -

      *Thread Reply:* Can you expand on what's not working exactly?

      +

      *Thread Reply:* Can you expand on what's not working exactly?

      This is not something we're aware of.

      @@ -61524,7 +61530,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 04:09:39
      -

      *Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error +

      *Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error Could not determine the dependencies of task ':shadowJar'. &gt; Could not resolve all dependencies for configuration ':runtimeClasspath'. &gt; Could not find spark:app:0.10.0. @@ -61576,7 +61582,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:00:02
      -

      *Thread Reply:* Can you try 0.11? I think we might already fixed that.

      +

      *Thread Reply:* Can you try 0.11? I think we might already fixed that.

      @@ -61602,7 +61608,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:50:03
      -

      *Thread Reply:* Tried with that as well. Doesn't work

      +

      *Thread Reply:* Tried with that as well. Doesn't work

      @@ -61628,7 +61634,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:56:50
      -

      *Thread Reply:* Same error with 0.11.0 as well

      +

      *Thread Reply:* Same error with 0.11.0 as well

      @@ -61654,7 +61660,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:11:13
      -

      *Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

      +

      *Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

      @@ -61680,7 +61686,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:11:34
      -

      *Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

      +

      *Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

      @@ -61706,7 +61712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:16:18
      -

      *Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

      +

      *Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

      @@ -61732,7 +61738,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 14:18:45
      -

      *Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

      +

      *Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

      @@ -61758,7 +61764,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 03:44:40
      -

      *Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: +

      *Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: repositories { mavenCentral() { metadataSources { @@ -61793,7 +61799,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 05:26:04
      -

      *Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

      +

      *Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

      @@ -61819,7 +61825,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 18:45:50
      -

      *Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

      +

      *Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

      @@ -61986,7 +61992,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 08:50:15
      -

      *Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

      +

      *Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

      Then you can try doing classForName just from the spark-shell without any listeners added. The classes should be available there.

      @@ -62014,7 +62020,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 11:42:25
      -

      *Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

      +

      *Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

      It looks like Databricks AND open source spark does some magic when you install a library OR use --jars on the spark-shell. In both Databricks and Apache Spark, the thread running the SparkListener cannot see the additional libraries installed unless they're on the original / main class path.

      @@ -62050,7 +62056,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:02:05
      -

      *Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

      +

      *Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

      What if you did it like it? By that I mean adding it to your code with compileOnly in gradle or provided in maven, compiling with it, then using static method to check if it loads?

      @@ -62078,7 +62084,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:02:36
      -

      *Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") +

      *Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") Isn't that this actual scenario?

      @@ -62105,7 +62111,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:36:47
      -

      *Thread Reply:* Thank you for the reply, Maciej!

      +

      *Thread Reply:* Thank you for the reply, Maciej!

      I will try the compileOnly route tonight!

      @@ -62137,7 +62143,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 06:43:47
      -

      *Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

      +

      *Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

      More doc can be found here: (https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)

      @@ -62171,7 +62177,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:20:02
      -

      *Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

      +

      *Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

      When using jars, the spark listener does NOT find the class using a classloader.

      @@ -62207,7 +62213,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:22:01
      -

      *Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

      +

      *Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

      @@ -62233,7 +62239,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:22:48
      -

      *Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

      +

      *Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

      @@ -62259,7 +62265,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:24:12
      -

      *Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

      +

      *Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

      @@ -62285,7 +62291,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:26:04
      -

      *Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

      +

      *Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

      @@ -62311,7 +62317,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:29:13
      -

      *Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

      +

      *Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

      https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-14 11:29:51
      -

      *Thread Reply:* Trying with --pacakges right now.

      +

      *Thread Reply:* Trying with --pacakges right now.

      @@ -62390,7 +62396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:54:37
      -

      *Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

      +

      *Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

      Spark Shell command spark-shell --master local[4] \ @@ -62440,7 +62446,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:59:07
      -

      *Thread Reply:* what if you load your listener using also packages?

      +

      *Thread Reply:* what if you load your listener using also packages?

      @@ -62466,7 +62472,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:00:38
      -

      *Thread Reply:* That's how I'm doing it locally using spark.conf: +

      *Thread Reply:* That's how I'm doing it locally using spark.conf: spark.jars.packages com.google.cloud.bigdataoss:gcs_connector:hadoop3-2.2.2,io.delta:delta_core_2.12:1.0.0,org.apache.iceberg:iceberg_spark3_runtime:0.12.1,io.openlineage:openlineage_spark:0.9.0

      @@ -62497,7 +62503,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:20:47
      -

      *Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! +

      *Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! 🙏 2022-07-14 11:14:21,266 INFO MyLogger: Trying LogicalRelation 2022-07-14 11:14:21,266 INFO MyLogger: Got logical relation @@ -62541,7 +62547,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:21:35
      -

      *Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

      +

      *Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

      Remember Java's ClassLoaders are hierarchical - there are parent ClassLoaders and child ClassLoaders. Parents can't see their children's classes, but children can see their parent's classes.

      @@ -62575,7 +62581,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:26:15
      -

      *Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

      +

      *Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

      @@ -62626,7 +62632,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:29:09
      -

      *Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

      +

      *Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

      @@ -62652,7 +62658,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:30:25
      -

      *Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

      +

      *Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

      https://docs.microsoft.com/en-us/azure/databricks/libraries/

      @@ -62709,7 +62715,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:31:32
      -

      *Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

      +

      *Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

      @@ -62735,7 +62741,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:32:46
      -

      *Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

      +

      *Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

      @@ -62765,7 +62771,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-17 01:03:02
      -

      *Thread Reply:* Quick update: +

      *Thread Reply:* Quick update: • Turns out using a class loader from a Scala spark listener does not have this problem. • https://stackoverflow.com/questions/7671888/scala-classloaders-confusion • I'm trying to use URLClassLoader as recommended by a few MSFT folks and point it at the /local_disk0/tmp folder. @@ -62846,7 +62852,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 05:45:59
      -

      *Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

      +

      *Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

      @@ -62876,7 +62882,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 05:48:15
      -

      *Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

      +

      *Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

      @@ -63025,7 +63031,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 16:40:48
      -

      *Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

      +

      *Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

      @@ -63051,7 +63057,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 16:57:29
      -

      *Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

      +

      *Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

      @@ -63077,7 +63083,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:02:53
      -

      *Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

      +

      *Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

      @@ -63103,7 +63109,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:05:35
      -

      *Thread Reply:* No pressure, of course 😉

      +

      *Thread Reply:* No pressure, of course 😉

      @@ -63129,7 +63135,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:08:50
      -

      *Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

      +

      *Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

      @@ -63155,7 +63161,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:12:55
      -

      *Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

      +

      *Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

      @@ -63185,7 +63191,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:18:27
      -

      *Thread Reply:* chiming in as well to say this is really cool 👍

      +

      *Thread Reply:* chiming in as well to say this is really cool 👍

      @@ -63211,7 +63217,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 18:26:28
      -

      *Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

      +

      *Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

      @@ -63237,7 +63243,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:07:42
      -

      *Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

      +

      *Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

      @@ -63327,7 +63333,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:37:25
      -

      *Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

      +

      *Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

      The repo is open for business! Feel free to add the page where you think it fits.

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-14 20:42:09
      -

      *Thread Reply:* OK! Let’s do this!

      +

      *Thread Reply:* OK! Let’s do this!

      @@ -63414,7 +63420,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:59:36
      -

      *Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

      +

      *Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

      @@ -63527,7 +63533,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:43:09
      -

      *Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

      +

      *Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

      @@ -63553,7 +63559,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:22:09
      -

      *Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

      +

      *Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

      openlineage.io
      @@ -63600,7 +63606,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:23:32
      -

      *Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

      +

      *Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

      @@ -63626,7 +63632,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:25:32
      -

      *Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

      +

      *Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

      @@ -63652,7 +63658,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:25:39
      -

      *Thread Reply:* and it came live at openlineage.io/docs 😄

      +

      *Thread Reply:* and it came live at openlineage.io/docs 😄

      @@ -63678,7 +63684,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:26:06
      -

      *Thread Reply:* nice and sounds good 👍

      +

      *Thread Reply:* nice and sounds good 👍

      @@ -63704,7 +63710,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:26:31
      -

      *Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

      +

      *Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

      @@ -63808,7 +63814,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 14:41:30
      -

      *Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

      +

      *Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

      @@ -63834,7 +63840,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 14:43:41
      -

      *Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

      +

      *Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

      @@ -63938,7 +63944,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 15:16:54
      -

      *Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

      +

      *Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

      @@ -63999,7 +64005,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 15:17:49
      -

      *Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

      +

      *Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

      @@ -64025,7 +64031,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 16:29:02
      -

      *Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

      +

      *Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

      @@ -64055,7 +64061,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 19:34:32
      -

      *Thread Reply:* There are a couple of aws people here (including me) following.

      +

      *Thread Reply:* There are a couple of aws people here (including me) following.

      @@ -64120,7 +64126,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:11:38
      -

      *Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

      +

      *Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

      @@ -64146,7 +64152,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:13:03
      -

      *Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

      +

      *Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

      @@ -64176,7 +64182,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:13:18
      -

      *Thread Reply:* (looking for a curl example…)

      +

      *Thread Reply:* (looking for a curl example…)

      @@ -64202,7 +64208,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:25:49
      -

      *Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

      +

      *Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

      Are these documented anywhere?

      @@ -64230,7 +64236,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:27:55
      -

      *Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

      +

      *Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

      @@ -64256,7 +64262,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:28:28
      -

      *Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

      +

      *Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

      { "job": { @@ -64309,7 +64315,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:28:58
      -

      *Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

      +

      *Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

      @@ -64335,7 +64341,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:29:19
      -

      *Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

      +

      *Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

      @@ -64361,7 +64367,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 22:25:39
      -

      *Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: +

      *Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: { "facets": { "description": "test-description-job", @@ -64405,7 +64411,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 22:36:55
      -

      *Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

      +

      *Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

      @@ -64512,7 +64518,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:25:47
      -

      *Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

      +

      *Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

      I think it would be cool to see some discussions start to develop around it 👍

      @@ -64544,7 +64550,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:26:44
      -

      *Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

      +

      *Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

      @@ -64570,7 +64576,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:30:55
      -

      *Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

      +

      *Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

      openlineage.io
      @@ -64621,7 +64627,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:56:24
      -

      *Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

      +

      *Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

      @@ -64651,7 +64657,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 11:41:48
      -

      *Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

      +

      *Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

      @@ -64706,7 +64712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-23 04:12:26
      -

      *Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

      +

      *Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

      @@ -64993,7 +64999,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-23 04:37:11
      -

      *Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

      +

      *Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

      @@ -65023,7 +65029,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-26 19:37:54
      -

      *Thread Reply:* It is supported in the Spark integration

      +

      *Thread Reply:* It is supported in the Spark integration

      @@ -65049,7 +65055,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-26 19:39:13
      -

      *Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

      +

      *Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

      @@ -65116,7 +65122,7 @@

      SundayFunday

      2022-07-24 16:26:59
      -

      *Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

      +

      *Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

      @@ -65142,7 +65148,7 @@

      SundayFunday

      2022-07-25 11:49:54
      -

      *Thread Reply:* 😄

      +

      *Thread Reply:* 😄

      @@ -65168,7 +65174,7 @@

      SundayFunday

      2022-07-25 11:50:54
      -

      *Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

      +

      *Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

      @@ -65194,7 +65200,7 @@

      SundayFunday

      2022-07-26 19:39:56
      -

      *Thread Reply:* This looks nice by the way.

      +

      *Thread Reply:* This looks nice by the way.

      @@ -65268,7 +65274,7 @@

      SundayFunday

      2022-07-26 10:39:57
      -

      *Thread Reply:* do you have proper execute permissions?

      +

      *Thread Reply:* do you have proper execute permissions?

      @@ -65294,7 +65300,7 @@

      SundayFunday

      2022-07-26 10:41:09
      -

      *Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

      +

      *Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

      @@ -65320,7 +65326,7 @@

      SundayFunday

      2022-07-26 10:43:00
      -

      *Thread Reply:* yes i have admin rights. how to make this as executable?

      +

      *Thread Reply:* yes i have admin rights. how to make this as executable?

      @@ -65346,7 +65352,7 @@

      SundayFunday

      2022-07-26 10:43:25
      -

      *Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

      +

      *Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

      @@ -65372,7 +65378,7 @@

      SundayFunday

      2022-07-26 17:33:08
      -

      *Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

      +

      *Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

      @@ -65398,7 +65404,7 @@

      SundayFunday

      2022-07-26 21:03:43
      -

      *Thread Reply:* will try that

      +

      *Thread Reply:* will try that

      @@ -65480,7 +65486,7 @@

      SundayFunday

      2022-07-27 01:54:31
      -

      *Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: +

      *Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: ./gradlew :shared:test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

      @@ -65507,7 +65513,7 @@

      SundayFunday

      2022-07-27 03:18:42
      -

      *Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

      +

      *Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

      ```> Task :shared:test FAILED

      @@ -65557,7 +65563,7 @@

      SundayFunday

      2022-07-27 03:24:41
      -

      *Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in +

      *Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in OpenLineage/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/plan are being run at all.

      @@ -65585,7 +65591,7 @@

      SundayFunday

      2022-07-27 03:42:43
      -

      *Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

      +

      *Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

      @@ -65615,11 +65621,11 @@

      SundayFunday

      2022-07-27 03:57:08
      -

      *Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: +

      *Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: ./gradlew :shared:spotlessApply &amp;&amp; ./gradlew :app:spotlessApply &amp;&amp; ./gradlew clean build test

      - + @@ -65651,7 +65657,7 @@

      SundayFunday

      2022-07-27 04:27:23
      -

      *Thread Reply:* I'll look into it

      +

      *Thread Reply:* I'll look into it

      @@ -65677,7 +65683,7 @@

      SundayFunday

      2022-07-28 05:17:36
      -

      *Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

      +

      *Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

      @@ -65707,7 +65713,7 @@

      SundayFunday

      2022-07-28 05:19:16
      -

      *Thread Reply:* That's great, thank you!

      +

      *Thread Reply:* That's great, thank you!

      @@ -65733,7 +65739,7 @@

      SundayFunday

      2022-07-29 06:05:55
      -

      *Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

      +

      *Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

      @@ -65799,7 +65805,7 @@

      SundayFunday

      2022-07-29 06:07:58
      -

      *Thread Reply:* I'm still testing it, should've changed it to draft, sorry

      +

      *Thread Reply:* I'm still testing it, should've changed it to draft, sorry

      @@ -65829,7 +65835,7 @@

      SundayFunday

      2022-07-29 06:08:59
      -

      *Thread Reply:* No worries! If I can help with testing or anything please let me know!

      +

      *Thread Reply:* No worries! If I can help with testing or anything please let me know!

      @@ -65855,7 +65861,7 @@

      SundayFunday

      2022-07-29 06:09:29
      -

      *Thread Reply:* Will do! Thanks :)

      +

      *Thread Reply:* Will do! Thanks :)

      @@ -65881,7 +65887,7 @@

      SundayFunday

      2022-08-02 11:06:31
      -

      *Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

      +

      *Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

      @@ -65907,7 +65913,7 @@

      SundayFunday

      2022-08-02 13:45:34
      -

      *Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test +

      *Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test but one is failing with ./gradlew app:build

      but if it fixes your problem I can disable this test for now and make a PR without it, then you can maybe unblock your stuff and I will have more time to investigate the issue.

      @@ -65936,7 +65942,7 @@

      SundayFunday

      2022-08-02 14:54:45
      -

      *Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

      +

      *Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

      @@ -65962,7 +65968,7 @@

      SundayFunday

      2022-08-02 14:54:52
      -

      *Thread Reply:* Thank you for your help Tomasz!

      +

      *Thread Reply:* Thank you for your help Tomasz!

      @@ -65988,7 +65994,7 @@

      SundayFunday

      2022-08-03 06:12:07
      -

      *Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

      +

      *Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

      @@ -66060,7 +66066,7 @@

      SundayFunday

      2022-08-03 06:12:26
      -

      *Thread Reply:* its waiting for review currently

      +

      *Thread Reply:* its waiting for review currently

      @@ -66086,7 +66092,7 @@

      SundayFunday

      2022-08-03 06:20:41
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -66261,7 +66267,7 @@

      SundayFunday

      2022-07-26 19:41:13
      -

      *Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

      +

      *Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

      @@ -66287,7 +66293,7 @@

      SundayFunday

      2022-07-27 01:59:27
      -

      *Thread Reply:* Sure, it’s already on my list, will do

      +

      *Thread Reply:* Sure, it’s already on my list, will do

      @@ -66317,7 +66323,7 @@

      SundayFunday

      2022-07-29 07:55:40
      -

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

      +

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

      openlineage.io
      @@ -66403,7 +66409,7 @@

      SundayFunday

      2022-07-27 04:08:18
      -

      *Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

      +

      *Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

      @Michael Collado or @Maciej Obuchowski: are you able to create some doc? I think one of you was working on that.

      @@ -66431,7 +66437,7 @@

      SundayFunday

      2022-07-27 04:24:19
      -

      *Thread Reply:* Yes, parent run

      +

      *Thread Reply:* Yes, parent run

      @@ -66487,7 +66493,7 @@

      SundayFunday

      2022-07-27 04:47:18
      -

      *Thread Reply:* Will take a look

      +

      *Thread Reply:* Will take a look

      @@ -66513,7 +66519,7 @@

      SundayFunday

      2022-07-27 04:47:40
      -

      *Thread Reply:* But generally this support message is just a warning

      +

      *Thread Reply:* But generally this support message is just a warning

      @@ -66539,7 +66545,7 @@

      SundayFunday

      2022-07-27 10:04:20
      -

      *Thread Reply:* @shweta p any actual error you've found? +

      *Thread Reply:* @shweta p any actual error you've found? I've tested it with dbt-bigquery on 1.1.0 and it works despite warning:

      ➜ small OPENLINEAGE_URL=<http://localhost:5050> dbt-ol build @@ -66620,7 +66626,7 @@

      SundayFunday

      2022-07-27 20:41:44
      -

      *Thread Reply:* I think it's safe to say we'll see a release by the end of next week

      +

      *Thread Reply:* I think it's safe to say we'll see a release by the end of next week

      @@ -66719,7 +66725,7 @@

      SundayFunday

      2022-07-28 08:50:44
      -

      *Thread Reply:* Welcome to the community!

      +

      *Thread Reply:* Welcome to the community!

      We talked about this exact topic in the most recent community call. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nextmeeting:Nov10th2021(9amPT)

      @@ -66764,7 +66770,7 @@

      SundayFunday

      2022-07-28 09:56:05
      -

      *Thread Reply:* > or do you mean reporting start job/end job should be processed each event? +

      *Thread Reply:* > or do you mean reporting start job/end job should be processed each event? We definitely want to avoid tracking every single event 🙂

      One thing worth mentioning is that OpenLineage events are meant to be cumulative - the streaming jobs start, run, and eventually finish or restart. In the meantime, we capture additional events "in the middle" - for example, on Apache Flink checkpoint, or every few minutes - where we can emit additional information connected to the state of the job.

      @@ -66797,7 +66803,7 @@

      SundayFunday

      2022-07-28 11:11:17
      -

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

      +

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

      jobs start, run, and eventually finish or restart

      @@ -66836,7 +66842,7 @@

      SundayFunday

      2022-07-28 11:11:36
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -66871,7 +66877,7 @@

      SundayFunday

      2022-07-28 12:00:34
      -

      *Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

      +

      *Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

      > if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state. But this is an misunderstanding 🙂 @@ -66903,7 +66909,7 @@

      SundayFunday

      2022-07-28 12:01:16
      -

      *Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

      +

      *Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

      @@ -66933,7 +66939,7 @@

      SundayFunday

      2022-07-28 14:24:03
      -

      *Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

      +

      *Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

      each event that entering the sub process (job) should do REST call “Start job” and “End job” ?

      @@ -66999,7 +67005,7 @@

      SundayFunday

      2022-07-28 17:30:33
      -

      *Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

      +

      *Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

      @@ -67054,7 +67060,7 @@

      SundayFunday

      2022-08-03 08:30:15
      -

      *Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

      +

      *Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

      We're doing something similar within Airflow now, but as a fallback mechanism: https://github.com/OpenLineage/OpenLineage/pull/914

      @@ -67084,7 +67090,7 @@

      SundayFunday

      2022-08-03 14:25:31
      -

      *Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. +

      *Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. If i understand your response correctly, then you assume that OpenLineage can get wide enough "native" support across the stack without resorting to a fallback like 'static code analysis'. Is that your base assumption?

      @@ -67164,7 +67170,7 @@

      SundayFunday

      2022-07-29 12:58:38
      -

      *Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

      +

      *Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

      Few ideas:

      @@ -67195,7 +67201,7 @@

      SundayFunday

      2022-07-30 16:29:24
      -

      *Thread Reply:* ok, now I understand, thank you

      +

      *Thread Reply:* ok, now I understand, thank you

      @@ -67225,7 +67231,7 @@

      SundayFunday

      2022-08-03 08:25:57
      -

      *Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

      +

      *Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

      But if you need just the raw events endpoint, without UI, then Marquez might be overkill for your needs

      SundayFunday
      2022-07-30 13:49:25
      -

      *Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

      +

      *Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

      The Lineage endpoint in Marquez can output the whole graph centered on a node ID, and you can use the jobs/datasets apis to grab lists of each for reference

      @@ -67341,7 +67347,7 @@

      SundayFunday

      2022-07-31 00:35:06
      -

      *Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

      +

      *Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

      Where is your lineage coming from?

      SundayFunday
      2022-08-01 20:17:22
      -

      *Thread Reply:* yes @Barak F we are using open lineage

      +

      *Thread Reply:* yes @Barak F we are using open lineage

      @@ -67598,7 +67604,7 @@

      SundayFunday

      2022-08-02 01:26:18
      -

      *Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

      +

      *Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

      @@ -67624,7 +67630,7 @@

      SundayFunday

      2022-08-03 08:24:58
      -

      *Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

      +

      *Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

      Not sure it solves your issue though 🙂

      @@ -67652,7 +67658,7 @@

      SundayFunday

      2022-08-05 04:46:45
      -

      *Thread Reply:* thanks

      +

      *Thread Reply:* thanks

      @@ -67815,7 +67821,7 @@

      SundayFunday

      2022-08-02 12:28:11
      -

      *Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

      +

      *Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

      Server model was added here after some discussion in Marquez which is relevant - I think @Michael Collado @Willy Lulciuc can add to that

      SundayFunday
      2022-08-02 15:54:24
      -

      *Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

      +

      *Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

      @@ -67972,7 +67978,7 @@

      SundayFunday

      2022-08-03 04:45:36
      -

      *Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

      +

      *Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

      @@ -67998,7 +68004,7 @@

      SundayFunday

      2022-08-03 04:56:23
      -

      *Thread Reply:* @Maciej Obuchowski no, we don’t. +

      *Thread Reply:* @Maciej Obuchowski no, we don’t. @Varun Singh create table as select may suit you well. Other examples are within tests like: • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]lifecycle/plan/column/ColumnLevelLineageUtilsV2CatalogTest.javahttps://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]ecycle/plan/column/ColumnLevelLineageUtilsNonV2CatalogTest.java

      @@ -68075,7 +68081,7 @@

      SundayFunday

      2022-08-05 05:55:12
      -

      *Thread Reply:* I’ve created an issue for column lineage in case of renaming: +

      *Thread Reply:* I’ve created an issue for column lineage in case of renaming: https://github.com/OpenLineage/OpenLineage/issues/993

      @@ -68131,7 +68137,7 @@

      SundayFunday

      2022-08-08 09:37:43
      -

      *Thread Reply:* Thanks @Paweł Leszczyński!

      +

      *Thread Reply:* Thanks @Paweł Leszczyński!

      @@ -68183,7 +68189,7 @@

      SundayFunday

      2022-08-03 13:00:22
      -

      *Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

      +

      *Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

      This copying results in thousands of SQL queries run against the target database for each sync. I don’t think each of these queries should map to an OpenLineage job, I think the entire synchronization should. Maybe I’m wrong here.

      @@ -68211,7 +68217,7 @@

      SundayFunday

      2022-08-03 13:01:00
      -

      *Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

      +

      *Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

      @@ -68237,7 +68243,7 @@

      SundayFunday

      2022-08-03 13:01:44
      -

      *Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

      +

      *Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

      @@ -68263,7 +68269,7 @@

      SundayFunday

      2022-08-03 13:02:27
      -

      *Thread Reply:* or is that good enough?

      +

      *Thread Reply:* or is that good enough?

      @@ -68289,7 +68295,7 @@

      SundayFunday

      2022-08-04 10:31:11
      -

      *Thread Reply:* You are looking at a pretty big topic here 🙂

      +

      *Thread Reply:* You are looking at a pretty big topic here 🙂

      Basically you're asking what is a job in OpenLineage - and it's not fully answered yet.

      @@ -68319,7 +68325,7 @@

      SundayFunday

      2022-08-04 15:50:22
      -

      *Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

      +

      *Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

      @@ -68345,7 +68351,7 @@

      SundayFunday

      2022-08-08 13:46:31
      -

      *Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

      +

      *Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

      I can see it being a nuisance having all of that on a lineage graph. OTOH, I can see it being useful to know that a datum can be traced back to a specific endpoint at SFDC.

      @@ -68373,7 +68379,7 @@

      SundayFunday

      2022-08-08 13:46:55
      -

      *Thread Reply:* this is a design decision, IMO.

      +

      *Thread Reply:* this is a design decision, IMO.

      @@ -68567,7 +68573,7 @@

      SundayFunday

      2022-08-10 22:34:29
      -

      *Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

      +

      *Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

      @@ -68593,7 +68599,7 @@

      SundayFunday

      2022-08-12 06:19:58
      -

      *Thread Reply:* We missed you too @Will Johnson 😉

      +

      *Thread Reply:* We missed you too @Will Johnson 😉

      @@ -68688,7 +68694,7 @@

      SundayFunday

      2022-08-12 06:09:57
      -

      *Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

      +

      *Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

      @@ -68714,7 +68720,7 @@

      SundayFunday

      2022-08-12 06:10:20
      -

      *Thread Reply:* If not, then Marquez logs might help to see something

      +

      *Thread Reply:* If not, then Marquez logs might help to see something

      @@ -68740,7 +68746,7 @@

      SundayFunday

      2022-08-12 13:12:56
      -

      *Thread Reply:* Does the START event needs to have an output?

      +

      *Thread Reply:* Does the START event needs to have an output?

      @@ -68766,7 +68772,7 @@

      SundayFunday

      2022-08-12 13:19:24
      -

      *Thread Reply:* It can have empty output 🙂

      +

      *Thread Reply:* It can have empty output 🙂

      @@ -68792,7 +68798,7 @@

      SundayFunday

      2022-08-12 13:32:43
      -

      *Thread Reply:* well, in your case you need to send COMPLETE event

      +

      *Thread Reply:* well, in your case you need to send COMPLETE event

      @@ -68818,7 +68824,7 @@

      SundayFunday

      2022-08-12 13:33:44
      -

      *Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

      +

      *Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

      @@ -68844,7 +68850,7 @@

      SundayFunday

      2022-08-12 13:34:06
      -

      *Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

      +

      *Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

      @@ -68879,7 +68885,7 @@

      SundayFunday

      2022-08-12 13:56:37
      -

      *Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

      +

      *Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

      @@ -68905,7 +68911,7 @@

      SundayFunday

      2022-08-12 14:34:51
      -

      *Thread Reply:* @Raj Mishra Yes and no 🙂

      +

      *Thread Reply:* @Raj Mishra Yes and no 🙂

      Basically your COMPLETE event does not need to contain any input and output datasets at all - OpenLineage model is cumulative, so it's enough to have datasets on either start or complete. That also means you can add different datasets in different moment of a run lifecycle - for example, you know inputs, but not outputs, so you emit inputs on START , but not COMPLETE.

      @@ -68947,7 +68953,7 @@

      SundayFunday

      2022-08-12 14:47:56
      -

      *Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

      +

      *Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

      @@ -68999,7 +69005,7 @@

      SundayFunday

      2022-08-11 20:35:33
      -

      *Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

      +

      *Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

      @@ -69025,7 +69031,7 @@

      SundayFunday

      2022-08-12 06:11:21
      -

      *Thread Reply:* Agreed 🙂

      +

      *Thread Reply:* Agreed 🙂

      How do you produce this dataset? Spark integration? Are you using any system like Apache Iceberg/Delta Lake or just writing raw files?

      @@ -69053,7 +69059,7 @@

      SundayFunday

      2022-08-12 12:59:48
      -

      *Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

      +

      *Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

      @@ -69079,7 +69085,7 @@

      SundayFunday

      2022-08-12 13:27:34
      -

      *Thread Reply:* written using Spark dataframe API, like +

      *Thread Reply:* written using Spark dataframe API, like df.write.format("parquet").save("/tmp/spark_output/parquet") or RDD?

      @@ -69107,7 +69113,7 @@

      SundayFunday

      2022-08-12 13:27:59
      -

      *Thread Reply:* the actual API used matters, because we're handling different cases separately

      +

      *Thread Reply:* the actual API used matters, because we're handling different cases separately

      @@ -69133,7 +69139,7 @@

      SundayFunday

      2022-08-12 13:29:48
      -

      *Thread Reply:* I see. Let me look that up to be absolutely sure

      +

      *Thread Reply:* I see. Let me look that up to be absolutely sure

      @@ -69159,7 +69165,7 @@

      SundayFunday

      2022-08-12 19:21:41
      -

      *Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

      +

      *Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

      @@ -69185,7 +69191,7 @@

      SundayFunday

      2022-08-15 12:43:45
      -

      *Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

      +

      *Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

      @@ -69211,7 +69217,7 @@

      SundayFunday

      2022-08-16 05:30:57
      -

      *Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

      +

      *Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

      @@ -69237,7 +69243,7 @@

      SundayFunday

      2022-08-16 15:08:41
      -

      *Thread Reply:* Sounds good , Ty !

      +

      *Thread Reply:* Sounds good , Ty !

      @@ -69316,7 +69322,7 @@

      SundayFunday

      2022-08-12 06:09:11
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -69342,7 +69348,7 @@

      SundayFunday

      2022-08-12 06:09:32
      -

      *Thread Reply:* please do create 🙂

      +

      *Thread Reply:* please do create 🙂

      @@ -69398,7 +69404,7 @@

      SundayFunday

      2022-08-17 13:54:19
      -

      *Thread Reply:* it's per project. +

      *Thread Reply:* it's per project. java based: ./gradlew test python based: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development

      @@ -69510,7 +69516,7 @@

      SundayFunday

      2022-08-19 18:38:10
      -

      *Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

      +

      *Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

      @@ -69536,7 +69542,7 @@

      SundayFunday

      2022-08-19 20:21:18
      -

      *Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

      +

      *Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

      @@ -69562,7 +69568,7 @@

      SundayFunday

      2022-08-29 05:31:32
      -

      *Thread Reply:* @Paweł Leszczyński might want to look at that

      +

      *Thread Reply:* @Paweł Leszczyński might want to look at that

      @@ -69617,7 +69623,7 @@

      SundayFunday

      2022-08-19 12:30:08
      -

      *Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

      +

      *Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

      @@ -69643,7 +69649,7 @@

      SundayFunday

      2022-08-19 13:49:35
      -

      *Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. +

      *Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. My company want to integrate with openLineage-spark, I am working on figure out what info does OpenLineage make available for non-sql and does it at least have support for logging the logical plan?

      @@ -69670,7 +69676,7 @@

      SundayFunday

      2022-08-19 18:26:48
      -

      *Thread Reply:* Yes, it does send the logical plan as part of the event

      +

      *Thread Reply:* Yes, it does send the logical plan as part of the event

      @@ -69696,7 +69702,7 @@

      SundayFunday

      2022-08-19 18:27:32
      -

      *Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

      +

      *Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

      openlineage.io
      @@ -69743,7 +69749,7 @@

      SundayFunday

      2022-08-19 18:28:11
      -

      *Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

      +

      *Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

      @@ -69769,7 +69775,7 @@

      SundayFunday

      2022-08-19 18:28:26
      -

      *Thread Reply:* you need to add the jar, set the listener and pass your OL config

      +

      *Thread Reply:* you need to add the jar, set the listener and pass your OL config

      @@ -69795,7 +69801,7 @@

      SundayFunday

      2022-08-19 18:31:11
      -

      *Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

      +

      *Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

      pretalx
      @@ -69846,7 +69852,7 @@

      SundayFunday

      2022-08-19 18:32:11
      -

      *Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

      +

      *Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

      @@ -69872,7 +69878,7 @@

      SundayFunday

      2022-08-19 18:35:50
      -

      *Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

      +

      *Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

      @@ -69898,7 +69904,7 @@

      SundayFunday

      2022-08-19 18:40:10
      -

      *Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

      +

      *Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

      @@ -69979,7 +69985,7 @@

      SundayFunday

      2022-08-22 14:38:58
      -

      *Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

      +

      *Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

      @@ -70009,7 +70015,7 @@

      SundayFunday

      2022-08-22 15:00:37
      -

      *Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

      +

      *Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

      @@ -70092,7 +70098,7 @@

      SundayFunday

      2022-08-23 03:55:24
      -

      *Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

      +

      *Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

      @@ -70144,7 +70150,7 @@

      SundayFunday

      2022-08-25 18:15:59
      -

      *Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. +

      *Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. A basic idea to do such a thing would be to implement an OpenLineage endpoint (receive the lineage events through http posts) and convert them to a format the graph db understand. If others in the community have ideas, please chime in

      @@ -70171,7 +70177,7 @@

      SundayFunday

      2022-09-01 13:48:09
      -

      *Thread Reply:* Understood, thanks a lot Julien. Make sense.

      +

      *Thread Reply:* Understood, thanks a lot Julien. Make sense.

      @@ -70227,7 +70233,7 @@

      SundayFunday

      2022-08-25 17:32:44
      -

      *Thread Reply:* @Michael Robinson ^

      +

      *Thread Reply:* @Michael Robinson ^

      @@ -70253,7 +70259,7 @@

      SundayFunday

      2022-08-25 17:34:04
      -

      *Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

      +

      *Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

      @@ -70279,7 +70285,7 @@

      SundayFunday

      2022-08-25 17:52:40
      -

      *Thread Reply:* 🙏

      +

      *Thread Reply:* 🙏

      @@ -70305,7 +70311,7 @@

      SundayFunday

      2022-08-25 18:09:51
      -

      *Thread Reply:* Thanks, all. The release is authorized

      +

      *Thread Reply:* Thanks, all. The release is authorized

      @@ -70335,7 +70341,7 @@

      SundayFunday

      2022-08-25 18:16:44
      -

      *Thread Reply:* can you also state the main purpose for this release?

      +

      *Thread Reply:* can you also state the main purpose for this release?

      @@ -70361,7 +70367,7 @@

      SundayFunday

      2022-08-25 18:25:49
      -

      *Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

      +

      *Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

      @@ -70387,7 +70393,7 @@

      SundayFunday

      2022-08-25 18:27:53
      -

      *Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

      +

      *Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

      @@ -70486,7 +70492,7 @@

      SundayFunday

      2022-08-26 19:26:22
      -

      *Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 +

      *Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 https://openlineage.io/docs/integrations/spark/spark_column_lineage

      @@ -70534,7 +70540,7 @@

      SundayFunday

      2022-08-26 20:21:12
      -

      *Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

      +

      *Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

      @@ -70560,7 +70566,7 @@

      SundayFunday

      2022-08-29 13:14:01
      -

      *Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

      +

      *Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

      @@ -70586,7 +70592,7 @@

      SundayFunday

      2022-08-29 13:14:41
      -

      *Thread Reply:* @Harel Shein

      +

      *Thread Reply:* @Harel Shein

      @@ -70612,7 +70618,7 @@

      SundayFunday

      2022-08-30 00:56:18
      -

      *Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

      +

      *Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

      @@ -70638,7 +70644,7 @@

      SundayFunday

      2022-08-30 13:51:03
      -

      *Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

      +

      *Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

      @@ -70664,7 +70670,7 @@

      SundayFunday

      2022-08-31 00:33:17
      -

      *Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

      +

      *Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

      @@ -70690,7 +70696,7 @@

      SundayFunday

      2022-08-31 20:00:47
      -

      *Thread Reply:* spark.read \ +

      *Thread Reply:* spark.read \ .option("header", "true") \ .option("inferschema", "true") \ .csv("data/input/batch/wikidata.csv") \ @@ -70722,7 +70728,7 @@

      SundayFunday

      2022-08-31 20:01:21
      -

      *Thread Reply:* This is simple code run on my local for testing

      +

      *Thread Reply:* This is simple code run on my local for testing

      @@ -70748,7 +70754,7 @@

      SundayFunday

      2022-08-31 21:41:31
      -

      *Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

      +

      *Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

      @@ -70774,7 +70780,7 @@

      SundayFunday

      2022-08-31 21:42:05
      -

      *Thread Reply:* Specifically this commit is what implemented it. +

      *Thread Reply:* Specifically this commit is what implemented it. https://github.com/OpenLineage/OpenLineage/commit/ce30178cc81b63b9930be11ac7500ed34808edd3

      @@ -70801,7 +70807,7 @@

      SundayFunday

      2022-08-31 22:02:16
      -

      *Thread Reply:* I see. I use 0.13.0

      +

      *Thread Reply:* I see. I use 0.13.0

      @@ -70827,7 +70833,7 @@

      SundayFunday

      2022-09-01 12:04:41
      -

      *Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

      +

      *Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

      @@ -70853,7 +70859,7 @@

      SundayFunday

      2022-09-01 12:52:52
      -

      *Thread Reply:* Great. Thanks Harel.

      +

      *Thread Reply:* Great. Thanks Harel.

      @@ -70957,7 +70963,7 @@

      SundayFunday

      2022-08-29 05:29:52
      -

      *Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

      +

      *Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

      @@ -70983,7 +70989,7 @@

      SundayFunday

      2022-08-29 05:30:14
      -

      *Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

      +

      *Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

      @@ -71009,7 +71015,7 @@

      SundayFunday

      2022-08-29 12:42:55
      -

      *Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

      +

      *Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

      @@ -71035,7 +71041,7 @@

      SundayFunday

      2022-09-01 08:35:42
      -

      *Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

      +

      *Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

      @@ -71087,7 +71093,7 @@

      SundayFunday

      2022-08-29 12:35:32
      -

      *Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

      +

      *Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

      openlineage.io
      @@ -71134,7 +71140,7 @@

      SundayFunday

      2022-08-29 12:35:46
      -

      *Thread Reply:* (docs in progress 😉)

      +

      *Thread Reply:* (docs in progress 😉)

      @@ -71160,7 +71166,7 @@

      SundayFunday

      2022-08-30 02:07:02
      -

      *Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

      +

      *Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

      @@ -71186,7 +71192,7 @@

      SundayFunday

      2022-08-30 02:12:50
      -

      *Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

      +

      *Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

      @@ -71212,7 +71218,7 @@

      SundayFunday

      2022-09-01 08:34:35
      -

      *Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

      +

      *Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

      @@ -71238,7 +71244,7 @@

      SundayFunday

      2022-09-01 10:30:51
      -

      *Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

      +

      *Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

      @@ -71264,7 +71270,7 @@

      SundayFunday

      2022-09-01 10:31:09
      -

      *Thread Reply:* unless we can some how easily hide the process that does that on the graph.

      +

      *Thread Reply:* unless we can some how easily hide the process that does that on the graph.

      @@ -71316,7 +71322,7 @@

      SundayFunday

      2022-08-30 04:44:06
      -

      *Thread Reply:* I don't think it will work for Spark 2.X.

      +

      *Thread Reply:* I don't think it will work for Spark 2.X.

      @@ -71342,7 +71348,7 @@

      SundayFunday

      2022-08-30 13:42:20
      -

      *Thread Reply:* Is there have plan to support spark 2.x?

      +

      *Thread Reply:* Is there have plan to support spark 2.x?

      @@ -71368,7 +71374,7 @@

      SundayFunday

      2022-08-30 14:00:38
      -

      *Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

      +

      *Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

      @@ -71394,7 +71400,7 @@

      SundayFunday

      2022-08-30 17:19:43
      -

      *Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

      +

      *Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

      @@ -71493,7 +71499,7 @@

      SundayFunday

      2022-09-01 05:56:13
      -

      *Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

      +

      *Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

      @@ -71519,7 +71525,7 @@

      SundayFunday

      2022-09-01 05:58:59
      -

      *Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

      +

      *Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

      @@ -71545,7 +71551,7 @@

      SundayFunday

      2022-09-01 09:52:30
      -

      *Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

      +

      *Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

      @@ -71571,7 +71577,7 @@

      SundayFunday

      2022-09-01 10:30:41
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -71601,7 +71607,7 @@

      SundayFunday

      2022-09-01 10:31:21
      -

      *Thread Reply:* As usual, thank you Maciej for the responses and insights!

      +

      *Thread Reply:* As usual, thank you Maciej for the responses and insights!

      @@ -71657,7 +71663,7 @@

      SundayFunday

      2022-09-07 14:13:34
      -

      *Thread Reply:* Anyone can help for it? Does I miss something

      +

      *Thread Reply:* Anyone can help for it? Does I miss something

      @@ -71748,7 +71754,7 @@

      SundayFunday

      2022-09-01 13:18:59
      -

      *Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

      @@ -71804,7 +71810,7 @@

      SundayFunday

      2022-09-08 10:31:44
      -

      *Thread Reply:* Which integration are you looking to implement?

      +

      *Thread Reply:* Which integration are you looking to implement?

      And what environment are you looking to deploy it on? The Cloud? On-Prem?

      @@ -71832,7 +71838,7 @@

      SundayFunday

      2022-09-08 10:40:11
      -

      *Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

      +

      *Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

      @@ -71858,7 +71864,7 @@

      SundayFunday

      2022-09-08 11:27:06
      -

      *Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

      +

      *Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

      https://github.com/OpenLineage/OpenLineage/tree/main/integration

      @@ -71886,7 +71892,7 @@

      SundayFunday

      2022-09-08 11:33:44
      -

      *Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

      +

      *Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

      @@ -71912,7 +71918,7 @@

      SundayFunday

      2022-09-08 11:34:32
      -

      *Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

      +

      *Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

      @@ -71938,7 +71944,7 @@

      SundayFunday

      2022-09-08 11:46:35
      -

      *Thread Reply:* https://bit.ly/MarquezSlack

      +

      *Thread Reply:* https://bit.ly/MarquezSlack

      @@ -71964,7 +71970,7 @@

      SundayFunday

      2022-09-08 11:47:14
      -

      *Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

      +

      *Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

      @@ -72039,7 +72045,7 @@

      SundayFunday

      2022-09-06 15:54:30
      -

      *Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

      +

      *Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

      @@ -72099,7 +72105,7 @@

      SundayFunday

      2022-09-07 10:00:11
      -

      *Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

      +

      *Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

      @@ -72125,7 +72131,7 @@

      SundayFunday

      2022-09-07 10:27:39
      -

      *Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

      +

      *Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

      @@ -72155,7 +72161,7 @@

      SundayFunday

      2022-09-07 10:30:31
      -

      *Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

      +

      *Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

      @@ -72181,7 +72187,7 @@

      SundayFunday

      2022-09-07 10:39:32
      -

      *Thread Reply:* Thanks for clarifying!

      +

      *Thread Reply:* Thanks for clarifying!

      @@ -72207,7 +72213,7 @@

      SundayFunday

      2022-09-07 10:50:29
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -72237,7 +72243,7 @@

      SundayFunday

      2022-09-07 11:04:39
      -

      *Thread Reply:* 1058 also fixes some bugs

      +

      *Thread Reply:* 1058 also fixes some bugs

      @@ -72289,7 +72295,7 @@

      SundayFunday

      2022-09-08 04:41:07
      -

      *Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

      +

      *Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

      Besides that, there is this proposal that did not get enough love yet https://github.com/OpenLineage/OpenLineage/issues/323

      @@ -72317,7 +72323,7 @@

      SundayFunday

      2022-09-08 04:53:23
      -

      *Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

      +

      *Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

      @@ -72343,7 +72349,7 @@

      SundayFunday

      2022-09-08 04:58:47
      -

      *Thread Reply:* this PR does not seem to support lineage between datasets

      +

      *Thread Reply:* this PR does not seem to support lineage between datasets

      @@ -72369,7 +72375,7 @@

      SundayFunday

      2022-09-08 12:49:48
      -

      *Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

      +

      *Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

      In OpenLineage, something observes the lineage relationship being created.

      @@ -72397,7 +72403,7 @@

      SundayFunday

      2022-09-08 12:50:13
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -72436,7 +72442,7 @@

      SundayFunday

      2022-09-08 12:51:15
      -

      *Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

      +

      *Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

      @@ -72462,7 +72468,7 @@

      SundayFunday

      2022-09-08 12:54:27
      -

      *Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

      +

      *Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

      @@ -72488,7 +72494,7 @@

      SundayFunday

      2022-09-11 12:27:42
      -

      *Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

      +

      *Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

      @@ -72514,7 +72520,7 @@

      SundayFunday

      2022-09-11 12:28:05
      -

      *Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

      +

      *Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

      @@ -72540,7 +72546,7 @@

      SundayFunday

      2022-09-11 12:31:57
      -

      *Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

      +

      *Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

      @@ -72566,7 +72572,7 @@

      SundayFunday

      2022-09-11 12:33:04
      -

      *Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

      +

      *Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

      @@ -72592,7 +72598,7 @@

      SundayFunday

      2022-09-12 17:20:13
      -

      *Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

      +

      *Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

      @@ -72618,7 +72624,7 @@

      SundayFunday

      2022-09-12 17:22:03
      -

      *Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

      +

      *Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

      @@ -72713,7 +72719,7 @@

      SundayFunday

      2022-09-09 14:01:13
      -

      *Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

      +

      *Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

      @@ -72739,7 +72745,7 @@

      SundayFunday

      2022-09-09 15:36:20
      -

      *Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

      +

      *Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

      @@ -72795,7 +72801,7 @@

      SundayFunday

      2022-09-12 19:26:25
      -

      *Thread Reply:* yes!

      +

      *Thread Reply:* yes!

      @@ -72821,7 +72827,7 @@

      SundayFunday

      2022-09-26 10:31:56
      -

      *Thread Reply:* Any example or ticket on how to lineage across namespace

      +

      *Thread Reply:* Any example or ticket on how to lineage across namespace

      @@ -72873,7 +72879,7 @@

      SundayFunday

      2022-09-12 04:56:13
      -

      *Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

      +

      *Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

      openlineage.io
      @@ -72920,7 +72926,7 @@

      SundayFunday

      2022-09-22 02:18:45
      -

      *Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage +

      *Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage • Proposal on how to implement column level lineage in Marquez (implementation is currently work in progress): https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md @Iftach Schonbaum let us know if you find the information useful.

      SundayFunday
      2022-09-12 15:30:08
      -

      *Thread Reply:* or is it automatic for anything that exists in extractors/?

      +

      *Thread Reply:* or is it automatic for anything that exists in extractors/?

      @@ -73195,7 +73201,7 @@

      SundayFunday

      2022-09-12 15:30:16
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -73229,7 +73235,7 @@

      SundayFunday

      2022-09-12 15:31:12
      -

      *Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

      +

      *Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

      @@ -73255,7 +73261,7 @@

      SundayFunday

      2022-09-12 15:31:31
      -

      *Thread Reply:* Well, not entirely

      +

      *Thread Reply:* Well, not entirely

      @@ -73281,7 +73287,7 @@

      SundayFunday

      2022-09-12 15:31:47
      -

      *Thread Reply:* please take a look at the source code of one of the extractors

      +

      *Thread Reply:* please take a look at the source code of one of the extractors

      @@ -73311,7 +73317,7 @@

      SundayFunday

      2022-09-12 15:32:13
      -

      *Thread Reply:* also, there are docs available at openlineage.io/docs

      +

      *Thread Reply:* also, there are docs available at openlineage.io/docs

      @@ -73341,7 +73347,7 @@

      SundayFunday

      2022-09-12 15:33:45
      -

      *Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

      +

      *Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

      @@ -73367,7 +73373,7 @@

      SundayFunday

      2022-09-12 15:34:29
      -

      *Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

      +

      *Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

      @@ -73393,7 +73399,7 @@

      SundayFunday

      2022-09-12 15:34:47
      -

      *Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

      +

      *Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

      @@ -73423,7 +73429,7 @@

      SundayFunday

      2022-09-12 15:35:14
      -

      *Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

      +

      *Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

      @@ -73453,7 +73459,7 @@

      SundayFunday

      2022-09-12 16:01:45
      -

      *Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

      +

      *Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

      openlineage.io
      @@ -73504,7 +73510,7 @@

      SundayFunday

      2022-09-12 17:08:49
      -

      *Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

      +

      *Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

      @@ -73562,7 +73568,7 @@

      SundayFunday

      2022-09-21 20:57:39
      -

      *Thread Reply:* In my head, I have:

      +

      *Thread Reply:* In my head, I have:

      Pyspark script -> Store metadata in S3 -> Marquez UI gets data from S3 and displays it

      @@ -73592,10 +73598,10 @@

      SundayFunday

      2022-09-22 02:14:50
      -

      *Thread Reply: It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: - * sent to rest endpoint (like Marquez), - * send as an event onto Kafka, - * print it onto console +

      *Thread Reply:* It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: + * sent to rest endpoint (like Marquez), + * send as an event onto Kafka, + ** print it onto console There is no S3 in between Spark & Marquez by default. Marquez serves both as an API where events are sent and UI to investigate them.

      @@ -73623,7 +73629,7 @@

      SundayFunday

      2022-09-22 17:36:10
      -

      *Thread Reply:* Yeah S3 was just an example for a storage option.

      +

      *Thread Reply:* Yeah S3 was just an example for a storage option.

      I actually found the answer I was looking for, turns out I had to look at Marquez documentation: https://marquezproject.ai/resources/deployment/

      @@ -73688,7 +73694,7 @@

      SundayFunday

      2022-09-26 00:27:05
      -

      *Thread Reply:* The Spark execution model follows:

      +

      *Thread Reply:* The Spark execution model follows:

      1. Spark SQL Execution Start event
      2. Spark Job Start event
      3. Spark Job End event
      4. Spark SQL Execution End event As a result, OpenLineage tracks all of those execution and jobs. There is a proposed plan to distinguish between those events (e.g. you wouldn't get two starts but one Start and one Job Start or something like that).
      5. @@ -73720,7 +73726,7 @@

        SundayFunday

      2022-09-26 15:16:26
      -

      *Thread Reply:* Thanks @Will Johnson +

      *Thread Reply:* Thanks @Will Johnson Can I get an example of how the proposed plan can be used to distinguish between start and job start events? Because I compare the 2 starts events I got, only the event_time is different, all other information are the same.

      @@ -73748,7 +73754,7 @@

      SundayFunday

      2022-09-26 15:30:34
      -

      *Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for +

      *Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for (1). 1 Spark SQL execution start event (2). 3 Spark job start event (Each query has a job start event ) (3). 3 Spark job end event (Each query has a job end event ) @@ -73778,7 +73784,7 @@

      SundayFunday

      2022-09-27 10:25:47
      -

      *Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

      +

      *Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

      Re: Multiple Queries in One Command: This is where Spark's execution model comes into play. I believe each one of those commands are executed sequentially and as a result, you'd actually get three execution start and three execution end. If you chose DROP + Create Table As Select, that would be only two commands and thus only two execution start events.

      SundayFunday
      2022-09-27 16:49:37
      -

      *Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, +

      *Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different.

      When I test Drop + Create Table @@ -73894,7 +73900,7 @@

      SundayFunday

      2022-09-27 22:03:38
      -

      *Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

      +

      *Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

      I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well.

      @@ -73922,7 +73928,7 @@

      SundayFunday

      2022-09-28 09:18:48
      -

      *Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. +

      *Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. @Hanbing Wang That's basically why we capture all the events (SQL Execution, Job) instead of one of them. We're just inconsistently notified of them by Spark.

      Some computations emit SQL Execution events, some emit Job events, I think majority emits both. This also differs by spark version.

      @@ -73956,7 +73962,7 @@

      SundayFunday

      2022-09-28 15:44:45
      -

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help +

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help We are not running on Databricks. We implemented the OpenLineage Spark listener, and custom the Event Transport which emitting the events to our own events pipeline with a hive metastore. We are using Spark version 3.2.1 @@ -73986,7 +73992,7 @@

      SundayFunday

      2022-09-29 15:16:28
      -

      *Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

      +

      *Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

      @@ -74012,7 +74018,7 @@

      SundayFunday

      2022-09-29 15:17:08
      -

      *Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

      +

      *Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

      @@ -74038,7 +74044,7 @@

      SundayFunday

      2022-09-30 15:34:01
      -

      *Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

      +

      *Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

      @@ -74096,7 +74102,7 @@

      SundayFunday

      2022-09-28 08:34:20
      -

      *Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

      +

      *Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

      The events can be merged then at the backend side. The behavior, you describe, can be then achieved by using backends like Marquez and Marquez API to obtain combined data.

      @@ -74336,7 +74342,7 @@

      SundayFunday

      2022-09-29 14:30:37
      -

      *Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

      +

      *Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

      @@ -74362,7 +74368,7 @@

      SundayFunday

      2022-09-29 15:24:26
      -

      *Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      +

      *Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      You can create a facet that could extract out the properties that you might set from within the spark session.

      @@ -74454,7 +74460,7 @@

      SundayFunday

      2022-09-30 04:56:25
      -

      *Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. +

      *Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. sc.setJobGroup("myjobgroupid", "job description goes here") This set the value in spark as setLocalProperty(SparkContext.SPARKJOBGROUPID, group_id)

      @@ -74485,7 +74491,7 @@

      SundayFunday

      2022-09-30 05:01:08
      -

      *Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

      +

      *Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

      @srutikanta hota What information would you like to include? There is great chance we already have some fields for that. If not it’s still worth putting in in write place like: is this info job specific, run specific or relates to some of input / output datasets?

      @@ -74513,7 +74519,7 @@

      SundayFunday

      2022-09-30 05:04:34
      -

      *Thread Reply:* @srutikanta hota sounds like you want to set up +

      *Thread Reply:* @srutikanta hota sounds like you want to set up spark.openlineage.parentJobName spark.openlineage.parentRunId https://openlineage.io/docs/integrations/spark/

      @@ -74542,7 +74548,7 @@

      SundayFunday

      2022-09-30 05:15:18
      -

      *Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

      +

      *Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

      https://spark.apache.org/docs/1.6.1/api/R/setJobGroup.html

      @@ -74612,7 +74618,7 @@

      SundayFunday

      2022-09-29 14:22:06
      -

      *Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

      +

      *Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

      @@ -74732,7 +74738,7 @@

      SundayFunday

      2022-09-30 14:38:34
      -

      *Thread Reply:* Is this in the context of an OpenLineage extractor?

      +

      *Thread Reply:* Is this in the context of an OpenLineage extractor?

      @@ -74758,7 +74764,7 @@

      SundayFunday

      2022-09-30 14:40:47
      -

      *Thread Reply:* Yes! I was specifically looking at the PostgresOperator

      +

      *Thread Reply:* Yes! I was specifically looking at the PostgresOperator

      @@ -74784,7 +74790,7 @@

      SundayFunday

      2022-09-30 14:41:54
      -

      *Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

      +

      *Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

      @@ -74810,7 +74816,7 @@

      SundayFunday

      2022-09-30 14:43:08
      -

      *Thread Reply:* The extractor for the SQL operators gets the query like this: +

      *Thread Reply:* The extractor for the SQL operators gets the query like this: https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c443[…]gration/airflow/openlineage/airflow/extractors/sql_extractor.py

      @@ -74866,7 +74872,7 @@

      SundayFunday

      2022-09-30 14:43:48
      -

      *Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

      +

      *Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

      @@ -74892,7 +74898,7 @@

      SundayFunday

      2022-09-30 14:45:00
      -

      *Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

      +

      *Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

      @@ -74957,7 +74963,7 @@

      SundayFunday

      2022-09-30 14:47:28
      -

      *Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: +

      *Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py#L58

      @@ -75009,7 +75015,7 @@

      SundayFunday

      2022-09-30 14:48:01
      -

      *Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

      +

      *Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

      @@ -75035,7 +75041,7 @@

      SundayFunday

      2022-09-30 14:48:08
      -

      *Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

      +

      *Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

      @@ -75061,7 +75067,7 @@

      SundayFunday

      2022-09-30 14:49:14
      -

      *Thread Reply:* I don't know enough about the airflow internals 😞

      +

      *Thread Reply:* I don't know enough about the airflow internals 😞

      @@ -75087,7 +75093,7 @@

      SundayFunday

      2022-09-30 14:50:00
      -

      *Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

      +

      *Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

      @@ -75366,7 +75372,7 @@

      SundayFunday

      2022-09-30 15:22:24
      -

      *Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

      +

      *Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

      @@ -75444,7 +75450,7 @@

      SundayFunday

      2022-09-30 19:06:47
      -

      *Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 +

      *Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit

      @@ -75528,7 +75534,7 @@

      SundayFunday

      2022-09-30 19:18:47
      -

      *Thread Reply:* amazing thank you I will take a look

      +

      *Thread Reply:* amazing thank you I will take a look

      @@ -75595,7 +75601,7 @@

      SundayFunday

      2022-10-03 11:33:30
      -

      *Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

      +

      *Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

      @@ -75625,7 +75631,7 @@

      SundayFunday

      2022-10-03 11:37:03
      -

      *Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

      +

      *Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

      @@ -75651,7 +75657,7 @@

      SundayFunday

      2022-10-03 11:37:07
      -

      *Thread Reply:* just caught this myself. I’ll make the change

      +

      *Thread Reply:* just caught this myself. I’ll make the change

      @@ -75677,7 +75683,7 @@

      SundayFunday

      2022-10-03 11:40:33
      -

      *Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

      +

      *Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

      @@ -75703,7 +75709,7 @@

      SundayFunday

      2022-10-03 11:49:47
      -

      *Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

      +

      *Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

      We want to remove 1.10 integration because for multiple PRs, maintaining compatibility with it takes a lot of time; the code is littered with checks like this. if parse_version(AIRFLOW_VERSION) &gt;= parse_version("2.0.0"):

      @@ -75736,7 +75742,7 @@

      SundayFunday

      2022-10-03 12:03:40
      -

      *Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

      +

      *Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

      @@ -75762,7 +75768,7 @@

      SundayFunday

      2022-10-04 09:39:11
      -

      *Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

      +

      *Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

      @@ -75816,7 +75822,7 @@

      SundayFunday

      2022-10-03 17:56:36
      -

      *Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

      +

      *Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

      @@ -75842,7 +75848,7 @@

      SundayFunday

      2022-10-03 18:03:03
      -

      *Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

      +

      *Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

      Consensus of Airflow maintainers wasn't positive about changing this interface, so we went with another direction: https://github.com/apache/airflow/pull/20443

      SundayFunday
      2022-10-03 18:06:58
      -

      *Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

      +

      *Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

      @@ -75969,7 +75975,7 @@

      SundayFunday

      2022-10-03 18:30:32
      -

      *Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

      +

      *Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

      @@ -75995,7 +76001,7 @@

      SundayFunday

      2022-10-03 18:30:42
      -

      *Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

      +

      *Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

      @@ -76021,7 +76027,7 @@

      SundayFunday

      2022-10-03 18:31:13
      -

      *Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

      +

      *Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

      @@ -76047,7 +76053,7 @@

      SundayFunday

      2022-10-03 20:11:15
      -

      *Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

      +

      *Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

      ends up with requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url immediately after task start. but by the end on task success/failure it emits the event with our custom backend both RunState.COMPLETE and RunState.START into our own pipeline.

      @@ -76075,7 +76081,7 @@

      SundayFunday

      2022-10-04 06:19:06
      -

      *Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

      +

      *Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

      @@ -76126,7 +76132,7 @@

      SundayFunday

      2022-10-04 06:23:48
      -

      *Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

      +

      *Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

      First, if your code only needs to touch HTTP client side - which I guess is the case due to 401 error - then you can create custom Transport.

      @@ -76160,7 +76166,7 @@

      SundayFunday

      2022-10-04 14:23:33
      -

      *Thread Reply:* amazing thank you for your help. i will take a look

      +

      *Thread Reply:* amazing thank you for your help. i will take a look

      @@ -76186,7 +76192,7 @@

      SundayFunday

      2022-10-04 14:49:47
      -

      *Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

      +

      *Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

      we're trying to not fork and instead opt with extending.

      @@ -76214,7 +76220,7 @@

      SundayFunday

      2022-10-05 04:55:05
      -

      *Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

      +

      *Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

      @@ -76273,7 +76279,7 @@

      SundayFunday

      2022-10-05 13:29:24
      -

      *Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

      +

      *Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

      @@ -76299,7 +76305,7 @@

      SundayFunday

      2022-10-05 13:30:04
      -

      *Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

      +

      *Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

      @@ -76325,7 +76331,7 @@

      SundayFunday

      2022-10-05 14:18:19
      -

      *Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

      +

      *Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

      @@ -76376,7 +76382,7 @@

      SundayFunday

      2022-10-05 14:20:48
      -

      *Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

      +

      *Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

      @@ -76402,7 +76408,7 @@

      SundayFunday

      2022-10-05 14:20:56
      -

      *Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47

      +

      *Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47

      @@ -76453,7 +76459,7 @@

      SundayFunday

      2022-10-05 14:21:09
      -

      *Thread Reply:* and I know we need to document this better 🙂

      +

      *Thread Reply:* and I know we need to document this better 🙂

      @@ -76483,7 +76489,7 @@

      SundayFunday

      2022-10-05 15:35:31
      -

      *Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

      +

      *Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

      @@ -76509,7 +76515,7 @@

      SundayFunday

      2022-10-05 15:50:29
      -

      *Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

      +

      *Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

      @@ -76648,7 +76654,7 @@

      SundayFunday

      2022-10-04 14:25:49
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -76751,7 +76757,7 @@

      SundayFunday

      2022-10-06 13:29:30
      -

      *Thread Reply:* would love to add improvement in docs :) for newcomers

      +

      *Thread Reply:* would love to add improvement in docs :) for newcomers

      @@ -76781,7 +76787,7 @@

      SundayFunday

      2022-10-06 13:31:07
      -

      *Thread Reply:* also, what’s TSC?

      +

      *Thread Reply:* also, what’s TSC?

      @@ -76807,7 +76813,7 @@

      SundayFunday

      2022-10-06 15:20:23
      -

      *Thread Reply:* Technical Steering Committee, but it’s open to everyone

      +

      *Thread Reply:* Technical Steering Committee, but it’s open to everyone

      @@ -76837,7 +76843,7 @@

      SundayFunday

      2022-10-06 15:20:45
      -

      *Thread Reply:* and we encourage newcomers to attend

      +

      *Thread Reply:* and we encourage newcomers to attend

      @@ -76889,7 +76895,7 @@

      SundayFunday

      2022-10-06 14:39:27
      -

      *Thread Reply:* is there any error/warn message logged maybe?

      +

      *Thread Reply:* is there any error/warn message logged maybe?

      @@ -76915,7 +76921,7 @@

      SundayFunday

      2022-10-06 14:40:53
      -

      *Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

      +

      *Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

      but on SUCCESS nothing fires.

      @@ -76943,7 +76949,7 @@

      SundayFunday

      2022-10-06 14:41:21
      -

      *Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

      +

      *Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

      @@ -76969,7 +76975,7 @@

      SundayFunday

      2022-10-06 16:37:54
      -

      *Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

      +

      *Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

      @@ -76995,7 +77001,7 @@

      SundayFunday

      2022-10-06 16:38:13
      -

      *Thread Reply:* I'd be happy to jump on a quick call if you wish

      +

      *Thread Reply:* I'd be happy to jump on a quick call if you wish

      @@ -77021,7 +77027,7 @@

      SundayFunday

      2022-10-06 16:38:40
      -

      *Thread Reply:* but in more EU friendly hours 🙂

      +

      *Thread Reply:* but in more EU friendly hours 🙂

      @@ -77047,7 +77053,7 @@

      SundayFunday

      2022-10-07 16:19:47
      -

      *Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

      +

      *Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

      @@ -77144,7 +77150,7 @@

      SundayFunday

      2022-10-20 05:11:09
      -

      *Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

      +

      *Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

      @@ -77211,7 +77217,7 @@

      SundayFunday

      2022-10-11 03:47:31
      -

      *Thread Reply:* Hi @Hanna Moazam,

      +

      *Thread Reply:* Hi @Hanna Moazam,

      Within recent PR https://github.com/OpenLineage/OpenLineage/pull/1111, I removed BigQuery dependencies from spark2, spark32 and spark3 subprojects. It has to stay in sharedbecause of BigQueryNodeVisitor. The usage of BigQueryNodeVisitor is tricky as we never know if bigquery classes are available on runtime or not. The check is done in io.openlineage.spark.agent.lifecycle.BaseVisitorFactory if (BigQueryNodeVisitor.hasBigQueryClasses()) { @@ -77303,7 +77309,7 @@

      SundayFunday

      2022-10-11 09:20:44
      -

      *Thread Reply:* So, if we wanted to add Snowflake, we would need to:

      +

      *Thread Reply:* So, if we wanted to add Snowflake, we would need to:

      1. Pick a version of snowflake's spark library
      2. Pick a version of scala that we target (i.e. we are only going to support Snowflake in Spark 3.2 so scala 2.12 will be hard coded)
      3. Add the visitor code to Shared
      4. Add the dependencies to app (ONLY if there is an integration test in app?? This is the confusing part still)
      @@ -77332,7 +77338,7 @@

      SundayFunday

      2022-10-12 03:51:54
      -

      *Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

      +

      *Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

      @@ -77362,7 +77368,7 @@

      SundayFunday

      2022-10-12 05:20:17
      -

      *Thread Reply:* Thank you Pawel!

      +

      *Thread Reply:* Thank you Pawel!

      @@ -77388,7 +77394,7 @@

      SundayFunday

      2022-10-12 12:18:16
      -

      *Thread Reply:* Basically the same pattern you've already done with Kusto 😉 +

      *Thread Reply:* Basically the same pattern you've already done with Kusto 😉 https://github.com/OpenLineage/OpenLineage/blob/a96ecdabe66567151e7739e25cd9dd03d6[…]va/io/openlineage/spark/agent/lifecycle/BaseVisitorFactory.java

      @@ -77440,7 +77446,7 @@

      SundayFunday

      2022-10-12 12:26:35
      -

      *Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

      +

      *Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

      @@ -77496,7 +77502,7 @@

      SundayFunday

      2022-10-11 05:03:26
      -

      *Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

      +

      *Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

      @@ -77876,7 +77882,7 @@

      SundayFunday

      2022-10-26 15:22:22
      -

      *Thread Reply:* Hey @Austin Poulton, welcome! 👋

      +

      *Thread Reply:* Hey @Austin Poulton, welcome! 👋

      @@ -77902,7 +77908,7 @@

      SundayFunday

      2022-10-31 06:09:41
      -

      *Thread Reply:* thanks Harel 🙂

      +

      *Thread Reply:* thanks Harel 🙂

      @@ -77973,7 +77979,7 @@

      SundayFunday

      2022-11-01 13:37:54
      -

      *Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

      +

      *Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

      @@ -78025,7 +78031,7 @@

      SundayFunday

      2022-11-02 09:19:43
      -

      *Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

      +

      *Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

      @@ -78051,7 +78057,7 @@

      SundayFunday

      2022-11-02 15:54:31
      -

      *Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

      +

      *Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

      @@ -78170,7 +78176,7 @@

      SundayFunday

      2022-11-03 12:25:29
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

      @@ -78268,7 +78274,7 @@

      SundayFunday

      2022-11-04 06:34:27
      -

      *Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. +

      *Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. We have somewhat of a start of a doc: https://openlineage.io/docs/development/developing/

      @@ -78440,7 +78446,7 @@

      SundayFunday

      2022-11-08 10:22:15
      -

      *Thread Reply:* Welcome Kenton! Happy to help 👍

      +

      *Thread Reply:* Welcome Kenton! Happy to help 👍

      @@ -78497,7 +78503,7 @@

      SundayFunday

      2022-11-08 10:06:10
      -

      *Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

      +

      *Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

      @@ -78523,7 +78529,7 @@

      SundayFunday

      2022-11-09 00:35:29
      -

      *Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

      +

      *Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

      @@ -78549,7 +78555,7 @@

      SundayFunday

      2022-11-09 00:41:35
      -

      *Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

      +

      *Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

      @@ -78636,7 +78642,7 @@

      SundayFunday

      2022-11-09 05:14:50
      -

      *Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

      +

      *Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

      @@ -78666,7 +78672,7 @@

      SundayFunday

      2022-11-14 03:28:35
      -

      *Thread Reply:* @Maciej Obuchowski Do you mean adding something like +

      *Thread Reply:* @Maciej Obuchowski Do you mean adding something like spark.openlineage.jobFacet.FacetName.Key=Value to the spark conf should add a new job facet like "FacetName": { "Key": "Value" @@ -78696,7 +78702,7 @@

      SundayFunday

      2022-11-14 05:56:02
      -

      *Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

      +

      *Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

      @@ -78748,7 +78754,7 @@

      SundayFunday

      2022-11-10 02:22:18
      -

      *Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

      +

      *Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

      @@ -78946,7 +78952,7 @@

      SundayFunday

      2022-11-11 16:28:07
      -

      *Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

      +

      *Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

      @@ -78998,7 +79004,7 @@

      SundayFunday

      2022-11-11 16:24:30
      -

      *Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

      +

      *Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

      @@ -79079,7 +79085,7 @@

      SundayFunday

      2022-11-15 02:09:25
      -

      *Thread Reply:* I don’t think this is possible at the moment.

      +

      *Thread Reply:* I don’t think this is possible at the moment.

      @@ -79135,7 +79141,7 @@

      SundayFunday

      2022-11-16 09:13:17
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -79161,7 +79167,7 @@

      SundayFunday

      2022-11-16 10:41:07
      -

      *Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

      +

      *Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

      @@ -79213,7 +79219,7 @@

      SundayFunday

      2022-11-16 12:27:12
      -

      *Thread Reply:* I think we had similar request before, but nothing was implemented.

      +

      *Thread Reply:* I think we had similar request before, but nothing was implemented.

      @@ -79333,7 +79339,7 @@

      SundayFunday

      2022-11-22 13:40:44
      -

      *Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

      +

      *Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

      @@ -79359,7 +79365,7 @@

      SundayFunday

      2022-11-22 13:48:58
      -

      *Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

      +

      *Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

      @@ -79411,7 +79417,7 @@

      SundayFunday

      2022-11-19 04:54:13
      -

      *Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

      +

      *Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

      @@ -79437,7 +79443,7 @@

      SundayFunday

      2022-11-22 13:40:11
      -

      *Thread Reply:* If you're on 1.10 then I think it won't work

      +

      *Thread Reply:* If you're on 1.10 then I think it won't work

      @@ -79463,7 +79469,7 @@

      SundayFunday

      2022-11-28 12:50:39
      -

      *Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

      +

      *Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

      cc. @Eli Schachar @Allison Suarez

      @@ -79491,7 +79497,7 @@

      SundayFunday

      2022-11-28 12:50:49
      -

      *Thread Reply:* is there no workaround we can make work?

      +

      *Thread Reply:* is there no workaround we can make work?

      @@ -79517,7 +79523,7 @@

      SundayFunday

      2022-11-28 12:51:01
      -

      *Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

      +

      *Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

      @@ -79569,7 +79575,7 @@

      SundayFunday

      2022-11-21 07:28:04
      -

      *Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.

      +

      *Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.

      Either in the URL version https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L48

      @@ -79699,7 +79705,7 @@

      SundayFunday

      2022-11-22 18:45:48
      -

      *Thread Reply:* I think that's what NominalTimeFacet covers

      +

      *Thread Reply:* I think that's what NominalTimeFacet covers

      openlineage.io
      @@ -79772,7 +79778,7 @@

      SundayFunday

      2022-11-28 10:29:58
      -

      *Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

      +

      *Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

      @@ -79798,7 +79804,7 @@

      SundayFunday

      2022-11-28 10:30:14
      -

      *Thread Reply:* i am using airflow 2.x

      +

      *Thread Reply:* i am using airflow 2.x

      @@ -79824,7 +79830,7 @@

      SundayFunday

      2022-11-28 10:30:27
      -

      *Thread Reply:* can we connect if you have time ?

      +

      *Thread Reply:* can we connect if you have time ?

      @@ -79850,7 +79856,7 @@

      SundayFunday

      2022-11-28 11:11:58
      -

      *Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

      +

      *Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

      openlineage.io
      @@ -79897,7 +79903,7 @@

      SundayFunday

      2022-11-28 11:12:22
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -79923,7 +79929,7 @@

      SundayFunday

      2022-11-28 11:12:36
      -

      *Thread Reply:* i already set configuration in airflow.cfg file

      +

      *Thread Reply:* i already set configuration in airflow.cfg file

      @@ -79949,7 +79955,7 @@

      SundayFunday

      2022-11-28 11:12:57
      -

      *Thread Reply:* where are you sending the events to?

      +

      *Thread Reply:* where are you sending the events to?

      @@ -79975,7 +79981,7 @@

      SundayFunday

      2022-11-28 11:13:24
      -

      *Thread Reply:* i have a docker machine on which marquez is working

      +

      *Thread Reply:* i have a docker machine on which marquez is working

      @@ -80001,7 +80007,7 @@

      SundayFunday

      2022-11-28 11:13:47
      -

      *Thread Reply:* so, what is the issue you are seeing?

      +

      *Thread Reply:* so, what is the issue you are seeing?

      @@ -80027,7 +80033,7 @@

      SundayFunday

      2022-11-28 11:15:37
      -

      *Thread Reply:* there is no error

      +

      *Thread Reply:* there is no error

      @@ -80053,7 +80059,7 @@

      SundayFunday

      2022-11-28 11:16:01
      -

      *Thread Reply:* ```[lineage]

      +

      *Thread Reply:* ```[lineage]

      what lineage backend to use

      @@ -80094,7 +80100,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-28 11:16:09
      -

      *Thread Reply:* above config i have set

      +

      *Thread Reply:* above config i have set

      @@ -80120,7 +80126,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-28 11:16:22
      -

      *Thread Reply:* please let me know any other thing need to do

      +

      *Thread Reply:* please let me know any other thing need to do

      @@ -80172,7 +80178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-25 02:20:40
      -

      *Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

      +

      *Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

      @@ -80224,7 +80230,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-30 14:10:10
      -

      *Thread Reply:* hey, I'll DM you.

      +

      *Thread Reply:* hey, I'll DM you.

      @@ -80287,7 +80293,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-06 13:56:17
      -

      *Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

      @@ -80370,7 +80376,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-02 15:21:20
      -

      *Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

      +

      *Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

      openlineage.io
      @@ -80421,7 +80427,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-05 13:54:06
      -

      *Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

      +

      *Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

      That means we have thousands of companies having experimented with OpenLineage and Microsoft Purview.

      @@ -80526,7 +80532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 14:22:58
      -

      *Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

      +

      *Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

      @@ -80556,7 +80562,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 14:25:12
      -

      *Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

      +

      *Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

      @@ -80586,7 +80592,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-08 13:03:08
      -

      *Thread Reply:* This is starting now. CC @Will Johnson

      +

      *Thread Reply:* This is starting now. CC @Will Johnson

      @@ -80612,7 +80618,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:24:15
      -

      *Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

      +

      *Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

      @@ -80638,7 +80644,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:24:30
      -

      *Thread Reply:* feel free to follow up here as well

      +

      *Thread Reply:* feel free to follow up here as well

      @@ -80664,7 +80670,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:39:37
      -

      *Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

      +

      *Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

                    /   \
                    / / \ \
      @@ -80719,7 +80725,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:40:05
      -

      *Thread Reply:* 😍

      +

      *Thread Reply:* 😍

      @@ -80745,7 +80751,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:41:13
      -

      *Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

      +

      *Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

      @@ -80771,7 +80777,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 05:18:53
      -

      *Thread Reply:* Hi, is there a recording for this meeting?

      +

      *Thread Reply:* Hi, is there a recording for this meeting?

      @@ -80823,7 +80829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 22:05:25
      -

      *Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

      +

      *Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

      @@ -80849,7 +80855,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 23:13:41
      -

      *Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

      +

      *Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

      @@ -80875,7 +80881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 23:34:33
      -

      *Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

      +

      *Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

      @@ -80975,7 +80981,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 11:51:16
      -

      *Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

      +

      *Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

      @@ -81001,7 +81007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-11 09:01:19
      -

      *Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

      +

      *Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

      @@ -81027,7 +81033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-13 15:24:20
      -

      *Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

      +

      *Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

      @@ -81053,7 +81059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-13 15:25:12
      -

      *Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

      +

      *Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

      @@ -81105,7 +81111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 09:56:22
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

      @@ -81135,7 +81141,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:05:54
      -

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

      +

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

      @@ -81161,7 +81167,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:06:08
      -

      *Thread Reply:* I am already able to capture column lineage

      +

      *Thread Reply:* I am already able to capture column lineage

      @@ -81187,7 +81193,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:06:33
      -

      *Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

      +

      *Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

      @@ -81213,7 +81219,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:22:59
      -

      *Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

      +

      *Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

      @@ -81290,7 +81296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:24:07
      -

      *Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

      +

      *Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

      @@ -81316,7 +81322,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:28:29
      -

      *Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

      +

      *Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

      you can also have a look at extending section of spark integration docs (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending) and create a class thats add run facet builder according to your needs.

      @@ -81344,7 +81350,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:29:28
      -

      *Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

      +

      *Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

      @@ -81374,7 +81380,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 21:49:19
      -

      *Thread Reply:* That is great! Thank you so much! This really helps!

      +

      *Thread Reply:* That is great! Thank you so much! This really helps!

      @@ -81400,7 +81406,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-15 01:44:42
      -

      *Thread Reply:* List&lt;String&gt; dbPropertiesKeys = +

      *Thread Reply:* List&lt;String&gt; dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", @@ -81443,7 +81449,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-15 01:57:05
      -

      *Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

      +

      *Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

      @@ -81501,7 +81507,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-01 02:24:39
      -

      *Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! +

      *Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! https://github.com/OpenLineage/OpenLineage/pull/1545

      @@ -81595,7 +81601,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 11:45:52
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

      @@ -81621,7 +81627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 03:06:42
      -

      *Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

      +

      *Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

      @@ -81647,7 +81653,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 03:06:49
      -

      *Thread Reply:* @Maciej Obuchowski ^ 🙂

      +

      *Thread Reply:* @Maciej Obuchowski ^ 🙂

      @@ -81677,7 +81683,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 09:09:34
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

      23/02/06 12:18:39 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException @@ -81723,7 +81729,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:17:02
      -

      *Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

      +

      *Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

      @@ -81749,7 +81755,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:18:46
      -

      *Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

      +

      *Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

      @@ -81779,7 +81785,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:24:38
      -

      *Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

      +

      *Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

      @@ -81917,7 +81923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 07:20:11
      -

      *Thread Reply:* tests failing isn't expected behaviour 🙂

      +

      *Thread Reply:* tests failing isn't expected behaviour 🙂

      @@ -81943,7 +81949,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:37:23
      -

      *Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

      +

      *Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

      @@ -81969,7 +81975,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:47:22
      -

      *Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

      +

      *Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

      @@ -81995,7 +82001,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:49:35
      -

      *Thread Reply:* I have pushed just now

      +

      *Thread Reply:* I have pushed just now

      @@ -82021,7 +82027,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:49:39
      -

      *Thread Reply:* You can run the tests

      +

      *Thread Reply:* You can run the tests

      @@ -82047,7 +82053,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 04:13:07
      -

      *Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

      +

      *Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

      @@ -82077,7 +82083,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 00:47:04
      -

      *Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

      +

      *Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

      @@ -82129,7 +82135,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:29:44
      -

      *Thread Reply:* Hi 👋 this would require a series of JSON calls:

      +

      *Thread Reply:* Hi 👋 this would require a series of JSON calls:

      1. start the first task
      2. end the first task, specify output dataset
      3. start the second task, specify input dataset
      4. end the second task
      @@ -82158,7 +82164,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:32:08
      -

      *Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so +

      *Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so • you create a relationship between datasets by referring to them in the same job - i.e., this task ran that read from these datasets and wrote to those datasets • you create a relationship between tasks by referring to the same datasets across both of them - i.e., this task wrote that dataset and this other task read from it

      @@ -82186,7 +82192,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:35:06
      -

      *Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

      +

      *Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

      (it’s an airflow workshop, but those examples are for a part of the workshop that doesn’t involve airflow)

      @@ -82214,7 +82220,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:35:46
      -

      *Thread Reply:* (these can also be found in the docs)

      +

      *Thread Reply:* (these can also be found in the docs)

      openlineage.io
      @@ -82265,7 +82271,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 14:49:30
      -

      *Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

      +

      *Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

      @@ -82291,7 +82297,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-17 19:53:21
      -

      *Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

      +

      *Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

      @@ -82321,7 +82327,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-17 19:53:48
      -

      *Thread Reply:* @Ross Turk - Thanks for your response.

      +

      *Thread Reply:* @Ross Turk - Thanks for your response.

      @@ -82347,7 +82353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 13:23:56
      -

      *Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. +

      *Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. Second Activity : After the completion of first activity I am making Azure databricks call to use the lookup file and generate the output tables. How I can refer the databricks generated tables facets as an input to the subsequent activities in the pipeline. When I refer it's as an input the spark tables metadata is not showing up. How can this be achievable. After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez.

      @@ -82460,7 +82466,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-20 14:27:59
      -

      *Thread Reply:* welcome! let’s us know if you have any questions

      +

      *Thread Reply:* welcome! let’s us know if you have any questions

      @@ -82512,7 +82518,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-29 10:47:00
      -

      *Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

      +

      *Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

      @@ -82728,7 +82734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:20:14
      -

      *Thread Reply:* @Ross Turk maybe you can help with this?

      +

      *Thread Reply:* @Ross Turk maybe you can help with this?

      @@ -82754,7 +82760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:34:23
      -

      *Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

      +

      *Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

      @@ -82805,7 +82811,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:34:58
      -

      *Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

      +

      *Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

      @@ -82831,7 +82837,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:35:42
      -

      *Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

      +

      *Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

      @@ -82857,7 +82863,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:38:06
      -

      *Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

      +

      *Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

      @@ -82909,7 +82915,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 01:53:10
      -

      *Thread Reply:* I was looking for something similar to the airflow integration

      +

      *Thread Reply:* I was looking for something similar to the airflow integration

      @@ -82935,7 +82941,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:21:18
      -

      *Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

      +

      *Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

      @@ -82961,7 +82967,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:23:53
      -

      *Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

      +

      *Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

      @@ -83026,7 +83032,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 23:07:59
      -

      *Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration +

      *Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration Would it be possible to have more details on what this entails please? Thanks!

      @@ -83053,7 +83059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 09:21:46
      -

      *Thread Reply:* @Tomasz Nazarewicz might explain this better

      +

      *Thread Reply:* @Tomasz Nazarewicz might explain this better

      @@ -83079,7 +83085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 10:04:22
      -

      *Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

      +

      *Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

      @@ -83105,7 +83111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 13:02:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

      +

      *Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

      @@ -83131,7 +83137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 22:00:55
      -

      *Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

      +

      *Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

      @@ -83192,7 +83198,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-05 23:45:38
      -

      *Thread Reply:* @Michael Robinson Will there be a recording?

      +

      *Thread Reply:* @Michael Robinson Will there be a recording?

      @@ -83218,7 +83224,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-06 09:10:50
      -

      *Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

      +

      *Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

      @@ -83364,7 +83370,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-09 02:28:43
      -

      *Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: +

      *Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

      For scenario you describe, there should be symlink facet sent similar to: @@ -83464,7 +83470,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-09 14:21:10
      -

      *Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

      +

      *Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

      @@ -83591,7 +83597,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 17:31:12
      -

      *Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

      +

      *Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

      https://github.com/yogyang/OpenLineage/blob/2b7fa8bbd19a2207d54756e79aea7a542bf7bb[…]/main/java/io/openlineage/client/transports/KafkaTransport.java

      @@ -83747,7 +83753,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 17:43:55
      -

      *Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

      +

      *Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

      @@ -83773,7 +83779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-17 15:11:05
      -

      *Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

      +

      *Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

      @@ -83825,7 +83831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 02:30:37
      -

      *Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: +

      *Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file"); (https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/sr[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java)

      @@ -83885,7 +83891,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-22 10:04:56
      -

      *Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

      +

      *Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

      @@ -83911,7 +83917,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-13 18:18:52
      -

      *Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

      +

      *Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

      @@ -83993,7 +83999,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 14:39:51
      -

      *Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

      +

      *Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

      @@ -84019,7 +84025,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 15:22:19
      -

      *Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

      +

      *Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

      @@ -84045,7 +84051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:00:04
      -

      *Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

      +

      *Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

      @@ -84071,7 +84077,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:28:41
      -

      *Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

      +

      *Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

      @@ -84097,7 +84103,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:34:06
      -

      *Thread Reply:* Maybe the condition could be having ; and [] in the value

      +

      *Thread Reply:* Maybe the condition could be having ; and [] in the value

      @@ -84127,7 +84133,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-15 08:14:14
      -

      *Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

      +

      *Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

      @@ -84153,7 +84159,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-16 01:15:19
      -

      *Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

      +

      *Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

      @@ -84325,7 +84331,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-17 15:56:38
      -

      *Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

      +

      *Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

      @@ -84440,7 +84446,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 11:17:14
      -

      *Thread Reply:* What are the errors?

      +

      *Thread Reply:* What are the errors?

      @@ -84466,7 +84472,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 11:18:09
      -

      *Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, +

      *Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

      @@ -84493,7 +84499,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:11:09
      -

      *Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

      +

      *Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

      @@ -84519,7 +84525,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:26:35
      -

      *Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

      +

      *Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

      @@ -84545,7 +84551,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:27:04
      -

      *Thread Reply:* (maybe that helps!)

      +

      *Thread Reply:* (maybe that helps!)

      @@ -84571,7 +84577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:38:23
      -

      *Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

      +

      *Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

      @@ -84597,7 +84603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:38:27
      -

      *Thread Reply:* That needs one fix though

      +

      *Thread Reply:* That needs one fix though

      @@ -84623,7 +84629,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:40:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( +

      *Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( \python.exe: No module named dbt-ol

      @@ -84650,7 +84656,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:23:56
      -

      *Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

      +

      *Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

      @@ -84676,7 +84682,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:36
      -

      *Thread Reply:* @Michael Robinson GE issue is on Windows?

      +

      *Thread Reply:* @Michael Robinson GE issue is on Windows?

      @@ -84702,7 +84708,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:49
      -

      *Thread Reply:* No, not Windows

      +

      *Thread Reply:* No, not Windows

      @@ -84728,7 +84734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:55
      -

      *Thread Reply:* (that I know of)

      +

      *Thread Reply:* (that I know of)

      @@ -84754,7 +84760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:46:39
      -

      *Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

      +

      *Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

      1. Python 3.8.10 with openlineage-dbt 0.18.0 & Latest
      2. Python 3.9.7 with openlineage-dbt 0.18.0 & Latest
      @@ -84783,7 +84789,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:49:19
      -

      *Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

      +

      *Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

      But if I am not in a virtual environment, it installs the packages in my PYTHONPATH. You might try this to see if the dbt-ol script can be found in one of the directories in sys.path.

      @@ -84820,7 +84826,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:58:38
      -

      *Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

      +

      *Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

      @@ -84855,7 +84861,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:59:42
      -

      *Thread Reply:* Again, I think this is windows issue

      +

      *Thread Reply:* Again, I think this is windows issue

      @@ -84881,7 +84887,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:00:54
      -

      *Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

      +

      *Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

      @@ -84907,7 +84913,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:15:13
      -

      *Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

      +

      *Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

      @@ -84933,7 +84939,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:16:48
      -

      *Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

      +

      *Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

      @@ -84959,7 +84965,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:31:15
      -

      *Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

      +

      *Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

      For some reason, I see only the following folder created

      @@ -85056,7 +85062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 18:54:57
      -

      *Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

      +

      *Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

      @@ -85082,7 +85088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 18:55:30
      -

      *Thread Reply:* This is an amazing write up! 🔥 💯 🚀

      +

      *Thread Reply:* This is an amazing write up! 🔥 💯 🚀

      @@ -85108,7 +85114,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 19:49:46
      -

      *Thread Reply:* Happy to have it promoted. 😄 +

      *Thread Reply:* Happy to have it promoted. 😄 Vish posted on LinkedIn: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios if you want something to repost there.

      @@ -85200,7 +85206,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 03:58:42
      -

      *Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

      +

      *Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

      @@ -85226,7 +85232,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 03:59:28
      -

      *Thread Reply:* ./gradlew clean build publishToMavenLocal +

      *Thread Reply:* ./gradlew clean build publishToMavenLocal in /client/java should help.

      @@ -85253,7 +85259,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 04:34:33
      -

      *Thread Reply:* Ahh yeap this works thanks! 🙂

      +

      *Thread Reply:* Ahh yeap this works thanks! 🙂

      @@ -85305,7 +85311,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:03:39
      -

      *Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

      +

      *Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

      @@ -85331,7 +85337,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:10:11
      -

      *Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

      +

      *Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

      @@ -85361,7 +85367,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 17:00:25
      -

      *Thread Reply:* I don’t know enough about Atlas to make that doc.

      +

      *Thread Reply:* I don’t know enough about Atlas to make that doc.

      @@ -85421,7 +85427,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:00:54
      -

      *Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

      +

      *Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

      YouTube
      @@ -85481,7 +85487,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:10:58
      -

      *Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

      +

      *Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

      @@ -85507,7 +85513,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:19:34
      -

      *Thread Reply:* More explicitly: +

      *Thread Reply:* More explicitly: • Airflow is an interesting platform to observe because it runs a large variety of workloads and lineage can only be automatically extracted for some of them • In general, OpenLineage is essentially a standard and data model for lineage. There are integrations for various systems, including Airflow, that cause them to emit lineage events to an OpenLineage compatible backend. It's a push model. • Marquez is one such backend, and the one I recommend for testing & development @@ -85545,7 +85551,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:23:33
      -

      *Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

      +

      *Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

      @@ -85571,7 +85577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 03:44:22
      -

      *Thread Reply:* Thanks a lot!! Very helpful!

      +

      *Thread Reply:* Thanks a lot!! Very helpful!

      @@ -85597,7 +85603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 11:42:43
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -85649,7 +85655,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:45:55
      -

      *Thread Reply:* Hello Brad, on top of my head:

      +

      *Thread Reply:* Hello Brad, on top of my head:

      @@ -85675,7 +85681,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:47:15
      -

      *Thread Reply:* • Marquez uses the API HTTP Post. so does Astro +

      *Thread Reply:* • Marquez uses the API HTTP Post. so does Astro • Egeria and Purview prefer consuming through a Kafka topic. There is a ProxyBackend that takes HTTP Posts and writes to Kafka. The client can also be configured to write to Kafka

      @@ -85706,7 +85712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:48:09
      -

      *Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

      +

      *Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

      @@ -85732,7 +85738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:49:10
      -

      *Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      +

      *Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      learn.microsoft.com
      @@ -85783,7 +85789,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:49:47
      -

      *Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

      +

      *Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

      openlineage.io
      @@ -85830,7 +85836,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-22 10:00:56
      -

      *Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

      +

      *Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

      However, there is a good deal of interest in using the kafka transport instead and that's our future roadmap.

      @@ -85914,7 +85920,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-25 12:04:48
      -

      *Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

      +

      *Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

      @@ -85940,7 +85946,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-26 05:46:35
      -

      *Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

      +

      *Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

      @@ -85992,7 +85998,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-25 10:38:03
      -

      *Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

      +

      *Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

      @@ -86044,7 +86050,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:04
      -

      *Thread Reply:* Command +

      *Thread Reply:* Command spark-submit --packages "io.openlineage:openlineage_spark:0.19.+" \ --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --conf "spark.openlineage.transport.type=kafka" \ @@ -86076,7 +86082,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:14
      -

      *Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception +

      *Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.client.transports.TransportFactory.build(TransportFactory.java:44) at io.openlineage.spark.agent.EventEmitter.&lt;init&gt;(EventEmitter.java:40) @@ -86121,7 +86127,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:31
      -

      *Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

      +

      *Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

      @@ -86147,7 +86153,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 16:15:00
      -

      *Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

      +

      *Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

      @@ -86173,7 +86179,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 08:37:07
      -

      *Thread Reply:* 👀 Could I get some help on this, please?

      +

      *Thread Reply:* 👀 Could I get some help on this, please?

      @@ -86199,7 +86205,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:07:08
      -

      *Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

      +

      *Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

      @@ -86229,7 +86235,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:30:51
      -

      *Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get +

      *Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get 23/01/30 14:28:33 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event

      I am trying to print to console at the moment. I haven't been able to get Kafka transport type working though.

      @@ -86258,7 +86264,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:41:12
      -

      *Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

      +

      *Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

      @@ -86284,7 +86290,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:42:28
      -

      *Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event +

      *Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event I see this and don't see any events on the console.

      @@ -86311,7 +86317,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:55:41
      -

      *Thread Reply:* Any logs filling pattern +

      *Thread Reply:* Any logs filling pattern log.warn("Unable to access job conf from RDD", nfe); or <a href="http://log.info">log.info</a>("Found job conf from RDD {}", jc); @@ -86341,7 +86347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:57:20
      -

      *Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents +

      *Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents 23/01/30 14:40:49 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) @@ -86396,7 +86402,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:03:35
      -

      *Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

      +

      *Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

      @@ -86422,7 +86428,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:04:14
      -

      *Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

      +

      *Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

      @@ -86448,7 +86454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:13:37
      -

      *Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

      +

      *Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

      Do you have a ETA on when the python issue would be fixed?

      @@ -86476,7 +86482,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:37:51
      -

      *Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

      +

      *Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

      @@ -86502,7 +86508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:38:44
      -

      *Thread Reply:* I think that has nothing to do with python

      +

      *Thread Reply:* I think that has nothing to do with python

      @@ -86528,7 +86534,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:39:16
      -

      *Thread Reply:* BTW, which Spark version are you using?

      +

      *Thread Reply:* BTW, which Spark version are you using?

      @@ -86554,7 +86560,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:41:22
      -

      *Thread Reply:* We are on 3.3.1

      +

      *Thread Reply:* We are on 3.3.1

      @@ -86584,7 +86590,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 11:38:24
      -

      *Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

      +

      *Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

      @@ -86610,7 +86616,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 11:46:30
      -

      *Thread Reply:* I think we plan to release somewhere in the next week

      +

      *Thread Reply:* I think we plan to release somewhere in the next week

      @@ -86640,7 +86646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 09:21:25
      -

      *Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

      +

      *Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

      @@ -86699,7 +86705,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 18:38:40
      -

      *Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

      +

      *Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

      @@ -86729,7 +86735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 18:46:15
      -

      *Thread Reply:* @Benji Lampel any help would be appreciated!

      +

      *Thread Reply:* @Benji Lampel any help would be appreciated!

      @@ -86755,7 +86761,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:01:34
      -

      *Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? +

      *Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? Could you try this with a PostgresCheckOperator ? (Also, only the SqlColumnCheckOperator and SqlTableCheckOperator will provide data quality facets in their output, those functions are not implementable in the other operators at this time)

      @@ -86787,7 +86793,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:36:07
      -

      *Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

      +

      *Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

      3-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - Traceback (most recent call last): [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner @@ -86832,7 +86838,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:37:06
      -

      *Thread Reply:* and above that

      +

      *Thread Reply:* and above that

      [2023-01-31, 00:32:38 UTC] {connection.py:424} ERROR - Unable to retrieve connection from secrets backend (EnvironmentVariablesBackend). Checking subsequent secrets backend. Traceback (most recent call last): @@ -86867,7 +86873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:39:31
      -

      *Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

      +

      *Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

      @@ -86893,7 +86899,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 15:09:51
      -

      *Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

      +

      *Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

      @@ -86919,7 +86925,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 17:38:45
      -

      *Thread Reply:* like so

      +

      *Thread Reply:* like so

      class CustomSQLCheckOperator(CheckOperator): ....

      @@ -86948,7 +86954,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 17:39:30
      -

      *Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

      +

      *Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

      @@ -87030,7 +87036,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 12:57:47
      -

      *Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      +

      *Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      @@ -87056,7 +87062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 13:26:33
      -

      *Thread Reply:* Thank you for this!

      +

      *Thread Reply:* Thank you for this!

      I created a openlineage.yml file with the following data to test out the integration locally. transport: @@ -87130,7 +87136,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 14:39:29
      -

      *Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

      +

      *Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

      We need to indicate this way better, and do not throw this error directly at user thought, both in docs and code.

      @@ -87349,7 +87355,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:39:34
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

      Does this fit your experience?

      @@ -87377,7 +87383,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:39:59
      -

      *Thread Reply:* We sometimes experience this in context of very small, quick jobs

      +

      *Thread Reply:* We sometimes experience this in context of very small, quick jobs

      @@ -87403,7 +87409,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:43:24
      -

      *Thread Reply:* Yes, my scenarios are dealing with quick jobs. +

      *Thread Reply:* Yes, my scenarios are dealing with quick jobs. Good to know that we will be able to solve it with future spark versions. Thanks!

      @@ -87509,7 +87515,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 13:24:03
      -

      *Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯

      +

      *Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯

      @@ -87535,7 +87541,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 13:35:38
      -

      *Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.

      +

      *Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.

      @@ -87561,7 +87567,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:02:52
      -

      *Thread Reply:* Lol the proxy is still in development and not ready for use

      +

      *Thread Reply:* Lol the proxy is still in development and not ready for use

      @@ -87587,7 +87593,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:03:26
      -

      *Thread Reply:* Good point! Let’s make that clear in the release / docs?

      +

      *Thread Reply:* Good point! Let’s make that clear in the release / docs?

      @@ -87617,7 +87623,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:03:33
      -

      *Thread Reply:* But it doesn’t block anything anyway, so happy to see the release

      +

      *Thread Reply:* But it doesn’t block anything anyway, so happy to see the release

      @@ -87643,7 +87649,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:04:38
      -

      *Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳

      +

      *Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳

      @@ -87699,7 +87705,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:47:15
      -

      *Thread Reply:* Hey @Daniel Joanes! thanks for the question.

      +

      *Thread Reply:* Hey @Daniel Joanes! thanks for the question.

      I am not aware of an issue that captures this. Column-level lineage is a somewhat new facet in the spec, and implementations across the various integrations are in varying states of readiness.

      @@ -87729,7 +87735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:50:59
      -

      *Thread Reply:* Go for it, once it's created i'll add a watch

      +

      *Thread Reply:* Go for it, once it's created i'll add a watch

      @@ -87759,7 +87765,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:51:13
      -

      *Thread Reply:* Thanks Ross!

      +

      *Thread Reply:* Thanks Ross!

      @@ -87785,7 +87791,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 23:10:30
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581

      @@ -87945,7 +87951,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 10:50:49
      -

      *Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.

      +

      *Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.

      @@ -87971,7 +87977,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:15:35
      -

      *Thread Reply:* 0.20.5 ?

      +

      *Thread Reply:* 0.20.5 ?

      @@ -88001,7 +88007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:28:20
      -

      *Thread Reply:* @Michael Robinson auth'd

      +

      *Thread Reply:* @Michael Robinson auth'd

      @@ -88027,7 +88033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:32:06
      -

      *Thread Reply:* 👍 the release is authorized

      +

      *Thread Reply:* 👍 the release is authorized

      @@ -88126,14 +88132,14 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong><em>*input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em>*</strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -88389,7 +88395,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 16:23:36
      -

      *Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.

      +

      *Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.

      @@ -88415,7 +88421,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 02:23:40
      -

      *Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.

      +

      *Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.

      Although Openlineage spec provides some built-in facets definition, a facet object can be anything you want (https://openlineage.io/apidocs/openapi/#tag/OpenLineage/operation/postRunEvent). The example metadata provided in this chat could be put into job or run facets I believe.

      @@ -88445,7 +88451,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 09:09:10
      -

      *Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)

      +

      *Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)

      @@ -88471,7 +88477,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-14 16:51:28
      -

      *Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!

      +

      *Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!

      @@ -88497,7 +88503,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:43:08
      -

      *Thread Reply:* I think @Will Johnson might have something to add about customization

      +

      *Thread Reply:* I think @Will Johnson might have something to add about customization

      @@ -88523,7 +88529,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 23:58:36
      -

      *Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!

      +

      *Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!

      @Susmitha Anandarao - You might take a look at https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java which has a hard coded set of properties we are extracting.

      @@ -88797,7 +88803,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 04:52:59
      2023-02-15 07:19:39
      -

      *Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch and it passes all the tests.

      +

      *Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch and it passes all the tests.

      @@ -88850,7 +88856,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 07:21:49
      -

      *Thread Reply:* Feel free to un-draft your PR if you think it's ready for review

      +

      *Thread Reply:* Feel free to un-draft your PR if you think it's ready for review

      @@ -88876,7 +88882,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:03:56
      -

      *Thread Reply:* Ok nice thanks 👍

      +

      *Thread Reply:* Ok nice thanks 👍

      @@ -88902,7 +88908,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:04:49
      -

      *Thread Reply:* I think it's ready, however should I update the version somewhere?

      +

      *Thread Reply:* I think it's ready, however should I update the version somewhere?

      @@ -88928,7 +88934,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:42:39
      -

      *Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft , so I'm not sure if you want to add something else to it.

      +

      *Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft , so I'm not sure if you want to add something else to it.

      @@ -88958,7 +88964,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:43:36
      -

      *Thread Reply:* No I don't want to add anything so I opened it 👍

      +

      *Thread Reply:* No I don't want to add anything so I opened it 👍

      @@ -89035,7 +89041,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:30:47
      -

      *Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      +

      *Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      @@ -89085,7 +89091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:32:19
      -

      *Thread Reply:* I have that set up already like this: +

      *Thread Reply:* I have that set up already like this: public class LyftOpenLineageEventHandlerFactory implements OpenLineageEventHandlerFactory { @Override public Collection&lt;PartialFunction&lt;LogicalPlan, List&lt;OutputDataset&gt;&gt;&gt; @@ -89122,7 +89128,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:33:35
      -

      *Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change

      +

      *Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change

      @@ -89148,7 +89154,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:34:30
      -

      *Thread Reply:* .

      +

      *Thread Reply:* .

      @@ -89174,7 +89180,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:34:49
      -

      *Thread Reply:* @Michael Collado

      +

      *Thread Reply:* @Michael Collado

      @@ -89200,7 +89206,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:35:14
      -

      *Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one

      +

      *Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one

      @@ -89226,7 +89232,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:35:48
      -

      *Thread Reply:* Have you added the file to the META-INF folder of your jar?

      +

      *Thread Reply:* Have you added the file to the META-INF folder of your jar?

      @@ -89252,7 +89258,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:01:56
      -

      *Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one

      +

      *Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one

      @@ -89278,7 +89284,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:02:29
      -

      *Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor

      +

      *Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor

      @@ -89304,7 +89310,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:05:22
      -

      *Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory but maybe im wrong @Michael Collado

      +

      *Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory but maybe im wrong @Michael Collado

      @@ -89330,7 +89336,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 15:05:19
      -

      *Thread Reply:* the VisitorFactory is loaded by the InternalEventHandlerFactory here. However, the createOutputDatasetQueryPlanVisitors should contain a union of everything defined by the VisitorFactory as well as your custom visitors: see this code.

      +

      *Thread Reply:* the VisitorFactory is loaded by the InternalEventHandlerFactory here. However, the createOutputDatasetQueryPlanVisitors should contain a union of everything defined by the VisitorFactory as well as your custom visitors: see this code.

      @@ -89404,7 +89410,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 15:09:21
      -

      *Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?

      +

      *Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?

      @@ -89461,7 +89467,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 16:54:46
      -

      *Thread Reply:* This was helpful, it works now, thank you so much Michael!

      +

      *Thread Reply:* This was helpful, it works now, thank you so much Michael!

      @@ -89517,7 +89523,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:09:49
      -

      *Thread Reply:* what is the curl cmd you are running? and what endpoint are you hitting? (assuming Marquez?)

      +

      *Thread Reply:* what is the curl cmd you are running? and what endpoint are you hitting? (assuming Marquez?)

      @@ -89543,7 +89549,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:18:28
      -

      *Thread Reply:* yep +

      *Thread Reply:* yep I am running curl - X curl -X POST http://localhost:5000/api/v1/namespaces/test ^ -H 'Content-Type: application/json' ^ -d '{ownerName:"me", description:"no description"^ @@ -89578,7 +89584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:23:08
      -

      *Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace} : +

      *Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace} : marquez-api | DELETE /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | GET /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | PUT /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) @@ -89608,7 +89614,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:26:23
      -

      *Thread Reply:* 3rd stupid question of the night +

      *Thread Reply:* 3rd stupid question of the night Sorry kept on trying POST who knows why

      @@ -89635,7 +89641,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:26:56
      -

      *Thread Reply:* no worries! keep the questions coming!

      +

      *Thread Reply:* no worries! keep the questions coming!

      @@ -89661,7 +89667,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:29:46
      -

      *Thread Reply:* well, maybe because it’s so late on your end! get some rest!

      +

      *Thread Reply:* well, maybe because it’s so late on your end! get some rest!

      @@ -89687,7 +89693,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:36:25
      -

      *Thread Reply:* Yeah but I want to see how it works +

      *Thread Reply:* Yeah but I want to see how it works Right now I have a response 200 for the creation of the names ... but it seems that nothing occurred nor on marquez front end (localhost:3000) nor on the database

      @@ -89716,7 +89722,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:37:13
      -

      *Thread Reply:* can you curl the list namespaces endpoint?

      +

      *Thread Reply:* can you curl the list namespaces endpoint?

      @@ -89742,7 +89748,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:38:14
      -

      *Thread Reply:* yep : nothing changed +

      *Thread Reply:* yep : nothing changed only default and food_delivery

      @@ -89769,7 +89775,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:38:47
      -

      *Thread Reply:* can you post your server logs? you should see the request

      +

      *Thread Reply:* can you post your server logs? you should see the request

      @@ -89795,7 +89801,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:40:41
      -

      *Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 +

      *Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 marquez-api | INFO [2023-02-17 00:32:07,072] marquez.logging.LoggingMdcFilter: status: 200

      @@ -89822,7 +89828,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:41:12
      -

      *Thread Reply:* the server is returning a 500 ?

      +

      *Thread Reply:* the server is returning a 500 ?

      @@ -89848,7 +89854,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:41:57
      -

      *Thread Reply:* odd that LoggingMdcFilter is logging 200

      +

      *Thread Reply:* odd that LoggingMdcFilter is logging 200

      @@ -89874,7 +89880,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:43:24
      -

      *Thread Reply:* Bit confused because now I realize that postman is returning bad request

      +

      *Thread Reply:* Bit confused because now I realize that postman is returning bad request

      @@ -89900,7 +89906,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:43:51
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -89935,7 +89941,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:44:30
      -

      *Thread Reply:* You'll notice that I go to use 3000 in the url +

      *Thread Reply:* You'll notice that I go to use 3000 in the url If I use 5000 I get No host

      @@ -89962,7 +89968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 01:14:50
      -

      *Thread Reply:* odd, the API should be using port 5000, have you followed our quickstart for Marquez?

      +

      *Thread Reply:* odd, the API should be using port 5000, have you followed our quickstart for Marquez?

      @@ -89988,7 +89994,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 03:43:29
      -

      *Thread Reply:* Hello Willy +

      *Thread Reply:* Hello Willy I am starting from scratch followin instruction from https://openlineage.io/docs/getting-started/ I am on Windows Instead of @@ -90038,7 +90044,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 03:40:48
      -

      *Thread Reply:* The issue is related to PostMan

      +

      *Thread Reply:* The issue is related to PostMan

      @@ -90090,7 +90096,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 03:54:47
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: +

      *Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied" @@ -90125,7 +90131,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:02:30
      -

      *Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?

      +

      *Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?

      @@ -90151,7 +90157,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:08:16
      -

      *Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.

      +

      *Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.

      To sum up, we don't have yet a proposal on that but this seems to be a natural next step in enriching column lineage features.

      @@ -90179,7 +90185,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:40:04
      -

      *Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂

      +

      *Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂

      @@ -90205,7 +90211,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:43:00
      -

      *Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.

      +

      *Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.

      @@ -90231,7 +90237,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:44:18
      -

      *Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!

      +

      *Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!

      @@ -90257,7 +90263,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 09:46:22
      -

      *Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.

      +

      *Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.

      @@ -90319,7 +90325,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 02:10:13
      -

      *Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging configuration section is present in marquez.yml .

      +

      *Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging configuration section is present in marquez.yml .

      @@ -90345,7 +90351,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 03:29:07
      -

      *Thread Reply:* Thank You Pavel

      +

      *Thread Reply:* Thank You Pavel

      @@ -90397,7 +90403,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 09:12:17
      -

      *Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?

      +

      *Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?

      @@ -90423,7 +90429,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 11:38:50
      -

      *Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.

      +

      *Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.

      @@ -90449,7 +90455,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 21:45:58
      -

      *Thread Reply:* Okay yeah sure that works. Thanks

      +

      *Thread Reply:* Okay yeah sure that works. Thanks

      @@ -90479,7 +90485,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 10:12:45
      -

      *Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI

      +

      *Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI

      @@ -90505,7 +90511,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 21:22:40
      -

      *Thread Reply:* Awesome thanks

      +

      *Thread Reply:* Awesome thanks

      @@ -90569,7 +90575,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-23 23:44:03
      -

      *Thread Reply:* This is my checkponit configuration in GE. +

      *Thread Reply:* This is my checkponit configuration in GE. ```name: 'openlineagecheckpoint' configversion: 1.0 templatename: @@ -90632,7 +90638,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-24 11:31:05
      -

      *Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?

      +

      *Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?

      @@ -90658,7 +90664,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-26 20:05:12
      -

      *Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow

      +

      *Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow

      @@ -90684,7 +90690,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 09:10:00
      -

      *Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44 for now. That is the latest one that works for me.

      +

      *Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44 for now. That is the latest one that works for me.

      @@ -90710,7 +90716,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 09:11:34
      -

      *Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment

      +

      *Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment

      @@ -90736,7 +90742,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 18:07:29
      -

      *Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44 , this is my requirement file

      +

      *Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44 , this is my requirement file

      great_expectations==0.15.44
       sqlalchemy
      @@ -90771,7 +90777,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-28 13:34:03
      -

      *Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack

      +

      *Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack

      @@ -90823,7 +90829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 11:47:40
      -

      *Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ +

      *Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ Are you using AWS Glue with Spark jobs?

      @@ -90875,7 +90881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 15:16:14
      -

      *Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?

      +

      *Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?

      @@ -90901,7 +90907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 11:30:44
      -

      *Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄

      +

      *Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄

      @@ -90962,7 +90968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 10:26:22
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.

      @@ -91014,7 +91020,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 05:15:55
      -

      *Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.

      +

      *Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.

      @@ -91202,7 +91208,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 20:04:19
      -

      *Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?

      +

      *Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?

      @@ -91228,7 +91234,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 20:06:00
      -

      *Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 +

      *Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 that was fixed in 0.20.6

      @@ -91279,7 +91285,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:15:44
      -

      *Thread Reply:* hmm perhaps.

      +

      *Thread Reply:* hmm perhaps.

      @@ -91305,7 +91311,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:15:55
      -

      *Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?

      +

      *Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?

      @@ -91331,7 +91337,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:24:07
      -

      *Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?

      +

      *Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?

      @@ -91361,7 +91367,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:24:50
      -

      *Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars

      +

      *Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars

      @@ -91391,7 +91397,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-06 14:43:56
      -

      *Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both

      +

      *Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both

      @@ -91482,7 +91488,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 23:46:08
      -

      *Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?

      +

      *Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?

      OpenLineage's events may send partial information. You should expect to collect all events for a given RunId and merge them together to get the complete events.

      @@ -91512,7 +91518,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:45:19
      -

      *Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command

      +

      *Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command

      @@ -91538,7 +91544,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:45:27
      -

      *Thread Reply:* And 2 complete

      +

      *Thread Reply:* And 2 complete

      @@ -91564,7 +91570,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:46:08
      -

      *Thread Reply:* I am currently only testing on parquet tables...

      +

      *Thread Reply:* I am currently only testing on parquet tables...

      @@ -91590,7 +91596,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 02:31:28
      -

      *Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.

      +

      *Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.

      @@ -91616,7 +91622,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-05 22:59:06
      -

      *Thread Reply:* Ahh I see...okay thanks!

      +

      *Thread Reply:* Ahh I see...okay thanks!

      @@ -91642,7 +91648,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-07 00:05:48
      -

      *Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios

      +

      *Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios

      @@ -91672,7 +91678,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-09 14:10:47
      -

      *Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?

      +

      *Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?

      @@ -91698,7 +91704,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-10 01:18:18
      -

      *Thread Reply:* CC: @Paweł Leszczyński ^

      +

      *Thread Reply:* CC: @Paweł Leszczyński ^

      @@ -91821,7 +91827,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-08 17:30:44
      -

      *Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true

      +

      *Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true

      @@ -91976,7 +91982,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:08:53
      -

      *Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113

      +

      *Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113

      @@ -92042,7 +92048,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:11:47
      -

      *Thread Reply:* you're right, only the change was included in 2.5.0

      +

      *Thread Reply:* you're right, only the change was included in 2.5.0

      @@ -92072,7 +92078,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:43:15
      -

      *Thread Reply:* Thanks Jakub

      +

      *Thread Reply:* Thanks Jakub

      @@ -92196,7 +92202,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:18:22
      -

      *Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?

      +

      *Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?

      @@ -92424,7 +92430,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:22:00
      -

      *Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)

      +

      *Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)

      @@ -92454,7 +92460,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:58:40
      -

      *Thread Reply:* Perhaps DocumentationJobFacet or DocumentationDatasetFacet?

      +

      *Thread Reply:* Perhaps DocumentationJobFacet or DocumentationDatasetFacet?

      @@ -92557,7 +92563,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-18 12:09:23
      -

      *Thread Reply:* dbt-ol does not read from openlineage.yml so you need to pass this information in OPENLINEAGE_NAMESPACE environment variable

      +

      *Thread Reply:* dbt-ol does not read from openlineage.yml so you need to pass this information in OPENLINEAGE_NAMESPACE environment variable

      @@ -92583,7 +92589,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:17:03
      -

      *Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment(), I’ll do some testing with Kafka backends.

      +

      *Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment(), I’ll do some testing with Kafka backends.

      @@ -92609,7 +92615,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:22:07
      -

      *Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIGto specify the yml file holding transport info and OPENLINEAGE_NAMESPACE

      +

      *Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIGto specify the yml file holding transport info and OPENLINEAGE_NAMESPACE

      @@ -92639,7 +92645,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:24:05
      -

      *Thread Reply:* Awesome! That’s exactly what I was going to test.

      +

      *Thread Reply:* Awesome! That’s exactly what I was going to test.

      @@ -92665,7 +92671,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:25:04
      -

      *Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml.

      +

      *Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml.

      @@ -92695,7 +92701,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-21 08:32:17
      -

      *Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol does not read OL namespace from openlineage.yml but from OPENLINEAGE_NAMESPACE env var instead

      +

      *Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol does not read OL namespace from openlineage.yml but from OPENLINEAGE_NAMESPACE env var instead

      @@ -92893,7 +92899,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 04:25:19
      -

      *Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table

      +

      *Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table

      @@ -92919,7 +92925,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 10:55:58
      -

      *Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that

      +

      *Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that

      @@ -92945,7 +92951,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 11:11:25
      -

      *Thread Reply:* So what I understand:

      +

      *Thread Reply:* So what I understand:

      1. your single job represents synchronization of multiple tables
      2. you want to have precise input-output dataset lineage? am I right?
      3. @@ -92979,7 +92985,7 @@

        MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 12:57:15
      -

      *Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me

      +

      *Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me

      @@ -93052,7 +93058,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 05:26:06
      -

      *Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.

      +

      *Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.

      For now, JSON Schema generator isn't flexible enough to generate custom facets from whatever schema we give it, so it would be unnecessary complexity

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-03-27 12:29:57
      -

      *Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.

      +

      *Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.

      @@ -93165,7 +93171,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 17:59:28
      -

      *Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet it seems to be okay to assume schema stays the same if not send.

      +

      *Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet it seems to be okay to assume schema stays the same if not send.

      For others, like OutputStatisticsOutputDatasetFacet you definitely can't assume that, as the data is unique to each run.

      @@ -93193,7 +93199,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 19:05:14
      -

      *Thread Reply:* ok great thanks, that makes sense to me

      +

      *Thread Reply:* ok great thanks, that makes sense to me

      @@ -93254,7 +93260,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-28 04:47:31
      -

      *Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/

      +

      *Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/

      openlineage.io
      @@ -93327,7 +93333,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-28 08:05:30
      -

      *Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.

      +

      *Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.

      @@ -93353,7 +93359,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-29 02:53:37
      -

      *Thread Reply:* Sure thanks!

      +

      *Thread Reply:* Sure thanks!

      @@ -93379,7 +93385,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-29 05:34:37
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 I have opened an issue here. Thanks! 🙂

      @@ -93437,7 +93443,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 11:53:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?

      +

      *Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?

      @@ -93516,7 +93522,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 09:32:46
      -

      *Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. +

      *Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. May I ask if you want just to test the flow or looking for any viable Snowflake data lineage solution?

      @@ -93543,7 +93549,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:47:57
      -

      *Thread Reply:* I believe that this will work if you change the line to client.transport.emit()

      +

      *Thread Reply:* I believe that this will work if you change the line to client.transport.emit()

      @@ -93569,7 +93575,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:49:05
      -

      *Thread Reply:* (this would be in the dags/lineage folder, if memory serves)

      +

      *Thread Reply:* (this would be in the dags/lineage folder, if memory serves)

      @@ -93595,7 +93601,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:57:23
      -

      *Thread Reply:* Ross is right, that should work

      +

      *Thread Reply:* Ross is right, that should work

      @@ -93621,7 +93627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 12:23:13
      -

      *Thread Reply:* This works! Thank you so much!

      +

      *Thread Reply:* This works! Thank you so much!

      @@ -93647,7 +93653,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 12:24:40
      -

      *Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂

      +

      *Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂

      @@ -93673,7 +93679,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:03:58
      -

      *Thread Reply:* I have been meaning to revisit that tutorial 👍

      +

      *Thread Reply:* I have been meaning to revisit that tutorial 👍

      @@ -93740,7 +93746,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 15:39:46
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.

      @@ -93901,7 +93907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:39:19
      -

      *Thread Reply:* Do you have small configuration and job to replicate this?

      +

      *Thread Reply:* Do you have small configuration and job to replicate this?

      @@ -93927,7 +93933,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 22:21:35
      -

      *Thread Reply:* Yeah. For configs: +

      *Thread Reply:* Yeah. For configs: spark.driver.bindAddress: "localhost" spark.master: "local[**]" spark.sql.catalogImplementation: "hive" @@ -93963,7 +93969,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 22:23:19
      -

      *Thread Reply:* The issue is because of the spark.jars.packages config... spark.jars config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...

      +

      *Thread Reply:* The issue is because of the spark.jars.packages config... spark.jars config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...

      @@ -93989,7 +93995,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-05 05:38:55
      -

      *Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener which is not in Spark jars would fall under the same problems, right?

      +

      *Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener which is not in Spark jars would fall under the same problems, right?

      @@ -94015,7 +94021,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-10 06:07:11
      -

      *Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol

      +

      *Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol

      @@ -94067,7 +94073,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:02:21
      -

      *Thread Reply:* In this case you would have four runevents:

      +

      *Thread Reply:* In this case you would have four runevents:

      1. a START event on my-job where my-input is the input and my-output is the output, with a runId you generate on the client
      2. a COMPLETE event on my-job with the same runId from #1
      3. a START event on my-job2 where the input is my-output and the output is my-final-output, with a separate runId you generate
      4. a COMPLETE event on my-job2 with the same runId from #3
      @@ -94096,7 +94102,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 14:53:14
      -

      *Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.

      +

      *Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.

      @@ -94122,7 +94128,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 16:07:06
      -

      *Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events table so it can be replicated.

      +

      *Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events table so it can be replicated.

      @@ -94148,7 +94154,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 16:24:28
      -

      *Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!

      +

      *Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!

      @@ -94213,7 +94219,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-05 05:35:02
      -

      *Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.

      +

      *Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.

      However, I don't think you can currently do any filtering over it

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-05 13:20:20
      -

      *Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430

      +

      *Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430

      @@ -94313,7 +94319,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 11:48:48
      -

      *Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.

      +

      *Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.

      @@ -94339,7 +94345,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 12:46:54
      -

      *Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞

      +

      *Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞

      @@ -94393,7 +94399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 10:22:17
      -

      *Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?

      +

      *Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?

      @@ -94419,7 +94425,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 11:26:13
      -

      *Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?

      +

      *Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?

      @@ -94609,7 +94615,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 12:50:14
      -

      *Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor?

      +

      *Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor?

      @@ -94635,7 +94641,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 13:18:05
      -

      *Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE env var.

      +

      *Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE env var.

      @@ -94661,7 +94667,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 18:37:44
      -

      *Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?

      +

      *Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?

      @@ -94687,7 +94693,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 19:03:01
      -

      *Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'. This is the output of airflow plugins

      +

      *Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'. This is the output of airflow plugins

      name | macros | listeners | source ==================+================================================+==============================+================================================= @@ -94746,7 +94752,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 13:09:05
      -

      *Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …

      +

      *Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …

      @@ -94800,7 +94806,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 16:30:00
      -

      *Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)

      +

      *Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)

      @@ -94826,7 +94832,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 18:40:39
      -

      *Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted

      +

      *Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted

      @@ -94852,7 +94858,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 21:28:25
      -

      *Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...

      +

      *Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...

      @@ -94878,7 +94884,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-08 07:13:56
      -

      *Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)

      +

      *Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)

      @@ -94904,7 +94910,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 02:49:07
      -

      *Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)

      +

      *Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)

      openlineage.io
      @@ -94955,7 +94961,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 16:52:04
      -

      *Thread Reply:* Thanks! This is very helpful. Exactly what I needed.

      +

      *Thread Reply:* Thanks! This is very helpful. Exactly what I needed.

      @@ -95011,7 +95017,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 02:30:19
      -

      *Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/

      +

      *Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/

      openlineage.io
      @@ -95297,7 +95303,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 17:05:35
      -

      *Thread Reply:* > The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. +

      *Thread Reply:* > The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. What's the error?

      > Does anyone know how to rewrite this code (e.g., with SnowflakeOperator ) @@ -95429,7 +95435,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 18:13:49
      -

      *Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: +

      *Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: ****** Log file does not exist: /opt/bitnami/airflow/logs/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log ****** Fetching from: <http://1c8bb4a78f14:8793/log/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log> ****** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!! @@ -95462,7 +95468,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 05:35:41
      -

      *Thread Reply:* I think it's Airflow error related to getting logs from worker

      +

      *Thread Reply:* I think it's Airflow error related to getting logs from worker

      @@ -95488,7 +95494,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 05:36:07
      -

      *Thread Reply:* snowflake.connector is a Snowflake connector library that SnowflakeOperator uses underneath to connect to Snowflake

      +

      *Thread Reply:* snowflake.connector is a Snowflake connector library that SnowflakeOperator uses underneath to connect to Snowflake

      @@ -95514,7 +95520,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 10:15:21
      -

      *Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?

      +

      *Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?

      @@ -95540,7 +95546,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 10:19:53
      -

      *Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388

      +

      *Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388

      I would try running it in Docker TBH

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-12 11:47:41
      -

      *Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!

      +

      *Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!

      @@ -95729,7 +95735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 11:19:57
      -

      *Thread Reply:* Very interesting!

      +

      *Thread Reply:* Very interesting!

      @@ -95755,7 +95761,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 13:28:53
      -

      *Thread Reply:* that’s awesome 🙂

      +

      *Thread Reply:* that’s awesome 🙂

      @@ -95807,7 +95813,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 08:36:01
      -

      *Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage

      +

      *Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage

      @@ -95863,7 +95869,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:10:08
      -

      *Thread Reply:* The dag is basically just: +

      *Thread Reply:* The dag is basically just: ```dag = DAG( dagid="asanaexampledag", defaultargs=defaultargs, @@ -95902,7 +95908,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:11:02
      -

      *Thread Reply:* This is the error I’m getting, seems to be coming from this line: +

      *Thread Reply:* This is the error I’m getting, seems to be coming from this line: [2023-04-13, 17:45:33 UTC] {logging_mixin.py:115} WARNING - Exception in thread Thread-1: Traceback (most recent call last): File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner @@ -96024,7 +96030,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:12:11
      -

      *Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0

      +

      *Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0

      @@ -96050,7 +96056,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:13:34
      -

      *Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy

      +

      *Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy

      Stack Overflow
      @@ -96101,7 +96107,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:44:36
      -

      *Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.

      +

      *Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.

      @@ -96131,7 +96137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:47:16
      -

      *Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: +

      *Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: ```import datetime

      from airflow.lineage.entities import Table @@ -96180,7 +96186,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:53:48
      -

      *Thread Reply:* is there any extra configuration you made possibly?

      +

      *Thread Reply:* is there any extra configuration you made possibly?

      @@ -96206,7 +96212,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:02:40
      -

      *Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output +

      *Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output looks like this was reported here and solved by this PR (not sure if this was released in 2.3.3 or later)

      @@ -96281,7 +96287,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:06:59
      -

      *Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.

      +

      *Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.

      @@ -96307,7 +96313,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:13:21
      -

      *Thread Reply:* I ran it against 2.4 and same dag works

      +

      *Thread Reply:* I ran it against 2.4 and same dag works

      @@ -96333,7 +96339,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:15:35
      -

      *Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)

      +

      *Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)

      @@ -96359,7 +96365,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:17:06
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -96385,7 +96391,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 12:29:09
      -

      *Thread Reply:* Got this working! We just monkey patched the __deepcopy__ method of the BaseOperator for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!

      +

      *Thread Reply:* Got this working! We just monkey patched the __deepcopy__ method of the BaseOperator for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!

      @@ -96458,7 +96464,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 03:56:30
      -

      *Thread Reply:* This is the spark submit command: +

      *Thread Reply:* This is the spark submit command: spark-submit --py-files /usr/local/lib/common_utils.zip,/usr/local/lib/team_utils.zip,/usr/local/lib/project_utils.zip --conf spark.executor.cores=16 --conf spark.hadoop.fs.s3a.connection.maximum=100 --conf spark.sql.shuffle.partitions=1000 @@ -96489,7 +96495,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 04:09:57
      -

      *Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.

      +

      *Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.

      @@ -96515,7 +96521,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 04:14:41
      -

      *Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!

      +

      *Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!

      @@ -96541,7 +96547,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 05:13:54
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 Opened an issue here

      @@ -96626,7 +96632,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 02:25:04
      -

      *Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl), but they're in the same package. Thanks for bringing this.

      +

      *Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl), but they're in the same package. Thanks for bringing this.

      @@ -96652,7 +96658,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 03:07:11
      -

      *Thread Reply:* This PR should resolve problem: +

      *Thread Reply:* This PR should resolve problem: https://github.com/OpenLineage/OpenLineage/pull/1788

      @@ -96740,7 +96746,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 13:34:43
      -

      *Thread Reply:* Thank you so much @Paweł Leszczyński 🙂

      +

      *Thread Reply:* Thank you so much @Paweł Leszczyński 🙂

      @@ -96766,7 +96772,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 13:35:46
      -

      *Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?

      +

      *Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?

      @@ -96792,7 +96798,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 17:34:29
      -

      *Thread Reply:* @Maciej Obuchowski ^

      +

      *Thread Reply:* @Maciej Obuchowski ^

      @@ -96818,7 +96824,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 01:49:33
      -

      *Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.

      +

      *Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.

      @@ -96844,7 +96850,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 09:08:58
      -

      *Thread Reply:* Hi Allison 👋, +

      *Thread Reply:* Hi Allison 👋, Anyone can request a release in the #general channel. I encourage you to go this route. You’ll need three +1s (there’s more info about the process here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md), but I don’t know of any reasons why we can’t do a mid-cycle release. 🙂

      @@ -96875,7 +96881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:23:20
      -

      *Thread Reply:* seems like we got enough +1s

      +

      *Thread Reply:* seems like we got enough +1s

      @@ -96901,7 +96907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:24:33
      -

      *Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third

      +

      *Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third

      @@ -96931,7 +96937,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:24:55
      -

      *Thread Reply:* oooh

      +

      *Thread Reply:* oooh

      @@ -96957,7 +96963,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:32:47
      -

      *Thread Reply:* Yeah, sorry I forgot to mention that!

      +

      *Thread Reply:* Yeah, sorry I forgot to mention that!

      @@ -96983,7 +96989,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 05:02:46
      -

      *Thread Reply:* we have it now

      +

      *Thread Reply:* we have it now

      @@ -97138,7 +97144,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:46:06
      -

      *Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).

      +

      *Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).

      @@ -97246,7 +97252,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 08:57:45
      -

      *Thread Reply:* Hi Sai,

      +

      *Thread Reply:* Hi Sai,

      Perhaps you could try within printing OpenLineage events into logs. This can be achieved with Spark config parameter: spark.openlineage.transport.type @@ -97278,7 +97284,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:18:53
      -

      *Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:

      +

      *Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:

      23/04/20 10:00:15 INFO ConsoleTransport: {"eventType":"START","eventTime":"2023-04-20T10:00:15.085Z","run":{"runId":"ef4f46d1-d13a-420a-87c3-19fbf6ffa231","facets":{"spark.logicalPlan":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.22.0/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect","num-children":2,"name":0,"partitioning":[],"query":1,"tableSpec":null,"writeOptions":null,"ignoreIfExists":false},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","num-children":0,"catalog":null,"ident":null},{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"workorderid","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-cl

      @@ -97315,7 +97321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:19:37
      -

      *Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection

      +

      *Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection

      @@ -97341,7 +97347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:20:33
      -

      *Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      +

      *Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      @@ -97367,7 +97373,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:21:10
      -

      *Thread Reply:* please note that spark.openlineage.transport.url has to be used which is different from what you have on screenshot attached

      +

      *Thread Reply:* please note that spark.openlineage.transport.url has to be used which is different from what you have on screenshot attached

      @@ -97393,7 +97399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:22:40
      -

      *Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?

      +

      *Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?

      @@ -97419,7 +97425,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:04
      -

      *Thread Reply:* yes, please give it a try

      +

      *Thread Reply:* yes, please give it a try

      @@ -97445,7 +97451,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:40
      -

      *Thread Reply:* sure will give a try and let you know the outcome

      +

      *Thread Reply:* sure will give a try and let you know the outcome

      @@ -97471,7 +97477,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:48
      -

      *Thread Reply:* and set spark.openlineage.transport.type to http

      +

      *Thread Reply:* and set spark.openlineage.transport.type to http

      @@ -97497,7 +97503,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:24:04
      -

      *Thread Reply:* okay

      +

      *Thread Reply:* okay

      @@ -97523,7 +97529,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:26:42
      -

      *Thread Reply:* does these configs suffice or I need to add anything else

      +

      *Thread Reply:* does these configs suffice or I need to add anything else

      spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.consoleTransport true @@ -97555,7 +97561,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:27:07
      -

      *Thread Reply:* spark.openlineage.consoleTransport true this one can be removed

      +

      *Thread Reply:* spark.openlineage.consoleTransport true this one can be removed

      @@ -97581,7 +97587,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:27:33
      -

      *Thread Reply:* otherwise shall be OK

      +

      *Thread Reply:* otherwise shall be OK

      @@ -97607,7 +97613,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 10:01:30
      -

      *Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.

      +

      *Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.

      @@ -97633,7 +97639,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 10:04:27
      -

      *Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart +

      *Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

      Does this need any changes in the config side?

      @@ -97773,7 +97779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 06:49:43
      -

      *Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog and provide more details how are you trying to read/write data?

      +

      *Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog and provide more details how are you trying to read/write data?

      @@ -97799,7 +97805,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 07:14:04
      -

      *Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog makes use of the V2SessionCatalog to understand this better. Thanks!

      +

      *Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog makes use of the V2SessionCatalog to understand this better. Thanks!

      @@ -97825,7 +97831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 03:44:15
      -

      *Thread Reply:* Hi @Paweł Leszczyński +

      *Thread Reply:* Hi @Paweł Leszczyński ''' CREATE TABLE IF NOT EXISTS TABLE_NAME ( SOME COLUMNS @@ -97863,7 +97869,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 03:44:48
      -

      *Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.

      +

      *Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.

      @@ -97889,7 +97895,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 05:06:05
      -

      *Thread Reply:* which spark & delta versions are you using?

      +

      *Thread Reply:* which spark & delta versions are you using?

      @@ -97915,7 +97921,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:35:50
      -

      *Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. +

      *Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. https://github.com/OpenLineage/OpenLineage/pull/1798

      @@ -98037,7 +98043,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:36:20
      -

      *Thread Reply:* Hi

      +

      *Thread Reply:* Hi

      @@ -98063,7 +98069,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:36:45
      -

      *Thread Reply:* Hmm actually I am noticing this error on my local

      +

      *Thread Reply:* Hmm actually I am noticing this error on my local

      @@ -98089,7 +98095,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:01
      -

      *Thread Reply:* But on the prod job, I am seeing no such error in the logs...

      +

      *Thread Reply:* But on the prod job, I am seeing no such error in the logs...

      @@ -98115,7 +98121,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:28
      -

      *Thread Reply:* Also, I was using spark 3.1.2

      +

      *Thread Reply:* Also, I was using spark 3.1.2

      @@ -98145,7 +98151,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:39
      -

      *Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2

      +

      *Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2

      @@ -98175,7 +98181,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:42
      -

      *Thread Reply:* Not too sure which delta version the prod job was using...

      +

      *Thread Reply:* Not too sure which delta version the prod job was using...

      @@ -98201,7 +98207,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:30:49
      -

      *Thread Reply:* I was running on Spark 3.1.2 the following command: +

      *Thread Reply:* I was running on Spark 3.1.2 the following command: spark.sql( "CREATE TABLE t_partitioned (a int, b int) USING delta " + "PARTITIONED BY (a) LOCATION '/tmp/delta/tbl'" @@ -98236,7 +98242,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:31:47
      -

      *Thread Reply:* Oh... hmm... that is strange. Let me check more from my end too

      +

      *Thread Reply:* Oh... hmm... that is strange. Let me check more from my end too

      @@ -98262,7 +98268,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:33:01
      -

      *Thread Reply:* for spark 3.1, we're using delta 1.0.0

      +

      *Thread Reply:* for spark 3.1, we're using delta 1.0.0

      @@ -98320,7 +98326,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 05:38:09
      -

      *Thread Reply:* Hi Cory, I think the definition of ParentRunFacet (https://openlineage.io/docs/spec/facets/run-facets/parent_run) contains answer to that: +

      *Thread Reply:* Hi Cory, I think the definition of ParentRunFacet (https://openlineage.io/docs/spec/facets/run-facets/parent_run) contains answer to that: Commonly, scheduler systems like Apache Airflow will trigger processes on remote systems, such as on Apache Spark or Apache Beam jobs. Those systems might have their own OpenLineage integration and report their own job runs and dataset inputs/outputs. The ParentRunFacet allows those downstream jobs to report which jobs spawned them to preserve job hierarchy. To do that, the scheduler system should have a way to pass its own job and run id to the child job. For example, when airflow is used to run Spark job, we want Spark events to contain some information on what triggered the spark job and parameters, you ask about, are used to pass that information from airflow operator to spark job.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-26 17:28:39
      -

      *Thread Reply:* Thank you for pointing me at this documentation; I did not see it previously. In my setup, the calling system is AWS Step Functions, which have no integration with OpenLineage.

      +

      *Thread Reply:* Thank you for pointing me at this documentation; I did not see it previously. In my setup, the calling system is AWS Step Functions, which have no integration with OpenLineage.

      So I've been essentially passing non-existing parent job information to OpenLineage. It has been useful as a data point for searches and reporting though.

      @@ -98399,7 +98405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 04:59:39
      -

      *Thread Reply:* I think parentRunId should be the same for Openlineage START and COMPLETE event. Is it like this in your case?

      +

      *Thread Reply:* I think parentRunId should be the same for Openlineage START and COMPLETE event. Is it like this in your case?

      @@ -98425,7 +98431,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:13:58
      -

      *Thread Reply:* that makes sense, and based on my configuration, i would think that it would be. however, given that i am seeing incomplete jobs in Marquez, i'm wondering if somehow the parentrunID is changing. I need to investigate

      +

      *Thread Reply:* that makes sense, and based on my configuration, i would think that it would be. however, given that i am seeing incomplete jobs in Marquez, i'm wondering if somehow the parentrunID is changing. I need to investigate

      @@ -98525,7 +98531,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:06:06
      -

      *Thread Reply:* I think @Michael Robinson has to manually promote artifacts

      +

      *Thread Reply:* I think @Michael Robinson has to manually promote artifacts

      @@ -98551,7 +98557,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:08:06
      -

      *Thread Reply:* I promoted the artifacts, but there is a delay before they appear in Maven. A couple releases ago, the delay was about 24 hours long

      +

      *Thread Reply:* I promoted the artifacts, but there is a delay before they appear in Maven. A couple releases ago, the delay was about 24 hours long

      @@ -98577,7 +98583,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:26:09
      -

      *Thread Reply:* Ahh I see... Thanks!

      +

      *Thread Reply:* Ahh I see... Thanks!

      @@ -98603,7 +98609,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:10:38
      -

      *Thread Reply:* @Anirudh Shrinivason are you using search.maven.org by chance? Version 0.23.0 is not appearing there yet, but I do see it on central.sonatype.com.

      +

      *Thread Reply:* @Anirudh Shrinivason are you using search.maven.org by chance? Version 0.23.0 is not appearing there yet, but I do see it on central.sonatype.com.

      @@ -98629,7 +98635,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:15:00
      -

      *Thread Reply:* Hmm I can see it now on search.maven.org actually. But I still cannot see it on https://mvnrepository.com/artifact/io.openlineage/openlineage-spark ...

      +

      *Thread Reply:* Hmm I can see it now on search.maven.org actually. But I still cannot see it on https://mvnrepository.com/artifact/io.openlineage/openlineage-spark ...

      @@ -98655,7 +98661,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:19:38
      -

      *Thread Reply:* Understood. I believe you can download the 0.23.0 jars from central.sonatype.com. For Spark, try going here: https://central.sonatype.com/artifact/io.openlineage/openlineage-spark/0.23.0/versions

      +

      *Thread Reply:* Understood. I believe you can download the 0.23.0 jars from central.sonatype.com. For Spark, try going here: https://central.sonatype.com/artifact/io.openlineage/openlineage-spark/0.23.0/versions

      Maven Central
      @@ -98702,7 +98708,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-22 06:11:10
      -

      *Thread Reply:* Yup. I can see it on all maven repos now haha. I think its just the delay.

      +

      *Thread Reply:* Yup. I can see it on all maven repos now haha. I think its just the delay.

      @@ -98728,7 +98734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-22 06:11:18
      -

      *Thread Reply:* ~24 hours ig

      +

      *Thread Reply:* ~24 hours ig

      @@ -98754,7 +98760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 16:49:15
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -98806,7 +98812,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:28:18
      -

      *Thread Reply:* Good news, everyone! The login worked on the second attempt after starting the Docker containers. Although it's unclear why it failed the first time.

      +

      *Thread Reply:* Good news, everyone! The login worked on the second attempt after starting the Docker containers. Although it's unclear why it failed the first time.

      @@ -98873,7 +98879,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 02:32:51
      -

      *Thread Reply:* Hi Natalie, it's wonderful to hear you're planning to contribute. Yes, you're right about TransportFactory . What other transport type was in your mind? If it is something generic, then it is surely OK to include it within TransportFactory. If it is a custom feature, we could follow ServiceLoader pattern that we're using to allow including custom plan visitors and dataset builders.

      +

      *Thread Reply:* Hi Natalie, it's wonderful to hear you're planning to contribute. Yes, you're right about TransportFactory . What other transport type was in your mind? If it is something generic, then it is surely OK to include it within TransportFactory. If it is a custom feature, we could follow ServiceLoader pattern that we're using to allow including custom plan visitors and dataset builders.

      @@ -98899,7 +98905,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 02:54:40
      -

      *Thread Reply:* Hi @Paweł Leszczyński +

      *Thread Reply:* Hi @Paweł Leszczyński Yes, I was planning to change TransportFactory to support custom/generic transport types using ServiceLoader pattern. After this change is done, I will be able to use our custom OpenMetadataTransport without changing anything in OpenLineage core. For now I don't have other types in mind, but after we'll add the customization support anyone will be able to create their own transport type and report the lineage to different backends

      @@ -98930,7 +98936,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 03:28:30
      -

      *Thread Reply:* Perhaps it's not strictly related to this particular usecase, but you may also find interesting our recent PoC about Fluentd & Openlineage integration. This will bring some cool backend features like: copy event and send it to multiple backends, send it to backends supported by fluentd output plugins etc. https://github.com/OpenLineage/OpenLineage/pull/1757/files?short_path=4fc5534#diff-4fc55343748f353fa1def0e00c553caa735f9adcb0da18baad50a989c0f2e935

      +

      *Thread Reply:* Perhaps it's not strictly related to this particular usecase, but you may also find interesting our recent PoC about Fluentd & Openlineage integration. This will bring some cool backend features like: copy event and send it to multiple backends, send it to backends supported by fluentd output plugins etc. https://github.com/OpenLineage/OpenLineage/pull/1757/files?short_path=4fc5534#diff-4fc55343748f353fa1def0e00c553caa735f9adcb0da18baad50a989c0f2e935

      @@ -98956,7 +98962,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 05:36:24
      -

      *Thread Reply:* Sounds interesting. Thanks, I will look into it

      +

      *Thread Reply:* Sounds interesting. Thanks, I will look into it

      @@ -99068,7 +99074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 08:07:36
      -

      *Thread Reply:* What's the extract_openlineage.py file? Looks like your code?

      +

      *Thread Reply:* What's the extract_openlineage.py file? Looks like your code?

      @@ -99103,7 +99109,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 08:43:04
      -

      *Thread Reply:* import json +

      *Thread Reply:* import json import os from pendulum import datetime

      @@ -99204,7 +99210,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:52:33
      -

      *Thread Reply:* OpenLineageClient expects RunEvent classes and you're sending it raw json. I think at this point your options are either sending them by constructing your own HTTP client, using something like requests, or using something like https://github.com/python-attrs/cattrs to structure json to RunEvent

      +

      *Thread Reply:* OpenLineageClient expects RunEvent classes and you're sending it raw json. I think at this point your options are either sending them by constructing your own HTTP client, using something like requests, or using something like https://github.com/python-attrs/cattrs to structure json to RunEvent

      @@ -99259,7 +99265,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 10:05:57
      -

      *Thread Reply:* @Jakub Dardziński suggested that you can +

      *Thread Reply:* @Jakub Dardziński suggested that you can change client.emit(ol_event) to client.transport.emit(ol_event) and it should work

      @@ -99290,7 +99296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 12:24:08
      -

      *Thread Reply:* @Maciej Obuchowski I believe this is from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py

      +

      *Thread Reply:* @Maciej Obuchowski I believe this is from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py

      @@ -99418,7 +99424,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 12:25:26
      -

      *Thread Reply:* I believe this example no longer works - perhaps a new access history pull/push example could be created that is simpler and doesn’t use airflow.

      +

      *Thread Reply:* I believe this example no longer works - perhaps a new access history pull/push example could be created that is simpler and doesn’t use airflow.

      @@ -99448,7 +99454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 08:34:02
      -

      *Thread Reply:* I think separating the actual getting data from the view and Airflow DAG would make sense

      +

      *Thread Reply:* I think separating the actual getting data from the view and Airflow DAG would make sense

      @@ -99474,7 +99480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:57:34
      -

      *Thread Reply:* Yeah - I also think that Airflow confuses the issue. You don’t need Airflow to get lineage from Snowflake Access History, the only reason Airflow is in the example is a) to simulate a pipeline that can be viewed in Marquez; b) to establish a mechanism that regularly pulls and emits lineage…

      +

      *Thread Reply:* Yeah - I also think that Airflow confuses the issue. You don’t need Airflow to get lineage from Snowflake Access History, the only reason Airflow is in the example is a) to simulate a pipeline that can be viewed in Marquez; b) to establish a mechanism that regularly pulls and emits lineage…

      but most people will already have A, and the simplest example doesn’t need to accomplish B.

      @@ -99502,7 +99508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:58:59
      -

      *Thread Reply:* just a few weeks ago 🙂 I was working on a script that you could run like SNOWFLAKE_USER=foo ./process_snowflake_lineage.py --from-date=xxxx-xx-xx --to-date=xxxx-xx-xx

      +

      *Thread Reply:* just a few weeks ago 🙂 I was working on a script that you could run like SNOWFLAKE_USER=foo ./process_snowflake_lineage.py --from-date=xxxx-xx-xx --to-date=xxxx-xx-xx

      @@ -99528,7 +99534,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:13:58
      -

      *Thread Reply:* Hi @Ross Turk! Do you have a link to this script? Perhaps this script can fix the connection issue 🙂

      +

      *Thread Reply:* Hi @Ross Turk! Do you have a link to this script? Perhaps this script can fix the connection issue 🙂

      @@ -99554,7 +99560,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:47:20
      -

      *Thread Reply:* No, it never became functional before I stopped to take on another task 😕

      +

      *Thread Reply:* No, it never became functional before I stopped to take on another task 😕

      @@ -99611,7 +99617,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:54:07
      -

      *Thread Reply:* Looks like the first URL is proper, but there's something wrong with entity - Marquez logs would help here.

      +

      *Thread Reply:* Looks like the first URL is proper, but there's something wrong with entity - Marquez logs would help here.

      @@ -99637,7 +99643,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:57:36
      -

      *Thread Reply:* This is my log in airflow, can you please prvide more info over it.

      +

      *Thread Reply:* This is my log in airflow, can you please prvide more info over it.

      @@ -99672,7 +99678,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 10:13:37
      -

      *Thread Reply:* Airflow log does not tell us why Marquez rejected the event. Marquez logs would be more helpful

      +

      *Thread Reply:* Airflow log does not tell us why Marquez rejected the event. Marquez logs would be more helpful

      @@ -99698,7 +99704,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 05:48:08
      -

      *Thread Reply:* We investigated the marquez container logs and were unable to locate the error. Could you please specify the log file that belongs to marquez while connecting the airflow or snowflake?

      +

      *Thread Reply:* We investigated the marquez container logs and were unable to locate the error. Could you please specify the log file that belongs to marquez while connecting the airflow or snowflake?

      Is it correct that the marquez-web log points to <http://api:5000/>? [HPM] Proxy created: /api/v1 -&gt; <http://api:5000/> @@ -99750,7 +99756,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:26:36
      -

      *Thread Reply:* I've the same error at the moment but can provide some additional screenshots. The Event data in Snowflake seems fine and the data is being retrieved correctly by the Airflow DAG. However, there seems to be a warning in the Marquez API logs. Hopefully we can troubleshoot this together!

      +

      *Thread Reply:* I've the same error at the moment but can provide some additional screenshots. The Event data in Snowflake seems fine and the data is being retrieved correctly by the Airflow DAG. However, there seems to be a warning in the Marquez API logs. Hopefully we can troubleshoot this together!

      @@ -99794,7 +99800,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:33:35
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -99838,7 +99844,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:06:30
      -

      *Thread Reply:* Possibly the Python part between does some weird things, like double-jsonning the data? I can imagine it being wrapped in second, unnecessary JSON object

      +

      *Thread Reply:* Possibly the Python part between does some weird things, like double-jsonning the data? I can imagine it being wrapped in second, unnecessary JSON object

      @@ -99864,7 +99870,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:08:18
      -

      *Thread Reply:* I guess only way to check is print one of those events - in the form they are send in Python part, not Snowflake - and see how they are like. For example using ConsoleTransport or setting DEBUG log level in Airflow

      +

      *Thread Reply:* I guess only way to check is print one of those events - in the form they are send in Python part, not Snowflake - and see how they are like. For example using ConsoleTransport or setting DEBUG log level in Airflow

      @@ -99890,7 +99896,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 14:37:32
      -

      *Thread Reply:* Here is a code snippet by using logging in DEBUG on the snowflake python connector:

      +

      *Thread Reply:* Here is a code snippet by using logging in DEBUG on the snowflake python connector:

      [20230426T17:16:55.166+0000] {cursor.py:593} DEBUG - binding: [set currentorganization='[PRIVATE]';] with input=[None], processed=[{}] [2023-04-26T17:16:55.166+0000] {cursor.py:800} INFO - query: [set currentorganization='[PRIVATE]';] @@ -99978,7 +99984,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 14:45:26
      -

      *Thread Reply:* I don't see any Airflow standard logs here, but anyway I looked at it and debugging it would not work if you're bypassing OpenLineageClient.emit and going directly to transport - the logging is done on Client level https://github.com/OpenLineage/OpenLineage/blob/acc207d63e976db7c48384f04bc578409f08cc8a/client/python/openlineage/client/client.py#L73

      +

      *Thread Reply:* I don't see any Airflow standard logs here, but anyway I looked at it and debugging it would not work if you're bypassing OpenLineageClient.emit and going directly to transport - the logging is done on Client level https://github.com/OpenLineage/OpenLineage/blob/acc207d63e976db7c48384f04bc578409f08cc8a/client/python/openlineage/client/client.py#L73

      @@ -100029,7 +100035,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:16:20
      -

      *Thread Reply:* I'm sorry, do you have a code snippet on how to get these logs from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py? I still get the ValueError for OpenLineageClient.emit

      +

      *Thread Reply:* I'm sorry, do you have a code snippet on how to get these logs from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py? I still get the ValueError for OpenLineageClient.emit

      @@ -100157,7 +100163,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 10:56:34
      -

      *Thread Reply:* Hey does anyone have an idea on this? I'm still stuck on this issue 😞

      +

      *Thread Reply:* Hey does anyone have an idea on this? I'm still stuck on this issue 😞

      @@ -100183,7 +100189,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 08:58:49
      -

      *Thread Reply:* I've found the root cause. It's because facets don't have _producer and _schemaURL set. I'll provide a fix soon

      +

      *Thread Reply:* I've found the root cause. It's because facets don't have _producer and _schemaURL set. I'll provide a fix soon

      @@ -100274,7 +100280,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:36:57
      -

      *Thread Reply:* I’ll be there! I’m looking forward to see you all.

      +

      *Thread Reply:* I’ll be there! I’m looking forward to see you all.

      @@ -100300,7 +100306,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:37:23
      -

      *Thread Reply:* We’ll talk about the evolution of the spec.

      +

      *Thread Reply:* We’ll talk about the evolution of the spec.

      @@ -100358,7 +100364,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 04:23:24
      -

      *Thread Reply:* I've created an integration test based on your example. The Openlineage event gets sent, however it does not contain output dataset. I will look deeper into that.

      +

      *Thread Reply:* I've created an integration test based on your example. The Openlineage event gets sent, however it does not contain output dataset. I will look deeper into that.

      @@ -100388,7 +100394,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:55:43
      -

      *Thread Reply:* Hey, sorry do you mean input dataset is empty? Or output dataset?

      +

      *Thread Reply:* Hey, sorry do you mean input dataset is empty? Or output dataset?

      @@ -100414,7 +100420,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:55:51
      -

      *Thread Reply:* I am seeing that input dataset is empty

      +

      *Thread Reply:* I am seeing that input dataset is empty

      @@ -100440,7 +100446,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:05
      -

      *Thread Reply:* ooh, I see input datasets

      +

      *Thread Reply:* ooh, I see input datasets

      @@ -100466,7 +100472,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:11
      -

      *Thread Reply:* Hmm

      +

      *Thread Reply:* Hmm

      @@ -100492,7 +100498,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:12
      -

      *Thread Reply:* I see

      +

      *Thread Reply:* I see

      @@ -100518,7 +100524,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:57:07
      -

      *Thread Reply:* I create a test in SparkDeltaIntegrationTest class a test method: +

      *Thread Reply:* I create a test in SparkDeltaIntegrationTest class a test method: ```@Test void testDeltaMergeInto() { Dataset<Row> dataset = @@ -100579,7 +100585,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:59:14
      -

      *Thread Reply:* Oh yeah my bad. I am seeing output dataset is empty.

      +

      *Thread Reply:* Oh yeah my bad. I am seeing output dataset is empty.

      @@ -100605,7 +100611,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:59:21
      -

      *Thread Reply:* Checks out with your observation

      +

      *Thread Reply:* Checks out with your observation

      @@ -100631,7 +100637,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 23:23:36
      -

      *Thread Reply:* Hi @Paweł Leszczyński just curious, has a fix for this been implemented alr?

      +

      *Thread Reply:* Hi @Paweł Leszczyński just curious, has a fix for this been implemented alr?

      @@ -100657,7 +100663,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:40:11
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I had some days ooo. I will look into this soon.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, I had some days ooo. I will look into this soon.

      @@ -100683,7 +100689,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 07:37:52
      -

      *Thread Reply:* Ahh okie! Thanks so much! Hope you had a good rest!

      +

      *Thread Reply:* Ahh okie! Thanks so much! Hope you had a good rest!

      @@ -100709,7 +100715,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 07:38:38
      -

      *Thread Reply:* yeah. this was an amazing extended weekend 😉

      +

      *Thread Reply:* yeah. this was an amazing extended weekend 😉

      @@ -100739,7 +100745,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:09:10
      -

      *Thread Reply:* This should be it: https://github.com/OpenLineage/OpenLineage/pull/1823

      +

      *Thread Reply:* This should be it: https://github.com/OpenLineage/OpenLineage/pull/1823

      @@ -100838,7 +100844,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:43:24
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, please let me know if there is still something to be done within #1747 PROPOSAL] Support for V2SessionCatalog. I could not reproduce exactly what you described but fixed some issue nearby.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, please let me know if there is still something to be done within #1747 PROPOSAL] Support for V2SessionCatalog. I could not reproduce exactly what you described but fixed some issue nearby.

      @@ -100864,7 +100870,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:49:38
      -

      *Thread Reply:* Hmm yeah sure let me find out the exact cause of the issue. The pipeline that was causing the issue is now inactive haha. So I'm trying to backtrace from the limited logs I captured last time. Let me get back by next week thanks! 🙇

      +

      *Thread Reply:* Hmm yeah sure let me find out the exact cause of the issue. The pipeline that was causing the issue is now inactive haha. So I'm trying to backtrace from the limited logs I captured last time. Let me get back by next week thanks! 🙇

      @@ -100890,7 +100896,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 09:35:00
      -

      *Thread Reply:* Hi @Paweł Leszczyński I was trying to replicate the issue from my end, but couldn't do so. I think we can close the issue for now, and revisit later on if the issue resurfaces. Does that sound okay?

      +

      *Thread Reply:* Hi @Paweł Leszczyński I was trying to replicate the issue from my end, but couldn't do so. I think we can close the issue for now, and revisit later on if the issue resurfaces. Does that sound okay?

      @@ -100916,7 +100922,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 09:40:33
      -

      *Thread Reply:* sounds cool. we can surely create a new issue later on.

      +

      *Thread Reply:* sounds cool. we can surely create a new issue later on.

      @@ -100946,7 +100952,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 23:34:04
      -

      *Thread Reply:* @Paweł Leszczyński - I was trying to implement these new changes in databricks. I was wondering which java file should I use for building the jar file? Could you plese help me?

      +

      *Thread Reply:* @Paweł Leszczyński - I was trying to implement these new changes in databricks. I was wondering which java file should I use for building the jar file? Could you plese help me?

      @@ -100972,7 +100978,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 00:46:34
      -

      *Thread Reply:* .

      +

      *Thread Reply:* .

      @@ -100998,7 +101004,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:37:49
      -

      *Thread Reply:* Hi I found that these merge operations have no input datasets/col lineage: +

      *Thread Reply:* Hi I found that these merge operations have no input datasets/col lineage: ```df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema).save(path)

      df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema)\ @@ -101038,7 +101044,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:41:24
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, great to hear from you. Could you create an issue out of this? I am working at the moment on Spark 3.4. Once this is ready, I will look at the spark issues. And this one seems to be nicely reproducible. Thanks for that.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, great to hear from you. Could you create an issue out of this? I am working at the moment on Spark 3.4. Once this is ready, I will look at the spark issues. And this one seems to be nicely reproducible. Thanks for that.

      @@ -101068,7 +101074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:49:56
      -

      *Thread Reply:* Sure let me create an issue! Thanks!

      +

      *Thread Reply:* Sure let me create an issue! Thanks!

      @@ -101094,7 +101100,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:55:21
      -

      *Thread Reply:* Created an issue here! https://github.com/OpenLineage/OpenLineage/issues/1919 +

      *Thread Reply:* Created an issue here! https://github.com/OpenLineage/OpenLineage/issues/1919 Thanks! 🙇

      @@ -101162,7 +101168,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 10:39:50
      -

      *Thread Reply:* Hi @Paweł Leszczyński I just realised, https://github.com/OpenLineage/OpenLineage/pull/1823/files +

      *Thread Reply:* Hi @Paweł Leszczyński I just realised, https://github.com/OpenLineage/OpenLineage/pull/1823/files This PR doesn't actually capture column lineage for the MergeIntoCommand? It looks like there is no column lineage field in the events json.

      @@ -101189,7 +101195,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-17 04:21:24
      -

      *Thread Reply:* Hi @Paweł Leszczyński Is there a potential timeline in mind to support column lineage for the MergeIntoCommand? We're really excited for this feature and would be a huge help to overcome a current blocker. Thanks!

      +

      *Thread Reply:* Hi @Paweł Leszczyński Is there a potential timeline in mind to support column lineage for the MergeIntoCommand? We're really excited for this feature and would be a huge help to overcome a current blocker. Thanks!

      @@ -101315,7 +101321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 12:02:13
      -

      *Thread Reply:* “producer” is an element of the event run itself - e.g. what produced the JSON packet you’re studying. There is only one of these per event run. You can think of it as a top-level property.

      +

      *Thread Reply:* “producer” is an element of the event run itself - e.g. what produced the JSON packet you’re studying. There is only one of these per event run. You can think of it as a top-level property.

      producer” (and “schemaURL”) are elements of a facet. They are the 2 required elements for any customized facet (though I don’t agree they should be required, or at least I believe they should be able to be compatible with a blank value and a null value).

      @@ -101345,7 +101351,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 12:06:52
      -

      *Thread Reply:* just curious --- is/was there any specific reason for the underscore prefix? If they are in a facet, they would already be qualified.......

      +

      *Thread Reply:* just curious --- is/was there any specific reason for the underscore prefix? If they are in a facet, they would already be qualified.......

      @@ -101371,7 +101377,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 13:13:28
      -

      *Thread Reply:* The facet “BaseFacet” that’s used for customization, has 2 required elements - _producer and _schemaURL. so I don’t believe it’s related to qualification.

      +

      *Thread Reply:* The facet “BaseFacet” that’s used for customization, has 2 required elements - _producer and _schemaURL. so I don’t believe it’s related to qualification.

      @@ -101442,7 +101448,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 19:43:12
      -

      *Thread Reply:* Thanks for voting. The release will commence within 2 days.

      +

      *Thread Reply:* Thanks for voting. The release will commence within 2 days.

      @@ -101494,7 +101500,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:33:32
      -

      *Thread Reply:* Although it is not documented, we do have an integration test for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py

      +

      *Thread Reply:* Although it is not documented, we do have an integration test for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py

      The test reads and writes data to Kafka and verifies if input/output datasets are collected.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-05-01 22:37:25
      -

      *Thread Reply:* From my experience, yeah it works for pyspark

      +

      *Thread Reply:* From my experience, yeah it works for pyspark

      @@ -101670,7 +101676,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 22:39:18
      -

      *Thread Reply:* Also from my experience, I think we can only use one of them as we can only configure one spark listener... correct me if I'm wrong. But it seems like the latest releases of spline are already using openlineage to some capacity?

      +

      *Thread Reply:* Also from my experience, I think we can only use one of them as we can only configure one spark listener... correct me if I'm wrong. But it seems like the latest releases of spline are already using openlineage to some capacity?

      @@ -101700,7 +101706,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 09:46:15
      -

      *Thread Reply:* In spark.extraListeners you can configure multiple listeners by comma separating them - I think you can use multiple ones with OpenLineage without obvious problems. I think we do pretty similar things to Spline though

      +

      *Thread Reply:* In spark.extraListeners you can configure multiple listeners by comma separating them - I think you can use multiple ones with OpenLineage without obvious problems. I think we do pretty similar things to Spline though

      @@ -101734,7 +101740,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 11:28:41
      -

      *Thread Reply:* (I never said thank you for this, so, thank you!)

      +

      *Thread Reply:* (I never said thank you for this, so, thank you!)

      @@ -101797,7 +101803,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 10:45:37
      -

      *Thread Reply:* Hello Sai, thanks for your question! A number of folks who could help with this are OOO, but someone will reply as soon as possible.

      +

      *Thread Reply:* Hello Sai, thanks for your question! A number of folks who could help with this are OOO, but someone will reply as soon as possible.

      @@ -101823,7 +101829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:44:46
      -

      *Thread Reply:* That is interesting @Sai. Are you able to reproduce this with a simple code snippet? Which Openlineage version are you using?

      +

      *Thread Reply:* That is interesting @Sai. Are you able to reproduce this with a simple code snippet? Which Openlineage version are you using?

      @@ -101849,7 +101855,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 01:16:20
      -

      *Thread Reply:* Yes @Paweł Leszczyński. Each join query I run on top of delta tables have two start and two complete events. We are using below jar for openlineage.

      +

      *Thread Reply:* Yes @Paweł Leszczyński. Each join query I run on top of delta tables have two start and two complete events. We are using below jar for openlineage.

      openlineage-spark-0.22.0.jar

      @@ -101881,7 +101887,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:41:26
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1828

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1828

      @@ -101937,7 +101943,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:05:26
      -

      *Thread Reply:* Hi @Paweł Leszczyński any updates on this issue?

      +

      *Thread Reply:* Hi @Paweł Leszczyński any updates on this issue?

      Also, OL is not giving column level lineage for group by operations on tables. Is this expected?

      @@ -101965,7 +101971,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:07:04
      -

      *Thread Reply:* Hi @Sai, https://github.com/OpenLineage/OpenLineage/pull/1830 should fix duplication issue

      +

      *Thread Reply:* Hi @Sai, https://github.com/OpenLineage/OpenLineage/pull/1830 should fix duplication issue

      @@ -102059,7 +102065,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:08:06
      -

      *Thread Reply:* this would be part of next release?

      +

      *Thread Reply:* this would be part of next release?

      @@ -102085,7 +102091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:08:30
      -

      *Thread Reply:* Regarding column lineage & group by issue, I think it's something on databricks side -> we do have an open issue for that #1821

      +

      *Thread Reply:* Regarding column lineage & group by issue, I think it's something on databricks side -> we do have an open issue for that #1821

      @@ -102111,7 +102117,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:09:24
      -

      *Thread Reply:* once #1830 is reviewed and merged, it will be the part of the next relase

      +

      *Thread Reply:* once #1830 is reviewed and merged, it will be the part of the next relase

      @@ -102137,7 +102143,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:11:01
      -

      *Thread Reply:* sure.. thanks @Paweł Leszczyński

      +

      *Thread Reply:* sure.. thanks @Paweł Leszczyński

      @@ -102163,7 +102169,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:27:01
      -

      *Thread Reply:* @Paweł Leszczyński I have used the latest jar (0.25.0) and still this issue persists. I see two events for same input/output lineage.

      +

      *Thread Reply:* @Paweł Leszczyński I have used the latest jar (0.25.0) and still this issue persists. I see two events for same input/output lineage.

      @@ -102215,7 +102221,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 04:06:37
      -

      *Thread Reply:* For example, MySQL is stored as producer + host + port + database + table as something like <mysql://db.foo.com:6543/metrics.orders> +

      *Thread Reply:* For example, MySQL is stored as producer + host + port + database + table as something like <mysql://db.foo.com:6543/metrics.orders> For an API (especially one following REST conditions), I was thinking something like method + host + port + path or GET <https://api.service.com:433/v1/users>

      @@ -102242,7 +102248,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:13:25
      -

      *Thread Reply:* Hi Thomas, thanks for asking about this — it sounds cool! I don’t know of others working on this kind of thing, but I’ve been developing a SQLAlchemy integration and have been experimenting with job naming — which I realize isn’t exactly what you’re working on. Hopefully others will chime in here, but in the meantime, would you be willing to create an issue about this? It seems worth discussing how we could expand the spec for this kind of use case.

      +

      *Thread Reply:* Hi Thomas, thanks for asking about this — it sounds cool! I don’t know of others working on this kind of thing, but I’ve been developing a SQLAlchemy integration and have been experimenting with job naming — which I realize isn’t exactly what you’re working on. Hopefully others will chime in here, but in the meantime, would you be willing to create an issue about this? It seems worth discussing how we could expand the spec for this kind of use case.

      @@ -102268,7 +102274,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:58:32
      -

      *Thread Reply:* I suspect this will definitely be a bigger discussion. Let me ponder on the problem a bit more and come back with something a bit more concrete.

      +

      *Thread Reply:* I suspect this will definitely be a bigger discussion. Let me ponder on the problem a bit more and come back with something a bit more concrete.

      @@ -102294,7 +102300,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:59:21
      -

      *Thread Reply:* Looking forward to hearing more!

      +

      *Thread Reply:* Looking forward to hearing more!

      @@ -102320,7 +102326,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:05:47
      -

      *Thread Reply:* On a tangential note, does OpenLineage's column level lineage have support for (I see it can be extended but want to know if someone had to map this before): +

      *Thread Reply:* On a tangential note, does OpenLineage's column level lineage have support for (I see it can be extended but want to know if someone had to map this before): • Properties as a path in a structure (like a JSON structure, Avro schema, protobuf, etc) maybe using something like JSON Path or XPath notation. • Fragments (when a column is a JSON blob, there is an entire sub-structure that needs to be described) • Transformation description (how an input affects an output. Is it a direct copy of the value or is it part of a formula)

      @@ -102349,7 +102355,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:22:21
      -

      *Thread Reply:* I don’t know, but I’ll ping some folks who might.

      +

      *Thread Reply:* I don’t know, but I’ll ping some folks who might.

      @@ -102375,7 +102381,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 03:24:01
      -

      *Thread Reply:* Hi @Thomas. Column-lineage support currently does not include json fields. We have included in specification fields like transformationDescription and transformationType to store a string representation of the transformation applied and its type like IDENTITY|MASKED. However, those fields aren't filled within Spark integration at the moment.

      +

      *Thread Reply:* Hi @Thomas. Column-lineage support currently does not include json fields. We have included in specification fields like transformationDescription and transformationType to store a string representation of the transformation applied and its type like IDENTITY|MASKED. However, those fields aren't filled within Spark integration at the moment.

      @@ -102615,7 +102621,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 07:38:19
      -

      *Thread Reply:* Hi @Harshini Devathi, I think this the same as issue: https://github.com/OpenLineage/OpenLineage/issues/1821

      +

      *Thread Reply:* Hi @Harshini Devathi, I think this the same as issue: https://github.com/OpenLineage/OpenLineage/issues/1821

      @@ -102746,7 +102752,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 19:44:26
      -

      *Thread Reply:* Thank you @Paweł Leszczyński. So, is this an issue with databricks. The issue thread says that it was able to work on AWS Glue. If so, is there some kind of solution to make it work on Databricks?

      +

      *Thread Reply:* Thank you @Paweł Leszczyński. So, is this an issue with databricks. The issue thread says that it was able to work on AWS Glue. If so, is there some kind of solution to make it work on Databricks?

      @@ -102798,7 +102804,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 20:17:38
      -

      *Thread Reply:* maybe @Will Johnson knows?

      +

      *Thread Reply:* maybe @Will Johnson knows?

      @@ -102861,7 +102867,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:06:02
      -

      *Thread Reply:* There are few possible issues:

      +

      *Thread Reply:* There are few possible issues:

      1. The column-level lineage is not implemented for particular part of Spark LogicalPlan
      2. Azure SQL or Databricks have their own implementations of some Spark class, which does not exactly match our extractor. We've seen that happen
      3. You're using SQL JDBC connection with SELECT ** - in which case we can't do anything for now, since we don't know the input columns.
      4. Possibly something else 🙂 @Paweł Leszczyński might have an idea To fully understand the issue, we'd have to see logs, LogicalPlan of the Spark job, or the job code itself
      5. @@ -102891,7 +102897,7 @@

        MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:35:32
      -

      *Thread Reply:* @Sai, providing a short code snippet that is able to reproduce this would be super helpful in examining that.

      +

      *Thread Reply:* @Sai, providing a short code snippet that is able to reproduce this would be super helpful in examining that.

      @@ -102917,7 +102923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:59:24
      -

      *Thread Reply:* sure Pawel +

      *Thread Reply:* sure Pawel Will share the code I used in sometime

      @@ -102944,10 +102950,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 03:37:54
      -

      *Thread Reply:* Here is the code we use.

      +

      *Thread Reply:* Here is the code we use.

      - + @@ -102979,7 +102985,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:23:13
      -

      *Thread Reply:* Hi Team, Any updates on this?

      +

      *Thread Reply:* Hi Team, Any updates on this?

      @@ -103005,7 +103011,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:23:37
      -

      *Thread Reply:* I tried with putting a sql query having column names in it, still the lineage didn't show up..

      +

      *Thread Reply:* I tried with putting a sql query having column names in it, still the lineage didn't show up..

      @@ -103082,7 +103088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:02:25
      -

      *Thread Reply:* @Anirudh Shrinivason what's your OL and Spark version?

      +

      *Thread Reply:* @Anirudh Shrinivason what's your OL and Spark version?

      @@ -103108,7 +103114,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:03:29
      -

      *Thread Reply:* Some example job would also help, or logs/LogicalPlan 🙂

      +

      *Thread Reply:* Some example job would also help, or logs/LogicalPlan 🙂

      @@ -103134,7 +103140,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:05:54
      -

      *Thread Reply:* OL version is 0.23.0 and spark version is 3.3.1

      +

      *Thread Reply:* OL version is 0.23.0 and spark version is 3.3.1

      @@ -103164,7 +103170,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 11:00:22
      -

      *Thread Reply:* Hmm actually, it seems like the error is intermittent actually. I ran the same job again, but did not notice any errors this time...

      +

      *Thread Reply:* Hmm actually, it seems like the error is intermittent actually. I ran the same job again, but did not notice any errors this time...

      @@ -103190,7 +103196,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:27:19
      -

      *Thread Reply:* This is interesting and it happens within a line: +

      *Thread Reply:* This is interesting and it happens within a line: job.finalStage().parents().forall(toScalaFn(stage -&gt; stageMap.put(stage.id(), stage))); The result of stageMap.put is Stage and for some reason which I don't undestand it tries doing unboxToBoolean . We could rewrite that to: job.finalStage().parents().forall(toScalaFn(stage -&gt; { @@ -103223,7 +103229,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 02:22:25
      -

      *Thread Reply:* @Anirudh Shrinivason, please let us know if it is still a valid issue. If so, we can create an issue for that.

      +

      *Thread Reply:* @Anirudh Shrinivason, please let us know if it is still a valid issue. If so, we can create an issue for that.

      @@ -103249,7 +103255,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 03:11:13
      -

      *Thread Reply:* Hi @Paweł Leszczyński Sflr. Yeah, I think if we are able to fix this, it'll be better. If this is the dedicated fix, then I can create an issue and raise an MR.

      +

      *Thread Reply:* Hi @Paweł Leszczyński Sflr. Yeah, I think if we are able to fix this, it'll be better. If this is the dedicated fix, then I can create an issue and raise an MR.

      @@ -103275,7 +103281,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 04:12:46
      -

      *Thread Reply:* Opened an issue and PR. Do help check if its okay thanks!

      +

      *Thread Reply:* Opened an issue and PR. Do help check if its okay thanks!

      @@ -103301,7 +103307,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 04:29:33
      -

      *Thread Reply:* please run ./gradlew spotlessApply with Java 8

      +

      *Thread Reply:* please run ./gradlew spotlessApply with Java 8

      @@ -103365,7 +103371,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 07:36:33
      -

      *Thread Reply:* > I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also +

      *Thread Reply:* > I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also This is not something that we support yet - there are definitely a lot of plans and preliminary work for that.

      @@ -103392,7 +103398,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 07:57:44
      -

      *Thread Reply:* Thanks for the response, btw I already took a look at the current capabilities provided by openlineage, so my “hidden” question is how do achieve what the customer want to in order to be integrated in some way with openalineage+marquez? +

      *Thread Reply:* Thanks for the response, btw I already took a look at the current capabilities provided by openlineage, so my “hidden” question is how do achieve what the customer want to in order to be integrated in some way with openalineage+marquez? should I choose between make or buy (between already supported platforms) and then try to align “static” (aka declarative) lineage metadata within the openlinage conceptual model?

      @@ -103516,7 +103522,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 13:58:28
      -

      *Thread Reply:* Hi John. Great question! [full disclosure, I am with Manta 🙂 ]. I'll let others answer as to their experience with ourselves or many other vendors that provide lineage, but want to mention that a variety of our customers are finding it beneficial to bring code based static lineage together with the event-based runtime lineage that OpenLineage provides. This gives them the best of both worlds, for analyzing the lineage of their existing systems, where rich parsers already exist (for everything from legacy ETL tools, reporting tools, rdbms, etc.), to newer or home-grown technologies where applying OpenLineage is a viable alternative.

      +

      *Thread Reply:* Hi John. Great question! [full disclosure, I am with Manta 🙂 ]. I'll let others answer as to their experience with ourselves or many other vendors that provide lineage, but want to mention that a variety of our customers are finding it beneficial to bring code based static lineage together with the event-based runtime lineage that OpenLineage provides. This gives them the best of both worlds, for analyzing the lineage of their existing systems, where rich parsers already exist (for everything from legacy ETL tools, reporting tools, rdbms, etc.), to newer or home-grown technologies where applying OpenLineage is a viable alternative.

      @@ -103546,7 +103552,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:12:04
      -

      *Thread Reply:* @Ernie Ostic do you see a single front-runner in the static lineage space? The static/event-based situation you describe is exactly the product roadmap I'm seeing here at Fivetran and I'm wondering if there's an opportunity to drive consensus towards a best-practice solution. If I'm not mistaken weren't there plans to start supporting non-run-based events in OL as well?

      +

      *Thread Reply:* @Ernie Ostic do you see a single front-runner in the static lineage space? The static/event-based situation you describe is exactly the product roadmap I'm seeing here at Fivetran and I'm wondering if there's an opportunity to drive consensus towards a best-practice solution. If I'm not mistaken weren't there plans to start supporting non-run-based events in OL as well?

      @@ -103576,7 +103582,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:16:34
      -

      *Thread Reply:* I definitely like the idea of a 3rd party solution being complementary to OSS tools we can maintain ourselves while allowing us to offload maintenance effort where possible. Currently I have strong opinions on both sides of the build vs. buy aisle and this seems like the best of both worlds.

      +

      *Thread Reply:* I definitely like the idea of a 3rd party solution being complementary to OSS tools we can maintain ourselves while allowing us to offload maintenance effort where possible. Currently I have strong opinions on both sides of the build vs. buy aisle and this seems like the best of both worlds.

      @@ -103602,7 +103608,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:52:40
      -

      *Thread Reply:* @Brad Paskewitz that’s 100% our plan to extend the OL spec to support “run-less” events. We want to collect that static metadata for Datasets and Jobs outside of the context of a run through OpenLineage. +

      *Thread Reply:* @Brad Paskewitz that’s 100% our plan to extend the OL spec to support “run-less” events. We want to collect that static metadata for Datasets and Jobs outside of the context of a run through OpenLineage. happy to get your feedback here as well: https://github.com/OpenLineage/OpenLineage/pull/1839

      @@ -103633,7 +103639,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:57:46
      -

      *Thread Reply:* Hi @John Lukenoff. Here at Atlan we've been working with the OpenLineage community for quite some time to unlock the use case you describe. These efforts are adjacent to our ongoing integration with Fivetran. Happy to connect and give you a demo of what we've built and dig into your use case specifics.

      +

      *Thread Reply:* Hi @John Lukenoff. Here at Atlan we've been working with the OpenLineage community for quite some time to unlock the use case you describe. These efforts are adjacent to our ongoing integration with Fivetran. Happy to connect and give you a demo of what we've built and dig into your use case specifics.

      @@ -103659,7 +103665,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 11:26:32
      -

      *Thread Reply:* Thanks all! These comments are really informative, it’s exciting to hear about vendors leaning into the project to let us continue to benefit from the tremendous progress being made by the community. Had a great discussion with Atlan yesterday and plan to connect with Manta next week to discuss our use-cases.

      +

      *Thread Reply:* Thanks all! These comments are really informative, it’s exciting to hear about vendors leaning into the project to let us continue to benefit from the tremendous progress being made by the community. Had a great discussion with Atlan yesterday and plan to connect with Manta next week to discuss our use-cases.

      @@ -103685,7 +103691,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 12:34:32
      -

      *Thread Reply:* Reach out anytime, John. @John Lukenoff Looking forward to engaging further with you on these topics!

      +

      *Thread Reply:* Reach out anytime, John. @John Lukenoff Looking forward to engaging further with you on these topics!

      @@ -103741,7 +103747,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 11:19:02
      -

      *Thread Reply:* Thank you for requesting an OpenLineage release. As stated here, three +1s from committers will authorize an immediate release. Our policy is not to release on Fridays, so the earliest we could initiate would be Monday.

      +

      *Thread Reply:* Thank you for requesting an OpenLineage release. As stated here, three +1s from committers will authorize an immediate release. Our policy is not to release on Fridays, so the earliest we could initiate would be Monday.

      @@ -103767,7 +103773,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 13:12:43
      -

      *Thread Reply:* A release on Monday is totally fine @Michael Robinson.

      +

      *Thread Reply:* A release on Monday is totally fine @Michael Robinson.

      @@ -103793,7 +103799,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-15 08:37:39
      -

      *Thread Reply:* The release will be initiated today. Thanks @Harshini Devathi

      +

      *Thread Reply:* The release will be initiated today. Thanks @Harshini Devathi

      @@ -103823,7 +103829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 20:16:07
      -

      *Thread Reply:* Appreciate it @Michael Robinson and thanks to all the committers for the prompt response

      +

      *Thread Reply:* Appreciate it @Michael Robinson and thanks to all the committers for the prompt response

      @@ -103947,7 +103953,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 14:13:16
      -

      *Thread Reply:* Can’t wait! 💯

      +

      *Thread Reply:* Can’t wait! 💯

      @@ -104002,7 +104008,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-17 01:58:28
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, are you able to send the OL events to console. This would let us confirm if the issue is related with event generation or emitting it and waiting for the backend to repond.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, are you able to send the OL events to console. This would let us confirm if the issue is related with event generation or emitting it and waiting for the backend to repond.

      @@ -104028,7 +104034,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-17 01:59:03
      -

      *Thread Reply:* Ahh okay sure. Let me see if I can do that

      +

      *Thread Reply:* Ahh okay sure. Let me see if I can do that

      @@ -104063,7 +104069,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      @Paweł Leszczyński @Michael Robinson

      - + @@ -104099,7 +104105,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:55:51
      -

      *Thread Reply:* Hi @Paweł Leszczyński

      +

      *Thread Reply:* Hi @Paweł Leszczyński

      Thanks for fixing the issue and with new release merge is working. But I could not see any input and output datasets for this. Let me know if you need any further details to look into this.

      @@ -104137,7 +104143,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:00:01
      -

      *Thread Reply:* Oh man, it's just that vanilla spark differs from the one available in databricks platform. our integration tests do verify behaviour on vanilla spark which still leaves a possibility for inconsistency. will need to get back to it then at some time.

      +

      *Thread Reply:* Oh man, it's just that vanilla spark differs from the one available in databricks platform. our integration tests do verify behaviour on vanilla spark which still leaves a possibility for inconsistency. will need to get back to it then at some time.

      @@ -104163,7 +104169,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:11:28
      -

      *Thread Reply:* Hi @Paweł Leszczyński

      +

      *Thread Reply:* Hi @Paweł Leszczyński

      Did you get chance to look into this issue?

      @@ -104191,7 +104197,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:13:18
      -

      *Thread Reply:* Hi Sai, I am going back to spark. I am working on support for Spark 3.4, which is going to add some event filtering on internal delta operations that trigger unncecessarly the events

      +

      *Thread Reply:* Hi Sai, I am going back to spark. I am working on support for Spark 3.4, which is going to add some event filtering on internal delta operations that trigger unncecessarly the events

      @@ -104217,7 +104223,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:13:28
      -

      *Thread Reply:* this may be releated to issue you created

      +

      *Thread Reply:* this may be releated to issue you created

      @@ -104243,7 +104249,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:14:13
      -

      *Thread Reply:* I do have planned creating integration test for databricks which will be helpful to tackle the issues you raised

      +

      *Thread Reply:* I do have planned creating integration test for databricks which will be helpful to tackle the issues you raised

      @@ -104269,7 +104275,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:14:27
      -

      *Thread Reply:* so yes, I am looking at the Spark

      +

      *Thread Reply:* so yes, I am looking at the Spark

      @@ -104295,7 +104301,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:20:06
      -

      *Thread Reply:* thanks much Pawel.. I am looking more into the merge part as first priority as we use is frequently.

      +

      *Thread Reply:* thanks much Pawel.. I am looking more into the merge part as first priority as we use is frequently.

      @@ -104321,7 +104327,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:01
      -

      *Thread Reply:* I know, this is important.

      +

      *Thread Reply:* I know, this is important.

      @@ -104347,7 +104353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:14
      -

      *Thread Reply:* It just need still some time.

      +

      *Thread Reply:* It just need still some time.

      @@ -104373,7 +104379,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:46
      -

      *Thread Reply:* thank you for your patience and being so proactive on those issues.

      +

      *Thread Reply:* thank you for your patience and being so proactive on those issues.

      @@ -104399,7 +104405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:22:12
      -

      *Thread Reply:* no problem.. Please do keep us posted with updates..

      +

      *Thread Reply:* no problem.. Please do keep us posted with updates..

      @@ -104461,7 +104467,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 10:26:48
      -

      *Thread Reply:* The release request has been approved and will be initiated shortly.

      +

      *Thread Reply:* The release request has been approved and will be initiated shortly.

      @@ -104513,7 +104519,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:29:55
      -

      *Thread Reply:* Hi Anirudh, I saw your issue and I think it is the same one as solved within #1858. Are you able to reproduce it on a version built on the top of main?

      +

      *Thread Reply:* Hi Anirudh, I saw your issue and I think it is the same one as solved within #1858. Are you able to reproduce it on a version built on the top of main?

      @@ -104539,7 +104545,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 06:21:05
      -

      *Thread Reply:* Hi I haven't managed to try with the main branch. But if its the same error then all's good! If the error resurfaces then we can look into it.

      +

      *Thread Reply:* Hi I haven't managed to try with the main branch. But if its the same error then all's good! If the error resurfaces then we can look into it.

      @@ -104593,7 +104599,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:28:31
      -

      *Thread Reply:* I know this one: https://openlineage.io/docs/integrations/dbt

      +

      *Thread Reply:* I know this one: https://openlineage.io/docs/integrations/dbt

      openlineage.io
      @@ -104640,7 +104646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:41:39
      -

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the revert, I tried same but facing below issue

      +

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the revert, I tried same but facing below issue

      requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url:

      @@ -104670,7 +104676,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:44:09
      -

      *Thread Reply:* @Lovenish Goyal, exactly. You need to start Marquez. +

      *Thread Reply:* @Lovenish Goyal, exactly. You need to start Marquez. More about it: https://marquezproject.ai/quickstart

      @@ -104697,7 +104703,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 10:27:52
      -

      *Thread Reply:* @Lovenish Goyal how are you running dbt core currently?

      +

      *Thread Reply:* @Lovenish Goyal how are you running dbt core currently?

      @@ -104723,7 +104729,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 01:55:20
      -

      *Thread Reply:* Trying but facing issue while running marquezproject @Jakub Dardziński

      +

      *Thread Reply:* Trying but facing issue while running marquezproject @Jakub Dardziński

      @@ -104749,7 +104755,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 01:56:03
      -

      *Thread Reply:* @Harel Shein we have created custom docker image of DBT + Airflow and running it on an EC2

      +

      *Thread Reply:* @Harel Shein we have created custom docker image of DBT + Airflow and running it on an EC2

      @@ -104775,7 +104781,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:05:31
      -

      *Thread Reply:* for running dbt core on Airflow, we have a utility that helps develop dbt natively on Airflow. There’s also built in support for collecting lineage if you have the airflow-openlineage provider installed. +

      *Thread Reply:* for running dbt core on Airflow, we have a utility that helps develop dbt natively on Airflow. There’s also built in support for collecting lineage if you have the airflow-openlineage provider installed. https://astronomer.github.io/astronomer-cosmos/#quickstart

      @@ -104802,7 +104808,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:06:30
      -

      *Thread Reply:* RE issues running Marquez, can you share what those are? I’m guessing that since you are running both of them in individual docker images, the airflow deployment might not be able to communicate with the Marquez endpoints?

      +

      *Thread Reply:* RE issues running Marquez, can you share what those are? I’m guessing that since you are running both of them in individual docker images, the airflow deployment might not be able to communicate with the Marquez endpoints?

      @@ -104828,7 +104834,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:06:53
      -

      *Thread Reply:* @Harel Shein I've already helped with running Marquez 🙂

      +

      *Thread Reply:* @Harel Shein I've already helped with running Marquez 🙂

      @@ -104884,7 +104890,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:34:35
      -

      *Thread Reply:* I am not sure if this is the same. If you see OL events collected with column-lineage missing, then it's a different one.

      +

      *Thread Reply:* I am not sure if this is the same. If you see OL events collected with column-lineage missing, then it's a different one.

      @@ -104910,7 +104916,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:41:11
      -

      *Thread Reply:* Please also be aware, that it is extremely helpful to investigate the issues on your own before creating them.

      +

      *Thread Reply:* Please also be aware, that it is extremely helpful to investigate the issues on your own before creating them.

      Our integration traverses spark's logical plans and extracts lineage events from plan nodes that it understands. Some plan nodes are not supported yet and, from my experience, when working on an issue, 80% of time is spent on reproducing the scenario.

      @@ -104940,7 +104946,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 03:52:30
      -

      *Thread Reply:* @Paweł Leszczyński Thanks for the prompt response.

      +

      *Thread Reply:* @Paweł Leszczyński Thanks for the prompt response.

      Provided sample codes with and without using aggregate functions and its respective lineage events for reference.

      @@ -105240,7 +105246,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 03:55:04
      -

      *Thread Reply:* 2. Please find the code using aggregate function:

      +

      *Thread Reply:* 2. Please find the code using aggregate function:

          final_df=spark.sql("""
           select productid
      @@ -105485,7 +105491,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 04:09:17
      -

      *Thread Reply:* amazing. https://github.com/OpenLineage/OpenLineage/issues/1861

      +

      *Thread Reply:* amazing. https://github.com/OpenLineage/OpenLineage/issues/1861

      @@ -105571,7 +105577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 04:11:56
      -

      *Thread Reply:* Thanks for considering the request and looking into it

      +

      *Thread Reply:* Thanks for considering the request and looking into it

      @@ -105706,7 +105712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 20:13:09
      -

      *Thread Reply:* Hi @Bramha Aelem I replied in the ticket. Thank you for opening it.

      +

      *Thread Reply:* Hi @Bramha Aelem I replied in the ticket. Thank you for opening it.

      @@ -105732,7 +105738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 21:15:30
      -

      *Thread Reply:* Hi @Julien Le Dem - Thanks for quick response. I replied in the ticket. Please let me know if you need any more details.

      +

      *Thread Reply:* Hi @Julien Le Dem - Thanks for quick response. I replied in the ticket. Please let me know if you need any more details.

      @@ -105758,7 +105764,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 02:13:57
      -

      *Thread Reply:* Hi @Bramha Aelem - asked for more details in the ticket.

      +

      *Thread Reply:* Hi @Bramha Aelem - asked for more details in the ticket.

      @@ -105784,7 +105790,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 11:08:58
      -

      *Thread Reply:* Hi @Paweł Leszczyński - I replied with necessary details in the ticket. Please let me know if you need any more details.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - I replied with necessary details in the ticket. Please let me know if you need any more details.

      @@ -105810,7 +105816,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 15:22:42
      -

      *Thread Reply:* Hi @Paweł Leszczyński - any further updates on issue?

      +

      *Thread Reply:* Hi @Paweł Leszczyński - any further updates on issue?

      @@ -105836,7 +105842,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 01:56:47
      -

      *Thread Reply:* hi @Bramha Aelem, i was out of office for a few days. will get back into this soon. thanks for update.

      +

      *Thread Reply:* hi @Bramha Aelem, i was out of office for a few days. will get back into this soon. thanks for update.

      @@ -105862,7 +105868,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-27 18:46:27
      -

      *Thread Reply:* Hi @Paweł Leszczyński -Thanks for your reply. will wait for your response to proceed further on the issue.

      +

      *Thread Reply:* Hi @Paweł Leszczyński -Thanks for your reply. will wait for your response to proceed further on the issue.

      @@ -105888,7 +105894,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 19:29:08
      -

      *Thread Reply:* Hi @Paweł Leszczyński -Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      +

      *Thread Reply:* Hi @Paweł Leszczyński -Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      @@ -105914,7 +105920,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 12:43:54
      -

      *Thread Reply:* Hi @Paweł Leszczyński - Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      @@ -105944,7 +105950,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-06 10:29:01
      -

      *Thread Reply:* Hi @Paweł Leszczyński - Good day. Did you get a chance to look into query which I have posted. can you please provide any thoughts on my observation/query.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - Good day. Did you get a chance to look into query which I have posted. can you please provide any thoughts on my observation/query.

      @@ -105998,7 +106004,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:49:27
      -

      *Thread Reply:* Hello @John Doe, this mostly means there's somehting wrong with your transport config for emitting Openlineage events.

      +

      *Thread Reply:* Hello @John Doe, this mostly means there's somehting wrong with your transport config for emitting Openlineage events.

      @@ -106024,7 +106030,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:49:41
      -

      *Thread Reply:* what do you want to do with the events?

      +

      *Thread Reply:* what do you want to do with the events?

      @@ -106050,7 +106056,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:10:24
      -

      *Thread Reply:* Hi @Paweł Leszczyński, I am working on a PoC to understand the use cases of OL and how it build Lineages.

      +

      *Thread Reply:* Hi @Paweł Leszczyński, I am working on a PoC to understand the use cases of OL and how it build Lineages.

      As for the transport config I am using the codes from the documentation to setup OL. https://openlineage.io/docs/integrations/spark/quickstart_local

      @@ -106102,7 +106108,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:38:58
      -

      *Thread Reply:* ok, I am wondering if what you experience isn't similar to issue #1860. Could you try openlineage 0.23.0 to see if get the same error?

      +

      *Thread Reply:* ok, I am wondering if what you experience isn't similar to issue #1860. Could you try openlineage 0.23.0 to see if get the same error?

      <https://github.com/OpenLineage/OpenLineage/issues/1860>

      @@ -106130,7 +106136,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 10:05:59
      -

      *Thread Reply:* I tried with 0.23.0 still getting the same error

      +

      *Thread Reply:* I tried with 0.23.0 still getting the same error

      @@ -106156,7 +106162,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 02:34:52
      -

      *Thread Reply:* @Paweł Leszczyński any other way I can try to setup. The issue still persists

      +

      *Thread Reply:* @Paweł Leszczyński any other way I can try to setup. The issue still persists

      @@ -106182,7 +106188,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 03:53:04
      -

      *Thread Reply:* hmyy, I've just redone steps from https://openlineage.io/docs/integrations/spark/quickstart_local with 0.26.0 and could not reproduce behaviour you encountered.

      +

      *Thread Reply:* hmyy, I've just redone steps from https://openlineage.io/docs/integrations/spark/quickstart_local with 0.26.0 and could not reproduce behaviour you encountered.

      openlineage.io
      @@ -106259,7 +106265,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:34:00
      -

      *Thread Reply:* Tom, thanks for your question. This is really exciting! I assume you’ve already started checking out the docs, but there are many other resources on the website, as well (on the blog and resources pages in particular). And don’t skip the YouTube channel, where we’ve recently started to upload short, more digestible excerpts from the community meetings. Please keep us updated as you make progress!

      +

      *Thread Reply:* Tom, thanks for your question. This is really exciting! I assume you’ve already started checking out the docs, but there are many other resources on the website, as well (on the blog and resources pages in particular). And don’t skip the YouTube channel, where we’ve recently started to upload short, more digestible excerpts from the community meetings. Please keep us updated as you make progress!

      @@ -106289,7 +106295,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:48:06
      -

      *Thread Reply:* Hi Michael! Thank you so much for sending these resources! I've been working on this thesis for quite some time already and it's almost finished. I just needed some additional information to help in accurately describing some of the processes in OpenLineage and Marquez. Will send you the case study chapter later this week to get some feedback if possible. Keep you posted on things such as publication! Perhaps it can make OpenLineage even more popular than it already is 😉

      +

      *Thread Reply:* Hi Michael! Thank you so much for sending these resources! I've been working on this thesis for quite some time already and it's almost finished. I just needed some additional information to help in accurately describing some of the processes in OpenLineage and Marquez. Will send you the case study chapter later this week to get some feedback if possible. Keep you posted on things such as publication! Perhaps it can make OpenLineage even more popular than it already is 😉

      @@ -106315,7 +106321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:52:18
      -

      *Thread Reply:* Yes, please share it! Looking forward to checking it out. Super cool!

      +

      *Thread Reply:* Yes, please share it! Looking forward to checking it out. Super cool!

      @@ -106371,7 +106377,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 10:04:44
      -

      *Thread Reply:* Thank you so much Ernie! That sounds like a very interesting direction to keep in mind during research!

      +

      *Thread Reply:* Thank you so much Ernie! That sounds like a very interesting direction to keep in mind during research!

      @@ -106496,7 +106502,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 07:24:09
      -

      *Thread Reply:* yeah, it's a bug

      +

      *Thread Reply:* yeah, it's a bug

      @@ -106522,7 +106528,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 12:00:48
      -

      *Thread Reply:* so it's optional then? 😄 or bug in the example?

      +

      *Thread Reply:* so it's optional then? 😄 or bug in the example?

      @@ -106686,7 +106692,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 19:01:43
      -

      *Thread Reply:* @Michael Collado wondering if you could shed some light here

      +

      *Thread Reply:* @Michael Collado wondering if you could shed some light here

      @@ -106712,7 +106718,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 05:39:01
      -

      *Thread Reply:* can you show logical plan of your Spark job? I think using hive is not the most important part, but whether job's LogicalPlan parses to CreateHiveTableAsSelectCommand or InsertIntoHiveTable

      +

      *Thread Reply:* can you show logical plan of your Spark job? I think using hive is not the most important part, but whether job's LogicalPlan parses to CreateHiveTableAsSelectCommand or InsertIntoHiveTable

      @@ -106738,7 +106744,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 19:37:02
      -

      *Thread Reply:* It parses into InsertIntoHadoopFsRelationCommand. example +

      *Thread Reply:* It parses into InsertIntoHadoopFsRelationCommand. example == Optimized Logical Plan == InsertIntoHadoopFsRelationCommand <s3a://uchmsdev03/default/sharanyaOutputTable>, false, [id#89], Parquet, [serialization.format=1, mergeSchema=false, partitionOverwriteMode=dynamic], Append, CatalogTable( Database: default @@ -106793,7 +106799,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 19:39:54
      -

      *Thread Reply:* using spark 3.2 & this is the query +

      *Thread Reply:* using spark 3.2 & this is the query spark.sql(s"INSERT INTO default.sharanyaOutput select ** from (SELECT ** from default.tableA union all " + s"select ** from default.tableB)")

      @@ -106847,7 +106853,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 05:37:06
      -

      *Thread Reply:* I think we can't really get it from Spark context, as Spark jobs are submitted in compiled, jar form, instead of plain text like for example Airflow dags.

      +

      *Thread Reply:* I think we can't really get it from Spark context, as Spark jobs are submitted in compiled, jar form, instead of plain text like for example Airflow dags.

      @@ -106873,7 +106879,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 02:15:35
      -

      *Thread Reply:* How about Jupyter Notebook based spark job?

      +

      *Thread Reply:* How about Jupyter Notebook based spark job?

      @@ -106899,7 +106905,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 08:44:18
      -

      *Thread Reply:* I don't think it changes much - but maybe @Paweł Leszczyński knows more

      +

      *Thread Reply:* I don't think it changes much - but maybe @Paweł Leszczyński knows more

      @@ -106982,7 +106988,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:00:04
      -

      *Thread Reply:* We developed a FileTransport class in order to save locally in json file our metrics if you interested in

      +

      *Thread Reply:* We developed a FileTransport class in order to save locally in json file our metrics if you interested in

      @@ -107008,7 +107014,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:00:37
      -

      *Thread Reply:* Does it also save the openlineage information, e.g. inputs/outputs?

      +

      *Thread Reply:* Does it also save the openlineage information, e.g. inputs/outputs?

      @@ -107034,7 +107040,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:02:07
      -

      *Thread Reply:* yes it save all json information, inputs / ouputs included

      +

      *Thread Reply:* yes it save all json information, inputs / ouputs included

      @@ -107060,7 +107066,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:03:03
      -

      *Thread Reply:* Yes! then I am very interested. Is there guidance on how to use the FileTransport class?

      +

      *Thread Reply:* Yes! then I am very interested. Is there guidance on how to use the FileTransport class?

      @@ -107086,7 +107092,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:06:22
      -

      *Thread Reply:* @alexandre bergere it would be pretty useful contribution if you can submit it 🙂

      +

      *Thread Reply:* @alexandre bergere it would be pretty useful contribution if you can submit it 🙂

      @@ -107116,7 +107122,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:08:28
      -

      *Thread Reply:* We are using it on a transformed OpenLineage library we developed ! I'm going to make a PR in order to share it with you :)

      +

      *Thread Reply:* We are using it on a transformed OpenLineage library we developed ! I'm going to make a PR in order to share it with you :)

      @@ -107146,7 +107152,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:56:48
      -

      *Thread Reply:* would be great to have. I had it in mind to implement as an enabler for databricks integration tests. great to hear that!

      +

      *Thread Reply:* would be great to have. I had it in mind to implement as an enabler for databricks integration tests. great to hear that!

      @@ -107172,7 +107178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 08:19:46
      -

      *Thread Reply:* PR sent: https://github.com/OpenLineage/OpenLineage/pull/1891 🙂 +

      *Thread Reply:* PR sent: https://github.com/OpenLineage/OpenLineage/pull/1891 🙂 @Maciej Obuchowski could you tell me how to update the documentation once approved please?

      @@ -107261,7 +107267,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 08:36:21
      -

      *Thread Reply:* @alexandre bergere we have separate repo for website + docs: https://github.com/OpenLineage/docs

      +

      *Thread Reply:* @alexandre bergere we have separate repo for website + docs: https://github.com/OpenLineage/docs

      @@ -107376,7 +107382,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 13:08:56
      -

      *Thread Reply:* I wonder if this blog post might help? https://openlineage.io/blog/openlineage-microsoft-purview

      +

      *Thread Reply:* I wonder if this blog post might help? https://openlineage.io/blog/openlineage-microsoft-purview

      openlineage.io
      @@ -107423,7 +107429,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 16:13:38
      -

      *Thread Reply:* This might not fully match your use case, either, but might help: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      +

      *Thread Reply:* This might not fully match your use case, either, but might help: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      learn.microsoft.com
      @@ -107474,7 +107480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 23:23:49
      -

      *Thread Reply:* Thanks @Michael Robinson

      +

      *Thread Reply:* Thanks @Michael Robinson

      @@ -107526,7 +107532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 05:04:32
      -

      *Thread Reply:* I think, for now such thing is not defined other than by implementation of consumers.

      +

      *Thread Reply:* I think, for now such thing is not defined other than by implementation of consumers.

      @@ -107552,7 +107558,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-30 10:32:09
      -

      *Thread Reply:* Any reason for that?

      +

      *Thread Reply:* Any reason for that?

      @@ -107578,7 +107584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:25:33
      -

      *Thread Reply:* The idea is that for particular run, facets can be attached to any event type.

      +

      *Thread Reply:* The idea is that for particular run, facets can be attached to any event type.

      This has advantages, for example, job that modifies dataset that it's also reading from, can get particular version of dataset it's reading from and attach it on start; it would not work if you tried to do it on complete as the dataset would change by then.

      @@ -107659,7 +107665,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:18:48
      -

      *Thread Reply:* I think it should be output-only, yes.

      +

      *Thread Reply:* I think it should be output-only, yes.

      @@ -107685,7 +107691,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:19:14
      -

      *Thread Reply:* @Paweł Leszczyński what do you think?

      +

      *Thread Reply:* @Paweł Leszczyński what do you think?

      @@ -107711,7 +107717,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 08:35:13
      -

      *Thread Reply:* yes, should be output only I think

      +

      *Thread Reply:* yes, should be output only I think

      @@ -107737,7 +107743,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 13:39:07
      -

      *Thread Reply:* should we move it over then? 😄

      +

      *Thread Reply:* should we move it over then? 😄

      @@ -107763,7 +107769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 13:39:31
      -

      *Thread Reply:* under Output Dataset Facets that is

      +

      *Thread Reply:* under Output Dataset Facets that is

      @@ -107828,7 +107834,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 14:23:17
      -

      *Thread Reply:* Correction: Julien and Willy’s talk at Data+AI Summit will take place on June 28

      +

      *Thread Reply:* Correction: Julien and Willy’s talk at Data+AI Summit will take place on June 28

      @@ -107888,7 +107894,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 10:30:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated on Monday in accordance with our policy here.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated on Monday in accordance with our policy here.

      @@ -108068,7 +108074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 15:47:08
      -

      *Thread Reply:* @Bernat Gabor thanks for the PR!

      +

      *Thread Reply:* @Bernat Gabor thanks for the PR!

      @@ -108124,7 +108130,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-06 10:58:23
      -

      *Thread Reply:* Thanks @Maciej Obuchowski. The release is authorized and will be initiated as soon as possible.

      +

      *Thread Reply:* Thanks @Maciej Obuchowski. The release is authorized and will be initiated as soon as possible.

      @@ -108412,7 +108418,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:39:20
      -

      *Thread Reply:* Do you get any output in your devtools? I just ran into this yesterday and it looks like it’s related to this issue: https://github.com/MarquezProject/marquez/issues/2410

      +

      *Thread Reply:* Do you get any output in your devtools? I just ran into this yesterday and it looks like it’s related to this issue: https://github.com/MarquezProject/marquez/issues/2410

      @@ -108476,7 +108482,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:40:01
      -

      *Thread Reply:* Seems like more of a Marquez client-side issue than something with OL

      +

      *Thread Reply:* Seems like more of a Marquez client-side issue than something with OL

      @@ -108502,7 +108508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:43:02
      -

      *Thread Reply:* ohh but if i try using the console output, it throws ClientProtocolError

      +

      *Thread Reply:* ohh but if i try using the console output, it throws ClientProtocolError

      @@ -108537,7 +108543,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:43:41
      -

      *Thread Reply:* Sorry I mean in the dev console of your web browser

      +

      *Thread Reply:* Sorry I mean in the dev console of your web browser

      @@ -108563,7 +108569,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:44:43
      -

      *Thread Reply:* this is the dev console in browser

      +

      *Thread Reply:* this is the dev console in browser

      @@ -108598,7 +108604,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:47:59
      -

      *Thread Reply:* Seems like it’s coming from this line. Are there any job facets defined when you fetch from the API directly? That seems like kind of an old version of OL so maybe the schema is incompatible with the version Marquez is expecting

      +

      *Thread Reply:* Seems like it’s coming from this line. Are there any job facets defined when you fetch from the API directly? That seems like kind of an old version of OL so maybe the schema is incompatible with the version Marquez is expecting

      @@ -108649,7 +108655,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:51:21
      -

      *Thread Reply:* from pyspark.sql import SparkSession

      +

      *Thread Reply:* from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') .appName('sample_spark') @@ -108690,7 +108696,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:51:49
      -

      *Thread Reply:* This is my only code, I havent done anything apart from this

      +

      *Thread Reply:* This is my only code, I havent done anything apart from this

      @@ -108716,7 +108722,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:52:30
      -

      *Thread Reply:* I would try a more recent version of OL. Looks like you’re using 0.12.0 and I think the project is on 0.27.x currently

      +

      *Thread Reply:* I would try a more recent version of OL. Looks like you’re using 0.12.0 and I think the project is on 0.27.x currently

      @@ -108742,7 +108748,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:55:07
      -

      *Thread Reply:* so i should change io.openlineage:openlineage_spark:0.12.0 to io.openlineage:openlineage_spark:0.27.1?

      +

      *Thread Reply:* so i should change io.openlineage:openlineage_spark:0.12.0 to io.openlineage:openlineage_spark:0.27.1?

      @@ -108772,7 +108778,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:10:03
      -

      *Thread Reply:* it executed well, unable to see it in marquez

      +

      *Thread Reply:* it executed well, unable to see it in marquez

      @@ -108798,7 +108804,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:18:16
      -

      *Thread Reply:* marquez didnt get updated

      +

      *Thread Reply:* marquez didnt get updated

      @@ -108833,7 +108839,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:20:44
      -

      *Thread Reply:* I am actually doing a POC on OpenLineage to find table and column level lineage for my team at Amazon. +

      *Thread Reply:* I am actually doing a POC on OpenLineage to find table and column level lineage for my team at Amazon. If this goes through, the team could use openlineage to track data lineage on a larger scale..

      @@ -108860,7 +108866,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:24:49
      -

      *Thread Reply:* Maybe marquez is still pulling the data from the previous run using the old OL version. Do you still get the same error in the browser console? Do you get the same result if you rebuild and start with a clean marquez db?

      +

      *Thread Reply:* Maybe marquez is still pulling the data from the previous run using the old OL version. Do you still get the same error in the browser console? Do you get the same result if you rebuild and start with a clean marquez db?

      @@ -108886,7 +108892,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:25:10
      -

      *Thread Reply:* yes i did that as well

      +

      *Thread Reply:* yes i did that as well

      @@ -108912,7 +108918,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:25:49
      -

      *Thread Reply:* the error was present only once you clicked on any of the jobs in marquez, +

      *Thread Reply:* the error was present only once you clicked on any of the jobs in marquez, since my job isnt showing up i cant check for the error itself

      @@ -108939,7 +108945,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:26:29
      -

      *Thread Reply:* docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      +

      *Thread Reply:* docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      used this to rebuild marquez

      @@ -108967,7 +108973,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:26:54
      -

      *Thread Reply:* That’s odd, sorry, that’s probably the most I can help, I’m kinda new to OL/Marquez as well 😅

      +

      *Thread Reply:* That’s odd, sorry, that’s probably the most I can help, I’m kinda new to OL/Marquez as well 😅

      @@ -108993,7 +108999,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:27:41
      -

      *Thread Reply:* no problem, can you refer me to someone who would know, so that i can ask them?

      +

      *Thread Reply:* no problem, can you refer me to someone who would know, so that i can ask them?

      @@ -109019,7 +109025,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:29:25
      -

      *Thread Reply:* Actually looking at in now I think you’re using a slightly outdated version of marquez-web too. I would update that tag to at least 0.33.0. that’s what I’m using

      +

      *Thread Reply:* Actually looking at in now I think you’re using a slightly outdated version of marquez-web too. I would update that tag to at least 0.33.0. that’s what I’m using

      @@ -109045,7 +109051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:30:10
      -

      *Thread Reply:* Other than that I would ask in the marquez slack channel or raise an issue in github on that project. Seems like more of an issue with Marquez since some at least some data is rendering in the UI initially

      +

      *Thread Reply:* Other than that I would ask in the marquez slack channel or raise an issue in github on that project. Seems like more of an issue with Marquez since some at least some data is rendering in the UI initially

      @@ -109071,7 +109077,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:32:58
      -

      *Thread Reply:* nope that version also didnt help

      +

      *Thread Reply:* nope that version also didnt help

      @@ -109097,7 +109103,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:33:19
      -

      *Thread Reply:* can you share their slack link?

      +

      *Thread Reply:* can you share their slack link?

      @@ -109123,7 +109129,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:34:52
      -

      *Thread Reply:* http://bit.ly/MarquezSlack

      +

      *Thread Reply:* http://bit.ly/MarquezSlack

      @@ -109149,7 +109155,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:35:08
      -

      *Thread Reply:* that link is no longer active

      +

      *Thread Reply:* that link is no longer active

      @@ -109175,7 +109181,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:44:25
      -

      *Thread Reply:* Hello @Rachana Gandhi could you point to the doc where you found the example .config(‘spark.jars.packages’, ‘io.openlineage:openlineage_spark:0.12.0’) ? We should update it to have the latest version instead.

      +

      *Thread Reply:* Hello @Rachana Gandhi could you point to the doc where you found the example .config(‘spark.jars.packages’, ‘io.openlineage:openlineage_spark:0.12.0’) ? We should update it to have the latest version instead.

      @@ -109201,7 +109207,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:54:49
      -

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/quickstart_local/

      +

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/quickstart_local/

      openlineage.io
      @@ -109248,7 +109254,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:59:17
      -

      *Thread Reply:* https://openlineage.io/docs/guides/spark

      +

      *Thread Reply:* https://openlineage.io/docs/guides/spark

      also the docker compose here has an earlier version of marquez

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-07-13 17:00:54
      -

      *Thread Reply:* Facing same issue with my initial POC. Did we get any solution for this?

      +

      *Thread Reply:* Facing same issue with my initial POC. Did we get any solution for this?

      @@ -109355,7 +109361,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:43:55
      -

      *Thread Reply:* Requesting a release? 3 +1s from committers will authorize. More info here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md

      +

      *Thread Reply:* Requesting a release? 3 +1s from committers will authorize. More info here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md

      @@ -109381,7 +109387,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:44:14
      -

      *Thread Reply:* Yeah, that one 😊

      +

      *Thread Reply:* Yeah, that one 😊

      @@ -109407,7 +109413,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:44:44
      -

      *Thread Reply:* Because the python client is broken as is today without a new release

      +

      *Thread Reply:* Because the python client is broken as is today without a new release

      @@ -109437,7 +109443,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 18:45:04
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated by EOB next Tuesday, but in all likelihood well before then.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated by EOB next Tuesday, but in all likelihood well before then.

      @@ -109463,7 +109469,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 19:06:34
      -

      *Thread Reply:* cool

      +

      *Thread Reply:* cool

      @@ -109588,7 +109594,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:44:28
      -

      *Thread Reply:* Do you mean to just log events to logging backend?

      +

      *Thread Reply:* Do you mean to just log events to logging backend?

      @@ -109614,7 +109620,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:54:30
      -

      *Thread Reply:* Hmm more like have a separate logging config for sending all the logs to a backend

      +

      *Thread Reply:* Hmm more like have a separate logging config for sending all the logs to a backend

      @@ -109640,7 +109646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:54:38
      -

      *Thread Reply:* Not the events itself

      +

      *Thread Reply:* Not the events itself

      @@ -109666,7 +109672,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:01:10
      -

      *Thread Reply:* @Anirudh Shrinivason with Spark integration?

      +

      *Thread Reply:* @Anirudh Shrinivason with Spark integration?

      @@ -109692,7 +109698,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:01:59
      -

      *Thread Reply:* It uses slf4j so you should be able to set up your log4j logger

      +

      *Thread Reply:* It uses slf4j so you should be able to set up your log4j logger

      @@ -109718,7 +109724,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:10:55
      -

      *Thread Reply:* Yeah with the spark integration. Ahh I see. Okay sure thanks!

      +

      *Thread Reply:* Yeah with the spark integration. Ahh I see. Okay sure thanks!

      @@ -109744,7 +109750,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 23:21:14
      -

      *Thread Reply:* ~Hi @Maciej Obuchowski May I know what the class path I should be using for setting up the log4j if I want to set it up for OL related logs? Is there some guide or runbook to setting up the log4j with OL? Thanks!~ +

      *Thread Reply:* ~Hi @Maciej Obuchowski May I know what the class path I should be using for setting up the log4j if I want to set it up for OL related logs? Is there some guide or runbook to setting up the log4j with OL? Thanks!~ Nvm lol found it! 🙂

      @@ -109832,7 +109838,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-14 10:42:41
      -

      *Thread Reply:* Hello, is anyone eho has recently installed latest version of marquez/open-lineage-spark using docker image available to help Vamshi and I or provide any pointers? Thank you

      +

      *Thread Reply:* Hello, is anyone eho has recently installed latest version of marquez/open-lineage-spark using docker image available to help Vamshi and I or provide any pointers? Thank you

      @@ -109858,7 +109864,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 03:38:38
      -

      *Thread Reply:* if you're working on mac, you can have an issue related to port 5000. Instructions here https://github.com/MarquezProject/marquez#quickstart provides a workaround for that ./docker/up.sh --api-port 9000

      +

      *Thread Reply:* if you're working on mac, you can have an issue related to port 5000. Instructions here https://github.com/MarquezProject/marquez#quickstart provides a workaround for that ./docker/up.sh --api-port 9000

      @@ -109884,7 +109890,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 08:43:33
      -

      *Thread Reply:* @Paweł Leszczyński, thank you, we were using ubuntu on an EC2 instance and each time we are running into different errors and are never able to access the application page, web server, the admin interface, we have run out of ideas of what else to try differently to get this setup up and running

      +

      *Thread Reply:* @Paweł Leszczyński, thank you, we were using ubuntu on an EC2 instance and each time we are running into different errors and are never able to access the application page, web server, the admin interface, we have run out of ideas of what else to try differently to get this setup up and running

      @@ -109910,7 +109916,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 14:47:00
      -

      *Thread Reply:* @Michael Robinson Can you please help us here?

      +

      *Thread Reply:* @Michael Robinson Can you please help us here?

      @@ -109936,7 +109942,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 14:58:57
      -

      *Thread Reply:* @Vamshi krishna I’m sorry you’re still blocked. Thanks for the information about your system. Would you please share some of the errors you are getting? More details would help us reproduce and diagnose.

      +

      *Thread Reply:* @Vamshi krishna I’m sorry you’re still blocked. Thanks for the information about your system. Would you please share some of the errors you are getting? More details would help us reproduce and diagnose.

      @@ -109962,7 +109968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 16:35:00
      -

      *Thread Reply:* @Michael Robinson, thank you, vamshi and i will share the errors that we are running into shortly

      +

      *Thread Reply:* @Michael Robinson, thank you, vamshi and i will share the errors that we are running into shortly

      @@ -109988,7 +109994,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 09:48:16
      -

      *Thread Reply:* We are following https://openlineage.io/getting-started/ guide and trying to set up Marquez on a ubuntu ec2 instance. Following are versions of docker, docker compose and ubuntu

      +

      *Thread Reply:* We are following https://openlineage.io/getting-started/ guide and trying to set up Marquez on a ubuntu ec2 instance. Following are versions of docker, docker compose and ubuntu

      openlineage.io
      @@ -110044,7 +110050,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 09:49:51
      -

      *Thread Reply:* @Michael Robinson When we follow the documentation without changing anything and run sudo ./docker/up.sh we are seeing following errors:

      +

      *Thread Reply:* @Michael Robinson When we follow the documentation without changing anything and run sudo ./docker/up.sh we are seeing following errors:

      @@ -110079,7 +110085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:00:38
      -

      *Thread Reply:* So, I edited up.sh file and modified docker compose command by removing --log-level flag and ran sudo ./docker/up.sh and found following errors:

      +

      *Thread Reply:* So, I edited up.sh file and modified docker compose command by removing --log-level flag and ran sudo ./docker/up.sh and found following errors:

      @@ -110114,7 +110120,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:02:29
      -

      *Thread Reply:* Then I copied .env.example to .env since compose needs .env file

      +

      *Thread Reply:* Then I copied .env.example to .env since compose needs .env file

      @@ -110149,7 +110155,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:05:04
      -

      *Thread Reply:* I got this error:

      +

      *Thread Reply:* I got this error:

      @@ -110184,7 +110190,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:09:24
      -

      *Thread Reply:* since I am getting timeouts, I thought it might be an issue with proxy. So, I followed this doc: https://stackoverflow.com/questions/58841014/set-proxy-on-docker and added my outbound proxy and tried

      +

      *Thread Reply:* since I am getting timeouts, I thought it might be an issue with proxy. So, I followed this doc: https://stackoverflow.com/questions/58841014/set-proxy-on-docker and added my outbound proxy and tried

      Stack Overflow
      @@ -110240,7 +110246,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:23:46
      -

      *Thread Reply:* @Michael Robinson Then it kind of worked but seeing following errors:

      +

      *Thread Reply:* @Michael Robinson Then it kind of worked but seeing following errors:

      @@ -110275,7 +110281,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:24:31
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -110310,7 +110316,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:25:29
      -

      *Thread Reply:* @Michael Robinson @Paweł Leszczyński Can you please see above steps and let us know what are we missing/doing wrong? I appreciate your help and time.

      +

      *Thread Reply:* @Michael Robinson @Paweł Leszczyński Can you please see above steps and let us know what are we missing/doing wrong? I appreciate your help and time.

      @@ -110336,7 +110342,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:45:39
      -

      *Thread Reply:* The latest errors look to me like they’re being caused by postgres and might reflect a port conflict. Are you using the default port for the API (5000)? You might try using a different port. More info about this in the Marquez readme: https://github.com/MarquezProject/marquez/blob/0.35.0/README.md.

      +

      *Thread Reply:* The latest errors look to me like they’re being caused by postgres and might reflect a port conflict. Are you using the default port for the API (5000)? You might try using a different port. More info about this in the Marquez readme: https://github.com/MarquezProject/marquez/blob/0.35.0/README.md.

      @@ -110362,7 +110368,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:46:55
      -

      *Thread Reply:* Yes we are using default ports: +

      *Thread Reply:* Yes we are using default ports: APIPORT=5000 APIADMINPORT=5001 WEBPORT=3000 @@ -110392,7 +110398,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:47:40
      -

      *Thread Reply:* We see these postgres permission issues only occasionally. Other times we only see db and api containers up but not the web

      +

      *Thread Reply:* We see these postgres permission issues only occasionally. Other times we only see db and api containers up but not the web

      @@ -110418,7 +110424,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:52:38
      -

      *Thread Reply:* I would try running ./docker/up.sh --api-port 9000 (see Pawel’s message above for more context.)

      +

      *Thread Reply:* I would try running ./docker/up.sh --api-port 9000 (see Pawel’s message above for more context.)

      @@ -110448,7 +110454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:54:18
      -

      *Thread Reply:* Still no luck. Seeing same errors.

      +

      *Thread Reply:* Still no luck. Seeing same errors.

      2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied marquez-db | 2023-06-23 14:53:23.971 GMT [1] FATAL: configuration file "/etc/postgresql/postgresql.conf" contains errors

      @@ -110477,7 +110483,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:54:43
      -

      *Thread Reply:* ERROR [2023-06-23 14:53:42,269] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +

      *Thread Reply:* ERROR [2023-06-23 14:53:42,269] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! java.net.UnknownHostException: postgres marquez-api | ! at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) marquez-api | ! at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) @@ -110541,7 +110547,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:06:32
      -

      *Thread Reply:* Why do you run docker up with sudo? some of your screenshots suggest docker is not able to access docker registry. The last error java.net.UnknownHostException: postgres may be just a result of container being down. Could you verify if all the containers are up and running and if not what's the error? Are you able to test this docker.up in your laptop or other environment?

      +

      *Thread Reply:* Why do you run docker up with sudo? some of your screenshots suggest docker is not able to access docker registry. The last error java.net.UnknownHostException: postgres may be just a result of container being down. Could you verify if all the containers are up and running and if not what's the error? Are you able to test this docker.up in your laptop or other environment?

      @@ -110567,7 +110573,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:08:34
      -

      *Thread Reply:* Docker commands require sudo and cannot run with other user. +

      *Thread Reply:* Docker commands require sudo and cannot run with other user. Postgres container is not coming up. It is failing with following errors:

      2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied @@ -110597,7 +110603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:10:19
      -

      *Thread Reply:* and what does docker ps -a say about postgres container? why did it fail?

      +

      *Thread Reply:* and what does docker ps -a say about postgres container? why did it fail?

      @@ -110623,7 +110629,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:11:36
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -110658,7 +110664,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:25:17
      -

      *Thread Reply:* hmyy, no changes on our side have been done in postgresql.conf since August 2022. Did you apply any changes or have a clean clone of a repo?

      +

      *Thread Reply:* hmyy, no changes on our side have been done in postgresql.conf since August 2022. Did you apply any changes or have a clean clone of a repo?

      @@ -110684,7 +110690,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:29:46
      -

      *Thread Reply:* No we didn't make any changes

      +

      *Thread Reply:* No we didn't make any changes

      @@ -110710,7 +110716,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:32:21
      -

      *Thread Reply:* you did write earlier Note: I had to modify docker-compose command in up.sh as per docker compose V2.

      +

      *Thread Reply:* you did write earlier Note: I had to modify docker-compose command in up.sh as per docker compose V2.

      @@ -110736,7 +110742,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:34:54
      -

      *Thread Reply:* Yes all I did was modified this line: docker-compose --log-level ERROR $compose_files up $ARGS to +

      *Thread Reply:* Yes all I did was modified this line: docker-compose --log-level ERROR $compose_files up $ARGS to docker compose $compose_files up $ARGS since docker compose v2 doesn't support --log-level flag

      @@ -110763,7 +110769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:37:03
      -

      *Thread Reply:* Let me pull an older version and try

      +

      *Thread Reply:* Let me pull an older version and try

      @@ -110789,7 +110795,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 12:09:43
      -

      *Thread Reply:* Still no luck same exact errors. Tried on a different ubuntu instance. Still seeing same errors with postgres

      +

      *Thread Reply:* Still no luck same exact errors. Tried on a different ubuntu instance. Still seeing same errors with postgres

      @@ -110815,7 +110821,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 15:06:32
      -

      *Thread Reply:* @Jeremy W

      +

      *Thread Reply:* @Jeremy W

      @@ -110867,7 +110873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 10:49:42
      -

      *Thread Reply:* > Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other? +

      *Thread Reply:* > Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other? Yes. Generally events regarding single run are cumulative

      @@ -110894,7 +110900,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 11:07:03
      -

      *Thread Reply:* Ahh I see... Is it fair to assume that if I see column lineage in a start event, it's the full column lineage? Or could it be possible that half the lineage is in the start event, and half the lineage is in the complete event?

      +

      *Thread Reply:* Ahh I see... Is it fair to assume that if I see column lineage in a start event, it's the full column lineage? Or could it be possible that half the lineage is in the start event, and half the lineage is in the complete event?

      @@ -110920,7 +110926,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 22:50:51
      -

      *Thread Reply:* Hi @Maciej Obuchowski just pinging in case you'd missed the above message. 🙇

      +

      *Thread Reply:* Hi @Maciej Obuchowski just pinging in case you'd missed the above message. 🙇

      @@ -110950,7 +110956,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 04:48:57
      -

      *Thread Reply:* Actually, in this case this definitely should not happen. @Paweł Leszczyński am I right?

      +

      *Thread Reply:* Actually, in this case this definitely should not happen. @Paweł Leszczyński am I right?

      @@ -110980,7 +110986,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 04:50:16
      -

      *Thread Reply:* @Maciej Obuchowski yes, you're

      +

      *Thread Reply:* @Maciej Obuchowski yes, you're

      @@ -111036,7 +111042,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 02:14:19
      -

      *Thread Reply:* ~No, we do have an open issue for that: https://github.com/OpenLineage/OpenLineage/issues/1758~

      +

      *Thread Reply:* ~No, we do have an open issue for that: https://github.com/OpenLineage/OpenLineage/issues/1758~

      @@ -111086,7 +111092,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 05:02:26
      -

      *Thread Reply:* @nivethika R, I am sorry for misleading response, we've merged PR for that https://github.com/OpenLineage/OpenLineage/pull/1636. It does not support select ** but besides that, it should be operational.

      +

      *Thread Reply:* @nivethika R, I am sorry for misleading response, we've merged PR for that https://github.com/OpenLineage/OpenLineage/pull/1636. It does not support select ** but besides that, it should be operational.

      Could you please try a query from our integration tests to verify if this is working for you or not: https://github.com/OpenLineage/OpenLineage/pull/1636/files#diff-137aa17091138b69681510e13e3b7d66aa9c9c7c81fe8fe13f09f0de76448dd5R46 ?

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-06-21 11:07:04
      -

      *Thread Reply:* Hi Nagendra, sorry you’re running into this error. We’re looking into it!

      +

      *Thread Reply:* Hi Nagendra, sorry you’re running into this error. We’re looking into it!

      @@ -111424,7 +111430,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-19 09:37:20
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, me and Paweł are at Berlin Buzzwords right now. Will definitely look at it later

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, me and Paweł are at Berlin Buzzwords right now. Will definitely look at it later

      @@ -111450,7 +111456,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-19 10:47:06
      -

      *Thread Reply:* Oh nice! Thanks!

      +

      *Thread Reply:* Oh nice! Thanks!

      @@ -111503,10 +111509,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 03:47:50
      -

      *Thread Reply:* This is the event generated for above query.

      +

      *Thread Reply:* This is the event generated for above query.

      - + @@ -111577,7 +111583,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      this is event for view for which no lineage is being generated

      - + @@ -111643,7 +111649,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 17:34:08
      -

      *Thread Reply:* Hi John, glad to hear you figured out a path forward on this! Please let us know what you learn 🙂

      +

      *Thread Reply:* Hi John, glad to hear you figured out a path forward on this! Please let us know what you learn 🙂

      @@ -111763,7 +111769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 17:33:36
      -

      *Thread Reply:* Hi Harshini, I’ve alerted our resident Spark and column-lineage expert about this. Hope to have an answer for you soon.

      +

      *Thread Reply:* Hi Harshini, I’ve alerted our resident Spark and column-lineage expert about this. Hope to have an answer for you soon.

      @@ -111789,7 +111795,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 19:39:46
      -

      *Thread Reply:* Thank you Michael, looking forward to it

      +

      *Thread Reply:* Thank you Michael, looking forward to it

      @@ -111815,7 +111821,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 02:58:41
      -

      *Thread Reply:* Hello @Harshini Devathi. An interesting topic which I have never thought about. The ordering of the fields, we get for Spark Apps, comes from Spark logical plans we extract information from and we do not apply any sorting on them. So, if Spark plan contains columns a , b, c we trust it's the order of columns for a dataset and don't want to check it on our own.

      +

      *Thread Reply:* Hello @Harshini Devathi. An interesting topic which I have never thought about. The ordering of the fields, we get for Spark Apps, comes from Spark logical plans we extract information from and we do not apply any sorting on them. So, if Spark plan contains columns a , b, c we trust it's the order of columns for a dataset and don't want to check it on our own.

      @@ -111841,7 +111847,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 02:59:45
      -

      *Thread Reply:* btw. please let us know how do you obtain your lineage: within a Spark app or from some SQL's scheduled by Airflow?

      +

      *Thread Reply:* btw. please let us know how do you obtain your lineage: within a Spark app or from some SQL's scheduled by Airflow?

      @@ -111867,7 +111873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 14:40:31
      -

      *Thread Reply:* Hello @Paweł Leszczyński, thank you for the response. We do not need you to check the ordering specifically but I assume that the spark logical plan maintains the column order based on the input datasets. Can we retain that order by adding column id or some sequence number which helps to represent the lineage in the same order.

      +

      *Thread Reply:* Hello @Paweł Leszczyński, thank you for the response. We do not need you to check the ordering specifically but I assume that the spark logical plan maintains the column order based on the input datasets. Can we retain that order by adding column id or some sequence number which helps to represent the lineage in the same order.

      The lineage we are capturing using Spark openlineage connector, by posting custom lineage to Marquez through API calls, and also in process of leveraging SQL connector feature using Airflow.

      @@ -111895,7 +111901,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-26 04:35:43
      -

      *Thread Reply:* Hi @Harshini Devathi, are you asking about schema facet within a dataset? This should have an order from spark logical plans. Or, are you asking about columnLineage facet? Or Marquez API responses? It's not clear to me why do you need it. Each column, is identified by a dataset (dataset namespace + dataset name) and field name. You can, on your side, generate and column id based on that and order columns based on the id, but still I think I am missing some arguments behind doing so.

      +

      *Thread Reply:* Hi @Harshini Devathi, are you asking about schema facet within a dataset? This should have an order from spark logical plans. Or, are you asking about columnLineage facet? Or Marquez API responses? It's not clear to me why do you need it. Each column, is identified by a dataset (dataset namespace + dataset name) and field name. You can, on your side, generate and column id based on that and order columns based on the id, but still I think I am missing some arguments behind doing so.

      @@ -112119,7 +112125,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 03:45:31
      -

      *Thread Reply:* Is it because the logLevel is set to WARN or ERROR?

      +

      *Thread Reply:* Is it because the logLevel is set to WARN or ERROR?

      @@ -112145,7 +112151,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:07:12
      -

      *Thread Reply:* No, I set it to INFO, may be I need to add some jars?

      +

      *Thread Reply:* No, I set it to INFO, may be I need to add some jars?

      @@ -112171,7 +112177,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:30:02
      -

      *Thread Reply:* Hmm have you set the relevant spark configs?

      +

      *Thread Reply:* Hmm have you set the relevant spark configs?

      @@ -112197,7 +112203,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:32:50
      -

      *Thread Reply:* yep, I have http working. But not the console +

      *Thread Reply:* yep, I have http working. But not the console spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.type=console

      @@ -112225,7 +112231,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:35:27
      -

      *Thread Reply:* Oh wait http works but not console...

      +

      *Thread Reply:* Oh wait http works but not console...

      @@ -112251,7 +112257,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:37:02
      -

      *Thread Reply:* If you want to see the console events which are emitted, then need to set logLevel to DEBUG

      +

      *Thread Reply:* If you want to see the console events which are emitted, then need to set logLevel to DEBUG

      @@ -112277,7 +112283,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:37:44
      -

      *Thread Reply:* tried that too, still nothing

      +

      *Thread Reply:* tried that too, still nothing

      @@ -112303,7 +112309,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:38:54
      -

      *Thread Reply:* Is the openlienage jar installed and added to config?

      +

      *Thread Reply:* Is the openlienage jar installed and added to config?

      @@ -112329,7 +112335,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:39:09
      -

      *Thread Reply:* yep, that’s why http works

      +

      *Thread Reply:* yep, that’s why http works

      @@ -112355,7 +112361,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:39:26
      -

      *Thread Reply:* the only thing I see in the logs is this: +

      *Thread Reply:* the only thing I see in the logs is this: 23/06/27 07:39:11 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerJobEnd

      @@ -112382,7 +112388,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:40:59
      -

      *Thread Reply:* Hmm if an event is still emitted for this case, but logs not showing up then I'm not sure... Maybe someone with more knowledge on this can help

      +

      *Thread Reply:* Hmm if an event is still emitted for this case, but logs not showing up then I'm not sure... Maybe someone with more knowledge on this can help

      @@ -112408,7 +112414,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:42:37
      -

      *Thread Reply:* sure, thanks for trying @Anirudh Shrinivason

      +

      *Thread Reply:* sure, thanks for trying @Anirudh Shrinivason

      @@ -112434,7 +112440,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 05:23:36
      -

      *Thread Reply:* What job are you trying this on? If there's this message, then logging is working afaik

      +

      *Thread Reply:* What job are you trying this on? If there's this message, then logging is working afaik

      @@ -112460,7 +112466,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 12:16:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski Actually I also noticed a similar issue... For some spark pipelines, the log level is set to debug, but I'm not seeing any events being logged. I am however receiving these events in the backend. Have any of the logging been removed from some places?

      +

      *Thread Reply:* Hi @Maciej Obuchowski Actually I also noticed a similar issue... For some spark pipelines, the log level is set to debug, but I'm not seeing any events being logged. I am however receiving these events in the backend. Have any of the logging been removed from some places?

      @@ -112486,7 +112492,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 20:57:45
      -

      *Thread Reply:* yep, exactly same thing here also @Maciej Obuchowski, I can get the events on http, but changing to console gets me nothing from ConsoleTransport.

      +

      *Thread Reply:* yep, exactly same thing here also @Maciej Obuchowski, I can get the events on http, but changing to console gets me nothing from ConsoleTransport.

      @@ -112564,7 +112570,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:57:59
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, could you disable serializing spark.logicalPlan to see if the behaviour is the same?

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, could you disable serializing spark.logicalPlan to see if the behaviour is the same?

      @@ -112590,7 +112596,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:58:28
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark -> spark.openlineage.facets.disabled -> [spark_unknown;spark.logicalPlan]

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark -> spark.openlineage.facets.disabled -> [spark_unknown;spark.logicalPlan]

      @@ -112616,7 +112622,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:59:55
      -

      *Thread Reply:* We do serialize logicalPlan because this is useful in many cases, but sometimes can lead to serializing things that shouldn't be serialized

      +

      *Thread Reply:* We do serialize logicalPlan because this is useful in many cases, but sometimes can lead to serializing things that shouldn't be serialized

      @@ -112642,7 +112648,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 15:49:35
      -

      *Thread Reply:* Ahh I see. Yeah okay let me try that

      +

      *Thread Reply:* Ahh I see. Yeah okay let me try that

      @@ -112704,7 +112710,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-30 08:05:53
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -112929,7 +112935,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-07 10:37:49
      -

      *Thread Reply:* thanks for watching and sharing! the recording is also on youtube 😉 https://www.youtube.com/watch?v=rO3BPqUtWrI

      +

      *Thread Reply:* thanks for watching and sharing! the recording is also on youtube 😉 https://www.youtube.com/watch?v=rO3BPqUtWrI

      YouTube
      @@ -112989,7 +112995,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-07 10:38:01
      -

      *Thread Reply:* ❤️

      +

      *Thread Reply:* ❤️

      @@ -113015,7 +113021,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-08 13:35:10
      -

      *Thread Reply:* Very much agree. I’ve even forwarded to a few people here and there, those who I think should learn about it.

      +

      *Thread Reply:* Very much agree. I’ve even forwarded to a few people here and there, those who I think should learn about it.

      @@ -113045,7 +113051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-08 13:47:17
      -

      *Thread Reply:* You’re both too kind :) +

      *Thread Reply:* You’re both too kind :) Thank you for your support and being part of the community.

      @@ -113162,7 +113168,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:46:36
      -

      *Thread Reply:* I think yes this is a great use case. @Paweł Leszczyński is more familiar with the spark integration code than I. +

      *Thread Reply:* I think yes this is a great use case. @Paweł Leszczyński is more familiar with the spark integration code than I. I think in this case, we would add the datasetVersion facet with the underlying Iceberg version: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DatasetVersionDatasetFacet.json We extract this information in a few places: https://github.com/search?q=repo%3AOpenLineage%2FOpenLineage%20%20DatasetVersionDatasetFacet&type=code

      @@ -113215,7 +113221,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:57:17
      -

      *Thread Reply:* Yes, we do have datasetVersion which captures for output and input datasets their iceberg or delta version. Input versions are collected on START while output are collected on COMPLETE in case a job reads and writes to the same dataset. So, even though column-lineage facet is missing the version, it should be available within events related to a particular run.

      +

      *Thread Reply:* Yes, we do have datasetVersion which captures for output and input datasets their iceberg or delta version. Input versions are collected on START while output are collected on COMPLETE in case a job reads and writes to the same dataset. So, even though column-lineage facet is missing the version, it should be available within events related to a particular run.

      If it is not, then perhaps the case here is the lack of support of as of syntax. As far as I remeber, we always get a current version of a dataset and this may be a missing part here.

      @@ -113243,7 +113249,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:58:49
      -

      *Thread Reply:* link to a method that gets dataset version for iceberg: https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/spark3/sr[…]lineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

      +

      *Thread Reply:* link to a method that gets dataset version for iceberg: https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/spark3/sr[…]lineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

      @@ -113294,7 +113300,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 10:57:26
      -

      *Thread Reply:* Thank you @Julien Le Dem and @Paweł Leszczyński +

      *Thread Reply:* Thank you @Julien Le Dem and @Paweł Leszczyński Based on what I’ve seen so far, indeed it seems that only the current snapshot is tracked. When IcebergHandler.getDatasetVersion() Initially I was expecting to be able to obtain the snapshotId from the SparkTable which comes within getDatasetVersion() but now I realize that OL is using an older version of Iceberg runtime, (0.12.1) which does not support time travel (introduced in 0.14.1). The evidence is: @@ -113330,7 +113336,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 15:48:53
      -

      *Thread Reply:* Created issues: #1969 and #1970

      +

      *Thread Reply:* Created issues: #1969 and #1970

      @@ -113356,7 +113362,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 07:15:14
      -

      *Thread Reply:* Thanks @Juan Manuel Cappi. openlineage-spark jar contains modules like spark3 , spark32 , spark33 and spark34 that is going to be merged soon (we do have a ready PR for that). spark34 will be compiled against latest iceberg version. Once this is done #1969 can be closed. For 1970, one would need to implement datasetBuilder within spark34 module and visits node within spark's logical plan that is responsible for as of and creates dataset for OpenLineage event other way than getting latest snapshot version.

      +

      *Thread Reply:* Thanks @Juan Manuel Cappi. openlineage-spark jar contains modules like spark3 , spark32 , spark33 and spark34 that is going to be merged soon (we do have a ready PR for that). spark34 will be compiled against latest iceberg version. Once this is done #1969 can be closed. For 1970, one would need to implement datasetBuilder within spark34 module and visits node within spark's logical plan that is responsible for as of and creates dataset for OpenLineage event other way than getting latest snapshot version.

      @@ -113382,7 +113388,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 12:51:19
      -

      *Thread Reply:* @Paweł Leszczyński I’ve see PR #1971 and I see a new spark34 project with the latest iceberg-spark dependency version, but other versions (spark33, spark32, etc) have not being upgraded in that PR. Since the change is small and does not break any tests, I’ve created PR #1976 for to fix #1969. That alone is unlocking some time travel lineage (i.e. dataset identifier now becomes schema.table.version or schema.table.snapshot_id). Hope it makes sense

      +

      *Thread Reply:* @Paweł Leszczyński I’ve see PR #1971 and I see a new spark34 project with the latest iceberg-spark dependency version, but other versions (spark33, spark32, etc) have not being upgraded in that PR. Since the change is small and does not break any tests, I’ve created PR #1976 for to fix #1969. That alone is unlocking some time travel lineage (i.e. dataset identifier now becomes schema.table.version or schema.table.snapshot_id). Hope it makes sense

      @@ -113408,7 +113414,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-14 04:37:55
      -

      *Thread Reply:* Hi @Juan Manuel Cappi, You're right and after discussion with you I realized we support some version of iceberg (for spark 3.3 it's still 0.14.0) but this is not the latest iceberg version matching spark version.

      +

      *Thread Reply:* Hi @Juan Manuel Cappi, You're right and after discussion with you I realized we support some version of iceberg (for spark 3.3 it's still 0.14.0) but this is not the latest iceberg version matching spark version.

      There's some tricky part here. Although we wan't our code to succeed with latest spark, we don't want it to fail in a nasty way (class not found exception) when a user is working with an old iceberg version. There are places in our code where we do check are iceberg classes on the classpath? We need to extend this to are iceberg classes on classpath is iceberg version above 0.14 or not For sure this is the case for merge into commands I am working on at the moment. Let's see if the other integration tests are affected in your PR

      @@ -113462,7 +113468,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 08:28:59
      -

      *Thread Reply:* what do you mean by that? there is a pyspark & kafka integration test that verifies event being sent when reading or writing to kafka topic: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      +

      *Thread Reply:* what do you mean by that? there is a pyspark & kafka integration test that verifies event being sent when reading or writing to kafka topic: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      @@ -113513,7 +113519,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 09:28:56
      -

      *Thread Reply:* We do have an old issue https://github.com/OpenLineage/OpenLineage/issues/372 to support more spark plans that are stream related. But, if you had an example of streaming that is not working for you, this would have been really helpful.

      +

      *Thread Reply:* We do have an old issue https://github.com/OpenLineage/OpenLineage/issues/372 to support more spark plans that are stream related. But, if you had an example of streaming that is not working for you, this would have been really helpful.

      @@ -113573,7 +113579,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:03:30
      -

      *Thread Reply:* I have a pipeline Which reads from topic and send data to 3 HIVE tables and one postgres , Its not emitting any lineage for this pipeline

      +

      *Thread Reply:* I have a pipeline Which reads from topic and send data to 3 HIVE tables and one postgres , Its not emitting any lineage for this pipeline

      @@ -113599,7 +113605,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:06:51
      -

      *Thread Reply:* just one task is getting created

      +

      *Thread Reply:* just one task is getting created

      @@ -113677,7 +113683,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:14:41
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I’m not familiar with Delta, but enabling debugging helped me a lot to understand what’s going when things fail silently. Just add at the end: +

      *Thread Reply:* Hi @Anirudh Shrinivason, I’m not familiar with Delta, but enabling debugging helped me a lot to understand what’s going when things fail silently. Just add at the end: spark.sparkContext.setLogLevel("DEBUG")

      @@ -113704,7 +113710,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:20:47
      -

      *Thread Reply:* Yeah I checked on debug

      +

      *Thread Reply:* Yeah I checked on debug

      @@ -113730,7 +113736,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:20:50
      -

      *Thread Reply:* There are no errors

      +

      *Thread Reply:* There are no errors

      @@ -113756,7 +113762,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:21:10
      -

      *Thread Reply:* Just that there is no environment-properties in the event that is being emitted

      +

      *Thread Reply:* Just that there is no environment-properties in the event that is being emitted

      @@ -113782,7 +113788,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 07:31:01
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, what spark version is that? i see you delta version is pretty old. Anyway, the observation is weird and don't know how come delta interferes with environment facet builder. These are so disjoint features. Are you sure you create a new session (there is getOrCreate) ?

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, what spark version is that? i see you delta version is pretty old. Anyway, the observation is weird and don't know how come delta interferes with environment facet builder. These are so disjoint features. Are you sure you create a new session (there is getOrCreate) ?

      @@ -113808,7 +113814,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 19:29:06
      -

      *Thread Reply:* @Paweł Leszczyński its because of this line : https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/app/src/m[…]nlineage/spark/agent/lifecycle/InternalEventHandlerFactory.java

      +

      *Thread Reply:* @Paweł Leszczyński its because of this line : https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/app/src/m[…]nlineage/spark/agent/lifecycle/InternalEventHandlerFactory.java

      @@ -113859,7 +113865,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 19:32:44
      -

      *Thread Reply:* Assuming this is https://learn.microsoft.com/en-us/azure/databricks/delta/ ... delta .. which is azure datbricks. @Anirudh Shrinivason

      +

      *Thread Reply:* Assuming this is https://learn.microsoft.com/en-us/azure/databricks/delta/ ... delta .. which is azure datbricks. @Anirudh Shrinivason

      @@ -113885,7 +113891,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 22:58:13
      -

      *Thread Reply:* Hmm I wasn't using databricks

      +

      *Thread Reply:* Hmm I wasn't using databricks

      @@ -113911,7 +113917,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 22:59:12
      -

      *Thread Reply:* @Paweł Leszczyński I'm using spark 3.1 btw

      +

      *Thread Reply:* @Paweł Leszczyński I'm using spark 3.1 btw

      @@ -113937,7 +113943,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:05:49
      -

      *Thread Reply:* @Anirudh Shrinivason This should resolve the issue https://github.com/OpenLineage/OpenLineage/pull/1973

      +

      *Thread Reply:* @Anirudh Shrinivason This should resolve the issue https://github.com/OpenLineage/OpenLineage/pull/1973

      @@ -114030,7 +114036,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:06:11
      -

      *Thread Reply:* PR description contains info on how come the observed behaviour was possible

      +

      *Thread Reply:* PR description contains info on how come the observed behaviour was possible

      @@ -114056,7 +114062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:07:47
      -

      *Thread Reply:* As always, thank you @Anirudh Shrinivason for providing clear information on how to reproduce the issue 🚀 :medal: 👍

      +

      *Thread Reply:* As always, thank you @Anirudh Shrinivason for providing clear information on how to reproduce the issue 🚀 :medal: 👍

      @@ -114082,7 +114088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 09:52:29
      -

      *Thread Reply:* Ohh that is really great! Thankss so much for the help! 🙂

      +

      *Thread Reply:* Ohh that is really great! Thankss so much for the help! 🙂

      @@ -114211,7 +114217,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 15:22:52
      -

      *Thread Reply:* what do you mean by Access file?

      +

      *Thread Reply:* what do you mean by Access file?

      @@ -114237,7 +114243,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:09:03
      -

      *Thread Reply:* ... accdb file, Microsoft Access File: I am in a reverse engineering projet facing a spaghetti style development and would have loved to use, airflow and openlineage as a magic wand, to help me in this damn work

      +

      *Thread Reply:* ... accdb file, Microsoft Access File: I am in a reverse engineering projet facing a spaghetti style development and would have loved to use, airflow and openlineage as a magic wand, to help me in this damn work

      @@ -114263,7 +114269,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 21:44:21
      -

      *Thread Reply:* oof.. I’d look into https://airflow.apache.org/docs/apache-airflow-providers-odbc/4.0.0/ +

      *Thread Reply:* oof.. I’d look into https://airflow.apache.org/docs/apache-airflow-providers-odbc/4.0.0/ but I really have no clue..

      @@ -114290,7 +114296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 09:47:02
      -

      *Thread Reply:* Thank you Harel +

      *Thread Reply:* Thank you Harel I started from that too ... but it became foggy after the initial step

      @@ -114347,7 +114353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:00
      -

      *Thread Reply:* Did you check the namespace menu in the top right for a food_delivery namespace?

      +

      *Thread Reply:* Did you check the namespace menu in the top right for a food_delivery namespace?

      @@ -114373,7 +114379,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:12
      -

      *Thread Reply:* (Hi Aaman!)

      +

      *Thread Reply:* (Hi Aaman!)

      @@ -114399,7 +114405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:45
      -

      *Thread Reply:* Hi! Thank you that helped!

      +

      *Thread Reply:* Hi! Thank you that helped!

      @@ -114425,7 +114431,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:55
      -

      *Thread Reply:* I think that should be added to the quickstart guide

      +

      *Thread Reply:* I think that should be added to the quickstart guide

      @@ -114455,7 +114461,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:56:23
      -

      *Thread Reply:* Great idea, thank you

      +

      *Thread Reply:* Great idea, thank you

      @@ -114603,7 +114609,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-14 04:09:15
      -

      *Thread Reply:* Hi @Steven, I assume you send your Openlineage events to Marquez. 422 http code is a response from backend and Marquez is still waiting for the PR https://github.com/MarquezProject/marquez/pull/2495 to be merged and released. This PR makes Marquez understand DatasetEvents. They won't be saved in Marquez database (this is to be implemented in future), but at least one will not experience error response code.

      +

      *Thread Reply:* Hi @Steven, I assume you send your Openlineage events to Marquez. 422 http code is a response from backend and Marquez is still waiting for the PR https://github.com/MarquezProject/marquez/pull/2495 to be merged and released. This PR makes Marquez understand DatasetEvents. They won't be saved in Marquez database (this is to be implemented in future), but at least one will not experience error response code.

      To sum up: what you do is correct. You are using a feature that is allowed on a client side but still not implemented on a backend.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-07-14 04:10:30
      -

      *Thread Reply:* Thanks!!

      +

      *Thread Reply:* Thanks!!

      @@ -114773,7 +114779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:35:18
      -

      *Thread Reply:* Hi @Harshit Soni, where are you deploying your spark? locally or not? is it on mac? Calling @Maciej Obuchowski to help with ibopenlineage_sql_java architecture compilation issue

      +

      *Thread Reply:* Hi @Harshit Soni, where are you deploying your spark? locally or not? is it on mac? Calling @Maciej Obuchowski to help with ibopenlineage_sql_java architecture compilation issue

      @@ -114799,7 +114805,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:38:03
      -

      *Thread Reply:* Currently, was testing on local.

      +

      *Thread Reply:* Currently, was testing on local.

      @@ -114825,7 +114831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:39:43
      -

      *Thread Reply:* We have created a centralised utility for all data ingestion needs and want to see how lineage is created for same using Openlineage.

      +

      *Thread Reply:* We have created a centralised utility for all data ingestion needs and want to see how lineage is created for same using Openlineage.

      @@ -114851,7 +114857,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:16:55
      -

      *Thread Reply:* 👀

      +

      *Thread Reply:* 👀

      @@ -115003,7 +115009,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:13:49
      -

      *Thread Reply:* > or would it somewhat work to write a mapper in between to support their spec? +

      *Thread Reply:* > or would it somewhat work to write a mapper in between to support their spec? I think yeah - maybe https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter would work out of the box if I understand correctly?

      @@ -115059,7 +115065,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:38:59
      -

      *Thread Reply:* Tagging @Natalie Zeller in case you want to collaborate

      +

      *Thread Reply:* Tagging @Natalie Zeller in case you want to collaborate

      @@ -115085,7 +115091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:47:34
      -

      *Thread Reply:* Hi, +

      *Thread Reply:* Hi, We've implemented a transporter that transmits lineage from OpenLineage to OpenMetadata, you can find the github project here. I've also published a blog post that explains this integration and how to use it. I'll be happy to help if you have any question

      @@ -115147,7 +115153,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 09:49:30
      -

      *Thread Reply:* very cool! thanks a lot for responding so quickly

      +

      *Thread Reply:* very cool! thanks a lot for responding so quickly

      @@ -115239,7 +115245,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:10:31
      -

      *Thread Reply:* runId is an UUID assigned per spark action (compute trigger within a spark job). A single spark script can result in multiple runs then

      +

      *Thread Reply:* runId is an UUID assigned per spark action (compute trigger within a spark job). A single spark script can result in multiple runs then

      @@ -115265,7 +115271,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:13:17
      -

      *Thread Reply:* adding an extra facet with applicationId looks like a good idea to me: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#applicationId:String

      +

      *Thread Reply:* adding an extra facet with applicationId looks like a good idea to me: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#applicationId:String

      spark.apache.org
      @@ -115312,7 +115318,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 23:06:01
      -

      *Thread Reply:* Got it thanks!

      +

      *Thread Reply:* Got it thanks!

      @@ -115368,7 +115374,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:48:54
      -

      *Thread Reply:* I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> https://github.com/OpenLineage/OpenLineage/issues/108

      +

      *Thread Reply:* I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> https://github.com/OpenLineage/OpenLineage/issues/108

      Surely one can use use python OL client to create manully events and send them to MQZ, which may be less convenient (https://github.com/OpenLineage/OpenLineage/tree/main/client/python)

      @@ -115422,7 +115428,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:52:32
      -

      *Thread Reply:* Thanks Pawel for responding

      +

      *Thread Reply:* Thanks Pawel for responding

      @@ -115474,7 +115480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 03:40:20
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I am still working on that. It's kind of complex because I want to refactor column level lineage so that it can work with multiple Spark versions and multiple delta jars as merge into implementation for delta differs for different delta releases. I thought it's ready, but this needs some extra work to be done in next days. I am excited about that too!

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, I am still working on that. It's kind of complex because I want to refactor column level lineage so that it can work with multiple Spark versions and multiple delta jars as merge into implementation for delta differs for different delta releases. I thought it's ready, but this needs some extra work to be done in next days. I am excited about that too!

      @@ -115500,7 +115506,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 03:54:37
      -

      *Thread Reply:* Ahh I see... Got it! Is there a tentative timeline for when we can expect this? So sorry haha don't mean to rush you. Just curious to know thats all! 🙂

      +

      *Thread Reply:* Ahh I see... Got it! Is there a tentative timeline for when we can expect this? So sorry haha don't mean to rush you. Just curious to know thats all! 🙂

      @@ -115526,7 +115532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 22:06:10
      -

      *Thread Reply:* Can we author a release sometime soon? Would like to use the CustomEnvironmentFacetBuilder for delta catalog!

      +

      *Thread Reply:* Can we author a release sometime soon? Would like to use the CustomEnvironmentFacetBuilder for delta catalog!

      @@ -115552,7 +115558,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 05:28:43
      -

      *Thread Reply:* we're pretty close i think with merge into delta which is under review. waiting for it would be nice. anyway, we're 3 weeks after the last release.

      +

      *Thread Reply:* we're pretty close i think with merge into delta which is under review. waiting for it would be nice. anyway, we're 3 weeks after the last release.

      @@ -115578,7 +115584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:50:56
      -

      *Thread Reply:* @Anirudh Shrinivason releases are available basically on-demand using our process in GOVERNANCE.md. I recommend watching 1958 and then making a request in #general once it’s been merged. But, as Paweł suggested, we have a scheduled release coming soon, anyway. Thanks for your interest in the fix!

      +

      *Thread Reply:* @Anirudh Shrinivason releases are available basically on-demand using our process in GOVERNANCE.md. I recommend watching 1958 and then making a request in #general once it’s been merged. But, as Paweł suggested, we have a scheduled release coming soon, anyway. Thanks for your interest in the fix!

      @@ -115652,7 +115658,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 11:01:14
      -

      *Thread Reply:* Ahh I see. Got it. Thanks! @Michael Robinson @Paweł Leszczyński

      +

      *Thread Reply:* Ahh I see. Got it. Thanks! @Michael Robinson @Paweł Leszczyński

      @@ -115678,7 +115684,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 03:12:22
      -

      *Thread Reply:* @Anirudh Shrinivason it's merged -> https://github.com/OpenLineage/OpenLineage/pull/1958

      +

      *Thread Reply:* @Anirudh Shrinivason it's merged -> https://github.com/OpenLineage/OpenLineage/pull/1958

      @@ -115778,7 +115784,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 04:19:15
      -

      *Thread Reply:* Awesome thanks so much! @Paweł Leszczyński

      +

      *Thread Reply:* Awesome thanks so much! @Paweł Leszczyński

      @@ -115834,7 +115840,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 07:14:54
      -

      *Thread Reply:* One hacky approach would be to update the current dataset identifier to include the snapshot_id, so, for schema.table.tag we would have something like schema.table.tag-snapshot_id. The benefit is that it’s explicit and it doesn’t require a change in the OL schemas. The obvious downside (though not that serious in my opinion) is that impacts readability. Not sure though if there are other non-obvious side-effects.

      +

      *Thread Reply:* One hacky approach would be to update the current dataset identifier to include the snapshot_id, so, for schema.table.tag we would have something like schema.table.tag-snapshot_id. The benefit is that it’s explicit and it doesn’t require a change in the OL schemas. The obvious downside (though not that serious in my opinion) is that impacts readability. Not sure though if there are other non-obvious side-effects.

      Another alternative would be to add a dedicated property. For instance, the job > latestRun schema, the input/output dataset version objects could look like this: "inputDatasetVersions": [ @@ -115891,7 +115897,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 08:33:43
      -

      *Thread Reply:* @Paweł Leszczyński what do you think?

      +

      *Thread Reply:* @Paweł Leszczyński what do you think?

      @@ -115917,7 +115923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 08:38:16
      -

      *Thread Reply:* 1. How does snapshotId differ from version? Could one make OL version property to be a string concat of iceberg-snapshot-id.iceberg-version

      +

      *Thread Reply:* 1. How does snapshotId differ from version? Could one make OL version property to be a string concat of iceberg-snapshot-id.iceberg-version

      1. I don't think it's necessary (or don't understand why) to add snapshot-id within column-linegea. Each entry within inputFields of columnLineage is already available within inputs of the OL event related to this run.
      @@ -115946,7 +115952,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 18:43:31
      -

      *Thread Reply:* Yes, I think follow the idea. The problem with that is the version is tied to the dataset name, i.e. my_namespace.table_A.tag_v1 which stays the same for the source dataset, which is the one being used with time travel. +

      *Thread Reply:* Yes, I think follow the idea. The problem with that is the version is tied to the dataset name, i.e. my_namespace.table_A.tag_v1 which stays the same for the source dataset, which is the one being used with time travel. Suppose the following sequence: step 1 => tableA.tagv1 has snapshot id 123-abc @@ -115991,7 +115997,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:35:22
      -

      *Thread Reply:* So, my understanding of the problem is that icberg version is not unique. So, if you have version 3, revert to version 2, and then write something again, one ends up again with version 3.

      +

      *Thread Reply:* So, my understanding of the problem is that icberg version is not unique. So, if you have version 3, revert to version 2, and then write something again, one ends up again with version 3.

      I would not like to mess with dataset names because on the backend sides like Marquez, dataset names being the same in different jobs and runs allow creating lineage graph. If dataset names are different, then there is no way to build lineage graph across multiple jobs.

      @@ -116053,7 +116059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 13:09:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days per our policy here.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days per our policy here.

      @@ -116079,7 +116085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 13:44:47
      -

      *Thread Reply:* @Anirudh Shrinivason and others waiting on this release: the release process isn’t working as expected due to security improvements recently made to the website, ironically enough, which is the source for the spec. But we’re working on a fix and hope to complete the release soon.

      +

      *Thread Reply:* @Anirudh Shrinivason and others waiting on this release: the release process isn’t working as expected due to security improvements recently made to the website, ironically enough, which is the source for the spec. But we’re working on a fix and hope to complete the release soon.

      @@ -116105,7 +116111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 15:19:49
      -

      *Thread Reply:* @Anirudh Shrinivason the release (0.30.1) is out now. Thanks for your patience 🙂

      +

      *Thread Reply:* @Anirudh Shrinivason the release (0.30.1) is out now. Thanks for your patience 🙂

      @@ -116131,7 +116137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 23:21:14
      -

      *Thread Reply:* Hi @Michael Robinson Thanks a lot!

      +

      *Thread Reply:* Hi @Michael Robinson Thanks a lot!

      @@ -116157,7 +116163,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:52:24
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -116218,7 +116224,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:10:58
      -

      *Thread Reply:* > I am running a job in Marquez with 180 rows of metadata +

      *Thread Reply:* > I am running a job in Marquez with 180 rows of metadata Do you mean that you have +100 rows of metadata in the jobs table for Marquez? Or that the job never finishes?

      @@ -116245,7 +116251,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:11:47
      -

      *Thread Reply:* Also, yes, we have an even viewer that allows you to query the raw OL events

      +

      *Thread Reply:* Also, yes, we have an even viewer that allows you to query the raw OL events

      @@ -116280,7 +116286,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:12:19
      -

      *Thread Reply:* If you post a sample of your events, it’d be helpful to troubleshoot your issue

      +

      *Thread Reply:* If you post a sample of your events, it’d be helpful to troubleshoot your issue

      @@ -116306,10 +116312,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:53:25
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -116341,7 +116347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:53:31
      -

      *Thread Reply:* Sure Willy thanks for your response. The job is still running. This is the code I am running from jupyter notebook using Python client:

      +

      *Thread Reply:* Sure Willy thanks for your response. The job is still running. This is the code I am running from jupyter notebook using Python client:

      @@ -116367,7 +116373,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:54:33
      -

      *Thread Reply:* as you can see my input and output datasets are just 1 row

      +

      *Thread Reply:* as you can see my input and output datasets are just 1 row

      @@ -116393,7 +116399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:55:02
      -

      *Thread Reply:* included column lineage but job keeps running so I don't know if it is working

      +

      *Thread Reply:* included column lineage but job keeps running so I don't know if it is working

      @@ -116479,7 +116485,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:20:33
      -

      *Thread Reply:* spark integration is based on extracting lineage from optimized plans

      +

      *Thread Reply:* spark integration is based on extracting lineage from optimized plans

      @@ -116505,7 +116511,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:25:35
      -

      *Thread Reply:* https://youtu.be/rO3BPqUtWrI?t=1326 i recommend whole presentation but in case you're just interested in Spark integration, there few mins that explain how this is achieved (link points to 22:06 min of video)

      +

      *Thread Reply:* https://youtu.be/rO3BPqUtWrI?t=1326 i recommend whole presentation but in case you're just interested in Spark integration, there few mins that explain how this is achieved (link points to 22:06 min of video)

      YouTube
      @@ -116565,7 +116571,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:43:47
      -

      *Thread Reply:* Thanks Pawel for sharing. I will take a look. Have a nice weekend.

      +

      *Thread Reply:* Thanks Pawel for sharing. I will take a look. Have a nice weekend.

      @@ -116621,7 +116627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 09:57:51
      -

      *Thread Reply:* Welcome, @Jens Pfau!

      +

      *Thread Reply:* Welcome, @Jens Pfau!

      @@ -116923,7 +116929,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 13:31:53
      -

      *Thread Reply:* That looks like your URL provided to OpenLineage is missing http:// or https:// in the front

      +

      *Thread Reply:* That looks like your URL provided to OpenLineage is missing http:// or https:// in the front

      @@ -116949,7 +116955,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:54:55
      -

      *Thread Reply:* sorry how can i resolve this ? do i need to add this ? i just follow the guide step by step . You dont mention anywhere to add anything. You provide smth that

      +

      *Thread Reply:* sorry how can i resolve this ? do i need to add this ? i just follow the guide step by step . You dont mention anywhere to add anything. You provide smth that

      @@ -116975,7 +116981,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:55:05
      -

      *Thread Reply:* really does not work out of the box

      +

      *Thread Reply:* really does not work out of the box

      @@ -117001,7 +117007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:55:13
      -

      *Thread Reply:* anbd this is supposed to be a demo

      +

      *Thread Reply:* anbd this is supposed to be a demo

      @@ -117027,7 +117033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 17:07:49
      -

      *Thread Reply:* bumping e.g. to io.openlineage:openlineage_spark:0.29.2 seems to be fixing the issue

      +

      *Thread Reply:* bumping e.g. to io.openlineage:openlineage_spark:0.29.2 seems to be fixing the issue

      not sure why it stopped working for 0.12.0 but we’ll take a look and fix accordingly

      @@ -117055,7 +117061,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 04:51:34
      -

      *Thread Reply:* ...probably by bumping the version on this page 🙂

      +

      *Thread Reply:* ...probably by bumping the version on this page 🙂

      @@ -117081,7 +117087,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 05:00:28
      -

      *Thread Reply:* thank you both for coming back to me , I bumped to 0.29 and i think that it now runs.Is this the expected output ? +

      *Thread Reply:* thank you both for coming back to me , I bumped to 0.29 and i think that it now runs.Is this the expected output ? 23/07/24 08:43:55 INFO ConsoleTransport: {"eventTime":"2023_07_24T08:43:55.941Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunEvent>","eventType":"COMPLETE","run":{"runId":"186c06c0_e79c_43cf_8bb7_08e1ab4c86a5","facets":{"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand","num-children":1,"table":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTable","identifier":{"product-class":"org.apache.spark.sql.catalyst.TableIdentifier","table":"temp2","database":"default"},"tableType":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTableType","name":"MANAGED"},"storage":{"product_class":"org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat","compressed":false,"properties":null},"schema":{"type":"struct","fields":[]},"provider":"parquet","partitionColumnNames":[],"owner":"","createTime":1690188235517,"lastAccessTime":-1,"createVersion":"","properties":null,"unsupportedFeatures":[],"tracksPartitionsInCatalog":false,"schemaPreservesCase":true,"ignoredProperties":null},"mode":null,"query":0,"outputColumnNames":"[a, b]"},{"class":"org.apache.spark.sql.execution.LogicalRDD","num_children":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"a","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":12,"jvmId":"173725f4_02c4_4174_9d18_3a61aa311d62"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"b","dataType":"long","nullable":true,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":13,"jvmId":"173725f4-02c4-4174-9d18-3a61aa311d62"},"qualifier":[]}]],"rdd":null,"outputPartitioning":{"product_class":"org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning","numPartitions":0},"outputOrdering":[],"isStreaming":false,"session":null}]},"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.1.2","openlineage_spark_version":"0.29.2"}}},"job":{"namespace":"default","name":"sample_spark.execute_create_data_source_table_as_select_command","facets":{}},"inputs":[],"outputs":[{"namespace":"file","name":"/home/jovyan/spark-warehouse/temp2","facets":{"dataSource":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>","name":"file","uri":"file"},"schema":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>","fields":[{"name":"a","type":"long"},{"name":"b","type":"long"}]},"symlinks":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>","identifiers":[{"namespace":"/home/jovyan/spark-warehouse","name":"default.temp2","type":"TABLE"}]},"lifecycleStateChange":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>","lifecycleStateChange":"CREATE"}},"outputFacets":{}}]} ? Also i then proceeded to run docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 @@ -117120,7 +117126,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:11:08
      -

      *Thread Reply:* You'd need to set up spark.openlineage.transport.url to send OpenLineage events to Marquez

      +

      *Thread Reply:* You'd need to set up spark.openlineage.transport.url to send OpenLineage events to Marquez

      @@ -117146,7 +117152,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:12:28
      -

      *Thread Reply:* where n how can i do this ?

      +

      *Thread Reply:* where n how can i do this ?

      @@ -117172,7 +117178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:13:04
      -

      *Thread Reply:* do i need to edit the conf ?

      +

      *Thread Reply:* do i need to edit the conf ?

      @@ -117198,7 +117204,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:09
      -

      *Thread Reply:* yes, in the spark conf

      +

      *Thread Reply:* yes, in the spark conf

      @@ -117224,7 +117230,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:48
      -

      *Thread Reply:* what this url should be ?

      +

      *Thread Reply:* what this url should be ?

      @@ -117250,7 +117256,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:51
      -

      *Thread Reply:* http://localhost:3000/ ?

      +

      *Thread Reply:* http://localhost:3000/ ?

      @@ -117276,7 +117282,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:43:30
      -

      *Thread Reply:* That depends how you ran Marquez, but looking at your screenshot UI is at 3000, I guess API would be at 5000

      +

      *Thread Reply:* That depends how you ran Marquez, but looking at your screenshot UI is at 3000, I guess API would be at 5000

      @@ -117302,7 +117308,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:43:46
      -

      *Thread Reply:* as that's default in Marquez docker-compose

      +

      *Thread Reply:* as that's default in Marquez docker-compose

      @@ -117328,7 +117334,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:44:14
      -

      *Thread Reply:* i cannot see spark conf

      +

      *Thread Reply:* i cannot see spark conf

      @@ -117354,7 +117360,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:44:23
      -

      *Thread Reply:* is it in there or do i need to create it ?

      +

      *Thread Reply:* is it in there or do i need to create it ?

      @@ -117380,7 +117386,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 16:42:53
      -

      *Thread Reply:* Is something like +

      *Thread Reply:* Is something like ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') @@ -117416,7 +117422,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:08:08
      -

      *Thread Reply:* OK when i use the snippet you provided and then execute +

      *Thread Reply:* OK when i use the snippet you provided and then execute docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      I can now see this

      @@ -117454,7 +117460,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:08:52
      -

      *Thread Reply:* but when i click on the job i then get this

      +

      *Thread Reply:* but when i click on the job i then get this

      @@ -117489,7 +117495,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:09:05
      -

      *Thread Reply:* so i cannot see any details of the job

      +

      *Thread Reply:* so i cannot see any details of the job

      @@ -117515,18 +117521,14 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 05:54:50
      -

      *Thread Reply:* @George Polychronopoulos Hi, I am facing the same issue. After adding spark conf and using the docker run command, marquez is still showing empty. Do I need to change something in the run command?

      +

      *Thread Reply:* @George Polychronopoulos Hi, I am facing the same issue. After adding spark conf and using the docker run command, marquez is still showing empty. Do I need to change something in the run command?

      @@ -117554,7 +117556,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 05:55:15
      -

      *Thread Reply:* yes i will tell you

      +

      *Thread Reply:* yes i will tell you

      @@ -117580,7 +117582,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:41
      -

      *Thread Reply:* For the docker command that I used, I updated the marquez-web version to 0.40.0 and I also updated the Marquez_host which I am not sure if I have to or not. The UI is running but not showing anything docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=localhost -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquez/marquez-web:0.40.0

      +

      *Thread Reply:* For the docker command that I used, I updated the marquez-web version to 0.40.0 and I also updated the Marquez_host which I am not sure if I have to or not. The UI is running but not showing anything docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=localhost -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquez/marquez-web:0.40.0

      @@ -117606,7 +117608,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:52
      -

      *Thread Reply:* is because you are running this command right

      +

      *Thread Reply:* is because you are running this command right

      @@ -117632,7 +117634,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:55
      -

      *Thread Reply:* yes thats it

      +

      *Thread Reply:* yes thats it

      @@ -117658,7 +117660,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:58
      -

      *Thread Reply:* you need 0.40

      +

      *Thread Reply:* you need 0.40

      @@ -117684,7 +117686,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:03
      -

      *Thread Reply:* and there is a lot of stuff

      +

      *Thread Reply:* and there is a lot of stuff

      @@ -117710,7 +117712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:07
      -

      *Thread Reply:* you need rto chwange

      +

      *Thread Reply:* you need rto chwange

      @@ -117736,7 +117738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:10
      -

      *Thread Reply:* in the Docker

      +

      *Thread Reply:* in the Docker

      @@ -117762,7 +117764,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:24
      -

      *Thread Reply:* so the spark

      +

      *Thread Reply:* so the spark

      @@ -117788,7 +117790,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:25
      -

      *Thread Reply:* version

      +

      *Thread Reply:* version

      @@ -117814,7 +117816,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:27
      -

      *Thread Reply:* the python

      +

      *Thread Reply:* the python

      @@ -117840,7 +117842,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:38:05
      -

      *Thread Reply:* version: "3.10" +

      *Thread Reply:* version: "3.10" services: notebook: image: jupyter/pyspark-notebook:spark-3.4.1 @@ -117935,7 +117937,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:10
      -

      *Thread Reply:* this is hopw mine looks

      +

      *Thread Reply:* this is hopw mine looks

      @@ -117961,7 +117963,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:20
      -

      *Thread Reply:* it is all tested and letest version

      +

      *Thread Reply:* it is all tested and letest version

      @@ -117987,7 +117989,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:31
      -

      *Thread Reply:* postgres does not work beyond 12

      +

      *Thread Reply:* postgres does not work beyond 12

      @@ -118013,7 +118015,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:56
      -

      *Thread Reply:* if you run this docker-compose up

      +

      *Thread Reply:* if you run this docker-compose up

      @@ -118039,7 +118041,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:58
      -

      *Thread Reply:* the notebooks

      +

      *Thread Reply:* the notebooks

      @@ -118065,7 +118067,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:02
      -

      *Thread Reply:* are 10 faster

      +

      *Thread Reply:* are 10 faster

      @@ -118091,7 +118093,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:06
      -

      *Thread Reply:* and give no errors

      +

      *Thread Reply:* and give no errors

      @@ -118117,7 +118119,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:14
      -

      *Thread Reply:* also you need to update other stuff

      +

      *Thread Reply:* also you need to update other stuff

      @@ -118143,7 +118145,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:18
      -

      *Thread Reply:* such as

      +

      *Thread Reply:* such as

      @@ -118169,7 +118171,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:26
      -

      *Thread Reply:* dont run what is in the docs

      +

      *Thread Reply:* dont run what is in the docs

      @@ -118195,7 +118197,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:34
      -

      *Thread Reply:* but run what is in github

      +

      *Thread Reply:* but run what is in github

      @@ -118221,7 +118223,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:13
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      @@ -118247,7 +118249,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:22
      -

      *Thread Reply:* run in your notebooks what is in here

      +

      *Thread Reply:* run in your notebooks what is in here

      @@ -118273,7 +118275,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:32
      -

      *Thread Reply:* ```from pyspark.sql import SparkSession

      +

      *Thread Reply:* ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') .appName('samplespark') @@ -118306,7 +118308,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:38
      -

      *Thread Reply:* the dont update documentation

      +

      *Thread Reply:* the dont update documentation

      @@ -118332,7 +118334,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:44
      -

      *Thread Reply:* it took me 4 weeks to get here

      +

      *Thread Reply:* it took me 4 weeks to get here

      @@ -118423,7 +118425,7 @@

      Marquez as an OpenLineage Client

      2023-07-24 00:14:43
      -

      *Thread Reply:* Like I want to change the facets of a specific dataset.

      +

      *Thread Reply:* Like I want to change the facets of a specific dataset.

      @@ -118449,7 +118451,7 @@

      Marquez as an OpenLineage Client

      2023-07-24 16:45:18
      -

      *Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

      +

      *Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

      @@ -118475,7 +118477,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 05:02:47
      -

      *Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

      +

      *Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

      @@ -118501,7 +118503,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:20:58
      -

      *Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: +

      *Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: • https://github.com/MarquezProject/marquez/issues/2071https://github.com/MarquezProject/marquez/pull/2153

      Marquez as an OpenLineage Client
      2023-07-25 06:43:32
      -

      *Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

      +

      *Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

      @@ -118655,7 +118657,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:45:24
      -

      *Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

      +

      *Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

      @@ -118681,7 +118683,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:49:08
      -

      *Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR +

      *Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR This shouldn’t be an error. I’m trying to understand the scenario in which this error is thrown (any info is helpful). We use flyway to manage our db schema, but you may have gotten in an odd state somehow

      @@ -118734,7 +118736,7 @@

      Marquez as an OpenLineage Client

      2023-07-31 03:35:58
      -

      *Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

      +

      *Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

      @@ -118760,7 +118762,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:59:00
      -

      *Thread Reply:* Right, great, that’s exactly what I was hoping 😄

      +

      *Thread Reply:* Right, great, that’s exactly what I was hoping 😄

      @@ -118866,7 +118868,7 @@

      Marquez as an OpenLineage Client

      2023-07-27 14:47:18
      -

      *Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

      +

      *Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

      @@ -118931,7 +118933,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 02:10:14
      -

      *Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

      +

      *Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

      @@ -118957,7 +118959,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 13:03:43
      -

      *Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 +

      *Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 I don't know how it works or is deployed, but I would recommend checking if there's robust way of being notified in the runtime about processing occuring there.

      @@ -118984,7 +118986,7 @@

      Marquez as an OpenLineage Client

      2023-07-31 12:17:07
      -

      *Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

      +

      *Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

      @@ -119036,7 +119038,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 10:53:35
      -

      *Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

      +

      *Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

      @@ -119062,7 +119064,7 @@

      Marquez as an OpenLineage Client

      2023-08-21 19:17:17
      -

      *Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

      +

      *Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

      API endpoint: https://app.swaggerhub.com/apis-docs/keboola/job-queue-api/1.3.4#/Jobs/getJobOpenApiLineage|job-queue-api | 1.3.4 | keboola | SwaggerHub @@ -119123,7 +119125,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 05:14:58
      -

      *Thread Reply:* +1, I could also really use this!

      +

      *Thread Reply:* +1, I could also really use this!

      @@ -119149,7 +119151,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 05:27:34
      -

      *Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

      +

      *Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

      @@ -119175,7 +119177,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:25:49
      -

      *Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

      +

      *Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

      @@ -119201,7 +119203,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:26:54
      -

      *Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

      +

      *Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

      @@ -119227,7 +119229,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:30:57
      -

      *Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

      +

      *Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

      @@ -119257,7 +119259,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:15:22
      -

      *Thread Reply:* We don't have facets with newer version than 1.1.0

      +

      *Thread Reply:* We don't have facets with newer version than 1.1.0

      @@ -119283,7 +119285,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:15:56
      -

      *Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

      +

      *Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

      @@ -119338,7 +119340,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:18:23
      -

      *Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case +

      *Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case Is using https://github.com/OpenLineage/OpenLineage/tree/main/spec not enough? We have separate files with facets definition to be able to evolve them separetely from main spec

      @@ -119365,7 +119367,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:53:03
      -

      *Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

      +

      *Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

      If I say, 'My application is compatible with OpenLineage 1.0.5' - what does that mean exactly? Does it mean that I am at least compatible with the base definition of RunEvent and its nested components, but not facets?

      @@ -119403,7 +119405,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:56:36
      -

      *Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

      +

      *Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

      @@ -119429,7 +119431,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:56:53
      -

      *Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

      +

      *Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

      @@ -119459,7 +119461,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 05:01:26
      -

      *Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

      +

      *Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

      @@ -119586,7 +119588,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 13:26:30
      -

      *Thread Reply:* Thanks, @Maciej Obuchowski

      +

      *Thread Reply:* Thanks, @Maciej Obuchowski

      @@ -119612,7 +119614,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 15:43:00
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

      @@ -119753,7 +119755,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:09:33
      -

      *Thread Reply:* I think the apidocs are not up to date 🙂

      +

      *Thread Reply:* I think the apidocs are not up to date 🙂

      @@ -119779,7 +119781,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:09:43
      -

      *Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

      +

      *Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

      @@ -119805,7 +119807,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:44:23
      -

      *Thread Reply:* thanks for the pointer @Maciej Obuchowski

      +

      *Thread Reply:* thanks for the pointer @Maciej Obuchowski

      @@ -119831,7 +119833,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:49:17
      -

      *Thread Reply:* Also working on updating the apidocs

      +

      *Thread Reply:* Also working on updating the apidocs

      @@ -119857,7 +119859,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 11:21:14
      -

      *Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

      +

      *Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

      @@ -119991,7 +119993,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 05:36:04
      -

      *Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

      +

      *Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

      @@ -120017,7 +120019,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 05:36:14
      -

      *Thread Reply:* Can you post the (anonymized?) queries?

      +

      *Thread Reply:* Can you post the (anonymized?) queries?

      @@ -120047,7 +120049,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 07:03:09
      -

      *Thread Reply:* for example

      +

      *Thread Reply:* for example

      copy bi.marquez_test_2 from '******' iam_role '**********' delimiter as '^' gzi
       
      @@ -120076,7 +120078,7 @@

      Marquez as an OpenLineage Client

      2023-08-07 13:35:30
      -

      *Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

      +

      *Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

      @@ -120102,7 +120104,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 04:04:35
      -

      *Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. +

      *Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. Both of them sent a new event through OL to Marquez, and still wasn’t part of the entire flow.

      @@ -120160,7 +120162,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:17:19
      -

      *Thread Reply:* Hey @Athitya Kumar,

      +

      *Thread Reply:* Hey @Athitya Kumar,

      1. For parsing SQL queries, we're using sqlparser-rs (https://github.com/sqlparser-rs/sqlparser-rs) which already has great coverage of sql syntax and supports different dialects. it's open source project and we already did contribute to it for snowflake dialect.
      2. We don't have such a benchmark, but if you like, you could contribute and help us providing such. We do support joins, subqueries, iceberg and delta tables, jdbc for Spark and much more. Everything we do support, is covered in our tests.
      3. Not sure if got it properly. Marquez is our reference backend implementation which parses all the facets and stores them in relational db in a relational manner (facets, jobs, datasets and runs in separate tables).
      @@ -120218,7 +120220,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:29:53
      -

      *Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

      +

      *Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

      @@ -120244,7 +120246,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:37:20
      -

      *Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

      +

      *Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

      @@ -120376,7 +120378,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:40:02
      -

      *Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

      +

      *Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

      @@ -120511,7 +120513,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 04:08:53
      -

      *Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

      +

      *Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

      @@ -120537,7 +120539,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 04:24:57
      -

      *Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

      +

      *Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

      @@ -120563,7 +120565,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 06:00:17
      -

      *Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

      +

      *Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

      You can extend that, for example for additional types of nodes that we don't support by implementing your own QueryPlanVisitor, and then implementing OpenLineageEventHandlerFactory and packaging this into a .jar deployed alongside OpenLineage jar - this would be loaded by us using Java's ServiceLoader .

      Marquez as an OpenLineage Client
      2023-08-08 05:06:07
      -

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

      +

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

      "--ansi" =&gt; Box::new(AnsiDialect {}), "--bigquery" =&gt; Box::new(BigQueryDialect {}), @@ -120707,7 +120709,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:21:32
      -

      *Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

      +

      *Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

      @@ -120733,7 +120735,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:04:52
      -

      *Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

      +

      *Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

      @@ -120759,7 +120761,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:19:53
      -

      *Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

      +

      *Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

      1. provide the Spark Logical Plan / query to sqlparser-rs
      2. get the output of sqlparser-rs (parsed query AST) & stitch back the inputs/outputs in the open-lineage events?
      @@ -120788,7 +120790,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 12:09:06
      -

      *Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

      +

      *Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

      @@ -120814,7 +120816,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 12:18:47
      -

      *Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

      +

      *Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

      @@ -120840,7 +120842,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 03:31:32
      -

      *Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: +

      *Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java#L69

      @@ -120922,7 +120924,7 @@

      Marquez as an OpenLineage Client

      2023-08-06 17:25:31
      -

      *Thread Reply:* No, it's not.

      +

      *Thread Reply:* No, it's not.

      @@ -120948,7 +120950,7 @@

      Marquez as an OpenLineage Client

      2023-08-06 23:53:17
      -

      *Thread Reply:* Is it only available for spark version 3+?

      +

      *Thread Reply:* Is it only available for spark version 3+?

      @@ -120974,7 +120976,7 @@

      Marquez as an OpenLineage Client

      2023-08-07 04:53:41
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -121141,7 +121143,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 04:59:23
      -

      *Thread Reply:* Does this work with similar jobs, but with small amount of columns?

      +

      *Thread Reply:* Does this work with similar jobs, but with small amount of columns?

      @@ -121167,7 +121169,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:12:52
      -

      *Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns +

      *Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns but not work in big amount of columns

      @@ -121194,7 +121196,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:14:04
      -

      *Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

      +

      *Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

      @@ -121220,7 +121222,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:14:54
      -

      *Thread Reply:* ah… it’s like 20 min ~ 30 min various +

      *Thread Reply:* ah… it’s like 20 min ~ 30 min various data size is like 2000,0000 rows with columns 100 ~ 1000

      @@ -121247,7 +121249,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:17
      -

      *Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

      +

      *Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

      @@ -121273,7 +121275,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:37
      -

      *Thread Reply:* honestly sorry for typo its 1000 columns

      +

      *Thread Reply:* honestly sorry for typo its 1000 columns

      @@ -121299,7 +121301,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:44
      -

      *Thread Reply:* pivoting features

      +

      *Thread Reply:* pivoting features

      @@ -121325,7 +121327,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:16:09
      -

      *Thread Reply:* i check it works good for small numbers of columns

      +

      *Thread Reply:* i check it works good for small numbers of columns

      @@ -121351,7 +121353,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:16:39
      -

      *Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

      +

      *Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

      @@ -121377,7 +121379,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:17:06
      -

      *Thread Reply:* maybe debug logs could tell us something

      +

      *Thread Reply:* maybe debug logs could tell us something

      @@ -121403,7 +121405,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:27
      -

      *Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

      +

      *Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

      @@ -121429,7 +121431,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:30
      -

      *Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

      +

      *Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

      @@ -121455,7 +121457,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:52
      -

      *Thread Reply:* did you check pivoting that results in less columns?

      +

      *Thread Reply:* did you check pivoting that results in less columns?

      @@ -121481,7 +121483,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:20:33
      -

      *Thread Reply:* @추호관 would also be good to disable logicalPlan facet: +

      *Thread Reply:* @추호관 would also be good to disable logicalPlan facet: spark.openlineage.facets.disabled: [spark_unknown;spark.logicalPlan] in spark conf

      @@ -121509,7 +121511,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:23:40
      -

      *Thread Reply:* got it can’t we do in python config +

      *Thread Reply:* got it can’t we do in python config .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", [spark_unknown;spark.logicalPlan]

      @@ -121538,7 +121540,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:24:31
      -

      *Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ +

      *Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]"

      @@ -121566,7 +121568,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:24:42
      -

      *Thread Reply:* ah.. string got it

      +

      *Thread Reply:* ah.. string got it

      @@ -121592,7 +121594,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:36:03
      -

      *Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

      +

      *Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

      @@ -121618,7 +121620,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:39:40
      -

      *Thread Reply: maybe df.groupBy(some column).pivot(some_column).agg(*agg_cols) is not supported

      +

      *Thread Reply:* maybe df.groupBy(some column).pivot(some_column).agg(**agg_cols) is not supported

      @@ -121644,7 +121646,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:43:44
      -

      *Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START +

      *Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START “eventType”: “START” “outputs”: [ … @@ -121676,7 +121678,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:54:13
      -

      *Thread Reply:* Yes +

      *Thread Reply:* Yes "spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]" <- this option gives output when eventType is START but not give output bunches of columns when that config is not set

      @@ -121703,7 +121705,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:55:18
      -

      *Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

      +

      *Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

      @@ -121729,7 +121731,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:56:12
      -

      *Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

      +

      *Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

      @@ -121755,7 +121757,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:57:56
      -

      *Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. +

      *Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. and disable logical plan with not serializing make availabe to generate event cuz not serialiize logical plan made by pivot

      Can we overcome this

      @@ -121784,7 +121786,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:58:48
      -

      *Thread Reply:* we've seen such issues for some plans some time ago

      +

      *Thread Reply:* we've seen such issues for some plans some time ago

      @@ -121814,7 +121816,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:29
      -

      *Thread Reply:* oh…. how did you solve it?

      +

      *Thread Reply:* oh…. how did you solve it?

      @@ -121840,7 +121842,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:32
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java

      @@ -121895,7 +121897,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:51
      -

      *Thread Reply:* by excluding some properties from plan to be serialized

      +

      *Thread Reply:* by excluding some properties from plan to be serialized

      @@ -121921,7 +121923,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:01:14
      -

      *Thread Reply:* here https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java we exclude certain classes

      +

      *Thread Reply:* here https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java we exclude certain classes

      @@ -121976,7 +121978,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:02:00
      -

      *Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

      +

      *Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

      @@ -122002,7 +122004,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:08:25
      -

      *Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

      +

      *Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

      then you can try to debug logical plan trying to find out what should be excluded from it when it's being serialized. Even, if you find this difficult, a failing integration test is super helpful to let others help you in that.

      Marquez as an OpenLineage Client
      2023-08-08 06:24:54
      -

      *Thread Reply:* okay i would look into and maybe pr thanks

      +

      *Thread Reply:* okay i would look into and maybe pr thanks

      @@ -122258,7 +122260,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:38:45
      -

      *Thread Reply:* Can I ask if there are any suspicious properties?

      +

      *Thread Reply:* Can I ask if there are any suspicious properties?

      @@ -122284,7 +122286,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:39:25
      -

      *Thread Reply:* sure

      +

      *Thread Reply:* sure

      @@ -122318,7 +122320,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:10:40
      -

      *Thread Reply:* Thanks I would also try to find the property too

      +

      *Thread Reply:* Thanks I would also try to find the property too

      @@ -122370,7 +122372,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:49:55
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, +

      *Thread Reply:* Hi @Anirudh Shrinivason, I think I would take a look on this https://sqlglot.com/sqlglot/diff.html

      @@ -122398,7 +122400,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 23:12:37
      -

      *Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... +

      *Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... eg: SELECT ** FROM TABLE_1 and SELECT col1,col2,col3 FROM TABLE_1 could be the same semantic query, but sqlglot would give diffs in the ast because its textual...

      @@ -122425,7 +122427,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 02:26:51
      -

      *Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

      +

      *Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

      @@ -122451,7 +122453,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 07:04:37
      -

      *Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... +

      *Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... But yeah that's probably the approach I'd be taking haha... Happy to discuss and learn if there are better ways to doing this

      @@ -122555,14 +122557,14 @@

      Marquez as an OpenLineage Client

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong><em>*input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em><em></strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -122598,14 +122600,14 @@

      Marquez as an OpenLineage Client

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong></em><em>input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em>*</strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -122824,7 +122826,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:53:05
      -

      *Thread Reply:* Hey @Luigi Scorzato! +

      *Thread Reply:* Hey @Luigi Scorzato! You can read about these 2 event types in this blog post: https://openlineage.io/blog/static-lineage

      @@ -122876,7 +122878,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:53:38
      -

      *Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

      +

      *Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

      @@ -122906,7 +122908,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 10:08:28
      -

      *Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

      +

      *Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

      @@ -122932,7 +122934,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:16:39
      -

      *Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

      +

      *Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

      @@ -122984,7 +122986,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:54:40
      -

      *Thread Reply:* great question! did you get a chance to look at the current Flink integration?

      +

      *Thread Reply:* great question! did you get a chance to look at the current Flink integration?

      @@ -123010,7 +123012,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 10:07:06
      -

      *Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

      +

      *Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

      openlineage.io
      @@ -123057,7 +123059,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:13:17
      -

      *Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

      +

      *Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

      @@ -123181,7 +123183,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:13:26
      -

      *Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

      +

      *Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

      @@ -123314,7 +123316,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:15:55
      -

      *Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

      +

      *Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

      @@ -123533,7 +123535,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 17:46:17
      -

      *Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

      +

      *Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

      @@ -123563,7 +123565,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 03:25:16
      -

      *Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

      +

      *Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

      @@ -123615,7 +123617,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:10:47
      -

      *Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

      +

      *Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

      For other orchestrators / schedulers, that would depend..

      @@ -123673,7 +123675,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 12:15:40
      -

      *Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

      +

      *Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

      @@ -123699,7 +123701,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 02:38:54
      -

      *Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

      +

      *Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

      Worth mentioning: init-scripts located in dbfs are becoming deprecated next month and we plan moving them into workspaces.

      @@ -123731,7 +123733,7 @@

      Marquez as an OpenLineage Client

      2023-08-11 01:33:24
      -

      *Thread Reply:* yes, the init scripts are moved at workspace level.

      +

      *Thread Reply:* yes, the init scripts are moved at workspace level.

      @@ -123846,7 +123848,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 02:35:28
      -

      *Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

      +

      *Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

      @@ -123876,7 +123878,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 03:33:50
      -

      *Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

      +

      *Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

      @@ -123929,7 +123931,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 07:55:28
      -

      *Thread Reply:* Query: +

      *Thread Reply:* Query: INSERT INTO test.merchant_md (Select m.`id`, @@ -123968,7 +123970,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 08:01:56
      -

      *Thread Reply:* "columnLineage":{ +

      *Thread Reply:* "columnLineage":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet>", "fields":{ @@ -124062,7 +124064,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 08:23:57
      -

      *Thread Reply:* "contact_name":{ +

      *Thread Reply:* "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", @@ -124151,7 +124153,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 11:23:42
      -

      *Thread Reply:* > OL-spark version to 0.6.+ +

      *Thread Reply:* > OL-spark version to 0.6.+ This OL version is ancient. You can try with 1.0.0

      I think you're hitting this issue which duplicates jobs: https://github.com/OpenLineage/OpenLineage/issues/1943

      @@ -124216,7 +124218,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 01:46:08
      -

      *Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … +

      *Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … None of them worked for me. @Maciej Obuchowski

      @@ -124244,7 +124246,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 05:25:49
      -

      *Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

      +

      *Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

      @@ -124270,7 +124272,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 05:26:11
      -

      *Thread Reply:* If you can, it might be better to create issue at github and communicate there.

      +

      *Thread Reply:* If you can, it might be better to create issue at github and communicate there.

      @@ -124296,7 +124298,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 08:34:01
      -

      *Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

      +

      *Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

      This is the sample of my test: ```from pyspark.sql import SparkSession @@ -124360,7 +124362,7 @@

      csv_file = location.csv

      2023-08-10 08:40:13
      -

      *Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

      +

      *Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

      @@ -124386,7 +124388,7 @@

      csv_file = location.csv

      2023-08-10 08:40:27
      -

      *Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

      +

      *Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

      @@ -124412,7 +124414,7 @@

      csv_file = location.csv

      2023-08-10 08:59:27
      -

      *Thread Reply:* https://openlineage.io/blog/openlineage-spark/

      +

      *Thread Reply:* https://openlineage.io/blog/openlineage-spark/

      openlineage.io
      @@ -124472,7 +124474,7 @@

      csv_file = location.csv

      2023-08-10 09:04:56
      -

      *Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

      +

      *Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

      @@ -124498,7 +124500,7 @@

      csv_file = location.csv

      2023-08-10 09:09:42
      -

      *Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

      +

      *Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

      @@ -124524,7 +124526,7 @@

      csv_file = location.csv

      2023-08-10 09:10:09
      -

      *Thread Reply:* let's try adding spark.openlineage.transport.type to http

      +

      *Thread Reply:* let's try adding spark.openlineage.transport.type to http

      @@ -124550,7 +124552,7 @@

      csv_file = location.csv

      2023-08-10 09:14:50
      -

      *Thread Reply:* Now it works !

      +

      *Thread Reply:* Now it works !

      @@ -124576,7 +124578,7 @@

      csv_file = location.csv

      2023-08-10 09:14:58
      -

      *Thread Reply:* thanks @Maciej Obuchowski

      +

      *Thread Reply:* thanks @Maciej Obuchowski

      @@ -124602,7 +124604,7 @@

      csv_file = location.csv

      2023-08-10 09:23:04
      -

      *Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

      +

      *Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

      @@ -124724,7 +124726,7 @@

      csv_file = location.csv

      2023-08-10 02:55:46
      -

      *Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

      +

      *Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

      We once had a look at it and found it difficult. Given an SQL, spark immediately translates it to a logical plan (which our integration is based on) and we didn't find any place where we could inject our code and get access to sql being run.

      @@ -124752,7 +124754,7 @@

      csv_file = location.csv

      2023-08-10 04:27:51
      -

      *Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

      +

      *Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

      val df = spark.read.format("jdbc") .option("url", url) @@ -124785,7 +124787,7 @@

      csv_file = location.csv

      2023-08-10 05:15:17
      -

      *Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

      +

      *Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

      @@ -124871,7 +124873,7 @@

      csv_file = location.csv

      2023-08-10 05:42:20
      -

      *Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

      +

      *Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

      @@ -124897,7 +124899,7 @@

      csv_file = location.csv

      2023-08-10 05:53:57
      -

      *Thread Reply:* for custom facets, as long as it's valid json - go for it

      +

      *Thread Reply:* for custom facets, as long as it's valid json - go for it

      @@ -124923,7 +124925,7 @@

      csv_file = location.csv

      2023-08-10 05:55:03
      -

      *Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

      +

      *Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

      @@ -124949,7 +124951,7 @@

      csv_file = location.csv

      2023-08-10 05:55:57
      -

      *Thread Reply:* @attr.s +

      *Thread Reply:* @attr.s class MyFacet(BaseFacet): columns: list[str] = attr.ib() Here's my python code.

      @@ -124978,7 +124980,7 @@

      csv_file = location.csv

      2023-08-10 05:59:02
      -

      *Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

      +

      *Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

      @@ -125004,7 +125006,7 @@

      csv_file = location.csv

      2023-08-10 06:00:27
      -

      *Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

      +

      *Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

      @@ -125039,7 +125041,7 @@

      csv_file = location.csv

      2023-08-10 06:00:54
      -

      *Thread Reply:* I use the python openlineage client to emit the RunEvent. +

      *Thread Reply:* I use the python openlineage client to emit the RunEvent. openlineage_client.emit( RunEvent( eventType=RunState.COMPLETE, @@ -125076,7 +125078,7 @@

      csv_file = location.csv

      2023-08-10 06:02:12
      -

      *Thread Reply:* Yah, list of objects is working, but list of string is not.😩

      +

      *Thread Reply:* Yah, list of objects is working, but list of string is not.😩

      @@ -125102,7 +125104,7 @@

      csv_file = location.csv

      2023-08-10 06:03:23
      -

      *Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

      +

      *Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

      @@ -125128,7 +125130,7 @@

      csv_file = location.csv

      2023-08-10 06:05:56
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -125163,7 +125165,7 @@

      csv_file = location.csv

      2023-08-10 06:19:34
      -

      *Thread Reply:* I think the code here filters out those string values in the list

      +

      *Thread Reply:* I think the code here filters out those string values in the list

      @@ -125198,7 +125200,7 @@

      csv_file = location.csv

      2023-08-10 06:21:39
      -

      *Thread Reply:* 👀

      +

      *Thread Reply:* 👀

      @@ -125224,7 +125226,7 @@

      csv_file = location.csv

      2023-08-10 06:24:48
      -

      *Thread Reply:* Yah, the value in list will end up False in this code and be filtered out +

      *Thread Reply:* Yah, the value in list will end up False in this code and be filtered out isinstance(_x_, dict)

      😳

      @@ -125253,7 +125255,7 @@

      csv_file = location.csv

      2023-08-10 06:26:33
      -

      *Thread Reply:* wow, that's right 😬

      +

      *Thread Reply:* wow, that's right 😬

      @@ -125279,7 +125281,7 @@

      csv_file = location.csv

      2023-08-10 06:26:47
      -

      *Thread Reply:* want to create PR fixing that?

      +

      *Thread Reply:* want to create PR fixing that?

      @@ -125305,7 +125307,7 @@

      csv_file = location.csv

      2023-08-10 06:27:20
      -

      *Thread Reply:* Sure! May do this later tomorrow.

      +

      *Thread Reply:* Sure! May do this later tomorrow.

      @@ -125335,7 +125337,7 @@

      csv_file = location.csv

      2023-08-10 23:59:28
      -

      *Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 +

      *Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 But the ci on integration-test-integration-spark FAILED

      @@ -125432,7 +125434,7 @@

      csv_file = location.csv

      2023-08-11 04:17:01
      -

      *Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

      +

      *Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

      @@ -125458,7 +125460,7 @@

      csv_file = location.csv

      2023-08-11 04:17:45
      -

      *Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

      +

      *Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

      @@ -125484,7 +125486,7 @@

      csv_file = location.csv

      2023-08-11 04:18:21
      -

      *Thread Reply:* Yeah sure, I will add them now

      +

      *Thread Reply:* Yeah sure, I will add them now

      @@ -125510,7 +125512,7 @@

      csv_file = location.csv

      2023-08-11 04:25:19
      -

      *Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

      +

      *Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

      @@ -125536,7 +125538,7 @@

      csv_file = location.csv

      2023-08-11 04:36:57
      -

      *Thread Reply:* LOL🤣 I've added some tests and made a force push

      +

      *Thread Reply:* LOL🤣 I've added some tests and made a force push

      @@ -125562,7 +125564,7 @@

      csv_file = location.csv

      2023-10-20 08:31:45
      -

      *Thread Reply:* @GitHubOpenLineageIssues +

      *Thread Reply:* @GitHubOpenLineageIssues I am trying to contribute to Integration tests which is listed here as good first issue the CONTRIBUTING.md mentions that i can trigger CI for integration tests from forked branch. using this tool. @@ -125603,7 +125605,7 @@

      csv_file = location.csv

      2023-10-23 04:57:44
      -

      *Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

      +

      *Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

      @@ -125629,7 +125631,7 @@

      csv_file = location.csv

      2023-10-25 01:08:41
      -

      *Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

      +

      *Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

      I will make the changes needed for the new integration test case for drop table (good first issue) , in another PR, I would need your help to run the integration tests again, thank you

      @@ -125682,7 +125684,7 @@

      csv_file = location.csv

      2023-10-26 07:48:52
      -

      *Thread Reply:* @Paweł Leszczyński +

      *Thread Reply:* @Paweł Leszczyński opened a PR ( link ) for integration test for drop table can you please help run the integration test

      csv_file = location.csv
      2023-10-26 07:50:29
      -

      *Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

      +

      *Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

      @@ -125800,7 +125802,7 @@

      csv_file = location.csv

      2023-10-29 09:31:22
      -

      *Thread Reply:* thanks @Paweł Leszczyński +

      *Thread Reply:* thanks @Paweł Leszczyński let me know if i can help in any way

      @@ -125827,7 +125829,250 @@

      csv_file = location.csv

      2023-11-15 02:31:50
      -

      *Thread Reply:* @Paweł Leszczyński any action item on my side?

      +

      *Thread Reply:* @Paweł Leszczyński any action item on my side?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 02:57:20
      +
      +

      *Thread Reply:* @Paweł Leszczyński can you please take a look at this ? 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:05:05
      +
      +

      *Thread Reply:* Hi @savan, were you able to run integration tests locally on your side? It seems the generated OL event is missing schema facet +"outputs" : [ { + "namespace" : "file", + "name" : "/tmp/drop_test/drop_table_test", + "facets" : { + "dataSource" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", + "name" : "file", + "uri" : "file" + }, + "symlinks" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", + "identifiers" : [ { + "namespace" : "/tmp/drop_test", + "name" : "default.drop_table_test", + "type" : "TABLE" + } ] + }, + "lifecycleStateChange" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", + "lifecycleStateChange" : "DROP" + } + }, + "outputFacets" : { } + } ] +which shouldn't be such a big problem I believe. This event intends to notify table is dropped which is still ok I believe without schema.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:06:40
      +
      +

      *Thread Reply:* @Paweł Leszczyński +i am unable to run integration tests locally, +as you mentioned it requires S3/BigQuery secret keys +and wont work from a forked branch

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:07:06
      +
      +

      *Thread Reply:* you can run this particular test you modify, don't need to run all of them

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:07:55
      +
      +

      *Thread Reply:* can you please share any doc which will help me do that. +i did go through the readme doc, +i was stuck at +> you dont have permission to perform this action

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:08:18
      +
      +

      *Thread Reply:* ./gradlew :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:08:33
      +
      +

      *Thread Reply:* let me try +thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:08:35
      +
      +

      *Thread Reply:* this should run the thing you modify

      @@ -125893,7 +126138,7 @@

      csv_file = location.csv

      2023-08-11 09:18:44
      -

      *Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

      +

      *Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

      @@ -125919,7 +126164,7 @@

      csv_file = location.csv

      2023-08-11 09:18:58
      -

      *Thread Reply:* I think your interpretation is more useful

      +

      *Thread Reply:* I think your interpretation is more useful

      @@ -125949,7 +126194,7 @@

      csv_file = location.csv

      2023-08-11 09:25:03
      -

      *Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

      +

      *Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

      But is there any technical limitation if we wanna implement this / make an OSS contribution for this (like logical predicate columns not being part of the spark logical plan object that we get in the PlanVisitor or something like that)?

      @@ -125977,7 +126222,7 @@

      csv_file = location.csv

      2023-08-11 11:14:58
      -

      *Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

      +

      *Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

      @@ -126003,7 +126248,7 @@

      csv_file = location.csv

      2023-08-11 12:45:34
      -

      *Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

      +

      *Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

      @@ -126029,7 +126274,7 @@

      csv_file = location.csv

      2023-08-11 12:49:10
      -

      *Thread Reply:* Something like additional flag in inputFields, right?

      +

      *Thread Reply:* Something like additional flag in inputFields, right?

      @@ -126059,7 +126304,7 @@

      csv_file = location.csv

      2023-08-14 02:36:34
      -

      *Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

      +

      *Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

      @@ -126085,7 +126330,7 @@

      csv_file = location.csv

      2023-08-15 15:16:49
      -

      *Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

      +

      *Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

      @@ -126111,7 +126356,7 @@

      csv_file = location.csv

      2023-08-16 03:10:57
      -

      *Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

      +

      *Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

      @@ -126137,7 +126382,7 @@

      csv_file = location.csv

      2023-08-22 02:23:08
      -

      *Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

      +

      *Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

      @@ -126163,7 +126408,7 @@

      csv_file = location.csv

      2023-08-22 08:03:49
      -

      *Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

      +

      *Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

      @@ -126279,7 +126524,7 @@

      csv_file = location.csv

      2023-08-14 06:00:21
      -

      *Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

      +

      *Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

      @@ -126305,7 +126550,7 @@

      csv_file = location.csv

      2023-08-14 06:00:44
      -

      *Thread Reply:* At least, not in all cases. For some maybe

      +

      *Thread Reply:* At least, not in all cases. For some maybe

      @@ -126331,7 +126576,7 @@

      csv_file = location.csv

      2023-08-14 07:34:39
      -

      *Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

      +

      *Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

      The difference here, between what you're asking about, is that we still have an access to the same spark session.

      @@ -126431,7 +126676,7 @@

      csv_file = location.csv

      2023-08-14 11:03:28
      -

      *Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

      +

      *Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

      For example, I know that we can statically create logical plans

      @@ -126459,7 +126704,7 @@

      csv_file = location.csv

      2023-08-16 03:05:44
      -

      *Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

      +

      *Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

      @@ -126485,7 +126730,7 @@

      csv_file = location.csv

      2023-08-16 04:01:39
      -

      *Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

      +

      *Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

      @@ -126511,7 +126756,7 @@

      csv_file = location.csv

      2023-08-16 04:22:01
      -

      *Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

      +

      *Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

      @@ -126598,7 +126843,7 @@

      csv_file = location.csv

      2023-08-15 01:13:49
      -

      *Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

      +

      *Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

      @@ -126628,7 +126873,7 @@

      csv_file = location.csv

      2023-08-16 07:32:00
      -

      *Thread Reply:* Hello +

      *Thread Reply:* Hello I am new here and I am asking why do you need an init script ? If it's a spark integration we can just specify --package=io.openlineage...

      @@ -126656,7 +126901,7 @@

      csv_file = location.csv

      2023-08-16 07:41:25
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

      @@ -126725,7 +126970,7 @@

      csv_file = location.csv

      2023-08-16 07:43:55
      -

      *Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) +

      *Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) But if you use --package io.openlineage... (the package will be downloaded from maven) it works fine.

      @@ -126756,7 +127001,7 @@

      csv_file = location.csv

      2023-08-16 07:47:50
      -

      *Thread Reply:* I think they don't use the right class loader.

      +

      *Thread Reply:* I think they don't use the right class loader.

      @@ -126782,7 +127027,7 @@

      csv_file = location.csv

      2023-08-16 08:36:14
      -

      *Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

      +

      *Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

      I was doing this a second ago and this ended up with Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1609ed55

      @@ -126845,7 +127090,7 @@

      csv_file = location.csv

      2023-08-15 12:19:34
      -

      *Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

      +

      *Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

      @@ -126997,7 +127242,7 @@

      csv_file = location.csv

      2023-08-15 06:48:43
      -

      *Thread Reply:* Would be useful to see generated event or any logs

      +

      *Thread Reply:* Would be useful to see generated event or any logs

      @@ -127023,7 +127268,7 @@

      csv_file = location.csv

      2023-08-16 03:09:05
      -

      *Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

      +

      *Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

      The smaller SQL to reproduce the problem, the easier it is to find the root cause. Most of the issues are reproducible with just few lines of code.

      @@ -127051,7 +127296,7 @@

      csv_file = location.csv

      2023-08-16 03:34:30
      -

      *Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

      +

      *Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

      @@ -127179,7 +127424,7 @@

      csv_file = location.csv

      2023-08-16 09:24:09
      -

      *Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! +

      *Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! was wondering if you’d consider opening up a PR? would love to help you as a contributor is that’s something you are interested in.

      @@ -127206,7 +127451,7 @@

      csv_file = location.csv

      2023-08-17 11:59:51
      -

      *Thread Reply:* Hello

      +

      *Thread Reply:* Hello

      @@ -127232,7 +127477,7 @@

      csv_file = location.csv

      2023-08-17 11:59:58
      -

      *Thread Reply:* Yes I am working on it

      +

      *Thread Reply:* Yes I am working on it

      @@ -127258,7 +127503,7 @@

      csv_file = location.csv

      2023-08-17 12:00:14
      -

      *Thread Reply:* I deleted the line that has that filter.

      +

      *Thread Reply:* I deleted the line that has that filter.

      @@ -127284,7 +127529,7 @@

      csv_file = location.csv

      2023-08-17 12:00:24
      -

      *Thread Reply:* I am adding some tests now

      +

      *Thread Reply:* I am adding some tests now

      @@ -127310,7 +127555,7 @@

      csv_file = location.csv

      2023-08-17 12:00:45
      -

      *Thread Reply:* But running +

      *Thread Reply:* But running ./gradlew --no-daemon databricksIntegrationTest -x test -Pspark.version=3.4.0 -PdatabricksHost=$DATABRICKS_HOST -PdatabricksToken=$DATABRICKS_TOKEN

      @@ -127337,7 +127582,7 @@

      csv_file = location.csv

      2023-08-17 12:01:11
      -

      *Thread Reply:* gives me +

      *Thread Reply:* gives me A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. @@ -127377,7 +127622,7 @@

      csv_file = location.csv

      2023-08-17 12:01:25
      -

      *Thread Reply:* And I am trying to understand what should I do.

      +

      *Thread Reply:* And I am trying to understand what should I do.

      @@ -127403,7 +127648,7 @@

      csv_file = location.csv

      2023-08-17 12:13:37
      -

      *Thread Reply:* I am compiling sql integration

      +

      *Thread Reply:* I am compiling sql integration

      @@ -127429,7 +127674,7 @@

      csv_file = location.csv

      2023-08-17 13:04:15
      -

      *Thread Reply:* I built the java client

      +

      *Thread Reply:* I built the java client

      @@ -127455,7 +127700,7 @@

      csv_file = location.csv

      2023-08-17 13:04:29
      -

      *Thread Reply:* but having +

      *Thread Reply:* but having A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. @@ -127495,7 +127740,7 @@

      csv_file = location.csv

      2023-08-17 14:47:41
      -

      *Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

      +

      *Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

      @@ -127521,7 +127766,7 @@

      csv_file = location.csv

      2023-08-17 14:47:59
      -

      *Thread Reply:* Okay thanks

      +

      *Thread Reply:* Okay thanks

      @@ -127547,7 +127792,7 @@

      csv_file = location.csv

      2023-08-17 14:48:01
      -

      *Thread Reply:* will do

      +

      *Thread Reply:* will do

      @@ -127573,7 +127818,7 @@

      csv_file = location.csv

      2023-08-22 10:33:02
      -

      *Thread Reply:* Hello back

      +

      *Thread Reply:* Hello back

      @@ -127599,7 +127844,7 @@

      csv_file = location.csv

      2023-08-22 10:33:12
      -

      *Thread Reply:* I created a databricks cluster.

      +

      *Thread Reply:* I created a databricks cluster.

      @@ -127625,7 +127870,7 @@

      csv_file = location.csv

      2023-08-22 10:35:00
      -

      *Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

      +

      *Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

      @@ -127651,7 +127896,7 @@

      csv_file = location.csv

      2023-08-22 10:36:19
      -

      *Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

      +

      *Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

      @@ -127677,7 +127922,7 @@

      csv_file = location.csv

      2023-08-22 10:38:03
      -

      *Thread Reply:* And now I am investigating this issue : +

      *Thread Reply:* And now I am investigating this issue : java.lang.NullPointerException at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226) at io.openlineage.spark.agent.DatabricksUtils.init(DatabricksUtils.java:66) @@ -127725,7 +127970,7 @@

      csv_file = location.csv

      2023-08-22 10:39:22
      -

      *Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

      +

      *Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

      @@ -127751,7 +127996,7 @@

      csv_file = location.csv

      2023-08-22 10:40:18
      -

      *Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

      +

      *Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

      @@ -127777,7 +128022,7 @@

      csv_file = location.csv

      2023-08-22 10:54:51
      -

      *Thread Reply:* I did this !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar

      +

      *Thread Reply:* I did this !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar

      @@ -127803,7 +128048,7 @@

      csv_file = location.csv

      2023-08-22 10:55:29
      -

      *Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

      +

      *Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

      @@ -127829,7 +128074,7 @@

      csv_file = location.csv

      2023-08-22 10:56:09
      -

      *Thread Reply:* Because if there is no file, this part fails +

      *Thread Reply:* Because if there is no file, this part fails StreamSupport.stream( workspace.dbfs().list("dbfs:/databricks/openlineage/").spliterator(), false) .filter(f -&gt; f.getPath().contains("openlineage-spark")) @@ -127864,7 +128109,7 @@

      csv_file = location.csv

      2023-08-22 11:47:17
      -

      *Thread Reply:* does this work after +

      *Thread Reply:* does this work after !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar ?

      @@ -127892,7 +128137,7 @@

      csv_file = location.csv

      2023-08-22 11:47:36
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -127918,7 +128163,7 @@

      csv_file = location.csv

      2023-08-22 19:02:05
      -

      *Thread Reply:* I am now having another error in the driver

      +

      *Thread Reply:* I am now having another error in the driver

      23/08/22 22:56:26 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Exception when registering SparkListener @@ -127959,7 +128204,7 @@

      csv_file = location.csv

      2023-08-22 19:19:29
      -

      *Thread Reply:* Can you please share with me your json conf for the cluster ?

      +

      *Thread Reply:* Can you please share with me your json conf for the cluster ?

      @@ -127994,7 +128239,7 @@

      csv_file = location.csv

      2023-08-22 19:55:57
      -

      *Thread Reply:* It's because in mu build file I have

      +

      *Thread Reply:* It's because in mu build file I have

      @@ -128029,7 +128274,7 @@

      csv_file = location.csv

      2023-08-22 19:56:27
      -

      *Thread Reply:* and the one that was copied is

      +

      *Thread Reply:* and the one that was copied is

      @@ -128064,7 +128309,7 @@

      csv_file = location.csv

      2023-08-22 20:01:12
      -

      *Thread Reply:* due to the findAny 😕 +

      *Thread Reply:* due to the findAny 😕 private static void uploadOpenlineageJar(WorkspaceClient workspace) { Path jarFile = Files.list(Paths.get("../build/libs/")) @@ -128097,7 +128342,7 @@

      csv_file = location.csv

      2023-08-22 20:35:10
      -

      *Thread Reply:* It works finally 😄

      +

      *Thread Reply:* It works finally 😄

      @@ -128123,7 +128368,7 @@

      csv_file = location.csv

      2023-08-23 05:16:19
      -

      *Thread Reply:* The PR 😄 +

      *Thread Reply:* The PR 😄 https://github.com/OpenLineage/OpenLineage/pull/2061

      @@ -128216,7 +128461,7 @@

      csv_file = location.csv

      2023-08-23 08:23:49
      -

      *Thread Reply:* thanks for the pr 🙂

      +

      *Thread Reply:* thanks for the pr 🙂

      @@ -128242,7 +128487,7 @@

      csv_file = location.csv

      2023-08-23 08:24:02
      -

      *Thread Reply:* code formatting checks complain now

      +

      *Thread Reply:* code formatting checks complain now

      @@ -128268,7 +128513,7 @@

      csv_file = location.csv

      2023-08-23 08:25:09
      -

      *Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

      +

      *Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

      @@ -128294,7 +128539,7 @@

      csv_file = location.csv

      2023-08-23 09:06:26
      -

      *Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

      +

      *Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

      @@ -128320,7 +128565,7 @@

      csv_file = location.csv

      2023-08-23 09:07:07
      -

      *Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

      +

      *Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

      @@ -128346,7 +128591,7 @@

      csv_file = location.csv

      2023-08-23 09:15:41
      -

      *Thread Reply:* Okay I will switch back to another java version

      +

      *Thread Reply:* Okay I will switch back to another java version

      @@ -128372,7 +128617,7 @@

      csv_file = location.csv

      2023-08-23 09:25:06
      -

      *Thread Reply:* terra@MacBook-Pro-M3 spark % java -version +

      *Thread Reply:* terra@MacBook-Pro-M3 spark % java -version java version "1.8.0_381" Java(TM) SE Runtime Environment (build 1.8.0_381-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)

      @@ -128401,7 +128646,7 @@

      csv_file = location.csv

      2023-08-23 09:28:28
      -

      *Thread Reply:* Can you tell me which java version should I use ?

      +

      *Thread Reply:* Can you tell me which java version should I use ?

      @@ -128427,7 +128672,7 @@

      csv_file = location.csv

      2023-08-23 09:49:42
      -

      *Thread Reply:* Hello, I have +

      *Thread Reply:* Hello, I have @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ?

      csv_file = location.csv
      2023-08-23 09:50:18
      2023-08-23 09:50:24
      -

      *Thread Reply:* Can you help please ?

      +

      *Thread Reply:* Can you help please ?

      @@ -128551,7 +128796,7 @@

      csv_file = location.csv

      2023-08-23 10:08:43
      -

      *Thread Reply:* Java 8

      +

      *Thread Reply:* Java 8

      @@ -128577,7 +128822,7 @@

      csv_file = location.csv

      2023-08-23 10:10:14
      -

      *Thread Reply:* ```Hello, I have

      +

      *Thread Reply:* ```Hello, I have

      @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ? (edited) ``` @@ -128607,7 +128852,7 @@

      csv_file = location.csv

      2023-08-23 10:11:10
      2023-08-23 10:53:34
      -

      *Thread Reply:* @Abdallah merged 🙂

      +

      *Thread Reply:* @Abdallah merged 🙂

      @@ -128659,7 +128904,7 @@

      csv_file = location.csv

      2023-08-23 10:59:22
      -

      *Thread Reply:* Thank you !

      +

      *Thread Reply:* Thank you !

      @@ -128800,7 +129045,7 @@

      csv_file = location.csv

      2023-08-21 20:26:56
      -

      *Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. +

      *Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. apache-airflow-providers-openlineage is the new package, maintained in the Airflow project that can be used starting Airflow 2.7 and is the recommended package moving forward. It is compatible with the configuration of the old package described in that usage page. CC: @Maciej Obuchowski @Jakub Dardziński It looks like this page needs improvement.

      @@ -128852,7 +129097,7 @@

      csv_file = location.csv

      2023-08-22 05:03:28
      -

      *Thread Reply:* Yeah, I'll fix that

      +

      *Thread Reply:* Yeah, I'll fix that

      @@ -128882,7 +129127,7 @@

      csv_file = location.csv

      2023-08-22 17:55:08
      -

      *Thread Reply:* https://github.com/apache/airflow/pull/33610

      +

      *Thread Reply:* https://github.com/apache/airflow/pull/33610

      fyi

      csv_file = location.csv
      2023-08-22 18:02:46
      -

      *Thread Reply:* that really depends on the use case IMHO +

      *Thread Reply:* that really depends on the use case IMHO if you consider a whole directory/folder as a dataset (meaning that each file inside folds into a larger whole) you should label dataset as directory

      you might as well have directory with each file being something different - in this case it would be best to set each file separately as dataset

      @@ -129009,7 +129254,7 @@

      csv_file = location.csv

      2023-08-22 18:04:32
      -

      *Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

      +

      *Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

      @@ -129035,7 +129280,7 @@

      csv_file = location.csv

      2023-08-22 18:07:26
      -

      *Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

      +

      *Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

      @@ -129065,7 +129310,7 @@

      csv_file = location.csv

      2023-08-23 08:22:09
      -

      *Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

      +

      *Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

      @@ -129095,7 +129340,7 @@

      csv_file = location.csv

      2023-08-25 10:26:52
      -

      *Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

      +

      *Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

      from 2022, this particular source had 6 raw schema changes (client controlled, no warning). what should I do to make that as obvious as possible if I track the dataset at a folder level?

      @@ -129123,7 +129368,7 @@

      csv_file = location.csv

      2023-08-25 10:32:19
      -

      *Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

      +

      *Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

      @@ -129149,7 +129394,7 @@

      csv_file = location.csv

      2023-08-25 10:32:57
      -

      *Thread Reply:* not sure what the best practice would be in this scenario though

      +

      *Thread Reply:* not sure what the best practice would be in this scenario though

      @@ -129196,7 +129441,7 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-22 18:35:45
      @@ -129228,7 +129473,7 @@

      csv_file = location.csv

      2023-08-23 05:22:18
      -

      *Thread Reply:* Which version are you trying to use?

      +

      *Thread Reply:* Which version are you trying to use?

      @@ -129254,7 +129499,7 @@

      csv_file = location.csv

      2023-08-23 05:22:45
      -

      *Thread Reply:* Both OL and MWAA/Airflow 🙂

      +

      *Thread Reply:* Both OL and MWAA/Airflow 🙂

      @@ -129280,7 +129525,7 @@

      csv_file = location.csv

      2023-08-23 05:23:52
      -

      *Thread Reply:* 'openlineage' is not a package +

      *Thread Reply:* 'openlineage' is not a package suggests that something went wrong with import process, for example cycle in import path

      @@ -129302,12 +129547,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-23 16:50:34
      -

      *Thread Reply:* MWAA: 2.6.3 +

      *Thread Reply:* MWAA: 2.6.3 OL: 1.0.0

      I can see from the log that OL has been successfully installed to the webserver: @@ -129354,7 +129599,7 @@

      csv_file = location.csv

      2023-08-24 08:18:36
      -

      *Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: +

      *Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: openlineage-airflow and openlineage-airflow==1.0.0 @@ -129379,12 +129624,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 08:29:30
      -

      *Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

      +

      *Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

      @@ -129410,7 +129655,7 @@

      csv_file = location.csv

      2023-08-24 08:33:53
      -

      *Thread Reply:* The thing is that I don’t see any error messages. +

      *Thread Reply:* The thing is that I don’t see any error messages. I wrote simple DAG to test too: ```from future import annotations

      @@ -129467,12 +129712,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 08:48:11
      -

      *Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

      +

      *Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

      @@ -129498,7 +129743,7 @@

      csv_file = location.csv

      2023-08-24 08:50:05
      -

      *Thread Reply:* ohh, I see +

      *Thread Reply:* ohh, I see you probably followed this guide: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/?

      @@ -129545,12 +129790,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:04:27
      -

      *Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

      +

      *Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

      @@ -129576,7 +129821,7 @@

      csv_file = location.csv

      2023-08-24 09:04:54
      -

      *Thread Reply:* tbh I don’t know

      +

      *Thread Reply:* tbh I don’t know

      @@ -129597,12 +129842,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:04:55
      -

      *Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

      +

      *Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

      @@ -129628,7 +129873,7 @@

      csv_file = location.csv

      2023-08-24 09:28:00
      -

      *Thread Reply:* I think it's still a plugin that sets env vars

      +

      *Thread Reply:* I think it's still a plugin that sets env vars

      @@ -129649,12 +129894,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:32:18
      -

      *Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

      +

      *Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

      @@ -129675,12 +129920,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 10:31:50
      -

      *Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

      +

      *Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

      @@ -129706,7 +129951,7 @@

      csv_file = location.csv

      2023-08-24 11:57:04
      -

      *Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

      +

      *Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

      @@ -129727,12 +129972,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 12:01:12
      -

      *Thread Reply:* Let me try removing all the dags to see if it helps.

      +

      *Thread Reply:* Let me try removing all the dags to see if it helps.

      @@ -129753,12 +129998,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 18:42:49
      -

      *Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

      +

      *Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

      @@ -129779,12 +130024,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 18:44:33
      -

      *Thread Reply:* Could this be the issue? +

      *Thread Reply:* Could this be the issue? from airflow.lineage.entities import File, Table How could I declare lineage manually if I can't import these classes?

      @@ -129812,7 +130057,7 @@

      csv_file = location.csv

      2023-08-25 06:52:47
      -

      *Thread Reply:* @Mars Lan I'll look in more details next week, as I'm in transit now

      +

      *Thread Reply:* @Mars Lan (Metaphor) I'll look in more details next week, as I'm in transit now

      @@ -129838,7 +130083,7 @@

      csv_file = location.csv

      2023-08-25 06:53:18
      -

      *Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

      +

      *Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

      @@ -129859,12 +130104,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-25 07:07:11
      -

      *Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

      +

      *Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

      @@ -129921,7 +130166,7 @@

      csv_file = location.csv

      2023-08-23 07:32:19
      -

      *Thread Reply:* are you using Airflow to connect to Redshift?

      +

      *Thread Reply:* are you using Airflow to connect to Redshift?

      @@ -129947,7 +130192,7 @@

      csv_file = location.csv

      2023-08-24 06:50:05
      -

      *Thread Reply:* Hi @Jakub Dardziński, +

      *Thread Reply:* Hi @Jakub Dardziński, Thank you for your reply. No, we are not using Airflow. We are using load/Unload cmd with Pyspark and also Pandas with JDBC connection

      @@ -129976,7 +130221,7 @@

      csv_file = location.csv

      2023-08-25 13:28:37
      -

      *Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

      +

      *Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

      for Pandas I think there wasn’t too much work done

      @@ -130004,7 +130249,7 @@

      csv_file = location.csv

      2023-08-28 02:18:49
      -

      *Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

      +

      *Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

      @@ -130059,7 +130304,7 @@

      csv_file = location.csv

      2023-08-28 04:53:03
      -

      *Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

      +

      *Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

      If you have any sample implementation for reference, it will be indeed helpful

      @@ -130087,7 +130332,7 @@

      csv_file = location.csv

      2023-08-28 06:12:46
      -

      *Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

      +

      *Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

      @@ -130138,7 +130383,7 @@

      csv_file = location.csv

      2023-08-28 08:18:14
      -

      *Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. +

      *Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. Also, we would like to capture, lineage information where our endpoints are Sagemaker and Redis

      @@ -130165,7 +130410,7 @@

      csv_file = location.csv

      2023-08-28 13:20:37
      -

      *Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

      +

      *Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

      @@ -130191,7 +130436,7 @@

      csv_file = location.csv

      2023-08-28 14:26:40
      -

      *Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

      +

      *Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

      @@ -130217,7 +130462,7 @@

      csv_file = location.csv

      2023-08-28 14:31:00
      -

      *Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly +

      *Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py

      @@ -130275,7 +130520,7 @@

      csv_file = location.csv

      2023-08-23 12:27:15
      -

      *Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

      +

      *Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

      @@ -130305,7 +130550,7 @@

      csv_file = location.csv

      2023-08-23 13:13:18
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      @@ -130357,7 +130602,7 @@

      csv_file = location.csv

      2023-08-23 13:24:32
      2023-08-23 13:29:05
      -

      *Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

      +

      *Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

      @@ -130434,7 +130679,7 @@

      csv_file = location.csv

      2023-08-23 13:29:52
      -

      *Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

      +

      *Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

      @@ -130460,7 +130705,7 @@

      csv_file = location.csv

      2023-08-23 15:14:43
      -

      *Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

      +

      *Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

      @@ -130486,7 +130731,7 @@

      csv_file = location.csv

      2023-08-23 15:23:13
      -

      *Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has +

      *Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has io.openlineage.client.transports.HttpTransportBuilder io.openlineage.client.transports.KafkaTransportBuilder io.openlineage.client.transports.ConsoleTransportBuilder @@ -130517,7 +130762,7 @@

      csv_file = location.csv

      2023-08-25 01:49:29
      -

      *Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? +

      *Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? https://github.com/OpenLineage/OpenLineage/issues/2007#issuecomment-1690350630

      @@ -130574,7 +130819,7 @@

      csv_file = location.csv

      2023-08-25 06:52:30
      -

      *Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

      +

      *Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

      @@ -130625,7 +130870,7 @@

      csv_file = location.csv

      - 👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan, Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham + 👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan (Metaphor), Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham
      @@ -130732,7 +130977,7 @@

      csv_file = location.csv

      2023-08-25 11:32:12
      -

      *Thread Reply:* there will certainly be more meetups, don’t worry about that!

      +

      *Thread Reply:* there will certainly be more meetups, don’t worry about that!

      @@ -130758,7 +131003,7 @@

      csv_file = location.csv

      2023-08-25 11:32:30
      -

      *Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

      +

      *Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

      @@ -130784,7 +131029,7 @@

      csv_file = location.csv

      2023-08-25 11:49:37
      -

      *Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

      +

      *Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

      @@ -130810,7 +131055,7 @@

      csv_file = location.csv

      2023-08-25 11:51:39
      -

      *Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

      +

      *Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

      @@ -130862,7 +131107,7 @@

      csv_file = location.csv

      2023-08-28 05:59:10
      -

      *Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

      +

      *Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

      @@ -130888,7 +131133,7 @@

      csv_file = location.csv

      2023-08-28 06:01:40
      -

      *Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

      +

      *Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

      @@ -130914,7 +131159,7 @@

      csv_file = location.csv

      2023-08-28 06:01:53
      -

      *Thread Reply:* yes, that should be it

      +

      *Thread Reply:* yes, that should be it

      @@ -130970,7 +131215,7 @@

      csv_file = location.csv

      2023-08-28 06:02:49
      -

      *Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

      +

      *Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

      @@ -130996,7 +131241,7 @@

      csv_file = location.csv

      2023-08-28 06:03:15
      -

      *Thread Reply:* and return 200 instead of 201 created

      +

      *Thread Reply:* and return 200 instead of 201 created

      @@ -131022,7 +131267,7 @@

      csv_file = location.csv

      2023-08-28 06:05:41
      -

      *Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

      +

      *Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

      @@ -131074,7 +131319,7 @@

      csv_file = location.csv

      2023-08-28 14:23:24
      -

      *Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

      +

      *Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

      @@ -131100,7 +131345,7 @@

      csv_file = location.csv

      2023-08-28 15:12:21
      -

      *Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

      +

      *Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

      @@ -131126,7 +131371,7 @@

      csv_file = location.csv

      2023-08-28 15:30:05
      -

      *Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

      +

      *Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

      @@ -131249,7 +131494,7 @@

      csv_file = location.csv

      - 🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan + 🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan (Metaphor)
      @@ -131274,31 +131519,23 @@

      csv_file = location.csv

      2023-08-29 03:18:04
      -

      Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

      +

      Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

      @@ -131326,7 +131563,7 @@

      csv_file = location.csv

      2023-08-29 03:20:18
      -

      *Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

      +

      *Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

      @@ -131352,18 +131589,14 @@

      csv_file = location.csv

      2023-08-29 05:23:01
      -

      *Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

      +

      *Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

      @@ -131391,7 +131624,7 @@

      csv_file = location.csv

      2023-08-29 10:02:38
      -

      *Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

      +

      *Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

      @@ -131443,7 +131676,7 @@

      csv_file = location.csv

      2023-08-29 10:58:29
      -

      *Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

      +

      *Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

      @@ -131499,7 +131732,7 @@

      csv_file = location.csv

      2023-08-30 10:53:08
      -

      *Thread Reply:* > then should my namespace be based on the client I am working with? +

      *Thread Reply:* > then should my namespace be based on the client I am working with? I think each of those sources should be a different namespace?

      @@ -131526,7 +131759,7 @@

      csv_file = location.csv

      2023-08-30 12:59:53
      -

      *Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

      +

      *Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

      @@ -131552,7 +131785,7 @@

      csv_file = location.csv

      2023-08-30 13:01:18
      -

      *Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

      +

      *Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

      @@ -131578,7 +131811,7 @@

      csv_file = location.csv

      2023-08-30 13:06:38
      -

      *Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

      +

      *Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

      @@ -131630,7 +131863,7 @@

      csv_file = location.csv

      2023-08-30 12:23:18
      -

      *Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

      +

      *Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

      @@ -131727,7 +131960,7 @@

      csv_file = location.csv

      2023-08-30 12:15:31
      -

      *Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

      +

      *Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

      @@ -131783,7 +132016,7 @@

      csv_file = location.csv

      2023-08-30 23:25:21
      -

      *Thread Reply:* Sorry about that!

      +

      *Thread Reply:* Sorry about that!

      @@ -131887,7 +132120,7 @@

      csv_file = location.csv

      2023-08-31 02:58:41
      -

      *Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

      +

      *Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

      @@ -131992,7 +132225,7 @@

      csv_file = location.csv

      2023-09-01 05:15:28
      -

      *Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

      +

      *Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

      @@ -132044,7 +132277,7 @@

      csv_file = location.csv

      2023-09-01 06:00:53
      -

      *Thread Reply:* @Michael Robinson

      +

      *Thread Reply:* @Michael Robinson

      @@ -132070,7 +132303,7 @@

      csv_file = location.csv

      2023-09-01 17:13:32
      -

      *Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

      +

      *Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

      YouTube
      @@ -132195,7 +132428,7 @@

      csv_file = location.csv

      - 🙌 Harel Shein, Willy Lulciuc, Mars Lan, Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson + 🙌 Harel Shein, Willy Lulciuc, Mars Lan (Metaphor), Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson
      @@ -132224,7 +132457,7 @@

      csv_file = location.csv

      2023-09-01 23:09:55
      -

      *Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

      +

      *Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

      YouTube
      @@ -132368,7 +132601,7 @@

      csv_file = location.csv

      2023-09-04 06:43:47
      -

      *Thread Reply:* Sounds good to me

      +

      *Thread Reply:* Sounds good to me

      @@ -132394,7 +132627,7 @@

      csv_file = location.csv

      2023-09-11 05:15:03
      -

      *Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

      +

      *Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

      @@ -132509,7 +132742,7 @@

      csv_file = location.csv

      2023-09-04 06:54:12
      -

      *Thread Reply:* I think it only depends on log4j configuration

      +

      *Thread Reply:* I think it only depends on log4j configuration

      @@ -132535,7 +132768,7 @@

      csv_file = location.csv

      2023-09-04 06:57:15
      -

      *Thread Reply:* ```# Set everything to be logged to the console +

      *Thread Reply:* ```# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err @@ -132571,7 +132804,7 @@

      set the log level for the openlineage spark library

      2023-09-04 11:28:11
      -

      *Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

      +

      *Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

      @@ -132597,7 +132830,7 @@

      set the log level for the openlineage spark library

      2023-09-05 03:33:03
      -

      *Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

      +

      *Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

      The possible problem could be filtering delta events (which we do bcz of delta being noisy)

      set the log level for the openlineage spark library
      2023-09-05 03:33:36
      -

      *Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` +

      *Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` createOrReplaceTempView

      @@ -132716,7 +132949,7 @@

      set the log level for the openlineage spark library

      2023-09-05 03:35:12
      -

      *Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

      +

      *Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

      @@ -132742,7 +132975,7 @@

      set the log level for the openlineage spark library

      2023-09-05 05:19:22
      -

      *Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... +

      *Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... I'm using 0.23.0 openlineage version.

      @@ -132769,7 +133002,7 @@

      set the log level for the openlineage spark library

      2023-09-05 05:47:51
      -

      *Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

      +

      *Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

      @@ -132821,7 +133054,7 @@

      set the log level for the openlineage spark library

      2023-09-05 08:54:57
      -

      *Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) +

      *Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) what version of Airflow are you using btw?

      @@ -132848,7 +133081,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:04:14
      -

      *Thread Reply:* 2.7.0

      +

      *Thread Reply:* 2.7.0

      I do see this error log in all of my dynamic tasks which might explain it:

      @@ -132892,7 +133125,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:05:34
      -

      *Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

      +

      *Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

      @@ -132918,7 +133151,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:06:05
      -

      *Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

      +

      *Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

      @@ -132944,7 +133177,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:16:32
      -

      *Thread Reply:* seems like some circular import

      +

      *Thread Reply:* seems like some circular import

      @@ -132970,7 +133203,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:19:47
      -

      *Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

      +

      *Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

      @@ -132996,7 +133229,7 @@

      set the log level for the openlineage spark library

      2023-09-05 10:53:28
      -

      *Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

      +

      *Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

      @@ -133022,7 +133255,7 @@

      set the log level for the openlineage spark library

      2023-09-07 11:46:20
      -

      *Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

      +

      *Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

      @@ -133048,7 +133281,7 @@

      set the log level for the openlineage spark library

      2023-09-07 11:50:07
      -

      *Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

      +

      *Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

      @@ -133074,7 +133307,7 @@

      set the log level for the openlineage spark library

      2023-09-07 15:40:31
      -

      *Thread Reply:* got it, thanks

      +

      *Thread Reply:* got it, thanks

      @@ -133100,7 +133333,7 @@

      set the log level for the openlineage spark library

      2023-09-07 19:43:46
      -

      *Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

      +

      *Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

      i was going to try to install from Airflow main branch but didnt want to mess anything up

      @@ -133128,7 +133361,7 @@

      set the log level for the openlineage spark library

      2023-09-07 19:44:39
      -

      *Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

      +

      *Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

      @@ -133154,7 +133387,7 @@

      set the log level for the openlineage spark library

      2023-09-08 05:45:48
      -

      *Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages +

      *Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages building the provider package on your own could be best idea probably? that depends on how you manage your Airflow instance

      @@ -133181,7 +133414,7 @@

      set the log level for the openlineage spark library

      2023-09-08 12:01:53
      -

      *Thread Reply:* there's 1.1.0rc1 btw

      +

      *Thread Reply:* there's 1.1.0rc1 btw

      @@ -133207,7 +133440,7 @@

      set the log level for the openlineage spark library

      2023-09-08 13:44:44
      -

      *Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

      +

      *Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

      @@ -133237,7 +133470,7 @@

      set the log level for the openlineage spark library

      2023-09-10 20:29:00
      -

      *Thread Reply:* The dynamic task mapping error is gone, I did run into this:

      +

      *Thread Reply:* The dynamic task mapping error is gone, I did run into this:

      File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/extractors/base.py", line 70, in disabledoperators operator.strip() for operator in conf.get("openlineage", "disabledfor_operators").split(";") @@ -133271,7 +133504,7 @@

      set the log level for the openlineage spark library

      2023-09-10 20:49:17
      -

      *Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

      +

      *Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

      openlineage: disabledforoperators: "" @@ -133392,7 +133625,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:03:01
      -

      *Thread Reply:* @Willy Lulciuc wdyt?

      +

      *Thread Reply:* @Willy Lulciuc wdyt?

      @@ -133488,7 +133721,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:28:17
      -

      *Thread Reply:* Hello Bernat,

      +

      *Thread Reply:* Hello Bernat,

      The primary mechanism to extend the model is through facets. You can either: • create new standard facets in the spec: https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets @@ -133528,7 +133761,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:31:12
      -

      *Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

      +

      *Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

      @@ -133554,7 +133787,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:35:43
      -

      *Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. +

      *Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. If the custom facet is very specific to a third party project/product then it makes sense for it to stay a custom facet. If it is more generic then it makes sense to add it to the core facets as part of the spec. Hopefully community agreement can be achieved relatively quickly. Unless someone is strongly against something, it can be added without too much red tape. Typically with support in at least one of the integrations to validate the model.

      @@ -133701,7 +133934,7 @@

      set the log level for the openlineage spark library

      - 🙌 Mars Lan, Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo + 🙌 Mars Lan (Metaphor), Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo
      @@ -133757,7 +133990,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:09:40
      -

      *Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

      +

      *Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

      @@ -133808,7 +134041,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:11:14
      -

      *Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

      +

      *Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

      @@ -133859,7 +134092,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:18:56
      -

      *Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

      +

      *Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

      @@ -133885,7 +134118,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:37:42
      -

      *Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

      +

      *Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

      @@ -134085,18 +134318,14 @@

      set the log level for the openlineage spark library

      2023-09-12 08:15:19
      -

      *Thread Reply:* This is the error message:

      +

      *Thread Reply:* This is the error message:

      @@ -134124,7 +134353,7 @@

      set the log level for the openlineage spark library

      2023-09-12 10:38:41
      -

      *Thread Reply:* no permissions?

      +

      *Thread Reply:* no permissions?

      @@ -134153,15 +134382,11 @@

      set the log level for the openlineage spark library

      I am trying to run Google Cloud Composer where i have added the openlineage-airflow pypi packagae as a dependency and have added the env OPENLINEAGEEXTRACTORS to point to my custom extractor. I have added a folder by name dependencies and inside that i have placed my extractor file, and the path given to OPENLINEAGEEXTRACTORS is dependencies.<filename>.<extractorclass_name>…still it fails with the exception saying No module named ‘dependencies’. Can anyone kindly help me out on correcting my mistake

      @@ -134189,7 +134414,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:15:36
      -

      *Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

      +

      *Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

      @@ -134215,7 +134440,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:16:26
      -

      *Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

      +

      *Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

      @@ -134241,7 +134466,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:45:08
      -

      *Thread Reply:* ```import traceback +

      *Thread Reply:* ```import traceback import uuid from typing import List, Optional

      @@ -134295,7 +134520,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:45:24
      -

      *Thread Reply:* this is the custom extractor code that im trying with

      +

      *Thread Reply:* this is the custom extractor code that im trying with

      @@ -134321,7 +134546,7 @@

      set the log level for the openlineage spark library

      2023-09-12 21:10:02
      -

      *Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

      +

      *Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

      @@ -134347,7 +134572,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:54:26
      -

      *Thread Reply:* No module named 'dependencies'. +

      *Thread Reply:* No module named 'dependencies'. This sounds like general Python problem

      @@ -134374,7 +134599,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:55:12
      -

      *Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

      +

      *Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

      Stack Overflow
      @@ -134426,7 +134651,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:56:28
      -

      *Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

      +

      *Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

      @@ -134452,7 +134677,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:12
      -

      *Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

      +

      *Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

      @@ -134478,18 +134703,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:32
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -134517,7 +134738,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:47
      -

      *Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod +

      *Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod OL integration is not running on triggerer, only on worker and scheduler pods

      @@ -134544,18 +134765,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:53
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -134583,7 +134800,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:03:26
      -

      *Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

      +

      *Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

      @@ -134609,18 +134826,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:03:29
      2023-09-13 08:10:32
      -

      *Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

      +

      *Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

      @@ -134674,7 +134887,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:12:06
      -

      *Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

      +

      *Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

      @@ -134700,7 +134913,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:18:14
      -

      *Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

      +

      *Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

      @@ -134726,7 +134939,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:22:49
      -

      *Thread Reply:* any suggestions to try @Maciej Obuchowski

      +

      *Thread Reply:* any suggestions to try @Maciej Obuchowski

      @@ -134752,7 +134965,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:27:48
      -

      *Thread Reply:* Could you share worker logs?

      +

      *Thread Reply:* Could you share worker logs?

      @@ -134778,7 +134991,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:27:56
      -

      *Thread Reply:* and check if module is importable from your dag code?

      +

      *Thread Reply:* and check if module is importable from your dag code?

      @@ -134804,10 +135017,10 @@

      set the log level for the openlineage spark library

      2023-09-13 08:31:25
      -

      *Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

      +

      *Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

      - + @@ -134839,7 +135052,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:31:52
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

      @@ -134900,7 +135113,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:38:32
      -

      *Thread Reply:* { +

      *Thread Reply:* { "textPayload": "Traceback (most recent call last): File \"/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py\", line 427, in import_from_string module = importlib.import_module(module_path) File \"/opt/python3.8/lib/python3.8/importlib/__init__.py\", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 973, in _find_and_load_unlockedModuleNotFoundError: No module named 'airflow.gcs'", "insertId": "pt2eu6fl9z5vw", "resource": { @@ -134946,18 +135159,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:44:11
      -

      *Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

      +

      *Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

      @@ -134985,18 +135194,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:44:33
      2023-09-13 08:45:46
      -

      *Thread Reply:* these are the logs of the triggerer pod specifically

      +

      *Thread Reply:* these are the logs of the triggerer pod specifically

      @@ -135063,7 +135264,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:46:31
      -

      *Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

      +

      *Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

      @@ -135089,7 +135290,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:47:09
      -

      *Thread Reply:* maybe ___init___.py is missing for top-level dag path?

      +

      *Thread Reply:* maybe ___init___.py is missing for top-level dag path?

      @@ -135115,18 +135316,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:49:01
      -

      *Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

      +

      *Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

      @@ -135154,7 +135351,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:49:54
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

      @@ -135215,7 +135412,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:55:24
      -

      *Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

      +

      *Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

      @@ -135241,7 +135438,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:55:48
      -

      *Thread Reply:* and would be nice if you could set +

      *Thread Reply:* and would be nice if you could set AIRFLOW__LOGGING__LOGGING_LEVEL="DEBUG"

      @@ -135268,15 +135465,15 @@

      set the log level for the openlineage spark library

      2023-09-13 09:14:58
      -

      *Thread Reply:* ```Starting the process, got command: triggerer +

      *Thread Reply:* ```Starting the process, got command: triggerer Initializing airflow.cfg. airflow.cfg initialization is done. [2023-09-13T13:11:46.620+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 8) [2023-09-13T13:11:46.622+0000] {settings.py:372} DEBUG - settings.prepareengineargs(): Using pool settings. poolsize=5, maxoverflow=10, poolrecycle=570, pid=8 [2023-09-13T13:11:46.742+0000] {cliactionloggers.py:39} DEBUG - Adding <function defaultactionlog at 0x7ff39ca1d3a0> to pre execution callback [2023-09-13T13:11:47.638+0000] {cliactionloggers.py:65} DEBUG - Calling callbacks: [<function defaultactionlog at 0x7ff39ca1d3a0>] - __ ___ - _ |( )__ / /_ _ + + _ |( ) _/ /_ _ _ /| |_ / / /_ _ / _ _ | /| / / _ | / _ / _ _/ _ / / // /_ |/ |/ / // |// // // // _/_/|/ @@ -135387,7 +135584,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:15:10
      -

      *Thread Reply:* still the same error in the triggerer pod

      +

      *Thread Reply:* still the same error in the triggerer pod

      @@ -135413,18 +135610,14 @@

      set the log level for the openlineage spark library

      2023-09-13 09:16:23
      -

      *Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

      +

      *Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

      @@ -135452,7 +135645,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:36:27
      -

      *Thread Reply:* > still the same error in the triggerer pod +

      *Thread Reply:* > still the same error in the triggerer pod it won't change, we're not trying to fix the triggerer import but worker, and should look only at worker pod at this point

      @@ -135479,7 +135672,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:43:34
      -

      *Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

      +

      *Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

      Using extractor BigQueryInsertJobExtractor tasktype=BigQueryInsertJobOperator airflowdagid=dataanalyticsdag taskid=joinbqdatasets.bqjoinholidaysweatherdata2021 airflowrunid=manual_2023-09-13T13:24:08.946947+00:00

      @@ -135512,7 +135705,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:44:44
      -

      *Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

      +

      *Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

      @@ -135538,7 +135731,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:45:24
      -

      *Thread Reply:* seems like we are almost there nearby…am i missing something obvious

      +

      *Thread Reply:* seems like we are almost there nearby…am i missing something obvious

      @@ -135564,7 +135757,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:06:35
      -

      *Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) +

      *Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). &gt; fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). @@ -135594,7 +135787,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:06:51
      -

      *Thread Reply:* that’s casual log in composer

      +

      *Thread Reply:* that’s casual log in composer

      @@ -135620,7 +135813,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:12:16
      -

      *Thread Reply:* extractor for &lt;class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'&gt; is &lt;class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' +

      *Thread Reply:* extractor for &lt;class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'&gt; is &lt;class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' that’s actually class from your custom module, right?

      @@ -135647,18 +135840,14 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:03
      -

      *Thread Reply:* I’ve done experiment, that’s how gcs looks like

      +

      *Thread Reply:* I’ve done experiment, that’s how gcs looks like

      @@ -135686,18 +135875,14 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:09
      -

      *Thread Reply:* and env vars

      +

      *Thread Reply:* and env vars

      @@ -135725,7 +135910,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:19
      -

      *Thread Reply:* I have this extractor detected as expected

      +

      *Thread Reply:* I have this extractor detected as expected

      @@ -135751,7 +135936,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:15:06
      -

      *Thread Reply:* seens as &lt;class 'dependencies.bq.BigQueryInsertJobExtractor'&gt;

      +

      *Thread Reply:* seens as &lt;class 'dependencies.bq.BigQueryInsertJobExtractor'&gt;

      @@ -135777,7 +135962,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:16:02
      -

      *Thread Reply:* no __init__.py in base dags folder

      +

      *Thread Reply:* no __init__.py in base dags folder

      @@ -135803,7 +135988,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:17:02
      -

      *Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization +

      *Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization the only effect is that when loading plugins in triggerer it throws some errors in logs, we don’t do anything at the moment there

      @@ -135830,7 +136015,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:19:26
      -

      *Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

      +

      *Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

      @@ -135856,7 +136041,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:21:28
      -

      *Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

      +

      *Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

      @@ -135882,7 +136067,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:24:46
      -

      *Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

      +

      *Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

      @@ -135908,7 +136093,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:25:37
      -

      *Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

      +

      *Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

      @@ -135934,7 +136119,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:27:49
      -

      *Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

      +

      *Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

      @@ -135960,7 +136145,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:31:15
      -

      *Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

      +

      *Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

      @@ -135990,7 +136175,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:35:30
      -

      *Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

      +

      *Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

      @@ -136016,7 +136201,7 @@

      set the log level for the openlineage spark library

      2023-09-16 03:32:49
      -

      *Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

      +

      *Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

      @@ -136106,12 +136291,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-12 17:34:29
      -

      *Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

      +

      *Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

      @@ -136204,7 +136389,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:54:37
      -

      *Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

      +

      *Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

      spark.openlineage.parentJobName spark.openlineage.parentRunId @@ -136236,7 +136421,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:55:51
      -

      *Thread Reply:* Thanks, will check this out

      +

      *Thread Reply:* Thanks, will check this out

      @@ -136262,7 +136447,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:57:43
      -

      *Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

      +

      *Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

      @@ -136288,7 +136473,7 @@

      set the log level for the openlineage spark library

      2023-09-13 04:33:03
      -

      *Thread Reply:* Can you share the the job and events? +

      *Thread Reply:* Can you share the the job and events? Also @Paweł Leszczyński

      @@ -136315,7 +136500,7 @@

      set the log level for the openlineage spark library

      2023-09-13 06:03:49
      -

      *Thread Reply:* Sure, sharing Job and events.

      +

      *Thread Reply:* Sure, sharing Job and events.

      @@ -136327,7 +136512,7 @@

      set the log level for the openlineage spark library

      - + @@ -136359,10 +136544,10 @@

      set the log level for the openlineage spark library

      2023-09-13 06:06:21
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -136394,7 +136579,7 @@

      set the log level for the openlineage spark library

      2023-09-13 06:39:02
      -

      *Thread Reply:* Hi @Suraj Gupta,

      +

      *Thread Reply:* Hi @Suraj Gupta,

      Thanks for providing such a detailed description of the problem.

      @@ -136430,7 +136615,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:28:03
      -

      *Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

      +

      *Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

      @@ -136456,7 +136641,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:33:35
      -

      *Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

      +

      *Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

      @@ -136482,7 +136667,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:33:54
      -

      *Thread Reply:* even if you write onto a file?

      +

      *Thread Reply:* even if you write onto a file?

      @@ -136508,7 +136693,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:37:21
      -

      *Thread Reply:* Yes, even when I write to a parquet file.

      +

      *Thread Reply:* Yes, even when I write to a parquet file.

      @@ -136534,7 +136719,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:49:28
      -

      *Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

      +

      *Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

      @@ -136560,7 +136745,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:56:11
      -

      *Thread Reply:* Opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2104

      +

      *Thread Reply:* Opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2104

      @@ -136586,7 +136771,7 @@

      set the log level for the openlineage spark library

      2023-09-25 16:32:09
      -

      *Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

      +

      *Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

      @@ -136612,7 +136797,7 @@

      set the log level for the openlineage spark library

      2023-09-26 03:32:03
      -

      *Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

      +

      *Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

      @@ -136638,7 +136823,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:33:03
      -

      *Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

      +

      *Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

      @@ -136907,7 +137092,7 @@

      set the log level for the openlineage spark library

      2023-09-14 07:32:25
      -

      *Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

      +

      *Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

      @@ -136933,7 +137118,7 @@

      set the log level for the openlineage spark library

      2023-09-14 09:16:58
      -

      *Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

      +

      *Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

      However my sample job uses spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src")

      @@ -136965,7 +137150,7 @@

      set the log level for the openlineage spark library

      2023-09-14 09:23:59
      -

      *Thread Reply:* ``&gt;&gt;&gt; spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) +

      *Thread Reply:* ``&gt;&gt;&gt; spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) == Parsed Logical Plan == 'Project [**] +- 'UnresolvedRelationdhawes.oltesthadoop_src`

      @@ -136996,7 +137181,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137029,7 +137214,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137039,7 +137224,7 @@

      set the log level for the openlineage spark library

      2023-09-14 10:05:19
      -

      *Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

      +

      *Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

      @@ -137065,7 +137250,7 @@

      set the log level for the openlineage spark library

      2023-09-14 10:05:24
      -

      *Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

      +

      *Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

      @@ -137095,7 +137280,7 @@

      set the log level for the openlineage spark library

      2023-09-14 11:59:55
      -

      *Thread Reply:* The release is authorized and will be initiated within two business days.

      +

      *Thread Reply:* The release is authorized and will be initiated within two business days.

      @@ -137115,7 +137300,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137125,7 +137310,7 @@

      set the log level for the openlineage spark library

      2023-09-15 04:40:12
      -

      *Thread Reply:* Thanks a lot, @Michael Robinson!

      +

      *Thread Reply:* Thanks a lot, @Michael Robinson!

      @@ -137184,7 +137369,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:33:35
      -

      *Thread Reply:* I have cleaned up the registry proposal. +

      *Thread Reply:* I have cleaned up the registry proposal. https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit In particular: • I clarified that option 2 is preferred at this point. @@ -137216,7 +137401,7 @@

      set the log level for the openlineage spark library

      2023-10-05 17:34:12
      -

      *Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu +

      *Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu https://github.com/OpenLineage/OpenLineage/issues/2161

      @@ -137394,7 +137579,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:22:47
      -

      *Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

      +

      *Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

      @@ -137420,7 +137605,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:25:28
      -

      *Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

      +

      *Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

      @@ -137446,7 +137631,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:27:10
      -

      *Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

      +

      *Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

      @@ -137472,7 +137657,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:30:59
      -

      *Thread Reply:* correct, just remember not to override all properties, just add ol specific

      +

      *Thread Reply:* correct, just remember not to override all properties, just add ol specific

      @@ -137498,7 +137683,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:32:02
      -

      *Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

      +

      *Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

      @@ -137528,7 +137713,7 @@

      set the log level for the openlineage spark library

      2023-09-16 05:00:24
      -

      *Thread Reply:* We want to automate setting those options at some point inside the operator itself

      +

      *Thread Reply:* We want to automate setting those options at some point inside the operator itself

      @@ -137584,7 +137769,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:55
      -

      *Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

      +

      *Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

      @@ -137605,7 +137790,7 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-17 09:07:41
      @@ -137668,7 +137853,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:25:11
      -

      *Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

      +

      *Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

      @@ -137689,12 +137874,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-18 16:30:39
      -

      *Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

      +

      *Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

      @@ -137720,7 +137905,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:52:48
      -

      *Thread Reply:* historical I guess? go for the PR, of course 🚀

      +

      *Thread Reply:* historical I guess? go for the PR, of course 🚀

      @@ -137741,12 +137926,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-03 08:52:16
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

      @@ -137858,7 +138043,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:07
      -

      *Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

      +

      *Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

      @@ -137884,7 +138069,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:46
      -

      *Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

      +

      *Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

      @@ -137910,7 +138095,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:58
      -

      *Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

      +

      *Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

      @@ -137963,7 +138148,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:06
      -

      *Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      +

      *Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      @@ -137993,7 +138178,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:54
      -

      *Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

      +

      *Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

      openlineage.io
      @@ -138040,7 +138225,7 @@

      set the log level for the openlineage spark library

      2023-09-20 06:26:56
      -

      *Thread Reply:* Maciej, +

      *Thread Reply:* Maciej, Thanks for sharing the link https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html this should address the issue

      @@ -138073,7 +138258,7 @@

      set the log level for the openlineage spark library

      - 🎉 Jakub Dardziński, Mars Lan, Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris + 🎉 Jakub Dardziński, Mars Lan (Metaphor), Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris
      @@ -138161,7 +138346,7 @@

      set the log level for the openlineage spark library

      2023-09-22 21:05:20
      -

      *Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

      +

      *Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

      @@ -138187,7 +138372,7 @@

      set the log level for the openlineage spark library

      2023-09-22 21:30:22
      -

      *Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

      +

      *Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

      @@ -138213,7 +138398,7 @@

      set the log level for the openlineage spark library

      2023-09-23 08:11:17
      -

      *Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

      +

      *Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

      @@ -138275,7 +138460,7 @@

      set the log level for the openlineage spark library

      2023-09-21 02:15:00
      -

      *Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: +

      *Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: • How is it replacing the concept of custom extractors and Manually Annotated Lineage in the latest version? • Do we have any examples of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator?

      @@ -138303,7 +138488,7 @@

      set the log level for the openlineage spark library

      2023-09-27 10:04:09
      -

      *Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? +

      *Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? It will be compatible, “default extractors” is generally the same concept as we’re using in the 2.7 integration. One thing that might be good to update is import paths, from openlineage.airflow to airflow.providers.openlineage but should work both ways

      @@ -138492,7 +138677,7 @@

      set the log level for the openlineage spark library

      I am attaching the log4j, there is no openlineagecontext

      - + @@ -138528,7 +138713,7 @@

      set the log level for the openlineage spark library

      2023-09-21 23:47:22
      -

      *Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

      +

      *Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

      @@ -138599,7 +138784,7 @@

      set the log level for the openlineage spark library

      2023-09-25 08:59:10
      -

      *Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

      +

      *Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

      @@ -138655,7 +138840,7 @@

      set the log level for the openlineage spark library

      2023-09-25 08:56:53
      -

      *Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? +

      *Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? What are you using to generate OL events? What’s your OL receiving backend?

      @@ -138682,7 +138867,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:04:33
      -

      *Thread Reply:* Hey @Harel Shein, +

      *Thread Reply:* Hey @Harel Shein, I wanted to clarify the previous message. I apologize for any confusion. When I mentioned "OpenLineage authentication," I was actually referring to the authentication process for the OpenLineage backend, specifically using HTTP transport. This involves using my custom token provider, which utilizes access keys and secrets for authentication. The OL backend is http based backend . I hope this clears things up!

      @@ -138709,7 +138894,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:05:12
      -

      *Thread Reply:* Are you using Marquez?

      +

      *Thread Reply:* Are you using Marquez?

      @@ -138735,7 +138920,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:05:55
      -

      *Thread Reply:* We are trying to leverage our own backend here.

      +

      *Thread Reply:* We are trying to leverage our own backend here.

      @@ -138761,7 +138946,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:07:03
      -

      *Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

      +

      *Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

      @@ -138787,7 +138972,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:08:56
      -

      *Thread Reply:* KTOR framework

      +

      *Thread Reply:* KTOR framework

      @@ -138813,7 +138998,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:15:33
      -

      *Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

      +

      *Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

      @@ -138839,7 +139024,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:26:57
      -

      *Thread Reply:* I see, I would ask this question here https://ktor.io/support/

      +

      *Thread Reply:* I see, I would ask this question here https://ktor.io/support/

      Ktor Framework
      @@ -138890,7 +139075,7 @@

      set the log level for the openlineage spark library

      2023-09-25 10:12:52
      -

      *Thread Reply:* Thank you

      +

      *Thread Reply:* Thank you

      @@ -138916,7 +139101,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:13:20
      -

      *Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

      +

      *Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

      @@ -138942,7 +139127,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:19:53
      -

      *Thread Reply:* @Paweł Leszczyński I am using python client

      +

      *Thread Reply:* @Paweł Leszczyński I am using python client

      @@ -139006,7 +139191,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:56:00
      -

      *Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

      +

      *Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

      @@ -139062,7 +139247,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:47:39
      -

      *Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

      +

      *Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

      @@ -139088,7 +139273,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:49:35
      -

      *Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

      +

      *Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

      @@ -139114,7 +139299,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:49:57
      -

      *Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

      +

      *Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

      @@ -139140,7 +139325,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:50:13
      -

      *Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

      +

      *Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

      @@ -139166,7 +139351,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:50:33
      -

      *Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

      +

      *Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

      @@ -139192,7 +139377,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:51:11
      -

      *Thread Reply:* yeah exactly. the way that it works with airflow integration

      +

      *Thread Reply:* yeah exactly. the way that it works with airflow integration

      @@ -139218,7 +139403,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:51:26
      -

      *Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

      +

      *Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

      @@ -139287,7 +139472,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:52:08
      -

      *Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

      +

      *Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

      @@ -139313,7 +139498,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:53:32
      -

      *Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

      +

      *Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

      @@ -139339,7 +139524,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:54:08
      -

      *Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

      +

      *Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

      openlineage.io
      @@ -139487,74 +139672,54 @@

      set the log level for the openlineage spark library

      - 🙌 Mars Lan, Harel Shein, Paweł Leszczyński + 🙌 Mars Lan (Metaphor), Harel Shein, Paweł Leszczyński
      @@ -139595,44 +139760,32 @@

      set the log level for the openlineage spark library

      2023-09-27 11:55:47
      -

      *Thread Reply:* A few more pics:

      +

      *Thread Reply:* A few more pics:

      @@ -139878,7 +140031,7 @@

      set the log level for the openlineage spark library

      2023-09-28 02:42:18
      -

      *Thread Reply:* responded within the issue

      +

      *Thread Reply:* responded within the issue

      @@ -139946,7 +140099,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:04:48
      -

      *Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: +

      *Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8080/workflows/aba53369-836c-48f5-a2dd-51bc0740a31c/jobs/140113

      and subproject page with build instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql

      @@ -139975,7 +140128,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:07:23
      -

      *Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

      +

      *Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

      I was hoping for something more automagical, but that should work

      @@ -140003,7 +140156,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:08:06
      -

      *Thread Reply:* I think so. @Jakub Dardziński am I right?

      +

      *Thread Reply:* I think so. @Jakub Dardziński am I right?

      @@ -140029,7 +140182,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:18:27
      -

      *Thread Reply:* https://openlineage.io/docs/development/developing/python/setup +

      *Thread Reply:* https://openlineage.io/docs/development/developing/python/setup there’s a guide how to setup the dev environment

      > Typically, you first need to build openlineage-sql locally (see README). After each release you have to repeat this step in order to bump local version of the package. @@ -140059,7 +140212,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:27:20
      -

      *Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md +

      *Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall It is installed and thus skipped/passed when pip later checks if it needs to be installed.

      @@ -140093,7 +140246,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:31:52
      -

      *Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md +

      *Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md > pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall > It is installed and thus skipped/passed when pip later checks if it needs to be installed. That’s actually expected. You should build new wheel locally and then install it.

      @@ -140133,7 +140286,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:32:04
      -

      *Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

      +

      *Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

      @@ -140159,7 +140312,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:35:46
      -

      *Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

      +

      *Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

      @@ -140185,7 +140338,7 @@

      set the log level for the openlineage spark library

      2023-09-28 04:54:39
      -

      *Thread Reply:* This is starting to feel like a rabbit hole 😞

      +

      *Thread Reply:* This is starting to feel like a rabbit hole 😞

      When I run tox, I get a lot of build errors • client needs to be built @@ -140216,7 +140369,7 @@

      set the log level for the openlineage spark library

      2023-09-28 05:19:34
      -

      *Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

      +

      *Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

      @@ -140242,7 +140395,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:45:48
      -

      *Thread Reply:* Getting requirements to build wheel did not run successfully. +

      *Thread Reply:* Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─&gt; [62 lines of output] /tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config insetup.cfg` @@ -140338,7 +140491,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:53:54
      -

      *Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

      +

      *Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

      @@ -140364,7 +140517,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:58:15
      -

      *Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

      +

      *Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

      @@ -140394,7 +140547,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:58:57
      -

      *Thread Reply:* we’re using ruff , tox runs it as one of commands

      +

      *Thread Reply:* we’re using ruff , tox runs it as one of commands

      @@ -140420,7 +140573,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:00:37
      -

      *Thread Reply:* Not in the airflow folder? +

      *Thread Reply:* Not in the airflow folder? OpenLineage/integration/airflow$ maturin build --out target/wheels 💥 maturin failed Caused by: pyproject.toml at /home/obr_erikal/projects/OpenLineage/integration/airflow/pyproject.toml is invalid @@ -140454,7 +140607,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:02:32
      -

      *Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

      +

      *Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

      so cd iface-py @@ -140496,7 +140649,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:05:12
      -

      *Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

      +

      *Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

      @@ -140522,7 +140675,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:12:20
      -

      *Thread Reply:* looks like it has something to do with latest release of Cython? +

      *Thread Reply:* looks like it has something to do with latest release of Cython? pip install "Cython&lt;3" maybe solves the issue?

      @@ -140549,7 +140702,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:15:06
      -

      *Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

      +

      *Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

      @@ -140575,7 +140728,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:16:59
      -

      *Thread Reply:* (I.e. I tried the pip install "Cython&lt;3" command without any change in the output )

      +

      *Thread Reply:* (I.e. I tried the pip install "Cython&lt;3" command without any change in the output )

      @@ -140601,7 +140754,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:20:30
      -

      *Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

      +

      *Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

      If the issue persists on my own computer, I'll dig a bit further

      @@ -140629,7 +140782,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:21:03
      -

      *Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

      +

      *Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

      @@ -140655,7 +140808,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:22:41
      -

      *Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

      +

      *Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

      @@ -140685,7 +140838,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:25:10
      -

      *Thread Reply:* Is there an official release cycle?

      +

      *Thread Reply:* Is there an official release cycle?

      or more specific, given that the PRs are approved, how soon can they reach openlineage-dbt and apache-airflow-providers-openlineage ?

      @@ -140713,7 +140866,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:28:58
      -

      *Thread Reply:* we need to differentiate some things:

      +

      *Thread Reply:* we need to differentiate some things:

      1. OpenLineage repository: a. dbt integration - this is the only place where it is maintained @@ -140748,7 +140901,7 @@

        set the log level for the openlineage spark library

      2023-09-28 07:31:30
      -

      *Thread Reply:* it’s a bit complex for this split but needed temporarily

      +

      *Thread Reply:* it’s a bit complex for this split but needed temporarily

      @@ -140774,7 +140927,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:31:47
      -

      *Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

      +

      *Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

      @@ -140800,7 +140953,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:32:28
      -

      *Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

      +

      *Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

      editing, that’s not true

      @@ -140828,7 +140981,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:34:40
      -

      *Thread Reply:* the code’s here: +

      *Thread Reply:* the code’s here: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/manager.py#L154

      @@ -140880,7 +141033,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:35:17
      -

      *Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

      +

      *Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

      @@ -140906,7 +141059,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:36:10
      -

      *Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

      +

      *Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

      @@ -140932,7 +141085,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:37:12
      -

      *Thread Reply:* may I ask if you use some custom operator / python operator there?

      +

      *Thread Reply:* may I ask if you use some custom operator / python operator there?

      @@ -140958,7 +141111,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:37:33
      -

      *Thread Reply:* yeah, taskflow with inlets/outlets

      +

      *Thread Reply:* yeah, taskflow with inlets/outlets

      @@ -140984,7 +141137,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:38:38
      -

      *Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

      +

      *Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

      @@ -141010,7 +141163,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:39:54
      -

      *Thread Reply:* awesome 👍 +

      *Thread Reply:* awesome 👍 we have plans to integrate more with Python operator as well but not earlier than in Airflow 2.8

      @@ -141037,7 +141190,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:43:41
      -

      *Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

      +

      *Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

      @@ -141143,7 +141296,7 @@

      set the log level for the openlineage spark library

      2023-10-02 17:00:08
      -

      *Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

      @@ -141169,7 +141322,7 @@

      set the log level for the openlineage spark library

      2023-10-02 17:11:46
      -

      *Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

      +

      *Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

      @@ -141195,7 +141348,7 @@

      set the log level for the openlineage spark library

      2023-10-03 09:59:24
      -

      *Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

      +

      *Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

      @@ -141225,7 +141378,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:23:40
      -

      *Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

      +

      *Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

      @@ -141251,7 +141404,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:30:15
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      @@ -141415,7 +141568,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:30:21
      -

      *Thread Reply:* and of course this issue still exists

      +

      *Thread Reply:* and of course this issue still exists

      @@ -141441,7 +141594,7 @@

      set the log level for the openlineage spark library

      2023-10-03 21:45:09
      -

      *Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

      +

      *Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

      @@ -141467,7 +141620,7 @@

      set the log level for the openlineage spark library

      2023-10-03 21:46:27
      -

      *Thread Reply:* is there any instructions on how to hook up a debugger to OL?

      +

      *Thread Reply:* is there any instructions on how to hook up a debugger to OL?

      @@ -141493,7 +141646,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:04:16
      -

      *Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

      +

      *Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

      @@ -141519,7 +141672,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:05:58
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

      @@ -141623,7 +141776,7 @@

      set the log level for the openlineage spark library

      2023-10-05 03:20:11
      -

      *Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

      +

      *Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

      @@ -141649,7 +141802,7 @@

      set the log level for the openlineage spark library

      2023-10-05 15:05:08
      -

      *Thread Reply:* we’ll ask for a release once it’s reviewed and merged

      +

      *Thread Reply:* we’ll ask for a release once it’s reviewed and merged

      @@ -141758,7 +141911,7 @@

      set the log level for the openlineage spark library

      2023-10-03 09:01:38
      -

      *Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

      +

      *Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

      So, If you pull the code, it contains OL version that has not been released yet and in case of dependencies, one need to build them on their own.

      @@ -141812,7 +141965,7 @@

      set the log level for the openlineage spark library

      2023-10-04 05:27:26
      -

      *Thread Reply:* Hmm. Let's elaborate my use case a bit.

      +

      *Thread Reply:* Hmm. Let's elaborate my use case a bit.

      We run Apache Hive on-premise. Hive provides query execution hooks for pre-query, post-query, and I think failed query.

      @@ -141848,7 +142001,7 @@

      set the log level for the openlineage spark library

      2023-10-04 05:27:36
      -

      *Thread Reply:* I hope that helps.

      +

      *Thread Reply:* I hope that helps.

      @@ -141878,7 +142031,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:03:02
      -

      *Thread Reply:* I get in now. In Circle CI we do have 3 build steps: +

      *Thread Reply:* I get in now. In Circle CI we do have 3 build steps: - build-integration-sql-x86 - build-integration-sql-arm - build-integration-sql-macos @@ -141948,7 +142101,7 @@

      set the log level for the openlineage spark library

      2023-10-23 11:56:12
      -

      *Thread Reply:* It doesn't have the free resource class still 😞 +

      *Thread Reply:* It doesn't have the free resource class still 😞 We're blocked on that unfortunately. Other solution would be to migrate to GH actions, where most of our solution could be replaced by something like that https://github.com/PyO3/maturin-action

      @@ -142025,7 +142178,7 @@

      set the log level for the openlineage spark library

      - 👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan + 👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan (Metaphor)
      @@ -142049,12 +142202,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-04 07:42:59
      -

      *Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

      +

      *Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

      @@ -142169,7 +142322,7 @@

      set the log level for the openlineage spark library

      2023-10-04 03:01:02
      -

      *Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

      +

      *Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

      I would try using FluentD proxy we have (https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd) to copy event stream (alerting is just one of usecases for lineage events) and write fluentd plugin to send it asynchronously further to alerting service like PagerDuty.

      @@ -142314,7 +142467,7 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-06 07:19:01
      @@ -142413,12 +142566,12 @@

      set the log level for the openlineage spark library

      2023-10-06 19:16:02
      -

      *Thread Reply:* Thanks for requesting a release, @Mars Lan. It has been approved and will be initiated within 2 business days of next Monday.

      +

      *Thread Reply:* Thanks for requesting a release, @Mars Lan (Metaphor). It has been approved and will be initiated within 2 business days of next Monday.

      - 🙏 Mars Lan + 🙏 Mars Lan (Metaphor)
      @@ -142446,20 +142599,16 @@

      set the log level for the openlineage spark library

      @here I am trying out the openlineage integration of spark on databricks. There is no event getting emitted from Openlineage, I see logs saying OpenLineage Event Skipped. I am attaching the Notebook that i am trying to run and the cluster logs. Kindly can someone help me on this

      - + @@ -142491,7 +142640,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:02:10
      -

      *Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

      +

      *Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

      @@ -142517,7 +142666,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:04:38
      -

      *Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

      +

      *Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

      @@ -142543,7 +142692,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:06:06
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      @@ -142707,7 +142856,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:11:54
      -

      *Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

      +

      *Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

      @@ -142733,7 +142882,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:12:35
      -

      *Thread Reply:* you can do 11.3, its the most stable one I know

      +

      *Thread Reply:* you can do 11.3, its the most stable one I know

      @@ -142759,7 +142908,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:12:46
      -

      *Thread Reply:* sure, let me try that out

      +

      *Thread Reply:* sure, let me try that out

      @@ -142785,7 +142934,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:31:51
      -

      *Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

      +

      *Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

      @@ -142811,7 +142960,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:43:59
      -

      *Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

      +

      *Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

      @@ -142837,7 +142986,7 @@

      set the log level for the openlineage spark library

      2023-10-09 01:47:20
      -

      *Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

      +

      *Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

      @@ -142863,7 +143012,7 @@

      set the log level for the openlineage spark library

      2023-10-09 04:31:12
      -

      *Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

      +

      *Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

      We do have a bunch of integration tests on Databricks platform available here and they're passing on databricks runtime 13.0.x-scala2.12.

      @@ -142943,7 +143092,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:06:41
      -

      *Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") +

      *Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect() years.sort() @@ -142974,7 +143123,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:08:09
      -

      *Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

      +

      *Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

      @@ -143000,7 +143149,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:10:30
      -

      *Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

      +

      *Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

      @@ -143026,7 +143175,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:11:02
      -

      *Thread Reply:* this is also a potential reason why can't you see any events

      +

      *Thread Reply:* this is also a potential reason why can't you see any events

      @@ -143052,7 +143201,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:14:33
      -

      *Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

      +

      *Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

      @@ -143078,7 +143227,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:16:54
      -

      *Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

      +

      *Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

      @@ -143107,7 +143256,7 @@

      set the log level for the openlineage spark library

      @channel We released OpenLineage 1.4.1! Additions: -• Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan +• Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan (Metaphor) • Flink: expand Iceberg source types 2149 @Peter Huang • Spark: add debug facet 2147 @Paweł Leszczyński • Spark: enable Nessie REST catalog 2165 @julwin @@ -143121,7 +143270,7 @@

      set the log level for the openlineage spark library

      - 👍 Jason Yip, Ross Turk, Mars Lan, Harel Shein, Rodrigo Maia + 👍 Jason Yip, Ross Turk, Mars Lan (Metaphor), Harel Shein, Rodrigo Maia
      @@ -143172,7 +143321,7 @@

      set the log level for the openlineage spark library

      2023-10-09 18:56:13
      -

      *Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

      +

      *Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

      @@ -143256,7 +143405,7 @@

      set the log level for the openlineage spark library

      2023-10-11 03:46:13
      -

      *Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

      +

      *Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

      I think we're facing some issues on recent databricks runtime versions. Here is an issue for this: https://github.com/OpenLineage/OpenLineage/issues/2131

      @@ -143316,7 +143465,7 @@

      set the log level for the openlineage spark library

      2023-10-11 11:17:06
      -

      *Thread Reply:* yes, exactly Spark 3.4+

      +

      *Thread Reply:* yes, exactly Spark 3.4+

      @@ -143342,7 +143491,7 @@

      set the log level for the openlineage spark library

      2023-10-11 21:12:27
      -

      *Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

      +

      *Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

      I am happy to jump on a call to show you if needed

      @@ -143370,7 +143519,7 @@

      set the log level for the openlineage spark library

      2023-10-16 02:58:56
      -

      *Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

      +

      *Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

      /** * We get exact copies of OL events for org.apache.spark.scheduler.SparkListenerJobStart and @@ -143438,7 +143587,7 @@

      set the log level for the openlineage spark library

      2023-10-10 23:44:53
      -

      *Thread Reply:* I use it to get the table with database name

      +

      *Thread Reply:* I use it to get the table with database name

      @@ -143464,7 +143613,7 @@

      set the log level for the openlineage spark library

      2023-10-10 23:47:15
      -

      *Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

      +

      *Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

      @@ -143490,7 +143639,7 @@

      set the log level for the openlineage spark library

      2023-10-11 01:25:44
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -143640,7 +143789,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:57:46
      -

      *Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

      +

      *Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

      @@ -143666,7 +143815,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:58:50
      -

      *Thread Reply:* sure noted @Damien Hawes

      +

      *Thread Reply:* sure noted @Damien Hawes

      @@ -143692,7 +143841,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:59:34
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -143744,7 +143893,7 @@

      set the log level for the openlineage spark library

      2023-10-12 13:55:26
      -

      *Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: +

      *Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: curl -XGET "<http://localhost:5002/api/v1/column-lineage?nodeId=datasetField%3Afood_delivery%3Apublic.delivery_7_days%3Acustomer_email>" You can view the API docs for column lineage here!

      @@ -143772,7 +143921,7 @@

      set the log level for the openlineage spark library

      2023-10-17 05:57:36
      -

      *Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: +

      *Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: 'http://marquez-web:3000/api/v1/column-lineage/nodeId=datasetField:file:/home/jovyan/Downloads/event_attribute.csv:eventType' but it is still not working 😞

      @@ -143800,7 +143949,7 @@

      set the log level for the openlineage spark library

      2023-10-17 06:07:06
      -

      *Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

      +

      *Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

      @@ -143897,7 +144046,7 @@

      set the log level for the openlineage spark library

      2023-10-11 14:26:45
      -

      *Thread Reply:* Newly added discussion topics: +

      *Thread Reply:* Newly added discussion topics: • a proposal to add a Registry of Consumers and Producers • a dbt issue to add OpenLineage Dataset names to the Manifest • a proposal to add Dataset support in Spark LogicalPlan Nodes @@ -143953,7 +144102,7 @@

      set the log level for the openlineage spark library

      2023-10-13 01:56:19
      -

      *Thread Reply:* just follow these instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#build

      +

      *Thread Reply:* just follow these instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#build

      @@ -143979,7 +144128,7 @@

      set the log level for the openlineage spark library

      2023-10-13 06:41:56
      -

      *Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error +

      *Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error ```> Task :signMavenJavaPublication FAILED

      FAILURE: Build failed with an exception.

      @@ -144012,18 +144161,14 @@

      set the log level for the openlineage spark library

      2023-10-13 13:35:06
      -

      *Thread Reply:* @Paweł Leszczyński this is what I am getting

      +

      *Thread Reply:* @Paweł Leszczyński this is what I am getting

      @@ -144051,10 +144196,10 @@

      set the log level for the openlineage spark library

      2023-10-13 13:36:00
      -

      *Thread Reply:* attaching the html

      +

      *Thread Reply:* attaching the html

      2023-10-16 03:02:13
      -

      *Thread Reply:* which java are you using? what is your operation system (is it windows?)?

      +

      *Thread Reply:* which java are you using? what is your operation system (is it windows?)?

      @@ -144112,7 +144257,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:35:18
      -

      *Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

      +

      *Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

      @@ -144138,7 +144283,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:35:51
      -

      *Thread Reply:* In my case it is Mac

      +

      *Thread Reply:* In my case it is Mac

      @@ -144164,7 +144309,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:56:09
      -

      *Thread Reply: * Where: +

      *Thread Reply:* ** Where: Build file '/mnt/c/Users/jason/Downloads/github/OpenLineage/integration/spark/build.gradle' line: 9

      ** What went wrong: @@ -144198,7 +144343,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:56:23
      -

      *Thread Reply:* tried with Linux subsystem

      +

      *Thread Reply:* tried with Linux subsystem

      @@ -144224,7 +144369,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:04:29
      -

      *Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

      +

      *Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

      @@ -144250,7 +144395,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:13:04
      -

      *Thread Reply:* ... 111 more +

      *Thread Reply:* ... 111 more Caused by: java.lang.ClassNotFoundException: org.gradle.api.provider.HasMultipleValues ... 117 more

      @@ -144278,7 +144423,7 @@

      set the log level for the openlineage spark library

      2023-10-17 00:26:07
      -

      *Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

      +

      *Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

      @@ -144304,7 +144449,7 @@

      set the log level for the openlineage spark library

      2023-10-21 23:33:48
      -

      *Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

      +

      *Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

      @@ -144330,7 +144475,7 @@

      set the log level for the openlineage spark library

      2023-10-22 13:08:40
      -

      *Thread Reply:* Now getting class not found despite build and test succeeded

      +

      *Thread Reply:* Now getting class not found despite build and test succeeded

      @@ -144356,7 +144501,7 @@

      set the log level for the openlineage spark library

      2023-10-22 21:46:23
      -

      *Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

      +

      *Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

      @@ -144434,7 +144579,7 @@

      set the log level for the openlineage spark library

      2023-10-16 00:01:42
      -

      *Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

      +

      *Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

      @@ -144460,7 +144605,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:17:30
      -

      *Thread Reply:* It's impossible to know without seeing how you've configured the listener.

      +

      *Thread Reply:* It's impossible to know without seeing how you've configured the listener.

      Can you show this configuration?

      @@ -144488,7 +144633,7 @@

      set the log level for the openlineage spark library

      2023-10-17 03:15:20
      -

      *Thread Reply:* spark.openlineage.transport.url &lt;url&gt; +

      *Thread Reply:* spark.openlineage.transport.url &lt;url&gt; spark.openlineage.transport.endpoint /&lt;endpoint&gt; spark.openlineage.transport.type http spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener @@ -144520,7 +144665,7 @@

      set the log level for the openlineage spark library

      2023-10-17 04:40:03
      -

      *Thread Reply:* Two things:

      +

      *Thread Reply:* Two things:

      1. If you want debug logs, you're going to have to provide a log4j.properties file or log4j2.properties file depending on the version of spark you're running. In that file, you will need to configure the logging levels. If I am not mistaken, the sc.setLogLevel controls ONLY the log levels of Spark namespaced components (i.e., org.apache.spark)
      2. You're telling the listener to emit to a URL. If you want to see the events emitted to the console, then set spark.openlineage.transport.type=console, and remove the other spark.openlineage.transport.** configurations. Do either (1) or (2).
      3. @@ -144550,7 +144695,7 @@

        set the log level for the openlineage spark library

      2023-10-20 00:49:45
      -

      *Thread Reply:* @Damien Hawes Hi, sflr.

      +

      *Thread Reply:* @Damien Hawes Hi, sflr.

      1. So enabling sc.setLogLevel does actually enable debug logs from Openlineage. I can see the events and everyting being logged if I save it as a parquet format instead of delta.
      2. I do want to emit events to the url. But, I would like to just see what exactly are the events being emitted for some specific jobs, since I see that the lineage is incorrect for some MergeInto cases
      @@ -144579,7 +144724,7 @@

      set the log level for the openlineage spark library

      2023-10-26 04:56:50
      -

      *Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

      +

      *Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

      @@ -144639,7 +144784,7 @@

      set the log level for the openlineage spark library

      2023-10-17 06:55:47
      -

      *Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

      +

      *Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

      Are you able to use job namespace to aggregate all the Spark actions run within the databricks notebook? This is something that should serve this purpose.

      @@ -144667,7 +144812,7 @@

      set the log level for the openlineage spark library

      2023-10-17 12:48:33
      -

      *Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

      +

      *Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

      @@ -144693,18 +144838,14 @@

      set the log level for the openlineage spark library

      2023-10-18 05:23:16
      -

      *Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

      +

      *Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

      @@ -144732,7 +144873,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:28:03
      -

      *Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

      +

      *Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

      @@ -144758,7 +144899,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:29:11
      -

      *Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

      +

      *Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

      @@ -144784,7 +144925,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:59:39
      -

      *Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

      +

      *Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

      @@ -144814,7 +144955,7 @@

      set the log level for the openlineage spark library

      2023-10-19 13:44:00
      -

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia I am referring to this: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java#L51

      +

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia I am referring to this: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java#L51

      @@ -144865,7 +145006,7 @@

      set the log level for the openlineage spark library

      2023-10-19 14:49:03
      -

      *Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

      I tried the code without OL and it worked perfectly, so it is OL filtering out the event for sure. I will try posting the code I use to collect the properties.

      set the log level for the openlineage spark library
      2023-10-19 23:46:17
      -

      *Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

      +

      *Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

      ```%scala import org.apache.spark.scheduler._

      @@ -145080,7 +145221,7 @@

      set the log level for the openlineage spark library

      2023-10-19 23:55:05
      -

      *Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

      +

      *Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

      https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      set the log level for the openlineage spark library
      2023-10-30 04:46:16
      -

      *Thread Reply:* Any ideas on how could i test it?

      +

      *Thread Reply:* Any ideas on how could i test it?

      @@ -145333,7 +145474,7 @@

      set the log level for the openlineage spark library

      2023-10-18 01:32:01
      -

      *Thread Reply:* hey, did you try to follow one of these guides? +

      *Thread Reply:* hey, did you try to follow one of these guides? https://openlineage.io/docs/guides/about

      @@ -145360,7 +145501,7 @@

      set the log level for the openlineage spark library

      2023-10-18 09:14:08
      -

      *Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

      +

      *Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

      @@ -145386,7 +145527,7 @@

      set the log level for the openlineage spark library

      2023-10-21 15:43:14
      -

      *Thread Reply:* Thanks Jakub for the response.

      +

      *Thread Reply:* Thanks Jakub for the response.

      @@ -145412,18 +145553,14 @@

      set the log level for the openlineage spark library

      2023-10-21 15:45:42
      -

      *Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

      +

      *Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

      @@ -145451,7 +145588,7 @@

      set the log level for the openlineage spark library

      2023-10-22 09:34:53
      -

      *Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

      +

      *Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

      @@ -145477,7 +145614,7 @@

      set the log level for the openlineage spark library

      2023-10-22 09:54:10
      -

      *Thread Reply:* could you please check what’s the output of +

      *Thread Reply:* could you please check what’s the output of git config --get core.autocrlf or git config --global --get core.autocrlf @@ -145507,7 +145644,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:09:14
      -

      *Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. +

      *Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. I tried deleting docker images and recreating them, but still the same issue persist stating /Usr/bin/env bash/r not found. Gradle build is successful.

      @@ -145536,7 +145673,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:09:54
      -

      *Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

      +

      *Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

      @@ -145562,7 +145699,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:15:57
      -

      *Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

      +

      *Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

      @@ -145616,7 +145753,7 @@

      set the log level for the openlineage spark library

      2023-10-20 02:57:50
      -

      *Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

      +

      *Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

      @@ -145642,7 +145779,7 @@

      set the log level for the openlineage spark library

      2023-10-23 11:59:40
      -

      *Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

      +

      *Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

      @@ -145668,7 +145805,7 @@

      set the log level for the openlineage spark library

      2023-10-24 17:50:26
      -

      *Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

      +

      *Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

      @@ -145732,7 +145869,7 @@

      set the log level for the openlineage spark library

      2023-10-23 04:56:25
      -

      *Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

      +

      *Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

      @@ -145758,7 +145895,7 @@

      set the log level for the openlineage spark library

      2023-10-23 12:02:13
      -

      *Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

      +

      *Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

      @@ -145784,7 +145921,59 @@

      set the log level for the openlineage spark library

      2023-10-24 06:47:14
      -

      *Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

      +

      *Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-08 05:57:55
      +
      +

      *Thread Reply:* Yeah - this just hit me yesterday.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-08 05:58:29
      +
      +

      *Thread Reply:* I've created a ticket for it, it wasn't a fun surprise, that's for sure.

      @@ -145836,7 +146025,7 @@

      set the log level for the openlineage spark library

      2023-10-26 07:45:41
      -

      *Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      +

      *Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      @@ -145862,7 +146051,7 @@

      set the log level for the openlineage spark library

      2023-10-26 09:53:07
      -

      *Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

      +

      *Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

      @@ -145917,15 +146106,11 @@

      set the log level for the openlineage spark library

      Im upgrading the version from openlineage-airflow==0.24.0 to openlineage-airflow 1.4.1 but im seeing the following error, any help is appreciated

      @@ -145953,7 +146138,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:02
      -

      *Thread Reply:* @Jakub Dardziński any thoughts?

      +

      *Thread Reply:* @Jakub Dardziński any thoughts?

      @@ -145979,7 +146164,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:24
      -

      *Thread Reply:* what version of Airflow are you using?

      +

      *Thread Reply:* what version of Airflow are you using?

      @@ -146005,7 +146190,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:52
      -

      *Thread Reply:* 2.6.3 that satisfies the requirement

      +

      *Thread Reply:* 2.6.3 that satisfies the requirement

      @@ -146031,7 +146216,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:16:38
      -

      *Thread Reply:* is it possible you have some custom operator?

      +

      *Thread Reply:* is it possible you have some custom operator?

      @@ -146057,7 +146242,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:17:15
      -

      *Thread Reply:* i think its the base operator causing the issue

      +

      *Thread Reply:* i think its the base operator causing the issue

      @@ -146083,7 +146268,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:17:36
      -

      *Thread Reply:* so no i believe

      +

      *Thread Reply:* so no i believe

      @@ -146109,7 +146294,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:18:43
      -

      *Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

      +

      *Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

      @@ -146135,7 +146320,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:19:11
      -

      *Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

      +

      *Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

      @@ -146161,7 +146346,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:19:49
      -

      *Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

      +

      *Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

      @@ -146187,7 +146372,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:21:23
      -

      *Thread Reply:* let me try with lower versions of openlineage, what's say

      +

      *Thread Reply:* let me try with lower versions of openlineage, what's say

      @@ -146213,7 +146398,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:21:39
      -

      *Thread Reply:* its a big jump from 0.24.0 to 1.4.1

      +

      *Thread Reply:* its a big jump from 0.24.0 to 1.4.1

      @@ -146239,7 +146424,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:22:25
      -

      *Thread Reply:* but i will help here to investigate this issue

      +

      *Thread Reply:* but i will help here to investigate this issue

      @@ -146265,7 +146450,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:24:03
      -

      *Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

      +

      *Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

      @@ -146291,7 +146476,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:26:05
      -

      *Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

      +

      *Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

      @@ -146317,7 +146502,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:39:34
      -

      *Thread Reply:* 0.24.0 is not that old 🤔

      +

      *Thread Reply:* 0.24.0 is not that old 🤔

      @@ -146343,7 +146528,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:45:07
      -

      *Thread Reply:* i see the issue with 0.24.0 I see it as warning +

      *Thread Reply:* i see the issue with 0.24.0 I see it as warning [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self.run() [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 870, in run @@ -146427,18 +146612,14 @@

      set the log level for the openlineage spark library

      2023-10-26 14:18:08
      -

      *Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

      +

      *Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

      @@ -146466,7 +146647,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:24:49
      -

      *Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

      +

      *Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

      @@ -146517,7 +146698,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:25:21
      -

      *Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

      +

      *Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

      @@ -146543,7 +146724,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:28:02
      -

      *Thread Reply:* may i know which Airflow version we tested this process?

      +

      *Thread Reply:* may i know which Airflow version we tested this process?

      @@ -146569,7 +146750,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:28:39
      -

      *Thread Reply:* im on 2.6.3

      +

      *Thread Reply:* im on 2.6.3

      @@ -146595,7 +146776,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:30:53
      -

      *Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 +

      *Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 usually there are not too many changes between minor versions

      I still believe it might be some code you might improve and probably is also an antipattern in airflow

      @@ -146624,7 +146805,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:34:26
      -

      *Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

      +

      *Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

      @@ -146650,7 +146831,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:38:04
      -

      *Thread Reply:* however I see ur point

      +

      *Thread Reply:* however I see ur point

      @@ -146676,7 +146857,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:49:52
      -

      *Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

      +

      *Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

      @@ -146702,7 +146883,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:50:15
      -

      *Thread Reply:* that should help right? @Jakub Dardziński

      +

      *Thread Reply:* that should help right? @Jakub Dardziński

      @@ -146728,7 +146909,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:51:27
      -

      *Thread Reply:* query length shouldn’t be the issue, rather any python code

      +

      *Thread Reply:* query length shouldn’t be the issue, rather any python code

      @@ -146754,7 +146935,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:51:50
      -

      *Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

      +

      *Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

      @@ -146780,7 +146961,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:52:12
      -

      *Thread Reply:* actually its failing on that task itself

      +

      *Thread Reply:* actually its failing on that task itself

      @@ -146806,7 +146987,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:52:33
      -

      *Thread Reply:* let me try it will be pretty quick

      +

      *Thread Reply:* let me try it will be pretty quick

      @@ -146832,7 +147013,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:58:58
      -

      *Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

      +

      *Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

      @@ -146858,7 +147039,7 @@

      set the log level for the openlineage spark library

      2023-10-26 17:51:05
      -

      *Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

      +

      *Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

      @@ -146884,7 +147065,7 @@

      set the log level for the openlineage spark library

      2023-10-27 05:25:54
      -

      *Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

      +

      *Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

      @@ -146910,7 +147091,7 @@

      set the log level for the openlineage spark library

      2023-10-27 11:35:44
      -

      *Thread Reply:* yup thats what im planning to do

      +

      *Thread Reply:* yup thats what im planning to do

      @@ -146936,7 +147117,7 @@

      set the log level for the openlineage spark library

      2023-10-27 13:59:03
      -

      *Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

      +

      *Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

      @@ -146962,7 +147143,7 @@

      set the log level for the openlineage spark library

      2023-10-30 08:21:10
      -

      *Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

      +

      *Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

      @@ -146992,7 +147173,7 @@

      set the log level for the openlineage spark library

      2023-10-31 15:00:38
      -

      *Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - +

      *Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - from openlineage.airflow.extractors import Extractors ModuleNotFoundError: No module named 'openlineage.airflow' any thoughts?

      @@ -147021,7 +147202,7 @@

      set the log level for the openlineage spark library

      2023-10-31 15:37:07
      -

      *Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

      +

      *Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

      @@ -147047,7 +147228,7 @@

      set the log level for the openlineage spark library

      2023-10-31 17:07:41
      -

      *Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

      +

      *Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

      @@ -147130,7 +147311,7 @@

      set the log level for the openlineage spark library

      2023-10-30 09:03:57
      -

      *Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

      +

      *Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

      openlineage.io
      @@ -147252,7 +147433,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:04:30
      -

      *Thread Reply:* Hmm, have you looked over our Running on AWS docs?

      +

      *Thread Reply:* Hmm, have you looked over our Running on AWS docs?

      @@ -147278,7 +147459,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:06:08
      -

      *Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

      +

      *Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

      @@ -147304,7 +147485,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:08:05
      -

      *Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

      +

      *Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

      @@ -147359,7 +147540,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:09:05
      -

      *Thread Reply:* leveraged docker and docker-compose

      +

      *Thread Reply:* leveraged docker and docker-compose

      @@ -147385,7 +147566,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:13:10
      -

      *Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

      +

      *Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

      @@ -147411,7 +147592,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:13:26
      -

      *Thread Reply:* yes, thats correct

      +

      *Thread Reply:* yes, thats correct

      @@ -147437,7 +147618,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:16:39
      -

      *Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

      +

      *Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

      @@ -147463,7 +147644,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:18:00
      -

      *Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

      +

      *Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

      @@ -147489,7 +147670,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:19:27
      -

      *Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

      +

      *Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

      @@ -147515,7 +147696,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:19:36
      -

      *Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption +

      *Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption marquez-db | 2023-10-27 20:35:52.529 GMT [36] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption

      @@ -147542,7 +147723,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:12
      -

      *Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

      +

      *Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

      @@ -147568,7 +147749,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:42
      -

      *Thread Reply:* Did you also update the marquez.yml with the db user / password?

      +

      *Thread Reply:* Did you also update the marquez.yml with the db user / password?

      @@ -147594,7 +147775,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:48
      -

      *Thread Reply:* after which we started to see the errors posted in the github open issues page

      +

      *Thread Reply:* after which we started to see the errors posted in the github open issues page

      @@ -147620,7 +147801,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:21:33
      -

      *Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

      +

      *Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

      @@ -147646,7 +147827,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:21:56
      -

      *Thread Reply:* we are spinning up the entire Marquez stack with docker compose

      +

      *Thread Reply:* we are spinning up the entire Marquez stack with docker compose

      @@ -147672,7 +147853,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:23:24
      -

      *Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

      +

      *Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

      @@ -147698,7 +147879,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:26:31
      -

      *Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

      +

      *Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

      @@ -147724,7 +147905,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:26:56
      -

      *Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

      +

      *Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

      @@ -147750,10 +147931,10 @@

      set the log level for the openlineage spark library

      2023-10-27 17:29:59
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -147785,7 +147966,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:33:15
      -

      *Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

      +

      *Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

      @@ -147811,7 +147992,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:33:49
      -

      *Thread Reply:* no the issue per se, but will help clear up some of the logs

      +

      *Thread Reply:* no the issue per se, but will help clear up some of the logs

      @@ -147837,7 +148018,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:35:06
      -

      *Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

      +

      *Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

      @@ -147863,7 +148044,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:35:40
      -

      *Thread Reply:* the errors were the same with/without --build option

      +

      *Thread Reply:* the errors were the same with/without --build option

      @@ -147889,7 +148070,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:36:02
      -

      *Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +

      *Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) @@ -147946,7 +148127,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:38:52
      -

      *Thread Reply:* debugging docker issues like this is so difficult

      +

      *Thread Reply:* debugging docker issues like this is so difficult

      @@ -147972,7 +148153,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:40:44
      -

      *Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

      +

      *Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

      @@ -147998,7 +148179,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:41:59
      -

      *Thread Reply:* the /init-db.sh is what manages user creation

      +

      *Thread Reply:* the /init-db.sh is what manages user creation

      @@ -148024,7 +148205,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:42:17
      -

      *Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

      +

      *Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

      @@ -148050,7 +148231,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:44:20
      -

      *Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

      +

      *Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

      @@ -148076,7 +148257,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:44:52
      -

      *Thread Reply:* is there a postgres process running outside of docker?

      +

      *Thread Reply:* is there a postgres process running outside of docker?

      @@ -148102,7 +148283,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:34:50
      -

      *Thread Reply:* no other services except marquez on this EC2 instance

      +

      *Thread Reply:* no other services except marquez on this EC2 instance

      @@ -148128,7 +148309,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:35:49
      -

      *Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

      +

      *Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

      @@ -148154,7 +148335,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:36:09
      -

      *Thread Reply:* n we can confirm that no postgres process runs outside of docker

      +

      *Thread Reply:* n we can confirm that no postgres process runs outside of docker

      @@ -148206,7 +148387,7 @@

      set the log level for the openlineage spark library

      2023-10-30 09:59:53
      -

      *Thread Reply:* hi @Jason Yip could you provide an example of such a job?

      +

      *Thread Reply:* hi @Jason Yip could you provide an example of such a job?

      @@ -148232,7 +148413,7 @@

      set the log level for the openlineage spark library

      2023-10-30 16:51:55
      -

      *Thread Reply:* @Paweł Leszczyński same old:

      +

      *Thread Reply:* @Paweł Leszczyński same old:

      delete the old table if needed

      @@ -148384,7 +148565,7 @@

      show data

      2023-10-31 06:34:05
      -

      *Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

      +

      *Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

      @@ -148410,7 +148591,7 @@

      show data

      2023-10-31 11:59:07
      -

      *Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

      +

      *Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

      @@ -148436,7 +148617,7 @@

      show data

      2023-10-31 12:15:37
      -

      *Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

      +

      *Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

      This is the full event, sanitized of internal info: { @@ -148514,7 +148695,7 @@

      show data

      2023-10-31 12:43:07
      -

      *Thread Reply:* thank you!

      +

      *Thread Reply:* thank you!

      @@ -148540,7 +148721,7 @@

      show data

      2023-10-31 13:14:29
      -

      *Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

      +

      *Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

      @@ -148566,7 +148747,7 @@

      show data

      2023-10-31 13:15:26
      -

      *Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

      +

      *Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

      @@ -148592,7 +148773,7 @@

      show data

      2023-10-31 13:25:51
      -

      *Thread Reply:* that was a good call. the link you just shared worked! thank you!

      +

      *Thread Reply:* that was a good call. the link you just shared worked! thank you!

      @@ -148618,7 +148799,7 @@

      show data

      2023-10-31 13:27:55
      -

      *Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

      +

      *Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

      @@ -148644,7 +148825,7 @@

      show data

      2023-10-31 13:28:05
      -

      *Thread Reply:* must be some Marquez hiccup 🙂

      +

      *Thread Reply:* must be some Marquez hiccup 🙂

      @@ -148674,7 +148855,7 @@

      show data

      2023-10-31 13:28:45
      -

      *Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

      +

      *Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

      @@ -148700,7 +148881,7 @@

      show data

      2023-11-01 19:37:06
      -

      *Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

      +

      *Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

      @@ -148772,7 +148953,7 @@

      show data

      2023-11-02 05:11:58
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      @@ -148825,7 +149006,7 @@

      show data

      - 👍 Mars Lan, harsh loomba + 👍 Mars Lan (Metaphor), harsh loomba
      @@ -148880,7 +149061,7 @@

      show data

      2023-11-02 03:34:15
      -

      *Thread Reply:* Hi John, we're always happy to help with the contribution.

      +

      *Thread Reply:* Hi John, we're always happy to help with the contribution.

      One of the possible solutions to this would be to do that just in openlineage-java client: • introduce config entry like normalizeDatasetNameToAscii : enabled/disabled @@ -148913,7 +149094,7 @@

      show data

      2023-11-02 03:34:47
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

      @@ -149037,7 +149218,7 @@

      show data

      2023-11-02 02:41:50
      -

      *Thread Reply:* there’s actually an issue for that: +

      *Thread Reply:* there’s actually an issue for that: https://github.com/OpenLineage/OpenLineage/issues/2189

      but the way to do this is imho to create new custom transport (it might inherit from HTTP transport) and register it in transport factory

      @@ -149066,7 +149247,7 @@

      show data

      2023-11-02 13:05:05
      -

      *Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

      +

      *Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

      Classes which subclass requests.auth.AuthBase can also just directly be given to the requests call in the auth parameter

      @@ -149098,7 +149279,7 @@

      show data

      2023-11-02 14:40:24
      -

      *Thread Reply:* would you like to contribute? 🙂

      +

      *Thread Reply:* would you like to contribute? 🙂

      @@ -149124,7 +149305,7 @@

      show data

      2023-11-02 14:43:05
      -

      *Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

      +

      *Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

      @@ -149150,7 +149331,7 @@

      show data

      2023-11-02 15:45:56
      -

      *Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: +

      *Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: ```from openlineage.client.transport import register_transport, Transport, Config

      @register_transport @@ -149191,7 +149372,7 @@

      show data

      2023-11-02 15:47:45
      -

      *Thread Reply:* in setup.py it’s: +

      *Thread Reply:* in setup.py it’s: ..., entry_points={ 'airflow.plugins': [ @@ -149225,7 +149406,7 @@

      show data

      2023-11-03 12:52:55
      -

      *Thread Reply:* ok great thanks for following up on this, super helpful

      +

      *Thread Reply:* ok great thanks for following up on this, super helpful

      @@ -149269,7 +149450,7 @@

      show data

      - 👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan, Sangeeta Mishra + 👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan (Metaphor), Sangeeta Mishra
      @@ -149301,7 +149482,7 @@

      show data

      @Paweł Leszczyński I tested 1.5.0, it works great now, but the environment facets is gone in START... which I very much want it.. any thoughts?

      - + @@ -149363,7 +149544,7 @@

      show data

      2023-11-04 15:44:22
      -

      *Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

      +

      *Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

      create table table using delta as location 'abfss://....' @@ -149393,7 +149574,7 @@

      show data

      2023-11-04 15:44:31
      -

      *Thread Reply:* Will test more and ope issues

      +

      *Thread Reply:* Will test more and ope issues

      @@ -149419,7 +149600,7 @@

      show data

      2023-11-06 05:34:33
      -

      *Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

      +

      *Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

      @@ -149445,7 +149626,7 @@

      show data

      2023-11-07 18:32:02
      -

      *Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

      +

      *Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

      @@ -149471,7 +149652,7 @@

      show data

      2023-11-08 03:04:32
      -

      *Thread Reply:* I never had a running event 🫠 Am I filtering something?

      +

      *Thread Reply:* I never had a running event 🫠 Am I filtering something?

      @@ -149497,7 +149678,7 @@

      show data

      2023-11-08 13:03:26
      -

      *Thread Reply:* Umm.. ok show me your code, will try on my end

      +

      *Thread Reply:* Umm.. ok show me your code, will try on my end

      @@ -149523,7 +149704,7 @@

      show data

      2023-11-08 14:26:06
      -

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

      +

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

      @@ -149635,7 +149816,7 @@

      show data

      2023-11-04 07:11:46
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 Do you need some more guidance than that?

      @@ -149697,7 +149878,7 @@

      show data

      2023-11-04 07:13:47
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -149723,7 +149904,7 @@

      show data

      2023-11-04 07:15:21
      -

      *Thread Reply:* It seems pretty extensively described, what kind of help do you need?

      +

      *Thread Reply:* It seems pretty extensively described, what kind of help do you need?

      @@ -149749,7 +149930,7 @@

      show data

      2023-11-04 07:16:13
      -

      *Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

      +

      *Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

      @@ -149775,7 +149956,7 @@

      show data

      2023-11-04 07:17:25
      -

      *Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

      +

      *Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

      @@ -149801,7 +149982,7 @@

      show data

      2023-11-04 07:17:32
      -

      *Thread Reply:* can you please provide some guidance

      +

      *Thread Reply:* can you please provide some guidance

      @@ -149827,7 +150008,7 @@

      show data

      2023-11-04 07:18:36
      -

      *Thread Reply:* @Jakub Dardziński this is the doubt i have

      +

      *Thread Reply:* @Jakub Dardziński this is the doubt i have

      @@ -149853,7 +150034,7 @@

      show data

      2023-11-04 08:17:25
      -

      *Thread Reply:* Some one who did spark integration throw some light

      +

      *Thread Reply:* Some one who did spark integration throw some light

      @@ -149879,7 +150060,7 @@

      show data

      2023-11-04 08:21:22
      -

      *Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

      +

      *Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

      @@ -149984,7 +150165,7 @@

      show data

      2023-11-08 10:42:35
      -

      *Thread Reply:* Thanks for merging this @Maciej Obuchowski!

      +

      *Thread Reply:* Thanks for merging this @Maciej Obuchowski!

      @@ -150072,7 +150253,7 @@

      show data

      2023-11-07 05:56:09
      -

      *Thread Reply:* similarly to spark config, you can use flink config

      +

      *Thread Reply:* similarly to spark config, you can use flink config

      @@ -150098,7 +150279,7 @@

      show data

      2023-11-07 22:36:53
      -

      *Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

      +

      *Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

      We're trying to have a wrapper class that extends OpenLineageFlinkJobListener class, and overwrites the HTTP transport endpoint/url to a constant value (say, example.com and /api/v1/flink). But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. If it was just a default scope, can we contribute a PR to make it public, to make it friendly for teams trying to adopt & extend openlineage?

      @@ -150128,7 +150309,7 @@

      show data

      2023-11-08 05:55:43
      -

      *Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

      +

      *Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

      > But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. The way to construct is is a public builder in the same class

      @@ -150184,7 +150365,7 @@

      show data

      2023-11-09 12:41:02
      -

      *Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? +

      *Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? @Maciej Obuchowski - The reasoning behind going with a wrapper class is that we can abstract out the nitty-gritty like how/where we're publishing openlineage events etc - especially for companies that have a lot of teams that may be adopting openlineage.

      For example, if we wanna move away from http transport to kafka transport - we'd be changing only this wrapper class and ask folks to update their wrapper class dependency version. If we went without the wrapper class, then the exact config changes would need to be synced and done by many different teams, who may not have enough context.

      @@ -150217,7 +150398,7 @@

      show data

      2023-11-09 13:03:56
      -

      *Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

      +

      *Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

      @@ -150234,6 +150415,78 @@

      show data

      +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-11-22 10:47:32
      +
      +

      *Thread Reply:* Created an issue for the same: https://github.com/OpenLineage/OpenLineage/issues/2273

      + +

      @Maciej Obuchowski - Can you assign this issue to me, if the proposal looks good?

      +
      + + + + + + + +
      +
      Labels
      + proposal +
      + + + + + + + + + + +
      + + + +
      + 👍 Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + +
      @@ -150282,7 +150535,7 @@

      show data

      2023-11-07 06:07:51
      -

      *Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

      +

      *Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

      @@ -150308,7 +150561,7 @@

      show data

      2023-11-07 07:05:52
      -

      *Thread Reply:* Hi Paweł, +

      *Thread Reply:* Hi Paweł, Thanks for your answer, yes indeed with the newer version of OL, we automatically have the name of the output dataset at the end of the job name, but no App run id, nor any parent run facet.

      @@ -150335,7 +150588,7 @@

      show data

      2023-11-07 08:16:44
      -

      *Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

      +

      *Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

      @@ -150361,7 +150614,7 @@

      show data

      2023-11-07 08:36:58
      -

      *Thread Reply:* I see, thanks a lot for your reply we'll try that

      +

      *Thread Reply:* I see, thanks a lot for your reply we'll try that

      @@ -150413,7 +150666,7 @@

      show data

      2023-11-08 10:49:04
      -

      *Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

      +

      *Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

      @@ -150465,7 +150718,7 @@

      show data

      2023-11-08 13:01:16
      -

      *Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

      +

      *Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

      @@ -150495,7 +150748,7 @@

      show data

      2023-11-09 03:10:54
      -

      *Thread Reply:* Yes. @Abdallah could provide some details if needed.

      +

      *Thread Reply:* Yes. @Abdallah could provide some details if needed.

      @@ -150529,7 +150782,7 @@

      show data

      2023-11-20 11:29:26
      -

      *Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- +

      *Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- HI @Abdallah i was able to set up a spark cluster on AWS EMR but im struggling to configure the OL Listener. Ive tried with steps and bootstrap actions for the jar and it didn't work out. How did you manage to include the jar? Besides, what about the spark configuration? Could you send me a sample of these configs?

      @@ -150543,6 +150796,133 @@

      show data

      +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-11-28 03:52:44
      +
      +

      *Thread Reply:* HI @Abdallah. Ive sent you a message with somo more information. If you could provide some more details that would be awesome. Thank you 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:14:14
      +
      +

      *Thread Reply:* Hi, sorry for the late reply. I didn't see your message at the right timling.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:16:02
      +
      +

      *Thread Reply:* If you want to test ol quickly without bootstrap action, you can use the following submit.json

      + +

      [ + { + "Name": "MyJob", + "Jar" : "command-runner.jar", + "Args": [ + "spark-submit", + "--deploy-mode", "cluster", + "--packages","io.openlineage:openlineage_spark:1.2.2", + "--conf","spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener", + "--conf","spark.openlineage.transport.type=http", + "--conf","spark.openlineage.transport.url=&lt;OPENLINEAGE-URL&gt;", + "--conf","spark.openlineage.version=v1", + "path/to/your/job.py" + ], + "ActionOnFailure": "CONTINUE", + "Type": "CUSTOM_JAR" + } +] +with

      + +

      aws emr add-steps --cluster-id &lt;cluster-id&gt; --steps file://&lt;path-to-your-json&gt;/submit.json

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:19:20
      +
      +

      *Thread Reply:* "--packages","io.openlineage:openlineage_spark:1.2.2" +Need to be mentioned before the creation of the spark session

      + + + +
      +
      +
      +
      + + + + +
      @@ -150822,7 +151202,7 @@

      show data

      2023-11-10 15:32:28
      -

      *Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

      +

      *Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

      Medium
      @@ -150964,7 +151344,7 @@

      show data

      @Paweł Leszczyński I went back to 1.4.1, output does show adls location. But environment facet is gone in 1.4.1. It shows up in 1.5.0 but namespace is back to dbfs....

      - + @@ -151022,7 +151402,7 @@

      show data

      2023-11-13 04:52:07
      -

      *Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

      +

      *Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

      Among the technical problems, another problem here is that our databricks integration tests are run on AWS and the issue you describe occurs in Azure. I would consider this a primary issue as it is difficult for me to verify the behaviour you describe and fix it with a failing integration test at the start.

      @@ -151052,7 +151432,451 @@

      show data

      2023-11-13 18:06:44
      -

      *Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

      +

      *Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:51:38
      +
      +

      *Thread Reply:* @Paweł Leszczyński finally got a chance to run but its a different script, its pretty interesting

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:51:59
      +
      +

      *Thread Reply:* "inputs": [ + { + "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", + "name": "AdultCensusIncome.parquet" + } + ], + "outputs": [ + { + "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", + "name": "AdultCensusIncome.parquet" + }

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:52:19
      +
      +

      *Thread Reply:* df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:52:41
      +
      +

      *Thread Reply:* it somehow got the input path as output path 😲

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:53:53
      +
      +

      *Thread Reply:* here's the full script:

      + +

      df = spark.read.parquet( "" +)

      + +

      sc.jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "") +sc.jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "")

      + +

      df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:54:23
      +
      +

      *Thread Reply:* yep, just 3 lines, will try it on Azure as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-23 08:14:23
      +
      +

      *Thread Reply:* just to clarify, were you able to reproduce this issue just on AWS databricks using s3? Asking this again, bcz this is how our intergation test environment looks like.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 18:28:36
      +
      +

      *Thread Reply:* @Paweł Leszczyński the issue is different. The issue before on Azure is that it says it's dbfs despite it's wasbs. This time around the destination is s3, but it says wasbs... it somehow took the path from inputs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 18:28:52
      +
      +

      *Thread Reply:* I have not tested this new script on Azure yet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:53:21
      +
      +

      *Thread Reply:* @Paweł Leszczyński tried on Azure, results is the same

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:53:24
      +
      +

      *Thread Reply:* df = spark.read.parquet("")

      + +

      df.write\ + .format('delta')\ + .mode('overwrite')\ + .option('overwriteSchema', 'true')\ + .option('path', adlsRootPath + '/examples/data/parquet/AdultCensusIncome/silver/AdultCensusIncome')\ + .saveAsTable('AdultCensusIncome')

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:54:05
      +
      +

      *Thread Reply:* "outputs": [ + { + "namespace": "dbfs", + "name": "/user/hive/warehouse/adultcensusincome",

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:54:24
      +
      +

      *Thread Reply:* Please note that 1.4.1 it correctly identified output as wasbs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:14:39
      +
      +

      *Thread Reply:* so to summarize, on Azure it'd become dbfs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:15:01
      +
      +

      *Thread Reply:* on AWS, it somehow becomes the same as input

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:15:34
      +
      +

      *Thread Reply:* 1.4.1 Azure is fine, I have not tested 1.4.1 on AWS

      @@ -151105,7 +151929,7 @@

      show data

      2023-11-15 07:27:34
      2023-11-15 07:27:55
      -

      *Thread Reply:* thank you @Maciej Obuchowski

      +

      *Thread Reply:* thank you @Maciej Obuchowski

      @@ -151185,7 +152009,7 @@

      show data

      2023-11-16 11:46:16
      -

      *Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? +

      *Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? OL is a specification that can be used to solve various problems, so if you have a clear problem statement, maybe I can help with pros/cons for that problem

      @@ -151199,6 +152023,62 @@

      show data

      +
      +
      + + + + +
      + +
      Naresh reddy + (naresh.naresh36@gmail.com) +
      +
      2023-11-22 23:49:05
      +
      +

      *Thread Reply:* i wanted to integrate Airflow using OL but wanted to understand what are the pros and cons of OL, if you can shed light that would be great

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-11-27 19:11:54
      +
      +

      *Thread Reply:* Airflow supports OL natively via a provider since 2.7. But it’s hard for me to tell you pros/cons without understanding your use case

      + + + +
      + 👀 Naresh reddy +
      + +
      +
      +
      +
      + + + + +
      @@ -151238,7 +152118,7 @@

      show data

      2023-11-16 13:38:42
      -

      *Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

      +

      *Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

      @@ -151271,6 +152151,10 @@

      show data

      +
      + 🙌 Naresh reddy +
      +
      @@ -151319,7 +152203,7 @@

      show data

      2023-11-20 06:07:36
      -

      *Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. +

      *Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. > will it also allow to connect to Power BI and derive the downstream column lineage ? No, there is no such feature yet 🙂 However, there's nothing preventing this - if you wish to work on such implementation, we'd be happy to help.

      @@ -151348,7 +152232,7 @@

      show data

      2023-11-21 00:20:11
      -

      *Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

      +

      *Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

      @@ -151374,7 +152258,7 @@

      show data

      2023-11-21 07:59:10
      -

      *Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

      +

      *Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

      @@ -151456,7 +152340,7 @@

      show data

      2023-11-21 07:37:00
      -

      *Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

      +

      *Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

      @@ -151482,7 +152366,7 @@

      show data

      2023-11-21 07:45:37
      -

      *Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: +

      *Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: while I execute: spark = (SparkSession.builder.master('local') .appName('Food Delivery') @@ -151562,7 +152446,7 @@

      show data

      2023-11-21 07:58:09
      -

      *Thread Reply:* 🤔 Jars are added during image building: https://github.com/getindata/openlineage-bbuzz2023-column-lineage/blob/main/Dockerfile#L12C1-L12C29

      +

      *Thread Reply:* 🤔 Jars are added during image building: https://github.com/getindata/openlineage-bbuzz2023-column-lineage/blob/main/Dockerfile#L12C1-L12C29

      @@ -151588,7 +152472,7 @@

      show data

      2023-11-21 07:58:28
      -

      *Thread Reply:* are you sure &lt;local-path&gt; is right?

      +

      *Thread Reply:* are you sure &lt;local-path&gt; is right?

      @@ -151614,7 +152498,7 @@

      show data

      2023-11-21 08:00:49
      -

      *Thread Reply:* yes, it's same as in sample - wondering why it's not get added: +

      *Thread Reply:* yes, it's same as in sample - wondering why it's not get added: ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') @@ -151653,7 +152537,7 @@

      show data

      2023-11-21 08:04:31
      -

      *Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

      +

      *Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

      @@ -151679,7 +152563,7 @@

      show data

      2023-11-21 08:06:27
      -

      *Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

      +

      *Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

      @@ -151705,7 +152589,7 @@

      show data

      2023-11-21 08:16:54
      -

      *Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

      +

      *Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

      openlineage.io
      @@ -151752,7 +152636,7 @@

      show data

      2023-11-21 09:21:59
      -

      *Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. +

      *Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. One more note is that sometimes kernels return weird results but it may be caused by some local nuances - anyways thx !

      @@ -151765,6 +152649,15905 @@

      show data

      + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:18:53
      +
      +

      Hi Everyone +I created custom operator in airflow for to extract metadata of that file like size creation time and modification time like that and I used that in my dag it is rnning fine in airflow by saving metadata info of that file to csv file but i want to see that metadata info which we are saving in csv file to be shown in marquez ui in extra facet How we can add that extra facet to see on marquez ui ? +Thankyou

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:20:34
      +
      +

      *Thread Reply:* are you adding job or run facet?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:21:32
      +
      +

      *Thread Reply:* job facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:22:24
      +
      +

      *Thread Reply:* that should be run facet, right? +that’s dynamic value dependent on individual run

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:24:42
      +
      +

      *Thread Reply:* yes yes sorry run facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:25:12
      +
      +

      *Thread Reply:* it should appear then in Marquez as additional facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:31:31
      +
      +

      *Thread Reply:* I can see these things already for custom operator but i want to add extra info under the root like the metadata of the file

      + +

      like ( file_name, size, modification time, creation time )

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:33:20
      +
      +

      *Thread Reply:* yeah, for that you need to provide additional run facets under these keys you mentioned

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-23 01:31:01
      +
      +

      *Thread Reply:* can u please tell precisely where we can add additional facet to be visible on ui ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-23 04:16:53
      +
      +

      *Thread Reply:* custom extractors should return from extract methods TaskMetadata that takes run_facets as argument +https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/base.py#L28

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-30 06:34:55
      +
      +

      *Thread Reply:* Hi @Jakub Dardziński finally today I am able to get extra facet on Marquez ui using custom operator and and custom extractor. +Thanks for the help. It is really nice community.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-30 06:43:55
      +
      +

      *Thread Reply:* Hey, that’s great to hear! +Sorry I didn’t answer yesterday but you managed on your own 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-30 10:51:09
      +
      +

      *Thread Reply:* Yes, No problem I tried and explored more by self. +Thanks 😊.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rafał Wójcik + (rwojcik@griddynamics.com) +
      +
      2023-11-22 05:32:40
      +
      +

      Hi Guys, one more questions - as we fix sample from previous thread I start playing with example - when I execute in notebook: +spark.sql('''SELECT ** FROM public.restaurants r where r.name = "BBUZZ DONER"''').show() +I got all raw events in marquez but all job.facets.sql field are null - is there any way to capture sql query that we use in spark? I know that we can pull this out from spark.logicalPlan but plain sql will be much more convenient

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-22 06:53:37
      +
      +

      *Thread Reply:* I was looking into it some time ago and was not able to extract SQL from logical plan. It seemed to me that SQL string is translated into LogicalPlan before Openlineage code gets called and I wasn't able to find SQL anywhere

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-22 06:54:31
      +
      +

      *Thread Reply:* Optional&lt;String&gt; query = ScalaConversionUtils.asJavaOptional(relation.jdbcOptions().parameters().get(JDBCOptions$.MODULE$.JDBC_QUERY_STRING())); +🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-22 06:54:54
      +
      +

      *Thread Reply:* ah, not only from JDBCRelation

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-22 08:03:52
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1250#issuecomment-1306865798

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-22 13:57:29
      +
      +

      @general +The next OpenLineage meetup is happening one week from today on November 29th in Warsaw/remote (it’s hybrid) at 5:30 pm CET (8:30 am PT). Sign-up and more details can be found https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here. Hoping to see many of you there!

      +
      +
      Meetup
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🙌 Harel Shein, Maciej Obuchowski, Sangeeta Mishra +
      + +
      + :flag_pl: Harel Shein, Maciej Obuchowski, Paweł Leszczyński, Stefan Krawczyk +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Ryhan Sunny + (rsunny@altinity.com) +
      +
      2023-11-23 12:26:31
      +
      +

      Hi all, don’t miss out on the upcoming Open Source Analytics (virtual) Conference OSA Con 2023 - the place to be for cutting-edge insights into open-source analytics and AI! Learn and share development tips on the hottest open source projects, like Kafka, Airflow, Grafana, ClickHouse, and DuckDB. See who’s speaking and save your spot today at https://osacon.io/

      +
      +
      osacon.io
      + + + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Jarek Potiuk, Jakub Dardziński, Julien Le Dem, Willy Lulciuc +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 10:47:55
      +
      +

      with Airflow, I have operators and define inputdatasets and outputdatasets. If Task B uses the outputdataset of Task A as the inputdataset, does it overwrite the metadata such as the documentation facet etc?

      + +

      should I ensure that the inputdataset info is exactly the same between tasks, or do I move certain logic into the runfacet?

      + +

      for example, this input_facet is the output dataset from my previous task.

      + +

      input_facet = { + "dataSource": input_source, + "schema": SchemaDatasetFacet(fields=fields), + "deltaTable": DeltaTableFacet( + path=self.source_model.dataset_uri, + name=self.source_model.name, + description=self.source_model.description, + partition_columns=json.dumps(self.source_model.partition_columns or []), + unique_constraint=json.dumps(self.source_model.unique_constraint or []), + rows=self.source_delta_table.rows, + file_count=self.source_delta_table.file_count, + size=self.source_delta_table.size, + ), + } + input_dataset = Dataset( + namespace=input_namespace, name=input_name, facets=input_facet + )

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 10:51:55
      +
      +

      *Thread Reply:* First question, how do you do that? Define get_openlineage_facets methods on your custom operators?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 10:59:09
      +
      +

      *Thread Reply:* yeah

      + +

      I have this method:

      + +

      ``` def getopenlineagefacetsoncomplete(self, task_instance: Any) -> Any: + """Returns the OpenLineage facets for the task instance""" + from airflow.providers.openlineage.extractors import OperatorLineage

      + +
          _ = task_instance
      +    inputs = self._create_input_datasets()
      +    outputs = self._create_output_datasets()
      +    run_facet = self._create_run_facet()
      +    job_facet = self._create_job_facet()
      +    return OperatorLineage(
      +        inputs=inputs, outputs=outputs, run_facets=run_facet, job_facets=job_facet
      +    )```
      +
      + +

      and then I define each facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:00:33
      +
      +

      *Thread Reply:* the input and output change

      + +

      ``` def createjob_facet(self) -> dict[str, Any]: + """Creates the Job facet for the OpenLineage Job""" + from openlineage.client.facet import ( + DocumentationJobFacet, + OwnershipJobFacet, + OwnershipJobFacetOwners, + SourceCodeJobFacet, + )

      + +
          return {
      +        "documentation": DocumentationJobFacet(
      +            description=f"""Filters data from {self.source_model.dataset_uri} using
      +            Polars and writes the data to the path:
      +            {self.destination_model.dataset_uri}.
      +            """
      +        ),
      +        "ownership": OwnershipJobFacet(
      +            owners=[OwnershipJobFacetOwners(name=self.owner)]
      +        ),
      +        "sourceCode": SourceCodeJobFacet(
      +            language="python", source=self.transform_model_name
      +        ),
      +    }
      +
      +def _create_run_facet(self) -&gt; dict[str, Any]:
      +    """Creates the Run facet for the OpenLineage Run"""
      +    return {}```
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:02:07
      +
      +

      *Thread Reply:* but a lot of the time I am reading a dataset and filtering it or selecting a subset of columns and saving a new dataset. I just want to make sure my input_dataset remains consistent basically, since a lot of different airflow tasks might using it

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:02:36
      +
      +

      *Thread Reply:* and these are all custom operators

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:03:29
      +
      +

      *Thread Reply:* Yeah, you should be okay sending those facets on output dataset only

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:07:19
      +
      +

      *Thread Reply:* so just ignore the input_facet completely? or pass an empty list or something?

      + +

      return OperatorLineage( + inputs=None, outputs=outputs, run_facets=run_facet, job_facets=job_facet + )

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:09:53
      +
      +

      *Thread Reply:* cool I'll try that, makes things cleaner for sure

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:10:20
      +
      +

      *Thread Reply:* pass input datasets, just don't include redundant facets

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:10:39
      +
      +

      *Thread Reply:* if you won't pass datasets, you won't get lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:26:12
      +
      +

      *Thread Reply:* got it, thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 01:12:47
      +
      +

      I am trying to run this spark script but my spark context is stopping on its own. Without the open-lineage configuration(listener) the spark script is working fine. I need the configuration to integrate with open-lineage.

      + +

      from pyspark.sql import SparkSession +from pyspark.sql.functions import col

      + +

      def execute_spark_script(query_num, output_path): + # Create a Spark session + spark = (SparkSession.builder.master('local[**]').appName('openlineage_spark_test')

      + +
               `# Install and set up the OpenLineage listener`
      +         `.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+')`
      +         `.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')`
      +         `.config('spark.openlineage.host', '<http://localhost:3000>')`
      +         `.config('spark.openlineage.namespace', 'airflow')`
      +         `.config('spark.openlineage.transport.type', 'console')`
      +         `.getOrCreate()`
      +         `)`
      +
      +
      +
      +
      +`# DataFrame 1`
      +`data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]`
      +`columns = ["rank", "city", "state", "code", "population", "price"]`
      +`df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")`
      +
      +`print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")`
      +
      +`# Save DataFrame 1 to the desired location`
      +`df1.write.mode("overwrite").csv(output_path + "df1")`
      +
      +`# DataFrame 2`
      +`df2 = (spark.read`
      +      `.format("csv")`
      +      `.option("header", "true")`
      +      `.option("inferSchema", "true")`
      +      `.load("/home/haneefa/Downloads/export.csv")`
      +      `)`
      +
      +`# Save DataFrame 2 to the desired location`
      +`df2.write.mode("overwrite").csv(output_path + "df2")`
      +
      +`# Returns a DataFrame that combines the rows of df1 and df2`
      +`query_df = df1.union(df2)`
      +`print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")`
      +
      +`# Save the combined DataFrame to the desired location`
      +`query_df.write.mode("overwrite").csv(output_path + "query_df")`
      +
      +`# Query 1: Add a new column derived from existing columns`
      +`query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))`
      +`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`
      +
      +`# Save Query 1 result to the desired location`
      +`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`
      +
      +
      +`spark.stop()`
      +
      + +

      if __name__ == "__main__": + execute_spark_script(1, "/home/haneefa/airflow/dags/saved_files/")

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 06:19:40
      +
      +

      *Thread Reply:* Hello @Haneefa tasneem - can you confirm which version of Spark you're running?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 06:20:44
      +
      +

      *Thread Reply:* Additionally, I noticed this line:

      + +

      .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+') +Could you try changing the version to 1.5.0 instead of 0.3.+?

      + + + +
      + 👍 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:04:24
      +
      +

      *Thread Reply:* Hello. my spark version is 3.5.0 +```from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.3.1/openlineage-spark-0.3.1.jar'] + files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.master('local[**]').appName('openlineagesparktest').config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.1')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .getOrCreate()
      +         )
      +
      +
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +# Save DataFrame 2 to the desired location
      +df2.write.mode("overwrite").csv(output_path + "df2")
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")
      +
      +# Save the combined DataFrame to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:04:45
      +
      +

      *Thread Reply:* above is the modified code

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:05:20
      +
      +

      *Thread Reply:* It seems the issue is with the listener in spark config

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:05:55
      +
      +

      *Thread Reply:* please do let me know if i should make any changes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 08:20:46
      +
      +

      *Thread Reply:* Yeah - I suspect its because of the version of the connector that you're using.

      + +

      You're using 0.3.1, please try it with 1.5.0.

      + + + +
      + 🚀 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 12:27:02
      +
      +

      *Thread Reply:* Yes. it seems the issue was with the version. thank you. its resolved now

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 03:29:09
      +
      +

      *Thread Reply:* Hi. I was able to see some metadata information on Marquez. I wanted to know if there is a way I can see any lineage of the data as its getting transformed? As in, We are running a query here, I wanted to know If we can see the lineage of dataset. I tried modifying the code like this: # Save Query 1 result to the desired location query_df.write.mode("overwrite").csv(output_path + "query_df")

      + +
      `# Register the DataFrame as a temporary SQL table`
      +`query_df.write.mode("overwrite").saveAsTable("temp_table")`
      +
      +`# Query 1: Add a new column derived from existing columns`
      +`query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")`
      +`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`
      +
      +
      +`# Register the DataFrame as a temporary SQL table`
      +`query1_df.write.mode("overwrite").saveAsTable("temp_table2")`
      +
      +`# Save Query 1 result to the desired location`
      +`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 03:32:25
      +
      +

      *Thread Reply:* But I'm getting error. Is there a way I can see how we are deriving a new column from the previous dataset.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-28 13:06:08
      +
      +

      @channel +Friendly reminder: our first Warsaw meetup is happening tomorrow at 5:30 PM CET (8:30 AM PT) — and it’s hybrid https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      slackbot + +
      +
      2023-11-29 06:30:57
      +
      +

      This message was deleted.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-29 07:03:57
      +
      +

      *Thread Reply:* Hey Shahid, I think we already discussed it here https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR

      +
      + + +
      + + + } + + Shahid Shaikh + (https://openlineage.slack.com/team/U062FLR8WBZ) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 09:09:00
      +
      +

      Hi. This code is running on my local ubuntu(Both in and out of the virtual environment). I have installed Airflow in virtual environment in ubuntu. Its not getting executed on Airflow. I'm getting the following error - airflow.exceptions.AirflowException: Cannot execute: spark-submit --master <spark://10.0.2.15:4041> --jars /home/haneefa/Downloads/openlineage-spark-1.5.0.jar --name SparkScript_query1 --deploy-mode client /home/haneefa/airflow/dags/custom_operators/sample_sql_spark.py 1. Error code is: 1. +[2023-11-29, 13:51:21 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=spark_dagf, task_id=spark_submit_query1, execution_date=20231129T134548, start_date=20231129T134723, end_date=20231129T135121 The spark-submit is working fine on my ubuntu. +```from pyspark import SparkContext +from os.path import abspath +from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] + warehouselocation = abspath('spark-warehouse') + #files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.master('local[**]').appName('openlineagespark_test') + #.config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .config('spark.openlineage.transport.type', 'console')
      +         .config("spark.sql.warehouse.dir", warehouse_location)
      +         .getOrCreate()
      +         )
      +
      +spark.sparkContext.setLogLevel("INFO")
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +query_df.count()
      +
      +# Save DataFrame 2 to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +# Register the DataFrame as a temporary SQL table
      +query_df.write.mode("overwrite").saveAsTable("temp_table")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Register the DataFrame as a temporary SQL table
      +query1_df.write.mode("overwrite").saveAsTable("temp_table2")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

      + +

      spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

      + +

      ./bin/spark-submit --class "SparkTest" --master local[**] --jars```

      + +

      This is my dag: +```from datetime import datetime, timedelta +from airflow import DAG +from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

      + +

      defaultargs = { + 'owner': 'admin', + 'startdate': datetime(2023, 1, 1), +}

      + +

      dag = DAG( + 'sparkdagf', + defaultargs=defaultargs, + scheduleinterval=None, +)

      + +

      Set up SparkSubmitOperator for each query

      + +

      queries = ['query1'] +previous_task = None

      + +

      for query in queries: + taskid = f'sparksubmit{query}' + scriptpath = '/home/haneefa/airflow/dags/customoperators/samplesqlspark.py'
      + query
      num = queries.index(query) + 1

      + +
      spark_task = SparkSubmitOperator(
      +    task_id=task_id,
      +    application=script_path,
      +    name=f"SparkScript_{query}",
      +    conn_id='spark_2',  
      +    jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar',
      +    application_args=[str(query_num)],
      +    dag=dag,
      +)
      +
      +if previous_task:
      +    previous_task >> spark_task
      +
      +previous_task = spark_task
      +
      + +

      if name == "main": + dag.cli()```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-29 10:36:45
      +
      +

      @channel +Today’s hybrid meetup with Google starts in about one hour! DM me for the link. https://openlineage.slack.com/archives/C01CK9T7HKR/p1701194768476699

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Lauzon + (davidonlaptop@gmail.com) +
      +
      2023-11-29 16:35:03
      +
      +

      *Thread Reply:* Thanks for organizing this meetup and sharing it online. Very good quality of talks btw !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-29 16:37:26
      +
      +

      *Thread Reply:* Thanks for coming! @Jens Pfau did most of the heavy lifting

      + + + +
      + 🙌 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jens Pfau + (jenspfau@google.com) +
      +
      2023-11-30 04:52:17
      +
      +

      *Thread Reply:* Thank you to everyone who turned up! +I noticed there was a question from @Sheeri Cabral (Collibra) that we didn't answer: Can someone point me to information on the collector that sends to S3? I'm curious as to the implementation (e.g. does each api push result in one S3 file? so the bucket ends up having millions of files? Does it append to a file? are there rules (configurable or otherwise) about rotating a file or directory? etc

      + +

      I didn't quite catch the context of that. +@Paweł Leszczyński did you?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-30 06:46:27
      +
      +

      *Thread Reply:* There was a question about querying lineage events in warehouses. Julien has confirmed that in case of hunderds of thousands or even million of events, this could be still accomplished within Marquez as PostgreSQL as its backend, should handle this.

      + +

      I was referring to fluentd openlineage proxy which lets users copy the event and send it to multiple backend. Fluentd has a list of out-of-the box output plugins containing BigQuery, S3, Redshift and others (https://www.fluentd.org/dataoutputs)

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-30 06:47:11
      +
      +

      *Thread Reply:* Some more info about fluentd proxy: https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-05 12:35:44
      +
      +

      *Thread Reply:* This is extremely helpful, thanks @Paweł Leszczyński!!!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-11-29 15:00:00
      +
      +

      Hi all. I’m Stefan, and I’m after some directional advice:

      + +

      Context: +• I drive an open source project called Hamilton. TL;DR: it’s an opinionated way to write python code to express dataflows, e.g. great for feature engineering, data processing, doing LLM workflows, etc. in python.
      +• One of the features is that you get “lineage as code”, e.g. you can get “column/dataframe level” lineage for pandas (or pyspark) code. So given a dataflow definition, and execution of a Hamilton dataflow, we can emit this code provenance for any artifacts generated — and this works where ever python runs. +Ask: +• As I understand open lineage was built more for artifact to artifact lineage. E.g. this table -> table -> ML model, etc. Question, for people who use/consume lineage, would this extra level of granularity (e.g. the code that ran to create an artifact) that we can provide with Hamilton, be interesting to emit as part of an open lineage event? (e.g. see inside your python airflow task, or spark job). I’m trying to determine how to prioritize an open lineage integration, and whether someone would pick up Hamilton because of it. +• If you would find this extra level of granularity useful, could I schedule a call with you so I can learn more about your use case please? +CC @Jakub Dardziński since I saw at the meet up that you deal with the airflow python operator & open lineage.

      +
      + + + + + + + +
      +
      Website
      + <https://hamilton.dagworks.io/en/latest/> +
      + +
      +
      Stars
      + 1051 +
      + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-30 09:57:11
      +
      +

      *Thread Reply:* We try to include SourceCodeFacet when possible when emitting OpenLineage event, however it's purely optional as facets are, since we can't guarantee it will be always there - for example it's not possible for us to get actual sql from spark-sql jobs.

      + + + +
      + ✅ Sheeri Cabral (Collibra) +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-30 09:57:29
      +
      +

      *Thread Reply:* Not sure if that answers your question 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-11-30 19:15:24
      +
      +

      *Thread Reply:* @Maciej Obuchowski kind of. That’s tactical implementation advice and definitely useful.

      + +

      My question is more around is that kind of detail actually useful for someone/what use case would it power/enable? e.g. if I integrate openlineage and emit that level of detail, would someone use it? If so, why?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mandy Chessell + (mandy.e.chessell@gmail.com) +
      +
      2023-12-02 12:45:25
      +
      +

      *Thread Reply:* @Stefan Krawczyk the types of use cases I have seen of this style is a business user/auditor that is not interested in how many jobs or intermediate stores are used in the end-to-end pipeline. They want to understand the transformations that the data underwent. For this type of user we extract details such as the source code facet, or facets that describe a rule or decision from the lineage graph and just display these transformations/decisions. If hamilton was providing the content for these types of facets, would they be consumable by such a user? Or perhaps, I should say, "could" they be consumed by such a user if the pipeline developer was careful to use meaningful variable names and comments.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-12-02 14:07:42
      +
      +

      *Thread Reply:* Awesome thanks @Mandy Chessell. That's useful context. What domain would this person be operating in? Yes I would think Hamilton could be useful here. It'll help standardize how the code is structured, and makes it easier to link an output with the code that created it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:01:26
      +
      +

      *Thread Reply:* @Stefan Krawczyk speaking as a vendor of lineage technology, our customers want to see the transformations - SQL if it’s SQL-based, or as much description as possible if it’s an ETL tool without direct SQL. (e.g. IBM DataStage might show a few stages like “group” and “aggregation” and show what field was used for grouping, etc).

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:03:50
      +
      +

      *Thread Reply:* i have seen 5 general use cases for lineage:

      + +
      1. Data Provenance - “where the data comes from”. This is often used when explaining what data means - by showing where it came from. e.g. salary data - did it come from a Google form, or from an HR department? +The most visceral example I have is when a VP sees 2 similar reports with different numbers. e.g. Total Sales but the reports have different numbers, and then the VP wants to know why, and how come we can’t trust the data, etc?
      2. +
      + +

      With lineage it’s easy to see that the data from report A had test data taken out and that’s why the sales $$ is less, but also that’s the accurate one.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:05:18
      +
      +

      *Thread Reply:* 2. Impact Analysis - if data provenance is “what’s upstream”, impact analysis is “what’s downstream”. You want to change something and want to know what it might affect. Perhaps you want to decommission a table or a whole server. Perhaps you are expanding, e.g. you started off in US Dollars and now are adding Japanese Yen…you have a field called “Amount” and now you want to add another field called “currency” and update every report that uses “Amount”…..Impact analysis is for that use case.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:05:41
      +
      +

      *Thread Reply:* 3. Compliance - you can show that your sensitive data stays in the sensitive places, because you can show where your data is flowing.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:06:31
      +
      +

      *Thread Reply:* 4. Migration verification - Compare the lineage from legacy system A with the lineage from new system B. When they have parity, your migration is complete.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:07:03
      +
      +

      *Thread Reply:* 5. Onboarding - lineage diagrams can be used like architecture diagrams to easily acquaint a user with what data you have, how it flows, what reports it goes to, etc.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-12-05 00:42:27
      +
      +

      *Thread Reply:* Thanks, that context is helpful @Sheeri Cabral (Collibra) !

      + + + +
      + ❤️ Sheeri Cabral (Collibra), Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Juan Luis Cano Rodríguez + (juan_luis_cano@mckinsey.com) +
      +
      2023-12-11 05:08:31
      +
      +

      *Thread Reply:* loved this thread, thanks everyone!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-01 17:03:32
      +
      +

      @channel +The November issue of OpenLineage News is here! This issue covers the latest updates to the OpenLineage Airflow Provider, a recap of the meetup in Warsaw, recent releases, and much more. +To get the newsletter directly in your inbox each month, sign up here. +apache.us14.list-manage.com

      +
      +
      openlineage.us14.list-manage.com
      + + + + + + + + + + + + + + + +
      + + + +
      + 👍 Paweł Leszczyński, Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-04 15:34:26
      +
      +

      @channel +I’m opening a vote to release OpenLineage 1.6.0, including: +• a new JobTypeFacet containing additional job-related information to improve support for Flink and streaming in general +• an option for the Flink job listener to read from Flink conf +• in the dbt integration, a new command to send metadata of the last run without running the job +• bug fixes in the Spark and Flink integrations +• more. +Three +1s from committers will authorize. Thanks in advance.

      + + + +
      + ➕ Jakub Dardziński, Harel Shein, Damien Hawes, Mandy Chessell +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-05 08:16:25
      +
      +

      *Thread Reply:* Can we hold on for a day? I do have some doubts about this one https://github.com/OpenLineage/OpenLineage/pull/2293

      +
      + + + + + + + +
      +
      Labels
      + client/java, dependencies, dependabot +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-05 08:50:44
      +
      +

      *Thread Reply:* Our policy allows for 2 days, so there’s no problem as far as I’m concerned

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-05 11:32:29
      +
      +

      *Thread Reply:* Thanks, all, the release is authorized and will be initiated within 2 business days.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-05 11:33:35
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2297 - this should resolve the problem. we don't have to wait anymore. Thank You @Michael Robinson

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/spark +
      + +
      +
      Comments
      + 2 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sathish Kumar J + (sathish.jeganathan@walmart.com) +
      +
      2023-12-06 21:51:29
      +
      +

      @Harel Shein thanks for the invite!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-06 21:52:39
      +
      +

      *Thread Reply:* Welcome :)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:24:52
      +
      +

      Hi

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:27:51
      +
      +

      *Thread Reply:* Hey Zacay!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:28:51
      +
      +

      I know here +does OpenLineage has support of column level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:30:12
      +
      +

      *Thread Reply:* it’s currently supported in Spark integration and for SQL-based Airflow operators

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:30:48
      +
      +

      *Thread Reply:* and DBT?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:31:11
      +
      +

      *Thread Reply:* DBT has only table-level lineage at the moment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:36:28
      +
      +

      *Thread Reply:* would you be interested in contributing/helping to add this feature? 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:38:12
      +
      +

      *Thread Reply:* look at Gudo Soft

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:38:18
      +
      +

      *Thread Reply:* it parser Sql

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:40:05
      +
      +

      *Thread Reply:* I’m sorry but I didn’t get it. What does the parser provide that would be helpful in terms of dbt?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sumeet Gyanchandani + (sumeet.gyanchandani@gmail.com) +
      +
      2023-12-07 07:11:47
      +
      +

      *Thread Reply:* @Jakub Dardziński I played around with column-level lineage in Spark recently and also listening to OL events and converting them to Apache Atlas entities to upload in Azure Purview. Works like a charm 🙂

      + +

      Flink not so much. Do you have column-level lineage in Flink yet or is it on the roadmap for future? Happy to contribute.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:12:29
      +
      +

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński know more about Flink 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:12:58
      +
      +

      *Thread Reply:* it's awesome you got it working with Purview!

      + + + +
      + 🙏 Sumeet Gyanchandani +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-07 07:19:58
      +
      +

      *Thread Reply:* we're actively working with Flink team to have first-class integration for Flink - a lot of things like column-level lineage are unfortunately currently not available

      + + + +
      + 🙌 Sumeet Gyanchandani +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sumeet Gyanchandani + (sumeet.gyanchandani@gmail.com) +
      +
      2023-12-07 09:09:33
      +
      +

      *Thread Reply:* @Jakub Dardziński and @Maciej Obuchowski thank you for the prompt responses. I really appreciate it!

      + + + +
      + 🙂 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:21:13
      +
      +

      Hi everyone, could someone help me in how can I integrating OpenLineage with dbt? I'm particularly interested in sending events to a Kafka topic rather than using HTTP. Any guidance on this would be greatly appreciated.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:28:04
      +
      +

      *Thread Reply:* Hey, it’s essentially described here: https://openlineage.io/docs/client/python

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      + 👍 Simran Suri +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:28:40
      +
      +

      *Thread Reply:* in your case best would be to: +Set an environment variable to a file path: OPENLINEAGE_CONFIG=path/to/openlineage.yml.

      + + + +
      + 👍 Simran Suri +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:29:43
      +
      +

      *Thread Reply:* that will work with dbt right? only setting up environment variable would be enough? Also, to get OL events I need to do dbt-ol run correct?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:30:23
      +
      +

      *Thread Reply:* setting env var + putting config file to pointed path

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:30:31
      +
      +

      *Thread Reply:* and yes, you need to run it with dbt-ol wrapper script

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:31:50
      +
      +

      *Thread Reply:* Great, that's very helpful. I'll try the same and will definitely ask another question if I encounter any issues while trying this out.

      + + + +
      + 👍 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-07 07:15:55
      +
      +

      *Thread Reply:* @Jakub Dardziński - regarding the Python client, does it have a functionality similar to the Java client? For example, the Java client allows you to use the service provider interface to implement a custom ~client~ transport.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:17:37
      +
      +

      *Thread Reply:* do you mean custom transport?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-07 07:17:47
      +
      +

      *Thread Reply:* Correct.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:21:23
      +
      +

      *Thread Reply:* it sure does, there's mention of it in the link above

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:26:46
      +
      +

      Hi All, im still struggling to get Lineage from Databricks Unity Catalog. +Is anyone here extracting lineage from Databricks/UC successfully?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:27:09
      +
      +

      *Thread Reply:* OL 1.5 is showing up this error:

      + +

      23/12/07 17:20:15 INFO PlanUtils: apply method failed with +org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver + at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) + at scala.Option.getOrElse(Option.scala:189) + at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) + at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) + at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) + at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) + at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:307) + at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:323) + at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) + at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) + at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) + at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) + at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:63) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:46) + at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:79) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:144) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.lambda$apply$3(CreateReplaceOutputDatasetBuilder.java:116) + at java.util.Optional.map(Optional.java:215) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:114) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:60) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:39) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) + at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) + at java.util.Optional.map(Optional.java:215) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) + at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) + at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.Iterator.forEachRemaining(Iterator.java:116) + at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) + at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) + at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) + at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) + at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:238) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:126) + at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) + at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1660) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:27:41
      +
      +

      *Thread Reply:* Event output is showing up empty.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-11 08:21:08
      +
      +

      *Thread Reply:* Anyone with the same issue or no issues at all regarding Unity Catalog?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-07 12:58:55
      +
      +

      @channel +We released OpenLineage 1.6.2, including: +• Dagster: support Dagster 1.5.x #2220 @tsungchih +• Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely +• Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom +• Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski +• Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski +• SQL: add quote information from sqlparser-rs #2259 @JDarDagran +• bug fixes, tests, and more. +Thanks to all the contributors, including new contributors @tsungchih and @ensctom! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.6.2 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.5.0...1.6.2 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jorge + (jorge.varona@nike.com) +
      +
      2023-12-07 13:01:27
      +
      +

      👋 Hi everyone! Just lurking to get insights into how I might implement this at my company to solve our lineage challenge.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-07 13:02:06
      +
      +

      *Thread Reply:* 👋 Hi Jorge, welcome! Can you tell us a bit about your use case?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jorge + (jorge.varona@nike.com) +
      +
      2023-12-07 13:03:51
      +
      +

      *Thread Reply:* Large company with a big ball of mud data ecosystem. There is active debate on doubling down on a vendor (probably wont work) or doing the work to instrument our jobs/toolling in an agnostic way. I'm inclined to do the later but want to understand what the implementation may look like.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-07 14:44:04
      +
      +

      *Thread Reply:* welcome! please feel free to ask any questions as they arise

      + + + +
      + 👍 Jorge +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-08 05:59:40
      +
      +

      Hello everyone, +I'm currently utilizing the OpenLineage JAR version - openlineage_spark_1_1_0.jar to capture lineage information in my environment. +I have a specific requirement to capture facets related to Input Datasets, especially focusing on Data Quality Metrics facets such as row count information for the input data, as well as Output Dataset facets encompassing output statistics, like row count information for the output data.

      + +

      Although the tags "inputFacets":{}}] and "outputFacets":{}}] seem to be enabled in the event, but the values within these tags are not reflecting the expected information, it seems to be blank always.

      + +

      This Setup involves Databricks, and the cluster's Spark version is Apache Spark 3.3.2. and I've configured the OpenLineage setup in the Global Init scripts within the Databricks workspace.

      + +

      Would greatly appreciate it if someone could provide guidance or insight into this issue.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-08 11:40:47
      +
      +

      *Thread Reply:* can you turn on the debug facet and share an example event so we can try to help? +spark.openlineage.debugFacet should be set to enabled +this is from https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:22:03
      +
      +

      *Thread Reply:* Hi @Harel Shein, sure.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:23:59
      +
      +

      *Thread Reply:* This text file contains a total of 10-11 events, including the start and completion events of one of my notebook runs. The process is simply reading from a Hive location and performing a full load to another Hive location.

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-11 03:32:40
      +
      +

      *Thread Reply:* @Simran Suri do you get any cluster logs with an error? im running a newer version of OL jar and im getting inputs and outputs from hive (but not for Unity Catalog)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:54:28
      +
      +

      *Thread Reply:* No @Rodrigo Maia, I can't see any errors there

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-12 16:05:18
      +
      +

      *Thread Reply:* thanks for adding that! +per the docs here, did you extend the Input/OutputDatasetFacetBuilder with anything to track data quality metrics?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-13 00:49:01
      +
      +

      *Thread Reply:* Actually no, I didn't tried this out, can you help me with some more brief on it like how it can be extended? do I need to add some configs?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-08 15:04:43
      +
      +

      @channel +This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: +• announcements +• recent releases +• proposal updates +• open discussion +• more (TBA) +More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-09 18:27:05
      +
      +

      Hi! I'm interested in using OpenLineage to track data files in my pipeline. Basically, I receive a data file and run some other Python code + ancillary files to produce other data files which are then used to produce more data files and onward. Each level of files is versioned and I would like to track this lineage. I don't use Spark, Airflow, dbt, etc. I do use Prefect though. My wrapper script is Python. Is OpenLineage appropriate? Seems like it... is my "DataSet" every individual file that I produce? I think I have to write my own integration and facet. Is that also correct? Any other advice? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-10 12:09:47
      +
      +

      *Thread Reply:* Hi @Joey Mukherjee, welcome to the community! This use case would work using the openlineage spec. you are right, unfortunately we don’t currently have a Prefect integration, but we’d be happy to support you if you chose to write it! 🙂 +I don’t know enough about Prefect to say what would be the right model to map to. Have you looked at the OpenLineage data model? +https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md

      + + + +
      + 👍 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 08:12:40
      +
      +

      Hey team.

      + +

      In the spark integration: When we do a spark.jdbc() or spark.read() from a JDBC connection like mysql/postgres etc, does OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) in the OL events or not?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 09:32:14
      +
      +

      *Thread Reply:* > OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) +I think those are part of dataset namespace? Probably not username tho

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 12:34:46
      +
      +

      *Thread Reply:* Ack, and are these spark.jdbc inputs being captured for spark 3.x onwards or for spark 2.x as well?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 12:34:55
      +
      +

      *Thread Reply:* @Maciej Obuchowski ^

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 12:53:25
      +
      +

      *Thread Reply:* I would assume not for Spark 2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2023-12-11 11:12:27
      +
      +

      ❓ Are there any particular standards/conventions around how to express data types in the dataset schema facet? The examples I’ve seen have been just types like integer etc. I think it would be useful for certain types to include modifiers for size, precision etc, so like varchar(50) and stuff like that. Would it just be a case of, stick to how the platform (mysql, snowflake, whatever) expresses it in DDL?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 12:54:12
      +
      +

      *Thread Reply:* We do express what's in DDL basically. It's database specific anyway

      + + + +
      + 👍 David Goss +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-11 16:31:08
      +
      +

      QQ - we have multi-cluster Redshift architecture where there is a possibility that table meta information could exist in different cluster. Now the way I see extractor here that it requires table meta-information to create the lineage right here? Currently I dont see those tables in my input datasets which are out of that cluster. Any thoughts?

      +
      + + + + + + + + + + + + + + + + +
      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-11 16:34:13
      +
      +

      *Thread Reply:* I’m not entirely sure how multi-cluster Redshift works. AFAIK the solution would to take advantage from SVV_REDSHIFT_COLUMNS

      + + + +
      + 👀 harsh loomba +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-11 16:34:31
      +
      +

      *Thread Reply:* I’ve got opened PR for Airflow OL provider here: +https://github.com/apache/airflow/pull/35794

      + + + +
      + 👀 harsh loomba +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-14 04:56:15
      +
      +

      *Thread Reply:* hey @harsh loomba, did you have a chance to check above out?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 11:50:53
      +
      +

      *Thread Reply:* I did check, looks promising to the problem statement we have right @Willy Lulciuc?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Willy Lulciuc + (willy@datakin.com) +
      +
      2023-12-14 14:30:44
      +
      +

      *Thread Reply:* @harsh loomba yep, it does!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Willy Lulciuc + (willy@datakin.com) +
      +
      2023-12-14 14:31:20
      +
      +

      *Thread Reply:* thanks @Jakub Dardziński for the quick fix!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:31:44
      +
      +

      *Thread Reply:* ohhh great

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:31:58
      +
      +

      *Thread Reply:* thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:32:07
      +
      +

      *Thread Reply:* hopefully it will added in upcoming OL release

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-11 19:47:37
      +
      +

      I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-12 05:37:04
      +
      +

      *Thread Reply:* > I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? +Idea of OpenLineage is to handle this kind of events - the events are generally ment to be cumulative. As you said, inputs or outputs can be not known at start time, but there can be opposite situation. You're reading version of the dataset that has been changing in the meantime, and you know the particular version only on start.

      + +

      That being said, the actual handling of those events depends on the consumer itself, so it depends on what you're using.

      + +

      > Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations. +It's probably consumer specific - hard to tell without knowledge of what you're using.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 03:06:36
      +
      +

      Hey folks. Is there a way to disable column-level lineage / "schema" facet in inputs/outputs for spark integration?

      + +

      Basically to have just table-level lineage and disable column-level lineage

      + + + +
      + ➕ Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 03:22:17
      +
      +

      *Thread Reply:* cc @Honey Thakuria

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-13 03:47:20
      +
      +

      *Thread Reply:* The spark integration provides the ability to disable facets, via the spark.openlineage.facets.disabled configuration.

      + +

      You provide values like this: +spark.openlineage.facets.disabled=[spark_unknown;spark.logicalPlan;&lt;more&gt;]

      + + + +
      + 👍 Jakub Dardziński, Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 06:25:24
      +
      +

      *Thread Reply:* Right, but is there a specific facet we can disable here to get table-level lineage but skip column-level lineage in the events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-13 06:44:00
      +
      +

      *Thread Reply:* The facet name that you want to disable is "columnLineage" in this case

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      rajeshree parmar + (rajeshreedatavizz@gmail.com) +
      +
      2023-12-13 05:05:07
      +
      +

      Hi i want to use open lineage on my plateform how i can use . i have already metadata and profiler pipleline and how integreate

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-13 08:33:36
      +
      +

      Hi, i'd like to request a release (patch 1.6.3) that will include this PR: #2305 . It would help people using OL with Airflow integration (with Airflow version 2.6).

      + + + +
      + 👍 Harel Shein, Maciej Obuchowski +
      + +
      + ➕ Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-14 08:40:03
      +
      +

      *Thread Reply:* Thanks for requesting a release. It is authorized and will be initiated within 2 business days (not including Friday).

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-14 08:57:34
      +
      +

      *Thread Reply:* Perfect, thank you !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-13 09:59:56
      +
      +

      I have a question/misunderstanding about the output section from Marquez under I/O. For one of my Jobs, the outputs are the right output and all of the inputs from the previous Job in the pipeline. I'm not sure why, but how do I find the cause of this? I feel like I did everything correct. To be clear, I am using a custom pipeline using files and the example code from the Python section of the OpenLineage docs.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-13 13:18:15
      +
      +

      *Thread Reply:* I notice only one of my three jobs has Run information. I have only six events, and all three have START and COMPLETE states. From what I can tell, the information in the events list is correct, but the GUI is not right. Open to any ideas on how to debug!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-13 11:34:15
      +
      +

      @channel +This month’s TSC meeting is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1702065883107479

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:01:32
      +
      +

      Hi team, I noticed this error: +ERROR ColumnLevelLineageUtils: Error when invoking static method 'buildColumnLineageDatasetFacet' for Spark3 +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875245402Z java.lang.reflect.InvocationTargetException +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875250061Z at jdk.internal.reflect.GeneratedMethodAccessor469.invoke(Unknown Source) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875253194Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875256744Z at java.base/java.lang.reflect.Method.invoke(Method.java:566) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875260324Z at io.openlineage.spark.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:35) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875263824Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$21(OpenLineageRunEventBuilder.java:434) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875267715Z at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875270666Z at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875273785Z at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875276824Z at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875279799Z at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875285515Z at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875288751Z at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875292022Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:447) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875294766Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:306) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875297558Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875300352Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:232) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875303027Z at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:70) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875305695Z at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875308769Z at java.base/java.util.Optional.ifPresent(Optional.java:183) +... +in some pipelines for OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:35:34
      +
      +

      *Thread Reply:* No. it wasn't reported. Are you able to reproduce the same with recent Openlineage version?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:36:47
      +
      +

      *Thread Reply:* I have not tried actually... let me try that and get back if it persists. Thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:37:54
      +
      +

      *Thread Reply:* just to make sure: this is just the error in the logs and should not prevent OL event being generated (except for column level lineage). right?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:45:04
      +
      +

      *Thread Reply:* Yeah, but column level lineage is what we'd actually want to capture so wondering why this error was being thrown

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:46:03
      +
      +

      *Thread Reply:* sure, could you also paste the end of the stacktrace?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:48:28
      +
      +

      *Thread Reply:* Yup, let me get that

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:49:43
      +
      +

      *Thread Reply:* INFO - 2023-12-12T15:06:34.726868651Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726871557Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726874327Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726878805Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726881789Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726884664Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726887371Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726890358Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726893117Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726895977Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726898933Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726901694Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726904581Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726907524Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726910276Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726913201Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726916094Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726918805Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726921673Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726924364Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726927240Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726930033Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726932890Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726935677Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726938562Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726941533Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726944428Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726947297Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726951913Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726954964Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726957942Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726960897Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726963830Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726968390Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726971331Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726974162Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726977051Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726979989Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726982847Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726985790Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726988700Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726991547Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726994338Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726997282Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727000102Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727003005Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727006039Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727009344Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:72) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727012365Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$null$2(ColumnLevelLineageUtils.java:83) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727015353Z at java.base/java.util.Optional.ifPresent(Optional.java:183) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727018363Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$collectInputsAndExpressionDependencies$3(ColumnLevelLineageUtils.java:80) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727021227Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:174) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727024129Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727029039Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727032321Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727035293Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727038585Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727041600Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727044499Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727047389Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727050337Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727053034Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727055990Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727060521Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727063671Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:76) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727066753Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:42) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727069592Z ... 35 more

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:45:58
      +
      +

      Another question, +I noticed that for a few cases, lineage is not being captured if running: +df.toPandas() +via pyspark, and then doing some pandas operations on it and writing it back to an s3 location. May I check if this is expected?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:47:18
      +
      +

      *Thread Reply:* this is something we haven't tested nor included in our tests. not sure what happens spark data goes to pandas.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:48:18
      +
      +

      *Thread Reply:* Got it. Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-15 04:39:45
      +
      +

      *Thread Reply:* I can speak from experience, because we had a similar issue in our custom spark listener, toPandas breaks lineage because toPandas is like calling collect, which forces the data in the Spark DataFrame to the driver, and pipes that data to the running Python sidecar process. What ever you do afterwards, you're running code in a private memory space and Spark has no way of knowing what you're doing.

      + + + +
      + 👍 Anirudh Shrinivason +
      + +
      + :gratitude_thank_you: Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-29 03:41:29
      +
      +

      *Thread Reply:* Just wondering, is it even technically possible to address this case where pipelines use toPandas to get lineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-08 04:46:36
      +
      +

      *Thread Reply:* To an extent - yes. Though not necessarily with the OL connector. As hinted previously, we had to recently update our own listener that we wrote years ago in order to publish lineage for subquery alias / project operations from collect / toPandas operations.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 04:58:57
      +
      +

      We're noticing significant delays in a few spark jobs, wherein the openlineage spark listener seems to be running even after the spark/yarn application have completed & requested for a gracious exit. Even after the application has ended, we see couple of events still being processed & each event takes around 5-6 mins (rarely we have seen 9 mins as well):

      + +

      2023-12-13 22:52:26.834 -0800] [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerJobEnd(12,1702535631760,JobSucceeded) by listener OpenLineageSparkListener took 385.396979168s. +This kinda results in our actual jobs taking 30+ mins to be marked as completed, which impacts SLA.

      + +

      Has anyone faced this issue, and any tips on how we can debug what event is causing this exact 5-6 mins issue / whic method in OpenLineageSparkListener is taking time?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 10:59:48
      +
      +

      *Thread Reply:* Ping @Paweł Leszczyński @Maciej Obuchowski ^

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-14 11:56:17
      +
      +

      *Thread Reply:* Can you try disabling lineage plan and spark_unknown facet? Only thing I can think of is serializing extremely large logical plans

      + + + +
      + 👍 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 12:32:33
      +
      +

      *Thread Reply:* yes, omiting plan serialialization and its serialization can be first thing to try

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 12:33:27
      +
      +

      *Thread Reply:* the other one would be to verify the backend is responding in reasonable time

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 13:57:22
      +
      +

      *Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - These 2 facets have already been disabled - but what we see if that the event is still too huge for openlineage spark listener to realistically process in time.

      + +

      For example, we have a job that reads from a S3 directory which has lot of dirs/files - resulting in 1041 inputs ** 8 columns per output 😅

      + +

      Is there a max limit config we can set on openlineage to consider parsing events if the spark event size is < x mb or # of inputs+outputs are < y or something like that?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 14:06:34
      +
      +

      *Thread Reply:* And also, when we disable facets like spark.logicalPlan / schema / columnLineage - does it mean that the part of the event is itself not being read from spark or is it just being used while generating/emitting the OL event?

      + +

      Basically if we have a very huge spark event, would disabling facets help or would it still kinda take a lot of time?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-15 12:48:57
      +
      +

      *Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - WDYT? ^

      + +

      In our use-case, we figured out the scenario that caused this huge event issue. We had a job that did spark.read from <s3://bucket/dir1/dir2/dir3_**/**> (with a file-level wildcard) instead of <s3://bucket/dir1/dir2/dir3_**/> - we were able to remove the wildcard and fix this issue.

      + +

      But I think this is something that should be handled on spark listener side, so that we're not really dependent on the patterns of the spark job code itself 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-18 03:59:46
      +
      +

      *Thread Reply:* That's the problem that bumps from time to time. So far, we were never able to come up with a general solution like circuit breaker.

      + +

      We do rather solve the problems after they're identified. In the past we added option to prevent sending serialized LogicalPlan as well as w trim it if it exceeds certain about of kilobytes.

      + +

      What caused the issue here? Was it the event being too large or spark openlineage internals trying to resolve ** and causing long lasting backend calls to S3?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-14 06:22:28
      +
      +

      Hi. I am trying to run a spark script through an Airflow Dag. I am not able to see any lineage information. In the spark script I had taken up a sample csv file. I created a Data-frame and made some transformations and then saved it as a csv file. I am not able to see any lineage information. Please do let me know if there is any way I can see Lineage information. Here is my spark script for reference. +```from pyspark import SparkContext +from os.path import abspath +from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] + warehouselocation = abspath('spark-warehouse') + #files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.appName("DataManipulation").master('local[**]').appName('openlineagespark_test') + #.config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .config('spark.openlineage.transport.type', 'console')
      +         .config("spark.sql.warehouse.dir", warehouse_location)
      +         .getOrCreate()
      +         )
      +
      +spark.sparkContext.setLogLevel("INFO")
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +df2.write.mode("overwrite").csv(output_path + "df2")
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +query_df.count()
      +
      +# Save DataFrame 2 to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +
      +
      +# Save Query 1 result to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +
      +# Register the DataFrame as a temporary SQL table
      +query_df.write.saveAsTable("temp_tb1")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_tb1")
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Register the DataFrame as a temporary SQL table
      +query1_df.write.saveAsTable("temp_tb2")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +# Read the saved DataFrame in parquet format
      +<b>#parquet_df</b> =spark.read.parquet(output_path + "query1_df")
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

      + +

      spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

      + +

      ./bin/spark-submit --class "SparkTest" --master local[**] --jars ```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-14 09:28:27
      +
      +

      *Thread Reply:* Do you run your script through a spark-submit?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-16 14:12:08
      +
      +

      *Thread Reply:* yes I do. Here is my Airflow Dag Code. +```from datetime import datetime, timedelta +from airflow import DAG +from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

      + +

      defaultargs = { + 'owner': 'airflow', + 'dependsonpast': False, + 'startdate': datetime(2023, 1, 1), +}

      + +

      dag = DAG( + 'abcsparkdagedit', + defaultargs=defaultargs, + description='A simple Airflow DAG to run the provided Spark script', + scheduleinterval='@once', +)

      + +

      sparktask = SparkSubmitOperator( + taskid='runsparkscript', + application='/home/haneefa/airflow/dags/customoperators/sparkedit.py', #path to your Spark script + name='examplesparkjob', + connid='spark2', # Spark connection ID configured in Airflow + jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar', + verbose=False, + dag=dag, +)

      + +

      spark_task``` +please do let me know if I can do anything

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-18 02:30:50
      +
      +

      *Thread Reply:* Hi. Just checking in. please do let me know if I can try anything.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-03 08:16:36
      +
      +

      *Thread Reply:* Can you bring the driver logs please ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-03 08:17:06
      +
      +

      *Thread Reply:* I see that you are mixing different types of transport. +... +.config('spark.openlineage.host', '<http://localhost:5000>') +... +.config('spark.openlineage.transport.type', 'console') +...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-15 03:38:06
      +
      +

      hey, I have a question re OpenLineage Spark integration. According to docs, while configuring spark session following parameters (amongst others) can be passed: +• spark.openlineage.parentJobName +• spark.openlineage.appName +• spark.openlineage.namespace +what I understand from OL spec is that first parameter (parentJobName) would go into facets section of every lineage event, while spark.openlineage.appName would replace spark.appName portion of job.name property of lineage event (and indeed that’s what we are observing). spark.openlineage.namespace would materialize as job.namespace in lineage events (that also seems accurate). What I also found in the documentations it that parent facet is used to materialize a job (like Airflow DAG) under which given event is emitted. That was also my expectation about parentJobName however after configuring this option, value that I am setting it to is nowhere to be found in output lineage events. The question is then - is my understanding of this property wrong (and if so, what is the correct one) or this value is not being propagated properly to lineage events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-15 04:19:51
      +
      +

      *Thread Reply:* cześć 😉 to sum up: you're setting parentJobName property and you don't see it anywhere within the content of OL event. ~Looking at the code, the facet should be attached to job.~

      + + + +
      + 👍 Mariusz Górski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-15 06:07:26
      +
      +

      *Thread Reply:* Looking at the code, the facet should be ingested into the event within this method https://github.com/OpenLineage/OpenLineage/blob/203052d663c4cd461ec38787434fc53a57[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

      + +

      However, I don't see any integration test for this. Feel free to create an issue for this in case parent job name is still missing.

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-15 13:53:47
      +
      +

      *Thread Reply:* +1, I've noticed this too - the namespace & appName from spark conf reflects in the openlineage events, while the parentJobName doesn't.

      + +

      But as a workaround, we kinda used namespace as a JSON string & provided the parentJobName as a key in this JSON

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-16 04:10:49
      +
      +

      *Thread Reply:* indeed you can put anything into namespace but this is a workaround and we'd like to have generic OL spec compliant approach. so far I've checked tests and parent facet is available in some of them (like BQ write integration test) but in some not so still not sure why sometimes it's there and sometimes not. will keep digging.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-16 09:23:28
      +
      +

      *Thread Reply:* ok I figured it out 🙂 tl;dr when you use parentJobName you need to also define parentRunId (this implicit relationship could be better documented) and the tricky part here is parentRunId needs to be proper UUID, not just random string (which was wrong assumption on my part, again - improvement to the docs also required). I will update docs in the upcoming week based on this discovery. I tested this and after making changes as per above description parent facet is visible in spark OL events 🙂

      + + + +
      + 🙌 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-27 04:11:27
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/docs/pull/268 took a little longer but there it is, I've also changed the proposal for exemplary airflow task metadata so it's aligned with how parent run facet is populated in airflow since OL 1.7.0 🙂

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 11:33:34
      +
      +

      Hi everyone, I have a question about Airflow's Job-to-Job level lineage. I've successfully obtained OpenLineage events for Airflow DAGs and now I'm looking to create inter DAG dependency (task-to-task), such as mapping D1.t1->D1.t2->D1.t3->D2.t1->D2.t2->D2.t3. Could you please advise on which fields I should consider to achieve this level of mapping?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-17 15:34:24
      +
      +

      *Thread Reply:* I’m not sure what you’re trying to achieve

      + +

      would you like to set relationships both between Airflow task <-> task and DAG <-> DAG?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 15:43:12
      +
      +

      *Thread Reply:* Yes, I'm trying to set a relationship between both of them. Suppose I've 2 DAGs, D1 and D2, with 2 tasks in it. So the lineage would be D1.t1 -> D1.t2 (TriggerDagRunOperator) ->D2.t1->D2.t2 +So in this way, I'll be able to mark an inter DAG dependency and task dependency also within the same DAG

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-17 15:55:47
      +
      +

      *Thread Reply:* in terms of Marquez relationship between jobs is established by common input/output datasets +there’s also parent/child relationship (e.g. between DAG and task) +there’s actually no facet that would point directly between two jobs, that would require some sort of customization

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 16:12:28
      +
      +

      *Thread Reply:* @Jakub Dardziński In terms of Airflow openlineage events, I can see task level dependencies within the same DAG in events. +But where I have inter-DAG dependencies via TriggerDagRunOperator I can't see that level of information in lineage events as mentioned here

      + +

      I'm not able to find what are the upstream dependencies of a DAG?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Parkash Pant + (ppant@tucowsinc.com) +
      +
      2023-12-17 21:24:16
      +
      +

      Hi Everyone, Need help! I am new to OpenLineage and Marquez and I am trying to test it with our local installation of airflow 2.6.3. Both airflow and marquez are running in a separate docker container and I have installed openlineage-airflow integration in airflow and set OPENLINEAGEURL and OPENLINEAGENAMESPACE. However, upon successful running a DAG, airflow is not able to emit OpenLineage event. Below is the error msg. I have also cross-checked that Marquez API is listening at localhost:5000. +2023-12-17 14:20:18 [2023-12-17T20:20:18.489+0000] {{adapter.py:98}} ERROR - Failed to emit OpenLineage event of id f0d30e1a-30cc-3ce5-9cbb-1d3529d5e206 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn +2023-12-17 14:20:18 conn = connection.create_connection( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection +2023-12-17 14:20:18 raise err +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection +2023-12-17 14:20:18 sock.connect(sa) +2023-12-17 14:20:18 ConnectionRefusedError: [Errno 111] Connection refused +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen +2023-12-17 14:20:18 httplib_response = self._make_request( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request +2023-12-17 14:20:18 conn.request(method, url, ****httplib_request_kw) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request +2023-12-17 14:20:18 super(HTTPConnection, self).request(method, url, body=body, headers=headers) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1282, in request +2023-12-17 14:20:18 self._send_request(method, url, body, headers, encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request +2023-12-17 14:20:18 self.endheaders(body, encode_chunked=encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders +2023-12-17 14:20:18 self._send_output(message_body, encode_chunked=encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output +2023-12-17 14:20:18 self.send(msg) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 975, in send +2023-12-17 14:20:18 self.connect() +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect +2023-12-17 14:20:18 conn = self._new_conn() +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn +2023-12-17 14:20:18 raise NewConnectionError( +2023-12-17 14:20:18 urllib3.exceptions.NewConnectionError: &lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send +2023-12-17 14:20:18 resp = conn.urlopen( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen +2023-12-17 14:20:18 retries = retries.increment( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment +2023-12-17 14:20:18 raise MaxRetryError(_pool, url, error or ResponseError(cause)) +2023-12-17 14:20:18 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused')) +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/airflow/adapter.py", line 95, in emit +2023-12-17 14:20:18 return self.client.emit(event) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/client.py", line 102, in emit +2023-12-17 14:20:18 self.transport.emit(event) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/transport/http.py", line 159, in emit +2023-12-17 14:20:18 resp = <a href="http://session.post">session.post</a>( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post +2023-12-17 14:20:18 return self.request("POST", url, data=data, json=json, ****kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request +2023-12-17 14:20:18 resp = self.send(prep, ****send_kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send +2023-12-17 14:20:18 r = adapter.send(request, ****kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 519, in send +2023-12-17 14:20:18 raise ConnectionError(e, request=request) +2023-12-17 14:20:18 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused'))

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-18 02:28:33
      +
      +

      *Thread Reply:* try with host.docker.internal instead of localhost

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Parkash Pant + (ppant@tucowsinc.com) +
      +
      2023-12-18 11:51:48
      +
      +

      *Thread Reply:* It worked. Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Ourch + (michael.ourch@qonto.com) +
      +
      2023-12-18 05:02:29
      +
      +

      Hey everyone, +I started experimenting OpenLineage with my stack (dbt + Snowflake). +The data lineage works great but I could not get the column lineage feature working. +Is it only implemented for Spark at the moment ? +Thanks 🙏

      + + + + +
      + ✅ Michael Ourch +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-18 05:03:54
      +
      +

      *Thread Reply:* it works from Spark (with exception of JDBC) and Airflow SQL-based operators (except BigQuery Operator)

      + + + +
      + 🙏 Michael Ourch +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Ourch + (michael.ourch@qonto.com) +
      +
      2023-12-18 08:20:18
      +
      +

      *Thread Reply:* Thanks @Jakub Dardziński for the clarification 🙏

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Daniel Henneberger + (me@danielhenneberger.com) +
      +
      2023-12-18 16:16:30
      +
      +

      Hey ya'll, I'm trying out the flink open lineage integration with marquez. I cloned marquez and did a ./docker/up.sh and configured flink using the yaml. However, when it tries to emit a metric, i get:

      + +

      ERROR io.openlineage.flink.client.EventEmitter - Failed to emit OpenLineage event: +io.openlineage.client.OpenLineageClientException: code: 422, response: {"errors":["job.facets.jobType.integration must not be null"]} + at io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:150) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:124) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:111) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:46) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] +Is there something else I need to configure?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Daniel Henneberger + (me@danielhenneberger.com) +
      +
      2023-12-18 17:22:39
      +
      +

      I opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2324

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:07:34
      +
      +

      *Thread Reply:* great finding, sorry for this. +this should help: +https://github.com/OpenLineage/OpenLineage/pull/2325

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/flink, streaming +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 00:19:59
      +
      +

      Hi team, Noticed this OL error for spark 3.4.1: +23/12/15 09:51:35 ERROR PlanUtils: Apply failed: +java.lang.NoSuchMethodError: 'java.lang.String org.apache.spark.sql.execution.datasources.PartitionedFile.filePath()' + at io.openlineage.spark.agent.util.PlanUtils.lambda$null$4(PlanUtils.java:241) + at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) + at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) + at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) + at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) + at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) + at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) + at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) + at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) + at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) + at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) + at io.openlineage.spark.agent.util.PlanUtils.findRDDPaths(PlanUtils.java:248) + at io.openlineage.spark.agent.lifecycle.plan.AbstractRDDNodeVisitor.findInputDatasets(AbstractRDDNodeVisitor.java:42) + at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:43) + at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:22) + at io.openlineage.spark.agent.util.PlanUtils$1.lambda$apply$2(PlanUtils.java:99) + at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) + at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) + at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) + at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:115) + at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:79) + at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) + at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) + at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) + at scala.PartialFunction$AndThen.applyOrElse(PartialFunction.scala:194) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$visitLogicalPlan$14(OpenLineageRunEventBuilder.java:400) + at io.openlineage.spark.agent.util.ScalaConversionUtils$3.apply(ScalaConversionUtils.java:131) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1(TreeNode.scala:305) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1$adapted(TreeNode.scala:305) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) + at scala.collection.Iterator.foreach(Iterator.scala:943) + at scala.collection.Iterator.foreach$(Iterator.scala:943) + at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) + at scala.collection.IterableLike.foreach(IterableLike.scala:74) + at scala.collection.IterableLike.foreach$(IterableLike.scala:73) + at scala.collection.AbstractIterable.foreach(Iterable.scala:56) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) + at scala.collection.Iterator.foreach(Iterator.scala:943) + at scala.collection.Iterator.foreach$(Iterator.scala:943) + at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) + at scala.collection.IterableLike.foreach(IterableLike.scala:74) + at scala.collection.IterableLike.foreach$(IterableLike.scala:73) + at scala.collection.AbstractIterable.foreach(Iterable.scala:56) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.map(TreeNode.scala:305) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:351) + at java.base/java.util.Optional.map(Optional.java:265) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:349) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:305) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:241) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:95) + at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) + at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1471) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) +OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:16:08
      + +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 22:35:34
      +
      +

      *Thread Reply:* Got it thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 00:22:03
      +
      +

      Also, just checking, is there a way to set the log level for OL separately for spark? Or does it always use the underlying spark context log level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:17:21
      +
      +

      *Thread Reply:* this is how we set it in tests -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/log4j.properties

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 22:35:16
      +
      +

      *Thread Reply:* Ahh I see... it goes in via log4j properties. Is there any plan to make this configurable via simpler means? Say env variable or using dedicated spark configs?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 02:27:06
      +
      +

      *Thread Reply:* We didn't plan anything so far. But feel free to create an issue and justify why is this important. perhaps more people share the same feeling about it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:45
      +
      +

      *Thread Reply:* Sure, I'll do that then. Thanks! 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-19 04:45:05
      +
      +

      hi does someone uses openLineage solution to get metadata of Airflow?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-19 05:06:24
      +
      +

      *Thread Reply:* hey Zacay, do you have any issue with using Airflow integration?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:42:30
      +
      +

      *Thread Reply:* Do i need to install openlineage or only configuration?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:43:03
      +
      +

      *Thread Reply:* i install marquez +and point on the Ariflow cfg to listen to the marquez:5000

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:43:12
      +
      +

      *Thread Reply:* what version of Airflow are you using?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:43:55
      +
      +

      *Thread Reply:* Version: v2.8.0

      +
      +
      PyPI
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:44:19
      +
      +

      *Thread Reply:* 2.8.0

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:46:09
      +
      +

      *Thread Reply:* you should follow this guide then +https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      + +

      as any other Airflow provider you need to install it, e.g. with pip install apache-airflow-provider-openlineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:47:07
      +
      +

      *Thread Reply:* and these variables are kept on airflow.cfg or on .env?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:50:12
      +
      +

      *Thread Reply:* this is Airflow config variables so you can either set them in airflow.cfg or using environment variables as described here https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#envvar-AIRFLOW__-SECTION-__-KEY

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:21
      +
      +

      *Thread Reply:* i created a DAG that creates a table and inserts it one line +then op the airflow.cfg +[openlineage] +transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' +namespace='airflow'

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:38
      +
      +

      *Thread Reply:* and on 10.0.19.7 there is a URL of marquze on port 300

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:51
      +
      +

      *Thread Reply:* i run the dag but see no lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:58
      +
      +

      *Thread Reply:* is there any log to get?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:57:16
      +
      +

      *Thread Reply:* there would be if you enable debug logs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:00:28
      +
      +

      *Thread Reply:* in the Airflow.cfg?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:14:07
      +
      +

      *Thread Reply:* Yes. I’m assuming you don’t see anything yet in logs without enabling debug level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:15:05
      +
      +

      *Thread Reply:* I see on the logs of the airflow

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:15:17
      +
      +

      *Thread Reply:* but not nothing related to the lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:17:04
      +
      +

      *Thread Reply:* in Admin > Plugins can you see whether you have OpenLineageProviderPlugin and if so, are there listeners?

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:19:02
      +
      +

      *Thread Reply:* there is a OpenLineageProviderPlugin +but no liseners

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:20:52
      +
      +

      *Thread Reply:* listeners are disabled following this logic: +def _is_disabled() -&gt; bool: + return ( + conf.getboolean("openlineage", "disabled", fallback=False) + or os.getenv("OPENLINEAGE_DISABLED", "false").lower() == "true" + or ( + conf.get("openlineage", "transport", fallback="") == "" + and conf.get("openlineage", "config_path", fallback="") == "" + and os.getenv("OPENLINEAGE_URL", "") == "" + and os.getenv("OPENLINEAGE_CONFIG", "") == "" + ) + ) +so maybe your config is not loaded properly?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:38
      +
      +

      *Thread Reply:* Dont

      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:44
      +
      +

      *Thread Reply:* here is the config

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:56
      +
      +

      *Thread Reply:*

      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:59
      +
      +

      *Thread Reply:* here is also a .env

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:27:25
      +
      +

      *Thread Reply:* you have transport and namespace twice under openlineage section, second ones are empty

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:32:23
      +
      +

      *Thread Reply:* apparently .env is not taken into account, not sure where the file is and what is your deployment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:32:44
      +
      +

      *Thread Reply:* also, AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend is needed only for Airflow <2.3

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:34:54
      +
      +

      *Thread Reply:* ok +so now i have a airflow.cfg +[openlineage] +transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' +namespace='my-namespace' +But still when i run i see no lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:41:15
      +
      +

      *Thread Reply:* can you verify and confirm that changes in your Airflow config are applied? I don’t see any other reason, I also can’t tell what’s your deployment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:43:32
      +
      +

      *Thread Reply:* Can you send me an example of airflow.cfg that works and i will try to compare

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:44:34
      +
      +

      *Thread Reply:* the one you sent seems ok, it may be a matter of how you configure Airflow to read it, where you put changed config file

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:45:16
      +
      +

      *Thread Reply:* it is on the same place where the logs and dags libraries

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:47:22
      +
      +

      *Thread Reply:* okay, let’s try this

      + +

      please change temporarily +expose_config = False +to +expose_config = True +and check whether you can see config in the UI under Admin > Configuration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:49:21
      +
      +

      *Thread Reply:* I need to stop and start the docker?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:50:23
      +
      +

      *Thread Reply:* yes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:53:41
      +
      +

      *Thread Reply:* i changed but see no change on the UI under Admin->Configuration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:54:11
      +
      +

      *Thread Reply:* i wonder of the cfg file affects at all on the Airflow

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:54:29
      +
      +

      *Thread Reply:* maybe it is relevant to the docker-compose.yaml?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:56:44
      +
      +

      *Thread Reply:* again, I don't know how you deploy your Airflow instance +if you need more help with that you might ask in Airflow Slack or learn more about from Airflow docs (which are pretty good imho)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:16:42
      +
      +

      *Thread Reply:* It seems that the config is on one of the contianers

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:16:47
      +
      +

      *Thread Reply:* so i got inside

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:17:05
      +
      +

      *Thread Reply:* but i think that if i stop and start it doesnt saves the cahnge

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 09:21:53
      +
      +

      *Thread Reply:* I would manipulate with env vars to override Airflow config entries (I passed the link above) and set them in compose file for all airflow containers +I'm assuming you're using Airflow provided docker compose yaml

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:22:25
      +
      +

      *Thread Reply:* right i use docker-compose yaml

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:28:23
      +
      +

      *Thread Reply:* so you mean to create an .env file in the location of the ymal and there put +AIRFLOWOPENLINEAGEDISABLED=False

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 09:30:20
      +
      +

      *Thread Reply:* no, to put this into yaml file

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:30:37
      +
      +

      *Thread Reply:* OK

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:32:51
      +
      +

      *Thread Reply:* i put on the yaml + OPENLINEAGEDISABLED=false + OPENLINEAGEURL=http://10.0.19.7:5000 + AIRFLOWOPENLINEAGENAMESPACE=food_delivery +and i will stop and start and lets see how it goes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-20 14:04:37
      +
      +

      *Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-19 04:45:13
      +
      +

      i used this post +https://openlineage.io/docs/guides/airflow_proxy/

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-19 08:49:49
      +
      +

      *Thread Reply:* For newest Airflow provider documentation, please look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:02
      +
      +

      Hi team, got this error with OL 1.6.2 on dbr aws: +23/12/20 07:10:18 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver. +java.lang.IllegalStateException: LiveListenerBus is stopped. + at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:109) + at org.apache.spark.scheduler.LiveListenerBus.addToSharedQueue(LiveListenerBus.scala:66) + at org.apache.spark.sql.QueryProfileListener$.initialize(QueryProfileListener.scala:122) + at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$executeDependedOperations$1(DatabricksILoop.scala:652) + at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) + at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709) + at com.databricks.unity.UCSUniverseHelper$.withNewScope(UCSUniverseHelper.scala:8) + at com.databricks.backend.daemon.driver.DatabricksILoop$.executeDependedOperations(DatabricksILoop.scala:580) + at com.databricks.backend.daemon.driver.DatabricksILoop$.initializeSharedDriverContext(DatabricksILoop.scala:448) + at com.databricks.backend.daemon.driver.DatabricksILoop$.getOrCreateSharedDriverContext(DatabricksILoop.scala:294) + at com.databricks.backend.daemon.driver.DriverCorral.driverContext(DriverCorral.scala:292) + at com.databricks.backend.daemon.driver.DriverCorral.&lt;init&gt;(DriverCorral.scala:159) + at com.databricks.backend.daemon.driver.DriverDaemon.&lt;init&gt;(DriverDaemon.scala:71) + at com.databricks.backend.daemon.driver.DriverDaemon$.create(DriverDaemon.scala:452) + at com.databricks.backend.daemon.driver.DriverDaemon$.initialize(DriverDaemon.scala:546) + at com.databricks.backend.daemon.driver.DriverDaemon$.wrappedMain(DriverDaemon.scala:511) + at com.databricks.DatabricksMain.$anonfun$main$1(DatabricksMain.scala:149) + at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) + at com.databricks.DatabricksMain.$anonfun$withStartupProfilingData$1(DatabricksMain.scala:498) + at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:571) + at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:666) + at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:684) + at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:426) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:196) + at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:424) + at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:418) + at com.databricks.DatabricksMain.withAttributionContext(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:470) + at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:455) + at com.databricks.DatabricksMain.withAttributionTags(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:661) + at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:580) + at com.databricks.DatabricksMain.recordOperationWithResultTags(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:571) + at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:540) + at com.databricks.DatabricksMain.recordOperation(DatabricksMain.scala:91) + at com.databricks.DatabricksMain.withStartupProfilingData(DatabricksMain.scala:498) + at com.databricks.DatabricksMain.main(DatabricksMain.scala:148) + at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) +Dbr spark 3.4, jre 1.8 +Is spark 3.4 not supported for OL as of yet?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:29
      +
      +

      *Thread Reply:* OL 1.3.1 works fine btw... This error only pops up with OL 1.6.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 03:00:26
      +
      +

      *Thread Reply:* Error happens when starting the cluster itself

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 04:56:09
      +
      +

      *Thread Reply:* spark 3.4 yes, but databricks runtie 14.x was not testet so far (edited)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 09:00:12
      +
      +

      *Thread Reply:* I've just tested existing databricks integration on latest dbr 14.2 and spark 3.5. The integration tests for databricks, we have, are passing. So please write more details on how do end up with logs like above.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-21 01:59:54
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2328 -> link to tests being run

      +
      + + + + + + + +
      +
      Labels
      + integration/spark, ci, tests +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      + 👀 Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-29 03:12:16
      +
      +

      *Thread Reply:* Hi @Paweł Leszczyński sorry missed out on this earlier... Actually, I got this while trying to simply start the dbr cluster with the OL spark configs. Nothing else done from my end for this. Works with OL 1.3.1 but not with 1.6.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-29 03:14:03
      +
      +

      *Thread Reply:* there was an issue with 1.6.2 related to log class on the classpath which may be responsible for this, but it got solved in recent release.

      + + + +
      + 👍 Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:39:45
      +
      +

      hi +does somenone uses Airflow lineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:40:58
      +
      +

      *Thread Reply:* hey Zacay, I’ve already tried to help you here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702980384927369?thread_ts=1702979105.626809&cid=C01CK9T7HKR

      +
      + + +
      + + + } + + Jakub Dardziński + (https://openlineage.slack.com/team/U02S6F54MAB) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 08:57:07
      +
      +

      *Thread Reply:* thanks for the help

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 08:57:26
      +
      +

      *Thread Reply:* did you manage to make it working?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-20 14:04:58
      +
      +

      *Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

      + + + +
      +
      +
      +
      + + + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-20 14:05:55
      +
      +

      Hi Everyone +I was trying to add extra facet at job level but I am not able to do that. +to explain more - assume that d1(input) and d2 (output) are the databases and by using some python file which having classes and functions helps it to get convert from d1 to d2. so currently for this i am extracting some info about python file by using external python parser and want to add that to jobs which will be in between d1 and d2. i tried by adding +```custom_facets = { + "customKey": "customValue", + "anotherKey": "anotherValue" + # Add more custom facets as needed + }

      + +
              # Creating a Job with custom facets
      +        job = Job(
      +            namespace=file_name,
      +            name=single_fn_info.name,
      +            facets=custom_facets  # Include the custom facets here
      +        )```
      +
      + +

      but it is not working. I did look into this documentation https://openlineage.io/docs/spec/facets/custom-facets/ and tried but it's still not showing any facet on ui. +currently in facets section for every job it only shows root dict which is blank and nothing else. So what i am missing how we can implement this ??

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-21 05:12:26
      +
      +

      @Michael Robinson can we vote for the OL release? #2319 brings significant fix to Spark logging and I think it's worth releasing without waiting for release cycle.

      + + + +
      + ➕ Paweł Leszczyński, Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia, Michael Robinson, harsh loomba +
      + +
      + 👍 Michael Robinson +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 09:55:59
      +
      +

      *Thread Reply:* Thanks, @Paweł Leszczyński, the release is authorized and will be initiated as soon as possible within 2 business days.

      + + + +
      + 🙌 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 10:54:23
      +
      +

      *Thread Reply:* Changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2331

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:00:32
      +
      +

      *Thread Reply:* @Jakub Dardziński any progress on this one https://github.com/apache/airflow/pull/35794 wondering if this could have been a part of release

      +
      + + + + + + + +
      +
      Labels
      + provider:amazon-aws, area:providers, provider:openlineage +
      + +
      +
      Comments
      + 3 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:02:36
      +
      +

      *Thread Reply:* I'm dependent on Airflow committers, pinging in the PR from time to time

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:02:57
      +
      +

      *Thread Reply:* if you wish to comment that you need the PR it might be taken into consideration as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:13:47
      +
      +

      *Thread Reply:* wait this will be supported by openlineage-airflow package as well right? I see you have made changes in airflow provider package but I dont see changes in standalone openlineage repo 🤔

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:15:22
      +
      +

      *Thread Reply:* We aim to make as less changes as possible to openlineage-airflow package to encourage users to use provider one. This change would be quite huge and probably won't be backported.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:17:54
      +
      +

      *Thread Reply:* We have yet not moved to provider packages coz our team is still making decision. And the move to provider package needs me to change a lot, so i would prefer this feature in standalone package rather

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Hitesh + (splicer9904@gmail.com) +
      +
      2023-12-21 05:25:29
      +
      +

      Hi Team +I am trying to send openlineage events to a Kafka eventhub and for that, I am passing spark.openlineage.transport.properties.sasl.jaas.config as org.apache.kafka.common.security.plain.PlainLoginModule required username=\"connection_string\" password=\"connection_string\";

      + +

      when I run the job, I get the error Value not specified for key 'username' in JAAS config

      + +

      I am getting connection string from Kafka evenhub properties and I'm running openlineage-spark-1.4.1

      + +

      Can someone please help me out?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 09:20:53
      +
      +

      *Thread Reply:* I don't think we can parse those non-escaped spaces as regular properties

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 09:23:01
      +
      +

      *Thread Reply:* can you try something like +"org.apache.kafka.common.security.plain.PlainLoginModule required username='connection_string' password='connection_string'"

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 11:14:44
      +
      +

      Hello,

      + +

      I hope you are doing well.

      + +

      I am wondering if any of you had the same issue before.

      + +

      Thank you. +23/12/21 16:01:26 WARN DatabricksEnvironmentFacetBuilder: Failed to load dbutils in OpenLineageListener: +java.util.NoSuchElementException: None.get + at scala.None$.get(Option.scala:529) + at scala.None$.get(Option.scala:527) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext$lzycompute(DbfsUtilsImpl.scala:27) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext(DbfsUtilsImpl.scala:26) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$driverContext$1(DbfsUtilsImpl.scala:29) + at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.driverContext(DbfsUtilsImpl.scala:29) + at <a href="http://com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc">com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc</a>(DbfsUtilsImpl.scala:30) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core$lzycompute(DbfsUtilsImpl.scala:32) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core(DbfsUtilsImpl.scala:32) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$core$1(DbfsUtilsImpl.scala:34) + at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.core(DbfsUtilsImpl.scala:34) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.mounts(DbfsUtilsImpl.scala:166) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksMountpoints(DatabricksEnvironmentFacetBuilder.java:142) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:98) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:60) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:32) + at io.openlineage.spark.api.CustomFacetBuilder.accept(CustomFacetBuilder.java:40) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:491) + at java.lang.Iterable.forEach(Iterable.java:75) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildRunFacets$28(OpenLineageRunEventBuilder.java:491) + at java.util.ArrayList.forEach(ArrayList.java:1259) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRunFacets(OpenLineageRunEventBuilder.java:491) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:313) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:250) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:167) + at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$10(OpenLineageSparkListener.java:151) + at java.util.Optional.ifPresent(Optional.java:159) + at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:107) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:107) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:98) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1639) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:98)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 12:01:59
      +
      +

      *Thread Reply:* Can you provide additional details: what is your version of OpenLineage integration, Spark, Databricks; how you're running your job, additional logs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:02:42
      +
      +

      *Thread Reply:* Version of OL 1.2.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:03:07
      +
      +

      *Thread Reply:* spark_version: 11.3.x-scala2.12

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:03:32
      +
      +

      *Thread Reply:* Running job through spark-submit

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-22 06:03:06
      +
      +

      *Thread Reply:* @Abdallah could you try to upgrade to 1.7.0 that was released lately?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 13:08:08
      +
      +

      @channel +We released OpenLineage 1.7.0! +In this release, we turned off support for Airflow versions >=2.8.0 in the Airflow integration and added a parent run facet to COMPLETE and FAIL events, also in Airflow. If you’re on the most recent release of Airflow and wish to continue receiving events from Airflow after upgrading, use the OpenLineage Airflow Provider instead.

      + +

      Added +• Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda + Adds a parent run facet to all events in the Airflow integration.

      + +

      Removed +• Airflow: remove Airflow 2.8+ support #2330 @kacpermuda + To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is >=2.8.0.

      + +

      A number of bug fixes were released as well, including: +• Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda + Some scripts were not working well on MacOS. This adjusts them. +• Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda + The Run_id in a FAIL event was different than in the START event for Airflow 2.6+. +• Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski + Adds a kafka:// prefix to Kafka topic datasets’ namespaces. +• Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss + Amends the Authority format for consistency with other references in the same section.

      + +

      Thanks to all the contributors, including new contributor @Kacper Muda! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.7.0 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.6.2...1.7.0 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:10:40
      +
      +

      *Thread Reply:* Shoutout to @Kacper Muda for huge contribution from the very start 🚀

      + + + +
      + :gratitude_thank_you: Kacper Muda, Sheeri Cabral (Collibra) +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 13:11:22
      +
      +

      *Thread Reply:* +1 on that! I think there were even more changes from Kacper than are listed here

      + + + +
      + :gratitude_thank_you: Kacper Muda +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-21 14:43:38
      +
      +

      *Thread Reply:* Thanks ! Just a quick note: listener API was introduced in Airflow 2.3 so it was already missing in the plugin for Airflow < 2.3 - I made no changes there. I just removed it from > 2.8.0, to encourage use of the provider, as You said 🙂 But the result is exactly as You said, there is no listener in <2.3 and >=2.8 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-21 14:45:57
      +
      +

      *Thread Reply:* So in result: i don't think we turned off support for Airflow versions &lt;2.3.0 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 14:46:48
      +
      +

      *Thread Reply:* This is good to know — thanks. I’ve updated the notes here and will do so elsewhere, as well.

      + + + +
      + 🙌 Kacper Muda, Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-30 14:33:21
      +
      +

      *Thread Reply:* Hi @Jakub Dardziński so before for custom operator we used to write custom extractor file and used to save with other extractors under the airflow integration folder.

      + +

      what is the procedure now according to this new provider update ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-01 05:08:40
      +
      +

      *Thread Reply:* Hey @Shahid Shaikh, as of my understanding nothing has changed regarding Custom Extractors, You can still use them if You wish to (see the bottom of this docs, to see how they can be registered in provider package). However, in my opinion, the best way to use the provider package is to implement OpenLineage methods directly in the operators, as described here. Let me know if this answers Your question.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:46:34
      +
      +

      *Thread Reply:* Yes, Thanks @Kacper Muda I referred the docs and able to do the work as u said by using provider package directly.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-28 00:46:57
      +
      +

      Hi Everyone +I was looking into airflow intergration with Marquez i ran number of dags, I observed that for a particular task in every dag it makes new job on Marquez and we are not able to see any job to job linkage on marquez map. Why it is not showing ? How can we add this feature ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 05:57:49
      +
      +

      Hey All! +I've tested merging operations for SPARK (on Databricks) and the OpenLineage Result. For the moment i failed to produce any input/output for the jobs, regardless of the data schema being present in the logical plan attribute of the JSON OL event.

      + +

      My test consisted in: +• Reading from parquet/CSV file in DBFS (databricks file storage) +• Creating a Temporary Table +• Performing the merging with spark.sql("merge... target...source ") with the target being a table in hive. +Test Variables: +• Source: Parquet/CSV +• Target: Hive Table +• OLxSpark versions: + ◦ OL 1.6.2 -> Spark 3.3.2 + ◦ OL 1.6.2 -> Spark 3.2.1 + ◦ OL 1.0.0 -> Spark 3.2.1 +I've created a pdf with some code samples and OL inputs and output attributes.

      + + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 05:59:53
      +
      +

      *Thread Reply:* @Sai @Anirudh Shrinivason Did you have any luck with the merge events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 06:00:51
      +
      +

      *Thread Reply:* @Paweł Leszczyński I know you were working on this in the past. Am I missing something here?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 06:02:23
      +
      +

      *Thread Reply:* Should i test with more recent versions of spark and the latest of OL?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-28 06:31:04
      +
      +

      *Thread Reply:* Are you able to verify if the issue is databricks runtime specific or the same happens on vanilla Spark docker containers?

      + +

      We do have an integration test for merge into and delta tables here -> https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79[…]/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java

      + +

      would this test fail on databricks as well?

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 08:30:56
      +
      +

      *Thread Reply:* the test on databricks fails to generate inputs and outputs for the merge operation

      + +

      "job": { + "namespace": "default", + "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_merge_into_command_edge", + "facets": {} + }, + "inputs": [], + "outputs": []

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 08:32:11
      +
      +

      *Thread Reply:* The create view also fails to generate Inputs and outputs (but this one i don't know if it is supposed to) +"job": { + "namespace": "default", + "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_create_view_command", + "facets": {} + }, + "inputs": [], + "outputs": []

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + + +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-28 23:00:10
      +
      +

      *Thread Reply:* Hey @Rodrigo Maia I'm using 1.3.1 OL version, and it seems to work for most merge into case though not all...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-29 02:36:14
      +
      +

      *Thread Reply:* @Rodrigo Maia will try to add existing testMergeInto to be run on Databricks and see why is it failing

      + + + +
      + 🙌 Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-02 08:39:50
      +
      +

      *Thread Reply:* looking into this and hopefully will have solution this week

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-03 06:27:09
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2348

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/spark +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      + ❤️ Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2024-01-02 10:47:11
      +
      +

      Hello! My company has a job opening for a Senior Data Engineer in our Data Office, in Czech Republic - https://www.linkedin.com/feed/update/urn:li:activity:7147975999253610496/ DM me here or on LinkedIn if you have questions.

      +
      +
      linkedin.com
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🔥 Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-02 11:00:58
      +
      +

      *Thread Reply:* created a #jobs channel, as we’re continuing to build a strong community here 🙂

      + + + +
      + 🚀 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-02 11:30:00
      +
      +

      @channel +This Friday, Jan. 4, is the last day to respond to the 2023 Ecosystem Survey, which will close at 5 pm ET that day. It’s an opportunity to tell us about your organization’s lineage needs and priorities for the purpose of updating the project roadmap. For example, you can tell us: +• which data connectors you’d like OL to support +• which cloud data platforms you’d like OL to support +• which additional orchestrators you’d like OL to support +• and more. +Your feedback is very important. Thanks in advance!

      +
      +
      Google Docs
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Martins Cardoso + (rodrigo.cardoso1@ibm.com) +
      +
      2024-01-02 12:27:27
      +
      +

      Hi everyone! Just passing by for a quick question: after reading SparkColumnLineage and ColumnLineageDatasetFacet.json I can see in the future work section of the 1st URL that

      + +
      +

      Current version of the mechanism allows finding input fields that were used to produce the output field but does not determine how were they used + This means that the transformationDescription and transformationType from ColumnLineageDatasetFacet are not sent in the events?

      +
      + +

      Thanks in advance for any inputs! Have a nice day.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-03 04:59:48
      +
      +

      *Thread Reply:* Currently, Spark integration does not fill transformation fields. Additionally, there's a proposal to improve this: https://github.com/OpenLineage/OpenLineage/issues/2186

      +
      + + + + + + + +
      +
      Labels
      + proposal +
      + +
      +
      Comments
      + 5 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Martins Cardoso + (rodrigo.cardoso1@ibm.com) +
      +
      2024-01-03 05:17:09
      +
      +

      *Thread Reply:* Thanks a lot for the clarification!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Vinay R + (vinayrme58@gmail.com) +
      +
      2024-01-03 11:33:45
      +
      +

      Hi,

      + +

      I'm currently working on extracting transformation logic from Redshift SQL insert statements. For instance, in the query 'SELECT A+B as Col1 FROM table1,' I'm trying to fetch the transformation logic 'A+B' for the column 'Col1.' I've been using regex, but it has limitations with subqueries, unions, and CTEs. Are there any specialized tools or suggestions for improving this process, especially when dealing with complex SQL structures like subqueries and unions?

      + +

      Queries are in rsql format. Thanks in advance any help would be appreciated.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-03 14:04:27
      +
      +

      *Thread Reply:* We don’t parse what are actual transformations done in sqlparser

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2024-01-03 12:14:41
      +
      +

      Hey, I’ve submitted update to OL docs re spark - anyone fancy to conduct review? 👀 🙏

      + +

      https://github.com/OpenLineage/docs/pull/268

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:43:48
      +
      +

      Hi everyone does job to job linkage possible in openlineage ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-04 04:47:05
      +
      +

      *Thread Reply:* OpenLineage focuses on data lineage which means we don’t explicitly track job to job lineage but rather job -> dataset -> job

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:53:24
      +
      +

      *Thread Reply:* Thanks for the insight, Jakub! That makes sense. I'm curious, though, if there's a way within openlineage for achieving this kind of linkage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-04 04:54:53
      +
      +

      *Thread Reply:* sure, you can create custom facet that would be used to expose such relations +in terms of Airflow there is Airflow specific run facet that contains upstream and downstream task ids but it’s rather informative than to achieve explicitly what you’re trying to do

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 05:02:37
      +
      +

      *Thread Reply:* Thanks for clarifying that, Jakub. I'll look into that as u suggested.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-04 12:25:16
      +
      +

      @channel +UK-based members: a speaker is needed for a meetup with Confluent on January 31st in London. Please reply or DM me if you’re interested in speaking.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-04 13:21:25
      +
      +

      @channel +This month’s TSC meeting is next Thursday the 11th at 10am PT. On the tentative agenda: +• announcements +• recent releases +• open discussion +• more (TBA) +More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 03:35:03
      +
      +

      Hi everyone, I am trying to use the datasetVersion facet, but want to confirm that I am using it correctly, at the moment the behaviour seemed a bit weird, but basically we have a pipeline like this +sourceData1 -> job1 -> datasetA

      + +

      dataSetA v1, sourceData2 -> job2 -> datasetB

      + +

      But the job can be configured to use a specific version of incoming dataset, so in the above example the job2 had been configured to always use datasetA v1. So If I run a job1 that produces datasetA v2. If I run job2 , it should pick use the v1.

      + +

      When we're integrating our pipeline by emitting events to OpenLineage ( Marquez ), we include +"inputs": [ + { + "namespace": "test.namespace", + "name": "datasetA", + "facets": { + "version": { + "_producer": "<https://some.producer.com/version/1.0>", + "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", + "datasetVersion": "1" + } + } + } + ], +"datasetVersion":"1" -> this refers to the "datasetVersion" of output dataset of job1

      + +

      But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1

      + +

      So just wandering if I am emitting the events wrongly or datasetVersion is not meant to be used for this purpose

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-08 04:57:43
      +
      +

      *Thread Reply:* Hi @William Sia, OpenLineage-Spark integration fetches latest version of the dataset. If you're deliberately reading some older one, OpenLineage event will still point to the latest which is kind of a bug or missing feature. feel free to create an issue for this. If possible, please provide some code snippet to reproduce this. Is this happening on delta/iceberg?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-08 09:46:16
      +
      +

      *Thread Reply:* > But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1 +You mean that event contains different version than you emit? Or do you mean you see different thing on the Marquez UI?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 21:56:19
      +
      +

      *Thread Reply:* @Maciej Obuchowski , what I meant is I am expecting my job2 to have lineage to input of datasetA v1 (because I specify the version facet of the dataset input to 1) , but what happened is job2 has lineage to the latest version of datasetA and the datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). So job2 that has an input of datasetA modified the Version Facet of the later.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 21:58:52
      +
      +

      *Thread Reply:* Cheers @Paweł Leszczyński, will raise an issue with more details on github on this. So at the moment we're enable the application that my team is working on to emit events. So I am emitting the event manually at the moment to mimic the data flow in our system.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-09 07:32:02
      +
      +

      *Thread Reply:* > (because I specify the version facet of the dataset input to 1) +So what I understand is that you emit event for job2 that has datasetA as input with version 1 as shown above

      + +

      > but what happened is job2 has lineage latest version of datasetA +on UI I understand? I think it's Marquez issue - properly display which dataset version you're reading. Can you post screenshot on Marquez issue showing exactly where you'd expect to see different data?

      + +

      > datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). +Yeah, if job1 updated datasetA to next version, then it makse sense.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-11 22:08:49
      +
      +

      *Thread Reply:* I had raised this issue with some sample payload here https://github.com/MarquezProject/marquez/issues/2733 , Let me know if you have more details

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-08 12:55:01
      +
      +

      @channel +Meetup alert: our first OpenLineage x Apache Kafka® meetup will be happening on January 31st (in-person) at 6:00 PM in Confluent’s London offices. Keep an eye out for more details and sign up https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.

      +
      +
      Meetup
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🚀 alexandre bergere, Maciej Obuchowski +
      + +
      + ❤️ Willy Lulciuc, Maciej Obuchowski, Eric Veleker +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2024-01-09 03:06:38
      +
      +

      Hi +are you familiar with the Spline solution? +do you know if we can use it on databricks withourt adding any code to the notebooks in order to get the notebook name?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-09 03:42:24
      +
      +

      *Thread Reply:* Spline is maintained by an entirely different group of people. Thus, we aren't familiar with the semantics of Spline. I suggest that you reach out to the maintainers of Spline to understand more about its capabilities.

      + + + +
      + ➕ Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2024-01-09 03:42:57
      +
      +

      *Thread Reply:* thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 08:57:33
      +
      +

      Hi folks, We are in MWAA supported Airflow 2.7.2. For OpenLineage integration, do we have transport type as kinesis to emit the events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-09 09:08:01
      +
      +

      *Thread Reply:* tbere’s no kinesis transport available in Python at the moment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 09:12:04
      +
      +

      *Thread Reply:* Thank you @Jakub Dardziński . Can we expect a support in future? Any suggestive alternative till then?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-09 09:13:36
      +
      +

      *Thread Reply:* that’s open source project 🙂 if someone finds it useful to contribute or the community decides to implement this - that’ll probably be supported

      + +

      I’m not sure what’s your use case but we have Kafka transport

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 09:20:48
      +
      +

      *Thread Reply:* Thanks @Jakub Dardziński. We have limited support for kafka and we are widely using Kinesis. The usecase is that, we are trying to build a in house lineage store, where we can emit these events. If Kafka is the only streaming option available, we can give a try.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-09 09:52:16
      +
      +

      *Thread Reply:* I'd look to see if AWS offers something like KafkaProxy, but for Kinesis. That is, you emit via HTTP and it forwards to the Kinesis stream.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 10:00:24
      +
      +

      *Thread Reply:* Nice.. I ll take a look in to it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-09 10:03:55
      +
      +

      *Thread Reply:* agree with @Jakub Dardziński & @Damien Hawes. +the transport interface is also pretty straightforward, so adding support for Kinesis might be trivial - in case you do decide to contribute

      + + + +
      + 👍 Anand Thamothara Dass +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-10 14:22:18
      +
      +

      Hello everyone,

      + +

      I'm exploring OpenLineage + Marquez for lineage in our warehouse solution connecting to Snowflake. We've set up ETL jobs for various operations like Select, Update, Delete, and Merge. Is there a way to capture Change Data Capture (CDC) at the column level using OpenLineage? Our goal is to present this CDC information as facets on Marquez. +Any insights or suggestions on this would be greatly appreciated!

      + + + +
      + 👍 alexandre bergere +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:14:03
      +
      +

      *Thread Reply:* What are you using for CDC?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 16:12:11
      +
      +

      *Thread Reply:* @Harel Shein In our current setup, we typically utilize Slowly Changing Dimension (SCD) Type 2 at the source level for Change Data Capture (CDC). In this specific scenario, the source is Snowflake.

      + +

      During the ETL process, we receive a CDC trigger or event from Snowflake.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-12 13:01:58
      +
      +

      *Thread Reply:* what do you use to run this ETL process? If that framework is supported by openlineage, I assume it would work

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-10 15:23:47
      +
      +

      @channel +This month’s TSC meeting, open to all, is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1704392485753579

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 04:32:41
      +
      +

      Hi Everyone

      + +

      there is this search option on facets on Marquez UI. Is there any functionality of that?

      + +

      Do we have the functionality to search on the lineage we are getting?

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:43:04
      +
      +

      *Thread Reply:* great question @Shahid Shaikh! I’m actually not sure, would you mind posting this question on the Marquez slack? +https://join.slack.com/t/marquezproject/shared_invite/zt-2afft44fo-nWcmZmdrv7qSxfNl6iOKMg

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 13:35:47
      +
      +

      *Thread Reply:* Thanks @Harel Shein I will take further update from Marquez slack.

      + + + +
      + 👍 Harel Shein +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-11 04:38:39
      +
      +

      Hi everyone, +I'm currently running my Spark code on Azure Kubernetes Service (AKS), and I'm interested in knowing if OpenLineage can provide cluster details such as its name, etc. In my current setup and run events, I haven't been able to find these cluster-related details.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:47:13
      +
      +

      *Thread Reply:* I’m not sure if all of those details are readily available to the spark application to extract them. If you find a way, those can be reported in a custom facet. +The debug facet for Spark would be the closest option to what you’re looking for, but definitely doesn’t contain k8s cluster level data.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 07:01:18
      +
      +

      *Thread Reply:* Not out of the box, but if those are in environment variables, you can use +spark.openlineage.facets.custom_environment_variables=[YOUR_VARIABLE;] +to include it as CustomEnvironmentFacet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-11 07:04:01
      +
      +

      *Thread Reply:* Thanks, can you please share the reference link for the same, so that I can try n implement this @Maciej Obuchowski

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 07:32:21
      + +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:12:27
      +
      +

      *Thread Reply:* Would this work if these environment variables were set at runtime?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 10:13:45
      +
      +

      *Thread Reply:* during runtime of a job? I don't think so, we collect them when we spawn the OpenLineage integration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:15:01
      +
      +

      *Thread Reply:* I thought so. thank you 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:23:42
      +
      +

      *Thread Reply:* @Maciej Obuchowski By the way, coming back to this, is there any way to pass some variable value from the executing code/job to the OL spark listener? The example above is something I was looking forward to, like, setting some (ENV) var at run time and having access to this value in the OL event.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-15 03:30:06
      +
      +

      *Thread Reply:* @Maciej Obuchowski, Actually cluster details are not in ENV variables. In my Airflow setup those are stored in the Airflow connections and I've a connection ID for that, I'm successfully getting the connection ID details but not the associated Spark cluster name and details.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2024-01-11 09:01:29
      +
      +

      ❓ Is there a standard or well-used facet for recording the type of a dataset? e.g. Table vs View at a very simple level.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 09:28:11
      +
      +

      *Thread Reply:* I think we think of type of the dataset on a dataset namespace schema: something s3 will always be a link to path in a bucket.

      + +

      Table vs View is a specific category of datasets that makes a lot of sense when talking about database datasets, but would not make sense when talking about object storage ones, so my naive view is that's something that should be unique to a particular data source type.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2024-01-11 09:30:10
      +
      +

      *Thread Reply:* Very fair! I think we’re probably going to use a custom facet for this for our own purposes, I just wanted to check if there was anything standard or standard-adjacent we should align to.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Julien Le Dem + (julien@apache.org) +
      +
      2024-01-11 18:59:34
      +
      +

      *Thread Reply:* This sounds good. Please share that custom facet’s schema if you’re willing. This might be a good candidate for a new optional core facet for views.

      + + + +
      + 👍 David Goss +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Julien Le Dem + (julien@apache.org) +
      +
      2024-01-11 19:16:03
      +
      +

      As discussed in the call today. I have updated the OpenLineage registry proposal. It includes as well the contributions of @Sheeri Cabral (Collibra) and feedback from the review of the . If you are interested, (in particular @Eric Veleker, @Ernie Ostic, @Sheeri Cabral (Collibra), @Jens Pfau but others as well) please comment on the PR. I think we are close enough to implement a first version.

      +
      + + + + + + + +
      +
      Labels
      + documentation, proposal +
      + +
      +
      Comments
      + 2 +
      + + + + + + + + + + +
      + + + +
      + ❤️ Jarek Potiuk, Michael Robinson, Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Honey Thakuria + (Honey_Thakuria@intuit.com) +
      +
      2024-01-12 00:08:47
      +
      +

      Hi everyone, +We're trying to use Openlineage for Presto on Spark script mentioned below, but aren't getting CREATE or INSERT lifecycle events and are only able to get DROP events.

      + +

      Could anyone help us in letting know the way how we can get proper events for Presto on Spark ? Any tweaks, config changes ?? +cc @Kiran Hiremath @Athitya Kumar +```drop table if exists schemaname.table1; +drop table if exists schemaname.table2; +drop table if exists schema_name.table3;

      + +

      CREATE TABLE schema_name.table1 AS +SELECT ** FROM ( + VALUES + (1, 'a'), + (2, 'b'), + (3, 'c') +) AS temp1 (id, name);

      + +

      CREATE TABLE schema_name.table2 AS +SELECT ** FROM ( + VALUES + (1, 'a'), + (2, 'b'), + (3, 'c') +) AS temp2 (id, name);

      + +

      CREATE TABLE schemaname.table3 AS +SELECT ** +FROM schemaname.table1 +UNION ALL +SELECT ** +FROM schema_name.table2;```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-13 10:54:45
      +
      +

      Hey team! 👋

      + +

      We're seeing some instances where the openlineage spark listener class seems to be running long even after the spark job has completed.

      + +

      While we debug the reason for why it's long running (huge spark event due to partitions / json read etc, which could be lagging the listener thread/JVM); we just wanted to see if there's a way we could ensure from the openlineage listener that any of the methods overriden from SparkListener doesn't execute for more than 2 mins (say)?

      + +

      For example, does having a configurable timeout value (like spark.openlineage.timeout.seconds=120 spark conf) and having an internal Executor + Future.get() with timeout / google's SimpleTimeLimiter make sense for this complexity for any spark event handled by OL spark listener?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:07:58
      +
      +

      *Thread Reply:* yes, this would definietely make sense

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-14 22:51:40
      +
      +

      Hey team.

      + +

      We're observing that the openlineage spark listener runs long (sometimes, even for couple of hours) even though spark job has completed. We've seen the pattern this is mostly happening for jobs with 3 levels of subqueries - is there a known issue for this, wherein huge spark event object from listener bus causes huge delay in openlineage spark listener's event processing due to JVM lag or openlineage code etc?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:07:11
      +
      +

      *Thread Reply:* there is no issue for this. just to make sure: did you disable serializing LogicalPlan and sending it via event?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 04:14:00
      +
      +

      *Thread Reply:* @Paweł Leszczyński - Yup, we've set this conf: +"spark.openlineage.facets.disabled": "[spark_unknown;spark.logicalPlan]" +We've also seen logs from spark where it says that event took more than 10-15 mins to process: +[INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerSQLExecutionEnd(54,1702566454593) by listener OpenLineageSparkListener took 614.914735426s.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:24:43
      +
      +

      *Thread Reply:* yeah, that's interesting. are you able to make sure if this is related to event generation and not backend issue (like marquez deadlock) ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 04:25:58
      +
      +

      *Thread Reply:* Yup yup, we use HTTP transport which has a 30 seconds API GW timeout on our side - but we also tried with console transport type to rule out the possibility of a backend issue

      + +

      Faced the same issue with console transport type too

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 06:32:28
      +
      +

      *Thread Reply:* when running for a long time, does it succeed or not? Perhaps there is a cycle in logical plan. Could you turn on and attach debugFacet ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 06:45:02
      +
      +

      *Thread Reply:* It runs for a long time and succeeds - for some jobs it's a matter of 1 hour, whereas we've seen jobs delaying by 4 hours as well.

      + +

      @Paweł Leszczyński - How do we enable debugFacet? Any additional spark conf to add?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 06:47:16
      +
      +

      *Thread Reply:* yes, spark.openlineage.debugFacet=enabled

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-15 05:29:21
      +
      +

      Hey, +can I request a patch release that will include this fix ? If there is already an upcoming release planned let me know, maybe it will be soon enough 🙂

      + + + +
      + ➕ Jakub Dardziński, Harel Shein, Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-16 16:01:22
      +
      +

      *Thread Reply:* I’m bumping this request. I would love to include this https://github.com/OpenLineage/OpenLineage/pull/2373 as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-19 09:05:54
      +
      +

      *Thread Reply:* Thanks, all. The release is authorized.

      + + + +
      + 🙌 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:24:23
      +
      +

      Hello All! +I want to try to help the community by testing some of the bugs I've been reporting but also try to work with custom facets. Im having trouble building the project. +Is there any step-by-step document (an dependencies) on how to build it?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:48:40
      +
      +

      *Thread Reply:*

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:51:20
      +
      +

      *Thread Reply:* gradle --version

      + +
      + +

      Gradle 8.5

      + +

      Build time: 2023-11-29 14:08:57 UTC +Revision: 28aca86a7180baa17117e0e5ba01d8ea9feca598

      + +

      Kotlin: 1.9.20 +Groovy: 3.0.17 +Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023 +JVM: 21.0.1 (Homebrew 21.0.1) +OS: Mac OS X 14.2 aarch64

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-15 09:20:10
      +
      +

      *Thread Reply:* I think easiest way would be to use Java 8 (it's still supported by Spark)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-15 09:23:30
      +
      +

      *Thread Reply:* I recommend Azul Zulu, it works for mac M1 https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk#zulu

      +
      +
      Azul | Better Java Performance, Superior Java Support
      + + + + + + +
      +
      Est. reading time
      + 1 minute +
      + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 11:10:00
      +
      +

      *Thread Reply:* Java Downgraded to 8. Thanks. +But now, lets say I want to build the project (focusing on the spark integration). What should i do?

      + +

      i tried gradle build (in integration/spark) and got the same error. Am i missing any step here?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mattia Bertorello + (mattia.bertorello@booking.com) +
      +
      2024-01-15 16:37:05
      +
      +

      *Thread Reply:* If you're having issues with Java, I suggest using https://asdf-vm.com/ to ensure that Gradle doesn't accidentally use a different Java version.

      +
      +
      asdf-vm.com
      + + + + + + + + + + + + + + + +
      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 05:01:01
      +
      +

      *Thread Reply:* @Rodrigo Maia I would try removing contents of ~/.m2/repository - this usually helps with this kind of error

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 05:01:12
      +
      +

      *Thread Reply:* The only downside is that you'll have to redownload dependencies

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-17 05:22:09
      +
      +

      *Thread Reply:* thank you for the ideas to solve the issue. I'll try that during the weekend 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:00:17
      +
      +

      Hello Team, I am new to columnar level lineage, I would like to know what is the fast and easiest means of creating a columnar level data/json. Which integration will be helpful ? Airflow, Spark or direct mysql operator.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:04:36
      +
      +

      *Thread Reply:* Previously I have worked with datahub integration with spark to generate an lineage. How do i skip data hub and directly use Openlineage adaptors. +Below is the previous Example :

      + +

      from pyspark.sql import SparkSession

      + +

      spark = SparkSession \ + .builder \ + .master("local[**]") \ + .appName("NetflixMovies") \ + .config("spark.jars.packages", "io.acryl:datahub_spark_lineage:0.8.23") \ + .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \ + .config("spark.datahub.rest.server", "<http://4.labs.net:8080>") \ + .enableHiveSupport() \ + .getOrCreate()

      + +

      print("Reading CSV File") +movies_df = spark \ + .read \ + .format("csv") \ + .option("header", "true") \ + .option("inferSchema", "true") \ + .load("n_movies.csv")

      + +

      movies_above8_rating = movies_df.filter(movies_df.rating &gt;= 8.0)

      + +

      print("Writing CSV File") +movies_above8_rating \ + .coalesce(1) \ + .write \ + .format("csv") \ + .option("header", "true") \ + .mode("overwrite") \ + .save("movies_above_rating8.csv") +print("Completed") +spark.stop()

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:12:28
      +
      +

      *Thread Reply:* Have you tried directly doing the same just using OpenLineage Spark listener?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:16:21
      +
      +

      *Thread Reply:* I did tried, no success yet !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:16:43
      +
      +

      *Thread Reply:* Can you share the resulting event?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:17:06
      +
      +

      *Thread Reply:* I need sometime to reproduce.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:17:16
      +
      +

      *Thread Reply:* Is there any other example you can suggest ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:33:47
      +
      +

      *Thread Reply:* You can look at our integration tests: https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]va/io/openlineage/spark/agent/ColumnLineageIntegrationTest.java

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Fabio Manganiello + (fabio.manganiello@booking.com) +
      +
      2024-01-19 05:06:31
      +
      +

      Hi channel, do we have a standard way to infer the type of the job from the OpenLineage events that we receive? Context: I'm building a product where we're expected to tell the users whether a particular job is an Airflow DAG, a dbt workflow, a Spark job etc. The (relatively) most reliable piece of data to extract this information is the producer string published in the event, but I've noticed that there's really no standard way of formatting it. After seeing some producer strings end in /dbt and /spark we thought "cool, we can just scrape the last slashed token of the producer URL and infer the type from there". Then we met the Airflow integration, which apparently publishes a producer in the format <https://github.com/apache/airflow/tree/providers-openlineage/1.3.0>.

      + +

      Is the producer the only way to infer the type of a job from an event? If so, are there any efforts in standardizing its possible values, or maybe create a registry of producers readily available in the OL libraries? If not, what's an alternative way of getting this info?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 05:53:21
      +
      +

      *Thread Reply:* We introduced JobTypeJobFacet https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/JobTypeJobFacet.json some time ago

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 05:54:39
      +
      +

      *Thread Reply:* The initial idea was to distinguish streaming jobs from batch ones. Then, it got extended to have jobTypelike: QUERY|COMMAND|DAG|TASK|JOB|MODEL or integration which can be SPARK|DBT|AIRFLOW|FLINK

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Fabio Manganiello + (fabio.manganiello@booking.com) +
      +
      2024-01-19 05:56:30
      +
      +

      *Thread Reply:* Thanks, that should definitely address my concern! Is it supposed to be filled explicitly (e.g. through env variables) or is it filled by the OL connector automatically? I've taken a look at several dumps of OL events but I can't recall seeing that facet being populated

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 06:13:17
      +
      +

      *Thread Reply:* I introduced this for Flink only https://github.com/OpenLineage/OpenLineage/pull/2241/files as that is what I needed at that point. You would need to add it into Spark integration if you like

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:49:34
      +
      +

      *Thread Reply:* JobTypeJobFacet(processingType="BATCH", integration="AIRFLOW", jobType="QUERY"),

      + +

      would you consider the jobType as QUERY if airflow is running a task that filters a delta table and saves the output to a different delta table? it is a task as well in that case of course

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:00:04
      +
      +

      any suggestions on naming for Graph API sources from outlook? I pull a lot of data from email attachments with Airflow. generally I am passing a resource (email address), the mailbox, and subfolder. from there I list messages and find attachments

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:08:06
      +
      +

      *Thread Reply:* What is the difference between mailbox and email address ? Could You provide some examples of all the parts provided?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:22:42
      +
      +

      *Thread Reply:* sure, so I definitely want to leave out the email address if possible (it is my name and I dont want it in OL metadata)

      + +

      so we use the graph API credentials to open the mailbox of a resource (my work email) +then we are opening the main Inbox folder, and from there we open folders/subfolders which is important to track

      + +

      hook = MicrosoftEmailHook(graph_conn_id="ms_email_api", resource="my@email", folder="client_name", subfolder="system_name")

      + +

      ``` def getconn(self) -> MailBox: + """Connects to the Office 365 account and opens the mailbox""" + credentials = (self.graphconn.login, self.graphconn.password) + account = Account( + credentials, authflowtype="credentials", tenantid=self.graphconn.host + ) + if not account.isauthenticated: + account.authenticate() + return account.mailbox(resource=self.resource)

      + +
      @cached_property
      +def mailbox(self) -&gt; MailBox:
      +    """Returns the authenticated account mailbox"""
      +    return self.get_conn()
      +
      +@cached_property
      +def current_folder(self) -&gt; Folder:
      +    """Property to lazily open and return the current folder"""
      +    if not self._current_folder:
      +        self.open_folder()
      +    return self._current_folder
      +
      +def open_folder(self) -&gt; Folder:
      +    """Opens the specified folder and sets it as the current folder"""
      +    inbox = self.mailbox.inbox_folder()
      +    f = inbox.get_folder(folder_name=self.folder)
      +    self._current_folder = (
      +        f.get_folder(folder_name=self.subfolder) if self.subfolder else f
      +    )
      +
      +def get_messages(
      +    self,
      +    query: Query | None = None,
      +    download_attachments: bool = False,
      +) -&gt; list[Message]:
      +    """Lists all messages in a folder. Optionally filters them based on an OData
      +    query
      +
      +    Args:
      +        query: OData query object
      +        download_attachments: Whether attachments should be downloaded
      +
      +    Returns:
      +        A Message object or list of Message objects. A tuple of the Message and
      +        Attachment is returned if return_attachments is True
      +    """
      +    messages = [
      +        message
      +        for message in self.current_folder.get_messages(
      +            limit=self.limit,
      +            batch=self.batch,
      +            query=query,
      +            download_attachments=download_attachments,
      +        )
      +    ]
      +    return messages```
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:30:27
      +
      +

      *Thread Reply:* I am not sure if something like this would be the namespace or the resource name, but I would consider a "source" of data to be this +outlook://{self.folder}/{self.subfolder}

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:30:44
      +
      +

      *Thread Reply:* then we list attachments with filters (subject, received since and until)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:33:56
      +
      +

      *Thread Reply:* Ok, got it. I'm not sure if avoiding to include Your email is the best choice, as it's the best identifier of the attachments' location 🙂 It's just my opinion, but i think i would go with something like:

      + +

      namespace: email://{email_addres} +name: {folder}/{subfolder}/

      + +

      OR

      + +

      namespace: email +name: {email_address}/{folder}/{subfolder}

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:34:44
      +
      +

      *Thread Reply:* I'm not sure if outlook://{self.folder}/{self.subfolder} is descriptive enough, as You can't really tell who received this attachment. But of course, it all depends on Your use case

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:35:14
      +
      +

      *Thread Reply:* here is the output of message.build_url() +<https://graph.microsoft.com/v1.0/users/name@company.com/folder>

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:36:08
      +
      +

      *Thread Reply:* which is a built in method in the O365 library I am using. and yeah I think i need the resource name no matter what

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:36:28
      +
      +

      *Thread Reply:* it makes it clear it is my personal work email address and not the service account email account..

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:37:52
      +
      +

      *Thread Reply:* I think i would not base the namespace on the email provider like outlook or gmail, because it's something that can easily change over time, yet it does not influence the actual content of the email. If You suddenly transfer to Google from Microsoft, and move all Your folder/subfolder structure, does it really matter fr You, from lineage perspective ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:38:57
      +
      +

      *Thread Reply:* This one i think is more debatable 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:39:16
      +
      +

      *Thread Reply:* agree

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:39:59
      +
      +

      *Thread Reply:* the folder name in outlook corresponds to the GCS bucket name the files are loaded to as well, so I would want that to be nice and clear in marquez. it is all based on the client name which is the "folder" for emails

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:44:15
      +
      +

      *Thread Reply:* If Your specific email folder structure is somehow mapped to gcs, then You can try looking at the naming spec for gcs and somehow mix it in. The namespace in gcs contains the bucket name, so maybe in Your case it's wise to put it there as well, so You have separate namespaces for each client? At the end of the day, You are using it and it should match Your needs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:54:44
      +
      +

      *Thread Reply:* yep that is correct, it kind of falls apart with SFTP sources because the namespace includes the host which is not very descriptive (just some domain names).

      + +

      we try to keep things separate between client namespaces though in general, since they are unrelated and different teams work on them

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-19 10:11:36
      +
      +

      Hello,

      + +

      I hope you are doing all well.

      + +

      We've observed an issue with symlinks generated by Spark integration, specifically when dealing with a Glue Catalog Table. The dataset namespace ends up being the same as the dataset name. This issue arises from the method used to parse the dataset's URI.

      + +

      I would like to discuss this issue with the contributors.

      + +

      https://github.com/OpenLineage/OpenLineage/pull/2379

      +
      + + + + + + + +
      +
      Labels
      + integration/spark +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-22 02:48:44
      +
      +

      Hey team

      + +

      Does openlineage-spark listener support Kafka sinks parsing from logical plan, to publish to lineage events? If yes, what's the openlineage-spark version since which it's been supported?

      + +

      We have a pattern like df.write.format("KAFKA").option("TOPIC", "topicName") wherein we don't see any openlineage events - we're using openlineage-spark:1.1.0 currently

      + + + +
      +
      +
      +
      + + + + + + + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-22 03:10:18
      +
      +

      *Thread Reply:* However, this feature has not been improved nor developed for long time (like 2 years), and surely has some gaps like https://github.com/OpenLineage/OpenLineage/issues/372

      +
      + + + + + + + +
      +
      Labels
      + integration/spark, streaming +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-22 15:42:55
      +
      +

      @channel +We released OpenLineage 1.8.0, featuring: +• Flink: support Flink 1.18 #2366 @HuangZhenQiu +• Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h +• Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello +• Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello +• Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski +• Spark: Support Glue catalog in iceberg #2283 @nataliezeller1 +Plus many bug fixes and a change in Spark! +Thanks to all the contributors with a special shoutout to new contributor @Mattia Bertorello, who had no fewer than 5 contributions in this release! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.8.0 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.7.0...1.8.0 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      + 🚀 Jakub Dardziński, alexandre bergere, Harel Shein, Mattia Bertorello, Maciej Obuchowski, Julian LaNeve +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 16:43:32
      +
      +

      Hello team I see the following issue when i install apache-airflow-providers-openlineage==1.4.0

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 16:43:56
      +
      +

      *Thread Reply:* @Jakub Dardziński @Willy Lulciuc

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-22 17:00:17
      +
      +

      *Thread Reply:* Have you modified your local files? Please try to uninstall and reinstall the package

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:10:47
      +
      +

      *Thread Reply:* i didn't do anything, all other packages are fine

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:12:42
      +
      +

      *Thread Reply:* let me reinstall

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-22 17:30:10
      +
      +

      *Thread Reply:* it’s odd because it should be line 142 actually, are you sure you’re installing it properly?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:33:33
      +
      +

      *Thread Reply:* actually it could be my local

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:33:39
      +
      +

      *Thread Reply:* im trying something

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      jayant joshi + (itsjayantjoshi@gmail.com) +
      +
      2024-01-25 06:42:58
      +
      +

      Hi Team! +I want to create column level lineage using pyspark +I followed https://openlineage.io/blog/openlineage-spark/ blog step by step +while run next command docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 will see <http://localhost:3000> marquez UI but +but it continusly fetching "Something went wrong while fetching search" not giving any result. in cmd " Error occurred while trying to proxy request /api/v1/search/?q=p&amp;sort=NAME&amp;limit=100 from localhost:3000 to <http://marquez-api:5000/> (EAI_AGAIN) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)" for dataset I am reading CSV file. for reference attaching SC +is there any solution?

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-25 07:48:45
      +
      +

      *Thread Reply:* First, can you try some newer version? Latest is 0.44.0, 0.19.1 is over two years old

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Tamizharasi Karuppasamy + (tamizhsangami@gmail.com) +
      +
      2024-01-29 04:13:49
      +
      +

      *Thread Reply:* Hi.. I am trying to generate column-level lineage for a csv file on AWS S3 using Spark and OpenLineage. I follow https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md and https://openlineage.io/docs/integrations/spark/quickstart_local for my reference. But I get below error. Kindly help me resolve it.

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      +
      + + + + + + + + + + + + + + + + +
      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-29 05:00:53
      +
      +

      *Thread Reply:* looks like you have a typo in versioning @Tamizharasi Karuppasamy

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      jayant joshi + (itsjayantjoshi@gmail.com) +
      +
      2024-01-29 06:07:41
      +
      +

      *Thread Reply:* Hi @Maciej Obuchowski as per your suggestion I used 0.44.0 but error still persist "[HPM] Error occurred while proxying request localhost:3000/api/v1/events/lineage?limit=20&amp;before=2024-01-29T23:59:59.000Z&amp;after=2024-01-29T00:00:00.000Z&amp;offset=0&amp;sortDirection=desc to <http://marquez-api:5000/> [EAI_AGAIN] (<https://nodejs.org/api/errors.html#errors_common_system_errors>" even in docker "marquez-api" is not running . +for your reference sharing log ... +2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] FATAL: password authentication failed for user "marquez" +2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] DETAIL: Role "marquez" does not exist. +2024-01-29 16:35:29 marquez-db | Connection matched pg_hba.conf line 95: "host all all all md5" +2024-01-29 16:35:29 marquez-api | ERROR [2024-01-29 11:05:29,553] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +2024-01-29 16:35:29 marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.jdbc.PgConnection.&lt;init&gt;(PgConnection.java:263) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:443) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.connect(Driver.java:297) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:772) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:700) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:499) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.&lt;init&gt;(ConnectionPool.java:155) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.&lt;init&gt;(JdbcConnectionFactory.java:75) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) +2024-01-29 16:35:29 marquez-api | ! at <a href="http://org.flywaydb.core.Flyway.info">org.flywaydb.core.Flyway.info</a>(Flyway.java:190) +2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:78) +2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:33) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:109) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:51) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.Application.run(Application.java:94) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:63) +2024-01-29 16:35:29 marquez-api | INFO [2024-01-29 11:05:29,556] marquez.MarquezApp: Stopping app...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-29 06:08:29
      +
      +

      *Thread Reply:* Please delete all docker volumes related to Marquez and try again

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-25 09:43:16
      +
      +

      @channel +Our first London meetup is happening next Wednesday, Jan. 31, at Confluent's offices in Covent Garden. Click through to the Meetup page to sign up and view an up-to-date agenda, featuring talks by @Abdallah, Kirill Kulikov at Confluent, and @Paweł Leszczyński! https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Maciej Obuchowski, Abdallah, Eric Veleker, Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-29 12:51:10
      +
      +

      @channel +The 2023 Year in Review special issue of OpenLineage News is here! This issue contains an overview of the exciting events, developments and releases in 2023, including the Airflow Provider and Static Lineage. +To get the newsletter directly in your inbox each month, sign up here.

      +
      +
      openlineage.us14.list-manage.com
      + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Ross Turk, Harel Shein +
      + +
      + 🙌 Francis McGregor-Macdonald, Harel Shein +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Ross Turk + (ross@rossturk.com) +
      +
      2024-01-29 13:15:16
      +
      +

      *Thread Reply:* love it! so much activity!

      + + + +
      + :gratitude_thank_you: Michael Robinson +
      + +
      + ➕ Harel Shein +
      + +
      +
      +
      +
      + + + +
      diff --git a/channel/gx-integration/index.html b/channel/gx-integration/index.html index 2b0e130..112866a 100644 --- a/channel/gx-integration/index.html +++ b/channel/gx-integration/index.html @@ -52,6 +52,12 @@

      Public Channels

    4. +
    5. + + # jobs + +
    6. +
    7. # london-meetup @@ -511,7 +517,7 @@

      Group Direct Messages

    2023-10-13 15:00:37
    -

    *Thread Reply:* I'm a bit slammed today but can look on Tuesday.

    +

    *Thread Reply:* I'm a bit slammed today but can look on Tuesday.

    diff --git a/channel/jobs/index.html b/channel/jobs/index.html new file mode 100644 index 0000000..8777a8e --- /dev/null +++ b/channel/jobs/index.html @@ -0,0 +1,520 @@ + + + + + + Slack Export - #jobs + + + + + +
    + + + +
    + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 11:00:10
    +
    +

    @Harel Shein has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Harel Shein + (harel.shein@gmail.com) +
    +
    2024-01-02 11:00:27
    +
    +

    https://openlineage.slack.com/archives/C01CK9T7HKR/p1704210431604709

    +
    + + +
    + + + } + + Sheeri Cabral + (https://openlineage.slack.com/team/U0323HG8C8H) +
    + + + + + + + + + + + + + + + + + +
    + + + +
    + 👀 Maciej Obuchowski +
    + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kavitha + (kkandaswamy@cardinalcommerce.com) +
    +
    2024-01-03 12:46:59
    +
    +

    *Thread Reply:* @Sheeri Cabral (Collibra), is this a remote position?

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Kavitha + (kkandaswamy@cardinalcommerce.com) +
    +
    2024-01-02 11:34:12
    +
    +

    @Kavitha has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
    +
    2024-01-02 11:59:08
    +
    +

    @Maciej Obuchowski has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Mattia Bertorello + (mattia.bertorello@booking.com) +
    +
    2024-01-02 12:24:20
    +
    +

    @Mattia Bertorello has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Laurent Paris + (laurent@datakin.com) +
    +
    2024-01-02 17:56:17
    +
    +

    @Laurent Paris has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Shahid Shaikh + (ssshahidwin@gmail.com) +
    +
    2024-01-02 18:59:50
    +
    +

    @Shahid Shaikh has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Rodrigo Maia + (rodrigo.maia@manta.io) +
    +
    2024-01-03 03:43:48
    +
    +

    @Rodrigo Maia has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
    +
    2024-01-03 05:00:11
    +
    +

    @Paweł Leszczyński has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    Michael Robinson + (michael.robinson@astronomer.io) +
    +
    2024-01-19 12:59:30
    +
    +

    @Michael Robinson has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    jayant joshi + (itsjayantjoshi@gmail.com) +
    +
    2024-01-24 01:10:25
    +
    +

    @jayant joshi has joined the channel

    + + + +
    +
    +
    +
    + + + +
    +
    + + + + \ No newline at end of file diff --git a/channel/london-meetup/index.html b/channel/london-meetup/index.html index 455ea1e..9734290 100644 --- a/channel/london-meetup/index.html +++ b/channel/london-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup diff --git a/channel/mark-grover/index.html b/channel/mark-grover/index.html index 2cc8c15..b70bcbd 100644 --- a/channel/mark-grover/index.html +++ b/channel/mark-grover/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -547,7 +553,7 @@

    Group Direct Messages

    2021-01-27 14:01:22
    -

    *Thread Reply:* He is presenting right now

    +

    *Thread Reply:* He is presenting right now

    @@ -703,7 +709,7 @@

    Group Direct Messages

    2021-01-27 14:35:49
    -

    *Thread Reply:* Anytime!

    +

    *Thread Reply:* Anytime!

    diff --git a/channel/nyc-meetup/index.html b/channel/nyc-meetup/index.html index 5ac62bd..18ea8f4 100644 --- a/channel/nyc-meetup/index.html +++ b/channel/nyc-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup diff --git a/channel/open-lineage-plus-bacalhau/index.html b/channel/open-lineage-plus-bacalhau/index.html index 693df2e..a17ecf5 100644 --- a/channel/open-lineage-plus-bacalhau/index.html +++ b/channel/open-lineage-plus-bacalhau/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -634,7 +640,7 @@

    Group Direct Messages

    2023-02-08 04:59:43
    -

    *Thread Reply:* Oh now that I’m reading your new operator draft AIP it all makes sense

    +

    *Thread Reply:* Oh now that I’m reading your new operator draft AIP it all makes sense

    @@ -660,7 +666,7 @@

    Group Direct Messages

    2023-02-08 05:21:02
    -

    *Thread Reply:* ok so at this point the question is… for newborn operators, do you suggest to start with their own package or try to merge into airflow.providers directly?

    +

    *Thread Reply:* ok so at this point the question is… for newborn operators, do you suggest to start with their own package or try to merge into airflow.providers directly?

    @@ -686,7 +692,7 @@

    Group Direct Messages

    2023-02-08 20:42:34
    -

    *Thread Reply:* I think you can create your own Provider package with your operator. This is more a question for the airflow mailing list.

    +

    *Thread Reply:* I think you can create your own Provider package with your operator. This is more a question for the airflow mailing list.

    @@ -712,7 +718,7 @@

    Group Direct Messages

    2023-02-08 20:43:25
    -

    *Thread Reply:* I would recommend this to add lineage support for your operator: +

    *Thread Reply:* I would recommend this to add lineage support for your operator: https://openlineage.io/docs/integrations/airflow/operator/ https://openlineage.io/docs/integrations/airflow/extractors/default-extractors

    Group Direct Messages
    2023-02-08 20:43:41
    -

    *Thread Reply:* And nice to heare from you @Enrico Rotundo!

    +

    *Thread Reply:* And nice to heare from you @Enrico Rotundo!

    @@ -808,7 +814,7 @@

    Group Direct Messages

    2023-02-08 20:44:11
    -

    *Thread Reply:* The next monthly meeting is tomorrow if you want to join

    +

    *Thread Reply:* The next monthly meeting is tomorrow if you want to join

    @@ -834,7 +840,7 @@

    Group Direct Messages

    2023-02-08 20:45:05
    -

    *Thread Reply:* I added you to the invite just in case

    +

    *Thread Reply:* I added you to the invite just in case

    @@ -860,7 +866,7 @@

    Group Direct Messages

    2023-02-08 20:46:17
    -

    *Thread Reply:* To answer your original question, we started outside the Airflow community, now that the project is more established it is easier to get this approved. Hopping to get this to a vote soon

    +

    *Thread Reply:* To answer your original question, we started outside the Airflow community, now that the project is more established it is easier to get this approved. Hopping to get this to a vote soon

    @@ -1031,6 +1037,58 @@

    Group Direct Messages

    + + +
    +
    + + + + +
    + +
    Rodrigo Maia + (rodrigo.maia@manta.io) +
    +
    2024-01-03 03:43:46
    +
    +

    @Rodrigo Maia has joined the channel

    + + + +
    +
    +
    +
    + + + + + +
    +
    + + + + +
    + +
    jayant joshi + (itsjayantjoshi@gmail.com) +
    +
    2024-01-24 01:10:30
    +
    +

    @jayant joshi has joined the channel

    + + + +
    +
    +
    +
    + + +
    diff --git a/channel/providence-meetup/index.html b/channel/providence-meetup/index.html index ba8dae9..de4882e 100644 --- a/channel/providence-meetup/index.html +++ b/channel/providence-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -396,7 +402,7 @@

    Group Direct Messages

    2023-02-24 16:22:03
    -

    *Thread Reply:* @Viraj Parekh / @Benji Lampel anyone else?

    +

    *Thread Reply:* @Viraj Parekh / @Benji Lampel anyone else?

    diff --git a/channel/sf-meetup/index.html b/channel/sf-meetup/index.html index 73f22ed..3a6cdf2 100644 --- a/channel/sf-meetup/index.html +++ b/channel/sf-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -551,28 +557,20 @@

    Group Direct Messages

    Some pictures from last night

    - + - - - +
    diff --git a/channel/spec-compliance/index.html b/channel/spec-compliance/index.html index 213f96a..482b4f5 100644 --- a/channel/spec-compliance/index.html +++ b/channel/spec-compliance/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -507,7 +513,7 @@

    Group Direct Messages

    2023-01-13 18:37:43
    -

    *Thread Reply:* thanks for kick-starting / driving this @Sheeri Cabral (Collibra) (long overdue)

    +

    *Thread Reply:* thanks for kick-starting / driving this @Sheeri Cabral (Collibra) (long overdue)

    @@ -667,7 +673,7 @@

    Group Direct Messages

    2023-02-09 14:02:44
    -

    *Thread Reply:* Note that it’s possible this has already been implemented - that’s fine! We’re just gathering ideas, nothing is being set in stone here.

    +

    *Thread Reply:* Note that it’s possible this has already been implemented - that’s fine! We’re just gathering ideas, nothing is being set in stone here.

    @@ -693,7 +699,7 @@

    Group Direct Messages

    2023-02-09 14:03:04
    -

    *Thread Reply:* you mean output statistics?

    +

    *Thread Reply:* you mean output statistics?

    @@ -773,7 +779,7 @@

    Group Direct Messages

    2023-02-09 14:04:07
    -

    *Thread Reply:* Yes! it’s already part of output statistics. I was just giving one example of the kind of ideas we are looking for people to throw out there. Whether or not the facet exists already. 😄 it’s brainstorming, no wrong answers. (in this case the answer is “congrats! it’s already there!“)

    +

    *Thread Reply:* Yes! it’s already part of output statistics. I was just giving one example of the kind of ideas we are looking for people to throw out there. Whether or not the facet exists already. 😄 it’s brainstorming, no wrong answers. (in this case the answer is “congrats! it’s already there!“)

    @@ -803,7 +809,7 @@

    Group Direct Messages

    2023-02-09 14:14:42
    -

    *Thread Reply:* I’ve always been interested in finding a way for devs to publish business metrics from their apps - i.e., metrics specific to the job or dataset like rowsFiltered or averageValues or something. I think enabling comparison of these metrics with lineage would provide tremendous value to data pipeline engineers

    +

    *Thread Reply:* I’ve always been interested in finding a way for devs to publish business metrics from their apps - i.e., metrics specific to the job or dataset like rowsFiltered or averageValues or something. I think enabling comparison of these metrics with lineage would provide tremendous value to data pipeline engineers

    @@ -829,7 +835,7 @@

    Group Direct Messages

    2023-02-09 14:16:53
    -

    *Thread Reply:* oh yeah. even rowsExamined - e.g. rowsExamined - rowsFiltered = rowsAffected +

    *Thread Reply:* oh yeah. even rowsExamined - e.g. rowsExamined - rowsFiltered = rowsAffected some databases give those metrics from queries. And this is why they’re optional, BUT if you make a new company that has software to analyze pipeline metrics, that might be required for your software, even though it’s optional in the openlineage spec)

    @@ -856,7 +862,7 @@

    Group Direct Messages

    2023-02-10 10:07:13
    -

    *Thread Reply:* I’m thinking about some custom facets that would allow intractability between different frameworks (possibly vendors). for example, I can throw in a link to the Airflow UI for a specific task that we captured metadata for, and another tool (say, a data catalog) can also use my custom facet to help guide users directly to where the process is running. there’s a commercial aspect to this as well, but I think the community aspect is interesting. +

    *Thread Reply:* I’m thinking about some custom facets that would allow intractability between different frameworks (possibly vendors). for example, I can throw in a link to the Airflow UI for a specific task that we captured metadata for, and another tool (say, a data catalog) can also use my custom facet to help guide users directly to where the process is running. there’s a commercial aspect to this as well, but I think the community aspect is interesting. I don’t think it should be required to comply with the spec, but I’d like to be able to share this custom facet for others to see if it exists and decorate accordingly. does that make sense?

    diff --git a/channel/toronto-meetup/index.html b/channel/toronto-meetup/index.html index 6b43ed5..423ab07 100644 --- a/channel/toronto-meetup/index.html +++ b/channel/toronto-meetup/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup diff --git a/index.html b/index.html index b6a82de..d5a5c9f 100644 --- a/index.html +++ b/index.html @@ -52,6 +52,12 @@

    Public Channels

  • +
  • + + # jobs + +
  • +
  • # london-meetup @@ -557,7 +563,7 @@

    Group Direct Messages

    2020-10-25 17:22:09
    -

    *Thread Reply:* slack is the new email?

    +

    *Thread Reply:* slack is the new email?

    @@ -583,7 +589,7 @@

    Group Direct Messages

    2020-10-25 17:40:19
    -

    *Thread Reply:* :(

    +

    *Thread Reply:* :(

    @@ -609,7 +615,7 @@

    Group Direct Messages

    2020-10-27 12:27:04
    -

    *Thread Reply:* I'd prefer a google group as well

    +

    *Thread Reply:* I'd prefer a google group as well

    @@ -635,7 +641,7 @@

    Group Direct Messages

    2020-10-27 12:27:25
    -

    *Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

    +

    *Thread Reply:* I think that is better for keeping people engaged, since it isn't just a ton of history to go through

    @@ -661,7 +667,7 @@

    Group Direct Messages

    2020-10-27 12:27:38
    -

    *Thread Reply:* And I think it is also better for having thoughtful design discussions

    +

    *Thread Reply:* And I think it is also better for having thoughtful design discussions

    @@ -687,7 +693,7 @@

    Group Direct Messages

    2020-10-29 15:40:14
    -

    *Thread Reply:* I’m happy to create a google group if that would help.

    +

    *Thread Reply:* I’m happy to create a google group if that would help.

    @@ -713,7 +719,7 @@

    Group Direct Messages

    2020-10-29 15:45:23
    -

    *Thread Reply:* Here it is: https://groups.google.com/g/openlineage

    +

    *Thread Reply:* Here it is: https://groups.google.com/g/openlineage

    @@ -739,7 +745,7 @@

    Group Direct Messages

    2020-10-29 15:46:34
    -

    *Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

    +

    *Thread Reply:* Slack is more of a way to nudge discussions along, we can use github issues or the mailing list to discuss specific points

    @@ -765,7 +771,7 @@

    Group Direct Messages

    2020-11-03 17:34:53
    -

    *Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

    +

    *Thread Reply:* @Ryan Blue and @Wes McKinney any recommendations on automating sending github issues update to that list?

    @@ -791,7 +797,7 @@

    Group Direct Messages

    2020-11-03 17:35:34
    -

    *Thread Reply:* I don't really know how to do that

    +

    *Thread Reply:* I don't really know how to do that

    @@ -817,7 +823,7 @@

    Group Direct Messages

    2021-04-02 07:18:25
    -

    *Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

    +

    *Thread Reply:* @Julien Le Dem How about using Github discussions. They are specifically meant to solve this problem. Feature is still in beta, but it be enabled from repository settings. One positive side i see is that it will really easy to follow through and one separate place to go and look for discussions and ideas which are being discussed.

    @@ -843,7 +849,7 @@

    Group Direct Messages

    2021-04-02 19:51:55
    -

    *Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

    +

    *Thread Reply:* I just enabled it: https://github.com/OpenLineage/OpenLineage/discussions

    @@ -899,7 +905,7 @@

    Group Direct Messages

    2020-10-25 17:21:44
    -

    *Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

    +

    *Thread Reply:* the plan is to use github issues for discussions on the spec. This is to supplement

    @@ -1313,7 +1319,7 @@

    Group Direct Messages

    2020-12-10 17:04:39
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    @@ -1471,7 +1477,7 @@

    Group Direct Messages

    2020-12-14 17:55:46
    -

    *Thread Reply:* Looks great!

    +

    *Thread Reply:* Looks great!

    @@ -1620,7 +1626,7 @@

    Group Direct Messages

    2020-12-16 21:41:16
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/1

    @@ -1802,7 +1808,7 @@

    Group Direct Messages

    2020-12-21 12:15:42
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -1888,7 +1894,7 @@

    Group Direct Messages

    2020-12-21 10:30:51
    -

    *Thread Reply:* Yes, that’s about right.

    +

    *Thread Reply:* Yes, that’s about right.

    @@ -1914,7 +1920,7 @@

    Group Direct Messages

    2020-12-21 10:31:45
    -

    *Thread Reply:* There’s some subtlety here.

    +

    *Thread Reply:* There’s some subtlety here.

    @@ -1940,7 +1946,7 @@

    Group Direct Messages

    2020-12-21 10:32:02
    -

    *Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

    +

    *Thread Reply:* The initial OpenLineage spec is pretty explicit about linking metadata primarily to execution of specific tasks, which is appropriate for ValidationResults in Great Expectations

    @@ -1970,7 +1976,7 @@

    Group Direct Messages

    2020-12-21 10:32:57
    -

    *Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

    +

    *Thread Reply:* There isn’t as strong a concept of persistent data objects (e.g. a specific table, or batches of data from a specific table)

    @@ -2000,7 +2006,7 @@

    Group Direct Messages

    2020-12-21 10:33:20
    -

    *Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

    +

    *Thread Reply:* (In the GE ecosystem, we call these DataAssets and Batches)

    @@ -2026,7 +2032,7 @@

    Group Direct Messages

    2020-12-21 10:33:56
    -

    *Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

    +

    *Thread Reply:* This is also an important conceptual unit, since it’s the level of analysis where Expectations and data docs would typically attach.

    @@ -2056,7 +2062,7 @@

    Group Direct Messages

    2020-12-21 10:34:47
    -

    *Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

    +

    *Thread Reply:* @James Campbell and I have had some productive conversations with @Julien Le Dem and others about this topic

    @@ -2082,7 +2088,7 @@

    Group Direct Messages

    2020-12-21 12:20:53
    -

    *Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

    +

    *Thread Reply:* Yep! The next step will be to open a few github issues with proposals to add to or amend the spec. We would probably start with a Descriptive Dataset facet of a dataset profile (or dataset update profile). There are other aspects to clarify as well as @Abe Gong is explaining above.

    @@ -2138,7 +2144,7 @@

    Group Direct Messages

    2020-12-21 12:23:41
    -

    *Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

    +

    *Thread Reply:* Totally! We had a quick discussion with Dagster. Looking forward to proposals along those lines.

    @@ -2194,7 +2200,7 @@

    Group Direct Messages

    2020-12-21 14:48:11
    -

    *Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

    +

    *Thread Reply:* Thanks, @Harikiran Nayak! It’s amazing to see such interest in the community on defining a standard for lineage metadata collection.

    @@ -2220,7 +2226,7 @@

    Group Direct Messages

    2020-12-21 15:03:29
    -

    *Thread Reply:* Yep! Its a validation that the problem is real!

    +

    *Thread Reply:* Yep! Its a validation that the problem is real!

    @@ -2277,7 +2283,7 @@

    Group Direct Messages

    2020-12-22 13:23:43
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -2307,7 +2313,7 @@

    Group Direct Messages

    2020-12-30 22:30:23
    -

    *Thread Reply:* What are you current use cases for lineage?

    +

    *Thread Reply:* What are you current use cases for lineage?

    @@ -2440,7 +2446,7 @@

    Group Direct Messages

    2020-12-30 22:28:54
    -

    *Thread Reply:* Welcome!

    +

    *Thread Reply:* Welcome!

    @@ -2466,7 +2472,7 @@

    Group Direct Messages

    2020-12-30 22:29:43
    -

    *Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

    +

    *Thread Reply:* Would you mind sharing your main use cases for collecting lineage?

    @@ -2548,7 +2554,7 @@

    Group Direct Messages

    2021-01-05 12:55:01
    -

    *Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

    +

    *Thread Reply:* Definitely, each dashboard should become a node in the lineage graph. That way you can understand all the dependencies of a given dashboard. SOme example of interesting metadata around this: is the dashboard updated in a timely fashion (data freshness); is the data correct (data quality)? Observing changes upstream of the dashboard will provide insights to what’s hapening when freshness or quality suffer

    @@ -2574,7 +2580,7 @@

    Group Direct Messages

    2021-01-05 13:20:41
    -

    *Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

    +

    *Thread Reply:* 100%. On a granular scale, the difference between a visualization and dashboard can be interesting. One visualization can be connected to multiple dashboards. But of course this depends on the BI tool, Redash would be an example in this case.

    @@ -2600,7 +2606,7 @@

    Group Direct Messages

    2021-01-05 15:15:23
    -

    *Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

    +

    *Thread Reply:* We would need to decide how to model those things. Possibly as a Job type for dashboard and visualization.

    @@ -2626,7 +2632,7 @@

    Group Direct Messages

    2021-01-06 18:20:06
    -

    *Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

    +

    *Thread Reply:* It could be. Its interesting in Redash for example you create custom queries that run at certain intervals to produce the data you need to visualize. Pretty much equivalent to job. But you then build certain visualizations off of that “job”. Then you build dashboards off of visualizations. So you could model it as an job or it could make sense for it to be more modeled like an dataset.

    Thats the hard part of this. How to you model a visualization/dashboard to all the possible ways they can be created since it differs depending on how the tool you use abstracts away creating an visualization.

    @@ -2688,7 +2694,7 @@

    Group Direct Messages

    2021-01-05 17:10:22
    -

    *Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

    +

    *Thread Reply:* Part of my role at Netflix is to oversee our data lineage story so very interested in this effort and hope to be able to participate in its success

    @@ -2714,7 +2720,7 @@

    Group Direct Messages

    2021-01-05 18:12:48
    -

    *Thread Reply:* Hi Jason and welcome

    +

    *Thread Reply:* Hi Jason and welcome

    @@ -2804,7 +2810,7 @@

    Group Direct Messages

    2021-01-07 12:46:19
    -

    *Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

    +

    *Thread Reply:* The OpenLineage reference implementation in Marquez will be presented this morning Thursday (01/07) at 10AM PST, at the Marquez Community meeting.

    When: Thursday, January 7th at 10AM PST Wherehttps://us02web.zoom.us/j/89344845719?pwd=Y09RZkxMZHc2U3pOTGZ6SnVMUUVoQT09

    @@ -2833,7 +2839,7 @@

    Group Direct Messages

    2021-01-07 12:46:36
    2021-01-07 12:46:44
    -

    *Thread Reply:* that’s in 15 min

    +

    *Thread Reply:* that’s in 15 min

    @@ -2885,7 +2891,7 @@

    Group Direct Messages

    2021-01-07 12:48:35
    2021-01-12 17:10:23
    -

    *Thread Reply:* And it’s merged!

    +

    *Thread Reply:* And it’s merged!

    @@ -2937,7 +2943,7 @@

    Group Direct Messages

    2021-01-12 17:10:53
    -

    *Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

    +

    *Thread Reply:* Marquez now has a reference implementation of the initial OpenLineage spec

    @@ -3019,7 +3025,7 @@

    Group Direct Messages

    2021-01-13 19:06:34
    -

    *Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

    +

    *Thread Reply:* There’s no explicit roadmap so far. With the initial spec defined and the reference implementation implemented, next steps are to define more facets (for example, data shape, dataset size, etc), provide clients to facilitate integrations (java, python, …), implement more integrations (Spark in the works). Members of the community are welcome to drive their own initiatives around the core spec. One of the design goals of the facet is to enable numerous and independant parallel efforts

    @@ -3045,7 +3051,7 @@

    Group Direct Messages

    2021-01-13 19:06:48
    -

    *Thread Reply:* Is there something you are interested about in particular?

    +

    *Thread Reply:* Is there something you are interested about in particular?

    @@ -3230,7 +3236,7 @@

    Group Direct Messages

    2021-02-01 17:55:59
    -

    *Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md +

    *Thread Reply:* You can find the initial spec here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md The process to contribute to the model is described here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md In particular, now we’d want to contribute more facets and integrations. Marquez has a reference implementation: https://github.com/MarquezProject/marquez/pull/880 @@ -3294,7 +3300,7 @@

    Group Direct Messages

    2021-02-01 17:57:05
    -

    *Thread Reply:* It is not on general, as it can be a bit noisy

    +

    *Thread Reply:* It is not on general, as it can be a bit noisy

    @@ -3346,7 +3352,7 @@

    Group Direct Messages

    2021-02-09 20:02:48
    -

    *Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

    +

    *Thread Reply:* Hi @Zachary Friedman! I replied bellow. https://openlineage.slack.com/archives/C01CK9T7HKR/p1612918909009900

    @@ -3506,7 +3512,7 @@

    Group Direct Messages

    2021-02-09 21:27:26
    -

    *Thread Reply:* Feel free to comment, this is ready to merge

    +

    *Thread Reply:* Feel free to comment, this is ready to merge

    @@ -3532,7 +3538,7 @@

    Group Direct Messages

    2021-02-11 20:12:18
    -

    *Thread Reply:* Thanks, Julien. The new spec format looks great 👍

    +

    *Thread Reply:* Thanks, Julien. The new spec format looks great 👍

    @@ -3764,7 +3770,7 @@

    Group Direct Messages

    2021-02-19 14:27:23
    -

    *Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 +

    *Thread Reply:* @Kevin Mellott and @Petr Šimeček Thanks for the confirmation on this slack message. To make your comment visible to the wider community, please chime in on the github issue as well: https://github.com/OpenLineage/OpenLineage/issues/20 Thank you.

    @@ -3860,7 +3866,7 @@

    Group Direct Messages

    2021-02-19 14:27:46
    -

    *Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

    +

    *Thread Reply:* The PR is out for this: https://github.com/OpenLineage/OpenLineage/pull/23

    @@ -4044,7 +4050,7 @@

    Group Direct Messages

    2021-02-19 05:03:21
    -

    *Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

    +

    *Thread Reply:* Okay, I found the spark integration in Marquez calls the /lineage endpoint. But I am still curious about the future plan to integrate with other tools, like airflow?

    @@ -4070,7 +4076,7 @@

    Group Direct Messages

    2021-02-19 12:41:23
    -

    *Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

    +

    *Thread Reply:* Just restating some of my answers from teh marquez slack for the benefits of folks here.

    • OpenLineage defines the schema to collect metadata • Marquez has a /lineage endpoint implementing the OpenLineage spec to receive this metadata, implemented by the OpenLineageResource you pointed out @@ -4103,7 +4109,7 @@

    Group Direct Messages

    2021-02-22 03:55:18
    -

    *Thread Reply:* thank you! very clear answer🙂

    +

    *Thread Reply:* thank you! very clear answer🙂

    @@ -4155,7 +4161,7 @@

    Group Direct Messages

    2021-03-02 17:51:29
    -

    *Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

    +

    *Thread Reply:* I forgot to reply in thread 🙂 https://openlineage.slack.com/archives/C01CK9T7HKR/p1614725462008300

    @@ -4273,7 +4279,7 @@

    Group Direct Messages

    2021-03-16 15:07:04
    -

    *Thread Reply:* Ha, I signed up just to ask this precise question!

    +

    *Thread Reply:* Ha, I signed up just to ask this precise question!

    @@ -4303,7 +4309,7 @@

    Group Direct Messages

    2021-03-16 15:07:44
    -

    *Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

    +

    *Thread Reply:* I’m still looking into the spec myself. Are we required to have 1 or more runs per Job? Or can a Job exist without a run event?

    @@ -4329,7 +4335,7 @@

    Group Direct Messages

    2021-04-02 07:24:39
    -

    *Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

    +

    *Thread Reply:* Run event can be emitted when it starts. and it can stay in RUNNING state unless something happens to the job. Additionally, you could send event periodically as state RUNNING to inform the system that job is healthy.

    @@ -4515,7 +4521,7 @@

    Group Direct Messages

    2021-03-17 10:05:53
    -

    *Thread Reply:* That's good idea.

    +

    *Thread Reply:* That's good idea.

    Now I'm wondering - let's say we want to track on which offset checkpoint ended processing. That would mean we want to expose checkpoint id, time, and offset. I suppose we don't want to overwrite previous checkpoint info, so we want to have some collection of data in this facet.

    @@ -4572,7 +4578,7 @@

    Group Direct Messages

    2021-03-17 09:18:49
    -

    *Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

    +

    *Thread Reply:* Thanks Julien! I will try to wrap my head around some use-cases and see how it maps to the current spec. From there, I can see if I can figure out any proposals

    @@ -4598,7 +4604,7 @@

    Group Direct Messages

    2021-03-17 13:43:29
    -

    *Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

    +

    *Thread Reply:* You can use the proposal issue template to propose a new facet for example: https://github.com/OpenLineage/OpenLineage/issues/new/choose

    @@ -4794,7 +4800,7 @@

    Group Direct Messages

    2021-03-17 16:42:24
    -

    *Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

    +

    *Thread Reply:* One thing we were considering is just adding these in as Facets ( Tags as per Marquez), and then plugging into some external people managing system. However, I think the question can be generalized to “should there be some sort of generic entity that can enable relationships between itself and Datasets, Jobs, Runs) as part of an integration element?

    @@ -4820,7 +4826,7 @@

    Group Direct Messages

    2021-03-18 16:03:55
    -

    *Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

    +

    *Thread Reply:* That’s a great topic of discussion. I would definitely use the OpenLineage facets to capture what you describe as aspect above. The current Marquez model has a simple notion of ownership at the namespace model but this need to be extended to enable use cases you are describing (owning a dataset or a job) . Right now the owner is just a generic identifier as a string (a user id or a group id for example). Once things are tagged (in some way), you can use the lineage API to find all the downstream or upstream jobs and datasets. In OpenLineage I would start by being able to capture the owner identifier in a facet with contact info optional if it’s available at runtime. It will have the advantage of keeping track of how that changed over time. This definitely deserves its own discussion.

    @@ -4846,7 +4852,7 @@

    Group Direct Messages

    2021-03-18 17:52:13
    -

    *Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

    +

    *Thread Reply:* And also to make sure I understand your use case, you want to be able to notify the consumers of a dataset that it is being discontinued/replaced/… ? What else are you thinking about?

    @@ -4872,7 +4878,7 @@

    Group Direct Messages

    2021-03-22 09:15:19
    -

    *Thread Reply:* Let me pull in my colleagues

    +

    *Thread Reply:* Let me pull in my colleagues

    @@ -4898,7 +4904,7 @@

    Group Direct Messages

    2021-03-22 09:15:24
    -

    *Thread Reply:* Standby

    +

    *Thread Reply:* Standby

    @@ -4924,7 +4930,7 @@

    Group Direct Messages

    2021-03-22 10:59:57
    -

    *Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

    +

    *Thread Reply:* 👋 Hi Julien. I’m Olessia, I’m working on the metadata collection implementation with Adam. Some thought on this:

    @@ -4950,7 +4956,7 @@

    Group Direct Messages

    2021-03-22 11:00:45
    -

    *Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

    • If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
    • According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
    • Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
    • I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?
    • +

      *Thread Reply:* To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options:

      • If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false?
      • According to the readme, “...emiting a new facet with the same name for the same entity replaces the previous facet instance for that entity entirely”. If we were to store multiple stakeholders, we’d have a field “stakeholders” and its value would be a list? This would make queries involving stakeholders not very straightforward. If the facet is overwritten every time, how do I a) add individuals to the list b) track changes to the list over time. Let me know what I’m missing, because based on what you said above tracking facet changes over time is possible.
      • Run events are issued by a scheduler. Why should it be in the domain of the scheduler to know the entire list of Stakeholders?
      • I noticed that Marquez has separate endpoints to capture information about Datasets, and some additional information beyond what’s described in the spec is required. In this context, we could add a required Stakeholder facets on a Dataset, and potentially even additional end points to add and remove Stakeholders. Is that a valid way to go about this, in your opinion?

      Curious to hear your thoughts on all of this!

      @@ -4979,7 +4985,7 @@

      Group Direct Messages

    2021-03-24 17:06:50
    -

    *Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: +

    *Thread Reply:* > To start off, we’re thinking that there often isn’t a single owner, but rather a set of Stakeholders that evolve over time. So we’d like to be able to attach multiple entries, possibly of different types, to a Dataset. We’re also thinking > that a dataset should have at least one owner. So a few things I’d like to confirm/discuss options: > -> If I were to stay true to the spec as it’s defined atm I wouldn’t be able to add a required facet. True/false? Correct, The spec defines what facets looks like (and how you can make your own custom facets) but it does not make statements about whether facets are required. However, you can have your own validation and make certain things required if you wish to on the client side?   @@ -5025,7 +5031,7 @@

    Group Direct Messages

    2021-03-24 17:06:50
    -

    *Thread Reply:* +

    *Thread Reply:* Marquez existed before OpenLineage. In particular the /run end-point to create and update runs will be deprecated as the OpenLineage /lineage endpoint replaces it. At the moment we are mapping OpenLineage metadata to Marquez. Soon Marquez will have all the facets exposed in the Marquez API. (See: https://github.com/MarquezProject/marquez/pull/894/files) We could make Marquez Configurable or Pluggable for validation purposes. There is already a notion of LineageListener for example. Although Marquez collects the metadata. I feel like this validation would be better upstream or with some some other mechanism. The question is when do you create a dataset vs when do you become a stakeholder? What are the various stakeholder and what is the responsibility of the minimum one stakeholder? I would probably make it required to deploy the job that the stakeholder is defined. This would apply to the output dataset and would be collected in Marquez.

    @@ -5059,7 +5065,7 @@

    Group Direct Messages

    2021-05-24 16:27:03
    -

    *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

    +

    *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1621887895004200

    @@ -5266,7 +5272,7 @@

    Group Direct Messages

    2021-04-02 19:54:52
    -

    *Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

    +

    *Thread Reply:* Marquez is the reference implementation for receiving events and tracking changes. But the definition of the API let’s other receive them (and also enables using openlineage events to sync between systems)

    @@ -5292,7 +5298,7 @@

    Group Direct Messages

    2021-04-02 19:55:32
    -

    *Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

    +

    *Thread Reply:* In particular, Egeria is involved in enabling receiving and emitting openlineage

    @@ -5318,7 +5324,7 @@

    Group Direct Messages

    2021-04-03 18:03:01
    -

    *Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

    +

    *Thread Reply:* Thanks @Julien Le Dem. So to get specific, if dbt were to emit OpenLineage events, how would this work? Would dbt Cloud hypothetically allow users to configure an endpoint to send OpenLineage events to, similar in UI implementation to configuring a Stripe webhook perhaps? And then whatever server the user would input here would point to somewhere that implements receipt of OpenLineage payloads? This is all a very hypothetical example, but trying to ground it in something I have a solid mental model for.

    @@ -5344,7 +5350,7 @@

    Group Direct Messages

    2021-04-05 17:51:57
    -

    *Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

    +

    *Thread Reply:* hypothetically speaking, that all sounds right. so a user, who, e.g., has a dbt pipeline and an AWS glue pipeline could configure both of those projects to point to the same open lineage service and get their entire lineage graph even if the two pipelines aren't connected.

    @@ -5370,7 +5376,7 @@

    Group Direct Messages

    2021-04-06 20:33:51
    -

    *Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

    +

    *Thread Reply:* Yeah, OpenLineage events need to be published to a backend (can be Kafka, can be a graphDB, etc). Your Stripe webhook analogy is aligned with how events can be received. For example, in Marquez, we expose a /lineage endpoint that consumes OpenLineage events. We then map an OpenLineage event to the Marquez model (sources, datasets, jobs, runs) that’s persisted in postgres.

    @@ -5396,7 +5402,7 @@

    Group Direct Messages

    2021-04-07 10:47:06
    -

    *Thread Reply:* Thanks both!

    +

    *Thread Reply:* Thanks both!

    @@ -5422,7 +5428,7 @@

    Group Direct Messages

    2021-04-13 20:52:53
    -

    *Thread Reply:* sorry, I was away last week. Yes that sounds right.

    +

    *Thread Reply:* sorry, I was away last week. Yes that sounds right.

    @@ -5474,7 +5480,7 @@

    Group Direct Messages

    2021-04-14 13:12:31
    -

    *Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

    +

    *Thread Reply:* Welcome, @Jakub Moravec 👋 . Given that you're able to retrieve metadata using the marquezAPI, you should be able to also view dataset and job metadata in the UI. Mind using the search bar in the top right-hand corner in the UI to see if your metadata is searchable? The UI only renders jobs and datasets that are connected in the lineage graph. We're working towards a more general metadata exploration experience, but currently the lineage graph is the main experience.

    @@ -5530,7 +5536,7 @@

    Group Direct Messages

    2021-04-13 20:56:42
    -

    *Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

    +

    *Thread Reply:* Your intuition is right here. I think we should define an input facet that specifies which dataset version is being read. Similarly you would have an output facet that specifies what version is being produced. This would apply to storage layers like Deltalake and Iceberg as well.

    @@ -5556,7 +5562,7 @@

    Group Direct Messages

    2021-04-13 20:57:58
    -

    *Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

    +

    *Thread Reply:* Regarding the race condition, input and output facets are attached to the run. The version of the dataset that was read is an attribute of a run and should not modify the dataset itself.

    @@ -5582,7 +5588,7 @@

    Group Direct Messages

    2021-04-13 21:01:34
    -

    *Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

    +

    *Thread Reply:* See the Dataset description here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#core-lineage-model

    @@ -5787,7 +5793,7 @@

    Group Direct Messages

    2021-04-19 16:17:06
    -

    *Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

    +

    *Thread Reply:* I think for Trino integration you'd be looking at writing a Trino extractor if I'm not mistaken, yes?

    @@ -5813,7 +5819,7 @@

    Group Direct Messages

    2021-04-19 16:17:23
    -

    *Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

    +

    *Thread Reply:* But extractor would obviously be at the Marquez layer not OpenLineage

    @@ -5839,7 +5845,7 @@

    Group Direct Messages

    2021-04-19 16:19:00
    -

    *Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

    +

    *Thread Reply:* And hopefully the metadata you'd be looking to extract from Trino wouldn't have any connector-specific syntax restrictions.

    @@ -5891,7 +5897,7 @@

    Group Direct Messages

    2021-04-19 21:42:14
    -

    *Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

    +

    *Thread Reply:* This is a proper usage. That schema is optional if it’s not available.

    @@ -5917,7 +5923,7 @@

    Group Direct Messages

    2021-04-19 21:43:27
    -

    *Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

    +

    *Thread Reply:* You would model it as a job reading from a folder (the input dataset) in the input bucket and writing to a folder (the output dataset) in the output bucket

    @@ -5943,7 +5949,7 @@

    Group Direct Messages

    2021-04-19 21:43:58
    -

    *Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

    +

    *Thread Reply:* This is similar to how this is modeled in the spark integration (spark job reading and writing to s3 buckets)

    @@ -5969,7 +5975,7 @@

    Group Direct Messages

    2021-04-19 21:47:06
    -

    *Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java

    +

    *Thread Reply:* for reference: getting the urls for the inputs: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]marquez/spark/agent/lifecycle/plan/HadoopFsRelationVisitor.java

    @@ -6020,7 +6026,7 @@

    Group Direct Messages

    2021-04-19 21:47:54
    -

    *Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java

    +

    *Thread Reply:* getting the output URL: https://github.com/MarquezProject/marquez/blob/c5e5d7b8345e347164aa5aa173e8cf35062[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java

    @@ -6071,7 +6077,7 @@

    Group Direct Messages

    2021-04-19 21:48:48
    -

    *Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87

    +

    *Thread Reply:* See the spec (comments welcome) for the naming of S3 datasets: https://github.com/OpenLineage/OpenLineage/pull/31/files#diff-e3a8184544e9bc70d8a12e76b58b109051c182a914f0b28529680e6ced0e2a1cR87

    @@ -6097,7 +6103,7 @@

    Group Direct Messages

    2021-04-20 11:11:38
    -

    *Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

    +

    *Thread Reply:* Hey Julien, thank you so much for getting back to me. I'll take a look at the documentation/implementations you've sent me and will reach out if I have anymore questions. Thanks again!

    @@ -6123,7 +6129,7 @@

    Group Direct Messages

    2021-04-20 17:39:24
    -

    *Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

    +

    *Thread Reply:* @Julien Le Dem I left a quick comment on that spec PR you mentioned. Just wanted to let you know.

    @@ -6149,7 +6155,7 @@

    Group Direct Messages

    2021-04-20 17:49:15
    -

    *Thread Reply:* thanks

    +

    *Thread Reply:* thanks

    @@ -6207,7 +6213,7 @@

    Group Direct Messages

    2021-04-28 20:56:53
    -

    *Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

    +

    *Thread Reply:* Thank you! Please do fix typos, I’ll approve your PR.

    @@ -6233,7 +6239,7 @@

    Group Direct Messages

    2021-04-28 23:21:44
    -

    *Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

    +

    *Thread Reply:* No problem. Here's the PR. https://github.com/OpenLineage/OpenLineage/pull/47

    @@ -6259,7 +6265,7 @@

    Group Direct Messages

    2021-04-28 23:22:41
    -

    *Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

    +

    *Thread Reply:* Once I fixed the ones I saw I figured "Why not just run it through a spell checker just in case... " and found a few additional ones.

    @@ -6406,7 +6412,7 @@

    Group Direct Messages

    2021-05-21 19:20:55
    -

    *Thread Reply:* Huge milestone! 🙌💯🎊

    +

    *Thread Reply:* Huge milestone! 🙌💯🎊

    @@ -6492,7 +6498,7 @@

    Group Direct Messages

    2021-06-01 17:56:16
    -

    *Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

    +

    *Thread Reply:* We currently don’t have S3 (or distributed filesystem specific facets) at the moment, but such support would be a great addition! @Julien Le Dem would be best to answer if any work has been done in this area 🙂

    @@ -6518,7 +6524,7 @@

    Group Direct Messages

    2021-06-01 17:57:19
    -

    *Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

    +

    *Thread Reply:* Also, happy to answer any Marquez specific questions, @Jonathon Mitchal when you’re thinking of making the move. Marquez supports OpenLineage out of the box 🙌

    @@ -6544,7 +6550,7 @@

    Group Direct Messages

    2021-06-01 19:58:21
    -

    *Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

    +

    *Thread Reply:* @Jonathon Mitchal You can follow the naming strategy here for referring to a S3 dataset: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#s3

    @@ -6740,7 +6746,7 @@

    Group Direct Messages

    2021-06-01 19:59:30
    -

    *Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

    +

    *Thread Reply:* There is no facet yet for the attributes of a Parquet file. I can give you feedback if you want to start defining one. https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md#proposing-changes

    @@ -6834,7 +6840,7 @@

    Group Direct Messages

    2021-06-01 20:00:50
    -

    *Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

    +

    *Thread Reply:* Adding Parquet metadata as a facet would make a lot of sense. It is mainly a matter of specifying what the json would look like

    @@ -6860,7 +6866,7 @@

    Group Direct Messages

    2021-06-01 20:01:54
    -

    *Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

    +

    *Thread Reply:* for reference the parquet metadata is defined here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

    @@ -6919,7 +6925,7 @@

    Group Direct Messages

    <em>* See LogicalTypes.md for conversion between ConvertedType and LogicalType. *</em>/ enum ConvertedType { - /<em>*</strong> a BYTE_ARRAY actually contains UTF8 encoded chars *</em>/ + /<em></strong></em> a BYTE_ARRAY actually contains UTF8 encoded chars **/ UTF8 = 0;</p> <p>/<em>*</em>* a map is converted as an optional field containing a repeated key/value pair **/ @@ -7052,7 +7058,7 @@

    Group Direct Messages

    *</em> Representation of Schemas <em>*/ enum FieldRepetitionType { - /</em><em></strong> This field is required (can not be null) and each record has exactly 1 value. *</em>/ + /</em></strong>* This field is required (can not be null) and each record has exactly 1 value. **/ REQUIRED = 0;</p> <p>/<em>*</em>* The field is optional (can be null) and each record has 0 or 1 values. **/ @@ -7067,31 +7073,31 @@

    Group Direct Messages

    <em>* All fields are optional. *</em>/ struct Statistics { - /<em>*</strong> - *</em> DEPRECATED: min and max value of the column. Use min<em>value and max</em>value. - <em>* - *</em> Values are encoded using PLAIN encoding, except that variable-length byte - <em>* arrays do not include a length prefix. + /<em></strong></em> + <em>* DEPRECATED: min and max value of the column. Use min_value and max_value. *</em> - <em>* These fields encode min and max values determined by signed comparison - *</em> only. New files should use the correct order for a column's logical type - <em>* and store the values in the min_value and max_value fields. - *</em> - <em>* To support older readers, these may be set when the column order is - *</em> signed. + <em>* Values are encoded using PLAIN encoding, except that variable-length byte + *</em> arrays do not include a length prefix. + <em>* + *</em> These fields encode min and max values determined by signed comparison + <em>* only. New files should use the correct order for a column's logical type + *</em> and store the values in the min<em>value and max</em>value fields. + <em>* + *</em> To support older readers, these may be set when the column order is + <em>* signed. <strong>/ 1: optional binary max; 2: optional binary min; - /<em>*</strong> count of null value in the column <strong>/ - 3: optional i64 null_count; - /</em><em></strong> count of distinct values occurring <strong>/ - 4: optional i64 distinct_count; - /</em><em></strong> - *</em> Min and max values for the column, determined by its ColumnOrder. - <em>* - *</em> Values are encoded using PLAIN encoding, except that variable-length byte - <em>* arrays do not include a length prefix. - *</em>/ + /</em></strong>* count of null value in the column <strong>/ + 3: optional i64 null<em>count; + /*</strong>* count of distinct values occurring <strong>/ + 4: optional i64 distinct</em>count; + /<em></strong></em> + <em>* Min and max values for the column, determined by its ColumnOrder. + *</em> + <em>* Values are encoded using PLAIN encoding, except that variable-length byte + *</em> arrays do not include a length prefix. + **/ 5: optional binary max<em>value; 6: optional binary min</em>value; }</p> @@ -7180,7 +7186,7 @@

    Group Direct Messages

    2021-06-01 23:20:50
    -

    *Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

    +

    *Thread Reply:* Thats awesome, thanks for the guidance Willy and Julien ... will report back on how we get on

    @@ -7240,7 +7246,7 @@

    Group Direct Messages

    2021-06-01 20:02:34
    -

    *Thread Reply:* Welcome! Let use know if you have any questions

    +

    *Thread Reply:* Welcome! Let use know if you have any questions

    @@ -7309,7 +7315,7 @@

    Group Direct Messages

    2021-06-04 04:45:15
    -

    *Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? +

    *Thread Reply:* > Is it the case that Open Lineage defines the general framework but doesn’t actually enforce push or pull-based implementations, it just so happens that the reference implementation (Marquez) uses push? Yes, at core OpenLineage just enforces format of the event. We also aim to provide clients - REST, later Kafka, etc. and some reference implementations - which are now in Marquez repo. https://raw.githubusercontent.com/OpenLineage/OpenLineage/main/doc/Scope.png

    There are several differences between push and poll models. Most important one is that with push model, latency between your job and emitting OpenLineage events is very low. With some systems, with internal, push based model you have more runtime metadata available than when looking from outside. Another one would be that naive poll implementation would need to "rebuild the world" on each change. There are also disadvantages, such as that usually, it's easier to write plugin that extracts data from outside the system than hooking up to the internals.

    @@ -7394,7 +7400,7 @@

    Group Direct Messages

    2021-06-04 10:39:51
    -

    *Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

    +

    *Thread Reply:* This is really helpful, thank you @Maciej Obuchowski!

    @@ -7420,7 +7426,7 @@

    Group Direct Messages

    2021-06-04 10:40:59
    -

    *Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: +

    *Thread Reply:* Similar to what you say about push vs pull, I found DataHub’s comment to be interesting yesterday: > Push is better than pull: While pulling metadata directly from the source seems like the most straightforward way to gather metadata, developing and maintaining a centralized fleet of domain-specific crawlers quickly becomes a nightmare. It is more scalable to have individual metadata providers push the information to the central repository via APIs or messages. This push-based approach also ensures a more timely reflection of new and updated metadata.

    @@ -7447,7 +7453,7 @@

    Group Direct Messages

    2021-06-04 21:59:59
    -

    *Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

    +

    *Thread Reply:* yes. You can also “pull-to-push” for things that don’t push.

    @@ -7473,7 +7479,7 @@

    Group Direct Messages

    2021-06-17 10:01:37
    -

    *Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

    +

    *Thread Reply:* @Maciej Obuchowski any particular reason for bypassing databuilder and go directly to neo4j? By design databuilder is supposed to be very abstract so any kind of backend can be used with Amundsen. Currently there are at least 4 and neo4j is just one of them.

    @@ -7499,7 +7505,7 @@

    Group Direct Messages

    2021-06-17 10:28:52
    -

    *Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

    +

    *Thread Reply:* Databuilder's pull model is very different than OpenLineage's push model, where the events are generated while the dataset itself is generated.

    So, how would you see using it? Just to proxy the events to concrete search and metadata backend?

    @@ -7529,7 +7535,7 @@

    Group Direct Messages

    2021-07-07 19:59:28
    -

    *Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

    +

    *Thread Reply:* @Mariusz Górski my slide that Maciej is referring to might be a bit misleading. The Amundsen integration does not exist yet. Please add your input in the ticket: https://github.com/OpenLineage/OpenLineage/issues/86

    @@ -7579,7 +7585,7 @@

    Group Direct Messages

    2021-07-09 02:22:06
    -

    *Thread Reply:* thanks Julien! will take a look

    +

    *Thread Reply:* thanks Julien! will take a look

    @@ -7631,7 +7637,7 @@

    Group Direct Messages

    2021-06-08 14:16:42
    -

    *Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

    +

    *Thread Reply:* Welcome! You can look here: https://github.com/OpenLineage/OpenLineage/blob/main/CONTRIBUTING.md

    @@ -7725,7 +7731,7 @@

    Group Direct Messages

    2021-06-08 14:17:19
    -

    *Thread Reply:* We’re starting a monthly call, I will publish more details here

    +

    *Thread Reply:* We’re starting a monthly call, I will publish more details here

    @@ -7751,7 +7757,7 @@

    Group Direct Messages

    2021-06-08 14:17:48
    -

    *Thread Reply:* Do you have a specific use case in mind?

    +

    *Thread Reply:* Do you have a specific use case in mind?

    @@ -7777,7 +7783,7 @@

    Group Direct Messages

    2021-06-08 21:32:02
    -

    *Thread Reply:* Nothing specific yet

    +

    *Thread Reply:* Nothing specific yet

    @@ -7854,7 +7860,7 @@

    Group Direct Messages

    2021-06-09 08:33:45
    -

    *Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

    +

    *Thread Reply:* Hey @Julien Le Dem, I can’t add a link to my calendar… Can you send an invite?

    @@ -7880,7 +7886,7 @@

    Group Direct Messages

    2021-06-09 11:00:05
    -

    *Thread Reply:* Same!

    +

    *Thread Reply:* Same!

    @@ -7906,7 +7912,7 @@

    Group Direct Messages

    2021-06-09 11:01:45
    -

    *Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

    +

    *Thread Reply:* Will do. Also if you send your email in dm you can get added to the invite

    @@ -7932,7 +7938,7 @@

    Group Direct Messages

    2021-06-09 11:59:22
    2021-06-09 12:00:30
    -

    *Thread Reply:* @Julien Le Dem Can't access the calendar.

    +

    *Thread Reply:* @Julien Le Dem Can't access the calendar.

    @@ -7984,7 +7990,7 @@

    Group Direct Messages

    2021-06-09 12:00:43
    -

    *Thread Reply:* Can you please share the meeting details

    +

    *Thread Reply:* Can you please share the meeting details

    @@ -8010,7 +8016,7 @@

    Group Direct Messages

    2021-06-09 12:01:12
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8045,7 +8051,7 @@

    Group Direct Messages

    2021-06-09 12:01:24
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8080,7 +8086,7 @@

    Group Direct Messages

    2021-06-09 12:01:55
    -

    *Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

    +

    *Thread Reply:* The calendar invite says 9am PDT, not 10am. Which is right?

    @@ -8106,7 +8112,7 @@

    Group Direct Messages

    2021-06-09 12:01:58
    -

    *Thread Reply:* Thanks

    +

    *Thread Reply:* Thanks

    @@ -8132,7 +8138,7 @@

    Group Direct Messages

    2021-06-09 13:25:13
    -

    *Thread Reply:* it is 9am,thanks

    +

    *Thread Reply:* it is 9am,thanks

    @@ -8158,7 +8164,7 @@

    Group Direct Messages

    2021-06-09 18:37:02
    -

    *Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

    +

    *Thread Reply:* I have posted the notes on the wiki (includes link to recording) https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+meeting+archive

    @@ -8218,7 +8224,7 @@

    Group Direct Messages

    2021-06-10 13:55:51
    -

    *Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

    +

    *Thread Reply:* We’ve recently worked on a getting started guide for OpenLineage that we’d like to publish on the OpenLineage website. That should help with making things a bit more clear on usage. @Ross Turk / @Julien Le Dem might know of when that might become available. Otherwise, happy to answer any immediate questions you might have about posting/collecting OpenLineage events

    @@ -8244,7 +8250,7 @@

    Group Direct Messages

    2021-06-10 13:58:58
    -

    *Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

    +

    *Thread Reply:* Here's a sample of what I'm producing, would appreciate any feedback if it's on the right track. One of our challenges is that 'dataset' is a little loosely defined for us as outputs since we take data from a warehouse/database and output to things like Salesforce, Airtable, Hubspot and even Slack.

    { eventType: 'START', @@ -8315,7 +8321,7 @@

    Group Direct Messages

    2021-06-10 14:02:59
    -

    *Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

    +

    *Thread Reply:* One other question I have is really around how customers might take the metadata we emit at Hightouch and integrate that with OpenLineage metadata emitted from other tools like dbt, Airflow, and other integrations to create a true lineage of their data.

    For example, if the data goes from S3 -&gt; Snowflake via Airflow and then from Snowflake -&gt; Salesforce via Hightouch, this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage?

    @@ -8343,7 +8349,7 @@

    Group Direct Messages

    2021-06-17 19:13:14
    -

    *Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

    +

    *Thread Reply:* Hey, @Dejan Peretin! Sorry for the late replay here! Your OL events look solid and only have a few of suggestions:

    1. I would use a valid UUID for the run ID as the spec will standardize on that type, see https://github.com/OpenLineage/OpenLineage/pull/65
    2. You don’t need to provide the input dataset again on the COMPLETE event as the input datasets have already been associated with the run ID
    3. For the producer, I’d recommend using a link to the producer source code version to link the producer version with the OL event that was emitted.
    @@ -8372,7 +8378,7 @@

    Group Direct Messages

    2021-06-17 19:13:59
    -

    *Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

    +

    *Thread Reply:* You can now reference our OL getting started guide for a close-to-real example 🙂 , see http://openlineage.io/getting-started

    openlineage.io
    @@ -8419,7 +8425,7 @@

    Group Direct Messages

    2021-06-17 19:18:19
    -

    *Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? +

    *Thread Reply:* > … this would mean both Airflow/Hightouch would need to define the Snowflake dataset in exactly the same way to get the benefits of lineage? Yes, the dataset and the namespace that it was registered under would have to be the same to properly build the lineage graph. We’re working on defining unique dataset names and have made some good progress in this area. I’d suggest reviewing the OL naming conventions if you haven’t already: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -8450,7 +8456,7 @@

    Group Direct Messages

    2021-06-19 01:09:27
    -

    *Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

    +

    *Thread Reply:* Thanks! I'm really excited to see what the future holds, I think there are so many great possibilities here. Will be keeping a watchful eye. 🙂

    @@ -8476,7 +8482,7 @@

    Group Direct Messages

    2021-06-22 15:14:39
    -

    *Thread Reply:* 🙂

    +

    *Thread Reply:* 🙂

    @@ -8530,7 +8536,7 @@

    Group Direct Messages

    2021-06-11 10:16:41
    -

    *Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

    +

    *Thread Reply:* Sounds like problem is with Marquez - might be worth to open issue here: https://github.com/MarquezProject/marquez/issues

    @@ -8556,7 +8562,7 @@

    Group Direct Messages

    2021-06-11 10:25:58
    -

    *Thread Reply:* Thank you! Will do.

    +

    *Thread Reply:* Thank you! Will do.

    @@ -8582,7 +8588,7 @@

    Group Direct Messages

    2021-06-11 15:31:41
    -

    *Thread Reply:* Thanks for reporting Antonio

    +

    *Thread Reply:* Thanks for reporting Antonio

    @@ -8732,7 +8738,7 @@

    Group Direct Messages

    2021-06-18 15:09:18
    -

    *Thread Reply:* Very nice!

    +

    *Thread Reply:* Very nice!

    @@ -8784,7 +8790,7 @@

    Group Direct Messages

    2021-06-20 10:09:28
    -

    *Thread Reply:* Here is the error...

    +

    *Thread Reply:* Here is the error...

    21/06/20 11:02:56 WARN ArgumentParser: missing jobs in [, api, v1, namespaces, spark_integration] at 5 21/06/20 11:02:56 WARN ArgumentParser: missing runs in [, api, v1, namespaces, spark_integration] at 7 @@ -8829,7 +8835,7 @@

    Group Direct Messages

    2021-06-20 10:10:41
    -

    *Thread Reply:* Here is my code ...

    +

    *Thread Reply:* Here is my code ...

    ```from pyspark.sql import SparkSession from pyspark.sql.functions import lit

    @@ -8899,7 +8905,7 @@

    Supress success

    2021-06-20 10:12:27
    -

    *Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

    +

    *Thread Reply:* After this execution, I can see just the source from first dataframe called dfsourcetrip...

    @@ -8925,7 +8931,7 @@

    Supress success

    2021-06-20 10:13:04
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -8960,7 +8966,7 @@

    Supress success

    2021-06-20 10:13:45
    -

    *Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

    +

    *Thread Reply:* I was expecting to see all source dataframes, target dataframes and the job

    @@ -8986,7 +8992,7 @@

    Supress success

    2021-06-20 10:14:35
    -

    *Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

    +

    *Thread Reply:* I`m running spark local on my laptop and I followed marquez getting start to up it

    @@ -9012,7 +9018,7 @@

    Supress success

    2021-06-20 10:14:44
    -

    *Thread Reply:* Can anyone help me?

    +

    *Thread Reply:* Can anyone help me?

    @@ -9038,7 +9044,7 @@

    Supress success

    2021-06-22 14:42:03
    -

    *Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add +

    *Thread Reply:* I think there's a race condition that causes the context to be missing when the job finishes too quickly. If I just add spark.sparkContext.setLogLevel('info') to the setup code, everything works reliably. Also works if you remove the master('local[1]') - at least when running in a notebook

    @@ -9174,7 +9180,7 @@

    Supress success

    2021-06-22 15:26:55
    -

    *Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

    +

    *Thread Reply:* Hey, @anup agrawal. This is a great question! The OpenLineage spec is defined using the Json Schema format, and it’s mainly for the transport layer of OL events. In terms of how OL events are eventually stored, that’s determined by the backend consumer of the events. For example, Marquez stores the raw event in a lineage_events table, but that’s mainly for convenience and replayability of events . As for importing / exporting OL events from storage, as long as you can translate the CSV to an OL event, then HTTP backends like Marquez that support OL can consume them

    @@ -9200,7 +9206,7 @@

    Supress success

    2021-06-22 15:27:29
    -

    *Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. +

    *Thread Reply:* > as part of this i need to convert the information fetched from graph db (neo4j) to CSV format and send in response. Depending on the exported CSV, I would translate the CSV to an OL event, see https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

    @@ -9227,7 +9233,7 @@

    Supress success

    2021-06-22 15:29:58
    -

    *Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

    +

    *Thread Reply:* When you say “send in response”, who would be the consumer of the lineage metadata exported for the graph db?

    @@ -9253,7 +9259,7 @@

    Supress success

    2021-06-22 23:33:05
    -

    *Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

    +

    *Thread Reply:* so far what i understood about my requirement is that. 1. my service will receive OL events

    @@ -9279,7 +9285,7 @@

    Supress success

    2021-06-22 23:33:24
    -

    *Thread Reply:* 2. store it in graph db (neo4j)

    +

    *Thread Reply:* 2. store it in graph db (neo4j)

    @@ -9305,7 +9311,7 @@

    Supress success

    2021-06-22 23:38:28
    -

    *Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

    +

    *Thread Reply:* 3. this lineage information will be displayed on ui, based on the request.

    1. now my part in that is to implement an Export functionality, so that someone can download it from UI. in UI there will be option to download the report.
    2. so i need to fetch data from storage and convert it into CSV format, send to UI
    3. they can download the report from UI.
    @@ -9341,7 +9347,7 @@

    Supress success

    2021-07-01 19:18:35
    -

    *Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

    +

    *Thread Reply:* I see. @Julien Le Dem might have some thoughts on how an OL event would be represented in different formats like CSV (but, of course, there’s also avro, parquet, etc). The Json Schema is the recommended format for importing / exporting lineage metadata. And, for a file, each line would be an OL event. But, given that CSV is a requirement, I’m not sure how that would be structured. Or at least, it’s something we haven’t previously discussed

    @@ -9393,7 +9399,7 @@

    Supress success

    2021-06-22 20:29:22
    -

    *Thread Reply:* There are no silly questions! 😉

    +

    *Thread Reply:* There are no silly questions! 😉

    @@ -9453,7 +9459,7 @@

    Supress success

    2021-07-01 19:07:27
    -

    *Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

    +

    *Thread Reply:* Welcome, @Abdulmalik AN 👋 Hopefully the talks / podcasts have been informative! And, sure, happy to clarify a few things:

    > What are events and facets and what are their purpose? An OpenLineage event is used to capture the lineage metadata at a point in time for a given run in execution. That is, the runs state transition, the inputs and outputs consumed/produced and the job associated with the run are part of the event. The metadata defined in the event can then be consumed by an HTTP backend (as well as other transport layers). Marquez is an HTTP backend implementation that consumes OL events via a REST API call. The OL core model only defines the metadata that should be captured in the context of a run, while the processing of the event is up to the backend implementation consuming the event (think consumer / producer model here). For Marquez, the end-to-end lineage metadata is stored for pipelines (composed of multiple jobs) with built-in metadata versioning support. Now, for the second part of your question: the OL core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. I’d recommend checking out the getting started guide for OL 🙂

    @@ -9591,7 +9597,7 @@

    Supress success

    2021-06-30 18:16:58
    -

    *Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

    +

    *Thread Reply:* I based my code on this previous post https://openlineage.slack.com/archives/C01CK9T7HKR/p1624198123045800

    @@ -9652,7 +9658,7 @@

    Supress success

    2021-06-30 18:36:59
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -9687,7 +9693,7 @@

    Supress success

    2021-07-01 13:45:42
    -

    *Thread Reply:* In your first cell, you have +

    *Thread Reply:* In your first cell, you have from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark.sparkContext.setLogLevel('info') @@ -9754,7 +9760,7 @@

    Supress success

    2021-07-01 16:32:18
    -

    *Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

    +

    *Thread Reply:* It’s absolutely in scope! We’ve primarily focused on the batch use case (ETL jobs, etc), but the OpenLineage standard supports both batch and streaming jobs. You can check out our roadmap here, where you’ll find Flink and Beam on our list of future integrations.

    @@ -9780,7 +9786,7 @@

    Supress success

    2021-07-01 16:32:57
    -

    *Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

    +

    *Thread Reply:* Is there a streaming framework you’d like to see added to our roadmap?

    @@ -9832,7 +9838,7 @@

    Supress success

    2021-07-01 16:24:16
    -

    *Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

    +

    *Thread Reply:* Welcome, @mohamed chorfa 👋 . Let’s us know if you have any questions!

    @@ -9862,7 +9868,7 @@

    Supress success

    2021-07-03 19:37:58
    -

    *Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

    +

    *Thread Reply:* Really looking follow the evolution of the specification from RawData to the ML-Model

    @@ -9980,7 +9986,7 @@

    Supress success

    2021-07-07 19:53:17
    -

    *Thread Reply:* Based on community feedback, +

    *Thread Reply:* Based on community feedback, The new proposed mission statement: “to enable the industry at-large to collect real-time lineage metadata consistently across complex ecosystems, creating a deeper understanding of how data is produced and used”

    @@ -10137,7 +10143,7 @@

    Supress success

    2021-07-08 18:57:44
    -

    *Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

    +

    *Thread Reply:* It is on the roadmap and there’s a ticket open but nobody is working on it at the moment. You are very welcome to contribute a spec and implementation

    @@ -10163,7 +10169,7 @@

    Supress success

    2021-07-08 18:59:00
    -

    *Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

    +

    *Thread Reply:* Please comment here and feel free to make a proposal: https://github.com/OpenLineage/OpenLineage/issues/35

    @@ -10308,7 +10314,7 @@

    Supress success

    2021-07-09 02:23:42
    -

    *Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

    +

    *Thread Reply:* cc @Julien Le Dem @Maciej Obuchowski

    @@ -10334,7 +10340,7 @@

    Supress success

    2021-07-09 04:28:38
    -

    *Thread Reply:* @Michael Collado

    +

    *Thread Reply:* @Michael Collado

    @@ -10403,7 +10409,7 @@

    Supress success

    2021-07-12 21:18:04
    -

    *Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

    +

    *Thread Reply:* Feel free to share your email with me if you want to be added to the gcal invite

    @@ -10429,7 +10435,7 @@

    Supress success

    2021-07-14 12:03:31
    -

    *Thread Reply:* It is starting now

    +

    *Thread Reply:* It is starting now

    @@ -10538,7 +10544,7 @@

    Supress success

    2021-07-14 17:04:18
    -

    *Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

    +

    *Thread Reply:* In particular, please review the versioning proposal: https://github.com/OpenLineage/OpenLineage/issues/63

    @@ -10641,7 +10647,7 @@

    Supress success

    2021-07-14 17:04:33
    -

    *Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

    +

    *Thread Reply:* and the mission statement: https://github.com/OpenLineage/OpenLineage/issues/84

    @@ -10705,7 +10711,7 @@

    Supress success

    2021-07-14 17:05:02
    -

    *Thread Reply:* for this one, please give explicit approval in the ticket

    +

    *Thread Reply:* for this one, please give explicit approval in the ticket

    @@ -10735,7 +10741,7 @@

    Supress success

    2021-07-14 21:10:42
    -

    *Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

    +

    *Thread Reply:* @Zhamak Dehghani @Daniel Henneberger @Drew Banin @James Campbell @Ryan Blue @Maciej Obuchowski @Willy Lulciuc ^

    @@ -10761,7 +10767,7 @@

    Supress success

    2021-07-27 18:58:35
    -

    *Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

    +

    *Thread Reply:* Per the votes in the github ticket, I have finalized the charter here: https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit

    @@ -10854,7 +10860,7 @@

    Supress success

    2021-07-20 16:38:38
    -

    *Thread Reply:* The demo in this does not really use the openlineage spec does it?

    +

    *Thread Reply:* The demo in this does not really use the openlineage spec does it?

    Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec?

    @@ -10882,7 +10888,7 @@

    Supress success

    2021-07-20 18:09:01
    -

    *Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

    +

    *Thread Reply:* I have a question about the SQLJobFacet in the job schema - isn't it better to call it the TransformationJob Facet or the ProjecessJobFacet such that any logic in the appropriate language and be described? Am I misinterpreting the intention of SQLJobFacet is to capture the logic that runs for a job?

    @@ -10908,7 +10914,7 @@

    Supress success

    2021-07-26 19:06:43
    -

    *Thread Reply:* > The demo in this does not really use the openlineage spec does it? +

    *Thread Reply:* > The demo in this does not really use the openlineage spec does it? @Samia Rahman In our Airflow talk, the demo used the marquez-airflow lib that sends OpenLineage events to Marquez’s . You can check out the how does Airflow works with OpenLineage + Marquez here https://openlineage.io/integration/apache-airflow/

    @@ -10956,7 +10962,7 @@

    Supress success

    2021-07-26 19:07:51
    -

    *Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? +

    *Thread Reply:* > Did I miss something - the API that was should for lineage was that of Marquez, how does Marquest use the open lineage spec? Yes, Marquez ingests OpenLineage events that confirm to the spec via the . Hope this helps!

    @@ -11011,7 +11017,7 @@

    Supress success

    2021-07-26 18:54:46
    -

    *Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture: +

    *Thread Reply:* I would say this is more of an ingestion pattern, then something the OpenLineage spec would support directly. Though I completely agree, query logs are a great source of lineage metadata with minimal effort. On our roadmap, we have Kafka as a supported backend which would enable streaming lineage metadata from query logs into a topic. That said, confluent has some great blog posts on Change Data Capture:https://www.confluent.io/blog/no-more-silos-how-to-integrate-your-databases-with-apache-kafka-and-cdc/https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/

    Supress success
    2021-07-26 18:57:59
    -

    *Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

    +

    *Thread Reply:* Q: @Kenton (swiple.io) Are you planning on using Kafka connect? If so, I see 2 reasonable options:

    1. Stream query logs to a topic using the JDBC source connector, then have a consumer read the query logs off the topic, parse the logs, then stream the result of the query parsing to another topic as an OpenLineage event
    2. Add direct support for OpenLineage to the JDBC connector or any other application you planned to use to read the query logs.
    @@ -11118,7 +11124,7 @@

    Supress success

    2021-07-26 19:01:31
    -

    *Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

    +

    *Thread Reply:* Either way, I think this is a great question and a common ingestion pattern we should document or have best practices for. Also, more details on how you plan to ingestion the query logs would be help drive the discussion.

    @@ -11144,7 +11150,7 @@

    Supress success

    2021-08-05 12:01:55
    -

    *Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

    +

    *Thread Reply:* Using something like sqlflow could be a good starting point? Demo https://sqlflow.gudusoft.com/?utm_source=gspsite&utm_medium=blog&utm_campaign=support_article#/

    sqlflow.gudusoft.com
    @@ -11220,7 +11226,7 @@

    Supress success

    2021-09-21 20:22:26
    -

    *Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

    +

    *Thread Reply:* @Kenton (swiple.io) I haven’t heard of sqlflow but it does look promising. It’s not on our current roadmap, but I think there is a need to have support for parsing query logs as OpenLineage events. Do you mind opening an issue and outlining you thoughts? It’d be great to start the discussion if you’d like to drive this feature and help prioritize this 💯

    @@ -11379,7 +11385,7 @@

    Supress success

    2021-07-21 18:22:01
    -

    *Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

    +

    *Thread Reply:* Hey, @Samia Rahman 👋. Yeah, great question! The SQLJobFacet is used only for SQL-based jobs. That is, it’s not intended to capture the code being executed, but rather the just the SQL if it’s present. The SQL fact can be used later for display purposes. For example, in Marquez, we use the SQLJobFacet to display the SQL executed by a given job to the user via the UI.

    @@ -11405,7 +11411,7 @@

    Supress success

    2021-07-21 18:23:03
    -

    *Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

    +

    *Thread Reply:* To capture the logic of the job (meaning, the code being executed), the OpenLineage spec defines the SourceCodeLocationJobFacet that builds the link to source in version control

    @@ -11560,7 +11566,7 @@

    Supress success

    2021-07-30 17:24:44
    -

    *Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

    +

    *Thread Reply:* The run event is always tied to exactly one job. It's up to the backend to store the relationship between the job and its inputs/outputs. E.g., in marquez, this is where we associate the input datasets with the job- https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/db/OpenLineageDao.java#L132-L143

    @@ -11681,7 +11687,7 @@

    Supress success

    2021-08-03 15:11:56
    -

    *Thread Reply:* /|~~~ +

    *Thread Reply:* /|~~~ ///| /////| ///////| @@ -11713,7 +11719,7 @@

    Supress success

    2021-08-03 19:59:03
    -

    *Thread Reply:*

    +

    *Thread Reply:* ⛵

    @@ -11972,7 +11978,7 @@

    Supress success

    2021-08-05 12:56:48
    -

    *Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

    +

    *Thread Reply:* Also interested in use case / implementation differences between Spline and OL. Watching this thread.

    @@ -11998,7 +12004,7 @@

    Supress success

    2021-08-05 14:46:44
    -

    *Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. +

    *Thread Reply:* It would be great to have the option to produce the spline lineage info as OpenLineage. To capture the column level lineage, you would want to add a ColumnLineage facet to the Output dataset facets. Which is something that is needed in the spec. Here is a proposal, please chime in: https://github.com/OpenLineage/OpenLineage/issues/148 @@ -12136,7 +12142,7 @@

    Supress success

    2021-08-09 19:49:51
    -

    *Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

    +

    *Thread Reply:* regarding the difference of implementation, the OpenLineage spark integration focuses on extracting metadata and exposing it as a standard representation. (The OpenLineage LineageEvents described in the JSON-Schema spec). The goal is really to have a common language to express lineage and related metadata across everything. We’d be happy if Spline can produce or consume OpenLineage as well and be part of that ecosystem.

    @@ -12186,7 +12192,7 @@

    Supress success

    2021-08-18 08:09:38
    -

    *Thread Reply:* Does anyone know if the Spline developers are in this slack group?

    +

    *Thread Reply:* Does anyone know if the Spline developers are in this slack group?

    @@ -12212,7 +12218,7 @@

    Supress success

    2022-08-03 03:07:56
    -

    *Thread Reply:* @Luke Smith how have things progressed on your side the past year?

    +

    *Thread Reply:* @Luke Smith how have things progressed on your side the past year?

    @@ -12442,7 +12448,7 @@

    Supress success

    2021-08-11 10:05:27
    -

    *Thread Reply:* Just a reminder that this is in 2 hours

    +

    *Thread Reply:* Just a reminder that this is in 2 hours

    @@ -12468,7 +12474,7 @@

    Supress success

    2021-08-11 18:50:32
    -

    *Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* I have added the notes to the meeting page: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -12494,7 +12500,7 @@

    Supress success

    2021-08-11 18:51:19
    -

    *Thread Reply:* The recording of the meeting is linked there: +

    *Thread Reply:* The recording of the meeting is linked there: https://us02web.zoom.us/rec/share/2k4O-Rjmmd5TYXzT-pEQsbYXt6o4V6SnS6Vi7a27BPve9aoMmjm-bP8UzBBzsFzg.uY1je-PyT4qTgYLZ?startTime=1628697944000 • Passcode: =RBUj01C

    @@ -12548,7 +12554,7 @@

    Supress success

    2021-08-11 13:36:43
    -

    *Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    +

    *Thread Reply:* @Daniel Avancini I'm working on it. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    @@ -12574,7 +12580,7 @@

    Supress success

    2021-08-11 13:48:36
    -

    *Thread Reply:* Thank you Maciej. I'll take a look

    +

    *Thread Reply:* Thank you Maciej. I'll take a look

    @@ -12742,7 +12748,7 @@

    Supress success

    2021-08-26 16:46:53
    -

    *Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

    +

    *Thread Reply:* The logicalPlan facet currently returns the Logical Plan, not the physical plan. This means you end up with expressions like Aggregate and Join rather than WholeStageCodegen and Exchange. I don't know if it's possible to reverse engineer the SQL- it's worth looking into the API and trying to find a way to generate that

    @@ -12893,7 +12899,7 @@

    Supress success

    2021-08-31 14:30:00
    -

    *Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

    +

    *Thread Reply:* Hey, @Erick Navarro 👋 . Are you using the openlineage-spark lib? (Note, the marquez-spark lib has been deprecated)

    @@ -12919,7 +12925,7 @@

    Supress success

    2021-08-31 14:43:20
    -

    *Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

    +

    *Thread Reply:* My team had this issue as well. Our read of the error is that Databricks attempts to register the listener before installing packages defined with either spark.jars or spark.jars.packages. Since the listener lib is not yet installed, the listener cannot be found. To solve the issue, we

    1. copy the OL JAR to a staging directory on DBFS (we use /dbfs/databricks/init/lineage)
    2. using an init script, copy the JAR from the staging directory to the default JAR location for the Databricks driver -- /mnt/driver-daemon/jars
    3. Within the same init script, write the spark config parameters to a .conf file in /databricks/driver/conf (we use open-lineage.conf) The .conf file will be read by the driver on initialization. It should follow this format (lineagehosturl should point to your API): @@ -12965,7 +12971,7 @@

      Supress success

    2021-08-31 14:51:46
    -

    *Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

    +

    *Thread Reply:* Absolutely, thanks for elaborating on your spark + OL deployment process and I think that’d be great to document. @Michael Collado what are your thoughts?

    @@ -12991,7 +12997,7 @@

    Supress success

    2021-08-31 14:57:02
    -

    *Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

    +

    *Thread Reply:* I haven't tried with Databricks specifically, but there should be no issue registering the OL listener in the Spark config as long as it's done before the Spark session is created- e.g., this example from the README works fine in a vanilla Jupyter notebook- https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#openlineagesparklistener-as-a-plain-spark-listener

    @@ -13017,7 +13023,7 @@

    Supress success

    2021-08-31 15:11:37
    -

    *Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

    +

    *Thread Reply:* Looks like Databricks' notebooks come with a Spark instance pre-configured- configuring lineage within the SparkSession configuration doesn't seem possible- https://docs.databricks.com/notebooks/notebooks-manage.html#attach-a-notebook-to-a-cluster 😞

    docs.databricks.com
    @@ -13064,7 +13070,7 @@

    Supress success

    2021-08-31 15:11:53
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -13099,7 +13105,7 @@

    Supress success

    2021-08-31 15:59:38
    -

    *Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

    +

    *Thread Reply:* Right, Databricks provides preconfigured spark context / session objects. With Spline, you can set some cluster level config (e.g. spark.spline.lineageDispatcher.http.producer.url ) and install the library on the cluster, but then enable tracking at a notebook level with:

    %scala import za.co.absa.spline.harvester.SparkLineageInitializer._ @@ -13130,7 +13136,7 @@

    Supress success

    2021-08-31 16:01:00
    -

    *Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

    +

    *Thread Reply:* Seems, at the very least, we need to provide a way to specify the job name at the notebook level

    @@ -13160,7 +13166,7 @@

    Supress success

    2021-08-31 16:03:50
    -

    *Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

    +

    *Thread Reply:* Agreed. I'd like a default that uses the notebook name that can also be overridden in the notebook.

    @@ -13186,7 +13192,7 @@

    Supress success

    2021-08-31 16:10:42
    -

    *Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

    +

    *Thread Reply:* if you have some insight into the available options, it would be great if you can open an issue on the OL project. I'll have to carve out some time to play with a databricks cluster and learn what options we have

    @@ -13216,7 +13222,7 @@

    Supress success

    2021-08-31 18:26:11
    -

    *Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

    +

    *Thread Reply:* Thank you @Luke Smith, the method you recommend works for me, the cluster is running and apparently it fetch the configuration it was my first progress in over a week testing openlineage in azure databricks. Thank you!

    Now I have this:

    @@ -13253,7 +13259,7 @@

    Supress success

    2021-08-31 18:52:15
    -

    *Thread Reply:* Is this error thrown during init or job execution?

    +

    *Thread Reply:* Is this error thrown during init or job execution?

    @@ -13279,7 +13285,7 @@

    Supress success

    2021-08-31 18:55:30
    -

    *Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

    +

    *Thread Reply:* this is likely a race condition- I've seen it happen for jobs that start and complete very quickly- things like defining temp views or similar

    @@ -13305,7 +13311,7 @@

    Supress success

    2021-08-31 19:59:15
    -

    *Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

    +

    *Thread Reply:* During the execution of the job @Luke Smith, thank you @Michael Collado, that was exactly the scenario, the job that I executed was empty, now the cluster is running ok, I don't have errors, I have run some jobs successfully, but I don't see any information in my datakin explorer

    @@ -13331,7 +13337,7 @@

    Supress success

    2021-08-31 20:00:46
    -

    *Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

    +

    *Thread Reply:* Awesome! Great to hear you’re up and running. For datakin specific questions, mind if we move the discussion to the datakin user slack channel?

    @@ -13357,7 +13363,7 @@

    Supress success

    2021-08-31 20:01:17
    -

    *Thread Reply:* Yes Willy, thank you!

    +

    *Thread Reply:* Yes Willy, thank you!

    @@ -13383,7 +13389,7 @@

    Supress success

    2021-09-02 10:06:00
    -

    *Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

    +

    *Thread Reply:* Hi , @Luke Smith, thank you for your help, are you familiar with this error in azure databricks when you use OL?

    @@ -13418,7 +13424,7 @@

    Supress success

    2021-09-02 10:07:07
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -13453,7 +13459,7 @@

    Supress success

    2021-09-02 10:17:17
    -

    *Thread Reply:* I found the solution here: +

    *Thread Reply:* I found the solution here: https://docs.microsoft.com/en-us/answers/questions/170730/handshake-fails-trying-to-connect-from-azure-datab.html

    @@ -13510,7 +13516,7 @@

    Supress success

    2021-09-02 10:17:28
    -

    *Thread Reply:* It works now! 😄

    +

    *Thread Reply:* It works now! 😄

    @@ -13540,7 +13546,7 @@

    Supress success

    2021-09-02 16:33:01
    -

    *Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

    +

    *Thread Reply:* @Erick Navarro This might be a helpful to add to our openlineage spark docs for others trying out openlineage-spark with Databricks. Let me know if that’s something you’d like to contribute 🙂

    @@ -13566,7 +13572,7 @@

    Supress success

    2021-09-02 19:59:10
    -

    *Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

    +

    *Thread Reply:* Yes of course @Willy Lulciuc, I will prepare a small tutorial for my colleagues and I will share it with you 🙂

    @@ -13592,7 +13598,7 @@

    Supress success

    2021-09-02 20:44:36
    -

    *Thread Reply:* Awesome. Thanks!

    +

    *Thread Reply:* Awesome. Thanks!

    @@ -13644,7 +13650,7 @@

    Supress success

    2021-09-02 04:57:55
    -

    *Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

    +

    *Thread Reply:* Hey! If you mean this picture, it provides concept of how OpenLineage works, not current state of integration. We don't have Prefect support yet; hovewer, it's on our roadmap.

    @@ -13719,7 +13725,7 @@

    Supress success

    2021-09-02 05:22:15
    -

    *Thread Reply:* great, thanks 🙂

    +

    *Thread Reply:* great, thanks 🙂

    @@ -13745,7 +13751,7 @@

    Supress success

    2021-09-02 11:49:48
    -

    *Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

    +

    *Thread Reply:* @Thomas Fredriksen Feel free to chime in the github issue Maciej linked if you want.

    @@ -13797,7 +13803,7 @@

    Supress success

    2021-09-02 14:28:11
    -

    *Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

    +

    *Thread Reply:* It is being worked on right now. @Oleksandr Dvornik is adding an integration test in the build so that we run test for both spark 2.4 and spark 3. Please open an issue with the stack trace if you can. From our perspective, it should be mostly compatible with a few exceptions like this one that we’d want to add test cases for.

    @@ -13823,7 +13829,7 @@

    Supress success

    2021-09-02 14:36:19
    -

    *Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

    +

    *Thread Reply:* The goal is to be able to make a release in the next few weeks. The integration is being used with Spark 3 already.

    @@ -13853,7 +13859,7 @@

    Supress success

    2021-09-02 15:50:14
    -

    *Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

    +

    *Thread Reply:* Great, I'll take some time to open an issue for this particular issue and a few others.

    @@ -13879,7 +13885,7 @@

    Supress success

    2021-09-02 17:33:08
    -

    *Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

    +

    *Thread Reply:* are you actually using the DatasetSource interface in any capacity? Or are you just scanning the source code to find incompatibilities?

    @@ -13905,7 +13911,7 @@

    Supress success

    2021-09-03 12:36:20
    -

    *Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

    +

    *Thread Reply:* Turns out this has more to do with a how Databricks handles the delta format. It's related to https://github.com/AbsaOSS/spline-spark-agent/issues/96.

    @@ -13983,7 +13989,7 @@

    Supress success

    2021-09-03 13:42:43
    -

    *Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

    +

    *Thread Reply:* I haven't been chasing this issue down on my team -- turns out some things were lost in communication. There are really two problems here:

    1. When attempting to do delta I/O with Spark 3 on Databricks, e.g. insert into . . . values . . . @@ -14069,7 +14075,7 @@

      Supress success

    2021-09-03 13:44:54
    -

    *Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

    +

    *Thread Reply:* So, the first issue is OpenLineage related directly, and the second issue applies to both OpenLineage and Spline?

    @@ -14095,7 +14101,7 @@

    Supress success

    2021-09-03 13:45:49
    -

    *Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

    +

    *Thread Reply:* Yes, that's my read of what I'm getting from others on the team.

    @@ -14121,7 +14127,7 @@

    Supress success

    2021-09-03 13:46:56
    -

    *Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

    +

    *Thread Reply:* For the first issue- can you give some details about the target of the INSERT INTO... ? Is it a data source defined in Databricks? a Hive table? a view on GCS?

    @@ -14147,7 +14153,7 @@

    Supress success

    2021-09-03 13:47:40
    -

    *Thread Reply:* oh, it's a Delta table?

    +

    *Thread Reply:* oh, it's a Delta table?

    @@ -14173,7 +14179,7 @@

    Supress success

    2021-09-03 14:48:15
    -

    *Thread Reply:* Yes, it's created via

    +

    *Thread Reply:* Yes, it's created via

    CREATE TABLE . . . using DELTA location "/dbfs/mnt/ . . . "

    @@ -14286,7 +14292,7 @@

    Supress success

    2021-09-04 12:53:54
    -

    *Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

    +

    *Thread Reply:* Apache Beam integration? I have a very crude integration at the moment. Maybe it’s better to integrate on the orchestration level (airflow, luigi). Thoughts?

    @@ -14312,7 +14318,7 @@

    Supress success

    2021-09-05 13:06:19
    -

    *Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

    +

    *Thread Reply:* I think it makes a lot of sense to have a Beam level integration similar to the spark one. Feel free to post a draft PR if you want to share.

    @@ -14338,7 +14344,7 @@

    Supress success

    2021-09-07 21:04:09
    -

    *Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

    +

    *Thread Reply:* I have added Beam as a topic for the roadmap discussion slide: https://docs.google.com/presentation/d/1fI0u8aE0iX9vG4GGrnQYAEcsJM9z7Rlv/edit#slide=id.ge7d4b64ef4_0_0

    @@ -14390,7 +14396,7 @@

    Supress success

    2021-09-07 21:03:32
    -

    *Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

    +

    *Thread Reply:* There will be a quick demo of the dbt integration (thanks @Willy Lulciuc!)

    @@ -14420,7 +14426,7 @@

    Supress success

    2021-09-07 21:05:13
    -

    *Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* Information to join and archive of previous meetings: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -14446,7 +14452,7 @@

    Supress success

    2021-09-08 14:49:52
    -

    *Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* The recording and notes are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -14472,7 +14478,7 @@

    Supress success

    2021-09-08 21:58:09
    -

    *Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

    +

    *Thread Reply:* Good meeting today. @Julien Le Dem. Thanks

    @@ -14572,7 +14578,7 @@

    Supress success

    2021-09-08 04:27:04
    -

    *Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    +

    *Thread Reply:* We're working on updating our integration to airflow 2. Some changes in airflow made current approach unfeasible, so slight change in a way how we capture events is needed. You can take a look at progress here: https://github.com/OpenLineage/OpenLineage/tree/airflow/2

    @@ -14622,7 +14628,7 @@

    Supress success

    2021-09-08 04:27:38
    -

    *Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

    +

    *Thread Reply:* Thanks @Maciej Obuchowski When is this expected to land in a release ?

    @@ -14648,7 +14654,7 @@

    Supress success

    2021-11-11 06:35:24
    -

    *Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

    +

    *Thread Reply:* hi @Maciej Obuchowski I wanted to follow up on this to understand when the more recent BQ Operators will be supported, specifically BigQueryInsertJobOperator

    @@ -14765,7 +14771,7 @@

    Supress success

    2021-09-13 15:40:01
    -

    *Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

    +

    *Thread Reply:* Welcome, @Jose Badeau 👋. That’s exciting to hear as we have Beam, Flink and Iceberg on our roadmap! Your welcome to join the discussion :)

    @@ -14979,7 +14985,7 @@

    Supress success

    2021-09-14 05:03:45
    -

    *Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

    +

    *Thread Reply:* Hello Tomas. We are currently working on support for spark v3. Can you please raise an issue with stack trace, that would help us to track and solve it. We are currently adding integration tests. Next step would be fix changes in method signatures for v3 (that's what you have)

    @@ -15005,7 +15011,7 @@

    Supress success

    2021-09-14 05:12:45
    -

    *Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

    +

    *Thread Reply:* Hi @Oleksandr Dvornik i raised https://github.com/OpenLineage/OpenLineage/issues/272

    @@ -15308,7 +15314,7 @@

    Supress success

    2021-09-14 08:55:44
    -

    *Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

    +

    *Thread Reply:* I think this might be actually spark issue: https://stackoverflow.com/questions/53787624/spark-throwing-arrayindexoutofboundsexception-when-parallelizing-list/53787847

    Stack Overflow
    @@ -15359,7 +15365,7 @@

    Supress success

    2021-09-14 08:56:10
    -

    *Thread Reply: Can you try newer version in 2.4.* line, like 2.4.7?

    +

    *Thread Reply:* Can you try newer version in 2.4.** line, like 2.4.7?

    @@ -15389,7 +15395,7 @@

    Supress success

    2021-09-14 08:57:30
    -

    *Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

    +

    *Thread Reply:* This might be also spark 2.4 with scala 2.12 issue - I'd recomment 2.11 versions.

    @@ -15415,7 +15421,7 @@

    Supress success

    2021-09-14 09:04:26
    -

    *Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

    +

    *Thread Reply:* @Maciej Obuchowski with 2.4.7 i get following exc:

    @@ -15441,7 +15447,7 @@

    Supress success

    2021-09-14 09:04:27
    -

    *Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD +

    *Thread Reply:* 21/09/14 15:03:25 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: config$1 at java.base/java.lang.Class.getDeclaredField(Class.java:2411)

    @@ -15469,7 +15475,7 @@

    Supress success

    2021-09-14 09:04:48
    -

    *Thread Reply:* i can also try to switch to 2.11 scala

    +

    *Thread Reply:* i can also try to switch to 2.11 scala

    @@ -15495,7 +15501,7 @@

    Supress success

    2021-09-14 09:05:37
    -

    *Thread Reply:* or do you have some recommended setup that works for sure?

    +

    *Thread Reply:* or do you have some recommended setup that works for sure?

    @@ -15521,7 +15527,7 @@

    Supress success

    2021-09-14 09:09:58
    -

    *Thread Reply:* One more check - you're using Java 8 with this, right?

    +

    *Thread Reply:* One more check - you're using Java 8 with this, right?

    @@ -15547,7 +15553,7 @@

    Supress success

    2021-09-14 09:10:17
    -

    *Thread Reply:* This is what works for me: +

    *Thread Reply:* This is what works for me: -&gt; % cat tools/spark-2.4/RELEASE Spark 2.4.8 (git revision 4be4064) built for Hadoop 2.7.3 Build flags: -B -Pmesos -Pyarn -Pkubernetes -Pflume -Psparkr -Pkafka-0-8 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

    @@ -15576,7 +15582,7 @@

    Supress success

    2021-09-14 09:11:23
    -

    *Thread Reply:* spark-shell: +

    *Thread Reply:* spark-shell: Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)

    @@ -15603,7 +15609,7 @@

    Supress success

    2021-09-14 09:12:05
    -

    *Thread Reply:* awesome let me try 🙂

    +

    *Thread Reply:* awesome let me try 🙂

    @@ -15629,7 +15635,7 @@

    Supress success

    2021-09-14 09:26:00
    -

    *Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception +

    *Thread Reply:* data has been sent to marquez. coolio. however i noticed nullpointer being thrown: 21/09/14 15:23:53 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.spark.agent.OpenLineageSparkListener.onJobEnd(OpenLineageSparkListener.java:164) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:39) @@ -15670,7 +15676,7 @@

    Supress success

    2021-09-14 10:59:45
    -

    *Thread Reply:* closed related issue #274

    +

    *Thread Reply:* closed related issue #274

    @@ -15750,7 +15756,7 @@

    Supress success

    2021-09-14 13:38:09
    -

    *Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

    +

    *Thread Reply:* Not at the moment, but it is in scope. You are welcome to open an issue with your example to track this or even propose an implementation if you have the time.

    @@ -15776,7 +15782,7 @@

    Supress success

    2021-09-14 15:12:01
    -

    *Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

    +

    *Thread Reply:* @Tomas Satka it would be great, if you can add an containerized integration test for kafka with your test case. You can take this as an example here

    @@ -16033,7 +16039,7 @@

    Supress success

    2021-09-14 18:02:05
    -

    *Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

    +

    *Thread Reply:* Hi @Oleksandr Dvornik i wrote a test for simple read/write from kafka topic using kafka testcontainer. However i discovered a bug. When writing to kafka topic getting java.lang.IllegalArgumentException: One of the following options must be specified for Kafka source: subscribe, subscribepattern, assign. See the docs for more details.

    • How would you like me to add the test? Fork openlineage and create PR

    @@ -16061,7 +16067,7 @@

    Supress success

    2021-09-14 18:02:50
    -

    *Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

    +

    *Thread Reply:* • Shall i raise bug for writing to kafka that should have only "topic" instead of "subscribe"

    @@ -16087,7 +16093,7 @@

    Supress success

    2021-09-14 18:03:42
    -

    *Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

    +

    *Thread Reply:* • Since i dont know expected payload to openlineage mock server can somebody help me to create it?

    @@ -16113,7 +16119,7 @@

    Supress success

    2021-09-14 19:06:41
    -

    *Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

    +

    *Thread Reply:* Hi @Tomas Satka, yes you should create a fork and raise a PR from that. For more details, please take a look at. Not sure about kafka, cause we don't have that integration yet. About expected payload, as a first step, I would suggest to leave that test without assertion for now. Second step would be investigation (what we can get from that plan node). Third step - implementation and asserting a payload. Basically we parse spark optimized plan, and get as much information as we can for specific implementation. You can take a look at recent PR for HIVE. We visit root node and leaves to get output datasets and input datasets accordingly.

    @@ -16172,7 +16178,7 @@

    Supress success

    2021-09-15 04:37:59
    -

    *Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

    +

    *Thread Reply:* Hi @Oleksandr Dvornik PR for step one : https://github.com/OpenLineage/OpenLineage/pull/279

    @@ -16313,7 +16319,7 @@

    Supress success

    2021-09-15 09:19:23
    -

    *Thread Reply:* Hey Thomas!

    +

    *Thread Reply:* Hey Thomas!

    Following what we do with Airflow, yes, I think that each task should be mapped to job.

    @@ -16349,7 +16355,7 @@

    Supress success

    2021-09-15 09:26:21
    -

    *Thread Reply:* this is very helpful, thank you

    +

    *Thread Reply:* this is very helpful, thank you

    @@ -16375,7 +16381,7 @@

    Supress success

    2021-09-15 09:26:43
    -

    *Thread Reply:* what would be the namespace used in the Job -definition of each task?

    +

    *Thread Reply:* what would be the namespace used in the Job -definition of each task?

    @@ -16401,7 +16407,7 @@

    Supress success

    2021-09-15 09:31:34
    -

    *Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

    +

    *Thread Reply:* In contrast to dataset namespaces - which we try to standardize, job namespaces should be provided by user, or operator of particular scheduler.

    For example, it would be good if it helped you identify Prefect instance where the job was run.

    @@ -16429,7 +16435,7 @@

    Supress success

    2021-09-15 09:32:23
    -

    *Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

    +

    *Thread Reply:* If you use openlineage-python client, you can provide namespace either in client constuctor, or via OPENLINEAGE_NAMESPACE env variable.

    @@ -16455,7 +16461,7 @@

    Supress success

    2021-09-15 09:32:55
    -

    *Thread Reply:* awesome, thank you 🙂

    +

    *Thread Reply:* awesome, thank you 🙂

    @@ -16481,7 +16487,7 @@

    Supress success

    2021-09-15 17:03:07
    -

    *Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

    +

    *Thread Reply:* Hey @Thomas Fredriksen - just chiming in, I’m also keen for a prefect integration. Let me know if I can help out at all

    @@ -16507,7 +16513,7 @@

    Supress success

    2021-09-15 17:27:20
    -

    *Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

    +

    *Thread Reply:* Please chime in on https://github.com/OpenLineage/OpenLineage/issues/81

    @@ -16557,7 +16563,7 @@

    Supress success

    2021-09-15 18:29:20
    -

    *Thread Reply:* Done!

    +

    *Thread Reply:* Done!

    @@ -16587,7 +16593,7 @@

    Supress success

    2021-09-16 00:06:41
    -

    *Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

    +

    *Thread Reply:* For now I'm prototyping in a separate repo https://github.com/limx0/caching_flow_runner/tree/open_lineage

    @@ -16613,7 +16619,7 @@

    Supress success

    2021-09-17 01:55:08
    -

    *Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

    +

    *Thread Reply:* I really like your PR, @Brad. I think that using FlowRunner and TaskRunner may be a more "proper" way of doing this, as opposed as adding a state-handler to each task the way I do it.

    How are you dealing with Prefect-library tasks such as the included BigQuery-tasks and such? Is it necessary to create DatasetTask for them to show up in the lineage graph?

    @@ -16641,7 +16647,7 @@

    Supress success

    2021-09-17 02:04:19
    -

    *Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

    +

    *Thread Reply:* Hey @Thomas Fredriksen! At the moment I'm not dealing with any task-specific things. The plan (in my head, and after speaking with another prefect user @davzucky) would be that we add a LineageTask subclass where you could define custom facets on a per task basis

    @@ -16667,7 +16673,7 @@

    Supress success

    2021-09-17 02:05:21
    -

    *Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

    +

    *Thread Reply:* or some sort of other hook where basically you would define some lineage attribute or put something in the prefect.context that the TaskRunner would find and attach

    @@ -16693,7 +16699,7 @@

    Supress success

    2021-09-17 02:06:23
    -

    *Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

    +

    *Thread Reply:* Sorry I misread your question - any tasks should be automatically tracked (I believe but have not tested yet!)

    @@ -16719,7 +16725,7 @@

    Supress success

    2021-09-17 02:16:02
    -

    *Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

    +

    *Thread Reply:* @Brad Could you elaborate a bit on your ideas around adding custom context attributes?

    @@ -16745,7 +16751,7 @@

    Supress success

    2021-09-17 02:21:57
    -

    *Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

    +

    *Thread Reply:* yeah so basically we just need some hooks that you can easily access from the task decorator or somewhere else that we can pass through to the open lineage adapter to do things like custom facets

    @@ -16771,7 +16777,7 @@

    Supress success

    2021-09-17 02:24:31
    -

    *Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

    +

    *Thread Reply:* like for your bigquery example - you might want to record some facets like in https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/bigquery.py and we need a way to do that with the Prefect bigquery task

    @@ -16797,7 +16803,7 @@

    Supress success

    2021-09-17 02:28:28
    -

    *Thread Reply:* @davzucky

    +

    *Thread Reply:* @davzucky

    @@ -16823,7 +16829,7 @@

    Supress success

    2021-09-17 02:29:12
    -

    *Thread Reply:* I see. Is this supported by the airflow-integration?

    +

    *Thread Reply:* I see. Is this supported by the airflow-integration?

    @@ -16849,7 +16855,7 @@

    Supress success

    2021-09-17 02:29:32
    -

    *Thread Reply:* I think so, yes

    +

    *Thread Reply:* I think so, yes

    @@ -16875,7 +16881,7 @@

    Supress success

    2021-09-17 02:30:51
    2021-09-17 02:31:54
    -

    *Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

    +

    *Thread Reply:* (I don't actually use airflow or bigquery - but for my own use case I can see wanting to do thing like this)

    @@ -16927,7 +16933,7 @@

    Supress success

    2021-09-17 03:18:27
    -

    *Thread Reply:* Interesting, I like how dynamic this is

    +

    *Thread Reply:* Interesting, I like how dynamic this is

    @@ -16980,7 +16986,7 @@

    Supress success

    2021-09-15 09:29:25
    -

    *Thread Reply:* Hey. +

    *Thread Reply:* Hey. Generally, dataSource name should be namespace of particular dataset.

    In some cases, like Postgres, dataSource facet is used to provide additionally connection strings, with info like particular host and port that we're connected to.

    @@ -17011,7 +17017,7 @@

    Supress success

    2021-09-15 10:11:19
    -

    *Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

    +

    *Thread Reply:* Thanks. So then perhaps marquez could differentiate a bit more between job & dataset namespaces. Right now it doesn't quite feel right to have a single global list of namespaces for jobs & datasets, especially as they also have a separate concept of sources (which are not in a namespace).

    @@ -17037,7 +17043,7 @@

    Supress success

    2021-09-15 10:18:59
    -

    *Thread Reply:* @Willy Lulciuc what do you think?

    +

    *Thread Reply:* @Willy Lulciuc what do you think?

    @@ -17063,7 +17069,7 @@

    Supress success

    2021-09-15 10:41:20
    -

    *Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. +

    *Thread Reply:* As an example, in marquez I have this list of namespaces (from some sample data): dbt-sales, default, <snowflake://my-account1>, <snowflake://my-account2>. I think the new marquez UI with the nice namespace dropdown and job/dataset search is awesome, and I'd expect to be able to filter by job namespace everywhere, but how about being able to filter datasets by source (which would be populated by the OL dataset namespace) and not persist dataset namespaces in the global namespace table?

    @@ -17126,7 +17132,7 @@

    Supress success

    2021-09-15 18:43:18
    -

    *Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

    +

    *Thread Reply:* Last point is that we should persist the configuration and not just have it in environment variables. What is the best way to do this in dbt?

    @@ -17152,7 +17158,7 @@

    Supress success

    2021-09-15 18:49:21
    -

    *Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

    +

    *Thread Reply:* We could have something similar to https://docs.getdbt.com/dbt-cli/configure-your-profile - or even put our config in there

    @@ -17182,7 +17188,7 @@

    Supress success

    2021-09-15 18:51:42
    -

    *Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

    +

    *Thread Reply:* I think we should assume that variables/config should be set and valid - and fail the run if they aren't. After all, if someone wouldn't need lineage events, they wouldn't use our wrapper.

    @@ -17208,7 +17214,7 @@

    Supress success

    2021-09-15 18:56:36
    -

    *Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

    +

    *Thread Reply:* 3rd point would be easy to address if we could send events async/in parallel. But there could be dataset version dependencies, and we don't want to get into needless complexity of recognizing that, building a dag etc.

    We could batch events if the network roundtrips are responsible for majority of the slowdown. However, we can't assume any particular environment.

    @@ -17242,7 +17248,7 @@

    Supress success

    2021-09-15 18:58:22
    -

    *Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

    +

    *Thread Reply:* About second point, I want to add recognizing if we already have a parent run - for example, if running via airflow. If not, creating run for this purpose is a good idea.

    @@ -17268,7 +17274,7 @@

    Supress success

    2021-09-15 21:31:35
    -

    *Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

    +

    *Thread Reply:* @Maciej Obuchowski can you open github issues to propose those changes?

    @@ -17294,7 +17300,7 @@

    Supress success

    2021-09-16 09:11:31
    -

    *Thread Reply:* Done

    +

    *Thread Reply:* Done

    @@ -17320,7 +17326,7 @@

    Supress success

    2021-09-16 12:05:10
    -

    *Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

    +

    *Thread Reply:* FWIW, I have been putting my config in ~/.openlineage/config so it can be mapped into a container

    @@ -17346,7 +17352,7 @@

    Supress success

    2021-09-16 17:56:23
    -

    *Thread Reply:* Makes sense, also, all clients could use that config

    +

    *Thread Reply:* Makes sense, also, all clients could use that config

    @@ -17372,7 +17378,7 @@

    Supress success

    2021-10-18 04:47:08
    -

    *Thread Reply:* if dbt could actually stream the events, that would be great.

    +

    *Thread Reply:* if dbt could actually stream the events, that would be great.

    @@ -17398,7 +17404,7 @@

    Supress success

    2021-10-18 09:59:12
    -

    *Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

    +

    *Thread Reply:* Unfortunately, this seems very unlikely for now, due to the fact that we rely on metadata files that dbt only produces after end of execution.

    @@ -17515,7 +17521,7 @@

    Supress success

    2021-09-16 00:16:44
    -

    *Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

    +

    *Thread Reply:* This might be useful to others https://github.com/OpenLineage/OpenLineage/pull/284

    @@ -17541,7 +17547,7 @@

    Supress success

    2021-09-16 00:18:44
    -

    *Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: +

    *Thread Reply:* So I'm trying to push a simple event to marquez, but getting the following response: '{"code":400,"message":"Unable to process JSON"}' The JSON I'm pushing:

    @@ -17586,7 +17592,7 @@

    Supress success

    2021-09-16 02:41:11
    -

    *Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. +

    *Thread Reply:* What I did previously when debugging something like this was to remove half of the payload until I found the culprit. Binary search essentially. I was running Marquez locally, so probably could’ve enabled better logging as well. Aren’t inputs and facets arrays?

    @@ -17617,7 +17623,7 @@

    Supress success

    2021-09-16 03:14:54
    -

    *Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

    +

    *Thread Reply:* Thanks for the response @marko - this is a greatly reduced payload already (but I'll keep going). Yep they are supposed to be arrays (I've since fixed that)

    @@ -17643,7 +17649,7 @@

    Supress success

    2021-09-16 03:46:01
    -

    *Thread Reply:* okay it was my timestamp 🥲

    +

    *Thread Reply:* okay it was my timestamp 🥲

    @@ -17669,7 +17675,7 @@

    Supress success

    2021-09-16 19:07:16
    -

    *Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

    +

    *Thread Reply:* Okay - I've got a simply working example now https://github.com/limx0/caching_flow_runner/blob/open_lineage/caching_flow_runner/task_runner.py

    @@ -17863,7 +17869,7 @@

    Supress success

    2021-09-16 19:07:37
    -

    *Thread Reply:* I might move this into a proper PR @Julien Le Dem

    +

    *Thread Reply:* I might move this into a proper PR @Julien Le Dem

    @@ -17889,7 +17895,7 @@

    Supress success

    2021-09-16 19:08:12
    -

    *Thread Reply:* Successfully got a basic prefect flow working

    +

    *Thread Reply:* Successfully got a basic prefect flow working

    @@ -17950,7 +17956,7 @@

    Supress success

    2021-09-16 09:53:24
    -

    *Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

    +

    *Thread Reply:* I think there was some talk somewhere to actually drop the DatasetType concept; can't find where though.

    @@ -17976,7 +17982,7 @@

    Supress success

    2021-09-16 10:04:09
    -

    *Thread Reply:* I've taken a look at your repo. Looks great so far!

    +

    *Thread Reply:* I've taken a look at your repo. Looks great so far!

    One thing I've noticed I don't think you need to use any stuff from Marquez to emit events. It's lineage ingestion API is deprecated - you can just use openlineage-python client. If there's something you think it's missing from it, feel free to write that here or open issue.

    @@ -18004,7 +18010,7 @@

    Supress success

    2021-09-16 17:12:31
    -

    *Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

    +

    *Thread Reply:* And would that be replaced by just some Input/Output notion @Maciej Obuchowski?

    @@ -18030,7 +18036,7 @@

    Supress success

    2021-09-16 17:13:26
    -

    *Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

    +

    *Thread Reply:* Oh yeah I got a little confused by the single lineage endpoint - but I’ve realised how it all works now. I’m still using the marquez backend to view things but I’ll use the openlineage-client to talk to it

    @@ -18056,7 +18062,7 @@

    Supress success

    2021-09-16 17:34:46
    -

    *Thread Reply:* Yes 🙌

    +

    *Thread Reply:* Yes 🙌

    @@ -18136,7 +18142,7 @@

    Supress success

    2021-09-16 06:59:34
    -

    *Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

    +

    *Thread Reply:* Hey, Airflow integration tests do not pass env variables to PRs from forks due to security reasons - everyone could create malicious PR and dump secrets

    @@ -18162,7 +18168,7 @@

    Supress success

    2021-09-16 07:00:29
    -

    *Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

    +

    *Thread Reply:* So, they will fail and there's nothing to do from your side 🙂

    @@ -18188,7 +18194,7 @@

    Supress success

    2021-09-16 07:00:55
    -

    *Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

    +

    *Thread Reply:* We probably should split those into ones that don't touch external systems, and run those for all PRs

    @@ -18214,7 +18220,7 @@

    Supress success

    2021-09-16 07:08:03
    -

    *Thread Reply:* ah okie. good to know. +

    *Thread Reply:* ah okie. good to know. and in build-integration-spark Could not resolve all artifacts. Is that also known issue? Or something from my side that i could fix?

    @@ -18241,7 +18247,7 @@

    Supress success

    2021-09-16 07:11:12
    -

    *Thread Reply:* Looks like gradle server problem? +

    *Thread Reply:* Looks like gradle server problem? &gt; Could not get resource '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. &gt; Could not GET '<https://plugins.gradle.org/m2/com/diffplug/spotless/spotless-lib/2.13.2/spotless-lib-2.13.2.module>'. Received status code 500 from server: Internal Server Error

    @@ -18269,7 +18275,7 @@

    Supress success

    2021-09-16 07:34:44
    -

    *Thread Reply:* After retry, there's spotless error:

    +

    *Thread Reply:* After retry, there's spotless error:

    +········.orElse(Collections.emptyList()).stream()

    @@ -18297,7 +18303,7 @@

    Supress success

    2021-09-16 07:35:15
    -

    *Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

    +

    *Thread Reply:* I think this is due to mismatch between behavior of spotless in Java 8 and Java 11+ - which you probably used 🙂

    @@ -18323,7 +18329,7 @@

    Supress success

    2021-09-16 07:40:01
    -

    *Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

    +

    *Thread Reply:* ah.. i used java11. so shall i rerun something with java8 setup as sdk?

    @@ -18349,7 +18355,7 @@

    Supress success

    2021-09-16 07:44:31
    -

    *Thread Reply:* For spotless, you can just fix this one line 🙂 +

    *Thread Reply:* For spotless, you can just fix this one line 🙂 Though I don't guarantee that tests that run later will pass, so you might need Java 8 for later testing

    @@ -18376,7 +18382,7 @@

    Supress success

    2021-09-16 08:04:36
    -

    *Thread Reply:* yup looks better now

    +

    *Thread Reply:* yup looks better now

    @@ -18402,7 +18408,7 @@

    Supress success

    2021-09-16 08:04:41
    -

    *Thread Reply:* thanks

    +

    *Thread Reply:* thanks

    @@ -18428,7 +18434,7 @@

    Supress success

    2021-09-16 14:27:02
    -

    *Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

    +

    *Thread Reply:* will somebody please review my PR? had to already adjust due to updates on same test class 🙂

    @@ -18484,7 +18490,7 @@

    Supress success

    2021-09-16 20:37:27
    -

    *Thread Reply:* @Thomas Fredriksen would love any feedback

    +

    *Thread Reply:* @Thomas Fredriksen would love any feedback

    @@ -18510,7 +18516,7 @@

    Supress success

    2021-09-17 04:21:13
    -

    *Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

    +

    *Thread Reply:* nicely done! As we discussed in another thread - the way you have implemented lineage using FlowRunner and TaskRunner is likely the best way to do this. Let me know if you need any help, I would love to see this PR get merged!

    @@ -18536,7 +18542,7 @@

    Supress success

    2021-09-17 07:28:33
    -

    *Thread Reply:* Hey @Brad, it looks great!

    +

    *Thread Reply:* Hey @Brad, it looks great!

    I've seen you're using task_qualified_name to name datasets and I don't think it's the right way. I'd take a look at naming conventions here: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -18567,7 +18573,7 @@

    Supress success

    2021-09-17 08:12:50
    -

    *Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

    +

    *Thread Reply:* Hey @Maciej Obuchowski thanks for the feedback. Yep the naming was a bit of a placeholder. Open to any recommendations.. I think things like dbt or pyspark are straight forward (we could add special handling for tasks like that) but what about regular transformation type tasks that run in a scheduler? Do you have any naming preference? Say I just had some pandas transform task in prefect for example

    @@ -18593,7 +18599,7 @@

    Supress success

    2021-09-17 08:28:04
    -

    *Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

    +

    *Thread Reply:* First of all, not all tasks are producing and consuming datasets. For example, I wouldn't expect any of the Github tasks to have any datasets.

    Second, in Airflow we have a concept of Extractor where you can write specialized code to expose datasets. For example, for BigQuery we extract datasets from query plan. Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused. Also, this way allows to emit additional facets, some of which are really useful - like query statistics for BigQuery, and data quality tests for dbt.

    @@ -18623,7 +18629,7 @@

    Supress success

    2021-09-19 23:03:14
    -

    *Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

    +

    *Thread Reply:* You've raised some good points @Maciej Obuchowski - I might have been thinking about this integration in slightly the wrong way. I think based on your comments I'll refactor some of the code to hook into the Results object in prefect (The Result object is the way in which data is serialized and persisted).

    > Now, I'm not sure if this concept would translate well to Prefect - but if yes, then we have some helpers inside openlineage common library that could be reused This definitely applies to prefect and the similar tasks exist in prefect and we should definitely leverage the common library in this case.

    @@ -18662,7 +18668,7 @@

    Supress success

    2021-09-20 06:30:51
    -

    *Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. +

    *Thread Reply:* > Yeah I agree with this. I'd like to make it as easy a possible to opt-in, but I think you're right that there needs to be some hooks for user defined lineage. I'll think about this a little more. First version of an integration doesn't have to be perfect. in particular, not handling this use case would be okay, since it does not lock us into some particular way of doing it later.

    > My initial thoughts here were that it would still be good to have lineage as these tasks do have side effects, and downstream consumers of the lineage data might want to know about these tasks. However I don't have a good feeling yet how best to do this, so I'm going to park those thoughts for now. @@ -18694,7 +18700,7 @@

    Supress success

    2021-09-23 17:27:49
    -

    *Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. +

    *Thread Reply:* > Won’t existence of a event be enough? After all, we’ll still have it despite it not having any input and output datasets. Duh, yep you’re right @Maciej Obuchowski, I’m over thinking this. I’m going to clean this up based on your comments

    @@ -18721,7 +18727,7 @@

    Supress success

    2021-10-06 03:39:28
    -

    *Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

    +

    *Thread Reply:* Hi @Brad. How will this integration work for Prefect flows running in Prefect Cloud or on Prefect Server?

    @@ -18747,7 +18753,7 @@

    Supress success

    2021-10-06 03:40:44
    -

    *Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

    +

    *Thread Reply:* Hi @Thomas Fredriksen - it'll relate to the agent actually - you'll need to pass the flow runner class to the agent when running

    @@ -18773,7 +18779,7 @@

    Supress success

    2021-10-06 03:48:14
    -

    *Thread Reply:* nice!

    +

    *Thread Reply:* nice!

    @@ -18799,7 +18805,7 @@

    Supress success

    2021-10-06 03:48:54
    -

    *Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

    +

    *Thread Reply:* Unfortunately I've been a little busy the past week, and I will be for the rest of this week

    @@ -18825,7 +18831,7 @@

    Supress success

    2021-10-06 03:49:09
    -

    *Thread Reply:* but I do plan to pick this up next week

    +

    *Thread Reply:* but I do plan to pick this up next week

    @@ -18851,7 +18857,7 @@

    Supress success

    2021-10-06 03:49:23
    -

    *Thread Reply:* (the additional changes I mention above)

    +

    *Thread Reply:* (the additional changes I mention above)

    @@ -18877,7 +18883,7 @@

    Supress success

    2021-10-06 03:50:08
    -

    *Thread Reply:* looking forward to it 🙂 let me know if you need any help!

    +

    *Thread Reply:* looking forward to it 🙂 let me know if you need any help!

    @@ -18903,7 +18909,7 @@

    Supress success

    2021-10-06 03:50:34
    -

    *Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

    +

    *Thread Reply:* yeah when I get this next lot of stuff in - I'd love for people to test it out

    @@ -18963,7 +18969,7 @@

    Supress success

    2021-09-20 19:18:53
    -

    *Thread Reply:* I think you can reffer to https://openlineage.io/

    +

    *Thread Reply:* I think you can reffer to https://openlineage.io/

    openlineage.io
    @@ -19132,7 +19138,7 @@

    Supress success

    2021-09-24 05:15:31
    -

    *Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

    +

    *Thread Reply:* It should be really easy to add additional databases. Basically, we'd need to know how to get namespace for that database: https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L467

    The first step should be to add SQLServer or Dremio to the dataset naming schema here https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    Supress success
    2021-10-04 16:22:59
    -

    *Thread Reply:* Thank you @Maciej Obuchowski, +

    *Thread Reply:* Thank you @Maciej Obuchowski, I tried to give it a try but without success yet. Not sure where I am suppose to add the sqlserver naming schema... If you have any documentation that I could read I would be glad =) Many thanks

    @@ -19237,7 +19243,7 @@

    Supress success

    2021-10-07 15:13:43
    -

    *Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

    +

    *Thread Reply:* This would be adding a paragraph similar to this one: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#snowflake

    @@ -19472,7 +19478,7 @@

    Supress success

    2021-10-07 15:14:30
    -

    *Thread Reply:* Snowflake +

    *Thread Reply:* Snowflake See: Object Identifiers — Snowflake Documentation Datasource hierarchy: • account name @@ -19541,7 +19547,7 @@

    Supress success

    2021-09-24 07:06:20
    -

    *Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

    +

    *Thread Reply:* For clarity on the provedance vs lineage point (in case I'm using those terms incorrectly...)

    Our platform performs automated enrichment and processing of data. In doing so, we often make calls to functions or out to other data services (such as APIs, or SELECTs against databases). We capture the inputs that pass to these, along with the outputs. (And, if the input is derived from other outputs, we capture the full chain, right back to the root).

    @@ -19571,7 +19577,7 @@

    Supress success

    2021-09-24 08:47:35
    -

    *Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

    +

    *Thread Reply:* Not sure I understand you right, but are you interested in tracking individual API calls, and for example, values of some parameters passed for one call?

    @@ -19597,7 +19603,7 @@

    Supress success

    2021-09-24 08:51:16
    -

    *Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

    +

    *Thread Reply:* I guess that's not in OpenLineage scope, as we're interested more in tracking metadata for whole datasets. But I might be wrong, some other people might chime in.

    We could of course model this situation, but that would capture for example schema of those parameters. Not their values.

    @@ -19625,7 +19631,7 @@

    Supress success

    2021-09-24 08:52:16
    -

    *Thread Reply:* I think this might be better suited for https://opentelemetry.io/

    +

    *Thread Reply:* I think this might be better suited for https://opentelemetry.io/

    @@ -19651,7 +19657,7 @@

    Supress success

    2021-09-24 10:55:54
    -

    *Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

    +

    *Thread Reply:* Kinda, but not really. Telemetery data is metadata about the API calls. We have that, but it's not interesting to our customers. It's the metadata about the data that Vyne provides that we want to expose.

    Our customers use Vyne to fetch data from lots of different sources. Eg:

    @@ -19685,7 +19691,7 @@

    Supress success

    2021-09-30 15:10:51
    -

    *Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

    +

    *Thread Reply:* The core OpenLineage model is documented at https://github.com/OpenLineage/OpenLineage/#core-model . The model is really focused on Jobs and Datasets. Jobs have Runs which have start and end times (typically scheduled start/end times as well) and read from and/or write to the target datasets. If your transformation chain fits within that model, then I think you can definitely record and share the lineage information with your customers. The existing implementations are all focused on batch data access, though streaming should be possible to capture as well

    @@ -19740,7 +19746,7 @@

    Supress success

    2021-09-29 13:46:59
    -

    *Thread Reply:* Hello @Drew Bittenbender!

    +

    *Thread Reply:* Hello @Drew Bittenbender!

    For your two first questions:

    @@ -19771,7 +19777,7 @@

    Supress success

    2021-09-29 13:49:41
    -

    *Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

    +

    *Thread Reply:* Thank you @Faouzi. Is there any documentation/best practices to write your own extractor, or is it "read the code"? We use the Python, Docker and SSH operators a lot. Maybe those don't fit into the lineage paradigm well, but want to give it a shot

    @@ -19797,7 +19803,7 @@

    Supress success

    2021-09-29 13:52:16
    -

    *Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    +

    *Thread Reply:* To the best of my knowledge there is no documentation to guide through the design of your own extractor. So yes we need to read the code. Here a link where you can see how they did for postgre extractor and others. https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    @@ -19827,7 +19833,7 @@

    Supress success

    2021-09-30 05:08:53
    -

    *Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

    +

    *Thread Reply:* I think in case of "bring your own code" operators like Python or Docker ones, it might be better to use lineage_run_id macro and use openlineage-python library inside, instead of implementing extractor.

    @@ -19853,7 +19859,7 @@

    Supress success

    2021-09-30 15:14:47
    -

    *Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

    +

    *Thread Reply:* I think @Maciej Obuchowski is right here. The airflow integration will create the parent jobs, but to get the dataset input/output links, it's best to do that directly from the python/docker scripts. If you report the parent run id, Marquez will link the jobs together correctly

    @@ -19879,7 +19885,7 @@

    Supress success

    2021-10-07 15:09:55
    -

    *Thread Reply:* To clarify on what airflow operators are supported out of the box: +

    *Thread Reply:* To clarify on what airflow operators are supported out of the box: • postgres • bigquery • snowflake @@ -19992,7 +19998,7 @@

    Supress success

    2021-10-07 15:03:40
    -

    *Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

    +

    *Thread Reply:* OpenLineage doesn’t have field level lineage yet. Here is the proposal for adding it: https://github.com/OpenLineage/OpenLineage/issues/148

    @@ -20141,7 +20147,7 @@

    Supress success

    2021-10-07 15:04:36
    -

    *Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

    +

    *Thread Reply:* Those two specs look compatible, so Datahub should be able to consume this lineage metadata in the future

    @@ -20197,7 +20203,7 @@

    Supress success

    2021-10-04 15:38:47
    -

    *Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: +

    *Thread Reply:* Hi! Airflow 2.x is currently in development - you can follow along with the progress here: https://github.com/OpenLineage/OpenLineage/issues/205

    @@ -20264,7 +20270,7 @@

    Supress success

    2021-10-05 03:01:54
    -

    *Thread Reply:* Thank you for your reply!

    +

    *Thread Reply:* Thank you for your reply!

    @@ -20290,7 +20296,7 @@

    Supress success

    2021-10-07 15:02:23
    -

    *Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 +

    *Thread Reply:* There should be a first version of Airflow 2.X support soon: https://github.com/OpenLineage/OpenLineage/pull/305 We’re labelling it experimental because the config step might change as discussion in the airflow github evolve. It will track succesful jobs in its current state.

    @@ -20382,7 +20388,7 @@

    Supress success

    2021-10-05 04:18:42
    -

    *Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

    +

    *Thread Reply:* Take a look at this: https://docs.getdbt.com/docs/dbt-cloud/dbt-cloud-api/metadata/metadata-overview

    docs.getdbt.com
    @@ -20437,7 +20443,7 @@

    Supress success

    2021-10-07 14:58:24
    -

    *Thread Reply:* @SAM Let us know of your progress.

    +

    *Thread Reply:* @SAM Let us know of your progress.

    @@ -20509,7 +20515,7 @@

    Supress success

    2021-10-05 16:41:30
    -

    *Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

    +

    *Thread Reply:* Hey, can you create ticket in OpenLineage repository? FWIW Redshift is very similar to postgres, so supporting it won't be hard.

    @@ -20535,7 +20541,7 @@

    Supress success

    2021-10-05 16:43:39
    -

    *Thread Reply:* Hey @Maciej Obuchowski 😊 +

    *Thread Reply:* Hey @Maciej Obuchowski 😊 Yep, will do now! Thanks!

    @@ -20562,7 +20568,7 @@

    Supress success

    2021-10-05 16:46:26
    -

    *Thread Reply:* Well...will do tomorrow morning 😅

    +

    *Thread Reply:* Well...will do tomorrow morning 😅

    @@ -20588,7 +20594,7 @@

    Supress success

    2021-10-06 03:03:16
    -

    *Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

    +

    *Thread Reply:* Here’s the issue: https://github.com/OpenLineage/OpenLineage/issues/318

    @@ -20656,7 +20662,7 @@

    Supress success

    2021-10-07 14:51:08
    -

    *Thread Reply:* Thanks a lot. I pulled it in the current project.

    +

    *Thread Reply:* Thanks a lot. I pulled it in the current project.

    @@ -20686,7 +20692,7 @@

    Supress success

    2021-10-08 05:48:28
    -

    *Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

    +

    *Thread Reply:* @Julien Le Dem @Maciej Obuchowski I’m not familiar with dbt-ol codebase, but I’m willing to help on this if you guys can give me a bit of guidance 😅

    @@ -20712,7 +20718,7 @@

    Supress success

    2021-10-08 05:53:05
    -

    *Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    +

    *Thread Reply:* @ale can you help us define naming schema for redshift, as we have for other databases? https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -20738,7 +20744,7 @@

    Supress success

    2021-10-08 05:53:21
    -

    *Thread Reply:* Sure!

    +

    *Thread Reply:* Sure!

    @@ -20764,7 +20770,7 @@

    Supress success

    2021-10-08 05:54:21
    -

    *Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

    +

    *Thread Reply:* will work on this today and I’ll try to submit a PR by EOD

    @@ -20790,7 +20796,7 @@

    Supress success

    2021-10-08 06:36:12
    -

    *Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

    +

    *Thread Reply:* There you go https://github.com/OpenLineage/OpenLineage/pull/324

    @@ -20842,7 +20848,7 @@

    Supress success

    2021-10-08 06:39:35
    -

    *Thread Reply:* Host would be something like +

    *Thread Reply:* Host would be something like examplecluster.&lt;XXXXXXXXXXXX&gt;.<a href="http://us-west-2.redshift.amazonaws.com">us-west-2.redshift.amazonaws.com</a> right?

    @@ -20870,7 +20876,7 @@

    Supress success

    2021-10-08 07:13:51
    -

    *Thread Reply:* Yep, let me update the PR

    +

    *Thread Reply:* Yep, let me update the PR

    @@ -20896,7 +20902,7 @@

    Supress success

    2021-10-08 07:27:42
    -

    *Thread Reply:* Done

    +

    *Thread Reply:* Done

    @@ -20922,7 +20928,7 @@

    Supress success

    2021-10-08 07:31:40
    -

    *Thread Reply:* 🙌

    +

    *Thread Reply:* 🙌

    @@ -20948,7 +20954,7 @@

    Supress success

    2021-10-08 07:35:30
    -

    *Thread Reply:* If you want to look at dbt integration itself, there are two things:

    +

    *Thread Reply:* If you want to look at dbt integration itself, there are two things:

    We need to determine how Redshift adapter reports metrics https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L412

    @@ -21006,7 +21012,7 @@

    Supress success

    2021-10-08 08:33:31
    -

    *Thread Reply:* I figured out how to generate the namespace. +

    *Thread Reply:* I figured out how to generate the namespace. But I can’t understand which of the JSON files is inspected for metrics. Is it run_results.json ?

    @@ -21033,7 +21039,7 @@

    Supress success

    2021-10-08 09:48:50
    -

    *Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

    +

    *Thread Reply:* yes, run_results.json - it's different in bigquery and snowflake, so I presume it's different in redshift too

    @@ -21059,7 +21065,7 @@

    Supress success

    2021-10-08 11:02:32
    -

    *Thread Reply:* Ok thanks!

    +

    *Thread Reply:* Ok thanks!

    @@ -21085,7 +21091,7 @@

    Supress success

    2021-10-08 11:11:57
    -

    *Thread Reply:* Should be stats:rows:value

    +

    *Thread Reply:* Should be stats:rows:value

    @@ -21111,7 +21117,7 @@

    Supress success

    2021-10-08 11:19:59
    -

    *Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

    +

    *Thread Reply:* Regarding namespace: if env_var is used in profiles.yml , how is this handled now?

    @@ -21137,7 +21143,7 @@

    Supress success

    2021-10-08 11:44:50
    -

    *Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

    +

    *Thread Reply:* Well, it isn't. This is relevant only if you passed cluster hostname this way, right?

    @@ -21163,7 +21169,7 @@

    Supress success

    2021-10-08 11:53:52
    -

    *Thread Reply:* Exactly

    +

    *Thread Reply:* Exactly

    @@ -21189,7 +21195,7 @@

    Supress success

    2021-10-11 07:10:38
    -

    *Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

    +

    *Thread Reply:* If you think it make sense, I can submit a PR to handle dbt profile with env_var

    @@ -21215,7 +21221,7 @@

    Supress success

    2021-10-11 07:18:01
    -

    *Thread Reply:* Do you want to run jinja on the dbt profile?

    +

    *Thread Reply:* Do you want to run jinja on the dbt profile?

    @@ -21241,7 +21247,7 @@

    Supress success

    2021-10-11 07:20:18
    -

    *Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

    +

    *Thread Reply:* Theoretically, we'd need to run it also on dbt_project.yml , but we only take target path and profile name from it.

    @@ -21267,7 +21273,7 @@

    Supress success

    2021-10-11 07:20:32
    -

    *Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

    +

    *Thread Reply:* The env_var syntax in the profile is quite simple, I was thinking of extracting the env var name using re and then retrieving the value from os

    @@ -21293,7 +21299,7 @@

    Supress success

    2021-10-11 07:23:59
    -

    *Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. +

    *Thread Reply:* It would work, but we can actually use jinja - if you're using dbt, it's already included. The method is pretty simple: ``` @contextmember @staticmethod @@ -21336,7 +21342,7 @@

    Supress success

    2021-10-11 07:25:07
    -

    *Thread Reply:* Oh cool! +

    *Thread Reply:* Oh cool! I will definitely use this one!

    @@ -21363,7 +21369,7 @@

    Supress success

    2021-10-11 07:25:09
    -

    *Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

    +

    *Thread Reply:* We'd be sure that our implementation matches dbt's one, right? Also, you'd support default method for free

    @@ -21389,7 +21395,7 @@

    Supress success

    2021-10-11 07:26:34
    -

    *Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

    +

    *Thread Reply:* So this env_varmethod is defined in dbt and not in OpenLineage codebase, right?

    @@ -21415,7 +21421,7 @@

    Supress success

    2021-10-11 07:27:01
    -

    *Thread Reply:* yes

    +

    *Thread Reply:* yes

    @@ -21441,7 +21447,7 @@

    Supress success

    2021-10-11 07:27:14
    -

    *Thread Reply:* dbt is on Apache license 🙂

    +

    *Thread Reply:* dbt is on Apache license 🙂

    @@ -21467,7 +21473,7 @@

    Supress success

    2021-10-11 07:28:06
    -

    *Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

    +

    *Thread Reply:* Should we import dbt package and use the method or should we just copy/paste the method inside OpenLineage codebase?

    @@ -21493,7 +21499,7 @@

    Supress success

    2021-10-11 07:28:28
    -

    *Thread Reply:* I’m asking for guidance here 😊

    +

    *Thread Reply:* I’m asking for guidance here 😊

    @@ -21519,7 +21525,7 @@

    Supress success

    2021-10-11 07:34:44
    -

    *Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

    +

    *Thread Reply:* I think we should just do basic jinja template rendering in our code like in the quick example: https://realpython.com/primer-on-jinja-templating/#quick-examples

    just with the env_var method passed to the render method 🙂

    @@ -21547,7 +21553,7 @@

    Supress success

    2021-10-11 07:37:05
    -

    *Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file +

    *Thread Reply:* basically, here in the code we should read the file, do the jinja render, and load yaml from string instead of straight from file https://github.com/OpenLineage/OpenLineage/blob/610a687bf69df2b52ec4ac4da80b4a05580e8d32/integration/common/openlineage/common/provider/dbt.py#L176

    @@ -21599,7 +21605,7 @@

    Supress success

    2021-10-11 07:38:53
    -

    *Thread Reply:* ok, got it. +

    *Thread Reply:* ok, got it. Will try to implement following your suggestions. Thanks @Maciej Obuchowski 🙌

    @@ -21631,7 +21637,7 @@

    Supress success

    2021-10-11 08:36:13
    -

    *Thread Reply:* We need to:

    +

    *Thread Reply:* We need to:

    1. load the template profile from the profile.yml
    2. replace any env vars we found For the first step, we can use jinja2.Template @@ -21662,7 +21668,7 @@

      Supress success

    2021-10-11 08:43:06
    -

    *Thread Reply:* The dbt method implements that: +

    *Thread Reply:* The dbt method implements that: ``` @contextmember @staticmethod def envvar(var: str, default: Optional[str] = None) -> str: @@ -21704,7 +21710,7 @@

    Supress success

    2021-10-11 08:45:54
    -

    *Thread Reply:* Ok, but I need to pass var to the env_var method. +

    *Thread Reply:* Ok, but I need to pass var to the env_var method. And to pass the var value, I need to look into the loaded Template and search for env var names…

    @@ -21731,7 +21737,7 @@

    Supress success

    2021-10-11 08:46:54
    -

    *Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

    +

    *Thread Reply:* that's what jinja does - you're passing function to jinja render, and it's calling it itself

    @@ -21757,7 +21763,7 @@

    Supress success

    2021-10-11 08:47:45
    -

    *Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

    +

    *Thread Reply:* you can try the quick example from here, but just pass the env_var method (slightly adjusted - as a standalone function and without undefined error) and call it inside the template: https://realpython.com/primer-on-jinja-templating/#quick-examples

    @@ -21783,7 +21789,7 @@

    Supress success

    2021-10-11 08:51:19
    -

    *Thread Reply:* Ok, will try

    +

    *Thread Reply:* Ok, will try

    @@ -21809,7 +21815,7 @@

    Supress success

    2021-10-11 09:37:49
    -

    *Thread Reply:* I’m trying to run +

    *Thread Reply:* I’m trying to run pip install -e ".[dev]" so that I can test my changes, but I get ERROR: Could not find a version that satisfies the requirement openlineage-integration-common[dbt]==0.2.3 (from openlineage-dbt[dev]) (from versions: 0.0.1rc7, 0.0.1rc8, 0.0.1, 0.1.0rc5, 0.1.0, 0.2.0, 0.2.1, 0.2.2) @@ -21840,7 +21846,7 @@

    Supress success

    2021-10-11 09:41:47
    -

    *Thread Reply:* can you try installing it manually?

    +

    *Thread Reply:* can you try installing it manually?

    pip install openlineage-integration-common[dbt]==0.2.3

    @@ -21868,7 +21874,7 @@

    Supress success

    2021-10-11 09:42:13
    -

    *Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

    +

    *Thread Reply:* I mean, it exists in pypi: https://pypi.org/project/openlineage-integration-common/#files

    PyPI
    @@ -21919,7 +21925,7 @@

    Supress success

    2021-10-11 09:44:57
    -

    *Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. +

    *Thread Reply:* Yep, maybe it’s our internal Pypi repo which is not synced. Installing from the public pypi resolved the issue

    @@ -21946,7 +21952,7 @@

    Supress success

    2021-10-11 12:04:55
    -

    *Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

    +

    *Thread Reply:* Can;’t seem to make env_var working as the render method of a Template 😅

    @@ -21972,7 +21978,7 @@

    Supress success

    2021-10-11 12:57:07
    -

    *Thread Reply:* try this:

    +

    *Thread Reply:* try this:

    @@ -21998,7 +22004,7 @@

    Supress success

    2021-10-11 12:57:09
    -

    *Thread Reply:* ```import os +

    *Thread Reply:* ```import os from typing import Optional from jinja2 import Template

    @@ -22045,7 +22051,7 @@

    Supress success

    2021-10-11 12:57:42
    -

    *Thread Reply:* works for me: +

    *Thread Reply:* works for me: mobuchowski@thinkpad [18:57:14] [~] -&gt; % ENV_VAR=world python jinja_example.py Hello world!

    @@ -22074,7 +22080,7 @@

    Supress success

    2021-10-11 16:59:13
    -

    *Thread Reply:* Finally 😅 +

    *Thread Reply:* Finally 😅 https://github.com/OpenLineage/OpenLineage/pull/328

    There are minimal tests for Redshift and env vars. @@ -22133,7 +22139,7 @@

    Supress success

    2021-10-12 03:10:45
    -

    *Thread Reply:* Hi @Maciej Obuchowski 😊 +

    *Thread Reply:* Hi @Maciej Obuchowski 😊 Regarding this comment https://github.com/OpenLineage/OpenLineage/pull/328#discussion_r726586564

    How can we distinguish between snowflake, bigquery and redshift in this method?

    @@ -22175,7 +22181,7 @@

    Supress success

    2021-10-12 05:35:00
    -

    *Thread Reply:* Well, why not do something like

    +

    *Thread Reply:* Well, why not do something like

    ```bytes = getfrommultiple_chains(... rest of stuff)

    @@ -22206,7 +22212,7 @@

    Supress success

    2021-10-12 05:36:49
    -

    *Thread Reply:* we can store adapter type in the class

    +

    *Thread Reply:* we can store adapter type in the class

    @@ -22232,7 +22238,7 @@

    Supress success

    2021-10-12 05:38:47
    -

    *Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

    +

    *Thread Reply:* well, I've looked at last commit and that's exactly what you did 👍

    @@ -22258,7 +22264,7 @@

    Supress success

    2021-10-12 05:40:35
    -

    *Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

    +

    *Thread Reply:* Now, have you tested your branch on real redshift cluster? I don't think we 100% need automated tests for that now, but would be nice to have confirmation that it works.

    @@ -22284,7 +22290,7 @@

    Supress success

    2021-10-12 06:35:04
    -

    *Thread Reply:* Not yet, but I'll try to do that this afternoon. +

    *Thread Reply:* Not yet, but I'll try to do that this afternoon. Need to figure out how to build the lib locally, then I can use it to test with Redshift

    @@ -22311,7 +22317,7 @@

    Supress success

    2021-10-12 06:40:58
    -

    *Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

    +

    *Thread Reply:* I think pip install -e .[dbt] in common directory should be enough

    @@ -22337,7 +22343,7 @@

    Supress success

    2021-10-12 09:29:13
    -

    *Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. +

    *Thread Reply:* I was able to run my local branch with my Redshift cluster and metadata is pushed to Marquez. However, I’m not sure about the namespace . I also see exceptions in Marquez logs

    @@ -22392,7 +22398,7 @@

    Supress success

    2021-10-12 09:33:26
    -

    *Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

    +

    *Thread Reply:* namespace: well, if it matches what you put into your profile, there's not much we can do. I don't understand why you connect to redshift via host, maybe this is related to IAM?

    @@ -22418,7 +22424,7 @@

    Supress success

    2021-10-12 09:44:17
    -

    *Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

    +

    *Thread Reply:* I think the marquez error is because we don't send SourceCodeLocationJobFacet

    @@ -22444,7 +22450,7 @@

    Supress success

    2021-10-12 09:46:17
    -

    *Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 +

    *Thread Reply:* Regarding the namespace, I will check it and figure it out 😊 Regarding the error: in the context of this PR, is it something I should worry about or not?

    @@ -22471,7 +22477,7 @@

    Supress success

    2021-10-12 09:54:17
    -

    *Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

    +

    *Thread Reply:* I think not in the context of the PR. It certainly deserves separate issue in Marquez repository.

    @@ -22497,7 +22503,7 @@

    Supress success

    2021-10-12 10:24:38
    -

    *Thread Reply:* 👍

    +

    *Thread Reply:* 👍

    @@ -22523,7 +22529,7 @@

    Supress success

    2021-10-12 10:24:51
    -

    *Thread Reply:* Is there anything else I can do to improve the PR?

    +

    *Thread Reply:* Is there anything else I can do to improve the PR?

    @@ -22549,7 +22555,7 @@

    Supress success

    2021-10-12 10:27:44
    -

    *Thread Reply:* did you figure out the namespace stuff? +

    *Thread Reply:* did you figure out the namespace stuff? I think it's ready to be merged outside of that

    @@ -22576,7 +22582,7 @@

    Supress success

    2021-10-12 10:49:06
    -

    *Thread Reply:* Not yet

    +

    *Thread Reply:* Not yet

    @@ -22602,7 +22608,7 @@

    Supress success

    2021-10-12 10:58:07
    -

    *Thread Reply:* Ok i figured it out. +

    *Thread Reply:* Ok i figured it out. When running dbt locally, we connect to Redshift using an SSH tunnel. dbt runs on Docker, hence it can access the tunnel using host.docker.internal

    @@ -22630,7 +22636,7 @@

    Supress success

    2021-10-12 10:58:16
    -

    *Thread Reply:* So the namespace is correct

    +

    *Thread Reply:* So the namespace is correct

    @@ -22656,7 +22662,7 @@

    Supress success

    2021-10-12 11:04:12
    -

    *Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

    +

    *Thread Reply:* Makes sense. So, let's merge it, after DCO bot gets up again.

    @@ -22682,7 +22688,7 @@

    Supress success

    2021-10-12 11:04:37
    -

    *Thread Reply:* 👍

    +

    *Thread Reply:* 👍

    @@ -22708,7 +22714,7 @@

    Supress success

    2021-10-13 05:29:48
    -

    *Thread Reply:* merged your PR 🙌

    +

    *Thread Reply:* merged your PR 🙌

    @@ -22734,7 +22740,7 @@

    Supress success

    2021-10-13 10:54:09
    -

    *Thread Reply:* 🎉

    +

    *Thread Reply:* 🎉

    @@ -22760,7 +22766,7 @@

    Supress success

    2021-10-13 12:01:20
    -

    *Thread Reply:* I think I'm going to change it up a bit. +

    *Thread Reply:* I think I'm going to change it up a bit. The problem is that we can try to render jinja everywhere, including comments. I tried to make it skip unknown methods and values here, but I think the right solution is to load the yaml, and then try to render jinja for values.

    @@ -22788,7 +22794,7 @@

    Supress success

    2021-10-13 14:27:37
    -

    *Thread Reply:* Ok sounds good to me!

    +

    *Thread Reply:* Ok sounds good to me!

    @@ -22849,7 +22855,7 @@

    Supress success

    2021-10-06 10:54:32
    -

    *Thread Reply:* Does it work with simply dbt run?

    +

    *Thread Reply:* Does it work with simply dbt run?

    @@ -22875,7 +22881,7 @@

    Supress success

    2021-10-06 10:55:51
    -

    *Thread Reply:* also, do you have dbt-snowflake installed?

    +

    *Thread Reply:* also, do you have dbt-snowflake installed?

    @@ -22901,7 +22907,7 @@

    Supress success

    2021-10-06 11:00:42
    -

    *Thread Reply:* it works with dbt run

    +

    *Thread Reply:* it works with dbt run

    @@ -22931,7 +22937,7 @@

    Supress success

    2021-10-06 11:01:22
    -

    *Thread Reply:* no i haven’t installed dbt-snowflake

    +

    *Thread Reply:* no i haven’t installed dbt-snowflake

    @@ -22957,7 +22963,7 @@

    Supress success

    2021-10-06 12:04:19
    -

    *Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

    +

    *Thread Reply:* what the dbt says - the snowflake profile with dev target - is that what you ment to run or was it something else?

    @@ -22983,7 +22989,7 @@

    Supress success

    2021-10-06 12:04:46
    -

    *Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

    +

    *Thread Reply:* it feels very weird to me, since the dbt-ol script just runs dbt run underneath

    @@ -23009,7 +23015,7 @@

    Supress success

    2021-10-06 12:19:27
    -

    *Thread Reply:* this is my profiles.yml file: +

    *Thread Reply:* this is my profiles.yml file: ```snowflake: target: dev outputs: @@ -23054,7 +23060,7 @@

    Supress success

    2021-10-06 12:26:39
    -

    *Thread Reply:* Yes, it looks that everything is okay on your side...

    +

    *Thread Reply:* Yes, it looks that everything is okay on your side...

    @@ -23080,7 +23086,7 @@

    Supress success

    2021-10-06 12:28:19
    -

    *Thread Reply:* may be I’ll restart my machine and try again

    +

    *Thread Reply:* may be I’ll restart my machine and try again

    @@ -23106,7 +23112,7 @@

    Supress success

    2021-10-06 12:30:25
    -

    *Thread Reply:* can you try +

    *Thread Reply:* can you try OPENLINEAGE_URL=<http://localhost:5000> dbt-ol debug

    @@ -23133,7 +23139,7 @@

    Supress success

    2021-10-07 05:59:03
    -

    *Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

    +

    *Thread Reply:* Actually i had to use venv that fixed above issue. However, i ran into another problem which is no jobs / datasets found in marquez:

    @@ -23177,7 +23183,7 @@

    Supress success

    2021-10-07 06:00:28
    -

    *Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

    +

    *Thread Reply:* Good that you fixed that one 🙂 Regarding last one, I've found it independently yesterday and PR fixing it is already waiting for review: https://github.com/OpenLineage/OpenLineage/pull/322

    @@ -23230,7 +23236,7 @@

    Supress success

    2021-10-07 06:00:46
    -

    *Thread Reply:* oh, thanks a lot

    +

    *Thread Reply:* oh, thanks a lot

    @@ -23256,7 +23262,7 @@

    Supress success

    2021-10-07 14:50:01
    -

    *Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

    +

    *Thread Reply:* There will be a release soon: https://openlineage.slack.com/archives/C01CK9T7HKR/p1633631825147900

    @@ -23327,7 +23333,7 @@

    Supress success

    2021-10-07 23:23:26
    -

    *Thread Reply:* Hi, +

    *Thread Reply:* Hi, openlineage-dbt==0.2.3 worked, thanks a lot for the quick fix.

    @@ -23389,7 +23395,7 @@

    Supress success

    2021-10-07 07:46:59
    -

    *Thread Reply:*

    +

    *Thread Reply:*

    @@ -23424,7 +23430,7 @@

    Supress success

    2021-10-07 09:51:40
    -

    *Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. +

    *Thread Reply:* So, as a quick fix, shutting down and re-starting the docker container resets everything. ./docker/up.sh

    @@ -23455,7 +23461,7 @@

    Supress success

    2021-10-07 12:28:25
    -

    *Thread Reply:* I guess that it's the easiest way now. There should be API for that.

    +

    *Thread Reply:* I guess that it's the easiest way now. There should be API for that.

    @@ -23481,7 +23487,7 @@

    Supress success

    2021-10-07 14:09:50
    -

    *Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

    +

    *Thread Reply:* @Alex P Yeah, we're realizing that being able to delete metadata is becoming very important. And, as @Maciej Obuchowski mentioned, dropping your entire database is the only way currently (not ideal!). We do have an issue in the Marquez backlog to expose delete APIs: https://github.com/MarquezProject/marquez/issues/754

    @@ -23548,7 +23554,7 @@

    Supress success

    2021-10-07 14:10:36
    -

    *Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

    +

    *Thread Reply:* A bit more discussion is needed though. Like what if a dataset is deleted, but you still want to keep track that it existed at some point? (i.e. soft vs hard deletes). But, for the case that you just want to clear metadata because you were testing things out, then yeah, that's more obvious and requires little discussion of the API upfront.

    @@ -23574,7 +23580,7 @@

    Supress success

    2021-10-07 14:12:52
    -

    *Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

    +

    *Thread Reply:* @Alex P I moved the delete APIs to the Marquez 0.20.0 release

    @@ -23600,7 +23606,7 @@

    Supress success

    2021-10-07 14:39:03
    -

    *Thread Reply:* Thanks Willy.

    +

    *Thread Reply:* Thanks Willy.

    @@ -23630,7 +23636,7 @@

    Supress success

    2021-10-07 14:48:37
    -

    *Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

    +

    *Thread Reply:* I have also updated a corresponding issue to track this in OpenLineage: https://github.com/OpenLineage/OpenLineage/issues/323

    @@ -23762,7 +23768,7 @@

    Supress success

    2021-10-13 10:47:49
    -

    *Thread Reply:* Reminder that the meeting is today. See you soon

    +

    *Thread Reply:* Reminder that the meeting is today. See you soon

    @@ -23788,7 +23794,7 @@

    Supress success

    2021-10-13 19:49:21
    -

    *Thread Reply:* The recording and notes of the meeting are now available: +

    *Thread Reply:* The recording and notes of the meeting are now available: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Oct13th2021

    @@ -23910,7 +23916,7 @@

    Supress success

    2021-10-07 14:55:35
    -

    *Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

    +

    *Thread Reply:* For people following along, dbt changed the schema of its metadata which broke the openlineage integration. However we were a bit too stringent on validating the schema version (they increment it every time event if it’s backwards compatible, which it is in this case). We will fix that so that future compatible changes don’t prevent the ol integration to work.

    @@ -23936,7 +23942,7 @@

    Supress success

    2021-10-07 16:44:28
    -

    *Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

    +

    *Thread Reply:* As one of the main integrations, would be good to connect more within the dbt community for the next releases, by testing the release candidates 👍

    Thanks for the PR

    @@ -23968,7 +23974,7 @@

    Supress success

    2021-10-07 16:46:40
    -

    *Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

    +

    *Thread Reply:* Yeah, I totally agree with you. We also should be more proactive and also be more aware in what’s coming in future dbt releases. Sorry if you were effected by this bug :ladybug:

    @@ -23994,7 +24000,7 @@

    Supress success

    2021-10-07 18:12:22
    -

    *Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

    +

    *Thread Reply:* We’ve release OpenLineage 0.2.3 with the hotfix for adding dbt v3 manifest support, see https://github.com/OpenLineage/OpenLineage/releases/tag/0.2.3

    You can download and install openlineage-dbt 0.2.3 with the fix using:

    @@ -24050,7 +24056,7 @@

    Supress success

    2021-10-07 21:10:36
    -

    *Thread Reply:* @Maciej Obuchowski might know

    +

    *Thread Reply:* @Maciej Obuchowski might know

    @@ -24076,7 +24082,7 @@

    Supress success

    2021-10-08 04:23:17
    -

    *Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

    +

    *Thread Reply:* @Drew Bittenbender dbt-ol always calls dbt command now, without spawning shell - so it does not have access to bash aliases.

    Can you elaborate about your use case? Do you mean that dbt in your path does docker run or something like this? It still might be a problem if we won't have access to artifacts generated by dbt in target directory.

    @@ -24104,7 +24110,7 @@

    Supress success

    2021-10-08 10:59:32
    -

    *Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

    +

    *Thread Reply:* I am running on a mac and I have aliased (.zshrc) dbt to execute docker run against the fishtownanalytics docker image rather than installing dbt natively (homebrew, etc). I am doing this so that the dbt configuration is portable and reusable by others.

    It seems that by installing openlineage-dbt in a virtual environment, it pulls down it's own version of dbt which it calls inline rather than shelling out and executing the dbt setup resident in the host system. I understand that opening a shell is a security risk so that is understandable.

    @@ -24132,7 +24138,7 @@

    Supress success

    2021-10-08 11:05:00
    -

    *Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

    +

    *Thread Reply:* It does not pull down, it just assumes that it's in the system. It would fail if it isn't.

    For now I think you could build your own image based on official one, and install openlineage-dbt inside, something like:

    @@ -24164,7 +24170,7 @@

    Supress success

    2021-10-08 11:05:15
    -

    *Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

    +

    *Thread Reply:* and then pass OPENLINEAGE_URL in env while doing docker run

    @@ -24190,7 +24196,7 @@

    Supress success

    2021-10-08 11:06:55
    -

    *Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

    +

    *Thread Reply:* Also, to make sure that using shell would help in your case: do you bind mount your dbt directory to home? dbt-ol can't run without access to dbt's target directory, so if it's not visible in host, the only option is to have dbt-ol in container.

    @@ -24263,7 +24269,7 @@

    Supress success

    2021-10-08 08:16:37
    -

    *Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

    +

    *Thread Reply:* Regarding 2), the data is only visible after next dbt-ol run - dbt docs generate does not emit events itself, but generates data that run take into account.

    @@ -24289,7 +24295,7 @@

    Supress success

    2021-10-08 08:24:57
    -

    *Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

    +

    *Thread Reply:* oh got it, since its in default, i need to click on it and choose my dbt profile’s account name. thnx

    @@ -24324,7 +24330,7 @@

    Supress success

    2021-10-08 11:25:22
    -

    *Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

    +

    *Thread Reply:* May I know, why these highlighted ones dont have schema? FYI, I used sources in dbt.

    @@ -24359,7 +24365,7 @@

    Supress success

    2021-10-08 11:26:18
    -

    *Thread Reply:* Do they have it in dbt docs?

    +

    *Thread Reply:* Do they have it in dbt docs?

    @@ -24385,7 +24391,7 @@

    Supress success

    2021-10-08 11:33:59
    -

    *Thread Reply:* I prepared this yaml file, not sure this is what u asked

    +

    *Thread Reply:* I prepared this yaml file, not sure this is what u asked

    @@ -24484,7 +24490,7 @@

    Supress success

    2021-10-12 07:21:33
    -

    *Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

    +

    *Thread Reply:* I don't think anything is wrong with your branch. It's also not working on my one. Maybe it's globally stuck?

    @@ -24543,7 +24549,7 @@

    Supress success

    2021-10-12 15:50:20
    -

    *Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? +

    *Thread Reply:* > Is there a way to generate OpenLineage output that contains a mapping between input and output fields? OpenLineage defines discrete classes for both OpenLineage.InputDataset and OpenLineage.OutputDataset datasets. But, for clarification, are you asking:

    1. If a job reads / writes to the same dataset, how can OpenLineage track which fields were used in job’s logic as input and which fields were used to write back to the resulting output?
    2. Or, if a job reads / writes from two different dataset, how can OpenLineage track which input fields were used in the job’s logic for the resulting output dataset? (i.e. column-level lineage)
    3. @@ -24573,7 +24579,7 @@

      Supress success

    2021-10-12 15:56:18
    -

    *Thread Reply:* > In Azure Databricks sources often map to ADB mount points.  We are looking for a way to translate this into source metadata in the OL output.  Is there some configuration that would make this possible, or any other suggestions? +

    *Thread Reply:* > In Azure Databricks sources often map to ADB mount points.  We are looking for a way to translate this into source metadata in the OL output.  Is there some configuration that would make this possible, or any other suggestions? I would look into our OutputDatasetVisitors class (as a starting point) that extracts metadata from the spark logical plan to construct a mapping between a logic plan to one or more OpenLineage.Dataset for the spark job. But, I think @Michael Collado will have a more detailed suggestion / approach to what you’re asking

    @@ -24600,7 +24606,7 @@

    Supress success

    2021-10-12 15:59:41
    -

    *Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

    +

    *Thread Reply:* are the sources mounted like local filesystem mounts? are you ending up with datasources that point to the local filesystem rather than some dbfs url? (sorry, I'm not familiar with databricks or azure at this point)

    @@ -24626,7 +24632,7 @@

    Supress success

    2021-10-12 16:59:38
    -

    *Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

    +

    *Thread Reply:* I think under the covers they are an os level fs mount, but it is using an ADB specific api, dbutils.fs.mount. It is using the ADB filesystem.

    docs.microsoft.com
    @@ -24677,7 +24683,7 @@

    Supress success

    2021-10-12 17:01:23
    -

    *Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? +

    *Thread Reply:* Do you use the dbfs scheme to access the files from Spark as in the example on that page? df = spark.read.text("dbfs:/mymount/my_file.txt")

    @@ -24704,7 +24710,7 @@

    Supress success

    2021-10-12 17:04:52
    -

    *Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

    +

    *Thread Reply:* @Willy Lulciuc In our project, @Will Johnson had generated some sample OL output from just reading in and writing out a dataset to blob storage. In the resulting output, I see the columns represented as fields under the schema element with a set represented for output and another for input. I would need the mapping of in and out columns to generate column level lineage so wondering if it is possible to get or am I just missing it somewhere? Thanks for your help!

    @@ -24730,7 +24736,7 @@

    Supress success

    2021-10-12 17:26:35
    -

    *Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

    +

    *Thread Reply:* Ahh, well currently, no, but it has been discussed and on the OpenLineage roadmap. Here’s a proposal opened by @Julien Le Dem, column level lineage facet, that starts the discussion to add the columnLineage face to the datasets model in order to support column-level lineage. Would be great to get your thoughts!

    @@ -24879,7 +24885,7 @@

    Supress success

    2021-10-12 17:41:41
    -

    *Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

    +

    *Thread Reply:* @Michael Collado - Databricks allows you to reference a file called /mnt/someMount/some/file/path The way you have referenced it would let you hit the file with local file system stuff like pandas / local python.

    @@ -24905,7 +24911,7 @@

    Supress success

    2021-10-12 17:49:37
    -

    *Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java +

    *Thread Reply:* For column level lineage, you can add your own custom facets: Here’s an example in the Spark integration: (LogicalPlanFacet) https://github.com/OpenLineage/OpenLineage/blob/5f189a94990dad715745506c0282e16fd8[…]openlineage/spark/agent/lifecycle/SparkSQLExecutionContext.java Here is the paragraph about this in the spec: https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#custom-facet-naming

    @@ -24980,7 +24986,7 @@

    Supress success

    2021-10-12 17:51:24
    -

    *Thread Reply:* This example adds facets to the run, but you can also add them to the job

    +

    *Thread Reply:* This example adds facets to the run, but you can also add them to the job

    @@ -25006,7 +25012,7 @@

    Supress success

    2021-10-12 17:52:46
    -

    *Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

    +

    *Thread Reply:* unfortunately, there's not yet a way to add your own custom facets to the spark integration- there's some work on extensibility to be done

    @@ -25032,7 +25038,7 @@

    Supress success

    2021-10-12 17:54:07
    -

    *Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

    +

    *Thread Reply:* for the hackathon's sake, you can check out the package and just add in whatever you want

    @@ -25058,7 +25064,7 @@

    Supress success

    2021-10-12 18:26:44
    -

    *Thread Reply:* Thank you guys!!

    +

    *Thread Reply:* Thank you guys!!

    @@ -25145,7 +25151,7 @@

    Supress success

    2021-10-12 21:46:12
    -

    *Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

    +

    *Thread Reply:* You can also pass the settings independently if you want something more flexible: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]n/java/io/openlineage/spark/agent/OpenLineageSparkListener.java

    @@ -25196,7 +25202,7 @@

    Supress success

    2021-10-12 21:47:36
    -

    *Thread Reply:* SparkSession.builder() +

    *Thread Reply:* SparkSession.builder() .config("spark.jars.packages", "io.openlineage:openlineage_spark:0.2.+") .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") .config("spark.openlineage.host", "<https://localhost>") @@ -25228,7 +25234,7 @@

    Supress success

    2021-10-12 21:48:57
    -

    *Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

    +

    *Thread Reply:* It is going to add /lineage in the end: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fe[…]rc/main/java/io/openlineage/spark/agent/OpenLineageContext.java

    @@ -25279,7 +25285,7 @@

    Supress success

    2021-10-12 21:49:37
    -

    *Thread Reply:* the apiKey setting is sent in an “Authorization” header

    +

    *Thread Reply:* the apiKey setting is sent in an “Authorization” header

    @@ -25305,7 +25311,7 @@

    Supress success

    2021-10-12 21:49:55
    -

    *Thread Reply:* “Bearer $KEY”

    +

    *Thread Reply:* “Bearer $KEY”

    @@ -25331,7 +25337,7 @@

    Supress success

    2021-10-12 21:51:09
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a9082[…]n/java/io/openlineage/spark/agent/client/OpenLineageClient.java

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/a6eea7a55fef444b6561005164869a9082[…]n/java/io/openlineage/spark/agent/client/OpenLineageClient.java

    @@ -25382,7 +25388,7 @@

    Supress success

    2021-10-12 22:54:22
    -

    *Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

    +

    *Thread Reply:* Thank you @Julien Le Dem it seems in both cases (defining the url endpoint with spark.openlineage.url and with the components: spark.openlineage.host / openlineage.version / openlineage.namespace / etc.) OpenLineage will strip out url parameters and rebuild the url endpoint with /lineage.

    I think we might need to add in a url parameter configuration for our hackathon. We're using a bit of serverless code to shuttle open lineage events to a queue so that another job and/or serverless application can read that queue at its leisure.

    @@ -25416,7 +25422,7 @@

    Supress success

    2021-10-13 10:46:22
    -

    *Thread Reply:* Yes, please open an issue detailing the use case.

    +

    *Thread Reply:* Yes, please open an issue detailing the use case.

    @@ -25497,7 +25503,7 @@

    Supress success

    2021-10-13 17:54:22
    -

    *Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet +

    *Thread Reply:* This is because this specific action is not covered yet. You can see the “spark_unknown” facet is describing things that are not understood yet run": { ... "facets": { @@ -25536,7 +25542,7 @@

    Supress success

    2021-10-13 17:54:43
    -

    *Thread Reply:* I think this is part of the Spark 3 gap

    +

    *Thread Reply:* I think this is part of the Spark 3 gap

    @@ -25562,7 +25568,7 @@

    Supress success

    2021-10-13 17:55:46
    -

    *Thread Reply:* an unknown output will cause missing output lineage

    +

    *Thread Reply:* an unknown output will cause missing output lineage

    @@ -25588,7 +25594,7 @@

    Supress success

    2021-10-13 18:05:57
    -

    *Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

    +

    *Thread Reply:* Output handling is here: https://github.com/OpenLineage/OpenLineage/blob/e0f1852422f325dc019b0eab0e466dc905[…]io/openlineage/spark/agent/lifecycle/OutputDatasetVisitors.java

    @@ -25643,7 +25649,7 @@

    Supress success

    2021-10-13 22:49:08
    -

    *Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

    +

    *Thread Reply:* Ah! Thank you so much, Julien! This is very helpful to understand where that is set. This is a big gap that we want to help address after our hackathon. Thank you!

    @@ -25753,7 +25759,7 @@

    Supress success

    2021-10-13 20:13:02
    -

    *Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

    +

    *Thread Reply:* the github wiki is backed by a git repo but it does not allow PRs. (people do hacks but I’d rather avoid those)

    @@ -25892,7 +25898,7 @@

    Supress success

    2021-10-22 05:44:17
    -

    *Thread Reply:* Maybe you want to contribute it? +

    *Thread Reply:* Maybe you want to contribute it? It's not that hard, mostly testing, and figuring out what would be the naming of openlineage namespace for Trino, and how some additional statistics work.

    For example, recently we had added support for Redshift by community member @ale

    @@ -26068,7 +26074,7 @@

    Supress success

    2021-10-28 12:20:27
    -

    *Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

    +

    *Thread Reply:* That would make sense. I think you are the first person to request this. Is this something you would want to contribute to the project?

    @@ -26094,7 +26100,7 @@

    Supress success

    2021-10-28 17:37:53
    -

    *Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

    +

    *Thread Reply:* I would like to Julien, but not sure how can I do it. Could you guide me how can i start? or show me other integration.

    @@ -26120,7 +26126,7 @@

    Supress success

    2021-10-31 07:57:55
    -

    *Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

    +

    *Thread Reply:* @David Virgil look at the pull request for the addition of Redshift as a starting guide. https://github.com/OpenLineage/OpenLineage/pull/328

    @@ -26180,7 +26186,7 @@

    Supress success

    2021-11-01 12:01:41
    -

    *Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

    +

    *Thread Reply:* Thanks @Matthew Mullins I ll try to add dbt spark integration

    @@ -26232,7 +26238,7 @@

    Supress success

    2021-10-28 10:05:09
    -

    *Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

    +

    *Thread Reply:* OK, was changed here: https://github.com/OpenLineage/OpenLineage/pull/286

    Did you think about this?

    Supress success
    2021-10-28 12:19:27
    -

    *Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

    +

    *Thread Reply:* In Marquez there was a mechanism to do that. Something like OPENLINEAGE_BACKEND=HTTP|LOG

    @@ -26323,7 +26329,7 @@

    Supress success

    2021-10-28 13:56:42
    -

    *Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

    +

    *Thread Reply:* @Mario Measic We're going to add Transport mechanism, that will address use cases like yours. Please comment on this PR what would you expect: https://github.com/OpenLineage/OpenLineage/pull/344

    @@ -26386,7 +26392,7 @@

    Supress success

    2021-10-28 15:29:50
    -

    *Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

    +

    *Thread Reply:* Nice, thanks @Julien Le Dem and @Maciej Obuchowski.

    @@ -26412,7 +26418,7 @@

    Supress success

    2021-10-28 15:46:45
    -

    *Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

    +

    *Thread Reply:* Also, dbt build is not working which is kind of the biggest feature of the version 0.21.0, I will try testing the code with modifications to the https://github.com/OpenLineage/OpenLineage/blob/c3aa70e161244091969951d0da4f37619bcbe36f/integration/dbt/scripts/dbt-ol#L141

    I guess there's a reason for it that I didn't see since you support v3 of the manifest.

    Supress success
    2021-10-29 03:45:27
    -

    *Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

    +

    *Thread Reply:* Also, is it normal not to see the column descriptions for the model/table even though these are provided in the YAML file, persisted in Redshift and also dbt docs generate has been run before dbt-ol run?

    @@ -26491,7 +26497,7 @@

    Supress success

    2021-10-29 04:26:22
    -

    *Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

    +

    *Thread Reply:* Tried with dbt versions 0.20.2 and 0.21.0, openlineage-dbt==0.3.1

    @@ -26517,7 +26523,7 @@

    Supress success

    2021-10-29 10:39:10
    -

    *Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

    +

    *Thread Reply:* I'll take a look at that. Supporting descriptions might be simple, but dbt build might be a little larger task.

    @@ -26543,7 +26549,7 @@

    Supress success

    2021-11-01 19:12:01
    -

    *Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

    +

    *Thread Reply:* I opened a ticket to track this: https://github.com/OpenLineage/OpenLineage/issues/376

    @@ -26599,7 +26605,7 @@

    Supress success

    2021-11-02 05:48:06
    -

    *Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

    +

    *Thread Reply:* The column description issue should be fixed here: https://github.com/OpenLineage/OpenLineage/pull/383

    @@ -26782,7 +26788,7 @@

    Supress success

    2021-10-28 18:51:10
    -

    *Thread Reply:* If anyone has any questions or comments - happy to discuss here

    +

    *Thread Reply:* If anyone has any questions or comments - happy to discuss here

    @@ -26808,7 +26814,7 @@

    Supress success

    2021-10-28 18:51:15
    -

    *Thread Reply:* @davzucky

    +

    *Thread Reply:* @davzucky

    @@ -26834,7 +26840,7 @@

    Supress success

    2021-10-28 23:01:29
    -

    *Thread Reply:* Thanks for updating the community, Brad!

    +

    *Thread Reply:* Thanks for updating the community, Brad!

    @@ -26860,7 +26866,7 @@

    Supress success

    2021-10-28 23:47:02
    -

    *Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

    +

    *Thread Reply:* Than you Brad. Looking forward to see how to integrated that with v2

    @@ -26924,7 +26930,7 @@

    Supress success

    2021-10-28 18:54:56
    -

    *Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

    +

    *Thread Reply:* Welcome, @Kevin Kho 👋. Really excited to see this integration kick off! 💯🚀

    @@ -26989,7 +26995,7 @@

    Supress success

    2021-11-01 12:23:11
    -

    *Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

    +

    *Thread Reply:* While it’s not something we’re supporting at the moment, it’s definitely something that we’re considering!

    If you can give me a little more detail on what your system infrastructure is like, it’ll help us set priority and design

    @@ -27017,7 +27023,7 @@

    Supress success

    2021-11-01 13:57:34
    -

    *Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

    +

    *Thread Reply:* So basic architecture of a datalake. We are using airflow to trigger jobs. Every job is a pipeline that runs a spark job (in our case it spin up an EMR). So the idea of lineage would be defining in the dags inlets and outlets based on the airflow lineage:

    https://airflow.apache.org/docs/apache-airflow/stable/lineage.html

    @@ -27047,7 +27053,7 @@

    Supress success

    2021-11-01 14:01:24
    -

    *Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

    +

    *Thread Reply:* Why not use spark integration? https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

    @@ -27073,7 +27079,7 @@

    Supress success

    2021-11-01 14:05:02
    -

    *Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

    +

    *Thread Reply:* because there are some other jobs that are not spark, some jobs they run in dbt, other jobs they run in redshift @Maciej Obuchowski

    @@ -27099,7 +27105,7 @@

    Supress success

    2021-11-01 14:08:58
    -

    *Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

    +

    *Thread Reply:* So, combo of https://github.com/OpenLineage/OpenLineage/tree/main/integration/dbt and PostgresExtractor from airflow integration should cover Redshift if you're using it from PostgresOperator 🙂

    It's definitely interesting use case - you'd be using most of the existing integrations we have.

    @@ -27127,7 +27133,7 @@

    Supress success

    2021-11-01 15:04:44
    -

    *Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

    +

    *Thread Reply:* @Maciej Obuchowski Do i need to define any extractor in the airflow startup?

    @@ -27153,7 +27159,7 @@

    Supress success

    2021-11-05 23:48:21
    -

    *Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

    +

    *Thread Reply:* I am using Redshift with PostgresOperator and it is returning…

    [2021-11-06 03:43:06,541] {{__init__.py:92}} ERROR - Failed to extract metadata 'NoneType' object has no attribute 'host' task_type=PostgresOperator airflow_dag_id=counter task_id=inc airflow_run_id=scheduled__2021-11-06T03:42:00+00:00 Traceback (most recent call last): @@ -27272,7 +27278,7 @@

    Supress success

    2021-11-01 14:06:12
    -

    *Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

    +

    *Thread Reply:* 1. Please use openlineage.lineage_backend.OpenLineageBackend as AIRFLOW__LINEAGE__BACKEND

    1. Please tell us where you've seen openlineage.airflow.backend.OpenLineageBackend, so we can fix the documentation 🙂
    @@ -27301,7 +27307,7 @@

    Supress success

    2021-11-01 19:07:21
    -

    *Thread Reply:* https://pypi.org/project/openlineage-airflow/

    +

    *Thread Reply:* https://pypi.org/project/openlineage-airflow/

    PyPI
    @@ -27352,7 +27358,7 @@

    Supress success

    2021-11-01 19:08:03
    -

    *Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

    +

    *Thread Reply:* (I googled it and found that page that seems to have an outdated doc)

    @@ -27378,7 +27384,7 @@

    Supress success

    2021-11-02 02:38:59
    -

    *Thread Reply:* @Maciej Obuchowski +

    *Thread Reply:* @Maciej Obuchowski @Julien Le Dem that's the page i followed. Please guys revise the documentation, as it is very important

    @@ -27405,7 +27411,7 @@

    Supress success

    2021-11-02 04:34:14
    -

    *Thread Reply:* It should just copy actual readme

    +

    *Thread Reply:* It should just copy actual readme

    @@ -27456,7 +27462,7 @@

    Supress success

    2021-11-03 16:30:00
    -

    *Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

    +

    *Thread Reply:* PyPi is using the README at the time of the release 0.3.1, rather than the current README, which is 0.4.0. If we send the new release to PyPi it should also update the README

    @@ -27508,7 +27514,7 @@

    Supress success

    2021-11-01 15:19:18
    -

    *Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

    +

    *Thread Reply:* I set i up in the scheduler and it starts to log data to marquez. But it fails with this error:

    Traceback (most recent call last): File "/home/airflow/.local/lib/python3.8/site-packages/openlineage/client/client.py", line 49, in __init__ @@ -27539,7 +27545,7 @@

    Supress success

    2021-11-01 15:19:26
    -

    *Thread Reply:* why is it not a valid URL?

    +

    *Thread Reply:* why is it not a valid URL?

    @@ -27565,7 +27571,7 @@

    Supress success

    2021-11-01 18:39:58
    -

    *Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

    +

    *Thread Reply:* Which version of the OpenLineage client are you using? On first check it should be fine

    @@ -27591,7 +27597,7 @@

    Supress success

    2021-11-02 05:14:30
    -

    *Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

    +

    *Thread Reply:* @John Thomas I was appending double quotes as part of the url. Forget about this error

    @@ -27617,7 +27623,7 @@

    Supress success

    2021-11-02 10:35:28
    -

    *Thread Reply:* aaaah, gotcha, good catch!

    +

    *Thread Reply:* aaaah, gotcha, good catch!

    @@ -27673,7 +27679,7 @@

    Supress success

    2021-11-02 05:18:18
    -

    *Thread Reply:* Are you sure that openlineage-airflow is present in the container?

    +

    *Thread Reply:* Are you sure that openlineage-airflow is present in the container?

    @@ -27779,7 +27785,7 @@

    Supress success

    2021-11-02 05:25:43
    -

    *Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

    +

    *Thread Reply:* Maybe ADDITIONAL_PYTHON_DEPS are dependencies needed by the tasks, and are installed after Airflow tries to initialize LineageBackend?

    @@ -27805,7 +27811,7 @@

    Supress success

    2021-11-02 06:34:11
    -

    *Thread Reply:* I am checking this accessing the Kubernetes pod

    +

    *Thread Reply:* I am checking this accessing the Kubernetes pod

    @@ -27920,7 +27926,7 @@

    Supress success

    2021-11-02 07:34:47
    -

    *Thread Reply:* Yes

    +

    *Thread Reply:* Yes

    @@ -27946,7 +27952,7 @@

    Supress success

    2021-11-02 07:35:53
    -

    *Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

    +

    *Thread Reply:* Probably what you want is job hierarchy: https://github.com/MarquezProject/marquez/issues/1737

    @@ -28001,7 +28007,7 @@

    Supress success

    2021-11-02 07:46:02
    -

    *Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

    +

    *Thread Reply:* I do not see any benefit of just having some airflow task metadata. I do not see relationship between tasks. Every task is a job. When I was thinking about lineage when i started working on my company integration with openlineage i though that openlineage would give me relationship between task or datasets and the only thing i see is some metadata of the history of airflow runs that is already provided by airflow

    @@ -28027,7 +28033,7 @@

    Supress success

    2021-11-02 07:46:20
    -

    *Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

    +

    *Thread Reply:* i was expecting to see a nice graph. I think it is missing some features

    @@ -28053,7 +28059,7 @@

    Supress success

    2021-11-02 07:46:25
    -

    *Thread Reply:* at this early stage

    +

    *Thread Reply:* at this early stage

    @@ -28079,7 +28085,7 @@

    Supress success

    2021-11-02 07:50:10
    -

    *Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    +

    *Thread Reply:* It probably depends on whether those tasks are covered by the extractors: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

    @@ -28105,7 +28111,7 @@

    Supress success

    2021-11-02 07:55:50
    -

    *Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

    +

    *Thread Reply:* We are not using any of those operators: bigquery, postsgress or snowflake.

    And what is it doing GreatExpectactions extractor?

    @@ -28135,7 +28141,7 @@

    Supress success

    2021-11-02 07:56:30
    -

    *Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

    +

    *Thread Reply:* And that the same dag graph can be seen in marquez, and not one job per task.

    @@ -28161,7 +28167,7 @@

    Supress success

    2021-11-02 08:07:06
    -

    *Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task +

    *Thread Reply:* > It would be good if there is one extractor that relies in the inlets and outlets that you can define in any Airflow task I think this is good idea. Overall, OpenLineage strongly focuses on automatic metadata collection. However, using them would be a nice fallback for not-covered-yet cases.

    > And that the same dag graph can be seen in marquez, and not one job per task. @@ -28218,7 +28224,7 @@

    Supress success

    2021-11-02 08:09:55
    -

    *Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

    +

    *Thread Reply:* Created issue for the manual fallback: https://github.com/OpenLineage/OpenLineage/issues/384

    @@ -28275,7 +28281,7 @@

    Supress success

    2021-11-02 08:28:29
    -

    *Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

    +

    *Thread Reply:* @Maciej Obuchowski how many people are working full time in this library? I really would like to adopt it in my company, as we use airflow and spark, but i see that yet it does not have the features we would like to.

    At the moment the same info we have in marquez related the tasks, is available in airflow UI or using airflow API.

    @@ -28305,7 +28311,7 @@

    Supress success

    2021-11-02 09:33:31
    -

    *Thread Reply:* > how many people are working full time in this library? +

    *Thread Reply:* > how many people are working full time in this library? On Airflow integration or on OpenLineage overall? 🙂

    > The game changer for us would be that it could give us features/metadata that we cannot query directly from airflow. @@ -28362,7 +28368,7 @@

    Supress success

    2021-11-02 09:35:10
    -

    *Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

    +

    *Thread Reply:* But first, before implementing last option, I'd like to get consensus about it - so feel free to comment there about your use case

    @@ -28444,7 +28450,7 @@

    Supress success

    2021-11-03 08:00:34
    -

    *Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

    +

    *Thread Reply:* It should only affect you if you're using https://greatexpectations.io/

    @@ -28470,7 +28476,7 @@

    Supress success

    2021-11-03 15:57:02
    -

    *Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

    +

    *Thread Reply:* I have a similar message after installing openlineage into Amazon MWAA from the scheduler logs:

    WARNING:/usr/local/airflow/.local/lib/python3.7/site-packages/openlineage/airflow/extractors/great_expectations_extractor.py:Did not find great_expectations_provider library or failed to import it

    @@ -28530,7 +28536,7 @@

    Supress success

    2021-11-03 08:08:21
    -

    *Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 +

    *Thread Reply:* I don't think 1) is a good idea. You can have multiple tasks in one dag, processing different datasets and producing different datasets. If you want visual linking of jobs that produce disjoint datasets, then I think you want this: https://github.com/MarquezProject/marquez/issues/1737 which wuill affect visual layer.

    Regarding 2), I think we need to get along with Airflow maintainers regarding long term mechanism on which OL will work: https://github.com/apache/airflow/issues/17984

    @@ -28587,7 +28593,7 @@

    Supress success

    2021-11-03 08:19:00
    2021-11-03 08:35:59
    -

    *Thread Reply:* Thank you very much @Maciej Obuchowski

    +

    *Thread Reply:* Thank you very much @Maciej Obuchowski

    @@ -28697,7 +28703,7 @@

    Supress success

    2021-11-03 08:41:14
    -

    *Thread Reply:* It worked like that in Airflow 1.10.

    +

    *Thread Reply:* It worked like that in Airflow 1.10.

    This is an unfortunate limitation of LineageBackend API that we're using for Airflow 2. We're trying to work out solution for this with Airflow maintainers: https://github.com/apache/airflow/issues/17984

    Supress success
    2021-11-04 18:47:41
    -

    *Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

    +

    *Thread Reply:* This job with no output is a symptom of the output not being understood. you should be able to see the facets for that job. There will be a spark_unknown facet with more information about the problem. If you put that into an issue with some more details about this job we should be able to help.

    @@ -28965,7 +28971,7 @@

    Supress success

    2021-11-05 04:36:30
    -

    *Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

    +

    *Thread Reply:* I ll try to put all the info in a ticket, as it is not working as i would expect

    @@ -29089,7 +29095,7 @@

    Supress success

    2021-11-04 18:49:31
    -

    *Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

    +

    *Thread Reply:* Is there an error in the browser javascript console? (example on chrome: View -> Developer -> Javascript console)

    @@ -29171,7 +29177,7 @@

    Supress success

    2021-11-04 21:35:16
    -

    *Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: +

    *Thread Reply:* Hi Taleb - I don’t know of a generalized example of lineage tracking with Pandas, but you should be able to accomplish this by sending the runEvents manually to the OpenLineage API in your code: https://openlineage.io/docs/openapi/

    @@ -29198,7 +29204,7 @@

    Supress success

    2021-11-04 21:38:25
    -

    *Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

    +

    *Thread Reply:* Is this a work in progress, that we can investigate? Because I see it in this image https://github.com/OpenLineage/OpenLineage/blob/main/doc/Scope.png

    @@ -29252,7 +29258,7 @@

    Supress success

    2021-11-04 21:54:51
    -

    *Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

    +

    *Thread Reply:* To my knowledge, while there are a few proposals around adding a wrapper on some Pandas methods to output runEvents, it’s not something that’s had work started on it yet

    @@ -29278,7 +29284,7 @@

    Supress success

    2021-11-04 21:56:26
    -

    *Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

    +

    *Thread Reply:* I sent some feelers out to get a little more context from folks who are more informed about this than I am, so I’ll get you more info about potential future plans and the considerations around them when I know more

    @@ -29304,7 +29310,7 @@

    Supress success

    2021-11-04 23:04:47
    -

    *Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

    +

    *Thread Reply:* So, Pandas is tricky because unlike Airflow, DBT, or Spark, Pandas doesn’t own the whole flow, and you might dip in and out of it to use other Python Packages (at least I did when I was doing more Data Science).

    We have this issue open in OpenLineage that you should go +1 to help with our planning 🙂

    Supress success
    2021-11-05 15:08:09
    -

    *Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

    +

    *Thread Reply:* interesting... what if it were instead on all the read_** to_** functions?

    @@ -29424,7 +29430,7 @@

    Supress success

    2021-11-05 12:59:49
    -

    *Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

    +

    *Thread Reply:* The Marquez write APIs are artifacts from before OpenLineage existed, and they’re already slated for deprecation soon.

    If you POST an OpenLineage runEvent to the /lineage endpoint in Marquez, it’ll create any missing jobs or datasets that are relevant.

    @@ -29452,7 +29458,7 @@

    Supress success

    2021-11-05 13:06:06
    -

    *Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. +

    *Thread Reply:* Thanks for the response. That sounds good. Does this include the query interface e.g. http://localhost:5000/api/v1/namespaces/testing_java/datasets/incremental_data as that currently returns the Marquez version of a dataset including default set fields for type and the above mentioned properties.

    @@ -29480,7 +29486,7 @@

    Supress success

    2021-11-05 17:01:55
    -

    *Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

    +

    *Thread Reply:* I believe the intention for type is to support a new facet- TBH, it hasn't been the most pressing concern for most users, as most people are only recording tables, not streams. However, there's been some recent work to support Kafka in Spark- maybe it's time to address that deficiency.

    I don't actually know what happened to the datasource type field- maybe @Julien Le Dem can comment on whether that field was dropped intentionally or whether it was an oversight.

    @@ -29508,7 +29514,7 @@

    Supress success

    2021-11-05 18:18:06
    -

    *Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

    +

    *Thread Reply:* It looks like an oversight, currently Marquez hard codes it to POSGRESQL: https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438

    @@ -29559,7 +29565,7 @@

    Supress success

    2021-11-05 18:18:25
    -

    *Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

    +

    *Thread Reply:* https://github.com/MarquezProject/marquez/blob/734bfd691636cb00212d7d22b1a489bd4870fb04/api/src/main/java/marquez/db/OpenLineageDao.java#L438-L440

    @@ -29612,7 +29618,7 @@

    Supress success

    2021-11-05 18:20:25
    -

    *Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12

    +

    *Thread Reply:* The source has a name though: https://github.com/OpenLineage/OpenLineage/blob/8afc4ff88b8dd8090cd9c45061a9f669fea2151e/spec/facets/DatasourceDatasetFacet.json#L12

    @@ -29745,7 +29751,7 @@

    Supress success

    2021-11-05 18:07:57
    -

    *Thread Reply:* If you want to add something please chime in this thread

    +

    *Thread Reply:* If you want to add something please chime in this thread

    @@ -29771,7 +29777,7 @@

    Supress success

    2021-11-05 19:27:44
    2021-11-09 19:47:26
    -

    *Thread Reply:* The monthly meeting is happening tomorrow. +

    *Thread Reply:* The monthly meeting is happening tomorrow. The purview team will present at the December meeting instead See full agenda here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting You are welcome to contribute

    @@ -29826,7 +29832,7 @@

    Supress success

    2021-11-10 11:10:17
    -

    *Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0

    +

    *Thread Reply:* The slides for the meeting later today: https://docs.google.com/presentation/d/1z2NTkkL8hg_2typHRYhcFPyD5az-5-tl/edit#slide=id.ge7d4b64ef4_0_0

    @@ -29852,7 +29858,7 @@

    Supress success

    2021-11-10 12:02:23
    -

    *Thread Reply:* It’s happening now ^

    +

    *Thread Reply:* It’s happening now ^

    @@ -29878,7 +29884,7 @@

    Supress success

    2021-11-16 19:57:23
    -

    *Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) +

    *Thread Reply:* I have posted the notes and the recording from the last instance of our monthly meeting: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nov10th2021(9amPT) I have a few TODOs to follow up on tickets

    @@ -29959,7 +29965,7 @@

    Supress success

    2021-11-09 08:18:58
    -

    *Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

    +

    *Thread Reply:* How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data.

    Maybe it’s simply a “Job” But than what is run ?

    @@ -29987,7 +29993,7 @@

    Supress success

    2021-11-09 08:19:44
    -

    *Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

    +

    *Thread Reply:* How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ?

    Have you considered having some examples of different use cases like those?

    @@ -30015,9 +30021,9 @@

    Supress success

    2021-11-09 08:21:43
    -

    *Thread Reply: By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +

    *Thread Reply:* By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? For example important use-case for lineage is troubleshooting or error notifications (e.g mark report or job as temporarily in bad state if upstream data integration is broken). -In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select * from x,y group by a,b) . +In order to be able to that you need to be able to traverse the graph to find the original error. So having multiple inputs produce single output make sense (e.g insert into output_1 select ** from x,y group by a,b) . But what are the cases where you’d want to see multiple outputs ? You can have single process produce multiple tables (in above example) but they’d alway be separate queries. The actual inputs for each output would be different.

    But having multiple outputs create ambiguity as now If x or y is broken but have multiple outputs I do not know which is really impacted?

    @@ -30046,7 +30052,7 @@

    Supress success

    2021-11-09 08:34:01
    -

    *Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. +

    *Thread Reply:* > How do you model continuous process (not batch processes). For example a flume or spark job that does some real time processing on data. > > Maybe it’s simply a “Job” But than what is run ? Every continuous process eventually has end - for example, you can deploy new version of your Flink pipeline. The new version would be the next Run for the same Job.

    @@ -30079,7 +30085,7 @@

    Supress success

    2021-11-09 08:43:09
    -

    *Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? +

    *Thread Reply:* > How do you model consumers at the end - they can be reports? Data applications, ML model deployments, APIs, GUI consumed by end users ? Our reference implementation is an web application https://marquezproject.github.io/marquez/

    We definitely do not exclude any of the things you're talking about - and it would make a lot of sense to talk more about potential usages.

    @@ -30108,7 +30114,7 @@

    Supress success

    2021-11-09 08:45:47
    -

    *Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? +

    *Thread Reply:* > By definition, Job is a process definition that consumes and produces datasets. It is many to many relations? I’ve been wondering about that.Shouldn’t be more restrictive? I think this is too SQL-centric view 🙂

    Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too.

    @@ -30139,7 +30145,7 @@

    Supress success

    2021-11-17 12:11:37
    -

    *Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. +

    *Thread Reply:* > We definitely do not exclude any of the things you’re talking about - and it would make a lot of sense to talk more about potential usages. Yes I think that would be great if we expand on potential usages. if Open Lineage documentation (perhaps) has all kind of examples for different use-cases or case studies. Financal or healthcase industry case study and how would someone doing integration with OpenLineage. It would be easier to understand the concepts and make sure things are modeled consistently.

    @@ -30166,7 +30172,7 @@

    Supress success

    2021-11-17 14:19:19
    -

    *Thread Reply:* > I think this is too SQL-centric view 🙂 +

    *Thread Reply:* > I think this is too SQL-centric view 🙂 > > Not everything is a query. For example, those Flink streaming jobs can produce side outputs, or even push data to multiple sinks. We need to model those types of jobs too. Thanks for answering @Maciej Obuchowski

    @@ -30253,7 +30259,7 @@

    Supress success

    2021-11-23 05:38:11
    -

    *Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

    +

    *Thread Reply:* @Anthony Ivanov I see what you're trying to model. I think this could be solved by column level lineage though - when we'll have it. OL consumer could look at particular columns and derive which table contained particular error.

    > 2. Within a single flink job and even task: Inventory is written only to S3, UI is written only to DB Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design. Wouldn't that leave the possibility of breaking exactly-once unless you're going full into two phase commit?

    @@ -30282,7 +30288,7 @@

    Supress success

    2021-11-23 17:02:36
    -

    *Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design +

    *Thread Reply:* > Does that actually happen? I understand this in case of job, but having single operator write to two different systems seems like bad design In a Spark or flink job it is less likely now that you mention it. But in a batch job (airflow python or kubernetes operator for example) users could do anything and then they’d need lineage to figure out what is wrong if even if what they did is suboptimal 🙂

    > I see what you’re trying to model. @@ -30315,7 +30321,7 @@

    Supress success

    2021-11-24 04:55:41
    -

    *Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

    +

    *Thread Reply:* @Anthony Ivanov take a look at this: https://github.com/OpenLineage/OpenLineage/issues/148

    @@ -30367,7 +30373,7 @@

    Supress success

    2021-11-10 16:17:10
    -

    *Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

    +

    *Thread Reply:* We’re adding APIs to delete metadata in Marquez 0.20.0. Here’s the related issue, https://github.com/MarquezProject/marquez/issues/1736

    @@ -30417,7 +30423,7 @@

    Supress success

    2021-11-10 16:17:37
    -

    *Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

    +

    *Thread Reply:* Until then, you can connected to the DB directly and drop the rows from both the datasets and jobs tables (I know, not dieal)

    @@ -30443,7 +30449,7 @@

    Supress success

    2021-11-11 05:03:50
    -

    *Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

    +

    *Thread Reply:* Thanks! I assume deleting information will remain a Marquez only feature rather than becoming part of OpenLineage itself?

    @@ -30469,7 +30475,7 @@

    Supress success

    2021-12-10 14:07:57
    -

    *Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

    +

    *Thread Reply:* Yes! Delete operations will be an action supported by consumers of OpenLineage events

    @@ -30522,7 +30528,7 @@

    Supress success

    2021-11-11 05:14:39
    -

    *Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    +

    *Thread Reply:* I've been skimming this page: https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md

    @@ -30793,7 +30799,7 @@

    Supress success

    2021-11-11 05:46:06
    -

    *Thread Reply:* Yes!

    +

    *Thread Reply:* Yes!

    @@ -30819,7 +30825,7 @@

    Supress success

    2021-11-11 06:17:01
    -

    *Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. +

    *Thread Reply:* Excellent, I think I had mistakenly conflated the two originally. This document makes it a little clearer. As an additional question: When viewing a Dataset in Marquez will it cross the job namespace bounds? As in, will I see jobs from different job namespaces?

    @@ -30847,7 +30853,7 @@

    Supress success

    2021-11-11 09:20:14
    -

    *Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: +

    *Thread Reply:* In this example I have 1 job namespace and 2 dataset namespaces: sql-runner-dev is the job namespace. I cannot see a graph of my job now. Is this something to do with the namespace names?

    @@ -30884,7 +30890,7 @@

    Supress success

    2021-11-11 09:21:46
    -

    *Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

    +

    *Thread Reply:* The above document seems to have implied a namespace could be like a connection string for a database

    @@ -30910,7 +30916,7 @@

    Supress success

    2021-11-11 09:22:25
    -

    *Thread Reply:* Wait, it does work? Marquez was being temperamental

    +

    *Thread Reply:* Wait, it does work? Marquez was being temperamental

    @@ -30936,7 +30942,7 @@

    Supress success

    2021-11-11 09:24:01
    -

    *Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

    +

    *Thread Reply:* Yes, marquez is unable to fetch lineage for either dataset

    @@ -30962,7 +30968,7 @@

    Supress success

    2021-11-11 09:32:19
    -

    *Thread Reply:* Here's what I mean:

    +

    *Thread Reply:* Here's what I mean:

    @@ -30997,7 +31003,7 @@

    Supress success

    2021-11-11 09:59:24
    -

    *Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

    +

    *Thread Reply:* I think you might have hit this issue: https://github.com/MarquezProject/marquez/issues/1744

    @@ -31057,7 +31063,7 @@

    Supress success

    2021-11-11 10:00:29
    -

    *Thread Reply:* or, maybe not? It was released already.

    +

    *Thread Reply:* or, maybe not? It was released already.

    Can you create issue on github with those helpful gifs? @Lyndon Armitage

    @@ -31085,7 +31091,7 @@

    Supress success

    2021-11-11 10:58:25
    -

    *Thread Reply:* I think you are right Maciej

    +

    *Thread Reply:* I think you are right Maciej

    @@ -31111,7 +31117,7 @@

    Supress success

    2021-11-11 10:58:52
    -

    *Thread Reply:* Was that patched in 0,19.1?

    +

    *Thread Reply:* Was that patched in 0,19.1?

    @@ -31137,7 +31143,7 @@

    Supress success

    2021-11-11 11:06:06
    -

    *Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

    +

    *Thread Reply:* As far as I see yes: https://github.com/MarquezProject/marquez/releases/tag/0.19.1

    Haven't tested this myself unfortunately.

    @@ -31165,7 +31171,7 @@

    Supress success

    2021-11-11 11:07:07
    -

    *Thread Reply:* Perhaps not. It is urlencoding them: +

    *Thread Reply:* Perhaps not. It is urlencoding them: <http://localhost:3000/lineage/dataset/jdbc%3Ah2%3Amem%3Asql_tests_like/HBMOFA.ORDDETP> But the error seems to be in marquez getting them.

    @@ -31193,10 +31199,10 @@

    Supress success

    2021-11-11 11:09:23
    -

    *Thread Reply:* This is an example Lineage event JSON I am sending.

    +

    *Thread Reply:* This is an example Lineage event JSON I am sending.

    - + @@ -31232,7 +31238,7 @@

    Supress success

    2021-11-11 11:11:29
    -

    *Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

    +

    *Thread Reply:* I did run into another issue with really long names not being supported due to Marquez's DB using a fixed size string for a column, but that is understandable and probably a non-issue (my test code was generating temporary folders with long names).

    @@ -31258,7 +31264,7 @@

    Supress success

    2021-11-11 11:22:00
    2021-11-11 11:36:01
    -

    *Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

    +

    *Thread Reply:* @Lyndon Armitage can you create issue on the Marquez repo? https://github.com/MarquezProject/marquez/issues

    @@ -31310,7 +31316,7 @@

    Supress success

    2021-11-11 11:52:36
    -

    *Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

    +

    *Thread Reply:* https://github.com/MarquezProject/marquez/issues/1761 Is this sufficient?

    @@ -31443,7 +31449,7 @@

    Supress success

    2021-11-11 11:54:41
    -

    *Thread Reply:* Yup, thanks!

    +

    *Thread Reply:* Yup, thanks!

    @@ -31503,7 +31509,7 @@

    Supress success

    2021-11-15 13:04:19
    -

    *Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

    +

    *Thread Reply:* Hi Francis, for the event is it creating a new table with new data in glue / adding new data to an existing one or is it simply reformatting an existing table or making an empty one?

    @@ -31529,7 +31535,7 @@

    Supress success

    2021-11-15 13:35:00
    -

    *Thread Reply:* The table does not exist in the Glue catalog until …

    +

    *Thread Reply:* The table does not exist in the Glue catalog until …

    A Glue crawler connects to one or more data stores (in this case S3), determines the data structures, and writes tables into the Data Catalog.

    @@ -31559,7 +31565,7 @@

    Supress success

    2021-11-15 13:41:14
    -

    *Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

    +

    *Thread Reply:* Hmm, interesting, so the lineage of interest here would be of the metadata flow not of the data itself?

    In that case I’d say that the glue Crawler is a job that outputs a dataset.

    @@ -31587,7 +31593,7 @@

    Supress success

    2021-11-15 15:03:36
    -

    *Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

    +

    *Thread Reply:* The crawler is a job that discovers a dataset. It doesn't create it. If you're posting lineage yourself, I'd post it as an input event, not an output. The thing that actually wrote the data - generated the records and stored them in S3 - is the thing that would be outputting the dataset

    @@ -31613,7 +31619,7 @@

    Supress success

    2021-11-15 15:23:23
    -

    *Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

    +

    *Thread Reply:* @Michael Collado I agree the crawler discovers the S3 dataset. It also creates an event which creates/updates the HIVE/Glue table.

    If the Glue table isn’t a distinct dataset from the S3 data, how does this compare to a view in a database on top of a table. Are they 2 datasets or just one?

    @@ -31643,7 +31649,7 @@

    Supress success

    2021-11-15 15:24:39
    -

    *Thread Reply:* @John Thomas yes, its the metadata flow.

    +

    *Thread Reply:* @John Thomas yes, its the metadata flow.

    @@ -31669,7 +31675,7 @@

    Supress success

    2021-11-15 15:24:52
    -

    *Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

    +

    *Thread Reply:* that's how the Spark integration currently treats Hive datasets- I'd like to add a facet to attach that indicates that it is being read as a Hive table, and include all the appropriate metadata, but it uses the dataset's location in S3 as the canonical dataset identifier

    @@ -31695,7 +31701,7 @@

    Supress success

    2021-11-15 15:29:22
    -

    *Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

    +

    *Thread Reply:* @Francis McGregor-Macdonald I think the way to represent this is predicated on what you’re looking to accomplish by sending a runEvent for the Glue crawler. What are your broader objectives in adding this?

    @@ -31721,7 +31727,7 @@

    Supress success

    2021-11-15 15:50:37
    -

    *Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

    +

    *Thread Reply:* I am working through AWS native services seeing how they could, can, or do best integrate with openlineage (I’m an AWS SA). Hence the questions on best practice.

    Aligning with the Spark integration sounds like it might make sense then. Is there an example I could build from?

    @@ -31749,7 +31755,7 @@

    Supress success

    2021-11-15 17:56:17
    -

    *Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

    +

    *Thread Reply:* an example of reporting lineage? you can look at the Spark integration here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/

    @@ -31775,7 +31781,7 @@

    Supress success

    2021-11-15 17:59:14
    -

    *Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

    +

    *Thread Reply:* Ahh, in that case I would have to agree with Michael’s approach to things!

    @@ -31805,7 +31811,7 @@

    Supress success

    2021-11-19 03:30:03
    -

    *Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? +

    *Thread Reply:* @Michael Collado I am following the Spark integration you recommended (for a Glue job) and while everything appears to be set up correct, I am getting no lineage appear in marquez (a request.get from the pyspark script can reach the endpoint). Is there a way to enable a debug log so I can look to identify where the issue is? Is there a specific place to look in the regular logs?

    @@ -31832,7 +31838,7 @@

    Supress success

    2021-11-19 13:39:01
    -

    *Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

    +

    *Thread Reply:* listener output should be present in the driver logs. you can turn on debug logging in your log4j config (or whatever logging tool you use) for the package io.openlineage.spark.agent

    @@ -31978,7 +31984,7 @@

    Supress success

    2021-11-22 13:34:15
    -

    *Thread Reply:* Output event:

    +

    *Thread Reply:* Output event:

    2021-11-22 08:12:15,513 INFO [spark-listener-group-shared] agent.OpenLineageContext (OpenLineageContext.java:emit(50)): Lineage completed successfully: ResponseMessage(responseCode=201, body=, error=null) { “eventType”: “COMPLETE”, @@ -32033,7 +32039,7 @@

    Supress success

    2021-11-22 13:34:59
    -

    *Thread Reply:* This sink record is missing details …

    +

    *Thread Reply:* This sink record is missing details …

    2021-11-22 08:12:15,481 INFO [Thread-7] sinks.HadoopDataSink (HadoopDataSink.scala:$anonfun$writeDynamicFrame$1(275)): nameSpace: , table:

    @@ -32061,7 +32067,7 @@

    Supress success

    2021-11-22 13:40:30
    -

    *Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

    +

    *Thread Reply:* I can also see multiple history events (presumably for each transform, each as above) emitted for the same Glue Job, with different RunId, with the same inputs and the same (null) output.

    @@ -32087,7 +32093,7 @@

    Supress success

    2021-11-22 14:31:06
    -

    *Thread Reply:* Are you using the existing spark integration for the spark lineage?

    +

    *Thread Reply:* Are you using the existing spark integration for the spark lineage?

    @@ -32113,7 +32119,7 @@

    Supress success

    2021-11-22 14:46:47
    -

    *Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark +

    *Thread Reply:* I followed: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark In the Glue context I was not clear on the correct settings for “spark.openlineage.parentJobName” and “spark.openlineage.parentRunId”, I put in static values (which may be incorrect)? I injected these via: "--conf": "spark.openlineage.parentJobName=nyc-taxi-raw-stage",

    @@ -32141,7 +32147,7 @@

    Supress success

    2021-11-22 14:47:54
    -

    *Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

    +

    *Thread Reply:* Happy to share what is working when I am done, I can’t seem to find an AWS Glue specific example to walk me through.

    @@ -32167,7 +32173,7 @@

    Supress success

    2021-11-22 15:03:31
    -

    *Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

    +

    *Thread Reply:* yeah, We haven’t spent any significant time with AWS Glue, but we just released the Databricks integration, which might help guide the way you’re working a little bit more

    @@ -32193,7 +32199,7 @@

    Supress success

    2021-11-22 15:12:15
    -

    *Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). +

    *Thread Reply:* from what I can see in the DBX integration (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks) all of what is being done here I am doing in Glue (upload the jar, embed the settings into the Glue spark job). It is emitting the above for each transform in the Glue job, but does not seem to capture the output …

    @@ -32220,7 +32226,7 @@

    Supress success

    2021-11-22 15:13:54
    -

    *Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

    +

    *Thread Reply:* Is there a standard Spark test script in use with openlineage I could put into Glue to test without using any Glue specific functionality (without for example the GlueContext, or Glue dynamic frames)?

    @@ -32246,7 +32252,7 @@

    Supress success

    2021-11-22 15:25:30
    -

    *Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README +

    *Thread Reply:* The initialisation does appear to be working if I compare it to the DBX README Mine from AWS Glue… 21/11/22 18:48:48 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 21/11/22 18:48:49 INFO OpenLineageContext: Init OpenLineageContext: Args: ArgumentParser(host=<http://ec2>-….<a href="http://compute-1.amazonaws.com:5000">compute-1.amazonaws.com:5000</a>, version=v1, namespace=spark_integration, jobName=default, parentRunId=null, apiKey=Optional.empty) URI: <http://ec2>-….<a href="http://compute-1.amazonaws.com:5000/api/v1/lineage">compute-1.amazonaws.com:5000/api/v1/lineage</a> @@ -32276,7 +32282,7 @@

    Supress success

    2021-11-22 16:12:40
    -

    *Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

    +

    *Thread Reply:* We don’t have a test run, unfortunately, but you could follow this blog post’s processes in each and see what the differences are? https://openlineage.io/blog/openlineage-spark/

    openlineage.io
    @@ -32323,7 +32329,7 @@

    Supress success

    2021-11-22 16:43:23
    -

    *Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

    +

    *Thread Reply:* Thanks, I have been looking at that. I will create a Glue job aligned with that. What is the best way to pass feedback? Keep it here?

    @@ -32349,7 +32355,7 @@

    Supress success

    2021-11-22 16:49:50
    -

    *Thread Reply:* yeah, this thread will work great 🙂

    +

    *Thread Reply:* yeah, this thread will work great 🙂

    @@ -32375,7 +32381,7 @@

    Supress success

    2022-07-18 11:37:02
    -

    *Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

    +

    *Thread Reply:* @Francis McGregor-Macdonald are you managed to enable it?

    @@ -32401,7 +32407,7 @@

    Supress success

    2022-07-18 15:14:47
    -

    *Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. +

    *Thread Reply:* Just DM you the code I used a while back (app.py + CDK code). I haven’t used it in a while, and there is some duplication in it. I had openlineage enabled, but dynamic frames not working yet with lineage. Let me know how you go. I haven’t had the space to look at it in a while, but happy to support if you are looking at it.

    @@ -32454,7 +32460,7 @@

    Supress success

    2021-11-23 09:01:11
    -

    *Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

    +

    *Thread Reply:* You can use this: https://github.com/amundsen-io/amundsen/pull/1444

    @@ -32532,7 +32538,7 @@

    Supress success

    2021-11-23 09:38:44
    -

    *Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

    +

    *Thread Reply:* you can also check out this section from the Amundsen Community Meeting in october: https://www.youtube.com/watch?v=7WgECcmLSRk

    YouTube
    @@ -32618,7 +32624,7 @@

    Supress success

    2021-11-23 12:45:34
    -

    *Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

    +

    *Thread Reply:* No, I believe the databuilder OpenLineage extractor for Amundsen will continue to store lineage metadata in Atlas

    @@ -32644,7 +32650,7 @@

    Supress success

    2021-11-23 12:47:01
    -

    *Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

    +

    *Thread Reply:* We've spoken to the Amundsen team, and though using Marquez to store lineage metadata isn't an option, it's an integration that makes sense but hasn't yet been prioritized

    @@ -32670,7 +32676,7 @@

    Supress success

    2021-11-23 13:51:00
    -

    *Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

    +

    *Thread Reply:* Thanks , Right now amundsen has no support for lineage extraction from spark or airflow , if this case do we need to use marquez for open lineage implementation to capture the lineage from airflow & spark

    @@ -32696,7 +32702,7 @@

    Supress success

    2021-11-23 13:57:13
    -

    *Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

    +

    *Thread Reply:* Maybe, that would mean running the full Amundsen stack as well as the Marquez stack along side each other (not ideal). The OpenLineage integration for Amundsen is very recent, so haven't had a chance to look deeply into the implementation. But, briefly looking over the config for Openlineagetablelineageextractor, you can only send metadata to Atlas

    @@ -32722,7 +32728,7 @@

    Supress success

    2021-11-24 00:36:56
    -

    *Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

    +

    *Thread Reply:* @Willy Lulciuc thats our real concern , running the two stacks will make a mess environment , let me explain our amundsen setup , we are having neo4j as backend , (front end , search service , metadata service,elastic search & neo4j) . our requirement to capture lineage from spark and airflow , imported into amundsen

    @@ -32748,7 +32754,7 @@

    Supress success

    2022-03-11 22:33:39
    -

    *Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

    +

    *Thread Reply:* We are running into a similar issue. @Dinakar Sundar were you able to get the Amundsen OpenLineage integration to work with a neo4j backend?

    @@ -32800,7 +32806,7 @@

    Supress success

    2021-11-24 11:49:14
    -

    *Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

    +

    *Thread Reply:* You can take a look at https://github.com/OpenLineage/OpenLineage/projects

    @@ -32888,7 +32894,7 @@

    Supress success

    2021-12-01 16:35:17
    -

    *Thread Reply:* Anyone? Thanks!

    +

    *Thread Reply:* Anyone? Thanks!

    @@ -33022,7 +33028,7 @@

    Supress success

    2021-11-30 07:03:30
    2021-11-30 11:21:34
    -

    *Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

    +

    *Thread Reply:* I have not! Will give it a try, Maciej! Thank you for the reply!

    @@ -33078,7 +33084,7 @@

    Supress success

    2021-11-30 15:20:18
    -

    *Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

    +

    *Thread Reply:* 🙏 @Maciej Obuchowski we're not worthy! That was the magic we needed. Seems like a hack since we're snooping in on private classes but if it works...

    Thank you so much for pointing to those utilities!

    @@ -33110,7 +33116,7 @@

    Supress success

    2021-11-30 15:48:25
    -

    *Thread Reply:* Glad I could help!

    +

    *Thread Reply:* Glad I could help!

    @@ -33162,7 +33168,7 @@

    Supress success

    2021-12-01 08:51:28
    -

    *Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

    +

    *Thread Reply:* Different concepts. OL is focused on describing the lineage and metadata of the running jobs. So it keeps track of all the metadata (schema, ...) of inputs and outputs at the time transformation occurs + transformation metadata (code version, cost, etc.)

    OM I am not an expert but it's a metadata model with clients and API around it.

    @@ -33216,7 +33222,7 @@

    Supress success

    2021-12-01 12:40:53
    -

    *Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

    +

    *Thread Reply:* Hey. For technical reasons, we can't automatically register macro that does this job, as we could in Airflow 1 integration. You could put it yourself:

    @@ -33242,7 +33248,7 @@

    Supress success

    2021-12-01 12:41:02
    -

    *Thread Reply:* ```def lineageparentid(run_id, task): +

    *Thread Reply:* ```def lineageparentid(run_id, task): """ Macro function which returns the generated job and run id for a given task. This can be used to forward the ids from a task to a child run so the job @@ -33300,7 +33306,7 @@

    Supress success

    2021-12-01 12:41:13
    -

    *Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

    +

    *Thread Reply:* from here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/dag.py#L77

    @@ -33351,7 +33357,7 @@

    Supress success

    2021-12-01 12:53:27
    -

    *Thread Reply:* the quickest response ever! And that works like a charm 🙌

    +

    *Thread Reply:* the quickest response ever! And that works like a charm 🙌

    @@ -33381,7 +33387,7 @@

    Supress success

    2021-12-01 13:21:16
    -

    *Thread Reply:* Glad I could help!

    +

    *Thread Reply:* Glad I could help!

    @@ -33433,7 +33439,7 @@

    Supress success

    2021-12-01 14:18:00
    -

    *Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

    +

    *Thread Reply:* For SQL, EXPLAIN EXTENDED and show() in scala-shell is helpful:

    spark.sql("EXPLAIN EXTENDED CREATE TABLE tbl USING delta LOCATION '/tmp/delta' AS SELECT ** FROM tmp").show(false) ```|== Parsed Logical Plan == @@ -33481,7 +33487,7 @@

    Supress success

    2021-12-01 14:27:25
    -

    *Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

    +

    *Thread Reply:* For dataframe api, I'm usually just either logging plan to console from OpenLineage listener, or looking at sparklogicalPlan or sparkunknown facets send by listener - even when the particular write operation isn't supported by integration, those facets should have some relevant info.

    @@ -33511,7 +33517,7 @@

    Supress success

    2021-12-01 14:27:40
    -

    *Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

    +

    *Thread Reply:* For example, for the query I've send at comment above, the spark_logicalPlan facet looks like this:

    "spark.logicalPlan": { "_producer": "<https://github.com/OpenLineage/OpenLineage/tree/0.4.0-SNAPSHOT/integration/spark>", @@ -33601,7 +33607,7 @@

    Supress success

    2021-12-01 14:38:55
    -

    *Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

    +

    *Thread Reply:* Okay! That is very helpful! I wasn't sure if there was a fancier trick but I can definitely do logging 🙂 Our challenge was that our proprietary packages were resulting in Null Pointer Exceptions when it tried to push to OpenLineage 😞

    @@ -33627,7 +33633,7 @@

    Supress success

    2021-12-01 14:39:02
    -

    *Thread Reply:* Thank you as usual!!

    +

    *Thread Reply:* Thank you as usual!!

    @@ -33653,7 +33659,7 @@

    Supress success

    2021-12-01 14:40:25
    -

    *Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

    +

    *Thread Reply:* You can always add test cases and add breakpoints to debug in your IDE. That doesn't work for the container tests, but it does work for the other ones

    @@ -33679,7 +33685,7 @@

    Supress success

    2021-12-01 14:47:20
    -

    *Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

    +

    *Thread Reply:* Ah! That's a great point! I definitely would appreciate being able to poke at the objects interactively in a debug mode. Thank you for the guidance as well!

    @@ -33897,7 +33903,7 @@

    Supress success

    2021-12-03 11:55:17
    -

    *Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

    +

    *Thread Reply:* Hi Ricardo - OpenLineage doesn’t currently have support for field-level lineage, but it’s definitely something we’ve been looking into. This is a great collection of resources 🙂

    We’ve to-date been working on our integrations library, making it as easy to set up as possible.

    @@ -33925,7 +33931,7 @@

    Supress success

    2021-12-03 12:01:25
    -

    *Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. +

    *Thread Reply:* Thanks John! I was checking the issues on github and other posts here. Just wanted to clarify that. I’ll keep an eye on it

    @@ -33985,7 +33991,7 @@

    Supress success

    2021-12-06 20:28:09
    -

    *Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    +

    *Thread Reply:* The link to join the meeting is on the wiki: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

    @@ -34011,7 +34017,7 @@

    Supress success

    2021-12-06 20:28:25
    -

    *Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

    +

    *Thread Reply:* Please reach out to me if you’d like to be added to a gcal invite

    @@ -34063,7 +34069,7 @@

    Supress success

    2021-12-08 02:03:37
    -

    *Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

    +

    *Thread Reply:* Hi Dinakar. Can you give some specifics regarding what kind of problem you're running into?

    @@ -34089,7 +34095,7 @@

    Supress success

    2021-12-09 10:15:50
    -

    *Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

    +

    *Thread Reply:* Hi @Michael Collado, were able to set the spark configuration for spark extra listener & placed jars as well , wen i ran the sapark job , Lineage is not get tracked into the marquez

    @@ -34115,7 +34121,7 @@

    Supress success

    2021-12-09 10:34:39
    -

    *Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} +

    *Thread Reply:* {"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark/facets/spark/v1/output-statistics-facet.json","rowCount":0,"size":-1,"status":"DEPRECATED"}},"outputFacets":{"outputStatistics":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/facets/1-0-0/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet","rowCount":0,"size":-1}}}],"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.3.1/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"} OpenLineageHttpException(code=0, message=java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1], details=java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: Cannot construct instance of io.openlineage.spark.agent.client.HttpError (although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value ('{"code":404,"message":"HTTP 404 Not Found"}') at [Source: UNKNOWN; line: -1, column: -1]) @@ -34163,7 +34169,7 @@

    Supress success

    2021-12-09 13:29:42
    -

    *Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

    +

    *Thread Reply:* Issue solved , mentioned the version wrongly as 1 instead v1

    @@ -34249,7 +34255,7 @@

    Supress success

    2021-12-08 12:15:38
    -

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

    +

    *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/400/files

    there’s ongoing PR for proxy backend, which opens http API and redirects events to Kafka.

    @@ -34277,7 +34283,7 @@

    Supress success

    2021-12-08 12:17:38
    -

    *Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

    +

    *Thread Reply:* Hi Kavuri, as minkyu said, there's currently work going on to simplify this process.

    For now, you'll need to make something to capture the HTTP api events and send them to the Kafka topic. Changing the spark.openlineage.url parameter will send the runEvents wherever you like, but obviously you can't directly produce HTTP events to a topic

    @@ -34305,7 +34311,7 @@

    Supress success

    2021-12-08 22:13:09
    -

    *Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

    +

    *Thread Reply:* Many Thanks for the Reply.. As I understand, currently pushing lineage to kafka topic is not yet there. it is under implementation. If you can help me out in understanding in which version it is going to be present, that will help me a lot. Thanks in advance.

    @@ -34331,7 +34337,7 @@

    Supress success

    2021-12-09 12:57:10
    -

    *Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

    +

    *Thread Reply:* Not sure about the release plan, but the http endpoint is just regular RESTful API, and you will be able to write a super simple proxy for your own use case if you want.

    @@ -34456,7 +34462,7 @@

    Supress success

    2021-12-12 07:34:13
    -

    *Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

    +

    *Thread Reply:* One thing, you're looking at Ryan's fork of Spark, which is few thousand commits behind head 🙂

    This one should be good: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L72

    @@ -34495,7 +34501,7 @@

    Supress success

    2021-12-12 07:37:04
    -

    *Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

    +

    *Thread Reply:* In this case, we basically need more info about nature of this DataSourceV2Relation, because this is provider-dependent. We have Iceberg in main branch and Delta here: https://github.com/OpenLineage/OpenLineage/pull/393/files#diff-7b66a9bd5905f4ba42914b73a87d834c1321ebcf75137c1e2a2413c0d85d9db6

    @@ -34521,7 +34527,7 @@

    Supress success

    2021-12-13 14:54:13
    -

    *Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

    +

    *Thread Reply:* Ah! Maciej! As always, thank you! Looking through the DataSourceV2RelationVisitor you provided, it looks like the connector (Azure Cosmos Db) doesn't provide that Provider property 😞 😞 😞

    Is there any other method for determining the type of DataSourceV2Relation?

    @@ -34549,7 +34555,7 @@

    Supress success

    2021-12-13 14:57:06
    -

    *Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

    +

    *Thread Reply:* And, to make sure I close out on my original question, it was as simple as the code that Maciej was using:

    I merely needed to use DataSourceV2Realtion rather than NamedRelation!

    @@ -34581,7 +34587,7 @@

    Supress success

    2021-12-15 06:20:31
    -

    *Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala

    +

    *Thread Reply:* Are we talking about this connector? https://github.com/Azure/azure-sdk-for-java/blob/934200f63dc5bc7d5502a95f8daeb8142[…]/src/main/scala/com/azure/cosmos/spark/ItemsReadOnlyTable.scala

    @@ -34632,7 +34638,7 @@

    Supress success

    2021-12-15 06:22:05
    -

    *Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

    +

    *Thread Reply:* I guess you can use object.getClass.getCanonicalName() to find if the passed class matches the one that Cosmos provider uses.

    @@ -34658,7 +34664,7 @@

    Supress success

    2021-12-15 09:53:24
    -

    *Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

    +

    *Thread Reply:* Yes! That's the one, Maciej! I will give getCanonicalName a try but also make a PR into that repo to get the provider property set up correctly 🙂

    @@ -34684,7 +34690,7 @@

    Supress success

    2021-12-15 09:53:28
    -

    *Thread Reply:* Thank you so much!

    +

    *Thread Reply:* Thank you so much!

    @@ -34710,7 +34716,7 @@

    Supress success

    2021-12-15 10:09:39
    -

    *Thread Reply:* Glad to help 😄

    +

    *Thread Reply:* Glad to help 😄

    @@ -34736,7 +34742,7 @@

    Supress success

    2021-12-15 10:22:58
    -

    *Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

    +

    *Thread Reply:* @Will Johnson could you tell on which commands from https://github.com/OpenLineage/OpenLineage/issues/368#issue-1038510649 you'll be working?

    @@ -34762,7 +34768,7 @@

    Supress success

    2021-12-15 10:24:14
    -

    *Thread Reply:* If any, of course 🙂

    +

    *Thread Reply:* If any, of course 🙂

    @@ -34788,7 +34794,7 @@

    Supress success

    2021-12-15 10:49:31
    -

    *Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

    +

    *Thread Reply:* From all of our tests on that Cosmos connector, it looks like it strictly uses athe AppendData operation. However @Harish Sune is looking at more of these commands from a Delta data source.

    @@ -34818,7 +34824,7 @@

    Supress success

    2021-12-22 22:43:34
    -

    *Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

    +

    *Thread Reply:* Just to close the loop on this one - I submitted a PR for the work we've been doing. Looking forward to any feedback! https://github.com/OpenLineage/OpenLineage/pull/450

    @@ -34906,7 +34912,7 @@

    Supress success

    2021-12-23 05:04:36
    -

    *Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

    +

    *Thread Reply:* Thanks @Will Johnson! I added one question about dataset naming.

    @@ -35007,7 +35013,7 @@

    Supress success

    2021-12-15 10:54:41
    -

    *Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

    +

    *Thread Reply:* Yes! This is awesome!! How might this work for an existing command like the DataSourceV2Visitor.

    Right now, OpenLineage checks based on the provider property if it's an Iceberg or Delta provider.

    @@ -35039,7 +35045,7 @@

    Supress success

    2021-12-15 11:13:20
    -

    *Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

    +

    *Thread Reply:* Resolving this would be nice addition to the doc (and, to the implementation) - currently, we're just returning result of first function for which isDefinedAt is satisfied.

    This means, that we can depend on the order of the visitors...

    @@ -35067,7 +35073,7 @@

    Supress success

    2021-12-15 13:59:12
    -

    *Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

    +

    *Thread Reply:* great question. For posterity, I'd like to move this to the PR discussion. I'll address the question there.

    @@ -35161,7 +35167,7 @@

    Supress success

    2021-12-15 11:50:47
    -

    *Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

    +

    *Thread Reply:* hmm, actually the only documentation we have right now is on the demo.datakin.com site https://demo.datakin.com/onboarding . The great expectations tab should be enough to get you started

    demo.datakin.com
    @@ -35208,7 +35214,7 @@

    Supress success

    2021-12-15 11:51:04
    -

    *Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

    +

    *Thread Reply:* I'll open a ticket to copy that documentation to the OpenLineage site repo

    @@ -35264,7 +35270,7 @@

    Supress success

    2021-12-15 19:01:50
    -

    *Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

    +

    *Thread Reply:* Hi! We don't have any integration with deequ yet. We have a structure for recording data quality assertions and statistics, though - see https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityAssertionsDatasetFacet.json and https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DataQualityMetricsInputDatasetFacet.json for the specs.

    Check the great expectations integration to see how those facets are being used

    @@ -35292,7 +35298,7 @@

    Supress success

    2022-05-24 06:20:50
    -

    *Thread Reply:* This is great. Thanks @Michael Collado!

    +

    *Thread Reply:* This is great. Thanks @Michael Collado!

    @@ -35381,7 +35387,7 @@

    Supress success

    2021-12-20 04:15:51
    -

    *Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

    +

    *Thread Reply:* This is nothing to show here when you click on test node, right? What about run node?

    @@ -35407,7 +35413,7 @@

    Supress success

    2021-12-20 12:28:21
    -

    *Thread Reply:* There is no details about failure.

    +

    *Thread Reply:* There is no details about failure.

    ```dbt-ol build -t DEV --profile cdp --profiles-dir /c/Work/dbt/cdp100/profiles --project-dir /c/Work/dbt/cdp100 --select +riskrawmastersharedshareclass Running OpenLineage dbt wrapper version 0.4.0 @@ -35495,7 +35501,7 @@

    Supress success

    2021-12-20 12:30:20
    -

    *Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

    +

    *Thread Reply:* I'm talking on clicking on non-test node in Marquez UI - the screenshots shared show you clicked on the one ending in test

    @@ -35521,7 +35527,7 @@

    Supress success

    2021-12-20 16:46:11
    -

    *Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

    +

    *Thread Reply:* There are two types of failures: tests failed on stage model (relationships) and physical error in master model (no table with such name). The stage test node in Marquez does not show any indication of failures and dataset node indicates failure but without number of failed records or table name for persistent test storage. The failed master model shows in red but no details of failure. Master model tests were skipped because of model failure but UI reports "Complete".

    @@ -35601,7 +35607,7 @@

    Supress success

    2021-12-20 18:11:50
    -

    *Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one +

    *Thread Reply:* If I understood correctly, for model you would like OpenLineage to capture message error, like this one 22:52:07 Database Error in model customers (models/customers.sql) 22:52:07 Syntax error: Expected "(" or keyword SELECT or keyword WITH but got identifier "PLEASE_REMOVE" at [56:12] 22:52:07 compiled SQL at target/run/jaffle_shop/models/customers.sql @@ -35649,7 +35655,7 @@

    Supress success

    2021-12-20 18:23:12
    -

    *Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

    +

    *Thread Reply:* We actually do the first one for Airflow and Spark, I've missed it for dbt 😞

    Created issue to add it to spec in a generic way: https://github.com/OpenLineage/OpenLineage/issues/446

    @@ -35764,7 +35770,7 @@

    Supress success

    2021-12-20 22:49:54
    -

    *Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

    +

    *Thread Reply:* Sounds great. Failed/Skipped Tests/Models could be color-coded as well. Thanks.

    @@ -35825,7 +35831,7 @@

    Supress success

    2021-12-22 12:38:26
    -

    *Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

    +

    *Thread Reply:* Hey. If you're using Airflow 2, you should use LineageBackend method described here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#airflow-21-experimental

    @@ -35855,7 +35861,7 @@

    Supress success

    2021-12-22 12:39:06
    -

    *Thread Reply:* You don't need to do anything with DAG import then.

    +

    *Thread Reply:* You don't need to do anything with DAG import then.

    @@ -35881,7 +35887,7 @@

    Supress success

    2021-12-22 12:40:30
    -

    *Thread Reply:* Thanks!!!!! i'll try

    +

    *Thread Reply:* Thanks!!!!! i'll try

    @@ -35937,7 +35943,7 @@

    Supress success

    2021-12-27 16:49:56
    -

    *Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

    +

    *Thread Reply:* @Will Johnson @Maciej Obuchowski FYI 👆

    @@ -36158,7 +36164,7 @@

    Supress success

    2022-01-12 11:59:44
    -

    *Thread Reply:* ^ It’s happening now!

    +

    *Thread Reply:* ^ It’s happening now!

    @@ -36210,7 +36216,7 @@

    Supress success

    2022-01-18 14:19:39
    -

    *Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

    +

    *Thread Reply:* Just to be clear, were you able to get a datasource information from API but just now showing up in the UI? Or you weren’t able to get it from API too?

    @@ -36262,7 +36268,7 @@

    Supress success

    2022-01-18 11:40:00
    -

    *Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

    +

    *Thread Reply:* It does generally work, but, there's a known limitation in that only successful task runs are reported to the lineage backend. This is planned to be fixed in Airflow 2.3.

    @@ -36296,7 +36302,7 @@

    Supress success

    2022-01-18 20:35:52
    -

    *Thread Reply:* thank you. 🙂

    +

    *Thread Reply:* thank you. 🙂

    @@ -36362,7 +36368,7 @@

    Supress success

    2022-01-18 11:30:38
    -

    *Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

    +

    *Thread Reply:* hey, I'm aware of one small bug ( which will be fixed in the upcoming OpenLineage 0.5.0 ) which means you would also have to include google-cloud-bigquery in your requirements.txt. This is the bug: https://github.com/OpenLineage/OpenLineage/issues/438

    @@ -36392,7 +36398,7 @@

    Supress success

    2022-01-18 11:31:51
    -

    *Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

    +

    *Thread Reply:* The other thing I think you should check is, did you def define the AIRFLOW__LINEAGE__BACKEND variable correctly? What you pasted above looks a little odd with the 2 = signs

    @@ -36418,7 +36424,7 @@

    Supress success

    2022-01-18 11:34:25
    -

    *Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: +

    *Thread Reply:* I'm looking a task log inside my own Airflow and I see msgs like: INFO - Constructing openlineage client to send events to

    @@ -36445,7 +36451,7 @@

    Supress success

    2022-01-18 11:34:47
    -

    *Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

    +

    *Thread Reply:* ^ i.e. I think checking the task logs you can see if it's at least attempting to send data

    @@ -36471,7 +36477,7 @@

    Supress success

    2022-01-18 11:34:52
    -

    *Thread Reply:* hope this helps!

    +

    *Thread Reply:* hope this helps!

    @@ -36497,7 +36503,7 @@

    Supress success

    2022-01-18 20:40:37
    -

    *Thread Reply:* Thank you, will try again.

    +

    *Thread Reply:* Thank you, will try again.

    @@ -36557,7 +36563,7 @@

    Supress success

    2022-01-19 12:39:30
    -

    *Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

    +

    *Thread Reply:* BTW, this was actually the 0.5.1 release. Because, pypi... 🤷‍♂️:skintone4:

    @@ -36583,7 +36589,7 @@

    Supress success

    2022-01-27 06:45:08
    -

    *Thread Reply:* nice on the dbt-spark support 👍

    +

    *Thread Reply:* nice on the dbt-spark support 👍

    @@ -36639,7 +36645,7 @@

    Supress success

    2022-01-19 11:13:35
    -

    *Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

    +

    *Thread Reply:* Are you using docker on mac with "Use Docker Compose V2" enabled?

    We've just found yesterday that it somehow breaks our example...

    @@ -36671,7 +36677,7 @@

    Supress success

    2022-01-19 11:14:51
    -

    *Thread Reply:* yes i just installed docker on mac

    +

    *Thread Reply:* yes i just installed docker on mac

    @@ -36697,7 +36703,7 @@

    Supress success

    2022-01-19 11:15:02
    -

    *Thread Reply:* and docker compose version 1.29.2

    +

    *Thread Reply:* and docker compose version 1.29.2

    @@ -36723,7 +36729,7 @@

    Supress success

    2022-01-19 11:20:24
    -

    *Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

    +

    *Thread Reply:* What you can do is to uncheck this, do docker system prune -a and try again.

    @@ -36749,7 +36755,7 @@

    Supress success

    2022-01-19 11:21:56
    -

    *Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

    +

    *Thread Reply:* done but i get this : Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

    @@ -36775,7 +36781,7 @@

    Supress success

    2022-01-19 11:22:15
    -

    *Thread Reply:* Try to restart docker for mac

    +

    *Thread Reply:* Try to restart docker for mac

    @@ -36801,7 +36807,7 @@

    Supress success

    2022-01-19 11:23:00
    -

    *Thread Reply:* It needs to show Docker Desktop is running :

    +

    *Thread Reply:* It needs to show Docker Desktop is running :

    @@ -36836,7 +36842,7 @@

    Supress success

    2022-01-19 11:24:01
    -

    *Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

    +

    *Thread Reply:* yeah done . I will try to implement the example again and see thank you very much

    @@ -36862,7 +36868,7 @@

    Supress success

    2022-01-19 11:32:55
    -

    *Thread Reply:* i dont why i getting this when i $ docker-compose up :

    +

    *Thread Reply:* i dont why i getting this when i $ docker-compose up :

    WARNING: The TAG variable is not set. Defaulting to a blank string. WARNING: The APIPORT variable is not set. Defaulting to a blank string. @@ -36898,7 +36904,7 @@

    Supress success

    2022-01-19 11:46:12
    -

    *Thread Reply:* are you running it exactly like here, with respect to directories, etc?

    +

    *Thread Reply:* are you running it exactly like here, with respect to directories, etc?

    https://github.com/MarquezProject/marquez/tree/main/examples/airflow

    @@ -36926,7 +36932,7 @@

    Supress success

    2022-01-19 11:59:36
    -

    *Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

    +

    *Thread Reply:* yeah yeah my bad . every things work fine know . I see the graph in the ui

    @@ -36952,7 +36958,7 @@

    Supress success

    2022-01-19 12:04:01
    -

    *Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

    +

    *Thread Reply:* one more question plz . As i said our etls based on postgres redshift and airflow . any advice you have for us to integrate OL to our pipeline ?

    @@ -37034,7 +37040,7 @@

    Supress success

    2022-01-19 17:50:39
    -

    *Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

    +

    *Thread Reply:* Hi Kevin - to my understanding that's correct. Do you guys have a custom extractor using this?

    @@ -37060,7 +37066,7 @@

    Supress success

    2022-01-19 20:49:49
    -

    *Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

    +

    *Thread Reply:* Thanks John! We have custom code emitting OL events within our ingestion pipeline and it includes a custom facet. I’ll refactor the code to the new format and should be good to go.

    @@ -37086,7 +37092,7 @@

    Supress success

    2022-01-21 00:34:37
    -

    *Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

    +

    *Thread Reply:* Just to follow up, this code update worked as expected and we are all good on the upgrade.

    @@ -37165,7 +37171,7 @@

    Supress success

    2022-01-25 14:46:58
    -

    *Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

    +

    *Thread Reply:* Hm, that is odd. Usually there are a few lines in the DAG log from the OpenLineage bits. I’d expect to see something about not having an extractor for the operator you are using.

    @@ -37191,7 +37197,7 @@

    Supress success

    2022-01-25 14:47:53
    -

    *Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

    +

    *Thread Reply:* If you open a shell in your Airflow Scheduler container and check for the presence of AIRFLOW__LINEAGE__BACKEND is it properly set? Possible the env isn’t making it all the way there.

    @@ -37247,7 +37253,7 @@

    Supress success

    2022-01-21 14:03:29
    -

    *Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

    +

    *Thread Reply:* Hi Lena! That’s correct, Openlineage is designed to send events to an HTTP backend. There’s a ticket on the future section of the roadmap to support pushing to Amundsen, but it’s not yet been worked on (Ref: Roadmap Issue #86)

    @@ -37297,7 +37303,7 @@

    Supress success

    2022-01-21 14:08:35
    -

    *Thread Reply:* Thank you for the info!

    +

    *Thread Reply:* Thank you for the info!

    @@ -37349,7 +37355,7 @@

    Supress success

    2022-01-30 12:37:39
    -

    *Thread Reply:* what do you mean by “Integrate Openlineage”?

    +

    *Thread Reply:* what do you mean by “Integrate Openlineage”?

    Can you give a little more information on what you’re trying to accomplish and what the existing project is?

    @@ -37377,7 +37383,7 @@

    Supress success

    2022-01-31 03:49:22
    -

    *Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

    +

    *Thread Reply:* I work in a datalake team and we are trying to implement data lineage property in our project using openlineage. our project basically keeps track of datasets coming from different sources(hive, redshift, elasticsearch etc.) and jobs.

    @@ -37403,7 +37409,7 @@

    Supress success

    2022-01-31 15:01:31
    -

    *Thread Reply:* Gotcha!

    +

    *Thread Reply:* Gotcha!

    Broadly speaking, all an integration needs to do is to send runEvents to Marquez.

    @@ -37581,7 +37587,7 @@

    Supress success

    2022-02-15 15:28:03
    -

    *Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

    +

    *Thread Reply:* I suppose OpenLineage itself only defines the standard/protocol to design your data model. To be able to visualize/trace the lineage, you either have to implement your self with the standard data models or including Marquez in your project. You would need to use HTTP API to send lineage events from your Java project to Marquez in this case.

    @@ -37607,7 +37613,7 @@

    Supress success

    2022-02-16 11:17:13
    -

    *Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

    +

    *Thread Reply:* Exactly! This project also includes connectors for more common data tools (Airflow, dbt, spark, etc), but at it's core OpenLineage is a standard and protocol

    @@ -37697,7 +37703,7 @@

    Supress success

    2022-02-03 12:39:57
    -

    *Thread Reply:* Hello!

    +

    *Thread Reply:* Hello!

    @@ -37750,7 +37756,7 @@

    Supress success

    2022-02-04 15:07:07
    -

    *Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

    +

    *Thread Reply:* Hey Albert! There aren't yet any issues or proposals around Apache Atlas yet, but that's definitely something you can help with!

    I'm not super familiar with Atlas, were you thinking in terms of enabling Atlas to receive runEvents from OpenLineage connectors?

    @@ -37778,7 +37784,7 @@

    Supress success

    2022-02-07 05:49:16
    -

    *Thread Reply:* Hi John! +

    *Thread Reply:* Hi John! Yes, exactly, it’d be nice to see Atlas as a receiver side of the OpenLineage events. Is there some guidelines on how to implement it? I guess we need OpenLineage-compatible server implementation so we could receive events and send them to Atlas, right?

    @@ -37805,7 +37811,7 @@

    Supress success

    2022-02-07 11:30:14
    -

    *Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. +

    *Thread Reply:* exactly - This would be a change on the Atlas side. I’d start by opening an issue in the atlas repo about making an API endpoint that can receive OpenLineage events. Marquez is our reference implementation of OpenLineage, so I’d look around in that repo to see how it’s been implemented :)

    @@ -37832,7 +37838,7 @@

    Supress success

    2022-02-07 11:50:27
    -

    *Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 +

    *Thread Reply:* Got it, thanks! Did that: https://issues.apache.org/jira/browse/ATLAS-4550 If it’d not get any traction we at New Work might contribute as well

    @@ -37859,7 +37865,7 @@

    Supress success

    2022-02-07 11:56:09
    -

    *Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

    +

    *Thread Reply:* awesome! if you guys have any questions, reach out and I can get you in touch with some of the engineers on our end

    @@ -37889,7 +37895,7 @@

    Supress success

    2022-02-08 11:20:47
    -

    *Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

    +

    *Thread Reply:* @Albert Bikeev one minor thing that could be helpful: java OpenLineage library contains server model classes: https://github.com/OpenLineage/OpenLineage/pull/300#issuecomment-923489097

    @@ -38215,7 +38221,7 @@

    Supress success

    2022-02-08 11:32:12
    -

    *Thread Reply:* Got it, thank you!

    +

    *Thread Reply:* Got it, thank you!

    @@ -38241,7 +38247,7 @@

    Supress success

    2022-05-04 11:12:23
    -

    *Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? +

    *Thread Reply:* This is a quite old discussion, but isn't possible to use openlineage proxy to send json to kafka topic and let Atlas read that json without any modification? It would be needed to create a new model for spark, other than https://github.com/apache/atlas/blob/release-2.1.0-rc3/addons/models/1000-Hadoop/1100-spark_model.json and upload it to atlas (what could be done with a call to the atlas Api) Does it makes sense?

    Supress success
    2022-05-04 11:24:02
    -

    *Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

    +

    *Thread Reply:* @Juan Carlos Fernández Rodríguez - You still need to build a bridge between the OpenLineage Spec and the Apache Atlas entity JSON. So far, no one has contributed something like that to the open source community... yet!

    @@ -38644,7 +38650,7 @@

    Supress success

    2022-05-04 14:24:28
    -

    *Thread Reply:* sorry for the ignorance, +

    *Thread Reply:* sorry for the ignorance, But what is the purpose of the bridge?the comunicación with atlas should be done throw kafka, and that messages can be sent by the proxy. What are I missing?

    @@ -38671,7 +38677,7 @@

    Supress success

    2022-05-04 16:37:33
    -

    *Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

    +

    *Thread Reply:* "bridge" in this case refers to a service of some sort that converts from OpenLineage run event to Atlas entity JSON, since there's currently nothing that will do that

    @@ -38697,7 +38703,7 @@

    Supress success

    2022-05-19 09:08:23
    -

    *Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

    +

    *Thread Reply:* If OpenLineage send an event to kafka, I think we can use kafka stream or kafka connect to rebuild message to atlas event.

    @@ -38723,7 +38729,7 @@

    Supress success

    2022-05-19 09:11:37
    -

    *Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

    +

    *Thread Reply:* @John Thomas Our company used to use atlas as a metadata service. I just came into know this project. After I learned how openlineage works, I think I can create an issue to describe my design first.

    @@ -38749,7 +38755,7 @@

    Supress success

    2022-05-19 09:13:36
    -

    *Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

    +

    *Thread Reply:* @Juan Carlos Fernández Rodríguez If you already have some experience and design, can you directly create an issue so that we can discuss it in more detail ?

    @@ -38775,7 +38781,7 @@

    Supress success

    2022-05-19 12:42:31
    -

    *Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

    +

    *Thread Reply:* Hi @xiang chen we are discussing internally in my company if rewrite to atlas or another alternative. If we do this, we will share and could involve you in some way.

    @@ -38860,7 +38866,7 @@

    Supress success

    2022-02-25 13:06:16
    -

    *Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

    +

    *Thread Reply:* Hi Luca, I agree this looks really promising. I’m working on getting it to run on Databricks, but I’m only just starting out 🙂

    @@ -38986,7 +38992,7 @@

    Supress success

    2022-02-10 09:05:39
    -

    *Thread Reply:* By duplicated, you mean with the same runId?

    +

    *Thread Reply:* By duplicated, you mean with the same runId?

    @@ -39012,7 +39018,7 @@

    Supress success

    2022-02-10 11:40:55
    -

    *Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

    +

    *Thread Reply:* It’s only one example, could be also duplicated job name or anything else. The question is if there is mechanism to report that

    @@ -39068,7 +39074,7 @@

    Supress success

    2022-02-15 05:15:12
    -

    *Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

    +

    *Thread Reply:* I think this log should be dropped to debug: https://github.com/OpenLineage/OpenLineage/blob/d66c41872f3cc7f7cd5c99664d401e070e[…]c/main/common/java/io/openlineage/spark/agent/EventEmitter.java

    @@ -39119,7 +39125,7 @@

    Supress success

    2022-02-15 23:27:07
    -

    *Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

    +

    *Thread Reply:* @Maciej Obuchowski that is a good one! It would be nice to still have SOME logging in info to know that the event complete successfully but that response and event is very verbose.

    I was also thinking about here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java#L337-L340

    @@ -39155,7 +39161,7 @@

    Supress success

    2022-02-16 04:59:17
    -

    *Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

    +

    *Thread Reply:* Yes, that would be good solution for now. Later would be nice to have some option to raise the log level - OL logs are absolutely drowning in logs from rest of Spark cluster when set to debug.

    @@ -39274,7 +39280,7 @@

    Supress success

    2022-02-16 13:44:30
    -

    *Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

    +

    *Thread Reply:* Hey, I responded on the issue, but just to make it clear for everyone, the OL events for a run are not expected to be an accumulation of all past events. Events should be treated as additive by the backend - each event can post what information it has about the run and the backend is responsible for constructing a holistic picture of the run

    @@ -39326,7 +39332,7 @@

    Supress success

    2022-02-16 13:47:18
    -

    *Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

    +

    *Thread Reply:* e.g., here is the marquez code that fetches the facets for a run. Note that all of the facets are included from all events with the requested run_uuid. If the env facet is present on any event, it will be returned by the API

    @@ -39381,7 +39387,7 @@

    Supress success

    2022-02-16 13:51:30
    -

    *Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

    +

    *Thread Reply:* Ah! Thanks for that @Michael Collado it's good to understand the OpenLineage perspective.

    So, we do need to maintain some state. That makes total sense, Mike.

    @@ -39437,7 +39443,7 @@

    Supress success

    2022-02-16 14:00:03
    -

    *Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). +

    *Thread Reply:* If I were building the backend, I would store events, then calculate the end state later, rather than trying to "maintain some state" (maybe we mean the same thing, but using different words here 😀). Re: the failure events, I think job failures will currently result in one FAIL event and one COMPLETE event. The SparkListenerJobEnd event will trigger a FAIL event but the SparkListenerSQLExecutionEnd event will trigger the COMPLETE event.

    @@ -39512,7 +39518,7 @@

    Supress success

    2022-02-16 15:16:27
    -

    *Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

    +

    *Thread Reply:* Oooh! I did not know we already could get a FAIL event! That is super helpful to know, Mike! Thank you so much!

    @@ -39643,7 +39649,7 @@

    Supress success

    2022-02-22 09:02:48
    -

    *Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

    +

    *Thread Reply:* I think this scenario only happens when spark job spawns another "sub-job", right?

    I think that maybe you can check sparkContext.getLocalProperty("spark.sql.execution.id")

    @@ -39702,7 +39708,7 @@

    Supress success

    2022-02-22 10:53:09
    -

    *Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

    +

    *Thread Reply:* @Maciej Obuchowski - I was hoping they'd have the same run id as well but they do not 😞

    But that is the expectation? A SparkSQLExecutionStart and JobStart SHOULD have the same execution ID, right?

    @@ -39732,7 +39738,7 @@

    Supress success

    2022-02-22 10:57:24
    -

    *Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

    +

    *Thread Reply:* SparkSQLExecutionStart and SparkSQLExecutionEnd should have the same runId, as well as JobStart and JobEnd events. Beyond those it can get wild. For example, some jobs don't emit JobStart/JobEnd events. Some jobs, like Delta emit multiple, that aren't easily tied to SQL event.

    @@ -39758,7 +39764,7 @@

    Supress success

    2022-02-23 03:48:38
    -

    *Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

    +

    *Thread Reply:* Okay, I dug into the Databricks Synapse Connector and it does the following:

    1. SparkSQLExecutionStart with execution id of 8 happens (so gets runid of abc123). It contains the real inputs and outputs that we want.
    2. The Synapse connector starts executing JDBC commands. These commands prepare the synapse database to connect with data that Spark will land in a staging area in the cloud. (I don't know how it' executing arbitrary commands before the official job start begins 😞 )
    3. SparkJobStart beings with execution id of 9 happens (so it gets runid of jkl456). This contains the inputs and an output to a temp folder (NOT the real output we want but a staging location) a. There are four JobIds 0 - 3, all of which point back to execution id 9 with the same physical plan. @@ -39773,7 +39779,7 @@

      Supress success

      I've attached the logs and a screenshot of what I'm seeing the Spark UI. If you had a chance to take a look, it's a bit verbose but I'd appreciate a second pair of eyes on my analysis. Hopefully I got something wrong 😅

      - + @@ -39814,7 +39820,7 @@

      Supress success

      2022-02-23 07:19:01
      -

      *Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

      +

      *Thread Reply:* I think we've encountered the same stuff in Delta before 🙂

      https://github.com/OpenLineage/OpenLineage/issues/388#issuecomment-964401860

      @@ -39842,7 +39848,7 @@

      Supress success

      2022-02-23 14:13:18
      -

      *Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

      +

      *Thread Reply:* @Will Johnson , am I reading your report correctly that the SparkListenerJobStart event is reported with a spark.sql.execution.id that differs from the execution id of the SparkSQLExecutionStart?

      @@ -39868,7 +39874,7 @@

      Supress success

      2022-02-23 14:18:04
      -

      *Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| +

      *Thread Reply:* WILLJ: We're deep inside this thing and have an executionid |9| 😂

      @@ -39895,7 +39901,7 @@

      Supress success

      2022-02-23 21:56:48
      -

      *Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

      +

      *Thread Reply:* Hah @Michael Collado I see you found my method of debugging in Databricks 😅

      But you're exactly right, there's a SparkSQLExecutionStart event with execution id 8 and then a set of JobStart events all with execution id 9!

      @@ -39997,7 +40003,7 @@

      Supress success

      2022-02-28 14:00:56
      -

      *Thread Reply:* You won’t want to miss this talk!

      +

      *Thread Reply:* You won’t want to miss this talk!

      @@ -40051,7 +40057,7 @@

      Supress success

      2022-02-28 15:29:58
      -

      *Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

      +

      *Thread Reply:* hi Martin - when you talk about a DataHub integration, did you mean a method to collect information from DataHub? I don't see a current issue open for that, but I recommend you make one and to kick off the discussion around it.

      If you mean sending information to DataHub, that should already be possible if users pass a datahub api endpoint to the OPENLINEAGE_ENDPOINT variable

      @@ -40079,7 +40085,7 @@

      Supress success

      2022-02-28 16:29:54
      -

      *Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

      +

      *Thread Reply:* Hi, thanks for a reply! I meant to emit Openlineage JSON structure to datahub.

      Could you be please more specific, possibly link an article how to find the endpoint on the datahub side? Many thanks!

      @@ -40107,7 +40113,7 @@

      Supress success

      2022-02-28 17:15:31
      -

      *Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

      +

      *Thread Reply:* ooooh, sorry I misread - I thought you meant that datahub had built an endpoint. Your integration should emit openlineage events to an endpoint, but datahub would have to build that support into their product likely? I'm not sure how to go about it

      @@ -40133,7 +40139,7 @@

      Supress success

      2022-02-28 17:16:27
      -

      *Thread Reply:* I'd reach out to datahub, potentially?

      +

      *Thread Reply:* I'd reach out to datahub, potentially?

      @@ -40159,7 +40165,7 @@

      Supress success

      2022-02-28 17:21:51
      -

      *Thread Reply:* i see. ok, will do!

      +

      *Thread Reply:* i see. ok, will do!

      @@ -40185,7 +40191,7 @@

      Supress success

      2022-03-02 18:15:21
      -

      *Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

      +

      *Thread Reply:* It has been discussed in the past but I don’t think there is something yet. The Kafka transport PR that is in flight should facilitate this

      @@ -40211,7 +40217,7 @@

      Supress success

      2022-03-02 18:33:45
      -

      *Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

      +

      *Thread Reply:* Thanks for the response! though dragging Kafka in just for data delivery bit is too much. I think the clearest way would be to push Datahub to make an API endpoint and parser for OL /lineage data structure.

      I see this is more political think that would require join effort of DataHub team and OpenLineage with a common goal.

      @@ -40429,7 +40435,7 @@

      Supress success

      2022-03-07 14:56:58
      -

      *Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

      +

      *Thread Reply:* Hi marco - we don't have documentation on that yet, but the Postgres extractor is a pretty good example of how they're implemented.

      all the included extractors are here: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow/openlineage/airflow/extractors

      @@ -40457,7 +40463,7 @@

      Supress success

      2022-03-07 15:07:41
      -

      *Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? +

      *Thread Reply:* Thanks. I can follow that to build my own. Also I am installing this environment right now in Airflow 2. It seems I need Marquez and openlinegae-aiflow library. It seems that by this example I can put my extractors in any path as long as it is referenced in the environment variable. Is that correct? OPENLINEAGE_EXTRACTOR_&lt;operator&gt;=full.path.to.ExtractorClass Also do I need anything else other than Marquez and openlineage_airflow?

      @@ -40485,7 +40491,7 @@

      Supress success

      2022-03-07 15:30:45
      -

      *Thread Reply:* Yes, as long as the extractors are in the python path.

      +

      *Thread Reply:* Yes, as long as the extractors are in the python path.

      @@ -40511,7 +40517,7 @@

      Supress success

      2022-03-07 15:31:59
      -

      *Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

      +

      *Thread Reply:* I built one a little while ago for a custom operator, I'd be happy to share what I did. I put it in the same file as the operator class for convenience.

      @@ -40537,7 +40543,7 @@

      Supress success

      2022-03-07 15:32:51
      -

      *Thread Reply:* That will be great help. Thanks

      +

      *Thread Reply:* That will be great help. Thanks

      @@ -40563,10 +40569,10 @@

      Supress success

      2022-03-08 20:38:27
      -

      *Thread Reply:* This is the one I wrote:

      +

      *Thread Reply:* This is the one I wrote:

      - + @@ -40598,7 +40604,7 @@

      Supress success

      2022-03-08 20:39:30
      -

      *Thread Reply:* to make it work, I set this environment variable:

      +

      *Thread Reply:* to make it work, I set this environment variable:

      OPENLINEAGE_EXTRACTOR_HttpToBigQueryOperator=http_to_bigquery.HttpToBigQueryExtractor

      @@ -40626,7 +40632,7 @@

      Supress success

      2022-03-08 20:40:57
      -

      *Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

      +

      *Thread Reply:* the extractor starts at line 183, and the really important bits start at line 218

      @@ -40712,7 +40718,7 @@

      Supress success

      2022-03-08 12:05:25
      -

      *Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

      +

      *Thread Reply:* You would see that job was run - but we couldn't extract dataset lineage from it.

      @@ -40738,7 +40744,7 @@

      Supress success

      2022-03-08 12:05:49
      -

      *Thread Reply:* The good news is that we're working to solve this problem in general.

      +

      *Thread Reply:* The good news is that we're working to solve this problem in general.

      @@ -40764,7 +40770,7 @@

      Supress success

      2022-03-08 12:15:52
      -

      *Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

      +

      *Thread Reply:* I see, so i definitively will need the custom extractor built. I just need to understand where to set the path to the extractor. I can build one by following the postgress extractor you have built.

      @@ -40790,7 +40796,7 @@

      Supress success

      2022-03-08 12:50:00
      -

      *Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

      +

      *Thread Reply:* That depends how you deploy Airflow. Our tests use environment in docker-compose: https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/tests/integration/tests/docker-compose-2.yml#L34

      @@ -40841,7 +40847,7 @@

      Supress success

      2022-03-08 13:19:37
      -

      *Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

      +

      *Thread Reply:* Thanks for the example. I can show this to my infra support person for his reference.

      @@ -40975,7 +40981,7 @@

      Supress success

      2022-03-10 12:46:18
      -

      *Thread Reply:* Do you need to specify a namespace that is not « default »?

      +

      *Thread Reply:* Do you need to specify a namespace that is not « default »?

      @@ -41027,7 +41033,7 @@

      Supress success

      2022-03-10 07:40:13
      -

      *Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

      +

      *Thread Reply:* @Kevin Mellott aren't you the chart wizard? Maybe you could help 🙂

      @@ -41057,7 +41063,7 @@

      Supress success

      2022-03-10 14:09:26
      -

      *Thread Reply:* Ok so I had to update a chart dependency

      +

      *Thread Reply:* Ok so I had to update a chart dependency

      @@ -41083,7 +41089,7 @@

      Supress success

      2022-03-10 14:10:39
      -

      *Thread Reply:* Now I installed the service in amazon using this +

      *Thread Reply:* Now I installed the service in amazon using this helm install marquez . --dependency-update --set marquez.db.host=myhost --set marquez.db.user=myuser --set marquez.db.password=mypassword --namespace marquez --atomic --wait

      @@ -41110,7 +41116,7 @@

      Supress success

      2022-03-10 14:11:31
      -

      *Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

      +

      *Thread Reply:* i can see marquez-web running and marquez as well as the database i set up manually

      @@ -41136,7 +41142,7 @@

      Supress success

      2022-03-10 14:12:27
      -

      *Thread Reply:* however I can not fetch initial data when login into the endpoint

      +

      *Thread Reply:* however I can not fetch initial data when login into the endpoint

      @@ -41171,7 +41177,7 @@

      Supress success

      2022-03-10 14:52:06
      -

      *Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

      +

      *Thread Reply:* 👋 @Marco Diaz happy to hear that the Helm install is completing without error! To help troubleshoot the error above, can you please let me know if this endpoint is available and working?

      http://localhost:5000/api/v1/namespaces

      @@ -41199,7 +41205,7 @@

      Supress success

      2022-03-10 15:13:16
      -

      *Thread Reply:* i got this +

      *Thread Reply:* i got this {"namespaces":[{"name":"default","createdAt":"2022_03_10T18:05:55.780593Z","updatedAt":"2022-03-10T19:03:31.309713Z","ownerName":"anonymous","description":"The default global namespace for dataset, job, and run metadata not belonging to a user-specified namespace."}]}

      @@ -41226,7 +41232,7 @@

      Supress success

      2022-03-10 15:13:34
      -

      *Thread Reply:* i have to use the namespace marquez to redirect there +

      *Thread Reply:* i have to use the namespace marquez to redirect there kubectl port-forward svc/marquez 5000:80 -n marquez

      @@ -41253,7 +41259,7 @@

      Supress success

      2022-03-10 15:13:48
      -

      *Thread Reply:* is there something i need to change in a config file?

      +

      *Thread Reply:* is there something i need to change in a config file?

      @@ -41279,7 +41285,7 @@

      Supress success

      2022-03-10 15:14:39
      -

      *Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

      +

      *Thread Reply:* also how would i change the "localhost" address to something that is accessible in amazon without the need to redirect?

      @@ -41305,7 +41311,7 @@

      Supress success

      2022-03-10 15:14:59
      -

      *Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

      +

      *Thread Reply:* Sorry for all the questions. I am not an infra guy and have had to do all this by myself

      @@ -41331,7 +41337,7 @@

      Supress success

      2022-03-10 15:39:23
      -

      *Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

      +

      *Thread Reply:* No problem at all, I think there are a couple of things at play here. With the local setup, it appears that the web is attempting to access the API on the wrong port number (3000 instead of 5000). I’ll create an issue for that one so that we can fix it.

      As to the EKS installation (or any non-local install), this is where you would need to use what’s called an ingress controller to expose the services outside of the Kubernetes cluster. There are different flavors of these (NGINX is popular), and I believe that AWS EKS has some built-in capabilities that might help as well.

      @@ -41386,7 +41392,7 @@

      Supress success

      2022-03-10 15:40:50
      -

      *Thread Reply:* So how do i fix this issue?

      +

      *Thread Reply:* So how do i fix this issue?

      @@ -41412,7 +41418,7 @@

      Supress success

      2022-03-10 15:46:56
      -

      *Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

      +

      *Thread Reply:* If your goal is to deploy to AWS, then you would need to get the EKS ingress configured. It’s not a trivial task, but they do have a bit of a walkthrough at https://www.eksworkshop.com/beginner/130_exposing-service/.

      However, if you are just seeking to explore Marquez and try things out, then I would highly recommend the “Open in Gitpod” functionality at https://github.com/MarquezProject/marquez#try-it. That will perform a full deployment for you in a temporary environment very quickly.

      Supress success
      2022-03-10 16:02:05
      -

      *Thread Reply:* i need to use it in aws for a POC

      +

      *Thread Reply:* i need to use it in aws for a POC

      @@ -41491,7 +41497,7 @@

      Supress success

      2022-03-10 19:15:08
      -

      *Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? +

      *Thread Reply:* Is there a better guide on how to install and setup Marquez in AWS? This guide is omitting many steps https://marquezproject.github.io/marquez/running-on-aws.html

      Supress success
      2022-03-11 13:59:28
      -

      *Thread Reply:* I still see this +

      *Thread Reply:* I still see this https://files.slack.com/files-pri/T01CWUYP5AR-F036JKN77EW/image.png

      @@ -41683,7 +41689,7 @@

      Supress success

      2022-03-11 14:00:09
      -

      *Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

      +

      *Thread Reply:* I created my own database and changed the values for host, user and password inside the chart.yml

      @@ -41709,7 +41715,7 @@

      Supress success

      2022-03-11 14:00:23
      -

      *Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

      +

      *Thread Reply:* Does it show that within the AWS deployment? It looks to show localhost in your screenshot.

      @@ -41735,7 +41741,7 @@

      Supress success

      2022-03-11 14:00:52
      -

      *Thread Reply:* Or are you working through the local deploy right now?

      +

      *Thread Reply:* Or are you working through the local deploy right now?

      @@ -41761,7 +41767,7 @@

      Supress success

      2022-03-11 14:01:57
      -

      *Thread Reply:* It shows the same using the exposed service

      +

      *Thread Reply:* It shows the same using the exposed service

      @@ -41787,7 +41793,7 @@

      Supress success

      2022-03-11 14:02:09
      -

      *Thread Reply:* i just didnt do another screenshot

      +

      *Thread Reply:* i just didnt do another screenshot

      @@ -41813,7 +41819,7 @@

      Supress success

      2022-03-11 14:02:27
      -

      *Thread Reply:* Could it be communication with the DB?

      +

      *Thread Reply:* Could it be communication with the DB?

      @@ -41839,7 +41845,7 @@

      Supress success

      2022-03-11 14:04:37
      -

      *Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

      +

      *Thread Reply:* What do you see if you view the network traffic within your web browser (right click -> Inspect -> Network). Specifically, wondering what the response code from the Marquez API URL looks like.

      @@ -41865,7 +41871,7 @@

      Supress success

      2022-03-11 14:14:48
      -

      *Thread Reply:* i see this error +

      *Thread Reply:* i see this error Error occured while trying to proxy to: <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces">xxxxxxxxxxxxxxxxxxxxxxxxx.us-east-1.elb.amazonaws.com/api/v1/namespaces</a>

      @@ -41892,7 +41898,7 @@

      Supress success

      2022-03-11 14:16:00
      -

      *Thread Reply:* it seems to be trying to use the same address to access the api endpoint

      +

      *Thread Reply:* it seems to be trying to use the same address to access the api endpoint

      @@ -41918,7 +41924,7 @@

      Supress success

      2022-03-11 14:16:26
      -

      *Thread Reply:* however the api service is in a different endpoint

      +

      *Thread Reply:* however the api service is in a different endpoint

      @@ -41944,7 +41950,7 @@

      Supress success

      2022-03-11 14:18:24
      -

      *Thread Reply:* The API resides here +

      *Thread Reply:* The API resides here <a href="http://Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com">Xxxxxxxxxxxxxxxxxxxxxx-2064419849.us-east-1.elb.amazonaws.com</a>

      @@ -41971,7 +41977,7 @@

      Supress success

      2022-03-11 14:19:13
      -

      *Thread Reply:* The web service resides here +

      *Thread Reply:* The web service resides here <a href="http://xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com">xxxxxxxxxxxxxxxxxxxxxxxxxxx-335729662.us-east-1.elb.amazonaws.com</a>

      @@ -41998,7 +42004,7 @@

      Supress success

      2022-03-11 14:19:25
      -

      *Thread Reply:* do they both need to be under the same LB?

      +

      *Thread Reply:* do they both need to be under the same LB?

      @@ -42024,7 +42030,7 @@

      Supress success

      2022-03-11 14:19:56
      -

      *Thread Reply:* How would i do that is they install as separate services?

      +

      *Thread Reply:* How would i do that is they install as separate services?

      @@ -42050,7 +42056,7 @@

      Supress success

      2022-03-11 14:27:15
      -

      *Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

      +

      *Thread Reply:* You are correct, both the website and API are expecting to be exposed on the same ALB. This will give you a single URL that can reach your Kubernetes cluster, and then the ALB will allow you to configure Ingress rules to route the traffic based on the request.

      Here is an example from one of the AWS repos - in the ingress resource you can see the single rule setup to point traffic to a given service.

      @@ -42167,7 +42173,7 @@

      Supress success

      2022-03-11 14:36:40
      -

      *Thread Reply:* Thanks for the help. Now I know what the issue is

      +

      *Thread Reply:* Thanks for the help. Now I know what the issue is

      @@ -42193,7 +42199,7 @@

      Supress success

      2022-03-11 14:51:34
      -

      *Thread Reply:* Great to hear!!

      +

      *Thread Reply:* Great to hear!!

      @@ -42248,7 +42254,7 @@

      Supress success

      2022-03-16 10:29:06
      -

      *Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

      +

      *Thread Reply:* Hi! Yes, OpenLineage is free. It is an open source standard for collection, and it provides the agents that integrate with pipeline tools to capture lineage metadata. You also need a metadata server, and there is an open source one called Marquez that you can use.

      @@ -42274,7 +42280,7 @@

      Supress success

      2022-03-16 10:29:15
      -

      *Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

      +

      *Thread Reply:* It supports the databases listed here: https://openlineage.io/integration

      openlineage.io
      @@ -42347,7 +42353,7 @@

      Supress success

      2022-03-16 10:29:53
      -

      *Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

      +

      *Thread Reply:* Not sure I understand - are you looking for example code in Python that shows how to make OpenLineage calls?

      @@ -42373,7 +42379,7 @@

      Supress success

      2022-03-16 12:45:14
      -

      *Thread Reply:* yup

      +

      *Thread Reply:* yup

      @@ -42399,7 +42405,7 @@

      Supress success

      2022-03-16 13:10:04
      -

      *Thread Reply:* how to run

      +

      *Thread Reply:* how to run

      @@ -42425,7 +42431,7 @@

      Supress success

      2022-03-16 23:08:31
      -

      *Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

      +

      *Thread Reply:* this is a good post for getting started with Marquez: https://openlineage.io/blog/explore-lineage-api/

      openlineage.io
      @@ -42472,7 +42478,7 @@

      Supress success

      2022-03-16 23:08:51
      -

      *Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

      +

      *Thread Reply:* once you have run ./docker/up.sh, you should be able to run through that and see how the system runs

      @@ -42498,7 +42504,7 @@

      Supress success

      2022-03-16 23:09:45
      -

      *Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

      +

      *Thread Reply:* There is a python client you can find here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python

      @@ -42524,7 +42530,7 @@

      Supress success

      2022-03-17 00:05:58
      -

      *Thread Reply:* Thank you

      +

      *Thread Reply:* Thank you

      @@ -42550,7 +42556,7 @@

      Supress success

      2022-03-19 00:00:32
      -

      *Thread Reply:* You are welcome 🙂

      +

      *Thread Reply:* You are welcome 🙂

      @@ -42576,7 +42582,7 @@

      Supress success

      2022-04-19 09:28:50
      -

      *Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

      +

      *Thread Reply:* Hey @Ross Turk, (and potentially @Maciej Obuchowski) - what are the plans for OL Python client? I'd like to use it, but without a pip package it's not really project-friendly.

      Is there any work in that direction, is the current client code considered mature and just needs re-packaging, or is it just a thought sketch and some serious work is needed?

      @@ -42606,7 +42612,7 @@

      Supress success

      2022-04-19 09:32:17
      -

      *Thread Reply:* What do you mean without pip-package?

      +

      *Thread Reply:* What do you mean without pip-package?

      @@ -42632,7 +42638,7 @@

      Supress success

      2022-04-19 09:32:18
      -

      *Thread Reply:* https://pypi.org/project/openlineage-python/

      +

      *Thread Reply:* https://pypi.org/project/openlineage-python/

      @@ -42658,7 +42664,7 @@

      Supress success

      2022-04-19 09:35:08
      -

      *Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka +

      *Thread Reply:* It's still developed, for example next release will have pluggable backends - like Kafka https://github.com/OpenLineage/OpenLineage/pull/530

      @@ -42685,7 +42691,7 @@

      Supress success

      2022-04-19 09:40:11
      -

      *Thread Reply:* My apologies Maciej! +

      *Thread Reply:* My apologies Maciej! In my defense - looking for "open lineage" on pypi doesn't show this in the first 20 results. Still, should have checked setup.py. My bad, and thank you for the pointer!

      @@ -42712,7 +42718,7 @@

      Supress success

      2022-04-19 10:00:49
      -

      *Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

      +

      *Thread Reply:* We might need to add some keywords to setup.py - right now we have only "openlineage" there 😉

      @@ -42738,7 +42744,7 @@

      Supress success

      2022-04-20 08:12:29
      -

      *Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

      +

      *Thread Reply:* My mistake was that I was expecting a separate repo for the clients. But now I'm playing around with the package and trying to figure out the OL concepts. Thank you for your contribution, it's much nicer to experiment from ipynb than curl 🙂

      @@ -42837,7 +42843,7 @@

      Supress success

      2022-03-16 14:35:14
      -

      *Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

      +

      *Thread Reply:* the seed data is being inserted by this command here: https://github.com/MarquezProject/marquez/blob/main/api/src/main/java/marquez/cli/SeedCommand.java

      @@ -43095,7 +43101,7 @@

      Supress success

      2022-03-17 00:06:53
      -

      *Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

      +

      *Thread Reply:* Got it, but if i changed the code in this java file lets say i added another job here satisfying the syntax its not appearing in the lineage flow

      @@ -43234,7 +43240,7 @@

      Supress success

      2022-03-23 10:20:47
      -

      *Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

      +

      *Thread Reply:* I’m not too familiar with the 504 error in ALB, but found a guide with troubleshooting steps. If this is an issue with connectivity to the Postgres database, then you should be able to see errors within the marquez pod in EKS (kubectl logs <marquez pod name>) to confirm.

      I know that EKS needs to have connectivity established to the Postgres database, even in the case of RDS, so that could be the culprit.

      @@ -43262,7 +43268,7 @@

      Supress success

      2022-03-23 16:09:09
      -

      *Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs +

      *Thread Reply:* @Kevin Mellott This is the error I am seeing in the logs [HPM] Proxy created: /api/v1 -&gt; <http://localhost:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://localhost:5000/> (ECONNREFUSED) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

      @@ -43291,7 +43297,7 @@

      Supress success

      2022-03-23 16:22:13
      -

      *Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

      +

      *Thread Reply:* It looks like the website is attempting to find the API on localhost. I believe this can be resolved by setting the following Helm chart value within your deployment.

      marquez.hostname=marquez-interface-test.di.rbx.com

      @@ -43319,7 +43325,7 @@

      Supress success

      2022-03-23 16:22:54
      -

      *Thread Reply:* assuming that is the DNS used by the website

      +

      *Thread Reply:* assuming that is the DNS used by the website

      @@ -43345,7 +43351,7 @@

      Supress success

      2022-03-23 16:48:53
      -

      *Thread Reply:* thanks, that did it. I have a question regarding the database

      +

      *Thread Reply:* thanks, that did it. I have a question regarding the database

      @@ -43371,7 +43377,7 @@

      Supress success

      2022-03-23 16:50:01
      -

      *Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

      +

      *Thread Reply:* I made my own database manually. Do the marquez tables should be created automatically when install marquez?

      @@ -43397,7 +43403,7 @@

      Supress success

      2022-03-23 16:56:10
      -

      *Thread Reply:* Also could you put both the API and interface on the same port (3000)

      +

      *Thread Reply:* Also could you put both the API and interface on the same port (3000)

      @@ -43423,7 +43429,7 @@

      Supress success

      2022-03-23 17:21:58
      -

      *Thread Reply:* Seems I am still having the forwarding issue +

      *Thread Reply:* Seems I am still having the forwarding issue [HPM] Proxy created: /api/v1 -&gt; <http://marquez-interface-test.di.rbx.com:5000/> App listening on port 3000! [HPM] Error occurred while trying to proxy request /api/v1/namespaces from <a href="http://marquez-interface-test.di.rbx.com">marquez-interface-test.di.rbx.com</a> to <http://marquez-interface-test.di.rbx.com:5000/> (ECONNRESET) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)

      @@ -43488,7 +43494,7 @@

      Supress success

      2022-03-23 09:25:18
      -

      *Thread Reply:* It's always Delta, isn't it?

      +

      *Thread Reply:* It's always Delta, isn't it?

      When I originally worked on Delta support I tried to find answer on Delta slack and got an answer:

      @@ -43522,7 +43528,7 @@

      Supress success

      2022-03-23 09:25:48
      -

      *Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

      +

      *Thread Reply:* I haven't touched how it works in Spark 2 - wanted to make it work with Spark 3's new catalogs, so can't help you there.

      @@ -43548,7 +43554,7 @@

      Supress success

      2022-03-23 09:46:14
      -

      *Thread Reply:* Argh!! It's always Databricks doing something 🙄

      +

      *Thread Reply:* Argh!! It's always Databricks doing something 🙄

      Thanks, Maciej!

      @@ -43576,7 +43582,7 @@

      Supress success

      2022-03-23 09:51:59
      -

      *Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

      +

      *Thread Reply:* One last question for you, @Maciej Obuchowski, any thoughts on how I could identify WHY a particular JobStart event fired? Is it just stepping through every event? Was that your approach to getting Spark3 Delta working? Thank you so much for the insights!

      @@ -43602,7 +43608,7 @@

      Supress success

      2022-03-23 09:58:08
      -

      *Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

      +

      *Thread Reply:* Before that, we were using just JobStart/JobEnd events and I couldn't find events that correspond to logical plan that has anything to do with what job was actually doing. I just found out that SQLExecution events have what I want, so I just started using them and stopped worrying about Projection or Aggregate, or other events that don't really matter here - and that's how filtering idea was born: https://github.com/OpenLineage/OpenLineage/issues/423

      @@ -43628,7 +43634,7 @@

      Supress success

      2022-03-23 09:59:37
      -

      *Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

      +

      *Thread Reply:* Are you trying to get environment info from those events, or do you actually get Job event with proper logical plans like SaveIntoDataSourceCommand?

      Might be worth to just post here all the events + logical plans that are generated for particular job, as I've done in that issue

      @@ -43656,7 +43662,7 @@

      Supress success

      2022-03-23 09:59:40
      -

      *Thread Reply:* scala&gt; spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") +

      *Thread Reply:* scala&gt; spark.sql("CREATE TABLE tbl USING delta AS SELECT ** FROM tmp") 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 3 21/11/09 19:01:46 WARN SparkSQLExecutionContext: org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect 21/11/09 19:01:46 WARN SparkSQLExecutionContext: SparkListenerSQLExecutionStart - executionId: 4 @@ -43702,7 +43708,7 @@

      Supress success

      2022-03-23 11:41:37
      -

      *Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

      +

      *Thread Reply:* The JobStart event contains a Properties field and that contains a bunch of fields we want to extract to get more precise lineage information within Databricks.

      As far as we know, the SQLExecutionStart event does not have any way to get these properties :(

      @@ -43797,7 +43803,7 @@

      Supress success

      2022-03-23 11:42:33
      -

      *Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

      +

      *Thread Reply:* I started down this path with the Project statement but I agree with @Michael Collado that a ProjectVisitor isn't a great idea.

      https://github.com/OpenLineage/OpenLineage/issues/617

      Supress success
      2022-03-25 20:33:02
      -

      *Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

      +

      *Thread Reply:* Marquez and OpenLineage are job-focused lineage tools, so once you run a job in an OL-integrated instance of Airflow (or any other supported integration), you should see the jobs and DBs appear in the marquez ui

      @@ -44074,7 +44080,7 @@

      Supress success

      2022-03-25 21:44:54
      -

      *Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

      +

      *Thread Reply:* If you want to seed it with some data, just to try it out, you can run docker/up.sh -s and it will run a seeding job as it starts.

      @@ -44126,7 +44132,7 @@

      Supress success

      2022-03-31 18:34:40
      -

      *Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

      +

      *Thread Reply:* Yep! Marquez will register all in/out datasets present in the OL event as well as link them to the run

      @@ -44152,7 +44158,7 @@

      Supress success

      2022-03-31 18:35:47
      -

      *Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

      +

      *Thread Reply:* FYI, @Peter Hicks is working on displaying the dataset version to run relationship in the web UI, see https://github.com/MarquezProject/marquez/pull/1929

      @@ -44265,7 +44271,7 @@

      Supress success

      2022-03-28 15:43:46
      -

      *Thread Reply:* Hi Marco,

      +

      *Thread Reply:* Hi Marco,

      Datakin is a reporting tool built on the Marquez API, and therefore designed to take in Lineage using the OpenLineage specification.

      @@ -44295,7 +44301,7 @@

      Supress success

      2022-03-28 15:47:53
      -

      *Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

      +

      *Thread Reply:* No, that is it. Got it. So, i can install Datakin and still use openlineage and marquez?

      @@ -44321,7 +44327,7 @@

      Supress success

      2022-03-28 15:55:07
      -

      *Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

      +

      *Thread Reply:* if you set up a datakin account, you'll have to change the environment variables used by your OpenLineage integrations, and the runEvents will be sent to Datakin rather than Marquez. You shouldn't have any loss of functionality, and you also won't have to keep manually hosting Marquez

      @@ -44347,7 +44353,7 @@

      Supress success

      2022-03-28 16:10:25
      -

      *Thread Reply:* Will I still be able to use facets for backfills?

      +

      *Thread Reply:* Will I still be able to use facets for backfills?

      @@ -44373,7 +44379,7 @@

      Supress success

      2022-03-28 17:04:03
      -

      *Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

      +

      *Thread Reply:* yeah it works in the same way - Datakin actually submodules the Marquez API

      @@ -44488,7 +44494,7 @@

      Supress success

      2022-03-29 06:24:21
      -

      *Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

      +

      *Thread Reply:* Yes, you don't need to modify dags in Airflow 2.1+

      @@ -44514,7 +44520,7 @@

      Supress success

      2022-03-29 17:47:39
      -

      *Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. +

      *Thread Reply:* ok, I added that environment variable. Now my question is how do i configure my other variables. I have marquez running in AWS with an ingress. Do i use OpenLineageURL or Marquez_URL?

      @@ -44551,7 +44557,7 @@

      Supress success

      2022-03-29 17:48:09
      -

      *Thread Reply:* Also would a new namespace be created if i add the variable?

      +

      *Thread Reply:* Also would a new namespace be created if i add the variable?

      @@ -44603,7 +44609,7 @@

      Supress success

      2022-03-30 14:59:13
      -

      *Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

      +

      *Thread Reply:* Hi Datafool - I'm not familiar with how trino works, but the DBT-OL integration works by wrapping the dbt run command with dtb-ol run , and capturing lineage data from the runresult file

      These things don't necessarily preclude you from using OpenLineage on trino, so it may work already.

      @@ -44631,7 +44637,7 @@

      Supress success

      2022-03-30 18:34:38
      -

      *Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

      +

      *Thread Reply:* hey @John Thomas yep, tried to use dbt-ol run command but it seems trino is not supported, only bigquery, redshift and few others.

      @@ -44657,7 +44663,7 @@

      Supress success

      2022-03-30 18:36:41
      -

      *Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

      +

      *Thread Reply:* aaah I misunderstood what Trino is - yeah we don't currently support jobs that are running outside of those environments.

      We don't currently have plans for this, but a great first step would be opening an issue in the OpenLineage repo.

      @@ -44687,7 +44693,7 @@

      Supress success

      2022-03-30 20:23:46
      -

      *Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

      +

      *Thread Reply:* oh okay, got it, yes I can contribute, I'll see if I can get some time in the next few weeks. Thanks @John Thomas

      @@ -44963,7 +44969,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:11:35
      -

      *Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

      +

      *Thread Reply:* Hmmmm. Are you running in Docker? Is it possible for you to shell into your scheduler container and make sure the ENV is properly set?

      @@ -44989,7 +44995,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:11:57
      -

      *Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

      +

      *Thread Reply:* looks to me like the value you posted is correct, and return ['QueryOperator'] seems right to me

      @@ -45015,7 +45021,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:33:00
      -

      *Thread Reply:* It is in an EKS cluster +

      *Thread Reply:* It is in an EKS cluster I checked and the variable is there OPENLINEAGE_EXTRACTOR_QUERYOPERATOR=shared.plugins.ol_custom_extractors.QueryOperatorExtractor

      @@ -45043,7 +45049,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:33:56
      -

      *Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

      +

      *Thread Reply:* I am wondering if it is an issue with my extractor code. Something not rendering well

      @@ -45069,7 +45075,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:40:17
      -

      *Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: +

      *Thread Reply:* I don’t think it’s even executing your extractor code. The error message traces back to here: https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a9b874b/integration/airflow/openlineage/lineage_backend/__init__.py#L77

      @@ -45121,7 +45127,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:40:45
      -

      *Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

      +

      *Thread Reply:* I am currently digging into _get_extractor to see where it might be missing yours 🤔

      @@ -45147,7 +45153,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:46:36
      -

      *Thread Reply:* Thanks

      +

      *Thread Reply:* Thanks

      @@ -45173,7 +45179,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:47:19
      -

      *Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

      +

      *Thread Reply:* silly idea, but you could add a log message to __init__ in your extractor.

      @@ -45199,7 +45205,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:47:25
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/249868fa9b97d218ee35c4a198bcdf231a[…]ntegration/airflow/openlineage/airflow/extractors/extractors.py

      @@ -45250,7 +45256,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:48:20
      -

      *Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

      +

      *Thread Reply:* the openlineage client actually tries to import the value of that env variable from pos 22. if that happens, but for some reason it fails to register the extractor, we can at least know that it’s importing

      @@ -45276,7 +45282,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:48:54
      -

      *Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

      +

      *Thread Reply:* if you add a log line, you can verify that your PYTHONPATH and env are correct

      @@ -45302,7 +45308,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:23
      -

      *Thread Reply:* will try that

      +

      *Thread Reply:* will try that

      @@ -45328,7 +45334,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:29
      -

      *Thread Reply:* and let you know

      +

      *Thread Reply:* and let you know

      @@ -45354,7 +45360,7 @@

      logger = logging.getLogger(name)

      2022-03-31 14:49:39
      -

      *Thread Reply:* ok!

      +

      *Thread Reply:* ok!

      @@ -45380,7 +45386,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:04:05
      -

      *Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

      +

      *Thread Reply:* @Marco Diaz can you try env variable OPENLINEAGE_EXTRACTOR_QueryOperator instead of full caps?

      @@ -45410,7 +45416,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:13:37
      -

      *Thread Reply:* Will try that too

      +

      *Thread Reply:* Will try that too

      @@ -45436,7 +45442,7 @@

      logger = logging.getLogger(name)

      2022-03-31 15:13:44
      -

      *Thread Reply:* Thanks for helping

      +

      *Thread Reply:* Thanks for helping

      @@ -45462,7 +45468,7 @@

      logger = logging.getLogger(name)

      2022-03-31 16:58:24
      -

      *Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

      +

      *Thread Reply:* @Maciej Obuchowski My setup does not allow me to submit environment variables with lowercases. Is the name of the variable used to register the extractor?

      @@ -45488,7 +45494,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:15:57
      -

      *Thread Reply:* yes, it's case sensitive...

      +

      *Thread Reply:* yes, it's case sensitive...

      @@ -45514,7 +45520,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:18:42
      -

      *Thread Reply:* i see

      +

      *Thread Reply:* i see

      @@ -45540,7 +45546,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:39:16
      -

      *Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

      +

      *Thread Reply:* So it is definitively the name of the variable. I changed the name of the operator to capitals and now is being registered

      @@ -45566,7 +45572,7 @@

      logger = logging.getLogger(name)

      2022-03-31 17:39:44
      -

      *Thread Reply:* Could there be a way not to make this case sensitive?

      +

      *Thread Reply:* Could there be a way not to make this case sensitive?

      @@ -45592,7 +45598,7 @@

      logger = logging.getLogger(name)

      2022-03-31 18:31:27
      -

      *Thread Reply:* yes - could you create issue on OpenLineage repository?

      +

      *Thread Reply:* yes - could you create issue on OpenLineage repository?

      @@ -45618,7 +45624,7 @@

      logger = logging.getLogger(name)

      2022-04-01 10:46:59
      -

      *Thread Reply:* sure

      +

      *Thread Reply:* sure

      @@ -45694,7 +45700,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:25:52
      -

      *Thread Reply:* Yes, it should. This line is the likely culprit: +

      *Thread Reply:* Yes, it should. This line is the likely culprit: https://github.com/OpenLineage/OpenLineage/blob/431251d25f03302991905df2dc24357823d9c9c3/integration/common/openlineage/common/sql/parser.py#L30

      @@ -45746,7 +45752,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:26:25
      -

      *Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

      +

      *Thread Reply:* I bet if that said ['INTO','OVERWRITE'] it would work

      @@ -45772,7 +45778,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:27:23
      -

      *Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

      +

      *Thread Reply:* @Maciej Obuchowski do you agree? should OVERWRITE be a token we look for? if so, I can submit a short PR.

      @@ -45798,7 +45804,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:30:36
      -

      *Thread Reply:* we have a better solution

      +

      *Thread Reply:* we have a better solution

      @@ -45824,7 +45830,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:30:37
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/644

      @@ -45878,7 +45884,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:27
      -

      *Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

      +

      *Thread Reply:* ah! I heard there was a new SQL parser, but did not know it was imminent!

      @@ -45904,7 +45910,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:30
      -

      *Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

      +

      *Thread Reply:* I've added this case as a test and it works: https://github.com/OpenLineage/OpenLineage/blob/764dfdb885112cd0840ebc7384ff958bf20d4a70/integration/sql/tests/tests_insert.rs

      @@ -46047,7 +46053,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:31:33
      -

      *Thread Reply:* let me review this PR

      +

      *Thread Reply:* let me review this PR

      @@ -46073,7 +46079,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:36:32
      -

      *Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

      +

      *Thread Reply:* Do i have to download a new version of the opelineage-airflow python library

      @@ -46099,7 +46105,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:36:41
      -

      *Thread Reply:* If so which version?

      +

      *Thread Reply:* If so which version?

      @@ -46125,7 +46131,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:37:22
      -

      *Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

      +

      *Thread Reply:* this PR isn’t merged yet 😞 so if you wanted to try this you’d have to build the python client from the sql/rust-parser-impl branch

      @@ -46151,7 +46157,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:38:17
      -

      *Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

      +

      *Thread Reply:* ok, np. I am not in a hurry yet. Do you have an ETA for the merge?

      @@ -46177,7 +46183,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:39:50
      -

      *Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

      +

      *Thread Reply:* Hard to say, it’s currently in-review. Let me pull some strings, see if I can get eyes on it.

      @@ -46203,7 +46209,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:34
      -

      *Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

      +

      *Thread Reply:* I will check again next week don't worry. I still need to make some things in my extractor work

      @@ -46229,7 +46235,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:36
      -

      *Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

      +

      *Thread Reply:* after it’s merged, we’ll have to do an OpenLineage release as well - perhaps next week?

      @@ -46259,7 +46265,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:40:41
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -46311,7 +46317,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:34:10
      -

      *Thread Reply:* Hm - what version of dbt are you using?

      +

      *Thread Reply:* Hm - what version of dbt are you using?

      @@ -46337,7 +46343,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:47:50
      -

      *Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

      +

      *Thread Reply:* @Tien Nguyen The dbt schema version changes with different versions of dbt. If you have recently updated, you may have to make some changes: https://docs.getdbt.com/docs/guides/migration-guide/upgrading-to-v1.0

      docs.getdbt.com
      @@ -46388,7 +46394,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:48:27
      -

      *Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

      +

      *Thread Reply:* also make sure you are on the latest version of openlineage-dbt - I believe we have made it a bit more tolerant of dbt schema changes.

      @@ -46414,7 +46420,7 @@

      logger = logging.getLogger(name)

      2022-04-01 13:52:46
      -

      *Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

      +

      *Thread Reply:* @Ross Turk Thank you very much for your answer. I will update those and see if I can resolve the issues.

      @@ -46440,7 +46446,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:20:00
      -

      *Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

      +

      *Thread Reply:* @Ross Turk Thank you very much for your help. The latest version of dbt couldn't work. But version 0.20.0 works for this problem.

      @@ -46466,7 +46472,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:22:42
      -

      *Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

      +

      *Thread Reply:* Hmm. Interesting, I remember when dbt 1.0 came out we fixed a very similar issue: https://github.com/OpenLineage/OpenLineage/pull/397

      @@ -46532,7 +46538,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:25:17
      -

      *Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

      +

      *Thread Reply:* if you run pip3 list | grep openlineage-dbt, what version does it show?

      @@ -46558,7 +46564,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:26:26
      -

      *Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

      +

      *Thread Reply:* I wonder if you have somehow ended up with an older version of the integration

      @@ -46584,7 +46590,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:33:43
      -

      *Thread Reply:* it is 0.1.0

      +

      *Thread Reply:* it is 0.1.0

      @@ -46610,7 +46616,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:34:23
      -

      *Thread Reply:* is it 0.1.0 the older version of openlineage ?

      +

      *Thread Reply:* is it 0.1.0 the older version of openlineage ?

      @@ -46636,7 +46642,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:14
      -

      *Thread Reply:* ❯ pip3 list | grep openlineage-dbt +

      *Thread Reply:* ❯ pip3 list | grep openlineage-dbt openlineage-dbt 0.6.2

      @@ -46663,7 +46669,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:26
      -

      *Thread Reply:* the latest is 0.6.2 - that might be your issue

      +

      *Thread Reply:* the latest is 0.6.2 - that might be your issue

      @@ -46689,7 +46695,7 @@

      logger = logging.getLogger(name)

      2022-04-01 14:43:59
      -

      *Thread Reply:* How are you going about installing it?

      +

      *Thread Reply:* How are you going about installing it?

      @@ -46715,7 +46721,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:35:26
      -

      *Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

      +

      *Thread Reply:* @Ross Turk. I follow instruction from open lineage "pip3 install openlineage-dbt"

      @@ -46741,7 +46747,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:36:00
      -

      *Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

      +

      *Thread Reply:* Hm! Interesting. I did the same thing to get 0.6.2.

      @@ -46767,7 +46773,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:51:36
      -

      *Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

      +

      *Thread Reply:* @Ross Turk Yes. I have tried to reinstall and clear cache but it still install 0.1.0

      @@ -46793,7 +46799,7 @@

      logger = logging.getLogger(name)

      2022-04-01 18:53:07
      -

      *Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

      +

      *Thread Reply:* But thanks for the version. I reinstall 0.6.2 version by specify the version

      @@ -46866,7 +46872,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:07:09
      -

      *Thread Reply:* they'll work with new parser - added test for those

      +

      *Thread Reply:* they'll work with new parser - added test for those

      @@ -46917,7 +46923,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:07:39
      -

      *Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

      +

      *Thread Reply:* btw, thank you very much for notifying us about multiple bugs @Marco Diaz!

      @@ -46943,7 +46949,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:20:55
      -

      *Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

      +

      *Thread Reply:* @Maciej Obuchowski thank you for making sure these cases are taken into account. I am getting more familiar with the Open lineage code as i build my extractors. If I see anything else I will let you know. Any ETA on the new parser release date?

      @@ -46969,7 +46975,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:55:28
      -

      *Thread Reply:* it should be week-two, unless anything comes up

      +

      *Thread Reply:* it should be week-two, unless anything comes up

      @@ -46995,7 +47001,7 @@

      logger = logging.getLogger(name)

      2022-04-03 17:10:02
      -

      *Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

      +

      *Thread Reply:* I see. Keeping my fingers crossed this is the only thing delaying me right now.

      @@ -47047,7 +47053,7 @@

      logger = logging.getLogger(name)

      2022-04-03 15:02:13
      -

      *Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

      +

      *Thread Reply:* current one handles cases where you have one CTE (like this test) but not multiple - next one will handle arbitrary number of CTEs (like this test)

      @@ -47099,7 +47105,7 @@

      logger = logging.getLogger(name)

      2022-04-04 11:11:53
      -

      *Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

      +

      *Thread Reply:* I've mentioned it before but I want to talk a bit about new SQL parser

      @@ -47129,7 +47135,7 @@

      logger = logging.getLogger(name)

      2022-04-04 13:25:17
      -

      *Thread Reply:* Will the parser be released after the 13?

      +

      *Thread Reply:* Will the parser be released after the 13?

      @@ -47155,7 +47161,7 @@

      logger = logging.getLogger(name)

      2022-04-08 11:47:05
      -

      *Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

      +

      *Thread Reply:* @Michael Robinson added additional item to Agenda - client transports feature that we'll have in next release

      @@ -47185,7 +47191,7 @@

      logger = logging.getLogger(name)

      2022-04-08 12:56:44
      -

      *Thread Reply:* Thanks, Maciej

      +

      *Thread Reply:* Thanks, Maciej

      @@ -47241,7 +47247,7 @@

      logger = logging.getLogger(name)

      2022-04-05 10:28:08
      -

      *Thread Reply:* I had kind of the same question. +

      *Thread Reply:* I had kind of the same question. I found https://marquezproject.github.io/marquez/openapi.html#tag/Lineage With some of the entries marked Deprecated, I am not sure how to proceed.

      @@ -47269,7 +47275,7 @@

      logger = logging.getLogger(name)

      2022-04-05 11:55:35
      -

      *Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

      +

      *Thread Reply:* Hey folks, are you looking for the OpenAPI specification found here?

      @@ -47295,7 +47301,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:33:23
      -

      *Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

      +

      *Thread Reply:* @Patrick Mol, Marquez's deprecated endpoints were the old methods for creating lineage (making jobs, dataset, and runs independently), they were deprecated because we moved over to using the OpenLineage spec for all lineage collection purposes.

      The GET methods for jobs/datasets/etc are still functional

      @@ -47323,7 +47329,7 @@

      logger = logging.getLogger(name)

      2022-04-05 21:10:39
      -

      *Thread Reply:* Hey John,

      +

      *Thread Reply:* Hey John,

      Thanks for sharing the OpenAPI docs. Was wondering if there are any means to setup OpenLineage API that will receive events without a consumer like Marquez or is it essential to always pair with a consumer to receive the events?

      @@ -47351,7 +47357,7 @@

      logger = logging.getLogger(name)

      2022-04-05 21:47:13
      -

      *Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

      +

      *Thread Reply:* the OpenLineage integrations don’t have any way to recieve events, since they’re designed to send events to other apps - what were you expecting OpenLinege to do?

      Marquez is our reference implementation of an OpenLineage consumer, but egeria also has a functional endpoint

      @@ -47379,7 +47385,7 @@

      logger = logging.getLogger(name)

      2022-04-06 09:53:31
      -

      *Thread Reply:* Hi @John Thomas, +

      *Thread Reply:* Hi @John Thomas, Would creation of Sources and Datasets have an equivalent in the OpenLineage specification ? Sofar I only see the Inputs and Outputs in the Run Event spec.

      @@ -47407,7 +47413,7 @@

      logger = logging.getLogger(name)

      2022-04-06 11:31:10
      -

      *Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

      +

      *Thread Reply:* Inputs and outputs in the OL spec are Datasets in the old MZ spec, so they're equivalent

      @@ -47461,7 +47467,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:16:34
      -

      *Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

      +

      *Thread Reply:* Hi Marco - it looks like LivyOperator itself does derive from BaseOperator, have you seen any other errors around this problem?

      @Maciej Obuchowski might be more help here

      @@ -47489,7 +47495,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:21:03
      -

      *Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

      +

      *Thread Reply:* It is the operators that inherit from LivyOperator. It doesn't find the parameters like sql, connection etc

      @@ -47515,7 +47521,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:25:42
      -

      *Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

      +

      *Thread Reply:* My guess is that operators that inherit from other operators (not baseoperator) will have the same problem

      @@ -47541,7 +47547,7 @@

      logger = logging.getLogger(name)

      2022-04-05 15:32:13
      -

      *Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

      +

      *Thread Reply:* interesting! I'm not sure about that. I can look into it if I have time, but Maciej is definitely the person who would know the most.

      @@ -47567,7 +47573,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:49:48
      -

      *Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

      +

      *Thread Reply:* @Marco Diaz I wonder - perhaps it would be better to instrument spark with OpenLineage. It doesn’t seem that Airflow will know much about what’s happening underneath here. Have you looked into openlineage-spark?

      @@ -47593,7 +47599,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:51:57
      -

      *Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

      +

      *Thread Reply:* I have not tried that library yet. I need to see how it implement because we have several spark custom operators that use livy

      @@ -47619,7 +47625,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:52:59
      -

      *Thread Reply:* Do you have any examples?

      +

      *Thread Reply:* Do you have any examples?

      @@ -47645,7 +47651,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:54:01
      -

      *Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

      +

      *Thread Reply:* there is a good blog post from @Michael Collado: https://openlineage.io/blog/openlineage-spark/

      openlineage.io
      @@ -47692,7 +47698,7 @@

      logger = logging.getLogger(name)

      2022-04-06 15:54:37
      -

      *Thread Reply:* and the doc page here has a good overview: +

      *Thread Reply:* and the doc page here has a good overview: https://openlineage.io/integration/apache-spark/

      @@ -47740,7 +47746,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:38:15
      -

      *Thread Reply:* is this all we need to pass? +

      *Thread Reply:* is this all we need to pass? spark-submit --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --packages "io.openlineage:openlineage_spark:0.2.+" \ --conf "spark.openlineage.host=http://&lt;your_ol_endpoint&gt;" \ @@ -47771,7 +47777,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:38:49
      -

      *Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

      +

      *Thread Reply:* If so, yes our operators have a way to pass configurations to spark and we may be able to implement it.

      @@ -47797,7 +47803,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:41:27
      -

      *Thread Reply:* Looks right to me

      +

      *Thread Reply:* Looks right to me

      @@ -47823,7 +47829,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:03
      -

      *Thread Reply:* Will give it a try

      +

      *Thread Reply:* Will give it a try

      @@ -47849,7 +47855,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:50
      -

      *Thread Reply:* Do we have to install the library on the spark side or the airflow side?

      +

      *Thread Reply:* Do we have to install the library on the spark side or the airflow side?

      @@ -47875,7 +47881,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:42:58
      -

      *Thread Reply:* I assume is the spark side

      +

      *Thread Reply:* I assume is the spark side

      @@ -47901,7 +47907,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:44:25
      -

      *Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

      +

      *Thread Reply:* The —packages argument tells spark where to get the jar (you'll want to upgrade to 0.6.1)

      @@ -47927,7 +47933,7 @@

      logger = logging.getLogger(name)

      2022-04-06 16:44:54
      -

      *Thread Reply:* sounds good

      +

      *Thread Reply:* sounds good

      @@ -47979,7 +47985,7 @@

      logger = logging.getLogger(name)

      2022-04-06 04:54:27
      -

      *Thread Reply:* @Will Johnson

      +

      *Thread Reply:* @Will Johnson

      @@ -48005,7 +48011,7 @@

      logger = logging.getLogger(name)

      2022-04-07 12:43:27
      -

      *Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

      +

      *Thread Reply:* Hey @Varun Singh! We are building a github repository that deploys a few resources that will support a limited number of Azure data sources being pushed into Azure Purview. You can expect a public release near the end of the month! Feel free to direct message me if you'd like more details!

      @@ -48091,7 +48097,7 @@

      logger = logging.getLogger(name)

      2022-04-07 01:00:43
      -

      *Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

      +

      *Thread Reply:* Are both airflow2 and Marquez installed locally on your computer?

      @@ -48117,7 +48123,7 @@

      logger = logging.getLogger(name)

      2022-04-07 09:04:19
      -

      *Thread Reply:* yes Marco

      +

      *Thread Reply:* yes Marco

      @@ -48143,7 +48149,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:00:18
      -

      *Thread Reply:* can you open marquez on +

      *Thread Reply:* can you open marquez on <http://localhost:3000>

      @@ -48170,7 +48176,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:00:40
      -

      *Thread Reply:* and get a response from +

      *Thread Reply:* and get a response from <http://localhost:5000/api/v1/namespaces>

      @@ -48197,7 +48203,7 @@

      logger = logging.getLogger(name)

      2022-04-07 15:26:41
      -

      *Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

      +

      *Thread Reply:* yes , i used this guide https://openlineage.io/getting-started and execute un post to marquez correctly

      openlineage.io
      @@ -48244,7 +48250,7 @@

      logger = logging.getLogger(name)

      2022-04-07 22:17:34
      -

      *Thread Reply:* In theory you should receive events in jobs under airflow namespace

      +

      *Thread Reply:* In theory you should receive events in jobs under airflow namespace

      @@ -48305,7 +48311,7 @@

      logger = logging.getLogger(name)

      2022-04-07 14:59:06
      -

      *Thread Reply:* It looks like you need to add a payment method to your DBT account

      +

      *Thread Reply:* It looks like you need to add a payment method to your DBT account

      @@ -48357,7 +48363,7 @@

      logger = logging.getLogger(name)

      2022-04-11 12:50:48
      -

      *Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

      +

      *Thread Reply:* It does, but admittedly not very well. It can't recognize what you're doing inside your tasks. The good news is that we're working on it and long term everything should work well.

      @@ -48387,7 +48393,7 @@

      logger = logging.getLogger(name)

      2022-04-11 12:58:28
      -

      *Thread Reply:* Thanks for the quick reply Maciej.

      +

      *Thread Reply:* Thanks for the quick reply Maciej.

      @@ -48447,7 +48453,7 @@

      logger = logging.getLogger(name)

      2022-04-12 13:56:53
      -

      *Thread Reply:* Hi Sandeep,

      +

      *Thread Reply:* Hi Sandeep,

      1&3: We don't currently have Hive or Presto on the roadmap! The best way to start the conversation around them would be to create a proposal in the OpenLineage repo, outlining your thoughts on implementation and benefits.

      @@ -48484,7 +48490,7 @@

      logger = logging.getLogger(name)

      2022-04-12 15:39:23
      -

      *Thread Reply:* Thank you John +

      *Thread Reply:* Thank you John The standard facets links to the github issues currently

      @@ -48511,7 +48517,7 @@

      logger = logging.getLogger(name)

      2022-04-12 15:40:33
      2022-04-12 15:41:01
      -

      *Thread Reply:* Will check it out thank you

      +

      *Thread Reply:* Will check it out thank you

      @@ -48660,7 +48666,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:08:30
      -

      *Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

      +

      *Thread Reply:* is there anything in your marquez-api logs that might indicate issues?

      What guide did you follow to setup the spark integration?

      @@ -48688,7 +48694,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:10:07
      -

      *Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

      +

      *Thread Reply:* Followed this guide https://openlineage.io/integration/apache-spark/ and used the spark-defaults.conf approach

      @@ -48714,7 +48720,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:11:04
      -

      *Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side +

      *Thread Reply:* The logs from dataproc side show no errors, let me check from the marquez api side To confirm, we should be able to see the datasets from the marquez UI with the spark integration right ?

      @@ -48741,7 +48747,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:11:50
      -

      *Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

      +

      *Thread Reply:* I'm not super familiar with the spark integration, since I work more with airflow - I'd start with looking through the readme for the spark integration here

      @@ -48767,7 +48773,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:14:44
      -

      *Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

      +

      *Thread Reply:* Hmm, the readme says it aims to generate the input and output datasets

      @@ -48793,7 +48799,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:40:38
      -

      *Thread Reply:* Are you looking at the same namespace?

      +

      *Thread Reply:* Are you looking at the same namespace?

      @@ -48819,7 +48825,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:40:51
      -

      *Thread Reply:* Yes, the same one where I can see the job

      +

      *Thread Reply:* Yes, the same one where I can see the job

      @@ -48845,7 +48851,7 @@

      logger = logging.getLogger(name)

      2022-04-12 16:54:49
      -

      *Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

      +

      *Thread Reply:* Tailing the API logs and rerunning the spark job now to hopefully catch errors if any, will ping back here

      @@ -48871,7 +48877,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:01:10
      -

      *Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

      +

      *Thread Reply:* Don’t see any failures in the logs, any suggestions on how to debug this ?

      @@ -48897,7 +48903,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:08:24
      -

      *Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

      +

      *Thread Reply:* I'd next set up a basic spark notebook and see if you can't get it to send dataset information on something simple in order to check if it's a setup issue or a problem with your spark job specifically

      @@ -48923,7 +48929,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:14:43
      -

      *Thread Reply:* ok, that sounds good, will try that

      +

      *Thread Reply:* ok, that sounds good, will try that

      @@ -48949,7 +48955,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:16:06
      -

      *Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api +

      *Thread Reply:* before that, I see that spark-lineage integration posts lineage to the api https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post We don’t seem to add a DataSet in this, does marquez internally create this “dataset” based on Output and fields ?

      @@ -48977,7 +48983,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:16:34
      -

      *Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

      +

      *Thread Reply:* yeah, you should be seeing "input" and "output" in the runEvents - that's where datasets come from

      @@ -49003,7 +49009,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:17:00
      -

      *Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

      +

      *Thread Reply:* I'm not sure if it's a problem with your specific spark job or with the integration itself, however

      @@ -49029,7 +49035,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:19:16
      -

      *Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? +

      *Thread Reply:* By runEvents, do you mean a job Object or lineage Object ? The integration seems to be only POSTing lineage objects

      @@ -49056,7 +49062,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:20:34
      -

      *Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

      +

      *Thread Reply:* yep, a runEvent is body that gets POSTed to the /lineage endpoint:

      https://openlineage.io/docs/openapi/

      @@ -49088,7 +49094,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:41:01
      -

      *Thread Reply:* > Yes, the same one where I can see the job +

      *Thread Reply:* > Yes, the same one where I can see the job I think you should look at other namespace, which name depends on what systems you're actually using

      @@ -49115,7 +49121,7 @@

      logger = logging.getLogger(name)

      2022-04-12 17:48:24
      -

      *Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

      +

      *Thread Reply:* Shouldn’t the dataset would be created in the same namespace we define in the spark properties?

      @@ -49141,7 +49147,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:19:06
      -

      *Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

      +

      *Thread Reply:* I found few datasets in the table location, I ran it in a similar (hive metastore, gcs, sparksql and scala spark jobs) setup to the one mentioned in this post https://openlineage.slack.com/archives/C01CK9T7HKR/p1649967405659519

      @@ -49288,7 +49294,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:17:55
      -

      *Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. +

      *Thread Reply:* This was my experience as well, I was under the impression we would see the table as a dataset. Looking forward to understanding the expected behavior

      @@ -49315,7 +49321,7 @@

      logger = logging.getLogger(name)

      2022-04-15 10:39:34
      -

      *Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

      +

      *Thread Reply:* relevant: https://github.com/OpenLineage/OpenLineage/issues/435

      @@ -49371,7 +49377,7 @@

      logger = logging.getLogger(name)

      2022-04-15 12:36:08
      -

      *Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

      +

      *Thread Reply:* Ah! Thank you both for confirming this! And it's great to see the proposal, Maciej!

      @@ -49397,7 +49403,7 @@

      logger = logging.getLogger(name)

      2022-06-10 12:37:41
      -

      *Thread Reply:* Is there a timeline around when we can expect this fix ?

      +

      *Thread Reply:* Is there a timeline around when we can expect this fix ?

      @@ -49423,7 +49429,7 @@

      logger = logging.getLogger(name)

      2022-06-10 12:46:47
      -

      *Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

      +

      *Thread Reply:* Not a simple fix, but I guess we'll start working on this relatively soon.

      @@ -49449,7 +49455,7 @@

      logger = logging.getLogger(name)

      2022-06-10 13:10:31
      -

      *Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

      +

      *Thread Reply:* I see, thanks for the update ! We are very much interested in this feature.

      @@ -49566,7 +49572,7 @@

      logger = logging.getLogger(name)

      2022-04-19 13:40:54
      -

      *Thread Reply:* You probably need to change dataset from default

      +

      *Thread Reply:* You probably need to change dataset from default

      @@ -49592,7 +49598,7 @@

      logger = logging.getLogger(name)

      2022-04-19 13:47:04
      -

      *Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. +

      *Thread Reply:* I click it on everything 😕 I manually (joining to the pod and send curl to the marquez local endpoint) created a namespaces to check if there is a network issue I was ok, I created a namespaces called: data-dev . The airflow is mounted over k8s using helm chart. ``` config: AIRFLOWWEBSERVERBASEURL: "http://airflow.dev.test.io" PYTHONPATH: "/opt/airflow/dags/repo/config" @@ -49644,7 +49650,7 @@

      logger = logging.getLogger(name)

      2022-04-19 15:16:47
      -

      *Thread Reply:* I think answer is somewhere in airflow logs 🙂 +

      *Thread Reply:* I think answer is somewhere in airflow logs 🙂 For some reason, OpenLineage events aren't send to Marquez.

      @@ -49671,7 +49677,7 @@

      logger = logging.getLogger(name)

      2022-04-20 11:08:09
      -

      *Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

      +

      *Thread Reply:* Thanks, finally was my error .. I created a dummy dag to see if maybe it's an issue over the dag and now I can see something over Marquez

      @@ -49732,7 +49738,7 @@

      logger = logging.getLogger(name)

      2022-04-20 08:20:35
      -

      *Thread Reply:* That's more of a Marquez question 🙂 +

      *Thread Reply:* That's more of a Marquez question 🙂 We have a long-standing issue to add that API https://github.com/MarquezProject/marquez/issues/1736

      @@ -49759,7 +49765,7 @@

      logger = logging.getLogger(name)

      2022-04-20 09:32:19
      -

      *Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

      +

      *Thread Reply:* I see it already got skipped for 2 releases, and my only conclusion is that people using Marquez don't make mistakes - ergo, API not needed 🙂 Lets see if I can stick around the project long enough to offer a bit of help, now I just need to showcase it and get interest in my org.

      @@ -49794,7 +49800,7 @@

      logger = logging.getLogger(name)

      Any thoughts?

      - + @@ -49803,7 +49809,7 @@

      logger = logging.getLogger(name)

      - + @@ -49919,7 +49925,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:11:19
      -

      *Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

      +

      *Thread Reply:* Namespace is defined by you, it does not have to be name of the spark cluster.

      @@ -49945,7 +49951,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:11:42
      -

      *Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

      +

      *Thread Reply:* And I definitely recommend to use newer version than 0.2.+ 🙂

      @@ -49971,7 +49977,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:13:32
      -

      *Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

      +

      *Thread Reply:* oh i see that someone mentioned that it has to be replaced with name of the spark clsuter

      @@ -49997,7 +50003,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:13:57
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1634089656188400?thread_ts=1634085740.187700&cid=C01CK9T7HKR

      @@ -50064,7 +50070,7 @@

      logger = logging.getLogger(name)

      2022-04-20 18:19:19
      -

      *Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

      +

      *Thread Reply:* @Maciej Obuchowski may i know if i can add the --packages "io.openlineage:openlineage_spark:0.2.+" as part of the spark jar file, that meant as part of the pom.xml

      @@ -50090,7 +50096,7 @@

      logger = logging.getLogger(name)

      2022-04-21 03:54:25
      -

      *Thread Reply:* I think it needs to run on the driver

      +

      *Thread Reply:* I think it needs to run on the driver

      @@ -50145,7 +50151,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:16:44
      -

      *Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. +

      *Thread Reply:* > OpenLineage API is very limited in attributes that can be passed. Can you specify where do you think it's limited? The way to solve that problems would be to evolve OpenLineage.

      > One practical question/example: how do we create a job of type STREAMING, @@ -50177,7 +50183,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:23:11
      -

      *Thread Reply:* Hey Maciej

      +

      *Thread Reply:* Hey Maciej

      first off - thanks for being active on the channel!

      @@ -50214,7 +50220,7 @@

      logger = logging.getLogger(name)

      2022-04-21 07:53:59
      -

      *Thread Reply:* > I just gave an example of how would you express a specific job type creation +

      *Thread Reply:* > I just gave an example of how would you express a specific job type creation Yes, but you're trying to achieve something by passing this parameter or creating a job in a certain way. We're trying to cover everything in OpenLineage API. Even if we don't have everything, the spec from the beginning is focused to allow emitting custom data by custom facet mechanism.

      > I have the feeling I'm still missing some key concepts on how OpenLineage is designed. @@ -50244,7 +50250,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:29:20
      -

      *Thread Reply:* > Any pointers are welcome! +

      *Thread Reply:* > Any pointers are welcome! BTW: OpenLineage is an open standard. Everyone is welcome to contribute and discuss. Every feedback ultimately helps us build better systems.

      @@ -50271,7 +50277,7 @@

      logger = logging.getLogger(name)

      2022-04-22 03:32:48
      -

      *Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

      +

      *Thread Reply:* I agree, but for now I'm more likely to be in the I didn't get it category, and not in the brilliant new idea category 🙂

      My temporary goal is to go over the documentation and to write the gaps that confused me (and the solutions) and maybe publish that as an article for wider audience. So far I realized that: • I don't get the naming convention - it became clearer that it's important with the Naming examples, but more info is needed @@ -50339,7 +50345,7 @@

      logger = logging.getLogger(name)

      2022-04-21 10:21:33
      -

      *Thread Reply:* Driver logs from Databricks:

      +

      *Thread Reply:* Driver logs from Databricks:

      ```22/04/21 14:05:07 INFO EventEmitter: Lineage completed successfully: ResponseMessage(responseCode=200, body={}, error=null) {"eventType":"COMPLETE",[...], "schemaURL":"https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunEvent"}

      @@ -50391,7 +50397,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:32:37
      -

      *Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

      +

      *Thread Reply:* @Karatuğ Ozan BİRCAN are you running on Spark 3.2? If yes, then new release should have fixed your problem: https://github.com/OpenLineage/OpenLineage/issues/609

      @@ -50417,7 +50423,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:33:15
      -

      *Thread Reply:* Spark 3.1.2 with Scala 2.12

      +

      *Thread Reply:* Spark 3.1.2 with Scala 2.12

      @@ -50443,7 +50449,7 @@

      logger = logging.getLogger(name)

      2022-04-21 11:33:50
      -

      *Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

      +

      *Thread Reply:* In fact, I couldn't make it work in Spark 3.2. But I'll test it again. Thanks for the info.

      @@ -50469,7 +50475,7 @@

      logger = logging.getLogger(name)

      2022-05-20 16:15:47
      -

      *Thread Reply:* Has this been resolved? +

      *Thread Reply:* Has this been resolved? I am facing the same issue with spark 3.2.

      @@ -50522,7 +50528,7 @@

      logger = logging.getLogger(name)

      2022-04-21 15:34:24
      -

      *Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

      +

      *Thread Reply:* I don't think that the facets are particularly strongly defined, but I would expect that it could be possible to see both on a pythonOperator that's executing SQL queries, depending on how the extractor was written

      @@ -50548,7 +50554,7 @@

      logger = logging.getLogger(name)

      2022-04-21 15:34:45
      -

      *Thread Reply:* ah sure, that makes sense

      +

      *Thread Reply:* ah sure, that makes sense

      @@ -50600,7 +50606,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:17:59
      -

      *Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

      +

      *Thread Reply:* We're actively working on it - expect it in next OpenLineage release. https://github.com/OpenLineage/OpenLineage/pull/645

      @@ -50660,7 +50666,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:24:16
      -

      *Thread Reply:* nice -thanks!

      +

      *Thread Reply:* nice -thanks!

      @@ -50686,7 +50692,7 @@

      logger = logging.getLogger(name)

      2022-04-21 16:25:19
      -

      *Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

      +

      *Thread Reply:* Assuming we don't need to do anything except using the next update? Or do you expect that we need to change quite a lot of configs?

      @@ -50712,7 +50718,7 @@

      logger = logging.getLogger(name)

      2022-04-21 17:44:46
      -

      *Thread Reply:* No, it should be automatic.

      +

      *Thread Reply:* No, it should be automatic.

      @@ -50766,7 +50772,7 @@

      logger = logging.getLogger(name)

      2022-04-25 03:52:20
      -

      *Thread Reply:* Hey Will,

      +

      *Thread Reply:* Hey Will,

      while I would not consider myself in the team, I'm dabbling in OL, hitting walls and learning as I go. If I don't have enough experience to contribute, I'd be happy to at least proof-read and point out things which are not clear from a novice perspective. Let me know!

      @@ -50798,7 +50804,7 @@

      logger = logging.getLogger(name)

      2022-04-25 13:49:48
      -

      *Thread Reply:* I'll hold you to that @Mirko Raca 😉

      +

      *Thread Reply:* I'll hold you to that @Mirko Raca 😉

      @@ -50824,7 +50830,7 @@

      logger = logging.getLogger(name)

      2022-04-25 17:18:02
      -

      *Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

      +

      *Thread Reply:* I will support! I’ve done a few recent presentations on the internals of OpenLineage that might also be useful - maybe some diagrams can be reused.

      @@ -50850,7 +50856,7 @@

      logger = logging.getLogger(name)

      2022-04-25 17:56:44
      -

      *Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

      +

      *Thread Reply:* Any chance you have links to those old presentations? Would be great to build off of an existing one and then update for some of the new naming conventions.

      @@ -50876,7 +50882,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:00:26
      -

      *Thread Reply:* the most recent one was an astronomer webinar

      +

      *Thread Reply:* the most recent one was an astronomer webinar

      happy to share the slides with you if you want 👍 here’s a PDF:

      @@ -50917,7 +50923,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:00:44
      -

      *Thread Reply:* the other ones have not been public, unfortunately 😕

      +

      *Thread Reply:* the other ones have not been public, unfortunately 😕

      @@ -50943,7 +50949,7 @@

      logger = logging.getLogger(name)

      2022-04-25 18:02:24
      -

      *Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

      +

      *Thread Reply:* architecture, object model, run lifecycle, naming conventions == the basics IMO

      @@ -50969,7 +50975,7 @@

      logger = logging.getLogger(name)

      2022-04-26 09:14:42
      -

      *Thread Reply:* Thank you so much, Ross! This is a great base to work from.

      +

      *Thread Reply:* Thank you so much, Ross! This is a great base to work from.

      @@ -51081,7 +51087,7 @@

      logger = logging.getLogger(name)

      2022-04-27 00:36:05
      -

      *Thread Reply:* For a spark job like that, you'd have at least four events:

      +

      *Thread Reply:* For a spark job like that, you'd have at least four events:

      1. START event - This represents the SparkSQLExecutionStart
      2. START event #2 - This represents a JobStart event
      3. COMPLET event - This represents a JobEnd event
      4. COMPLETE event #2 - This represents a SparkSQLExectionEnd event For CSV to Parquet, you should be seeing inputs and outputs that match across each event. OpenLineage scans the logical plan and reports back the inputs / outputs / metadata across the different facets for each event BECAUSE each event might give you some different information.
      5. @@ -51115,7 +51121,7 @@

        logger = logging.getLogger(name)

      2022-04-27 21:51:07
      -

      *Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem +

      *Thread Reply:* Hi @Will Johnson good evening. We are seeing an issue while using spark integaration and found that when we provide openlinegae.host property a value like <http://lineage.com/common/marquez> where my marquez api is running I see that the below line is modifying the host to become <http://lineage.com/api/v1/lineage> instead of <http://lineage.com/common/marquez/api/v1/lineage> which is causing the problem https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/EventEmitter.java#L49 I see that it has been added 5 months ago and released it as part of 0.4.0, is there anyway that we can fix the line to be like below this.lineageURI = @@ -51175,7 +51181,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:31:42
      -

      *Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

      +

      *Thread Reply:* Can you open up a Github issue for this? I had this same issue and so our implementation always has to feature the /api/v1/lineage. The host config is literally the host. You're specifying a host and path. I'd be happy to see greater flexibility with the api endpoint but the /v1/ is important to know which version of OpenLineage's specification you're communicating with.

      @@ -51227,7 +51233,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:10:24
      -

      *Thread Reply:* @Michael Collado made one for a recent webinar:

      +

      *Thread Reply:* @Michael Collado made one for a recent webinar:

      https://gist.github.com/collado-mike/d1854958b7b1672f5a494933f80b8b58

      @@ -51255,7 +51261,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:11:38
      -

      *Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

      +

      *Thread Reply:* it's not exactly for an operator that has source-destination, but it shows how to format lineage events for a few different kinds of datasets

      @@ -51281,7 +51287,7 @@

      logger = logging.getLogger(name)

      2022-04-27 15:51:32
      -

      *Thread Reply:* Thanks! I'm going to take a look

      +

      *Thread Reply:* Thanks! I'm going to take a look

      @@ -51337,7 +51343,7 @@

      logger = logging.getLogger(name)

      2022-04-28 17:44:00
      -

      *Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

      +

      *Thread Reply:* Thanks for your input. The release is authorized. Look for it tomorrow!

      @@ -51390,7 +51396,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:41:10
      -

      *Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

      +

      *Thread Reply:* What's the spark job that's running - this looks similar to an error that can happen when jobs have a very short lifecycle

      @@ -51416,7 +51422,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:47:27
      -

      *Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

      +

      *Thread Reply:* nothing in spark job, its just a simple csv to parquet conversion file

      @@ -51442,7 +51448,7 @@

      logger = logging.getLogger(name)

      2022-04-28 14:48:50
      -

      *Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

      +

      *Thread Reply:* ah yeah that's probably it - when the job is finished before the Openlineage integration can poll it for information this error is thrown. Since the job is very quick it creates a race condition

      @@ -51472,7 +51478,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:16:39
      -

      *Thread Reply:* @John Thomas may i know how to solve this kind of issue?

      +

      *Thread Reply:* @John Thomas may i know how to solve this kind of issue?

      @@ -51498,7 +51504,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:20:11
      -

      *Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

      +

      *Thread Reply:* This is probably an issue with the integration - for now you can either open an issue, or see if you're still getting a subset of events and take it as is. I'm not sure what you could do on your end aside from adding a sleep call or similar

      @@ -51524,7 +51530,7 @@

      logger = logging.getLogger(name)

      2022-05-03 17:21:17
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/src/main/common/java/io/openlineage/spark/agent/OpenLineageSparkListener.java#L151 you meant if we add a sleep in this method this will solve this

      @@ -51575,7 +51581,7 @@

      logger = logging.getLogger(name)

      2022-05-03 18:44:43
      -

      *Thread Reply:* oh no I meant making sure your jobs don't close too quickly

      +

      *Thread Reply:* oh no I meant making sure your jobs don't close too quickly

      @@ -51601,7 +51607,7 @@

      logger = logging.getLogger(name)

      2022-05-06 00:14:15
      -

      *Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

      +

      *Thread Reply:* Hi @John Thomas we figured out the error that it is indeed causing with conflicted versions and with shadowJar and shading, we are not seeing it anymore.

      @@ -51661,7 +51667,7 @@

      logger = logging.getLogger(name)

      2022-04-29 18:41:37
      -

      *Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

      +

      *Thread Reply:* Amazing work on the new sql parser @Maciej Obuchowski 💯 :firstplacemedal:

      @@ -51747,7 +51753,7 @@

      logger = logging.getLogger(name)

      2022-05-03 04:12:09
      -

      *Thread Reply:* Hi @Hubert Dulay,

      +

      *Thread Reply:* Hi @Hubert Dulay,

      while I'm not an expert, I can offer the following: • Marquez has had the but what I got here - that API is not encouraged @@ -51779,7 +51785,7 @@

      logger = logging.getLogger(name)

      2022-05-03 21:19:06
      -

      *Thread Reply:* Thanks for the details

      +

      *Thread Reply:* Thanks for the details

      @@ -51831,7 +51837,7 @@

      logger = logging.getLogger(name)

      2022-05-02 13:26:52
      -

      *Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

      +

      *Thread Reply:* Hi Kostikey - this blog has an example with Spark and jupyter, which might be a good place to start!

      openlineage.io
      @@ -51878,7 +51884,7 @@

      logger = logging.getLogger(name)

      2022-05-02 14:58:29
      -

      *Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

      +

      *Thread Reply:* Hi @John Thomas, thanks for the reply. I think I am close but my cluster is unable to talk to the marquez server. After looking at log4j I see the following rows:

      22/05/02 18:43:39 INFO SparkContext: Registered listener io.openlineage.spark.agent.OpenLineageSparkListener 22/05/02 18:43:40 INFO EventEmitter: Init OpenLineageContext: Args: ArgumentParser(host=<http://135.170.226.91:8400>, version=v1, namespace=gus-namespace, jobName=default, parentRunId=null, apiKey=Optional.empty, urlParams=Optional[{}]) URI: <http://135.170.226.91:8400/api/v1/lineage>? @@ -51955,7 +51961,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:03:40
      -

      *Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

      +

      *Thread Reply:* hmmm, interesting - if it's easy could you spin both up locally and check that it's just a communication issue? It helps with diagnosis

      It might also be a firewall issue, but your cURL should preclude that

      @@ -51983,7 +51989,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:05:38
      -

      *Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

      +

      *Thread Reply:* Since it's Databricks I was having a hard time figuring out how to try locally. Other than just using plain 'ol spark on my laptop and a localhost Marquez...

      @@ -52009,7 +52015,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:07:13
      -

      *Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

      +

      *Thread Reply:* hmm, that could be an interesting test to see if it's a databricks issue - the databricks integration is pretty much the same as the spark integration, just with a little bit of a wrapper and the init script

      @@ -52035,7 +52041,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:08:44
      -

      *Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

      +

      *Thread Reply:* yeah, i was going to try that but it just didnt seem like helpful troubleshooting for exactly that reason... but i may just do that anyways just so i can see something working 🙂 (morale booster)

      @@ -52061,7 +52067,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:09:22
      -

      *Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

      +

      *Thread Reply:* oh totally! Network issues are a huge pain in the ass, and if you're still seeing issues locally with spark/mz then we'll know a lot more than we do now 🙂

      @@ -52087,7 +52093,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:11:19
      -

      *Thread Reply:* sounds good, i will give it a go!

      +

      *Thread Reply:* sounds good, i will give it a go!

      @@ -52113,7 +52119,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:16:16
      -

      *Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

      +

      *Thread Reply:* @Kostikey Mustakas - I think spark.openlineage.version should be equal to 1 not v1.

      In addition, is http://135.170.226.91:8400 accessible to Databricks? Could you try doing a %sh command inside of a databricks notebook and see if you can ping that IP address (https://linux.die.net/man/8/ping)?

      @@ -52164,7 +52170,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:19:22
      -

      *Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

      +

      *Thread Reply:* Ya, know i meant to ask about that. Docs say 1 like you mention: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/databricks. I second guessed from this thread https://openlineage.slack.com/archives/C01CK9T7HKR/p1638848249159700.

      @@ -52225,7 +52231,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:23:42
      -

      *Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

      +

      *Thread Reply:* @Will Johnson, ping fails... this is surprising as the curl command mentioned above works fine.

      @@ -52251,7 +52257,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:37:00
      -

      *Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

      +

      *Thread Reply:* I’m also trying to set up Databricks according to Running Marquez on AWS. Right now I’m stuck on the database part rather than the Marquez part — I can’t connect my EKS cluster to the RDS database which I described in more detail on the Marquez slack.

      @Kostikey Mustakas Sorry for the distraction, but I’m curious how you have set up your networking to make the API requests work with Databricks. Good luck with your issue!

      @@ -52280,7 +52286,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:47:17
      -

      *Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

      +

      *Thread Reply:* @Julius Rentergent We are using Azure and leverage Private Endpoints to connect resources in separate subscriptions. There is a Bastion proxy in place that we can map http traffic through and I have a Load Balancer Inbound NAT rule I setup that maps one our whitelisted port ranges (8400) to 5000.

      @@ -52310,7 +52316,7 @@

      logger = logging.getLogger(name)

      2022-05-02 20:15:01
      -

      *Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

      +

      *Thread Reply:* @Will Johnson a little progress maybe... I created a private endpoint and updated dns to point to it. Now I get a 404 Not Found error instead of a timeout

      @@ -52340,7 +52346,7 @@

      logger = logging.getLogger(name)

      2022-05-02 20:16:41
      -

      *Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... +

      *Thread Reply:* 22/05/03 00:09:24 ERROR EventEmitter: Could not emit lineage [responseCode=404]: {"eventType":"START","eventTime":"2022-05-03T00:09:22.498Z","run":{"runId":"f41575a0-e59d-4cbc-a401-9b52d2b020e0","facets":{"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.2.1","openlineage_spark_version":"0.8.1"},"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.ShowNamespaces","num-children":1,"namespace":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"databaseName","dataType":"string","nullable":false,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":4,"jvmId":"aad3656d_8903_4db3_84f0_fe6d773d71c3"},"qualifier":[]}]]},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedNamespace","num_children":0,"catalog":null,"namespace":[]}]},"spark_unknown":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.8.1/integration/spark>","_schemaURL":"<https://openlineage.io/spec/1-0-2/OpenLineage.json#/$defs/RunFacet>","output":{"description":"Unable to serialize logical plan due to: Infinite recursion (StackOverflowError) (through reference chain: org.apache.spark.sql.catalyst.expressions.AttributeReference[\"preCanonicalized\"] .... OpenLineageHttpException(code=null, message={"code":404,"message":"HTTP 404 Not Found"}, details=null) at io.openlineage.spark.agent.EventEmitter.emit(EventEmitter.java:68)

      @@ -52368,7 +52374,7 @@

      logger = logging.getLogger(name)

      2022-05-27 00:03:30
      -

      *Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

      +

      *Thread Reply:* Following up on this as I encounter the same issue with the Openlineage Databricks integration. This issue seems quite malicious as it crashes the Spark Context and requires a restart.

      I have marquez running on AWS EKS; I’m using Openlineage 0.8.2 on Databricks 10.4 (Spark 3.2.1) and my Spark config looks like this: spark.openlineage.host <https://internal-xxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxx.us-east-1.elb.amazonaws.com> @@ -52410,7 +52416,7 @@

      logger = logging.getLogger(name)

      2022-05-27 00:22:27
      -

      *Thread Reply:* I tried two more things: +

      *Thread Reply:* I tried two more things: • curl works, ping fails, just like in the previous report • Databricks allows providing spark configs without quotes, whereas quotes are generally required for Spark. So I added the quotes to the host name, but now I’m getting: ERROR OpenLineageSparkListener: Unable to parse open lineage endpoint. Lineage events will not be collected

      @@ -52438,7 +52444,7 @@

      logger = logging.getLogger(name)

      2022-05-27 14:00:38
      -

      *Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

      +

      *Thread Reply:* @Kostikey Mustakas May I ask what is the reason for migration from Palantir? Sorry for this off-topic question!

      @@ -52464,7 +52470,7 @@

      logger = logging.getLogger(name)

      2022-05-30 05:46:27
      -

      *Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

      +

      *Thread Reply:* @Julius Rentergent created issue on project github: https://github.com/OpenLineage/OpenLineage/issues/795

      @@ -52547,7 +52553,7 @@

      logger = logging.getLogger(name)

      2022-06-01 11:15:26
      -

      *Thread Reply:* Thank you @Maciej Obuchowski. +

      *Thread Reply:* Thank you @Maciej Obuchowski. Just to clarify, the Spark Context crashes with and without port; it’s just that adding the port causes it to crash more quickly (on the 1st attempt).

      I will run some more experiments when I have time, and add the results to the ticket.

      @@ -52616,7 +52622,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:09:42
      -

      *Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

      +

      *Thread Reply:* @Michael Robinson - just to be sure, is the 5/19 meeting at 10 AM PT as well?

      @@ -52642,7 +52648,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:14:11
      -

      *Thread Reply:* Yes, and I’ll update the msg for others. Thank you

      +

      *Thread Reply:* Yes, and I’ll update the msg for others. Thank you

      @@ -52668,7 +52674,7 @@

      logger = logging.getLogger(name)

      2022-05-02 15:16:25
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -52746,7 +52752,7 @@

      logger = logging.getLogger(name)

      2022-05-04 07:04:48
      -

      *Thread Reply:* It seems we can't for now. This was the same question I had last week:

      +

      *Thread Reply:* It seems we can't for now. This was the same question I had last week:

      https://github.com/MarquezProject/marquez/issues/1736

      logger = logging.getLogger(name)
      2022-05-04 10:56:35
      -

      *Thread Reply:* Seems that it's really popular request 🙂

      +

      *Thread Reply:* Seems that it's really popular request 🙂

      @@ -52876,7 +52882,7 @@

      logger = logging.getLogger(name)

      2022-05-04 11:28:48
      -

      *Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

      +

      *Thread Reply:* @Mirko Raca pointed out that I was missing eventType.

      Mirko Raca : "From a quick glance - you're missing "eventType": "START", attribute. It's also worth noting that metadata typically shows up after the second event (type COMPLETE)"

      @@ -52937,7 +52943,7 @@

      logger = logging.getLogger(name)

      2022-05-06 05:26:16
      -

      *Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

      +

      *Thread Reply:* As far as I understand, OpenLineage has tools to extract metadata from sources. Depend on your source, you could find an integration, if it doesn't exists you should write your own integration (and collaborate with the project)

      @@ -52963,7 +52969,7 @@

      logger = logging.getLogger(name)

      2022-05-06 12:59:06
      -

      *Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

      +

      *Thread Reply:* @Sandeep Bhat take a look at https://openlineage.io/integration - there is some info there on the different integrations that can be used to automatically pull metadata.

      openlineage.io
      @@ -53010,7 +53016,7 @@

      logger = logging.getLogger(name)

      2022-05-06 13:00:39
      -

      *Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

      +

      *Thread Reply:* The Airflow integration, in particular, uses a SQL parser to determine input/output tables (in cases where the data store can't be queried for that info)

      @@ -53062,7 +53068,7 @@

      logger = logging.getLogger(name)

      2022-05-12 05:48:43
      -

      *Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

      +

      *Thread Reply:* To my understanding - datasets model the structure, not the content. So, as long as your table doesn't change number of columns, it's the same thing.

      The catch-all would be to create a Dataset facet which would record the distinction between append/overwrite per run. But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected).

      @@ -53090,7 +53096,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:05:36
      -

      *Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

      +

      *Thread Reply:* Thanks, that makes sense. We're looking for a way to get the lineage of table contents. We may have to opt for new names on overwrite, or indeed extend a facet to flag these.

      @@ -53116,7 +53122,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:06:44
      -

      *Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

      +

      *Thread Reply:* Use case is compliancy, where we need to show how a certain delivered data product (at a given point in time) was constructed. We have all our transforms/transfers as code, but there are a few parts where datasets get recreated in the process after fixes have been made, and I wouldn't want to bother the auditors with those stray paths

      @@ -53142,7 +53148,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:12:09
      -

      *Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

      +

      *Thread Reply:* We have LifecycleStateChangeDataset facet that captures this information. It's currently emitted when using Spark integration

      @@ -53168,7 +53174,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:13:25
      -

      *Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). +

      *Thread Reply:* > But, while this is supported by the standard, Marquez does not handle custom facets at the moment (I'll happily be corrected). It displays this information when it exists

      @@ -53199,7 +53205,7 @@

      logger = logging.getLogger(name)

      2022-05-12 06:13:29
      -

      *Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

      +

      *Thread Reply:* Oh that looks perfect! I completely missed that, thanks!

      @@ -53251,7 +53257,7 @@

      logger = logging.getLogger(name)

      2022-05-13 05:19:47
      -

      *Thread Reply:* Work with Spark is not yet fully merged

      +

      *Thread Reply:* Work with Spark is not yet fully merged

      @@ -53367,7 +53373,7 @@

      logger = logging.getLogger(name)

      2022-05-16 11:31:17
      -

      *Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

      +

      *Thread Reply:* I think you can either use proxy backend: https://github.com/OpenLineage/OpenLineage/tree/main/proxy

      or configure OL client to send data to kafka: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      @@ -53400,7 +53406,7 @@

      logger = logging.getLogger(name)

      2022-05-16 12:15:59
      -

      *Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

      +

      *Thread Reply:* Thank you very much for the useful pointers. The proxy solutions could indeed work in our case but it implies creating another service in front of Kafka, and thus and another layer of complexity to the architecture. If there is another more "native" way of streaming event directly from the Airflow backend that'll be great to know

      @@ -53426,7 +53432,7 @@

      logger = logging.getLogger(name)

      2022-05-16 12:37:10
      -

      *Thread Reply:* The second link 😉

      +

      *Thread Reply:* The second link 😉

      @@ -53452,7 +53458,7 @@

      logger = logging.getLogger(name)

      2022-05-17 03:46:03
      -

      *Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 +

      *Thread Reply:* Sure, we already implemented the python client for jobs outside airflow and it works great 🙂 You are saying that there is a way to use this python client in conjonction with the MWAA lineage backend to relay the job events that come with the airflow integration (without including it in the DAGs)? Our strategy is to use both the airflow backend to collect automatic lineage events without modifying any existing DAGs, and the in-code implementation to allow our data engineers to send their own events if they want to. The second option works perfectly but the first one is where we struggle a bit, especially with MWAA.

      @@ -53481,7 +53487,7 @@

      logger = logging.getLogger(name)

      2022-05-17 05:24:30
      -

      *Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

      +

      *Thread Reply:* If you can mount file to MWAA, then yes - it should work with config file option: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#config-file

      @@ -53507,7 +53513,7 @@

      logger = logging.getLogger(name)

      2022-05-17 05:40:45
      -

      *Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

      +

      *Thread Reply:* Brilliant! I'm going to test that. Thank you Maciej!

      @@ -53663,7 +53669,7 @@

      logger = logging.getLogger(name)

      2022-05-18 16:34:25
      -

      *Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

      +

      *Thread Reply:* Hey Tyler - A custom extractor just needs to be able to assemble the runEvents and send the information out to the lineage backends.

      If the things you're sending/receiving with TaskFlow are accessible in terms of metadata in the environment the DAG is running in, then you should be able to make one that would work!

      @@ -53695,7 +53701,7 @@

      logger = logging.getLogger(name)

      2022-05-18 16:41:16
      -

      *Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

      +

      *Thread Reply:* Taskflow internally is just PythonOperator. If you'd write extractor that assumes something more than just it being PythonOperator then you'd probably make it work 🙂

      @@ -53721,7 +53727,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:15:52
      -

      *Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: +

      *Thread Reply:* Thanks @John Thomas @Maciej Obuchowski, Your answers both make sense. I just keep running into this error in my logs: [2022-05-18, 20:52:34 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=Postgres_1_to_Snowflake_1_v3 task_id=Postgres_1 airflow_run_id=scheduled__2022-05-18T20:51:34.334045+00:00 The picture is my custom extractor, it's not doing anything currently as this is just a test.

      @@ -53758,7 +53764,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:05
      -

      *Thread Reply:* thanks again for the help yall

      +

      *Thread Reply:* thanks again for the help yall

      @@ -53784,7 +53790,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:34
      -

      *Thread Reply:* did you set the environment variable with the path to your extractor?

      +

      *Thread Reply:* did you set the environment variable with the path to your extractor?

      @@ -53810,7 +53816,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:16:46
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -53845,7 +53851,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:17:13
      -

      *Thread Reply:* i believe thats correct @John Thomas

      +

      *Thread Reply:* i believe thats correct @John Thomas

      @@ -53871,7 +53877,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:18:35
      -

      *Thread Reply:* and the versions im using: +

      *Thread Reply:* and the versions im using: Astronomer Runtime 5.0.0 based on Airflow 2.3.0+astro.1

      @@ -53898,7 +53904,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:25:58
      -

      *Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

      +

      *Thread Reply:* this might not be the problem, but you should have only one of extract and extract_on_complete - which one are you meaning to use?

      @@ -53924,7 +53930,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:32:26
      -

      *Thread Reply:* ahh thanks John, as of right now extract_on_complete.

      +

      *Thread Reply:* ahh thanks John, as of right now extract_on_complete.

      This is a similar setup as Michael had in the video.

      @@ -53961,7 +53967,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:33:31
      -

      *Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

      +

      *Thread Reply:* if it's still not working I'm not really sure at this point - that's about what I had when I spun up my own custom extractor

      @@ -53987,7 +53993,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:39:44
      -

      *Thread Reply:* is there anything in logs regarding extractors?

      +

      *Thread Reply:* is there anything in logs regarding extractors?

      @@ -54013,7 +54019,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:40:36
      -

      *Thread Reply:* just this: +

      *Thread Reply:* just this: [2022-05-18, 21:36:59 UTC] {__init__.py:97} WARNING - Unable to find an extractor. task_type=_PythonDecoratedOperator airflow_dag_id=competitive_oss_projects_git_to_snowflake task_id=Transform_git_logs_to_S3 airflow_run_id=scheduled__2022-05-18T21:35:57.694690+00:00

      @@ -54040,7 +54046,7 @@

      logger = logging.getLogger(name)

      2022-05-18 17:41:11
      -

      *Thread Reply:* @John Thomas Thanks, I appreciate your help.

      +

      *Thread Reply:* @John Thomas Thanks, I appreciate your help.

      @@ -54066,7 +54072,7 @@

      logger = logging.getLogger(name)

      2022-05-19 06:01:52
      -

      *Thread Reply:* No Failed to import messages?

      +

      *Thread Reply:* No Failed to import messages?

      @@ -54092,8 +54098,8 @@

      logger = logging.getLogger(name)

      2022-05-19 11:26:34
      -

      *Thread Reply: @Maciej Obuchowski None that I can see. Here is the full log: -``` Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. +

      *Thread Reply:* @Maciej Obuchowski None that I can see. Here is the full log: +```* Failed to verify remote log exists s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log. Please provide a bucket_name instead of "s3:///dag_id=Postgres_1_to_Snowflake_1_v3/run_id=scheduled2022-05-19T15:23:49.248097+00:00/task_id=Postgres_1/attempt=1.log" Falling back to local log * Reading local file: /usr/local/airflow/logs/dagid=Postgres1toSnowflake1v3/runid=scheduled2022-05-19T15:23:49.248097+00:00/taskid=Postgres1/attempt=1.log @@ -54163,7 +54169,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 16:57:38
      -

      *Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

      +

      *Thread Reply:* @Maciej Obuchowski is our ENV var wrong maybe? Do we need to mention the file to import somewhere else that we may have missed?

      @@ -54189,7 +54195,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 10:26:01
      -

      *Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

      +

      *Thread Reply:* @Josh Owens one thing I can think of is that you might have older openlineage integration version, as OPENLINEAGE_EXTRACTORS variable was added very recently: https://github.com/OpenLineage/OpenLineage/pull/694

      @@ -54215,7 +54221,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 11:58:28
      -

      *Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

      +

      *Thread Reply:* @Maciej Obuchowski, that was it! For some reason, my requirements.txt wasn't pulling the latest version of openlineage-airflow. Working now with 0.8.2

      @@ -54241,7 +54247,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 11:59:01
      -

      *Thread Reply:* 🙌

      +

      *Thread Reply:* 🙌

      @@ -54295,7 +54301,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 05:59:59
      -

      *Thread Reply:* Currently, it assumes latest version. +

      *Thread Reply:* Currently, it assumes latest version. There's an effort with DatasetVersionDatasetFacet to be able to specify it manually - or extract this information from cases like Iceberg or Delta Lake tables.

      @@ -54322,7 +54328,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:14:59
      -

      *Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

      +

      *Thread Reply:* Ah ok. Is it Marquez assuming the latest version when it records the OpenLineage event?

      @@ -54348,7 +54354,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:18:20
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -54378,7 +54384,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 06:54:40
      -

      *Thread Reply:* Thanks, that's very helpful 👍

      +

      *Thread Reply:* Thanks, that's very helpful 👍

      @@ -54466,7 +54472,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:27:35
      -

      *Thread Reply:* @Howard Yoo What version of airflow?

      +

      *Thread Reply:* @Howard Yoo What version of airflow?

      @@ -54492,7 +54498,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:27:51
      -

      *Thread Reply:* it's 2.3

      +

      *Thread Reply:* it's 2.3

      @@ -54518,7 +54524,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:28:42
      -

      *Thread Reply:* (sorry, it's 2.4)

      +

      *Thread Reply:* (sorry, it's 2.4)

      @@ -54544,7 +54550,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:29:28
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow Id refer to the docs again.

      "Airflow 2.3+ Integration automatically registers itself for Airflow 2.3 if it's installed on Airflow worker's python. This means you don't have to do anything besides configuring it, which is described in Configuration section."

      @@ -54573,7 +54579,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:29:53
      -

      *Thread Reply:* Right, configuring I don't see any issues

      +

      *Thread Reply:* Right, configuring I don't see any issues

      @@ -54599,7 +54605,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:30:56
      -

      *Thread Reply:* so you dont need:

      +

      *Thread Reply:* so you dont need:

      from openlineage.airflow import DAG

      @@ -54629,7 +54635,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:31:41
      -

      *Thread Reply:* Okay... that makes sense then

      +

      *Thread Reply:* Okay... that makes sense then

      @@ -54655,7 +54661,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:32:47
      -

      *Thread Reply:* so if you need to import DAG it would just be: +

      *Thread Reply:* so if you need to import DAG it would just be: from airflow import DAG

      @@ -54686,7 +54692,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-19 15:56:19
      -

      *Thread Reply:* Thanks!

      +

      *Thread Reply:* Thanks!

      @@ -54772,7 +54778,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 10:23:55
      -

      *Thread Reply:* What changes do you mean by letting openlineage support? +

      *Thread Reply:* What changes do you mean by letting openlineage support? Or, do you mean, to write Apache Camel integration?

      @@ -54799,7 +54805,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-22 19:54:17
      -

      *Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

      +

      *Thread Reply:* @Maciej Obuchowski Yes, let openlineage work as same as airflow

      @@ -54825,7 +54831,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-22 19:56:47
      -

      *Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

      +

      *Thread Reply:* I think this is a very valuable thing. I wish openlineage can support some commonly used pipeline tools, and try to abstract out some general interfaces so that users can expand by themselves

      @@ -54851,7 +54857,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-23 05:20:30
      -

      *Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

      +

      *Thread Reply:* For Python, we have OL client, common libraries (well, at least beginning of them) and SQL parser

      @@ -54877,7 +54883,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-23 05:20:44
      -

      *Thread Reply:* As we support more systems, the general libraries will grow as well.

      +

      *Thread Reply:* As we support more systems, the general libraries will grow as well.

      @@ -54939,7 +54945,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-20 20:23:15
      -

      *Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

      +

      *Thread Reply:* This looks like a bug. we are probably not looking at the right instance in the TaskInstanceListener

      @@ -54965,7 +54971,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-21 14:17:19
      -

      *Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

      +

      *Thread Reply:* @Howard Yoo I filed: https://github.com/OpenLineage/OpenLineage/issues/767 for this

      @@ -55107,7 +55113,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-21 09:42:21
      -

      *Thread Reply:* Thank you so much, Michael!

      +

      *Thread Reply:* Thank you so much, Michael!

      @@ -55159,7 +55165,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:41:11
      -

      *Thread Reply:* In Python or Java?

      +

      *Thread Reply:* In Python or Java?

      @@ -55185,7 +55191,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:44:32
      -

      *Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

      +

      *Thread Reply:* In python just inherit BaseFacet and add _get_schema static method that would point to some place where you have your json schema of a facet. For example our DbtVersionRunFacet

      In Java you can take a look at Spark's custom facets.

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-05-24 16:40:00
      -

      *Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

      +

      *Thread Reply:* Thanks, @Maciej Obuchowski, I was asking in regards to Python, sorry I should have clarified.

      I'm not sure what the disconnect is, but the facets aren't showing up in the inputs and outputs. The Lineage event is sent successfully to my astrocloud.

      @@ -55410,7 +55416,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 09:21:02
      -

      *Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

      +

      *Thread Reply:* _get_schema should return address to the schema hosted somewhere else - afaik sending object field where server expects string field might cause some problems

      @@ -55436,7 +55442,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 09:21:59
      -

      *Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

      +

      *Thread Reply:* can you register ManualLineageFacet as facets not as inputFacets or outputFacets?

      @@ -55462,7 +55468,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 13:15:30
      -

      *Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! +

      *Thread Reply:* Thanks for the advice @Maciej Obuchowski, I was able to get it working! Also great talk today at the airflow summit.

      @@ -55489,7 +55495,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 13:25:17
      -

      *Thread Reply:* Thanks 🙇

      +

      *Thread Reply:* Thanks 🙇

      @@ -55611,7 +55617,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:35:05
      -

      *Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build +

      *Thread Reply:* We're also getting data quality from dbt if you're running dbt test or dbt build https://github.com/OpenLineage/OpenLineage/blob/main/integration/common/openlineage/common/provider/dbt.py#L399

      @@ -55663,7 +55669,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 06:37:15
      -

      *Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

      +

      *Thread Reply:* Generally, you'd need to construct DataQualityAssertionsDatasetFacet and/or DataQualityMetricsInputDatasetFacet and attach it to tested dataset

      @@ -55714,7 +55720,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-24 13:23:34
      -

      *Thread Reply:* Thanks @Maciej Obuchowski!!!

      +

      *Thread Reply:* Thanks @Maciej Obuchowski!!!

      @@ -55871,7 +55877,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 16:59:47
      -

      *Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

      +

      *Thread Reply:* We're working on this feature - should be in the next release from OpenLineage side

      @@ -55901,7 +55907,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-25 17:06:12
      -

      *Thread Reply:* Thanks! I will keep an eye on updates.

      +

      *Thread Reply:* Thanks! I will keep an eye on updates.

      @@ -55999,7 +56005,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-26 06:05:33
      -

      *Thread Reply:* Looks great! Thanks for sharing!

      +

      *Thread Reply:* Looks great! Thanks for sharing!

      @@ -56148,7 +56154,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:38:41
      -

      *Thread Reply:* Might be that case if you're using Spark 3.2

      +

      *Thread Reply:* Might be that case if you're using Spark 3.2

      @@ -56174,7 +56180,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:38:54
      -

      *Thread Reply:* There were some changes to those operators

      +

      *Thread Reply:* There were some changes to those operators

      @@ -56200,7 +56206,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 05:39:09
      -

      *Thread Reply:* If you're not using 3.2, please share more details 🙂

      +

      *Thread Reply:* If you're not using 3.2, please share more details 🙂

      @@ -56226,7 +56232,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:58:58
      -

      *Thread Reply:* Yeap, im using spark version 3.2.1

      +

      *Thread Reply:* Yeap, im using spark version 3.2.1

      @@ -56252,7 +56258,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:59:35
      -

      *Thread Reply:* is it open issue, or i have some option to force them to be sent?)

      +

      *Thread Reply:* is it open issue, or i have some option to force them to be sent?)

      @@ -56278,7 +56284,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 07:59:58
      -

      *Thread Reply:* btw thank you for quick response @Maciej Obuchowski

      +

      *Thread Reply:* btw thank you for quick response @Maciej Obuchowski

      @@ -56304,7 +56310,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-05-30 08:00:34
      -

      *Thread Reply:* Yes, we have issue for AlterTable at least

      +

      *Thread Reply:* Yes, we have issue for AlterTable at least

      @@ -56330,7 +56336,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 02:52:14
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/616 -> that’s the issue for altering tables in Spark 3.2. @Ilqar Memmedov Did you mean drop table or drop columns? I am not aware of any drop table issue.

      @@ -56393,7 +56399,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 06:03:38
      -

      *Thread Reply:* @Paweł Leszczyński drop table statement.

      +

      *Thread Reply:* @Paweł Leszczyński drop table statement.

      @@ -56419,7 +56425,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-01 06:05:58
      -

      *Thread Reply:* For reproduce it, i just create simple spark job. +

      *Thread Reply:* For reproduce it, i just create simple spark job. Create table as select from other, Select data from table, and then drop entire table.

      @@ -56493,7 +56499,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-03 14:20:16
      -

      *Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

      +

      *Thread Reply:* hi xiang 👋 lineage in airflow depends on the operator. some operators have extractors as part of the integration, but when they are missing you only see job information in Marquez.

      @@ -56519,7 +56525,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-03 14:20:51
      -

      *Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail

      +

      *Thread Reply:* take a look at https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#extractors--sending-the-correct-data-from-your-dags for a bit more detail

      @@ -56676,7 +56682,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 08:02:58
      -

      *Thread Reply:* I think internal naming is more important 🙂

      +

      *Thread Reply:* I think internal naming is more important 🙂

      I guess, for now, try to match what the local directory has.

      @@ -56704,7 +56710,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 10:59:39
      -

      *Thread Reply:* Thanks @Maciej Obuchowski

      +

      *Thread Reply:* Thanks @Maciej Obuchowski

      @@ -56809,7 +56815,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 11:00:41
      -

      *Thread Reply:* Those jobs are actually what Spark does underneath 🙂

      +

      *Thread Reply:* Those jobs are actually what Spark does underneath 🙂

      @@ -56835,7 +56841,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 11:00:57
      -

      *Thread Reply:* Are you using Delta Lake btw?

      +

      *Thread Reply:* Are you using Delta Lake btw?

      @@ -56861,7 +56867,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 12:02:39
      -

      *Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

      +

      *Thread Reply:* No, this is not Delta Lake. It is a normal Spark app .

      @@ -56887,7 +56893,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 13:58:05
      -

      *Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

      +

      *Thread Reply:* @Maciej Obuchowski i think David posted about this before. https://openlineage.slack.com/archives/C01CK9T7HKR/p1636011698055200

      @@ -56961,7 +56967,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 14:27:46
      -

      *Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

      +

      *Thread Reply:* I agree that it looks bad on UI, but I also think integration is going good job here. The eventual "aggregation" should be done by event consumer.

      If anything, we should filter some 'useless' nodes like collect_limit since they add nothing.

      @@ -57199,7 +57205,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 14:33:09
      -

      *Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

      +

      *Thread Reply:* @Maciej Obuchowski but we only see these 2 jobs in the namespace, no other jobs were part of the lineage metadata, are we doing something wrong?

      @@ -57225,7 +57231,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 16:09:15
      -

      *Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

      +

      *Thread Reply:* @Michael Robinson On this note, may we know how to form a lineage if we have different set of API's before calling the spark job (already integrated with OpenLineageSparkListener), we want to see how the different set of params pass thru these components before landing into the spark job. If we use openlineage client to post the lineage events into the Marquez, do we need to mention the same Run UUID across the lineage events for the run or is there any other way to do this? Can you pls advise?

      @@ -57251,7 +57257,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 22:51:38
      -

      *Thread Reply:* I think I understand what you are asking -

      +

      *Thread Reply:* I think I understand what you are asking -

      The runID is used to correlate different state updates (i.e., start, fail, complete, abort) across the lifespan of a run. So if you are trying to add additional metadata to the same job run, you’d use the same runID.

      @@ -57287,7 +57293,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 22:59:39
      -

      *Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. +

      *Thread Reply:* thanks @Ross Turk but what i am looking for is lets say for example, if we have 4 components in the system then we want to show the 4 components as job icons in the graph and the datasets between them would show the input/output parameters that these components use. A(job) --> DS1(dataset) --> B(job) --> DS2(dataset) --> C(job) --> DS3(dataset) --> D(job)

      @@ -57314,7 +57320,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:04:37
      -

      *Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

      +

      *Thread Reply:* then you would need to have separate Jobs for each, with inputs and outputs defined

      @@ -57340,7 +57346,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:03
      -

      *Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

      +

      *Thread Reply:* so there would be a Run of job B that shows DS1 as an input and DS2 as an output

      @@ -57366,7 +57372,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:18
      -

      *Thread Reply:* got it

      +

      *Thread Reply:* got it

      @@ -57392,7 +57398,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-08 23:06:34
      -

      *Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

      +

      *Thread Reply:* (fyi: I know openlineage but my understanding stops at spark 😄)

      @@ -57422,7 +57428,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-10 12:27:58
      -

      *Thread Reply:* > The eventual “aggregation” should be done by event consumer. +

      *Thread Reply:* > The eventual “aggregation” should be done by event consumer. @Maciej Obuchowski Are there any known client side libraries that support this aggregation already ? In case of spark applications running as part of ETL pipelines, most of the times our end user is interested in seeing only the aggregated view where all jobs spawned as part of a single application are rolled up into 1 job.

      @@ -57449,7 +57455,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-10 12:32:14
      -

      *Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

      +

      *Thread Reply:* I believe Microsoft @Will Johnson has something similar to that, but it's probably proprietary.

      We'd love to have something like it, but AFAIK it affects only some percentage of Spark jobs and we can only do so much.

      @@ -57479,7 +57485,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-11 23:38:27
      -

      *Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

      +

      *Thread Reply:* @Maciej Obuchowski Microsoft ❤️ OSS!

      Apache Atlas doesn't have the same model as Marquez. It only knows of effectively one entity that represents the complete asset.

      @@ -57542,7 +57548,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-19 09:37:06
      -

      *Thread Reply:* thank you !

      +

      *Thread Reply:* thank you !

      @@ -57640,7 +57646,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-09 13:04:00
      -

      *Thread Reply:* Hi, is the link correct? The meeting room is empty

      +

      *Thread Reply:* Hi, is the link correct? The meeting room is empty

      @@ -57666,7 +57672,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-09 16:04:23
      -

      *Thread Reply:* sorry about that, thanks for letting us know

      +

      *Thread Reply:* sorry about that, thanks for letting us know

      @@ -57736,7 +57742,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:49:37
      -

      *Thread Reply:* Hey @Mark Beebe!

      +

      *Thread Reply:* Hey @Mark Beebe!

      In this case, nodeId is going to be either a dataset or a job. You need to tell Marquez where to start since there is likely to be more than one graph. So you need to get your hands on an identifier for that starting node.

      @@ -57764,7 +57770,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:50:07
      -

      *Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

      +

      *Thread Reply:* You can do this in a few ways (that I can think of). First, by looking for a namespace, then querying for the datasets in that namespace:

      @@ -57799,7 +57805,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:53:43
      -

      *Thread Reply:* Or you can search, if you know the name of the dataset:

      +

      *Thread Reply:* Or you can search, if you know the name of the dataset:

      @@ -57834,7 +57840,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-13 18:53:54
      -

      *Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

      +

      *Thread Reply:* aaaaannnnd that’s actually all the ways I can think of.

      @@ -57860,7 +57866,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 08:11:30
      -

      *Thread Reply:* That worked, thank you so much!

      +

      *Thread Reply:* That worked, thank you so much!

      @@ -57946,7 +57952,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 10:37:01
      -

      *Thread Reply:* tnazarew - Tomasz Nazarewicz

      +

      *Thread Reply:* tnazarew - Tomasz Nazarewicz

      @@ -57972,7 +57978,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-14 10:37:14
      -

      *Thread Reply:* 🙌

      +

      *Thread Reply:* 🙌

      @@ -57998,7 +58004,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-15 12:46:45
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -58390,7 +58396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:39:16
      -

      *Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

      +

      *Thread Reply:* I think you’re reference the deprecation of the DatasetAPI in Marquez? A milestone for the Marquez is to only collect metadata via OpenLineage events. This includes metadata for datasets , jobs , and runs . The DatasetAPI won’t be removed until support for collecting dataset metadata via OpenLineage has been added, see https://github.com/OpenLineage/OpenLineage/issues/323

      @@ -58492,7 +58498,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:40:28
      -

      *Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

      +

      *Thread Reply:* Once the spec supports dataset metadata, we’ll outline steps in the Marquez project to switch to using the new dataset event type

      @@ -58518,7 +58524,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:43:20
      -

      *Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

      +

      *Thread Reply:* The DatasetAPI was also deprecated to avoid confusion around which API to use

      @@ -58596,7 +58602,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:44:49
      -

      *Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

      +

      *Thread Reply:* Do you want to register just datasets? Or are you extracting metadata for a job that would include input / output datasets? (outside of Airflow of course)

      @@ -58622,7 +58628,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:45:09
      -

      *Thread Reply:* Sorry didn't notice you over here ! lol

      +

      *Thread Reply:* Sorry didn't notice you over here ! lol

      @@ -58648,7 +58654,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:45:53
      -

      *Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

      +

      *Thread Reply:* So ideally I would like to map out our current data flow from on prem to aws

      @@ -58674,7 +58680,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:47:39
      -

      *Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

      +

      *Thread Reply:* What do you mean by mapping to AWS? Like send OL events to a service on AWS that would process the lineage metadata?

      @@ -58700,7 +58706,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:48:14
      -

      *Thread Reply:* no, just visualize the current migration flow.

      +

      *Thread Reply:* no, just visualize the current migration flow.

      @@ -58726,7 +58732,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:48:53
      -

      *Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

      +

      *Thread Reply:* Ah I see, youre doing a infra migration from on prem to AWS 👌

      @@ -58752,7 +58758,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:08
      -

      *Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

      +

      *Thread Reply:* really AWS is irrelevant. Source sink -> migration scriipts -> s3 -> additional processing -> final sink

      @@ -58778,7 +58784,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:19
      -

      *Thread Reply:* correct

      +

      *Thread Reply:* correct

      @@ -58804,7 +58810,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:49:45
      -

      *Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

      +

      *Thread Reply:* right right. so you want to map out that flow and visualize it in Marquez? (or some other meta service)

      @@ -58830,7 +58836,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:50:05
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -58856,7 +58862,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:50:26
      -

      *Thread Reply:* which I think I can do once the first nodes exist

      +

      *Thread Reply:* which I think I can do once the first nodes exist

      @@ -58882,7 +58888,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:51:18
      -

      *Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

      +

      *Thread Reply:* But I don't know how to get that initial node. I tried using the input facet at job start , that didn't do it. I also can't get the sql context that is in these examples.

      @@ -58908,7 +58914,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:51:54
      -

      *Thread Reply:* really just want to re-create food_devlivery using my own biz context

      +

      *Thread Reply:* really just want to re-create food_devlivery using my own biz context

      @@ -58934,7 +58940,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:52:14
      -

      *Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

      +

      *Thread Reply:* Have you looked over our workshops and this example? (assuming you’re using python?)

      @@ -59032,7 +59038,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:53:49
      -

      *Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

      +

      *Thread Reply:* that goes over the py client with some OL examples, but really calling openlineage.emit(...) method with RunEvents and specifying Marquez as the backend will get you up and running!

      @@ -59058,7 +59064,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:54:32
      -

      *Thread Reply:* Don’t forget to configure the transport for the client

      +

      *Thread Reply:* Don’t forget to configure the transport for the client

      @@ -59084,7 +59090,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:54:45
      -

      *Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

      +

      *Thread Reply:* sweet. Thank you! I'll take a look. Also.. Just came across datakin for the first time. very nice 🙂

      @@ -59110,7 +59116,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:25
      -

      *Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

      +

      *Thread Reply:* thanks! …. but we’re now part of astronomer.io 😉

      @@ -59136,7 +59142,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:48
      -

      *Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

      +

      *Thread Reply:* making airflow oh-so-easy-to-use one DAG at a time

      @@ -59162,7 +59168,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:55:52
      -

      *Thread Reply:* saw that too !

      +

      *Thread Reply:* saw that too !

      @@ -59188,7 +59194,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:03
      -

      *Thread Reply:* you’re on top of it!

      +

      *Thread Reply:* you’re on top of it!

      @@ -59214,7 +59220,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:28
      -

      *Thread Reply:* ha. Thanks again!

      +

      *Thread Reply:* ha. Thanks again!

      @@ -59347,7 +59353,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:59:38
      -

      *Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

      +

      *Thread Reply:* Honestly, I’ve been a huge fan of using / falling back on inlets and outlets since day 1. AND if you’re willing to contribute this support, you get a +1 from me (I’ll add some minor comments to the issue) /cc @Julien Le Dem

      @@ -59377,7 +59383,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:59:59
      -

      *Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

      +

      *Thread Reply:* would be great to get @Maciej Obuchowski thoughts on this as well

      @@ -59407,7 +59413,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 12:40:39
      -

      *Thread Reply:* I have created a draft PR for this here. +

      *Thread Reply:* I have created a draft PR for this here. Please let me know if the changes make sense.

      @@ -59491,7 +59497,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 12:42:30
      -

      *Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

      +

      *Thread Reply:* I think this effort: https://github.com/OpenLineage/OpenLineage/pull/904 ultimately makes more sense, since it will allow getting lineage on Airflow 2.3+ too

      @@ -59578,7 +59584,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-08 18:12:47
      -

      *Thread Reply:* I have made the changes in-line to the mentioned comments here. +

      *Thread Reply:* I have made the changes in-line to the mentioned comments here. Does this look good?

      @@ -59662,7 +59668,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 09:35:22
      -

      *Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

      +

      *Thread Reply:* I think it looks good! Would be great to have tests for this feature though.

      @@ -59692,7 +59698,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 21:56:50
      -

      *Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. +

      *Thread Reply:* I have added the tests! Would really appreciate it if someone can take a look and let me know if anything else needs to be done. Thank you for the support! 😄

      @@ -59723,7 +59729,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 06:48:03
      -

      *Thread Reply:* One change and I think it will be good for now.

      +

      *Thread Reply:* One change and I think it will be good for now.

      @@ -59749,7 +59755,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 06:48:07
      -

      *Thread Reply:* Have you tested it manually?

      +

      *Thread Reply:* Have you tested it manually?

      @@ -59775,7 +59781,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 13:22:04
      -

      *Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 +

      *Thread Reply:* Thanks a lot for the review! Appreciate it 🙌 Yes, I tested it manually (for Airflow versions 2.1.4 and 2.3.3) and it works 🎉

      @@ -59802,7 +59808,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 13:24:55
      -

      *Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

      +

      *Thread Reply:* I think this is such a useful feature to have, thank you! Would you mind adding a little example to the PR of how to use it? Like a little example DAG or something? ( either in a comment or edit the PR description )

      @@ -59832,7 +59838,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 15:20:32
      -

      *Thread Reply:* Yes, Sure! I will add it in the PR description

      +

      *Thread Reply:* Yes, Sure! I will add it in the PR description

      @@ -59858,7 +59864,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 05:30:56
      -

      *Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

      +

      *Thread Reply:* I think it would be easy to convert to integration test then if you provided example dag

      @@ -59888,7 +59894,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:20:43
      -

      *Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

      +

      *Thread Reply:* ping @Fenil Doshi if possible I would really love to see the example DAG on there 🙂 🙏

      @@ -59914,7 +59920,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:26:22
      -

      *Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

      +

      *Thread Reply:* Yes, I was going to but the PR got merged so did not update the description. Should I just update the description of merged PR? Or should I add it somewhere in the docs?

      @@ -59940,7 +59946,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:42:29
      -

      *Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

      +

      *Thread Reply:* ^ @Ross Turk is it easy for @Fenil Doshi to contribute doc for manual inlet definition on the new doc site?

      @@ -59966,7 +59972,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:48:32
      -

      *Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

      +

      *Thread Reply:* It is easy 🙂 it's just markdown: https://github.com/openlineage/docs/

      @@ -59992,7 +59998,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 12:49:23
      -

      *Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

      +

      *Thread Reply:* @Fenil Doshi feel free to create new page here and don't sweat where to put it, we'll still figuring the structure of it out and will move it then

      @@ -60022,7 +60028,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 13:12:31
      -

      *Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

      +

      *Thread Reply:* exactly, yes - don’t be worried about the doc quality right now, the doc site is still in a pre-release state. so whatever you write will be likely edited or moved before it becomes official 👍

      @@ -60052,7 +60058,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 20:37:34
      -

      *Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

      +

      *Thread Reply:* I added documentations here - https://github.com/OpenLineage/docs/pull/16

      Also, have added an example for it. 🙂 Let me know if something is unclear and needs to be updated.

      @@ -60109,7 +60115,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:50:54
      -

      *Thread Reply:* Thanks! very cool.

      +

      *Thread Reply:* Thanks! very cool.

      @@ -60135,7 +60141,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:52:22
      -

      *Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

      +

      *Thread Reply:* Does Airflow check the types of the inlets/outlets btw?

      Like I wonder if a user could directly define an OpenLineage DataSet ( which might even have various other facets included on it ) and specify it in the inlets/outlets ?

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-28 12:54:56
      -

      *Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

      +

      *Thread Reply:* Yeah, I was also curious about using the models from airflow.lineage.entities as opposed to openlineage.client.run.

      @@ -60214,7 +60220,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:55:42
      -

      *Thread Reply:* I am accustomed to creating OpenLineage entities like this:

      +

      *Thread Reply:* I am accustomed to creating OpenLineage entities like this:

      taxes = Dataset(namespace="<postgres://foobar>", name="schema.table")

      @@ -60242,7 +60248,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:56:45
      -

      *Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

      +

      *Thread Reply:* I don’t dislike the airflow.lineage.entities models especially, but if we only support one of them…

      @@ -60268,7 +60274,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 12:58:18
      -

      *Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

      +

      *Thread Reply:* yeah, if Airflow allows that class within inlets/outlets it'd be nice to support both imo.

      Like we would suggest users to use openlineage.client.run.Dataset but if a user already has DAGs that use Table then they'd still work in a best efforts way.

      @@ -60296,7 +60302,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 13:03:07
      -

      *Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

      +

      *Thread Reply:* either Airflow depends on OpenLineage or we can probably change those entities as part of AIP-48 overhaul to more openlineage-like ones

      @@ -60322,7 +60328,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-28 17:18:35
      -

      *Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

      +

      *Thread Reply:* hm, not sure I understand the dependency issue. isn’t this extractor living in openlineage-airflow?

      @@ -60348,7 +60354,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-08-15 09:49:02
      -

      *Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

      +

      *Thread Reply:* I gave manual lineage a try with native OL Datasets specified in the Airflow inlets/outlets and it seems to work! Had to make some small tweaks which I have attempted here: https://github.com/OpenLineage/OpenLineage/pull/1015

      ( I left the support for converting the Airflow Table to Dataset because I think that's nice to have also )

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-06-28 18:45:52
      -

      *Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

      +

      *Thread Reply:* Ahh great question! I actually just updated the seeding cmd for Marquez to do just this (but in java of course)

      @@ -60500,7 +60506,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:46:15
      -

      *Thread Reply:* Give me a sec to send you over the diff…

      +

      *Thread Reply:* Give me a sec to send you over the diff…

      @@ -60530,7 +60536,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-28 18:56:35
      -

      *Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

      +

      *Thread Reply:* … continued here https://openlineage.slack.com/archives/C01CK9T7HKR/p1656456734272809?thread_ts=1656456141.097229&cid=C01CK9T7HKR

      @@ -60680,7 +60686,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 10:04:14
      -

      *Thread Reply:* I think you should declare those as sources? Or do you need something different?

      +

      *Thread Reply:* I think you should declare those as sources? Or do you need something different?

      @@ -60706,7 +60712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 21:15:33
      -

      *Thread Reply:* I'll try to experiment with this.

      +

      *Thread Reply:* I'll try to experiment with this.

      @@ -60759,7 +60765,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 10:02:17
      -

      *Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

      +

      *Thread Reply:* this should already be working if you run dbt-ol test or dbt-ol build

      @@ -60785,7 +60791,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-06-29 21:15:25
      -

      *Thread Reply:* oh, nice!

      +

      *Thread Reply:* oh, nice!

      @@ -60837,7 +60843,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 14:26:59
      -

      *Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

      +

      *Thread Reply:* Maybe @Maciej Obuchowski knows? You need to check, it's using the dbt-ol command and that the configuration is available. (environment variables or conf file)

      @@ -60863,7 +60869,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 15:31:20
      -

      *Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

      +

      *Thread Reply:* Maybe some aws networking stuff? I'm not really sure how mwaa works internally (or, at all - never used it)

      @@ -60889,7 +60895,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 15:35:06
      -

      *Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

      +

      *Thread Reply:* anyway, any logs/errors should be in the same space where your task logs are

      @@ -60941,7 +60947,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 10:21:50
      -

      *Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

      +

      *Thread Reply:* What is the status on the Flink / Streaming decisions being made for OpenLineage / Marquez?

      A few months ago, Flink was being introduced and it was said that more thought was needed around supporting streaming services in OpenLineage.

      @@ -60975,7 +60981,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-06 11:08:01
      -

      *Thread Reply:* @Will Johnson added your item

      +

      *Thread Reply:* @Will Johnson added your item

      @@ -61093,7 +61099,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-11 10:30:34
      -

      *Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

      +

      *Thread Reply:* would appreciate a TSC discussion on OL philosophy for Streaming in general and where/if it fits in the vision and strategy for OL. fully appreciate current maturity, moreso just validating how OL is being positioned from a vision perspective. as we consider aligning enterprise lineage solution around OL want to make sure we're not making bad assumptions. neat discussion might be "imagine that Confluent decided to make Stream Lineage OL compliant/capable - are we cool with that and what are the implications?".

      @@ -61123,7 +61129,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 12:36:17
      -

      *Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

      +

      *Thread Reply:* @Michael Robinson could I also have a quick 5m to talk about plans for a documentation site?

      @@ -61153,7 +61159,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 12:46:29
      -

      *Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

      +

      *Thread Reply:* @David Cecchi @Ross Turk Added your items to the agenda. Thanks and looking forward to the discussion!

      @@ -61179,7 +61185,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 15:08:48
      -

      *Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

      +

      *Thread Reply:* this is great - will keep an eye out for recording. if it got tabled due to lack of attendance will pick it up next TSC.

      @@ -61205,7 +61211,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-12 16:12:43
      -

      *Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

      +

      *Thread Reply:* I think OpenLineage should have some representation at https://impactdatasummit.com/2022

      I’m happy to help craft the abstract, look over slides, etc. (I could help present, but all I’ve done with OpenLineage is one tutorial, so I’m hardly an expert).

      @@ -61286,7 +61292,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 12:04:55
      -

      *Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

      +

      *Thread Reply:* @Michael Collado would you have any thoughts on how to extend the Facets without having to alter OpenLineage itself?

      @@ -61312,7 +61318,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 15:16:45
      -

      *Thread Reply:* This is described here. Notably: +

      *Thread Reply:* This is described here. Notably: > Custom implementations are registered by following Java's ServiceLoader conventions. A file called io.openlineage.spark.api.OpenLineageEventHandlerFactory must exist in the application or jar's META-INF/service directory. Each line of that file must be the fully qualified class name of a concrete implementation of OpenLineageEventHandlerFactory. More than one implementation can be present in a single file. This might be useful to separate extensions that are targeted toward different environments - e.g., one factory may contain Azure-specific extensions, while another factory may contain GCP extensions.

      @@ -61339,7 +61345,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 15:17:55
      -

      *Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory

      +

      *Thread Reply:* This example is present in the test package - https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]ervices/io.openlineage.spark.api.OpenLineageEventHandlerFactory

      @@ -61390,7 +61396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-07 20:19:01
      -

      *Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

      +

      *Thread Reply:* @Michael Collado you are amazing! Thank you so much for pointing me to the docs and example!

      @@ -61496,7 +61502,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 06:18:29
      -

      *Thread Reply:* Can you expand on what's not working exactly?

      +

      *Thread Reply:* Can you expand on what's not working exactly?

      This is not something we're aware of.

      @@ -61524,7 +61530,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 04:09:39
      -

      *Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error +

      *Thread Reply:* @Maciej Obuchowski Sure, I have my own library where I am creating a shadowJar. This includes the open lineage library into the new uber jar. This worked fine till 0.9.0 but now building the shadowJar gives this error Could not determine the dependencies of task ':shadowJar'. &gt; Could not resolve all dependencies for configuration ':runtimeClasspath'. &gt; Could not find spark:app:0.10.0. @@ -61576,7 +61582,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:00:02
      -

      *Thread Reply:* Can you try 0.11? I think we might already fixed that.

      +

      *Thread Reply:* Can you try 0.11? I think we might already fixed that.

      @@ -61602,7 +61608,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:50:03
      -

      *Thread Reply:* Tried with that as well. Doesn't work

      +

      *Thread Reply:* Tried with that as well. Doesn't work

      @@ -61628,7 +61634,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 05:56:50
      -

      *Thread Reply:* Same error with 0.11.0 as well

      +

      *Thread Reply:* Same error with 0.11.0 as well

      @@ -61654,7 +61660,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:11:13
      -

      *Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

      +

      *Thread Reply:* I think I see - we removed internal dependencies from maven's pom.xml but we also publish gradle metadata: https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.11.0/openlineage-spark-0.11.0.module

      @@ -61680,7 +61686,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:11:34
      -

      *Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

      +

      *Thread Reply:* we should remove the dependencies or disable the gradle metadata altogether, it's not required

      @@ -61706,7 +61712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 08:16:18
      -

      *Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

      +

      *Thread Reply:* @Varun Singh For now I think you can try ignoring gradle metadata: https://docs.gradle.org/current/userguide/declaring_repositories.html#sec:supported_metadata_sources

      @@ -61732,7 +61738,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 14:18:45
      -

      *Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

      +

      *Thread Reply:* @Varun Singh did you find out how to build shadowJar successful with release 0.10.0. I can build shadowJar with 0.9.0, but not higher version. If your problem already resolved, could you share some suggestion. thanks ^^

      @@ -61758,7 +61764,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 03:44:40
      -

      *Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: +

      *Thread Reply:* @Hanbing Wang I followed @Maciej Obuchowski's instructions (Thank you!) and added this to my build.gradle file: repositories { mavenCentral() { metadataSources { @@ -61793,7 +61799,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 05:26:04
      -

      *Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

      +

      *Thread Reply:* Also, I am not able to see the 3rd party dependencies in the dependency lock file, but they are present in some folder inside the jar (relocated in subproject's build file). But this is a different problem ig

      @@ -61819,7 +61825,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-20 18:45:50
      -

      *Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

      +

      *Thread Reply:* Thanks @Varun Singh for the very helpful info. I will also try update build.gradle and rebuild shadowJar again.

      @@ -61986,7 +61992,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 08:50:15
      -

      *Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

      +

      *Thread Reply:* Could you unzip the created jar and verify that classes you’re trying to use are present? Perhaps there’s some relocate in shadowJar plugin, which renames the classes. Making sure the classes are present in jar good point to start.

      Then you can try doing classForName just from the spark-shell without any listeners added. The classes should be available there.

      @@ -62014,7 +62020,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 11:42:25
      -

      *Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

      +

      *Thread Reply:* Thank you for the reply Pawel! Hanna and I just wrapped up some testing.

      It looks like Databricks AND open source spark does some magic when you install a library OR use --jars on the spark-shell. In both Databricks and Apache Spark, the thread running the SparkListener cannot see the additional libraries installed unless they're on the original / main class path.

      @@ -62050,7 +62056,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:02:05
      -

      *Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

      +

      *Thread Reply:* We run tests using --packages for external stuff like Delta - which is the same as --jars , but getting them from maven central, not local disk, and it works, like in KafkaRelationVisitor.

      What if you did it like it? By that I mean adding it to your code with compileOnly in gradle or provided in maven, compiling with it, then using static method to check if it loads?

      @@ -62078,7 +62084,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:02:36
      -

      *Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") +

      *Thread Reply:* > • When we use a classloader on a class we installed and imported, it DOES find the class. myImportedClass.getClass().getClassLoader().getResource("com/microsoft/kusto/spark/datasource/KustoSourceOptions.class") Isn't that this actual scenario?

      @@ -62105,7 +62111,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 12:36:47
      -

      *Thread Reply:* Thank you for the reply, Maciej!

      +

      *Thread Reply:* Thank you for the reply, Maciej!

      I will try the compileOnly route tonight!

      @@ -62137,7 +62143,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 06:43:47
      -

      *Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

      +

      *Thread Reply:* The difference between --jars and --packages is that for packages all transitive dependencies will be handled. But this does not seem to be the case here.

      More doc can be found here: (https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)

      @@ -62171,7 +62177,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:20:02
      -

      *Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

      +

      *Thread Reply:* @Paweł Leszczyński thank you for the reply! Hanna and I experimented with jars vs extraClassPath.

      When using jars, the spark listener does NOT find the class using a classloader.

      @@ -62207,7 +62213,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:22:01
      -

      *Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

      +

      *Thread Reply:* So, this is only relevant to Databricks now? Because I don't understand what do you do different than us with Kafka/Iceberg/Delta

      @@ -62233,7 +62239,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:22:48
      -

      *Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

      +

      *Thread Reply:* I'm not the spark/classpath expert though - maybe @Michael Collado have something to add?

      @@ -62259,7 +62265,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:24:12
      -

      *Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

      +

      *Thread Reply:* @Maciej Obuchowski that's a super good question on Iceberg. How do you instantiate a spark job with Iceberg installed?

      @@ -62285,7 +62291,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:26:04
      -

      *Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

      +

      *Thread Reply:* It is still relevant to apache spark because I can't get OpenLineage to find the installed package UNLESS I use extraClassPath.

      @@ -62311,7 +62317,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:29:13
      -

      *Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

      +

      *Thread Reply:* Basically, by adding --packages org.apache.iceberg:iceberg_spark_runtime_3.1_2.12:0.13.0

      https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-14 11:29:51
      -

      *Thread Reply:* Trying with --pacakges right now.

      +

      *Thread Reply:* Trying with --pacakges right now.

      @@ -62390,7 +62396,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:54:37
      -

      *Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

      +

      *Thread Reply:* Using --packages wouldn't let me find the Spark relation's default source:

      Spark Shell command spark-shell --master local[4] \ @@ -62440,7 +62446,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:59:07
      -

      *Thread Reply:* what if you load your listener using also packages?

      +

      *Thread Reply:* what if you load your listener using also packages?

      @@ -62466,7 +62472,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:00:38
      -

      *Thread Reply:* That's how I'm doing it locally using spark.conf: +

      *Thread Reply:* That's how I'm doing it locally using spark.conf: spark.jars.packages com.google.cloud.bigdataoss:gcs_connector:hadoop3-2.2.2,io.delta:delta_core_2.12:1.0.0,org.apache.iceberg:iceberg_spark3_runtime:0.12.1,io.openlineage:openlineage_spark:0.9.0

      @@ -62497,7 +62503,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:20:47
      -

      *Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! +

      *Thread Reply:* @Maciej Obuchowski - You beautiful bearded man! 🙏 2022-07-14 11:14:21,266 INFO MyLogger: Trying LogicalRelation 2022-07-14 11:14:21,266 INFO MyLogger: Got logical relation @@ -62541,7 +62547,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:21:35
      -

      *Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

      +

      *Thread Reply:* This is an annoying detail about Java ClassLoaders and the way Spark loads extra jars/packages

      Remember Java's ClassLoaders are hierarchical - there are parent ClassLoaders and child ClassLoaders. Parents can't see their children's classes, but children can see their parent's classes.

      @@ -62575,7 +62581,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:26:15
      -

      *Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

      +

      *Thread Reply:* In Databricks, we were not able to simply use the --packages argument to load the listener, which is why we have that init script that copies the jar into the classpath that Databricks uses for application startup (the main ClassLoader). You need to copy your visitor jar into the same location so that both jars are loaded by the same ClassLoader and can see each other

      @@ -62626,7 +62632,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:29:09
      -

      *Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

      +

      *Thread Reply:* (as an aside, this is one of the major drawbacks of the java agent approach and one reason why all the documentation recommends using the spark.jars.packages configuration parameter for loading the OL library - it guarantees that any DataSource nodes loaded by the Spark ClassLoader can be seen by the OL library and we don't have to use reflection for everything)

      @@ -62652,7 +62658,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:30:25
      -

      *Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

      +

      *Thread Reply:* @Michael Collado Thank you so much for the reply. The challenge is that Databricks has their own mechanism for installing libraries / packages.

      https://docs.microsoft.com/en-us/azure/databricks/libraries/

      @@ -62709,7 +62715,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:31:32
      -

      *Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

      +

      *Thread Reply:* Unfortunately, I can't ask users to install their packages on Databricks in a non-standard way (e.g. via an init script) because no one will follow that recommendation.

      @@ -62735,7 +62741,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 12:32:46
      -

      *Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

      +

      *Thread Reply:* yeah, I'd prefer if we didn't need an init script to get OL on Databricks either 🤷‍♂️:skintone4:

      @@ -62765,7 +62771,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-17 01:03:02
      -

      *Thread Reply:* Quick update: +

      *Thread Reply:* Quick update: • Turns out using a class loader from a Scala spark listener does not have this problem. • https://stackoverflow.com/questions/7671888/scala-classloaders-confusion • I'm trying to use URLClassLoader as recommended by a few MSFT folks and point it at the /local_disk0/tmp folder. @@ -62846,7 +62852,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 05:45:59
      -

      *Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

      +

      *Thread Reply:* Can't help you now, but I'd love if you dumped the knowledge you've gained through this process into some doc on new OpenLineage doc site 🙏

      @@ -62876,7 +62882,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 05:48:15
      -

      *Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

      +

      *Thread Reply:* We'll definitely put all of it together as a reference for others, and hopefully have a solution by the end of it too

      @@ -63025,7 +63031,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 16:40:48
      -

      *Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

      +

      *Thread Reply:* Soo cool, @David Cecchi 💯💯💯. I’m not familiar with marklogic, but pretty awesome ETL platform and the lineage graph looks 👌! Did you have to write any custom integration code? Or where you able to use our off the self integrations to get things working? (Also, thanks for sharing!)

      @@ -63051,7 +63057,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 16:57:29
      -

      *Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

      +

      *Thread Reply:* team had to write some custom stuff but it's all framework so it can be repurposed not rewritten over and over. i would see this as another "Platform" in the context of the integrations semantic OL uses, so no, we didn't start w/ an existing solution. just used internal hooks and then called lineage APIs.

      @@ -63077,7 +63083,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:02:53
      -

      *Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

      +

      *Thread Reply:* Ah totally make sense. Would you be open to a brief presentation and/or demo in a future OL community meeting? The community is always looking to hear how OL is used in the wild, and this seems aligned with that (assuming you can talk about the implementation at a high-level)

      @@ -63103,7 +63109,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:05:35
      -

      *Thread Reply:* No pressure, of course 😉

      +

      *Thread Reply:* No pressure, of course 😉

      @@ -63129,7 +63135,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:08:50
      -

      *Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

      +

      *Thread Reply:* ha not feeling any pressure. familiar with the intentions and dynamic. let's keep that on radar - i don't keep tabs on community meetings but mid/late august would be workable. and to be clear, this is being used in the wild in a sandbox 🙂.

      @@ -63155,7 +63161,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:12:55
      -

      *Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

      +

      *Thread Reply:* Sounds great, and a reasonable timeline! (cc @Michael Robinson can follow up). Even if it’s in a sandbox, talking about the level of effort helps with improving our APIs or sharing with others how smooth it can be!

      @@ -63185,7 +63191,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 17:18:27
      -

      *Thread Reply:* chiming in as well to say this is really cool 👍

      +

      *Thread Reply:* chiming in as well to say this is really cool 👍

      @@ -63211,7 +63217,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-13 18:26:28
      -

      *Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

      +

      *Thread Reply:* Nice! Would this become a product feature in Marklogic Data Hub?

      @@ -63237,7 +63243,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 11:07:42
      -

      *Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

      +

      *Thread Reply:* MarkLogic is a multi-model database and search engine. This implementation triggers off the MarkLogic Datahub Github batch records created when running the datahub flows. Just a toe in the water so far.

      @@ -63327,7 +63333,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:37:25
      -

      *Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

      +

      *Thread Reply:* That sounds like a good idea to me. Be good to have some guidance on that.

      The repo is open for business! Feel free to add the page where you think it fits.

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -
      2022-07-14 20:42:09
      -

      *Thread Reply:* OK! Let’s do this!

      +

      *Thread Reply:* OK! Let’s do this!

      @@ -63414,7 +63420,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:59:36
      -

      *Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

      +

      *Thread Reply:* @Ross Turk, feel free to assign to me https://github.com/OpenLineage/docs/issues/1!

      @@ -63527,7 +63533,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 20:43:09
      -

      *Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

      +

      *Thread Reply:* Thanks, @Ross Turk for finding a home for more technical / how-to docs… long overdue 💯

      @@ -63553,7 +63559,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:22:09
      -

      *Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

      +

      *Thread Reply:* BTW you can see the current site at http://openlineage.io/docs/ - merges to main will ship a new site.

      openlineage.io
      @@ -63600,7 +63606,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:23:32
      -

      *Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

      +

      *Thread Reply:* great, was using <a href="http://docs.openlineage.io">docs.openlineage.io</a> … we’ll eventually want the docs to live under the docs subdomain though?

      @@ -63626,7 +63632,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:25:32
      -

      *Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

      +

      *Thread Reply:* TBH I activated GitHub Pages on the repo expecting it to live at openlineage.github.io/docs, thinking we could look at it there before it's ready to be published and linked in to the website

      @@ -63652,7 +63658,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:25:39
      -

      *Thread Reply:* and it came live at openlineage.io/docs 😄

      +

      *Thread Reply:* and it came live at openlineage.io/docs 😄

      @@ -63678,7 +63684,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:26:06
      -

      *Thread Reply:* nice and sounds good 👍

      +

      *Thread Reply:* nice and sounds good 👍

      @@ -63704,7 +63710,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-14 21:26:31
      -

      *Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

      +

      *Thread Reply:* still do not understand why, but I'll take it as a happy accident. we can move to docs.openlineage.io easily - just need to add the A record in the LF infra + the CNAME file in the static dir of this repo

      @@ -63808,7 +63814,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 14:41:30
      -

      *Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

      +

      *Thread Reply:* yes - openlineage is job -> dataset -> job. particularly, the model is designed to observe the movement of data

      @@ -63834,7 +63840,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-15 14:43:41
      -

      *Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

      +

      *Thread Reply:* the spec is based around run events, which are observed states of job runs. jobs are observed to see how they affect datasets, and that relationship is what OpenLineage traces

      @@ -63938,7 +63944,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 15:16:54
      -

      *Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

      +

      *Thread Reply:* This thread covers glue in some detail: https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1637605977118000?threadts=1637605977.118000&cid=C01CK9T7HKR

      @@ -63999,7 +64005,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 15:17:49
      -

      *Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

      +

      *Thread Reply:* TL;Dr: you can use the spark integration to capture some lineage, but it's not comprehensive

      @@ -64025,7 +64031,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 16:29:02
      -

      *Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

      +

      *Thread Reply:* i suspect there will be opportunities to influence AWS to be a "fast follower" if OL adoption and buy-in starts to feel authentically real in non-aws portions of the stack. i discussed OL casually with AWS analytics leadership (Rahul Pathak) last winter and he seemed curious and open to this type of idea. to be clear, ~95% chance he's forgotten that conversation now but hey it's still something.

      @@ -64055,7 +64061,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-18 19:34:32
      -

      *Thread Reply:* There are a couple of aws people here (including me) following.

      +

      *Thread Reply:* There are a couple of aws people here (including me) following.

      @@ -64120,7 +64126,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:11:38
      -

      *Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

      +

      *Thread Reply:* I believe what you want is the DocumentationJobFacet. It adds a description property to a job.

      @@ -64146,7 +64152,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:13:03
      -

      *Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

      +

      *Thread Reply:* You can see a Python example here, in the Airflow integration: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/adapter.py#L217

      @@ -64176,7 +64182,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:13:18
      -

      *Thread Reply:* (looking for a curl example…)

      +

      *Thread Reply:* (looking for a curl example…)

      @@ -64202,7 +64208,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:25:49
      -

      *Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

      +

      *Thread Reply:* I see, so there are special facet keys which will get translated into something special in the ui, is that correct?

      Are these documented anywhere?

      @@ -64230,7 +64236,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:27:55
      -

      *Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

      +

      *Thread Reply:* Correct - info from the various OpenLineage facets are used in the Marquez UI.

      @@ -64256,7 +64262,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:28:28
      -

      *Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

      +

      *Thread Reply:* I couldn’t find a curl example with a description field, but I did generate this one with a sql field:

      { "job": { @@ -64309,7 +64315,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:28:58
      -

      *Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

      +

      *Thread Reply:* The facets (at least, those in the core spec) are here: https://github.com/OpenLineage/OpenLineage/tree/65a5f021a1ba3035d5198e759587737a05b242e1/spec/facets

      @@ -64335,7 +64341,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 20:29:19
      -

      *Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

      +

      *Thread Reply:* it’s designed so that facets can exist outside the core, in other repos, as well

      @@ -64361,7 +64367,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 22:25:39
      -

      *Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: +

      *Thread Reply:* Thank you for sharing these, I was able to get the sql query highlighting to work. But I failed to get the location link or the documentation to work. My facet attempt looked like: { "facets": { "description": "test-description-job", @@ -64405,7 +64411,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-19 22:36:55
      -

      *Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

      +

      *Thread Reply:* I got the documentation link to work by renaming the property from documentation -> description . I still haven't been able to get the external link to work

      @@ -64512,7 +64518,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:25:47
      -

      *Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

      +

      *Thread Reply:* There’s a good integration between OL and Databricks for pulling metadata out of running Spark clusters. But there’s not currently a connection between OL and the Unity Catalog.

      I think it would be cool to see some discussions start to develop around it 👍

      @@ -64544,7 +64550,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:26:44
      -

      *Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

      +

      *Thread Reply:* Absolutely. I saw some mention of APIs and access, and was wondering if maybe they used OpenLineage as a framework, which would be awesome.

      @@ -64570,7 +64576,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:30:55
      -

      *Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

      +

      *Thread Reply:* (and since Azure Databricks uses it - https://openlineage.io/blog/openlineage-microsoft-purview/ I wasn’t sure about Unity Catalog)

      openlineage.io
      @@ -64621,7 +64627,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-21 16:56:24
      -

      *Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

      +

      *Thread Reply:* We're in the early stages of discussion regarding an OpenLineage integration for Unity. You showing interest would help increase the priority of that on the DB side.

      @@ -64651,7 +64657,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-27 11:41:48
      -

      *Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

      +

      *Thread Reply:* I'm interested in Databricks enabling an openlineage endpoint, serving as a catalogue. Similar to how they provide hosted MLFlow. I can mention this to our Databricks reps as well

      @@ -64706,7 +64712,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-23 04:12:26
      -

      *Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

      +

      *Thread Reply:* Link to spec where I looked https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.json

      @@ -64993,7 +64999,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-23 04:37:11
      -

      *Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

      +

      *Thread Reply:* My bad. I realize now that column lineage has been implemented as a facet, hence not visible in the main spec https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=|https://github.com/OpenLineage/OpenLineage/search?q=ColumnLineageDatasetFacet&type=

      @@ -65023,7 +65029,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-26 19:37:54
      -

      *Thread Reply:* It is supported in the Spark integration

      +

      *Thread Reply:* It is supported in the Spark integration

      @@ -65049,7 +65055,7 @@

      [2022-05-19, 15:24:50 UTC] {taskinstance.py:1357} INFO -

      2022-07-26 19:39:13
      -

      *Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

      +

      *Thread Reply:* @Paweł Leszczyński could you add the Column Lineage facet here in the spec? https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md#standard-facets

      @@ -65116,7 +65122,7 @@

      SundayFunday

      2022-07-24 16:26:59
      -

      *Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

      +

      *Thread Reply:* @Ross Turk I still want to contribute something like this to the OpenLineage docs / new site but the bar for an internal doc is lower in my mind 😅

      @@ -65142,7 +65148,7 @@

      SundayFunday

      2022-07-25 11:49:54
      -

      *Thread Reply:* 😄

      +

      *Thread Reply:* 😄

      @@ -65168,7 +65174,7 @@

      SundayFunday

      2022-07-25 11:50:54
      -

      *Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

      +

      *Thread Reply:* @Will Johnson happy to help you with docs, when the time comes! sketching outline --> editing, whatever you need

      @@ -65194,7 +65200,7 @@

      SundayFunday

      2022-07-26 19:39:56
      -

      *Thread Reply:* This looks nice by the way.

      +

      *Thread Reply:* This looks nice by the way.

      @@ -65268,7 +65274,7 @@

      SundayFunday

      2022-07-26 10:39:57
      -

      *Thread Reply:* do you have proper execute permissions?

      +

      *Thread Reply:* do you have proper execute permissions?

      @@ -65294,7 +65300,7 @@

      SundayFunday

      2022-07-26 10:41:09
      -

      *Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

      +

      *Thread Reply:* not sure how that works on windows, but it just looks like it does not recognize dbt-ol as executable

      @@ -65320,7 +65326,7 @@

      SundayFunday

      2022-07-26 10:43:00
      -

      *Thread Reply:* yes i have admin rights. how to make this as executable?

      +

      *Thread Reply:* yes i have admin rights. how to make this as executable?

      @@ -65346,7 +65352,7 @@

      SundayFunday

      2022-07-26 10:43:25
      -

      *Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

      +

      *Thread Reply:* btw do we have a sample docker image where dbt-ol can run?

      @@ -65372,7 +65378,7 @@

      SundayFunday

      2022-07-26 17:33:08
      -

      *Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

      +

      *Thread Reply:* I have also never tried on Windows 😕 but you might try python3 dbt-ol run?

      @@ -65398,7 +65404,7 @@

      SundayFunday

      2022-07-26 21:03:43
      -

      *Thread Reply:* will try that

      +

      *Thread Reply:* will try that

      @@ -65480,7 +65486,7 @@

      SundayFunday

      2022-07-27 01:54:31
      -

      *Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: +

      *Thread Reply:* This may be a result of splitting Spark integration into multiple submodules: app, shared, spark2, spark3, spark32, etc. If the test case is from shared submodule (this one looks like that), you could try running: ./gradlew :shared:test --tests io.openlineage.spark.agent.OpenLineageSparkListenerTest

      @@ -65507,7 +65513,7 @@

      SundayFunday

      2022-07-27 03:18:42
      -

      *Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

      +

      *Thread Reply:* @Paweł Leszczyński, I tried running that command, and I get the following error:

      ```> Task :shared:test FAILED

      @@ -65557,7 +65563,7 @@

      SundayFunday

      2022-07-27 03:24:41
      -

      *Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in +

      *Thread Reply:* When running build and test for all the submodules, I can see outputs for tests in different submodules (spark3, spark2 etc), but for some reason, I cannot find any indication that the tests in OpenLineage/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/plan are being run at all.

      @@ -65585,7 +65591,7 @@

      SundayFunday

      2022-07-27 03:42:43
      -

      *Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

      +

      *Thread Reply:* That’s interesting. Let’s ask @Tomasz Nazarewicz about that.

      @@ -65615,11 +65621,11 @@

      SundayFunday

      2022-07-27 03:57:08
      -

      *Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: +

      *Thread Reply:* For reference, I attached the stdout and stderr messages from running the following: ./gradlew :shared:spotlessApply &amp;&amp; ./gradlew :app:spotlessApply &amp;&amp; ./gradlew clean build test

      - + @@ -65651,7 +65657,7 @@

      SundayFunday

      2022-07-27 04:27:23
      -

      *Thread Reply:* I'll look into it

      +

      *Thread Reply:* I'll look into it

      @@ -65677,7 +65683,7 @@

      SundayFunday

      2022-07-28 05:17:36
      -

      *Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

      +

      *Thread Reply:* Update: some test appeared to not be visible after split, that's fixed but now I have to solevr some dependency issues

      @@ -65707,7 +65713,7 @@

      SundayFunday

      2022-07-28 05:19:16
      -

      *Thread Reply:* That's great, thank you!

      +

      *Thread Reply:* That's great, thank you!

      @@ -65733,7 +65739,7 @@

      SundayFunday

      2022-07-29 06:05:55
      -

      *Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

      +

      *Thread Reply:* Hi Tomasz, thanks so much for looking into this. Is this your PR (https://github.com/OpenLineage/OpenLineage/pull/953) that fixes the whole issue, or is there still some work to do to solve the dependency issues you mentioned?

      @@ -65799,7 +65805,7 @@

      SundayFunday

      2022-07-29 06:07:58
      -

      *Thread Reply:* I'm still testing it, should've changed it to draft, sorry

      +

      *Thread Reply:* I'm still testing it, should've changed it to draft, sorry

      @@ -65829,7 +65835,7 @@

      SundayFunday

      2022-07-29 06:08:59
      -

      *Thread Reply:* No worries! If I can help with testing or anything please let me know!

      +

      *Thread Reply:* No worries! If I can help with testing or anything please let me know!

      @@ -65855,7 +65861,7 @@

      SundayFunday

      2022-07-29 06:09:29
      -

      *Thread Reply:* Will do! Thanks :)

      +

      *Thread Reply:* Will do! Thanks :)

      @@ -65881,7 +65887,7 @@

      SundayFunday

      2022-08-02 11:06:31
      -

      *Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

      +

      *Thread Reply:* Hi @Tomasz Nazarewicz, if possible, could you please share an estimated timeline for resolving the issue? We have 3 PRs which we are either waiting to open or to update which are dependent on the tests.

      @@ -65907,7 +65913,7 @@

      SundayFunday

      2022-08-02 13:45:34
      -

      *Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test +

      *Thread Reply:* @Hanna Moazam hi, it's quite difficult to do that because the issue is that all the tests are passing when I execute ./gradlew app:test but one is failing with ./gradlew app:build

      but if it fixes your problem I can disable this test for now and make a PR without it, then you can maybe unblock your stuff and I will have more time to investigate the issue.

      @@ -65936,7 +65942,7 @@

      SundayFunday

      2022-08-02 14:54:45
      -

      *Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

      +

      *Thread Reply:* Oh that's a strange issue. Yes that would be really helpful if you can, because we have some tests we implemented which we need to make sure pass as expected.

      @@ -65962,7 +65968,7 @@

      SundayFunday

      2022-08-02 14:54:52
      -

      *Thread Reply:* Thank you for your help Tomasz!

      +

      *Thread Reply:* Thank you for your help Tomasz!

      @@ -65988,7 +65994,7 @@

      SundayFunday

      2022-08-03 06:12:07
      -

      *Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

      +

      *Thread Reply:* @Hanna Moazam https://github.com/OpenLineage/OpenLineage/pull/980 here is the pull request with the changes

      @@ -66060,7 +66066,7 @@

      SundayFunday

      2022-08-03 06:12:26
      -

      *Thread Reply:* its waiting for review currently

      +

      *Thread Reply:* its waiting for review currently

      @@ -66086,7 +66092,7 @@

      SundayFunday

      2022-08-03 06:20:41
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -66261,7 +66267,7 @@

      SundayFunday

      2022-07-26 19:41:13
      -

      *Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

      +

      *Thread Reply:* The doc site would benefit from a page about it. Maybe @Paweł Leszczyński?

      @@ -66287,7 +66293,7 @@

      SundayFunday

      2022-07-27 01:59:27
      -

      *Thread Reply:* Sure, it’s already on my list, will do

      +

      *Thread Reply:* Sure, it’s already on my list, will do

      @@ -66317,7 +66323,7 @@

      SundayFunday

      2022-07-29 07:55:40
      -

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

      +

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/spark_column_lineage

      openlineage.io
      @@ -66403,7 +66409,7 @@

      SundayFunday

      2022-07-27 04:08:18
      -

      *Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

      +

      *Thread Reply:* To be honest, I have never seen that in action and would love to have that in our documentation.

      @Michael Collado or @Maciej Obuchowski: are you able to create some doc? I think one of you was working on that.

      @@ -66431,7 +66437,7 @@

      SundayFunday

      2022-07-27 04:24:19
      -

      *Thread Reply:* Yes, parent run

      +

      *Thread Reply:* Yes, parent run

      @@ -66487,7 +66493,7 @@

      SundayFunday

      2022-07-27 04:47:18
      -

      *Thread Reply:* Will take a look

      +

      *Thread Reply:* Will take a look

      @@ -66513,7 +66519,7 @@

      SundayFunday

      2022-07-27 04:47:40
      -

      *Thread Reply:* But generally this support message is just a warning

      +

      *Thread Reply:* But generally this support message is just a warning

      @@ -66539,7 +66545,7 @@

      SundayFunday

      2022-07-27 10:04:20
      -

      *Thread Reply:* @shweta p any actual error you've found? +

      *Thread Reply:* @shweta p any actual error you've found? I've tested it with dbt-bigquery on 1.1.0 and it works despite warning:

      ➜ small OPENLINEAGE_URL=<http://localhost:5050> dbt-ol build @@ -66620,7 +66626,7 @@

      SundayFunday

      2022-07-27 20:41:44
      -

      *Thread Reply:* I think it's safe to say we'll see a release by the end of next week

      +

      *Thread Reply:* I think it's safe to say we'll see a release by the end of next week

      @@ -66719,7 +66725,7 @@

      SundayFunday

      2022-07-28 08:50:44
      -

      *Thread Reply:* Welcome to the community!

      +

      *Thread Reply:* Welcome to the community!

      We talked about this exact topic in the most recent community call. https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting#MonthlyTSCmeeting-Nextmeeting:Nov10th2021(9amPT)

      @@ -66764,7 +66770,7 @@

      SundayFunday

      2022-07-28 09:56:05
      -

      *Thread Reply:* > or do you mean reporting start job/end job should be processed each event? +

      *Thread Reply:* > or do you mean reporting start job/end job should be processed each event? We definitely want to avoid tracking every single event 🙂

      One thing worth mentioning is that OpenLineage events are meant to be cumulative - the streaming jobs start, run, and eventually finish or restart. In the meantime, we capture additional events "in the middle" - for example, on Apache Flink checkpoint, or every few minutes - where we can emit additional information connected to the state of the job.

      @@ -66797,7 +66803,7 @@

      SundayFunday

      2022-07-28 11:11:17
      -

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

      +

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thank you for your answer

      jobs start, run, and eventually finish or restart

      @@ -66836,7 +66842,7 @@

      SundayFunday

      2022-07-28 11:11:36
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -66871,7 +66877,7 @@

      SundayFunday

      2022-07-28 12:00:34
      -

      *Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

      +

      *Thread Reply:* The idea is that jobs usually get upgraded - for example, you change Apache Flink version, increase resources, or change the structure of a job - that's the difference for us. The stop events make sense, because if you for example changed SQL of your Flink SQL job, you probably would want this to be captured - from X to Y job was running with older SQL version well, but after change, the second run started and throughput dropped to 10% of the previous one.

      > if you do start/stop events from the checkpoints on Flink it might be the wrong representation instead use the concept of event-driven for example reporting state. But this is an misunderstanding 🙂 @@ -66903,7 +66909,7 @@

      SundayFunday

      2022-07-28 12:01:16
      -

      *Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

      +

      *Thread Reply:* The checkpoint would be captured as a new eventType: RUNNING - do I miss something why you want to add StateFacet?

      @@ -66933,7 +66939,7 @@

      SundayFunday

      2022-07-28 14:24:03
      -

      *Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

      +

      *Thread Reply:* About argue - it’s depends on what the definition of job in streaming mode, i agree that if you already have ‘job’ you want to know about the job more information.

      each event that entering the sub process (job) should do REST call “Start job” and “End job” ?

      @@ -66999,7 +67005,7 @@

      SundayFunday

      2022-07-28 17:30:33
      -

      *Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

      +

      *Thread Reply:* Thanks for the +1s. We will initiate the release by Tuesday.

      @@ -67054,7 +67060,7 @@

      SundayFunday

      2022-08-03 08:30:15
      -

      *Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

      +

      *Thread Reply:* I think this is an interesting idea, however, just the static analysis does not convey any runtime information.

      We're doing something similar within Airflow now, but as a fallback mechanism: https://github.com/OpenLineage/OpenLineage/pull/914

      @@ -67084,7 +67090,7 @@

      SundayFunday

      2022-08-03 14:25:31
      -

      *Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. +

      *Thread Reply:* Thanks for the detailed response @Maciej Obuchowski! It seems like this solution is specific only to AirFlow, and i wonder why wouldn't we generalize this outside of just AirFlow? My thinking is that there are other areas where there is vast scope (e.g. arbitrary code that does data manipulations), and without such an option, the only path is to provide full runtime information via building your own extractor, which might be a bit hard/expensive to do. If i understand your response correctly, then you assume that OpenLineage can get wide enough "native" support across the stack without resorting to a fallback like 'static code analysis'. Is that your base assumption?

      @@ -67164,7 +67170,7 @@

      SundayFunday

      2022-07-29 12:58:38
      -

      *Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

      +

      *Thread Reply:* The query API has a different spec than the reporting API, so what you’d get from Marquez would look different from what Marquez receives.

      Few ideas:

      @@ -67195,7 +67201,7 @@

      SundayFunday

      2022-07-30 16:29:24
      -

      *Thread Reply:* ok, now I understand, thank you

      +

      *Thread Reply:* ok, now I understand, thank you

      @@ -67225,7 +67231,7 @@

      SundayFunday

      2022-08-03 08:25:57
      -

      *Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

      +

      *Thread Reply:* FYI we want to have something like that too: https://github.com/MarquezProject/marquez/issues/1927

      But if you need just the raw events endpoint, without UI, then Marquez might be overkill for your needs

      SundayFunday
      2022-07-30 13:49:25
      -

      *Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

      +

      *Thread Reply:* Short of implementing an open lineage endpoint in Amundsen, yes that's the right approach.

      The Lineage endpoint in Marquez can output the whole graph centered on a node ID, and you can use the jobs/datasets apis to grab lists of each for reference

      @@ -67341,7 +67347,7 @@

      SundayFunday

      2022-07-31 00:35:06
      -

      *Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

      +

      *Thread Reply:* Is your lineage information coming via OpenLineage? if so - you can quickly use the Amundsen scripts in order to load data into Amundsen, for example, see this script here: https://github.com/amundsen-io/amundsendatabuilder/blob/master/example/scripts/sample_data_loader.py

      Where is your lineage coming from?

      SundayFunday
      2022-08-01 20:17:22
      -

      *Thread Reply:* yes @Barak F we are using open lineage

      +

      *Thread Reply:* yes @Barak F we are using open lineage

      @@ -67598,7 +67604,7 @@

      SundayFunday

      2022-08-02 01:26:18
      -

      *Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

      +

      *Thread Reply:* So, have you tried using Amundsen data builder scripts to load the lineage information into Amundsen? (maybe you'll have to "play" with those a bit)

      @@ -67624,7 +67630,7 @@

      SundayFunday

      2022-08-03 08:24:58
      -

      *Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

      +

      *Thread Reply:* AFAIK there is OpenLineage extractor: https://www.amundsen.io/amundsen/databuilder/#openlineagetablelineageextractor

      Not sure it solves your issue though 🙂

      @@ -67652,7 +67658,7 @@

      SundayFunday

      2022-08-05 04:46:45
      -

      *Thread Reply:* thanks

      +

      *Thread Reply:* thanks

      @@ -67815,7 +67821,7 @@

      SundayFunday

      2022-08-02 12:28:11
      -

      *Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

      +

      *Thread Reply:* I think the reason for server model being very generic is because new facets can be added later (also as custom facets) - and generally server wants to accept all valid events and get the facet information that it can actually use, rather than reject event because it has unknown field.

      Server model was added here after some discussion in Marquez which is relevant - I think @Michael Collado @Willy Lulciuc can add to that

      SundayFunday
      2022-08-02 15:54:24
      -

      *Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

      +

      *Thread Reply:* Thanks for the response. I realize the server stubs were created to support flexibility , but it also makes the parsing logic on server side a bit more complex as we need to maintain code on the server side to look for specific facets & their properties from maps or like maquez duplicate the OL model on our end with the facets we care about. Wanted to know whats the guidance around managing this server side. @Willy Lulciuc @Michael Collado Any suggestions ?

      @@ -67972,7 +67978,7 @@

      SundayFunday

      2022-08-03 04:45:36
      -

      *Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

      +

      *Thread Reply:* @Paweł Leszczyński do we collect column level lineage on renames?

      @@ -67998,7 +68004,7 @@

      SundayFunday

      2022-08-03 04:56:23
      -

      *Thread Reply:* @Maciej Obuchowski no, we don’t. +

      *Thread Reply:* @Maciej Obuchowski no, we don’t. @Varun Singh create table as select may suit you well. Other examples are within tests like: • https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]lifecycle/plan/column/ColumnLevelLineageUtilsV2CatalogTest.javahttps://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/spark3/src/[…]ecycle/plan/column/ColumnLevelLineageUtilsNonV2CatalogTest.java

      @@ -68075,7 +68081,7 @@

      SundayFunday

      2022-08-05 05:55:12
      -

      *Thread Reply:* I’ve created an issue for column lineage in case of renaming: +

      *Thread Reply:* I’ve created an issue for column lineage in case of renaming: https://github.com/OpenLineage/OpenLineage/issues/993

      @@ -68131,7 +68137,7 @@

      SundayFunday

      2022-08-08 09:37:43
      -

      *Thread Reply:* Thanks @Paweł Leszczyński!

      +

      *Thread Reply:* Thanks @Paweł Leszczyński!

      @@ -68183,7 +68189,7 @@

      SundayFunday

      2022-08-03 13:00:22
      -

      *Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

      +

      *Thread Reply:* Fivetran is a tool that copies data from source systems to target databases. One of these source systems might be SalesForce, for example.

      This copying results in thousands of SQL queries run against the target database for each sync. I don’t think each of these queries should map to an OpenLineage job, I think the entire synchronization should. Maybe I’m wrong here.

      @@ -68211,7 +68217,7 @@

      SundayFunday

      2022-08-03 13:01:00
      -

      *Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

      +

      *Thread Reply:* But if I’m right, that means that there needs to be a way to specify “SalesForce Account #45123452233” as a dataset.

      @@ -68237,7 +68243,7 @@

      SundayFunday

      2022-08-03 13:01:44
      -

      *Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

      +

      *Thread Reply:* or it ends up just being a job with outputs and no inputs…but that’s not very illuminating

      @@ -68263,7 +68269,7 @@

      SundayFunday

      2022-08-03 13:02:27
      -

      *Thread Reply:* or is that good enough?

      +

      *Thread Reply:* or is that good enough?

      @@ -68289,7 +68295,7 @@

      SundayFunday

      2022-08-04 10:31:11
      -

      *Thread Reply:* You are looking at a pretty big topic here 🙂

      +

      *Thread Reply:* You are looking at a pretty big topic here 🙂

      Basically you're asking what is a job in OpenLineage - and it's not fully answered yet.

      @@ -68319,7 +68325,7 @@

      SundayFunday

      2022-08-04 15:50:22
      -

      *Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

      +

      *Thread Reply:* my 2 cents on this is that in the Salesforce example, the system is to complex to capture as a single dataset. and so maybe different objects within a salesforce account (org/account/opportunity/etc…) could be treated as individual datasets. But as @Maciej Obuchowski pointed out, this is quite a large topic 🙂

      @@ -68345,7 +68351,7 @@

      SundayFunday

      2022-08-08 13:46:31
      -

      *Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

      +

      *Thread Reply:* I guess it depends on whether you actually care about the table/column level lineage for an operation like “copy salesforce to snowflake”.

      I can see it being a nuisance having all of that on a lineage graph. OTOH, I can see it being useful to know that a datum can be traced back to a specific endpoint at SFDC.

      @@ -68373,7 +68379,7 @@

      SundayFunday

      2022-08-08 13:46:55
      -

      *Thread Reply:* this is a design decision, IMO.

      +

      *Thread Reply:* this is a design decision, IMO.

      @@ -68567,7 +68573,7 @@

      SundayFunday

      2022-08-10 22:34:29
      -

      *Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

      +

      *Thread Reply:* I am so sad I'm going to miss this month's meeting 😰 Looking forward to the recording!

      @@ -68593,7 +68599,7 @@

      SundayFunday

      2022-08-12 06:19:58
      -

      *Thread Reply:* We missed you too @Will Johnson 😉

      +

      *Thread Reply:* We missed you too @Will Johnson 😉

      @@ -68688,7 +68694,7 @@

      SundayFunday

      2022-08-12 06:09:57
      -

      *Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

      +

      *Thread Reply:* From the first look, you're missing outputsfield in your event - this might break something

      @@ -68714,7 +68720,7 @@

      SundayFunday

      2022-08-12 06:10:20
      -

      *Thread Reply:* If not, then Marquez logs might help to see something

      +

      *Thread Reply:* If not, then Marquez logs might help to see something

      @@ -68740,7 +68746,7 @@

      SundayFunday

      2022-08-12 13:12:56
      -

      *Thread Reply:* Does the START event needs to have an output?

      +

      *Thread Reply:* Does the START event needs to have an output?

      @@ -68766,7 +68772,7 @@

      SundayFunday

      2022-08-12 13:19:24
      -

      *Thread Reply:* It can have empty output 🙂

      +

      *Thread Reply:* It can have empty output 🙂

      @@ -68792,7 +68798,7 @@

      SundayFunday

      2022-08-12 13:32:43
      -

      *Thread Reply:* well, in your case you need to send COMPLETE event

      +

      *Thread Reply:* well, in your case you need to send COMPLETE event

      @@ -68818,7 +68824,7 @@

      SundayFunday

      2022-08-12 13:33:44
      -

      *Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

      +

      *Thread Reply:* Internally, Marquez does not create dataset version until you complete event. It makes sense when your semantics are transactional - you can still read from previous dataset version until it's finished writing.

      @@ -68844,7 +68850,7 @@

      SundayFunday

      2022-08-12 13:34:06
      -

      *Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

      +

      *Thread Reply:* After I send COMPLETE event with the same information I can see the dataset.

      @@ -68879,7 +68885,7 @@

      SundayFunday

      2022-08-12 13:56:37
      -

      *Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

      +

      *Thread Reply:* Thanks for the explanation @Maciej Obuchowski So, if I understand this correct. I won't see the my-test-input dataset till I have the COMPLETE event with input and output?

      @@ -68905,7 +68911,7 @@

      SundayFunday

      2022-08-12 14:34:51
      -

      *Thread Reply:* @Raj Mishra Yes and no 🙂

      +

      *Thread Reply:* @Raj Mishra Yes and no 🙂

      Basically your COMPLETE event does not need to contain any input and output datasets at all - OpenLineage model is cumulative, so it's enough to have datasets on either start or complete. That also means you can add different datasets in different moment of a run lifecycle - for example, you know inputs, but not outputs, so you emit inputs on START , but not COMPLETE.

      @@ -68947,7 +68953,7 @@

      SundayFunday

      2022-08-12 14:47:56
      -

      *Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

      +

      *Thread Reply:* @Maciej Obuchowski Thank you so much! This is great explanation.

      @@ -68999,7 +69005,7 @@

      SundayFunday

      2022-08-11 20:35:33
      -

      *Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

      +

      *Thread Reply:* Would adding support for alias/grouping as a config on OL client side be valuable to other users ? i.e OL client could pass down an Alias/grouping facet Or should this be treated purely a server side feature

      @@ -69025,7 +69031,7 @@

      SundayFunday

      2022-08-12 06:11:21
      -

      *Thread Reply:* Agreed 🙂

      +

      *Thread Reply:* Agreed 🙂

      How do you produce this dataset? Spark integration? Are you using any system like Apache Iceberg/Delta Lake or just writing raw files?

      @@ -69053,7 +69059,7 @@

      SundayFunday

      2022-08-12 12:59:48
      -

      *Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

      +

      *Thread Reply:* these are raw files written from Spark or map reduce jobs. And downstream Spark jobs read these raw files to produce tables

      @@ -69079,7 +69085,7 @@

      SundayFunday

      2022-08-12 13:27:34
      -

      *Thread Reply:* written using Spark dataframe API, like +

      *Thread Reply:* written using Spark dataframe API, like df.write.format("parquet").save("/tmp/spark_output/parquet") or RDD?

      @@ -69107,7 +69113,7 @@

      SundayFunday

      2022-08-12 13:27:59
      -

      *Thread Reply:* the actual API used matters, because we're handling different cases separately

      +

      *Thread Reply:* the actual API used matters, because we're handling different cases separately

      @@ -69133,7 +69139,7 @@

      SundayFunday

      2022-08-12 13:29:48
      -

      *Thread Reply:* I see. Let me look that up to be absolutely sure

      +

      *Thread Reply:* I see. Let me look that up to be absolutely sure

      @@ -69159,7 +69165,7 @@

      SundayFunday

      2022-08-12 19:21:41
      -

      *Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

      +

      *Thread Reply:* It is like. this : df.write.format("parquet").save("/tmp/spark_output/parquet")

      @@ -69185,7 +69191,7 @@

      SundayFunday

      2022-08-15 12:43:45
      -

      *Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

      +

      *Thread Reply:* @Maciej Obuchowski curious what you had in mind with respect to RDDs & Dataframes. Also what if we cannot integrate OL with the frameworks that produce this dataset , but only those that consume from the already produced datasets. Is there a way we could still capture the dataset appropriately ?

      @@ -69211,7 +69217,7 @@

      SundayFunday

      2022-08-16 05:30:57
      -

      *Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

      +

      *Thread Reply:* @Sharanya Santhanam the naming should be consistent between reading and writing, so it wouldn't change much of you can't integrate OL into writers. For the rest, can you create an issue on OL GitHub so someone can pick it up? I'm at vacation now.

      @@ -69237,7 +69243,7 @@

      SundayFunday

      2022-08-16 15:08:41
      -

      *Thread Reply:* Sounds good , Ty !

      +

      *Thread Reply:* Sounds good , Ty !

      @@ -69316,7 +69322,7 @@

      SundayFunday

      2022-08-12 06:09:11
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -69342,7 +69348,7 @@

      SundayFunday

      2022-08-12 06:09:32
      -

      *Thread Reply:* please do create 🙂

      +

      *Thread Reply:* please do create 🙂

      @@ -69398,7 +69404,7 @@

      SundayFunday

      2022-08-17 13:54:19
      -

      *Thread Reply:* it's per project. +

      *Thread Reply:* it's per project. java based: ./gradlew test python based: https://github.com/OpenLineage/OpenLineage/tree/main/integration/airflow#development

      @@ -69510,7 +69516,7 @@

      SundayFunday

      2022-08-19 18:38:10
      -

      *Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

      +

      *Thread Reply:* Hey Will! A bunch of folks are on vacation or out this week. Sorry for the delay, I am personally not sure but if it's not too urgent you can have an answer when knowledgable folks are back.

      @@ -69536,7 +69542,7 @@

      SundayFunday

      2022-08-19 20:21:18
      -

      *Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

      +

      *Thread Reply:* Hah! No worries, @Julien Le Dem! I can definitely wait for the lucky people who are enjoying the last few weeks of summer unlike the rest of us 😋

      @@ -69562,7 +69568,7 @@

      SundayFunday

      2022-08-29 05:31:32
      -

      *Thread Reply:* @Paweł Leszczyński might want to look at that

      +

      *Thread Reply:* @Paweł Leszczyński might want to look at that

      @@ -69617,7 +69623,7 @@

      SundayFunday

      2022-08-19 12:30:08
      -

      *Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

      +

      *Thread Reply:* Hello Hanbing, the spark integration works for PySpark since pyspark is wrapped into regular spark operators.

      @@ -69643,7 +69649,7 @@

      SundayFunday

      2022-08-19 13:49:35
      -

      *Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. +

      *Thread Reply:* @Julien Le Dem Thanks a lot for your help. I searched around, but I couldn't find any doc introduce how pyspark supported in openLineage. My company want to integrate with openLineage-spark, I am working on figure out what info does OpenLineage make available for non-sql and does it at least have support for logging the logical plan?

      @@ -69670,7 +69676,7 @@

      SundayFunday

      2022-08-19 18:26:48
      -

      *Thread Reply:* Yes, it does send the logical plan as part of the event

      +

      *Thread Reply:* Yes, it does send the logical plan as part of the event

      @@ -69696,7 +69702,7 @@

      SundayFunday

      2022-08-19 18:27:32
      -

      *Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

      +

      *Thread Reply:* This configuration here should work as well for pyspark https://openlineage.io/docs/integrations/spark/

      openlineage.io
      @@ -69743,7 +69749,7 @@

      SundayFunday

      2022-08-19 18:28:11
      -

      *Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

      +

      *Thread Reply:* --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener"

      @@ -69769,7 +69775,7 @@

      SundayFunday

      2022-08-19 18:28:26
      -

      *Thread Reply:* you need to add the jar, set the listener and pass your OL config

      +

      *Thread Reply:* you need to add the jar, set the listener and pass your OL config

      @@ -69795,7 +69801,7 @@

      SundayFunday

      2022-08-19 18:31:11
      -

      *Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

      +

      *Thread Reply:* Actually I'm demoing this at 27:10 right here 🙂 https://pretalx.com/bbuzz22/talk/FHEHAL/

      pretalx
      @@ -69846,7 +69852,7 @@

      SundayFunday

      2022-08-19 18:32:11
      -

      *Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

      +

      *Thread Reply:* you can see the parameters I'm passing to the pyspark command line in the video

      @@ -69872,7 +69878,7 @@

      SundayFunday

      2022-08-19 18:35:50
      -

      *Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

      +

      *Thread Reply:* @Julien Le Dem Thanks for the info, Let me take a look at the video now.

      @@ -69898,7 +69904,7 @@

      SundayFunday

      2022-08-19 18:40:10
      -

      *Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

      +

      *Thread Reply:* The full demo starts at 24:40. It shows lineage connected together in Marquez coming from 3 different sources: Airflow, Spark and a custom integration

      @@ -69979,7 +69985,7 @@

      SundayFunday

      2022-08-22 14:38:58
      -

      *Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

      +

      *Thread Reply:* @Michael Robinson can we start posting the “Unreleased” section in the changelog along with the release request? That way, we / the community will know what will be in the upcoming release

      @@ -70009,7 +70015,7 @@

      SundayFunday

      2022-08-22 15:00:37
      -

      *Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

      +

      *Thread Reply:* The release is approved. Thanks @Willy Lulciuc, @Minkyu Park, @Harel Shein

      @@ -70092,7 +70098,7 @@

      SundayFunday

      2022-08-23 03:55:24
      -

      *Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

      +

      *Thread Reply:* Cool! Are the new ownership facets populated by the Airflow integration ?

      @@ -70144,7 +70150,7 @@

      SundayFunday

      2022-08-25 18:15:59
      -

      *Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. +

      *Thread Reply:* There is not that I know of besides the Amundsen integration example you pointed at. A basic idea to do such a thing would be to implement an OpenLineage endpoint (receive the lineage events through http posts) and convert them to a format the graph db understand. If others in the community have ideas, please chime in

      @@ -70171,7 +70177,7 @@

      SundayFunday

      2022-09-01 13:48:09
      -

      *Thread Reply:* Understood, thanks a lot Julien. Make sense.

      +

      *Thread Reply:* Understood, thanks a lot Julien. Make sense.

      @@ -70227,7 +70233,7 @@

      SundayFunday

      2022-08-25 17:32:44
      -

      *Thread Reply:* @Michael Robinson ^

      +

      *Thread Reply:* @Michael Robinson ^

      @@ -70253,7 +70259,7 @@

      SundayFunday

      2022-08-25 17:34:04
      -

      *Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

      +

      *Thread Reply:* Thanks, Harel. 3 +1s from committers is all we need to make this happen today.

      @@ -70279,7 +70285,7 @@

      SundayFunday

      2022-08-25 17:52:40
      -

      *Thread Reply:* 🙏

      +

      *Thread Reply:* 🙏

      @@ -70305,7 +70311,7 @@

      SundayFunday

      2022-08-25 18:09:51
      -

      *Thread Reply:* Thanks, all. The release is authorized

      +

      *Thread Reply:* Thanks, all. The release is authorized

      @@ -70335,7 +70341,7 @@

      SundayFunday

      2022-08-25 18:16:44
      -

      *Thread Reply:* can you also state the main purpose for this release?

      +

      *Thread Reply:* can you also state the main purpose for this release?

      @@ -70361,7 +70367,7 @@

      SundayFunday

      2022-08-25 18:25:49
      -

      *Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

      +

      *Thread Reply:* I believe (correct me if wrong, @Harel Shein) that this is to make available a fix of a bug in the compare functionality

      @@ -70387,7 +70393,7 @@

      SundayFunday

      2022-08-25 18:27:53
      -

      *Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

      +

      *Thread Reply:* ParentRunFacet from the airflow integration is not compliant to OpenLineage spec and this release includes the fix of that so that the marquez can handle parent run/job information.

      @@ -70486,7 +70492,7 @@

      SundayFunday

      2022-08-26 19:26:22
      -

      *Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 +

      *Thread Reply:* What version of Spark are you using? it should be enabled by default for Spark 3 https://openlineage.io/docs/integrations/spark/spark_column_lineage

      @@ -70534,7 +70540,7 @@

      SundayFunday

      2022-08-26 20:21:12
      -

      *Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

      +

      *Thread Reply:* Thanks. Good to here that. I am use 0.9.+ . I will try again

      @@ -70560,7 +70566,7 @@

      SundayFunday

      2022-08-29 13:14:01
      -

      *Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

      +

      *Thread Reply:* I tested 0.9.+ 0.12.+ with spark 3.0 and 3.2 version. There still do not have dataset facet columnlineage. This is strange. I saw the column lineage design proposals 148. It should support from 0.9.+ Do I miss something?

      @@ -70586,7 +70592,7 @@

      SundayFunday

      2022-08-29 13:14:41
      -

      *Thread Reply:* @Harel Shein

      +

      *Thread Reply:* @Harel Shein

      @@ -70612,7 +70618,7 @@

      SundayFunday

      2022-08-30 00:56:18
      -

      *Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

      +

      *Thread Reply:* @Jason it depends on the data source. What sort of data are you trying to read? Is it in a hive metastore? Is it on an S3 bucket? Is it a delta file format?

      @@ -70638,7 +70644,7 @@

      SundayFunday

      2022-08-30 13:51:03
      -

      *Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

      +

      *Thread Reply:* I tried read hive megastore on s3 and cave file on local. All are miss the columnlineage

      @@ -70664,7 +70670,7 @@

      SundayFunday

      2022-08-31 00:33:17
      -

      *Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

      +

      *Thread Reply:* @Jason - Sorry, you'll have to translate a bit for me. Can you share a snippet of code you're using to do the read and write? Is it a special package you need to install or is it just using the hadoop standard for S3? https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

      @@ -70690,7 +70696,7 @@

      SundayFunday

      2022-08-31 20:00:47
      -

      *Thread Reply:* spark.read \ +

      *Thread Reply:* spark.read \ .option("header", "true") \ .option("inferschema", "true") \ .csv("data/input/batch/wikidata.csv") \ @@ -70722,7 +70728,7 @@

      SundayFunday

      2022-08-31 20:01:21
      -

      *Thread Reply:* This is simple code run on my local for testing

      +

      *Thread Reply:* This is simple code run on my local for testing

      @@ -70748,7 +70754,7 @@

      SundayFunday

      2022-08-31 21:41:31
      -

      *Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

      +

      *Thread Reply:* Which version of OpenLineage are you running? You might look at the code on the main branch. This looks like a HadoopFSRelation which I implemented for column lineage but the latest release (0.13.1) does not include it yet.

      @@ -70774,7 +70780,7 @@

      SundayFunday

      2022-08-31 21:42:05
      -

      *Thread Reply:* Specifically this commit is what implemented it. +

      *Thread Reply:* Specifically this commit is what implemented it. https://github.com/OpenLineage/OpenLineage/commit/ce30178cc81b63b9930be11ac7500ed34808edd3

      @@ -70801,7 +70807,7 @@

      SundayFunday

      2022-08-31 22:02:16
      -

      *Thread Reply:* I see. I use 0.13.0

      +

      *Thread Reply:* I see. I use 0.13.0

      @@ -70827,7 +70833,7 @@

      SundayFunday

      2022-09-01 12:04:41
      -

      *Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

      +

      *Thread Reply:* @Jason we have our monthly release coming up now, so it should be included in 0.14.0 when released today/tomorrow

      @@ -70853,7 +70859,7 @@

      SundayFunday

      2022-09-01 12:52:52
      -

      *Thread Reply:* Great. Thanks Harel.

      +

      *Thread Reply:* Great. Thanks Harel.

      @@ -70957,7 +70963,7 @@

      SundayFunday

      2022-08-29 05:29:52
      -

      *Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

      +

      *Thread Reply:* Reading dataset - which input dataset implies - does not mutate the dataset 🙂

      @@ -70983,7 +70989,7 @@

      SundayFunday

      2022-08-29 05:30:14
      -

      *Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

      +

      *Thread Reply:* If you change the dataset, it would be represented as some other job with this datasets in the outputs list

      @@ -71009,7 +71015,7 @@

      SundayFunday

      2022-08-29 12:42:55
      -

      *Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

      +

      *Thread Reply:* So, changing the input dataset will always create new output data versions? Sorry I have trouble understanding this, but if the input is changing, shouldn't the input data set will have different versions?

      @@ -71035,7 +71041,7 @@

      SundayFunday

      2022-09-01 08:35:42
      -

      *Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

      +

      *Thread Reply:* @Raj Mishra if input is changing, there should be something else in your data infrastructure that changes this dataset - and it should emit this dataset as output

      @@ -71087,7 +71093,7 @@

      SundayFunday

      2022-08-29 12:35:32
      -

      *Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

      +

      *Thread Reply:* I think you want to emit raw events using python or java client: https://openlineage.io/docs/client/python

      openlineage.io
      @@ -71134,7 +71140,7 @@

      SundayFunday

      2022-08-29 12:35:46
      -

      *Thread Reply:* (docs in progress 😉)

      +

      *Thread Reply:* (docs in progress 😉)

      @@ -71160,7 +71166,7 @@

      SundayFunday

      2022-08-30 02:07:02
      -

      *Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

      +

      *Thread Reply:* can you give a hind what should i look for for modeling a dataset on top of other dataset? potentially also map columns?

      @@ -71186,7 +71192,7 @@

      SundayFunday

      2022-08-30 02:12:50
      -

      *Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

      +

      *Thread Reply:* i can only see that i can have a dataset as input to a job run and not for another dataset

      @@ -71212,7 +71218,7 @@

      SundayFunday

      2022-09-01 08:34:35
      -

      *Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

      +

      *Thread Reply:* Not sure I understand - jobs process input datasets into output datasets. There is always something that can be modeled into a job that consumes input and produces output.

      @@ -71238,7 +71244,7 @@

      SundayFunday

      2022-09-01 10:30:51
      -

      *Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

      +

      *Thread Reply:* so openlineage force me to put a job between datasets? does not fit our use case

      @@ -71264,7 +71270,7 @@

      SundayFunday

      2022-09-01 10:31:09
      -

      *Thread Reply:* unless we can some how easily hide the process that does that on the graph.

      +

      *Thread Reply:* unless we can some how easily hide the process that does that on the graph.

      @@ -71316,7 +71322,7 @@

      SundayFunday

      2022-08-30 04:44:06
      -

      *Thread Reply:* I don't think it will work for Spark 2.X.

      +

      *Thread Reply:* I don't think it will work for Spark 2.X.

      @@ -71342,7 +71348,7 @@

      SundayFunday

      2022-08-30 13:42:20
      -

      *Thread Reply:* Is there have plan to support spark 2.x?

      +

      *Thread Reply:* Is there have plan to support spark 2.x?

      @@ -71368,7 +71374,7 @@

      SundayFunday

      2022-08-30 14:00:38
      -

      *Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

      +

      *Thread Reply:* Nope - on the other hand we plan to drop any support for it, as it's unmaintained for quite a bit and vendors are dropping support for it too - afaik Databricks in April 2023.

      @@ -71394,7 +71400,7 @@

      SundayFunday

      2022-08-30 17:19:43
      -

      *Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

      +

      *Thread Reply:* I see. Thanks. Amazon Emr still support spark 2.x

      @@ -71493,7 +71499,7 @@

      SundayFunday

      2022-09-01 05:56:13
      -

      *Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

      +

      *Thread Reply:* At least for Iceberg I've done it, since I want to emit DatasetVersionDatasetFacet for input dataset only at START - and after I finish writing the dataset might have different version than before writing.

      @@ -71519,7 +71525,7 @@

      SundayFunday

      2022-09-01 05:58:59
      -

      *Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

      +

      *Thread Reply:* Same should be for output AFAIK - output version should be emitted only on COMPLETE, since the version changes after I finish writing.

      @@ -71545,7 +71551,7 @@

      SundayFunday

      2022-09-01 09:52:30
      -

      *Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

      +

      *Thread Reply:* Ah! Okay, so this still requires us to truly combine START and COMPLETE to get a TOTAL picture of the entire run. Is that fair?

      @@ -71571,7 +71577,7 @@

      SundayFunday

      2022-09-01 10:30:41
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -71601,7 +71607,7 @@

      SundayFunday

      2022-09-01 10:31:21
      -

      *Thread Reply:* As usual, thank you Maciej for the responses and insights!

      +

      *Thread Reply:* As usual, thank you Maciej for the responses and insights!

      @@ -71657,7 +71663,7 @@

      SundayFunday

      2022-09-07 14:13:34
      -

      *Thread Reply:* Anyone can help for it? Does I miss something

      +

      *Thread Reply:* Anyone can help for it? Does I miss something

      @@ -71748,7 +71754,7 @@

      SundayFunday

      2022-09-01 13:18:59
      -

      *Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks. The release is authorized. It will be initiated within 2 business days.

      @@ -71804,7 +71810,7 @@

      SundayFunday

      2022-09-08 10:31:44
      -

      *Thread Reply:* Which integration are you looking to implement?

      +

      *Thread Reply:* Which integration are you looking to implement?

      And what environment are you looking to deploy it on? The Cloud? On-Prem?

      @@ -71832,7 +71838,7 @@

      SundayFunday

      2022-09-08 10:40:11
      -

      *Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

      +

      *Thread Reply:* We are planning to deploy on premise with Kerberos as authentication for postgres

      @@ -71858,7 +71864,7 @@

      SundayFunday

      2022-09-08 11:27:06
      -

      *Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

      +

      *Thread Reply:* Ah! Are you planning on running Marquez as well and that is your main concern or are you planning on building your own store of OpenLineage Events and using the SQL integration to generate those events?

      https://github.com/OpenLineage/OpenLineage/tree/main/integration

      @@ -71886,7 +71892,7 @@

      SundayFunday

      2022-09-08 11:33:44
      -

      *Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

      +

      *Thread Reply:* I am looking to deploy Marquez on-prem with onprem postgres as back-end with Kerberos authentication.

      @@ -71912,7 +71918,7 @@

      SundayFunday

      2022-09-08 11:34:32
      -

      *Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

      +

      *Thread Reply:* Is the the right forum for Marquez as well or there is different slack channel for Marquez available

      @@ -71938,7 +71944,7 @@

      SundayFunday

      2022-09-08 11:46:35
      -

      *Thread Reply:* https://bit.ly/MarquezSlack

      +

      *Thread Reply:* https://bit.ly/MarquezSlack

      @@ -71964,7 +71970,7 @@

      SundayFunday

      2022-09-08 11:47:14
      -

      *Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

      +

      *Thread Reply:* There is another slack channel just for Marquez! That might be a better spot with more dedicated Marquez developers.

      @@ -72039,7 +72045,7 @@

      SundayFunday

      2022-09-06 15:54:30
      -

      *Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

      +

      *Thread Reply:* Thanks for breaking up the changes in the release! Love the new format 💯

      @@ -72099,7 +72105,7 @@

      SundayFunday

      2022-09-07 10:00:11
      -

      *Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

      +

      *Thread Reply:* Is PR #1069 all that’s going in 0.14.1 ?

      @@ -72125,7 +72131,7 @@

      SundayFunday

      2022-09-07 10:27:39
      -

      *Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

      +

      *Thread Reply:* There’s also 1058. 1069 is urgently needed. We can technically wait…

      @@ -72155,7 +72161,7 @@

      SundayFunday

      2022-09-07 10:30:31
      -

      *Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

      +

      *Thread Reply:* (edited prior message because I’m not sure how accurately I was describing the issue)

      @@ -72181,7 +72187,7 @@

      SundayFunday

      2022-09-07 10:39:32
      -

      *Thread Reply:* Thanks for clarifying!

      +

      *Thread Reply:* Thanks for clarifying!

      @@ -72207,7 +72213,7 @@

      SundayFunday

      2022-09-07 10:50:29
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -72237,7 +72243,7 @@

      SundayFunday

      2022-09-07 11:04:39
      -

      *Thread Reply:* 1058 also fixes some bugs

      +

      *Thread Reply:* 1058 also fixes some bugs

      @@ -72289,7 +72295,7 @@

      SundayFunday

      2022-09-08 04:41:07
      -

      *Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

      +

      *Thread Reply:* Usually there is something creating the view, for example dbt materialization: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/materializations

      Besides that, there is this proposal that did not get enough love yet https://github.com/OpenLineage/OpenLineage/issues/323

      @@ -72317,7 +72323,7 @@

      SundayFunday

      2022-09-08 04:53:23
      -

      *Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

      +

      *Thread Reply:* but we are not working iwth dbt. we try to model lineage of our internal view/tables hirarchy which is related to a propriety application of ours. so we like OpenLineage that lets me explicily model stuff and not only via scanning some DW. but in that case we dont want a job in between.

      @@ -72343,7 +72349,7 @@

      SundayFunday

      2022-09-08 04:58:47
      -

      *Thread Reply:* this PR does not seem to support lineage between datasets

      +

      *Thread Reply:* this PR does not seem to support lineage between datasets

      @@ -72369,7 +72375,7 @@

      SundayFunday

      2022-09-08 12:49:48
      -

      *Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

      +

      *Thread Reply:* This is something core to the OpenLineage design - the lineage relationships are defined as dataset-job-dataset, not dataset-dataset.

      In OpenLineage, something observes the lineage relationship being created.

      @@ -72397,7 +72403,7 @@

      SundayFunday

      2022-09-08 12:50:13
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -72436,7 +72442,7 @@

      SundayFunday

      2022-09-08 12:51:15
      -

      *Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

      +

      *Thread Reply:* It’s a bit different from some other lineage approaches, but OL is intended to be a push model. A job is observed as it runs, metadata is pushed to the backend.

      @@ -72462,7 +72468,7 @@

      SundayFunday

      2022-09-08 12:54:27
      -

      *Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

      +

      *Thread Reply:* so in this case, according to openlineage 🙂, the job would be whatever runs within the pipeline that creates the view. very operational point of view.

      @@ -72488,7 +72494,7 @@

      SundayFunday

      2022-09-11 12:27:42
      -

      *Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

      +

      *Thread Reply:* but what about the view definition use case? u have lineage of columns in view/base table relation ships

      @@ -72514,7 +72520,7 @@

      SundayFunday

      2022-09-11 12:28:05
      -

      *Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

      +

      *Thread Reply:* how would you model that in OpenLineage? would you create a dummy job ?

      @@ -72540,7 +72546,7 @@

      SundayFunday

      2022-09-11 12:31:57
      -

      *Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

      +

      *Thread Reply:* would you say that because this is my use case i might better choose some other lineage tool?

      @@ -72566,7 +72572,7 @@

      SundayFunday

      2022-09-11 12:33:04
      -

      *Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

      +

      *Thread Reply:* for the context: i am not talking about some view and table definitions in some warehouse e.g. SF but its internal data processing mechanism with propriety view/tables definition (in Flink SQL) and we want to push this metadata for visibility

      @@ -72592,7 +72598,7 @@

      SundayFunday

      2022-09-12 17:20:13
      -

      *Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

      +

      *Thread Reply:* Ah, gotcha. Yeah, I would say it’s probably best to create a job in this case. You can send the view definition using a sourcecodefacet, so it will be collected as well. You’d want to send START and STOP events for it.

      @@ -72618,7 +72624,7 @@

      SundayFunday

      2022-09-12 17:22:03
      -

      *Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

      +

      *Thread Reply:* regarding the PR linked before, you are right - I wonder if someday the spec should have a way to express “the system was made aware that these datasets are related, but did not observe the relationship being created so it can’t tell you i.e. how long it took or whether it changed over time”

      @@ -72713,7 +72719,7 @@

      SundayFunday

      2022-09-09 14:01:13
      -

      *Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

      +

      *Thread Reply:* Hey, @data_fool! Not in the near term. but of course we’d love to see this happen. We’re open to having an Airbyte integration driven by the community. Want to open an issue to start the discussion?

      @@ -72739,7 +72745,7 @@

      SundayFunday

      2022-09-09 15:36:20
      -

      *Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

      +

      *Thread Reply:* hey @Willy Lulciuc, Yep, will open an issue. Thanks!

      @@ -72795,7 +72801,7 @@

      SundayFunday

      2022-09-12 19:26:25
      -

      *Thread Reply:* yes!

      +

      *Thread Reply:* yes!

      @@ -72821,7 +72827,7 @@

      SundayFunday

      2022-09-26 10:31:56
      -

      *Thread Reply:* Any example or ticket on how to lineage across namespace

      +

      *Thread Reply:* Any example or ticket on how to lineage across namespace

      @@ -72873,7 +72879,7 @@

      SundayFunday

      2022-09-12 04:56:13
      -

      *Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

      +

      *Thread Reply:* Yes https://openlineage.io/blog/column-lineage/

      openlineage.io
      @@ -72920,7 +72926,7 @@

      SundayFunday

      2022-09-22 02:18:45
      -

      *Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage +

      *Thread Reply:* • More details on Spark & Column level lineage integration: https://openlineage.io/docs/integrations/spark/spark_column_lineage • Proposal on how to implement column level lineage in Marquez (implementation is currently work in progress): https://github.com/MarquezProject/marquez/blob/main/proposals/2045-column-lineage-endpoint.md @Iftach Schonbaum let us know if you find the information useful.

      SundayFunday
      2022-09-12 15:30:08
      -

      *Thread Reply:* or is it automatic for anything that exists in extractors/?

      +

      *Thread Reply:* or is it automatic for anything that exists in extractors/?

      @@ -73195,7 +73201,7 @@

      SundayFunday

      2022-09-12 15:30:16
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -73229,7 +73235,7 @@

      SundayFunday

      2022-09-12 15:31:12
      -

      *Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

      +

      *Thread Reply:* so anything i add to extractors directory with the same name as the operator will automatically extract the metadata from the operator is that correct?

      @@ -73255,7 +73261,7 @@

      SundayFunday

      2022-09-12 15:31:31
      -

      *Thread Reply:* Well, not entirely

      +

      *Thread Reply:* Well, not entirely

      @@ -73281,7 +73287,7 @@

      SundayFunday

      2022-09-12 15:31:47
      -

      *Thread Reply:* please take a look at the source code of one of the extractors

      +

      *Thread Reply:* please take a look at the source code of one of the extractors

      @@ -73311,7 +73317,7 @@

      SundayFunday

      2022-09-12 15:32:13
      -

      *Thread Reply:* also, there are docs available at openlineage.io/docs

      +

      *Thread Reply:* also, there are docs available at openlineage.io/docs

      @@ -73341,7 +73347,7 @@

      SundayFunday

      2022-09-12 15:33:45
      -

      *Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

      +

      *Thread Reply:* ok, i'll take a look. i think one thing that would be helpful is having a custom setup without marquez. a lot of the docs or videos i found were integrated with marquez

      @@ -73367,7 +73373,7 @@

      SundayFunday

      2022-09-12 15:34:29
      -

      *Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

      +

      *Thread Reply:* I see. Marquez is a openlineage backend that stores the lineage data, so many examples do need them.

      @@ -73393,7 +73399,7 @@

      SundayFunday

      2022-09-12 15:34:47
      -

      *Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

      +

      *Thread Reply:* If you do not want to run marquez but just test out the openlineage, you can also take a look at OpenLineage Proxy.

      @@ -73423,7 +73429,7 @@

      SundayFunday

      2022-09-12 15:35:14
      -

      *Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

      +

      *Thread Reply:* awesome thanks Howard! i'll take a look at these resources and come back around if i need to

      @@ -73453,7 +73459,7 @@

      SundayFunday

      2022-09-12 16:01:45
      -

      *Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

      +

      *Thread Reply:* http://openlineage.io/docs/integrations/airflow/extractor - this is the doc you might want to read

      openlineage.io
      @@ -73504,7 +73510,7 @@

      SundayFunday

      2022-09-12 17:08:49
      -

      *Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

      +

      *Thread Reply:* yeah, saw that doc earlier. thanks @Maciej Obuchowski appreciate it 🙏

      @@ -73562,7 +73568,7 @@

      SundayFunday

      2022-09-21 20:57:39
      -

      *Thread Reply:* In my head, I have:

      +

      *Thread Reply:* In my head, I have:

      Pyspark script -> Store metadata in S3 -> Marquez UI gets data from S3 and displays it

      @@ -73592,10 +73598,10 @@

      SundayFunday

      2022-09-22 02:14:50
      -

      *Thread Reply: It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: - * sent to rest endpoint (like Marquez), - * send as an event onto Kafka, - * print it onto console +

      *Thread Reply:* It’s more like: you add openlineage jar to Spark job, configure it what to do with the events. Popular options are: + * sent to rest endpoint (like Marquez), + * send as an event onto Kafka, + ** print it onto console There is no S3 in between Spark & Marquez by default. Marquez serves both as an API where events are sent and UI to investigate them.

      @@ -73623,7 +73629,7 @@

      SundayFunday

      2022-09-22 17:36:10
      -

      *Thread Reply:* Yeah S3 was just an example for a storage option.

      +

      *Thread Reply:* Yeah S3 was just an example for a storage option.

      I actually found the answer I was looking for, turns out I had to look at Marquez documentation: https://marquezproject.ai/resources/deployment/

      @@ -73688,7 +73694,7 @@

      SundayFunday

      2022-09-26 00:27:05
      -

      *Thread Reply:* The Spark execution model follows:

      +

      *Thread Reply:* The Spark execution model follows:

      1. Spark SQL Execution Start event
      2. Spark Job Start event
      3. Spark Job End event
      4. Spark SQL Execution End event As a result, OpenLineage tracks all of those execution and jobs. There is a proposed plan to distinguish between those events (e.g. you wouldn't get two starts but one Start and one Job Start or something like that).
      5. @@ -73720,7 +73726,7 @@

        SundayFunday

      2022-09-26 15:16:26
      -

      *Thread Reply:* Thanks @Will Johnson +

      *Thread Reply:* Thanks @Will Johnson Can I get an example of how the proposed plan can be used to distinguish between start and job start events? Because I compare the 2 starts events I got, only the event_time is different, all other information are the same.

      @@ -73748,7 +73754,7 @@

      SundayFunday

      2022-09-26 15:30:34
      -

      *Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for +

      *Thread Reply:* One followup question, if I process multiple queries in one command, for example (Drop + Create Table + Insert Overwrite), should I expected for (1). 1 Spark SQL execution start event (2). 3 Spark job start event (Each query has a job start event ) (3). 3 Spark job end event (Each query has a job end event ) @@ -73778,7 +73784,7 @@

      SundayFunday

      2022-09-27 10:25:47
      -

      *Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

      +

      *Thread Reply:* Re: Distinguish between start and job start events. There was a proposal to differentiate the two (https://github.com/OpenLineage/OpenLineage/issues/636) but the current discussion is here: https://github.com/OpenLineage/OpenLineage/issues/599 As it currently stands, there is not a way to tell which one is which (I believe). The design of OpenLineage is such that you should consume ALL events under the same run id and job name / namespace.

      Re: Multiple Queries in One Command: This is where Spark's execution model comes into play. I believe each one of those commands are executed sequentially and as a result, you'd actually get three execution start and three execution end. If you chose DROP + Create Table As Select, that would be only two commands and thus only two execution start events.

      SundayFunday
      2022-09-27 16:49:37
      -

      *Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, +

      *Thread Reply:* Thanks a lot for your help 🙏 @Will Johnson, For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different.

      When I test Drop + Create Table @@ -73894,7 +73900,7 @@

      SundayFunday

      2022-09-27 22:03:38
      -

      *Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

      +

      *Thread Reply:* @Hanbing Wang are you running this on Databricks with a hive metastore that is defaulting to Delta by any chance?

      I THINK there are some gaps in OpenLineage because of the way Databricks Delta handles things and now there is Unity catalog that is causing some hiccups as well.

      @@ -73922,7 +73928,7 @@

      SundayFunday

      2022-09-28 09:18:48
      -

      *Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. +

      *Thread Reply:* > For multiple queries in one command, I still have a confused place why Drop + CreateTable and Drop + CreateTableAsSelect act different. @Hanbing Wang That's basically why we capture all the events (SQL Execution, Job) instead of one of them. We're just inconsistently notified of them by Spark.

      Some computations emit SQL Execution events, some emit Job events, I think majority emits both. This also differs by spark version.

      @@ -73956,7 +73962,7 @@

      SundayFunday

      2022-09-28 15:44:45
      -

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help +

      *Thread Reply:* @Will Johnson and @Maciej Obuchowski Thanks a lot for your help We are not running on Databricks. We implemented the OpenLineage Spark listener, and custom the Event Transport which emitting the events to our own events pipeline with a hive metastore. We are using Spark version 3.2.1 @@ -73986,7 +73992,7 @@

      SundayFunday

      2022-09-29 15:16:28
      -

      *Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

      +

      *Thread Reply:* Ooof! @Hanbing Wang then I'm not certain why you're not receiving the extra event 😞 You may need to run your spark cluster in debug mode to step through the Spark Listener.

      @@ -74012,7 +74018,7 @@

      SundayFunday

      2022-09-29 15:17:08
      -

      *Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

      +

      *Thread Reply:* @Maciej Obuchowski - I'll add it to my list!

      @@ -74038,7 +74044,7 @@

      SundayFunday

      2022-09-30 15:34:01
      -

      *Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

      +

      *Thread Reply:* @Will Johnson Thanks a lot for your help. Let us debug and continue investigating on this issue.

      @@ -74096,7 +74102,7 @@

      SundayFunday

      2022-09-28 08:34:20
      -

      *Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

      +

      *Thread Reply:* One of assumptions was to create a stateless integration model where multiple events can be sent for a single job run. This has several advantages like sending events for jobs which suddenly fail, sending events immediately, etc.

      The events can be merged then at the backend side. The behavior, you describe, can be then achieved by using backends like Marquez and Marquez API to obtain combined data.

      @@ -74336,7 +74342,7 @@

      SundayFunday

      2022-09-29 14:30:37
      -

      *Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

      +

      *Thread Reply:* Hello @srutikanta hota, could you elaborate a bit on your use case? I'm not sure what you are trying to achieve. Possibly @Paweł Leszczyński will know.

      @@ -74362,7 +74368,7 @@

      SundayFunday

      2022-09-29 15:24:26
      -

      *Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      +

      *Thread Reply:* @srutikanta hota - Not sure what MDC properties stands for but you might take inspiration from the DatabricksEnvironmentHandler Facet Builder: https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      You can create a facet that could extract out the properties that you might set from within the spark session.

      @@ -74454,7 +74460,7 @@

      SundayFunday

      2022-09-30 04:56:25
      -

      *Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. +

      *Thread Reply:* Many thanks for the details. My usecase is simple, I like to default the sparkgroupjob Id as openlineage parent runid if there is no parent run Id set. sc.setJobGroup("myjobgroupid", "job description goes here") This set the value in spark as setLocalProperty(SparkContext.SPARKJOBGROUPID, group_id)

      @@ -74485,7 +74491,7 @@

      SundayFunday

      2022-09-30 05:01:08
      -

      *Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

      +

      *Thread Reply:* MDC is an ability to add extra key -> value pairs to a log entry, while not doing this within message body. So the question here is (I believe): how to add custom entries / custom facets to OpenLineage events?

      @srutikanta hota What information would you like to include? There is great chance we already have some fields for that. If not it’s still worth putting in in write place like: is this info job specific, run specific or relates to some of input / output datasets?

      @@ -74513,7 +74519,7 @@

      SundayFunday

      2022-09-30 05:04:34
      -

      *Thread Reply:* @srutikanta hota sounds like you want to set up +

      *Thread Reply:* @srutikanta hota sounds like you want to set up spark.openlineage.parentJobName spark.openlineage.parentRunId https://openlineage.io/docs/integrations/spark/

      @@ -74542,7 +74548,7 @@

      SundayFunday

      2022-09-30 05:15:18
      -

      *Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

      +

      *Thread Reply:* @… we are having a long-running spark context(the context may run for a week) where we submit jobs. Settings the parentrunid at beginning won't help. We are submitting the job with sparkgroupid. I like to use the group Id as parentRunId

      https://spark.apache.org/docs/1.6.1/api/R/setJobGroup.html

      @@ -74612,7 +74618,7 @@

      SundayFunday

      2022-09-29 14:22:06
      -

      *Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

      +

      *Thread Reply:* Hi Trevor, thank you for reaching out. I’d be happy to discuss with you how we can help you support OpenLineage. Let me send you an email.

      @@ -74732,7 +74738,7 @@

      SundayFunday

      2022-09-30 14:38:34
      -

      *Thread Reply:* Is this in the context of an OpenLineage extractor?

      +

      *Thread Reply:* Is this in the context of an OpenLineage extractor?

      @@ -74758,7 +74764,7 @@

      SundayFunday

      2022-09-30 14:40:47
      -

      *Thread Reply:* Yes! I was specifically looking at the PostgresOperator

      +

      *Thread Reply:* Yes! I was specifically looking at the PostgresOperator

      @@ -74784,7 +74790,7 @@

      SundayFunday

      2022-09-30 14:41:54
      -

      *Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

      +

      *Thread Reply:* (as Snowflake lineage can be retrieved from their internal ACCESS_HISTORY tables, we wouldn’t need to use Airflow’s SnowflakeOperator to get lineage, we’d use the method on the openlineage blog)

      @@ -74810,7 +74816,7 @@

      SundayFunday

      2022-09-30 14:43:08
      -

      *Thread Reply:* The extractor for the SQL operators gets the query like this: +

      *Thread Reply:* The extractor for the SQL operators gets the query like this: https://github.com/OpenLineage/OpenLineage/blob/45fda47d8ef29dd6d25103bb491fb8c443[…]gration/airflow/openlineage/airflow/extractors/sql_extractor.py

      @@ -74866,7 +74872,7 @@

      SundayFunday

      2022-09-30 14:43:48
      -

      *Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

      +

      *Thread Reply:* let me see if I can find the corresponding part of the Airflow API docs...

      @@ -74892,7 +74898,7 @@

      SundayFunday

      2022-09-30 14:45:00
      -

      *Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

      +

      *Thread Reply:* aha! I’m not so far behind the times, it was only put in during July https://github.com/OpenLineage/OpenLineage/pull/907

      @@ -74957,7 +74963,7 @@

      SundayFunday

      2022-09-30 14:47:28
      -

      *Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: +

      *Thread Reply:* Hm. The PostgresOperator seems to extend BaseOperator directly: https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/postgres/operators/postgres.py#L58

      @@ -75009,7 +75015,7 @@

      SundayFunday

      2022-09-30 14:48:01
      -

      *Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

      +

      *Thread Reply:* yeah 😞 I couldn’t find a way to make that work as an end-user.

      @@ -75035,7 +75041,7 @@

      SundayFunday

      2022-09-30 14:48:08
      -

      *Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

      +

      *Thread Reply:* perhaps that can't be assumed for all operators that deal with SQL. I know that @Maciej Obuchowski has spent a lot of time on this.

      @@ -75061,7 +75067,7 @@

      SundayFunday

      2022-09-30 14:49:14
      -

      *Thread Reply:* I don't know enough about the airflow internals 😞

      +

      *Thread Reply:* I don't know enough about the airflow internals 😞

      @@ -75087,7 +75093,7 @@

      SundayFunday

      2022-09-30 14:50:00
      -

      *Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

      +

      *Thread Reply:* No worries. In case it saves you work, I also had a look at https://github.com/apache/airflow/blob/029ebacd9cbbb5e307a03530bdaf111c2c3d4f51/airflow/providers/common/sql/operators/sql.py - which also extends BaseOperator but not with a way to just get the SQL.

      @@ -75366,7 +75372,7 @@

      SundayFunday

      2022-09-30 15:22:24
      -

      *Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

      +

      *Thread Reply:* that's more of an Airflow question indeed. As far as I understand you need to read file with SQL statement within Airflow Operator and do something but run the query (like pass as an XCom)? SQLExtractors we have get same SQL that operators render and uses it to extract additional information like table schema straight from database

      @@ -75444,7 +75450,7 @@

      SundayFunday

      2022-09-30 19:06:47
      -

      *Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 +

      *Thread Reply:* There's a design doc linked from the PR: https://github.com/apache/airflow/pull/20443 https://docs.google.com/document/d/1L3xfdlWVUrdnFXng1Di4nMQYQtzMfhvvWDR9K4wXnDU/edit

      @@ -75528,7 +75534,7 @@

      SundayFunday

      2022-09-30 19:18:47
      -

      *Thread Reply:* amazing thank you I will take a look

      +

      *Thread Reply:* amazing thank you I will take a look

      @@ -75595,7 +75601,7 @@

      SundayFunday

      2022-10-03 11:33:30
      -

      *Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

      +

      *Thread Reply:* Hey @Michael Robinson. Removal of Airflow 1.x support is planned for next release after 0.15.0

      @@ -75625,7 +75631,7 @@

      SundayFunday

      2022-10-03 11:37:03
      -

      *Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

      +

      *Thread Reply:* 0.15.0 would be the last release supporting Airflow 1.x

      @@ -75651,7 +75657,7 @@

      SundayFunday

      2022-10-03 11:37:07
      -

      *Thread Reply:* just caught this myself. I’ll make the change

      +

      *Thread Reply:* just caught this myself. I’ll make the change

      @@ -75677,7 +75683,7 @@

      SundayFunday

      2022-10-03 11:40:33
      -

      *Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

      +

      *Thread Reply:* we’re still on 1.10.15 at the moment so i guess our team would have to rely on <=0.15.0?

      @@ -75703,7 +75709,7 @@

      SundayFunday

      2022-10-03 11:49:47
      -

      *Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

      +

      *Thread Reply:* Is this something you want to continue doing or do you want to migrate relatively soon?

      We want to remove 1.10 integration because for multiple PRs, maintaining compatibility with it takes a lot of time; the code is littered with checks like this. if parse_version(AIRFLOW_VERSION) &gt;= parse_version("2.0.0"):

      @@ -75736,7 +75742,7 @@

      SundayFunday

      2022-10-03 12:03:40
      -

      *Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

      +

      *Thread Reply:* hey Maciej, we do have plans to migrate in the coming months but for right now we need to stay on 1.10.15.

      @@ -75762,7 +75768,7 @@

      SundayFunday

      2022-10-04 09:39:11
      -

      *Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

      +

      *Thread Reply:* Thanks, all. The release is authorized, and you can expect it by Thursday.

      @@ -75816,7 +75822,7 @@

      SundayFunday

      2022-10-03 17:56:36
      -

      *Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

      +

      *Thread Reply:* running airflow 2.3.4 with openlineage-airflow 0.14.1

      @@ -75842,7 +75848,7 @@

      SundayFunday

      2022-10-03 18:03:03
      -

      *Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

      +

      *Thread Reply:* if you're talking about LineageBackend, it is used in Airflow 2.1-2.2. It did not have functionality where you can be notified on task start or failure, so we wanted to expand the functionality: https://github.com/apache/airflow/issues/17984

      Consensus of Airflow maintainers wasn't positive about changing this interface, so we went with another direction: https://github.com/apache/airflow/pull/20443

      SundayFunday
      2022-10-03 18:06:58
      -

      *Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

      +

      *Thread Reply:* Why nothing happens? https://github.com/OpenLineage/OpenLineage/blob/895160423643398348154a87e0682c3ab5c8704b/integration/airflow/openlineage/lineage_backend/__init__.py#L91

      @@ -75969,7 +75975,7 @@

      SundayFunday

      2022-10-03 18:30:32
      -

      *Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

      +

      *Thread Reply:* ah hmm ok, i will double check. i commented that part out so technically it should run but maybe i missed something

      @@ -75995,7 +76001,7 @@

      SundayFunday

      2022-10-03 18:30:42
      -

      *Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

      +

      *Thread Reply:* thank you for your fast response @Maciej Obuchowski ! i appreciate it

      @@ -76021,7 +76027,7 @@

      SundayFunday

      2022-10-03 18:31:13
      -

      *Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

      +

      *Thread Reply:* it seems like it doesn't use my custom wrapper but instead uses the openlineage implementation.

      @@ -76047,7 +76053,7 @@

      SundayFunday

      2022-10-03 20:11:15
      -

      *Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

      +

      *Thread Reply:* @Maciej Obuchowski ok, after checking we are emitting events with our custom backend but an odd thing is an attempt is always made with the openlineage backend. is there something obvious i am perhaps missing 🤔

      ends up with requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url immediately after task start. but by the end on task success/failure it emits the event with our custom backend both RunState.COMPLETE and RunState.START into our own pipeline.

      @@ -76075,7 +76081,7 @@

      SundayFunday

      2022-10-04 06:19:06
      -

      *Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

      +

      *Thread Reply:* If you're on 2.3 and trying to use some wrapped LineageBackend, what I think is happening is OpenLineagePlugin that automatically registers via setup.py entrypoint https://github.com/OpenLineage/OpenLineage/blob/65a5f021a1ba3035d5198e759587737a05b242e1/integration/airflow/openlineage/airflow/plugin.py#L30

      @@ -76126,7 +76132,7 @@

      SundayFunday

      2022-10-04 06:23:48
      -

      *Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

      +

      *Thread Reply:* I think if you want to extend it with proprietary code there are two good options.

      First, if your code only needs to touch HTTP client side - which I guess is the case due to 401 error - then you can create custom Transport.

      @@ -76160,7 +76166,7 @@

      SundayFunday

      2022-10-04 14:23:33
      -

      *Thread Reply:* amazing thank you for your help. i will take a look

      +

      *Thread Reply:* amazing thank you for your help. i will take a look

      @@ -76186,7 +76192,7 @@

      SundayFunday

      2022-10-04 14:49:47
      -

      *Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

      +

      *Thread Reply:* @Maciej Obuchowski is there a way to extend the plugin like how we can wrap the custom backend with 2.2? or would it be necessary to fork it.

      we're trying to not fork and instead opt with extending.

      @@ -76214,7 +76220,7 @@

      SundayFunday

      2022-10-05 04:55:05
      -

      *Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

      +

      *Thread Reply:* I think it's best to fork, since it's getting loaded by Airflow as an entrypoint: https://github.com/OpenLineage/OpenLineage/blob/133110300e8ea4e42e3640608cfed459683d5a8d/integration/airflow/setup.py#L70

      @@ -76273,7 +76279,7 @@

      SundayFunday

      2022-10-05 13:29:24
      -

      *Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

      +

      *Thread Reply:* got it. and in terms of the openlineage.yml and defining a custom transport is there a way i can define where openlineage-python should look for the custom transport? e.g. different path

      @@ -76299,7 +76305,7 @@

      SundayFunday

      2022-10-05 13:30:04
      -

      *Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

      +

      *Thread Reply:* because from the docs i. can't tell except for the file i'm supposed to copy and implement.

      @@ -76325,7 +76331,7 @@

      SundayFunday

      2022-10-05 14:18:19
      -

      *Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

      +

      *Thread Reply:* @Paul Lee you should derive from Transport base class and register type as full python import path to your custom transport, for example https://github.com/OpenLineage/OpenLineage/blob/f8533266491acea2159f602f782a99a4f8a82cca/client/python/tests/openlineage.yml#L2

      @@ -76376,7 +76382,7 @@

      SundayFunday

      2022-10-05 14:20:48
      -

      *Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

      +

      *Thread Reply:* your custom transport should have also define custom class Config , and this class should implement from_dict method

      @@ -76402,7 +76408,7 @@

      SundayFunday

      2022-10-05 14:20:56
      -

      *Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47

      +

      *Thread Reply:* the whole process is here: https://github.com/OpenLineage/OpenLineage/blob/a62484ec14359a985d283c639ac7e8b9cfc54c2e/client/python/openlineage/client/transport/factory.py#L47

      @@ -76453,7 +76459,7 @@

      SundayFunday

      2022-10-05 14:21:09
      -

      *Thread Reply:* and I know we need to document this better 🙂

      +

      *Thread Reply:* and I know we need to document this better 🙂

      @@ -76483,7 +76489,7 @@

      SundayFunday

      2022-10-05 15:35:31
      -

      *Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

      +

      *Thread Reply:* amazing, thanks for all your help 🙂 +1 to the docs, if i have some time when done i will push up some docs to document what i've done

      @@ -76509,7 +76515,7 @@

      SundayFunday

      2022-10-05 15:50:29
      -

      *Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

      +

      *Thread Reply:* https://github.com/openlineage/docs/ - let me know and I'll review 🙂

      @@ -76648,7 +76654,7 @@

      SundayFunday

      2022-10-04 14:25:49
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -76751,7 +76757,7 @@

      SundayFunday

      2022-10-06 13:29:30
      -

      *Thread Reply:* would love to add improvement in docs :) for newcomers

      +

      *Thread Reply:* would love to add improvement in docs :) for newcomers

      @@ -76781,7 +76787,7 @@

      SundayFunday

      2022-10-06 13:31:07
      -

      *Thread Reply:* also, what’s TSC?

      +

      *Thread Reply:* also, what’s TSC?

      @@ -76807,7 +76813,7 @@

      SundayFunday

      2022-10-06 15:20:23
      -

      *Thread Reply:* Technical Steering Committee, but it’s open to everyone

      +

      *Thread Reply:* Technical Steering Committee, but it’s open to everyone

      @@ -76837,7 +76843,7 @@

      SundayFunday

      2022-10-06 15:20:45
      -

      *Thread Reply:* and we encourage newcomers to attend

      +

      *Thread Reply:* and we encourage newcomers to attend

      @@ -76889,7 +76895,7 @@

      SundayFunday

      2022-10-06 14:39:27
      -

      *Thread Reply:* is there any error/warn message logged maybe?

      +

      *Thread Reply:* is there any error/warn message logged maybe?

      @@ -76915,7 +76921,7 @@

      SundayFunday

      2022-10-06 14:40:53
      -

      *Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

      +

      *Thread Reply:* none that i'm seeing on our workers. i do see that our custom http transport is being utilized on START.

      but on SUCCESS nothing fires.

      @@ -76943,7 +76949,7 @@

      SundayFunday

      2022-10-06 14:41:21
      -

      *Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

      +

      *Thread Reply:* which makes me believe the listeners themselves aren't being utilized? 🤔

      @@ -76969,7 +76975,7 @@

      SundayFunday

      2022-10-06 16:37:54
      -

      *Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

      +

      *Thread Reply:* uhm, any chance you're experiencing this with custom extractors?

      @@ -76995,7 +77001,7 @@

      SundayFunday

      2022-10-06 16:38:13
      -

      *Thread Reply:* I'd be happy to jump on a quick call if you wish

      +

      *Thread Reply:* I'd be happy to jump on a quick call if you wish

      @@ -77021,7 +77027,7 @@

      SundayFunday

      2022-10-06 16:38:40
      -

      *Thread Reply:* but in more EU friendly hours 🙂

      +

      *Thread Reply:* but in more EU friendly hours 🙂

      @@ -77047,7 +77053,7 @@

      SundayFunday

      2022-10-07 16:19:47
      -

      *Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

      +

      *Thread Reply:* no custom extractors, its usingt he base extractor. a call would be 👍. let me look at my calendar and EU hours.

      @@ -77144,7 +77150,7 @@

      SundayFunday

      2022-10-20 05:11:09
      -

      *Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

      +

      *Thread Reply:* Apparently the value is hard coded in the code somewhere that I couldn't figure out but at-least learnt that in my Mac where this port 5000 is being held up can be freed by following the below simple step.

      @@ -77211,7 +77217,7 @@

      SundayFunday

      2022-10-11 03:47:31
      -

      *Thread Reply:* Hi @Hanna Moazam,

      +

      *Thread Reply:* Hi @Hanna Moazam,

      Within recent PR https://github.com/OpenLineage/OpenLineage/pull/1111, I removed BigQuery dependencies from spark2, spark32 and spark3 subprojects. It has to stay in sharedbecause of BigQueryNodeVisitor. The usage of BigQueryNodeVisitor is tricky as we never know if bigquery classes are available on runtime or not. The check is done in io.openlineage.spark.agent.lifecycle.BaseVisitorFactory if (BigQueryNodeVisitor.hasBigQueryClasses()) { @@ -77303,7 +77309,7 @@

      SundayFunday

      2022-10-11 09:20:44
      -

      *Thread Reply:* So, if we wanted to add Snowflake, we would need to:

      +

      *Thread Reply:* So, if we wanted to add Snowflake, we would need to:

      1. Pick a version of snowflake's spark library
      2. Pick a version of scala that we target (i.e. we are only going to support Snowflake in Spark 3.2 so scala 2.12 will be hard coded)
      3. Add the visitor code to Shared
      4. Add the dependencies to app (ONLY if there is an integration test in app?? This is the confusing part still)
      @@ -77332,7 +77338,7 @@

      SundayFunday

      2022-10-12 03:51:54
      -

      *Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

      +

      *Thread Reply:* Yes. Please note that snowflake library will not be included in target OpenLineage jar. So you may test it manually against multiple Snowflake library versions or even adjust code in case of minor differences.

      @@ -77362,7 +77368,7 @@

      SundayFunday

      2022-10-12 05:20:17
      -

      *Thread Reply:* Thank you Pawel!

      +

      *Thread Reply:* Thank you Pawel!

      @@ -77388,7 +77394,7 @@

      SundayFunday

      2022-10-12 12:18:16
      -

      *Thread Reply:* Basically the same pattern you've already done with Kusto 😉 +

      *Thread Reply:* Basically the same pattern you've already done with Kusto 😉 https://github.com/OpenLineage/OpenLineage/blob/a96ecdabe66567151e7739e25cd9dd03d6[…]va/io/openlineage/spark/agent/lifecycle/BaseVisitorFactory.java

      @@ -77440,7 +77446,7 @@

      SundayFunday

      2022-10-12 12:26:35
      -

      *Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

      +

      *Thread Reply:* We actually used only reflection for Kusto and were hoping to do it the 'better' way with the package itself for snowflake - if it's possible :)

      @@ -77496,7 +77502,7 @@

      SundayFunday

      2022-10-11 05:03:26
      -

      *Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

      +

      *Thread Reply:* Reference implementation of OpenLineage consumer is Marquez: https://github.com/MarquezProject/marquez

      @@ -77876,7 +77882,7 @@

      SundayFunday

      2022-10-26 15:22:22
      -

      *Thread Reply:* Hey @Austin Poulton, welcome! 👋

      +

      *Thread Reply:* Hey @Austin Poulton, welcome! 👋

      @@ -77902,7 +77908,7 @@

      SundayFunday

      2022-10-31 06:09:41
      -

      *Thread Reply:* thanks Harel 🙂

      +

      *Thread Reply:* thanks Harel 🙂

      @@ -77973,7 +77979,7 @@

      SundayFunday

      2022-11-01 13:37:54
      -

      *Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

      +

      *Thread Reply:* Thanks, all! The release is authorized. We will initiate it within 48 hours.

      @@ -78025,7 +78031,7 @@

      SundayFunday

      2022-11-02 09:19:43
      -

      *Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

      +

      *Thread Reply:* I think amundsen-openlineage dataloader precedes column-level lineage in OL by a bit, so I doubt this works

      @@ -78051,7 +78057,7 @@

      SundayFunday

      2022-11-02 15:54:31
      -

      *Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

      +

      *Thread Reply:* do you want to open up an issue for it @Iftach Schonbaum?

      @@ -78170,7 +78176,7 @@

      SundayFunday

      2022-11-03 12:25:29
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated shortly.

      @@ -78268,7 +78274,7 @@

      SundayFunday

      2022-11-04 06:34:27
      -

      *Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. +

      *Thread Reply:* > Are there any tutorial and documentation how to create an Openlinage connector. We have somewhat of a start of a doc: https://openlineage.io/docs/development/developing/

      @@ -78440,7 +78446,7 @@

      SundayFunday

      2022-11-08 10:22:15
      -

      *Thread Reply:* Welcome Kenton! Happy to help 👍

      +

      *Thread Reply:* Welcome Kenton! Happy to help 👍

      @@ -78497,7 +78503,7 @@

      SundayFunday

      2022-11-08 10:06:10
      -

      *Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

      +

      *Thread Reply:* What kind of data? My first feeling is that you need to extend the Spark integration

      @@ -78523,7 +78529,7 @@

      SundayFunday

      2022-11-09 00:35:29
      -

      *Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

      +

      *Thread Reply:* Yes, we wanted to add information like user/job description that we can use later with rest of openlineage event fields in our system

      @@ -78549,7 +78555,7 @@

      SundayFunday

      2022-11-09 00:41:35
      -

      *Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

      +

      *Thread Reply:* I can see in this PR https://github.com/OpenLineage/OpenLineage/pull/490 that env values can be captured which we can use to add some custom metadata but it seems it is specific to Databricks only.

      @@ -78636,7 +78642,7 @@

      SundayFunday

      2022-11-09 05:14:50
      -

      *Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

      +

      *Thread Reply:* I think it makes sense to have something like that, but generic, if you want to contribute it

      @@ -78666,7 +78672,7 @@

      SundayFunday

      2022-11-14 03:28:35
      -

      *Thread Reply:* @Maciej Obuchowski Do you mean adding something like +

      *Thread Reply:* @Maciej Obuchowski Do you mean adding something like spark.openlineage.jobFacet.FacetName.Key=Value to the spark conf should add a new job facet like "FacetName": { "Key": "Value" @@ -78696,7 +78702,7 @@

      SundayFunday

      2022-11-14 05:56:02
      -

      *Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

      +

      *Thread Reply:* We can argue about name of that key, but yes, something like that. Just notice that while it's possible to attach something to run and job facets directly, it would be much harder to do this with datasets

      @@ -78748,7 +78754,7 @@

      SundayFunday

      2022-11-10 02:22:18
      -

      *Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

      +

      *Thread Reply:* Hi @Varun Singh, what version of openlineage-spark where you using? Are you able to copy lineage event here?

      @@ -78946,7 +78952,7 @@

      SundayFunday

      2022-11-11 16:28:07
      -

      *Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

      +

      *Thread Reply:* What if we just not include it in the BaseVisitorFactory but only in the Spark3 visitor factories?

      @@ -78998,7 +79004,7 @@

      SundayFunday

      2022-11-11 16:24:30
      -

      *Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

      +

      *Thread Reply:* You might look here: https://github.com/OpenLineage/OpenLineage/blob/f7049c599a0b1416408860427f0759624326677d/client/python/openlineage/client/serde.py#L51

      @@ -79079,7 +79085,7 @@

      SundayFunday

      2022-11-15 02:09:25
      -

      *Thread Reply:* I don’t think this is possible at the moment.

      +

      *Thread Reply:* I don’t think this is possible at the moment.

      @@ -79135,7 +79141,7 @@

      SundayFunday

      2022-11-16 09:13:17
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -79161,7 +79167,7 @@

      SundayFunday

      2022-11-16 10:41:07
      -

      *Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

      +

      *Thread Reply:* The PR for the changelog updates: https://github.com/OpenLineage/OpenLineage/pull/1306

      @@ -79213,7 +79219,7 @@

      SundayFunday

      2022-11-16 12:27:12
      -

      *Thread Reply:* I think we had similar request before, but nothing was implemented.

      +

      *Thread Reply:* I think we had similar request before, but nothing was implemented.

      @@ -79333,7 +79339,7 @@

      SundayFunday

      2022-11-22 13:40:44
      -

      *Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

      +

      *Thread Reply:* It's kind of a hard problem UI side. Backend can express that relationship

      @@ -79359,7 +79365,7 @@

      SundayFunday

      2022-11-22 13:48:58
      -

      *Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

      +

      *Thread Reply:* Thanks for replying. Could you please point me to the API that allows me to do that? I've been calling GET /lineage with dataset in the node ID, e g., nodeId=dataset:my_dataset . Where could I specify the version of my dataset?

      @@ -79411,7 +79417,7 @@

      SundayFunday

      2022-11-19 04:54:13
      -

      *Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

      +

      *Thread Reply:* Templated fields are rendered before generating lineage data. Do you have some sample code or logs preferrably?

      @@ -79437,7 +79443,7 @@

      SundayFunday

      2022-11-22 13:40:11
      -

      *Thread Reply:* If you're on 1.10 then I think it won't work

      +

      *Thread Reply:* If you're on 1.10 then I think it won't work

      @@ -79463,7 +79469,7 @@

      SundayFunday

      2022-11-28 12:50:39
      -

      *Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

      +

      *Thread Reply:* @Maciej Obuchowski we are still on airflow 1.10.15 unfortunately.

      cc. @Eli Schachar @Allison Suarez

      @@ -79491,7 +79497,7 @@

      SundayFunday

      2022-11-28 12:50:49
      -

      *Thread Reply:* is there no workaround we can make work?

      +

      *Thread Reply:* is there no workaround we can make work?

      @@ -79517,7 +79523,7 @@

      SundayFunday

      2022-11-28 12:51:01
      -

      *Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

      +

      *Thread Reply:* @Jakub Dardziński is this for airflow versions 2.0+?

      @@ -79569,7 +79575,7 @@

      SundayFunday

      2022-11-21 07:28:04
      -

      *Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.

      +

      *Thread Reply:* Yeah. However, to add it, just tiny bit of code would be required.

      Either in the URL version https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L48

      @@ -79699,7 +79705,7 @@

      SundayFunday

      2022-11-22 18:45:48
      -

      *Thread Reply:* I think that's what NominalTimeFacet covers

      +

      *Thread Reply:* I think that's what NominalTimeFacet covers

      openlineage.io
      @@ -79772,7 +79778,7 @@

      SundayFunday

      2022-11-28 10:29:58
      -

      *Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

      +

      *Thread Reply:* Hey @Rahul Sharma, what version of Airflow are you running?

      @@ -79798,7 +79804,7 @@

      SundayFunday

      2022-11-28 10:30:14
      -

      *Thread Reply:* i am using airflow 2.x

      +

      *Thread Reply:* i am using airflow 2.x

      @@ -79824,7 +79830,7 @@

      SundayFunday

      2022-11-28 10:30:27
      -

      *Thread Reply:* can we connect if you have time ?

      +

      *Thread Reply:* can we connect if you have time ?

      @@ -79850,7 +79856,7 @@

      SundayFunday

      2022-11-28 11:11:58
      -

      *Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

      +

      *Thread Reply:* did you see these docs before? https://openlineage.io/integration/apache-airflow/#airflow-20

      openlineage.io
      @@ -79897,7 +79903,7 @@

      SundayFunday

      2022-11-28 11:12:22
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -79923,7 +79929,7 @@

      SundayFunday

      2022-11-28 11:12:36
      -

      *Thread Reply:* i already set configuration in airflow.cfg file

      +

      *Thread Reply:* i already set configuration in airflow.cfg file

      @@ -79949,7 +79955,7 @@

      SundayFunday

      2022-11-28 11:12:57
      -

      *Thread Reply:* where are you sending the events to?

      +

      *Thread Reply:* where are you sending the events to?

      @@ -79975,7 +79981,7 @@

      SundayFunday

      2022-11-28 11:13:24
      -

      *Thread Reply:* i have a docker machine on which marquez is working

      +

      *Thread Reply:* i have a docker machine on which marquez is working

      @@ -80001,7 +80007,7 @@

      SundayFunday

      2022-11-28 11:13:47
      -

      *Thread Reply:* so, what is the issue you are seeing?

      +

      *Thread Reply:* so, what is the issue you are seeing?

      @@ -80027,7 +80033,7 @@

      SundayFunday

      2022-11-28 11:15:37
      -

      *Thread Reply:* there is no error

      +

      *Thread Reply:* there is no error

      @@ -80053,7 +80059,7 @@

      SundayFunday

      2022-11-28 11:16:01
      -

      *Thread Reply:* ```[lineage]

      +

      *Thread Reply:* ```[lineage]

      what lineage backend to use

      @@ -80094,7 +80100,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-28 11:16:09
      -

      *Thread Reply:* above config i have set

      +

      *Thread Reply:* above config i have set

      @@ -80120,7 +80126,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-28 11:16:22
      -

      *Thread Reply:* please let me know any other thing need to do

      +

      *Thread Reply:* please let me know any other thing need to do

      @@ -80172,7 +80178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-25 02:20:40
      -

      *Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

      +

      *Thread Reply:* please have a look at openapi definition of the event: https://openlineage.io/apidocs/openapi/

      @@ -80224,7 +80230,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-11-30 14:10:10
      -

      *Thread Reply:* hey, I'll DM you.

      +

      *Thread Reply:* hey, I'll DM you.

      @@ -80287,7 +80293,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-06 13:56:17
      -

      *Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized will be initiated within two business days.

      @@ -80370,7 +80376,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-02 15:21:20
      -

      *Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

      +

      *Thread Reply:* Microsoft https://openlineage.io/blog/openlineage-microsoft-purview/

      openlineage.io
      @@ -80421,7 +80427,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-05 13:54:06
      -

      *Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

      +

      *Thread Reply:* I think we can share that we have over 2,000 installs of that Microsoft solution accelerator using OpenLineage.

      That means we have thousands of companies having experimented with OpenLineage and Microsoft Purview.

      @@ -80526,7 +80532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 14:22:58
      -

      *Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

      +

      *Thread Reply:* For open discussion, I'd like to ask the team for an overview of how the different gradle files are working together for the Spark implementation. I'm terribly confused on where dependencies need to be added (whether it's in shared, app, or a spark version specific folder). Maybe @Maciej Obuchowski...?

      @@ -80556,7 +80562,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 14:25:12
      -

      *Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

      +

      *Thread Reply:* Unfortunately I'll be unable to attend the meeting @Will Johnson 😞

      @@ -80586,7 +80592,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-08 13:03:08
      -

      *Thread Reply:* This is starting now. CC @Will Johnson

      +

      *Thread Reply:* This is starting now. CC @Will Johnson

      @@ -80612,7 +80618,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:24:15
      -

      *Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

      +

      *Thread Reply:* @Will Johnson Check the notes and the recording. @Michael Collado did a pass at explaining the relationship between shared, app and the versions

      @@ -80638,7 +80644,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:24:30
      -

      *Thread Reply:* feel free to follow up here as well

      +

      *Thread Reply:* feel free to follow up here as well

      @@ -80664,7 +80670,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:39:37
      -

      *Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

      +

      *Thread Reply:* ascii art to the rescue! (top “depends on” bottom)

                    /   \
                    / / \ \
      @@ -80719,7 +80725,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:40:05
      -

      *Thread Reply:* 😍

      +

      *Thread Reply:* 😍

      @@ -80745,7 +80751,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 19:41:13
      -

      *Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

      +

      *Thread Reply:* (btw, we should have written datakin to output ascii art; it’s obviously the superior way to generate graphs 😜)

      @@ -80771,7 +80777,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 05:18:53
      -

      *Thread Reply:* Hi, is there a recording for this meeting?

      +

      *Thread Reply:* Hi, is there a recording for this meeting?

      @@ -80823,7 +80829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 22:05:25
      -

      *Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

      +

      *Thread Reply:* The namespace is the bucket and the dataset name is the path. Is there a blob storage provider in particular you are thinking of?

      @@ -80849,7 +80855,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 23:13:41
      -

      *Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

      +

      *Thread Reply:* Thanks, that makes sense. We use GCS, so it is already covered by the naming conventions documented. I was just not sure if I was understanding the document correctly or not.

      @@ -80875,7 +80881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-07 23:34:33
      -

      *Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

      +

      *Thread Reply:* No problem. Let us know if you have suggestions on the wording to make the doc clearer

      @@ -80975,7 +80981,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-09 11:51:16
      -

      *Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

      +

      *Thread Reply:* Dataset dependencies are represented through common relationship with a Job - e.g., the task that performed the transformation.

      @@ -81001,7 +81007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-11 09:01:19
      -

      *Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

      +

      *Thread Reply:* Is it possible to populate table level dependency without any transformation using open lineage specifications? Like to define dataset 1 is dependent of table 1 and table 2 which can be represented as separate datasets

      @@ -81027,7 +81033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-13 15:24:20
      -

      *Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

      +

      *Thread Reply:* Not explicitly, in today's spec. The guiding principle is that something created that dependency, and the dependency changes over time in a way that is important to study.

      @@ -81053,7 +81059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-13 15:25:12
      -

      *Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

      +

      *Thread Reply:* I say this to explain why it is the way it is - but the spec can change over time to serve new uses cases, certainly!

      @@ -81105,7 +81111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 09:56:22
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, you could start with column-lineage & spark workshop available here -> https://github.com/OpenLineage/workshops/tree/main/spark

      @@ -81135,7 +81141,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:05:54
      -

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

      +

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the link! But this does not really answer the concern.

      @@ -81161,7 +81167,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:06:08
      -

      *Thread Reply:* I am already able to capture column lineage

      +

      *Thread Reply:* I am already able to capture column lineage

      @@ -81187,7 +81193,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 10:06:33
      -

      *Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

      +

      *Thread Reply:* What I would like is to capture some extra environment variables, and send it to the server along with the lineage

      @@ -81213,7 +81219,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:22:59
      -

      *Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

      +

      *Thread Reply:* i remember we already have a facet for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/facets/EnvironmentFacet.java

      @@ -81290,7 +81296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:24:07
      -

      *Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

      +

      *Thread Reply:* but it is only used at the moment to capture some databricks environment attributes

      @@ -81316,7 +81322,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:28:29
      -

      *Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

      +

      *Thread Reply:* so you can contribute to project and add a feature which adds specified/al environment variables to lineage event.

      you can also have a look at extending section of spark integration docs (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending) and create a class thats add run facet builder according to your needs.

      @@ -81344,7 +81350,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 11:29:28
      -

      *Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

      +

      *Thread Reply:* the third way is to create an issue related to this bcz being able to send selected/all environment variables in OL event seems to be really cool feature.

      @@ -81374,7 +81380,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-14 21:49:19
      -

      *Thread Reply:* That is great! Thank you so much! This really helps!

      +

      *Thread Reply:* That is great! Thank you so much! This really helps!

      @@ -81400,7 +81406,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-15 01:44:42
      -

      *Thread Reply:* List&lt;String&gt; dbPropertiesKeys = +

      *Thread Reply:* List&lt;String&gt; dbPropertiesKeys = Arrays.asList( "orgId", "spark.databricks.clusterUsageTags.clusterOwnerOrgId", @@ -81443,7 +81449,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-15 01:57:05
      -

      *Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

      +

      *Thread Reply:* I have opened an issue in the community here: https://github.com/OpenLineage/OpenLineage/issues/1419

      @@ -81501,7 +81507,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-01 02:24:39
      -

      *Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! +

      *Thread Reply:* Hi @Paweł Leszczyński I have opened a PR for helping to add this use case. Please do help to see if we can merge it in. Thanks! https://github.com/OpenLineage/OpenLineage/pull/1545

      @@ -81595,7 +81601,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 11:45:52
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, sorry for late reply, but I reviewed the PR.

      @@ -81621,7 +81627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 03:06:42
      -

      *Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

      +

      *Thread Reply:* Hey thanks a lot! I have made the requested changes! Thanks!

      @@ -81647,7 +81653,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 03:06:49
      -

      *Thread Reply:* @Maciej Obuchowski ^ 🙂

      +

      *Thread Reply:* @Maciej Obuchowski ^ 🙂

      @@ -81677,7 +81683,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 09:09:34
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, took a look at it but it unfortunately fails integration tests (throws NPE), can you take a look again?

      23/02/06 12:18:39 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException @@ -81723,7 +81729,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:17:02
      -

      *Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

      +

      *Thread Reply:* Hi yeah my bad. It should be fixed in the latest push. But I think the tests are not running in the CI because of some GCP environment issue? I am not really sure how to fix it...

      @@ -81749,7 +81755,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:18:46
      -

      *Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

      +

      *Thread Reply:* I can make them run, it's just that running them on forks is disabled. We need to make it more clear I suppose

      @@ -81779,7 +81785,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 04:24:38
      -

      *Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

      +

      *Thread Reply:* Ahh I see thanks! Also, some of the tests are failing on my local, such as https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/java/io/openlineage/spark/agent/lifecycle/DeltaDataSourceTest.java. Is this expected behaviour?

      @@ -81917,7 +81923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-07 07:20:11
      -

      *Thread Reply:* tests failing isn't expected behaviour 🙂

      +

      *Thread Reply:* tests failing isn't expected behaviour 🙂

      @@ -81943,7 +81949,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:37:23
      -

      *Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

      +

      *Thread Reply:* Ahh yeap it was a local ide issue on my side. I added some tests to verify the presence of env variables too.

      @@ -81969,7 +81975,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:47:22
      -

      *Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

      +

      *Thread Reply:* @Anirudh Shrinivason let me know then when you'll push fixed version, I can run full tests then

      @@ -81995,7 +82001,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:49:35
      -

      *Thread Reply:* I have pushed just now

      +

      *Thread Reply:* I have pushed just now

      @@ -82021,7 +82027,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 03:49:39
      -

      *Thread Reply:* You can run the tests

      +

      *Thread Reply:* You can run the tests

      @@ -82047,7 +82053,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-08 04:13:07
      -

      *Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

      +

      *Thread Reply:* @Maciej Obuchowski mb I pushed again rn. Missed out a closing bracket.

      @@ -82077,7 +82083,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 00:47:04
      -

      *Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

      +

      *Thread Reply:* @Maciej Obuchowski Hi, could we merge this PR in? I'd like to see if we can have these changes in the new release...

      @@ -82129,7 +82135,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:29:44
      -

      *Thread Reply:* Hi 👋 this would require a series of JSON calls:

      +

      *Thread Reply:* Hi 👋 this would require a series of JSON calls:

      1. start the first task
      2. end the first task, specify output dataset
      3. start the second task, specify input dataset
      4. end the second task
      @@ -82158,7 +82164,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:32:08
      -

      *Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so +

      *Thread Reply:* in OpenLineage relationships are typically Job -> Dataset -> Job, so • you create a relationship between datasets by referring to them in the same job - i.e., this task ran that read from these datasets and wrote to those datasets • you create a relationship between tasks by referring to the same datasets across both of them - i.e., this task wrote that dataset and this other task read from it

      @@ -82186,7 +82192,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:35:06
      -

      *Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

      +

      *Thread Reply:* @Bramha Aelem if you look in this directory, you can find example start/complete JSON calls that show how to specify input/output datasets.

      (it’s an airflow workshop, but those examples are for a part of the workshop that doesn’t involve airflow)

      @@ -82214,7 +82220,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 13:35:46
      -

      *Thread Reply:* (these can also be found in the docs)

      +

      *Thread Reply:* (these can also be found in the docs)

      openlineage.io
      @@ -82265,7 +82271,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-16 14:49:30
      -

      *Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

      +

      *Thread Reply:* @Ross Turk - Thanks for the details. will try and get back to you on it

      @@ -82291,7 +82297,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-17 19:53:21
      -

      *Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

      +

      *Thread Reply:* @Ross Turk - Good Evening, It worked as expected. I am able to replicate the scenarios which I am looking for.

      @@ -82321,7 +82327,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-17 19:53:48
      -

      *Thread Reply:* @Ross Turk - Thanks for your response.

      +

      *Thread Reply:* @Ross Turk - Thanks for your response.

      @@ -82347,7 +82353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 13:23:56
      -

      *Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. +

      *Thread Reply:* @Ross Turk - First activity : I am making HTTP Call to pull the lookup data and store it in ADLS. Second Activity : After the completion of first activity I am making Azure databricks call to use the lookup file and generate the output tables. How I can refer the databricks generated tables facets as an input to the subsequent activities in the pipeline. When I refer it's as an input the spark tables metadata is not showing up. How can this be achievable. After the execution of each activity in ADF Pipeline I am sending start and complete/fail event lineage to Marquez.

      @@ -82460,7 +82466,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-20 14:27:59
      -

      *Thread Reply:* welcome! let’s us know if you have any questions

      +

      *Thread Reply:* welcome! let’s us know if you have any questions

      @@ -82512,7 +82518,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2022-12-29 10:47:00
      -

      *Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

      +

      *Thread Reply:* Hey Matt! Happy to have you here! Feel free to reach out if you have any questions

      @@ -82728,7 +82734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:20:14
      -

      *Thread Reply:* @Ross Turk maybe you can help with this?

      +

      *Thread Reply:* @Ross Turk maybe you can help with this?

      @@ -82754,7 +82760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:34:23
      -

      *Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

      +

      *Thread Reply:* hmm, I believe the dbt-ol integration does capture bytes/rows, but only for some data sources: https://github.com/OpenLineage/OpenLineage/blob/6ae1fd5665d5fd539b05d044f9b6fb831ce9d475/integration/common/openlineage/common/provider/dbt.py#L567

      @@ -82805,7 +82811,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:34:58
      -

      *Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

      +

      *Thread Reply:* I haven't personally tried it with Snowflake in a few versions, but the code suggests that it's one of them.

      @@ -82831,7 +82837,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:35:42
      -

      *Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

      +

      *Thread Reply:* @Max you say your dbt-ol run is resulting in only jobs and no datasets emitted, is that correct?

      @@ -82857,7 +82863,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:38:06
      -

      *Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

      +

      *Thread Reply:* if so, I'd say something rather strange is going on because in my experience each model should result in a Job and a Dataset.

      @@ -82909,7 +82915,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 01:53:10
      -

      *Thread Reply:* I was looking for something similar to the airflow integration

      +

      *Thread Reply:* I was looking for something similar to the airflow integration

      @@ -82935,7 +82941,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:21:18
      -

      *Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

      +

      *Thread Reply:* hey @Kuldeep - i don't think there's something for Luigi right now - is that something you'd potentially be interested in?

      @@ -82961,7 +82967,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 13:23:53
      -

      *Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

      +

      *Thread Reply:* @Viraj Parekh Yes this is something we are interested in! There are a lot of projects out there that use luigi

      @@ -83026,7 +83032,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-03 23:07:59
      -

      *Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration +

      *Thread Reply:* Hi @Michael Robinson a new, improved mechanism for passing conf parameters to the OpenLineage client in the Spark integration Would it be possible to have more details on what this entails please? Thanks!

      @@ -83053,7 +83059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 09:21:46
      -

      *Thread Reply:* @Tomasz Nazarewicz might explain this better

      +

      *Thread Reply:* @Tomasz Nazarewicz might explain this better

      @@ -83079,7 +83085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 10:04:22
      -

      *Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

      +

      *Thread Reply:* @Anirudh Shrinivason until now If you wanted to add new property to OL client, you had to also implement it in the integration because it had to parse all properties, create appropriate objects etc. New implementation makes client properties transparent to integration, they are only passed through and parsing happens inside the client.

      @@ -83105,7 +83111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 13:02:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

      +

      *Thread Reply:* Thanks, all. The release is authorized and will commence shortly 🙂

      @@ -83131,7 +83137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-04 22:00:55
      -

      *Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

      +

      *Thread Reply:* @Tomasz Nazarewicz Ahh I see. Okay thanks!

      @@ -83192,7 +83198,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-05 23:45:38
      -

      *Thread Reply:* @Michael Robinson Will there be a recording?

      +

      *Thread Reply:* @Michael Robinson Will there be a recording?

      @@ -83218,7 +83224,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-06 09:10:50
      -

      *Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

      +

      *Thread Reply:* @Anirudh Shrinivason Yes, and the recording will be here: https://wiki.lfaidata.foundation/display/OpenLineage/Monthly+TSC+meeting

      @@ -83364,7 +83370,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-09 02:28:43
      -

      *Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: +

      *Thread Reply:* Hi @Hanna Moazam, we've written Jupyter notebook to demo dataset symlinks feature: https://github.com/OpenLineage/workshops/blob/main/spark/dataset_symlinks.ipynb

      For scenario you describe, there should be symlink facet sent similar to: @@ -83464,7 +83470,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-09 14:21:10
      -

      *Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

      +

      *Thread Reply:* This is so awesome, @Paweł Leszczyński - Thank you so much for sharing this! I'm wondering if we could extend this to capture the hive JDBC Connection URL. I will explore this and put in an issue and PR to try and extend it. Thank you for the insights!

      @@ -83591,7 +83597,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 17:31:12
      -

      *Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

      +

      *Thread Reply:* @Varun Singh why not just use the KafkaTransport and the Event Hub's Kafka endpoint?

      https://github.com/yogyang/OpenLineage/blob/2b7fa8bbd19a2207d54756e79aea7a542bf7bb[…]/main/java/io/openlineage/client/transports/KafkaTransport.java

      @@ -83747,7 +83753,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-12 17:43:55
      -

      *Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

      +

      *Thread Reply:* @Julien Le Dem is there a channel to discuss the community call / ask follow-up questions on the communiyt call topics? For example, I wanted to ask more about the AirflowFacet and if we expected to introduce more tool specific facets into the spec. Where's the right place to ask that question? On the PR?

      @@ -83773,7 +83779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-17 15:11:05
      -

      *Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

      +

      *Thread Reply:* I think asking in #general is the right place. If there’s a specific github issue/PR, his is a good place as well. You can tag the relevant folks as well to get their attention

      @@ -83825,7 +83831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 02:30:37
      -

      *Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: +

      *Thread Reply:* Hello @Allison Suarez, in case of InsertIntoHadoopFsRelationCommand, Spark Openlineage implementation uses method: DatasetIdentifier di = PathUtils.fromURI(command.outputPath().toUri(), "file"); (https://github.com/OpenLineage/OpenLineage/blob/0.19.2/integration/spark/shared/sr[…]ark/agent/lifecycle/plan/InsertIntoHadoopFsRelationVisitor.java)

      @@ -83885,7 +83891,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-22 10:04:56
      -

      *Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

      +

      *Thread Reply:* @Allison Suarez it would also be good to know what compute engine you're using to run your code on? On-Prem Apache Spark? Azure/AWS/GCP Databricks?

      @@ -83911,7 +83917,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-13 18:18:52
      -

      *Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

      +

      *Thread Reply:* I created a custom visitor and fixed the issue that way, thank you!

      @@ -83993,7 +83999,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 14:39:51
      -

      *Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

      +

      *Thread Reply:* seems like a bug to me, but tagging @Tomasz Nazarewicz / @Paweł Leszczyński

      @@ -84019,7 +84025,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-13 15:22:19
      -

      *Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

      +

      *Thread Reply:* So we needed a generic way of passing parameters to client and made an assumption that every field with ; will be treated as an array

      @@ -84045,7 +84051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:00:04
      -

      *Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

      +

      *Thread Reply:* Thanks for the confirmation, should I add a condition to split only if it's a key that can have array values? We can have a list of such keys like facets.disabled

      @@ -84071,7 +84077,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:28:41
      -

      *Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

      +

      *Thread Reply:* We thought about this solution but it forces us to know the structure of each config and we wanted to avoid that as much as possible

      @@ -84097,7 +84103,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-14 02:34:06
      -

      *Thread Reply:* Maybe the condition could be having ; and [] in the value

      +

      *Thread Reply:* Maybe the condition could be having ; and [] in the value

      @@ -84127,7 +84133,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-15 08:14:14
      -

      *Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

      +

      *Thread Reply:* Makes sense, I can add this check. Thanks @Tomasz Nazarewicz!

      @@ -84153,7 +84159,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-16 01:15:19
      -

      *Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

      +

      *Thread Reply:* Created issue https://github.com/OpenLineage/OpenLineage/issues/1506 for this

      @@ -84325,7 +84331,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-17 15:56:38
      -

      *Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

      +

      *Thread Reply:* Thank you, Ross!! I appreciate it. I might have coordinated it, but it’s been a team effort. Lots of folks shared knowledge and time to help us check all the boxes, literally and figuratively (lots of boxes). ;)

      @@ -84440,7 +84446,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 11:17:14
      -

      *Thread Reply:* What are the errors?

      +

      *Thread Reply:* What are the errors?

      @@ -84466,7 +84472,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 11:18:09
      -

      *Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, +

      *Thread Reply:* 'dbt-ol' is not recognized as an internal or external command, operable program or batch file.

      @@ -84493,7 +84499,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:11:09
      -

      *Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

      +

      *Thread Reply:* Hm, I think this is due to different windows conventions around scripts.

      @@ -84519,7 +84525,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:26:35
      -

      *Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

      +

      *Thread Reply:* I have not tried it on Windows before myself, but on mac/linux if you make a Python virtual environment in venv/ and run pip install openlineage-dbt, the script winds up in ./venv/bin/dbt-ol.

      @@ -84545,7 +84551,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:27:04
      -

      *Thread Reply:* (maybe that helps!)

      +

      *Thread Reply:* (maybe that helps!)

      @@ -84571,7 +84577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:38:23
      -

      *Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

      +

      *Thread Reply:* This might not work, but I think I have an idea that would allow it to run as python -m dbt-ol run ...

      @@ -84597,7 +84603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:38:27
      -

      *Thread Reply:* That needs one fix though

      +

      *Thread Reply:* That needs one fix though

      @@ -84623,7 +84629,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 14:40:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( +

      *Thread Reply:* Hi @Maciej Obuchowski, thanks for the input, when I try to use python -m dbt-ol run, I see the below error :( \python.exe: No module named dbt-ol

      @@ -84650,7 +84656,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:23:56
      -

      *Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

      +

      *Thread Reply:* We’re seeing a similar issue with the Great Expectations integration at the moment. This is purely a guess, but what happens when you try with openlineage-dbt 0.18.0?

      @@ -84676,7 +84682,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:36
      -

      *Thread Reply:* @Michael Robinson GE issue is on Windows?

      +

      *Thread Reply:* @Michael Robinson GE issue is on Windows?

      @@ -84702,7 +84708,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:49
      -

      *Thread Reply:* No, not Windows

      +

      *Thread Reply:* No, not Windows

      @@ -84728,7 +84734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:24:55
      -

      *Thread Reply:* (that I know of)

      +

      *Thread Reply:* (that I know of)

      @@ -84754,7 +84760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:46:39
      -

      *Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

      +

      *Thread Reply:* @Michael Robinson - I see the same error. I used 2 Combinations

      1. Python 3.8.10 with openlineage-dbt 0.18.0 & Latest
      2. Python 3.9.7 with openlineage-dbt 0.18.0 & Latest
      @@ -84783,7 +84789,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:49:19
      -

      *Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

      +

      *Thread Reply:* Hm. You should be able to find the dbt-ol command wherever pip is installing the packages. In my case, that's usually in a virtual environment.

      But if I am not in a virtual environment, it installs the packages in my PYTHONPATH. You might try this to see if the dbt-ol script can be found in one of the directories in sys.path.

      @@ -84820,7 +84826,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:58:38
      -

      *Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

      +

      *Thread Reply:* this can help you verify that your PYTHONPATH and PATH are correct - installing an unrelated python command-line tool and seeing if you can execute it:

      @@ -84855,7 +84861,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 13:59:42
      -

      *Thread Reply:* Again, I think this is windows issue

      +

      *Thread Reply:* Again, I think this is windows issue

      @@ -84881,7 +84887,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:00:54
      -

      *Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

      +

      *Thread Reply:* @Maciej Obuchowski you think even if dbt-ol could be found in the path, that might not be the issue?

      @@ -84907,7 +84913,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:15:13
      -

      *Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

      +

      *Thread Reply:* Hi @Ross Turk - I could not find the dbt-ol in the site-packages.

      @@ -84933,7 +84939,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:16:48
      -

      *Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

      +

      *Thread Reply:* Hm 😕 then perhaps @Maciej Obuchowski is right and there is a bigger issue here

      @@ -84959,7 +84965,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-24 14:31:15
      -

      *Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

      +

      *Thread Reply:* @Ross Turk & @Maciej Obuchowski I see the issue event when I do the install using the https://pypi.org/project/openlineage-dbt/#files - openlineage-dbt-0.19.2.tar.gz.

      For some reason, I see only the following folder created

      @@ -85056,7 +85062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 18:54:57
      -

      *Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

      +

      *Thread Reply:* This is excellent! May we promote it on openlineage and marquez social channels?

      @@ -85082,7 +85088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 18:55:30
      -

      *Thread Reply:* This is an amazing write up! 🔥 💯 🚀

      +

      *Thread Reply:* This is an amazing write up! 🔥 💯 🚀

      @@ -85108,7 +85114,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-18 19:49:46
      -

      *Thread Reply:* Happy to have it promoted. 😄 +

      *Thread Reply:* Happy to have it promoted. 😄 Vish posted on LinkedIn: https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios|https://www.linkedin.com/posts/vishwanatha-nayak-b8462054automate-data-lineage-on-amazon-mwaa-with-activity-7021589819763945473-yMHF?utmsource=share&utmmedium=memberios if you want something to repost there.

      @@ -85200,7 +85206,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 03:58:42
      -

      *Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

      +

      *Thread Reply:* Hello @Anirudh Shrinivason, you need to build your openlineage-java package first. Possibly you built in some time ao in different version

      @@ -85226,7 +85232,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 03:59:28
      -

      *Thread Reply:* ./gradlew clean build publishToMavenLocal +

      *Thread Reply:* ./gradlew clean build publishToMavenLocal in /client/java should help.

      @@ -85253,7 +85259,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 04:34:33
      -

      *Thread Reply:* Ahh yeap this works thanks! 🙂

      +

      *Thread Reply:* Ahh yeap this works thanks! 🙂

      @@ -85305,7 +85311,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:03:39
      -

      *Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

      +

      *Thread Reply:* It's been a while since I looked at Atlas, but does it even now supports something else than very Java Apache-adjacent projects like Hive and HBase?

      @@ -85331,7 +85337,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:10:11
      -

      *Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

      +

      *Thread Reply:* To directly answer your question @Sheeri Cabral (Collibra): I am not aware of any resources currently that explain this 😞 but I would welcome the creation of one & pitch in where possible!

      @@ -85361,7 +85367,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 17:00:25
      -

      *Thread Reply:* I don’t know enough about Atlas to make that doc.

      +

      *Thread Reply:* I don’t know enough about Atlas to make that doc.

      @@ -85421,7 +85427,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 11:00:54
      -

      *Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

      +

      *Thread Reply:* I think most of your questions will be answered by this video: https://www.youtube.com/watch?v=LRr-ja8_Wjs

      YouTube
      @@ -85481,7 +85487,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:10:58
      -

      *Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

      +

      *Thread Reply:* I agree - a lot of the answers are in that overview video. You might also take a look at the docs, they do a pretty good job of explaining how it works.

      @@ -85507,7 +85513,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:19:34
      -

      *Thread Reply:* More explicitly: +

      *Thread Reply:* More explicitly: • Airflow is an interesting platform to observe because it runs a large variety of workloads and lineage can only be automatically extracted for some of them • In general, OpenLineage is essentially a standard and data model for lineage. There are integrations for various systems, including Airflow, that cause them to emit lineage events to an OpenLineage compatible backend. It's a push model. • Marquez is one such backend, and the one I recommend for testing & development @@ -85545,7 +85551,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-19 13:23:33
      -

      *Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

      +

      *Thread Reply:* FWIW I did a workshop on openlineage and airflow a while back, and it's all in this repo. You can find slides + a quick Python example + a simple Airflow example in there.

      @@ -85571,7 +85577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 03:44:22
      -

      *Thread Reply:* Thanks a lot!! Very helpful!

      +

      *Thread Reply:* Thanks a lot!! Very helpful!

      @@ -85597,7 +85603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 11:42:43
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -85649,7 +85655,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:45:55
      -

      *Thread Reply:* Hello Brad, on top of my head:

      +

      *Thread Reply:* Hello Brad, on top of my head:

      @@ -85675,7 +85681,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:47:15
      -

      *Thread Reply:* • Marquez uses the API HTTP Post. so does Astro +

      *Thread Reply:* • Marquez uses the API HTTP Post. so does Astro • Egeria and Purview prefer consuming through a Kafka topic. There is a ProxyBackend that takes HTTP Posts and writes to Kafka. The client can also be configured to write to Kafka

      @@ -85706,7 +85712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:48:09
      -

      *Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

      +

      *Thread Reply:* @Will Johnson @Mandy Chessell might have opinions

      @@ -85732,7 +85738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:49:10
      -

      *Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      +

      *Thread Reply:* The Microsoft Purview approach is documented here: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      learn.microsoft.com
      @@ -85783,7 +85789,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-20 19:49:47
      -

      *Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

      +

      *Thread Reply:* There’s a blog post about Egeria here: https://openlineage.io/blog/openlineage-egeria/

      openlineage.io
      @@ -85830,7 +85836,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-22 10:00:56
      -

      *Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

      +

      *Thread Reply:* @Brad Paskewitz at Microsoft, the solution that Julien linked above, we are using the HTTP Transport (REST API) as we are consuming the OpenLineage Events and transforming them to Apache Atlas / Microsoft Purview.

      However, there is a good deal of interest in using the kafka transport instead and that's our future roadmap.

      @@ -85914,7 +85920,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-25 12:04:48
      -

      *Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

      +

      *Thread Reply:* Hmm, if Databricks is shutting the process down without waiting for the ListenerBus to clear, I don’t know that there’s a lot we can do. The best thing is to somehow delay the main application thread from exiting. One thing you could try is to subclass the OpenLineageSparkListener and generate a lock for each SparkListenerSQLExecutionStart and release it when the accompanying SparkListenerSQLExecutionEnd event is processed. Then, in the main application, block until all such locks are released. If you try it and it works, let us know!

      @@ -85940,7 +85946,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-26 05:46:35
      -

      *Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

      +

      *Thread Reply:* Ok thanks for the idea! I'll tell you if I try this and if it works 🤞

      @@ -85992,7 +85998,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-25 10:38:03
      -

      *Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

      +

      *Thread Reply:* Hey Petr. Please DM me or describe the issue here 🙂

      @@ -86044,7 +86050,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:04
      -

      *Thread Reply:* Command +

      *Thread Reply:* Command spark-submit --packages "io.openlineage:openlineage_spark:0.19.+" \ --conf "spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener" \ --conf "spark.openlineage.transport.type=kafka" \ @@ -86076,7 +86082,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:14
      -

      *Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception +

      *Thread Reply:* 23/01/27 17:29:06 ERROR AsyncEventQueue: Listener OpenLineageSparkListener threw an exception java.lang.NullPointerException at io.openlineage.client.transports.TransportFactory.build(TransportFactory.java:44) at io.openlineage.spark.agent.EventEmitter.&lt;init&gt;(EventEmitter.java:40) @@ -86121,7 +86127,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 15:25:31
      -

      *Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

      +

      *Thread Reply:* I would appreciate any pointers on getting started with using openlineage-spark with Kafka.

      @@ -86147,7 +86153,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 16:15:00
      -

      *Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

      +

      *Thread Reply:* Also this might seem a little elementary but the kafka topic itself, should it be hosted on the spark cluster or could it be any kafka topic?

      @@ -86173,7 +86179,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 08:37:07
      -

      *Thread Reply:* 👀 Could I get some help on this, please?

      +

      *Thread Reply:* 👀 Could I get some help on this, please?

      @@ -86199,7 +86205,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:07:08
      -

      *Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

      +

      *Thread Reply:* I think any NullPointerException is clearly our bug, can you open issue on OL GitHub?

      @@ -86229,7 +86235,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:30:51
      -

      *Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get +

      *Thread Reply:* @Maciej Obuchowski Another interesting thing is if I use 0.19.2 version specifically, I get 23/01/30 14:28:33 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event

      I am trying to print to console at the moment. I haven't been able to get Kafka transport type working though.

      @@ -86258,7 +86264,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:41:12
      -

      *Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

      +

      *Thread Reply:* Are you getting events printed on the console though? This log should not affect you if you're running, for example Spark SQL jobs

      @@ -86284,7 +86290,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:42:28
      -

      *Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event +

      *Thread Reply:* I am trying to run a python file using pyspark. 23/01/30 14:40:49 INFO RddExecutionContext: RDDs are empty: skipping sending OpenLineage event I see this and don't see any events on the console.

      @@ -86311,7 +86317,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:55:41
      -

      *Thread Reply:* Any logs filling pattern +

      *Thread Reply:* Any logs filling pattern log.warn("Unable to access job conf from RDD", nfe); or <a href="http://log.info">log.info</a>("Found job conf from RDD {}", jc); @@ -86341,7 +86347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:57:20
      -

      *Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents +

      *Thread Reply:* ```23/01/30 14:40:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[2] at reduceByKey at /tmp/spark-20487725-f49b-4587-986d-e63a61890673/statusapidemo.py:47), which has no missing parents 23/01/30 14:40:49 WARN RddExecutionContext: Unable to access job conf from RDD java.lang.NoSuchFieldException: Field is not instance of HadoopMapRedWriteConfigUtil at io.openlineage.spark.agent.lifecycle.RddExecutionContext.lambda$setActiveJob$0(RddExecutionContext.java:117) @@ -86396,7 +86402,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:03:35
      -

      *Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

      +

      *Thread Reply:* I think this is same problem as this: https://github.com/OpenLineage/OpenLineage/issues/1521

      @@ -86422,7 +86428,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:04:14
      -

      *Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

      +

      *Thread Reply:* and I think I might have solution on a branch for it, just need to polish it up to release

      @@ -86448,7 +86454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:13:37
      -

      *Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

      +

      *Thread Reply:* Aah got it. I will give it a try with SQL and a jar.

      Do you have a ETA on when the python issue would be fixed?

      @@ -86476,7 +86482,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:37:51
      -

      *Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

      +

      *Thread Reply:* @Maciej Obuchowski Well I run into the same errors if I run spark-submit on a jar.

      @@ -86502,7 +86508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:38:44
      -

      *Thread Reply:* I think that has nothing to do with python

      +

      *Thread Reply:* I think that has nothing to do with python

      @@ -86528,7 +86534,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:39:16
      -

      *Thread Reply:* BTW, which Spark version are you using?

      +

      *Thread Reply:* BTW, which Spark version are you using?

      @@ -86554,7 +86560,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 10:41:22
      -

      *Thread Reply:* We are on 3.3.1

      +

      *Thread Reply:* We are on 3.3.1

      @@ -86584,7 +86590,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 11:38:24
      -

      *Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

      +

      *Thread Reply:* @Maciej Obuchowski Do you have a estimated release date for the fix. Our team is specifically interested in using the Emitter to write out to Kafka.

      @@ -86610,7 +86616,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 11:46:30
      -

      *Thread Reply:* I think we plan to release somewhere in the next week

      +

      *Thread Reply:* I think we plan to release somewhere in the next week

      @@ -86640,7 +86646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 09:21:25
      -

      *Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

      +

      *Thread Reply:* @Susmitha Anandarao PR fixing this has been merged, release should be today

      @@ -86699,7 +86705,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 18:38:40
      -

      *Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

      +

      *Thread Reply:* hmmm, I am not sure. perhaps @Benji Lampel can help, he’s very familiar with those operators.

      @@ -86729,7 +86735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-27 18:46:15
      -

      *Thread Reply:* @Benji Lampel any help would be appreciated!

      +

      *Thread Reply:* @Benji Lampel any help would be appreciated!

      @@ -86755,7 +86761,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-30 09:01:34
      -

      *Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? +

      *Thread Reply:* Hey Paul, the SQLCheckExtractors were written with the intent that they would be used by a provider that inherits for them - they are all treated as a sort of base class. What is the exact error message you're getting? And what is the operator code? Could you try this with a PostgresCheckOperator ? (Also, only the SqlColumnCheckOperator and SqlTableCheckOperator will provide data quality facets in their output, those functions are not implementable in the other operators at this time)

      @@ -86787,7 +86793,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:36:07
      -

      *Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

      +

      *Thread Reply:* @Benji Lampel here is the error message. i am not sure what the operator code is.

      3-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - Traceback (most recent call last): [2023-01-31, 00:32:38 UTC] {logging_mixin.py:115} WARNING - File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner @@ -86832,7 +86838,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:37:06
      -

      *Thread Reply:* and above that

      +

      *Thread Reply:* and above that

      [2023-01-31, 00:32:38 UTC] {connection.py:424} ERROR - Unable to retrieve connection from secrets backend (EnvironmentVariablesBackend). Checking subsequent secrets backend. Traceback (most recent call last): @@ -86867,7 +86873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 14:39:31
      -

      *Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

      +

      *Thread Reply:* sorry, i should mention we're wrapping over the CheckOperator as we're still migrating from 1.10.15 @Benji Lampel

      @@ -86893,7 +86899,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 15:09:51
      -

      *Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

      +

      *Thread Reply:* What do you mean by wrapping the CheckOperator? Like how so, exactly? Can you show me the operator code you're using in the DAG?

      @@ -86919,7 +86925,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 17:38:45
      -

      *Thread Reply:* like so

      +

      *Thread Reply:* like so

      class CustomSQLCheckOperator(CheckOperator): ....

      @@ -86948,7 +86954,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 17:39:30
      -

      *Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

      +

      *Thread Reply:* i think i found the issue though, we have our own get_hook function and so we don't follow the traditional Airflow way of setting CONN_ID which is why CONN_ID is always None and that path only gets called through OpenLineage which doesn't ever get called with our custom wrapper

      @@ -87030,7 +87036,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 12:57:47
      -

      *Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      +

      *Thread Reply:* The dbt integration uses the python client, you should be able to do something similar than with the java client. See here: https://github.com/OpenLineage/OpenLineage/tree/main/client/python#kafka

      @@ -87056,7 +87062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-01-31 13:26:33
      -

      *Thread Reply:* Thank you for this!

      +

      *Thread Reply:* Thank you for this!

      I created a openlineage.yml file with the following data to test out the integration locally. transport: @@ -87130,7 +87136,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 14:39:29
      -

      *Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

      +

      *Thread Reply:* @Susmitha Anandarao It's not installed because it's large binary package. We don't want to install for every user something giant majority won't use, and it's 100x bigger than rest of the client.

      We need to indicate this way better, and do not throw this error directly at user thought, both in docs and code.

      @@ -87349,7 +87355,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:39:34
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/999#issuecomment-1209048556

      Does this fit your experience?

      @@ -87377,7 +87383,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:39:59
      -

      *Thread Reply:* We sometimes experience this in context of very small, quick jobs

      +

      *Thread Reply:* We sometimes experience this in context of very small, quick jobs

      @@ -87403,7 +87409,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-02 08:43:24
      -

      *Thread Reply:* Yes, my scenarios are dealing with quick jobs. +

      *Thread Reply:* Yes, my scenarios are dealing with quick jobs. Good to know that we will be able to solve it with future spark versions. Thanks!

      @@ -87509,7 +87515,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 13:24:03
      -

      *Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯

      +

      *Thread Reply:* exciting to see the client proxy work being released by @Minkyu Park 💯

      @@ -87535,7 +87541,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 13:35:38
      -

      *Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.

      +

      *Thread Reply:* This was without a doubt among the fastest release votes we’ve ever had 😉 . Thank you! You can expect the release to happen on Monday.

      @@ -87561,7 +87567,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:02:52
      -

      *Thread Reply:* Lol the proxy is still in development and not ready for use

      +

      *Thread Reply:* Lol the proxy is still in development and not ready for use

      @@ -87587,7 +87593,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:03:26
      -

      *Thread Reply:* Good point! Let’s make that clear in the release / docs?

      +

      *Thread Reply:* Good point! Let’s make that clear in the release / docs?

      @@ -87617,7 +87623,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:03:33
      -

      *Thread Reply:* But it doesn’t block anything anyway, so happy to see the release

      +

      *Thread Reply:* But it doesn’t block anything anyway, so happy to see the release

      @@ -87643,7 +87649,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-03 14:04:38
      -

      *Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳

      +

      *Thread Reply:* We can celebrate that the proposal for the proxy is merged. I’m happy with that 🥳

      @@ -87699,7 +87705,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:47:15
      -

      *Thread Reply:* Hey @Daniel Joanes! thanks for the question.

      +

      *Thread Reply:* Hey @Daniel Joanes! thanks for the question.

      I am not aware of an issue that captures this. Column-level lineage is a somewhat new facet in the spec, and implementations across the various integrations are in varying states of readiness.

      @@ -87729,7 +87735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:50:59
      -

      *Thread Reply:* Go for it, once it's created i'll add a watch

      +

      *Thread Reply:* Go for it, once it's created i'll add a watch

      @@ -87759,7 +87765,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 14:51:13
      -

      *Thread Reply:* Thanks Ross!

      +

      *Thread Reply:* Thanks Ross!

      @@ -87785,7 +87791,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-06 23:10:30
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1581

      @@ -87945,7 +87951,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 10:50:49
      -

      *Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.

      +

      *Thread Reply:* Thanks for requesting a release. 3 +1s from committers will authorize an immediate release.

      @@ -87971,7 +87977,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:15:35
      -

      *Thread Reply:* 0.20.5 ?

      +

      *Thread Reply:* 0.20.5 ?

      @@ -88001,7 +88007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:28:20
      -

      *Thread Reply:* @Michael Robinson auth'd

      +

      *Thread Reply:* @Michael Robinson auth'd

      @@ -88027,7 +88033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 11:32:06
      -

      *Thread Reply:* 👍 the release is authorized

      +

      *Thread Reply:* 👍 the release is authorized

      @@ -88126,14 +88132,14 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong><em>*input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em>*</strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -88389,7 +88395,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-09 16:23:36
      -

      *Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.

      +

      *Thread Reply:* I am also exploring something similar, but writing to kafka, and would want to know more on how we could add custom metadata from spark.

      @@ -88415,7 +88421,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 02:23:40
      -

      *Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.

      +

      *Thread Reply:* Hi @Avinash Pancham @Susmitha Anandarao, it's great to hear about successful experimenting on your side.

      Although Openlineage spec provides some built-in facets definition, a facet object can be anything you want (https://openlineage.io/apidocs/openapi/#tag/OpenLineage/operation/postRunEvent). The example metadata provided in this chat could be put into job or run facets I believe.

      @@ -88445,7 +88451,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-10 09:09:10
      -

      *Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)

      +

      *Thread Reply:* (I would love to see what your specs are! I’m not with Astronomer, just a community member, but I am finding that many of the customizations people are making to the spec are valuable ones that we should consider adding to core)

      @@ -88471,7 +88477,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-14 16:51:28
      -

      *Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!

      +

      *Thread Reply:* Are there any examples out there of customizations already done in Spark? An example would definitely help!

      @@ -88497,7 +88503,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:43:08
      -

      *Thread Reply:* I think @Will Johnson might have something to add about customization

      +

      *Thread Reply:* I think @Will Johnson might have something to add about customization

      @@ -88523,7 +88529,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 23:58:36
      -

      *Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!

      +

      *Thread Reply:* Oh man... Mike Collado did a nice write up on Slack of how many different ways there are to customize / extend OpenLineage. I know we all talked about doing a blog post at one point!

      @Susmitha Anandarao - You might take a look at https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java which has a hard coded set of properties we are extracting.

      @@ -88797,7 +88803,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 04:52:59
      2023-02-15 07:19:39
      -

      *Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch and it passes all the tests.

      +

      *Thread Reply:* @Quentin Nambot this happens because we run additional integration tests against real databases (like BigQuery) which aren't ever configured on forks, since we don't want to expose our secrets. We need to figure out how to make this experience better, but in the meantime we've pushed your code using git-push-fork-to-upstream-branch and it passes all the tests.

      @@ -88850,7 +88856,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 07:21:49
      -

      *Thread Reply:* Feel free to un-draft your PR if you think it's ready for review

      +

      *Thread Reply:* Feel free to un-draft your PR if you think it's ready for review

      @@ -88876,7 +88882,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:03:56
      -

      *Thread Reply:* Ok nice thanks 👍

      +

      *Thread Reply:* Ok nice thanks 👍

      @@ -88902,7 +88908,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:04:49
      -

      *Thread Reply:* I think it's ready, however should I update the version somewhere?

      +

      *Thread Reply:* I think it's ready, however should I update the version somewhere?

      @@ -88928,7 +88934,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:42:39
      -

      *Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft , so I'm not sure if you want to add something else to it.

      +

      *Thread Reply:* @Quentin Nambot I don't think so - it's just that you opened PR as Draft , so I'm not sure if you want to add something else to it.

      @@ -88958,7 +88964,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 08:43:36
      -

      *Thread Reply:* No I don't want to add anything so I opened it 👍

      +

      *Thread Reply:* No I don't want to add anything so I opened it 👍

      @@ -89035,7 +89041,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:30:47
      -

      *Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      +

      *Thread Reply:* You can add your own EventHandlerFactory https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/api/OpenLineageEventHandlerFactory.java . See the docs about extending here - https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      @@ -89085,7 +89091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:32:19
      -

      *Thread Reply:* I have that set up already like this: +

      *Thread Reply:* I have that set up already like this: public class LyftOpenLineageEventHandlerFactory implements OpenLineageEventHandlerFactory { @Override public Collection&lt;PartialFunction&lt;LogicalPlan, List&lt;OutputDataset&gt;&gt;&gt; @@ -89122,7 +89128,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:33:35
      -

      *Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change

      +

      *Thread Reply:* do I just add a constructor? the visitorFactory is private so I wasn't sure if that's something that was intended to change

      @@ -89148,7 +89154,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:34:30
      -

      *Thread Reply:* .

      +

      *Thread Reply:* .

      @@ -89174,7 +89180,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:34:49
      -

      *Thread Reply:* @Michael Collado

      +

      *Thread Reply:* @Michael Collado

      @@ -89200,7 +89206,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:35:14
      -

      *Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one

      +

      *Thread Reply:* The VisitorFactory is only used by the internal EventHandlerFactory. It shouldn’t be needed for your custom one

      @@ -89226,7 +89232,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-15 21:35:48
      -

      *Thread Reply:* Have you added the file to the META-INF folder of your jar?

      +

      *Thread Reply:* Have you added the file to the META-INF folder of your jar?

      @@ -89252,7 +89258,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:01:56
      -

      *Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one

      +

      *Thread Reply:* yes, I am able to use my custom event handler factory with a list of visitors but for some reason I cant access the visitors for some commands (AlterTableAddPartitionCommand) is one

      @@ -89278,7 +89284,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:02:29
      -

      *Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor

      +

      *Thread Reply:* so even if I set up everything correctly I am unable to reach the code for that specific visitor

      @@ -89304,7 +89310,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 11:05:22
      -

      *Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory but maybe im wrong @Michael Collado

      +

      *Thread Reply:* and my assumption is I can reach other commands but not this one because the command is not defined in the BaseVisitorFactory but maybe im wrong @Michael Collado

      @@ -89330,7 +89336,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 15:05:19
      -

      *Thread Reply:* the VisitorFactory is loaded by the InternalEventHandlerFactory here. However, the createOutputDatasetQueryPlanVisitors should contain a union of everything defined by the VisitorFactory as well as your custom visitors: see this code.

      +

      *Thread Reply:* the VisitorFactory is loaded by the InternalEventHandlerFactory here. However, the createOutputDatasetQueryPlanVisitors should contain a union of everything defined by the VisitorFactory as well as your custom visitors: see this code.

      @@ -89404,7 +89410,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 15:09:21
      -

      *Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?

      +

      *Thread Reply:* there might be a conflict with another visitor that’s being matched against that command. Can you turn on debug logging and look for this line to see what visitor is being applied to that command?

      @@ -89461,7 +89467,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 16:54:46
      -

      *Thread Reply:* This was helpful, it works now, thank you so much Michael!

      +

      *Thread Reply:* This was helpful, it works now, thank you so much Michael!

      @@ -89517,7 +89523,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:09:49
      -

      *Thread Reply:* what is the curl cmd you are running? and what endpoint are you hitting? (assuming Marquez?)

      +

      *Thread Reply:* what is the curl cmd you are running? and what endpoint are you hitting? (assuming Marquez?)

      @@ -89543,7 +89549,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:18:28
      -

      *Thread Reply:* yep +

      *Thread Reply:* yep I am running curl - X curl -X POST http://localhost:5000/api/v1/namespaces/test ^ -H 'Content-Type: application/json' ^ -d '{ownerName:"me", description:"no description"^ @@ -89578,7 +89584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:23:08
      -

      *Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace} : +

      *Thread Reply:* Marquez logs all supported endpoints (and methods) on start up. For example, here are all the supported methods on /api/v1/namespaces/{namespace} : marquez-api | DELETE /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | GET /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) marquez-api | PUT /api/v1/namespaces/{namespace} (marquez.api.NamespaceResource) @@ -89608,7 +89614,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:26:23
      -

      *Thread Reply:* 3rd stupid question of the night +

      *Thread Reply:* 3rd stupid question of the night Sorry kept on trying POST who knows why

      @@ -89635,7 +89641,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:26:56
      -

      *Thread Reply:* no worries! keep the questions coming!

      +

      *Thread Reply:* no worries! keep the questions coming!

      @@ -89661,7 +89667,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:29:46
      -

      *Thread Reply:* well, maybe because it’s so late on your end! get some rest!

      +

      *Thread Reply:* well, maybe because it’s so late on your end! get some rest!

      @@ -89687,7 +89693,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:36:25
      -

      *Thread Reply:* Yeah but I want to see how it works +

      *Thread Reply:* Yeah but I want to see how it works Right now I have a response 200 for the creation of the names ... but it seems that nothing occurred nor on marquez front end (localhost:3000) nor on the database

      @@ -89716,7 +89722,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:37:13
      -

      *Thread Reply:* can you curl the list namespaces endpoint?

      +

      *Thread Reply:* can you curl the list namespaces endpoint?

      @@ -89742,7 +89748,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:38:14
      -

      *Thread Reply:* yep : nothing changed +

      *Thread Reply:* yep : nothing changed only default and food_delivery

      @@ -89769,7 +89775,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:38:47
      -

      *Thread Reply:* can you post your server logs? you should see the request

      +

      *Thread Reply:* can you post your server logs? you should see the request

      @@ -89795,7 +89801,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:40:41
      -

      *Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 +

      *Thread Reply:* marquez-api | XXX.23.0.4 - - [17/Feb/2023:00:30:38 +0000] "PUT /api/v1/namespaces/ciro HTTP/1.1" 500 110 "-" "-" 7 marquez-api | INFO [2023-02-17 00:32:07,072] marquez.logging.LoggingMdcFilter: status: 200

      @@ -89822,7 +89828,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:41:12
      -

      *Thread Reply:* the server is returning a 500 ?

      +

      *Thread Reply:* the server is returning a 500 ?

      @@ -89848,7 +89854,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:41:57
      -

      *Thread Reply:* odd that LoggingMdcFilter is logging 200

      +

      *Thread Reply:* odd that LoggingMdcFilter is logging 200

      @@ -89874,7 +89880,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:43:24
      -

      *Thread Reply:* Bit confused because now I realize that postman is returning bad request

      +

      *Thread Reply:* Bit confused because now I realize that postman is returning bad request

      @@ -89900,7 +89906,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:43:51
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -89935,7 +89941,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-16 19:44:30
      -

      *Thread Reply:* You'll notice that I go to use 3000 in the url +

      *Thread Reply:* You'll notice that I go to use 3000 in the url If I use 5000 I get No host

      @@ -89962,7 +89968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 01:14:50
      -

      *Thread Reply:* odd, the API should be using port 5000, have you followed our quickstart for Marquez?

      +

      *Thread Reply:* odd, the API should be using port 5000, have you followed our quickstart for Marquez?

      @@ -89988,7 +89994,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 03:43:29
      -

      *Thread Reply:* Hello Willy +

      *Thread Reply:* Hello Willy I am starting from scratch followin instruction from https://openlineage.io/docs/getting-started/ I am on Windows Instead of @@ -90038,7 +90044,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 03:40:48
      -

      *Thread Reply:* The issue is related to PostMan

      +

      *Thread Reply:* The issue is related to PostMan

      @@ -90090,7 +90096,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 03:54:47
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: +

      *Thread Reply:* Hi @Anirudh Shrinivason, this is a great question. We included extra fields in OpenLineage spec to contain that information: "transformationDescription": { "type": "string", "description": "a string representation of the transformation applied" @@ -90125,7 +90131,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:02:30
      -

      *Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?

      +

      *Thread Reply:* Thanks a lot! That is great. Is there a potential plan in the roadmap to support this for spark?

      @@ -90151,7 +90157,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:08:16
      -

      *Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.

      +

      *Thread Reply:* I think there will be a growing interest in that. In general a dependency may really difficult to express if many Spark operators are used on input columns to produce output one. The simple version would be just to detect indetity operation or some kind of hashing.

      To sum up, we don't have yet a proposal on that but this seems to be a natural next step in enriching column lineage features.

      @@ -90179,7 +90185,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:40:04
      -

      *Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂

      +

      *Thread Reply:* Got it. Thanks! If this item potentially comes on the roadmap, then I'd be happy to work with other interested developers to help contribute! 🙂

      @@ -90205,7 +90211,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:43:00
      -

      *Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.

      +

      *Thread Reply:* Great to hear that. What you could perhaps start with, is come to our monthly OpenLineage meetings and ask @Michael Robinson to put this item on discussions' list. There are many strategies to address this issue and hearing your story, usage scenario and would are you trying to achieve, would be super helpful in design and implementation phase.

      @@ -90231,7 +90237,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 04:44:18
      -

      *Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!

      +

      *Thread Reply:* Got it! The monthly meeting might be a bit hard for me to attend live, because of the time zone. But I'll try my best to make it to the next one! thanks!

      @@ -90257,7 +90263,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-17 09:46:22
      -

      *Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.

      +

      *Thread Reply:* Thank you for bringing this up, @Anirudh Shrinivason. I’ll add it to the agenda of our next meeting because there might be interest from others in adding this to the roadmap.

      @@ -90319,7 +90325,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 02:10:13
      -

      *Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging configuration section is present in marquez.yml .

      +

      *Thread Reply:* Hi @thebruuu, pls take a look at logging documentation of Dropwizard (https://www.dropwizard.io/en/latest/manual/core.html#logging) - the framework Marquez is implemented in. The logging configuration section is present in marquez.yml .

      @@ -90345,7 +90351,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-20 03:29:07
      -

      *Thread Reply:* Thank You Pavel

      +

      *Thread Reply:* Thank You Pavel

      @@ -90397,7 +90403,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 09:12:17
      -

      *Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?

      +

      *Thread Reply:* We generally schedule a release every month, next one will be in the next week - is that okay @Anirudh Shrinivason?

      @@ -90423,7 +90429,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 11:38:50
      -

      *Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.

      +

      *Thread Reply:* Yes, there’s one scheduled for next Wednesday, if that suits.

      @@ -90449,7 +90455,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-21 21:45:58
      -

      *Thread Reply:* Okay yeah sure that works. Thanks

      +

      *Thread Reply:* Okay yeah sure that works. Thanks

      @@ -90479,7 +90485,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 10:12:45
      -

      *Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI

      +

      *Thread Reply:* @Anirudh Shrinivason we’re expecting the release to happen today or tomorrow, FYI

      @@ -90505,7 +90511,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 21:22:40
      -

      *Thread Reply:* Awesome thanks

      +

      *Thread Reply:* Awesome thanks

      @@ -90569,7 +90575,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-23 23:44:03
      -

      *Thread Reply:* This is my checkponit configuration in GE. +

      *Thread Reply:* This is my checkponit configuration in GE. ```name: 'openlineagecheckpoint' configversion: 1.0 templatename: @@ -90632,7 +90638,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-24 11:31:05
      -

      *Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?

      +

      *Thread Reply:* What version of GX are you running? And is this being run directly through GX or through Airflow with the operator?

      @@ -90658,7 +90664,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-26 20:05:12
      -

      *Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow

      +

      *Thread Reply:* I use the latest version of Great Expectations. This error occurs either directly through Great Expectations or airflow

      @@ -90684,7 +90690,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 09:10:00
      -

      *Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44 for now. That is the latest one that works for me.

      +

      *Thread Reply:* I noticed another issue in the latest version as well. Try dropping to GE version great-expectations==0.15.44 for now. That is the latest one that works for me.

      @@ -90710,7 +90716,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 09:11:34
      -

      *Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment

      +

      *Thread Reply:* You should definitely open an issue here, and you can tag me @denimalpaca in the comment

      @@ -90736,7 +90742,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-27 18:07:29
      -

      *Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44 , this is my requirement file

      +

      *Thread Reply:* Thanks Benji, but I still have the same problem after I drop to great-expectations==0.15.44 , this is my requirement file

      great_expectations==0.15.44
       sqlalchemy
      @@ -90771,7 +90777,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-02-28 13:34:03
      -

      *Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack

      +

      *Thread Reply:* interesting... I do think this may be a GX issue so let's see if they say anything. I can also cross post this thread to their slack

      @@ -90823,7 +90829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 11:47:40
      -

      *Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ +

      *Thread Reply:* I know I’m responding to an older post - I’m not sure if this would work in your environment? https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/ Are you using AWS Glue with Spark jobs?

      @@ -90875,7 +90881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 15:16:14
      -

      *Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?

      +

      *Thread Reply:* This was proposed by our AWS Solution architect but we are not seeing much improvement compared to open lineage. Have you deployed the above solution to prod?

      @@ -90901,7 +90907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 11:30:44
      -

      *Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄

      +

      *Thread Reply:* We are currently in the research phase, so we have not deployed to prod. We have customers with thousands of existing scripts that they don’t want to rewrite to add openlineage libraries - i would imagine that if you are already integrating OpenLineage in your code, the spark listener isn’t an improvement. Our research is on magically getting lineage from existing scripts 😄

      @@ -90962,7 +90968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-01 10:26:22
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated as soon as possible.

      @@ -91014,7 +91020,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 05:15:55
      -

      *Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.

      +

      *Thread Reply:* GitHub has a new issue template for reporting vulnerabilities, actually. If you use a config that enables this issue template.

      @@ -91202,7 +91208,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 20:04:19
      -

      *Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?

      +

      *Thread Reply:* Hey @Paul Lee, are you seeing this happen for Async operators?

      @@ -91228,7 +91234,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 20:06:00
      -

      *Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 +

      *Thread Reply:* might be related to this issue https://github.com/OpenLineage/OpenLineage/pull/1601 that was fixed in 0.20.6

      @@ -91279,7 +91285,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:15:44
      -

      *Thread Reply:* hmm perhaps.

      +

      *Thread Reply:* hmm perhaps.

      @@ -91305,7 +91311,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:15:55
      -

      *Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?

      +

      *Thread Reply:* @Harel Shein if i want to turn off openlineage listener how do i do that? do i just remove the package?

      @@ -91331,7 +91337,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:24:07
      -

      *Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?

      +

      *Thread Reply:* meaning, you don’t want openlineage to collect any information from your Airflow deployment?

      @@ -91361,7 +91367,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 16:24:50
      -

      *Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars

      +

      *Thread Reply:* in that case, you could either remove it from your requirements file, or set OPENLINEAGE_DISABLED=True in your Airflow env vars

      @@ -91391,7 +91397,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-06 14:43:56
      -

      *Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both

      +

      *Thread Reply:* removed it from requirements and also the backend key in airflow config. needed both

      @@ -91482,7 +91488,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-02 23:46:08
      -

      *Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?

      +

      *Thread Reply:* Are you seeing duplicate START events or do you see two events one that is a START and one that is COMPLETE?

      OpenLineage's events may send partial information. You should expect to collect all events for a given RunId and merge them together to get the complete events.

      @@ -91512,7 +91518,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:45:19
      -

      *Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command

      +

      *Thread Reply:* Hmm...I'm seeing 2 start events for the same runnable command

      @@ -91538,7 +91544,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:45:27
      -

      *Thread Reply:* And 2 complete

      +

      *Thread Reply:* And 2 complete

      @@ -91564,7 +91570,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 00:46:08
      -

      *Thread Reply:* I am currently only testing on parquet tables...

      +

      *Thread Reply:* I am currently only testing on parquet tables...

      @@ -91590,7 +91596,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-03 02:31:28
      -

      *Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.

      +

      *Thread Reply:* One of openlineage assumptions is the ability to merge lineage events in the backend to make client integrations stateless. So, it is possible that Spark can emit multiple events for the same job. However, sometimes it does not make any sense to send or collect some events, which happened to us some time ago with delta. In that case we decided to filter them and created filtering mechanism (https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters) than can be extended in case of other unwanted events being generated and sent.

      @@ -91616,7 +91622,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-05 22:59:06
      -

      *Thread Reply:* Ahh I see...okay thanks!

      +

      *Thread Reply:* Ahh I see...okay thanks!

      @@ -91642,7 +91648,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-07 00:05:48
      -

      *Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios

      +

      *Thread Reply:* in general , you should build any event consumer system with at least once semantics. Even if this issue is fixed, there is a possibility of duplicates for other valid scenarios

      @@ -91672,7 +91678,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-09 14:10:47
      -

      *Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?

      +

      *Thread Reply:* Hi..I compared some duplicate 'START' events just now, and noticed that they are exactly the same, with the only exception of one of them having an 'environment-properties' field... Could I just quickly check if this is a bug or a feature haha?

      @@ -91698,7 +91704,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-10 01:18:18
      -

      *Thread Reply:* CC: @Paweł Leszczyński ^

      +

      *Thread Reply:* CC: @Paweł Leszczyński ^

      @@ -91821,7 +91827,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-08 17:30:44
      -

      *Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true

      +

      *Thread Reply:* if you can set up env variables for particular notebooks, you can set OPENLINEAGE_DISABLED=true

      @@ -91976,7 +91982,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:08:53
      -

      *Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113

      +

      *Thread Reply:* is DAG run capturing enabled starting airflow 2.5.1 ? https://github.com/apache/airflow/pull/27113

      @@ -92042,7 +92048,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:11:47
      -

      *Thread Reply:* you're right, only the change was included in 2.5.0

      +

      *Thread Reply:* you're right, only the change was included in 2.5.0

      @@ -92072,7 +92078,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-13 17:43:15
      -

      *Thread Reply:* Thanks Jakub

      +

      *Thread Reply:* Thanks Jakub

      @@ -92196,7 +92202,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:18:22
      -

      *Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?

      +

      *Thread Reply:* I don’t see it currently on the AirflowRunFacet, but probably not a big deal to add it? @Benji Lampel wdyt?

      @@ -92424,7 +92430,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:22:00
      -

      *Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)

      +

      *Thread Reply:* Definitely could be a good thing to have--is there not some info facet that could hold this data already? I don't see an issue with adding to the AirflowRunFacet tho (full disclosure, I'm not super familiar with this facet)

      @@ -92454,7 +92460,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-15 15:58:40
      -

      *Thread Reply:* Perhaps DocumentationJobFacet or DocumentationDatasetFacet?

      +

      *Thread Reply:* Perhaps DocumentationJobFacet or DocumentationDatasetFacet?

      @@ -92557,7 +92563,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-18 12:09:23
      -

      *Thread Reply:* dbt-ol does not read from openlineage.yml so you need to pass this information in OPENLINEAGE_NAMESPACE environment variable

      +

      *Thread Reply:* dbt-ol does not read from openlineage.yml so you need to pass this information in OPENLINEAGE_NAMESPACE environment variable

      @@ -92583,7 +92589,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:17:03
      -

      *Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment(), I’ll do some testing with Kafka backends.

      +

      *Thread Reply:* Hmmm. Interesting! I thought that it used client = OpenLineageClient.from_environment(), I’ll do some testing with Kafka backends.

      @@ -92609,7 +92615,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:22:07
      -

      *Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIGto specify the yml file holding transport info and OPENLINEAGE_NAMESPACE

      +

      *Thread Reply:* Thank you for the hint. I was able to make it work with specifying the env OPENLINEAGE_CONFIGto specify the yml file holding transport info and OPENLINEAGE_NAMESPACE

      @@ -92639,7 +92645,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:24:05
      -

      *Thread Reply:* Awesome! That’s exactly what I was going to test.

      +

      *Thread Reply:* Awesome! That’s exactly what I was going to test.

      @@ -92665,7 +92671,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-20 15:25:04
      -

      *Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml.

      +

      *Thread Reply:* I think it also works if you put it in $HOME/.openlineage/openlineage.yml.

      @@ -92695,7 +92701,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-21 08:32:17
      -

      *Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol does not read OL namespace from openlineage.yml but from OPENLINEAGE_NAMESPACE env var instead

      +

      *Thread Reply:* @Susmitha Anandarao I might have provided misleading information. I meant that dbt-ol does not read OL namespace from openlineage.yml but from OPENLINEAGE_NAMESPACE env var instead

      @@ -92893,7 +92899,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 04:25:19
      -

      *Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table

      +

      *Thread Reply:* It should be represented by multiple datasets, unless I misunderstood what you mean by multi-table

      @@ -92919,7 +92925,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 10:55:58
      -

      *Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that

      +

      *Thread Reply:* here at Fivetran when we sync data it is generally 1 schema with multiple tables (sometimes many) so we would want to represent all of that

      @@ -92945,7 +92951,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 11:11:25
      -

      *Thread Reply:* So what I understand:

      +

      *Thread Reply:* So what I understand:

      1. your single job represents synchronization of multiple tables
      2. you want to have precise input-output dataset lineage? am I right?
      3. @@ -92979,7 +92985,7 @@

        MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-23 12:57:15
      -

      *Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me

      +

      *Thread Reply:* Yes your statements are correct. Thanks for sharing that model, that makes sense to me

      @@ -93052,7 +93058,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 05:26:06
      -

      *Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.

      +

      *Thread Reply:* I think it's better to just create POJO. This is what we do in Spark integration, for example.

      For now, JSON Schema generator isn't flexible enough to generate custom facets from whatever schema we give it, so it would be unnecessary complexity

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-03-27 12:29:57
      -

      *Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.

      +

      *Thread Reply:* Agreed, just a POJO would work. This is using Jackson, so you would use annotations as needed. You can also use a Jackson JSONNode or even Map.

      @@ -93165,7 +93171,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 17:59:28
      -

      *Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet it seems to be okay to assume schema stays the same if not send.

      +

      *Thread Reply:* That depends on the OL consumer, but for something like SchemaDatasetFacet it seems to be okay to assume schema stays the same if not send.

      For others, like OutputStatisticsOutputDatasetFacet you definitely can't assume that, as the data is unique to each run.

      @@ -93193,7 +93199,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-27 19:05:14
      -

      *Thread Reply:* ok great thanks, that makes sense to me

      +

      *Thread Reply:* ok great thanks, that makes sense to me

      @@ -93254,7 +93260,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-28 04:47:31
      -

      *Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/

      +

      *Thread Reply:* OpenLineage API: https://openlineage.io/docs/getting-started/

      openlineage.io
      @@ -93327,7 +93333,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-28 08:05:30
      -

      *Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.

      +

      *Thread Reply:* I think it would be great to support V2SessionCatalog, and it would very much help if you created GitHub issue with more explanation and examples of it's use.

      @@ -93353,7 +93359,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-29 02:53:37
      -

      *Thread Reply:* Sure thanks!

      +

      *Thread Reply:* Sure thanks!

      @@ -93379,7 +93385,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-03-29 05:34:37
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1747 I have opened an issue here. Thanks! 🙂

      @@ -93437,7 +93443,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 11:53:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?

      +

      *Thread Reply:* Hi @Maciej Obuchowski Just curious, is this issue on the potential roadmap for the next Openlineage release?

      @@ -93516,7 +93522,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 09:32:46
      -

      *Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. +

      *Thread Reply:* The blog post is a bit old and in the meantime there were changes in OpenLineage Python Client introduced. May I ask if you want just to test the flow or looking for any viable Snowflake data lineage solution?

      @@ -93543,7 +93549,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:47:57
      -

      *Thread Reply:* I believe that this will work if you change the line to client.transport.emit()

      +

      *Thread Reply:* I believe that this will work if you change the line to client.transport.emit()

      @@ -93569,7 +93575,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:49:05
      -

      *Thread Reply:* (this would be in the dags/lineage folder, if memory serves)

      +

      *Thread Reply:* (this would be in the dags/lineage folder, if memory serves)

      @@ -93595,7 +93601,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 10:57:23
      -

      *Thread Reply:* Ross is right, that should work

      +

      *Thread Reply:* Ross is right, that should work

      @@ -93621,7 +93627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 12:23:13
      -

      *Thread Reply:* This works! Thank you so much!

      +

      *Thread Reply:* This works! Thank you so much!

      @@ -93647,7 +93653,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 12:24:40
      -

      *Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂

      +

      *Thread Reply:* @Jakub Dardziński I want to use a viable Snowflake data lineage solution alongside a Amazon DataZone Catalog 🙂

      @@ -93673,7 +93679,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:03:58
      -

      *Thread Reply:* I have been meaning to revisit that tutorial 👍

      +

      *Thread Reply:* I have been meaning to revisit that tutorial 👍

      @@ -93740,7 +93746,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-03 15:39:46
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 48 hours.

      @@ -93901,7 +93907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:39:19
      -

      *Thread Reply:* Do you have small configuration and job to replicate this?

      +

      *Thread Reply:* Do you have small configuration and job to replicate this?

      @@ -93927,7 +93933,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 22:21:35
      -

      *Thread Reply:* Yeah. For configs: +

      *Thread Reply:* Yeah. For configs: spark.driver.bindAddress: "localhost" spark.master: "local[**]" spark.sql.catalogImplementation: "hive" @@ -93963,7 +93969,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 22:23:19
      -

      *Thread Reply:* The issue is because of the spark.jars.packages config... spark.jars config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...

      +

      *Thread Reply:* The issue is because of the spark.jars.packages config... spark.jars config also runs into the same issue. Because the executor tries to fetch the jar from driver for some reason even though there is no executors set...

      @@ -93989,7 +93995,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-05 05:38:55
      -

      *Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener which is not in Spark jars would fall under the same problems, right?

      +

      *Thread Reply:* TBH I'm not sure if we can do anything about it. Seems like just having any SparkListener which is not in Spark jars would fall under the same problems, right?

      @@ -94015,7 +94021,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-10 06:07:11
      -

      *Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol

      +

      *Thread Reply:* Yeah... Actually, this was because of binding the driver ip to localhost. In that case, the executor was not able to get the jar from the driver. But yeah I don't think we could have done anything from openlienage end anyway for this. Was just an interesting error to encounter lol

      @@ -94067,7 +94073,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 13:02:21
      -

      *Thread Reply:* In this case you would have four runevents:

      +

      *Thread Reply:* In this case you would have four runevents:

      1. a START event on my-job where my-input is the input and my-output is the output, with a runId you generate on the client
      2. a COMPLETE event on my-job with the same runId from #1
      3. a START event on my-job2 where the input is my-output and the output is my-final-output, with a separate runId you generate
      4. a COMPLETE event on my-job2 with the same runId from #3
      @@ -94096,7 +94102,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 14:53:14
      -

      *Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.

      +

      *Thread Reply:* thanks for the response. I tried it but now the UI only shows like one second and then turn to blank. I has similar issue before. It seems to me every time when I added a bad lineage, the UI stops working. I have to delete the docker image:-( Not sure whether it is MacOS M1 related issue.

      @@ -94122,7 +94128,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 16:07:06
      -

      *Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events table so it can be replicated.

      +

      *Thread Reply:* Hmmm, that's interesting. Not sure I've seen that before. If you happen to catch it in that state again, perhaps capture the contents of the lineage_events table so it can be replicated.

      @@ -94148,7 +94154,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-04 16:24:28
      -

      *Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!

      +

      *Thread Reply:* I can fairly easy to reproduce this blank UI issue. Apparently I used the same runId for two different jobs. If I use different unId (which I should), the lineage displays correctly. Thanks again!

      @@ -94213,7 +94219,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-05 05:35:02
      -

      *Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.

      +

      *Thread Reply:* You can add https://openlineage.io/docs/spec/facets/dataset-facets/column_lineage_facet/ to your datasets.

      However, I don't think you can currently do any filtering over it

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-05 13:20:20
      -

      *Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430

      +

      *Thread Reply:* you can see a good example here, @Lq Dodo: https://github.com/MarquezProject/marquez/blob/289fa3eef967c8f7915b074325bb6f8f55480030/docker/metadata.json#L430

      @@ -94313,7 +94319,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 11:48:48
      -

      *Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.

      +

      *Thread Reply:* those examples really help. I can at least build the lineage with column level info using the apis. thanks a lot! Ideally I'd like select one column from the UI and then show me the column level graph. Seems not possible.

      @@ -94339,7 +94345,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 12:46:54
      -

      *Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞

      +

      *Thread Reply:* correct, right now there isn't column-level metadata on the lineage graph 😞

      @@ -94393,7 +94399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 10:22:17
      -

      *Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?

      +

      *Thread Reply:* something needs to trigger lineage collection, are you using some sort of scheduler / execution engine?

      @@ -94419,7 +94425,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-06 11:26:13
      -

      *Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?

      +

      *Thread Reply:* Nope... We currently don't have scheduling tool. Isn't it possible to use open lineage api and collect the details?

      @@ -94609,7 +94615,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 12:50:14
      -

      *Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor?

      +

      *Thread Reply:* E.g. do I need to update my custom operators to inherit from DefaultExtractor?

      @@ -94635,7 +94641,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 13:18:05
      -

      *Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE env var.

      +

      *Thread Reply:* FWIW, I can tell some level of connectivity to my Marquez deployment is working since I can see it created the default namespace I defined in my OPENLINEAGE_NAMESPACE env var.

      @@ -94661,7 +94667,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 18:37:44
      -

      *Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?

      +

      *Thread Reply:* hey John, it is enough to add the method to your custom operator. Perhaps something breaks inside the method. Did anything show up in the logs?

      @@ -94687,7 +94693,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 19:03:01
      -

      *Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'. This is the output of airflow plugins

      +

      *Thread Reply:* That’s the strange part. I’m not seeing anything to suggest that the method is ever getting called. I’m also expecting that the listener created by the plugin should at least be calling this log line when the task runs. However, I’m not seeing that either. I’m able to verify the plugin is registered using airflow plugins and have debug level logging enabled via AIRFLOW__LOGGING__LOGGING_LEVEL='DEBUG'. This is the output of airflow plugins

      name | macros | listeners | source ==================+================================================+==============================+================================================= @@ -94746,7 +94752,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 13:09:05
      -

      *Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …

      +

      *Thread Reply:* Figured this out. Just needed to run the airflow scheduler and trigger tasks through the DAGs vs. airflow tasks test …

      @@ -94800,7 +94806,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 16:30:00
      -

      *Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)

      +

      *Thread Reply:* (they are importing dagkit sdk stuff like Job, JobContext, ExecutionContext, and NodeContext.)

      @@ -94826,7 +94832,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 18:40:39
      -

      *Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted

      +

      *Thread Reply:* Do they run those scripts in PythonOperator? If so, they should receive some events but with no datasets extracted

      @@ -94852,7 +94858,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-07 21:28:25
      -

      *Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...

      +

      *Thread Reply:* How can I know that? Would it be in the scripts or the airflow configuration or...

      @@ -94878,7 +94884,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-08 07:13:56
      -

      *Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)

      +

      *Thread Reply:* And "with no datasets extracted" that means I wouldn't have the schema of the input and output datasets? (I need the db/schema/table/column names for my purposes)

      @@ -94904,7 +94910,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 02:49:07
      -

      *Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)

      +

      *Thread Reply:* That really depends what is the current code but in general any custom code in Airflow does not extract any extra information, especially datasets. One can write their own extractors (more in the docs)

      openlineage.io
      @@ -94955,7 +94961,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 16:52:04
      -

      *Thread Reply:* Thanks! This is very helpful. Exactly what I needed.

      +

      *Thread Reply:* Thanks! This is very helpful. Exactly what I needed.

      @@ -95011,7 +95017,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 02:30:19
      -

      *Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/

      +

      *Thread Reply:* Currently there's no extractor implemented for MS-SQL. We try to update list of supported databases here: https://openlineage.io/docs/integrations/about/

      openlineage.io
      @@ -95297,7 +95303,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 17:05:35
      -

      *Thread Reply:* > The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. +

      *Thread Reply:* > The snowflake.connector import connect package seems to be outdated here in extract_openlineage.py and is not working for airflow. What's the error?

      > Does anyone know how to rewrite this code (e.g., with SnowflakeOperator ) @@ -95429,7 +95435,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-11 18:13:49
      -

      *Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: +

      *Thread Reply:* Hi Maciej!Thank you so much for the reply! I managed to generate a working combination on Windows between the airflow example in the marquez git and the snowflake openlineage git. The only error I still get is: ****** Log file does not exist: /opt/bitnami/airflow/logs/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log ****** Fetching from: <http://1c8bb4a78f14:8793/log/dag_id=etl_openlineage/run_id=manual__2023-04-10T14:12:53.764783+00:00/task_id=send_ol_events/attempt=1.log> ****** !!!! Please make sure that all your Airflow components (e.g. schedulers, webservers and workers) have the same 'secret_key' configured in 'webserver' section and time is synchronized on all your machines (for example with ntpd) !!!!! @@ -95462,7 +95468,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 05:35:41
      -

      *Thread Reply:* I think it's Airflow error related to getting logs from worker

      +

      *Thread Reply:* I think it's Airflow error related to getting logs from worker

      @@ -95488,7 +95494,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 05:36:07
      -

      *Thread Reply:* snowflake.connector is a Snowflake connector library that SnowflakeOperator uses underneath to connect to Snowflake

      +

      *Thread Reply:* snowflake.connector is a Snowflake connector library that SnowflakeOperator uses underneath to connect to Snowflake

      @@ -95514,7 +95520,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 10:15:21
      -

      *Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?

      +

      *Thread Reply:* Ah alright! Thanks for pointing that out! 🙂 Do you know how to solve it? Or do you have any recommendations on how to look for the solution?

      @@ -95540,7 +95546,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-12 10:19:53
      -

      *Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388

      +

      *Thread Reply:* I have no experience with Windows, and I think it's the issue: https://github.com/apache/airflow/issues/10388

      I would try running it in Docker TBH

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-12 11:47:41
      -

      *Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!

      +

      *Thread Reply:* Yeah I was running Airflow in Docker but this didn't work. I'll try to use my Macbook for now because I don't think there is a solution for this in the short time. Thank you so much for the support though!!

      @@ -95729,7 +95735,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 11:19:57
      -

      *Thread Reply:* Very interesting!

      +

      *Thread Reply:* Very interesting!

      @@ -95755,7 +95761,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 13:28:53
      -

      *Thread Reply:* that’s awesome 🙂

      +

      *Thread Reply:* that’s awesome 🙂

      @@ -95807,7 +95813,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 08:36:01
      -

      *Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage

      +

      *Thread Reply:* Thanks Ernie, I’m looking at Airflow as well as GE and would like to contribute back to the project as well… we’re close to getting a public preview release of our product done and then we want to help build out open lineage

      @@ -95863,7 +95869,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:10:08
      -

      *Thread Reply:* The dag is basically just: +

      *Thread Reply:* The dag is basically just: ```dag = DAG( dagid="asanaexampledag", defaultargs=defaultargs, @@ -95902,7 +95908,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:11:02
      -

      *Thread Reply:* This is the error I’m getting, seems to be coming from this line: +

      *Thread Reply:* This is the error I’m getting, seems to be coming from this line: [2023-04-13, 17:45:33 UTC] {logging_mixin.py:115} WARNING - Exception in thread Thread-1: Traceback (most recent call last): File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner @@ -96024,7 +96030,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:12:11
      -

      *Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0

      +

      *Thread Reply:* This is with Airflow 2.3.2 and openlineage-airflow 0.22.0

      @@ -96050,7 +96056,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-13 14:13:34
      -

      *Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy

      +

      *Thread Reply:* Seems like it might be some issue like this with a circular structure? https://stackoverflow.com/questions/46283738/attributeerror-when-using-python-deepcopy

      Stack Overflow
      @@ -96101,7 +96107,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:44:36
      -

      *Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.

      +

      *Thread Reply:* Just by quick look at it, it will definitely be fixed with Airflow 2.6, as it won't need to deepcopy anything.

      @@ -96131,7 +96137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:47:16
      -

      *Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: +

      *Thread Reply:* I can't seem to reproduce the issue. I ran following example DAG with same Airflow and OL versions as yours: ```import datetime

      from airflow.lineage.entities import Table @@ -96180,7 +96186,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 08:53:48
      -

      *Thread Reply:* is there any extra configuration you made possibly?

      +

      *Thread Reply:* is there any extra configuration you made possibly?

      @@ -96206,7 +96212,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:02:40
      -

      *Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output +

      *Thread Reply:* @John Lukenoff, I was finally able to reproduce this when passing xcom as task.output looks like this was reported here and solved by this PR (not sure if this was released in 2.3.3 or later)

      @@ -96281,7 +96287,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:06:59
      -

      *Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.

      +

      *Thread Reply:* Ah interesting. Let me see if bumping my Airflow version resolves this. Haven’t had a chance to tinker with it much since yesterday.

      @@ -96307,7 +96313,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:13:21
      -

      *Thread Reply:* I ran it against 2.4 and same dag works

      +

      *Thread Reply:* I ran it against 2.4 and same dag works

      @@ -96333,7 +96339,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:15:35
      -

      *Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)

      +

      *Thread Reply:* 👍 Looks like a fix for that issue was rolled out in 2.3.3. I’m gonna try that for now (my company has a notoriously difficult time with airflow major version updates 😅)

      @@ -96359,7 +96365,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-14 13:17:06
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -96385,7 +96391,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 12:29:09
      -

      *Thread Reply:* Got this working! We just monkey patched the __deepcopy__ method of the BaseOperator for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!

      +

      *Thread Reply:* Got this working! We just monkey patched the __deepcopy__ method of the BaseOperator for now until we can get bandwidth for an airflow upgrade. Thanks for the help here!

      @@ -96458,7 +96464,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 03:56:30
      -

      *Thread Reply:* This is the spark submit command: +

      *Thread Reply:* This is the spark submit command: spark-submit --py-files /usr/local/lib/common_utils.zip,/usr/local/lib/team_utils.zip,/usr/local/lib/project_utils.zip --conf spark.executor.cores=16 --conf spark.hadoop.fs.s3a.connection.maximum=100 --conf spark.sql.shuffle.partitions=1000 @@ -96489,7 +96495,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 04:09:57
      -

      *Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.

      +

      *Thread Reply:* @Anirudh Shrinivason pls create an issue for this and I will look at it. Although it may be difficult to find the root cause, null pointer exception should be always avoided and this seems to be a bug.

      @@ -96515,7 +96521,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 04:14:41
      -

      *Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!

      +

      *Thread Reply:* Hmm yeah sure. I'll create an issue on github for this issue. Thanks!

      @@ -96541,7 +96547,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-17 05:13:54
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1784 Opened an issue here

      @@ -96626,7 +96632,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 02:25:04
      -

      *Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl), but they're in the same package. Thanks for bringing this.

      +

      *Thread Reply:* Hi @Allison Suarez, CustomColumnLineageVisitor should be definitely public. I'll prepare a fix PR for that. We do have a test for custom column lineage visitors (CustomColumnLineageVisitorTestImpl), but they're in the same package. Thanks for bringing this.

      @@ -96652,7 +96658,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 03:07:11
      -

      *Thread Reply:* This PR should resolve problem: +

      *Thread Reply:* This PR should resolve problem: https://github.com/OpenLineage/OpenLineage/pull/1788

      @@ -96740,7 +96746,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 13:34:43
      -

      *Thread Reply:* Thank you so much @Paweł Leszczyński 🙂

      +

      *Thread Reply:* Thank you so much @Paweł Leszczyński 🙂

      @@ -96766,7 +96772,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 13:35:46
      -

      *Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?

      +

      *Thread Reply:* How does the release process work for OL? Do we have to wait a certain amount of time to get this change in a new release?

      @@ -96792,7 +96798,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-18 17:34:29
      -

      *Thread Reply:* @Maciej Obuchowski ^

      +

      *Thread Reply:* @Maciej Obuchowski ^

      @@ -96818,7 +96824,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 01:49:33
      -

      *Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.

      +

      *Thread Reply:* 0.22.0 was released two weeks ago, so the next schedule should be in next two weeks. We can ask @Michael Robinson his opinion on releasing 0.22.1 before that.

      @@ -96844,7 +96850,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 09:08:58
      -

      *Thread Reply:* Hi Allison 👋, +

      *Thread Reply:* Hi Allison 👋, Anyone can request a release in the #general channel. I encourage you to go this route. You’ll need three +1s (there’s more info about the process here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md), but I don’t know of any reasons why we can’t do a mid-cycle release. 🙂

      @@ -96875,7 +96881,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:23:20
      -

      *Thread Reply:* seems like we got enough +1s

      +

      *Thread Reply:* seems like we got enough +1s

      @@ -96901,7 +96907,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:24:33
      -

      *Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third

      +

      *Thread Reply:* We need three committers to give a +1. I’ll reach out again to see if I can recruit a third

      @@ -96931,7 +96937,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:24:55
      -

      *Thread Reply:* oooh

      +

      *Thread Reply:* oooh

      @@ -96957,7 +96963,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-19 16:32:47
      -

      *Thread Reply:* Yeah, sorry I forgot to mention that!

      +

      *Thread Reply:* Yeah, sorry I forgot to mention that!

      @@ -96983,7 +96989,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 05:02:46
      -

      *Thread Reply:* we have it now

      +

      *Thread Reply:* we have it now

      @@ -97138,7 +97144,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:46:06
      -

      *Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).

      +

      *Thread Reply:* The release is authorized and will be initiated within 2 business days (not including tomorrow).

      @@ -97246,7 +97252,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 08:57:45
      -

      *Thread Reply:* Hi Sai,

      +

      *Thread Reply:* Hi Sai,

      Perhaps you could try within printing OpenLineage events into logs. This can be achieved with Spark config parameter: spark.openlineage.transport.type @@ -97278,7 +97284,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:18:53
      -

      *Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:

      +

      *Thread Reply:* Hi @Paweł Leszczyński I passed this config as below, but could not see any changes in the logs. The events are getting generated sometimes like below:

      23/04/20 10:00:15 INFO ConsoleTransport: {"eventType":"START","eventTime":"2023-04-20T10:00:15.085Z","run":{"runId":"ef4f46d1-d13a-420a-87c3-19fbf6ffa231","facets":{"spark.logicalPlan":{"producer":"https://github.com/OpenLineage/OpenLineage/tree/0.22.0/integration/spark","schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunFacet","plan":[{"class":"org.apache.spark.sql.catalyst.plans.logical.CreateTableAsSelect","num-children":2,"name":0,"partitioning":[],"query":1,"tableSpec":null,"writeOptions":null,"ignoreIfExists":false},{"class":"org.apache.spark.sql.catalyst.analysis.ResolvedTableName","num-children":0,"catalog":null,"ident":null},{"class":"org.apache.spark.sql.catalyst.plans.logical.Project","num-children":1,"projectList":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"workorderid","dataType":"integer","nullable":true,"metadata":{},"exprId":{"product-cl

      @@ -97315,7 +97321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:19:37
      -

      *Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection

      +

      *Thread Reply:* Ok, great. This means the issue is related to Spark <-> Marquez connection

      @@ -97341,7 +97347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:20:33
      -

      *Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      +

      *Thread Reply:* Some time ago Spark config has changed and here is the up-to-date-documentation: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      @@ -97367,7 +97373,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:21:10
      -

      *Thread Reply:* please note that spark.openlineage.transport.url has to be used which is different from what you have on screenshot attached

      +

      *Thread Reply:* please note that spark.openlineage.transport.url has to be used which is different from what you have on screenshot attached

      @@ -97393,7 +97399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:22:40
      -

      *Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?

      +

      *Thread Reply:* You mean instead of "spark.openlineage.host" I need to use "spark.openlineage.transport.url"?

      @@ -97419,7 +97425,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:04
      -

      *Thread Reply:* yes, please give it a try

      +

      *Thread Reply:* yes, please give it a try

      @@ -97445,7 +97451,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:40
      -

      *Thread Reply:* sure will give a try and let you know the outcome

      +

      *Thread Reply:* sure will give a try and let you know the outcome

      @@ -97471,7 +97477,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:23:48
      -

      *Thread Reply:* and set spark.openlineage.transport.type to http

      +

      *Thread Reply:* and set spark.openlineage.transport.type to http

      @@ -97497,7 +97503,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:24:04
      -

      *Thread Reply:* okay

      +

      *Thread Reply:* okay

      @@ -97523,7 +97529,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:26:42
      -

      *Thread Reply:* does these configs suffice or I need to add anything else

      +

      *Thread Reply:* does these configs suffice or I need to add anything else

      spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.consoleTransport true @@ -97555,7 +97561,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:27:07
      -

      *Thread Reply:* spark.openlineage.consoleTransport true this one can be removed

      +

      *Thread Reply:* spark.openlineage.consoleTransport true this one can be removed

      @@ -97581,7 +97587,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 09:27:33
      -

      *Thread Reply:* otherwise shall be OK

      +

      *Thread Reply:* otherwise shall be OK

      @@ -97607,7 +97613,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 10:01:30
      -

      *Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.

      +

      *Thread Reply:* I added these configs and run, but still same issue. Now I am not able to see the events in log file as well.

      @@ -97633,7 +97639,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-20 10:04:27
      -

      *Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart +

      *Thread Reply:* 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionStart 23/04/20 13:51:22 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerSQLExecutionEnd

      Does this need any changes in the config side?

      @@ -97773,7 +97779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 06:49:43
      -

      *Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog and provide more details how are you trying to read/write data?

      +

      *Thread Reply:* @Anirudh Shrinivason Thanks for linking this as it contains a clear explanation on Spark catalogs. However, I am still unable to write a failing integration test that reproduces the scenario. Could you provide an example of Spark which is failing on V2SessionCatalog and provide more details how are you trying to read/write data?

      @@ -97799,7 +97805,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 07:14:04
      -

      *Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog makes use of the V2SessionCatalog to understand this better. Thanks!

      +

      *Thread Reply:* Hi @Paweł Leszczyński I noticed this issue on one of our pipelines before actually. I didn't note down which pipeline the issue was occuring in unfortunately. I'll keep checking from my end to identify the spark job that ran into this error. In the meantime, I'll also try to see for which cases deltaCatalog makes use of the V2SessionCatalog to understand this better. Thanks!

      @@ -97825,7 +97831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 03:44:15
      -

      *Thread Reply:* Hi @Paweł Leszczyński +

      *Thread Reply:* Hi @Paweł Leszczyński ''' CREATE TABLE IF NOT EXISTS TABLE_NAME ( SOME COLUMNS @@ -97863,7 +97869,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 03:44:48
      -

      *Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.

      +

      *Thread Reply:* Thanks @Anirudh Shrinivason, will look into that.

      @@ -97889,7 +97895,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 05:06:05
      -

      *Thread Reply:* which spark & delta versions are you using?

      +

      *Thread Reply:* which spark & delta versions are you using?

      @@ -97915,7 +97921,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:35:50
      -

      *Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. +

      *Thread Reply:* I am not 100% sure if this is something you described, but this was an error I was able to replicate and fix. Please look at the exception stacktrace and let me know if it is same on your side. https://github.com/OpenLineage/OpenLineage/pull/1798

      @@ -98037,7 +98043,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:36:20
      -

      *Thread Reply:* Hi

      +

      *Thread Reply:* Hi

      @@ -98063,7 +98069,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:36:45
      -

      *Thread Reply:* Hmm actually I am noticing this error on my local

      +

      *Thread Reply:* Hmm actually I am noticing this error on my local

      @@ -98089,7 +98095,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:01
      -

      *Thread Reply:* But on the prod job, I am seeing no such error in the logs...

      +

      *Thread Reply:* But on the prod job, I am seeing no such error in the logs...

      @@ -98115,7 +98121,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:28
      -

      *Thread Reply:* Also, I was using spark 3.1.2

      +

      *Thread Reply:* Also, I was using spark 3.1.2

      @@ -98145,7 +98151,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:39
      -

      *Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2

      +

      *Thread Reply:* then perhaps it's sth different :face_palm: will try to replicate on spark 3.1.2

      @@ -98175,7 +98181,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 02:37:42
      -

      *Thread Reply:* Not too sure which delta version the prod job was using...

      +

      *Thread Reply:* Not too sure which delta version the prod job was using...

      @@ -98201,7 +98207,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:30:49
      -

      *Thread Reply:* I was running on Spark 3.1.2 the following command: +

      *Thread Reply:* I was running on Spark 3.1.2 the following command: spark.sql( "CREATE TABLE t_partitioned (a int, b int) USING delta " + "PARTITIONED BY (a) LOCATION '/tmp/delta/tbl'" @@ -98236,7 +98242,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:31:47
      -

      *Thread Reply:* Oh... hmm... that is strange. Let me check more from my end too

      +

      *Thread Reply:* Oh... hmm... that is strange. Let me check more from my end too

      @@ -98262,7 +98268,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 03:33:01
      -

      *Thread Reply:* for spark 3.1, we're using delta 1.0.0

      +

      *Thread Reply:* for spark 3.1, we're using delta 1.0.0

      @@ -98320,7 +98326,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 05:38:09
      -

      *Thread Reply:* Hi Cory, I think the definition of ParentRunFacet (https://openlineage.io/docs/spec/facets/run-facets/parent_run) contains answer to that: +

      *Thread Reply:* Hi Cory, I think the definition of ParentRunFacet (https://openlineage.io/docs/spec/facets/run-facets/parent_run) contains answer to that: Commonly, scheduler systems like Apache Airflow will trigger processes on remote systems, such as on Apache Spark or Apache Beam jobs. Those systems might have their own OpenLineage integration and report their own job runs and dataset inputs/outputs. The ParentRunFacet allows those downstream jobs to report which jobs spawned them to preserve job hierarchy. To do that, the scheduler system should have a way to pass its own job and run id to the child job. For example, when airflow is used to run Spark job, we want Spark events to contain some information on what triggered the spark job and parameters, you ask about, are used to pass that information from airflow operator to spark job.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-04-26 17:28:39
      -

      *Thread Reply:* Thank you for pointing me at this documentation; I did not see it previously. In my setup, the calling system is AWS Step Functions, which have no integration with OpenLineage.

      +

      *Thread Reply:* Thank you for pointing me at this documentation; I did not see it previously. In my setup, the calling system is AWS Step Functions, which have no integration with OpenLineage.

      So I've been essentially passing non-existing parent job information to OpenLineage. It has been useful as a data point for searches and reporting though.

      @@ -98399,7 +98405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 04:59:39
      -

      *Thread Reply:* I think parentRunId should be the same for Openlineage START and COMPLETE event. Is it like this in your case?

      +

      *Thread Reply:* I think parentRunId should be the same for Openlineage START and COMPLETE event. Is it like this in your case?

      @@ -98425,7 +98431,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:13:58
      -

      *Thread Reply:* that makes sense, and based on my configuration, i would think that it would be. however, given that i am seeing incomplete jobs in Marquez, i'm wondering if somehow the parentrunID is changing. I need to investigate

      +

      *Thread Reply:* that makes sense, and based on my configuration, i would think that it would be. however, given that i am seeing incomplete jobs in Marquez, i'm wondering if somehow the parentrunID is changing. I need to investigate

      @@ -98525,7 +98531,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:06:06
      -

      *Thread Reply:* I think @Michael Robinson has to manually promote artifacts

      +

      *Thread Reply:* I think @Michael Robinson has to manually promote artifacts

      @@ -98551,7 +98557,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:08:06
      -

      *Thread Reply:* I promoted the artifacts, but there is a delay before they appear in Maven. A couple releases ago, the delay was about 24 hours long

      +

      *Thread Reply:* I promoted the artifacts, but there is a delay before they appear in Maven. A couple releases ago, the delay was about 24 hours long

      @@ -98577,7 +98583,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:26:09
      -

      *Thread Reply:* Ahh I see... Thanks!

      +

      *Thread Reply:* Ahh I see... Thanks!

      @@ -98603,7 +98609,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:10:38
      -

      *Thread Reply:* @Anirudh Shrinivason are you using search.maven.org by chance? Version 0.23.0 is not appearing there yet, but I do see it on central.sonatype.com.

      +

      *Thread Reply:* @Anirudh Shrinivason are you using search.maven.org by chance? Version 0.23.0 is not appearing there yet, but I do see it on central.sonatype.com.

      @@ -98629,7 +98635,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:15:00
      -

      *Thread Reply:* Hmm I can see it now on search.maven.org actually. But I still cannot see it on https://mvnrepository.com/artifact/io.openlineage/openlineage-spark ...

      +

      *Thread Reply:* Hmm I can see it now on search.maven.org actually. But I still cannot see it on https://mvnrepository.com/artifact/io.openlineage/openlineage-spark ...

      @@ -98655,7 +98661,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 10:19:38
      -

      *Thread Reply:* Understood. I believe you can download the 0.23.0 jars from central.sonatype.com. For Spark, try going here: https://central.sonatype.com/artifact/io.openlineage/openlineage-spark/0.23.0/versions

      +

      *Thread Reply:* Understood. I believe you can download the 0.23.0 jars from central.sonatype.com. For Spark, try going here: https://central.sonatype.com/artifact/io.openlineage/openlineage-spark/0.23.0/versions

      Maven Central
      @@ -98702,7 +98708,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-22 06:11:10
      -

      *Thread Reply:* Yup. I can see it on all maven repos now haha. I think its just the delay.

      +

      *Thread Reply:* Yup. I can see it on all maven repos now haha. I think its just the delay.

      @@ -98728,7 +98734,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-22 06:11:18
      -

      *Thread Reply:* ~24 hours ig

      +

      *Thread Reply:* ~24 hours ig

      @@ -98754,7 +98760,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 16:49:15
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -98806,7 +98812,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-21 09:28:18
      -

      *Thread Reply:* Good news, everyone! The login worked on the second attempt after starting the Docker containers. Although it's unclear why it failed the first time.

      +

      *Thread Reply:* Good news, everyone! The login worked on the second attempt after starting the Docker containers. Although it's unclear why it failed the first time.

      @@ -98873,7 +98879,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 02:32:51
      -

      *Thread Reply:* Hi Natalie, it's wonderful to hear you're planning to contribute. Yes, you're right about TransportFactory . What other transport type was in your mind? If it is something generic, then it is surely OK to include it within TransportFactory. If it is a custom feature, we could follow ServiceLoader pattern that we're using to allow including custom plan visitors and dataset builders.

      +

      *Thread Reply:* Hi Natalie, it's wonderful to hear you're planning to contribute. Yes, you're right about TransportFactory . What other transport type was in your mind? If it is something generic, then it is surely OK to include it within TransportFactory. If it is a custom feature, we could follow ServiceLoader pattern that we're using to allow including custom plan visitors and dataset builders.

      @@ -98899,7 +98905,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 02:54:40
      -

      *Thread Reply:* Hi @Paweł Leszczyński +

      *Thread Reply:* Hi @Paweł Leszczyński Yes, I was planning to change TransportFactory to support custom/generic transport types using ServiceLoader pattern. After this change is done, I will be able to use our custom OpenMetadataTransport without changing anything in OpenLineage core. For now I don't have other types in mind, but after we'll add the customization support anyone will be able to create their own transport type and report the lineage to different backends

      @@ -98930,7 +98936,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 03:28:30
      -

      *Thread Reply:* Perhaps it's not strictly related to this particular usecase, but you may also find interesting our recent PoC about Fluentd & Openlineage integration. This will bring some cool backend features like: copy event and send it to multiple backends, send it to backends supported by fluentd output plugins etc. https://github.com/OpenLineage/OpenLineage/pull/1757/files?short_path=4fc5534#diff-4fc55343748f353fa1def0e00c553caa735f9adcb0da18baad50a989c0f2e935

      +

      *Thread Reply:* Perhaps it's not strictly related to this particular usecase, but you may also find interesting our recent PoC about Fluentd & Openlineage integration. This will bring some cool backend features like: copy event and send it to multiple backends, send it to backends supported by fluentd output plugins etc. https://github.com/OpenLineage/OpenLineage/pull/1757/files?short_path=4fc5534#diff-4fc55343748f353fa1def0e00c553caa735f9adcb0da18baad50a989c0f2e935

      @@ -98956,7 +98962,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-24 05:36:24
      -

      *Thread Reply:* Sounds interesting. Thanks, I will look into it

      +

      *Thread Reply:* Sounds interesting. Thanks, I will look into it

      @@ -99068,7 +99074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 08:07:36
      -

      *Thread Reply:* What's the extract_openlineage.py file? Looks like your code?

      +

      *Thread Reply:* What's the extract_openlineage.py file? Looks like your code?

      @@ -99103,7 +99109,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 08:43:04
      -

      *Thread Reply:* import json +

      *Thread Reply:* import json import os from pendulum import datetime

      @@ -99204,7 +99210,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:52:33
      -

      *Thread Reply:* OpenLineageClient expects RunEvent classes and you're sending it raw json. I think at this point your options are either sending them by constructing your own HTTP client, using something like requests, or using something like https://github.com/python-attrs/cattrs to structure json to RunEvent

      +

      *Thread Reply:* OpenLineageClient expects RunEvent classes and you're sending it raw json. I think at this point your options are either sending them by constructing your own HTTP client, using something like requests, or using something like https://github.com/python-attrs/cattrs to structure json to RunEvent

      @@ -99259,7 +99265,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 10:05:57
      -

      *Thread Reply:* @Jakub Dardziński suggested that you can +

      *Thread Reply:* @Jakub Dardziński suggested that you can change client.emit(ol_event) to client.transport.emit(ol_event) and it should work

      @@ -99290,7 +99296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 12:24:08
      -

      *Thread Reply:* @Maciej Obuchowski I believe this is from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py

      +

      *Thread Reply:* @Maciej Obuchowski I believe this is from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py

      @@ -99418,7 +99424,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 12:25:26
      -

      *Thread Reply:* I believe this example no longer works - perhaps a new access history pull/push example could be created that is simpler and doesn’t use airflow.

      +

      *Thread Reply:* I believe this example no longer works - perhaps a new access history pull/push example could be created that is simpler and doesn’t use airflow.

      @@ -99448,7 +99454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 08:34:02
      -

      *Thread Reply:* I think separating the actual getting data from the view and Airflow DAG would make sense

      +

      *Thread Reply:* I think separating the actual getting data from the view and Airflow DAG would make sense

      @@ -99474,7 +99480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:57:34
      -

      *Thread Reply:* Yeah - I also think that Airflow confuses the issue. You don’t need Airflow to get lineage from Snowflake Access History, the only reason Airflow is in the example is a) to simulate a pipeline that can be viewed in Marquez; b) to establish a mechanism that regularly pulls and emits lineage…

      +

      *Thread Reply:* Yeah - I also think that Airflow confuses the issue. You don’t need Airflow to get lineage from Snowflake Access History, the only reason Airflow is in the example is a) to simulate a pipeline that can be viewed in Marquez; b) to establish a mechanism that regularly pulls and emits lineage…

      but most people will already have A, and the simplest example doesn’t need to accomplish B.

      @@ -99502,7 +99508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:58:59
      -

      *Thread Reply:* just a few weeks ago 🙂 I was working on a script that you could run like SNOWFLAKE_USER=foo ./process_snowflake_lineage.py --from-date=xxxx-xx-xx --to-date=xxxx-xx-xx

      +

      *Thread Reply:* just a few weeks ago 🙂 I was working on a script that you could run like SNOWFLAKE_USER=foo ./process_snowflake_lineage.py --from-date=xxxx-xx-xx --to-date=xxxx-xx-xx

      @@ -99528,7 +99534,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:13:58
      -

      *Thread Reply:* Hi @Ross Turk! Do you have a link to this script? Perhaps this script can fix the connection issue 🙂

      +

      *Thread Reply:* Hi @Ross Turk! Do you have a link to this script? Perhaps this script can fix the connection issue 🙂

      @@ -99554,7 +99560,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:47:20
      -

      *Thread Reply:* No, it never became functional before I stopped to take on another task 😕

      +

      *Thread Reply:* No, it never became functional before I stopped to take on another task 😕

      @@ -99611,7 +99617,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:54:07
      -

      *Thread Reply:* Looks like the first URL is proper, but there's something wrong with entity - Marquez logs would help here.

      +

      *Thread Reply:* Looks like the first URL is proper, but there's something wrong with entity - Marquez logs would help here.

      @@ -99637,7 +99643,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 09:57:36
      -

      *Thread Reply:* This is my log in airflow, can you please prvide more info over it.

      +

      *Thread Reply:* This is my log in airflow, can you please prvide more info over it.

      @@ -99672,7 +99678,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-25 10:13:37
      -

      *Thread Reply:* Airflow log does not tell us why Marquez rejected the event. Marquez logs would be more helpful

      +

      *Thread Reply:* Airflow log does not tell us why Marquez rejected the event. Marquez logs would be more helpful

      @@ -99698,7 +99704,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 05:48:08
      -

      *Thread Reply:* We investigated the marquez container logs and were unable to locate the error. Could you please specify the log file that belongs to marquez while connecting the airflow or snowflake?

      +

      *Thread Reply:* We investigated the marquez container logs and were unable to locate the error. Could you please specify the log file that belongs to marquez while connecting the airflow or snowflake?

      Is it correct that the marquez-web log points to <http://api:5000/>? [HPM] Proxy created: /api/v1 -&gt; <http://api:5000/> @@ -99750,7 +99756,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:26:36
      -

      *Thread Reply:* I've the same error at the moment but can provide some additional screenshots. The Event data in Snowflake seems fine and the data is being retrieved correctly by the Airflow DAG. However, there seems to be a warning in the Marquez API logs. Hopefully we can troubleshoot this together!

      +

      *Thread Reply:* I've the same error at the moment but can provide some additional screenshots. The Event data in Snowflake seems fine and the data is being retrieved correctly by the Airflow DAG. However, there seems to be a warning in the Marquez API logs. Hopefully we can troubleshoot this together!

      @@ -99794,7 +99800,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:33:35
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -99838,7 +99844,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:06:30
      -

      *Thread Reply:* Possibly the Python part between does some weird things, like double-jsonning the data? I can imagine it being wrapped in second, unnecessary JSON object

      +

      *Thread Reply:* Possibly the Python part between does some weird things, like double-jsonning the data? I can imagine it being wrapped in second, unnecessary JSON object

      @@ -99864,7 +99870,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 13:08:18
      -

      *Thread Reply:* I guess only way to check is print one of those events - in the form they are send in Python part, not Snowflake - and see how they are like. For example using ConsoleTransport or setting DEBUG log level in Airflow

      +

      *Thread Reply:* I guess only way to check is print one of those events - in the form they are send in Python part, not Snowflake - and see how they are like. For example using ConsoleTransport or setting DEBUG log level in Airflow

      @@ -99890,7 +99896,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 14:37:32
      -

      *Thread Reply:* Here is a code snippet by using logging in DEBUG on the snowflake python connector:

      +

      *Thread Reply:* Here is a code snippet by using logging in DEBUG on the snowflake python connector:

      [20230426T17:16:55.166+0000] {cursor.py:593} DEBUG - binding: [set currentorganization='[PRIVATE]';] with input=[None], processed=[{}] [2023-04-26T17:16:55.166+0000] {cursor.py:800} INFO - query: [set currentorganization='[PRIVATE]';] @@ -99978,7 +99984,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 14:45:26
      -

      *Thread Reply:* I don't see any Airflow standard logs here, but anyway I looked at it and debugging it would not work if you're bypassing OpenLineageClient.emit and going directly to transport - the logging is done on Client level https://github.com/OpenLineage/OpenLineage/blob/acc207d63e976db7c48384f04bc578409f08cc8a/client/python/openlineage/client/client.py#L73

      +

      *Thread Reply:* I don't see any Airflow standard logs here, but anyway I looked at it and debugging it would not work if you're bypassing OpenLineageClient.emit and going directly to transport - the logging is done on Client level https://github.com/OpenLineage/OpenLineage/blob/acc207d63e976db7c48384f04bc578409f08cc8a/client/python/openlineage/client/client.py#L73

      @@ -100029,7 +100035,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-27 11:16:20
      -

      *Thread Reply:* I'm sorry, do you have a code snippet on how to get these logs from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py? I still get the ValueError for OpenLineageClient.emit

      +

      *Thread Reply:* I'm sorry, do you have a code snippet on how to get these logs from https://github.com/Snowflake-Labs/OpenLineage-AccessHistory-Setup/blob/main/examples/airflow/dags/lineage/extract_openlineage.py? I still get the ValueError for OpenLineageClient.emit

      @@ -100157,7 +100163,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 10:56:34
      -

      *Thread Reply:* Hey does anyone have an idea on this? I'm still stuck on this issue 😞

      +

      *Thread Reply:* Hey does anyone have an idea on this? I'm still stuck on this issue 😞

      @@ -100183,7 +100189,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 08:58:49
      -

      *Thread Reply:* I've found the root cause. It's because facets don't have _producer and _schemaURL set. I'll provide a fix soon

      +

      *Thread Reply:* I've found the root cause. It's because facets don't have _producer and _schemaURL set. I'll provide a fix soon

      @@ -100274,7 +100280,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:36:57
      -

      *Thread Reply:* I’ll be there! I’m looking forward to see you all.

      +

      *Thread Reply:* I’ll be there! I’m looking forward to see you all.

      @@ -100300,7 +100306,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-26 11:37:23
      -

      *Thread Reply:* We’ll talk about the evolution of the spec.

      +

      *Thread Reply:* We’ll talk about the evolution of the spec.

      @@ -100358,7 +100364,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 04:23:24
      -

      *Thread Reply:* I've created an integration test based on your example. The Openlineage event gets sent, however it does not contain output dataset. I will look deeper into that.

      +

      *Thread Reply:* I've created an integration test based on your example. The Openlineage event gets sent, however it does not contain output dataset. I will look deeper into that.

      @@ -100388,7 +100394,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:55:43
      -

      *Thread Reply:* Hey, sorry do you mean input dataset is empty? Or output dataset?

      +

      *Thread Reply:* Hey, sorry do you mean input dataset is empty? Or output dataset?

      @@ -100414,7 +100420,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:55:51
      -

      *Thread Reply:* I am seeing that input dataset is empty

      +

      *Thread Reply:* I am seeing that input dataset is empty

      @@ -100440,7 +100446,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:05
      -

      *Thread Reply:* ooh, I see input datasets

      +

      *Thread Reply:* ooh, I see input datasets

      @@ -100466,7 +100472,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:11
      -

      *Thread Reply:* Hmm

      +

      *Thread Reply:* Hmm

      @@ -100492,7 +100498,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:56:12
      -

      *Thread Reply:* I see

      +

      *Thread Reply:* I see

      @@ -100518,7 +100524,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:57:07
      -

      *Thread Reply:* I create a test in SparkDeltaIntegrationTest class a test method: +

      *Thread Reply:* I create a test in SparkDeltaIntegrationTest class a test method: ```@Test void testDeltaMergeInto() { Dataset<Row> dataset = @@ -100579,7 +100585,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:59:14
      -

      *Thread Reply:* Oh yeah my bad. I am seeing output dataset is empty.

      +

      *Thread Reply:* Oh yeah my bad. I am seeing output dataset is empty.

      @@ -100605,7 +100611,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-04-28 08:59:21
      -

      *Thread Reply:* Checks out with your observation

      +

      *Thread Reply:* Checks out with your observation

      @@ -100631,7 +100637,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 23:23:36
      -

      *Thread Reply:* Hi @Paweł Leszczyński just curious, has a fix for this been implemented alr?

      +

      *Thread Reply:* Hi @Paweł Leszczyński just curious, has a fix for this been implemented alr?

      @@ -100657,7 +100663,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:40:11
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I had some days ooo. I will look into this soon.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, I had some days ooo. I will look into this soon.

      @@ -100683,7 +100689,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 07:37:52
      -

      *Thread Reply:* Ahh okie! Thanks so much! Hope you had a good rest!

      +

      *Thread Reply:* Ahh okie! Thanks so much! Hope you had a good rest!

      @@ -100709,7 +100715,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 07:38:38
      -

      *Thread Reply:* yeah. this was an amazing extended weekend 😉

      +

      *Thread Reply:* yeah. this was an amazing extended weekend 😉

      @@ -100739,7 +100745,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:09:10
      -

      *Thread Reply:* This should be it: https://github.com/OpenLineage/OpenLineage/pull/1823

      +

      *Thread Reply:* This should be it: https://github.com/OpenLineage/OpenLineage/pull/1823

      @@ -100838,7 +100844,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:43:24
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, please let me know if there is still something to be done within #1747 PROPOSAL] Support for V2SessionCatalog. I could not reproduce exactly what you described but fixed some issue nearby.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, please let me know if there is still something to be done within #1747 PROPOSAL] Support for V2SessionCatalog. I could not reproduce exactly what you described but fixed some issue nearby.

      @@ -100864,7 +100870,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:49:38
      -

      *Thread Reply:* Hmm yeah sure let me find out the exact cause of the issue. The pipeline that was causing the issue is now inactive haha. So I'm trying to backtrace from the limited logs I captured last time. Let me get back by next week thanks! 🙇

      +

      *Thread Reply:* Hmm yeah sure let me find out the exact cause of the issue. The pipeline that was causing the issue is now inactive haha. So I'm trying to backtrace from the limited logs I captured last time. Let me get back by next week thanks! 🙇

      @@ -100890,7 +100896,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 09:35:00
      -

      *Thread Reply:* Hi @Paweł Leszczyński I was trying to replicate the issue from my end, but couldn't do so. I think we can close the issue for now, and revisit later on if the issue resurfaces. Does that sound okay?

      +

      *Thread Reply:* Hi @Paweł Leszczyński I was trying to replicate the issue from my end, but couldn't do so. I think we can close the issue for now, and revisit later on if the issue resurfaces. Does that sound okay?

      @@ -100916,7 +100922,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 09:40:33
      -

      *Thread Reply:* sounds cool. we can surely create a new issue later on.

      +

      *Thread Reply:* sounds cool. we can surely create a new issue later on.

      @@ -100946,7 +100952,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 23:34:04
      -

      *Thread Reply:* @Paweł Leszczyński - I was trying to implement these new changes in databricks. I was wondering which java file should I use for building the jar file? Could you plese help me?

      +

      *Thread Reply:* @Paweł Leszczyński - I was trying to implement these new changes in databricks. I was wondering which java file should I use for building the jar file? Could you plese help me?

      @@ -100972,7 +100978,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 00:46:34
      -

      *Thread Reply:* .

      +

      *Thread Reply:* .

      @@ -100998,7 +101004,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:37:49
      -

      *Thread Reply:* Hi I found that these merge operations have no input datasets/col lineage: +

      *Thread Reply:* Hi I found that these merge operations have no input datasets/col lineage: ```df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema).save(path)

      df.write.format(fileformat).mode(mode).option("mergeSchema", mergeschema).option("overwriteSchema", overwriteSchema)\ @@ -101038,7 +101044,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:41:24
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, great to hear from you. Could you create an issue out of this? I am working at the moment on Spark 3.4. Once this is ready, I will look at the spark issues. And this one seems to be nicely reproducible. Thanks for that.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, great to hear from you. Could you create an issue out of this? I am working at the moment on Spark 3.4. Once this is ready, I will look at the spark issues. And this one seems to be nicely reproducible. Thanks for that.

      @@ -101068,7 +101074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:49:56
      -

      *Thread Reply:* Sure let me create an issue! Thanks!

      +

      *Thread Reply:* Sure let me create an issue! Thanks!

      @@ -101094,7 +101100,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 02:55:21
      -

      *Thread Reply:* Created an issue here! https://github.com/OpenLineage/OpenLineage/issues/1919 +

      *Thread Reply:* Created an issue here! https://github.com/OpenLineage/OpenLineage/issues/1919 Thanks! 🙇

      @@ -101162,7 +101168,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 10:39:50
      -

      *Thread Reply:* Hi @Paweł Leszczyński I just realised, https://github.com/OpenLineage/OpenLineage/pull/1823/files +

      *Thread Reply:* Hi @Paweł Leszczyński I just realised, https://github.com/OpenLineage/OpenLineage/pull/1823/files This PR doesn't actually capture column lineage for the MergeIntoCommand? It looks like there is no column lineage field in the events json.

      @@ -101189,7 +101195,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-17 04:21:24
      -

      *Thread Reply:* Hi @Paweł Leszczyński Is there a potential timeline in mind to support column lineage for the MergeIntoCommand? We're really excited for this feature and would be a huge help to overcome a current blocker. Thanks!

      +

      *Thread Reply:* Hi @Paweł Leszczyński Is there a potential timeline in mind to support column lineage for the MergeIntoCommand? We're really excited for this feature and would be a huge help to overcome a current blocker. Thanks!

      @@ -101315,7 +101321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 12:02:13
      -

      *Thread Reply:* “producer” is an element of the event run itself - e.g. what produced the JSON packet you’re studying. There is only one of these per event run. You can think of it as a top-level property.

      +

      *Thread Reply:* “producer” is an element of the event run itself - e.g. what produced the JSON packet you’re studying. There is only one of these per event run. You can think of it as a top-level property.

      producer” (and “schemaURL”) are elements of a facet. They are the 2 required elements for any customized facet (though I don’t agree they should be required, or at least I believe they should be able to be compatible with a blank value and a null value).

      @@ -101345,7 +101351,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 12:06:52
      -

      *Thread Reply:* just curious --- is/was there any specific reason for the underscore prefix? If they are in a facet, they would already be qualified.......

      +

      *Thread Reply:* just curious --- is/was there any specific reason for the underscore prefix? If they are in a facet, they would already be qualified.......

      @@ -101371,7 +101377,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 13:13:28
      -

      *Thread Reply:* The facet “BaseFacet” that’s used for customization, has 2 required elements - _producer and _schemaURL. so I don’t believe it’s related to qualification.

      +

      *Thread Reply:* The facet “BaseFacet” that’s used for customization, has 2 required elements - _producer and _schemaURL. so I don’t believe it’s related to qualification.

      @@ -101442,7 +101448,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 19:43:12
      -

      *Thread Reply:* Thanks for voting. The release will commence within 2 days.

      +

      *Thread Reply:* Thanks for voting. The release will commence within 2 days.

      @@ -101494,7 +101500,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:33:32
      -

      *Thread Reply:* Although it is not documented, we do have an integration test for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py

      +

      *Thread Reply:* Although it is not documented, we do have an integration test for that: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/spark_scripts/spark_kafka.py

      The test reads and writes data to Kafka and verifies if input/output datasets are collected.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-05-01 22:37:25
      -

      *Thread Reply:* From my experience, yeah it works for pyspark

      +

      *Thread Reply:* From my experience, yeah it works for pyspark

      @@ -101670,7 +101676,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-01 22:39:18
      -

      *Thread Reply:* Also from my experience, I think we can only use one of them as we can only configure one spark listener... correct me if I'm wrong. But it seems like the latest releases of spline are already using openlineage to some capacity?

      +

      *Thread Reply:* Also from my experience, I think we can only use one of them as we can only configure one spark listener... correct me if I'm wrong. But it seems like the latest releases of spline are already using openlineage to some capacity?

      @@ -101700,7 +101706,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 09:46:15
      -

      *Thread Reply:* In spark.extraListeners you can configure multiple listeners by comma separating them - I think you can use multiple ones with OpenLineage without obvious problems. I think we do pretty similar things to Spline though

      +

      *Thread Reply:* In spark.extraListeners you can configure multiple listeners by comma separating them - I think you can use multiple ones with OpenLineage without obvious problems. I think we do pretty similar things to Spline though

      @@ -101734,7 +101740,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 11:28:41
      -

      *Thread Reply:* (I never said thank you for this, so, thank you!)

      +

      *Thread Reply:* (I never said thank you for this, so, thank you!)

      @@ -101797,7 +101803,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-02 10:45:37
      -

      *Thread Reply:* Hello Sai, thanks for your question! A number of folks who could help with this are OOO, but someone will reply as soon as possible.

      +

      *Thread Reply:* Hello Sai, thanks for your question! A number of folks who could help with this are OOO, but someone will reply as soon as possible.

      @@ -101823,7 +101829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 02:44:46
      -

      *Thread Reply:* That is interesting @Sai. Are you able to reproduce this with a simple code snippet? Which Openlineage version are you using?

      +

      *Thread Reply:* That is interesting @Sai. Are you able to reproduce this with a simple code snippet? Which Openlineage version are you using?

      @@ -101849,7 +101855,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 01:16:20
      -

      *Thread Reply:* Yes @Paweł Leszczyński. Each join query I run on top of delta tables have two start and two complete events. We are using below jar for openlineage.

      +

      *Thread Reply:* Yes @Paweł Leszczyński. Each join query I run on top of delta tables have two start and two complete events. We are using below jar for openlineage.

      openlineage-spark-0.22.0.jar

      @@ -101881,7 +101887,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-05 02:41:26
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1828

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1828

      @@ -101937,7 +101943,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:05:26
      -

      *Thread Reply:* Hi @Paweł Leszczyński any updates on this issue?

      +

      *Thread Reply:* Hi @Paweł Leszczyński any updates on this issue?

      Also, OL is not giving column level lineage for group by operations on tables. Is this expected?

      @@ -101965,7 +101971,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:07:04
      -

      *Thread Reply:* Hi @Sai, https://github.com/OpenLineage/OpenLineage/pull/1830 should fix duplication issue

      +

      *Thread Reply:* Hi @Sai, https://github.com/OpenLineage/OpenLineage/pull/1830 should fix duplication issue

      @@ -102059,7 +102065,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:08:06
      -

      *Thread Reply:* this would be part of next release?

      +

      *Thread Reply:* this would be part of next release?

      @@ -102085,7 +102091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:08:30
      -

      *Thread Reply:* Regarding column lineage & group by issue, I think it's something on databricks side -> we do have an open issue for that #1821

      +

      *Thread Reply:* Regarding column lineage & group by issue, I think it's something on databricks side -> we do have an open issue for that #1821

      @@ -102111,7 +102117,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:09:24
      -

      *Thread Reply:* once #1830 is reviewed and merged, it will be the part of the next relase

      +

      *Thread Reply:* once #1830 is reviewed and merged, it will be the part of the next relase

      @@ -102137,7 +102143,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 04:11:01
      -

      *Thread Reply:* sure.. thanks @Paweł Leszczyński

      +

      *Thread Reply:* sure.. thanks @Paweł Leszczyński

      @@ -102163,7 +102169,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:27:01
      -

      *Thread Reply:* @Paweł Leszczyński I have used the latest jar (0.25.0) and still this issue persists. I see two events for same input/output lineage.

      +

      *Thread Reply:* @Paweł Leszczyński I have used the latest jar (0.25.0) and still this issue persists. I see two events for same input/output lineage.

      @@ -102215,7 +102221,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 04:06:37
      -

      *Thread Reply:* For example, MySQL is stored as producer + host + port + database + table as something like <mysql://db.foo.com:6543/metrics.orders> +

      *Thread Reply:* For example, MySQL is stored as producer + host + port + database + table as something like <mysql://db.foo.com:6543/metrics.orders> For an API (especially one following REST conditions), I was thinking something like method + host + port + path or GET <https://api.service.com:433/v1/users>

      @@ -102242,7 +102248,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:13:25
      -

      *Thread Reply:* Hi Thomas, thanks for asking about this — it sounds cool! I don’t know of others working on this kind of thing, but I’ve been developing a SQLAlchemy integration and have been experimenting with job naming — which I realize isn’t exactly what you’re working on. Hopefully others will chime in here, but in the meantime, would you be willing to create an issue about this? It seems worth discussing how we could expand the spec for this kind of use case.

      +

      *Thread Reply:* Hi Thomas, thanks for asking about this — it sounds cool! I don’t know of others working on this kind of thing, but I’ve been developing a SQLAlchemy integration and have been experimenting with job naming — which I realize isn’t exactly what you’re working on. Hopefully others will chime in here, but in the meantime, would you be willing to create an issue about this? It seems worth discussing how we could expand the spec for this kind of use case.

      @@ -102268,7 +102274,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:58:32
      -

      *Thread Reply:* I suspect this will definitely be a bigger discussion. Let me ponder on the problem a bit more and come back with something a bit more concrete.

      +

      *Thread Reply:* I suspect this will definitely be a bigger discussion. Let me ponder on the problem a bit more and come back with something a bit more concrete.

      @@ -102294,7 +102300,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 10:59:21
      -

      *Thread Reply:* Looking forward to hearing more!

      +

      *Thread Reply:* Looking forward to hearing more!

      @@ -102320,7 +102326,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:05:47
      -

      *Thread Reply:* On a tangential note, does OpenLineage's column level lineage have support for (I see it can be extended but want to know if someone had to map this before): +

      *Thread Reply:* On a tangential note, does OpenLineage's column level lineage have support for (I see it can be extended but want to know if someone had to map this before): • Properties as a path in a structure (like a JSON structure, Avro schema, protobuf, etc) maybe using something like JSON Path or XPath notation. • Fragments (when a column is a JSON blob, there is an entire sub-structure that needs to be described) • Transformation description (how an input affects an output. Is it a direct copy of the value or is it part of a formula)

      @@ -102349,7 +102355,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-03 11:22:21
      -

      *Thread Reply:* I don’t know, but I’ll ping some folks who might.

      +

      *Thread Reply:* I don’t know, but I’ll ping some folks who might.

      @@ -102375,7 +102381,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-04 03:24:01
      -

      *Thread Reply:* Hi @Thomas. Column-lineage support currently does not include json fields. We have included in specification fields like transformationDescription and transformationType to store a string representation of the transformation applied and its type like IDENTITY|MASKED. However, those fields aren't filled within Spark integration at the moment.

      +

      *Thread Reply:* Hi @Thomas. Column-lineage support currently does not include json fields. We have included in specification fields like transformationDescription and transformationType to store a string representation of the transformation applied and its type like IDENTITY|MASKED. However, those fields aren't filled within Spark integration at the moment.

      @@ -102615,7 +102621,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 07:38:19
      -

      *Thread Reply:* Hi @Harshini Devathi, I think this the same as issue: https://github.com/OpenLineage/OpenLineage/issues/1821

      +

      *Thread Reply:* Hi @Harshini Devathi, I think this the same as issue: https://github.com/OpenLineage/OpenLineage/issues/1821

      @@ -102746,7 +102752,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-08 19:44:26
      -

      *Thread Reply:* Thank you @Paweł Leszczyński. So, is this an issue with databricks. The issue thread says that it was able to work on AWS Glue. If so, is there some kind of solution to make it work on Databricks?

      +

      *Thread Reply:* Thank you @Paweł Leszczyński. So, is this an issue with databricks. The issue thread says that it was able to work on AWS Glue. If so, is there some kind of solution to make it work on Databricks?

      @@ -102798,7 +102804,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 20:17:38
      -

      *Thread Reply:* maybe @Will Johnson knows?

      +

      *Thread Reply:* maybe @Will Johnson knows?

      @@ -102861,7 +102867,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:06:02
      -

      *Thread Reply:* There are few possible issues:

      +

      *Thread Reply:* There are few possible issues:

      1. The column-level lineage is not implemented for particular part of Spark LogicalPlan
      2. Azure SQL or Databricks have their own implementations of some Spark class, which does not exactly match our extractor. We've seen that happen
      3. You're using SQL JDBC connection with SELECT ** - in which case we can't do anything for now, since we don't know the input columns.
      4. Possibly something else 🙂 @Paweł Leszczyński might have an idea To fully understand the issue, we'd have to see logs, LogicalPlan of the Spark job, or the job code itself
      5. @@ -102891,7 +102897,7 @@

        MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:35:32
      -

      *Thread Reply:* @Sai, providing a short code snippet that is able to reproduce this would be super helpful in examining that.

      +

      *Thread Reply:* @Sai, providing a short code snippet that is able to reproduce this would be super helpful in examining that.

      @@ -102917,7 +102923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:59:24
      -

      *Thread Reply:* sure Pawel +

      *Thread Reply:* sure Pawel Will share the code I used in sometime

      @@ -102944,10 +102950,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 03:37:54
      -

      *Thread Reply:* Here is the code we use.

      +

      *Thread Reply:* Here is the code we use.

      - + @@ -102979,7 +102985,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:23:13
      -

      *Thread Reply:* Hi Team, Any updates on this?

      +

      *Thread Reply:* Hi Team, Any updates on this?

      @@ -103005,7 +103011,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 03:23:37
      -

      *Thread Reply:* I tried with putting a sql query having column names in it, still the lineage didn't show up..

      +

      *Thread Reply:* I tried with putting a sql query having column names in it, still the lineage didn't show up..

      @@ -103082,7 +103088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:02:25
      -

      *Thread Reply:* @Anirudh Shrinivason what's your OL and Spark version?

      +

      *Thread Reply:* @Anirudh Shrinivason what's your OL and Spark version?

      @@ -103108,7 +103114,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:03:29
      -

      *Thread Reply:* Some example job would also help, or logs/LogicalPlan 🙂

      +

      *Thread Reply:* Some example job would also help, or logs/LogicalPlan 🙂

      @@ -103134,7 +103140,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 10:05:54
      -

      *Thread Reply:* OL version is 0.23.0 and spark version is 3.3.1

      +

      *Thread Reply:* OL version is 0.23.0 and spark version is 3.3.1

      @@ -103164,7 +103170,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-09 11:00:22
      -

      *Thread Reply:* Hmm actually, it seems like the error is intermittent actually. I ran the same job again, but did not notice any errors this time...

      +

      *Thread Reply:* Hmm actually, it seems like the error is intermittent actually. I ran the same job again, but did not notice any errors this time...

      @@ -103190,7 +103196,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 02:27:19
      -

      *Thread Reply:* This is interesting and it happens within a line: +

      *Thread Reply:* This is interesting and it happens within a line: job.finalStage().parents().forall(toScalaFn(stage -&gt; stageMap.put(stage.id(), stage))); The result of stageMap.put is Stage and for some reason which I don't undestand it tries doing unboxToBoolean . We could rewrite that to: job.finalStage().parents().forall(toScalaFn(stage -&gt; { @@ -103223,7 +103229,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 02:22:25
      -

      *Thread Reply:* @Anirudh Shrinivason, please let us know if it is still a valid issue. If so, we can create an issue for that.

      +

      *Thread Reply:* @Anirudh Shrinivason, please let us know if it is still a valid issue. If so, we can create an issue for that.

      @@ -103249,7 +103255,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 03:11:13
      -

      *Thread Reply:* Hi @Paweł Leszczyński Sflr. Yeah, I think if we are able to fix this, it'll be better. If this is the dedicated fix, then I can create an issue and raise an MR.

      +

      *Thread Reply:* Hi @Paweł Leszczyński Sflr. Yeah, I think if we are able to fix this, it'll be better. If this is the dedicated fix, then I can create an issue and raise an MR.

      @@ -103275,7 +103281,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 04:12:46
      -

      *Thread Reply:* Opened an issue and PR. Do help check if its okay thanks!

      +

      *Thread Reply:* Opened an issue and PR. Do help check if its okay thanks!

      @@ -103301,7 +103307,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 04:29:33
      -

      *Thread Reply:* please run ./gradlew spotlessApply with Java 8

      +

      *Thread Reply:* please run ./gradlew spotlessApply with Java 8

      @@ -103365,7 +103371,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 07:36:33
      -

      *Thread Reply:* > I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also +

      *Thread Reply:* > I mean I want to be able to publish one item at a time and then (maybe in a future release of the customer product) be able to exploit runtime lineage also This is not something that we support yet - there are definitely a lot of plans and preliminary work for that.

      @@ -103392,7 +103398,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-10 07:57:44
      -

      *Thread Reply:* Thanks for the response, btw I already took a look at the current capabilities provided by openlineage, so my “hidden” question is how do achieve what the customer want to in order to be integrated in some way with openalineage+marquez? +

      *Thread Reply:* Thanks for the response, btw I already took a look at the current capabilities provided by openlineage, so my “hidden” question is how do achieve what the customer want to in order to be integrated in some way with openalineage+marquez? should I choose between make or buy (between already supported platforms) and then try to align “static” (aka declarative) lineage metadata within the openlinage conceptual model?

      @@ -103516,7 +103522,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 13:58:28
      -

      *Thread Reply:* Hi John. Great question! [full disclosure, I am with Manta 🙂 ]. I'll let others answer as to their experience with ourselves or many other vendors that provide lineage, but want to mention that a variety of our customers are finding it beneficial to bring code based static lineage together with the event-based runtime lineage that OpenLineage provides. This gives them the best of both worlds, for analyzing the lineage of their existing systems, where rich parsers already exist (for everything from legacy ETL tools, reporting tools, rdbms, etc.), to newer or home-grown technologies where applying OpenLineage is a viable alternative.

      +

      *Thread Reply:* Hi John. Great question! [full disclosure, I am with Manta 🙂 ]. I'll let others answer as to their experience with ourselves or many other vendors that provide lineage, but want to mention that a variety of our customers are finding it beneficial to bring code based static lineage together with the event-based runtime lineage that OpenLineage provides. This gives them the best of both worlds, for analyzing the lineage of their existing systems, where rich parsers already exist (for everything from legacy ETL tools, reporting tools, rdbms, etc.), to newer or home-grown technologies where applying OpenLineage is a viable alternative.

      @@ -103546,7 +103552,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:12:04
      -

      *Thread Reply:* @Ernie Ostic do you see a single front-runner in the static lineage space? The static/event-based situation you describe is exactly the product roadmap I'm seeing here at Fivetran and I'm wondering if there's an opportunity to drive consensus towards a best-practice solution. If I'm not mistaken weren't there plans to start supporting non-run-based events in OL as well?

      +

      *Thread Reply:* @Ernie Ostic do you see a single front-runner in the static lineage space? The static/event-based situation you describe is exactly the product roadmap I'm seeing here at Fivetran and I'm wondering if there's an opportunity to drive consensus towards a best-practice solution. If I'm not mistaken weren't there plans to start supporting non-run-based events in OL as well?

      @@ -103576,7 +103582,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:16:34
      -

      *Thread Reply:* I definitely like the idea of a 3rd party solution being complementary to OSS tools we can maintain ourselves while allowing us to offload maintenance effort where possible. Currently I have strong opinions on both sides of the build vs. buy aisle and this seems like the best of both worlds.

      +

      *Thread Reply:* I definitely like the idea of a 3rd party solution being complementary to OSS tools we can maintain ourselves while allowing us to offload maintenance effort where possible. Currently I have strong opinions on both sides of the build vs. buy aisle and this seems like the best of both worlds.

      @@ -103602,7 +103608,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:52:40
      -

      *Thread Reply:* @Brad Paskewitz that’s 100% our plan to extend the OL spec to support “run-less” events. We want to collect that static metadata for Datasets and Jobs outside of the context of a run through OpenLineage. +

      *Thread Reply:* @Brad Paskewitz that’s 100% our plan to extend the OL spec to support “run-less” events. We want to collect that static metadata for Datasets and Jobs outside of the context of a run through OpenLineage. happy to get your feedback here as well: https://github.com/OpenLineage/OpenLineage/pull/1839

      @@ -103633,7 +103639,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-11 14:57:46
      -

      *Thread Reply:* Hi @John Lukenoff. Here at Atlan we've been working with the OpenLineage community for quite some time to unlock the use case you describe. These efforts are adjacent to our ongoing integration with Fivetran. Happy to connect and give you a demo of what we've built and dig into your use case specifics.

      +

      *Thread Reply:* Hi @John Lukenoff. Here at Atlan we've been working with the OpenLineage community for quite some time to unlock the use case you describe. These efforts are adjacent to our ongoing integration with Fivetran. Happy to connect and give you a demo of what we've built and dig into your use case specifics.

      @@ -103659,7 +103665,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 11:26:32
      -

      *Thread Reply:* Thanks all! These comments are really informative, it’s exciting to hear about vendors leaning into the project to let us continue to benefit from the tremendous progress being made by the community. Had a great discussion with Atlan yesterday and plan to connect with Manta next week to discuss our use-cases.

      +

      *Thread Reply:* Thanks all! These comments are really informative, it’s exciting to hear about vendors leaning into the project to let us continue to benefit from the tremendous progress being made by the community. Had a great discussion with Atlan yesterday and plan to connect with Manta next week to discuss our use-cases.

      @@ -103685,7 +103691,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 12:34:32
      -

      *Thread Reply:* Reach out anytime, John. @John Lukenoff Looking forward to engaging further with you on these topics!

      +

      *Thread Reply:* Reach out anytime, John. @John Lukenoff Looking forward to engaging further with you on these topics!

      @@ -103741,7 +103747,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 11:19:02
      -

      *Thread Reply:* Thank you for requesting an OpenLineage release. As stated here, three +1s from committers will authorize an immediate release. Our policy is not to release on Fridays, so the earliest we could initiate would be Monday.

      +

      *Thread Reply:* Thank you for requesting an OpenLineage release. As stated here, three +1s from committers will authorize an immediate release. Our policy is not to release on Fridays, so the earliest we could initiate would be Monday.

      @@ -103767,7 +103773,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-12 13:12:43
      -

      *Thread Reply:* A release on Monday is totally fine @Michael Robinson.

      +

      *Thread Reply:* A release on Monday is totally fine @Michael Robinson.

      @@ -103793,7 +103799,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-15 08:37:39
      -

      *Thread Reply:* The release will be initiated today. Thanks @Harshini Devathi

      +

      *Thread Reply:* The release will be initiated today. Thanks @Harshini Devathi

      @@ -103823,7 +103829,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 20:16:07
      -

      *Thread Reply:* Appreciate it @Michael Robinson and thanks to all the committers for the prompt response

      +

      *Thread Reply:* Appreciate it @Michael Robinson and thanks to all the committers for the prompt response

      @@ -103947,7 +103953,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-16 14:13:16
      -

      *Thread Reply:* Can’t wait! 💯

      +

      *Thread Reply:* Can’t wait! 💯

      @@ -104002,7 +104008,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-17 01:58:28
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, are you able to send the OL events to console. This would let us confirm if the issue is related with event generation or emitting it and waiting for the backend to repond.

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, are you able to send the OL events to console. This would let us confirm if the issue is related with event generation or emitting it and waiting for the backend to repond.

      @@ -104028,7 +104034,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-17 01:59:03
      -

      *Thread Reply:* Ahh okay sure. Let me see if I can do that

      +

      *Thread Reply:* Ahh okay sure. Let me see if I can do that

      @@ -104063,7 +104069,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      @Paweł Leszczyński @Michael Robinson

      - + @@ -104099,7 +104105,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:55:51
      -

      *Thread Reply:* Hi @Paweł Leszczyński

      +

      *Thread Reply:* Hi @Paweł Leszczyński

      Thanks for fixing the issue and with new release merge is working. But I could not see any input and output datasets for this. Let me know if you need any further details to look into this.

      @@ -104137,7 +104143,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:00:01
      -

      *Thread Reply:* Oh man, it's just that vanilla spark differs from the one available in databricks platform. our integration tests do verify behaviour on vanilla spark which still leaves a possibility for inconsistency. will need to get back to it then at some time.

      +

      *Thread Reply:* Oh man, it's just that vanilla spark differs from the one available in databricks platform. our integration tests do verify behaviour on vanilla spark which still leaves a possibility for inconsistency. will need to get back to it then at some time.

      @@ -104163,7 +104169,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:11:28
      -

      *Thread Reply:* Hi @Paweł Leszczyński

      +

      *Thread Reply:* Hi @Paweł Leszczyński

      Did you get chance to look into this issue?

      @@ -104191,7 +104197,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:13:18
      -

      *Thread Reply:* Hi Sai, I am going back to spark. I am working on support for Spark 3.4, which is going to add some event filtering on internal delta operations that trigger unncecessarly the events

      +

      *Thread Reply:* Hi Sai, I am going back to spark. I am working on support for Spark 3.4, which is going to add some event filtering on internal delta operations that trigger unncecessarly the events

      @@ -104217,7 +104223,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:13:28
      -

      *Thread Reply:* this may be releated to issue you created

      +

      *Thread Reply:* this may be releated to issue you created

      @@ -104243,7 +104249,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:14:13
      -

      *Thread Reply:* I do have planned creating integration test for databricks which will be helpful to tackle the issues you raised

      +

      *Thread Reply:* I do have planned creating integration test for databricks which will be helpful to tackle the issues you raised

      @@ -104269,7 +104275,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:14:27
      -

      *Thread Reply:* so yes, I am looking at the Spark

      +

      *Thread Reply:* so yes, I am looking at the Spark

      @@ -104295,7 +104301,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:20:06
      -

      *Thread Reply:* thanks much Pawel.. I am looking more into the merge part as first priority as we use is frequently.

      +

      *Thread Reply:* thanks much Pawel.. I am looking more into the merge part as first priority as we use is frequently.

      @@ -104321,7 +104327,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:01
      -

      *Thread Reply:* I know, this is important.

      +

      *Thread Reply:* I know, this is important.

      @@ -104347,7 +104353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:14
      -

      *Thread Reply:* It just need still some time.

      +

      *Thread Reply:* It just need still some time.

      @@ -104373,7 +104379,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:21:46
      -

      *Thread Reply:* thank you for your patience and being so proactive on those issues.

      +

      *Thread Reply:* thank you for your patience and being so proactive on those issues.

      @@ -104399,7 +104405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 02:22:12
      -

      *Thread Reply:* no problem.. Please do keep us posted with updates..

      +

      *Thread Reply:* no problem.. Please do keep us posted with updates..

      @@ -104461,7 +104467,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 10:26:48
      -

      *Thread Reply:* The release request has been approved and will be initiated shortly.

      +

      *Thread Reply:* The release request has been approved and will be initiated shortly.

      @@ -104513,7 +104519,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:29:55
      -

      *Thread Reply:* Hi Anirudh, I saw your issue and I think it is the same one as solved within #1858. Are you able to reproduce it on a version built on the top of main?

      +

      *Thread Reply:* Hi Anirudh, I saw your issue and I think it is the same one as solved within #1858. Are you able to reproduce it on a version built on the top of main?

      @@ -104539,7 +104545,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 06:21:05
      -

      *Thread Reply:* Hi I haven't managed to try with the main branch. But if its the same error then all's good! If the error resurfaces then we can look into it.

      +

      *Thread Reply:* Hi I haven't managed to try with the main branch. But if its the same error then all's good! If the error resurfaces then we can look into it.

      @@ -104593,7 +104599,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:28:31
      -

      *Thread Reply:* I know this one: https://openlineage.io/docs/integrations/dbt

      +

      *Thread Reply:* I know this one: https://openlineage.io/docs/integrations/dbt

      openlineage.io
      @@ -104640,7 +104646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:41:39
      -

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the revert, I tried same but facing below issue

      +

      *Thread Reply:* Hi @Paweł Leszczyński Thanks for the revert, I tried same but facing below issue

      requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url:

      @@ -104670,7 +104676,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:44:09
      -

      *Thread Reply:* @Lovenish Goyal, exactly. You need to start Marquez. +

      *Thread Reply:* @Lovenish Goyal, exactly. You need to start Marquez. More about it: https://marquezproject.ai/quickstart

      @@ -104697,7 +104703,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 10:27:52
      -

      *Thread Reply:* @Lovenish Goyal how are you running dbt core currently?

      +

      *Thread Reply:* @Lovenish Goyal how are you running dbt core currently?

      @@ -104723,7 +104729,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 01:55:20
      -

      *Thread Reply:* Trying but facing issue while running marquezproject @Jakub Dardziński

      +

      *Thread Reply:* Trying but facing issue while running marquezproject @Jakub Dardziński

      @@ -104749,7 +104755,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 01:56:03
      -

      *Thread Reply:* @Harel Shein we have created custom docker image of DBT + Airflow and running it on an EC2

      +

      *Thread Reply:* @Harel Shein we have created custom docker image of DBT + Airflow and running it on an EC2

      @@ -104775,7 +104781,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:05:31
      -

      *Thread Reply:* for running dbt core on Airflow, we have a utility that helps develop dbt natively on Airflow. There’s also built in support for collecting lineage if you have the airflow-openlineage provider installed. +

      *Thread Reply:* for running dbt core on Airflow, we have a utility that helps develop dbt natively on Airflow. There’s also built in support for collecting lineage if you have the airflow-openlineage provider installed. https://astronomer.github.io/astronomer-cosmos/#quickstart

      @@ -104802,7 +104808,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:06:30
      -

      *Thread Reply:* RE issues running Marquez, can you share what those are? I’m guessing that since you are running both of them in individual docker images, the airflow deployment might not be able to communicate with the Marquez endpoints?

      +

      *Thread Reply:* RE issues running Marquez, can you share what those are? I’m guessing that since you are running both of them in individual docker images, the airflow deployment might not be able to communicate with the Marquez endpoints?

      @@ -104828,7 +104834,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 09:06:53
      -

      *Thread Reply:* @Harel Shein I've already helped with running Marquez 🙂

      +

      *Thread Reply:* @Harel Shein I've already helped with running Marquez 🙂

      @@ -104884,7 +104890,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:34:35
      -

      *Thread Reply:* I am not sure if this is the same. If you see OL events collected with column-lineage missing, then it's a different one.

      +

      *Thread Reply:* I am not sure if this is the same. If you see OL events collected with column-lineage missing, then it's a different one.

      @@ -104910,7 +104916,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 02:41:11
      -

      *Thread Reply:* Please also be aware, that it is extremely helpful to investigate the issues on your own before creating them.

      +

      *Thread Reply:* Please also be aware, that it is extremely helpful to investigate the issues on your own before creating them.

      Our integration traverses spark's logical plans and extracts lineage events from plan nodes that it understands. Some plan nodes are not supported yet and, from my experience, when working on an issue, 80% of time is spent on reproducing the scenario.

      @@ -104940,7 +104946,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 03:52:30
      -

      *Thread Reply:* @Paweł Leszczyński Thanks for the prompt response.

      +

      *Thread Reply:* @Paweł Leszczyński Thanks for the prompt response.

      Provided sample codes with and without using aggregate functions and its respective lineage events for reference.

      @@ -105240,7 +105246,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 03:55:04
      -

      *Thread Reply:* 2. Please find the code using aggregate function:

      +

      *Thread Reply:* 2. Please find the code using aggregate function:

          final_df=spark.sql("""
           select productid
      @@ -105485,7 +105491,7 @@ 

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 04:09:17
      -

      *Thread Reply:* amazing. https://github.com/OpenLineage/OpenLineage/issues/1861

      +

      *Thread Reply:* amazing. https://github.com/OpenLineage/OpenLineage/issues/1861

      @@ -105571,7 +105577,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 04:11:56
      -

      *Thread Reply:* Thanks for considering the request and looking into it

      +

      *Thread Reply:* Thanks for considering the request and looking into it

      @@ -105706,7 +105712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 20:13:09
      -

      *Thread Reply:* Hi @Bramha Aelem I replied in the ticket. Thank you for opening it.

      +

      *Thread Reply:* Hi @Bramha Aelem I replied in the ticket. Thank you for opening it.

      @@ -105732,7 +105738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-18 21:15:30
      -

      *Thread Reply:* Hi @Julien Le Dem - Thanks for quick response. I replied in the ticket. Please let me know if you need any more details.

      +

      *Thread Reply:* Hi @Julien Le Dem - Thanks for quick response. I replied in the ticket. Please let me know if you need any more details.

      @@ -105758,7 +105764,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 02:13:57
      -

      *Thread Reply:* Hi @Bramha Aelem - asked for more details in the ticket.

      +

      *Thread Reply:* Hi @Bramha Aelem - asked for more details in the ticket.

      @@ -105784,7 +105790,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 11:08:58
      -

      *Thread Reply:* Hi @Paweł Leszczyński - I replied with necessary details in the ticket. Please let me know if you need any more details.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - I replied with necessary details in the ticket. Please let me know if you need any more details.

      @@ -105810,7 +105816,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 15:22:42
      -

      *Thread Reply:* Hi @Paweł Leszczyński - any further updates on issue?

      +

      *Thread Reply:* Hi @Paweł Leszczyński - any further updates on issue?

      @@ -105836,7 +105842,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 01:56:47
      -

      *Thread Reply:* hi @Bramha Aelem, i was out of office for a few days. will get back into this soon. thanks for update.

      +

      *Thread Reply:* hi @Bramha Aelem, i was out of office for a few days. will get back into this soon. thanks for update.

      @@ -105862,7 +105868,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-27 18:46:27
      -

      *Thread Reply:* Hi @Paweł Leszczyński -Thanks for your reply. will wait for your response to proceed further on the issue.

      +

      *Thread Reply:* Hi @Paweł Leszczyński -Thanks for your reply. will wait for your response to proceed further on the issue.

      @@ -105888,7 +105894,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 19:29:08
      -

      *Thread Reply:* Hi @Paweł Leszczyński -Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      +

      *Thread Reply:* Hi @Paweł Leszczyński -Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      @@ -105914,7 +105920,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 12:43:54
      -

      *Thread Reply:* Hi @Paweł Leszczyński - Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - Hope you are doing well. Did you get a chance to look into the samples which are provided in the ticket. Kindly let me know your observations/recommendations.

      @@ -105944,7 +105950,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-06 10:29:01
      -

      *Thread Reply:* Hi @Paweł Leszczyński - Good day. Did you get a chance to look into query which I have posted. can you please provide any thoughts on my observation/query.

      +

      *Thread Reply:* Hi @Paweł Leszczyński - Good day. Did you get a chance to look into query which I have posted. can you please provide any thoughts on my observation/query.

      @@ -105998,7 +106004,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:49:27
      -

      *Thread Reply:* Hello @John Doe, this mostly means there's somehting wrong with your transport config for emitting Openlineage events.

      +

      *Thread Reply:* Hello @John Doe, this mostly means there's somehting wrong with your transport config for emitting Openlineage events.

      @@ -106024,7 +106030,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 03:49:41
      -

      *Thread Reply:* what do you want to do with the events?

      +

      *Thread Reply:* what do you want to do with the events?

      @@ -106050,7 +106056,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:10:24
      -

      *Thread Reply:* Hi @Paweł Leszczyński, I am working on a PoC to understand the use cases of OL and how it build Lineages.

      +

      *Thread Reply:* Hi @Paweł Leszczyński, I am working on a PoC to understand the use cases of OL and how it build Lineages.

      As for the transport config I am using the codes from the documentation to setup OL. https://openlineage.io/docs/integrations/spark/quickstart_local

      @@ -106102,7 +106108,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 04:38:58
      -

      *Thread Reply:* ok, I am wondering if what you experience isn't similar to issue #1860. Could you try openlineage 0.23.0 to see if get the same error?

      +

      *Thread Reply:* ok, I am wondering if what you experience isn't similar to issue #1860. Could you try openlineage 0.23.0 to see if get the same error?

      <https://github.com/OpenLineage/OpenLineage/issues/1860>

      @@ -106130,7 +106136,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-19 10:05:59
      -

      *Thread Reply:* I tried with 0.23.0 still getting the same error

      +

      *Thread Reply:* I tried with 0.23.0 still getting the same error

      @@ -106156,7 +106162,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 02:34:52
      -

      *Thread Reply:* @Paweł Leszczyński any other way I can try to setup. The issue still persists

      +

      *Thread Reply:* @Paweł Leszczyński any other way I can try to setup. The issue still persists

      @@ -106182,7 +106188,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 03:53:04
      -

      *Thread Reply:* hmyy, I've just redone steps from https://openlineage.io/docs/integrations/spark/quickstart_local with 0.26.0 and could not reproduce behaviour you encountered.

      +

      *Thread Reply:* hmyy, I've just redone steps from https://openlineage.io/docs/integrations/spark/quickstart_local with 0.26.0 and could not reproduce behaviour you encountered.

      openlineage.io
      @@ -106259,7 +106265,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:34:00
      -

      *Thread Reply:* Tom, thanks for your question. This is really exciting! I assume you’ve already started checking out the docs, but there are many other resources on the website, as well (on the blog and resources pages in particular). And don’t skip the YouTube channel, where we’ve recently started to upload short, more digestible excerpts from the community meetings. Please keep us updated as you make progress!

      +

      *Thread Reply:* Tom, thanks for your question. This is really exciting! I assume you’ve already started checking out the docs, but there are many other resources on the website, as well (on the blog and resources pages in particular). And don’t skip the YouTube channel, where we’ve recently started to upload short, more digestible excerpts from the community meetings. Please keep us updated as you make progress!

      @@ -106289,7 +106295,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:48:06
      -

      *Thread Reply:* Hi Michael! Thank you so much for sending these resources! I've been working on this thesis for quite some time already and it's almost finished. I just needed some additional information to help in accurately describing some of the processes in OpenLineage and Marquez. Will send you the case study chapter later this week to get some feedback if possible. Keep you posted on things such as publication! Perhaps it can make OpenLineage even more popular than it already is 😉

      +

      *Thread Reply:* Hi Michael! Thank you so much for sending these resources! I've been working on this thesis for quite some time already and it's almost finished. I just needed some additional information to help in accurately describing some of the processes in OpenLineage and Marquez. Will send you the case study chapter later this week to get some feedback if possible. Keep you posted on things such as publication! Perhaps it can make OpenLineage even more popular than it already is 😉

      @@ -106315,7 +106321,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 16:52:18
      -

      *Thread Reply:* Yes, please share it! Looking forward to checking it out. Super cool!

      +

      *Thread Reply:* Yes, please share it! Looking forward to checking it out. Super cool!

      @@ -106371,7 +106377,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-22 10:04:44
      -

      *Thread Reply:* Thank you so much Ernie! That sounds like a very interesting direction to keep in mind during research!

      +

      *Thread Reply:* Thank you so much Ernie! That sounds like a very interesting direction to keep in mind during research!

      @@ -106496,7 +106502,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 07:24:09
      -

      *Thread Reply:* yeah, it's a bug

      +

      *Thread Reply:* yeah, it's a bug

      @@ -106522,7 +106528,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 12:00:48
      -

      *Thread Reply:* so it's optional then? 😄 or bug in the example?

      +

      *Thread Reply:* so it's optional then? 😄 or bug in the example?

      @@ -106686,7 +106692,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-23 19:01:43
      -

      *Thread Reply:* @Michael Collado wondering if you could shed some light here

      +

      *Thread Reply:* @Michael Collado wondering if you could shed some light here

      @@ -106712,7 +106718,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 05:39:01
      -

      *Thread Reply:* can you show logical plan of your Spark job? I think using hive is not the most important part, but whether job's LogicalPlan parses to CreateHiveTableAsSelectCommand or InsertIntoHiveTable

      +

      *Thread Reply:* can you show logical plan of your Spark job? I think using hive is not the most important part, but whether job's LogicalPlan parses to CreateHiveTableAsSelectCommand or InsertIntoHiveTable

      @@ -106738,7 +106744,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 19:37:02
      -

      *Thread Reply:* It parses into InsertIntoHadoopFsRelationCommand. example +

      *Thread Reply:* It parses into InsertIntoHadoopFsRelationCommand. example == Optimized Logical Plan == InsertIntoHadoopFsRelationCommand <s3a://uchmsdev03/default/sharanyaOutputTable>, false, [id#89], Parquet, [serialization.format=1, mergeSchema=false, partitionOverwriteMode=dynamic], Append, CatalogTable( Database: default @@ -106793,7 +106799,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 19:39:54
      -

      *Thread Reply:* using spark 3.2 & this is the query +

      *Thread Reply:* using spark 3.2 & this is the query spark.sql(s"INSERT INTO default.sharanyaOutput select ** from (SELECT ** from default.tableA union all " + s"select ** from default.tableB)")

      @@ -106847,7 +106853,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-24 05:37:06
      -

      *Thread Reply:* I think we can't really get it from Spark context, as Spark jobs are submitted in compiled, jar form, instead of plain text like for example Airflow dags.

      +

      *Thread Reply:* I think we can't really get it from Spark context, as Spark jobs are submitted in compiled, jar form, instead of plain text like for example Airflow dags.

      @@ -106873,7 +106879,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 02:15:35
      -

      *Thread Reply:* How about Jupyter Notebook based spark job?

      +

      *Thread Reply:* How about Jupyter Notebook based spark job?

      @@ -106899,7 +106905,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 08:44:18
      -

      *Thread Reply:* I don't think it changes much - but maybe @Paweł Leszczyński knows more

      +

      *Thread Reply:* I don't think it changes much - but maybe @Paweł Leszczyński knows more

      @@ -106982,7 +106988,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:00:04
      -

      *Thread Reply:* We developed a FileTransport class in order to save locally in json file our metrics if you interested in

      +

      *Thread Reply:* We developed a FileTransport class in order to save locally in json file our metrics if you interested in

      @@ -107008,7 +107014,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:00:37
      -

      *Thread Reply:* Does it also save the openlineage information, e.g. inputs/outputs?

      +

      *Thread Reply:* Does it also save the openlineage information, e.g. inputs/outputs?

      @@ -107034,7 +107040,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:02:07
      -

      *Thread Reply:* yes it save all json information, inputs / ouputs included

      +

      *Thread Reply:* yes it save all json information, inputs / ouputs included

      @@ -107060,7 +107066,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 12:03:03
      -

      *Thread Reply:* Yes! then I am very interested. Is there guidance on how to use the FileTransport class?

      +

      *Thread Reply:* Yes! then I am very interested. Is there guidance on how to use the FileTransport class?

      @@ -107086,7 +107092,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:06:22
      -

      *Thread Reply:* @alexandre bergere it would be pretty useful contribution if you can submit it 🙂

      +

      *Thread Reply:* @alexandre bergere it would be pretty useful contribution if you can submit it 🙂

      @@ -107116,7 +107122,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:08:28
      -

      *Thread Reply:* We are using it on a transformed OpenLineage library we developed ! I'm going to make a PR in order to share it with you :)

      +

      *Thread Reply:* We are using it on a transformed OpenLineage library we developed ! I'm going to make a PR in order to share it with you :)

      @@ -107146,7 +107152,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-25 13:56:48
      -

      *Thread Reply:* would be great to have. I had it in mind to implement as an enabler for databricks integration tests. great to hear that!

      +

      *Thread Reply:* would be great to have. I had it in mind to implement as an enabler for databricks integration tests. great to hear that!

      @@ -107172,7 +107178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 08:19:46
      -

      *Thread Reply:* PR sent: https://github.com/OpenLineage/OpenLineage/pull/1891 🙂 +

      *Thread Reply:* PR sent: https://github.com/OpenLineage/OpenLineage/pull/1891 🙂 @Maciej Obuchowski could you tell me how to update the documentation once approved please?

      @@ -107261,7 +107267,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 08:36:21
      -

      *Thread Reply:* @alexandre bergere we have separate repo for website + docs: https://github.com/OpenLineage/docs

      +

      *Thread Reply:* @alexandre bergere we have separate repo for website + docs: https://github.com/OpenLineage/docs

      @@ -107376,7 +107382,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 13:08:56
      -

      *Thread Reply:* I wonder if this blog post might help? https://openlineage.io/blog/openlineage-microsoft-purview

      +

      *Thread Reply:* I wonder if this blog post might help? https://openlineage.io/blog/openlineage-microsoft-purview

      openlineage.io
      @@ -107423,7 +107429,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-26 16:13:38
      -

      *Thread Reply:* This might not fully match your use case, either, but might help: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      +

      *Thread Reply:* This might not fully match your use case, either, but might help: https://learn.microsoft.com/en-us/samples/microsoft/purview-adb-lineage-solution-accelerator/azure-databricks-to-purview-lineage-connector/

      learn.microsoft.com
      @@ -107474,7 +107480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 23:23:49
      -

      *Thread Reply:* Thanks @Michael Robinson

      +

      *Thread Reply:* Thanks @Michael Robinson

      @@ -107526,7 +107532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-29 05:04:32
      -

      *Thread Reply:* I think, for now such thing is not defined other than by implementation of consumers.

      +

      *Thread Reply:* I think, for now such thing is not defined other than by implementation of consumers.

      @@ -107552,7 +107558,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-05-30 10:32:09
      -

      *Thread Reply:* Any reason for that?

      +

      *Thread Reply:* Any reason for that?

      @@ -107578,7 +107584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:25:33
      -

      *Thread Reply:* The idea is that for particular run, facets can be attached to any event type.

      +

      *Thread Reply:* The idea is that for particular run, facets can be attached to any event type.

      This has advantages, for example, job that modifies dataset that it's also reading from, can get particular version of dataset it's reading from and attach it on start; it would not work if you tried to do it on complete as the dataset would change by then.

      @@ -107659,7 +107665,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:18:48
      -

      *Thread Reply:* I think it should be output-only, yes.

      +

      *Thread Reply:* I think it should be output-only, yes.

      @@ -107685,7 +107691,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 10:19:14
      -

      *Thread Reply:* @Paweł Leszczyński what do you think?

      +

      *Thread Reply:* @Paweł Leszczyński what do you think?

      @@ -107711,7 +107717,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 08:35:13
      -

      *Thread Reply:* yes, should be output only I think

      +

      *Thread Reply:* yes, should be output only I think

      @@ -107737,7 +107743,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 13:39:07
      -

      *Thread Reply:* should we move it over then? 😄

      +

      *Thread Reply:* should we move it over then? 😄

      @@ -107763,7 +107769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 13:39:31
      -

      *Thread Reply:* under Output Dataset Facets that is

      +

      *Thread Reply:* under Output Dataset Facets that is

      @@ -107828,7 +107834,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-01 14:23:17
      -

      *Thread Reply:* Correction: Julien and Willy’s talk at Data+AI Summit will take place on June 28

      +

      *Thread Reply:* Correction: Julien and Willy’s talk at Data+AI Summit will take place on June 28

      @@ -107888,7 +107894,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-02 10:30:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated on Monday in accordance with our policy here.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated on Monday in accordance with our policy here.

      @@ -108068,7 +108074,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-05 15:47:08
      -

      *Thread Reply:* @Bernat Gabor thanks for the PR!

      +

      *Thread Reply:* @Bernat Gabor thanks for the PR!

      @@ -108124,7 +108130,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-06 10:58:23
      -

      *Thread Reply:* Thanks @Maciej Obuchowski. The release is authorized and will be initiated as soon as possible.

      +

      *Thread Reply:* Thanks @Maciej Obuchowski. The release is authorized and will be initiated as soon as possible.

      @@ -108412,7 +108418,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:39:20
      -

      *Thread Reply:* Do you get any output in your devtools? I just ran into this yesterday and it looks like it’s related to this issue: https://github.com/MarquezProject/marquez/issues/2410

      +

      *Thread Reply:* Do you get any output in your devtools? I just ran into this yesterday and it looks like it’s related to this issue: https://github.com/MarquezProject/marquez/issues/2410

      @@ -108476,7 +108482,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:40:01
      -

      *Thread Reply:* Seems like more of a Marquez client-side issue than something with OL

      +

      *Thread Reply:* Seems like more of a Marquez client-side issue than something with OL

      @@ -108502,7 +108508,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:43:02
      -

      *Thread Reply:* ohh but if i try using the console output, it throws ClientProtocolError

      +

      *Thread Reply:* ohh but if i try using the console output, it throws ClientProtocolError

      @@ -108537,7 +108543,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:43:41
      -

      *Thread Reply:* Sorry I mean in the dev console of your web browser

      +

      *Thread Reply:* Sorry I mean in the dev console of your web browser

      @@ -108563,7 +108569,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:44:43
      -

      *Thread Reply:* this is the dev console in browser

      +

      *Thread Reply:* this is the dev console in browser

      @@ -108598,7 +108604,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:47:59
      -

      *Thread Reply:* Seems like it’s coming from this line. Are there any job facets defined when you fetch from the API directly? That seems like kind of an old version of OL so maybe the schema is incompatible with the version Marquez is expecting

      +

      *Thread Reply:* Seems like it’s coming from this line. Are there any job facets defined when you fetch from the API directly? That seems like kind of an old version of OL so maybe the schema is incompatible with the version Marquez is expecting

      @@ -108649,7 +108655,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:51:21
      -

      *Thread Reply:* from pyspark.sql import SparkSession

      +

      *Thread Reply:* from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') .appName('sample_spark') @@ -108690,7 +108696,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:51:49
      -

      *Thread Reply:* This is my only code, I havent done anything apart from this

      +

      *Thread Reply:* This is my only code, I havent done anything apart from this

      @@ -108716,7 +108722,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:52:30
      -

      *Thread Reply:* I would try a more recent version of OL. Looks like you’re using 0.12.0 and I think the project is on 0.27.x currently

      +

      *Thread Reply:* I would try a more recent version of OL. Looks like you’re using 0.12.0 and I think the project is on 0.27.x currently

      @@ -108742,7 +108748,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 12:55:07
      -

      *Thread Reply:* so i should change io.openlineage:openlineage_spark:0.12.0 to io.openlineage:openlineage_spark:0.27.1?

      +

      *Thread Reply:* so i should change io.openlineage:openlineage_spark:0.12.0 to io.openlineage:openlineage_spark:0.27.1?

      @@ -108772,7 +108778,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:10:03
      -

      *Thread Reply:* it executed well, unable to see it in marquez

      +

      *Thread Reply:* it executed well, unable to see it in marquez

      @@ -108798,7 +108804,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:18:16
      -

      *Thread Reply:* marquez didnt get updated

      +

      *Thread Reply:* marquez didnt get updated

      @@ -108833,7 +108839,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:20:44
      -

      *Thread Reply:* I am actually doing a POC on OpenLineage to find table and column level lineage for my team at Amazon. +

      *Thread Reply:* I am actually doing a POC on OpenLineage to find table and column level lineage for my team at Amazon. If this goes through, the team could use openlineage to track data lineage on a larger scale..

      @@ -108860,7 +108866,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:24:49
      -

      *Thread Reply:* Maybe marquez is still pulling the data from the previous run using the old OL version. Do you still get the same error in the browser console? Do you get the same result if you rebuild and start with a clean marquez db?

      +

      *Thread Reply:* Maybe marquez is still pulling the data from the previous run using the old OL version. Do you still get the same error in the browser console? Do you get the same result if you rebuild and start with a clean marquez db?

      @@ -108886,7 +108892,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:25:10
      -

      *Thread Reply:* yes i did that as well

      +

      *Thread Reply:* yes i did that as well

      @@ -108912,7 +108918,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:25:49
      -

      *Thread Reply:* the error was present only once you clicked on any of the jobs in marquez, +

      *Thread Reply:* the error was present only once you clicked on any of the jobs in marquez, since my job isnt showing up i cant check for the error itself

      @@ -108939,7 +108945,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:26:29
      -

      *Thread Reply:* docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      +

      *Thread Reply:* docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      used this to rebuild marquez

      @@ -108967,7 +108973,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:26:54
      -

      *Thread Reply:* That’s odd, sorry, that’s probably the most I can help, I’m kinda new to OL/Marquez as well 😅

      +

      *Thread Reply:* That’s odd, sorry, that’s probably the most I can help, I’m kinda new to OL/Marquez as well 😅

      @@ -108993,7 +108999,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:27:41
      -

      *Thread Reply:* no problem, can you refer me to someone who would know, so that i can ask them?

      +

      *Thread Reply:* no problem, can you refer me to someone who would know, so that i can ask them?

      @@ -109019,7 +109025,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:29:25
      -

      *Thread Reply:* Actually looking at in now I think you’re using a slightly outdated version of marquez-web too. I would update that tag to at least 0.33.0. that’s what I’m using

      +

      *Thread Reply:* Actually looking at in now I think you’re using a slightly outdated version of marquez-web too. I would update that tag to at least 0.33.0. that’s what I’m using

      @@ -109045,7 +109051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:30:10
      -

      *Thread Reply:* Other than that I would ask in the marquez slack channel or raise an issue in github on that project. Seems like more of an issue with Marquez since some at least some data is rendering in the UI initially

      +

      *Thread Reply:* Other than that I would ask in the marquez slack channel or raise an issue in github on that project. Seems like more of an issue with Marquez since some at least some data is rendering in the UI initially

      @@ -109071,7 +109077,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:32:58
      -

      *Thread Reply:* nope that version also didnt help

      +

      *Thread Reply:* nope that version also didnt help

      @@ -109097,7 +109103,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:33:19
      -

      *Thread Reply:* can you share their slack link?

      +

      *Thread Reply:* can you share their slack link?

      @@ -109123,7 +109129,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:34:52
      -

      *Thread Reply:* http://bit.ly/MarquezSlack

      +

      *Thread Reply:* http://bit.ly/MarquezSlack

      @@ -109149,7 +109155,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 13:35:08
      -

      *Thread Reply:* that link is no longer active

      +

      *Thread Reply:* that link is no longer active

      @@ -109175,7 +109181,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:44:25
      -

      *Thread Reply:* Hello @Rachana Gandhi could you point to the doc where you found the example .config(‘spark.jars.packages’, ‘io.openlineage:openlineage_spark:0.12.0’) ? We should update it to have the latest version instead.

      +

      *Thread Reply:* Hello @Rachana Gandhi could you point to the doc where you found the example .config(‘spark.jars.packages’, ‘io.openlineage:openlineage_spark:0.12.0’) ? We should update it to have the latest version instead.

      @@ -109201,7 +109207,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:54:49
      -

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/quickstart_local/

      +

      *Thread Reply:* https://openlineage.io/docs/integrations/spark/quickstart_local/

      openlineage.io
      @@ -109248,7 +109254,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-09 18:59:17
      -

      *Thread Reply:* https://openlineage.io/docs/guides/spark

      +

      *Thread Reply:* https://openlineage.io/docs/guides/spark

      also the docker compose here has an earlier version of marquez

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-07-13 17:00:54
      -

      *Thread Reply:* Facing same issue with my initial POC. Did we get any solution for this?

      +

      *Thread Reply:* Facing same issue with my initial POC. Did we get any solution for this?

      @@ -109355,7 +109361,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:43:55
      -

      *Thread Reply:* Requesting a release? 3 +1s from committers will authorize. More info here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md

      +

      *Thread Reply:* Requesting a release? 3 +1s from committers will authorize. More info here: https://github.com/OpenLineage/OpenLineage/blob/main/GOVERNANCE.md

      @@ -109381,7 +109387,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:44:14
      -

      *Thread Reply:* Yeah, that one 😊

      +

      *Thread Reply:* Yeah, that one 😊

      @@ -109407,7 +109413,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 14:44:44
      -

      *Thread Reply:* Because the python client is broken as is today without a new release

      +

      *Thread Reply:* Because the python client is broken as is today without a new release

      @@ -109437,7 +109443,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 18:45:04
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated by EOB next Tuesday, but in all likelihood well before then.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated by EOB next Tuesday, but in all likelihood well before then.

      @@ -109463,7 +109469,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-08 19:06:34
      -

      *Thread Reply:* cool

      +

      *Thread Reply:* cool

      @@ -109588,7 +109594,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:44:28
      -

      *Thread Reply:* Do you mean to just log events to logging backend?

      +

      *Thread Reply:* Do you mean to just log events to logging backend?

      @@ -109614,7 +109620,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:54:30
      -

      *Thread Reply:* Hmm more like have a separate logging config for sending all the logs to a backend

      +

      *Thread Reply:* Hmm more like have a separate logging config for sending all the logs to a backend

      @@ -109640,7 +109646,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 04:54:38
      -

      *Thread Reply:* Not the events itself

      +

      *Thread Reply:* Not the events itself

      @@ -109666,7 +109672,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:01:10
      -

      *Thread Reply:* @Anirudh Shrinivason with Spark integration?

      +

      *Thread Reply:* @Anirudh Shrinivason with Spark integration?

      @@ -109692,7 +109698,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:01:59
      -

      *Thread Reply:* It uses slf4j so you should be able to set up your log4j logger

      +

      *Thread Reply:* It uses slf4j so you should be able to set up your log4j logger

      @@ -109718,7 +109724,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-13 05:10:55
      -

      *Thread Reply:* Yeah with the spark integration. Ahh I see. Okay sure thanks!

      +

      *Thread Reply:* Yeah with the spark integration. Ahh I see. Okay sure thanks!

      @@ -109744,7 +109750,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 23:21:14
      -

      *Thread Reply:* ~Hi @Maciej Obuchowski May I know what the class path I should be using for setting up the log4j if I want to set it up for OL related logs? Is there some guide or runbook to setting up the log4j with OL? Thanks!~ +

      *Thread Reply:* ~Hi @Maciej Obuchowski May I know what the class path I should be using for setting up the log4j if I want to set it up for OL related logs? Is there some guide or runbook to setting up the log4j with OL? Thanks!~ Nvm lol found it! 🙂

      @@ -109832,7 +109838,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-14 10:42:41
      -

      *Thread Reply:* Hello, is anyone eho has recently installed latest version of marquez/open-lineage-spark using docker image available to help Vamshi and I or provide any pointers? Thank you

      +

      *Thread Reply:* Hello, is anyone eho has recently installed latest version of marquez/open-lineage-spark using docker image available to help Vamshi and I or provide any pointers? Thank you

      @@ -109858,7 +109864,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 03:38:38
      -

      *Thread Reply:* if you're working on mac, you can have an issue related to port 5000. Instructions here https://github.com/MarquezProject/marquez#quickstart provides a workaround for that ./docker/up.sh --api-port 9000

      +

      *Thread Reply:* if you're working on mac, you can have an issue related to port 5000. Instructions here https://github.com/MarquezProject/marquez#quickstart provides a workaround for that ./docker/up.sh --api-port 9000

      @@ -109884,7 +109890,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 08:43:33
      -

      *Thread Reply:* @Paweł Leszczyński, thank you, we were using ubuntu on an EC2 instance and each time we are running into different errors and are never able to access the application page, web server, the admin interface, we have run out of ideas of what else to try differently to get this setup up and running

      +

      *Thread Reply:* @Paweł Leszczyński, thank you, we were using ubuntu on an EC2 instance and each time we are running into different errors and are never able to access the application page, web server, the admin interface, we have run out of ideas of what else to try differently to get this setup up and running

      @@ -109910,7 +109916,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 14:47:00
      -

      *Thread Reply:* @Michael Robinson Can you please help us here?

      +

      *Thread Reply:* @Michael Robinson Can you please help us here?

      @@ -109936,7 +109942,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 14:58:57
      -

      *Thread Reply:* @Vamshi krishna I’m sorry you’re still blocked. Thanks for the information about your system. Would you please share some of the errors you are getting? More details would help us reproduce and diagnose.

      +

      *Thread Reply:* @Vamshi krishna I’m sorry you’re still blocked. Thanks for the information about your system. Would you please share some of the errors you are getting? More details would help us reproduce and diagnose.

      @@ -109962,7 +109968,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-22 16:35:00
      -

      *Thread Reply:* @Michael Robinson, thank you, vamshi and i will share the errors that we are running into shortly

      +

      *Thread Reply:* @Michael Robinson, thank you, vamshi and i will share the errors that we are running into shortly

      @@ -109988,7 +109994,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 09:48:16
      -

      *Thread Reply:* We are following https://openlineage.io/getting-started/ guide and trying to set up Marquez on a ubuntu ec2 instance. Following are versions of docker, docker compose and ubuntu

      +

      *Thread Reply:* We are following https://openlineage.io/getting-started/ guide and trying to set up Marquez on a ubuntu ec2 instance. Following are versions of docker, docker compose and ubuntu

      openlineage.io
      @@ -110044,7 +110050,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 09:49:51
      -

      *Thread Reply:* @Michael Robinson When we follow the documentation without changing anything and run sudo ./docker/up.sh we are seeing following errors:

      +

      *Thread Reply:* @Michael Robinson When we follow the documentation without changing anything and run sudo ./docker/up.sh we are seeing following errors:

      @@ -110079,7 +110085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:00:38
      -

      *Thread Reply:* So, I edited up.sh file and modified docker compose command by removing --log-level flag and ran sudo ./docker/up.sh and found following errors:

      +

      *Thread Reply:* So, I edited up.sh file and modified docker compose command by removing --log-level flag and ran sudo ./docker/up.sh and found following errors:

      @@ -110114,7 +110120,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:02:29
      -

      *Thread Reply:* Then I copied .env.example to .env since compose needs .env file

      +

      *Thread Reply:* Then I copied .env.example to .env since compose needs .env file

      @@ -110149,7 +110155,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:05:04
      -

      *Thread Reply:* I got this error:

      +

      *Thread Reply:* I got this error:

      @@ -110184,7 +110190,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:09:24
      -

      *Thread Reply:* since I am getting timeouts, I thought it might be an issue with proxy. So, I followed this doc: https://stackoverflow.com/questions/58841014/set-proxy-on-docker and added my outbound proxy and tried

      +

      *Thread Reply:* since I am getting timeouts, I thought it might be an issue with proxy. So, I followed this doc: https://stackoverflow.com/questions/58841014/set-proxy-on-docker and added my outbound proxy and tried

      Stack Overflow
      @@ -110240,7 +110246,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:23:46
      -

      *Thread Reply:* @Michael Robinson Then it kind of worked but seeing following errors:

      +

      *Thread Reply:* @Michael Robinson Then it kind of worked but seeing following errors:

      @@ -110275,7 +110281,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:24:31
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -110310,7 +110316,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:25:29
      -

      *Thread Reply:* @Michael Robinson @Paweł Leszczyński Can you please see above steps and let us know what are we missing/doing wrong? I appreciate your help and time.

      +

      *Thread Reply:* @Michael Robinson @Paweł Leszczyński Can you please see above steps and let us know what are we missing/doing wrong? I appreciate your help and time.

      @@ -110336,7 +110342,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:45:39
      -

      *Thread Reply:* The latest errors look to me like they’re being caused by postgres and might reflect a port conflict. Are you using the default port for the API (5000)? You might try using a different port. More info about this in the Marquez readme: https://github.com/MarquezProject/marquez/blob/0.35.0/README.md.

      +

      *Thread Reply:* The latest errors look to me like they’re being caused by postgres and might reflect a port conflict. Are you using the default port for the API (5000)? You might try using a different port. More info about this in the Marquez readme: https://github.com/MarquezProject/marquez/blob/0.35.0/README.md.

      @@ -110362,7 +110368,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:46:55
      -

      *Thread Reply:* Yes we are using default ports: +

      *Thread Reply:* Yes we are using default ports: APIPORT=5000 APIADMINPORT=5001 WEBPORT=3000 @@ -110392,7 +110398,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:47:40
      -

      *Thread Reply:* We see these postgres permission issues only occasionally. Other times we only see db and api containers up but not the web

      +

      *Thread Reply:* We see these postgres permission issues only occasionally. Other times we only see db and api containers up but not the web

      @@ -110418,7 +110424,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:52:38
      -

      *Thread Reply:* I would try running ./docker/up.sh --api-port 9000 (see Pawel’s message above for more context.)

      +

      *Thread Reply:* I would try running ./docker/up.sh --api-port 9000 (see Pawel’s message above for more context.)

      @@ -110448,7 +110454,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:54:18
      -

      *Thread Reply:* Still no luck. Seeing same errors.

      +

      *Thread Reply:* Still no luck. Seeing same errors.

      2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied marquez-db | 2023-06-23 14:53:23.971 GMT [1] FATAL: configuration file "/etc/postgresql/postgresql.conf" contains errors

      @@ -110477,7 +110483,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 10:54:43
      -

      *Thread Reply:* ERROR [2023-06-23 14:53:42,269] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +

      *Thread Reply:* ERROR [2023-06-23 14:53:42,269] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! java.net.UnknownHostException: postgres marquez-api | ! at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) marquez-api | ! at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) @@ -110541,7 +110547,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:06:32
      -

      *Thread Reply:* Why do you run docker up with sudo? some of your screenshots suggest docker is not able to access docker registry. The last error java.net.UnknownHostException: postgres may be just a result of container being down. Could you verify if all the containers are up and running and if not what's the error? Are you able to test this docker.up in your laptop or other environment?

      +

      *Thread Reply:* Why do you run docker up with sudo? some of your screenshots suggest docker is not able to access docker registry. The last error java.net.UnknownHostException: postgres may be just a result of container being down. Could you verify if all the containers are up and running and if not what's the error? Are you able to test this docker.up in your laptop or other environment?

      @@ -110567,7 +110573,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:08:34
      -

      *Thread Reply:* Docker commands require sudo and cannot run with other user. +

      *Thread Reply:* Docker commands require sudo and cannot run with other user. Postgres container is not coming up. It is failing with following errors:

      2023-06-23 14:53:23.971 GMT [1] LOG: could not open configuration file "/etc/postgresql/postgresql.conf": Permission denied @@ -110597,7 +110603,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:10:19
      -

      *Thread Reply:* and what does docker ps -a say about postgres container? why did it fail?

      +

      *Thread Reply:* and what does docker ps -a say about postgres container? why did it fail?

      @@ -110623,7 +110629,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:11:36
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -110658,7 +110664,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:25:17
      -

      *Thread Reply:* hmyy, no changes on our side have been done in postgresql.conf since August 2022. Did you apply any changes or have a clean clone of a repo?

      +

      *Thread Reply:* hmyy, no changes on our side have been done in postgresql.conf since August 2022. Did you apply any changes or have a clean clone of a repo?

      @@ -110684,7 +110690,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:29:46
      -

      *Thread Reply:* No we didn't make any changes

      +

      *Thread Reply:* No we didn't make any changes

      @@ -110710,7 +110716,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:32:21
      -

      *Thread Reply:* you did write earlier Note: I had to modify docker-compose command in up.sh as per docker compose V2.

      +

      *Thread Reply:* you did write earlier Note: I had to modify docker-compose command in up.sh as per docker compose V2.

      @@ -110736,7 +110742,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:34:54
      -

      *Thread Reply:* Yes all I did was modified this line: docker-compose --log-level ERROR $compose_files up $ARGS to +

      *Thread Reply:* Yes all I did was modified this line: docker-compose --log-level ERROR $compose_files up $ARGS to docker compose $compose_files up $ARGS since docker compose v2 doesn't support --log-level flag

      @@ -110763,7 +110769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 11:37:03
      -

      *Thread Reply:* Let me pull an older version and try

      +

      *Thread Reply:* Let me pull an older version and try

      @@ -110789,7 +110795,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 12:09:43
      -

      *Thread Reply:* Still no luck same exact errors. Tried on a different ubuntu instance. Still seeing same errors with postgres

      +

      *Thread Reply:* Still no luck same exact errors. Tried on a different ubuntu instance. Still seeing same errors with postgres

      @@ -110815,7 +110821,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 15:06:32
      -

      *Thread Reply:* @Jeremy W

      +

      *Thread Reply:* @Jeremy W

      @@ -110867,7 +110873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 10:49:42
      -

      *Thread Reply:* > Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other? +

      *Thread Reply:* > Or could there be cases where the column lineage, and any output information is only present in one of the events, but not the other? Yes. Generally events regarding single run are cumulative

      @@ -110894,7 +110900,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 11:07:03
      -

      *Thread Reply:* Ahh I see... Is it fair to assume that if I see column lineage in a start event, it's the full column lineage? Or could it be possible that half the lineage is in the start event, and half the lineage is in the complete event?

      +

      *Thread Reply:* Ahh I see... Is it fair to assume that if I see column lineage in a start event, it's the full column lineage? Or could it be possible that half the lineage is in the start event, and half the lineage is in the complete event?

      @@ -110920,7 +110926,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-15 22:50:51
      -

      *Thread Reply:* Hi @Maciej Obuchowski just pinging in case you'd missed the above message. 🙇

      +

      *Thread Reply:* Hi @Maciej Obuchowski just pinging in case you'd missed the above message. 🙇

      @@ -110950,7 +110956,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 04:48:57
      -

      *Thread Reply:* Actually, in this case this definitely should not happen. @Paweł Leszczyński am I right?

      +

      *Thread Reply:* Actually, in this case this definitely should not happen. @Paweł Leszczyński am I right?

      @@ -110980,7 +110986,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 04:50:16
      -

      *Thread Reply:* @Maciej Obuchowski yes, you're

      +

      *Thread Reply:* @Maciej Obuchowski yes, you're

      @@ -111036,7 +111042,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 02:14:19
      -

      *Thread Reply:* ~No, we do have an open issue for that: https://github.com/OpenLineage/OpenLineage/issues/1758~

      +

      *Thread Reply:* ~No, we do have an open issue for that: https://github.com/OpenLineage/OpenLineage/issues/1758~

      @@ -111086,7 +111092,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-16 05:02:26
      -

      *Thread Reply:* @nivethika R, I am sorry for misleading response, we've merged PR for that https://github.com/OpenLineage/OpenLineage/pull/1636. It does not support select ** but besides that, it should be operational.

      +

      *Thread Reply:* @nivethika R, I am sorry for misleading response, we've merged PR for that https://github.com/OpenLineage/OpenLineage/pull/1636. It does not support select ** but besides that, it should be operational.

      Could you please try a query from our integration tests to verify if this is working for you or not: https://github.com/OpenLineage/OpenLineage/pull/1636/files#diff-137aa17091138b69681510e13e3b7d66aa9c9c7c81fe8fe13f09f0de76448dd5R46 ?

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-06-21 11:07:04
      -

      *Thread Reply:* Hi Nagendra, sorry you’re running into this error. We’re looking into it!

      +

      *Thread Reply:* Hi Nagendra, sorry you’re running into this error. We’re looking into it!

      @@ -111424,7 +111430,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-19 09:37:20
      -

      *Thread Reply:* Hey @Anirudh Shrinivason, me and Paweł are at Berlin Buzzwords right now. Will definitely look at it later

      +

      *Thread Reply:* Hey @Anirudh Shrinivason, me and Paweł are at Berlin Buzzwords right now. Will definitely look at it later

      @@ -111450,7 +111456,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-19 10:47:06
      -

      *Thread Reply:* Oh nice! Thanks!

      +

      *Thread Reply:* Oh nice! Thanks!

      @@ -111503,10 +111509,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 03:47:50
      -

      *Thread Reply:* This is the event generated for above query.

      +

      *Thread Reply:* This is the event generated for above query.

      - + @@ -111577,7 +111583,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      this is event for view for which no lineage is being generated

      - + @@ -111643,7 +111649,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 17:34:08
      -

      *Thread Reply:* Hi John, glad to hear you figured out a path forward on this! Please let us know what you learn 🙂

      +

      *Thread Reply:* Hi John, glad to hear you figured out a path forward on this! Please let us know what you learn 🙂

      @@ -111763,7 +111769,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 17:33:36
      -

      *Thread Reply:* Hi Harshini, I’ve alerted our resident Spark and column-lineage expert about this. Hope to have an answer for you soon.

      +

      *Thread Reply:* Hi Harshini, I’ve alerted our resident Spark and column-lineage expert about this. Hope to have an answer for you soon.

      @@ -111789,7 +111795,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-20 19:39:46
      -

      *Thread Reply:* Thank you Michael, looking forward to it

      +

      *Thread Reply:* Thank you Michael, looking forward to it

      @@ -111815,7 +111821,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 02:58:41
      -

      *Thread Reply:* Hello @Harshini Devathi. An interesting topic which I have never thought about. The ordering of the fields, we get for Spark Apps, comes from Spark logical plans we extract information from and we do not apply any sorting on them. So, if Spark plan contains columns a , b, c we trust it's the order of columns for a dataset and don't want to check it on our own.

      +

      *Thread Reply:* Hello @Harshini Devathi. An interesting topic which I have never thought about. The ordering of the fields, we get for Spark Apps, comes from Spark logical plans we extract information from and we do not apply any sorting on them. So, if Spark plan contains columns a , b, c we trust it's the order of columns for a dataset and don't want to check it on our own.

      @@ -111841,7 +111847,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-21 02:59:45
      -

      *Thread Reply:* btw. please let us know how do you obtain your lineage: within a Spark app or from some SQL's scheduled by Airflow?

      +

      *Thread Reply:* btw. please let us know how do you obtain your lineage: within a Spark app or from some SQL's scheduled by Airflow?

      @@ -111867,7 +111873,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-23 14:40:31
      -

      *Thread Reply:* Hello @Paweł Leszczyński, thank you for the response. We do not need you to check the ordering specifically but I assume that the spark logical plan maintains the column order based on the input datasets. Can we retain that order by adding column id or some sequence number which helps to represent the lineage in the same order.

      +

      *Thread Reply:* Hello @Paweł Leszczyński, thank you for the response. We do not need you to check the ordering specifically but I assume that the spark logical plan maintains the column order based on the input datasets. Can we retain that order by adding column id or some sequence number which helps to represent the lineage in the same order.

      The lineage we are capturing using Spark openlineage connector, by posting custom lineage to Marquez through API calls, and also in process of leveraging SQL connector feature using Airflow.

      @@ -111895,7 +111901,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-26 04:35:43
      -

      *Thread Reply:* Hi @Harshini Devathi, are you asking about schema facet within a dataset? This should have an order from spark logical plans. Or, are you asking about columnLineage facet? Or Marquez API responses? It's not clear to me why do you need it. Each column, is identified by a dataset (dataset namespace + dataset name) and field name. You can, on your side, generate and column id based on that and order columns based on the id, but still I think I am missing some arguments behind doing so.

      +

      *Thread Reply:* Hi @Harshini Devathi, are you asking about schema facet within a dataset? This should have an order from spark logical plans. Or, are you asking about columnLineage facet? Or Marquez API responses? It's not clear to me why do you need it. Each column, is identified by a dataset (dataset namespace + dataset name) and field name. You can, on your side, generate and column id based on that and order columns based on the id, but still I think I am missing some arguments behind doing so.

      @@ -112119,7 +112125,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 03:45:31
      -

      *Thread Reply:* Is it because the logLevel is set to WARN or ERROR?

      +

      *Thread Reply:* Is it because the logLevel is set to WARN or ERROR?

      @@ -112145,7 +112151,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:07:12
      -

      *Thread Reply:* No, I set it to INFO, may be I need to add some jars?

      +

      *Thread Reply:* No, I set it to INFO, may be I need to add some jars?

      @@ -112171,7 +112177,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:30:02
      -

      *Thread Reply:* Hmm have you set the relevant spark configs?

      +

      *Thread Reply:* Hmm have you set the relevant spark configs?

      @@ -112197,7 +112203,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:32:50
      -

      *Thread Reply:* yep, I have http working. But not the console +

      *Thread Reply:* yep, I have http working. But not the console spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener spark.openlineage.transport.type=console

      @@ -112225,7 +112231,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:35:27
      -

      *Thread Reply:* Oh wait http works but not console...

      +

      *Thread Reply:* Oh wait http works but not console...

      @@ -112251,7 +112257,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:37:02
      -

      *Thread Reply:* If you want to see the console events which are emitted, then need to set logLevel to DEBUG

      +

      *Thread Reply:* If you want to see the console events which are emitted, then need to set logLevel to DEBUG

      @@ -112277,7 +112283,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:37:44
      -

      *Thread Reply:* tried that too, still nothing

      +

      *Thread Reply:* tried that too, still nothing

      @@ -112303,7 +112309,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:38:54
      -

      *Thread Reply:* Is the openlienage jar installed and added to config?

      +

      *Thread Reply:* Is the openlienage jar installed and added to config?

      @@ -112329,7 +112335,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:39:09
      -

      *Thread Reply:* yep, that’s why http works

      +

      *Thread Reply:* yep, that’s why http works

      @@ -112355,7 +112361,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:39:26
      -

      *Thread Reply:* the only thing I see in the logs is this: +

      *Thread Reply:* the only thing I see in the logs is this: 23/06/27 07:39:11 INFO SparkSQLExecutionContext: OpenLineage received Spark event that is configured to be skipped: SparkListenerJobEnd

      @@ -112382,7 +112388,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:40:59
      -

      *Thread Reply:* Hmm if an event is still emitted for this case, but logs not showing up then I'm not sure... Maybe someone with more knowledge on this can help

      +

      *Thread Reply:* Hmm if an event is still emitted for this case, but logs not showing up then I'm not sure... Maybe someone with more knowledge on this can help

      @@ -112408,7 +112414,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-27 12:42:37
      -

      *Thread Reply:* sure, thanks for trying @Anirudh Shrinivason

      +

      *Thread Reply:* sure, thanks for trying @Anirudh Shrinivason

      @@ -112434,7 +112440,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 05:23:36
      -

      *Thread Reply:* What job are you trying this on? If there's this message, then logging is working afaik

      +

      *Thread Reply:* What job are you trying this on? If there's this message, then logging is working afaik

      @@ -112460,7 +112466,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 12:16:52
      -

      *Thread Reply:* Hi @Maciej Obuchowski Actually I also noticed a similar issue... For some spark pipelines, the log level is set to debug, but I'm not seeing any events being logged. I am however receiving these events in the backend. Have any of the logging been removed from some places?

      +

      *Thread Reply:* Hi @Maciej Obuchowski Actually I also noticed a similar issue... For some spark pipelines, the log level is set to debug, but I'm not seeing any events being logged. I am however receiving these events in the backend. Have any of the logging been removed from some places?

      @@ -112486,7 +112492,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-28 20:57:45
      -

      *Thread Reply:* yep, exactly same thing here also @Maciej Obuchowski, I can get the events on http, but changing to console gets me nothing from ConsoleTransport.

      +

      *Thread Reply:* yep, exactly same thing here also @Maciej Obuchowski, I can get the events on http, but changing to console gets me nothing from ConsoleTransport.

      @@ -112564,7 +112570,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:57:59
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, could you disable serializing spark.logicalPlan to see if the behaviour is the same?

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, could you disable serializing spark.logicalPlan to see if the behaviour is the same?

      @@ -112590,7 +112596,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:58:28
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark -> spark.openlineage.facets.disabled -> [spark_unknown;spark.logicalPlan]

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark -> spark.openlineage.facets.disabled -> [spark_unknown;spark.logicalPlan]

      @@ -112616,7 +112622,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 05:59:55
      -

      *Thread Reply:* We do serialize logicalPlan because this is useful in many cases, but sometimes can lead to serializing things that shouldn't be serialized

      +

      *Thread Reply:* We do serialize logicalPlan because this is useful in many cases, but sometimes can lead to serializing things that shouldn't be serialized

      @@ -112642,7 +112648,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-29 15:49:35
      -

      *Thread Reply:* Ahh I see. Yeah okay let me try that

      +

      *Thread Reply:* Ahh I see. Yeah okay let me try that

      @@ -112704,7 +112710,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-06-30 08:05:53
      -

      *Thread Reply:* Thanks, all. The release is authorized.

      +

      *Thread Reply:* Thanks, all. The release is authorized.

      @@ -112929,7 +112935,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-07 10:37:49
      -

      *Thread Reply:* thanks for watching and sharing! the recording is also on youtube 😉 https://www.youtube.com/watch?v=rO3BPqUtWrI

      +

      *Thread Reply:* thanks for watching and sharing! the recording is also on youtube 😉 https://www.youtube.com/watch?v=rO3BPqUtWrI

      YouTube
      @@ -112989,7 +112995,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-07 10:38:01
      -

      *Thread Reply:* ❤️

      +

      *Thread Reply:* ❤️

      @@ -113015,7 +113021,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-08 13:35:10
      -

      *Thread Reply:* Very much agree. I’ve even forwarded to a few people here and there, those who I think should learn about it.

      +

      *Thread Reply:* Very much agree. I’ve even forwarded to a few people here and there, those who I think should learn about it.

      @@ -113045,7 +113051,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-08 13:47:17
      -

      *Thread Reply:* You’re both too kind :) +

      *Thread Reply:* You’re both too kind :) Thank you for your support and being part of the community.

      @@ -113162,7 +113168,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:46:36
      -

      *Thread Reply:* I think yes this is a great use case. @Paweł Leszczyński is more familiar with the spark integration code than I. +

      *Thread Reply:* I think yes this is a great use case. @Paweł Leszczyński is more familiar with the spark integration code than I. I think in this case, we would add the datasetVersion facet with the underlying Iceberg version: https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/DatasetVersionDatasetFacet.json We extract this information in a few places: https://github.com/search?q=repo%3AOpenLineage%2FOpenLineage%20%20DatasetVersionDatasetFacet&type=code

      @@ -113215,7 +113221,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:57:17
      -

      *Thread Reply:* Yes, we do have datasetVersion which captures for output and input datasets their iceberg or delta version. Input versions are collected on START while output are collected on COMPLETE in case a job reads and writes to the same dataset. So, even though column-lineage facet is missing the version, it should be available within events related to a particular run.

      +

      *Thread Reply:* Yes, we do have datasetVersion which captures for output and input datasets their iceberg or delta version. Input versions are collected on START while output are collected on COMPLETE in case a job reads and writes to the same dataset. So, even though column-lineage facet is missing the version, it should be available within events related to a particular run.

      If it is not, then perhaps the case here is the lack of support of as of syntax. As far as I remeber, we always get a current version of a dataset and this may be a missing part here.

      @@ -113243,7 +113249,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 05:58:49
      -

      *Thread Reply:* link to a method that gets dataset version for iceberg: https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/spark3/sr[…]lineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

      +

      *Thread Reply:* link to a method that gets dataset version for iceberg: https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/spark3/sr[…]lineage/spark3/agent/lifecycle/plan/catalog/IcebergHandler.java

      @@ -113294,7 +113300,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 10:57:26
      -

      *Thread Reply:* Thank you @Julien Le Dem and @Paweł Leszczyński +

      *Thread Reply:* Thank you @Julien Le Dem and @Paweł Leszczyński Based on what I’ve seen so far, indeed it seems that only the current snapshot is tracked. When IcebergHandler.getDatasetVersion() Initially I was expecting to be able to obtain the snapshotId from the SparkTable which comes within getDatasetVersion() but now I realize that OL is using an older version of Iceberg runtime, (0.12.1) which does not support time travel (introduced in 0.14.1). The evidence is: @@ -113330,7 +113336,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 15:48:53
      -

      *Thread Reply:* Created issues: #1969 and #1970

      +

      *Thread Reply:* Created issues: #1969 and #1970

      @@ -113356,7 +113362,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 07:15:14
      -

      *Thread Reply:* Thanks @Juan Manuel Cappi. openlineage-spark jar contains modules like spark3 , spark32 , spark33 and spark34 that is going to be merged soon (we do have a ready PR for that). spark34 will be compiled against latest iceberg version. Once this is done #1969 can be closed. For 1970, one would need to implement datasetBuilder within spark34 module and visits node within spark's logical plan that is responsible for as of and creates dataset for OpenLineage event other way than getting latest snapshot version.

      +

      *Thread Reply:* Thanks @Juan Manuel Cappi. openlineage-spark jar contains modules like spark3 , spark32 , spark33 and spark34 that is going to be merged soon (we do have a ready PR for that). spark34 will be compiled against latest iceberg version. Once this is done #1969 can be closed. For 1970, one would need to implement datasetBuilder within spark34 module and visits node within spark's logical plan that is responsible for as of and creates dataset for OpenLineage event other way than getting latest snapshot version.

      @@ -113382,7 +113388,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 12:51:19
      -

      *Thread Reply:* @Paweł Leszczyński I’ve see PR #1971 and I see a new spark34 project with the latest iceberg-spark dependency version, but other versions (spark33, spark32, etc) have not being upgraded in that PR. Since the change is small and does not break any tests, I’ve created PR #1976 for to fix #1969. That alone is unlocking some time travel lineage (i.e. dataset identifier now becomes schema.table.version or schema.table.snapshot_id). Hope it makes sense

      +

      *Thread Reply:* @Paweł Leszczyński I’ve see PR #1971 and I see a new spark34 project with the latest iceberg-spark dependency version, but other versions (spark33, spark32, etc) have not being upgraded in that PR. Since the change is small and does not break any tests, I’ve created PR #1976 for to fix #1969. That alone is unlocking some time travel lineage (i.e. dataset identifier now becomes schema.table.version or schema.table.snapshot_id). Hope it makes sense

      @@ -113408,7 +113414,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-14 04:37:55
      -

      *Thread Reply:* Hi @Juan Manuel Cappi, You're right and after discussion with you I realized we support some version of iceberg (for spark 3.3 it's still 0.14.0) but this is not the latest iceberg version matching spark version.

      +

      *Thread Reply:* Hi @Juan Manuel Cappi, You're right and after discussion with you I realized we support some version of iceberg (for spark 3.3 it's still 0.14.0) but this is not the latest iceberg version matching spark version.

      There's some tricky part here. Although we wan't our code to succeed with latest spark, we don't want it to fail in a nasty way (class not found exception) when a user is working with an old iceberg version. There are places in our code where we do check are iceberg classes on the classpath? We need to extend this to are iceberg classes on classpath is iceberg version above 0.14 or not For sure this is the case for merge into commands I am working on at the moment. Let's see if the other integration tests are affected in your PR

      @@ -113462,7 +113468,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 08:28:59
      -

      *Thread Reply:* what do you mean by that? there is a pyspark & kafka integration test that verifies event being sent when reading or writing to kafka topic: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      +

      *Thread Reply:* what do you mean by that? there is a pyspark & kafka integration test that verifies event being sent when reading or writing to kafka topic: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]a/io/openlineage/spark/agent/SparkContainerIntegrationTest.java

      @@ -113513,7 +113519,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-11 09:28:56
      -

      *Thread Reply:* We do have an old issue https://github.com/OpenLineage/OpenLineage/issues/372 to support more spark plans that are stream related. But, if you had an example of streaming that is not working for you, this would have been really helpful.

      +

      *Thread Reply:* We do have an old issue https://github.com/OpenLineage/OpenLineage/issues/372 to support more spark plans that are stream related. But, if you had an example of streaming that is not working for you, this would have been really helpful.

      @@ -113573,7 +113579,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:03:30
      -

      *Thread Reply:* I have a pipeline Which reads from topic and send data to 3 HIVE tables and one postgres , Its not emitting any lineage for this pipeline

      +

      *Thread Reply:* I have a pipeline Which reads from topic and send data to 3 HIVE tables and one postgres , Its not emitting any lineage for this pipeline

      @@ -113599,7 +113605,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:06:51
      -

      *Thread Reply:* just one task is getting created

      +

      *Thread Reply:* just one task is getting created

      @@ -113677,7 +113683,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:14:41
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I’m not familiar with Delta, but enabling debugging helped me a lot to understand what’s going when things fail silently. Just add at the end: +

      *Thread Reply:* Hi @Anirudh Shrinivason, I’m not familiar with Delta, but enabling debugging helped me a lot to understand what’s going when things fail silently. Just add at the end: spark.sparkContext.setLogLevel("DEBUG")

      @@ -113704,7 +113710,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:20:47
      -

      *Thread Reply:* Yeah I checked on debug

      +

      *Thread Reply:* Yeah I checked on debug

      @@ -113730,7 +113736,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:20:50
      -

      *Thread Reply:* There are no errors

      +

      *Thread Reply:* There are no errors

      @@ -113756,7 +113762,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 06:21:10
      -

      *Thread Reply:* Just that there is no environment-properties in the event that is being emitted

      +

      *Thread Reply:* Just that there is no environment-properties in the event that is being emitted

      @@ -113782,7 +113788,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 07:31:01
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, what spark version is that? i see you delta version is pretty old. Anyway, the observation is weird and don't know how come delta interferes with environment facet builder. These are so disjoint features. Are you sure you create a new session (there is getOrCreate) ?

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, what spark version is that? i see you delta version is pretty old. Anyway, the observation is weird and don't know how come delta interferes with environment facet builder. These are so disjoint features. Are you sure you create a new session (there is getOrCreate) ?

      @@ -113808,7 +113814,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 19:29:06
      -

      *Thread Reply:* @Paweł Leszczyński its because of this line : https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/app/src/m[…]nlineage/spark/agent/lifecycle/InternalEventHandlerFactory.java

      +

      *Thread Reply:* @Paweł Leszczyński its because of this line : https://github.com/OpenLineage/OpenLineage/blob/0.29.2/integration/spark/app/src/m[…]nlineage/spark/agent/lifecycle/InternalEventHandlerFactory.java

      @@ -113859,7 +113865,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 19:32:44
      -

      *Thread Reply:* Assuming this is https://learn.microsoft.com/en-us/azure/databricks/delta/ ... delta .. which is azure datbricks. @Anirudh Shrinivason

      +

      *Thread Reply:* Assuming this is https://learn.microsoft.com/en-us/azure/databricks/delta/ ... delta .. which is azure datbricks. @Anirudh Shrinivason

      @@ -113885,7 +113891,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 22:58:13
      -

      *Thread Reply:* Hmm I wasn't using databricks

      +

      *Thread Reply:* Hmm I wasn't using databricks

      @@ -113911,7 +113917,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 22:59:12
      -

      *Thread Reply:* @Paweł Leszczyński I'm using spark 3.1 btw

      +

      *Thread Reply:* @Paweł Leszczyński I'm using spark 3.1 btw

      @@ -113937,7 +113943,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:05:49
      -

      *Thread Reply:* @Anirudh Shrinivason This should resolve the issue https://github.com/OpenLineage/OpenLineage/pull/1973

      +

      *Thread Reply:* @Anirudh Shrinivason This should resolve the issue https://github.com/OpenLineage/OpenLineage/pull/1973

      @@ -114030,7 +114036,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:06:11
      -

      *Thread Reply:* PR description contains info on how come the observed behaviour was possible

      +

      *Thread Reply:* PR description contains info on how come the observed behaviour was possible

      @@ -114056,7 +114062,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 08:07:47
      -

      *Thread Reply:* As always, thank you @Anirudh Shrinivason for providing clear information on how to reproduce the issue 🚀 :medal: 👍

      +

      *Thread Reply:* As always, thank you @Anirudh Shrinivason for providing clear information on how to reproduce the issue 🚀 :medal: 👍

      @@ -114082,7 +114088,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 09:52:29
      -

      *Thread Reply:* Ohh that is really great! Thankss so much for the help! 🙂

      +

      *Thread Reply:* Ohh that is really great! Thankss so much for the help! 🙂

      @@ -114211,7 +114217,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 15:22:52
      -

      *Thread Reply:* what do you mean by Access file?

      +

      *Thread Reply:* what do you mean by Access file?

      @@ -114237,7 +114243,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:09:03
      -

      *Thread Reply:* ... accdb file, Microsoft Access File: I am in a reverse engineering projet facing a spaghetti style development and would have loved to use, airflow and openlineage as a magic wand, to help me in this damn work

      +

      *Thread Reply:* ... accdb file, Microsoft Access File: I am in a reverse engineering projet facing a spaghetti style development and would have loved to use, airflow and openlineage as a magic wand, to help me in this damn work

      @@ -114263,7 +114269,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 21:44:21
      -

      *Thread Reply:* oof.. I’d look into https://airflow.apache.org/docs/apache-airflow-providers-odbc/4.0.0/ +

      *Thread Reply:* oof.. I’d look into https://airflow.apache.org/docs/apache-airflow-providers-odbc/4.0.0/ but I really have no clue..

      @@ -114290,7 +114296,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-13 09:47:02
      -

      *Thread Reply:* Thank you Harel +

      *Thread Reply:* Thank you Harel I started from that too ... but it became foggy after the initial step

      @@ -114347,7 +114353,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:00
      -

      *Thread Reply:* Did you check the namespace menu in the top right for a food_delivery namespace?

      +

      *Thread Reply:* Did you check the namespace menu in the top right for a food_delivery namespace?

      @@ -114373,7 +114379,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:12
      -

      *Thread Reply:* (Hi Aaman!)

      +

      *Thread Reply:* (Hi Aaman!)

      @@ -114399,7 +114405,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:45
      -

      *Thread Reply:* Hi! Thank you that helped!

      +

      *Thread Reply:* Hi! Thank you that helped!

      @@ -114425,7 +114431,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:55:55
      -

      *Thread Reply:* I think that should be added to the quickstart guide

      +

      *Thread Reply:* I think that should be added to the quickstart guide

      @@ -114455,7 +114461,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-12 16:56:23
      -

      *Thread Reply:* Great idea, thank you

      +

      *Thread Reply:* Great idea, thank you

      @@ -114603,7 +114609,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-14 04:09:15
      -

      *Thread Reply:* Hi @Steven, I assume you send your Openlineage events to Marquez. 422 http code is a response from backend and Marquez is still waiting for the PR https://github.com/MarquezProject/marquez/pull/2495 to be merged and released. This PR makes Marquez understand DatasetEvents. They won't be saved in Marquez database (this is to be implemented in future), but at least one will not experience error response code.

      +

      *Thread Reply:* Hi @Steven, I assume you send your Openlineage events to Marquez. 422 http code is a response from backend and Marquez is still waiting for the PR https://github.com/MarquezProject/marquez/pull/2495 to be merged and released. This PR makes Marquez understand DatasetEvents. They won't be saved in Marquez database (this is to be implemented in future), but at least one will not experience error response code.

      To sum up: what you do is correct. You are using a feature that is allowed on a client side but still not implemented on a backend.

      MARQUEZAPIKEY=[YOURAPIKEY]
      2023-07-14 04:10:30
      -

      *Thread Reply:* Thanks!!

      +

      *Thread Reply:* Thanks!!

      @@ -114773,7 +114779,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:35:18
      -

      *Thread Reply:* Hi @Harshit Soni, where are you deploying your spark? locally or not? is it on mac? Calling @Maciej Obuchowski to help with ibopenlineage_sql_java architecture compilation issue

      +

      *Thread Reply:* Hi @Harshit Soni, where are you deploying your spark? locally or not? is it on mac? Calling @Maciej Obuchowski to help with ibopenlineage_sql_java architecture compilation issue

      @@ -114799,7 +114805,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:38:03
      -

      *Thread Reply:* Currently, was testing on local.

      +

      *Thread Reply:* Currently, was testing on local.

      @@ -114825,7 +114831,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 02:39:43
      -

      *Thread Reply:* We have created a centralised utility for all data ingestion needs and want to see how lineage is created for same using Openlineage.

      +

      *Thread Reply:* We have created a centralised utility for all data ingestion needs and want to see how lineage is created for same using Openlineage.

      @@ -114851,7 +114857,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:16:55
      -

      *Thread Reply:* 👀

      +

      *Thread Reply:* 👀

      @@ -115003,7 +115009,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:13:49
      -

      *Thread Reply:* > or would it somewhat work to write a mapper in between to support their spec? +

      *Thread Reply:* > or would it somewhat work to write a mapper in between to support their spec? I think yeah - maybe https://github.com/Natural-Intelligence/openLineage-openMetadata-transporter would work out of the box if I understand correctly?

      @@ -115059,7 +115065,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:38:59
      -

      *Thread Reply:* Tagging @Natalie Zeller in case you want to collaborate

      +

      *Thread Reply:* Tagging @Natalie Zeller in case you want to collaborate

      @@ -115085,7 +115091,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 08:47:34
      -

      *Thread Reply:* Hi, +

      *Thread Reply:* Hi, We've implemented a transporter that transmits lineage from OpenLineage to OpenMetadata, you can find the github project here. I've also published a blog post that explains this integration and how to use it. I'll be happy to help if you have any question

      @@ -115147,7 +115153,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-17 09:49:30
      -

      *Thread Reply:* very cool! thanks a lot for responding so quickly

      +

      *Thread Reply:* very cool! thanks a lot for responding so quickly

      @@ -115239,7 +115245,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:10:31
      -

      *Thread Reply:* runId is an UUID assigned per spark action (compute trigger within a spark job). A single spark script can result in multiple runs then

      +

      *Thread Reply:* runId is an UUID assigned per spark action (compute trigger within a spark job). A single spark script can result in multiple runs then

      @@ -115265,7 +115271,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 05:13:17
      -

      *Thread Reply:* adding an extra facet with applicationId looks like a good idea to me: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#applicationId:String

      +

      *Thread Reply:* adding an extra facet with applicationId looks like a good idea to me: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#applicationId:String

      spark.apache.org
      @@ -115312,7 +115318,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-18 23:06:01
      -

      *Thread Reply:* Got it thanks!

      +

      *Thread Reply:* Got it thanks!

      @@ -115368,7 +115374,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:48:54
      -

      *Thread Reply:* I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> https://github.com/OpenLineage/OpenLineage/issues/108

      +

      *Thread Reply:* I think we don't have pandas support so far. So, if one uses pandas to read local files on disk, then perhaps Openlineage (OL) has little sense to do. There is an old pandas issues in our backlog (over 2 years old) -> https://github.com/OpenLineage/OpenLineage/issues/108

      Surely one can use use python OL client to create manully events and send them to MQZ, which may be less convenient (https://github.com/OpenLineage/OpenLineage/tree/main/client/python)

      @@ -115422,7 +115428,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:52:32
      -

      *Thread Reply:* Thanks Pawel for responding

      +

      *Thread Reply:* Thanks Pawel for responding

      @@ -115474,7 +115480,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 03:40:20
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, I am still working on that. It's kind of complex because I want to refactor column level lineage so that it can work with multiple Spark versions and multiple delta jars as merge into implementation for delta differs for different delta releases. I thought it's ready, but this needs some extra work to be done in next days. I am excited about that too!

      +

      *Thread Reply:* Hi @Anirudh Shrinivason, I am still working on that. It's kind of complex because I want to refactor column level lineage so that it can work with multiple Spark versions and multiple delta jars as merge into implementation for delta differs for different delta releases. I thought it's ready, but this needs some extra work to be done in next days. I am excited about that too!

      @@ -115500,7 +115506,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 03:54:37
      -

      *Thread Reply:* Ahh I see... Got it! Is there a tentative timeline for when we can expect this? So sorry haha don't mean to rush you. Just curious to know thats all! 🙂

      +

      *Thread Reply:* Ahh I see... Got it! Is there a tentative timeline for when we can expect this? So sorry haha don't mean to rush you. Just curious to know thats all! 🙂

      @@ -115526,7 +115532,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 22:06:10
      -

      *Thread Reply:* Can we author a release sometime soon? Would like to use the CustomEnvironmentFacetBuilder for delta catalog!

      +

      *Thread Reply:* Can we author a release sometime soon? Would like to use the CustomEnvironmentFacetBuilder for delta catalog!

      @@ -115552,7 +115558,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 05:28:43
      -

      *Thread Reply:* we're pretty close i think with merge into delta which is under review. waiting for it would be nice. anyway, we're 3 weeks after the last release.

      +

      *Thread Reply:* we're pretty close i think with merge into delta which is under review. waiting for it would be nice. anyway, we're 3 weeks after the last release.

      @@ -115578,7 +115584,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:50:56
      -

      *Thread Reply:* @Anirudh Shrinivason releases are available basically on-demand using our process in GOVERNANCE.md. I recommend watching 1958 and then making a request in #general once it’s been merged. But, as Paweł suggested, we have a scheduled release coming soon, anyway. Thanks for your interest in the fix!

      +

      *Thread Reply:* @Anirudh Shrinivason releases are available basically on-demand using our process in GOVERNANCE.md. I recommend watching 1958 and then making a request in #general once it’s been merged. But, as Paweł suggested, we have a scheduled release coming soon, anyway. Thanks for your interest in the fix!

      @@ -115652,7 +115658,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 11:01:14
      -

      *Thread Reply:* Ahh I see. Got it. Thanks! @Michael Robinson @Paweł Leszczyński

      +

      *Thread Reply:* Ahh I see. Got it. Thanks! @Michael Robinson @Paweł Leszczyński

      @@ -115678,7 +115684,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 03:12:22
      -

      *Thread Reply:* @Anirudh Shrinivason it's merged -> https://github.com/OpenLineage/OpenLineage/pull/1958

      +

      *Thread Reply:* @Anirudh Shrinivason it's merged -> https://github.com/OpenLineage/OpenLineage/pull/1958

      @@ -115778,7 +115784,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 04:19:15
      -

      *Thread Reply:* Awesome thanks so much! @Paweł Leszczyński

      +

      *Thread Reply:* Awesome thanks so much! @Paweł Leszczyński

      @@ -115834,7 +115840,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 07:14:54
      -

      *Thread Reply:* One hacky approach would be to update the current dataset identifier to include the snapshot_id, so, for schema.table.tag we would have something like schema.table.tag-snapshot_id. The benefit is that it’s explicit and it doesn’t require a change in the OL schemas. The obvious downside (though not that serious in my opinion) is that impacts readability. Not sure though if there are other non-obvious side-effects.

      +

      *Thread Reply:* One hacky approach would be to update the current dataset identifier to include the snapshot_id, so, for schema.table.tag we would have something like schema.table.tag-snapshot_id. The benefit is that it’s explicit and it doesn’t require a change in the OL schemas. The obvious downside (though not that serious in my opinion) is that impacts readability. Not sure though if there are other non-obvious side-effects.

      Another alternative would be to add a dedicated property. For instance, the job > latestRun schema, the input/output dataset version objects could look like this: "inputDatasetVersions": [ @@ -115891,7 +115897,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 08:33:43
      -

      *Thread Reply:* @Paweł Leszczyński what do you think?

      +

      *Thread Reply:* @Paweł Leszczyński what do you think?

      @@ -115917,7 +115923,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 08:38:16
      -

      *Thread Reply:* 1. How does snapshotId differ from version? Could one make OL version property to be a string concat of iceberg-snapshot-id.iceberg-version

      +

      *Thread Reply:* 1. How does snapshotId differ from version? Could one make OL version property to be a string concat of iceberg-snapshot-id.iceberg-version

      1. I don't think it's necessary (or don't understand why) to add snapshot-id within column-linegea. Each entry within inputFields of columnLineage is already available within inputs of the OL event related to this run.
      @@ -115946,7 +115952,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-19 18:43:31
      -

      *Thread Reply:* Yes, I think follow the idea. The problem with that is the version is tied to the dataset name, i.e. my_namespace.table_A.tag_v1 which stays the same for the source dataset, which is the one being used with time travel. +

      *Thread Reply:* Yes, I think follow the idea. The problem with that is the version is tied to the dataset name, i.e. my_namespace.table_A.tag_v1 which stays the same for the source dataset, which is the one being used with time travel. Suppose the following sequence: step 1 => tableA.tagv1 has snapshot id 123-abc @@ -115991,7 +115997,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-20 06:35:22
      -

      *Thread Reply:* So, my understanding of the problem is that icberg version is not unique. So, if you have version 3, revert to version 2, and then write something again, one ends up again with version 3.

      +

      *Thread Reply:* So, my understanding of the problem is that icberg version is not unique. So, if you have version 3, revert to version 2, and then write something again, one ends up again with version 3.

      I would not like to mess with dataset names because on the backend sides like Marquez, dataset names being the same in different jobs and runs allow creating lineage graph. If dataset names are different, then there is no way to build lineage graph across multiple jobs.

      @@ -116053,7 +116059,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 13:09:39
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days per our policy here.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days per our policy here.

      @@ -116079,7 +116085,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 13:44:47
      -

      *Thread Reply:* @Anirudh Shrinivason and others waiting on this release: the release process isn’t working as expected due to security improvements recently made to the website, ironically enough, which is the source for the spec. But we’re working on a fix and hope to complete the release soon.

      +

      *Thread Reply:* @Anirudh Shrinivason and others waiting on this release: the release process isn’t working as expected due to security improvements recently made to the website, ironically enough, which is the source for the spec. But we’re working on a fix and hope to complete the release soon.

      @@ -116105,7 +116111,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 15:19:49
      -

      *Thread Reply:* @Anirudh Shrinivason the release (0.30.1) is out now. Thanks for your patience 🙂

      +

      *Thread Reply:* @Anirudh Shrinivason the release (0.30.1) is out now. Thanks for your patience 🙂

      @@ -116131,7 +116137,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 23:21:14
      -

      *Thread Reply:* Hi @Michael Robinson Thanks a lot!

      +

      *Thread Reply:* Hi @Michael Robinson Thanks a lot!

      @@ -116157,7 +116163,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-26 08:52:24
      -

      *Thread Reply:* 👍

      +

      *Thread Reply:* 👍

      @@ -116218,7 +116224,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:10:58
      -

      *Thread Reply:* > I am running a job in Marquez with 180 rows of metadata +

      *Thread Reply:* > I am running a job in Marquez with 180 rows of metadata Do you mean that you have +100 rows of metadata in the jobs table for Marquez? Or that the job never finishes?

      @@ -116245,7 +116251,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:11:47
      -

      *Thread Reply:* Also, yes, we have an even viewer that allows you to query the raw OL events

      +

      *Thread Reply:* Also, yes, we have an even viewer that allows you to query the raw OL events

      @@ -116280,7 +116286,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:12:19
      -

      *Thread Reply:* If you post a sample of your events, it’d be helpful to troubleshoot your issue

      +

      *Thread Reply:* If you post a sample of your events, it’d be helpful to troubleshoot your issue

      @@ -116306,10 +116312,10 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:53:25
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -116341,7 +116347,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:53:31
      -

      *Thread Reply:* Sure Willy thanks for your response. The job is still running. This is the code I am running from jupyter notebook using Python client:

      +

      *Thread Reply:* Sure Willy thanks for your response. The job is still running. This is the code I am running from jupyter notebook using Python client:

      @@ -116367,7 +116373,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:54:33
      -

      *Thread Reply:* as you can see my input and output datasets are just 1 row

      +

      *Thread Reply:* as you can see my input and output datasets are just 1 row

      @@ -116393,7 +116399,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:55:02
      -

      *Thread Reply:* included column lineage but job keeps running so I don't know if it is working

      +

      *Thread Reply:* included column lineage but job keeps running so I don't know if it is working

      @@ -116479,7 +116485,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:20:33
      -

      *Thread Reply:* spark integration is based on extracting lineage from optimized plans

      +

      *Thread Reply:* spark integration is based on extracting lineage from optimized plans

      @@ -116505,7 +116511,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:25:35
      -

      *Thread Reply:* https://youtu.be/rO3BPqUtWrI?t=1326 i recommend whole presentation but in case you're just interested in Spark integration, there few mins that explain how this is achieved (link points to 22:06 min of video)

      +

      *Thread Reply:* https://youtu.be/rO3BPqUtWrI?t=1326 i recommend whole presentation but in case you're just interested in Spark integration, there few mins that explain how this is achieved (link points to 22:06 min of video)

      YouTube
      @@ -116565,7 +116571,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 08:43:47
      -

      *Thread Reply:* Thanks Pawel for sharing. I will take a look. Have a nice weekend.

      +

      *Thread Reply:* Thanks Pawel for sharing. I will take a look. Have a nice weekend.

      @@ -116621,7 +116627,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-21 09:57:51
      -

      *Thread Reply:* Welcome, @Jens Pfau!

      +

      *Thread Reply:* Welcome, @Jens Pfau!

      @@ -116923,7 +116929,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 13:31:53
      -

      *Thread Reply:* That looks like your URL provided to OpenLineage is missing http:// or https:// in the front

      +

      *Thread Reply:* That looks like your URL provided to OpenLineage is missing http:// or https:// in the front

      @@ -116949,7 +116955,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:54:55
      -

      *Thread Reply:* sorry how can i resolve this ? do i need to add this ? i just follow the guide step by step . You dont mention anywhere to add anything. You provide smth that

      +

      *Thread Reply:* sorry how can i resolve this ? do i need to add this ? i just follow the guide step by step . You dont mention anywhere to add anything. You provide smth that

      @@ -116975,7 +116981,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:55:05
      -

      *Thread Reply:* really does not work out of the box

      +

      *Thread Reply:* really does not work out of the box

      @@ -117001,7 +117007,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 14:55:13
      -

      *Thread Reply:* anbd this is supposed to be a demo

      +

      *Thread Reply:* anbd this is supposed to be a demo

      @@ -117027,7 +117033,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-23 17:07:49
      -

      *Thread Reply:* bumping e.g. to io.openlineage:openlineage_spark:0.29.2 seems to be fixing the issue

      +

      *Thread Reply:* bumping e.g. to io.openlineage:openlineage_spark:0.29.2 seems to be fixing the issue

      not sure why it stopped working for 0.12.0 but we’ll take a look and fix accordingly

      @@ -117055,7 +117061,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 04:51:34
      -

      *Thread Reply:* ...probably by bumping the version on this page 🙂

      +

      *Thread Reply:* ...probably by bumping the version on this page 🙂

      @@ -117081,7 +117087,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 05:00:28
      -

      *Thread Reply:* thank you both for coming back to me , I bumped to 0.29 and i think that it now runs.Is this the expected output ? +

      *Thread Reply:* thank you both for coming back to me , I bumped to 0.29 and i think that it now runs.Is this the expected output ? 23/07/24 08:43:55 INFO ConsoleTransport: {"eventTime":"2023_07_24T08:43:55.941Z","producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunEvent>","eventType":"COMPLETE","run":{"runId":"186c06c0_e79c_43cf_8bb7_08e1ab4c86a5","facets":{"spark.logicalPlan":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","plan":[{"class":"org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand","num-children":1,"table":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTable","identifier":{"product-class":"org.apache.spark.sql.catalyst.TableIdentifier","table":"temp2","database":"default"},"tableType":{"product-class":"org.apache.spark.sql.catalyst.catalog.CatalogTableType","name":"MANAGED"},"storage":{"product_class":"org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat","compressed":false,"properties":null},"schema":{"type":"struct","fields":[]},"provider":"parquet","partitionColumnNames":[],"owner":"","createTime":1690188235517,"lastAccessTime":-1,"createVersion":"","properties":null,"unsupportedFeatures":[],"tracksPartitionsInCatalog":false,"schemaPreservesCase":true,"ignoredProperties":null},"mode":null,"query":0,"outputColumnNames":"[a, b]"},{"class":"org.apache.spark.sql.execution.LogicalRDD","num_children":0,"output":[[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num-children":0,"name":"a","dataType":"long","nullable":true,"metadata":{},"exprId":{"product-class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":12,"jvmId":"173725f4_02c4_4174_9d18_3a61aa311d62"},"qualifier":[]}],[{"class":"org.apache.spark.sql.catalyst.expressions.AttributeReference","num_children":0,"name":"b","dataType":"long","nullable":true,"metadata":{},"exprId":{"product_class":"org.apache.spark.sql.catalyst.expressions.ExprId","id":13,"jvmId":"173725f4-02c4-4174-9d18-3a61aa311d62"},"qualifier":[]}]],"rdd":null,"outputPartitioning":{"product_class":"org.apache.spark.sql.catalyst.plans.physical.UnknownPartitioning","numPartitions":0},"outputOrdering":[],"isStreaming":false,"session":null}]},"spark_version":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/2-0-0/OpenLineage.json#/$defs/RunFacet>","spark-version":"3.1.2","openlineage_spark_version":"0.29.2"}}},"job":{"namespace":"default","name":"sample_spark.execute_create_data_source_table_as_select_command","facets":{}},"inputs":[],"outputs":[{"namespace":"file","name":"/home/jovyan/spark-warehouse/temp2","facets":{"dataSource":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>","name":"file","uri":"file"},"schema":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet>","fields":[{"name":"a","type":"long"},{"name":"b","type":"long"}]},"symlinks":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>","identifiers":[{"namespace":"/home/jovyan/spark-warehouse","name":"default.temp2","type":"TABLE"}]},"lifecycleStateChange":{"_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.29.2/integration/spark>","_schemaURL":"<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>","lifecycleStateChange":"CREATE"}},"outputFacets":{}}]} ? Also i then proceeded to run docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 @@ -117120,7 +117126,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:11:08
      -

      *Thread Reply:* You'd need to set up spark.openlineage.transport.url to send OpenLineage events to Marquez

      +

      *Thread Reply:* You'd need to set up spark.openlineage.transport.url to send OpenLineage events to Marquez

      @@ -117146,7 +117152,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:12:28
      -

      *Thread Reply:* where n how can i do this ?

      +

      *Thread Reply:* where n how can i do this ?

      @@ -117172,7 +117178,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:13:04
      -

      *Thread Reply:* do i need to edit the conf ?

      +

      *Thread Reply:* do i need to edit the conf ?

      @@ -117198,7 +117204,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:09
      -

      *Thread Reply:* yes, in the spark conf

      +

      *Thread Reply:* yes, in the spark conf

      @@ -117224,7 +117230,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:48
      -

      *Thread Reply:* what this url should be ?

      +

      *Thread Reply:* what this url should be ?

      @@ -117250,7 +117256,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:37:51
      -

      *Thread Reply:* http://localhost:3000/ ?

      +

      *Thread Reply:* http://localhost:3000/ ?

      @@ -117276,7 +117282,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:43:30
      -

      *Thread Reply:* That depends how you ran Marquez, but looking at your screenshot UI is at 3000, I guess API would be at 5000

      +

      *Thread Reply:* That depends how you ran Marquez, but looking at your screenshot UI is at 3000, I guess API would be at 5000

      @@ -117302,7 +117308,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:43:46
      -

      *Thread Reply:* as that's default in Marquez docker-compose

      +

      *Thread Reply:* as that's default in Marquez docker-compose

      @@ -117328,7 +117334,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:44:14
      -

      *Thread Reply:* i cannot see spark conf

      +

      *Thread Reply:* i cannot see spark conf

      @@ -117354,7 +117360,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 11:44:23
      -

      *Thread Reply:* is it in there or do i need to create it ?

      +

      *Thread Reply:* is it in there or do i need to create it ?

      @@ -117380,7 +117386,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-24 16:42:53
      -

      *Thread Reply:* Is something like +

      *Thread Reply:* Is something like ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') @@ -117416,7 +117422,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:08:08
      -

      *Thread Reply:* OK when i use the snippet you provided and then execute +

      *Thread Reply:* OK when i use the snippet you provided and then execute docker run --network sparkdefault -p 3000:3000 -e MARQUEZHOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1

      I can now see this

      @@ -117454,7 +117460,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:08:52
      -

      *Thread Reply:* but when i click on the job i then get this

      +

      *Thread Reply:* but when i click on the job i then get this

      @@ -117489,7 +117495,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-07-25 05:09:05
      -

      *Thread Reply:* so i cannot see any details of the job

      +

      *Thread Reply:* so i cannot see any details of the job

      @@ -117515,18 +117521,14 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 05:54:50
      -

      *Thread Reply:* @George Polychronopoulos Hi, I am facing the same issue. After adding spark conf and using the docker run command, marquez is still showing empty. Do I need to change something in the run command?

      +

      *Thread Reply:* @George Polychronopoulos Hi, I am facing the same issue. After adding spark conf and using the docker run command, marquez is still showing empty. Do I need to change something in the run command?

      @@ -117554,7 +117556,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 05:55:15
      -

      *Thread Reply:* yes i will tell you

      +

      *Thread Reply:* yes i will tell you

      @@ -117580,7 +117582,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:41
      -

      *Thread Reply:* For the docker command that I used, I updated the marquez-web version to 0.40.0 and I also updated the Marquez_host which I am not sure if I have to or not. The UI is running but not showing anything docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=localhost -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquez/marquez-web:0.40.0

      +

      *Thread Reply:* For the docker command that I used, I updated the marquez-web version to 0.40.0 and I also updated the Marquez_host which I am not sure if I have to or not. The UI is running but not showing anything docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=localhost -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquez/marquez-web:0.40.0

      @@ -117606,7 +117608,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:52
      -

      *Thread Reply:* is because you are running this command right

      +

      *Thread Reply:* is because you are running this command right

      @@ -117632,7 +117634,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:55
      -

      *Thread Reply:* yes thats it

      +

      *Thread Reply:* yes thats it

      @@ -117658,7 +117660,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:36:58
      -

      *Thread Reply:* you need 0.40

      +

      *Thread Reply:* you need 0.40

      @@ -117684,7 +117686,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:03
      -

      *Thread Reply:* and there is a lot of stuff

      +

      *Thread Reply:* and there is a lot of stuff

      @@ -117710,7 +117712,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:07
      -

      *Thread Reply:* you need rto chwange

      +

      *Thread Reply:* you need rto chwange

      @@ -117736,7 +117738,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:10
      -

      *Thread Reply:* in the Docker

      +

      *Thread Reply:* in the Docker

      @@ -117762,7 +117764,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:24
      -

      *Thread Reply:* so the spark

      +

      *Thread Reply:* so the spark

      @@ -117788,7 +117790,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:25
      -

      *Thread Reply:* version

      +

      *Thread Reply:* version

      @@ -117814,7 +117816,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:37:27
      -

      *Thread Reply:* the python

      +

      *Thread Reply:* the python

      @@ -117840,7 +117842,7 @@

      MARQUEZAPIKEY=[YOURAPIKEY]

      2023-09-05 07:38:05
      -

      *Thread Reply:* version: "3.10" +

      *Thread Reply:* version: "3.10" services: notebook: image: jupyter/pyspark-notebook:spark-3.4.1 @@ -117935,7 +117937,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:10
      -

      *Thread Reply:* this is hopw mine looks

      +

      *Thread Reply:* this is hopw mine looks

      @@ -117961,7 +117963,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:20
      -

      *Thread Reply:* it is all tested and letest version

      +

      *Thread Reply:* it is all tested and letest version

      @@ -117987,7 +117989,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:31
      -

      *Thread Reply:* postgres does not work beyond 12

      +

      *Thread Reply:* postgres does not work beyond 12

      @@ -118013,7 +118015,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:56
      -

      *Thread Reply:* if you run this docker-compose up

      +

      *Thread Reply:* if you run this docker-compose up

      @@ -118039,7 +118041,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:38:58
      -

      *Thread Reply:* the notebooks

      +

      *Thread Reply:* the notebooks

      @@ -118065,7 +118067,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:02
      -

      *Thread Reply:* are 10 faster

      +

      *Thread Reply:* are 10 faster

      @@ -118091,7 +118093,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:06
      -

      *Thread Reply:* and give no errors

      +

      *Thread Reply:* and give no errors

      @@ -118117,7 +118119,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:14
      -

      *Thread Reply:* also you need to update other stuff

      +

      *Thread Reply:* also you need to update other stuff

      @@ -118143,7 +118145,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:18
      -

      *Thread Reply:* such as

      +

      *Thread Reply:* such as

      @@ -118169,7 +118171,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:26
      -

      *Thread Reply:* dont run what is in the docs

      +

      *Thread Reply:* dont run what is in the docs

      @@ -118195,7 +118197,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:39:34
      -

      *Thread Reply:* but run what is in github

      +

      *Thread Reply:* but run what is in github

      @@ -118221,7 +118223,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:13
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      @@ -118247,7 +118249,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:22
      -

      *Thread Reply:* run in your notebooks what is in here

      +

      *Thread Reply:* run in your notebooks what is in here

      @@ -118273,7 +118275,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:32
      -

      *Thread Reply:* ```from pyspark.sql import SparkSession

      +

      *Thread Reply:* ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') .appName('samplespark') @@ -118306,7 +118308,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:38
      -

      *Thread Reply:* the dont update documentation

      +

      *Thread Reply:* the dont update documentation

      @@ -118332,7 +118334,7 @@

      Marquez as an OpenLineage Client

      2023-09-05 07:40:44
      -

      *Thread Reply:* it took me 4 weeks to get here

      +

      *Thread Reply:* it took me 4 weeks to get here

      @@ -118423,7 +118425,7 @@

      Marquez as an OpenLineage Client

      2023-07-24 00:14:43
      -

      *Thread Reply:* Like I want to change the facets of a specific dataset.

      +

      *Thread Reply:* Like I want to change the facets of a specific dataset.

      @@ -118449,7 +118451,7 @@

      Marquez as an OpenLineage Client

      2023-07-24 16:45:18
      -

      *Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

      +

      *Thread Reply:* @Willy Lulciuc do you have any idea? 🙂

      @@ -118475,7 +118477,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 05:02:47
      -

      *Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

      +

      *Thread Reply:* I solved this by adding a new job which outputs to the same dataset. This ended up in a newer dataset version.

      @@ -118501,7 +118503,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:20:58
      -

      *Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: +

      *Thread Reply:* @Steven great to hear that you solved the issue! but there are some minor logical inconsistencies that we’d like to address with versioning (for both datasets and jobs) in Marquez. The tl;dr is the version column wasn’t meant to be used externally, but internally within Marquez. The issue is “minor” as it’s more of a pointer thing. We’ll be addressing soon. For some background, you can look at: • https://github.com/MarquezProject/marquez/issues/2071https://github.com/MarquezProject/marquez/pull/2153

      Marquez as an OpenLineage Client
      2023-07-25 06:43:32
      -

      *Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

      +

      *Thread Reply:* @Steven ahh very good point, it’s technically not “error” in the true sense, but annoying nonetheless. I think you’re referencing the init container in the Marquez helm chart? https://github.com/MarquezProject/marquez/blob/main/chart/templates/marquez/deployment.yaml#L37

      @@ -118655,7 +118657,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:45:24
      -

      *Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

      +

      *Thread Reply:* hmm, actually what raises the error you’re referencing? the Maruez http server?

      @@ -118681,7 +118683,7 @@

      Marquez as an OpenLineage Client

      2023-07-25 06:49:08
      -

      *Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR +

      *Thread Reply:* > Every time I restart the marquez deployment I have to drop all those tables otherwise it will raise table already exists ERROR This shouldn’t be an error. I’m trying to understand the scenario in which this error is thrown (any info is helpful). We use flyway to manage our db schema, but you may have gotten in an odd state somehow

      @@ -118734,7 +118736,7 @@

      Marquez as an OpenLineage Client

      2023-07-31 03:35:58
      -

      *Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

      +

      *Thread Reply:* Nope, one should modify the cluster as per doc <https://openlineage.io/docs/integrations/spark/quickstart_databricks> but no changes in notebook are required.

      @@ -118760,7 +118762,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:59:00
      -

      *Thread Reply:* Right, great, that’s exactly what I was hoping 😄

      +

      *Thread Reply:* Right, great, that’s exactly what I was hoping 😄

      @@ -118866,7 +118868,7 @@

      Marquez as an OpenLineage Client

      2023-07-27 14:47:18
      -

      *Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

      +

      *Thread Reply:* This is the first I’ve heard of someone trying to do this, but others have tried getting lineage from pandas. There isn’t support for this currently, but this thread contains a link to an issue that might be helpful: https://openlineage.slack.com/archives/C01CK9T7HKR/p1689850134978429?thread_ts=1689688067.729469&cid=C01CK9T7HKR.

      @@ -118931,7 +118933,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 02:10:14
      -

      *Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

      +

      *Thread Reply:* Thank you for your response. We have implemented the “manual way” of emitting events with python OL client. We are now looking for a more automated way, so that updates to the scripts that run in Ray are minimal to none

      @@ -118957,7 +118959,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 13:03:43
      -

      *Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 +

      *Thread Reply:* If you're actively using Ray, then you know way more about it than me, or probably any other OL contributor 🙂 I don't know how it works or is deployed, but I would recommend checking if there's robust way of being notified in the runtime about processing occuring there.

      @@ -118984,7 +118986,7 @@

      Marquez as an OpenLineage Client

      2023-07-31 12:17:07
      -

      *Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

      +

      *Thread Reply:* Thank you for the tip. That’s the kind of details I’m looking for, but couldn’t find yet

      @@ -119036,7 +119038,7 @@

      Marquez as an OpenLineage Client

      2023-07-28 10:53:35
      -

      *Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

      +

      *Thread Reply:* @Martin Fiser can you share any resources or pointers that might be helpful?

      @@ -119062,7 +119064,7 @@

      Marquez as an OpenLineage Client

      2023-08-21 19:17:17
      -

      *Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

      +

      *Thread Reply:* Hi, apologies - vacation period has hit m. However here are the resources:

      API endpoint: https://app.swaggerhub.com/apis-docs/keboola/job-queue-api/1.3.4#/Jobs/getJobOpenApiLineage|job-queue-api | 1.3.4 | keboola | SwaggerHub @@ -119123,7 +119125,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 05:14:58
      -

      *Thread Reply:* +1, I could also really use this!

      +

      *Thread Reply:* +1, I could also really use this!

      @@ -119149,7 +119151,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 05:27:34
      -

      *Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

      +

      *Thread Reply:* Found a way: you download it as json in the above link (“Download OpenAPI specification”), then if you copy paste it to editor.swagger.io it asks f you want to convert to yaml :)

      @@ -119175,7 +119177,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:25:49
      -

      *Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

      +

      *Thread Reply:* Whilst that works, it isn't complete. The issue is that the "facets" are not resolved. Exploring the website repository (https://github.com/OpenLineage/website/tree/main/static/spec) shows that facets aren't published alongside the spec, beyond 1.0.1 - which means its hard to know which revisions of the facets belong to which version of the spec.

      @@ -119201,7 +119203,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:26:54
      -

      *Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

      +

      *Thread Reply:* Good point! Would be good if we could clarify how to get the full spec, in that case

      @@ -119227,7 +119229,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 10:30:57
      -

      *Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

      +

      *Thread Reply:* Granted. If the spec follows backwards compatible evolution rules, then this shouldn't be a problem, i.e., new fields must be optional, you can not remove existing fields, you can not modify existing fields, etc.

      @@ -119257,7 +119259,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:15:22
      -

      *Thread Reply:* We don't have facets with newer version than 1.1.0

      +

      *Thread Reply:* We don't have facets with newer version than 1.1.0

      @@ -119283,7 +119285,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:15:56
      -

      *Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

      +

      *Thread Reply:* @Damien Hawes we've moved to merge docs and website repos here: https://github.com/OpenLineage/docs

      @@ -119338,7 +119340,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 12:18:23
      -

      *Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case +

      *Thread Reply:* > Would be good if we could clarify how to get the full spec, in that case Is using https://github.com/OpenLineage/OpenLineage/tree/main/spec not enough? We have separate files with facets definition to be able to evolve them separetely from main spec

      @@ -119365,7 +119367,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:53:03
      -

      *Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

      +

      *Thread Reply:* @Maciej Obuchowski - thanks for your input. I understand the desire to want to evolve the facets independently from the main spec, yet I keep running into a mental wall.

      If I say, 'My application is compatible with OpenLineage 1.0.5' - what does that mean exactly? Does it mean that I am at least compatible with the base definition of RunEvent and its nested components, but not facets?

      @@ -119403,7 +119405,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:56:36
      -

      *Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

      +

      *Thread Reply:* If I approach this from a conventional software engineering standpoint, where I provide a library to my consumers. The library has a version associated with it, and that version encompasses all the objects located within that particular library. If I release a new version of my library, it implies that some form of evolution has happened. Whether it is a bug fix, a documentation change, or evolving the API of my objects it means something has changed and the new version is there to indicate that.

      @@ -119429,7 +119431,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 04:56:53
      -

      *Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

      +

      *Thread Reply:* Yes - it means you can read and understand base spec. Facets are completely optional - reading them might provide you additional information, but you as a event consumer need to define what you do with them. Basically, the needs can be very different between consumers, spec should not define behavior of a consumer.

      @@ -119459,7 +119461,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 05:01:26
      -

      *Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

      +

      *Thread Reply:* OK. Thanks for the clarification. That clears things up for me.

      @@ -119586,7 +119588,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 13:26:30
      -

      *Thread Reply:* Thanks, @Maciej Obuchowski

      +

      *Thread Reply:* Thanks, @Maciej Obuchowski

      @@ -119612,7 +119614,7 @@

      Marquez as an OpenLineage Client

      2023-08-01 15:43:00
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within two business days.

      @@ -119753,7 +119755,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:09:33
      -

      *Thread Reply:* I think the apidocs are not up to date 🙂

      +

      *Thread Reply:* I think the apidocs are not up to date 🙂

      @@ -119779,7 +119781,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:09:43
      -

      *Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

      +

      *Thread Reply:* https://openlineage.io/spec/2-0-2/OpenLineage.json has the newest spec

      @@ -119805,7 +119807,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:44:23
      -

      *Thread Reply:* thanks for the pointer @Maciej Obuchowski

      +

      *Thread Reply:* thanks for the pointer @Maciej Obuchowski

      @@ -119831,7 +119833,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 10:49:17
      -

      *Thread Reply:* Also working on updating the apidocs

      +

      *Thread Reply:* Also working on updating the apidocs

      @@ -119857,7 +119859,7 @@

      Marquez as an OpenLineage Client

      2023-08-02 11:21:14
      -

      *Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

      +

      *Thread Reply:* The API docs are now up to date @Juan Luis Cano Rodríguez! Thank you for raising this issue.

      @@ -119991,7 +119993,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 05:36:04
      -

      *Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

      +

      *Thread Reply:* I think our underlying SQL parser does not hancle the Postgres versions of those queries

      @@ -120017,7 +120019,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 05:36:14
      -

      *Thread Reply:* Can you post the (anonymized?) queries?

      +

      *Thread Reply:* Can you post the (anonymized?) queries?

      @@ -120047,7 +120049,7 @@

      Marquez as an OpenLineage Client

      2023-08-03 07:03:09
      -

      *Thread Reply:* for example

      +

      *Thread Reply:* for example

      copy bi.marquez_test_2 from '******' iam_role '**********' delimiter as '^' gzi
       
      @@ -120076,7 +120078,7 @@

      Marquez as an OpenLineage Client

      2023-08-07 13:35:30
      -

      *Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

      +

      *Thread Reply:* @Zahi Fail iam_role suggests you want redshift version of this supported, not Postgres one right?

      @@ -120102,7 +120104,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 04:04:35
      -

      *Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. +

      *Thread Reply:* @Maciej Obuchowski hey, actually I tried both Postgres and Redshift to S3 operators. Both of them sent a new event through OL to Marquez, and still wasn’t part of the entire flow.

      @@ -120160,7 +120162,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:17:19
      -

      *Thread Reply:* Hey @Athitya Kumar,

      +

      *Thread Reply:* Hey @Athitya Kumar,

      1. For parsing SQL queries, we're using sqlparser-rs (https://github.com/sqlparser-rs/sqlparser-rs) which already has great coverage of sql syntax and supports different dialects. it's open source project and we already did contribute to it for snowflake dialect.
      2. We don't have such a benchmark, but if you like, you could contribute and help us providing such. We do support joins, subqueries, iceberg and delta tables, jdbc for Spark and much more. Everything we do support, is covered in our tests.
      3. Not sure if got it properly. Marquez is our reference backend implementation which parses all the facets and stores them in relational db in a relational manner (facets, jobs, datasets and runs in separate tables).
      @@ -120218,7 +120220,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:29:53
      -

      *Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

      +

      *Thread Reply:* For (3), I was referring to where we call the sqlparser-rs in our spark-openlineage event listener / integration; and how customising/improving them would look like

      @@ -120244,7 +120246,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:37:20
      -

      *Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

      +

      *Thread Reply:* sqlparser-rs is a rust libary and we bundle it within iface-java (https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/iface-java/src/main/java/io/openlineage/sql/SqlMeta.java). It's capable of extracting input/output datasets, column lineage information from SQL

      @@ -120376,7 +120378,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 02:40:02
      -

      *Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

      +

      *Thread Reply:* and this is Spark code that extracts it from JdbcRelation -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ge/spark/agent/lifecycle/plan/handlers/JdbcRelationHandler.java

      @@ -120511,7 +120513,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 04:08:53
      -

      *Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

      +

      *Thread Reply:* I think 3 question relates generally to Spark SQL handling, rather than handling JDBC connections inside Spark, right?

      @@ -120537,7 +120539,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 04:24:57
      -

      *Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

      +

      *Thread Reply:* Yup, both actually. Related to getting the JDBC connection info in the input/output facet, as well as spark-sql queries we do on that JDBC connection

      @@ -120563,7 +120565,7 @@

      Marquez as an OpenLineage Client

      2023-08-04 06:00:17
      -

      *Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

      +

      *Thread Reply:* For Spark SQL - it's translated to Spark's internal query LogicalPlan. We take that plan, and process it's nodes. From root node we can take output dataset, from leaf nodes we can take input datasets, and inside internal nodes we track columns to extract column-level lineage. We express those (table-level) operations by implementing classes like QueryPlanVisitor

      You can extend that, for example for additional types of nodes that we don't support by implementing your own QueryPlanVisitor, and then implementing OpenLineageEventHandlerFactory and packaging this into a .jar deployed alongside OpenLineage jar - this would be loaded by us using Java's ServiceLoader .

      Marquez as an OpenLineage Client
      2023-08-08 05:06:07
      -

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

      +

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński - Thanks for your responses! I had a follow-up query regarding the sqlparser-rs that's used internally by open-lineage: we see that these are the SQL dialects supported by sqlparser-rs here doesn't include spark-sql / presto-sql dialects which means they'd fallback to generic dialect:

      "--ansi" =&gt; Box::new(AnsiDialect {}), "--bigquery" =&gt; Box::new(BigQueryDialect {}), @@ -120707,7 +120709,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:21:32
      -

      *Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

      +

      *Thread Reply:* spark-sql integration is based on spark LogicalPlan's tree. Extracting input/output datasets from tree nodes which is more detailed than sql parsing

      @@ -120733,7 +120735,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:04:52
      -

      *Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

      +

      *Thread Reply:* I think presto/trino dialect is very standard - there shouldn't be any problems with regular queries

      @@ -120759,7 +120761,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:19:53
      -

      *Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

      +

      *Thread Reply:* @Paweł Leszczyński - Got it, and would you be able to point me to where within the openlineage-spark integration do we:

      1. provide the Spark Logical Plan / query to sqlparser-rs
      2. get the output of sqlparser-rs (parsed query AST) & stitch back the inputs/outputs in the open-lineage events?
      @@ -120788,7 +120790,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 12:09:06
      -

      *Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

      +

      *Thread Reply:* For example, we'd like to understand which dialectname of sqlparser-rs would be used in which scenario by open-lineage and what're the interactions b/w open-lineage & sqlparser-rs

      @@ -120814,7 +120816,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 12:18:47
      -

      *Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

      +

      *Thread Reply:* @Paweł Leszczyński - Incase you missed the above messages ^

      @@ -120840,7 +120842,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 03:31:32
      -

      *Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: +

      *Thread Reply:* Sqlparser-rs is used within Spark integration only for spark jdbc queries (queries to external databases). That's the only scenario. For spark.sql(...) , instead of SQL parsing, we rely on a logical plan of a job and extract information from it. For jdbc queries, that user sqlparser-rs, dialect is extracted from url: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/util/JdbcUtils.java#L69

      @@ -120922,7 +120924,7 @@

      Marquez as an OpenLineage Client

      2023-08-06 17:25:31
      -

      *Thread Reply:* No, it's not.

      +

      *Thread Reply:* No, it's not.

      @@ -120948,7 +120950,7 @@

      Marquez as an OpenLineage Client

      2023-08-06 23:53:17
      -

      *Thread Reply:* Is it only available for spark version 3+?

      +

      *Thread Reply:* Is it only available for spark version 3+?

      @@ -120974,7 +120976,7 @@

      Marquez as an OpenLineage Client

      2023-08-07 04:53:41
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -121141,7 +121143,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 04:59:23
      -

      *Thread Reply:* Does this work with similar jobs, but with small amount of columns?

      +

      *Thread Reply:* Does this work with similar jobs, but with small amount of columns?

      @@ -121167,7 +121169,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:12:52
      -

      *Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns +

      *Thread Reply:* thanks for reply @Maciej Obuchowski yes it works for small amount of columns but not work in big amount of columns

      @@ -121194,7 +121196,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:14:04
      -

      *Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

      +

      *Thread Reply:* one more question: how much data the jobs approximately process and how long does the execution take?

      @@ -121220,7 +121222,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:14:54
      -

      *Thread Reply:* ah… it’s like 20 min ~ 30 min various +

      *Thread Reply:* ah… it’s like 20 min ~ 30 min various data size is like 2000,0000 rows with columns 100 ~ 1000

      @@ -121247,7 +121249,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:17
      -

      *Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

      +

      *Thread Reply:* that's interesting. we could prepare integration test for that. 100 cols shouldn't make a difference

      @@ -121273,7 +121275,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:37
      -

      *Thread Reply:* honestly sorry for typo its 1000 columns

      +

      *Thread Reply:* honestly sorry for typo its 1000 columns

      @@ -121299,7 +121301,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:15:44
      -

      *Thread Reply:* pivoting features

      +

      *Thread Reply:* pivoting features

      @@ -121325,7 +121327,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:16:09
      -

      *Thread Reply:* i check it works good for small numbers of columns

      +

      *Thread Reply:* i check it works good for small numbers of columns

      @@ -121351,7 +121353,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:16:39
      -

      *Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

      +

      *Thread Reply:* if it's 1000, then maybe we're over event size - event is too large and backend can't accept that

      @@ -121377,7 +121379,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:17:06
      -

      *Thread Reply:* maybe debug logs could tell us something

      +

      *Thread Reply:* maybe debug logs could tell us something

      @@ -121403,7 +121405,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:27
      -

      *Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

      +

      *Thread Reply:* i’ll do spark.sparkContext.setLogLevel("DEBUG") ing

      @@ -121429,7 +121431,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:30
      -

      *Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

      +

      *Thread Reply:* are there any errors in the logs? perhaps pivoting uses contains nodes in SparkPlan that we don't support yet

      @@ -121455,7 +121457,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:19:52
      -

      *Thread Reply:* did you check pivoting that results in less columns?

      +

      *Thread Reply:* did you check pivoting that results in less columns?

      @@ -121481,7 +121483,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:20:33
      -

      *Thread Reply:* @추호관 would also be good to disable logicalPlan facet: +

      *Thread Reply:* @추호관 would also be good to disable logicalPlan facet: spark.openlineage.facets.disabled: [spark_unknown;spark.logicalPlan] in spark conf

      @@ -121509,7 +121511,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:23:40
      -

      *Thread Reply:* got it can’t we do in python config +

      *Thread Reply:* got it can’t we do in python config .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", [spark_unknown;spark.logicalPlan]

      @@ -121538,7 +121540,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:24:31
      -

      *Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ +

      *Thread Reply:* .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.initialExecutors", "5") \ .config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]"

      @@ -121566,7 +121568,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:24:42
      -

      *Thread Reply:* ah.. string got it

      +

      *Thread Reply:* ah.. string got it

      @@ -121592,7 +121594,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:36:03
      -

      *Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

      +

      *Thread Reply:* ah… there are no errors nor debug level issue successfully Registered listener ìo.openlineage.spark.agent.OpenLineageSparkListener

      @@ -121618,7 +121620,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:39:40
      -

      *Thread Reply: maybe df.groupBy(some column).pivot(some_column).agg(*agg_cols) is not supported

      +

      *Thread Reply:* maybe df.groupBy(some column).pivot(some_column).agg(**agg_cols) is not supported

      @@ -121644,7 +121646,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:43:44
      -

      *Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START +

      *Thread Reply:* oh.. interesting spark.openlineage.facets.disabled this option gives me output when eventType is START “eventType”: “START” “outputs”: [ … @@ -121676,7 +121678,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:54:13
      -

      *Thread Reply:* Yes +

      *Thread Reply:* Yes "spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]" <- this option gives output when eventType is START but not give output bunches of columns when that config is not set

      @@ -121703,7 +121705,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:55:18
      -

      *Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

      +

      *Thread Reply:* this option prevents logicalPlan being serialized and sent as a part of Openlineage event which included in one of the facets

      @@ -121729,7 +121731,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:56:12
      -

      *Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

      +

      *Thread Reply:* possibly, serializing logicalPlans, in case of pivots, leads to size of the events that are not acceptable

      @@ -121755,7 +121757,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:57:56
      -

      *Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. +

      *Thread Reply:* Ah… so you mean pivot makes serializing logical plan not availble for generating event because of size. and disable logical plan with not serializing make availabe to generate event cuz not serialiize logical plan made by pivot

      Can we overcome this

      @@ -121784,7 +121786,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:58:48
      -

      *Thread Reply:* we've seen such issues for some plans some time ago

      +

      *Thread Reply:* we've seen such issues for some plans some time ago

      @@ -121814,7 +121816,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:29
      -

      *Thread Reply:* oh…. how did you solve it?

      +

      *Thread Reply:* oh…. how did you solve it?

      @@ -121840,7 +121842,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:32
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java

      @@ -121895,7 +121897,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 05:59:51
      -

      *Thread Reply:* by excluding some properties from plan to be serialized

      +

      *Thread Reply:* by excluding some properties from plan to be serialized

      @@ -121921,7 +121923,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:01:14
      -

      *Thread Reply:* here https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java we exclude certain classes

      +

      *Thread Reply:* here https://github.com/OpenLineage/OpenLineage/blob/c3a5211f919c01870a7f79f48588177a9b[…]io/openlineage/spark/agent/lifecycle/LogicalPlanSerializer.java we exclude certain classes

      @@ -121976,7 +121978,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:02:00
      -

      *Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

      +

      *Thread Reply:* AH…. excluded properties cause ignore logical plan’s of pivointing

      @@ -122002,7 +122004,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:08:25
      -

      *Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

      +

      *Thread Reply:* you can start with writing a failing test here -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]/openlineage/spark/agent/lifecycle/SparkReadWriteIntegTest.java

      then you can try to debug logical plan trying to find out what should be excluded from it when it's being serialized. Even, if you find this difficult, a failing integration test is super helpful to let others help you in that.

      Marquez as an OpenLineage Client
      2023-08-08 06:24:54
      -

      *Thread Reply:* okay i would look into and maybe pr thanks

      +

      *Thread Reply:* okay i would look into and maybe pr thanks

      @@ -122258,7 +122260,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:38:45
      -

      *Thread Reply:* Can I ask if there are any suspicious properties?

      +

      *Thread Reply:* Can I ask if there are any suspicious properties?

      @@ -122284,7 +122286,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 06:39:25
      -

      *Thread Reply:* sure

      +

      *Thread Reply:* sure

      @@ -122318,7 +122320,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:10:40
      -

      *Thread Reply:* Thanks I would also try to find the property too

      +

      *Thread Reply:* Thanks I would also try to find the property too

      @@ -122370,7 +122372,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 07:49:55
      -

      *Thread Reply:* Hi @Anirudh Shrinivason, +

      *Thread Reply:* Hi @Anirudh Shrinivason, I think I would take a look on this https://sqlglot.com/sqlglot/diff.html

      @@ -122398,7 +122400,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 23:12:37
      -

      *Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... +

      *Thread Reply:* Hey @Guy Biecher Yeah I was looking at this... but it seems to calculate similarity from a more textual context, as opposed to a more semantic one... eg: SELECT ** FROM TABLE_1 and SELECT col1,col2,col3 FROM TABLE_1 could be the same semantic query, but sqlglot would give diffs in the ast because its textual...

      @@ -122425,7 +122427,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 02:26:51
      -

      *Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

      +

      *Thread Reply:* I totally get you. In such cases without the metadata of the TABLE_1, it's impossible what I would do I would replace all ** before you use the diff function.

      @@ -122451,7 +122453,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 07:04:37
      -

      *Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... +

      *Thread Reply:* Yeah I was thinking about the same... But the more nested and complex your queries get, the harder it'll become to accurately pre-process before running the ast diff too... But yeah that's probably the approach I'd be taking haha... Happy to discuss and learn if there are better ways to doing this

      @@ -122555,14 +122557,14 @@

      Marquez as an OpenLineage Client

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong><em>*input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em><em></strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -122598,14 +122600,14 @@

      Marquez as an OpenLineage Client

      "$ref": "#/$defs/Job" }, "inputs": { - "description": "The set of <strong></em><em>input</em><em></strong> datasets.", + "description": "The set of <strong><em>*input</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/InputDataset" } }, "outputs": { - "description": "The set of <strong></em><em>output</em>*</strong> datasets.", + "description": "The set of <strong><em>*output</em></strong>* datasets.", "type": "array", "items": { "$ref": "#/$defs/OutputDataset" @@ -122824,7 +122826,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:53:05
      -

      *Thread Reply:* Hey @Luigi Scorzato! +

      *Thread Reply:* Hey @Luigi Scorzato! You can read about these 2 event types in this blog post: https://openlineage.io/blog/static-lineage

      @@ -122876,7 +122878,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:53:38
      -

      *Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

      +

      *Thread Reply:* we’ll work on getting the documentation improved to clarify the expected use cases for each event type. this is a relatively new addition to the spec.

      @@ -122906,7 +122908,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 10:08:28
      -

      *Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

      +

      *Thread Reply:* this sounds relevant for my 3rd question, doesn't it? But I do not see scheduling information among the use cases, am I wrong?

      @@ -122932,7 +122934,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:16:39
      -

      *Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

      +

      *Thread Reply:* you’re not wrong, these 2 events were not designed for runtime lineage, but rather “static” lineage that gets emitted after the fact

      @@ -122984,7 +122986,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 09:54:40
      -

      *Thread Reply:* great question! did you get a chance to look at the current Flink integration?

      +

      *Thread Reply:* great question! did you get a chance to look at the current Flink integration?

      @@ -123010,7 +123012,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 10:07:06
      -

      *Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

      +

      *Thread Reply:* to be honest, I only quickly went through this and I did not identfy what I needed. Can you please point me to the relevant section?

      openlineage.io
      @@ -123057,7 +123059,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:13:17
      -

      *Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

      +

      *Thread Reply:* here’s an example START event for Flink: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka.json

      @@ -123181,7 +123183,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:13:26
      -

      *Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

      +

      *Thread Reply:* or a checkpoint (RUNNING) event: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/test/resources/events/expected_kafka_checkpoints.json

      @@ -123314,7 +123316,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:15:55
      -

      *Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

      +

      *Thread Reply:* generally speaking, you can see the execution contexts that invoke generation of OL events here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/flink/src/main/ja[…]/openlineage/flink/visitor/lifecycle/FlinkExecutionContext.java

      @@ -123533,7 +123535,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 17:46:17
      -

      *Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

      +

      *Thread Reply:* thank you! So, if I understand correctly, the key is that even eventType=START, admits an output datasets. Correct? What determines how often are the eventType=RUNNING emitted?

      @@ -123563,7 +123565,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 03:25:16
      -

      *Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

      +

      *Thread Reply:* now I see, RUNNING events are emitted on onJobCheckpoint

      @@ -123615,7 +123617,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 11:10:47
      -

      *Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

      +

      *Thread Reply:* For Airflow, this is part of the AirflowRunFacet, here: https://github.com/OpenLineage/OpenLineage/blob/81372ca2bc2afecab369eab4a54cc6380dda49d0/integration/airflow/facets/AirflowRunFacet.json#L100

      For other orchestrators / schedulers, that would depend..

      @@ -123673,7 +123675,7 @@

      Marquez as an OpenLineage Client

      2023-08-08 12:15:40
      -

      *Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

      +

      *Thread Reply:* I might be wrong, but I believe it's unique for each cluster - the common part is dbfs\.

      @@ -123699,7 +123701,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 02:38:54
      -

      *Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

      +

      *Thread Reply:* dbfs is mounted to a databricks workspace which can run multiple clusters. so i think, it's common.

      Worth mentioning: init-scripts located in dbfs are becoming deprecated next month and we plan moving them into workspaces.

      @@ -123731,7 +123733,7 @@

      Marquez as an OpenLineage Client

      2023-08-11 01:33:24
      -

      *Thread Reply:* yes, the init scripts are moved at workspace level.

      +

      *Thread Reply:* yes, the init scripts are moved at workspace level.

      @@ -123846,7 +123848,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 02:35:28
      -

      *Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

      +

      *Thread Reply:* great to hear. I still need some time as there are few corner cases. For example: what should be the behaviour when alter table rename is called 😉 But sure, you can test it if you like. ci is failing on integration tests but ./gradlew clean build with unit tests are fine.

      @@ -123876,7 +123878,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 03:33:50
      -

      *Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

      +

      *Thread Reply:* @GitHubOpenLineageIssues Feel invited to join todays community and advocate for the importance of this issue. Such discussions are extremely helpful in prioritising backlog the right way.

      @@ -123929,7 +123931,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 07:55:28
      -

      *Thread Reply:* Query: +

      *Thread Reply:* Query: INSERT INTO test.merchant_md (Select m.`id`, @@ -123968,7 +123970,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 08:01:56
      -

      *Thread Reply:* "columnLineage":{ +

      *Thread Reply:* "columnLineage":{ "_producer":"<https://github.com/OpenLineage/OpenLineage/tree/0.30.1/integration/spark>", "_schemaURL":"<https://openlineage.io/spec/facets/1-0-1/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet>", "fields":{ @@ -124062,7 +124064,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 08:23:57
      -

      *Thread Reply:* "contact_name":{ +

      *Thread Reply:* "contact_name":{ "inputFields":[ { "namespace":"<s3a://datalake>", @@ -124151,7 +124153,7 @@

      Marquez as an OpenLineage Client

      2023-08-09 11:23:42
      -

      *Thread Reply:* > OL-spark version to 0.6.+ +

      *Thread Reply:* > OL-spark version to 0.6.+ This OL version is ancient. You can try with 1.0.0

      I think you're hitting this issue which duplicates jobs: https://github.com/OpenLineage/OpenLineage/issues/1943

      @@ -124216,7 +124218,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 01:46:08
      -

      *Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … +

      *Thread Reply:* I haven’t mentioned that I tried multiple OL versions - 1.0.0 / 0.30.1 / 0.6.+ … None of them worked for me. @Maciej Obuchowski

      @@ -124244,7 +124246,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 05:25:49
      -

      *Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

      +

      *Thread Reply:* @Zahi Fail understood. Can you provide sample job that reproduces this behavior, and possibly some logs?

      @@ -124270,7 +124272,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 05:26:11
      -

      *Thread Reply:* If you can, it might be better to create issue at github and communicate there.

      +

      *Thread Reply:* If you can, it might be better to create issue at github and communicate there.

      @@ -124296,7 +124298,7 @@

      Marquez as an OpenLineage Client

      2023-08-10 08:34:01
      -

      *Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

      +

      *Thread Reply:* Before creating an issue in GIT, I wanted to check if my issue only related to versions compatibility..

      This is the sample of my test: ```from pyspark.sql import SparkSession @@ -124360,7 +124362,7 @@

      csv_file = location.csv

      2023-08-10 08:40:13
      -

      *Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

      +

      *Thread Reply:* try spark.openlineage.transport.url instead of spark.openlineage.host

      @@ -124386,7 +124388,7 @@

      csv_file = location.csv

      2023-08-10 08:40:27
      -

      *Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

      +

      *Thread Reply:* and possibly link the doc where you've seen spark.openlineage.host 🙂

      @@ -124412,7 +124414,7 @@

      csv_file = location.csv

      2023-08-10 08:59:27
      -

      *Thread Reply:* https://openlineage.io/blog/openlineage-spark/

      +

      *Thread Reply:* https://openlineage.io/blog/openlineage-spark/

      openlineage.io
      @@ -124472,7 +124474,7 @@

      csv_file = location.csv

      2023-08-10 09:04:56
      -

      *Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

      +

      *Thread Reply:* changing to “spark.openlineage.transport.url” didn’t make any change

      @@ -124498,7 +124500,7 @@

      csv_file = location.csv

      2023-08-10 09:09:42
      -

      *Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

      +

      *Thread Reply:* do you see the ConsoleTransport log? it suggests Spark integration did not register that you want to send events to Marquez

      @@ -124524,7 +124526,7 @@

      csv_file = location.csv

      2023-08-10 09:10:09
      -

      *Thread Reply:* let's try adding spark.openlineage.transport.type to http

      +

      *Thread Reply:* let's try adding spark.openlineage.transport.type to http

      @@ -124550,7 +124552,7 @@

      csv_file = location.csv

      2023-08-10 09:14:50
      -

      *Thread Reply:* Now it works !

      +

      *Thread Reply:* Now it works !

      @@ -124576,7 +124578,7 @@

      csv_file = location.csv

      2023-08-10 09:14:58
      -

      *Thread Reply:* thanks @Maciej Obuchowski

      +

      *Thread Reply:* thanks @Maciej Obuchowski

      @@ -124602,7 +124604,7 @@

      csv_file = location.csv

      2023-08-10 09:23:04
      -

      *Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

      +

      *Thread Reply:* Cool 🙂 however it should not require it if you provide spark.openlineage.transport.url - I'll create issue for debugging that.

      @@ -124724,7 +124726,7 @@

      csv_file = location.csv

      2023-08-10 02:55:46
      -

      *Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

      +

      *Thread Reply:* Let me first rephrase my understanding of the question assume a user runs spark.sql('INSERT INTO ...'). Are we able to include sql queryINSERT INTO ...within SQL facet?

      We once had a look at it and found it difficult. Given an SQL, spark immediately translates it to a logical plan (which our integration is based on) and we didn't find any place where we could inject our code and get access to sql being run.

      @@ -124752,7 +124754,7 @@

      csv_file = location.csv

      2023-08-10 04:27:51
      -

      *Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

      +

      *Thread Reply:* Got it. So for spark.sql() - there's no interaction with sqlparser-rs and we directly try stitching the input/output & column lineage from the spark logical plan. Would something like this fall under the spark.jdbc() route or the spark.sql() route (say, if the df is collected / written somewhere)?

      val df = spark.read.format("jdbc") .option("url", url) @@ -124785,7 +124787,7 @@

      csv_file = location.csv

      2023-08-10 05:15:17
      -

      *Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

      +

      *Thread Reply:* @Athitya Kumar I understand your issue. From my side, there's one problem with this - potentially there can be multiple queries for one spark job. You can imagine something like joining results of two queries - possible to separate systems - and then one SqlJobFacet would be misleading. This needs more thorough spec discussion

      @@ -124871,7 +124873,7 @@

      csv_file = location.csv

      2023-08-10 05:42:20
      -

      *Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

      +

      *Thread Reply:* I'm using python OpenLineage package and extend the BaseFacet class

      @@ -124897,7 +124899,7 @@

      csv_file = location.csv

      2023-08-10 05:53:57
      -

      *Thread Reply:* for custom facets, as long as it's valid json - go for it

      +

      *Thread Reply:* for custom facets, as long as it's valid json - go for it

      @@ -124923,7 +124925,7 @@

      csv_file = location.csv

      2023-08-10 05:55:03
      -

      *Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

      +

      *Thread Reply:* However I tried to insert a list of string. And I tried to get the dataset, the returned valued of that list field is empty.

      @@ -124949,7 +124951,7 @@

      csv_file = location.csv

      2023-08-10 05:55:57
      -

      *Thread Reply:* @attr.s +

      *Thread Reply:* @attr.s class MyFacet(BaseFacet): columns: list[str] = attr.ib() Here's my python code.

      @@ -124978,7 +124980,7 @@

      csv_file = location.csv

      2023-08-10 05:59:02
      -

      *Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

      +

      *Thread Reply:* How did you emit, serialized the event, and where did you look when you said you tried to get the dataset?

      @@ -125004,7 +125006,7 @@

      csv_file = location.csv

      2023-08-10 06:00:27
      -

      *Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

      +

      *Thread Reply:* I assume the problem is somewhere there, not on the level of facet definition, since SchemaDatasetFacet looks pretty much the same and it works

      @@ -125039,7 +125041,7 @@

      csv_file = location.csv

      2023-08-10 06:00:54
      -

      *Thread Reply:* I use the python openlineage client to emit the RunEvent. +

      *Thread Reply:* I use the python openlineage client to emit the RunEvent. openlineage_client.emit( RunEvent( eventType=RunState.COMPLETE, @@ -125076,7 +125078,7 @@

      csv_file = location.csv

      2023-08-10 06:02:12
      -

      *Thread Reply:* Yah, list of objects is working, but list of string is not.😩

      +

      *Thread Reply:* Yah, list of objects is working, but list of string is not.😩

      @@ -125102,7 +125104,7 @@

      csv_file = location.csv

      2023-08-10 06:03:23
      -

      *Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

      +

      *Thread Reply:* I think the problem is related to the openlineage package openlineage.client.serde.py. The function Serde.to_json()

      @@ -125128,7 +125130,7 @@

      csv_file = location.csv

      2023-08-10 06:05:56
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -125163,7 +125165,7 @@

      csv_file = location.csv

      2023-08-10 06:19:34
      -

      *Thread Reply:* I think the code here filters out those string values in the list

      +

      *Thread Reply:* I think the code here filters out those string values in the list

      @@ -125198,7 +125200,7 @@

      csv_file = location.csv

      2023-08-10 06:21:39
      -

      *Thread Reply:* 👀

      +

      *Thread Reply:* 👀

      @@ -125224,7 +125226,7 @@

      csv_file = location.csv

      2023-08-10 06:24:48
      -

      *Thread Reply:* Yah, the value in list will end up False in this code and be filtered out +

      *Thread Reply:* Yah, the value in list will end up False in this code and be filtered out isinstance(_x_, dict)

      😳

      @@ -125253,7 +125255,7 @@

      csv_file = location.csv

      2023-08-10 06:26:33
      -

      *Thread Reply:* wow, that's right 😬

      +

      *Thread Reply:* wow, that's right 😬

      @@ -125279,7 +125281,7 @@

      csv_file = location.csv

      2023-08-10 06:26:47
      -

      *Thread Reply:* want to create PR fixing that?

      +

      *Thread Reply:* want to create PR fixing that?

      @@ -125305,7 +125307,7 @@

      csv_file = location.csv

      2023-08-10 06:27:20
      -

      *Thread Reply:* Sure! May do this later tomorrow.

      +

      *Thread Reply:* Sure! May do this later tomorrow.

      @@ -125335,7 +125337,7 @@

      csv_file = location.csv

      2023-08-10 23:59:28
      -

      *Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 +

      *Thread Reply:* I created the pr at https://github.com/OpenLineage/OpenLineage/pull/2044 But the ci on integration-test-integration-spark FAILED

      @@ -125432,7 +125434,7 @@

      csv_file = location.csv

      2023-08-11 04:17:01
      -

      *Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

      +

      *Thread Reply:* @Steven sorry for that - some tests require credentials that are not present on the forked versions of CI. It will work once I push it to origin. Anyway Spark tests failing aren't blocker for this Python PR

      @@ -125458,7 +125460,7 @@

      csv_file = location.csv

      2023-08-11 04:17:45
      -

      *Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

      +

      *Thread Reply:* I would only ask to add some tests for that case with facets containing list of string

      @@ -125484,7 +125486,7 @@

      csv_file = location.csv

      2023-08-11 04:18:21
      -

      *Thread Reply:* Yeah sure, I will add them now

      +

      *Thread Reply:* Yeah sure, I will add them now

      @@ -125510,7 +125512,7 @@

      csv_file = location.csv

      2023-08-11 04:25:19
      -

      *Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

      +

      *Thread Reply:* ah we had other CI problem, go version was too old in one of the jobs - neverthless I won't judge your PR on stuff failing outside your PR anyway 🙂

      @@ -125536,7 +125538,7 @@

      csv_file = location.csv

      2023-08-11 04:36:57
      -

      *Thread Reply:* LOL🤣 I've added some tests and made a force push

      +

      *Thread Reply:* LOL🤣 I've added some tests and made a force push

      @@ -125562,7 +125564,7 @@

      csv_file = location.csv

      2023-10-20 08:31:45
      -

      *Thread Reply:* @GitHubOpenLineageIssues +

      *Thread Reply:* @GitHubOpenLineageIssues I am trying to contribute to Integration tests which is listed here as good first issue the CONTRIBUTING.md mentions that i can trigger CI for integration tests from forked branch. using this tool. @@ -125603,7 +125605,7 @@

      csv_file = location.csv

      2023-10-23 04:57:44
      -

      *Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

      +

      *Thread Reply:* what PR is the probem related to? I can run git-push-fork-to-upstream-branch for you

      @@ -125629,7 +125631,7 @@

      csv_file = location.csv

      2023-10-25 01:08:41
      -

      *Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

      +

      *Thread Reply:* @Paweł Leszczyński thanks for approving my PR - ( link )

      I will make the changes needed for the new integration test case for drop table (good first issue) , in another PR, I would need your help to run the integration tests again, thank you

      @@ -125682,7 +125684,7 @@

      csv_file = location.csv

      2023-10-26 07:48:52
      -

      *Thread Reply:* @Paweł Leszczyński +

      *Thread Reply:* @Paweł Leszczyński opened a PR ( link ) for integration test for drop table can you please help run the integration test

      csv_file = location.csv
      2023-10-26 07:50:29
      -

      *Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

      +

      *Thread Reply:* sure, some of our tests require access to S3/BigQuery secret keys, so will not work automatically from the fork, and require action on our side. working on that

      @@ -125800,7 +125802,7 @@

      csv_file = location.csv

      2023-10-29 09:31:22
      -

      *Thread Reply:* thanks @Paweł Leszczyński +

      *Thread Reply:* thanks @Paweł Leszczyński let me know if i can help in any way

      @@ -125827,7 +125829,250 @@

      csv_file = location.csv

      2023-11-15 02:31:50
      -

      *Thread Reply:* @Paweł Leszczyński any action item on my side?

      +

      *Thread Reply:* @Paweł Leszczyński any action item on my side?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 02:57:20
      +
      +

      *Thread Reply:* @Paweł Leszczyński can you please take a look at this ? 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:05:05
      +
      +

      *Thread Reply:* Hi @savan, were you able to run integration tests locally on your side? It seems the generated OL event is missing schema facet +"outputs" : [ { + "namespace" : "file", + "name" : "/tmp/drop_test/drop_table_test", + "facets" : { + "dataSource" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet>", + "name" : "file", + "uri" : "file" + }, + "symlinks" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/SymlinksDatasetFacet.json#/$defs/SymlinksDatasetFacet>", + "identifiers" : [ { + "namespace" : "/tmp/drop_test", + "name" : "default.drop_table_test", + "type" : "TABLE" + } ] + }, + "lifecycleStateChange" : { + "_producer" : "<https://github.com/OpenLineage/OpenLineage/tree/1.5.0-SNAPSHOT/integration/spark>", + "_schemaURL" : "<https://openlineage.io/spec/facets/1-0-0/LifecycleStateChangeDatasetFacet.json#/$defs/LifecycleStateChangeDatasetFacet>", + "lifecycleStateChange" : "DROP" + } + }, + "outputFacets" : { } + } ] +which shouldn't be such a big problem I believe. This event intends to notify table is dropped which is still ok I believe without schema.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:06:40
      +
      +

      *Thread Reply:* @Paweł Leszczyński +i am unable to run integration tests locally, +as you mentioned it requires S3/BigQuery secret keys +and wont work from a forked branch

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:07:06
      +
      +

      *Thread Reply:* you can run this particular test you modify, don't need to run all of them

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:07:55
      +
      +

      *Thread Reply:* can you please share any doc which will help me do that. +i did go through the readme doc, +i was stuck at +> you dont have permission to perform this action

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:08:18
      +
      +

      *Thread Reply:* ./gradlew :app:integrationTest --tests io.openlineage.spark.agent.SparkIcebergIntegrationTest.testDropTable

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      savan + (SavanSharan_Navalgi@intuit.com) +
      +
      2023-12-01 03:08:33
      +
      +

      *Thread Reply:* let me try +thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-01 03:08:35
      +
      +

      *Thread Reply:* this should run the thing you modify

      @@ -125893,7 +126138,7 @@

      csv_file = location.csv

      2023-08-11 09:18:44
      -

      *Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

      +

      *Thread Reply:* Hmm... this is kinda "logical" difference - is column level lineage taken from actual "physical" operations - like in this case, we always take from t1 - or from "logical" where t2 is used only for predicate, yet we still want to indicate it as a source?

      @@ -125919,7 +126164,7 @@

      csv_file = location.csv

      2023-08-11 09:18:58
      -

      *Thread Reply:* I think your interpretation is more useful

      +

      *Thread Reply:* I think your interpretation is more useful

      @@ -125949,7 +126194,7 @@

      csv_file = location.csv

      2023-08-11 09:25:03
      -

      *Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

      +

      *Thread Reply:* @Maciej Obuchowski - Yup, especially for use-cases where we wanna depend on column lineage for impact analysis, I think we should be considering even predicates. For example, if t2.c1 / t2.c2 gets corrupted or dropped, the query would be impacted - which means that we should be including even predicates (t2.c1 / t2.c2) in the column lineage imo

      But is there any technical limitation if we wanna implement this / make an OSS contribution for this (like logical predicate columns not being part of the spark logical plan object that we get in the PlanVisitor or something like that)?

      @@ -125977,7 +126222,7 @@

      csv_file = location.csv

      2023-08-11 11:14:58
      -

      *Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

      +

      *Thread Reply:* It's probably a bit of work, but can't think it's impossible on parser side - @Paweł Leszczyński will know better about spark collection

      @@ -126003,7 +126248,7 @@

      csv_file = location.csv

      2023-08-11 12:45:34
      -

      *Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

      +

      *Thread Reply:* This is a case where it would be nice to have an alternate indication (perhaps in the Column lineage facet?) for this type of "suggested" lineage. As noted, this is especially important for impact analysis purposes. We (and I believe others do the same or similar) call that "indirect" lineage at Manta.

      @@ -126029,7 +126274,7 @@

      csv_file = location.csv

      2023-08-11 12:49:10
      -

      *Thread Reply:* Something like additional flag in inputFields, right?

      +

      *Thread Reply:* Something like additional flag in inputFields, right?

      @@ -126059,7 +126304,7 @@

      csv_file = location.csv

      2023-08-14 02:36:34
      -

      *Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

      +

      *Thread Reply:* Yes, this would require some extension to the spec. What do you mean spark-sql : spark.sql() with some spark query or SQL in spark JDBC?

      @@ -126085,7 +126330,7 @@

      csv_file = location.csv

      2023-08-15 15:16:49
      -

      *Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

      +

      *Thread Reply:* Sorry, missed your question @Paweł Leszczyński. By spark-sql, I'm referring to the former: spark.sql() with some spark query

      @@ -126111,7 +126356,7 @@

      csv_file = location.csv

      2023-08-16 03:10:57
      -

      *Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

      +

      *Thread Reply:* cc @Jens Pfau - you may be also interested in extending column level lineage facet.

      @@ -126137,7 +126382,7 @@

      csv_file = location.csv

      2023-08-22 02:23:08
      -

      *Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

      +

      *Thread Reply:* Hi, is there a github issue for this feature? Seems like a really cool and exciting functionality to have!

      @@ -126163,7 +126408,7 @@

      csv_file = location.csv

      2023-08-22 08:03:49
      -

      *Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

      +

      *Thread Reply:* @Anirudh Shrinivason - Are you referring to this issue: https://github.com/OpenLineage/OpenLineage/issues/2048?

      @@ -126279,7 +126524,7 @@

      csv_file = location.csv

      2023-08-14 06:00:21
      -

      *Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

      +

      *Thread Reply:* Not really I think - the integration does not rely purely on the logical plan

      @@ -126305,7 +126550,7 @@

      csv_file = location.csv

      2023-08-14 06:00:44
      -

      *Thread Reply:* At least, not in all cases. For some maybe

      +

      *Thread Reply:* At least, not in all cases. For some maybe

      @@ -126331,7 +126576,7 @@

      csv_file = location.csv

      2023-08-14 07:34:39
      -

      *Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

      +

      *Thread Reply:* We're using pretty similar approach in our column level lineage tests where we run some spark commands, register custom listener https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/tes[…]eage/spark/agent/util/LastQueryExecutionSparkEventListener.java which catches the logical plan. Further we run our tests on the captured logical plan.

      The difference here, between what you're asking about, is that we still have an access to the same spark session.

      @@ -126431,7 +126676,7 @@

      csv_file = location.csv

      2023-08-14 11:03:28
      -

      *Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

      +

      *Thread Reply:* @Paweł Leszczyński - We're mainly interested to see the inputs/outputs (mainly column schema and column lineage) for different logical plans. Is that something that could be done in a static manner without running spark jobs in your opinion?

      For example, I know that we can statically create logical plans

      @@ -126459,7 +126704,7 @@

      csv_file = location.csv

      2023-08-16 03:05:44
      -

      *Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

      +

      *Thread Reply:* The more we talk the more I am wondering what is the purpose of doing so? Do you want to test openlineage coverage or is there any production scenario where you would like to apply this?

      @@ -126485,7 +126730,7 @@

      csv_file = location.csv

      2023-08-16 04:01:39
      -

      *Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

      +

      *Thread Reply:* @Paweł Leszczyński - This is for testing openlineage coverage so that we can be more confident on what're the happy path scenarios and what're the scenarios where it may not work / work partially etc

      @@ -126511,7 +126756,7 @@

      csv_file = location.csv

      2023-08-16 04:22:01
      -

      *Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

      +

      *Thread Reply:* If this is for testing, then you're also capable of mocking some SparkSession/catalog methods when Openlineage integration tries to access them. If you want to reuse LogicalPlans from your prod environment, you will encounter logicalplan serialization issues. On the other hand, if you generate logical plans from some example Spark jobs, then the same can be easier achieved in a way the integration tests are run with mockserver.

      @@ -126598,7 +126843,7 @@

      csv_file = location.csv

      2023-08-15 01:13:49
      -

      *Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

      +

      *Thread Reply:* We're uploading the init scripts to s3 via tf. But yeah ig there are some access permissions that the user needs to have

      @@ -126628,7 +126873,7 @@

      csv_file = location.csv

      2023-08-16 07:32:00
      -

      *Thread Reply:* Hello +

      *Thread Reply:* Hello I am new here and I am asking why do you need an init script ? If it's a spark integration we can just specify --package=io.openlineage...

      @@ -126656,7 +126901,7 @@

      csv_file = location.csv

      2023-08-16 07:41:25
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/databricks/open-lineage-init-script.sh -> I think the issue was in having openlineage-jar installed immediately on the classpath bcz it's required when OpenLineageSparkListener is instantiated. It didn't work without it.

      @@ -126725,7 +126970,7 @@

      csv_file = location.csv

      2023-08-16 07:43:55
      -

      *Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) +

      *Thread Reply:* Yes it happens if you use --jars s3://.../...openlineage-spark-VERSION.jar parameter. (I made a ticket for this issue in Databricks support) But if you use --package io.openlineage... (the package will be downloaded from maven) it works fine.

      @@ -126756,7 +127001,7 @@

      csv_file = location.csv

      2023-08-16 07:47:50
      -

      *Thread Reply:* I think they don't use the right class loader.

      +

      *Thread Reply:* I think they don't use the right class loader.

      @@ -126782,7 +127027,7 @@

      csv_file = location.csv

      2023-08-16 08:36:14
      -

      *Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

      +

      *Thread Reply:* To make sure: are you able to run Openlineage & Spark on Databricks Runtime without init_scripts?

      I was doing this a second ago and this ended up with Caused by: java.lang.ClassNotFoundException: io.openlineage.spark.agent.OpenLineageSparkListener not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@1609ed55

      @@ -126845,7 +127090,7 @@

      csv_file = location.csv

      2023-08-15 12:19:34
      -

      *Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

      +

      *Thread Reply:* Ok, nevermind. I figured it out. The port 5000 is reserved in MACOS so I had to start on port 9000 instead.

      @@ -126997,7 +127242,7 @@

      csv_file = location.csv

      2023-08-15 06:48:43
      -

      *Thread Reply:* Would be useful to see generated event or any logs

      +

      *Thread Reply:* Would be useful to see generated event or any logs

      @@ -127023,7 +127268,7 @@

      csv_file = location.csv

      2023-08-16 03:09:05
      -

      *Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

      +

      *Thread Reply:* @Anirudh Shrinivason what if there is just one union instead of four? What if there are just two columns selected instead of 10? What if inner join is skipped? Does merge into matter?

      The smaller SQL to reproduce the problem, the easier it is to find the root cause. Most of the issues are reproducible with just few lines of code.

      @@ -127051,7 +127296,7 @@

      csv_file = location.csv

      2023-08-16 03:34:30
      -

      *Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

      +

      *Thread Reply:* Yup let me try to identify the cause from my end. Give me some time haha. I'll reach out again once there is more clarity on the occurence

      @@ -127179,7 +127424,7 @@

      csv_file = location.csv

      2023-08-16 09:24:09
      -

      *Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! +

      *Thread Reply:* thanks @Abdallah for the thoughtful issue that you submitted! was wondering if you’d consider opening up a PR? would love to help you as a contributor is that’s something you are interested in.

      @@ -127206,7 +127451,7 @@

      csv_file = location.csv

      2023-08-17 11:59:51
      -

      *Thread Reply:* Hello

      +

      *Thread Reply:* Hello

      @@ -127232,7 +127477,7 @@

      csv_file = location.csv

      2023-08-17 11:59:58
      -

      *Thread Reply:* Yes I am working on it

      +

      *Thread Reply:* Yes I am working on it

      @@ -127258,7 +127503,7 @@

      csv_file = location.csv

      2023-08-17 12:00:14
      -

      *Thread Reply:* I deleted the line that has that filter.

      +

      *Thread Reply:* I deleted the line that has that filter.

      @@ -127284,7 +127529,7 @@

      csv_file = location.csv

      2023-08-17 12:00:24
      -

      *Thread Reply:* I am adding some tests now

      +

      *Thread Reply:* I am adding some tests now

      @@ -127310,7 +127555,7 @@

      csv_file = location.csv

      2023-08-17 12:00:45
      -

      *Thread Reply:* But running +

      *Thread Reply:* But running ./gradlew --no-daemon databricksIntegrationTest -x test -Pspark.version=3.4.0 -PdatabricksHost=$DATABRICKS_HOST -PdatabricksToken=$DATABRICKS_TOKEN

      @@ -127337,7 +127582,7 @@

      csv_file = location.csv

      2023-08-17 12:01:11
      -

      *Thread Reply:* gives me +

      *Thread Reply:* gives me A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. @@ -127377,7 +127622,7 @@

      csv_file = location.csv

      2023-08-17 12:01:25
      -

      *Thread Reply:* And I am trying to understand what should I do.

      +

      *Thread Reply:* And I am trying to understand what should I do.

      @@ -127403,7 +127648,7 @@

      csv_file = location.csv

      2023-08-17 12:13:37
      -

      *Thread Reply:* I am compiling sql integration

      +

      *Thread Reply:* I am compiling sql integration

      @@ -127429,7 +127674,7 @@

      csv_file = location.csv

      2023-08-17 13:04:15
      -

      *Thread Reply:* I built the java client

      +

      *Thread Reply:* I built the java client

      @@ -127455,7 +127700,7 @@

      csv_file = location.csv

      2023-08-17 13:04:29
      -

      *Thread Reply:* but having +

      *Thread Reply:* but having A problem occurred evaluating project ':app'. &gt; Could not resolve all files for configuration ':app:spark33'. &gt; Could not resolve io.openlineage:openlineage_java:1.1.0-SNAPSHOT. @@ -127495,7 +127740,7 @@

      csv_file = location.csv

      2023-08-17 14:47:41
      -

      *Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

      +

      *Thread Reply:* Please do ./gradlew publishToMavenLocal in client/java directory

      @@ -127521,7 +127766,7 @@

      csv_file = location.csv

      2023-08-17 14:47:59
      -

      *Thread Reply:* Okay thanks

      +

      *Thread Reply:* Okay thanks

      @@ -127547,7 +127792,7 @@

      csv_file = location.csv

      2023-08-17 14:48:01
      -

      *Thread Reply:* will do

      +

      *Thread Reply:* will do

      @@ -127573,7 +127818,7 @@

      csv_file = location.csv

      2023-08-22 10:33:02
      -

      *Thread Reply:* Hello back

      +

      *Thread Reply:* Hello back

      @@ -127599,7 +127844,7 @@

      csv_file = location.csv

      2023-08-22 10:33:12
      -

      *Thread Reply:* I created a databricks cluster.

      +

      *Thread Reply:* I created a databricks cluster.

      @@ -127625,7 +127870,7 @@

      csv_file = location.csv

      2023-08-22 10:35:00
      -

      *Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

      +

      *Thread Reply:* And I had somme issues that -PdatabricksHost doesn't work with System.getProperty("databricksHost") So I changed to -DdatabricksHost with System.getenv("databricksHost")

      @@ -127651,7 +127896,7 @@

      csv_file = location.csv

      2023-08-22 10:36:19
      -

      *Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

      +

      *Thread Reply:* Then I had some issue that the path dbfs:/databricks/openlineage/ doesn't exist, I, then, created the folder /dbfs/databricks/openlineage/

      @@ -127677,7 +127922,7 @@

      csv_file = location.csv

      2023-08-22 10:38:03
      -

      *Thread Reply:* And now I am investigating this issue : +

      *Thread Reply:* And now I am investigating this issue : java.lang.NullPointerException at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226) at io.openlineage.spark.agent.DatabricksUtils.init(DatabricksUtils.java:66) @@ -127725,7 +127970,7 @@

      csv_file = location.csv

      2023-08-22 10:39:22
      -

      *Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

      +

      *Thread Reply:* Suppressed: com.databricks.sdk.core.DatabricksError: Missing required field: cluster_id

      @@ -127751,7 +127996,7 @@

      csv_file = location.csv

      2023-08-22 10:40:18
      -

      *Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

      +

      *Thread Reply:* at io.openlineage.spark.agent.DatabricksUtils.uploadOpenlineageJar(DatabricksUtils.java:226)

      @@ -127777,7 +128022,7 @@

      csv_file = location.csv

      2023-08-22 10:54:51
      -

      *Thread Reply:* I did this !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar

      +

      *Thread Reply:* I did this !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar

      @@ -127803,7 +128048,7 @@

      csv_file = location.csv

      2023-08-22 10:55:29
      -

      *Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

      +

      *Thread Reply:* To create some fake file that can be deleted in uploadOpenlineageJar function.

      @@ -127829,7 +128074,7 @@

      csv_file = location.csv

      2023-08-22 10:56:09
      -

      *Thread Reply:* Because if there is no file, this part fails +

      *Thread Reply:* Because if there is no file, this part fails StreamSupport.stream( workspace.dbfs().list("dbfs:/databricks/openlineage/").spliterator(), false) .filter(f -&gt; f.getPath().contains("openlineage-spark")) @@ -127864,7 +128109,7 @@

      csv_file = location.csv

      2023-08-22 11:47:17
      -

      *Thread Reply:* does this work after +

      *Thread Reply:* does this work after !echo "xxx" &gt; /dbfs/databricks/openlineage/openlineage-spark-V.jar ?

      @@ -127892,7 +128137,7 @@

      csv_file = location.csv

      2023-08-22 11:47:36
      -

      *Thread Reply:* Yes

      +

      *Thread Reply:* Yes

      @@ -127918,7 +128163,7 @@

      csv_file = location.csv

      2023-08-22 19:02:05
      -

      *Thread Reply:* I am now having another error in the driver

      +

      *Thread Reply:* I am now having another error in the driver

      23/08/22 22:56:26 ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Exception when registering SparkListener @@ -127959,7 +128204,7 @@

      csv_file = location.csv

      2023-08-22 19:19:29
      -

      *Thread Reply:* Can you please share with me your json conf for the cluster ?

      +

      *Thread Reply:* Can you please share with me your json conf for the cluster ?

      @@ -127994,7 +128239,7 @@

      csv_file = location.csv

      2023-08-22 19:55:57
      -

      *Thread Reply:* It's because in mu build file I have

      +

      *Thread Reply:* It's because in mu build file I have

      @@ -128029,7 +128274,7 @@

      csv_file = location.csv

      2023-08-22 19:56:27
      -

      *Thread Reply:* and the one that was copied is

      +

      *Thread Reply:* and the one that was copied is

      @@ -128064,7 +128309,7 @@

      csv_file = location.csv

      2023-08-22 20:01:12
      -

      *Thread Reply:* due to the findAny 😕 +

      *Thread Reply:* due to the findAny 😕 private static void uploadOpenlineageJar(WorkspaceClient workspace) { Path jarFile = Files.list(Paths.get("../build/libs/")) @@ -128097,7 +128342,7 @@

      csv_file = location.csv

      2023-08-22 20:35:10
      -

      *Thread Reply:* It works finally 😄

      +

      *Thread Reply:* It works finally 😄

      @@ -128123,7 +128368,7 @@

      csv_file = location.csv

      2023-08-23 05:16:19
      -

      *Thread Reply:* The PR 😄 +

      *Thread Reply:* The PR 😄 https://github.com/OpenLineage/OpenLineage/pull/2061

      @@ -128216,7 +128461,7 @@

      csv_file = location.csv

      2023-08-23 08:23:49
      -

      *Thread Reply:* thanks for the pr 🙂

      +

      *Thread Reply:* thanks for the pr 🙂

      @@ -128242,7 +128487,7 @@

      csv_file = location.csv

      2023-08-23 08:24:02
      -

      *Thread Reply:* code formatting checks complain now

      +

      *Thread Reply:* code formatting checks complain now

      @@ -128268,7 +128513,7 @@

      csv_file = location.csv

      2023-08-23 08:25:09
      -

      *Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

      +

      *Thread Reply:* for the JAR issues, do you also want to create PR as you've fixed the issue on your end?

      @@ -128294,7 +128539,7 @@

      csv_file = location.csv

      2023-08-23 09:06:26
      -

      *Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

      +

      *Thread Reply:* @Abdallah you're using newer version of Java than 8, right?

      @@ -128320,7 +128565,7 @@

      csv_file = location.csv

      2023-08-23 09:07:07
      -

      *Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

      +

      *Thread Reply:* AFAIK googleJavaFormat behaves differently between Java versions

      @@ -128346,7 +128591,7 @@

      csv_file = location.csv

      2023-08-23 09:15:41
      -

      *Thread Reply:* Okay I will switch back to another java version

      +

      *Thread Reply:* Okay I will switch back to another java version

      @@ -128372,7 +128617,7 @@

      csv_file = location.csv

      2023-08-23 09:25:06
      -

      *Thread Reply:* terra@MacBook-Pro-M3 spark % java -version +

      *Thread Reply:* terra@MacBook-Pro-M3 spark % java -version java version "1.8.0_381" Java(TM) SE Runtime Environment (build 1.8.0_381-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)

      @@ -128401,7 +128646,7 @@

      csv_file = location.csv

      2023-08-23 09:28:28
      -

      *Thread Reply:* Can you tell me which java version should I use ?

      +

      *Thread Reply:* Can you tell me which java version should I use ?

      @@ -128427,7 +128672,7 @@

      csv_file = location.csv

      2023-08-23 09:49:42
      -

      *Thread Reply:* Hello, I have +

      *Thread Reply:* Hello, I have @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ?

      csv_file = location.csv
      2023-08-23 09:50:18
      2023-08-23 09:50:24
      -

      *Thread Reply:* Can you help please ?

      +

      *Thread Reply:* Can you help please ?

      @@ -128551,7 +128796,7 @@

      csv_file = location.csv

      2023-08-23 10:08:43
      -

      *Thread Reply:* Java 8

      +

      *Thread Reply:* Java 8

      @@ -128577,7 +128822,7 @@

      csv_file = location.csv

      2023-08-23 10:10:14
      -

      *Thread Reply:* ```Hello, I have

      +

      *Thread Reply:* ```Hello, I have

      @mobuchowski ERROR: Missing environment variable {i} Can you please check what does it come from ? (edited) ``` @@ -128607,7 +128852,7 @@

      csv_file = location.csv

      2023-08-23 10:11:10
      2023-08-23 10:53:34
      -

      *Thread Reply:* @Abdallah merged 🙂

      +

      *Thread Reply:* @Abdallah merged 🙂

      @@ -128659,7 +128904,7 @@

      csv_file = location.csv

      2023-08-23 10:59:22
      -

      *Thread Reply:* Thank you !

      +

      *Thread Reply:* Thank you !

      @@ -128800,7 +129045,7 @@

      csv_file = location.csv

      2023-08-21 20:26:56
      -

      *Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. +

      *Thread Reply:* openlineage-airflow is the package maintained in the OpenLineage project and to be used for versions of Airflow before 2.7. You could use it with 2.7 as well but you’d be staying on the “old” integration. apache-airflow-providers-openlineage is the new package, maintained in the Airflow project that can be used starting Airflow 2.7 and is the recommended package moving forward. It is compatible with the configuration of the old package described in that usage page. CC: @Maciej Obuchowski @Jakub Dardziński It looks like this page needs improvement.

      @@ -128852,7 +129097,7 @@

      csv_file = location.csv

      2023-08-22 05:03:28
      -

      *Thread Reply:* Yeah, I'll fix that

      +

      *Thread Reply:* Yeah, I'll fix that

      @@ -128882,7 +129127,7 @@

      csv_file = location.csv

      2023-08-22 17:55:08
      -

      *Thread Reply:* https://github.com/apache/airflow/pull/33610

      +

      *Thread Reply:* https://github.com/apache/airflow/pull/33610

      fyi

      csv_file = location.csv
      2023-08-22 18:02:46
      -

      *Thread Reply:* that really depends on the use case IMHO +

      *Thread Reply:* that really depends on the use case IMHO if you consider a whole directory/folder as a dataset (meaning that each file inside folds into a larger whole) you should label dataset as directory

      you might as well have directory with each file being something different - in this case it would be best to set each file separately as dataset

      @@ -129009,7 +129254,7 @@

      csv_file = location.csv

      2023-08-22 18:04:32
      -

      *Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

      +

      *Thread Reply:* there was also SymlinksDatasetFacet introduced to store alternative dataset names, might be useful: https://github.com/OpenLineage/OpenLineage/pull/936

      @@ -129035,7 +129280,7 @@

      csv_file = location.csv

      2023-08-22 18:07:26
      -

      *Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

      +

      *Thread Reply:* cool, yeah in general each file is just a snapshot of data from a client (for example, daily dump). the parquet datasets are normally partitioned and might have small fragments and I definitely picture it as more of a table than individual files

      @@ -129065,7 +129310,7 @@

      csv_file = location.csv

      2023-08-23 08:22:09
      -

      *Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

      +

      *Thread Reply:* Agree with Jakub here - with object storage, people use different patterns, but usually some directory layer vs file is the valid abstraction level, especially if your pattern is adding files with new data inside

      @@ -129095,7 +129340,7 @@

      csv_file = location.csv

      2023-08-25 10:26:52
      -

      *Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

      +

      *Thread Reply:* I tested a dataset for each raw file versus the folder and the folder looks much cleaner (not sure if I can collapse individual datasets/files into a group?)

      from 2022, this particular source had 6 raw schema changes (client controlled, no warning). what should I do to make that as obvious as possible if I track the dataset at a folder level?

      @@ -129123,7 +129368,7 @@

      csv_file = location.csv

      2023-08-25 10:32:19
      -

      *Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

      +

      *Thread Reply:* I was thinking that I could name the dataset based on the schema_version (identified by the raw column names), so in this example I would have 6 OL datasets feeding into one "staging" dataset

      @@ -129149,7 +129394,7 @@

      csv_file = location.csv

      2023-08-25 10:32:57
      -

      *Thread Reply:* not sure what the best practice would be in this scenario though

      +

      *Thread Reply:* not sure what the best practice would be in this scenario though

      @@ -129196,7 +129441,7 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-22 18:35:45
      @@ -129228,7 +129473,7 @@

      csv_file = location.csv

      2023-08-23 05:22:18
      -

      *Thread Reply:* Which version are you trying to use?

      +

      *Thread Reply:* Which version are you trying to use?

      @@ -129254,7 +129499,7 @@

      csv_file = location.csv

      2023-08-23 05:22:45
      -

      *Thread Reply:* Both OL and MWAA/Airflow 🙂

      +

      *Thread Reply:* Both OL and MWAA/Airflow 🙂

      @@ -129280,7 +129525,7 @@

      csv_file = location.csv

      2023-08-23 05:23:52
      -

      *Thread Reply:* 'openlineage' is not a package +

      *Thread Reply:* 'openlineage' is not a package suggests that something went wrong with import process, for example cycle in import path

      @@ -129302,12 +129547,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-23 16:50:34
      -

      *Thread Reply:* MWAA: 2.6.3 +

      *Thread Reply:* MWAA: 2.6.3 OL: 1.0.0

      I can see from the log that OL has been successfully installed to the webserver: @@ -129354,7 +129599,7 @@

      csv_file = location.csv

      2023-08-24 08:18:36
      -

      *Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: +

      *Thread Reply:* It’s taking long to update MWAA environment but I tested 2.6.3 version with the followingrequirements.txt: openlineage-airflow and openlineage-airflow==1.0.0 @@ -129379,12 +129624,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 08:29:30
      -

      *Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

      +

      *Thread Reply:* Yeah, it takes forever to update MWAA even for a simple change. If you open either the webserver log (in CloudWatch) or the AirFlow UI, you should see the above error message.

      @@ -129410,7 +129655,7 @@

      csv_file = location.csv

      2023-08-24 08:33:53
      -

      *Thread Reply:* The thing is that I don’t see any error messages. +

      *Thread Reply:* The thing is that I don’t see any error messages. I wrote simple DAG to test too: ```from future import annotations

      @@ -129467,12 +129712,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 08:48:11
      -

      *Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

      +

      *Thread Reply:* Oh how interesting. I did have a plugin that sets the endpoint & key via env var. Let me try to disable that to see if it fixes the issue. Will report back after 30 mins, or however long it takes to update MWAA 😉

      @@ -129498,7 +129743,7 @@

      csv_file = location.csv

      2023-08-24 08:50:05
      -

      *Thread Reply:* ohh, I see +

      *Thread Reply:* ohh, I see you probably followed this guide: https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/?

      @@ -129545,12 +129790,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:04:27
      -

      *Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

      +

      *Thread Reply:* Actually no. I'm not aware of this guide. I assume it's outdated already?

      @@ -129576,7 +129821,7 @@

      csv_file = location.csv

      2023-08-24 09:04:54
      -

      *Thread Reply:* tbh I don’t know

      +

      *Thread Reply:* tbh I don’t know

      @@ -129597,12 +129842,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:04:55
      -

      *Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

      +

      *Thread Reply:* Actually while we're on that topic, what's the recommended way to pass the URL & API Key in MWAA?

      @@ -129628,7 +129873,7 @@

      csv_file = location.csv

      2023-08-24 09:28:00
      -

      *Thread Reply:* I think it's still a plugin that sets env vars

      +

      *Thread Reply:* I think it's still a plugin that sets env vars

      @@ -129649,12 +129894,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 09:32:18
      -

      *Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

      +

      *Thread Reply:* Yeah based on the page you shared, secret manager + plugin seems like the way to go.

      @@ -129675,12 +129920,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 10:31:50
      -

      *Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

      +

      *Thread Reply:* Alas after disabling the plugin and restarting the cluster, I'm still getting the same error. Do you mind to share a screenshot of your cluster's settings so I can compare?

      @@ -129706,7 +129951,7 @@

      csv_file = location.csv

      2023-08-24 11:57:04
      -

      *Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

      +

      *Thread Reply:* Are you maybe importing some top level OpenLineage code anywhere? This error is most likely circular import

      @@ -129727,12 +129972,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 12:01:12
      -

      *Thread Reply:* Let me try removing all the dags to see if it helps.

      +

      *Thread Reply:* Let me try removing all the dags to see if it helps.

      @@ -129753,12 +129998,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 18:42:49
      -

      *Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

      +

      *Thread Reply:* @Maciej Obuchowski you were correct! It was indeed the DAGs. The errors are gone after removing all the dags. Now just need to figure what caused the circular import since I didn't import OL directly in DAG.

      @@ -129779,12 +130024,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-24 18:44:33
      -

      *Thread Reply:* Could this be the issue? +

      *Thread Reply:* Could this be the issue? from airflow.lineage.entities import File, Table How could I declare lineage manually if I can't import these classes?

      @@ -129812,7 +130057,7 @@

      csv_file = location.csv

      2023-08-25 06:52:47
      -

      *Thread Reply:* @Mars Lan I'll look in more details next week, as I'm in transit now

      +

      *Thread Reply:* @Mars Lan (Metaphor) I'll look in more details next week, as I'm in transit now

      @@ -129838,7 +130083,7 @@

      csv_file = location.csv

      2023-08-25 06:53:18
      -

      *Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

      +

      *Thread Reply:* but if you could narrow down a problem to single dag that I or @Jakub Dardziński could reproduce, ideally locally, it would help a lot

      @@ -129859,12 +130104,12 @@

      csv_file = location.csv

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-08-25 07:07:11
      -

      *Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

      +

      *Thread Reply:* Thanks. I think I understand how this works much better now. Found a few useful BQ example dags. Will give them a try and report back.

      @@ -129921,7 +130166,7 @@

      csv_file = location.csv

      2023-08-23 07:32:19
      -

      *Thread Reply:* are you using Airflow to connect to Redshift?

      +

      *Thread Reply:* are you using Airflow to connect to Redshift?

      @@ -129947,7 +130192,7 @@

      csv_file = location.csv

      2023-08-24 06:50:05
      -

      *Thread Reply:* Hi @Jakub Dardziński, +

      *Thread Reply:* Hi @Jakub Dardziński, Thank you for your reply. No, we are not using Airflow. We are using load/Unload cmd with Pyspark and also Pandas with JDBC connection

      @@ -129976,7 +130221,7 @@

      csv_file = location.csv

      2023-08-25 13:28:37
      -

      *Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

      +

      *Thread Reply:* @Paweł Leszczyński might know answer if Spark<->OL integration works with Redshift. Eventually JDBC is supported with sqlparser

      for Pandas I think there wasn’t too much work done

      @@ -130004,7 +130249,7 @@

      csv_file = location.csv

      2023-08-28 02:18:49
      -

      *Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

      +

      *Thread Reply:* @Nitin If you're using jdbc within Spark, the lineage should be obtained via sqlparser-rs library https://github.com/sqlparser-rs/sqlparser-rs. In case it's not, please try to provide some minimal SQL code (or pyspark) which leads to uncaught lineage.

      @@ -130059,7 +130304,7 @@

      csv_file = location.csv

      2023-08-28 04:53:03
      -

      *Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

      +

      *Thread Reply:* Hi @Jakub Dardziński / @Paweł Leszczyński, thank you for taking out time to reply on my query. We need to capture only load and unload query lineage which we are running using Spark.

      If you have any sample implementation for reference, it will be indeed helpful

      @@ -130087,7 +130332,7 @@

      csv_file = location.csv

      2023-08-28 06:12:46
      -

      *Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

      +

      *Thread Reply:* I think we don't support load yet on our side: https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/impl/src/visitor.rs#L8

      @@ -130138,7 +130383,7 @@

      csv_file = location.csv

      2023-08-28 08:18:14
      -

      *Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. +

      *Thread Reply:* Yeah! any way you can think of, we can accommodate it specially load and unload statement. Also, we would like to capture, lineage information where our endpoints are Sagemaker and Redis

      @@ -130165,7 +130410,7 @@

      csv_file = location.csv

      2023-08-28 13:20:37
      -

      *Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

      +

      *Thread Reply:* @Paweł Leszczyński can we use this code base integration/common/openlineage/common/provider/redshift_data.py for redshift lineage capture

      @@ -130191,7 +130436,7 @@

      csv_file = location.csv

      2023-08-28 14:26:40
      -

      *Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

      +

      *Thread Reply:* it still expects input and output tables that are usually retrieved from sqlparser

      @@ -130217,7 +130462,7 @@

      csv_file = location.csv

      2023-08-28 14:31:00
      -

      *Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly +

      *Thread Reply:* for Sagemaker there is an Airflow integration written, might be an example possibly https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py

      @@ -130275,7 +130520,7 @@

      csv_file = location.csv

      2023-08-23 12:27:15
      -

      *Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

      +

      *Thread Reply:* Thank you for requesting a release @Abdallah. Three +1s from committers will authorize.

      @@ -130305,7 +130550,7 @@

      csv_file = location.csv

      2023-08-23 13:13:18
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      @@ -130357,7 +130602,7 @@

      csv_file = location.csv

      2023-08-23 13:24:32
      2023-08-23 13:29:05
      -

      *Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

      +

      *Thread Reply:* For custom transport, you have to provide implementation of interface https://github.com/OpenLineage/OpenLineage/blob/4a1a5c3bf9767467b71ca0e1b6d820ba9e[…]ain/java/io/openlineage/client/transports/TransportBuilder.java and point to it in META_INF file

      @@ -130434,7 +130679,7 @@

      csv_file = location.csv

      2023-08-23 13:29:52
      -

      *Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

      +

      *Thread Reply:* But if I understand correctly, if you want to change behavior rather than extend, the correct way may be to either contribute it to repo - if that behavior is useful to anyone, or fork the repo

      @@ -130460,7 +130705,7 @@

      csv_file = location.csv

      2023-08-23 15:14:43
      -

      *Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

      +

      *Thread Reply:* @Maciej Obuchowski - Can you elaborate more on the "point to it in META_INF file"? Let's say we have the custom transport type built in a standalone jar by extending transport builder - what're the exact next steps to use this custom transport in the standalone jar when doing spark-submit?

      @@ -130486,7 +130731,7 @@

      csv_file = location.csv

      2023-08-23 15:23:13
      -

      *Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has +

      *Thread Reply:* @Athitya Kumar your jar needs to have META-INF/services/io.openlineage.client.transports.TransportBuilder with fully qualified class names of your custom TransportBuilders there - like openlineage-spark has io.openlineage.client.transports.HttpTransportBuilder io.openlineage.client.transports.KafkaTransportBuilder io.openlineage.client.transports.ConsoleTransportBuilder @@ -130517,7 +130762,7 @@

      csv_file = location.csv

      2023-08-25 01:49:29
      -

      *Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? +

      *Thread Reply:* @Maciej Obuchowski - I think this change may be required for consumers to leverage custom transports, can you check & verify this GH comment? https://github.com/OpenLineage/OpenLineage/issues/2007#issuecomment-1690350630

      @@ -130574,7 +130819,7 @@

      csv_file = location.csv

      2023-08-25 06:52:30
      -

      *Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

      +

      *Thread Reply:* Probably, I will look at more details next week @Athitya Kumar as I'm in transit

      @@ -130625,7 +130870,7 @@

      csv_file = location.csv

      - 👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan, Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham + 👏 Ayoub Oudmane, Abdallah, Yuanli Wang, Athitya Kumar, Mars Lan (Metaphor), Maciej Obuchowski, Harel Shein, Kiran Hiremath, Thomas Abraham
      @@ -130732,7 +130977,7 @@

      csv_file = location.csv

      2023-08-25 11:32:12
      -

      *Thread Reply:* there will certainly be more meetups, don’t worry about that!

      +

      *Thread Reply:* there will certainly be more meetups, don’t worry about that!

      @@ -130758,7 +131003,7 @@

      csv_file = location.csv

      2023-08-25 11:32:30
      -

      *Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

      +

      *Thread Reply:* where are you located? perhaps we can try to organize a meetup closer to where you are.

      @@ -130784,7 +131029,7 @@

      csv_file = location.csv

      2023-08-25 11:49:37
      -

      *Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

      +

      *Thread Reply:* Thanks a lot for the response, we are in London. We'd be glad to help you organise a meetup and also meet in person!

      @@ -130810,7 +131055,7 @@

      csv_file = location.csv

      2023-08-25 11:51:39
      -

      *Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

      +

      *Thread Reply:* This is awesome, thanks @George Polychronopoulos. I’ll start a channel and invite you

      @@ -130862,7 +131107,7 @@

      csv_file = location.csv

      2023-08-28 05:59:10
      -

      *Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

      +

      *Thread Reply:* Although you emit DatasetEvent, you still emit an event and eventTime is a valid marker.

      @@ -130888,7 +131133,7 @@

      csv_file = location.csv

      2023-08-28 06:01:40
      -

      *Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

      +

      *Thread Reply:* so, should I use the current time at the moment of emitting it and that's it?

      @@ -130914,7 +131159,7 @@

      csv_file = location.csv

      2023-08-28 06:01:53
      -

      *Thread Reply:* yes, that should be it

      +

      *Thread Reply:* yes, that should be it

      @@ -130970,7 +131215,7 @@

      csv_file = location.csv

      2023-08-28 06:02:49
      -

      *Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

      +

      *Thread Reply:* marquez is not capable of reflecting DatasetEvents in DB but it should respond with Unsupported event type

      @@ -130996,7 +131241,7 @@

      csv_file = location.csv

      2023-08-28 06:03:15
      -

      *Thread Reply:* and return 200 instead of 201 created

      +

      *Thread Reply:* and return 200 instead of 201 created

      @@ -131022,7 +131267,7 @@

      csv_file = location.csv

      2023-08-28 06:05:41
      -

      *Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

      +

      *Thread Reply:* I'll have a deeper look then, probably I'm doing something wrong. thanks @Paweł Leszczyński

      @@ -131074,7 +131319,7 @@

      csv_file = location.csv

      2023-08-28 14:23:24
      -

      *Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

      +

      *Thread Reply:* I'd rather generate them from OL spec (compliant with JSON Schema)

      @@ -131100,7 +131345,7 @@

      csv_file = location.csv

      2023-08-28 15:12:21
      -

      *Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

      +

      *Thread Reply:* I'll look into this. I take you to mean that I would use the OL spec which is available as a set of JSON schemas to create the data object and then HTTP POST it using vanilla Golang. Is that correct? Thank you for your help!

      @@ -131126,7 +131371,7 @@

      csv_file = location.csv

      2023-08-28 15:30:05
      -

      *Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

      +

      *Thread Reply:* Correct! You’re also very welcome to contribute Golang client (currently we have Python & Java clients) if you manage to send events using golang 🙂

      @@ -131249,7 +131494,7 @@

      csv_file = location.csv

      - 🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan + 🎉 Drew Meyers, Harel Shein, Maciej Obuchowski, Julian LaNeve, Mars Lan (Metaphor)
      @@ -131274,31 +131519,23 @@

      csv_file = location.csv

      2023-08-29 03:18:04
      -

      Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

      +

      Hello, I'm currently in the process of following the instructions outlined in the provided getting started guide at https://openlineage.io/getting-started/. However, I've encountered a problem while attempting to complete *Step 1* of the guide. Unfortunately, I'm encountering an internal server error at this stage. I did manage to successfully run Marquez, but it appears that there might be an issue that needs to be addressed. I have attached screen shots.

      @@ -131326,7 +131563,7 @@

      csv_file = location.csv

      2023-08-29 03:20:18
      -

      *Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

      +

      *Thread Reply:* is 5000 port taken by any other application? or ./docker/up.sh has some errors in logs?

      @@ -131352,18 +131589,14 @@

      csv_file = location.csv

      2023-08-29 05:23:01
      -

      *Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

      +

      *Thread Reply:* @Jakub Dardziński 5000 port is not taken by any other application. The logs show some errors but I am not sure what is the issue here.

      @@ -131391,7 +131624,7 @@

      csv_file = location.csv

      2023-08-29 10:02:38
      -

      *Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

      +

      *Thread Reply:* I think Marquez is running on WSL while you're trying to connect from host computer?

      @@ -131443,7 +131676,7 @@

      csv_file = location.csv

      2023-08-29 10:58:29
      -

      *Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

      +

      *Thread Reply:* reply by @Julian LaNeve: yes 🙂💯

      @@ -131499,7 +131732,7 @@

      csv_file = location.csv

      2023-08-30 10:53:08
      -

      *Thread Reply:* > then should my namespace be based on the client I am working with? +

      *Thread Reply:* > then should my namespace be based on the client I am working with? I think each of those sources should be a different namespace?

      @@ -131526,7 +131759,7 @@

      csv_file = location.csv

      2023-08-30 12:59:53
      -

      *Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

      +

      *Thread Reply:* got it, yeah I was kind of picturing as one namespace for the client (we handle many clients but they are completely distinct entities). I was able to get it to work with multiple namespaces like you suggested and Marquez was able to plot everything correctly in the visualization

      @@ -131552,7 +131785,7 @@

      csv_file = location.csv

      2023-08-30 13:01:18
      -

      *Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

      +

      *Thread Reply:* I noticed some of my Dataset facets make more sense as Run facets, for example, the name of the specific file I processed and how many rows of data / size of the data for that schedule. that won't impact the Run facets Airflow provides right? I can still have the schedule information + my custom run facets?

      @@ -131578,7 +131811,7 @@

      csv_file = location.csv

      2023-08-30 13:06:38
      -

      *Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

      +

      *Thread Reply:* Yes, unless you name it the same as one of the Airflow facets 🙂

      @@ -131630,7 +131863,7 @@

      csv_file = location.csv

      2023-08-30 12:23:18
      -

      *Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

      +

      *Thread Reply:* I’ve seen people do this through the ingress controller in Kubernetes. Unfortunately I don’t have documentation besides k8s specific ones you would find for the ingress controller you’re using. You’d redirect any unauthenticated request to your identity provider

      @@ -131727,7 +131960,7 @@

      csv_file = location.csv

      2023-08-30 12:15:31
      -

      *Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

      +

      *Thread Reply:* I’ll be there and looking forward to see @John Lukenoff ‘s presentation

      @@ -131783,7 +132016,7 @@

      csv_file = location.csv

      2023-08-30 23:25:21
      -

      *Thread Reply:* Sorry about that!

      +

      *Thread Reply:* Sorry about that!

      @@ -131887,7 +132120,7 @@

      csv_file = location.csv

      2023-08-31 02:58:41
      -

      *Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

      +

      *Thread Reply:* io.openlineage:openlineage_spark:0.12.0 -> could you repeat the steps with newer version?

      @@ -131992,7 +132225,7 @@

      csv_file = location.csv

      2023-09-01 05:15:28
      -

      *Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

      +

      *Thread Reply:* please use latest io.openlineage:openlineage_spark:1.1.0 instead. openlineage-java is already contained in the jar, no need to add it on your own.

      @@ -132044,7 +132277,7 @@

      csv_file = location.csv

      2023-09-01 06:00:53
      -

      *Thread Reply:* @Michael Robinson

      +

      *Thread Reply:* @Michael Robinson

      @@ -132070,7 +132303,7 @@

      csv_file = location.csv

      2023-09-01 17:13:32
      -

      *Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

      +

      *Thread Reply:* The recording is on the youtube channel here. I’ll update the wiki ASAP

      YouTube
      @@ -132195,7 +132428,7 @@

      csv_file = location.csv

      - 🙌 Harel Shein, Willy Lulciuc, Mars Lan, Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson + 🙌 Harel Shein, Willy Lulciuc, Mars Lan (Metaphor), Peter Hicks, Maciej Obuchowski, Paweł Leszczyński, Eric Veleker, Sheeri Cabral (Collibra), Ross Turk, Michael Robinson
      @@ -132224,7 +132457,7 @@

      csv_file = location.csv

      2023-09-01 23:09:55
      -

      *Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

      +

      *Thread Reply:* https://www.youtube.com/watch?v=zvCdrNJsxBo&t=2260s

      YouTube
      @@ -132368,7 +132601,7 @@

      csv_file = location.csv

      2023-09-04 06:43:47
      -

      *Thread Reply:* Sounds good to me

      +

      *Thread Reply:* Sounds good to me

      @@ -132394,7 +132627,7 @@

      csv_file = location.csv

      2023-09-11 05:15:03
      -

      *Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

      +

      *Thread Reply:* Added this here: https://github.com/OpenLineage/OpenLineage/pull/2099

      @@ -132509,7 +132742,7 @@

      csv_file = location.csv

      2023-09-04 06:54:12
      -

      *Thread Reply:* I think it only depends on log4j configuration

      +

      *Thread Reply:* I think it only depends on log4j configuration

      @@ -132535,7 +132768,7 @@

      csv_file = location.csv

      2023-09-04 06:57:15
      -

      *Thread Reply:* ```# Set everything to be logged to the console +

      *Thread Reply:* ```# Set everything to be logged to the console log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err @@ -132571,7 +132804,7 @@

      set the log level for the openlineage spark library

      2023-09-04 11:28:11
      -

      *Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

      +

      *Thread Reply:* Hmm... I can see the logs for the other commands, like createViewCommand etc. I just cannot see it for any of the delta runs

      @@ -132597,7 +132830,7 @@

      set the log level for the openlineage spark library

      2023-09-05 03:33:03
      -

      *Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

      +

      *Thread Reply:* that's interesting. So, logging is done here: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/main/java/io/openlineage/spark/agent/EventEmitter.java#L63 and this code is unaware of delta.

      The possible problem could be filtering delta events (which we do bcz of delta being noisy)

      set the log level for the openlineage spark library
      2023-09-05 03:33:36
      -

      *Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` +

      *Thread Reply:* Recently, we've closed that https://github.com/OpenLineage/OpenLineage/issues/1982 which prevents generating events for ` createOrReplaceTempView

      @@ -132716,7 +132949,7 @@

      set the log level for the openlineage spark library

      2023-09-05 03:35:12
      -

      *Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

      +

      *Thread Reply:* and this is the code change: https://github.com/OpenLineage/OpenLineage/pull/1987/files

      @@ -132742,7 +132975,7 @@

      set the log level for the openlineage spark library

      2023-09-05 05:19:22
      -

      *Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... +

      *Thread Reply:* Hmm I'm a little confused here. I thought we are only filtering out events for certain specific commands, like show table etc. because its noisy right? Some important commands like MergeInto or SaveIntoDataSource used to be logged before, but I notice now that its not being logged anymore... I'm using 0.23.0 openlineage version.

      @@ -132769,7 +133002,7 @@

      set the log level for the openlineage spark library

      2023-09-05 05:47:51
      -

      *Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

      +

      *Thread Reply:* yes, we do. it's just sometimes when doing a filter, we can remove too much. but SaveIntoDataSource and MergeInto should be fine, as we do check them within the tests

      @@ -132821,7 +133054,7 @@

      set the log level for the openlineage spark library

      2023-09-05 08:54:57
      -

      *Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) +

      *Thread Reply:* map_index should be indeed included when calculating run ID (it’s deterministic in Airflow integration) what version of Airflow are you using btw?

      @@ -132848,7 +133081,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:04:14
      -

      *Thread Reply:* 2.7.0

      +

      *Thread Reply:* 2.7.0

      I do see this error log in all of my dynamic tasks which might explain it:

      @@ -132892,7 +133125,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:05:34
      -

      *Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

      +

      *Thread Reply:* I only have a few custom operators with the on_complete facet so I think this is a built in one - it runs before my task custom logs for example

      @@ -132918,7 +133151,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:06:05
      -

      *Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

      +

      *Thread Reply:* and any time I messed up my custom facet, the error would be at the bottom of the logs. this is on top, probably an on_start facet?

      @@ -132944,7 +133177,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:16:32
      -

      *Thread Reply:* seems like some circular import

      +

      *Thread Reply:* seems like some circular import

      @@ -132970,7 +133203,7 @@

      set the log level for the openlineage spark library

      2023-09-05 09:19:47
      -

      *Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

      +

      *Thread Reply:* I just tested it manually, it’s a bug in OL provider. let me fix that

      @@ -132996,7 +133229,7 @@

      set the log level for the openlineage spark library

      2023-09-05 10:53:28
      -

      *Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

      +

      *Thread Reply:* cool, thanks. I am glad it is just a bug, I was afraid dynamic tasks were not supported for a minute there

      @@ -133022,7 +133255,7 @@

      set the log level for the openlineage spark library

      2023-09-07 11:46:20
      -

      *Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

      +

      *Thread Reply:* how do the provider updates work? they can be released in between Airflow releases and issues for them are raised on the main Airflow repo?

      @@ -133048,7 +133281,7 @@

      set the log level for the openlineage spark library

      2023-09-07 11:50:07
      -

      *Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

      +

      *Thread Reply:* generally speaking anything related to OL-Airflow should be placed to Airflow repo, important changes/bug fixes would be implemented in OL repo as well

      @@ -133074,7 +133307,7 @@

      set the log level for the openlineage spark library

      2023-09-07 15:40:31
      -

      *Thread Reply:* got it, thanks

      +

      *Thread Reply:* got it, thanks

      @@ -133100,7 +133333,7 @@

      set the log level for the openlineage spark library

      2023-09-07 19:43:46
      -

      *Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

      +

      *Thread Reply:* is there a way for me to install the openlineage provider based on the commit you made to fix the circular imports?

      i was going to try to install from Airflow main branch but didnt want to mess anything up

      @@ -133128,7 +133361,7 @@

      set the log level for the openlineage spark library

      2023-09-07 19:44:39
      -

      *Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

      +

      *Thread Reply:* I saw it was merged to airflow main but it is not in 2.7.1 and there is no 1.0.3 provider version yet, so I wondered if I could manually install it for the time being

      @@ -133154,7 +133387,7 @@

      set the log level for the openlineage spark library

      2023-09-08 05:45:48
      -

      *Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages +

      *Thread Reply:* https://github.com/apache/airflow/blob/main/BREEZE.rst#preparing-provider-packages building the provider package on your own could be best idea probably? that depends on how you manage your Airflow instance

      @@ -133181,7 +133414,7 @@

      set the log level for the openlineage spark library

      2023-09-08 12:01:53
      -

      *Thread Reply:* there's 1.1.0rc1 btw

      +

      *Thread Reply:* there's 1.1.0rc1 btw

      @@ -133207,7 +133440,7 @@

      set the log level for the openlineage spark library

      2023-09-08 13:44:44
      -

      *Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

      +

      *Thread Reply:* perfect, thanks. I got started with breeze but then stopped haha

      @@ -133237,7 +133470,7 @@

      set the log level for the openlineage spark library

      2023-09-10 20:29:00
      -

      *Thread Reply:* The dynamic task mapping error is gone, I did run into this:

      +

      *Thread Reply:* The dynamic task mapping error is gone, I did run into this:

      File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/openlineage/extractors/base.py", line 70, in disabledoperators operator.strip() for operator in conf.get("openlineage", "disabledfor_operators").split(";") @@ -133271,7 +133504,7 @@

      set the log level for the openlineage spark library

      2023-09-10 20:49:17
      -

      *Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

      +

      *Thread Reply:* added "disabledforoperators" to my openlineage config and it worked (using Airflow helm chart - not sure if that means there is an error because the value I provided should just be the default value, not sure why I needed to explicitly specify it)

      openlineage: disabledforoperators: "" @@ -133392,7 +133625,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:03:01
      -

      *Thread Reply:* @Willy Lulciuc wdyt?

      +

      *Thread Reply:* @Willy Lulciuc wdyt?

      @@ -133488,7 +133721,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:28:17
      -

      *Thread Reply:* Hello Bernat,

      +

      *Thread Reply:* Hello Bernat,

      The primary mechanism to extend the model is through facets. You can either: • create new standard facets in the spec: https://github.com/OpenLineage/OpenLineage/tree/main/spec/facets @@ -133528,7 +133761,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:31:12
      -

      *Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

      +

      *Thread Reply:* I see, so in general one is best copying a standard facet and maintain it under a different name. That way can be made mandatory 🙂 and one does not need to be blocked for a long time until there's a community agreement 🤔

      @@ -133554,7 +133787,7 @@

      set the log level for the openlineage spark library

      2023-09-06 18:35:43
      -

      *Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. +

      *Thread Reply:* Yes, The goal of custom facets is to allow you to experiment and extend the spec however you want without having to wait for approval. If the custom facet is very specific to a third party project/product then it makes sense for it to stay a custom facet. If it is more generic then it makes sense to add it to the core facets as part of the spec. Hopefully community agreement can be achieved relatively quickly. Unless someone is strongly against something, it can be added without too much red tape. Typically with support in at least one of the integrations to validate the model.

      @@ -133701,7 +133934,7 @@

      set the log level for the openlineage spark library

      - 🙌 Mars Lan, Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo + 🙌 Mars Lan (Metaphor), Jarek Potiuk, Harel Shein, Maciej Obuchowski, Peter Hicks, Paweł Leszczyński, Dongjin Seo
      @@ -133757,7 +133990,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:09:40
      -

      *Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

      +

      *Thread Reply:* If I log this line I can tell the TokenProvider is the class instance I would expect: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L55

      @@ -133808,7 +134041,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:11:14
      -

      *Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

      +

      *Thread Reply:* However, if I log the token_provider here I get the origin TokenProvider: https://github.com/OpenLineage/OpenLineage/blob/45d94fb73b5488d34b8ca544b58317382ceb3980/client/python/openlineage/client/transport/http.py#L154

      @@ -133859,7 +134092,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:18:56
      -

      *Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

      +

      *Thread Reply:* Ah I think I see the issue. Looks like this was introduced here, we are instantiating with the base token provider here when we should be using the subclass: https://github.com/OpenLineage/OpenLineage/pull/1869/files#diff-2f8ea6f9a22b5567de8ab56c6a63da8e7adf40cb436ee5e7e6b16e70a82afe05R57

      @@ -133885,7 +134118,7 @@

      set the log level for the openlineage spark library

      2023-09-11 17:37:42
      -

      *Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

      +

      *Thread Reply:* Opened a PR for this here: https://github.com/OpenLineage/OpenLineage/pull/2100

      @@ -134085,18 +134318,14 @@

      set the log level for the openlineage spark library

      2023-09-12 08:15:19
      -

      *Thread Reply:* This is the error message:

      +

      *Thread Reply:* This is the error message:

      @@ -134124,7 +134353,7 @@

      set the log level for the openlineage spark library

      2023-09-12 10:38:41
      -

      *Thread Reply:* no permissions?

      +

      *Thread Reply:* no permissions?

      @@ -134153,15 +134382,11 @@

      set the log level for the openlineage spark library

      I am trying to run Google Cloud Composer where i have added the openlineage-airflow pypi packagae as a dependency and have added the env OPENLINEAGEEXTRACTORS to point to my custom extractor. I have added a folder by name dependencies and inside that i have placed my extractor file, and the path given to OPENLINEAGEEXTRACTORS is dependencies.<filename>.<extractorclass_name>…still it fails with the exception saying No module named ‘dependencies’. Can anyone kindly help me out on correcting my mistake

      @@ -134189,7 +134414,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:15:36
      -

      *Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

      +

      *Thread Reply:* Hey @Guntaka Jeevan Paul, can you share some details on which versions of airflow and openlineage you’re using?

      @@ -134215,7 +134440,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:16:26
      -

      *Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

      +

      *Thread Reply:* airflow ---> 2.5.3, openlinegae-airflow ---> 1.1.0

      @@ -134241,7 +134466,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:45:08
      -

      *Thread Reply:* ```import traceback +

      *Thread Reply:* ```import traceback import uuid from typing import List, Optional

      @@ -134295,7 +134520,7 @@

      set the log level for the openlineage spark library

      2023-09-12 17:45:24
      -

      *Thread Reply:* this is the custom extractor code that im trying with

      +

      *Thread Reply:* this is the custom extractor code that im trying with

      @@ -134321,7 +134546,7 @@

      set the log level for the openlineage spark library

      2023-09-12 21:10:02
      -

      *Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

      +

      *Thread Reply:* thanks @Guntaka Jeevan Paul, will try to take a deeper look tomorrow

      @@ -134347,7 +134572,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:54:26
      -

      *Thread Reply:* No module named 'dependencies'. +

      *Thread Reply:* No module named 'dependencies'. This sounds like general Python problem

      @@ -134374,7 +134599,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:55:12
      -

      *Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

      +

      *Thread Reply:* https://stackoverflow.com/questions/69991553/how-to-import-custom-modules-in-cloud-composer

      Stack Overflow
      @@ -134426,7 +134651,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:56:28
      -

      *Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

      +

      *Thread Reply:* basically, if you're able to import the file from your dag code, OL should be able too

      @@ -134452,7 +134677,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:12
      -

      *Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

      +

      *Thread Reply:* The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod

      @@ -134478,18 +134703,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:32
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -134517,7 +134738,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:47
      -

      *Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod +

      *Thread Reply:* > The Problem is in the GCS Composer there is a component called Triggerer, which they say is used for deferrable operators…i have logged into that pod and i could see that the GCS Bucket is not mounted on this, but i am unable to understand why is the initialisation happening inside the triggerer pod OL integration is not running on triggerer, only on worker and scheduler pods

      @@ -134544,18 +134765,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:01:53
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      @@ -134583,7 +134800,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:03:26
      -

      *Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

      +

      *Thread Reply:* As you can see in this screenshot i am seeing the logs of the triggerer and it says clearly unable to import plugin openlineage

      @@ -134609,18 +134826,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:03:29
      2023-09-13 08:10:32
      -

      *Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

      +

      *Thread Reply:* I see. There are few possible things to do here - composer could mount the user files, Airflow could not start plugins on triggerer, or we could detect we're on triggerer and not import anything there. However, does it impact OL or Airflow operation in other way than this log?

      @@ -134674,7 +134887,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:12:06
      -

      *Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

      +

      *Thread Reply:* Probably we'd have to do something if that really bothers you as there won't be further changes to Airflow 2.5

      @@ -134700,7 +134913,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:18:14
      -

      *Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

      +

      *Thread Reply:* The Problem is it is actually not registering this custom extractor written by me, henceforth i am just receiving the DefaultExtractor things and my piece of extractor code is not even getting triggered

      @@ -134726,7 +134939,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:22:49
      -

      *Thread Reply:* any suggestions to try @Maciej Obuchowski

      +

      *Thread Reply:* any suggestions to try @Maciej Obuchowski

      @@ -134752,7 +134965,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:27:48
      -

      *Thread Reply:* Could you share worker logs?

      +

      *Thread Reply:* Could you share worker logs?

      @@ -134778,7 +134991,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:27:56
      -

      *Thread Reply:* and check if module is importable from your dag code?

      +

      *Thread Reply:* and check if module is importable from your dag code?

      @@ -134804,10 +135017,10 @@

      set the log level for the openlineage spark library

      2023-09-13 08:31:25
      -

      *Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

      +

      *Thread Reply:* these are the worker pod logs…where there is no log of openlineageplugin

      - + @@ -134839,7 +135052,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:31:52
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694608076879469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> sure will check now on this one

      @@ -134900,7 +135113,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:38:32
      -

      *Thread Reply:* { +

      *Thread Reply:* { "textPayload": "Traceback (most recent call last): File \"/opt/python3.8/lib/python3.8/site-packages/openlineage/airflow/utils.py\", line 427, in import_from_string module = importlib.import_module(module_path) File \"/opt/python3.8/lib/python3.8/importlib/__init__.py\", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 961, in _find_and_load_unlocked File \"&lt;frozen importlib._bootstrap&gt;\", line 219, in _call_with_frames_removed File \"&lt;frozen importlib._bootstrap&gt;\", line 1014, in _gcd_import File \"&lt;frozen importlib._bootstrap&gt;\", line 991, in _find_and_load File \"&lt;frozen importlib._bootstrap&gt;\", line 973, in _find_and_load_unlockedModuleNotFoundError: No module named 'airflow.gcs'", "insertId": "pt2eu6fl9z5vw", "resource": { @@ -134946,18 +135159,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:44:11
      -

      *Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

      +

      *Thread Reply:* this is one of the experimentation that i have did, but then i reverted it back to keeping it to dependencies.bigqueryinsertjobextractor.BigQueryInsertJobExtractor…where dependencies is a module i have created inside my dags folder

      @@ -134985,18 +135194,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:44:33
      2023-09-13 08:45:46
      -

      *Thread Reply:* these are the logs of the triggerer pod specifically

      +

      *Thread Reply:* these are the logs of the triggerer pod specifically

      @@ -135063,7 +135264,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:46:31
      -

      *Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

      +

      *Thread Reply:* yeah it would be expected to have this in triggerer where it's not mounted, but will it behave the same for worker where it's mounted?

      @@ -135089,7 +135290,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:47:09
      -

      *Thread Reply:* maybe ___init___.py is missing for top-level dag path?

      +

      *Thread Reply:* maybe ___init___.py is missing for top-level dag path?

      @@ -135115,18 +135316,14 @@

      set the log level for the openlineage spark library

      2023-09-13 08:49:01
      -

      *Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

      +

      *Thread Reply:* these are the logs of the worker pod at startup, where it does not complain of the plugin like in triggerer, but when tasks are run on this worker…somehow it is not picking up the extractor for the operator that i have written it for

      @@ -135154,7 +135351,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:49:54
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

      +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1694609229577469?thread_ts=1694545905.974339&cid=C01CK9T7HKR --> you mean to make the dags folder as well like a module by adding the init.py?

      @@ -135215,7 +135412,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:55:24
      -

      *Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

      +

      *Thread Reply:* yes, I would put whole custom code directly in dags folder, to make sure import paths are the problem

      @@ -135241,7 +135438,7 @@

      set the log level for the openlineage spark library

      2023-09-13 08:55:48
      -

      *Thread Reply:* and would be nice if you could set +

      *Thread Reply:* and would be nice if you could set AIRFLOW__LOGGING__LOGGING_LEVEL="DEBUG"

      @@ -135268,15 +135465,15 @@

      set the log level for the openlineage spark library

      2023-09-13 09:14:58
      -

      *Thread Reply:* ```Starting the process, got command: triggerer +

      *Thread Reply:* ```Starting the process, got command: triggerer Initializing airflow.cfg. airflow.cfg initialization is done. [2023-09-13T13:11:46.620+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 8) [2023-09-13T13:11:46.622+0000] {settings.py:372} DEBUG - settings.prepareengineargs(): Using pool settings. poolsize=5, maxoverflow=10, poolrecycle=570, pid=8 [2023-09-13T13:11:46.742+0000] {cliactionloggers.py:39} DEBUG - Adding <function defaultactionlog at 0x7ff39ca1d3a0> to pre execution callback [2023-09-13T13:11:47.638+0000] {cliactionloggers.py:65} DEBUG - Calling callbacks: [<function defaultactionlog at 0x7ff39ca1d3a0>] - __ ___ - _ |( )__ / /_ _ + + _ |( ) _/ /_ _ _ /| |_ / / /_ _ / _ _ | /| / / _ | / _ / _ _/ _ / / // /_ |/ |/ / // |// // // // _/_/|/ @@ -135387,7 +135584,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:15:10
      -

      *Thread Reply:* still the same error in the triggerer pod

      +

      *Thread Reply:* still the same error in the triggerer pod

      @@ -135413,18 +135610,14 @@

      set the log level for the openlineage spark library

      2023-09-13 09:16:23
      -

      *Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

      +

      *Thread Reply:* have changed the dags folder where i have added the init file as you suggested and then have updated the OPENLINEAGEEXTRACTORS to bigqueryinsertjob_extractor.BigQueryInsertJobExtractor…still the same thing

      @@ -135452,7 +135645,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:36:27
      -

      *Thread Reply:* > still the same error in the triggerer pod +

      *Thread Reply:* > still the same error in the triggerer pod it won't change, we're not trying to fix the triggerer import but worker, and should look only at worker pod at this point

      @@ -135479,7 +135672,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:43:34
      -

      *Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

      +

      *Thread Reply:* ```extractor for <class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'> is <class 'bigqueryinsertjobextractor.BigQueryInsertJobExtractor'

      Using extractor BigQueryInsertJobExtractor tasktype=BigQueryInsertJobOperator airflowdagid=dataanalyticsdag taskid=joinbqdatasets.bqjoinholidaysweatherdata2021 airflowrunid=manual_2023-09-13T13:24:08.946947+00:00

      @@ -135512,7 +135705,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:44:44
      -

      *Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

      +

      *Thread Reply:* able to see these logs in the worker pod…so what you said is right that it is able to get the extractor but i get the below error immediately where it says not a git repository

      @@ -135538,7 +135731,7 @@

      set the log level for the openlineage spark library

      2023-09-13 09:45:24
      -

      *Thread Reply:* seems like we are almost there nearby…am i missing something obvious

      +

      *Thread Reply:* seems like we are almost there nearby…am i missing something obvious

      @@ -135564,7 +135757,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:06:35
      -

      *Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) +

      *Thread Reply:* > fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). &gt; fatal: not a git repository (or any parent up to mount point /home/airflow) &gt; Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). @@ -135594,7 +135787,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:06:51
      -

      *Thread Reply:* that’s casual log in composer

      +

      *Thread Reply:* that’s casual log in composer

      @@ -135620,7 +135813,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:12:16
      -

      *Thread Reply:* extractor for &lt;class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'&gt; is &lt;class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' +

      *Thread Reply:* extractor for &lt;class 'airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator'&gt; is &lt;class 'big_query_insert_job_extractor.BigQueryInsertJobExtractor' that’s actually class from your custom module, right?

      @@ -135647,18 +135840,14 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:03
      -

      *Thread Reply:* I’ve done experiment, that’s how gcs looks like

      +

      *Thread Reply:* I’ve done experiment, that’s how gcs looks like

      @@ -135686,18 +135875,14 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:09
      -

      *Thread Reply:* and env vars

      +

      *Thread Reply:* and env vars

      @@ -135725,7 +135910,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:14:19
      -

      *Thread Reply:* I have this extractor detected as expected

      +

      *Thread Reply:* I have this extractor detected as expected

      @@ -135751,7 +135936,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:15:06
      -

      *Thread Reply:* seens as &lt;class 'dependencies.bq.BigQueryInsertJobExtractor'&gt;

      +

      *Thread Reply:* seens as &lt;class 'dependencies.bq.BigQueryInsertJobExtractor'&gt;

      @@ -135777,7 +135962,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:16:02
      -

      *Thread Reply:* no __init__.py in base dags folder

      +

      *Thread Reply:* no __init__.py in base dags folder

      @@ -135803,7 +135988,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:17:02
      -

      *Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization +

      *Thread Reply:* I also checked that triggerer pod indeed has no gcsfuse set up, tbh no idea why, maybe some kind of optimization the only effect is that when loading plugins in triggerer it throws some errors in logs, we don’t do anything at the moment there

      @@ -135830,7 +136015,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:19:26
      -

      *Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

      +

      *Thread Reply:* okk…got it @Jakub Dardziński…so the init at the top level of dags is as well not reqd, got it. Just one more doubt, there is a requirement where i want to change the operators property in the extractor inside the extract function, will that be taken into account and the operator’s execute be called with the property that i have populated in my extractor?

      @@ -135856,7 +136041,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:21:28
      -

      *Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

      +

      *Thread Reply:* for example i want to add a custom jobid to the BigQueryInsertJobOperator, so wheneerv someone uses the BigQueryInsertJobOperator operator i want to intercept that and add this jobid property to the operator…will that work?

      @@ -135882,7 +136067,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:24:46
      -

      *Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

      +

      *Thread Reply:* I’m not sure if using OL for such thing is best choice. Wouldn’t it be better to subclass the operator?

      @@ -135908,7 +136093,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:25:37
      -

      *Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

      +

      *Thread Reply:* but the answer is: it dependes on the airflow version, in 2.3+ I’m pretty sure the changed property stays in execute method

      @@ -135934,7 +136119,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:27:49
      -

      *Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

      +

      *Thread Reply:* yeah ideally that is how we should have done this but the problem is our client is having around 1000+ Dag’s in different google cloud projects, which are owned by multiple teams…so they are not willing to change anything in their dag. Thankfully they are using airflow 2.4.3

      @@ -135960,7 +136145,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:31:15
      -

      *Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

      +

      *Thread Reply:* task_policy might be better tool for that: https://airflow.apache.org/docs/apache-airflow/2.6.0/administration-and-deployment/cluster-policies.html

      @@ -135990,7 +136175,7 @@

      set the log level for the openlineage spark library

      2023-09-13 10:35:30
      -

      *Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

      +

      *Thread Reply:* btw I double-checked - execute method is in different process so this would not change task’s attribute there

      @@ -136016,7 +136201,7 @@

      set the log level for the openlineage spark library

      2023-09-16 03:32:49
      -

      *Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

      +

      *Thread Reply:* @Jakub Dardziński any idea how can we achieve this one. ---> https://openlineage.slack.com/archives/C01CK9T7HKR/p1694849427228709

      @@ -136106,12 +136291,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-12 17:34:29
      -

      *Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

      +

      *Thread Reply:* I'm getting quite close with MWAA. See https://openlineage.slack.com/archives/C01CK9T7HKR/p1692743745585879.

      @@ -136204,7 +136389,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:54:37
      -

      *Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

      +

      *Thread Reply:* The Spark integration requires that two parameters are passed to it, namely:

      spark.openlineage.parentJobName spark.openlineage.parentRunId @@ -136236,7 +136421,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:55:51
      -

      *Thread Reply:* Thanks, will check this out

      +

      *Thread Reply:* Thanks, will check this out

      @@ -136262,7 +136447,7 @@

      set the log level for the openlineage spark library

      2023-09-13 02:57:43
      -

      *Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

      +

      *Thread Reply:* As for double accounting of events - that's a bit harder to diagnose.

      @@ -136288,7 +136473,7 @@

      set the log level for the openlineage spark library

      2023-09-13 04:33:03
      -

      *Thread Reply:* Can you share the the job and events? +

      *Thread Reply:* Can you share the the job and events? Also @Paweł Leszczyński

      @@ -136315,7 +136500,7 @@

      set the log level for the openlineage spark library

      2023-09-13 06:03:49
      -

      *Thread Reply:* Sure, sharing Job and events.

      +

      *Thread Reply:* Sure, sharing Job and events.

      @@ -136327,7 +136512,7 @@

      set the log level for the openlineage spark library

      - + @@ -136359,10 +136544,10 @@

      set the log level for the openlineage spark library

      2023-09-13 06:06:21
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -136394,7 +136579,7 @@

      set the log level for the openlineage spark library

      2023-09-13 06:39:02
      -

      *Thread Reply:* Hi @Suraj Gupta,

      +

      *Thread Reply:* Hi @Suraj Gupta,

      Thanks for providing such a detailed description of the problem.

      @@ -136430,7 +136615,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:28:03
      -

      *Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

      +

      *Thread Reply:* Thanks @Paweł Leszczyński, will confirm about writing to file and get back.

      @@ -136456,7 +136641,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:33:35
      -

      *Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

      +

      *Thread Reply:* And yes, the issue is reproducible. Will raise an issue for this.

      @@ -136482,7 +136667,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:33:54
      -

      *Thread Reply:* even if you write onto a file?

      +

      *Thread Reply:* even if you write onto a file?

      @@ -136508,7 +136693,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:37:21
      -

      *Thread Reply:* Yes, even when I write to a parquet file.

      +

      *Thread Reply:* Yes, even when I write to a parquet file.

      @@ -136534,7 +136719,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:49:28
      -

      *Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

      +

      *Thread Reply:* ok. i think i was able to reproduce it locally with https://github.com/OpenLineage/OpenLineage/pull/2103/files

      @@ -136560,7 +136745,7 @@

      set the log level for the openlineage spark library

      2023-09-13 07:56:11
      -

      *Thread Reply:* Opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2104

      +

      *Thread Reply:* Opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2104

      @@ -136586,7 +136771,7 @@

      set the log level for the openlineage spark library

      2023-09-25 16:32:09
      -

      *Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

      +

      *Thread Reply:* @Paweł Leszczyński I see that the PR is work in progress. Any rough estimate on when we can expect this fix to be released?

      @@ -136612,7 +136797,7 @@

      set the log level for the openlineage spark library

      2023-09-26 03:32:03
      -

      *Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

      +

      *Thread Reply:* @Suraj Gupta put a comment within your issue. it's a bug we need to solve but I cannot bring any estimates today.

      @@ -136638,7 +136823,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:33:03
      -

      *Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

      +

      *Thread Reply:* Thanks for update @Paweł Leszczyński, also please look into this comment. It might related and I'm not sure if expected behaviour.

      @@ -136907,7 +137092,7 @@

      set the log level for the openlineage spark library

      2023-09-14 07:32:25
      -

      *Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

      +

      *Thread Reply:* Seems like an issue on our side. Do you know how the source is read? What LogicalPlan leaf is used to read src? Would love to find how is this done differently

      @@ -136933,7 +137118,7 @@

      set the log level for the openlineage spark library

      2023-09-14 09:16:58
      -

      *Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

      +

      *Thread Reply:* Hmm, I'll have to do explain plan to see what exactly it is.

      However my sample job uses spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src")

      @@ -136965,7 +137150,7 @@

      set the log level for the openlineage spark library

      2023-09-14 09:23:59
      -

      *Thread Reply:* ``&gt;&gt;&gt; spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) +

      *Thread Reply:* ``&gt;&gt;&gt; spark.sql("SELECT ** FROM dhawes.ol_test_hadoop_src").explain(True) == Parsed Logical Plan == 'Project [**] +- 'UnresolvedRelationdhawes.oltesthadoop_src`

      @@ -136996,7 +137181,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137029,7 +137214,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137039,7 +137224,7 @@

      set the log level for the openlineage spark library

      2023-09-14 10:05:19
      -

      *Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

      +

      *Thread Reply:* Specially the first PR is affecting users of the astronomer-cosmos library: https://github.com/astronomer/astronomer-cosmos/issues/533

      @@ -137065,7 +137250,7 @@

      set the log level for the openlineage spark library

      2023-09-14 10:05:24
      -

      *Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

      +

      *Thread Reply:* Thanks @tati for requesting your first OpenLineage release! Three +1s from committers will authorize

      @@ -137095,7 +137280,7 @@

      set the log level for the openlineage spark library

      2023-09-14 11:59:55
      -

      *Thread Reply:* The release is authorized and will be initiated within two business days.

      +

      *Thread Reply:* The release is authorized and will be initiated within two business days.

      @@ -137115,7 +137300,7 @@

      set the log level for the openlineage spark library

      - +
      @@ -137125,7 +137310,7 @@

      set the log level for the openlineage spark library

      2023-09-15 04:40:12
      -

      *Thread Reply:* Thanks a lot, @Michael Robinson!

      +

      *Thread Reply:* Thanks a lot, @Michael Robinson!

      @@ -137184,7 +137369,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:33:35
      -

      *Thread Reply:* I have cleaned up the registry proposal. +

      *Thread Reply:* I have cleaned up the registry proposal. https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit In particular: • I clarified that option 2 is preferred at this point. @@ -137216,7 +137401,7 @@

      set the log level for the openlineage spark library

      2023-10-05 17:34:12
      -

      *Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu +

      *Thread Reply:* I have created a ticket to make this easier to find. Once I get more feedback I’ll turn it into a md file in the repo: https://docs.google.com/document/d/1zIxKST59q3I6ws896M4GkUn7IsueLw8ejct5E-TR0vY/edit#heading=h.enpbmvu7n8gu https://github.com/OpenLineage/OpenLineage/issues/2161

      @@ -137394,7 +137579,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:22:47
      -

      *Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

      +

      *Thread Reply:* you don't need actual task instance to do that. you only should set additional argument as jinja template, same as above

      @@ -137420,7 +137605,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:25:28
      -

      *Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

      +

      *Thread Reply:* task_instance in this case is just part of string which is evaluated when jinja render happens

      @@ -137446,7 +137631,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:27:10
      -

      *Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

      +

      *Thread Reply:* ohh…then we could use the same example as above inside the task_policy to intercept the Operator and add the openlineage specific additions properties?

      @@ -137472,7 +137657,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:30:59
      -

      *Thread Reply:* correct, just remember not to override all properties, just add ol specific

      +

      *Thread Reply:* correct, just remember not to override all properties, just add ol specific

      @@ -137498,7 +137683,7 @@

      set the log level for the openlineage spark library

      2023-09-16 04:32:02
      -

      *Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

      +

      *Thread Reply:* yeah sure…thank you so much @Jakub Dardziński, will try this out and keep you posted

      @@ -137528,7 +137713,7 @@

      set the log level for the openlineage spark library

      2023-09-16 05:00:24
      -

      *Thread Reply:* We want to automate setting those options at some point inside the operator itself

      +

      *Thread Reply:* We want to automate setting those options at some point inside the operator itself

      @@ -137584,7 +137769,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:55
      -

      *Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

      +

      *Thread Reply:* there’s no out-of-the-box possibility to do that yet, you’re very welcome to create an issue in GitHub and maybe contribute as well! 🙂

      @@ -137605,7 +137790,7 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-17 09:07:41
      @@ -137668,7 +137853,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:25:11
      -

      *Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

      +

      *Thread Reply:* That’s correct. For now there’s no way to configure the endpoint via env var. You can do that by using config file

      @@ -137689,12 +137874,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-09-18 16:30:39
      -

      *Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

      +

      *Thread Reply:* How do you do that in Airflow? Any particular reason for excluding endpoint override via env var? Happy to create a PR to fix that.

      @@ -137720,7 +137905,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:52:48
      -

      *Thread Reply:* historical I guess? go for the PR, of course 🚀

      +

      *Thread Reply:* historical I guess? go for the PR, of course 🚀

      @@ -137741,12 +137926,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-03 08:52:16
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2151

      @@ -137858,7 +138043,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:07
      -

      *Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

      +

      *Thread Reply:* there’s no actual single source of what integrations are currently implemented in openlineage Airflow provider. That’s something we should work on so it’s more visible

      @@ -137884,7 +138069,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:46
      -

      *Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

      +

      *Thread Reply:* answering this quickly - GE & MS SQL are not currently implemented yet in the provider

      @@ -137910,7 +138095,7 @@

      set the log level for the openlineage spark library

      2023-09-18 16:26:58
      -

      *Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

      +

      *Thread Reply:* but I also invite you to contribute if you’re interested! 🙂

      @@ -137963,7 +138148,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:06
      -

      *Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      +

      *Thread Reply:* If you're using Airflow 2.7, take a look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      @@ -137993,7 +138178,7 @@

      set the log level for the openlineage spark library

      2023-09-19 16:40:54
      -

      *Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

      +

      *Thread Reply:* If you use one of the lower versions, take a look here https://openlineage.io/docs/integrations/airflow/usage

      openlineage.io
      @@ -138040,7 +138225,7 @@

      set the log level for the openlineage spark library

      2023-09-20 06:26:56
      -

      *Thread Reply:* Maciej, +

      *Thread Reply:* Maciej, Thanks for sharing the link https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html this should address the issue

      @@ -138073,7 +138258,7 @@

      set the log level for the openlineage spark library

      - 🎉 Jakub Dardziński, Mars Lan, Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris + 🎉 Jakub Dardziński, Mars Lan (Metaphor), Ross Turk, Guntaka Jeevan Paul, Peter Hicks, Maciej Obuchowski, Athitya Kumar, John Lukenoff, Harel Shein, Francis McGregor-Macdonald, Laurent Paris
      @@ -138161,7 +138346,7 @@

      set the log level for the openlineage spark library

      2023-09-22 21:05:20
      -

      *Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

      +

      *Thread Reply:* Hi @Michael Robinson Thank you! I love the job that you’ve done. If you have a few seconds, please hint at how I can push lineage gathered from Airflow and Spark jobs into DataHub for visualization? I didn’t find any solutions or official support neither at Openlineage nor at DataHub, but I still want to continue using Openlineage

      @@ -138187,7 +138372,7 @@

      set the log level for the openlineage spark library

      2023-09-22 21:30:22
      -

      *Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

      +

      *Thread Reply:* Hi Yevhenii, thank you for using OpenLineage. The DataHub integration is new to us, but perhaps the experts on Spark and Airflow know more. @Paweł Leszczyński @Maciej Obuchowski @Jakub Dardziński

      @@ -138213,7 +138398,7 @@

      set the log level for the openlineage spark library

      2023-09-23 08:11:17
      -

      *Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

      +

      *Thread Reply:* @Yevhenii Soboliev at Airflow Summit, Shirshanka Das from DataHub mentioned this as upcoming feature.

      @@ -138275,7 +138460,7 @@

      set the log level for the openlineage spark library

      2023-09-21 02:15:00
      -

      *Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: +

      *Thread Reply:* Also, do we have any docs on how OL works with the latest airflow version? Few questions: • How is it replacing the concept of custom extractors and Manually Annotated Lineage in the latest version? • Do we have any examples of setting up the integration to emit input/output datasets for non supported Operators like PythonOperator?

      @@ -138303,7 +138488,7 @@

      set the log level for the openlineage spark library

      2023-09-27 10:04:09
      -

      *Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? +

      *Thread Reply:* > Question: Now if we upgrade our Airflow version to 2.7 in the future, would our code be backward compatible? It will be compatible, “default extractors” is generally the same concept as we’re using in the 2.7 integration. One thing that might be good to update is import paths, from openlineage.airflow to airflow.providers.openlineage but should work both ways

      @@ -138492,7 +138677,7 @@

      set the log level for the openlineage spark library

      I am attaching the log4j, there is no openlineagecontext

      - + @@ -138528,7 +138713,7 @@

      set the log level for the openlineage spark library

      2023-09-21 23:47:22
      -

      *Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

      +

      *Thread Reply:* this issue is resolved, solution can be found here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1691592987038929

      @@ -138599,7 +138784,7 @@

      set the log level for the openlineage spark library

      2023-09-25 08:59:10
      -

      *Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

      +

      *Thread Reply:* We were all out at Airflow Summit last week, so apologies for the delayed response. Glad you were able to resolve the issue!

      @@ -138655,7 +138840,7 @@

      set the log level for the openlineage spark library

      2023-09-25 08:56:53
      -

      *Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? +

      *Thread Reply:* Hey @Sangeeta Mishra, I’m not sure that I fully understand your question here. What do you mean by OpenLineage authentication? What are you using to generate OL events? What’s your OL receiving backend?

      @@ -138682,7 +138867,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:04:33
      -

      *Thread Reply:* Hey @Harel Shein, +

      *Thread Reply:* Hey @Harel Shein, I wanted to clarify the previous message. I apologize for any confusion. When I mentioned "OpenLineage authentication," I was actually referring to the authentication process for the OpenLineage backend, specifically using HTTP transport. This involves using my custom token provider, which utilizes access keys and secrets for authentication. The OL backend is http based backend . I hope this clears things up!

      @@ -138709,7 +138894,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:05:12
      -

      *Thread Reply:* Are you using Marquez?

      +

      *Thread Reply:* Are you using Marquez?

      @@ -138735,7 +138920,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:05:55
      -

      *Thread Reply:* We are trying to leverage our own backend here.

      +

      *Thread Reply:* We are trying to leverage our own backend here.

      @@ -138761,7 +138946,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:07:03
      -

      *Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

      +

      *Thread Reply:* I see.. I’m not sure the OpenLineage community could help here. Which webserver framework are you using?

      @@ -138787,7 +138972,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:08:56
      -

      *Thread Reply:* KTOR framework

      +

      *Thread Reply:* KTOR framework

      @@ -138813,7 +138998,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:15:33
      -

      *Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

      +

      *Thread Reply:* Our backend authentication operates based on either a pair of keys or a single bearer token, with a limited time of expiry. Hence, wanted to cache this information inside the token provider.

      @@ -138839,7 +139024,7 @@

      set the log level for the openlineage spark library

      2023-09-25 09:26:57
      -

      *Thread Reply:* I see, I would ask this question here https://ktor.io/support/

      +

      *Thread Reply:* I see, I would ask this question here https://ktor.io/support/

      Ktor Framework
      @@ -138890,7 +139075,7 @@

      set the log level for the openlineage spark library

      2023-09-25 10:12:52
      -

      *Thread Reply:* Thank you

      +

      *Thread Reply:* Thank you

      @@ -138916,7 +139101,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:13:20
      -

      *Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

      +

      *Thread Reply:* @Sangeeta Mishra which openlineage client are you using: java or python?

      @@ -138942,7 +139127,7 @@

      set the log level for the openlineage spark library

      2023-09-26 04:19:53
      -

      *Thread Reply:* @Paweł Leszczyński I am using python client

      +

      *Thread Reply:* @Paweł Leszczyński I am using python client

      @@ -139006,7 +139191,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:56:00
      -

      *Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

      +

      *Thread Reply:* I am not sure it's stated in the doc. Here's the list of spark facets schemas: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark/shared/facets/spark/v1

      @@ -139062,7 +139247,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:47:39
      -

      *Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

      +

      *Thread Reply:* For spark we do send start and complete for each spark action being run (single operation that causes spark processing being run). However, it is difficult for us to know if we're dealing with the last action within spark job or a spark script.

      @@ -139088,7 +139273,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:49:35
      -

      *Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

      +

      *Thread Reply:* I think we need to look deeper into that as there is reoccuring need to capture such information

      @@ -139114,7 +139299,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:49:57
      -

      *Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

      +

      *Thread Reply:* and spark listener event has methods like onApplicationStart and onApplicationEnd

      @@ -139140,7 +139325,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:50:13
      -

      *Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

      +

      *Thread Reply:* We are using the SparkListener, which has a function called OnApplicationStart which gets called whenever a spark application starts, so i was thinking why cant we send one at start and simlarly at end as well

      @@ -139166,7 +139351,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:50:33
      -

      *Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

      +

      *Thread Reply:* additionally, we would like to have a concept of a parent run for a spark job which aggregates all actions run within a single spark job context

      @@ -139192,7 +139377,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:51:11
      -

      *Thread Reply:* yeah exactly. the way that it works with airflow integration

      +

      *Thread Reply:* yeah exactly. the way that it works with airflow integration

      @@ -139218,7 +139403,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:51:26
      -

      *Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

      +

      *Thread Reply:* we do have an issue for that https://github.com/OpenLineage/OpenLineage/issues/2105

      @@ -139287,7 +139472,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:52:08
      -

      *Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

      +

      *Thread Reply:* what you can is: come to our monthly Openlineage open meetings and raise that issue and convince the community about its importance

      @@ -139313,7 +139498,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:53:32
      -

      *Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

      +

      *Thread Reply:* yeah sure would love to do that…how can i join them, will that be posted here in this slack channel?

      @@ -139339,7 +139524,7 @@

      set the log level for the openlineage spark library

      2023-09-27 09:54:08
      -

      *Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

      +

      *Thread Reply:* Hi, you can see the schedule and RSVP here: https://openlineage.io/community

      openlineage.io
      @@ -139487,74 +139672,54 @@

      set the log level for the openlineage spark library

      - 🙌 Mars Lan, Harel Shein, Paweł Leszczyński + 🙌 Mars Lan (Metaphor), Harel Shein, Paweł Leszczyński
      @@ -139595,44 +139760,32 @@

      set the log level for the openlineage spark library

      2023-09-27 11:55:47
      -

      *Thread Reply:* A few more pics:

      +

      *Thread Reply:* A few more pics:

      @@ -139878,7 +140031,7 @@

      set the log level for the openlineage spark library

      2023-09-28 02:42:18
      -

      *Thread Reply:* responded within the issue

      +

      *Thread Reply:* responded within the issue

      @@ -139946,7 +140099,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:04:48
      -

      *Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: +

      *Thread Reply:* Hi @Erik Alfthan, the channel is totally OK. I am not airflow integration expert, but what it looks to me, you're missing openlineage-sql library, which is a rust library used to extract lineage from sql queries. This is how we do that in circle ci: https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8080/workflows/aba53369-836c-48f5-a2dd-51bc0740a31c/jobs/140113

      and subproject page with build instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/sql

      @@ -139975,7 +140128,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:07:23
      -

      *Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

      +

      *Thread Reply:* Ok, so I go and "manually" build the internal dependency so that it becomes available in the pip cache?

      I was hoping for something more automagical, but that should work

      @@ -140003,7 +140156,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:08:06
      -

      *Thread Reply:* I think so. @Jakub Dardziński am I right?

      +

      *Thread Reply:* I think so. @Jakub Dardziński am I right?

      @@ -140029,7 +140182,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:18:27
      -

      *Thread Reply:* https://openlineage.io/docs/development/developing/python/setup +

      *Thread Reply:* https://openlineage.io/docs/development/developing/python/setup there’s a guide how to setup the dev environment

      > Typically, you first need to build openlineage-sql locally (see README). After each release you have to repeat this step in order to bump local version of the package. @@ -140059,7 +140212,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:27:20
      -

      *Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md +

      *Thread Reply:* It didnt find the wheel in the cache, but if I used the line in the sql/README.md pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall It is installed and thus skipped/passed when pip later checks if it needs to be installed.

      @@ -140093,7 +140246,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:31:52
      -

      *Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md +

      *Thread Reply:* > It didnt find the wheel in the cache, but if I used the line in the sql/README.md > pip install openlineage-sql --no-index --find-links ../target/wheels --force-reinstall > It is installed and thus skipped/passed when pip later checks if it needs to be installed. That’s actually expected. You should build new wheel locally and then install it.

      @@ -140133,7 +140286,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:32:04
      -

      *Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

      +

      *Thread Reply:* I just realized that I should probably skip setting up my wsl and just run the tests in the docker setup you prepared

      @@ -140159,7 +140312,7 @@

      set the log level for the openlineage spark library

      2023-09-28 03:35:46
      -

      *Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

      +

      *Thread Reply:* You could do that as well but if you want to test your changes vs many Airflow versions that wouldn’t be possible I think (run them with tox btw)

      @@ -140185,7 +140338,7 @@

      set the log level for the openlineage spark library

      2023-09-28 04:54:39
      -

      *Thread Reply:* This is starting to feel like a rabbit hole 😞

      +

      *Thread Reply:* This is starting to feel like a rabbit hole 😞

      When I run tox, I get a lot of build errors • client needs to be built @@ -140216,7 +140369,7 @@

      set the log level for the openlineage spark library

      2023-09-28 05:19:34
      -

      *Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

      +

      *Thread Reply:* would you like to share some exact log lines? I’ve never seen such errors, they probably are system specific

      @@ -140242,7 +140395,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:45:48
      -

      *Thread Reply:* Getting requirements to build wheel did not run successfully. +

      *Thread Reply:* Getting requirements to build wheel did not run successfully. │ exit code: 1 ╰─&gt; [62 lines of output] /tmp/pip-build-env-q1pay0xo/overlay/lib/python3.10/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config insetup.cfg` @@ -140338,7 +140491,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:53:54
      -

      *Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

      +

      *Thread Reply:* Then, for the actual error in my PR: Evidently you are not using isort, so what linter/fixer should I use for imports?

      @@ -140364,7 +140517,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:58:15
      -

      *Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

      +

      *Thread Reply:* for the error - I think there’s a mistake in the docs. Could you please run maturin build --out target/wheels as a temp solution?

      @@ -140394,7 +140547,7 @@

      set the log level for the openlineage spark library

      2023-09-28 06:58:57
      -

      *Thread Reply:* we’re using ruff , tox runs it as one of commands

      +

      *Thread Reply:* we’re using ruff , tox runs it as one of commands

      @@ -140420,7 +140573,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:00:37
      -

      *Thread Reply:* Not in the airflow folder? +

      *Thread Reply:* Not in the airflow folder? OpenLineage/integration/airflow$ maturin build --out target/wheels 💥 maturin failed Caused by: pyproject.toml at /home/obr_erikal/projects/OpenLineage/integration/airflow/pyproject.toml is invalid @@ -140454,7 +140607,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:02:32
      -

      *Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

      +

      *Thread Reply:* I meant change here https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md

      so cd iface-py @@ -140496,7 +140649,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:05:12
      -

      *Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

      +

      *Thread Reply:* yes, that part I actually worked out myself, but the cython_sources error I fail to understand cause. I have python3-dev installed on WSL Ubuntu with python version 3.10.12 in a virtualenv. Anything in that that could cause issues?

      @@ -140522,7 +140675,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:12:20
      -

      *Thread Reply:* looks like it has something to do with latest release of Cython? +

      *Thread Reply:* looks like it has something to do with latest release of Cython? pip install "Cython&lt;3" maybe solves the issue?

      @@ -140549,7 +140702,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:15:06
      -

      *Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

      +

      *Thread Reply:* I didnt have any cython before the install. Also no change. Could it be some update to setuptools itself? seems like the depreciation notice and the error is coming from inside setuptools

      @@ -140575,7 +140728,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:16:59
      -

      *Thread Reply:* (I.e. I tried the pip install "Cython&lt;3" command without any change in the output )

      +

      *Thread Reply:* (I.e. I tried the pip install "Cython&lt;3" command without any change in the output )

      @@ -140601,7 +140754,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:20:30
      -

      *Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

      +

      *Thread Reply:* Applying ruff lint on the converter.py file fixed the issue on the PR though so unless you have any feedback on the change itself, I will set it up on my own computer later instead (right now doing changes on behalf of a client on the clients computer)

      If the issue persists on my own computer, I'll dig a bit further

      @@ -140629,7 +140782,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:21:03
      -

      *Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

      +

      *Thread Reply:* It’s a bit hard for me to find the root cause as I cannot reproduce this locally and CI works fine as well

      @@ -140655,7 +140808,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:22:41
      -

      *Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

      +

      *Thread Reply:* Yeah, I am thinking that if I run into the same problem "at home", I might find it worthwhile to understand the issue. Right now, the client only wants the fix.

      @@ -140685,7 +140838,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:25:10
      -

      *Thread Reply:* Is there an official release cycle?

      +

      *Thread Reply:* Is there an official release cycle?

      or more specific, given that the PRs are approved, how soon can they reach openlineage-dbt and apache-airflow-providers-openlineage ?

      @@ -140713,7 +140866,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:28:58
      -

      *Thread Reply:* we need to differentiate some things:

      +

      *Thread Reply:* we need to differentiate some things:

      1. OpenLineage repository: a. dbt integration - this is the only place where it is maintained @@ -140748,7 +140901,7 @@

        set the log level for the openlineage spark library

      2023-09-28 07:31:30
      -

      *Thread Reply:* it’s a bit complex for this split but needed temporarily

      +

      *Thread Reply:* it’s a bit complex for this split but needed temporarily

      @@ -140774,7 +140927,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:31:47
      -

      *Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

      +

      *Thread Reply:* oh, I did the fix in the wrong place! The client is on airflow 2.7 and is using the provider. Is it syncing?

      @@ -140800,7 +140953,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:32:28
      -

      *Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

      +

      *Thread Reply:* it’s not, two separate places a~nd we haven’t even added the whole thing with converting old lineage objects to OL specific~

      editing, that’s not true

      @@ -140828,7 +140981,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:34:40
      -

      *Thread Reply:* the code’s here: +

      *Thread Reply:* the code’s here: https://github.com/apache/airflow/blob/main/airflow/providers/openlineage/extractors/manager.py#L154

      @@ -140880,7 +141033,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:35:17
      -

      *Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

      +

      *Thread Reply:* sorry I did not mention this earlier. we definitely need to add some guidance how to proceed with contributions to OL and Airflow OL provider

      @@ -140906,7 +141059,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:36:10
      -

      *Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

      +

      *Thread Reply:* anyway, the dbt fix is the blocking issue, so if that parts comes next week, there is no real urgency in getting the columns. It is a nice to have for our ingest parquet files.

      @@ -140932,7 +141085,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:37:12
      -

      *Thread Reply:* may I ask if you use some custom operator / python operator there?

      +

      *Thread Reply:* may I ask if you use some custom operator / python operator there?

      @@ -140958,7 +141111,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:37:33
      -

      *Thread Reply:* yeah, taskflow with inlets/outlets

      +

      *Thread Reply:* yeah, taskflow with inlets/outlets

      @@ -140984,7 +141137,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:38:38
      -

      *Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

      +

      *Thread Reply:* so we extract from sources and use pyarrow to create parquet files in storage that an mssql-server can use as external tables

      @@ -141010,7 +141163,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:39:54
      -

      *Thread Reply:* awesome 👍 +

      *Thread Reply:* awesome 👍 we have plans to integrate more with Python operator as well but not earlier than in Airflow 2.8

      @@ -141037,7 +141190,7 @@

      set the log level for the openlineage spark library

      2023-09-28 07:43:41
      -

      *Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

      +

      *Thread Reply:* I guess writing a generic extractor for the python operator is quite hard, but if you could support some inlet/outlet type for tabular fileformat / their python libraries like pyarrow or maybe even pandas and document it, I think a lot of people would understand how to use them

      @@ -141143,7 +141296,7 @@

      set the log level for the openlineage spark library

      2023-10-02 17:00:08
      -

      *Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks all. The release is authorized and will be initiated within 2 business days.

      @@ -141169,7 +141322,7 @@

      set the log level for the openlineage spark library

      2023-10-02 17:11:46
      -

      *Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

      +

      *Thread Reply:* looking forward to that, I am seeing inconsistent results in Databricks for Spark 3.4+, sometimes there's no inputs / outputs, hope that is fixed?

      @@ -141195,7 +141348,7 @@

      set the log level for the openlineage spark library

      2023-10-03 09:59:24
      -

      *Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

      +

      *Thread Reply:* @Jason Yip if it isn’t fixed for you, would love it if you could open up an issue that will allow us to reproduce and fix

      @@ -141225,7 +141378,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:23:40
      -

      *Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

      +

      *Thread Reply:* @Harel Shein the issue still exists -> Spark 3.4 and above, including 3.5, saveAsTable and create table won't have inputs and outputs in Databricks

      @@ -141251,7 +141404,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:30:15
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      @@ -141415,7 +141568,7 @@

      set the log level for the openlineage spark library

      2023-10-03 20:30:21
      -

      *Thread Reply:* and of course this issue still exists

      +

      *Thread Reply:* and of course this issue still exists

      @@ -141441,7 +141594,7 @@

      set the log level for the openlineage spark library

      2023-10-03 21:45:09
      -

      *Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

      +

      *Thread Reply:* thanks for posting, we’ll continue looking into this.. if you find any clues that might help, please let us know.

      @@ -141467,7 +141620,7 @@

      set the log level for the openlineage spark library

      2023-10-03 21:46:27
      -

      *Thread Reply:* is there any instructions on how to hook up a debugger to OL?

      +

      *Thread Reply:* is there any instructions on how to hook up a debugger to OL?

      @@ -141493,7 +141646,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:04:16
      -

      *Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

      +

      *Thread Reply:* @Paweł Leszczyński has been working on adding a debug facet, but more suggestions are more than welcome!

      @@ -141519,7 +141672,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:05:58
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2147

      @@ -141623,7 +141776,7 @@

      set the log level for the openlineage spark library

      2023-10-05 03:20:11
      -

      *Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

      +

      *Thread Reply:* @Paweł Leszczyński do you have a build for the PR? Appreciated!

      @@ -141649,7 +141802,7 @@

      set the log level for the openlineage spark library

      2023-10-05 15:05:08
      -

      *Thread Reply:* we’ll ask for a release once it’s reviewed and merged

      +

      *Thread Reply:* we’ll ask for a release once it’s reviewed and merged

      @@ -141758,7 +141911,7 @@

      set the log level for the openlineage spark library

      2023-10-03 09:01:38
      -

      *Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

      +

      *Thread Reply:* Not sure if I follow your question. Whenever OL is released, there is a script new-version.sh - https://github.com/OpenLineage/OpenLineage/blob/main/new-version.sh being run and modify the codebase.

      So, If you pull the code, it contains OL version that has not been released yet and in case of dependencies, one need to build them on their own.

      @@ -141812,7 +141965,7 @@

      set the log level for the openlineage spark library

      2023-10-04 05:27:26
      -

      *Thread Reply:* Hmm. Let's elaborate my use case a bit.

      +

      *Thread Reply:* Hmm. Let's elaborate my use case a bit.

      We run Apache Hive on-premise. Hive provides query execution hooks for pre-query, post-query, and I think failed query.

      @@ -141848,7 +142001,7 @@

      set the log level for the openlineage spark library

      2023-10-04 05:27:36
      -

      *Thread Reply:* I hope that helps.

      +

      *Thread Reply:* I hope that helps.

      @@ -141878,7 +142031,7 @@

      set the log level for the openlineage spark library

      2023-10-04 09:03:02
      -

      *Thread Reply:* I get in now. In Circle CI we do have 3 build steps: +

      *Thread Reply:* I get in now. In Circle CI we do have 3 build steps: - build-integration-sql-x86 - build-integration-sql-arm - build-integration-sql-macos @@ -141948,7 +142101,7 @@

      set the log level for the openlineage spark library

      2023-10-23 11:56:12
      -

      *Thread Reply:* It doesn't have the free resource class still 😞 +

      *Thread Reply:* It doesn't have the free resource class still 😞 We're blocked on that unfortunately. Other solution would be to migrate to GH actions, where most of our solution could be replaced by something like that https://github.com/PyO3/maturin-action

      @@ -142025,7 +142178,7 @@

      set the log level for the openlineage spark library

      - 👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan + 👍 Jason Yip, Peter Hicks, Peter Huang, Mars Lan (Metaphor)
      @@ -142049,12 +142202,12 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-04 07:42:59
      -

      *Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

      +

      *Thread Reply:* Any chance we can do a 1.3.2 soonish to include https://github.com/OpenLineage/OpenLineage/pull/2151 instead of waiting for the next monthly release?

      @@ -142169,7 +142322,7 @@

      set the log level for the openlineage spark library

      2023-10-04 03:01:02
      -

      *Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

      +

      *Thread Reply:* That's a great usecase for OpenLineage. Unfortunately, we don't have any doc or recomendation on that.

      I would try using FluentD proxy we have (https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd) to copy event stream (alerting is just one of usecases for lineage events) and write fluentd plugin to send it asynchronously further to alerting service like PagerDuty.

      @@ -142314,7 +142467,7 @@

      set the log level for the openlineage spark library

      -
      Mars Lan +
      Mars Lan (Metaphor) (mars@metaphor.io)
      2023-10-06 07:19:01
      @@ -142413,12 +142566,12 @@

      set the log level for the openlineage spark library

      2023-10-06 19:16:02
      -

      *Thread Reply:* Thanks for requesting a release, @Mars Lan. It has been approved and will be initiated within 2 business days of next Monday.

      +

      *Thread Reply:* Thanks for requesting a release, @Mars Lan (Metaphor). It has been approved and will be initiated within 2 business days of next Monday.

      - 🙏 Mars Lan + 🙏 Mars Lan (Metaphor)
      @@ -142446,20 +142599,16 @@

      set the log level for the openlineage spark library

      @here I am trying out the openlineage integration of spark on databricks. There is no event getting emitted from Openlineage, I see logs saying OpenLineage Event Skipped. I am attaching the Notebook that i am trying to run and the cluster logs. Kindly can someone help me on this

      - + @@ -142491,7 +142640,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:02:10
      -

      *Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

      +

      *Thread Reply:* from my experience, it will only work on Spark 3.3.x or below, aka Runtime 12.2 or below. Anything above the events will show up once in a blue moon

      @@ -142517,7 +142666,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:04:38
      -

      *Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

      +

      *Thread Reply:* ohh, thanks for the information @Jason Yip, I am trying out with 13.3 Databricks Version and Spark 3.4.1, will try using a below version as you suggested. Any issue tracking this bug @Jason Yip

      @@ -142543,7 +142692,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:06:06
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/2124

      @@ -142707,7 +142856,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:11:54
      -

      *Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

      +

      *Thread Reply:* tried with databricks 12.2 --> spark 3.3.2, still the same behaviour no event getting emitted

      @@ -142733,7 +142882,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:12:35
      -

      *Thread Reply:* you can do 11.3, its the most stable one I know

      +

      *Thread Reply:* you can do 11.3, its the most stable one I know

      @@ -142759,7 +142908,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:12:46
      -

      *Thread Reply:* sure, let me try that out

      +

      *Thread Reply:* sure, let me try that out

      @@ -142785,7 +142934,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:31:51
      -

      *Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

      +

      *Thread Reply:* still the same problem…the jar that i am using is the latest openlineage-spark-1.3.1.jar, do you think that can be the problem

      @@ -142811,7 +142960,7 @@

      set the log level for the openlineage spark library

      2023-10-09 00:43:59
      -

      *Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

      +

      *Thread Reply:* tried with openlineage-spark-1.2.2.jar, still the same issue, seems like they are skipping some events

      @@ -142837,7 +142986,7 @@

      set the log level for the openlineage spark library

      2023-10-09 01:47:20
      -

      *Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

      +

      *Thread Reply:* Probably not all events will be captured, I have only tested create tables and jobs

      @@ -142863,7 +143012,7 @@

      set the log level for the openlineage spark library

      2023-10-09 04:31:12
      -

      *Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

      +

      *Thread Reply:* Hi @Guntaka Jeevan Paul, how did you configure openlineage and what is your job doing?

      We do have a bunch of integration tests on Databricks platform available here and they're passing on databricks runtime 13.0.x-scala2.12.

      @@ -142943,7 +143092,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:06:41
      -

      *Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") +

      *Thread Reply:* babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("dbfs:/FileStore/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").rdd.map(lambda row : row[0]).collect() years.sort() @@ -142974,7 +143123,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:08:09
      -

      *Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

      +

      *Thread Reply:* this is the script that i am running @Paweł Leszczyński…kindly let me know if i’m doing any mistake. I have added the init script at the cluster level and from the logs i could see that openlineage is configured as i see a log statement

      @@ -143000,7 +143149,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:10:30
      -

      *Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

      +

      *Thread Reply:* there's nothing wrong in that script. It's just we decided to limit amount of OL events for jobs that don't write their data anywhere and just do collect operation

      @@ -143026,7 +143175,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:11:02
      -

      *Thread Reply:* this is also a potential reason why can't you see any events

      +

      *Thread Reply:* this is also a potential reason why can't you see any events

      @@ -143052,7 +143201,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:14:33
      -

      *Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

      +

      *Thread Reply:* ohh…okk, will try out the test script that you have mentioned above. Kindly correct me if my understanding is correct, so if there are a few transformatiosna nd finally writing somewhere that is where the OL events are expected to be emitted?

      @@ -143078,7 +143227,7 @@

      set the log level for the openlineage spark library

      2023-10-09 05:16:54
      -

      *Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

      +

      *Thread Reply:* yes. main purpose of the lineage is to track dependencies between the datasets, when a job reads from dataset A and writes to dataset B. In case of databricks notebook, that do show or collect and print some query result on the screen, there may be no reason to track it in the sense of lineage.

      @@ -143107,7 +143256,7 @@

      set the log level for the openlineage spark library

      @channel We released OpenLineage 1.4.1! Additions: -• Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan +• Client: allow setting client’s endpoint via environment variable 2151 @Mars Lan (Metaphor) • Flink: expand Iceberg source types 2149 @Peter Huang • Spark: add debug facet 2147 @Paweł Leszczyński • Spark: enable Nessie REST catalog 2165 @julwin @@ -143121,7 +143270,7 @@

      set the log level for the openlineage spark library

      - 👍 Jason Yip, Ross Turk, Mars Lan, Harel Shein, Rodrigo Maia + 👍 Jason Yip, Ross Turk, Mars Lan (Metaphor), Harel Shein, Rodrigo Maia
      @@ -143172,7 +143321,7 @@

      set the log level for the openlineage spark library

      2023-10-09 18:56:13
      -

      *Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

      +

      *Thread Reply:* Hi Drew, thank you for using OpenLineage! I don’t know the details of your use case, but I believe this is expected, yes. In general, the dataset namespace is different. Jobs are namespaced separately from datasets, which are namespaced by their containing datasources. This is the case so datasets have the same name regardless of the job writing to them, as datasets are sometimes shared by jobs in different namespaces.

      @@ -143256,7 +143405,7 @@

      set the log level for the openlineage spark library

      2023-10-11 03:46:13
      -

      *Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

      +

      *Thread Reply:* Is this related to any OL version? In OL 1.2.2. we've added extra variable spark.databricks.clusterUsageTags.clusterAllTags to be captured, but this should not break things.

      I think we're facing some issues on recent databricks runtime versions. Here is an issue for this: https://github.com/OpenLineage/OpenLineage/issues/2131

      @@ -143316,7 +143465,7 @@

      set the log level for the openlineage spark library

      2023-10-11 11:17:06
      -

      *Thread Reply:* yes, exactly Spark 3.4+

      +

      *Thread Reply:* yes, exactly Spark 3.4+

      @@ -143342,7 +143491,7 @@

      set the log level for the openlineage spark library

      2023-10-11 21:12:27
      -

      *Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

      +

      *Thread Reply:* Btw I don't understand the code flow entirely, if we are talking about a different classpath only, I see there's Unity Catalog handler in the code and it says it works the same as Delta, but I am not seeing it subclassing Delta. I suppose it will work the same.

      I am happy to jump on a call to show you if needed

      @@ -143370,7 +143519,7 @@

      set the log level for the openlineage spark library

      2023-10-16 02:58:56
      -

      *Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

      +

      *Thread Reply:* @Paweł Leszczyński do you think in Spark 3.4+ only one event would happen?

      /** * We get exact copies of OL events for org.apache.spark.scheduler.SparkListenerJobStart and @@ -143438,7 +143587,7 @@

      set the log level for the openlineage spark library

      2023-10-10 23:44:53
      -

      *Thread Reply:* I use it to get the table with database name

      +

      *Thread Reply:* I use it to get the table with database name

      @@ -143464,7 +143613,7 @@

      set the log level for the openlineage spark library

      2023-10-10 23:47:15
      -

      *Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

      +

      *Thread Reply:* so can i think it like if there is a synlink, then that table is kind of a reference to the original dataset

      @@ -143490,7 +143639,7 @@

      set the log level for the openlineage spark library

      2023-10-11 01:25:44
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -143640,7 +143789,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:57:46
      -

      *Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

      +

      *Thread Reply:* Hey Guntaka - can I ask you a favour? Can you please stop using @here or @channel - please keep in mind, you're pinging over 1000 people when you use that mention. Its incredibly distracting to have Slack notify me of a message that isn't pertinent to me.

      @@ -143666,7 +143815,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:58:50
      -

      *Thread Reply:* sure noted @Damien Hawes

      +

      *Thread Reply:* sure noted @Damien Hawes

      @@ -143692,7 +143841,7 @@

      set the log level for the openlineage spark library

      2023-10-11 06:59:34
      -

      *Thread Reply:* Thank you!

      +

      *Thread Reply:* Thank you!

      @@ -143744,7 +143893,7 @@

      set the log level for the openlineage spark library

      2023-10-12 13:55:26
      -

      *Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: +

      *Thread Reply:* Make sure to provide a dataset field nodeId as a query param in your request. If you’ve seeded Marquez with test metadata, you can use: curl -XGET "<http://localhost:5002/api/v1/column-lineage?nodeId=datasetField%3Afood_delivery%3Apublic.delivery_7_days%3Acustomer_email>" You can view the API docs for column lineage here!

      @@ -143772,7 +143921,7 @@

      set the log level for the openlineage spark library

      2023-10-17 05:57:36
      -

      *Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: +

      *Thread Reply:* Thanks Willy. The documentation says 'name space' so i constructed API Like this: 'http://marquez-web:3000/api/v1/column-lineage/nodeId=datasetField:file:/home/jovyan/Downloads/event_attribute.csv:eventType' but it is still not working 😞

      @@ -143800,7 +143949,7 @@

      set the log level for the openlineage spark library

      2023-10-17 06:07:06
      -

      *Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

      +

      *Thread Reply:* nodeId is constructed like this: datasetField:<namespace>:<dataset>:<field name>

      @@ -143897,7 +144046,7 @@

      set the log level for the openlineage spark library

      2023-10-11 14:26:45
      -

      *Thread Reply:* Newly added discussion topics: +

      *Thread Reply:* Newly added discussion topics: • a proposal to add a Registry of Consumers and Producers • a dbt issue to add OpenLineage Dataset names to the Manifest • a proposal to add Dataset support in Spark LogicalPlan Nodes @@ -143953,7 +144102,7 @@

      set the log level for the openlineage spark library

      2023-10-13 01:56:19
      -

      *Thread Reply:* just follow these instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#build

      +

      *Thread Reply:* just follow these instructions: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#build

      @@ -143979,7 +144128,7 @@

      set the log level for the openlineage spark library

      2023-10-13 06:41:56
      -

      *Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error +

      *Thread Reply:* when trying to install openlineage-java in local via this command --> cd ../../client/java/ && ./gradlew publishToMavenLocal, i am receiving this error ```> Task :signMavenJavaPublication FAILED

      FAILURE: Build failed with an exception.

      @@ -144012,18 +144161,14 @@

      set the log level for the openlineage spark library

      2023-10-13 13:35:06
      -

      *Thread Reply:* @Paweł Leszczyński this is what I am getting

      +

      *Thread Reply:* @Paweł Leszczyński this is what I am getting

      @@ -144051,10 +144196,10 @@

      set the log level for the openlineage spark library

      2023-10-13 13:36:00
      -

      *Thread Reply:* attaching the html

      +

      *Thread Reply:* attaching the html

      2023-10-16 03:02:13
      -

      *Thread Reply:* which java are you using? what is your operation system (is it windows?)?

      +

      *Thread Reply:* which java are you using? what is your operation system (is it windows?)?

      @@ -144112,7 +144257,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:35:18
      -

      *Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

      +

      *Thread Reply:* yes it is Windows, i downloaded java 8 but I can try to build it with Linux subsystem or Mac

      @@ -144138,7 +144283,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:35:51
      -

      *Thread Reply:* In my case it is Mac

      +

      *Thread Reply:* In my case it is Mac

      @@ -144164,7 +144309,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:56:09
      -

      *Thread Reply: * Where: +

      *Thread Reply:* ** Where: Build file '/mnt/c/Users/jason/Downloads/github/OpenLineage/integration/spark/build.gradle' line: 9

      ** What went wrong: @@ -144198,7 +144343,7 @@

      set the log level for the openlineage spark library

      2023-10-16 03:56:23
      -

      *Thread Reply:* tried with Linux subsystem

      +

      *Thread Reply:* tried with Linux subsystem

      @@ -144224,7 +144369,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:04:29
      -

      *Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

      +

      *Thread Reply:* we don't have any restrictions for windows builds, however it is something we don't test regularly. 2h ago we did have a successful build on circle CI https://app.circleci.com/pipelines/github/OpenLineage/OpenLineage/8271/workflows/0ec521ae-cd21-444a-bfec-554d101770ea

      @@ -144250,7 +144395,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:13:04
      -

      *Thread Reply:* ... 111 more +

      *Thread Reply:* ... 111 more Caused by: java.lang.ClassNotFoundException: org.gradle.api.provider.HasMultipleValues ... 117 more

      @@ -144278,7 +144423,7 @@

      set the log level for the openlineage spark library

      2023-10-17 00:26:07
      -

      *Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

      +

      *Thread Reply:* @Paweł Leszczyński now I am doing gradlew instead of gradle on windows coz Linux one doesn't work. The doc didn't mention about setting up Spark / Hadoop and that's my original question -- do I need to setup local Spark? Now it's throwing an error on Hadoop: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

      @@ -144304,7 +144449,7 @@

      set the log level for the openlineage spark library

      2023-10-21 23:33:48
      -

      *Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

      +

      *Thread Reply:* Got it working with Mac, couldn't get it working with Windows / Linux subsystem

      @@ -144330,7 +144475,7 @@

      set the log level for the openlineage spark library

      2023-10-22 13:08:40
      -

      *Thread Reply:* Now getting class not found despite build and test succeeded

      +

      *Thread Reply:* Now getting class not found despite build and test succeeded

      @@ -144356,7 +144501,7 @@

      set the log level for the openlineage spark library

      2023-10-22 21:46:23
      -

      *Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

      +

      *Thread Reply:* I uploaded the wrong jar.. there are so many jars, only the jar in the spark folder works, not subfolder

      @@ -144434,7 +144579,7 @@

      set the log level for the openlineage spark library

      2023-10-16 00:01:42
      -

      *Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

      +

      *Thread Reply:* Hi @Paweł Leszczyński is this expected? CMIIW but we should expect to see the events being logged when running with debug log level right?

      @@ -144460,7 +144605,7 @@

      set the log level for the openlineage spark library

      2023-10-16 04:17:30
      -

      *Thread Reply:* It's impossible to know without seeing how you've configured the listener.

      +

      *Thread Reply:* It's impossible to know without seeing how you've configured the listener.

      Can you show this configuration?

      @@ -144488,7 +144633,7 @@

      set the log level for the openlineage spark library

      2023-10-17 03:15:20
      -

      *Thread Reply:* spark.openlineage.transport.url &lt;url&gt; +

      *Thread Reply:* spark.openlineage.transport.url &lt;url&gt; spark.openlineage.transport.endpoint /&lt;endpoint&gt; spark.openlineage.transport.type http spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener @@ -144520,7 +144665,7 @@

      set the log level for the openlineage spark library

      2023-10-17 04:40:03
      -

      *Thread Reply:* Two things:

      +

      *Thread Reply:* Two things:

      1. If you want debug logs, you're going to have to provide a log4j.properties file or log4j2.properties file depending on the version of spark you're running. In that file, you will need to configure the logging levels. If I am not mistaken, the sc.setLogLevel controls ONLY the log levels of Spark namespaced components (i.e., org.apache.spark)
      2. You're telling the listener to emit to a URL. If you want to see the events emitted to the console, then set spark.openlineage.transport.type=console, and remove the other spark.openlineage.transport.** configurations. Do either (1) or (2).
      3. @@ -144550,7 +144695,7 @@

        set the log level for the openlineage spark library

      2023-10-20 00:49:45
      -

      *Thread Reply:* @Damien Hawes Hi, sflr.

      +

      *Thread Reply:* @Damien Hawes Hi, sflr.

      1. So enabling sc.setLogLevel does actually enable debug logs from Openlineage. I can see the events and everyting being logged if I save it as a parquet format instead of delta.
      2. I do want to emit events to the url. But, I would like to just see what exactly are the events being emitted for some specific jobs, since I see that the lineage is incorrect for some MergeInto cases
      @@ -144579,7 +144724,7 @@

      set the log level for the openlineage spark library

      2023-10-26 04:56:50
      -

      *Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

      +

      *Thread Reply:* Hi @Damien Hawes would like to check again on whether you'd have any thoughts about this... Thanks! 🙂

      @@ -144639,7 +144784,7 @@

      set the log level for the openlineage spark library

      2023-10-17 06:55:47
      -

      *Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

      +

      *Thread Reply:* In general, we assume that OL events per run are cumulative. So, if you have 20 events with the same runId , then even if a single event contains some facet, we consider this is OK and let the backend combine it together. That's what we do in Marquez project (a reference backend architecture for OL) and that's why it is worth to use in Marquez as a rest API.

      Are you able to use job namespace to aggregate all the Spark actions run within the databricks notebook? This is something that should serve this purpose.

      @@ -144667,7 +144812,7 @@

      set the log level for the openlineage spark library

      2023-10-17 12:48:33
      -

      *Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

      +

      *Thread Reply:* @Rodrigo Maia for Spark 3.4 I don't see the environment-properties showing up at all, but if you run the code as it is, register a listener on SparkListenerJobStart and get the properties, all of those properties will show up. There's an event filter that filters out the SparkListenerJobStart, I suspect that filtered out the "unneccessary" events.. was trying to do a custom build to do that, but still trying to setup Hadoop and Spark on my local

      @@ -144693,18 +144838,14 @@

      set the log level for the openlineage spark library

      2023-10-18 05:23:16
      -

      *Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

      +

      *Thread Reply:* @Paweł Leszczyński you are right. This is what we are doing as well, combining events with the same runId to process the information on our backend. But even so, there are several runIds without this information. I went through these events to have a better view of what was happening. As you can see from 7 runIds, only 3 were showing the "environment-properties" attribute. Some condition is not being met here, or maybe it is what @Jason Yip suspects and there's some sort of filtering of unnecessary events

      @@ -144732,7 +144873,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:28:03
      -

      *Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

      +

      *Thread Reply:* @Rodrigo Maia, If you are able to provide a small Spark script such that none of the OL events contain the environment-properties, but at least one should, please raise an issue for this.

      @@ -144758,7 +144899,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:29:11
      -

      *Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

      +

      *Thread Reply:* It's extremely helpful when community open issues that are not only described well, but also contain small piece of code needed to reproduce this.

      @@ -144784,7 +144925,7 @@

      set the log level for the openlineage spark library

      2023-10-19 02:59:39
      -

      *Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

      +

      *Thread Reply:* I know. that's the goal. that is why I wanted to understand in the first place if there was any condition preventing this from happening, but now i get that this is not expected behaviour.

      @@ -144814,7 +144955,7 @@

      set the log level for the openlineage spark library

      2023-10-19 13:44:00
      -

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia I am referring to this: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java#L51

      +

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia I am referring to this: https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/main/java/io/openlineage/spark/agent/filters/DeltaEventFilter.java#L51

      @@ -144865,7 +145006,7 @@

      set the log level for the openlineage spark library

      2023-10-19 14:49:03
      -

      *Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

      +

      *Thread Reply:* Please note that I am getting the same behavior, no code is needed, Spark 3.4+ won't be generating no matter what. I have been testing the same code for 2 months from this issue: https://github.com/OpenLineage/OpenLineage/issues/2124

      I tried the code without OL and it worked perfectly, so it is OL filtering out the event for sure. I will try posting the code I use to collect the properties.

      set the log level for the openlineage spark library
      2023-10-19 23:46:17
      -

      *Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

      +

      *Thread Reply:* this code proves that the prosperities are still there, somehow got filtered out by OL:

      ```%scala import org.apache.spark.scheduler._

      @@ -145080,7 +145221,7 @@

      set the log level for the openlineage spark library

      2023-10-19 23:55:05
      -

      *Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

      +

      *Thread Reply:* of course feel free to test this logic as well, it still works -- if not the filtering:

      https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/shared/src/[…]ark/agent/facets/builder/DatabricksEnvironmentFacetBuilder.java

      set the log level for the openlineage spark library
      2023-10-30 04:46:16
      -

      *Thread Reply:* Any ideas on how could i test it?

      +

      *Thread Reply:* Any ideas on how could i test it?

      @@ -145333,7 +145474,7 @@

      set the log level for the openlineage spark library

      2023-10-18 01:32:01
      -

      *Thread Reply:* hey, did you try to follow one of these guides? +

      *Thread Reply:* hey, did you try to follow one of these guides? https://openlineage.io/docs/guides/about

      @@ -145360,7 +145501,7 @@

      set the log level for the openlineage spark library

      2023-10-18 09:14:08
      -

      *Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

      +

      *Thread Reply:* Which guide were you using, and what errors/issues are you encountering?

      @@ -145386,7 +145527,7 @@

      set the log level for the openlineage spark library

      2023-10-21 15:43:14
      -

      *Thread Reply:* Thanks Jakub for the response.

      +

      *Thread Reply:* Thanks Jakub for the response.

      @@ -145412,18 +145553,14 @@

      set the log level for the openlineage spark library

      2023-10-21 15:45:42
      -

      *Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

      +

      *Thread Reply:* In docker, marquez-api image is not running and exiting with the exit code 127.

      @@ -145451,7 +145588,7 @@

      set the log level for the openlineage spark library

      2023-10-22 09:34:53
      -

      *Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

      +

      *Thread Reply:* @ankit jain thanks. I don't recognize 127, but 9 times out of 10 if the API or DB container fails the reason is a port conflict. Have you checked if port 5000 is available?

      @@ -145477,7 +145614,7 @@

      set the log level for the openlineage spark library

      2023-10-22 09:54:10
      -

      *Thread Reply:* could you please check what’s the output of +

      *Thread Reply:* could you please check what’s the output of git config --get core.autocrlf or git config --global --get core.autocrlf @@ -145507,7 +145644,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:09:14
      -

      *Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. +

      *Thread Reply:* @Michael Robinson thanks , I checked the port 5000 is not available. I tried deleting docker images and recreating them, but still the same issue persist stating /Usr/bin/env bash/r not found. Gradle build is successful.

      @@ -145536,7 +145673,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:09:54
      -

      *Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

      +

      *Thread Reply:* @Jakub Dardziński thanks, first command resulted as true and second command has no response

      @@ -145562,7 +145699,7 @@

      set the log level for the openlineage spark library

      2023-10-24 08:15:57
      -

      *Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

      +

      *Thread Reply:* are you running docker and git in Windows or Mac OS before 10.0?

      @@ -145616,7 +145753,7 @@

      set the log level for the openlineage spark library

      2023-10-20 02:57:50
      -

      *Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

      +

      *Thread Reply:* Hi, just checking, are you excluding the sparkPlan from the events? Or is it sending the spark plan too

      @@ -145642,7 +145779,7 @@

      set the log level for the openlineage spark library

      2023-10-23 11:59:40
      -

      *Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

      +

      *Thread Reply:* yeah - setting spark.openlineage.facets.disabled to [spark_unknown;spark.logicalPlan] should help

      @@ -145668,7 +145805,7 @@

      set the log level for the openlineage spark library

      2023-10-24 17:50:26
      -

      *Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

      +

      *Thread Reply:* sorry for the late reply - turns out this job is just whack 😄 we were going in circles trying to figure it out, we end up dropping events without open lineage enabled at all. But good to know that disabling the logical plan should speed us up if we run into this again

      @@ -145732,7 +145869,7 @@

      set the log level for the openlineage spark library

      2023-10-23 04:56:25
      -

      *Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

      +

      *Thread Reply:* Hmy, that is interesting. Did it occur on databricks runtime? Could you give it a try with Scala 2.12? I think we don't test scala 2.13.

      @@ -145758,7 +145895,7 @@

      set the log level for the openlineage spark library

      2023-10-23 12:02:13
      -

      *Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

      +

      *Thread Reply:* I believe our Scala 2.12 jobs are working fine. It's not databricks runtime. We run Spark on Kube.

      @@ -145784,7 +145921,59 @@

      set the log level for the openlineage spark library

      2023-10-24 06:47:14
      -

      *Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

      +

      *Thread Reply:* Ok. I think You can raise an issue to support Scala 2.13 for latest Spark versions.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-08 05:57:55
      +
      +

      *Thread Reply:* Yeah - this just hit me yesterday.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-08 05:58:29
      +
      +

      *Thread Reply:* I've created a ticket for it, it wasn't a fun surprise, that's for sure.

      @@ -145836,7 +146025,7 @@

      set the log level for the openlineage spark library

      2023-10-26 07:45:41
      -

      *Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      +

      *Thread Reply:* Hi @priya narayana, please get familiar with Extending section on our docs: https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark#extending

      @@ -145862,7 +146051,7 @@

      set the log level for the openlineage spark library

      2023-10-26 09:53:07
      -

      *Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

      +

      *Thread Reply:* Okay thank you. Just checking any other docs or git code which also can help me

      @@ -145917,15 +146106,11 @@

      set the log level for the openlineage spark library

      Im upgrading the version from openlineage-airflow==0.24.0 to openlineage-airflow 1.4.1 but im seeing the following error, any help is appreciated

      @@ -145953,7 +146138,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:02
      -

      *Thread Reply:* @Jakub Dardziński any thoughts?

      +

      *Thread Reply:* @Jakub Dardziński any thoughts?

      @@ -145979,7 +146164,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:24
      -

      *Thread Reply:* what version of Airflow are you using?

      +

      *Thread Reply:* what version of Airflow are you using?

      @@ -146005,7 +146190,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:14:52
      -

      *Thread Reply:* 2.6.3 that satisfies the requirement

      +

      *Thread Reply:* 2.6.3 that satisfies the requirement

      @@ -146031,7 +146216,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:16:38
      -

      *Thread Reply:* is it possible you have some custom operator?

      +

      *Thread Reply:* is it possible you have some custom operator?

      @@ -146057,7 +146242,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:17:15
      -

      *Thread Reply:* i think its the base operator causing the issue

      +

      *Thread Reply:* i think its the base operator causing the issue

      @@ -146083,7 +146268,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:17:36
      -

      *Thread Reply:* so no i believe

      +

      *Thread Reply:* so no i believe

      @@ -146109,7 +146294,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:18:43
      -

      *Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

      +

      *Thread Reply:* BaseOperator is parent class for any other operators, it defines how to do deepcopy

      @@ -146135,7 +146320,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:19:11
      -

      *Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

      +

      *Thread Reply:* yeah so its controlled by Airflow itself, I didnt customize it

      @@ -146161,7 +146346,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:19:49
      -

      *Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

      +

      *Thread Reply:* uhm, maybe it's possible you could share dag code? you may hide sensitive data

      @@ -146187,7 +146372,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:21:23
      -

      *Thread Reply:* let me try with lower versions of openlineage, what's say

      +

      *Thread Reply:* let me try with lower versions of openlineage, what's say

      @@ -146213,7 +146398,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:21:39
      -

      *Thread Reply:* its a big jump from 0.24.0 to 1.4.1

      +

      *Thread Reply:* its a big jump from 0.24.0 to 1.4.1

      @@ -146239,7 +146424,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:22:25
      -

      *Thread Reply:* but i will help here to investigate this issue

      +

      *Thread Reply:* but i will help here to investigate this issue

      @@ -146265,7 +146450,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:24:03
      -

      *Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

      +

      *Thread Reply:* for me it seems that within dag or task you're defining some object that is not easy to copy

      @@ -146291,7 +146476,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:26:05
      -

      *Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

      +

      *Thread Reply:* possible, but with 0.24.0 that issue is not occurring, so worry is that the version upgrade could potentially break things

      @@ -146317,7 +146502,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:39:34
      -

      *Thread Reply:* 0.24.0 is not that old 🤔

      +

      *Thread Reply:* 0.24.0 is not that old 🤔

      @@ -146343,7 +146528,7 @@

      set the log level for the openlineage spark library

      2023-10-26 13:45:07
      -

      *Thread Reply:* i see the issue with 0.24.0 I see it as warning +

      *Thread Reply:* i see the issue with 0.24.0 I see it as warning [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - self.run() [2023-10-26, 17:40:50 UTC] [airflow/utils/log/logging_mixin.py::_propagate_log()::150] WARNING - File "/usr/lib64/python3.8/threading.py", line 870, in run @@ -146427,18 +146612,14 @@

      set the log level for the openlineage spark library

      2023-10-26 14:18:08
      -

      *Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

      +

      *Thread Reply:* I see the difference of calling in these 2 versions, current versions checks if Airflow is >2.6 then directly runs on_running but earlier version was running on separate thread. IS this what's raising this exception?

      @@ -146466,7 +146647,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:24:49
      -

      *Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

      +

      *Thread Reply:* this is the issue - https://github.com/OpenLineage/OpenLineage/blob/c343835c1664eda94d5c315897ae6702854c81bd/integration/airflow/openlineage/airflow/listener.py#L89 while copying the task

      @@ -146517,7 +146698,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:25:21
      -

      *Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

      +

      *Thread Reply:* since we are directly running if version>2.6.0 therefore its throwing error in main processing

      @@ -146543,7 +146724,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:28:02
      -

      *Thread Reply:* may i know which Airflow version we tested this process?

      +

      *Thread Reply:* may i know which Airflow version we tested this process?

      @@ -146569,7 +146750,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:28:39
      -

      *Thread Reply:* im on 2.6.3

      +

      *Thread Reply:* im on 2.6.3

      @@ -146595,7 +146776,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:30:53
      -

      *Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 +

      *Thread Reply:* 2.1.4, 2.2.4, 2.3.4, 2.4.3, 2.5.2, 2.6.1 usually there are not too many changes between minor versions

      I still believe it might be some code you might improve and probably is also an antipattern in airflow

      @@ -146624,7 +146805,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:34:26
      -

      *Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

      +

      *Thread Reply:* hummm...that's a valid observation but I dont write DAGS, other teams do, so imagine if many people wrote such DAGS I can't ask everyone to change their patterns right? If something is running on current openlineage version with warning that should still be running on upgraded version isn't it?

      @@ -146650,7 +146831,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:38:04
      -

      *Thread Reply:* however I see ur point

      +

      *Thread Reply:* however I see ur point

      @@ -146676,7 +146857,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:49:52
      -

      *Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

      +

      *Thread Reply:* So that specific task has 570 line of query and pretty bulky query, let me split into smaller units

      @@ -146702,7 +146883,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:50:15
      -

      *Thread Reply:* that should help right? @Jakub Dardziński

      +

      *Thread Reply:* that should help right? @Jakub Dardziński

      @@ -146728,7 +146909,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:51:27
      -

      *Thread Reply:* query length shouldn’t be the issue, rather any python code

      +

      *Thread Reply:* query length shouldn’t be the issue, rather any python code

      @@ -146754,7 +146935,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:51:50
      -

      *Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

      +

      *Thread Reply:* I get your point too, we might figure out some mechanism to skip irrelevant parts of task instance so that it doesn’t fail then

      @@ -146780,7 +146961,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:52:12
      -

      *Thread Reply:* actually its failing on that task itself

      +

      *Thread Reply:* actually its failing on that task itself

      @@ -146806,7 +146987,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:52:33
      -

      *Thread Reply:* let me try it will be pretty quick

      +

      *Thread Reply:* let me try it will be pretty quick

      @@ -146832,7 +147013,7 @@

      set the log level for the openlineage spark library

      2023-10-26 14:58:58
      -

      *Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

      +

      *Thread Reply:* @Jakub Dardziński but ur right we have to fix this at Openlineage side as well. Because ideally Openlineage shouldn't be causing any issue to the main DAG processing

      @@ -146858,7 +147039,7 @@

      set the log level for the openlineage spark library

      2023-10-26 17:51:05
      -

      *Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

      +

      *Thread Reply:* it doesn’t break any airflow functionality, execution is wrapped into try/except block, only exception traceback is logged as you can see

      @@ -146884,7 +147065,7 @@

      set the log level for the openlineage spark library

      2023-10-27 05:25:54
      -

      *Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

      +

      *Thread Reply:* Can you migrate to Airflow 2.7 and use apache-airflow-providers-openlineage? Ideally we wouldn't make meaningful changes to openlineage-airflow

      @@ -146910,7 +147091,7 @@

      set the log level for the openlineage spark library

      2023-10-27 11:35:44
      -

      *Thread Reply:* yup thats what im planning to do

      +

      *Thread Reply:* yup thats what im planning to do

      @@ -146936,7 +147117,7 @@

      set the log level for the openlineage spark library

      2023-10-27 13:59:03
      -

      *Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

      +

      *Thread Reply:* referencing to https://openlineage.slack.com/archives/C01CK9T7HKR/p1698398754823079?threadts=1698340358.557159&cid=C01CK9T7HKR|this conversation - what it takes to move to openlineage provider package from openlineage-airflow. Im updating Airflow to 2.7.2 but moving off of openlineage-airflow to provider package Im trying to estimate the amount of work it takes, any thoughts? reading changelogs I dont think its too much of a change but please share your thoughts and if somewhere its drafted please do share that as well

      @@ -146962,7 +147143,7 @@

      set the log level for the openlineage spark library

      2023-10-30 08:21:10
      -

      *Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

      +

      *Thread Reply:* Generally not much - I would maybe think of a operator coverage. For example, for BigQuery old openlineage-airflow supports BigQueryExecuteQueryOperator. However, new apache-airflow-providers-openlineage supports BigQueryInsertJobOperator - because it's intended replacement for BigQueryExecuteQueryOperator and Airflow community does not want to accept contributions to deprecated operators.

      @@ -146992,7 +147173,7 @@

      set the log level for the openlineage spark library

      2023-10-31 15:00:38
      -

      *Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - +

      *Thread Reply:* one question if someone is around - when im keeping both openlineage-airflow and apache-airflow-providers-openlineage in my requirement file, i see the following error - from openlineage.airflow.extractors import Extractors ModuleNotFoundError: No module named 'openlineage.airflow' any thoughts?

      @@ -147021,7 +147202,7 @@

      set the log level for the openlineage spark library

      2023-10-31 15:37:07
      -

      *Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

      +

      *Thread Reply:* I would usually do a pip freeze | grep openlineage as a sanity check to validate that the module is actually installed. Not sure how the provider and the module play together though

      @@ -147047,7 +147228,7 @@

      set the log level for the openlineage spark library

      2023-10-31 17:07:41
      -

      *Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

      +

      *Thread Reply:* yeah so @John Lukenoff im not getting how i can use the specific extractor when i run my operator. Say for example, I have custom datawarehouseOperator and i want to override getopenlineagefacetsonstart and getopenlineagefacetsoncomplete using the redshift extractor then how would i do that?

      @@ -147130,7 +147311,7 @@

      set the log level for the openlineage spark library

      2023-10-30 09:03:57
      -

      *Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

      +

      *Thread Reply:* It general, I think this kind of use case is probably best served by facets, but what do you think @Paweł Leszczyński?

      openlineage.io
      @@ -147252,7 +147433,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:04:30
      -

      *Thread Reply:* Hmm, have you looked over our Running on AWS docs?

      +

      *Thread Reply:* Hmm, have you looked over our Running on AWS docs?

      @@ -147278,7 +147459,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:06:08
      -

      *Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

      +

      *Thread Reply:* More specifically, the AWS RDS section. How are you deploying Marquez on Ec2?

      @@ -147304,7 +147485,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:08:05
      -

      *Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

      +

      *Thread Reply:* we were primarily referencing this document on git - https://github.com/MarquezProject/marquez

      @@ -147359,7 +147540,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:09:05
      -

      *Thread Reply:* leveraged docker and docker-compose

      +

      *Thread Reply:* leveraged docker and docker-compose

      @@ -147385,7 +147566,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:13:10
      -

      *Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

      +

      *Thread Reply:* hmm so you’re running docker-compose up on an Ec2 instance you’ve ssh’d into? (just trying to understand your setup better)

      @@ -147411,7 +147592,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:13:26
      -

      *Thread Reply:* yes, thats correct

      +

      *Thread Reply:* yes, thats correct

      @@ -147437,7 +147618,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:16:39
      -

      *Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

      +

      *Thread Reply:* I’ve only used docker compose for local dev or integration tests. but, ok you’re probably in the PoC phase. Can you run the docker cmd on you local machine successfully? What OS is stalled on the Ec2 instance?

      @@ -147463,7 +147644,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:18:00
      -

      *Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

      +

      *Thread Reply:* yes, i can run and the OS is Ubuntu 20.04.6 LTS

      @@ -147489,7 +147670,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:19:27
      -

      *Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

      +

      *Thread Reply:* we initiallly ran into a permission denied error related to postgressql.conf file and we had to update file permissions to 777 and after which we started to see below errors

      @@ -147515,7 +147696,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:19:36
      -

      *Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption +

      *Thread Reply:* marquez-db | 2023-10-27 20:35:52.512 GMT [35] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption marquez-db | 2023-10-27 20:35:52.529 GMT [36] FATAL: no pghba.conf entry for host "172.18.0.5", user "marquez", database "marquez", no encryption

      @@ -147542,7 +147723,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:12
      -

      *Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

      +

      *Thread Reply:* we then manually updated pg_hba.conf file to include host user and db details

      @@ -147568,7 +147749,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:42
      -

      *Thread Reply:* Did you also update the marquez.yml with the db user / password?

      +

      *Thread Reply:* Did you also update the marquez.yml with the db user / password?

      @@ -147594,7 +147775,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:20:48
      -

      *Thread Reply:* after which we started to see the errors posted in the github open issues page

      +

      *Thread Reply:* after which we started to see the errors posted in the github open issues page

      @@ -147620,7 +147801,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:21:33
      -

      *Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

      +

      *Thread Reply:* hmm are you using an external database or are you spinning up the entire Marquez stack with docker compose?

      @@ -147646,7 +147827,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:21:56
      -

      *Thread Reply:* we are spinning up the entire Marquez stack with docker compose

      +

      *Thread Reply:* we are spinning up the entire Marquez stack with docker compose

      @@ -147672,7 +147853,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:23:24
      -

      *Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

      +

      *Thread Reply:* we did not change anything in the marquez.yml, i think we did not find that file in the github repo that we cloned into our local instance

      @@ -147698,7 +147879,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:26:31
      -

      *Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

      +

      *Thread Reply:* It’s important that the init-db.sh script runs, but I don’t think it is

      @@ -147724,7 +147905,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:26:56
      -

      *Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

      +

      *Thread Reply:* can you grab all the docker compose logs and share them? it’s hard to debug otherwise

      @@ -147750,10 +147931,10 @@

      set the log level for the openlineage spark library

      2023-10-27 17:29:59
      -

      *Thread Reply:*

      +

      *Thread Reply:*

      - + @@ -147785,7 +147966,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:33:15
      -

      *Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

      +

      *Thread Reply:* I would first suggest to remove the --build flag since you are specifying a version of Marquez to use via --tag

      @@ -147811,7 +147992,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:33:49
      -

      *Thread Reply:* no the issue per se, but will help clear up some of the logs

      +

      *Thread Reply:* no the issue per se, but will help clear up some of the logs

      @@ -147837,7 +148018,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:35:06
      -

      *Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

      +

      *Thread Reply:* for sure thanks. we could get the logs without the --build portion, we tried with that option just once

      @@ -147863,7 +148044,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:35:40
      -

      *Thread Reply:* the errors were the same with/without --build option

      +

      *Thread Reply:* the errors were the same with/without --build option

      @@ -147889,7 +148070,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:36:02
      -

      *Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +

      *Thread Reply:* marquez-api | ERROR [2023-10-27 21:34:58,019] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) @@ -147946,7 +148127,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:38:52
      -

      *Thread Reply:* debugging docker issues like this is so difficult

      +

      *Thread Reply:* debugging docker issues like this is so difficult

      @@ -147972,7 +148153,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:40:44
      -

      *Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

      +

      *Thread Reply:* it could be a number of things, but you are connected to the database it’s just that the marquez user hasn’t been created

      @@ -147998,7 +148179,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:41:59
      -

      *Thread Reply:* the /init-db.sh is what manages user creation

      +

      *Thread Reply:* the /init-db.sh is what manages user creation

      @@ -148024,7 +148205,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:42:17
      -

      *Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

      +

      *Thread Reply:* so it’s possible that the script isn’t running for whatever reason on your Ec2 instance

      @@ -148050,7 +148231,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:44:20
      -

      *Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

      +

      *Thread Reply:* do you have other services running on that Ec2 instance? Like, other than Marquez

      @@ -148076,7 +148257,7 @@

      set the log level for the openlineage spark library

      2023-10-27 17:44:52
      -

      *Thread Reply:* is there a postgres process running outside of docker?

      +

      *Thread Reply:* is there a postgres process running outside of docker?

      @@ -148102,7 +148283,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:34:50
      -

      *Thread Reply:* no other services except marquez on this EC2 instance

      +

      *Thread Reply:* no other services except marquez on this EC2 instance

      @@ -148128,7 +148309,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:35:49
      -

      *Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

      +

      *Thread Reply:* this was a new Ec2 instance that was spun up to install and use marquez

      @@ -148154,7 +148335,7 @@

      set the log level for the openlineage spark library

      2023-10-27 20:36:09
      -

      *Thread Reply:* n we can confirm that no postgres process runs outside of docker

      +

      *Thread Reply:* n we can confirm that no postgres process runs outside of docker

      @@ -148206,7 +148387,7 @@

      set the log level for the openlineage spark library

      2023-10-30 09:59:53
      -

      *Thread Reply:* hi @Jason Yip could you provide an example of such a job?

      +

      *Thread Reply:* hi @Jason Yip could you provide an example of such a job?

      @@ -148232,7 +148413,7 @@

      set the log level for the openlineage spark library

      2023-10-30 16:51:55
      -

      *Thread Reply:* @Paweł Leszczyński same old:

      +

      *Thread Reply:* @Paweł Leszczyński same old:

      delete the old table if needed

      @@ -148384,7 +148565,7 @@

      show data

      2023-10-31 06:34:05
      -

      *Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

      +

      *Thread Reply:* Can you show the actual event? Should be in the events tab in Marquez

      @@ -148410,7 +148591,7 @@

      show data

      2023-10-31 11:59:07
      -

      *Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

      +

      *Thread Reply:* @John Lukenoff, would you mind posting the link to Marquez teams slack channel?

      @@ -148436,7 +148617,7 @@

      show data

      2023-10-31 12:15:37
      -

      *Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

      +

      *Thread Reply:* yep here is the link: https://marquezproject.slack.com/archives/C01E8MQGJP7/p1698702140709439

      This is the full event, sanitized of internal info: { @@ -148514,7 +148695,7 @@

      show data

      2023-10-31 12:43:07
      -

      *Thread Reply:* thank you!

      +

      *Thread Reply:* thank you!

      @@ -148540,7 +148721,7 @@

      show data

      2023-10-31 13:14:29
      -

      *Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

      +

      *Thread Reply:* @John Lukenoff, sorry to trouble again, is the slack channel still active? for whatever reason i cant get to this workspace

      @@ -148566,7 +148747,7 @@

      show data

      2023-10-31 13:15:26
      -

      *Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

      +

      *Thread Reply:* yep it’s still active, maybe you need to join the workspace first? https://join.slack.com/t/marquezproject/shared_invite/zt-266fdhg9g-TE7e0p~EHK50GJMMqNH4tg

      @@ -148592,7 +148773,7 @@

      show data

      2023-10-31 13:25:51
      -

      *Thread Reply:* that was a good call. the link you just shared worked! thank you!

      +

      *Thread Reply:* that was a good call. the link you just shared worked! thank you!

      @@ -148618,7 +148799,7 @@

      show data

      2023-10-31 13:27:55
      -

      *Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

      +

      *Thread Reply:* yeah from OL perspective this looks good - the inputs and outputs are there, the extraction error facet looks like it should

      @@ -148644,7 +148825,7 @@

      show data

      2023-10-31 13:28:05
      -

      *Thread Reply:* must be some Marquez hiccup 🙂

      +

      *Thread Reply:* must be some Marquez hiccup 🙂

      @@ -148674,7 +148855,7 @@

      show data

      2023-10-31 13:28:45
      -

      *Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

      +

      *Thread Reply:* Makes sense, I’ll tail my marquez logs today to see if I can find anything

      @@ -148700,7 +148881,7 @@

      show data

      2023-11-01 19:37:06
      -

      *Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

      +

      *Thread Reply:* Somehow this started working after we switched from our beta to prod infrastructure. I suspect something was failing due to constraints on the size of our db and the load of poor quality data it was under after months of testing against it

      @@ -148772,7 +148953,7 @@

      show data

      2023-11-02 05:11:58
      -

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      +

      *Thread Reply:* Thanks, all. The release is authorized and will be initiated within 2 business days.

      @@ -148825,7 +149006,7 @@

      show data

      - 👍 Mars Lan, harsh loomba + 👍 Mars Lan (Metaphor), harsh loomba
      @@ -148880,7 +149061,7 @@

      show data

      2023-11-02 03:34:15
      -

      *Thread Reply:* Hi John, we're always happy to help with the contribution.

      +

      *Thread Reply:* Hi John, we're always happy to help with the contribution.

      One of the possible solutions to this would be to do that just in openlineage-java client: • introduce config entry like normalizeDatasetNameToAscii : enabled/disabled @@ -148913,7 +149094,7 @@

      show data

      2023-11-02 03:34:47
      -

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/blob/main/client/java/src/main/java/io/openlineage/client/utils/DatasetIdentifier.java

      @@ -149037,7 +149218,7 @@

      show data

      2023-11-02 02:41:50
      -

      *Thread Reply:* there’s actually an issue for that: +

      *Thread Reply:* there’s actually an issue for that: https://github.com/OpenLineage/OpenLineage/issues/2189

      but the way to do this is imho to create new custom transport (it might inherit from HTTP transport) and register it in transport factory

      @@ -149066,7 +149247,7 @@

      show data

      2023-11-02 13:05:05
      -

      *Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

      +

      *Thread Reply:* I am thinking of just modifying the HTTP transport and using requests.auth.AuthBase to create different auth methods instead of a TokenProvider class

      Classes which subclass requests.auth.AuthBase can also just directly be given to the requests call in the auth parameter

      @@ -149098,7 +149279,7 @@

      show data

      2023-11-02 14:40:24
      -

      *Thread Reply:* would you like to contribute? 🙂

      +

      *Thread Reply:* would you like to contribute? 🙂

      @@ -149124,7 +149305,7 @@

      show data

      2023-11-02 14:43:05
      -

      *Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

      +

      *Thread Reply:* I was about to contribute, but I actually just realized that there is an existing way to provide a custom transport that would solve form y use case. My only question is how do I register this custom transport in my MWAA environment? Can I provide the custom transport as an Airflow plugin and then specify the class in the Openlineage.yml config? Will it automatically pick it up?

      @@ -149150,7 +149331,7 @@

      show data

      2023-11-02 15:45:56
      -

      *Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: +

      *Thread Reply:* although I did not test this in MWAA but locally only: I’ve created Airflow plugin that in __init__.py has defined (or imported) following code: ```from openlineage.client.transport import register_transport, Transport, Config

      @register_transport @@ -149191,7 +149372,7 @@

      show data

      2023-11-02 15:47:45
      -

      *Thread Reply:* in setup.py it’s: +

      *Thread Reply:* in setup.py it’s: ..., entry_points={ 'airflow.plugins': [ @@ -149225,7 +149406,7 @@

      show data

      2023-11-03 12:52:55
      -

      *Thread Reply:* ok great thanks for following up on this, super helpful

      +

      *Thread Reply:* ok great thanks for following up on this, super helpful

      @@ -149269,7 +149450,7 @@

      show data

      - 👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan, Sangeeta Mishra + 👍 Jason Yip, Sophie LY, Tristan GUEZENNEC -CROIX-, Mars Lan (Metaphor), Sangeeta Mishra
      @@ -149301,7 +149482,7 @@

      show data

      @Paweł Leszczyński I tested 1.5.0, it works great now, but the environment facets is gone in START... which I very much want it.. any thoughts?

      - + @@ -149363,7 +149544,7 @@

      show data

      2023-11-04 15:44:22
      -

      *Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

      +

      *Thread Reply:* @Paweł Leszczyński looks like I need to bring bad news.. 13.3 is fixed for specific scenarios, but 11.3 is still reading output as dbfs.. there are scenarios that it's not producing input and output like:

      create table table using delta as location 'abfss://....' @@ -149393,7 +149574,7 @@

      show data

      2023-11-04 15:44:31
      -

      *Thread Reply:* Will test more and ope issues

      +

      *Thread Reply:* Will test more and ope issues

      @@ -149419,7 +149600,7 @@

      show data

      2023-11-06 05:34:33
      -

      *Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

      +

      *Thread Reply:* @Jason Yiphow did you manage the get the environment attribute. it's not showing up to me at all. I've tried databricks abut also tried a local instance of spark.

      @@ -149445,7 +149626,7 @@

      show data

      2023-11-07 18:32:02
      -

      *Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

      +

      *Thread Reply:* @Rodrigo Maia its showing up in one of the RUNNING events, not in the START event anymore

      @@ -149471,7 +149652,7 @@

      show data

      2023-11-08 03:04:32
      -

      *Thread Reply:* I never had a running event 🫠 Am I filtering something?

      +

      *Thread Reply:* I never had a running event 🫠 Am I filtering something?

      @@ -149497,7 +149678,7 @@

      show data

      2023-11-08 13:03:26
      -

      *Thread Reply:* Umm.. ok show me your code, will try on my end

      +

      *Thread Reply:* Umm.. ok show me your code, will try on my end

      @@ -149523,7 +149704,7 @@

      show data

      2023-11-08 14:26:06
      -

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

      +

      *Thread Reply:* @Paweł Leszczyński @Rodrigo Maia actually if you are using UC-enabled cluster, you won't get any RUNNING events

      @@ -149635,7 +149816,7 @@

      show data

      2023-11-04 07:11:46
      -

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 +

      *Thread Reply:* https://openlineage.slack.com/archives/C01CK9T7HKR/p1698315220142929 Do you need some more guidance than that?

      @@ -149697,7 +149878,7 @@

      show data

      2023-11-04 07:13:47
      -

      *Thread Reply:* yes

      +

      *Thread Reply:* yes

      @@ -149723,7 +149904,7 @@

      show data

      2023-11-04 07:15:21
      -

      *Thread Reply:* It seems pretty extensively described, what kind of help do you need?

      +

      *Thread Reply:* It seems pretty extensively described, what kind of help do you need?

      @@ -149749,7 +149930,7 @@

      show data

      2023-11-04 07:16:13
      -

      *Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

      +

      *Thread Reply:* io.openlineage.spark.api.OpenLineageEventHandlerFactory if i use this how will i pass custom listener to my spark submit

      @@ -149775,7 +149956,7 @@

      show data

      2023-11-04 07:17:25
      -

      *Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

      +

      *Thread Reply:* I would like to know how will i customize my events using this . For example: - In "input" Facet i want only symlinks name i am not intereseted in anything else

      @@ -149801,7 +149982,7 @@

      show data

      2023-11-04 07:17:32
      -

      *Thread Reply:* can you please provide some guidance

      +

      *Thread Reply:* can you please provide some guidance

      @@ -149827,7 +150008,7 @@

      show data

      2023-11-04 07:18:36
      -

      *Thread Reply:* @Jakub Dardziński this is the doubt i have

      +

      *Thread Reply:* @Jakub Dardziński this is the doubt i have

      @@ -149853,7 +150034,7 @@

      show data

      2023-11-04 08:17:25
      -

      *Thread Reply:* Some one who did spark integration throw some light

      +

      *Thread Reply:* Some one who did spark integration throw some light

      @@ -149879,7 +150060,7 @@

      show data

      2023-11-04 08:21:22
      -

      *Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

      +

      *Thread Reply:* it's weekend for most of us so you probably need to wait until Monday for precise answers

      @@ -149984,7 +150165,7 @@

      show data

      2023-11-08 10:42:35
      -

      *Thread Reply:* Thanks for merging this @Maciej Obuchowski!

      +

      *Thread Reply:* Thanks for merging this @Maciej Obuchowski!

      @@ -150072,7 +150253,7 @@

      show data

      2023-11-07 05:56:09
      -

      *Thread Reply:* similarly to spark config, you can use flink config

      +

      *Thread Reply:* similarly to spark config, you can use flink config

      @@ -150098,7 +150279,7 @@

      show data

      2023-11-07 22:36:53
      -

      *Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

      +

      *Thread Reply:* @Maciej Obuchowski - Got it. Our use-case is that we're trying to build a wrapper on top of openlineage-flink for productionising for our flink jobs.

      We're trying to have a wrapper class that extends OpenLineageFlinkJobListener class, and overwrites the HTTP transport endpoint/url to a constant value (say, example.com and /api/v1/flink). But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. If it was just a default scope, can we contribute a PR to make it public, to make it friendly for teams trying to adopt & extend openlineage?

      @@ -150128,7 +150309,7 @@

      show data

      2023-11-08 05:55:43
      -

      *Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

      +

      *Thread Reply:* We parse flink conf to get that information: https://github.com/OpenLineage/OpenLineage/blob/26494b596e9669d2ada164066a73c44e04[…]ink/src/main/java/io/openlineage/flink/client/EventEmitter.java

      > But we see that the OpenLineageFlinkJobListener constructor is defined as a private constructor - just wanted to check with the team whether it was just a default scope, or intended to be private. The way to construct is is a public builder in the same class

      @@ -150184,7 +150365,7 @@

      show data

      2023-11-09 12:41:02
      -

      *Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? +

      *Thread Reply:* > I think easier way than wrapper class would be use existing flink configuration, or to set up OPENLINEAGE_URL env variable, or have openlineage.yml config file - not sure why this is the way you've chosen? @Maciej Obuchowski - The reasoning behind going with a wrapper class is that we can abstract out the nitty-gritty like how/where we're publishing openlineage events etc - especially for companies that have a lot of teams that may be adopting openlineage.

      For example, if we wanna move away from http transport to kafka transport - we'd be changing only this wrapper class and ask folks to update their wrapper class dependency version. If we went without the wrapper class, then the exact config changes would need to be synced and done by many different teams, who may not have enough context.

      @@ -150217,7 +150398,7 @@

      show data

      2023-11-09 13:03:56
      -

      *Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

      +

      *Thread Reply:* @Athitya Kumar that makes sense. Feel free to provide PR adding getters and stuff.

      @@ -150234,6 +150415,78 @@

      show data

      +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-11-22 10:47:32
      +
      +

      *Thread Reply:* Created an issue for the same: https://github.com/OpenLineage/OpenLineage/issues/2273

      + +

      @Maciej Obuchowski - Can you assign this issue to me, if the proposal looks good?

      +
      + + + + + + + +
      +
      Labels
      + proposal +
      + + + + + + + + + + +
      + + + +
      + 👍 Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + +
      @@ -150282,7 +150535,7 @@

      show data

      2023-11-07 06:07:51
      -

      *Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

      +

      *Thread Reply:* hey, a name of the output dataset should be put at the end of the job name. This was introduced to help with jobs that call multiple spark actions

      @@ -150308,7 +150561,7 @@

      show data

      2023-11-07 07:05:52
      -

      *Thread Reply:* Hi Paweł, +

      *Thread Reply:* Hi Paweł, Thanks for your answer, yes indeed with the newer version of OL, we automatically have the name of the output dataset at the end of the job name, but no App run id, nor any parent run facet.

      @@ -150335,7 +150588,7 @@

      show data

      2023-11-07 08:16:44
      -

      *Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

      +

      *Thread Reply:* yes, you're right. I mean you can set in config spark.openlineage.parentJobName which will be shared through whole app run, but this needs to be set manually

      @@ -150361,7 +150614,7 @@

      show data

      2023-11-07 08:36:58
      -

      *Thread Reply:* I see, thanks a lot for your reply we'll try that

      +

      *Thread Reply:* I see, thanks a lot for your reply we'll try that

      @@ -150413,7 +150666,7 @@

      show data

      2023-11-08 10:49:04
      -

      *Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

      +

      *Thread Reply:* Sounds like it, yes - if the logical dataset names are different but physical one is the same

      @@ -150465,7 +150718,7 @@

      show data

      2023-11-08 13:01:16
      -

      *Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

      +

      *Thread Reply:* No but it should work the same I tried on AWS and Google Colab and Azure

      @@ -150495,7 +150748,7 @@

      show data

      2023-11-09 03:10:54
      -

      *Thread Reply:* Yes. @Abdallah could provide some details if needed.

      +

      *Thread Reply:* Yes. @Abdallah could provide some details if needed.

      @@ -150529,7 +150782,7 @@

      show data

      2023-11-20 11:29:26
      -

      *Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- +

      *Thread Reply:* Thanks @Tristan GUEZENNEC -CROIX- HI @Abdallah i was able to set up a spark cluster on AWS EMR but im struggling to configure the OL Listener. Ive tried with steps and bootstrap actions for the jar and it didn't work out. How did you manage to include the jar? Besides, what about the spark configuration? Could you send me a sample of these configs?

      @@ -150543,6 +150796,133 @@

      show data

      +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-11-28 03:52:44
      +
      +

      *Thread Reply:* HI @Abdallah. Ive sent you a message with somo more information. If you could provide some more details that would be awesome. Thank you 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:14:14
      +
      +

      *Thread Reply:* Hi, sorry for the late reply. I didn't see your message at the right timling.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:16:02
      +
      +

      *Thread Reply:* If you want to test ol quickly without bootstrap action, you can use the following submit.json

      + +

      [ + { + "Name": "MyJob", + "Jar" : "command-runner.jar", + "Args": [ + "spark-submit", + "--deploy-mode", "cluster", + "--packages","io.openlineage:openlineage_spark:1.2.2", + "--conf","spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener", + "--conf","spark.openlineage.transport.type=http", + "--conf","spark.openlineage.transport.url=&lt;OPENLINEAGE-URL&gt;", + "--conf","spark.openlineage.version=v1", + "path/to/your/job.py" + ], + "ActionOnFailure": "CONTINUE", + "Type": "CUSTOM_JAR" + } +] +with

      + +

      aws emr add-steps --cluster-id &lt;cluster-id&gt; --steps file://&lt;path-to-your-json&gt;/submit.json

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-11-28 04:19:20
      +
      +

      *Thread Reply:* "--packages","io.openlineage:openlineage_spark:1.2.2" +Need to be mentioned before the creation of the spark session

      + + + +
      +
      +
      +
      + + + + +
      @@ -150822,7 +151202,7 @@

      show data

      2023-11-10 15:32:28
      -

      *Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

      +

      *Thread Reply:* Here's for more reference: https://dilorom.medium.com/finding-the-path-to-a-table-in-databricks-2c74c6009dbb

      Medium
      @@ -150964,7 +151344,7 @@

      show data

      @Paweł Leszczyński I went back to 1.4.1, output does show adls location. But environment facet is gone in 1.4.1. It shows up in 1.5.0 but namespace is back to dbfs....

      - + @@ -151022,7 +151402,7 @@

      show data

      2023-11-13 04:52:07
      -

      *Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

      +

      *Thread Reply:* Thanks @Jason Yip for your engagement in finding the cause and solution to this issue.

      Among the technical problems, another problem here is that our databricks integration tests are run on AWS and the issue you describe occurs in Azure. I would consider this a primary issue as it is difficult for me to verify the behaviour you describe and fix it with a failing integration test at the start.

      @@ -151052,7 +151432,451 @@

      show data

      2023-11-13 18:06:44
      -

      *Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

      +

      *Thread Reply:* I didn't know Azure and AWS Databricks are different. Let me try it on AWS as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:51:38
      +
      +

      *Thread Reply:* @Paweł Leszczyński finally got a chance to run but its a different script, its pretty interesting

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:51:59
      +
      +

      *Thread Reply:* "inputs": [ + { + "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", + "name": "AdultCensusIncome.parquet" + } + ], + "outputs": [ + { + "namespace": "<wasbs://publicwasb@mmlspark.blob.core.windows.net>", + "name": "AdultCensusIncome.parquet" + }

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:52:19
      +
      +

      *Thread Reply:* df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:52:41
      +
      +

      *Thread Reply:* it somehow got the input path as output path 😲

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:53:53
      +
      +

      *Thread Reply:* here's the full script:

      + +

      df = spark.read.parquet( "" +)

      + +

      sc.jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "") +sc.jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "")

      + +

      df.write.format("delta").mode("overwrite").option("path", "").saveAsTable("test.AdultCensusIncome")

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 04:54:23
      +
      +

      *Thread Reply:* yep, just 3 lines, will try it on Azure as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-23 08:14:23
      +
      +

      *Thread Reply:* just to clarify, were you able to reproduce this issue just on AWS databricks using s3? Asking this again, bcz this is how our intergation test environment looks like.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 18:28:36
      +
      +

      *Thread Reply:* @Paweł Leszczyński the issue is different. The issue before on Azure is that it says it's dbfs despite it's wasbs. This time around the destination is s3, but it says wasbs... it somehow took the path from inputs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-23 18:28:52
      +
      +

      *Thread Reply:* I have not tested this new script on Azure yet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:53:21
      +
      +

      *Thread Reply:* @Paweł Leszczyński tried on Azure, results is the same

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:53:24
      +
      +

      *Thread Reply:* df = spark.read.parquet("")

      + +

      df.write\ + .format('delta')\ + .mode('overwrite')\ + .option('overwriteSchema', 'true')\ + .option('path', adlsRootPath + '/examples/data/parquet/AdultCensusIncome/silver/AdultCensusIncome')\ + .saveAsTable('AdultCensusIncome')

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:54:05
      +
      +

      *Thread Reply:* "outputs": [ + { + "namespace": "dbfs", + "name": "/user/hive/warehouse/adultcensusincome",

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 03:54:24
      +
      +

      *Thread Reply:* Please note that 1.4.1 it correctly identified output as wasbs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:14:39
      +
      +

      *Thread Reply:* so to summarize, on Azure it'd become dbfs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:15:01
      +
      +

      *Thread Reply:* on AWS, it somehow becomes the same as input

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jason Yip + (jasonyip@gmail.com) +
      +
      2023-11-24 04:15:34
      +
      +

      *Thread Reply:* 1.4.1 Azure is fine, I have not tested 1.4.1 on AWS

      @@ -151105,7 +151929,7 @@

      show data

      2023-11-15 07:27:34
      2023-11-15 07:27:55
      -

      *Thread Reply:* thank you @Maciej Obuchowski

      +

      *Thread Reply:* thank you @Maciej Obuchowski

      @@ -151185,7 +152009,7 @@

      show data

      2023-11-16 11:46:16
      -

      *Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? +

      *Thread Reply:* Hey @Naresh reddy can you help me understand what you mean by competitors? OL is a specification that can be used to solve various problems, so if you have a clear problem statement, maybe I can help with pros/cons for that problem

      @@ -151199,6 +152023,62 @@

      show data

      +
      +
      + + + + +
      + +
      Naresh reddy + (naresh.naresh36@gmail.com) +
      +
      2023-11-22 23:49:05
      +
      +

      *Thread Reply:* i wanted to integrate Airflow using OL but wanted to understand what are the pros and cons of OL, if you can shed light that would be great

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-11-27 19:11:54
      +
      +

      *Thread Reply:* Airflow supports OL natively via a provider since 2.7. But it’s hard for me to tell you pros/cons without understanding your use case

      + + + +
      + 👀 Naresh reddy +
      + +
      +
      +
      +
      + + + + +
      @@ -151238,7 +152118,7 @@

      show data

      2023-11-16 13:38:42
      -

      *Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

      +

      *Thread Reply:* Hi @Naresh reddy, thanks for your question. We’ve heard that OpenLineage is attractive because of its desirable integrations, including a best-in-class Spark integration, its extensibility, the fact that it’s not destructive, and the fact that it’s open source. I’m not aware of pain points per se, but there are certainly features and integrations that we wish we could focus on but can’t at the moment — like the Dagster integration, which needs a new maintainer. OpenLineage is like any other open standard in that ecosystem coverage is a constant process rather than a journey, and it requires contributions in order to get close to 100%. Thankfully, we are gaining users and contributors all the time, and integrations are being added or improved upon daily. See the Ecosystem page on the website for a list of consumers and producers and links to more resources, and check out the GitHub repo for the codebase, commit history, contributors, governance procedures, and more. We’re quick to respond to messages here and issues on GitHub — usually within one day.

      @@ -151271,6 +152151,10 @@

      show data

      +
      + 🙌 Naresh reddy +
      +
      @@ -151319,7 +152203,7 @@

      show data

      2023-11-20 06:07:36
      -

      *Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. +

      *Thread Reply:* Yes, it works with Airflow and Spark - there is caveat that amount of operators that support it on Airflow side is fairly small and limited generally to most popular SQL operators. > will it also allow to connect to Power BI and derive the downstream column lineage ? No, there is no such feature yet 🙂 However, there's nothing preventing this - if you wish to work on such implementation, we'd be happy to help.

      @@ -151348,7 +152232,7 @@

      show data

      2023-11-21 00:20:11
      -

      *Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

      +

      *Thread Reply:* Thank you Maciej Obuchowski for the update. Currently we are looking out for a tool which can support connecting to Power Bi and pull column level lineage information for reports and dashboards. How this can be achieved with OL ? Can you give some idea?

      @@ -151374,7 +152258,7 @@

      show data

      2023-11-21 07:59:10
      -

      *Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

      +

      *Thread Reply:* I don't think I can help you with that now, unless you want to work on your own integration with PowerBI 🙁

      @@ -151456,7 +152340,7 @@

      show data

      2023-11-21 07:37:00
      -

      *Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

      +

      *Thread Reply:* Can you provide details about those issues? Like exceptions, logs, details of the jobs and how do you run them?

      @@ -151482,7 +152366,7 @@

      show data

      2023-11-21 07:45:37
      -

      *Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: +

      *Thread Reply:* Hi @Maciej Obuchowski - I rerun docker container after deleting metadata_db folder possibly created by other local test, and fix this one but got problem with OpenLineageListener - during initialization of spark: while I execute: spark = (SparkSession.builder.master('local') .appName('Food Delivery') @@ -151562,7 +152446,7 @@

      show data

      2023-11-21 07:58:09
      -

      *Thread Reply:* 🤔 Jars are added during image building: https://github.com/getindata/openlineage-bbuzz2023-column-lineage/blob/main/Dockerfile#L12C1-L12C29

      +

      *Thread Reply:* 🤔 Jars are added during image building: https://github.com/getindata/openlineage-bbuzz2023-column-lineage/blob/main/Dockerfile#L12C1-L12C29

      @@ -151588,7 +152472,7 @@

      show data

      2023-11-21 07:58:28
      -

      *Thread Reply:* are you sure &lt;local-path&gt; is right?

      +

      *Thread Reply:* are you sure &lt;local-path&gt; is right?

      @@ -151614,7 +152498,7 @@

      show data

      2023-11-21 08:00:49
      -

      *Thread Reply:* yes, it's same as in sample - wondering why it's not get added: +

      *Thread Reply:* yes, it's same as in sample - wondering why it's not get added: ```from pyspark.sql import SparkSession

      spark = (SparkSession.builder.master('local') @@ -151653,7 +152537,7 @@

      show data

      2023-11-21 08:04:31
      -

      *Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

      +

      *Thread Reply:* can you make sure jars are in this directory? just by docker run --entrypoint /usr/local/bin/bash IMAGE_NAME "ls /home/jovyan/jars"

      @@ -151679,7 +152563,7 @@

      show data

      2023-11-21 08:06:27
      -

      *Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

      +

      *Thread Reply:* another option to try is to replace spark.jars with spark.jars.packages io.openlineage:openlineage_spark:1.5.0,org.postgresql:postgresql:42.7.0

      @@ -151705,7 +152589,7 @@

      show data

      2023-11-21 08:16:54
      -

      *Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

      +

      *Thread Reply:* I think this was done for the purpose of presentation to make sure the demo will work without internet access. This can be the reason to add jar manually to a docker. openlineage-spark can be added to Spark via spark.jar.packages , like we do here https://openlineage.io/docs/integrations/spark/quickstart_local

      openlineage.io
      @@ -151752,7 +152636,7 @@

      show data

      2023-11-21 09:21:59
      -

      *Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. +

      *Thread Reply:* got it guys - thanks a lot for help - it turns out that spark context from notebook 2 and 3 has come kind of metadata conflict - when I combine those 2 and recreate image to clean up old metadata it works. One more note is that sometimes kernels return weird results but it may be caused by some local nuances - anyways thx !

      @@ -151765,6 +152649,15905 @@

      show data

      + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:18:53
      +
      +

      Hi Everyone +I created custom operator in airflow for to extract metadata of that file like size creation time and modification time like that and I used that in my dag it is rnning fine in airflow by saving metadata info of that file to csv file but i want to see that metadata info which we are saving in csv file to be shown in marquez ui in extra facet How we can add that extra facet to see on marquez ui ? +Thankyou

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:20:34
      +
      +

      *Thread Reply:* are you adding job or run facet?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:21:32
      +
      +

      *Thread Reply:* job facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:22:24
      +
      +

      *Thread Reply:* that should be run facet, right? +that’s dynamic value dependent on individual run

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:24:42
      +
      +

      *Thread Reply:* yes yes sorry run facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:25:12
      +
      +

      *Thread Reply:* it should appear then in Marquez as additional facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-22 05:31:31
      +
      +

      *Thread Reply:* I can see these things already for custom operator but i want to add extra info under the root like the metadata of the file

      + +

      like ( file_name, size, modification time, creation time )

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-22 05:33:20
      +
      +

      *Thread Reply:* yeah, for that you need to provide additional run facets under these keys you mentioned

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-23 01:31:01
      +
      +

      *Thread Reply:* can u please tell precisely where we can add additional facet to be visible on ui ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-23 04:16:53
      +
      +

      *Thread Reply:* custom extractors should return from extract methods TaskMetadata that takes run_facets as argument +https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/base.py#L28

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-30 06:34:55
      +
      +

      *Thread Reply:* Hi @Jakub Dardziński finally today I am able to get extra facet on Marquez ui using custom operator and and custom extractor. +Thanks for the help. It is really nice community.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-30 06:43:55
      +
      +

      *Thread Reply:* Hey, that’s great to hear! +Sorry I didn’t answer yesterday but you managed on your own 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-11-30 10:51:09
      +
      +

      *Thread Reply:* Yes, No problem I tried and explored more by self. +Thanks 😊.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rafał Wójcik + (rwojcik@griddynamics.com) +
      +
      2023-11-22 05:32:40
      +
      +

      Hi Guys, one more questions - as we fix sample from previous thread I start playing with example - when I execute in notebook: +spark.sql('''SELECT ** FROM public.restaurants r where r.name = "BBUZZ DONER"''').show() +I got all raw events in marquez but all job.facets.sql field are null - is there any way to capture sql query that we use in spark? I know that we can pull this out from spark.logicalPlan but plain sql will be much more convenient

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-22 06:53:37
      +
      +

      *Thread Reply:* I was looking into it some time ago and was not able to extract SQL from logical plan. It seemed to me that SQL string is translated into LogicalPlan before Openlineage code gets called and I wasn't able to find SQL anywhere

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-22 06:54:31
      +
      +

      *Thread Reply:* Optional&lt;String&gt; query = ScalaConversionUtils.asJavaOptional(relation.jdbcOptions().parameters().get(JDBCOptions$.MODULE$.JDBC_QUERY_STRING())); +🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-22 06:54:54
      +
      +

      *Thread Reply:* ah, not only from JDBCRelation

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-22 08:03:52
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/issues/1250#issuecomment-1306865798

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-22 13:57:29
      +
      +

      @general +The next OpenLineage meetup is happening one week from today on November 29th in Warsaw/remote (it’s hybrid) at 5:30 pm CET (8:30 am PT). Sign-up and more details can be found https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here. Hoping to see many of you there!

      +
      +
      Meetup
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🙌 Harel Shein, Maciej Obuchowski, Sangeeta Mishra +
      + +
      + :flag_pl: Harel Shein, Maciej Obuchowski, Paweł Leszczyński, Stefan Krawczyk +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Ryhan Sunny + (rsunny@altinity.com) +
      +
      2023-11-23 12:26:31
      +
      +

      Hi all, don’t miss out on the upcoming Open Source Analytics (virtual) Conference OSA Con 2023 - the place to be for cutting-edge insights into open-source analytics and AI! Learn and share development tips on the hottest open source projects, like Kafka, Airflow, Grafana, ClickHouse, and DuckDB. See who’s speaking and save your spot today at https://osacon.io/

      +
      +
      osacon.io
      + + + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Jarek Potiuk, Jakub Dardziński, Julien Le Dem, Willy Lulciuc +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 10:47:55
      +
      +

      with Airflow, I have operators and define inputdatasets and outputdatasets. If Task B uses the outputdataset of Task A as the inputdataset, does it overwrite the metadata such as the documentation facet etc?

      + +

      should I ensure that the inputdataset info is exactly the same between tasks, or do I move certain logic into the runfacet?

      + +

      for example, this input_facet is the output dataset from my previous task.

      + +

      input_facet = { + "dataSource": input_source, + "schema": SchemaDatasetFacet(fields=fields), + "deltaTable": DeltaTableFacet( + path=self.source_model.dataset_uri, + name=self.source_model.name, + description=self.source_model.description, + partition_columns=json.dumps(self.source_model.partition_columns or []), + unique_constraint=json.dumps(self.source_model.unique_constraint or []), + rows=self.source_delta_table.rows, + file_count=self.source_delta_table.file_count, + size=self.source_delta_table.size, + ), + } + input_dataset = Dataset( + namespace=input_namespace, name=input_name, facets=input_facet + )

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 10:51:55
      +
      +

      *Thread Reply:* First question, how do you do that? Define get_openlineage_facets methods on your custom operators?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 10:59:09
      +
      +

      *Thread Reply:* yeah

      + +

      I have this method:

      + +

      ``` def getopenlineagefacetsoncomplete(self, task_instance: Any) -> Any: + """Returns the OpenLineage facets for the task instance""" + from airflow.providers.openlineage.extractors import OperatorLineage

      + +
          _ = task_instance
      +    inputs = self._create_input_datasets()
      +    outputs = self._create_output_datasets()
      +    run_facet = self._create_run_facet()
      +    job_facet = self._create_job_facet()
      +    return OperatorLineage(
      +        inputs=inputs, outputs=outputs, run_facets=run_facet, job_facets=job_facet
      +    )```
      +
      + +

      and then I define each facet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:00:33
      +
      +

      *Thread Reply:* the input and output change

      + +

      ``` def createjob_facet(self) -> dict[str, Any]: + """Creates the Job facet for the OpenLineage Job""" + from openlineage.client.facet import ( + DocumentationJobFacet, + OwnershipJobFacet, + OwnershipJobFacetOwners, + SourceCodeJobFacet, + )

      + +
          return {
      +        "documentation": DocumentationJobFacet(
      +            description=f"""Filters data from {self.source_model.dataset_uri} using
      +            Polars and writes the data to the path:
      +            {self.destination_model.dataset_uri}.
      +            """
      +        ),
      +        "ownership": OwnershipJobFacet(
      +            owners=[OwnershipJobFacetOwners(name=self.owner)]
      +        ),
      +        "sourceCode": SourceCodeJobFacet(
      +            language="python", source=self.transform_model_name
      +        ),
      +    }
      +
      +def _create_run_facet(self) -&gt; dict[str, Any]:
      +    """Creates the Run facet for the OpenLineage Run"""
      +    return {}```
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:02:07
      +
      +

      *Thread Reply:* but a lot of the time I am reading a dataset and filtering it or selecting a subset of columns and saving a new dataset. I just want to make sure my input_dataset remains consistent basically, since a lot of different airflow tasks might using it

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:02:36
      +
      +

      *Thread Reply:* and these are all custom operators

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:03:29
      +
      +

      *Thread Reply:* Yeah, you should be okay sending those facets on output dataset only

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:07:19
      +
      +

      *Thread Reply:* so just ignore the input_facet completely? or pass an empty list or something?

      + +

      return OperatorLineage( + inputs=None, outputs=outputs, run_facets=run_facet, job_facets=job_facet + )

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:09:53
      +
      +

      *Thread Reply:* cool I'll try that, makes things cleaner for sure

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:10:20
      +
      +

      *Thread Reply:* pass input datasets, just don't include redundant facets

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-24 11:10:39
      +
      +

      *Thread Reply:* if you won't pass datasets, you won't get lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2023-11-24 11:26:12
      +
      +

      *Thread Reply:* got it, thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 01:12:47
      +
      +

      I am trying to run this spark script but my spark context is stopping on its own. Without the open-lineage configuration(listener) the spark script is working fine. I need the configuration to integrate with open-lineage.

      + +

      from pyspark.sql import SparkSession +from pyspark.sql.functions import col

      + +

      def execute_spark_script(query_num, output_path): + # Create a Spark session + spark = (SparkSession.builder.master('local[**]').appName('openlineage_spark_test')

      + +
               `# Install and set up the OpenLineage listener`
      +         `.config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+')`
      +         `.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')`
      +         `.config('spark.openlineage.host', '<http://localhost:3000>')`
      +         `.config('spark.openlineage.namespace', 'airflow')`
      +         `.config('spark.openlineage.transport.type', 'console')`
      +         `.getOrCreate()`
      +         `)`
      +
      +
      +
      +
      +`# DataFrame 1`
      +`data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]`
      +`columns = ["rank", "city", "state", "code", "population", "price"]`
      +`df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")`
      +
      +`print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")`
      +
      +`# Save DataFrame 1 to the desired location`
      +`df1.write.mode("overwrite").csv(output_path + "df1")`
      +
      +`# DataFrame 2`
      +`df2 = (spark.read`
      +      `.format("csv")`
      +      `.option("header", "true")`
      +      `.option("inferSchema", "true")`
      +      `.load("/home/haneefa/Downloads/export.csv")`
      +      `)`
      +
      +`# Save DataFrame 2 to the desired location`
      +`df2.write.mode("overwrite").csv(output_path + "df2")`
      +
      +`# Returns a DataFrame that combines the rows of df1 and df2`
      +`query_df = df1.union(df2)`
      +`print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")`
      +
      +`# Save the combined DataFrame to the desired location`
      +`query_df.write.mode("overwrite").csv(output_path + "query_df")`
      +
      +`# Query 1: Add a new column derived from existing columns`
      +`query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))`
      +`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`
      +
      +`# Save Query 1 result to the desired location`
      +`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`
      +
      +
      +`spark.stop()`
      +
      + +

      if __name__ == "__main__": + execute_spark_script(1, "/home/haneefa/airflow/dags/saved_files/")

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 06:19:40
      +
      +

      *Thread Reply:* Hello @Haneefa tasneem - can you confirm which version of Spark you're running?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 06:20:44
      +
      +

      *Thread Reply:* Additionally, I noticed this line:

      + +

      .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.+') +Could you try changing the version to 1.5.0 instead of 0.3.+?

      + + + +
      + 👍 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:04:24
      +
      +

      *Thread Reply:* Hello. my spark version is 3.5.0 +```from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/0.3.1/openlineage-spark-0.3.1.jar'] + files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.master('local[**]').appName('openlineagesparktest').config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:0.3.1')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .getOrCreate()
      +         )
      +
      +
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +# Save DataFrame 2 to the desired location
      +df2.write.mode("overwrite").csv(output_path + "df2")
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +print(f"Count after combining DataFrame 1 and DataFrame 2: {query_df.count()} rows, {len(query_df.columns)} columns")
      +
      +# Save the combined DataFrame to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = query_df.withColumn("population_price_ratio", col("population") / col("price"))
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:04:45
      +
      +

      *Thread Reply:* above is the modified code

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:05:20
      +
      +

      *Thread Reply:* It seems the issue is with the listener in spark config

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 08:05:55
      +
      +

      *Thread Reply:* please do let me know if i should make any changes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-11-28 08:20:46
      +
      +

      *Thread Reply:* Yeah - I suspect its because of the version of the connector that you're using.

      + +

      You're using 0.3.1, please try it with 1.5.0.

      + + + +
      + 🚀 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-28 12:27:02
      +
      +

      *Thread Reply:* Yes. it seems the issue was with the version. thank you. its resolved now

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 03:29:09
      +
      +

      *Thread Reply:* Hi. I was able to see some metadata information on Marquez. I wanted to know if there is a way I can see any lineage of the data as its getting transformed? As in, We are running a query here, I wanted to know If we can see the lineage of dataset. I tried modifying the code like this: # Save Query 1 result to the desired location query_df.write.mode("overwrite").csv(output_path + "query_df")

      + +
      `# Register the DataFrame as a temporary SQL table`
      +`query_df.write.mode("overwrite").saveAsTable("temp_table")`
      +
      +`# Query 1: Add a new column derived from existing columns`
      +`query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")`
      +`print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")`
      +
      +
      +`# Register the DataFrame as a temporary SQL table`
      +`query1_df.write.mode("overwrite").saveAsTable("temp_table2")`
      +
      +`# Save Query 1 result to the desired location`
      +`query1_df.write.mode("overwrite").csv(output_path + "query1_df")`
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 03:32:25
      +
      +

      *Thread Reply:* But I'm getting error. Is there a way I can see how we are deriving a new column from the previous dataset.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-28 13:06:08
      +
      +

      @channel +Friendly reminder: our first Warsaw meetup is happening tomorrow at 5:30 PM CET (8:30 AM PT) — and it’s hybrid https://openlineage.slack.com/archives/C01CK9T7HKR/p1700679449568039

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      slackbot + +
      +
      2023-11-29 06:30:57
      +
      +

      This message was deleted.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-11-29 07:03:57
      +
      +

      *Thread Reply:* Hey Shahid, I think we already discussed it here https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR|https://openlineage.slack.com/archives/C01CK9T7HKR/p1700648333528029?threadts=1700648333.528029&cid=C01CK9T7HKR

      +
      + + +
      + + + } + + Shahid Shaikh + (https://openlineage.slack.com/team/U062FLR8WBZ) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-11-29 09:09:00
      +
      +

      Hi. This code is running on my local ubuntu(Both in and out of the virtual environment). I have installed Airflow in virtual environment in ubuntu. Its not getting executed on Airflow. I'm getting the following error - airflow.exceptions.AirflowException: Cannot execute: spark-submit --master <spark://10.0.2.15:4041> --jars /home/haneefa/Downloads/openlineage-spark-1.5.0.jar --name SparkScript_query1 --deploy-mode client /home/haneefa/airflow/dags/custom_operators/sample_sql_spark.py 1. Error code is: 1. +[2023-11-29, 13:51:21 UTC] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=spark_dagf, task_id=spark_submit_query1, execution_date=20231129T134548, start_date=20231129T134723, end_date=20231129T135121 The spark-submit is working fine on my ubuntu. +```from pyspark import SparkContext +from os.path import abspath +from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] + warehouselocation = abspath('spark-warehouse') + #files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.master('local[**]').appName('openlineagespark_test') + #.config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .config('spark.openlineage.transport.type', 'console')
      +         .config("spark.sql.warehouse.dir", warehouse_location)
      +         .getOrCreate()
      +         )
      +
      +spark.sparkContext.setLogLevel("INFO")
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +query_df.count()
      +
      +# Save DataFrame 2 to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +# Register the DataFrame as a temporary SQL table
      +query_df.write.mode("overwrite").saveAsTable("temp_table")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_table")
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Register the DataFrame as a temporary SQL table
      +query1_df.write.mode("overwrite").saveAsTable("temp_table2")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

      + +

      spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

      + +

      ./bin/spark-submit --class "SparkTest" --master local[**] --jars```

      + +

      This is my dag: +```from datetime import datetime, timedelta +from airflow import DAG +from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

      + +

      defaultargs = { + 'owner': 'admin', + 'startdate': datetime(2023, 1, 1), +}

      + +

      dag = DAG( + 'sparkdagf', + defaultargs=defaultargs, + scheduleinterval=None, +)

      + +

      Set up SparkSubmitOperator for each query

      + +

      queries = ['query1'] +previous_task = None

      + +

      for query in queries: + taskid = f'sparksubmit{query}' + scriptpath = '/home/haneefa/airflow/dags/customoperators/samplesqlspark.py'
      + query
      num = queries.index(query) + 1

      + +
      spark_task = SparkSubmitOperator(
      +    task_id=task_id,
      +    application=script_path,
      +    name=f"SparkScript_{query}",
      +    conn_id='spark_2',  
      +    jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar',
      +    application_args=[str(query_num)],
      +    dag=dag,
      +)
      +
      +if previous_task:
      +    previous_task >> spark_task
      +
      +previous_task = spark_task
      +
      + +

      if name == "main": + dag.cli()```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-29 10:36:45
      +
      +

      @channel +Today’s hybrid meetup with Google starts in about one hour! DM me for the link. https://openlineage.slack.com/archives/C01CK9T7HKR/p1701194768476699

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Lauzon + (davidonlaptop@gmail.com) +
      +
      2023-11-29 16:35:03
      +
      +

      *Thread Reply:* Thanks for organizing this meetup and sharing it online. Very good quality of talks btw !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-11-29 16:37:26
      +
      +

      *Thread Reply:* Thanks for coming! @Jens Pfau did most of the heavy lifting

      + + + +
      + 🙌 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jens Pfau + (jenspfau@google.com) +
      +
      2023-11-30 04:52:17
      +
      +

      *Thread Reply:* Thank you to everyone who turned up! +I noticed there was a question from @Sheeri Cabral (Collibra) that we didn't answer: Can someone point me to information on the collector that sends to S3? I'm curious as to the implementation (e.g. does each api push result in one S3 file? so the bucket ends up having millions of files? Does it append to a file? are there rules (configurable or otherwise) about rotating a file or directory? etc

      + +

      I didn't quite catch the context of that. +@Paweł Leszczyński did you?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-30 06:46:27
      +
      +

      *Thread Reply:* There was a question about querying lineage events in warehouses. Julien has confirmed that in case of hunderds of thousands or even million of events, this could be still accomplished within Marquez as PostgreSQL as its backend, should handle this.

      + +

      I was referring to fluentd openlineage proxy which lets users copy the event and send it to multiple backend. Fluentd has a list of out-of-the box output plugins containing BigQuery, S3, Redshift and others (https://www.fluentd.org/dataoutputs)

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-11-30 06:47:11
      +
      +

      *Thread Reply:* Some more info about fluentd proxy: https://github.com/OpenLineage/OpenLineage/tree/main/proxy/fluentd

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-05 12:35:44
      +
      +

      *Thread Reply:* This is extremely helpful, thanks @Paweł Leszczyński!!!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-11-29 15:00:00
      +
      +

      Hi all. I’m Stefan, and I’m after some directional advice:

      + +

      Context: +• I drive an open source project called Hamilton. TL;DR: it’s an opinionated way to write python code to express dataflows, e.g. great for feature engineering, data processing, doing LLM workflows, etc. in python.
      +• One of the features is that you get “lineage as code”, e.g. you can get “column/dataframe level” lineage for pandas (or pyspark) code. So given a dataflow definition, and execution of a Hamilton dataflow, we can emit this code provenance for any artifacts generated — and this works where ever python runs. +Ask: +• As I understand open lineage was built more for artifact to artifact lineage. E.g. this table -> table -> ML model, etc. Question, for people who use/consume lineage, would this extra level of granularity (e.g. the code that ran to create an artifact) that we can provide with Hamilton, be interesting to emit as part of an open lineage event? (e.g. see inside your python airflow task, or spark job). I’m trying to determine how to prioritize an open lineage integration, and whether someone would pick up Hamilton because of it. +• If you would find this extra level of granularity useful, could I schedule a call with you so I can learn more about your use case please? +CC @Jakub Dardziński since I saw at the meet up that you deal with the airflow python operator & open lineage.

      +
      + + + + + + + +
      +
      Website
      + <https://hamilton.dagworks.io/en/latest/> +
      + +
      +
      Stars
      + 1051 +
      + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-30 09:57:11
      +
      +

      *Thread Reply:* We try to include SourceCodeFacet when possible when emitting OpenLineage event, however it's purely optional as facets are, since we can't guarantee it will be always there - for example it's not possible for us to get actual sql from spark-sql jobs.

      + + + +
      + ✅ Sheeri Cabral (Collibra) +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-11-30 09:57:29
      +
      +

      *Thread Reply:* Not sure if that answers your question 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-11-30 19:15:24
      +
      +

      *Thread Reply:* @Maciej Obuchowski kind of. That’s tactical implementation advice and definitely useful.

      + +

      My question is more around is that kind of detail actually useful for someone/what use case would it power/enable? e.g. if I integrate openlineage and emit that level of detail, would someone use it? If so, why?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mandy Chessell + (mandy.e.chessell@gmail.com) +
      +
      2023-12-02 12:45:25
      +
      +

      *Thread Reply:* @Stefan Krawczyk the types of use cases I have seen of this style is a business user/auditor that is not interested in how many jobs or intermediate stores are used in the end-to-end pipeline. They want to understand the transformations that the data underwent. For this type of user we extract details such as the source code facet, or facets that describe a rule or decision from the lineage graph and just display these transformations/decisions. If hamilton was providing the content for these types of facets, would they be consumable by such a user? Or perhaps, I should say, "could" they be consumed by such a user if the pipeline developer was careful to use meaningful variable names and comments.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-12-02 14:07:42
      +
      +

      *Thread Reply:* Awesome thanks @Mandy Chessell. That's useful context. What domain would this person be operating in? Yes I would think Hamilton could be useful here. It'll help standardize how the code is structured, and makes it easier to link an output with the code that created it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:01:26
      +
      +

      *Thread Reply:* @Stefan Krawczyk speaking as a vendor of lineage technology, our customers want to see the transformations - SQL if it’s SQL-based, or as much description as possible if it’s an ETL tool without direct SQL. (e.g. IBM DataStage might show a few stages like “group” and “aggregation” and show what field was used for grouping, etc).

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:03:50
      +
      +

      *Thread Reply:* i have seen 5 general use cases for lineage:

      + +
      1. Data Provenance - “where the data comes from”. This is often used when explaining what data means - by showing where it came from. e.g. salary data - did it come from a Google form, or from an HR department? +The most visceral example I have is when a VP sees 2 similar reports with different numbers. e.g. Total Sales but the reports have different numbers, and then the VP wants to know why, and how come we can’t trust the data, etc?
      2. +
      + +

      With lineage it’s easy to see that the data from report A had test data taken out and that’s why the sales $$ is less, but also that’s the accurate one.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:05:18
      +
      +

      *Thread Reply:* 2. Impact Analysis - if data provenance is “what’s upstream”, impact analysis is “what’s downstream”. You want to change something and want to know what it might affect. Perhaps you want to decommission a table or a whole server. Perhaps you are expanding, e.g. you started off in US Dollars and now are adding Japanese Yen…you have a field called “Amount” and now you want to add another field called “currency” and update every report that uses “Amount”…..Impact analysis is for that use case.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:05:41
      +
      +

      *Thread Reply:* 3. Compliance - you can show that your sensitive data stays in the sensitive places, because you can show where your data is flowing.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:06:31
      +
      +

      *Thread Reply:* 4. Migration verification - Compare the lineage from legacy system A with the lineage from new system B. When they have parity, your migration is complete.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2023-12-04 15:07:03
      +
      +

      *Thread Reply:* 5. Onboarding - lineage diagrams can be used like architecture diagrams to easily acquaint a user with what data you have, how it flows, what reports it goes to, etc.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Stefan Krawczyk + (stefan@dagworks.io) +
      +
      2023-12-05 00:42:27
      +
      +

      *Thread Reply:* Thanks, that context is helpful @Sheeri Cabral (Collibra) !

      + + + +
      + ❤️ Sheeri Cabral (Collibra), Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Juan Luis Cano Rodríguez + (juan_luis_cano@mckinsey.com) +
      +
      2023-12-11 05:08:31
      +
      +

      *Thread Reply:* loved this thread, thanks everyone!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-01 17:03:32
      +
      +

      @channel +The November issue of OpenLineage News is here! This issue covers the latest updates to the OpenLineage Airflow Provider, a recap of the meetup in Warsaw, recent releases, and much more. +To get the newsletter directly in your inbox each month, sign up here. +apache.us14.list-manage.com

      +
      +
      openlineage.us14.list-manage.com
      + + + + + + + + + + + + + + + +
      + + + +
      + 👍 Paweł Leszczyński, Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-04 15:34:26
      +
      +

      @channel +I’m opening a vote to release OpenLineage 1.6.0, including: +• a new JobTypeFacet containing additional job-related information to improve support for Flink and streaming in general +• an option for the Flink job listener to read from Flink conf +• in the dbt integration, a new command to send metadata of the last run without running the job +• bug fixes in the Spark and Flink integrations +• more. +Three +1s from committers will authorize. Thanks in advance.

      + + + +
      + ➕ Jakub Dardziński, Harel Shein, Damien Hawes, Mandy Chessell +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-05 08:16:25
      +
      +

      *Thread Reply:* Can we hold on for a day? I do have some doubts about this one https://github.com/OpenLineage/OpenLineage/pull/2293

      +
      + + + + + + + +
      +
      Labels
      + client/java, dependencies, dependabot +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-05 08:50:44
      +
      +

      *Thread Reply:* Our policy allows for 2 days, so there’s no problem as far as I’m concerned

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-05 11:32:29
      +
      +

      *Thread Reply:* Thanks, all, the release is authorized and will be initiated within 2 business days.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-05 11:33:35
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2297 - this should resolve the problem. we don't have to wait anymore. Thank You @Michael Robinson

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/spark +
      + +
      +
      Comments
      + 2 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sathish Kumar J + (sathish.jeganathan@walmart.com) +
      +
      2023-12-06 21:51:29
      +
      +

      @Harel Shein thanks for the invite!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-06 21:52:39
      +
      +

      *Thread Reply:* Welcome :)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:24:52
      +
      +

      Hi

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:27:51
      +
      +

      *Thread Reply:* Hey Zacay!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:28:51
      +
      +

      I know here +does OpenLineage has support of column level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:30:12
      +
      +

      *Thread Reply:* it’s currently supported in Spark integration and for SQL-based Airflow operators

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:30:48
      +
      +

      *Thread Reply:* and DBT?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:31:11
      +
      +

      *Thread Reply:* DBT has only table-level lineage at the moment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:36:28
      +
      +

      *Thread Reply:* would you be interested in contributing/helping to add this feature? 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:38:12
      +
      +

      *Thread Reply:* look at Gudo Soft

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-07 04:38:18
      +
      +

      *Thread Reply:* it parser Sql

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 04:40:05
      +
      +

      *Thread Reply:* I’m sorry but I didn’t get it. What does the parser provide that would be helpful in terms of dbt?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sumeet Gyanchandani + (sumeet.gyanchandani@gmail.com) +
      +
      2023-12-07 07:11:47
      +
      +

      *Thread Reply:* @Jakub Dardziński I played around with column-level lineage in Spark recently and also listening to OL events and converting them to Apache Atlas entities to upload in Azure Purview. Works like a charm 🙂

      + +

      Flink not so much. Do you have column-level lineage in Flink yet or is it on the roadmap for future? Happy to contribute.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:12:29
      +
      +

      *Thread Reply:* @Maciej Obuchowski @Paweł Leszczyński know more about Flink 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:12:58
      +
      +

      *Thread Reply:* it's awesome you got it working with Purview!

      + + + +
      + 🙏 Sumeet Gyanchandani +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-07 07:19:58
      +
      +

      *Thread Reply:* we're actively working with Flink team to have first-class integration for Flink - a lot of things like column-level lineage are unfortunately currently not available

      + + + +
      + 🙌 Sumeet Gyanchandani +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sumeet Gyanchandani + (sumeet.gyanchandani@gmail.com) +
      +
      2023-12-07 09:09:33
      +
      +

      *Thread Reply:* @Jakub Dardziński and @Maciej Obuchowski thank you for the prompt responses. I really appreciate it!

      + + + +
      + 🙂 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:21:13
      +
      +

      Hi everyone, could someone help me in how can I integrating OpenLineage with dbt? I'm particularly interested in sending events to a Kafka topic rather than using HTTP. Any guidance on this would be greatly appreciated.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:28:04
      +
      +

      *Thread Reply:* Hey, it’s essentially described here: https://openlineage.io/docs/client/python

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      + 👍 Simran Suri +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:28:40
      +
      +

      *Thread Reply:* in your case best would be to: +Set an environment variable to a file path: OPENLINEAGE_CONFIG=path/to/openlineage.yml.

      + + + +
      + 👍 Simran Suri +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:29:43
      +
      +

      *Thread Reply:* that will work with dbt right? only setting up environment variable would be enough? Also, to get OL events I need to do dbt-ol run correct?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:30:23
      +
      +

      *Thread Reply:* setting env var + putting config file to pointed path

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 05:30:31
      +
      +

      *Thread Reply:* and yes, you need to run it with dbt-ol wrapper script

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-07 05:31:50
      +
      +

      *Thread Reply:* Great, that's very helpful. I'll try the same and will definitely ask another question if I encounter any issues while trying this out.

      + + + +
      + 👍 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-07 07:15:55
      +
      +

      *Thread Reply:* @Jakub Dardziński - regarding the Python client, does it have a functionality similar to the Java client? For example, the Java client allows you to use the service provider interface to implement a custom ~client~ transport.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:17:37
      +
      +

      *Thread Reply:* do you mean custom transport?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-07 07:17:47
      +
      +

      *Thread Reply:* Correct.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-07 07:21:23
      +
      +

      *Thread Reply:* it sure does, there's mention of it in the link above

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:26:46
      +
      +

      Hi All, im still struggling to get Lineage from Databricks Unity Catalog. +Is anyone here extracting lineage from Databricks/UC successfully?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:27:09
      +
      +

      *Thread Reply:* OL 1.5 is showing up this error:

      + +

      23/12/07 17:20:15 INFO PlanUtils: apply method failed with +org.apache.spark.SparkException: There is no Credential Scope. Current env: Driver + at com.databricks.unity.UCSDriver$Manager.$anonfun$currentScopeId$3(UCSDriver.scala:131) + at scala.Option.getOrElse(Option.scala:189) + at com.databricks.unity.UCSDriver$Manager.currentScopeId(UCSDriver.scala:131) + at com.databricks.unity.UCSDriver$Manager.currentScope(UCSDriver.scala:134) + at com.databricks.unity.UnityCredentialScope$.currentScope(UnityCredentialScope.scala:100) + at com.databricks.unity.UnityCredentialScope$.getSAMRegistry(UnityCredentialScope.scala:120) + at com.databricks.unity.SAMRegistry$.registerSAM(SAMRegistry.scala:307) + at com.databricks.unity.SAMRegistry$.registerDefaultSAM(SAMRegistry.scala:323) + at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.defaultTablePath(SessionCatalog.scala:1200) + at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.defaultTablePath(ManagedCatalogSessionCatalog.scala:991) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.AbstractDatabricksHandler.getDatasetIdentifier(AbstractDatabricksHandler.java:92) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61) + at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) + at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) + at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:63) + at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.getDatasetIdentifier(CatalogUtils3.java:46) + at io.openlineage.spark3.agent.utils.PlanUtils3.getDatasetIdentifier(PlanUtils3.java:79) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:144) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.lambda$apply$3(CreateReplaceOutputDatasetBuilder.java:116) + at java.util.Optional.map(Optional.java:215) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:114) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:60) + at io.openlineage.spark35.agent.lifecycle.plan.CreateReplaceOutputDatasetBuilder.apply(CreateReplaceOutputDatasetBuilder.java:39) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:94) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder$1.apply(AbstractQueryPlanDatasetBuilder.java:85) + at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.lambda$apply$0(AbstractQueryPlanDatasetBuilder.java:75) + at java.util.Optional.map(Optional.java:215) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:67) + at io.openlineage.spark.api.AbstractQueryPlanDatasetBuilder.apply(AbstractQueryPlanDatasetBuilder.java:39) + at io.openlineage.spark.agent.util.PlanUtils.safeApply(PlanUtils.java:279) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$23(OpenLineageRunEventBuilder.java:451) + at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) + at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) + at java.util.Iterator.forEachRemaining(Iterator.java:116) + at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) + at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) + at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272) + at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313) + at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) + at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) + at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) + at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) + at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:410) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:298) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:281) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:238) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:126) + at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) + at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1660) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-07 12:27:41
      +
      +

      *Thread Reply:* Event output is showing up empty.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-11 08:21:08
      +
      +

      *Thread Reply:* Anyone with the same issue or no issues at all regarding Unity Catalog?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-07 12:58:55
      +
      +

      @channel +We released OpenLineage 1.6.2, including: +• Dagster: support Dagster 1.5.x #2220 @tsungchih +• Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely +• Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom +• Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski +• Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski +• SQL: add quote information from sqlparser-rs #2259 @JDarDagran +• bug fixes, tests, and more. +Thanks to all the contributors, including new contributors @tsungchih and @ensctom! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.6.2 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.5.0...1.6.2 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jorge + (jorge.varona@nike.com) +
      +
      2023-12-07 13:01:27
      +
      +

      👋 Hi everyone! Just lurking to get insights into how I might implement this at my company to solve our lineage challenge.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-07 13:02:06
      +
      +

      *Thread Reply:* 👋 Hi Jorge, welcome! Can you tell us a bit about your use case?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jorge + (jorge.varona@nike.com) +
      +
      2023-12-07 13:03:51
      +
      +

      *Thread Reply:* Large company with a big ball of mud data ecosystem. There is active debate on doubling down on a vendor (probably wont work) or doing the work to instrument our jobs/toolling in an agnostic way. I'm inclined to do the later but want to understand what the implementation may look like.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-07 14:44:04
      +
      +

      *Thread Reply:* welcome! please feel free to ask any questions as they arise

      + + + +
      + 👍 Jorge +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-08 05:59:40
      +
      +

      Hello everyone, +I'm currently utilizing the OpenLineage JAR version - openlineage_spark_1_1_0.jar to capture lineage information in my environment. +I have a specific requirement to capture facets related to Input Datasets, especially focusing on Data Quality Metrics facets such as row count information for the input data, as well as Output Dataset facets encompassing output statistics, like row count information for the output data.

      + +

      Although the tags "inputFacets":{}}] and "outputFacets":{}}] seem to be enabled in the event, but the values within these tags are not reflecting the expected information, it seems to be blank always.

      + +

      This Setup involves Databricks, and the cluster's Spark version is Apache Spark 3.3.2. and I've configured the OpenLineage setup in the Global Init scripts within the Databricks workspace.

      + +

      Would greatly appreciate it if someone could provide guidance or insight into this issue.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-08 11:40:47
      +
      +

      *Thread Reply:* can you turn on the debug facet and share an example event so we can try to help? +spark.openlineage.debugFacet should be set to enabled +this is from https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:22:03
      +
      +

      *Thread Reply:* Hi @Harel Shein, sure.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:23:59
      +
      +

      *Thread Reply:* This text file contains a total of 10-11 events, including the start and completion events of one of my notebook runs. The process is simply reading from a Hive location and performing a full load to another Hive location.

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-11 03:32:40
      +
      +

      *Thread Reply:* @Simran Suri do you get any cluster logs with an error? im running a newer version of OL jar and im getting inputs and outputs from hive (but not for Unity Catalog)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-11 03:54:28
      +
      +

      *Thread Reply:* No @Rodrigo Maia, I can't see any errors there

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-12 16:05:18
      +
      +

      *Thread Reply:* thanks for adding that! +per the docs here, did you extend the Input/OutputDatasetFacetBuilder with anything to track data quality metrics?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-13 00:49:01
      +
      +

      *Thread Reply:* Actually no, I didn't tried this out, can you help me with some more brief on it like how it can be extended? do I need to add some configs?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-08 15:04:43
      +
      +

      @channel +This month’s TSC meeting is next Thursday the 14th at 10am PT. On the tentative agenda: +• announcements +• recent releases +• proposal updates +• open discussion +• more (TBA) +More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-09 18:27:05
      +
      +

      Hi! I'm interested in using OpenLineage to track data files in my pipeline. Basically, I receive a data file and run some other Python code + ancillary files to produce other data files which are then used to produce more data files and onward. Each level of files is versioned and I would like to track this lineage. I don't use Spark, Airflow, dbt, etc. I do use Prefect though. My wrapper script is Python. Is OpenLineage appropriate? Seems like it... is my "DataSet" every individual file that I produce? I think I have to write my own integration and facet. Is that also correct? Any other advice? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2023-12-10 12:09:47
      +
      +

      *Thread Reply:* Hi @Joey Mukherjee, welcome to the community! This use case would work using the openlineage spec. you are right, unfortunately we don’t currently have a Prefect integration, but we’d be happy to support you if you chose to write it! 🙂 +I don’t know enough about Prefect to say what would be the right model to map to. Have you looked at the OpenLineage data model? +https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md

      + + + +
      + 👍 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 08:12:40
      +
      +

      Hey team.

      + +

      In the spark integration: When we do a spark.jdbc() or spark.read() from a JDBC connection like mysql/postgres etc, does OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) in the OL events or not?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 09:32:14
      +
      +

      *Thread Reply:* > OpenLineage support capturing metadata on the JDBC connection (host URI / port / username etc) +I think those are part of dataset namespace? Probably not username tho

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 12:34:46
      +
      +

      *Thread Reply:* Ack, and are these spark.jdbc inputs being captured for spark 3.x onwards or for spark 2.x as well?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-11 12:34:55
      +
      +

      *Thread Reply:* @Maciej Obuchowski ^

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 12:53:25
      +
      +

      *Thread Reply:* I would assume not for Spark 2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2023-12-11 11:12:27
      +
      +

      ❓ Are there any particular standards/conventions around how to express data types in the dataset schema facet? The examples I’ve seen have been just types like integer etc. I think it would be useful for certain types to include modifiers for size, precision etc, so like varchar(50) and stuff like that. Would it just be a case of, stick to how the platform (mysql, snowflake, whatever) expresses it in DDL?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-11 12:54:12
      +
      +

      *Thread Reply:* We do express what's in DDL basically. It's database specific anyway

      + + + +
      + 👍 David Goss +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-11 16:31:08
      +
      +

      QQ - we have multi-cluster Redshift architecture where there is a possibility that table meta information could exist in different cluster. Now the way I see extractor here that it requires table meta-information to create the lineage right here? Currently I dont see those tables in my input datasets which are out of that cluster. Any thoughts?

      +
      + + + + + + + + + + + + + + + + +
      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-11 16:34:13
      +
      +

      *Thread Reply:* I’m not entirely sure how multi-cluster Redshift works. AFAIK the solution would to take advantage from SVV_REDSHIFT_COLUMNS

      + + + +
      + 👀 harsh loomba +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-11 16:34:31
      +
      +

      *Thread Reply:* I’ve got opened PR for Airflow OL provider here: +https://github.com/apache/airflow/pull/35794

      + + + +
      + 👀 harsh loomba +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-14 04:56:15
      +
      +

      *Thread Reply:* hey @harsh loomba, did you have a chance to check above out?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 11:50:53
      +
      +

      *Thread Reply:* I did check, looks promising to the problem statement we have right @Willy Lulciuc?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Willy Lulciuc + (willy@datakin.com) +
      +
      2023-12-14 14:30:44
      +
      +

      *Thread Reply:* @harsh loomba yep, it does!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Willy Lulciuc + (willy@datakin.com) +
      +
      2023-12-14 14:31:20
      +
      +

      *Thread Reply:* thanks @Jakub Dardziński for the quick fix!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:31:44
      +
      +

      *Thread Reply:* ohhh great

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:31:58
      +
      +

      *Thread Reply:* thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-14 14:32:07
      +
      +

      *Thread Reply:* hopefully it will added in upcoming OL release

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-11 19:47:37
      +
      +

      I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-12 05:37:04
      +
      +

      *Thread Reply:* > I'm playing with OpenLineage and for my start event and complete event, do they have to have the same input and output datasets? Say my input datasets generate files unknown at start time, can OpenLineage handle that? +Idea of OpenLineage is to handle this kind of events - the events are generally ment to be cumulative. As you said, inputs or outputs can be not known at start time, but there can be opposite situation. You're reading version of the dataset that has been changing in the meantime, and you know the particular version only on start.

      + +

      That being said, the actual handling of those events depends on the consumer itself, so it depends on what you're using.

      + +

      > Right now, I am getting 422 Client Error: for url: http://xxx:5000/api/v1/lineage . How do I find out the error? I am not using any of the integrations. +It's probably consumer specific - hard to tell without knowledge of what you're using.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 03:06:36
      +
      +

      Hey folks. Is there a way to disable column-level lineage / "schema" facet in inputs/outputs for spark integration?

      + +

      Basically to have just table-level lineage and disable column-level lineage

      + + + +
      + ➕ Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 03:22:17
      +
      +

      *Thread Reply:* cc @Honey Thakuria

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-13 03:47:20
      +
      +

      *Thread Reply:* The spark integration provides the ability to disable facets, via the spark.openlineage.facets.disabled configuration.

      + +

      You provide values like this: +spark.openlineage.facets.disabled=[spark_unknown;spark.logicalPlan;&lt;more&gt;]

      + + + +
      + 👍 Jakub Dardziński, Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-13 06:25:24
      +
      +

      *Thread Reply:* Right, but is there a specific facet we can disable here to get table-level lineage but skip column-level lineage in the events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-13 06:44:00
      +
      +

      *Thread Reply:* The facet name that you want to disable is "columnLineage" in this case

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      rajeshree parmar + (rajeshreedatavizz@gmail.com) +
      +
      2023-12-13 05:05:07
      +
      +

      Hi i want to use open lineage on my plateform how i can use . i have already metadata and profiler pipleline and how integreate

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-13 08:33:36
      +
      +

      Hi, i'd like to request a release (patch 1.6.3) that will include this PR: #2305 . It would help people using OL with Airflow integration (with Airflow version 2.6).

      + + + +
      + 👍 Harel Shein, Maciej Obuchowski +
      + +
      + ➕ Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-14 08:40:03
      +
      +

      *Thread Reply:* Thanks for requesting a release. It is authorized and will be initiated within 2 business days (not including Friday).

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-14 08:57:34
      +
      +

      *Thread Reply:* Perfect, thank you !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-13 09:59:56
      +
      +

      I have a question/misunderstanding about the output section from Marquez under I/O. For one of my Jobs, the outputs are the right output and all of the inputs from the previous Job in the pipeline. I'm not sure why, but how do I find the cause of this? I feel like I did everything correct. To be clear, I am using a custom pipeline using files and the example code from the Python section of the OpenLineage docs.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Joey Mukherjee + (joey.mukherjee@swri.org) +
      +
      2023-12-13 13:18:15
      +
      +

      *Thread Reply:* I notice only one of my three jobs has Run information. I have only six events, and all three have START and COMPLETE states. From what I can tell, the information in the events list is correct, but the GUI is not right. Open to any ideas on how to debug!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-13 11:34:15
      +
      +

      @channel +This month’s TSC meeting is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1702065883107479

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:01:32
      +
      +

      Hi team, I noticed this error: +ERROR ColumnLevelLineageUtils: Error when invoking static method 'buildColumnLineageDatasetFacet' for Spark3 +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875245402Z java.lang.reflect.InvocationTargetException +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875250061Z at jdk.internal.reflect.GeneratedMethodAccessor469.invoke(Unknown Source) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875253194Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875256744Z at java.base/java.lang.reflect.Method.invoke(Method.java:566) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875260324Z at io.openlineage.spark.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:35) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875263824Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildOutputDatasets$21(OpenLineageRunEventBuilder.java:434) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875267715Z at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875270666Z at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875273785Z at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875276824Z at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875279799Z at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875285515Z at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875288751Z at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875292022Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildOutputDatasets(OpenLineageRunEventBuilder.java:447) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875294766Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:306) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875297558Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875300352Z at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:232) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875303027Z at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:70) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875305695Z at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$sparkSQLExecStart$0(OpenLineageSparkListener.java:91) +[2023-12-12, 14:16:44 UTC] {subprocess.py:93} INFO - 2023-12-12T14:16:44.875308769Z at java.base/java.util.Optional.ifPresent(Optional.java:183) +... +in some pipelines for OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:35:34
      +
      +

      *Thread Reply:* No. it wasn't reported. Are you able to reproduce the same with recent Openlineage version?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:36:47
      +
      +

      *Thread Reply:* I have not tried actually... let me try that and get back if it persists. Thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:37:54
      +
      +

      *Thread Reply:* just to make sure: this is just the error in the logs and should not prevent OL event being generated (except for column level lineage). right?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:45:04
      +
      +

      *Thread Reply:* Yeah, but column level lineage is what we'd actually want to capture so wondering why this error was being thrown

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:46:03
      +
      +

      *Thread Reply:* sure, could you also paste the end of the stacktrace?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:48:28
      +
      +

      *Thread Reply:* Yup, let me get that

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:49:43
      +
      +

      *Thread Reply:* INFO - 2023-12-12T15:06:34.726868651Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726871557Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726874327Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726878805Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726881789Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726884664Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726887371Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726890358Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726893117Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726895977Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726898933Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726901694Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726904581Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726907524Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726910276Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726913201Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726916094Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726918805Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726921673Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726924364Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726927240Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726930033Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726932890Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726935677Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726938562Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726941533Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726944428Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726947297Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726951913Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726954964Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726957942Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726960897Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726963830Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726968390Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726971331Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726974162Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726977051Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726979989Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726982847Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726985790Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726988700Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726991547Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.lambda$collect$0(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726994338Z at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.726997282Z at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727000102Z at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727003005Z at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727006039Z at io.openlineage.spark3.agent.lifecycle.plan.column.InputFieldsCollector.collect(InputFieldsCollector.java:61) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727009344Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:72) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727012365Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$null$2(ColumnLevelLineageUtils.java:83) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727015353Z at java.base/java.util.Optional.ifPresent(Optional.java:183) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727018363Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.lambda$collectInputsAndExpressionDependencies$3(ColumnLevelLineageUtils.java:80) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727021227Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:174) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727024129Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727029039Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727032321Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727035293Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727038585Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727041600Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727044499Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727047389Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727050337Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727053034Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727055990Z at scala.collection.immutable.List.foreach(List.scala:431) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727060521Z at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727063671Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.collectInputsAndExpressionDependencies(ColumnLevelLineageUtils.java:76) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727066753Z at io.openlineage.spark3.agent.lifecycle.plan.column.ColumnLevelLineageUtils.buildColumnLineageDatasetFacet(ColumnLevelLineageUtils.java:42) +[2023-12-12, 15:06:34 UTC] {subprocess.py:93} INFO - 2023-12-12T15:06:34.727069592Z ... 35 more

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:45:58
      +
      +

      Another question, +I noticed that for a few cases, lineage is not being captured if running: +df.toPandas() +via pyspark, and then doing some pandas operations on it and writing it back to an s3 location. May I check if this is expected?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 02:47:18
      +
      +

      *Thread Reply:* this is something we haven't tested nor included in our tests. not sure what happens spark data goes to pandas.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-14 02:48:18
      +
      +

      *Thread Reply:* Got it. Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2023-12-15 04:39:45
      +
      +

      *Thread Reply:* I can speak from experience, because we had a similar issue in our custom spark listener, toPandas breaks lineage because toPandas is like calling collect, which forces the data in the Spark DataFrame to the driver, and pipes that data to the running Python sidecar process. What ever you do afterwards, you're running code in a private memory space and Spark has no way of knowing what you're doing.

      + + + +
      + 👍 Anirudh Shrinivason +
      + +
      + :gratitude_thank_you: Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-29 03:41:29
      +
      +

      *Thread Reply:* Just wondering, is it even technically possible to address this case where pipelines use toPandas to get lineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-08 04:46:36
      +
      +

      *Thread Reply:* To an extent - yes. Though not necessarily with the OL connector. As hinted previously, we had to recently update our own listener that we wrote years ago in order to publish lineage for subquery alias / project operations from collect / toPandas operations.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 04:58:57
      +
      +

      We're noticing significant delays in a few spark jobs, wherein the openlineage spark listener seems to be running even after the spark/yarn application have completed & requested for a gracious exit. Even after the application has ended, we see couple of events still being processed & each event takes around 5-6 mins (rarely we have seen 9 mins as well):

      + +

      2023-12-13 22:52:26.834 -0800] [INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerJobEnd(12,1702535631760,JobSucceeded) by listener OpenLineageSparkListener took 385.396979168s. +This kinda results in our actual jobs taking 30+ mins to be marked as completed, which impacts SLA.

      + +

      Has anyone faced this issue, and any tips on how we can debug what event is causing this exact 5-6 mins issue / whic method in OpenLineageSparkListener is taking time?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 10:59:48
      +
      +

      *Thread Reply:* Ping @Paweł Leszczyński @Maciej Obuchowski ^

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-14 11:56:17
      +
      +

      *Thread Reply:* Can you try disabling lineage plan and spark_unknown facet? Only thing I can think of is serializing extremely large logical plans

      + + + +
      + 👍 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 12:32:33
      +
      +

      *Thread Reply:* yes, omiting plan serialialization and its serialization can be first thing to try

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-14 12:33:27
      +
      +

      *Thread Reply:* the other one would be to verify the backend is responding in reasonable time

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 13:57:22
      +
      +

      *Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - These 2 facets have already been disabled - but what we see if that the event is still too huge for openlineage spark listener to realistically process in time.

      + +

      For example, we have a job that reads from a S3 directory which has lot of dirs/files - resulting in 1041 inputs ** 8 columns per output 😅

      + +

      Is there a max limit config we can set on openlineage to consider parsing events if the spark event size is < x mb or # of inputs+outputs are < y or something like that?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-14 14:06:34
      +
      +

      *Thread Reply:* And also, when we disable facets like spark.logicalPlan / schema / columnLineage - does it mean that the part of the event is itself not being read from spark or is it just being used while generating/emitting the OL event?

      + +

      Basically if we have a very huge spark event, would disabling facets help or would it still kinda take a lot of time?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-15 12:48:57
      +
      +

      *Thread Reply:* @Paweł Leszczyński @Maciej Obuchowski - WDYT? ^

      + +

      In our use-case, we figured out the scenario that caused this huge event issue. We had a job that did spark.read from <s3://bucket/dir1/dir2/dir3_**/**> (with a file-level wildcard) instead of <s3://bucket/dir1/dir2/dir3_**/> - we were able to remove the wildcard and fix this issue.

      + +

      But I think this is something that should be handled on spark listener side, so that we're not really dependent on the patterns of the spark job code itself 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-18 03:59:46
      +
      +

      *Thread Reply:* That's the problem that bumps from time to time. So far, we were never able to come up with a general solution like circuit breaker.

      + +

      We do rather solve the problems after they're identified. In the past we added option to prevent sending serialized LogicalPlan as well as w trim it if it exceeds certain about of kilobytes.

      + +

      What caused the issue here? Was it the event being too large or spark openlineage internals trying to resolve ** and causing long lasting backend calls to S3?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-14 06:22:28
      +
      +

      Hi. I am trying to run a spark script through an Airflow Dag. I am not able to see any lineage information. In the spark script I had taken up a sample csv file. I created a Data-frame and made some transformations and then saved it as a csv file. I am not able to see any lineage information. Please do let me know if there is any way I can see Lineage information. Here is my spark script for reference. +```from pyspark import SparkContext +from os.path import abspath +from pyspark.sql import SparkSession +from pyspark.sql.functions import col +import urllib.request

      + +

      def executesparkscript(querynum, outputpath): + # Create a Spark session + #oljars= ['https://repo1.maven.org/maven2/io/openlineage/openlineage-spark/1.5.0/openlineage-spark-1.5.0.jar'] + warehouselocation = abspath('spark-warehouse') + #files = [urllib.request.urlretrieve(url)[0] for url in oljars] + spark = (SparkSession.builder.appName("DataManipulation").master('local[**]').appName('openlineagespark_test') + #.config('spark.jars', ",".join(files))

      + +
               # Install and set up the OpenLineage listener
      +
      +         .config('spark.jars.packages', 'io.openlineage:openlineage_spark:1.5.0')
      +         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
      +         .config('spark.openlineage.host', '<http://localhost:5000>')
      +         .config('spark.openlineage.namespace', 'airflow')
      +         .config('spark.openlineage.transport.type', 'console')
      +         .config("spark.sql.warehouse.dir", warehouse_location)
      +         .getOrCreate()
      +         )
      +
      +spark.sparkContext.setLogLevel("INFO")
      +
      +
      +# DataFrame 1
      +data = [[295, "South Bend", "Indiana", "IN", 101190, 112.9]]
      +columns = ["rank", "city", "state", "code", "population", "price"]
      +df1 = spark.createDataFrame(data, schema="rank LONG, city STRING, state STRING, code STRING, population LONG, price DOUBLE")
      +
      +print(f"Count after DataFrame 1: {df1.count()} rows, {len(df1.columns)} columns")
      +
      +# Save DataFrame 1 to the desired location
      +df1.write.mode("overwrite").csv(output_path + "df1")
      +
      +# DataFrame 2
      +df2 = (spark.read
      +      .format("csv")
      +      .option("header", "true")
      +      .option("inferSchema", "true")
      +      .load("/home/haneefa/Downloads/export.csv")
      +      )
      +
      +df2.write.mode("overwrite").csv(output_path + "df2")
      +
      +# Returns a DataFrame that combines the rows of df1 and df2
      +query_df = df1.union(df2)
      +query_df.count()
      +
      +# Save DataFrame 2 to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +
      +
      +# Save Query 1 result to the desired location
      +query_df.write.mode("overwrite").csv(output_path + "query_df")
      +
      +
      +# Register the DataFrame as a temporary SQL table
      +query_df.write.saveAsTable("temp_tb1")
      +
      +# Query 1: Add a new column derived from existing columns
      +query1_df = spark.sql("SELECT **, population / price as population_price_ratio FROM temp_tb1")
      +print(f"Count after Query 1: {query1_df.count()} rows, {len(query1_df.columns)} columns")
      +
      +# Register the DataFrame as a temporary SQL table
      +query1_df.write.saveAsTable("temp_tb2")
      +
      +# Save Query 1 result to the desired location
      +query1_df.write.mode("overwrite").csv(output_path + "query1_df")
      +
      +# Read the saved DataFrame in parquet format
      +<b>#parquet_df</b> =spark.read.parquet(output_path + "query1_df")
      +
      +spark.stop()
      +
      + +

      if name == "main": + executesparkscript(1, "/home/haneefa/airflow/dags/saved_files/")

      + +

      spark-submit --master --name SparkScriptquery1 --deploy-mode client /home/haneefa/airflow/dags/customoperators/samplesqlspark.py

      + +

      ./bin/spark-submit --class "SparkTest" --master local[**] --jars ```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-14 09:28:27
      +
      +

      *Thread Reply:* Do you run your script through a spark-submit?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-16 14:12:08
      +
      +

      *Thread Reply:* yes I do. Here is my Airflow Dag Code. +```from datetime import datetime, timedelta +from airflow import DAG +from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

      + +

      defaultargs = { + 'owner': 'airflow', + 'dependsonpast': False, + 'startdate': datetime(2023, 1, 1), +}

      + +

      dag = DAG( + 'abcsparkdagedit', + defaultargs=defaultargs, + description='A simple Airflow DAG to run the provided Spark script', + scheduleinterval='@once', +)

      + +

      sparktask = SparkSubmitOperator( + taskid='runsparkscript', + application='/home/haneefa/airflow/dags/customoperators/sparkedit.py', #path to your Spark script + name='examplesparkjob', + connid='spark2', # Spark connection ID configured in Airflow + jars='/home/haneefa/Downloads/openlineage-spark-1.5.0.jar', + verbose=False, + dag=dag, +)

      + +

      spark_task``` +please do let me know if I can do anything

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Haneefa tasneem + (haneefa.tasneem.shaik@gmail.com) +
      +
      2023-12-18 02:30:50
      +
      +

      *Thread Reply:* Hi. Just checking in. please do let me know if I can try anything.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-03 08:16:36
      +
      +

      *Thread Reply:* Can you bring the driver logs please ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-03 08:17:06
      +
      +

      *Thread Reply:* I see that you are mixing different types of transport. +... +.config('spark.openlineage.host', '<http://localhost:5000>') +... +.config('spark.openlineage.transport.type', 'console') +...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-15 03:38:06
      +
      +

      hey, I have a question re OpenLineage Spark integration. According to docs, while configuring spark session following parameters (amongst others) can be passed: +• spark.openlineage.parentJobName +• spark.openlineage.appName +• spark.openlineage.namespace +what I understand from OL spec is that first parameter (parentJobName) would go into facets section of every lineage event, while spark.openlineage.appName would replace spark.appName portion of job.name property of lineage event (and indeed that’s what we are observing). spark.openlineage.namespace would materialize as job.namespace in lineage events (that also seems accurate). What I also found in the documentations it that parent facet is used to materialize a job (like Airflow DAG) under which given event is emitted. That was also my expectation about parentJobName however after configuring this option, value that I am setting it to is nowhere to be found in output lineage events. The question is then - is my understanding of this property wrong (and if so, what is the correct one) or this value is not being propagated properly to lineage events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-15 04:19:51
      +
      +

      *Thread Reply:* cześć 😉 to sum up: you're setting parentJobName property and you don't see it anywhere within the content of OL event. ~Looking at the code, the facet should be attached to job.~

      + + + +
      + 👍 Mariusz Górski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-15 06:07:26
      +
      +

      *Thread Reply:* Looking at the code, the facet should be ingested into the event within this method https://github.com/OpenLineage/OpenLineage/blob/203052d663c4cd461ec38787434fc53a57[…]enlineage/spark/agent/lifecycle/OpenLineageRunEventBuilder.java

      + +

      However, I don't see any integration test for this. Feel free to create an issue for this in case parent job name is still missing.

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2023-12-15 13:53:47
      +
      +

      *Thread Reply:* +1, I've noticed this too - the namespace & appName from spark conf reflects in the openlineage events, while the parentJobName doesn't.

      + +

      But as a workaround, we kinda used namespace as a JSON string & provided the parentJobName as a key in this JSON

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-16 04:10:49
      +
      +

      *Thread Reply:* indeed you can put anything into namespace but this is a workaround and we'd like to have generic OL spec compliant approach. so far I've checked tests and parent facet is available in some of them (like BQ write integration test) but in some not so still not sure why sometimes it's there and sometimes not. will keep digging.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-16 09:23:28
      +
      +

      *Thread Reply:* ok I figured it out 🙂 tl;dr when you use parentJobName you need to also define parentRunId (this implicit relationship could be better documented) and the tricky part here is parentRunId needs to be proper UUID, not just random string (which was wrong assumption on my part, again - improvement to the docs also required). I will update docs in the upcoming week based on this discovery. I tested this and after making changes as per above description parent facet is visible in spark OL events 🙂

      + + + +
      + 🙌 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2023-12-27 04:11:27
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/docs/pull/268 took a little longer but there it is, I've also changed the proposal for exemplary airflow task metadata so it's aligned with how parent run facet is populated in airflow since OL 1.7.0 🙂

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 11:33:34
      +
      +

      Hi everyone, I have a question about Airflow's Job-to-Job level lineage. I've successfully obtained OpenLineage events for Airflow DAGs and now I'm looking to create inter DAG dependency (task-to-task), such as mapping D1.t1->D1.t2->D1.t3->D2.t1->D2.t2->D2.t3. Could you please advise on which fields I should consider to achieve this level of mapping?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-17 15:34:24
      +
      +

      *Thread Reply:* I’m not sure what you’re trying to achieve

      + +

      would you like to set relationships both between Airflow task <-> task and DAG <-> DAG?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 15:43:12
      +
      +

      *Thread Reply:* Yes, I'm trying to set a relationship between both of them. Suppose I've 2 DAGs, D1 and D2, with 2 tasks in it. So the lineage would be D1.t1 -> D1.t2 (TriggerDagRunOperator) ->D2.t1->D2.t2 +So in this way, I'll be able to mark an inter DAG dependency and task dependency also within the same DAG

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-17 15:55:47
      +
      +

      *Thread Reply:* in terms of Marquez relationship between jobs is established by common input/output datasets +there’s also parent/child relationship (e.g. between DAG and task) +there’s actually no facet that would point directly between two jobs, that would require some sort of customization

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2023-12-17 16:12:28
      +
      +

      *Thread Reply:* @Jakub Dardziński In terms of Airflow openlineage events, I can see task level dependencies within the same DAG in events. +But where I have inter-DAG dependencies via TriggerDagRunOperator I can't see that level of information in lineage events as mentioned here

      + +

      I'm not able to find what are the upstream dependencies of a DAG?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Parkash Pant + (ppant@tucowsinc.com) +
      +
      2023-12-17 21:24:16
      +
      +

      Hi Everyone, Need help! I am new to OpenLineage and Marquez and I am trying to test it with our local installation of airflow 2.6.3. Both airflow and marquez are running in a separate docker container and I have installed openlineage-airflow integration in airflow and set OPENLINEAGEURL and OPENLINEAGENAMESPACE. However, upon successful running a DAG, airflow is not able to emit OpenLineage event. Below is the error msg. I have also cross-checked that Marquez API is listening at localhost:5000. +2023-12-17 14:20:18 [2023-12-17T20:20:18.489+0000] {{adapter.py:98}} ERROR - Failed to emit OpenLineage event of id f0d30e1a-30cc-3ce5-9cbb-1d3529d5e206 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn +2023-12-17 14:20:18 conn = connection.create_connection( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection +2023-12-17 14:20:18 raise err +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection +2023-12-17 14:20:18 sock.connect(sa) +2023-12-17 14:20:18 ConnectionRefusedError: [Errno 111] Connection refused +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen +2023-12-17 14:20:18 httplib_response = self._make_request( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request +2023-12-17 14:20:18 conn.request(method, url, ****httplib_request_kw) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request +2023-12-17 14:20:18 super(HTTPConnection, self).request(method, url, body=body, headers=headers) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1282, in request +2023-12-17 14:20:18 self._send_request(method, url, body, headers, encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1328, in _send_request +2023-12-17 14:20:18 self.endheaders(body, encode_chunked=encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1277, in endheaders +2023-12-17 14:20:18 self._send_output(message_body, encode_chunked=encode_chunked) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 1037, in _send_output +2023-12-17 14:20:18 self.send(msg) +2023-12-17 14:20:18 File "/usr/lib/python3.10/http/client.py", line 975, in send +2023-12-17 14:20:18 self.connect() +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect +2023-12-17 14:20:18 conn = self._new_conn() +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn +2023-12-17 14:20:18 raise NewConnectionError( +2023-12-17 14:20:18 urllib3.exceptions.NewConnectionError: &lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send +2023-12-17 14:20:18 resp = conn.urlopen( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen +2023-12-17 14:20:18 retries = retries.increment( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment +2023-12-17 14:20:18 raise MaxRetryError(_pool, url, error or ResponseError(cause)) +2023-12-17 14:20:18 urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused')) +2023-12-17 14:20:18 +2023-12-17 14:20:18 During handling of the above exception, another exception occurred: +2023-12-17 14:20:18 +2023-12-17 14:20:18 Traceback (most recent call last): +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/airflow/adapter.py", line 95, in emit +2023-12-17 14:20:18 return self.client.emit(event) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/client.py", line 102, in emit +2023-12-17 14:20:18 self.transport.emit(event) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/openlineage/client/transport/http.py", line 159, in emit +2023-12-17 14:20:18 resp = <a href="http://session.post">session.post</a>( +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post +2023-12-17 14:20:18 return self.request("POST", url, data=data, json=json, ****kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request +2023-12-17 14:20:18 resp = self.send(prep, ****send_kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send +2023-12-17 14:20:18 r = adapter.send(request, ****kwargs) +2023-12-17 14:20:18 File "/usr/local/airflow/.local/lib/python3.10/site-packages/requests/adapters.py", line 519, in send +2023-12-17 14:20:18 raise ConnectionError(e, request=request) +2023-12-17 14:20:18 requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /api/v1/lineage (Caused by NewConnectionError('&lt;urllib3.connection.HTTPConnection object at 0x7f3244438a90&gt;: Failed to establish a new connection: [Errno 111] Connection refused'))

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-18 02:28:33
      +
      +

      *Thread Reply:* try with host.docker.internal instead of localhost

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Parkash Pant + (ppant@tucowsinc.com) +
      +
      2023-12-18 11:51:48
      +
      +

      *Thread Reply:* It worked. Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Ourch + (michael.ourch@qonto.com) +
      +
      2023-12-18 05:02:29
      +
      +

      Hey everyone, +I started experimenting OpenLineage with my stack (dbt + Snowflake). +The data lineage works great but I could not get the column lineage feature working. +Is it only implemented for Spark at the moment ? +Thanks 🙏

      + + + + +
      + ✅ Michael Ourch +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-18 05:03:54
      +
      +

      *Thread Reply:* it works from Spark (with exception of JDBC) and Airflow SQL-based operators (except BigQuery Operator)

      + + + +
      + 🙏 Michael Ourch +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Ourch + (michael.ourch@qonto.com) +
      +
      2023-12-18 08:20:18
      +
      +

      *Thread Reply:* Thanks @Jakub Dardziński for the clarification 🙏

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Daniel Henneberger + (me@danielhenneberger.com) +
      +
      2023-12-18 16:16:30
      +
      +

      Hey ya'll, I'm trying out the flink open lineage integration with marquez. I cloned marquez and did a ./docker/up.sh and configured flink using the yaml. However, when it tries to emit a metric, i get:

      + +

      ERROR io.openlineage.flink.client.EventEmitter - Failed to emit OpenLineage event: +io.openlineage.client.OpenLineageClientException: code: 422, response: {"errors":["job.facets.jobType.integration must not be null"]} + at io.openlineage.client.transports.HttpTransport.throwOnHttpError(HttpTransport.java:150) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:124) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.transports.HttpTransport.emit(HttpTransport.java:111) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] + at io.openlineage.client.OpenLineageClient.emit(OpenLineageClient.java:46) ~[sqrl-cli.jar:0.4.1-SNAPSHOT] +Is there something else I need to configure?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Daniel Henneberger + (me@danielhenneberger.com) +
      +
      2023-12-18 17:22:39
      +
      +

      I opened an issue: https://github.com/OpenLineage/OpenLineage/issues/2324

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:07:34
      +
      +

      *Thread Reply:* great finding, sorry for this. +this should help: +https://github.com/OpenLineage/OpenLineage/pull/2325

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/flink, streaming +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 00:19:59
      +
      +

      Hi team, Noticed this OL error for spark 3.4.1: +23/12/15 09:51:35 ERROR PlanUtils: Apply failed: +java.lang.NoSuchMethodError: 'java.lang.String org.apache.spark.sql.execution.datasources.PartitionedFile.filePath()' + at io.openlineage.spark.agent.util.PlanUtils.lambda$null$4(PlanUtils.java:241) + at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) + at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) + at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) + at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) + at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) + at scala.collection.convert.Wrappers$IteratorWrapper.forEachRemaining(Wrappers.scala:31) + at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) + at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) + at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274) + at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) + at io.openlineage.spark.agent.util.PlanUtils.findRDDPaths(PlanUtils.java:248) + at io.openlineage.spark.agent.lifecycle.plan.AbstractRDDNodeVisitor.findInputDatasets(AbstractRDDNodeVisitor.java:42) + at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:43) + at io.openlineage.spark.agent.lifecycle.plan.SqlExecutionRDDVisitor.apply(SqlExecutionRDDVisitor.java:22) + at io.openlineage.spark.agent.util.PlanUtils$1.lambda$apply$2(PlanUtils.java:99) + at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) + at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) + at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) + at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) + at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) + at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) + at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) + at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) + at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:115) + at io.openlineage.spark.agent.util.PlanUtils$1.apply(PlanUtils.java:79) + at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) + at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) + at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) + at scala.PartialFunction$AndThen.applyOrElse(PartialFunction.scala:194) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$visitLogicalPlan$14(OpenLineageRunEventBuilder.java:400) + at io.openlineage.spark.agent.util.ScalaConversionUtils$3.apply(ScalaConversionUtils.java:131) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1(TreeNode.scala:305) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$map$1$adapted(TreeNode.scala:305) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:285) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) + at scala.collection.Iterator.foreach(Iterator.scala:943) + at scala.collection.Iterator.foreach$(Iterator.scala:943) + at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) + at scala.collection.IterableLike.foreach(IterableLike.scala:74) + at scala.collection.IterableLike.foreach$(IterableLike.scala:73) + at scala.collection.AbstractIterable.foreach(Iterable.scala:56) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:286) + at scala.collection.Iterator.foreach(Iterator.scala:943) + at scala.collection.Iterator.foreach$(Iterator.scala:943) + at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) + at scala.collection.IterableLike.foreach(IterableLike.scala:74) + at scala.collection.IterableLike.foreach$(IterableLike.scala:73) + at scala.collection.AbstractIterable.foreach(Iterable.scala:56) + at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:286) + at org.apache.spark.sql.catalyst.trees.TreeNode.map(TreeNode.scala:305) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildInputDatasets$6(OpenLineageRunEventBuilder.java:351) + at java.base/java.util.Optional.map(Optional.java:265) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildInputDatasets(OpenLineageRunEventBuilder.java:349) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:305) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:241) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.end(SparkSQLExecutionContext.java:95) + at io.openlineage.spark.agent.OpenLineageSparkListener.sparkSQLExecEnd(OpenLineageSparkListener.java:98) + at io.openlineage.spark.agent.OpenLineageSparkListener.onOtherEvent(OpenLineageSparkListener.java:84) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1471) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96) +OL version 0.30.1. May I check if this has already been reported/been fixed in the later releases of OL? Thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:16:08
      + +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 22:35:34
      +
      +

      *Thread Reply:* Got it thanks!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 00:22:03
      +
      +

      Also, just checking, is there a way to set the log level for OL separately for spark? Or does it always use the underlying spark context log level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-19 03:17:21
      +
      +

      *Thread Reply:* this is how we set it in tests -> https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/app/src/test/resources/log4j.properties

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-19 22:35:16
      +
      +

      *Thread Reply:* Ahh I see... it goes in via log4j properties. Is there any plan to make this configurable via simpler means? Say env variable or using dedicated spark configs?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 02:27:06
      +
      +

      *Thread Reply:* We didn't plan anything so far. But feel free to create an issue and justify why is this important. perhaps more people share the same feeling about it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:45
      +
      +

      *Thread Reply:* Sure, I'll do that then. Thanks! 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-19 04:45:05
      +
      +

      hi does someone uses openLineage solution to get metadata of Airflow?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-19 05:06:24
      +
      +

      *Thread Reply:* hey Zacay, do you have any issue with using Airflow integration?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:42:30
      +
      +

      *Thread Reply:* Do i need to install openlineage or only configuration?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:43:03
      +
      +

      *Thread Reply:* i install marquez +and point on the Ariflow cfg to listen to the marquez:5000

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:43:12
      +
      +

      *Thread Reply:* what version of Airflow are you using?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:43:55
      +
      +

      *Thread Reply:* Version: v2.8.0

      +
      +
      PyPI
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:44:19
      +
      +

      *Thread Reply:* 2.8.0

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:46:09
      +
      +

      *Thread Reply:* you should follow this guide then +https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/guides/user.html

      + +

      as any other Airflow provider you need to install it, e.g. with pip install apache-airflow-provider-openlineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:47:07
      +
      +

      *Thread Reply:* and these variables are kept on airflow.cfg or on .env?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:50:12
      +
      +

      *Thread Reply:* this is Airflow config variables so you can either set them in airflow.cfg or using environment variables as described here https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#envvar-AIRFLOW__-SECTION-__-KEY

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:21
      +
      +

      *Thread Reply:* i created a DAG that creates a table and inserts it one line +then op the airflow.cfg +[openlineage] +transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' +namespace='airflow'

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:38
      +
      +

      *Thread Reply:* and on 10.0.19.7 there is a URL of marquze on port 300

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:51
      +
      +

      *Thread Reply:* i run the dag but see no lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:54:58
      +
      +

      *Thread Reply:* is there any log to get?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:57:16
      +
      +

      *Thread Reply:* there would be if you enable debug logs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:00:28
      +
      +

      *Thread Reply:* in the Airflow.cfg?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:14:07
      +
      +

      *Thread Reply:* Yes. I’m assuming you don’t see anything yet in logs without enabling debug level?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:15:05
      +
      +

      *Thread Reply:* I see on the logs of the airflow

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:15:17
      +
      +

      *Thread Reply:* but not nothing related to the lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:17:04
      +
      +

      *Thread Reply:* in Admin > Plugins can you see whether you have OpenLineageProviderPlugin and if so, are there listeners?

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:19:02
      +
      +

      *Thread Reply:* there is a OpenLineageProviderPlugin +but no liseners

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:20:52
      +
      +

      *Thread Reply:* listeners are disabled following this logic: +def _is_disabled() -&gt; bool: + return ( + conf.getboolean("openlineage", "disabled", fallback=False) + or os.getenv("OPENLINEAGE_DISABLED", "false").lower() == "true" + or ( + conf.get("openlineage", "transport", fallback="") == "" + and conf.get("openlineage", "config_path", fallback="") == "" + and os.getenv("OPENLINEAGE_URL", "") == "" + and os.getenv("OPENLINEAGE_CONFIG", "") == "" + ) + ) +so maybe your config is not loaded properly?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:38
      +
      +

      *Thread Reply:* Dont

      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:44
      +
      +

      *Thread Reply:* here is the config

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:56
      +
      +

      *Thread Reply:*

      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:25:59
      +
      +

      *Thread Reply:* here is also a .env

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:27:25
      +
      +

      *Thread Reply:* you have transport and namespace twice under openlineage section, second ones are empty

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:32:23
      +
      +

      *Thread Reply:* apparently .env is not taken into account, not sure where the file is and what is your deployment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:32:44
      +
      +

      *Thread Reply:* also, AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend is needed only for Airflow <2.3

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:34:54
      +
      +

      *Thread Reply:* ok +so now i have a airflow.cfg +[openlineage] +transport = '{"type": "http", "url": "http://10.0.19.7:5000"}' +namespace='my-namespace' +But still when i run i see no lineage

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:41:15
      +
      +

      *Thread Reply:* can you verify and confirm that changes in your Airflow config are applied? I don’t see any other reason, I also can’t tell what’s your deployment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:43:32
      +
      +

      *Thread Reply:* Can you send me an example of airflow.cfg that works and i will try to compare

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:44:34
      +
      +

      *Thread Reply:* the one you sent seems ok, it may be a matter of how you configure Airflow to read it, where you put changed config file

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:45:16
      +
      +

      *Thread Reply:* it is on the same place where the logs and dags libraries

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:47:22
      +
      +

      *Thread Reply:* okay, let’s try this

      + +

      please change temporarily +expose_config = False +to +expose_config = True +and check whether you can see config in the UI under Admin > Configuration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:49:21
      +
      +

      *Thread Reply:* I need to stop and start the docker?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:50:23
      +
      +

      *Thread Reply:* yes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:53:41
      +
      +

      *Thread Reply:* i changed but see no change on the UI under Admin->Configuration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:54:11
      +
      +

      *Thread Reply:* i wonder of the cfg file affects at all on the Airflow

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 07:54:29
      +
      +

      *Thread Reply:* maybe it is relevant to the docker-compose.yaml?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 07:56:44
      +
      +

      *Thread Reply:* again, I don't know how you deploy your Airflow instance +if you need more help with that you might ask in Airflow Slack or learn more about from Airflow docs (which are pretty good imho)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:16:42
      +
      +

      *Thread Reply:* It seems that the config is on one of the contianers

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:16:47
      +
      +

      *Thread Reply:* so i got inside

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:17:05
      +
      +

      *Thread Reply:* but i think that if i stop and start it doesnt saves the cahnge

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 09:21:53
      +
      +

      *Thread Reply:* I would manipulate with env vars to override Airflow config entries (I passed the link above) and set them in compose file for all airflow containers +I'm assuming you're using Airflow provided docker compose yaml

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:22:25
      +
      +

      *Thread Reply:* right i use docker-compose yaml

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:28:23
      +
      +

      *Thread Reply:* so you mean to create an .env file in the location of the ymal and there put +AIRFLOWOPENLINEAGEDISABLED=False

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 09:30:20
      +
      +

      *Thread Reply:* no, to put this into yaml file

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:30:37
      +
      +

      *Thread Reply:* OK

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 09:32:51
      +
      +

      *Thread Reply:* i put on the yaml + OPENLINEAGEDISABLED=false + OPENLINEAGEURL=http://10.0.19.7:5000 + AIRFLOWOPENLINEAGENAMESPACE=food_delivery +and i will stop and start and lets see how it goes

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-20 14:04:37
      +
      +

      *Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-19 04:45:13
      +
      +

      i used this post +https://openlineage.io/docs/guides/airflow_proxy/

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-19 08:49:49
      +
      +

      *Thread Reply:* For newest Airflow provider documentation, please look at https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:02
      +
      +

      Hi team, got this error with OL 1.6.2 on dbr aws: +23/12/20 07:10:18 ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver. +java.lang.IllegalStateException: LiveListenerBus is stopped. + at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:109) + at org.apache.spark.scheduler.LiveListenerBus.addToSharedQueue(LiveListenerBus.scala:66) + at org.apache.spark.sql.QueryProfileListener$.initialize(QueryProfileListener.scala:122) + at com.databricks.backend.daemon.driver.DatabricksILoop$.$anonfun$executeDependedOperations$1(DatabricksILoop.scala:652) + at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) + at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1709) + at com.databricks.unity.UCSUniverseHelper$.withNewScope(UCSUniverseHelper.scala:8) + at com.databricks.backend.daemon.driver.DatabricksILoop$.executeDependedOperations(DatabricksILoop.scala:580) + at com.databricks.backend.daemon.driver.DatabricksILoop$.initializeSharedDriverContext(DatabricksILoop.scala:448) + at com.databricks.backend.daemon.driver.DatabricksILoop$.getOrCreateSharedDriverContext(DatabricksILoop.scala:294) + at com.databricks.backend.daemon.driver.DriverCorral.driverContext(DriverCorral.scala:292) + at com.databricks.backend.daemon.driver.DriverCorral.&lt;init&gt;(DriverCorral.scala:159) + at com.databricks.backend.daemon.driver.DriverDaemon.&lt;init&gt;(DriverDaemon.scala:71) + at com.databricks.backend.daemon.driver.DriverDaemon$.create(DriverDaemon.scala:452) + at com.databricks.backend.daemon.driver.DriverDaemon$.initialize(DriverDaemon.scala:546) + at com.databricks.backend.daemon.driver.DriverDaemon$.wrappedMain(DriverDaemon.scala:511) + at com.databricks.DatabricksMain.$anonfun$main$1(DatabricksMain.scala:149) + at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) + at com.databricks.DatabricksMain.$anonfun$withStartupProfilingData$1(DatabricksMain.scala:498) + at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:571) + at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:666) + at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:684) + at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:426) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:196) + at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:424) + at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:418) + at com.databricks.DatabricksMain.withAttributionContext(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:470) + at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:455) + at com.databricks.DatabricksMain.withAttributionTags(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:661) + at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:580) + at com.databricks.DatabricksMain.recordOperationWithResultTags(DatabricksMain.scala:91) + at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:571) + at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:540) + at com.databricks.DatabricksMain.recordOperation(DatabricksMain.scala:91) + at com.databricks.DatabricksMain.withStartupProfilingData(DatabricksMain.scala:498) + at com.databricks.DatabricksMain.main(DatabricksMain.scala:148) + at com.databricks.backend.daemon.driver.DriverDaemon.main(DriverDaemon.scala) +Dbr spark 3.4, jre 1.8 +Is spark 3.4 not supported for OL as of yet?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 02:37:29
      +
      +

      *Thread Reply:* OL 1.3.1 works fine btw... This error only pops up with OL 1.6.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-20 03:00:26
      +
      +

      *Thread Reply:* Error happens when starting the cluster itself

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 04:56:09
      +
      +

      *Thread Reply:* spark 3.4 yes, but databricks runtie 14.x was not testet so far (edited)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-20 09:00:12
      +
      +

      *Thread Reply:* I've just tested existing databricks integration on latest dbr 14.2 and spark 3.5. The integration tests for databricks, we have, are passing. So please write more details on how do end up with logs like above.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-21 01:59:54
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2328 -> link to tests being run

      +
      + + + + + + + +
      +
      Labels
      + integration/spark, ci, tests +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      + 👀 Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-29 03:12:16
      +
      +

      *Thread Reply:* Hi @Paweł Leszczyński sorry missed out on this earlier... Actually, I got this while trying to simply start the dbr cluster with the OL spark configs. Nothing else done from my end for this. Works with OL 1.3.1 but not with 1.6.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-29 03:14:03
      +
      +

      *Thread Reply:* there was an issue with 1.6.2 related to log class on the classpath which may be responsible for this, but it got solved in recent release.

      + + + +
      + 👍 Anirudh Shrinivason +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 05:39:45
      +
      +

      hi +does somenone uses Airflow lineage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 05:40:58
      +
      +

      *Thread Reply:* hey Zacay, I’ve already tried to help you here: https://openlineage.slack.com/archives/C01CK9T7HKR/p1702980384927369?thread_ts=1702979105.626809&cid=C01CK9T7HKR

      +
      + + +
      + + + } + + Jakub Dardziński + (https://openlineage.slack.com/team/U02S6F54MAB) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2023-12-20 08:57:07
      +
      +

      *Thread Reply:* thanks for the help

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-20 08:57:26
      +
      +

      *Thread Reply:* did you manage to make it working?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-20 14:04:58
      +
      +

      *Thread Reply:* out of curiosity, do you use Astronomer bootstrap solution to spinup airflow with openlineage?

      + + + +
      +
      +
      +
      + + + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-20 14:05:55
      +
      +

      Hi Everyone +I was trying to add extra facet at job level but I am not able to do that. +to explain more - assume that d1(input) and d2 (output) are the databases and by using some python file which having classes and functions helps it to get convert from d1 to d2. so currently for this i am extracting some info about python file by using external python parser and want to add that to jobs which will be in between d1 and d2. i tried by adding +```custom_facets = { + "customKey": "customValue", + "anotherKey": "anotherValue" + # Add more custom facets as needed + }

      + +
              # Creating a Job with custom facets
      +        job = Job(
      +            namespace=file_name,
      +            name=single_fn_info.name,
      +            facets=custom_facets  # Include the custom facets here
      +        )```
      +
      + +

      but it is not working. I did look into this documentation https://openlineage.io/docs/spec/facets/custom-facets/ and tried but it's still not showing any facet on ui. +currently in facets section for every job it only shows root dict which is blank and nothing else. So what i am missing how we can implement this ??

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-21 05:12:26
      +
      +

      @Michael Robinson can we vote for the OL release? #2319 brings significant fix to Spark logging and I think it's worth releasing without waiting for release cycle.

      + + + +
      + ➕ Paweł Leszczyński, Jakub Dardziński, Maciej Obuchowski, Rodrigo Maia, Michael Robinson, harsh loomba +
      + +
      + 👍 Michael Robinson +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 09:55:59
      +
      +

      *Thread Reply:* Thanks, @Paweł Leszczyński, the release is authorized and will be initiated as soon as possible within 2 business days.

      + + + +
      + 🙌 Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 10:54:23
      +
      +

      *Thread Reply:* Changelog PR: https://github.com/OpenLineage/OpenLineage/pull/2331

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:00:32
      +
      +

      *Thread Reply:* @Jakub Dardziński any progress on this one https://github.com/apache/airflow/pull/35794 wondering if this could have been a part of release

      +
      + + + + + + + +
      +
      Labels
      + provider:amazon-aws, area:providers, provider:openlineage +
      + +
      +
      Comments
      + 3 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:02:36
      +
      +

      *Thread Reply:* I'm dependent on Airflow committers, pinging in the PR from time to time

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:02:57
      +
      +

      *Thread Reply:* if you wish to comment that you need the PR it might be taken into consideration as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:13:47
      +
      +

      *Thread Reply:* wait this will be supported by openlineage-airflow package as well right? I see you have made changes in airflow provider package but I dont see changes in standalone openlineage repo 🤔

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:15:22
      +
      +

      *Thread Reply:* We aim to make as less changes as possible to openlineage-airflow package to encourage users to use provider one. This change would be quite huge and probably won't be backported.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2023-12-21 13:17:54
      +
      +

      *Thread Reply:* We have yet not moved to provider packages coz our team is still making decision. And the move to provider package needs me to change a lot, so i would prefer this feature in standalone package rather

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Hitesh + (splicer9904@gmail.com) +
      +
      2023-12-21 05:25:29
      +
      +

      Hi Team +I am trying to send openlineage events to a Kafka eventhub and for that, I am passing spark.openlineage.transport.properties.sasl.jaas.config as org.apache.kafka.common.security.plain.PlainLoginModule required username=\"connection_string\" password=\"connection_string\";

      + +

      when I run the job, I get the error Value not specified for key 'username' in JAAS config

      + +

      I am getting connection string from Kafka evenhub properties and I'm running openlineage-spark-1.4.1

      + +

      Can someone please help me out?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 09:20:53
      +
      +

      *Thread Reply:* I don't think we can parse those non-escaped spaces as regular properties

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 09:23:01
      +
      +

      *Thread Reply:* can you try something like +"org.apache.kafka.common.security.plain.PlainLoginModule required username='connection_string' password='connection_string'"

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 11:14:44
      +
      +

      Hello,

      + +

      I hope you are doing well.

      + +

      I am wondering if any of you had the same issue before.

      + +

      Thank you. +23/12/21 16:01:26 WARN DatabricksEnvironmentFacetBuilder: Failed to load dbutils in OpenLineageListener: +java.util.NoSuchElementException: None.get + at scala.None$.get(Option.scala:529) + at scala.None$.get(Option.scala:527) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext$lzycompute(DbfsUtilsImpl.scala:27) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._driverContext(DbfsUtilsImpl.scala:26) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$driverContext$1(DbfsUtilsImpl.scala:29) + at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.driverContext(DbfsUtilsImpl.scala:29) + at <a href="http://com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc">com.databricks.dbutils_v1.impl.DbfsUtilsImpl.sc</a>(DbfsUtilsImpl.scala:30) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core$lzycompute(DbfsUtilsImpl.scala:32) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl._core(DbfsUtilsImpl.scala:32) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.$anonfun$core$1(DbfsUtilsImpl.scala:34) + at com.databricks.dbutils_v1.impl.DBUtilsV1Utils$.checkLocalDriver(DBUtilsV1Impl.scala:61) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.core(DbfsUtilsImpl.scala:34) + at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.mounts(DbfsUtilsImpl.scala:166) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksMountpoints(DatabricksEnvironmentFacetBuilder.java:142) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.getDatabricksEnvironmentalAttributes(DatabricksEnvironmentFacetBuilder.java:98) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:60) + at io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder.build(DatabricksEnvironmentFacetBuilder.java:32) + at io.openlineage.spark.api.CustomFacetBuilder.accept(CustomFacetBuilder.java:40) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$null$27(OpenLineageRunEventBuilder.java:491) + at java.lang.Iterable.forEach(Iterable.java:75) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.lambda$buildRunFacets$28(OpenLineageRunEventBuilder.java:491) + at java.util.ArrayList.forEach(ArrayList.java:1259) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRunFacets(OpenLineageRunEventBuilder.java:491) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.populateRun(OpenLineageRunEventBuilder.java:313) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:289) + at io.openlineage.spark.agent.lifecycle.OpenLineageRunEventBuilder.buildRun(OpenLineageRunEventBuilder.java:250) + at io.openlineage.spark.agent.lifecycle.SparkSQLExecutionContext.start(SparkSQLExecutionContext.java:167) + at io.openlineage.spark.agent.OpenLineageSparkListener.lambda$onJobStart$10(OpenLineageSparkListener.java:151) + at java.util.Optional.ifPresent(Optional.java:159) + at io.openlineage.spark.agent.OpenLineageSparkListener.onJobStart(OpenLineageSparkListener.java:147) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37) + at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) + at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:39) + at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118) + at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:107) + at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:107) + at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) + at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) + at <a href="http://org.apache.spark.scheduler.AsyncEventQueue.org">org.apache.spark.scheduler.AsyncEventQueue.org</a>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:98) + at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1639) + at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:98)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-21 12:01:59
      +
      +

      *Thread Reply:* Can you provide additional details: what is your version of OpenLineage integration, Spark, Databricks; how you're running your job, additional logs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:02:42
      +
      +

      *Thread Reply:* Version of OL 1.2.2

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:03:07
      +
      +

      *Thread Reply:* spark_version: 11.3.x-scala2.12

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2023-12-21 12:03:32
      +
      +

      *Thread Reply:* Running job through spark-submit

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2023-12-22 06:03:06
      +
      +

      *Thread Reply:* @Abdallah could you try to upgrade to 1.7.0 that was released lately?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 13:08:08
      +
      +

      @channel +We released OpenLineage 1.7.0! +In this release, we turned off support for Airflow versions >=2.8.0 in the Airflow integration and added a parent run facet to COMPLETE and FAIL events, also in Airflow. If you’re on the most recent release of Airflow and wish to continue receiving events from Airflow after upgrading, use the OpenLineage Airflow Provider instead.

      + +

      Added +• Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda + Adds a parent run facet to all events in the Airflow integration.

      + +

      Removed +• Airflow: remove Airflow 2.8+ support #2330 @kacpermuda + To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is >=2.8.0.

      + +

      A number of bug fixes were released as well, including: +• Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda + Some scripts were not working well on MacOS. This adjusts them. +• Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda + The Run_id in a FAIL event was different than in the START event for Airflow 2.6+. +• Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski + Adds a kafka:// prefix to Kafka topic datasets’ namespaces. +• Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss + Amends the Authority format for consistency with other references in the same section.

      + +

      Thanks to all the contributors, including new contributor @Kacper Muda! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.7.0 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.6.2...1.7.0 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2023-12-21 13:10:40
      +
      +

      *Thread Reply:* Shoutout to @Kacper Muda for huge contribution from the very start 🚀

      + + + +
      + :gratitude_thank_you: Kacper Muda, Sheeri Cabral (Collibra) +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 13:11:22
      +
      +

      *Thread Reply:* +1 on that! I think there were even more changes from Kacper than are listed here

      + + + +
      + :gratitude_thank_you: Kacper Muda +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-21 14:43:38
      +
      +

      *Thread Reply:* Thanks ! Just a quick note: listener API was introduced in Airflow 2.3 so it was already missing in the plugin for Airflow < 2.3 - I made no changes there. I just removed it from > 2.8.0, to encourage use of the provider, as You said 🙂 But the result is exactly as You said, there is no listener in <2.3 and >=2.8 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2023-12-21 14:45:57
      +
      +

      *Thread Reply:* So in result: i don't think we turned off support for Airflow versions &lt;2.3.0 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2023-12-21 14:46:48
      +
      +

      *Thread Reply:* This is good to know — thanks. I’ve updated the notes here and will do so elsewhere, as well.

      + + + +
      + 🙌 Kacper Muda, Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-30 14:33:21
      +
      +

      *Thread Reply:* Hi @Jakub Dardziński so before for custom operator we used to write custom extractor file and used to save with other extractors under the airflow integration folder.

      + +

      what is the procedure now according to this new provider update ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-01 05:08:40
      +
      +

      *Thread Reply:* Hey @Shahid Shaikh, as of my understanding nothing has changed regarding Custom Extractors, You can still use them if You wish to (see the bottom of this docs, to see how they can be registered in provider package). However, in my opinion, the best way to use the provider package is to implement OpenLineage methods directly in the operators, as described here. Let me know if this answers Your question.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:46:34
      +
      +

      *Thread Reply:* Yes, Thanks @Kacper Muda I referred the docs and able to do the work as u said by using provider package directly.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2023-12-28 00:46:57
      +
      +

      Hi Everyone +I was looking into airflow intergration with Marquez i ran number of dags, I observed that for a particular task in every dag it makes new job on Marquez and we are not able to see any job to job linkage on marquez map. Why it is not showing ? How can we add this feature ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 05:57:49
      +
      +

      Hey All! +I've tested merging operations for SPARK (on Databricks) and the OpenLineage Result. For the moment i failed to produce any input/output for the jobs, regardless of the data schema being present in the logical plan attribute of the JSON OL event.

      + +

      My test consisted in: +• Reading from parquet/CSV file in DBFS (databricks file storage) +• Creating a Temporary Table +• Performing the merging with spark.sql("merge... target...source ") with the target being a table in hive. +Test Variables: +• Source: Parquet/CSV +• Target: Hive Table +• OLxSpark versions: + ◦ OL 1.6.2 -> Spark 3.3.2 + ◦ OL 1.6.2 -> Spark 3.2.1 + ◦ OL 1.0.0 -> Spark 3.2.1 +I've created a pdf with some code samples and OL inputs and output attributes.

      + + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 05:59:53
      +
      +

      *Thread Reply:* @Sai @Anirudh Shrinivason Did you have any luck with the merge events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 06:00:51
      +
      +

      *Thread Reply:* @Paweł Leszczyński I know you were working on this in the past. Am I missing something here?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 06:02:23
      +
      +

      *Thread Reply:* Should i test with more recent versions of spark and the latest of OL?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-28 06:31:04
      +
      +

      *Thread Reply:* Are you able to verify if the issue is databricks runtime specific or the same happens on vanilla Spark docker containers?

      + +

      We do have an integration test for merge into and delta tables here -> https://github.com/OpenLineage/OpenLineage/blob/17fa874083850b711f364c4de656e14d79[…]/java/io/openlineage/spark/agent/SparkDeltaIntegrationTest.java

      + +

      would this test fail on databricks as well?

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 08:30:56
      +
      +

      *Thread Reply:* the test on databricks fails to generate inputs and outputs for the merge operation

      + +

      "job": { + "namespace": "default", + "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_merge_into_command_edge", + "facets": {} + }, + "inputs": [], + "outputs": []

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2023-12-28 08:32:11
      +
      +

      *Thread Reply:* The create view also fails to generate Inputs and outputs (but this one i don't know if it is supposed to) +"job": { + "namespace": "default", + "name": "dbc-a9e3696a-291f_cloud_databricks_com_execute_create_view_command", + "facets": {} + }, + "inputs": [], + "outputs": []

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + + +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anirudh Shrinivason + (anirudh.shrinivason@grabtaxi.com) +
      +
      2023-12-28 23:00:10
      +
      +

      *Thread Reply:* Hey @Rodrigo Maia I'm using 1.3.1 OL version, and it seems to work for most merge into case though not all...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2023-12-29 02:36:14
      +
      +

      *Thread Reply:* @Rodrigo Maia will try to add existing testMergeInto to be run on Databricks and see why is it failing

      + + + +
      + 🙌 Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-02 08:39:50
      +
      +

      *Thread Reply:* looking into this and hopefully will have solution this week

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-03 06:27:09
      +
      +

      *Thread Reply:* https://github.com/OpenLineage/OpenLineage/pull/2348

      +
      + + + + + + + +
      +
      Labels
      + documentation, integration/spark +
      + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      + ❤️ Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Sheeri Cabral (Collibra) + (sheeri.cabral@collibra.com) +
      +
      2024-01-02 10:47:11
      +
      +

      Hello! My company has a job opening for a Senior Data Engineer in our Data Office, in Czech Republic - https://www.linkedin.com/feed/update/urn:li:activity:7147975999253610496/ DM me here or on LinkedIn if you have questions.

      +
      +
      linkedin.com
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🔥 Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-02 11:00:58
      +
      +

      *Thread Reply:* created a #jobs channel, as we’re continuing to build a strong community here 🙂

      + + + +
      + 🚀 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-02 11:30:00
      +
      +

      @channel +This Friday, Jan. 4, is the last day to respond to the 2023 Ecosystem Survey, which will close at 5 pm ET that day. It’s an opportunity to tell us about your organization’s lineage needs and priorities for the purpose of updating the project roadmap. For example, you can tell us: +• which data connectors you’d like OL to support +• which cloud data platforms you’d like OL to support +• which additional orchestrators you’d like OL to support +• and more. +Your feedback is very important. Thanks in advance!

      +
      +
      Google Docs
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Martins Cardoso + (rodrigo.cardoso1@ibm.com) +
      +
      2024-01-02 12:27:27
      +
      +

      Hi everyone! Just passing by for a quick question: after reading SparkColumnLineage and ColumnLineageDatasetFacet.json I can see in the future work section of the 1st URL that

      + +
      +

      Current version of the mechanism allows finding input fields that were used to produce the output field but does not determine how were they used + This means that the transformationDescription and transformationType from ColumnLineageDatasetFacet are not sent in the events?

      +
      + +

      Thanks in advance for any inputs! Have a nice day.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-03 04:59:48
      +
      +

      *Thread Reply:* Currently, Spark integration does not fill transformation fields. Additionally, there's a proposal to improve this: https://github.com/OpenLineage/OpenLineage/issues/2186

      +
      + + + + + + + +
      +
      Labels
      + proposal +
      + +
      +
      Comments
      + 5 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Martins Cardoso + (rodrigo.cardoso1@ibm.com) +
      +
      2024-01-03 05:17:09
      +
      +

      *Thread Reply:* Thanks a lot for the clarification!

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Vinay R + (vinayrme58@gmail.com) +
      +
      2024-01-03 11:33:45
      +
      +

      Hi,

      + +

      I'm currently working on extracting transformation logic from Redshift SQL insert statements. For instance, in the query 'SELECT A+B as Col1 FROM table1,' I'm trying to fetch the transformation logic 'A+B' for the column 'Col1.' I've been using regex, but it has limitations with subqueries, unions, and CTEs. Are there any specialized tools or suggestions for improving this process, especially when dealing with complex SQL structures like subqueries and unions?

      + +

      Queries are in rsql format. Thanks in advance any help would be appreciated.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-03 14:04:27
      +
      +

      *Thread Reply:* We don’t parse what are actual transformations done in sqlparser

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mariusz Górski + (gorskimariusz13@gmail.com) +
      +
      2024-01-03 12:14:41
      +
      +

      Hey, I’ve submitted update to OL docs re spark - anyone fancy to conduct review? 👀 🙏

      + +

      https://github.com/OpenLineage/docs/pull/268

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:43:48
      +
      +

      Hi everyone does job to job linkage possible in openlineage ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-04 04:47:05
      +
      +

      *Thread Reply:* OpenLineage focuses on data lineage which means we don’t explicitly track job to job lineage but rather job -> dataset -> job

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 04:53:24
      +
      +

      *Thread Reply:* Thanks for the insight, Jakub! That makes sense. I'm curious, though, if there's a way within openlineage for achieving this kind of linkage?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-04 04:54:53
      +
      +

      *Thread Reply:* sure, you can create custom facet that would be used to expose such relations +in terms of Airflow there is Airflow specific run facet that contains upstream and downstream task ids but it’s rather informative than to achieve explicitly what you’re trying to do

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-04 05:02:37
      +
      +

      *Thread Reply:* Thanks for clarifying that, Jakub. I'll look into that as u suggested.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-04 12:25:16
      +
      +

      @channel +UK-based members: a speaker is needed for a meetup with Confluent on January 31st in London. Please reply or DM me if you’re interested in speaking.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-04 13:21:25
      +
      +

      @channel +This month’s TSC meeting is next Thursday the 11th at 10am PT. On the tentative agenda: +• announcements +• recent releases +• open discussion +• more (TBA) +More info and the meeting link can be found on the website. All are welcome! Do you have a discussion topic, use case or integration you’d like to demo? DM me to be added to the agenda.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 03:35:03
      +
      +

      Hi everyone, I am trying to use the datasetVersion facet, but want to confirm that I am using it correctly, at the moment the behaviour seemed a bit weird, but basically we have a pipeline like this +sourceData1 -> job1 -> datasetA

      + +

      dataSetA v1, sourceData2 -> job2 -> datasetB

      + +

      But the job can be configured to use a specific version of incoming dataset, so in the above example the job2 had been configured to always use datasetA v1. So If I run a job1 that produces datasetA v2. If I run job2 , it should pick use the v1.

      + +

      When we're integrating our pipeline by emitting events to OpenLineage ( Marquez ), we include +"inputs": [ + { + "namespace": "test.namespace", + "name": "datasetA", + "facets": { + "version": { + "_producer": "<https://some.producer.com/version/1.0>", + "_schemaURL": "<https://openlineage.io/spec/facets/1-0-0/DatasetVersionDatasetFacet.json>", + "datasetVersion": "1" + } + } + } + ], +"datasetVersion":"1" -> this refers to the "datasetVersion" of output dataset of job1

      + +

      But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1

      + +

      So just wandering if I am emitting the events wrongly or datasetVersion is not meant to be used for this purpose

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-08 04:57:43
      +
      +

      *Thread Reply:* Hi @William Sia, OpenLineage-Spark integration fetches latest version of the dataset. If you're deliberately reading some older one, OpenLineage event will still point to the latest which is kind of a bug or missing feature. feel free to create an issue for this. If possible, please provide some code snippet to reproduce this. Is this happening on delta/iceberg?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-08 09:46:16
      +
      +

      *Thread Reply:* > But what happened at the moment is, job2 always use the latest version of datasetA and it rewrites the latest version's version of datasetA to 1 +You mean that event contains different version than you emit? Or do you mean you see different thing on the Marquez UI?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 21:56:19
      +
      +

      *Thread Reply:* @Maciej Obuchowski , what I meant is I am expecting my job2 to have lineage to input of datasetA v1 (because I specify the version facet of the dataset input to 1) , but what happened is job2 has lineage to the latest version of datasetA and the datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). So job2 that has an input of datasetA modified the Version Facet of the later.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-08 21:58:52
      +
      +

      *Thread Reply:* Cheers @Paweł Leszczyński, will raise an issue with more details on github on this. So at the moment we're enable the application that my team is working on to emit events. So I am emitting the event manually at the moment to mimic the data flow in our system.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-09 07:32:02
      +
      +

      *Thread Reply:* > (because I specify the version facet of the dataset input to 1) +So what I understand is that you emit event for job2 that has datasetA as input with version 1 as shown above

      + +

      > but what happened is job2 has lineage latest version of datasetA +on UI I understand? I think it's Marquez issue - properly display which dataset version you're reading. Can you post screenshot on Marquez issue showing exactly where you'd expect to see different data?

      + +

      > datasetVersion parameter of the Version Facet of the latest version is now modified to 1 (it was 2 because there was a run on job1 that updates the version to 2 ). +Yeah, if job1 updated datasetA to next version, then it makse sense.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      William Sia + (willsia@amazon.com) +
      +
      2024-01-11 22:08:49
      +
      +

      *Thread Reply:* I had raised this issue with some sample payload here https://github.com/MarquezProject/marquez/issues/2733 , Let me know if you have more details

      +
      + + + + + + + +
      +
      Comments
      + 1 +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-08 12:55:01
      +
      +

      @channel +Meetup alert: our first OpenLineage x Apache Kafka® meetup will be happening on January 31st (in-person) at 6:00 PM in Confluent’s London offices. Keep an eye out for more details and sign up https://www.meetup.com/london-openlineage-meetup-group/events/298420417/?utmmedium=referral&utmcampaign=share-btnsavedeventssharemodal&utmsource=link|here.

      +
      +
      Meetup
      + + + + + + + + + + + + + + + + + +
      + + + +
      + 🚀 alexandre bergere, Maciej Obuchowski +
      + +
      + ❤️ Willy Lulciuc, Maciej Obuchowski, Eric Veleker +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2024-01-09 03:06:38
      +
      +

      Hi +are you familiar with the Spline solution? +do you know if we can use it on databricks withourt adding any code to the notebooks in order to get the notebook name?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-09 03:42:24
      +
      +

      *Thread Reply:* Spline is maintained by an entirely different group of people. Thus, we aren't familiar with the semantics of Spline. I suggest that you reach out to the maintainers of Spline to understand more about its capabilities.

      + + + +
      + ➕ Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Zacay Daushin + (zacayd@octopai.com) +
      +
      2024-01-09 03:42:57
      +
      +

      *Thread Reply:* thanks

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 08:57:33
      +
      +

      Hi folks, We are in MWAA supported Airflow 2.7.2. For OpenLineage integration, do we have transport type as kinesis to emit the events?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-09 09:08:01
      +
      +

      *Thread Reply:* tbere’s no kinesis transport available in Python at the moment

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 09:12:04
      +
      +

      *Thread Reply:* Thank you @Jakub Dardziński . Can we expect a support in future? Any suggestive alternative till then?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-09 09:13:36
      +
      +

      *Thread Reply:* that’s open source project 🙂 if someone finds it useful to contribute or the community decides to implement this - that’ll probably be supported

      + +

      I’m not sure what’s your use case but we have Kafka transport

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 09:20:48
      +
      +

      *Thread Reply:* Thanks @Jakub Dardziński. We have limited support for kafka and we are widely using Kinesis. The usecase is that, we are trying to build a in house lineage store, where we can emit these events. If Kafka is the only streaming option available, we can give a try.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Damien Hawes + (damien.hawes@booking.com) +
      +
      2024-01-09 09:52:16
      +
      +

      *Thread Reply:* I'd look to see if AWS offers something like KafkaProxy, but for Kinesis. That is, you emit via HTTP and it forwards to the Kinesis stream.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Anand Thamothara Dass + (anand_thamotharadass@cable.comcast.com) +
      +
      2024-01-09 10:00:24
      +
      +

      *Thread Reply:* Nice.. I ll take a look in to it.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-09 10:03:55
      +
      +

      *Thread Reply:* agree with @Jakub Dardziński & @Damien Hawes. +the transport interface is also pretty straightforward, so adding support for Kinesis might be trivial - in case you do decide to contribute

      + + + +
      + 👍 Anand Thamothara Dass +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-10 14:22:18
      +
      +

      Hello everyone,

      + +

      I'm exploring OpenLineage + Marquez for lineage in our warehouse solution connecting to Snowflake. We've set up ETL jobs for various operations like Select, Update, Delete, and Merge. Is there a way to capture Change Data Capture (CDC) at the column level using OpenLineage? Our goal is to present this CDC information as facets on Marquez. +Any insights or suggestions on this would be greatly appreciated!

      + + + +
      + 👍 alexandre bergere +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:14:03
      +
      +

      *Thread Reply:* What are you using for CDC?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 16:12:11
      +
      +

      *Thread Reply:* @Harel Shein In our current setup, we typically utilize Slowly Changing Dimension (SCD) Type 2 at the source level for Change Data Capture (CDC). In this specific scenario, the source is Snowflake.

      + +

      During the ETL process, we receive a CDC trigger or event from Snowflake.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-12 13:01:58
      +
      +

      *Thread Reply:* what do you use to run this ETL process? If that framework is supported by openlineage, I assume it would work

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-10 15:23:47
      +
      +

      @channel +This month’s TSC meeting, open to all, is tomorrow at 10 am PT https://openlineage.slack.com/archives/C01CK9T7HKR/p1704392485753579

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 04:32:41
      +
      +

      Hi Everyone

      + +

      there is this search option on facets on Marquez UI. Is there any functionality of that?

      + +

      Do we have the functionality to search on the lineage we are getting?

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:43:04
      +
      +

      *Thread Reply:* great question @Shahid Shaikh! I’m actually not sure, would you mind posting this question on the Marquez slack? +https://join.slack.com/t/marquezproject/shared_invite/zt-2afft44fo-nWcmZmdrv7qSxfNl6iOKMg

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Shahid Shaikh + (ssshahidwin@gmail.com) +
      +
      2024-01-11 13:35:47
      +
      +

      *Thread Reply:* Thanks @Harel Shein I will take further update from Marquez slack.

      + + + +
      + 👍 Harel Shein +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-11 04:38:39
      +
      +

      Hi everyone, +I'm currently running my Spark code on Azure Kubernetes Service (AKS), and I'm interested in knowing if OpenLineage can provide cluster details such as its name, etc. In my current setup and run events, I haven't been able to find these cluster-related details.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Harel Shein + (harel.shein@gmail.com) +
      +
      2024-01-11 05:47:13
      +
      +

      *Thread Reply:* I’m not sure if all of those details are readily available to the spark application to extract them. If you find a way, those can be reported in a custom facet. +The debug facet for Spark would be the closest option to what you’re looking for, but definitely doesn’t contain k8s cluster level data.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 07:01:18
      +
      +

      *Thread Reply:* Not out of the box, but if those are in environment variables, you can use +spark.openlineage.facets.custom_environment_variables=[YOUR_VARIABLE;] +to include it as CustomEnvironmentFacet

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-11 07:04:01
      +
      +

      *Thread Reply:* Thanks, can you please share the reference link for the same, so that I can try n implement this @Maciej Obuchowski

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 07:32:21
      + +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:12:27
      +
      +

      *Thread Reply:* Would this work if these environment variables were set at runtime?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 10:13:45
      +
      +

      *Thread Reply:* during runtime of a job? I don't think so, we collect them when we spawn the OpenLineage integration

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:15:01
      +
      +

      *Thread Reply:* I thought so. thank you 🙂

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-11 10:23:42
      +
      +

      *Thread Reply:* @Maciej Obuchowski By the way, coming back to this, is there any way to pass some variable value from the executing code/job to the OL spark listener? The example above is something I was looking forward to, like, setting some (ENV) var at run time and having access to this value in the OL event.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Simran Suri + (mailsimransuri@gmail.com) +
      +
      2024-01-15 03:30:06
      +
      +

      *Thread Reply:* @Maciej Obuchowski, Actually cluster details are not in ENV variables. In my Airflow setup those are stored in the Airflow connections and I've a connection ID for that, I'm successfully getting the connection ID details but not the associated Spark cluster name and details.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2024-01-11 09:01:29
      +
      +

      ❓ Is there a standard or well-used facet for recording the type of a dataset? e.g. Table vs View at a very simple level.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-11 09:28:11
      +
      +

      *Thread Reply:* I think we think of type of the dataset on a dataset namespace schema: something s3 will always be a link to path in a bucket.

      + +

      Table vs View is a specific category of datasets that makes a lot of sense when talking about database datasets, but would not make sense when talking about object storage ones, so my naive view is that's something that should be unique to a particular data source type.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      David Goss + (david.goss@matillion.com) +
      +
      2024-01-11 09:30:10
      +
      +

      *Thread Reply:* Very fair! I think we’re probably going to use a custom facet for this for our own purposes, I just wanted to check if there was anything standard or standard-adjacent we should align to.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Julien Le Dem + (julien@apache.org) +
      +
      2024-01-11 18:59:34
      +
      +

      *Thread Reply:* This sounds good. Please share that custom facet’s schema if you’re willing. This might be a good candidate for a new optional core facet for views.

      + + + +
      + 👍 David Goss +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Julien Le Dem + (julien@apache.org) +
      +
      2024-01-11 19:16:03
      +
      +

      As discussed in the call today. I have updated the OpenLineage registry proposal. It includes as well the contributions of @Sheeri Cabral (Collibra) and feedback from the review of the . If you are interested, (in particular @Eric Veleker, @Ernie Ostic, @Sheeri Cabral (Collibra), @Jens Pfau but others as well) please comment on the PR. I think we are close enough to implement a first version.

      +
      + + + + + + + +
      +
      Labels
      + documentation, proposal +
      + +
      +
      Comments
      + 2 +
      + + + + + + + + + + +
      + + + +
      + ❤️ Jarek Potiuk, Michael Robinson, Paweł Leszczyński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Honey Thakuria + (Honey_Thakuria@intuit.com) +
      +
      2024-01-12 00:08:47
      +
      +

      Hi everyone, +We're trying to use Openlineage for Presto on Spark script mentioned below, but aren't getting CREATE or INSERT lifecycle events and are only able to get DROP events.

      + +

      Could anyone help us in letting know the way how we can get proper events for Presto on Spark ? Any tweaks, config changes ?? +cc @Kiran Hiremath @Athitya Kumar +```drop table if exists schemaname.table1; +drop table if exists schemaname.table2; +drop table if exists schema_name.table3;

      + +

      CREATE TABLE schema_name.table1 AS +SELECT ** FROM ( + VALUES + (1, 'a'), + (2, 'b'), + (3, 'c') +) AS temp1 (id, name);

      + +

      CREATE TABLE schema_name.table2 AS +SELECT ** FROM ( + VALUES + (1, 'a'), + (2, 'b'), + (3, 'c') +) AS temp2 (id, name);

      + +

      CREATE TABLE schemaname.table3 AS +SELECT ** +FROM schemaname.table1 +UNION ALL +SELECT ** +FROM schema_name.table2;```

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-13 10:54:45
      +
      +

      Hey team! 👋

      + +

      We're seeing some instances where the openlineage spark listener class seems to be running long even after the spark job has completed.

      + +

      While we debug the reason for why it's long running (huge spark event due to partitions / json read etc, which could be lagging the listener thread/JVM); we just wanted to see if there's a way we could ensure from the openlineage listener that any of the methods overriden from SparkListener doesn't execute for more than 2 mins (say)?

      + +

      For example, does having a configurable timeout value (like spark.openlineage.timeout.seconds=120 spark conf) and having an internal Executor + Future.get() with timeout / google's SimpleTimeLimiter make sense for this complexity for any spark event handled by OL spark listener?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:07:58
      +
      +

      *Thread Reply:* yes, this would definietely make sense

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-14 22:51:40
      +
      +

      Hey team.

      + +

      We're observing that the openlineage spark listener runs long (sometimes, even for couple of hours) even though spark job has completed. We've seen the pattern this is mostly happening for jobs with 3 levels of subqueries - is there a known issue for this, wherein huge spark event object from listener bus causes huge delay in openlineage spark listener's event processing due to JVM lag or openlineage code etc?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:07:11
      +
      +

      *Thread Reply:* there is no issue for this. just to make sure: did you disable serializing LogicalPlan and sending it via event?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 04:14:00
      +
      +

      *Thread Reply:* @Paweł Leszczyński - Yup, we've set this conf: +"spark.openlineage.facets.disabled": "[spark_unknown;spark.logicalPlan]" +We've also seen logs from spark where it says that event took more than 10-15 mins to process: +[INFO ] [spark-listener-group-shared] [org.apache.spark.scheduler.AsyncEventQueue.logInfo@57] - Process of event SparkListenerSQLExecutionEnd(54,1702566454593) by listener OpenLineageSparkListener took 614.914735426s.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 04:24:43
      +
      +

      *Thread Reply:* yeah, that's interesting. are you able to make sure if this is related to event generation and not backend issue (like marquez deadlock) ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 04:25:58
      +
      +

      *Thread Reply:* Yup yup, we use HTTP transport which has a 30 seconds API GW timeout on our side - but we also tried with console transport type to rule out the possibility of a backend issue

      + +

      Faced the same issue with console transport type too

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 06:32:28
      +
      +

      *Thread Reply:* when running for a long time, does it succeed or not? Perhaps there is a cycle in logical plan. Could you turn on and attach debugFacet ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-15 06:45:02
      +
      +

      *Thread Reply:* It runs for a long time and succeeds - for some jobs it's a matter of 1 hour, whereas we've seen jobs delaying by 4 hours as well.

      + +

      @Paweł Leszczyński - How do we enable debugFacet? Any additional spark conf to add?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-15 06:47:16
      +
      +

      *Thread Reply:* yes, spark.openlineage.debugFacet=enabled

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-15 05:29:21
      +
      +

      Hey, +can I request a patch release that will include this fix ? If there is already an upcoming release planned let me know, maybe it will be soon enough 🙂

      + + + +
      + ➕ Jakub Dardziński, Harel Shein, Maciej Obuchowski +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-16 16:01:22
      +
      +

      *Thread Reply:* I’m bumping this request. I would love to include this https://github.com/OpenLineage/OpenLineage/pull/2373 as well

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-19 09:05:54
      +
      +

      *Thread Reply:* Thanks, all. The release is authorized.

      + + + +
      + 🙌 Jakub Dardziński +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:24:23
      +
      +

      Hello All! +I want to try to help the community by testing some of the bugs I've been reporting but also try to work with custom facets. Im having trouble building the project. +Is there any step-by-step document (an dependencies) on how to build it?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:48:40
      +
      +

      *Thread Reply:*

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 07:51:20
      +
      +

      *Thread Reply:* gradle --version

      + +
      + +

      Gradle 8.5

      + +

      Build time: 2023-11-29 14:08:57 UTC +Revision: 28aca86a7180baa17117e0e5ba01d8ea9feca598

      + +

      Kotlin: 1.9.20 +Groovy: 3.0.17 +Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023 +JVM: 21.0.1 (Homebrew 21.0.1) +OS: Mac OS X 14.2 aarch64

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-15 09:20:10
      +
      +

      *Thread Reply:* I think easiest way would be to use Java 8 (it's still supported by Spark)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-15 09:23:30
      +
      +

      *Thread Reply:* I recommend Azul Zulu, it works for mac M1 https://www.azul.com/downloads/?version=java-8-lts&os=macos&architecture=arm-64-bit&package=jdk#zulu

      +
      +
      Azul | Better Java Performance, Superior Java Support
      + + + + + + +
      +
      Est. reading time
      + 1 minute +
      + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-15 11:10:00
      +
      +

      *Thread Reply:* Java Downgraded to 8. Thanks. +But now, lets say I want to build the project (focusing on the spark integration). What should i do?

      + +

      i tried gradle build (in integration/spark) and got the same error. Am i missing any step here?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Mattia Bertorello + (mattia.bertorello@booking.com) +
      +
      2024-01-15 16:37:05
      +
      +

      *Thread Reply:* If you're having issues with Java, I suggest using https://asdf-vm.com/ to ensure that Gradle doesn't accidentally use a different Java version.

      +
      +
      asdf-vm.com
      + + + + + + + + + + + + + + + +
      + +
      + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 05:01:01
      +
      +

      *Thread Reply:* @Rodrigo Maia I would try removing contents of ~/.m2/repository - this usually helps with this kind of error

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 05:01:12
      +
      +

      *Thread Reply:* The only downside is that you'll have to redownload dependencies

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Rodrigo Maia + (rodrigo.maia@manta.io) +
      +
      2024-01-17 05:22:09
      +
      +

      *Thread Reply:* thank you for the ideas to solve the issue. I'll try that during the weekend 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:00:17
      +
      +

      Hello Team, I am new to columnar level lineage, I would like to know what is the fast and easiest means of creating a columnar level data/json. Which integration will be helpful ? Airflow, Spark or direct mysql operator.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:04:36
      +
      +

      *Thread Reply:* Previously I have worked with datahub integration with spark to generate an lineage. How do i skip data hub and directly use Openlineage adaptors. +Below is the previous Example :

      + +

      from pyspark.sql import SparkSession

      + +

      spark = SparkSession \ + .builder \ + .master("local[**]") \ + .appName("NetflixMovies") \ + .config("spark.jars.packages", "io.acryl:datahub_spark_lineage:0.8.23") \ + .config("spark.extraListeners", "datahub.spark.DatahubSparkListener") \ + .config("spark.datahub.rest.server", "<http://4.labs.net:8080>") \ + .enableHiveSupport() \ + .getOrCreate()

      + +

      print("Reading CSV File") +movies_df = spark \ + .read \ + .format("csv") \ + .option("header", "true") \ + .option("inferSchema", "true") \ + .load("n_movies.csv")

      + +

      movies_above8_rating = movies_df.filter(movies_df.rating &gt;= 8.0)

      + +

      print("Writing CSV File") +movies_above8_rating \ + .coalesce(1) \ + .write \ + .format("csv") \ + .option("header", "true") \ + .mode("overwrite") \ + .save("movies_above_rating8.csv") +print("Completed") +spark.stop()

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:12:28
      +
      +

      *Thread Reply:* Have you tried directly doing the same just using OpenLineage Spark listener?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:16:21
      +
      +

      *Thread Reply:* I did tried, no success yet !

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:16:43
      +
      +

      *Thread Reply:* Can you share the resulting event?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:17:06
      +
      +

      *Thread Reply:* I need sometime to reproduce.

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Dinesh Singh + (dinesh.r-singh@hpe.com) +
      +
      2024-01-16 07:17:16
      +
      +

      *Thread Reply:* Is there any other example you can suggest ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-16 07:33:47
      +
      +

      *Thread Reply:* You can look at our integration tests: https://github.com/OpenLineage/OpenLineage/blob/31f8ce588526e9c7c4bc7d849699cb7ce2[…]va/io/openlineage/spark/agent/ColumnLineageIntegrationTest.java

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Fabio Manganiello + (fabio.manganiello@booking.com) +
      +
      2024-01-19 05:06:31
      +
      +

      Hi channel, do we have a standard way to infer the type of the job from the OpenLineage events that we receive? Context: I'm building a product where we're expected to tell the users whether a particular job is an Airflow DAG, a dbt workflow, a Spark job etc. The (relatively) most reliable piece of data to extract this information is the producer string published in the event, but I've noticed that there's really no standard way of formatting it. After seeing some producer strings end in /dbt and /spark we thought "cool, we can just scrape the last slashed token of the producer URL and infer the type from there". Then we met the Airflow integration, which apparently publishes a producer in the format <https://github.com/apache/airflow/tree/providers-openlineage/1.3.0>.

      + +

      Is the producer the only way to infer the type of a job from an event? If so, are there any efforts in standardizing its possible values, or maybe create a registry of producers readily available in the OL libraries? If not, what's an alternative way of getting this info?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 05:53:21
      +
      +

      *Thread Reply:* We introduced JobTypeJobFacet https://github.com/OpenLineage/OpenLineage/blob/main/spec/facets/JobTypeJobFacet.json some time ago

      +
      + + + + + + + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 05:54:39
      +
      +

      *Thread Reply:* The initial idea was to distinguish streaming jobs from batch ones. Then, it got extended to have jobTypelike: QUERY|COMMAND|DAG|TASK|JOB|MODEL or integration which can be SPARK|DBT|AIRFLOW|FLINK

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Fabio Manganiello + (fabio.manganiello@booking.com) +
      +
      2024-01-19 05:56:30
      +
      +

      *Thread Reply:* Thanks, that should definitely address my concern! Is it supposed to be filled explicitly (e.g. through env variables) or is it filled by the OL connector automatically? I've taken a look at several dumps of OL events but I can't recall seeing that facet being populated

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-19 06:13:17
      +
      +

      *Thread Reply:* I introduced this for Flink only https://github.com/OpenLineage/OpenLineage/pull/2241/files as that is what I needed at that point. You would need to add it into Spark integration if you like

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:49:34
      +
      +

      *Thread Reply:* JobTypeJobFacet(processingType="BATCH", integration="AIRFLOW", jobType="QUERY"),

      + +

      would you consider the jobType as QUERY if airflow is running a task that filters a delta table and saves the output to a different delta table? it is a task as well in that case of course

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:00:04
      +
      +

      any suggestions on naming for Graph API sources from outlook? I pull a lot of data from email attachments with Airflow. generally I am passing a resource (email address), the mailbox, and subfolder. from there I list messages and find attachments

      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:08:06
      +
      +

      *Thread Reply:* What is the difference between mailbox and email address ? Could You provide some examples of all the parts provided?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:22:42
      +
      +

      *Thread Reply:* sure, so I definitely want to leave out the email address if possible (it is my name and I dont want it in OL metadata)

      + +

      so we use the graph API credentials to open the mailbox of a resource (my work email) +then we are opening the main Inbox folder, and from there we open folders/subfolders which is important to track

      + +

      hook = MicrosoftEmailHook(graph_conn_id="ms_email_api", resource="my@email", folder="client_name", subfolder="system_name")

      + +

      ``` def getconn(self) -> MailBox: + """Connects to the Office 365 account and opens the mailbox""" + credentials = (self.graphconn.login, self.graphconn.password) + account = Account( + credentials, authflowtype="credentials", tenantid=self.graphconn.host + ) + if not account.isauthenticated: + account.authenticate() + return account.mailbox(resource=self.resource)

      + +
      @cached_property
      +def mailbox(self) -&gt; MailBox:
      +    """Returns the authenticated account mailbox"""
      +    return self.get_conn()
      +
      +@cached_property
      +def current_folder(self) -&gt; Folder:
      +    """Property to lazily open and return the current folder"""
      +    if not self._current_folder:
      +        self.open_folder()
      +    return self._current_folder
      +
      +def open_folder(self) -&gt; Folder:
      +    """Opens the specified folder and sets it as the current folder"""
      +    inbox = self.mailbox.inbox_folder()
      +    f = inbox.get_folder(folder_name=self.folder)
      +    self._current_folder = (
      +        f.get_folder(folder_name=self.subfolder) if self.subfolder else f
      +    )
      +
      +def get_messages(
      +    self,
      +    query: Query | None = None,
      +    download_attachments: bool = False,
      +) -&gt; list[Message]:
      +    """Lists all messages in a folder. Optionally filters them based on an OData
      +    query
      +
      +    Args:
      +        query: OData query object
      +        download_attachments: Whether attachments should be downloaded
      +
      +    Returns:
      +        A Message object or list of Message objects. A tuple of the Message and
      +        Attachment is returned if return_attachments is True
      +    """
      +    messages = [
      +        message
      +        for message in self.current_folder.get_messages(
      +            limit=self.limit,
      +            batch=self.batch,
      +            query=query,
      +            download_attachments=download_attachments,
      +        )
      +    ]
      +    return messages```
      +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:30:27
      +
      +

      *Thread Reply:* I am not sure if something like this would be the namespace or the resource name, but I would consider a "source" of data to be this +outlook://{self.folder}/{self.subfolder}

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:30:44
      +
      +

      *Thread Reply:* then we list attachments with filters (subject, received since and until)

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:33:56
      +
      +

      *Thread Reply:* Ok, got it. I'm not sure if avoiding to include Your email is the best choice, as it's the best identifier of the attachments' location 🙂 It's just my opinion, but i think i would go with something like:

      + +

      namespace: email://{email_addres} +name: {folder}/{subfolder}/

      + +

      OR

      + +

      namespace: email +name: {email_address}/{folder}/{subfolder}

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:34:44
      +
      +

      *Thread Reply:* I'm not sure if outlook://{self.folder}/{self.subfolder} is descriptive enough, as You can't really tell who received this attachment. But of course, it all depends on Your use case

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:35:14
      +
      +

      *Thread Reply:* here is the output of message.build_url() +<https://graph.microsoft.com/v1.0/users/name@company.com/folder>

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:36:08
      +
      +

      *Thread Reply:* which is a built in method in the O365 library I am using. and yeah I think i need the resource name no matter what

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:36:28
      +
      +

      *Thread Reply:* it makes it clear it is my personal work email address and not the service account email account..

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:37:52
      +
      +

      *Thread Reply:* I think i would not base the namespace on the email provider like outlook or gmail, because it's something that can easily change over time, yet it does not influence the actual content of the email. If You suddenly transfer to Google from Microsoft, and move all Your folder/subfolder structure, does it really matter fr You, from lineage perspective ?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:38:57
      +
      +

      *Thread Reply:* This one i think is more debatable 😄

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:39:16
      +
      +

      *Thread Reply:* agree

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:39:59
      +
      +

      *Thread Reply:* the folder name in outlook corresponds to the GCS bucket name the files are loaded to as well, so I would want that to be nice and clear in marquez. it is all based on the client name which is the "folder" for emails

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Kacper Muda + (kacper.muda@getindata.com) +
      +
      2024-01-19 10:44:15
      +
      +

      *Thread Reply:* If Your specific email folder structure is somehow mapped to gcs, then You can try looking at the naming spec for gcs and somehow mix it in. The namespace in gcs contains the bucket name, so maybe in Your case it's wise to put it there as well, so You have separate namespaces for each client? At the end of the day, You are using it and it should match Your needs

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      ldacey + (lance.dacey2@sutherlandglobal.com) +
      +
      2024-01-19 10:54:44
      +
      +

      *Thread Reply:* yep that is correct, it kind of falls apart with SFTP sources because the namespace includes the host which is not very descriptive (just some domain names).

      + +

      we try to keep things separate between client namespaces though in general, since they are unrelated and different teams work on them

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Abdallah + (abdallah@terrab.me) +
      +
      2024-01-19 10:11:36
      +
      +

      Hello,

      + +

      I hope you are doing all well.

      + +

      We've observed an issue with symlinks generated by Spark integration, specifically when dealing with a Glue Catalog Table. The dataset namespace ends up being the same as the dataset name. This issue arises from the method used to parse the dataset's URI.

      + +

      I would like to discuss this issue with the contributors.

      + +

      https://github.com/OpenLineage/OpenLineage/pull/2379

      +
      + + + + + + + +
      +
      Labels
      + integration/spark +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Athitya Kumar + (athityakumar@gmail.com) +
      +
      2024-01-22 02:48:44
      +
      +

      Hey team

      + +

      Does openlineage-spark listener support Kafka sinks parsing from logical plan, to publish to lineage events? If yes, what's the openlineage-spark version since which it's been supported?

      + +

      We have a pattern like df.write.format("KAFKA").option("TOPIC", "topicName") wherein we don't see any openlineage events - we're using openlineage-spark:1.1.0 currently

      + + + +
      +
      +
      +
      + + + + + + + + + + + +
      +
      + + + + +
      + +
      Paweł Leszczyński + (pawel.leszczynski@getindata.com) +
      +
      2024-01-22 03:10:18
      +
      +

      *Thread Reply:* However, this feature has not been improved nor developed for long time (like 2 years), and surely has some gaps like https://github.com/OpenLineage/OpenLineage/issues/372

      +
      + + + + + + + +
      +
      Labels
      + integration/spark, streaming +
      + + + + + + + + + + +
      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-22 15:42:55
      +
      +

      @channel +We released OpenLineage 1.8.0, featuring: +• Flink: support Flink 1.18 #2366 @HuangZhenQiu +• Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h +• Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello +• Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello +• Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski +• Spark: Support Glue catalog in iceberg #2283 @nataliezeller1 +Plus many bug fixes and a change in Spark! +Thanks to all the contributors with a special shoutout to new contributor @Mattia Bertorello, who had no fewer than 5 contributions in this release! +Release: https://github.com/OpenLineage/OpenLineage/releases/tag/1.8.0 +Changelog: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md +Commit history: https://github.com/OpenLineage/OpenLineage/compare/1.7.0...1.8.0 +Maven: https://oss.sonatype.org/#nexus-search;quick~openlineage +PyPI: https://pypi.org/project/openlineage-python/

      + + + +
      + 🚀 Jakub Dardziński, alexandre bergere, Harel Shein, Mattia Bertorello, Maciej Obuchowski, Julian LaNeve +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 16:43:32
      +
      +

      Hello team I see the following issue when i install apache-airflow-providers-openlineage==1.4.0

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 16:43:56
      +
      +

      *Thread Reply:* @Jakub Dardziński @Willy Lulciuc

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-22 17:00:17
      +
      +

      *Thread Reply:* Have you modified your local files? Please try to uninstall and reinstall the package

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:10:47
      +
      +

      *Thread Reply:* i didn't do anything, all other packages are fine

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:12:42
      +
      +

      *Thread Reply:* let me reinstall

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Jakub Dardziński + (jakub.dardzinski@getindata.com) +
      +
      2024-01-22 17:30:10
      +
      +

      *Thread Reply:* it’s odd because it should be line 142 actually, are you sure you’re installing it properly?

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:33:33
      +
      +

      *Thread Reply:* actually it could be my local

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      harsh loomba + (hloomba@upgrade.com) +
      +
      2024-01-22 17:33:39
      +
      +

      *Thread Reply:* im trying something

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      jayant joshi + (itsjayantjoshi@gmail.com) +
      +
      2024-01-25 06:42:58
      +
      +

      Hi Team! +I want to create column level lineage using pyspark +I followed https://openlineage.io/blog/openlineage-spark/ blog step by step +while run next command docker run --network spark_default -p 3000:3000 -e MARQUEZ_HOST=marquez-api -e MARQUEZ_PORT=5000 --link marquez-api:marquez-api marquezproject/marquez-web:0.19.1 will see <http://localhost:3000> marquez UI but +but it continusly fetching "Something went wrong while fetching search" not giving any result. in cmd " Error occurred while trying to proxy request /api/v1/search/?q=p&amp;sort=NAME&amp;limit=100 from localhost:3000 to <http://marquez-api:5000/> (EAI_AGAIN) (<https://nodejs.org/api/errors.html#errors_common_system_errors>)" for dataset I am reading CSV file. for reference attaching SC +is there any solution?

      + + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-25 07:48:45
      +
      +

      *Thread Reply:* First, can you try some newer version? Latest is 0.44.0, 0.19.1 is over two years old

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Tamizharasi Karuppasamy + (tamizhsangami@gmail.com) +
      +
      2024-01-29 04:13:49
      +
      +

      *Thread Reply:* Hi.. I am trying to generate column-level lineage for a csv file on AWS S3 using Spark and OpenLineage. I follow https://github.com/OpenLineage/OpenLineage/blob/main/integration/spark/README.md and https://openlineage.io/docs/integrations/spark/quickstart_local for my reference. But I get below error. Kindly help me resolve it.

      +
      +
      openlineage.io
      + + + + + + + + + + + + + + + +
      +
      + + + + + + + + + + + + + + + + +
      + +
      + + + + + + + + + +
      + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-29 05:00:53
      +
      +

      *Thread Reply:* looks like you have a typo in versioning @Tamizharasi Karuppasamy

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      jayant joshi + (itsjayantjoshi@gmail.com) +
      +
      2024-01-29 06:07:41
      +
      +

      *Thread Reply:* Hi @Maciej Obuchowski as per your suggestion I used 0.44.0 but error still persist "[HPM] Error occurred while proxying request localhost:3000/api/v1/events/lineage?limit=20&amp;before=2024-01-29T23:59:59.000Z&amp;after=2024-01-29T00:00:00.000Z&amp;offset=0&amp;sortDirection=desc to <http://marquez-api:5000/> [EAI_AGAIN] (<https://nodejs.org/api/errors.html#errors_common_system_errors>" even in docker "marquez-api" is not running . +for your reference sharing log ... +2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] FATAL: password authentication failed for user "marquez" +2024-01-29 16:35:29 marquez-db | 2024-01-29 11:05:29.547 UTC [39] DETAIL: Role "marquez" does not exist. +2024-01-29 16:35:29 marquez-db | Connection matched pg_hba.conf line 95: "host all all all md5" +2024-01-29 16:35:29 marquez-api | ERROR [2024-01-29 11:05:29,553] org.apache.tomcat.jdbc.pool.ConnectionPool: Unable to create initial connections of pool. +2024-01-29 16:35:29 marquez-api | ! org.postgresql.util.PSQLException: FATAL: password authentication failed for user "marquez" +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:693) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:203) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:258) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.jdbc.PgConnection.&lt;init&gt;(PgConnection.java:263) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.makeConnection(Driver.java:443) +2024-01-29 16:35:29 marquez-api | ! at org.postgresql.Driver.connect(Driver.java:297) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connectUsingDriver(PooledConnection.java:346) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.PooledConnection.connect(PooledConnection.java:227) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.createConnection(ConnectionPool.java:772) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:700) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.init(ConnectionPool.java:499) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.ConnectionPool.&lt;init&gt;(ConnectionPool.java:155) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.pCreatePool(DataSourceProxy.java:118) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.createPool(DataSourceProxy.java:107) +2024-01-29 16:35:29 marquez-api | ! at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:131) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcUtils.openConnection(JdbcUtils.java:48) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.internal.jdbc.JdbcConnectionFactory.&lt;init&gt;(JdbcConnectionFactory.java:75) +2024-01-29 16:35:29 marquez-api | ! at org.flywaydb.core.FlywayExecutor.execute(FlywayExecutor.java:147) +2024-01-29 16:35:29 marquez-api | ! at <a href="http://org.flywaydb.core.Flyway.info">org.flywaydb.core.Flyway.info</a>(Flyway.java:190) +2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.hasPendingDbMigrations(DbMigration.java:78) +2024-01-29 16:35:29 marquez-api | ! at marquez.db.DbMigration.migrateDbOrError(DbMigration.java:33) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:109) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.run(MarquezApp.java:51) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.EnvironmentCommand.run(EnvironmentCommand.java:67) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.ConfiguredCommand.run(ConfiguredCommand.java:98) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.cli.Cli.run(Cli.java:78) +2024-01-29 16:35:29 marquez-api | ! at io.dropwizard.Application.run(Application.java:94) +2024-01-29 16:35:29 marquez-api | ! at marquez.MarquezApp.main(MarquezApp.java:63) +2024-01-29 16:35:29 marquez-api | INFO [2024-01-29 11:05:29,556] marquez.MarquezApp: Stopping app...

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Maciej Obuchowski + (maciej.obuchowski@getindata.com) +
      +
      2024-01-29 06:08:29
      +
      +

      *Thread Reply:* Please delete all docker volumes related to Marquez and try again

      + + + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-25 09:43:16
      +
      +

      @channel +Our first London meetup is happening next Wednesday, Jan. 31, at Confluent's offices in Covent Garden. Click through to the Meetup page to sign up and view an up-to-date agenda, featuring talks by @Abdallah, Kirill Kulikov at Confluent, and @Paweł Leszczyński! https://openlineage.slack.com/archives/C01CK9T7HKR/p1704736501486239

      +
      + + +
      + + + } + + Michael Robinson + (https://openlineage.slack.com/team/U02LXF3HUN7) +
      + + + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Maciej Obuchowski, Abdallah, Eric Veleker, Rodrigo Maia +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Michael Robinson + (michael.robinson@astronomer.io) +
      +
      2024-01-29 12:51:10
      +
      +

      @channel +The 2023 Year in Review special issue of OpenLineage News is here! This issue contains an overview of the exciting events, developments and releases in 2023, including the Airflow Provider and Static Lineage. +To get the newsletter directly in your inbox each month, sign up here.

      +
      +
      openlineage.us14.list-manage.com
      + + + + + + + + + + + + + + + +
      + + + +
      + ❤️ Ross Turk, Harel Shein +
      + +
      + 🙌 Francis McGregor-Macdonald, Harel Shein +
      + +
      +
      +
      +
      + + + + + +
      +
      + + + + +
      + +
      Ross Turk + (ross@rossturk.com) +
      +
      2024-01-29 13:15:16
      +
      +

      *Thread Reply:* love it! so much activity!

      + + + +
      + :gratitude_thank_you: Michael Robinson +
      + +
      + ➕ Harel Shein +
      + +
      +
      +
      +
      + + + +