Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed advanced incremental sync and local state #7

Open
koltyakov opened this issue Feb 24, 2023 · 5 comments
Open

Embed advanced incremental sync and local state #7

koltyakov opened this issue Feb 24, 2023 · 5 comments
Assignees

Comments

@koltyakov
Copy link
Owner

Sync should support advanced increment behaviors shown in https://github.com/koltyakov/spsync. Technically spsync can be used for lists fetching.

@koltyakov koltyakov self-assigned this Feb 24, 2023
@koltyakov
Copy link
Owner Author

koltyakov commented Feb 27, 2023

@yevgenypats @disq could you guys point me to the right direction how CloudQuery deals with incremental updates.

I found a PR with Shopify with incremental code based on timestamps. However, it could be a bit more complex in SharePoint. Let me elaborate.

Sometimes you can expect a really large number of objects in prod, e.g. millions of documents or items in lists (not necessarily in a single one as it usually means a poor design but in general), SharePoint is not the fastest system (many SQL like scenarios could be technically expensive), also too many requests means that the API can be throttled.

SharePoint has a few techniques when it comes to list data increments. It could be based on modified field, but not so simple. An item can be removed to recycle bin then restored - timestamp doesn't change, there are also cases when something to sync can't be based on timestamp. There is the change API which can explicitly tell a change set since a last change token.

So, let's say we query a large list. With change API we get in a blink of an eye and very cheap: what's added, what's removed, what's restored, what changed, etc.

Here we come to the question: how could I tell CloudQuery SDK to deal in an incremental mode, only update/create/delete something specific? Is if feasible?

Please no rush, I only plan to spend some time on this feature next weekend.

@yevgenypats
Copy link
Contributor

So we have another example for increment/cdc(change data capture):

https://github.com/cloudquery/cloudquery/tree/main/plugins/source/postgresql

Where we subscribe to PostgreSQL changes and then those are only updated/created in the destination. We don't support yet delete but it is on the roadmap.

One thing that you need to do is to save the state somewhere so you need to use a backend like in shopify, so you can save let's say on a local-file where you stopped syncing and then continue from the same place. a backend is really just a database where you can store "cursors" or any other data needed to resume a sync.

Something to note here: We have those few (I think HN plugin and potentially a few others using an incremental sync) plugins that use incremental sync but we do actively work on an improved version around incremental syncing and cursor storage, so depending on the timeline on this one you might want to wait a bit so you wont need to update the incremental part to a new API.

@koltyakov
Copy link
Owner Author

OK. Great to know. Thank you!

I gotta wait then. It's not so critical and would be anyways incomplete without deletes.

Regarding storage backend completely understand this, pretty logical. I plan to have a local one first. If CloudQuery will provide abstract key-value to store something within the sessions which could be easily injectable via SDK that would be the way to switch to.

@candiduslynx
Copy link
Contributor

@koltyakov FYI SDK v4 contains the simple logic (state.Backend) to use during syncs.
A minimal example can be found in Google Analytics source plugin code.

@koltyakov
Copy link
Owner Author

@candiduslynx thanks! I've been looking into this briefly already following this article. Nice to know the plugin which I can use as a sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants