-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed advanced incremental sync and local state #7
Comments
@yevgenypats @disq could you guys point me to the right direction how CloudQuery deals with incremental updates. I found a PR with Shopify with incremental code based on timestamps. However, it could be a bit more complex in SharePoint. Let me elaborate. Sometimes you can expect a really large number of objects in prod, e.g. millions of documents or items in lists (not necessarily in a single one as it usually means a poor design but in general), SharePoint is not the fastest system (many SQL like scenarios could be technically expensive), also too many requests means that the API can be throttled. SharePoint has a few techniques when it comes to list data increments. It could be based on modified field, but not so simple. An item can be removed to recycle bin then restored - timestamp doesn't change, there are also cases when something to sync can't be based on timestamp. There is the change API which can explicitly tell a change set since a last change token. So, let's say we query a large list. With change API we get in a blink of an eye and very cheap: what's added, what's removed, what's restored, what changed, etc. Here we come to the question: how could I tell CloudQuery SDK to deal in an incremental mode, only update/create/delete something specific? Is if feasible? Please no rush, I only plan to spend some time on this feature next weekend. |
So we have another example for increment/cdc(change data capture): https://github.com/cloudquery/cloudquery/tree/main/plugins/source/postgresql Where we subscribe to PostgreSQL changes and then those are only updated/created in the destination. We don't support yet delete but it is on the roadmap. One thing that you need to do is to save the state somewhere so you need to use a backend like in shopify, so you can save let's say on a local-file where you stopped syncing and then continue from the same place. a backend is really just a database where you can store "cursors" or any other data needed to resume a sync. Something to note here: We have those few (I think HN plugin and potentially a few others using an incremental sync) plugins that use incremental sync but we do actively work on an improved version around incremental syncing and cursor storage, so depending on the timeline on this one you might want to wait a bit so you wont need to update the incremental part to a new API. |
OK. Great to know. Thank you! I gotta wait then. It's not so critical and would be anyways incomplete without deletes. Regarding storage backend completely understand this, pretty logical. I plan to have a local one first. If CloudQuery will provide abstract key-value to store something within the sessions which could be easily injectable via SDK that would be the way to switch to. |
@koltyakov FYI SDK v4 contains the simple logic (state.Backend) to use during syncs. |
@candiduslynx thanks! I've been looking into this briefly already following this article. Nice to know the plugin which I can use as a sample. |
Sync should support advanced increment behaviors shown in https://github.com/koltyakov/spsync. Technically spsync can be used for lists fetching.
The text was updated successfully, but these errors were encountered: