Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip White Space from Cell Values #215

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

JVickery-TBS
Copy link
Contributor

@JVickery-TBS JVickery-TBS commented May 7, 2024

Use case

Legacy datasets that have white space in columns that business owners want to be treated as number/timestamp. XLoader cannot set that field or fail's data loads when said data is found 3000+ rows deep.

Changes being made

feat(logic): strip white space;

  • Strip white space padding from cell values when loading into DataStore.

- Strip white space from cell values when loading into DataStore.
- Fixed white space stripping for `load_table`.
- Fixed white space stripping for `load_table`.
@duttonw
Copy link
Collaborator

duttonw commented May 7, 2024

Hi @JVickery-TBS ,

What is the business use case for this change or what use behaviour are you seeing that you are trying to correct?

should/could this be controlled by a per resource feature flag on reload (like types) for when removing tailing space is not wanted?

ckanext/xloader/parser.py Outdated Show resolved Hide resolved
@JVickery-TBS
Copy link
Contributor Author

@duttonw we have some users with large CSVs that have just spaces in empty cells instead of actual empty cells (apparently from some old data building applications). So they are unable to use the numeric or timestamp types as their "empty" values are actually spaces.

So just stripping all string cell values would work. For the "per resource flag", would that be adding a field to the Data Dictionary for each field? Just a "Strip White Space" checkbox or something eh?

- Condition stripping on `str` type.
- Align test with new strip code.
- Write headers to stream for `load_csv`.
@duttonw
Copy link
Collaborator

duttonw commented May 8, 2024

@duttonw we have some users with large CSVs that have just spaces in empty cells instead of actual empty cells (apparently from some old data building applications). So they are unable to use the numeric or timestamp types as their "empty" values are actually spaces.

So just stripping all string cell values would work. For the "per resource flag", would that be adding a field to the Data Dictionary for each field? Just a "Strip White Space" checkbox or something eh?

Hi @JVickery-TBS ,

That is a very good use case, are you able to update this pr per the following:

  • Always strip whitespace for non-text set fields. (Your code should already be doing this)

  • Do not 'strip' whitespace on type 'text'. As you never know if something might be using the datastore to concat emails and need's that trailing white space.

  • Optional: Add an option of per resource overrides, per how we set comments on the db table field if it should be test/numeric/timestamp and description (nice title) etc. We could extend to have that extra option added to this template (inside xloader plugin) dictionary_form.html, add logic to save it to the 'comments' section of the datastore table as well as use it when reading in for it.

Other things in this space

I'm not really a fan of having the datastore field types being stored on the datastore table as comments since when you drop the table (which is like all the time when you upload a new file to the resource**) if the uploaded file fails all customizations are lost.

I feel it should be set as a custom param against the resource instead so its persisted on a xloader failure so that when a valid csv file is upload it does not affect downstream api products expecting a number/timestamp when xloader defaulted to 'text' again.

Do you also think its a good idea in correcting the data storage location?

** recent changes now only truncate if csv fields match.

@duttonw
Copy link
Collaborator

duttonw commented May 8, 2024

Thankfully I've not run into this before but did run into something similar when requested to review why only 3800ish rows were imported into the db instead of the 3million on https://www.data.qld.gov.au/dataset/road-location-and-traffic-data/resource/daab3617-077f-450a-a1c0-57c26d8ba47c

The reason I found was that one of the col's which the first 3000 rows had looked like it was 100% numeric, but it was not and the system would crash but leave any loaded data it had as it was doing it the slow way (fast way fails fast when pushing the 256mb file to db)

@JVickery-TBS
Copy link
Contributor Author

@duttonw okay, I will find some time to add in the condition on text type overrides to not strip from those. And will implement a new Data Dictionary field for stripping white space. (will add some test coverage for these things too)

As for the Data Dictionary being stored in the DB table info, I have always found it a bit strange. But I think the idea behind it was to keep everything in the same table and database so the queries would be the fastest and there would not need to be any joins or querying between different databases.

That being said, I would say to keep the Data Dictionary inside the table info as it is currently, but we could come up with another place where it gets saved too? Such as the normal resource table in an actual field called data_dictionary. And then add in some more logic in datastore_create, which would check between any existing table info and the data_dictionary field value.

But yeah, having the Data Dictionary inside the table info where the tables can get deleted is not great to me.

@wardi is in tomorrow, so I will discuss this further with him. He may know of any issues we could run into here or maybe know of the best way to do this.

@duttonw
Copy link
Collaborator

duttonw commented May 9, 2024

We could take it as is.

I just want to not have side effects that people were using before or have the ability to not strip by flag.

Reference Comic: https://xkcd.com/1172/ (workflow)


As to location of data store data types/comments, with items like validation schema going into resource extra params. We already have a pattern we could reference. This of course is something that needs to be fleshed out if it were to go ahead as well as if we want to also mirror it into the data pusher plugin so they are in sync with a defacto way.

@JVickery-TBS
Copy link
Contributor Author

@duttonw yeah for sure would need to be fully fleshed out in upstream. For now, with this PR I will add a new field to the Data Dictionary form and use it in the new code. (and skip the text overrides).

I also just remembered that I did make this action: https://github.com/ckan/ckan/blob/master/ckanext/datastore/logic/action.py#L498

Which is really just a wrapper of datastore_delete but forces a filters param to be given and not null. So this could be an option in the meantime for Xloader. As if you do datastore_delete with filters={}, it will just delete all the records in the table and not the table itself. Which would preserver the table info and Data Dictionary with any overrides, comments, labels, etc.

@wardi
Copy link
Contributor

wardi commented May 10, 2024

When re-xloader-ing data if we use datastore_records_delete instead of datastore_delete (or datastore_delete with an empty filters dict) then the existing columns and data dictionary values won't be removed.
Then datastore_create should update columns while keeping data dictionary values from the previous columns.

I think data dictionary data in the resources is a good idea (this is something that ckanext-validation already does) we just need to deal with custom resource schemas somehow, and handling when columns are added/removed.

@wardi
Copy link
Contributor

wardi commented May 10, 2024

Note that the new IDataDictionaryForm allows users to customize their data dictionaries with private data that may not be appropriate for a public schema in the resource. So for now having xloader not delete columns when loading data is the only way to keep this data.

@wardi
Copy link
Contributor

wardi commented May 10, 2024

In fact xloader should add its own settings to columns with an IDataDictionaryForm so things like "strip whitespace" will be presented in the data dictionary form and validated by ckan.

- Extended tabulator stream iterator to strip white space from cell values.
@duttonw
Copy link
Collaborator

duttonw commented May 12, 2024

Note that the new IDataDictionaryForm allows users to customize their data dictionaries with private data that may not be appropriate for a public schema in the resource. So for now having xloader not delete columns when loading data is the only way to keep this data.

There is ways to strip it from going out unless you are sysadmin or internal plugin. We had a highly known vendor do this for PII fields from going out publicly, You can see this in effect here https://github.com/qld-gov-au/ckanext-resource-visibility/blob/master/ckanext/resource_visibility/logic/action.py#L28 which hides these fields which are in the db but not in the solr index nor page display (unless author/sysadmin) https://github.com/qld-gov-au/ckanext-resource-visibility/blob/e35a0fc6c3b84176373be6c9b0c1faffbb30f01f/ckanext/resource_visibility/constants.py#L15

@duttonw
Copy link
Collaborator

duttonw commented May 12, 2024

When re-xloader-ing data if we use datastore_records_delete instead of datastore_delete (or datastore_delete with an empty filters dict) then the existing columns and data dictionary values won't be removed. Then datastore_create should update columns while keeping data dictionary values from the previous columns.

I think data dictionary data in the resources is a good idea (this is something that ckanext-validation already does) we just need to deal with custom resource schemas somehow, and handling when columns are added/removed.

The main problem we are running into Authors who accidently use the "datastore 'csv'" instead of the original which includes the _id field, edits that and then uploads that back into CKAN, due to a change in fields, xloader drops the table to reload it. Of course when a validation schema is present (which takes time to get added everywhere). This could be stopped and hopefully @JVickery-TBS work on integration for that is fully ready for others to consume.

On the point of getting validation schemas to the masses'. We got a a standalone plugin built that adds a new tab when a datastore is made that general Authors can use to generate a simple validation schema which could then be applied at the resource level, or on the dataset (for monthly new files).

https://github.com/qld-gov-au/ckanext-validation-schema-generator
It may need a bit more TLC + readme updates on allowing it to be standalone consumable since they did go to down in placing huge number of overrides in our primary theme/plugin customizations. https://github.com/qld-gov-au/ckanext-data-qld/blob/develop/ckanext/data_qld/ckan_dataset.json#L203

@JVickery-TBS
Copy link
Contributor Author

@duttonw I do have an idea in the backlog to exclude the _id column when downloading from the Web UI. As it really is not needed by users in that way. If they need those ID columns for some machine integration, I feel like those users are able to use datastore_search

- Added `strip_extra_white` info field and form fields.
- Added validator for `strip_extra_white`.
- Used `strip_extra_white` to control stripping white space.
- Added `strip_extra_white` field and form fields.
- Used `strip_extra_white` to control stripping white space.
- Minor logic fixes for the new `strip_extra_white` field.
@JVickery-TBS
Copy link
Contributor Author

@ThrawnCA @duttonw okay I have added a new DataStore field and schema for strip_extra_white and baked it into the Loader script.

Not sure why the tests are unable to import the DataStore module however :/

Copy link
Collaborator

@duttonw duttonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @JVickery-TBS ,

Some tweaks to the tests but overall. You have done an amazing job.

I think its ok to go as it his but would like @ThrawnCA to review it once he's more available. Currently he is dealing with a tight deadline that AWS imposed on us :'( on our infra IasC.

I do want to test this in a live environment (or pick ckan:develop as upstream) as I'm hesitant in including gui changes which don't have selenium BDD testing on a straight PR.

As Such, @JVickery-TBS if you do deploy this to one of your test environments that is publicly visibly and you can give myself of @ThrawnCA author rights for a small window of time, we could do the independent verification testing on the gui and merge it in.

We will still be pulling this into the qld-gov-au fork and can test on our dev.data.qld.gov.au system in the coming weeks/monthish.

It's also on my todo list on getting the pypi system deployments automated when my current workload drops off a bit so we can move back to nice pip installs again.

Regards,

@duttonw

ckanext/xloader/loader.py Outdated Show resolved Hide resolved
ckanext/xloader/tests/test_loader.py Show resolved Hide resolved
ckanext/xloader/tests/test_loader.py Show resolved Hide resolved
ckanext/xloader/tests/test_loader.py Outdated Show resolved Hide resolved
@duttonw
Copy link
Collaborator

duttonw commented May 18, 2024

cicd testing issue due to missing dependancy wiring that exists in normal runtime.

    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/__w/ckanext-xloader/ckanext-xloader/ckanext/xloader/plugin.py", line 7, in <module>
    from ckanext.datastore.interfaces import IDataDictionaryForm
ImportError: cannot import name 'IDataDictionaryForm' from 'ckanext.datastore.interfaces' (/srv/app/src/ckan/ckanext/datastore/interfaces.py)

Reviewing how https://github.com/qld-gov-au/ckanext-s3filestore/ works with iUploader interface.
Please see if just adding:

If the interface or methods you want to use do not exist in 2.9 you may need to wire in checks to skip the new functionality to make it backwards compatible.

See:
https://github.com/qld-gov-au/ckanext-s3filestore/blob/master/ckanext/s3filestore/uploader.py#L28

if toolkit.check_ckan_version(min_version='2.8'):
    from ckan.lib.uploader import ALLOWED_UPLOAD_TYPES
else:
    from cgi import FieldStorage
    if toolkit.check_ckan_version(min_version='2.7.0'):
        from werkzeug.datastructures import FileStorage as FlaskFileStorage
        ALLOWED_UPLOAD_TYPES = (FieldStorage, FlaskFileStorage)
    else:
        ALLOWED_UPLOAD_TYPES = (FieldStorage)

and maybe also in combination of
https://github.com/qld-gov-au/ckan/blob/d13b0e80726d133539d7067fee1f83d1e791e03b/ckan/logic/action/delete.py#L191

        # Don't break if old plugin/interface is found
        if hasattr(upload, "delete"):
            upload.delete(id)  # type: ignore
        else:
            log.warning("%s does not have delete function, could not cleanup: %s", type(upload).__name__, id)

@duttonw
Copy link
Collaborator

duttonw commented Jul 2, 2024

Hi @JVickery-TBS ,

Once you can resolve the merge conflict as well as why the tests is failing, this should be good to go

2024-05-14 17:22:31,450 INFO  [ckan.config.environment] Loading templates from /srv/app/src/ckan/ckan/templates
Traceback (most recent call last):
  File "/usr/bin/ckan", line 8, in <module>
    sys.exit(ckan())
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1[13](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:14)0, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1054, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/usr/lib/python3.10/site-packages/click/core.py", line 920, in make_context
    self.parse_args(ctx, args)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 121, in parse_args
    result = super().parse_args(ctx, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1613, in parse_args
    rest = super().parse_args(ctx, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 1378, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 2360, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/usr/lib/python3.10/site-packages/click/core.py", line 2322, in process_value
    value = self.callback(ctx, self, value)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 131, in _init_ckan_config
    _add_ctx_object(ctx, value)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line [14](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:15)0, in _add_ctx_object
    ctx.obj = CtxObject(path)
  File "/srv/app/src/ckan/ckan/cli/cli.py", line 57, in __init__
    self.app = make_app(raw_config)
  File "/srv/app/src/ckan/ckan/config/middleware/__init__.py", line 27, in make_app
    load_environment(conf)
  File "/srv/app/src/ckan/ckan/config/environment.py", line 69, in load_environment
    p.load_all()
  File "/srv/app/src/ckan/ckan/plugins/core.py", line [22](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:23)4, in load_all
    load(*plugins)
  File "/srv/app/src/ckan/ckan/plugins/core.py", line 240, in load
    service = _get_service(plugin)
  File "/srv/app/src/ckan/ckan/plugins/core.py", line 347, in _get_service
    return plugin.load()(name=plugin_name)
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line [24](https://github.com/ckan/ckanext-xloader/actions/runs/9083543297/job/24962461543?pr=215#step:5:25)64, in load
    return self.resolve()
  File "/usr/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2470, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/__w/ckanext-xloader/ckanext-xloader/ckanext/xloader/plugin.py", line 7, in <module>
    from ckanext.datastore.interfaces import IDataDictionaryForm
ImportError: cannot import name 'IDataDictionaryForm' from 'ckanext.datastore.interfaces' (/srv/app/src/ckan/ckanext/datastore/interfaces.py)
Error: Process completed with exit code 1.

- Updated various tests.
- Added more gettext.
# Conflicts:
#	ckanext/xloader/loader.py
### RESOLVED.
- Updated post parser to set empty string type cells to `None` for parody with `load_csv`.
- Updated some tests for new code.
- Added check for ckan version `2.11` for the data dictionary form.
- Updated conditions in `load_csv`.
- Fixed test for new output with `strip_extra_white`.
- Fixed CKAN version pardoy of `info` vs `_info` in `update_datastore_info_field` (upstream issue).
@JVickery-TBS
Copy link
Contributor Author

Just the one 2.9 test failing now, will have to get a 2.9 environment set up quickly and see if I can debug it in there. I know there are major differences between 2.10 datastore_info and 2.9 so I imagine it is something in there.

ckanext/xloader/loader.py Show resolved Hide resolved
- DS fields for ckan versions.
- DS fields for ckan versions.
- Support current versions of CKAN for the DataDictionary form override for `strip_extra_white`.
- Support current versions of CKAN for existing info before existing data dictionary custom fields.
@@ -81,7 +81,7 @@ def test_xloader_data_into_datastore(self, cli, data):
with mock.patch("ckanext.xloader.jobs.get_response", get_response):
stdout = cli.invoke(ckan, ["jobs", "worker", "--burst"]).output
assert "File hash: d44fa65eda3675e11710682fdb5f1648" in stdout
assert "Fields: [{'id': 'x', 'type': 'text'}, {'id': 'y', 'type': 'text'}]" in stdout
assert "Fields: [{'id': 'x', 'type': 'text', 'strip_extra_white': True}, {'id': 'y', 'type': 'text', 'strip_extra_white': True}]" in stdout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need tests of the ability to disable this behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThrawnCA okay added test coverage for load_csv and load_table with 'strip_extra_white': False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...welp those didn't work. Worked locally on 2.9 and 2.11, will see if I can replicate and fix em up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThrawnCA I cannot seem to be able to replicate the failed tests in any of my environments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JVickery-TBS can you rebase this and try again to see if the 'tests' fix themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants