-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wider numeric field format support #44
base: develop
Are you sure you want to change the base?
Conversation
Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'.
pdf_statement_reader/parse.py
Outdated
@@ -37,22 +38,24 @@ def get_raw_df(filename, num_pages, config): | |||
def clean_numeric(df, config): | |||
numeric_cols = [config["columns"][col] for col in config["cleaning"]["numeric"]] | |||
|
|||
def format_negatives(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I haven't been good yet about writing tests for the project, but can you please add some tests to show that the format_currency_number
method actually does what it's supposed to.
I have a natural aversion to regex, mostly because it generally results in the opposite of what python generally gives us - clean human readable code that's easy to easily visually grep what's going on. Python's built in string methods usually do a better job at this. In cases like this however with multiple complex rules, I'll concede that using regex may result in cleaner code. I'd only include it if there were tests against it though. Just reading it now, I'm not certain what all the side effects and edge cases might be.
I'll update with a sample test on the original format_negatives
method that you can use as a basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the format_negatives
method outside of the clean_numeric
method so that it could be tested but that's now caused a merge conflict with your branch. Should be easy enough to fix though.
Looking at that old code I wrote in parse.py
there's certainly a lot to be desired in terms of it being robust to the different content that might come out of a pdf statement file
Use regular expression operations on numeric fields to discard all characters except numbers and decimal separator, and identify negative values by a leading or trailing '-' or case insensitive 'DR'. See #33.