Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Errors #33

Open
flywire opened this issue May 30, 2021 · 13 comments
Open

Parse Errors #33

flywire opened this issue May 30, 2021 · 13 comments

Comments

@flywire
Copy link
Contributor

flywire commented May 30, 2021

Done at #33 (comment) (pdf at #30).

Date    Transaction                             Debit   Credit  Balance
01 Jul  2018 OPENING BALANCE                                    $1,384.89 CR
01 Jul  DEBIT INTEREST CHARGED on this account
        to June 30. 2018 is $0.11
02 Jul  Transfer to another Bank NetBank        372.00          $1,012.89 CR
        Rob Ubank Transfer
  1. Amount starting/ending/contains CR/DB
  2. No year in date
  3. Skip lines starting/ending/contains Balance/Forward
  4. Concatenate wrapped Transaction description
  5. Allow currency sign
  6. Allow thousands separator
@flywire
Copy link
Contributor Author

flywire commented Jun 14, 2021

Using #34 (comment)

(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf
Converted MockTemplate.pdf and saved as MockTemplate.csv
Date,Transaction Description,Transaction Detail,Debit Amount,Credit Amount,Balance
,"Lorem ipsum dolor sit amet,","consectetur adipiscing elit, sed",-197.60,,802.40
,tristique senectus. Quisque,,-976.14,,192.73
,egestas diam in arcu curs,euismod quis. Eget mi proin sed,-904.80,,-712.07
,libero enim sed. Dignissim enim,,,327.70,-831.74
,sit amet venenatis urna. Ipsum a,,,362.86,-468.88
,arcu cursus vitae congue mauris,,-455.41,,-924.29
,rhoncus aenean,,,169.89,-754.40
,consectetur adipiscing,,,462.19,-292.21
,arcu felis bibendum ut tristique,,,452.75,160.54
,et egestas. Sagittis vitae et leos,,,171.54,332.08
,duis ut diam. Tempor commo,,,237.17,569.25
,ullamcorper,sed arcu non. Facilisis volutpat,,392.18,961.43
,est velil egestas. Vel facilisi,,-786.68,,595.30
  1. Date dropped with format used
  2. Transactions with thousands separator in Balance dropped (terminates with exception error if thousands separator in Debit/Credit Amount)
  3. Where next transaction dropped, next Transaction Description shown in previous Transaction Detail
  4. Only first page processed

@marlanperumal
Copy link
Owner

Can you provide the full json config that you used? In the issue that you link, I only see the layout section

@flywire
Copy link
Contributor Author

flywire commented Jun 14, 2021

(pdf_statement_reader) E:\venv\pdf_statement_reader>psr pdf2csv MockTemplate.pdf

[default: za.absa.cheque]

@marlanperumal
Copy link
Owner

Oh I see did you just use the default config?

{
    "layout": {
        "default": {
            "area": [280, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        },
        "first": {
            "area": [480, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        }
    },
    "columns": {
        "trans_date": "Date",
        "trans_type": "Transaction Description",
        "trans_detail": "Transaction Detail",
        "debit": "Debit Amount",
        "credit": "Credit Amount",
        "balance": "Balance"
    },
    "order": [
        "trans_date",
        "trans_type",
        "trans_detail",
        "debit",
        "credit",
        "balance"
    ],
    "cleaning": {
        "numeric": ["debit", "credit", "balance"],
        "date": ["trans_date"],
        "date_format": "%d/%m/%Y",
        "trans_detail": "below",
        "dropna": ["balance"]
    }
}

You should create a new json config file, updated with the correct values, store it in pdf_statement_reader/config/au/citi/mock.json or similar and call it with

psr pdf2csv MockTemplate.pdf --config=au.citi.mock

@flywire
Copy link
Contributor Author

flywire commented Jun 14, 2021

lol now I finally get it, we have a cross-communication issue.

For now the only config supported is for Cheque account statements from Absa bank in South Africa.

You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.

Testing pdf(s) are also needed. Maybe pdf_statement_reader/config/00/test/mock.json

@flywire
Copy link
Contributor Author

flywire commented Jun 16, 2021

Read MockTemplate.pdf date with "date_format": "%d/%m/%y" but saved in csv file as "yyyy/mm/dd"

@flywire
Copy link
Contributor Author

flywire commented Jun 20, 2021

What do you think about changing JSON cleaning as follows?

  • "drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]

@marlanperumal
Copy link
Owner

marlanperumal commented Jun 20, 2021

What do you think about changing JSON cleaning as follows?

  • "drop": ["trans_type", "(^(?i)BALANCE.*(?i)FORWARD$)"]

To be honest I'd prefer to be explicit on the column names, also because of my previously mentioned aversion to regex. I might be persuaded to go for some additional keys like

  • drop_if_prefix
  • drop_if_suffix
  • drop_if_contains

@marlanperumal
Copy link
Owner

Read MockTemplate.pdf date with "date_format": "%d/%m/%y" but saved in csv file as "yyyy/mm/dd"

Are you reading the csv file into Excel? (hence the format change). The desired behaviour is actually to always output the date in iso-format YYYY-MM-DD no matter what the input format.

@marlanperumal
Copy link
Owner

lol now I finally get it, we have a cross-communication issue.

For now the only config supported is for Cheque account statements from Absa bank in South Africa.

You are suggesting a new config file is required but I was trying to create a sample pdf to use with default setting.

Testing pdf(s) are also needed. Maybe pdf_statement_reader/config/00/test/mock.json

I'll try to create a config for your mock pdf later this week. It'll actually be a a really good base to use for testing of all the other functionality

@flywire
Copy link
Contributor Author

flywire commented Jun 20, 2021

config > [country code] > [bank] > [statement type].json

Country code is irrelevant as bank format normally consistent across countries

explicit column names

json file already uses field coding names, more portable as not specific to actual statements

The desired behaviour is actually to always output the date in iso-format YYYY-MM-DD no matter what the input format.

No, output format specified by software consuming file

I'll try to create a config for your mock pdf later this week

Only difference from default is "date_format": "%d/%m/%Y" -> "date_format": "%d/%m/%y". Do one for CbaBankStatement.pdf.

@flywire
Copy link
Contributor Author

flywire commented Jun 21, 2021

CbaBankStatement.pdf fully supported on https://github.com/flywire/pdf_statement_reader/commits/cba with extended functionality.

CbaBankStatement.csv

Date,Transaction,Debit,Credit,Balance
01/07/18,2018 Opening Balance,,,1384.89
01/07/18,Debit Interest Charged On This Account To June 30. 2018 Is $0.11,,,
02/07/18,Transfer To Another Bank Netbank Rob Ubank Transfer,372.00,,1012.89
02/07/18,Transfer From Mckay Mj Mick Mckay - Neck Hackles.,,43.80,
02/07/18,Direct Debit 000115 Colonial Mutual 1200180874627741,25.00,,
02/07/18,Loan Repayment Ln Repay694259331,280.00,,751.69
03/07/18,Petstock Heathmont P Heathmont Aus Card Xx4521 Value Date: 01/07/2018,32.99,,718.70
03/07/18,Woolworths 3149 Eastla Ringwood Aus Card Xx4521 Value Date: 01/07/2018,134.26,,584.44
03/07/18,Heathmont Iga Heathmont Aus Card Xx4521 Value Date: 30/06/2018,30.00,,554.44
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 29/06/2018,47.57,,506.87
03/07/18,Five Star Music Pl Ringwood Vi Aus Card Xx4521 Value Date: 28/06/2018,50.00,,456.87
03/07/18,Post Heathmont Lpohe Heathmont Ca Aus Card Xx4521 Value Date: 28/06/2018,55.65,,401.22

Note: Original pdf file has 2 invalid balances because . is used as both decimal and thousands separator on those values so they are ignored.

@flywire flywire closed this as completed Jun 22, 2021
@flywire flywire reopened this Jun 22, 2021
@flywire
Copy link
Contributor Author

flywire commented Jun 24, 2021

Issue remains for misidentified data after end of transaction data as previously described. Suggest support dropping all records after user described string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants