Skip to content

Commit

Permalink
Merge pull request #47 from marlanperumal/release/0.2.3
Browse files Browse the repository at this point in the history
Release/0.2.3
  • Loading branch information
marlanperumal authored Jun 20, 2021
2 parents 27633ed + a2df2aa commit 92a3017
Show file tree
Hide file tree
Showing 8 changed files with 507 additions and 278 deletions.
20 changes: 20 additions & 0 deletions .github/workflows/issue-branch.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Create issue branch

on:
issues:
types: [ assigned ]
issue_comment:
types: [ created ]
pull_request:
types: [ closed ]

jobs:
create_issue_branch_job:
runs-on: ubuntu-latest
steps:
- name: Create Issue Branch
uses: robvanderleek/create-issue-branch@master
with:
branchname: 'feature/GH-${issue.number}-${issue.title,}'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
8 changes: 4 additions & 4 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ verify_ssl = true
name = "pypi"

[packages]
ipykernel = "*"
tabula-py = "*"
pikepdf = "*"
click = "*"
tabula-py = "==2.2.0"
pikepdf = "==2.11.4"
click = "==7.1.2"
pdf-statement-reader = {editable = true, path = "."}

[dev-packages]
ipykernel = "*"
setuptools = "*"
wheel = "*"
pytest = "*"
Expand Down
559 changes: 291 additions & 268 deletions Pipfile.lock

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@ Banks generally send account statements in pdf format. These pdfs are often encr

## Installation

Python software can optionally be installed in a virtual environment to eliminae system conflicts as described [here](https://docs.python.org/3/library/venv.html)
eg for Windows:
```
python -m venv ./venv/psr
.\venv\psr\scripts\activate
cd .\venv\psr
```
Use `deactivate` to return to the normal system.

```
pip install pdf-statement-reader
```
Expand Down Expand Up @@ -70,6 +79,7 @@ The configuration file itself is in JSON format. Here's the Absa cheque account

```json5
{
"$schema": "https://raw.githubusercontent.com/marlanperumal/pdf_statement_reader/develop/pdf_statement_reader/config/psr_config.schema.json",
// Describes the page layout that should be scanned
"layout": {
// Default layout for all pages not otherwise defined
Expand Down Expand Up @@ -127,6 +137,8 @@ The configuration file itself is in JSON format. Here's the Absa cheque account

These were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.

This format is also captured in `pdf_statement_rader/config/psr_config.schema.json` as a [json-schema](https://json-schema.org/understanding-json-schema/index.html). If you're using vscode or some other compatible text editor, you should get autocompletion hints as long as you include that `$schema` tag at the top of your json file.

A key part in setting up a new configuration is getting the page coordinates for the area and columns. The easiest way to do this is to run the [tabula GUI](https://tabula.technology/), autodetect the page areas, save the settings as a template, then download and inspect json template file. It's not a one-to-one mapping to the psr config but hopefully it will be a good starting point.

## CLI API
Expand Down
161 changes: 161 additions & 0 deletions pdf_statement_reader/config/psr_config.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://github.com/marlanperumal/pdf_statement_reader/psr_config.schema.json",
"title": "PDF Statement Reader Config",
"description": "Config file to be used by the PDF Statement Reader (psr) application to read pdf bank statements",
"type": "object",
"properties": {
"layout": {
"description": "Describes the page layout that should be scanned",
"type": "object",
"properties": {
"default": {
"description": "Default layout for all pages not otherwise defined",
"type": "object",
"properties": {
"area": {
"description": "The page coordinates containing the table in pts: [top, left, bottom, right]",
"type": "array",
"items": {
"type": "integer"
},
"minItems": 4,
"maxItems": 4,
"examples": [
[280, 27, 763, 576]
],
"uniqueItems": true
},
"columns": {
"description": "The right x coordinate of each column in the table",
"type": "array",
"items": {
"type": "integer"
},
"minItems": 1,
"examples": [
[83, 264, 344, 425, 485, 570]
],
"uniqueItems": true
}
}
},
"first": {
"description": "Layout for the first page",
"type": "object",
"properties": {
"area": {
"description": "The page coordinates containing the table in pts: [top, left, bottom, right]",
"type": "array",
"items": {
"type": "integer"
},
"minItems": 4,
"maxItems": 4,
"examples": [
[280, 27, 763, 576]
],
"uniqueItems": true
},
"columns": {
"description": "The right x coordinate of each column in the table",
"type": "array",
"items": {
"type": "integer"
},
"minItems": 1,
"examples": [
[[83, 264, 344, 425, 485, 570]]
],
"uniqueItems": true
}
}
}
},
"required": ["default"]
},
"columns": {
"description": "The columns names to be used as they exactly appear in the statement",
"type": "object",
"minProperties": 1,
"examples": [
{
"trans_date": "Date",
"trans_type": "Transaction Description",
"trans_detail": "Transaction Detail",
"debit": "Debit Amount",
"credit": "Credit Amount",
"balance": "Balance"
}
]
},
"order": {
"description": "The order of the columns to be output in the csv",
"type": "array",
"items": {
"type": "string"
},
"minItems": 1,
"examples": [
[
"trans_date",
"trans_type",
"trans_detail",
"debit",
"credit",
"balance"
]
],
"uniqueItems": true
},
"cleaning": {
"description": "Specifies any cleaning operations required",
"type": "object",
"properties": {
"numeric": {
"description": "Convert these columns to numeric",
"type": "array",
"items": {
"type": "string"
},
"examples": [
["debit", "credit", "balance"]
],
"uniqueItems": true
},
"date": {
"description": "Convert these columns to date",
"type": "array",
"items": {
"type": "string"
},
"examples": [
["trans_date"]
],
"uniqueItems": true
},
"date_format": {
"description": "Use this date format to parse any date columns",
"type": "string",
"examples": ["%d/%m/%Y"]
},
"trans_detail": {
"description": "For cases where the transaction detail is stored in the next line below the transaction type",
"type": "string",
"examples": ["below"]
},
"dropna": {
"description": "Only keep the rows where these columns are populated",
"type": "array",
"items": {
"type": "string"
},
"examples": [
["balance"]
],
"uniqueItems": true
}
}
}
}
}
1 change: 1 addition & 0 deletions pdf_statement_reader/config/za/absa/cheque.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{
"$schema": "https://raw.githubusercontent.com/marlanperumal/pdf_statement_reader/develop/pdf_statement_reader/config/psr_config.schema.json",
"layout": {
"default": {
"area": [280, 27, 763, 576],
Expand Down
14 changes: 8 additions & 6 deletions pdf_statement_reader/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,17 @@ def get_raw_df(filename, num_pages, config):
return statement


def format_negatives(s):
s = str(s)
if s.endswith("-"):
return "-" + s[:-1]
else:
return s


def clean_numeric(df, config):
numeric_cols = [config["columns"][col] for col in config["cleaning"]["numeric"]]

def format_negatives(s):
s = str(s)
if s.endswith("-"):
return "-" + s[:-1]
else:
return s

for col in numeric_cols:
df[col] = df[col].apply(format_negatives)
Expand Down
10 changes: 10 additions & 0 deletions tests/test_parse_methods.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from pdf_statement_reader.parse import format_negatives

def test_format_negatives():
assert format_negatives(123.45) == "123.45"
assert format_negatives(-123.45) == "-123.45"
assert format_negatives(0) == "0"
assert format_negatives("123.45") == "123.45"
assert format_negatives("-123.45") == "-123.45"
assert format_negatives("0") == "0"
assert format_negatives("123.45-") == "-123.45"

0 comments on commit 92a3017

Please sign in to comment.