This project is a Python script that extracts tabular data from a PDF file of Epidemiology and Disease Control Division Department of Health Services, Ministry of Health & Population (containing dengue cases data) and saves it to an Excel file. The extracted data can then be used for further analysis or reporting.
ℹ️ Visit the public data download platform to view dengue situation reports for Nepal 2024 on https://konishon.github.io/data-dengue-situation-report-nepal-2024/
- Extract tables from a PDF document using
pdfplumber
. - Convert the extracted tables into a
pandas
DataFrame. - Save the DataFrame to an Excel file with a name derived from the original PDF file.
To install the required packages, follow these steps:
- Clone the repository or download the script files.
- Install the necessary Python dependencies using
pip
:
pip install -r requirements.txt
python dengue_cases_nepal_data_extraction.py "data/PDFs/extracted/67049d6319129-2.pdf" --auto-clean
This will generate an CSV file named dengue_cases_nepal_extracted_tables.CSV
python generate_html.py data/CSVs/66f3af5ec20b2-2_extracted_tables.csv
This will generate the public data download page avalaible at https://konishon.github.io/data-dengue-situation-report-nepal-2024/
- If the PDF file doesn't exist, the script will raise a FileNotFoundError.
- If no tables are found in the PDF, the script will raise a ValueError.
This project is licensed under the MIT License. Feel free to use and modify it according to your needs.