These scripts were used to run OCR on a corpus of ~1,500 articles by the architectural critic Ada Louise Huxtable.
This code utilizes the Google Vision API and Google Cloud storage.
Before starting, make sure that your input list and your filenames do not include apostrophes or commas.
-
google_pdf_ocr.py
Run OCR on PDF files stored in the cloud. Writes JSON output to the cloud. -
Download output using Google Cloud CLI in terminal
./google-cloud-sdk/bin/gcloud init
gsutil cp -r [GOOGLE FOLDER] [OUTPUT FOLDER]
-
json_to_csv_rename.py
Write all output (filename, detected text) to a single CSV.
Code from Google Cloud Vision API.