Skip to content

Scripts for data gathering for Ada Louise Huxtable text mining project

Notifications You must be signed in to change notification settings

tracy-st/huxtable-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

huxtable-ocr

These scripts were used to run OCR on a corpus of ~1,500 articles by the architectural critic Ada Louise Huxtable.

Getting Started

This code utilizes the Google Vision API and Google Cloud storage.

Before starting, make sure that your input list and your filenames do not include apostrophes or commas.

Usage

  1. google_pdf_ocr.py
    Run OCR on PDF files stored in the cloud. Writes JSON output to the cloud.

  2. Download output using Google Cloud CLI in terminal
    ./google-cloud-sdk/bin/gcloud init
    gsutil cp -r [GOOGLE FOLDER] [OUTPUT FOLDER]

  3. json_to_csv_rename.py
    Write all output (filename, detected text) to a single CSV.

Acknowledgments

Code from Google Cloud Vision API.

About

Scripts for data gathering for Ada Louise Huxtable text mining project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages