Essex County Council Asset Spider

This is a one-off project to spider essex.gov.uk and find all files hosted on the contentful asset CDN that are in use, to aid in migration.

Running

There are two options for running this programme. They will both output CSV and JSON documents to the output directory.

N.B.: These files are appended to, so if you'd like to re-run the spider then delete these files if present.

Docker

There is a Docker environment available. Ensure you have docker installed (note, the Docker Desktop application requires a paid licence in many situations). Then, start the spider using:

docker compose up

This will configure the spider then run it, outputting status to the screen.

N.B.: The docker container will contain the spider code, so this isn't well suited for development of spider changes. You'll need to run docker compose build to refresh the build each time.

Python

Ensure you have Python 3.9 installed, as well as pipenv. Then run the following:

pipenv install
pipenv run scrapy runspider scraper.py -o ./output/output.json -o ./output/output.csv

By default, this spider will crawl the homepage only. If you also have extra urls to crawl these can be passed as an argument. For example, to also crawl all news items, even unreachable ones, you can export them from the Contentful API and add these:

contentful space export ...
jq -rc '.entries | [.[] | select(.sys.contentType.sys.id | contains("news")) | ("https://www.essex.gov.uk/news/" + .fields.slug["en-GB"])]' < contentful-export.json > extra_urls.json
pipenv install
pipenv run scrapy runspider scraper.py -o ./output/output.json -o ./output/output.csv -a extra_urls_file=/Users/mwilkes/play/ecc/extra_urls.json

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
docker-compose.yml		docker-compose.yml
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Essex County Council Asset Spider

Running

Docker

Python

About

Releases

Packages

Languages

essexcountycouncil/contentful-asset-spider

Folders and files

Latest commit

History

Repository files navigation

Essex County Council Asset Spider

Running

Docker

Python

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages