Web-Crawler

A web Crawler in Python that crawls the world wide web starting with a given url as a seed upto the specified depth and retrieves pdf files which are converted to text documents with proper formatting and minimum loss of data.

Input :

    1. seed URL
    2. Path to directory where pdf and text fils should be saved

Limitations:

Complex pdfs which include non trivial text fonts and images would not be converted efficiently into text format. Redundant formation of pdf due to multiple links mapping to same URI happens. Since the entire script is running in single thread, its slow down the process. Unnecessary data or noise in the pdfs.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
crawler.py		crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Crawler

Input :

Limitations:

About

Releases

Packages

Languages

vansika/Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

Web-Crawler

Input :

Limitations:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages