Skip to content

tokahuke/lopez

Repository files navigation

Welcome to "the Lopez"

Affero GPLv3 license Join our Discord chat Github All Downloads GitHub release

Crawling and scraping the Web for fun and profit.

A word of caution

There is a very tenuous line between a crawl and a DoS attack. Please, be mindful of the crawling speed you inflict on websites! For your convenience, crawling is limited by default to 2.5 hits per second per origin, which is a good default. You can override this value using the set max_hits_per_sec directive in your configuration, but make sure that you will not overload the server (or that you have the permission to do so). Remember: some people's livelihoods depend on these websites and not every site has good DoS mitigation.

Also, some people may get angry that you are scraping their website and may start annoying you because of that. If they are crazy enough or money is involved, they may even try to prosecute you. And the judicial system is just crazy nowadays, so who knows?

In either case, I have nothing to do with that. Use this program at your own risk.

Installing the damn thing

If you are feeling particularly lazy today, just copy and paste the following in your favorite command line (Unix-like only):

curl -L "https://github.com/tokahuke/lopez/releases/latest/download/entalator" \
    > /tmp/entalator
chmod +x /tmp/entalator
sudo /tmp/entalator &&
sudo cp /tmp/entalator /usr/share/lopez

You will get the latest Lopez experience, which is lopez installed for all users in your computer with full access to lopez-std out of the box. If you ever wish to get rid of the installation, just use the following one-liner:

sudo /usr/share/lopez/entalator --uninstall

but remember there is no turning back.

This method should work on any Unix-based system; there is an open issue for porting it to Windows. However, with a bit more of setup, you can run lopez on most architectures. Compiling from the source code in the repository using Cargo (the Rust package manager) should be quite simple.

Running the damn thing

If you installed from the entalator, you will have the binary lopez available globally on your machine. To get started, run

lopez --help

to get a friendly help dialog. This will list your options while running Lopez. To really get started running lopez, see our Quickstart guide.

Lopez Crawl Directives

You will need a Crawl Directives file to run the crawl. This file describes what you want to scrape from web pages as well as were and how you want to crawl the Web. For more information on the syntax and semantics, see this link. Either way, here is a nice example (yes, syntax highlighting is supported for VSCode!):

Sample code example for Lopez Crawl Directives

Backends

Lopez supports the idea of backends, which is where the data comes from and goes to. The implementation is completely generic, so you may write your own if you so wish. For now, lopez ships with a nice PostgreSQL backend for your convenience. Support for other popular databases (and unpopular ones as well) is greatly appreciated.

For more information on backends, see the documentation for the lib_lopez::backend module.

Minimum Rust Version

By now, Lopez only compiles on Rust Nightly. Unfortunately, we are waiting on the following features to be stabilized:

Good news: stabilization is due in a few days!

Features

Let's brag a little!

  • The beast is fast, in comparison with other similar programs I have made in the past using the Python ecosystem (BeatufulSoup, asyncio, etc...). It's in Rust; what were you expecting?

  • It uses very little memory. If crawling is not done correctly, it can gobble up your memory and still ask for more. Using a database (PostgreSQL), all evil is averted!

  • It is polite. Yes, it obeys robots.txt and no, you can't turn that off.

Limitations and future plans

  • Lopez is still limited to a single machine. No distributed programming yet. However, what are you scraping that requires so much firepower?

  • No JavaScript execution. This is pretty standard, since JavaScript is heavy to run. If it is ever to be supported, it should be opt-in.

  • This crate need more docs and more support for other backends. Sorry, I have a full-time job.

  • See the open issues for more scary (and interesting) stuff.

Licensing

All the work in this repository is licensed under the Affero GPLv3 (aka AGPL) license. See the license file for mode detailed information.