Skip to content

Releases: tokahuke/lopez

Minor improvements

30 Dec 13:47
70b14d7
Compare
Choose a tag to compare

Minor improvements:

  • Added a new optional variable MAX_QUOTA which allows you to control the maximum number of crawled pages per process instance, instead of per crawl.
  • Added a new "recently completed" pages to the stats of the crawl.
  • Added a new seo.h2s analysis to the seo library, which collects all h2 headers in a page.

Typed Analyses

08 Oct 00:34
Compare
Choose a tag to compare

This is a kind-of-breaking release, but you should be fine, most probably. With this release, all analyses have their type logged in the backend. All existing analyses are marked as any. This is in preparation for proper querying over the crawl results in a way that is independent of the backend implementation.

Plus, more transformers were implemented. However, they are not yet documented.

Remove waves from the command line directly

23 Sep 23:30
Compare
Choose a tag to compare

This is a minor release that allows you to remove waves from the command line directly, without having to code that single SQL line. Now, the Backend trait handles that for you (with a default impl. of not allowing you to delete anything; this allows the current version to be minor).

Logging improvements

22 Sep 02:32
Compare
Choose a tag to compare
  • Now, you can use --verbose if you really want to be annoyed by logs.
  • Droppping support for Jemalloc in MUSL (in preparation for Docker image using Alpine). Not a major break, since Jemalloc is notoriously incompatible with MUSL (couldn't get to compile, rumors say segfaults abound).

Small, but breaking, bugfix

19 Sep 13:41
Compare
Choose a tag to compare

Now, the status code for lopez matches the response of the process. Before, it would answer with a 0 (ok) even if, e.g., errors in the crawl directives were present.

v0.4.0

04 Sep 02:08
Compare
Choose a tag to compare

In this release, there are a bunch of goodies:

  • Improvements to memory usage (profiled using heaptrack):
    • Goodbye, OpenSSL, you buggy bastard! Stop leaking my memory. Using rustls now and memory consumption has reasonably improved. Since nobody is going to be putting their bank's password into lopez, it is actually very ok to use rustls, despite its lack of maturity.
    • Goodbye, Reqwests: lets centralize all open sockets in the crawler itself. With Reqwests, I can't do that and therefore I waste memory. The code got clunkier, but nothing that a good refatctory does not help.
  • Pretty printing reports, error messages (compile messages a little less awful) and overall less scandalous logging.
  • More regex transformers: matches and replace ... with ....
  • You can now disable PageRank in the crawl using the enable_page_rank set-variable.
  • And you can ask to run only the PageRank with the lopez page-rank sub-command.

v0.3.1

31 Aug 01:56
Compare
Choose a tag to compare

This is a minor release to fix some bugs and insert one final feature which didn't make 0.3.0:

  • There was a bug in the implementation of count(extractor) which caused Lopez to panic.
  • The numeric transformers were not implementing space between the tag (e.g. the word equals) and the accompanying number.
  • You now have the !explode pseudo-transformer. It works similarly to flatten, but allows you to send the contents of an array each at a time, instead of sending the whole array to the aggregator. This allows you to do this, for example:
select * {
    class-frequency: group(classes !explode, count);
}

The above rule counts how many times each class appears in a webpage. Since classes returns an array, this would be a type error, as only strings can be the key of a JSON map. However, with !explode, the group is done over each element of the classes array, which is a string.

  • You can now use filter and each with maps. Both transformers will operate on the keys of the input map.

v0.3.0

30 Aug 18:42
Compare
Choose a tag to compare

Note: due to some bugs in the implementation, this release is yanked. Please use a more up-to-date version.

This is another huge release in which much has changed. Here are some key points:

  • Set-variables were introduced to make the number of environment variables smaller. Now, you can configure Lopez directly in your .lcd file. Of course "CLI-ish" things are kept in the CLI. You can read more about set variables in this wiki article.
  • We changed the syntax of selection rules to something much more expressive, introducing extractors, aggregators and transformers in order to avoid post-processing. This is not backwards compatible at all, though. You can read more about these subjects here.
  • We now have a small, but functional wiki, where you can get started using Lopez.
  • Besides validating your configurations, you can now test them against a supplied URL.
  • There was a bug with the way full rule names where rendered in the database: they were all prepended with a dot. This dot does not appear anymore, but any crawls started with 0.2 will break when continued with 0.3.

Better than 0.1, for sure

01 Sep 16:26
Compare
Choose a tag to compare

Changes include:

  • Jemalloc is back: I tested and it seems to reduce memory consumption.
  • postgres-lopez now has views to reduce number of annoying joins in your life. They are the named_* views and substitute id-like fields by their corresponding values.
  • PageRank is now
    1. Calculated over all pages, canonical or otherwise.
    2. Calculated only over links between closed pages.
  • lopez-std now has some more modules:
    1. frontiers: common places that you may crawl, bot don't want to go further (e.g., Wikipedia, Google, WebArchive, etc...)
    2. ignore-tracking: common tracking parameters meant for humans, which can be safely ignored by robot-kind.