Releases · tokahuke/lopez

30 Dec 13:47

tokahuke

v0.6.1

70b14d7

Minor improvements Latest

Latest

Minor improvements:

Added a new optional variable MAX_QUOTA which allows you to control the maximum number of crawled pages per process instance, instead of per crawl.
Added a new "recently completed" pages to the stats of the crawl.
Added a new seo.h2s analysis to the seo library, which collects all h2 headers in a page.

Assets 3

08 Oct 00:34

tokahuke

v0.6.0

54ecaec

Typed Analyses

This is a kind-of-breaking release, but you should be fine, most probably. With this release, all analyses have their type logged in the backend. All existing analyses are marked as any. This is in preparation for proper querying over the crawl results in a way that is independent of the backend implementation.

Plus, more transformers were implemented. However, they are not yet documented.

Assets 3

23 Sep 23:30

tokahuke

v0.5.2

088e699

Remove waves from the command line directly

This is a minor release that allows you to remove waves from the command line directly, without having to code that single SQL line. Now, the Backend trait handles that for you (with a default impl. of not allowing you to delete anything; this allows the current version to be minor).

Assets 3

22 Sep 02:32

tokahuke

v0.5.1

a9476e4

Logging improvements

Now, you can use --verbose if you really want to be annoyed by logs.
Droppping support for Jemalloc in MUSL (in preparation for Docker image using Alpine). Not a major break, since Jemalloc is notoriously incompatible with MUSL (couldn't get to compile, rumors say segfaults abound).

Assets 3

19 Sep 13:41

tokahuke

v0.5.0

69487aa

Small, but breaking, bugfix

Now, the status code for lopez matches the response of the process. Before, it would answer with a 0 (ok) even if, e.g., errors in the crawl directives were present.

Assets 3

04 Sep 02:08

tokahuke

v0.4.0

cd7bb04

v0.4.0

In this release, there are a bunch of goodies:

Improvements to memory usage (profiled using heaptrack):
- Goodbye, OpenSSL, you buggy bastard! Stop leaking my memory. Using rustls now and memory consumption has reasonably improved. Since nobody is going to be putting their bank's password into lopez, it is actually very ok to use rustls, despite its lack of maturity.
- Goodbye, Reqwests: lets centralize all open sockets in the crawler itself. With Reqwests, I can't do that and therefore I waste memory. The code got clunkier, but nothing that a good refatctory does not help.
Pretty printing reports, error messages (compile messages a little less awful) and overall less scandalous logging.
More regex transformers: matches and replace ... with ....
You can now disable PageRank in the crawl using the enable_page_rank set-variable.
And you can ask to run only the PageRank with the lopez page-rank sub-command.

Assets 3

31 Aug 01:56

tokahuke

v0.3.1

f55588f

v0.3.1

This is a minor release to fix some bugs and insert one final feature which didn't make 0.3.0:

There was a bug in the implementation of count(extractor) which caused Lopez to panic.
The numeric transformers were not implementing space between the tag (e.g. the word equals) and the accompanying number.
You now have the !explode pseudo-transformer. It works similarly to flatten, but allows you to send the contents of an array each at a time, instead of sending the whole array to the aggregator. This allows you to do this, for example:

select * {
    class-frequency: group(classes !explode, count);
}

The above rule counts how many times each class appears in a webpage. Since classes returns an array, this would be a type error, as only strings can be the key of a JSON map. However, with !explode, the group is done over each element of the classes array, which is a string.

You can now use filter and each with maps. Both transformers will operate on the keys of the input map.

Assets 3

30 Aug 18:42

tokahuke

v0.3.0

54484d8

v0.3.0

Note: due to some bugs in the implementation, this release is yanked. Please use a more up-to-date version.

This is another huge release in which much has changed. Here are some key points:

Set-variables were introduced to make the number of environment variables smaller. Now, you can configure Lopez directly in your .lcd file. Of course "CLI-ish" things are kept in the CLI. You can read more about set variables in this wiki article.
We changed the syntax of selection rules to something much more expressive, introducing extractors, aggregators and transformers in order to avoid post-processing. This is not backwards compatible at all, though. You can read more about these subjects here.
We now have a small, but functional wiki, where you can get started using Lopez.
Besides validating your configurations, you can now test them against a supplied URL.
There was a bug with the way full rule names where rendered in the database: they were all prepended with a dot. This dot does not appear anymore, but any crawls started with 0.2 will break when continued with 0.3.

Assets 2

01 Sep 16:26

tokahuke

v0.2.0

93dbf33

Better than 0.1, for sure

Changes include:

Jemalloc is back: I tested and it seems to reduce memory consumption.
postgres-lopez now has views to reduce number of annoying joins in your life. They are the named_* views and substitute id-like fields by their corresponding values.
PageRank is now
1. Calculated over all pages, canonical or otherwise.
2. Calculated only over links between closed pages.
lopez-std now has some more modules:
1. frontiers: common places that you may crawl, bot don't want to go further (e.g., Wikipedia, Google, WebArchive, etc...)
2. ignore-tracking: common tracking parameters meant for humans, which can be safely ignored by robot-kind.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: tokahuke/lopez

Minor improvements

Typed Analyses

Remove waves from the command line directly

Logging improvements

Small, but breaking, bugfix

v0.4.0

v0.3.1

v0.3.0

Better than 0.1, for sure