Releases: tokahuke/lopez
Minor improvements
Minor improvements:
- Added a new optional variable
MAX_QUOTA
which allows you to control the maximum number of crawled pages per process instance, instead of per crawl. - Added a new "recently completed" pages to the stats of the crawl.
- Added a new
seo.h2s
analysis to theseo
library, which collects all h2 headers in a page.
Typed Analyses
This is a kind-of-breaking release, but you should be fine, most probably. With this release, all analyses have their type logged in the backend. All existing analyses are marked as any
. This is in preparation for proper querying over the crawl results in a way that is independent of the backend implementation.
Plus, more transformers were implemented. However, they are not yet documented.
Remove waves from the command line directly
This is a minor release that allows you to remove waves from the command line directly, without having to code that single SQL line. Now, the Backend
trait handles that for you (with a default impl. of not allowing you to delete anything; this allows the current version to be minor).
Logging improvements
- Now, you can use
--verbose
if you really want to be annoyed by logs. - Droppping support for Jemalloc in MUSL (in preparation for Docker image using Alpine). Not a major break, since Jemalloc is notoriously incompatible with MUSL (couldn't get to compile, rumors say segfaults abound).
Small, but breaking, bugfix
Now, the status code for lopez matches the response of the process. Before, it would answer with a 0
(ok) even if, e.g., errors in the crawl directives were present.
v0.4.0
In this release, there are a bunch of goodies:
- Improvements to memory usage (profiled using
heaptrack
):- Goodbye, OpenSSL, you buggy bastard! Stop leaking my memory. Using
rustls
now and memory consumption has reasonably improved. Since nobody is going to be putting their bank's password intolopez
, it is actually very ok to userustls
, despite its lack of maturity. - Goodbye, Reqwests: lets centralize all open sockets in the crawler itself. With Reqwests, I can't do that and therefore I waste memory. The code got clunkier, but nothing that a good refatctory does not help.
- Goodbye, OpenSSL, you buggy bastard! Stop leaking my memory. Using
- Pretty printing reports, error messages (compile messages a little less awful) and overall less scandalous logging.
- More regex transformers:
matches
andreplace ... with ...
. - You can now disable PageRank in the crawl using the
enable_page_rank
set-variable. - And you can ask to run only the PageRank with the
lopez page-rank
sub-command.
v0.3.1
This is a minor release to fix some bugs and insert one final feature which didn't make 0.3.0:
- There was a bug in the implementation of
count(extractor)
which caused Lopez to panic. - The numeric transformers were not implementing space between the tag (e.g. the word
equals
) and the accompanying number. - You now have the
!explode
pseudo-transformer. It works similarly toflatten
, but allows you to send the contents of an array each at a time, instead of sending the whole array to the aggregator. This allows you to do this, for example:
select * {
class-frequency: group(classes !explode, count);
}
The above rule counts how many times each class appears in a webpage. Since classes
returns an array, this would be a type error, as only strings can be the key of a JSON map. However, with !explode
, the group is done over each element of the classes
array, which is a string.
- You can now use
filter
andeach
with maps. Both transformers will operate on the keys of the input map.
v0.3.0
Note: due to some bugs in the implementation, this release is yanked. Please use a more up-to-date version.
This is another huge release in which much has changed. Here are some key points:
- Set-variables were introduced to make the number of environment variables smaller. Now, you can configure Lopez directly in your
.lcd
file. Of course "CLI-ish" things are kept in the CLI. You can read more about set variables in this wiki article. - We changed the syntax of selection rules to something much more expressive, introducing extractors, aggregators and transformers in order to avoid post-processing. This is not backwards compatible at all, though. You can read more about these subjects here.
- We now have a small, but functional wiki, where you can get started using Lopez.
- Besides
validating
your configurations, you can nowtest
them against a supplied URL. - There was a bug with the way full rule names where rendered in the database: they were all prepended with a
dot
. This dot does not appear anymore, but any crawls started with 0.2 will break when continued with 0.3.
Better than 0.1, for sure
Changes include:
- Jemalloc is back: I tested and it seems to reduce memory consumption.
postgres-lopez
now has views to reduce number of annoyingjoin
s in your life. They are thenamed_*
views and substituteid
-like fields by their corresponding values.- PageRank is now
- Calculated over all pages, canonical or otherwise.
- Calculated only over links between
closed
pages.
lopez-std
now has some more modules:frontiers
: common places that you may crawl, bot don't want to go further (e.g., Wikipedia, Google, WebArchive, etc...)ignore-tracking
: common tracking parameters meant for humans, which can be safely ignored by robot-kind.