We've all written code that fetches content from the internets, or accepts user posts from the internets, and in next to no time you have a database full of trash.
I'd like to crawl, scrap and generally import stuff from the tubes knowing that I have some kind of garbage filter on my user agent... something that can read [Adblock Plus ...]
files and filter content based on those rules.
After a quick dive in the cpan swamp I found WWW::Adblock and App::DNS::Adblock - sadly both require that I have the files on disk.
step one: fetch a pile of blacklists, and keep them up to date.
After reading a couple of threads I discovered that filterlists.com and pi-hole have their full lists on github (filterlist, pihole) which saves me the trouble of searching/scraping.
These lists overlap a lot and will require de-duping, but that's a whole other thing.
Rough plan:
- read a list of files to mirror
- check with the internet to find out what's changed
- write modifications to disk
- wait some time and do it all again
.yaml
seems fine for a config file
So let's just make a list of files we're interested in and call it list_sync.yaml
:
# The file contains a yaml document per source file,
# ---
# name: <something pretty> - a name, mostly for logs
# disabled: "reason" - skip this rule
# path: <path-on-disk> - where should we save it
# url: <place-on-internet> - where should I get it from?
---
name: ricklist
path: index/adblock/ricklist.txt
url: http://rickrolldb.com/ricklist.txt
---
url: https://raw.githubusercontent.com/MajkiIT/polish-ads-filter/master/polish-adblock-filters/adblock.txt
path: index/adblock/polish-adblock-filters-adblock.txt
It's super easy to just grab the config with LoadFile
and ignore the content, but it's also nice to do some validation and set some defaults for the optional values. The validation be more aggressive but I don't really need a type system for this one.
Let's get started on this script, and call it list_sync
:
#! /usr/local/bin/perl
#ABSTRACT: keep a directory of files up to date
use warnings; use strict;
use AnyEvent;
use FindBin;
use File::Basename qw[ basename ];
use lib "$FindBin::Bin/../lib";
my @CONFIG_PATH=("$FindBin::Bin");
my $CONFIG_NAME = basename($0);
my ($CONFIG_FILE) = grep -e, map "$_/$CONFIG_NAME.yaml", @CONFIG_PATH
or AE::log fatal => "No $CONFIG_NAME.yaml found in ".join ", ", @CONFIG_PATH;
AE::log info => "Will read sync rules from: $CONFIG_FILE";
my @sync_rules;
{
use YAML qw[ LoadFile DumpFile Dump ];
my $normalise = sub {
my $name= $_->{name} || $_->{path} || $_->{url} || Dump($_);
# without a url we don't have anything to mirror.
AE::log info => q[The entry "%s" is missing a url, so it'll be ignored], $name
and return unless exists $_->{url};
AE::log info => q[The entry "%s" is missing a destination path, so it'll be ignored.], $name
and return unless exists $_->{path};
AE::log debug => q[The entry "%s" is diabled.], $name
and return if exists $_->{disabled};
$_->{name} ||= $_->{url};
$_->{statefile} ||= $_->{path} . ".$CONFIG_NAME";
$_
};
my @unvalidated;
eval {@unvalidated=LoadFile $CONFIG_FILE;1}
or AE::log fatal => "Can't read $CONFIG_FILE: $@";
my @normal = map $normalise->(), @unvalidated;
AE::log note => " ... file list contained %s, I kept %s. (see above)",
0+@unvalidated, 0+@normal;
AE::log fatal => "There were no rules left after validation"
unless @sync_rules = @normal;
AE::log debug => sub {sprintf "%s=[%s] %s -> %s", @{ $_ }{qw[ type name url path ]} }
for @sync_rules
}
So we have a list of urls printed out, and the content of the config file looks like it will be enough to serve us even after we decide to add type:
and all sorts of crazy other keys to the config.
And let's get set up to do some http requests:
Now we have a nice list of files to fetch, we can do some http request to grab them:
use AnyEvent::HTTP;
my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) {
;http_request GET => $rule->{url},
headers => {
# Pass in last-modified and ETag from previous request
# so we don't use all the bandwidth.
},
on_body => sub {my ($partial_body, $headers) = @_;
# We can rudely hang up after the first chunk if we see content that
# indicates the file hasn't changed by returning 0
;1},
sub { my($data, $headers) = @_;
AE::log debug => Dump( [$data, $headers] );
$loop->send; # exit after the first response
}
}
$loop->recv
So, if we run this, with debugging turned up all the way:
$ PERL_ANYEVENT_VERBOSE=9 perl list_sync
2017-08-06 13:06:49.000000 +0200 debug AnyEvent::Util: Using Guard module to implement guards.
2017-08-06 13:06:49.000000 +0200 info main: Will read sync rules from: list_sync.yaml
2017-08-06 13:06:49.000000 +0200 debug main: The entry "FilterList" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "pihole" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "280blocker for japanese mobile site" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "index/adblock/polish-adblock-filters-adblock.txt" is diabled.
2017-08-06 13:06:49.000000 +0200 info main: The entry "https://pastebin.com/raw/kDqbYwWr" is missing a destination path, so it'll be ignored.
2017-08-06 13:06:49.000000 +0200 note main: ... file list contained 6, I kept 1. (see above)
2017-08-06 13:06:49.000000 +0200 debug main: adblock=[ricklist] http://rickrolldb.com/ricklist.txt -> index/adblock/ricklist.txt
So there's all our config validation stuff, in super high verbosity. It's all buffered up so it all appears in the log at the same time.
AE::log
defaults to only showing fatal
, alert
, crit
, and error
level messages, so most none of this will turn up in your logs by default.
I normally set PERL_ANYEVENT_VERBOSE=info
in my login profile so I get the more juicy stuff while running scripts.
Moving on:
2017-08-06 13:06:49.000000 +0200 info AnyEvent: Autoloaded model 'AnyEvent::Impl::EV', using it.
2017-08-06 13:06:49.178216 +0200 info AnyEvent::IO: Autoloaded IO model 'Perl', using it.
2017-08-06 13:06:49.330338 +0200 debug AnyEvent::Socket: (re)loading /etc/hosts.
2017-08-06 13:06:49.842046 +0200 debug main: ---
2017-08-06 13:06:49.842046 +0200 + - ''
2017-08-06 13:06:49.842046 +0200 + - HTTPVersion: 1.1
2017-08-06 13:06:49.842046 +0200 + Reason: OK
2017-08-06 13:06:49.842046 +0200 + Status: 200
2017-08-06 13:06:49.842046 +0200 + URL: http://rickrolldb.com/ricklist.txt
2017-08-06 13:06:49.842046 +0200 + accept-ranges: bytes
2017-08-06 13:06:49.842046 +0200 + connection: keep-alive
2017-08-06 13:06:49.842046 +0200 + content-length: 10726
2017-08-06 13:06:49.842046 +0200 + content-type: text/plain
2017-08-06 13:06:49.842046 +0200 + date: 'Sun, 06 Aug 2017 11:06:49 GMT'
2017-08-06 13:06:49.842046 +0200 + last-modified: 'Sun, 06 Aug 2017 11:00:02 GMT'
2017-08-06 13:06:49.842046 +0200 + server: nginx/1.0.2
2017-08-06 13:06:49.842046 +0200 + vary: Accept-Encoding
We can see here that we got '' as $data
, because we passed an (empty) on_body
callback.
Since we're starting with ricklist.txt, let's update our on_body
to be more helpful:
on_body => sub {my ($partial_body, $headers) = @_;
AE::log warn => "Yep, this is the ricklist.txt! " . ($ricks++)
if $partial_body =~ m{! Rickroll blacklist, http://rickrolldb.com};
return 1
},
We'll see the same thing as before, only
2017-08-06 13:35:13.800477 +0200 warn main: Yep, this is the ricklist.txt! 0
Here the callback was run several times with different size blocks of content. Seeing the zero at the end indicates that the text we were looking for was in the first block.
Fiddling a bit with the callback, and logging what we get we see $partial_body
can even be split in the middle of lines, the length changes, but you're called pretty frequently. We can use this callback to stream large lists to disk without having to slurp the whole thing into memory. Neat.
We only want it if-modified-since
...
We want to be good internet citizens, and part of doing that is allowing http servers to say "There's nothing to see here, Move along.". When speaking http this comes in the form of an HTTP/304 not modified response.
To do this we need to do a little bit of extra book keeping on our end, but in return we get back zero length responses when there's no change to the file. This means we can spend our bandwidth on other stuff, which is part of the point of blocking ads in the first place.
So, the interesting headers from the response above is:
2017-08-06 14:06:49.842046 +0200 + last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'
So, let's give this a try, We'll just set If-Modified-Since
to the value we saw above
;http_request GET => $rule->{url},
headers => {
'If-Modified-Since' => 'Sun, 06 Aug 2017 12:00:02 GMT'
},
on_body => sub {my ($partial_body, $headers) = @_; ;1},
sub { my($data, $headers) = @_;
AE::log debug => Dump( [$data, $headers] );
}
and we see:
2017-08-06 14:15:04.301523 +0200 + Reason: Not Modified
2017-08-06 14:15:04.301523 +0200 + Status: 304
...
2017-08-06 14:15:04.301523 +0200 + date: 'Sun, 06 Aug 2017 12:15:04 GMT'
2017-08-06 14:15:04.301523 +0200 + last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'
^C
but if you use the value from earlier:
'If-Modified-Since' => 'Sun, 06 Aug 2017 11:00:02 GMT'
You'll be sent the whole body again.
2017-08-06 14:20:11.509902 +0200 + Reason: OK
2017-08-06 14:20:11.509902 +0200 + Status: 200
...
2017-08-06 14:20:11.509902 +0200 + content-length: 10726
2017-08-06 14:20:11.509902 +0200 + date: 'Sun, 06 Aug 2017 12:20:11 GMT'
2017-08-06 14:20:11.509902 +0200 + last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'
^C
From the last-modified date and times we see in the headers we can guess that they generate this file at the top of every hour and dump it out to disk for us to download.
Add in ETag support for good measure
Most of the block lists I found were hosted on github.
github does this neat thing where they include the name of the git blob corresponding to that version of the file, if we add a list from raw.githubusercontent.com
:
2017-08-06 14:47:41.078166 +0200 + Reason: OK
2017-08-06 14:47:41.078166 +0200 + Status: 200
...
2017-08-06 14:47:41.078166 +0200 + content-length: 109252
...
2017-08-06 14:47:41.078166 +0200 + date: 'Sun, 06 Aug 2017 12:47:41 GMT'
2017-08-06 14:47:41.078166 +0200 + etag: '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
2017-08-06 14:47:41.078166 +0200 + expires: 'Sun, 06 Aug 2017 12:52:41 GMT'
If we include that in our If-None-Match
we can provoke the same kind of HTTP/304 not modifieds.
headers => {
'If-None-Match' => '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
},
and we see:
2017-08-06 15:00:16.755018 +0200 + Reason: Not Modified
2017-08-06 15:00:16.755018 +0200 + Status: 304
...
2017-08-06 15:00:16.755018 +0200 + date: 'Sun, 06 Aug 2017 13:00:16 GMT'
2017-08-06 15:00:16.755018 +0200 + etag: '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
2017-08-06 15:00:16.755018 +0200 + expires: 'Sun, 06 Aug 2017 13:05:16 GMT'
We have 2 good ways to prevent extra fetching of content, but that's too easy.
There's no defined behaviour for a request that includes both.
No problem though we simply prefer the ETag
if the server doesn't set it we won't have it,
so we won't pass it back.
The rest of the time we can stick with If-Modified-Since
and things will be fine.
Keeping the headers around
After fetching the page the first time we can simply stash the values in our $rule
hash. This looks pretty much the same as before, we just store/send the headers and pass them along with the next request.
If we keep piling stuff on like this it gets to be a bit of a mess, so let's pull the http request out into its own sub:
sub update {
my $rule = shift;
# cancel the timer while we fetch the file:
unschedule($rule);
use AnyEvent::HTTP;
http_request GET => $rule->{url},
http_request GET => $rule->{url},
headers => {
(-e $rule->{path}) ? (
exists $rule->{last_etag}?('If-None-Match' => $rule->{last_etag}):
exists $rule->{last_mod} ?('If-Modified-Since' => $rule->{last_mod} ): ()
) : (),
},
# stream to disk
on_body => sub {my ($partial_body, $headers) = @_; ... ;1},
# If we die in on_body the download will stop and this won't happen:
sub { my($data, $headers) = @_;
AE::log debug => Dump( [undef, $headers] );
# $headers->{expires}?
$rule->{last_etag} = $headers->{etag}
if exists $headers->{etag};
$rule->{last_mod} = $headers->{'last-modified'}
if exists $headers->{'last-modified'};
# check again later:
schedule( $rule );
}
}
So all we need to do is call update
periodically so we keep polling for new content though 90 seconds might be a bit too fast for a lot of these lists (as we saw, ricklist is hourly):
my $timers;
sub unschedule {
my $rule = shift;
undef $timers->{ $rule->{name} }
}
sub schedule {
my $rule = shift;
$timers->{ $rule->{name} } = AnyEvent->timer (
after => rand 90,
cb => sub{ update($rule) }
);
}
my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) {
schedule( $rule );
}
$loop->recv
As long as we run our script near the top of the hour when ricklist is re-generated, we can see:
2017-08-06 15:52:23.779815 +0200 + Reason: OK
2017-08-06 15:52:23.779815 +0200 + Status: 200
2017-08-06 15:52:23.779815 +0200 + last-modified: 'Sun, 06 Aug 2017 13:00:01 GMT'
2017-08-06 15:52:47.563668 +0200 + Reason: Not Modified
2017-08-06 15:52:47.563668 +0200 + Status: 304
2017-08-06 15:52:47.563668 +0200 + date: 'Sun, 06 Aug 2017 13:52:47 GMT'
2017-08-06 15:52:47.563668 +0200 + last-modified: 'Sun, 06 Aug 2017 13:00:01 GMT'
2017-08-06 16:00:16.781009 +0200 + Reason: OK
2017-08-06 16:00:16.781009 +0200 + Status: 200
2017-08-06 16:00:16.781009 +0200 + date: 'Sun, 06 Aug 2017 14:00:16 GMT'
2017-08-06 16:00:16.781009 +0200 + last-modified: 'Sun, 06 Aug 2017 14:00:02 GMT'
2017-08-06 16:00:33.477262 +0200 + Reason: Not Modified
2017-08-06 16:00:33.477262 +0200 + Status: 304
2017-08-06 16:00:33.477262 +0200 + date: 'Sun, 06 Aug 2017 14:00:33 GMT'
2017-08-06 16:00:33.477262 +0200 + last-modified: 'Sun, 06 Aug 2017 14:00:02 GMT'
Now it's just a matter of implementing on_body
, which looks a bit like this.
Let's add an on_body
to stream to the file
Once we've decided that the server is really going to send us some content, we want to create a file and write the data there:
This is mostly logging, but it seems silly to just pass empty callbacks to all these calls:
on_body => sub {my ($partial_body, $headers) = @_;
return if $headers->{Status} !~ /^2/;
use AnyEvent::IO qw[:aio :flags];
return aio_open($rule->{statefile}, O_CREAT | O_EXCL | O_RDWR, 0600, sub {
($fh) = @_
or return AE::log error => "$! - denial of service attack?";
AE::log debug => q[Opened %s for write], $rule->{statefile};
aio_write $fh, $partial_body, sub { my $length = shift;
AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
}
}) unless $fh;
aio_write $fh, $partial_body, sub { my $length = shift;
AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
};
;1},
Because of we call aio_open
with O_CREAT | O_EXCL
our open will fail if the file already exists. It's possible that another version of this script is writing there... or that we crashed in the past.
Once our temp file is written to the path we decided on in $rule->{statefile}
, we'll want to move it into place and replace the existing version.
and finally we replace the old file
Here we use an old unix trick, you create a hard link to the existing file, and then rename your new file over it. This way the old files contents is still referred to by the hard link, and you can recover it easily if something goes wrong with a write.
When we add this to our existing callback, we get this. It's mostly logging again, but you only ever really think about logging when you don't have enough:
sub { my($data, $headers) = @_;
AE::log debug => "[%s] %s %s with %s byte body from %s",
$rule->{name}, @{ $headers }{qw[ Status Reason content-length URL]};
# $headers->{expires}?
$rule->{last_etag} = $headers->{etag}
if exists $headers->{etag};
$rule->{last_mod} = $headers->{'last-modified'}
if exists $headers->{'last-modified'};
return aio_close $fh, sub {
AE::log debug => q[Closed %s], $rule->{statefile};
my $bak = $rule->{path} . '.bak';
# just like in the docs, create a .bak, and then rename the temp over the top:
aio_unlink $bak, sub {
# don't care if the unlink fails.
aio_link $rule->{path}, $bak, sub {
my($success) = @_;
# the link can fail, but not if the file exists:
return AE::log error => "Can't create %s: $!", $bak
if -e $rule->{path} and not $success;
aio_rename $rule->{statefile}, $rule->{path}, sub {
@_ or return AE::log error => "Can't replace '%s': $!", $rule->{path};
AE::log info => "Replaced %s with %s", $rule->{path}, $rule->{statefile};
};
};
# check again later:
schedule( $rule );
}
} if $fh;
schedule( $rule );
}
Wrap up
This chunk of code should be enough to get a list of mirrored Adblock lists, and perhaps other stuff too.
Exercises for the reader
There's always room to improve, here's a couple of ideas...
Allow for per-file polling frequency
Polling every file every 90 seconds is definitely too much for Adblock lists, it could be good to configure ranges of delays and maybe even some back offs to handle failures.
Scheduling to force downloads
You might want to enforce a minimum age for the downloaded files, so we can be sure we have the latest version available.
Verification of downloads
There are lots of ways to verify downlaods. We can try:
- content length
- sha / md5 checksums from another url
- Adblock
! Checksum
commens.
The whole thing
Here's the whole script looks like this:
#! /usr/local/bin/perl
#ABSTRACT: keep a directory of files up to date
use warnings; use strict;
use AnyEvent;
use FindBin;
use File::Basename qw[ basename ];
use lib "$FindBin::Bin/../lib";
my @CONFIG_PATH=("$FindBin::Bin");
my $CONFIG_NAME = basename($0);
my ($CONFIG_FILE) = grep -e, map "$_/$CONFIG_NAME.yaml", @CONFIG_PATH
or AE::log fatal => "No $CONFIG_NAME.yaml found in ".join ", ", @CONFIG_PATH;
AE::log info => "Will read sync rules from: $CONFIG_FILE";
# maintain a list of files to mirror
my @sync_rules;
{
use YAML qw[ LoadFile DumpFile Dump ];
my $normalise = sub {
my $name= $_->{name} || $_->{path} || $_->{url} || Dump($_);
# without a url we don't have anything to mirror.
AE::log info => q[The entry "%s" is missing a url, so it'll be ignored], $name
and return unless exists $_->{url};
AE::log info => q[The entry "%s" is missing a destination path, so it'll be ignored.], $name
and return unless exists $_->{path};
AE::log debug => q[The entry "%s" is diabled.], $name
and return if exists $_->{disabled};
$_->{name} ||= $_->{url};
# we store the ETag and Modified here:
$_->{statefile} ||= $_->{path} . ".$CONFIG_NAME";
$_
};
my @unvalidated;
eval {@unvalidated=LoadFile $CONFIG_FILE;1}
or AE::log fatal => "Can't read $CONFIG_FILE: $@";
my @normal = map $normalise->(), @unvalidated;
AE::log note => " ... file list contained %s, I kept %s. (see above)",
0+@unvalidated, 0+@normal;
AE::log fatal => "There were no rules left after validation"
unless @sync_rules = @normal;
AE::log debug => sub {sprintf "%s=[%s] %s -> %s", @{ $_ }{qw[ type name url path ]} }
for @sync_rules
}
sub update {
my $rule = shift;
# cancel the timer while we fetch the file:
unschedule($rule);
use AnyEvent::HTTP;
my $fh;
http_request GET => $rule->{url},
headers => {
(-e $rule->{path}) ? (
exists $rule->{last_etag}?('If-None-Match' => $rule->{last_etag}):
exists $rule->{last_mod} ?('If-Modified-Since' => $rule->{last_mod} ): ()
) : (),
},
# stream to disk
on_body => sub {my ($partial_body, $headers) = @_;
return if $headers->{Status} !~ /^2/;
use AnyEvent::IO qw[:aio :flags];
return aio_open($rule->{statefile}, O_CREAT | O_EXCL | O_RDWR, 0600, sub {
($fh) = @_
or return AE::log error => "Can't create %s: $!", $rule->{statefile};
AE::log debug => q[Opened %s for write], $rule->{statefile};
aio_write $fh, $partial_body, sub { my $length = shift;
AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
}
}) unless $fh;
aio_write $fh, $partial_body, sub { my $length = shift;
AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
};
;1},
# If we die in on_body the download will stop and this won't happen:
sub { my($data, $headers) = @_;
AE::log debug => "[%s] %s %s with %s byte body from %s",
$rule->{name}, @{ $headers }{qw[ Status Reason content-length URL]};
# $headers->{expires}?
$rule->{last_etag} = $headers->{etag}
if exists $headers->{etag};
$rule->{last_mod} = $headers->{'last-modified'}
if exists $headers->{'last-modified'};
return aio_close $fh, sub {
AE::log debug => q[Closed %s], $rule->{statefile};
my $bak = $rule->{path} . '.bak';
# just like in the docs, create a .bak, and then rename the temp over the top:
aio_unlink $bak, sub {
# don't care if the unlink fails.
aio_link $rule->{path}, $bak, sub {
my($success) = @_;
# the link can fail, but not if the file exists:
return AE::log error => "Can't create %s: $!", $bak
if -e $rule->{path} and not $success;
aio_rename $rule->{statefile}, $rule->{path}, sub {
@_ or return AE::log error => "Can't replace '%s': $!", $rule->{path};
AE::log info => "Replaced %s with %s", $rule->{path}, $rule->{statefile};
};
};
# check again later:
schedule( $rule );
}
} if $fh;
schedule( $rule );
}
}
my $timers;
sub unschedule {
my $rule = shift;
undef $timers->{ $rule->{name} }
}
sub schedule {
my $rule = shift;
$timers->{ $rule->{name} } = AnyEvent->timer (
after => 1,#rand 10,
cb => sub{ update($rule) }
);
}
my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) {
schedule( $rule );
}
$loop->recv