Keeping a bunch of files up to date with AnyEvent::HTTP


We've all written code that fetches content from the internets, or accepts user posts from the internets, and in next to no time you have a database full of trash.

I'd like to crawl, scrap and generally import stuff from the tubes knowing that I have some kind of garbage filter on my user agent... something that can read [Adblock Plus ...] files and filter content based on those rules.

After a quick dive in the cpan swamp I found WWW::Adblock and App::DNS::Adblock - sadly both require that I have the files on disk.

step one: fetch a pile of blacklists, and keep them up to date.

After reading a couple of threads I discovered that filterlists.com and pi-hole have their full lists on github (filterlist, pihole) which saves me the trouble of searching/scraping.

These lists overlap a lot and will require de-duping, but that's a whole other thing.

Rough plan:

  1. read a list of files to mirror
  2. check with the internet to find out what's changed
  3. write modifications to disk
  4. wait some time and do it all again

.yaml seems fine for a config file

So let's just make a list of files we're interested in and call it list_sync.yaml:

# The file contains a yaml document per source file, 
# ---
# name: <something pretty>      - a name, mostly for logs
# disabled: "reason"            - skip this rule
# path: <path-on-disk>          - where should we save it
# url: <place-on-internet>      - where should I get it from?
---
name: ricklist
path: index/adblock/ricklist.txt
url: http://rickrolldb.com/ricklist.txt
---
url: https://raw.githubusercontent.com/MajkiIT/polish-ads-filter/master/polish-adblock-filters/adblock.txt
path: index/adblock/polish-adblock-filters-adblock.txt

It's super easy to just grab the config with LoadFile and ignore the content, but it's also nice to do some validation and set some defaults for the optional values. The validation be more aggressive but I don't really need a type system for this one.

Let's get started on this script, and call it list_sync:

#! /usr/local/bin/perl
#ABSTRACT: keep a directory of files up to date

use warnings; use strict;
use AnyEvent;

use FindBin;
use File::Basename qw[ basename ];
use lib "$FindBin::Bin/../lib";
my @CONFIG_PATH=("$FindBin::Bin");
my $CONFIG_NAME = basename($0);

my ($CONFIG_FILE) = grep -e, map "$_/$CONFIG_NAME.yaml", @CONFIG_PATH
    or AE::log fatal => "No $CONFIG_NAME.yaml found in ".join ", ", @CONFIG_PATH;

AE::log info => "Will read sync rules from: $CONFIG_FILE";    

my @sync_rules;
{
    use YAML qw[ LoadFile  DumpFile Dump ];
    my $normalise = sub {
        my $name= $_->{name} || $_->{path} || $_->{url} || Dump($_);
        # without a url we don't have anything to mirror.
        AE::log info => q[The entry "%s" is missing a url, so it'll be ignored], $name
            and return unless exists $_->{url};

        AE::log info => q[The entry "%s" is missing a destination path, so it'll be ignored.], $name
            and return unless exists $_->{path};

        AE::log debug => q[The entry "%s" is diabled.], $name
            and return if exists $_->{disabled};

        $_->{name} ||= $_->{url};
        $_->{statefile} ||= $_->{path} . ".$CONFIG_NAME";

        $_
    };

    my @unvalidated;
    eval {@unvalidated=LoadFile $CONFIG_FILE;1} 
        or AE::log fatal => "Can't read $CONFIG_FILE: $@";

    my @normal = map $normalise->(), @unvalidated;
    AE::log note => "  ... file list contained %s, I kept %s. (see above)",
                            0+@unvalidated, 0+@normal;

    AE::log fatal => "There were no rules left after validation"
        unless @sync_rules = @normal;

    AE::log debug => sub {sprintf "%s=[%s] %s -> %s", @{ $_ }{qw[ type name url path ]} }
        for @sync_rules

}

So we have a list of urls printed out, and the content of the config file looks like it will be enough to serve us even after we decide to add type: and all sorts of crazy other keys to the config.

And let's get set up to do some http requests:

Now we have a nice list of files to fetch, we can do some http request to grab them:

use AnyEvent::HTTP;

my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) { 

;http_request GET => $rule->{url},
    headers => {
        # Pass in last-modified and ETag from previous request
        # so we don't use all the bandwidth.
    },
    on_body => sub {my ($partial_body, $headers) = @_;
        # We can rudely hang up after the first chunk if we see content that
        # indicates the file hasn't changed by returning 0
    ;1},
    sub { my($data, $headers) = @_;
        AE::log debug => Dump( [$data, $headers] );

        $loop->send; # exit after the first response
    }
}
$loop->recv

So, if we run this, with debugging turned up all the way:

$ PERL_ANYEVENT_VERBOSE=9 perl list_sync
2017-08-06 13:06:49.000000 +0200 debug AnyEvent::Util: Using Guard module to implement guards.
2017-08-06 13:06:49.000000 +0200 info  main: Will read sync rules from: list_sync.yaml
2017-08-06 13:06:49.000000 +0200 debug main: The entry "FilterList" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "pihole" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "280blocker for japanese mobile site" is diabled.
2017-08-06 13:06:49.000000 +0200 debug main: The entry "index/adblock/polish-adblock-filters-adblock.txt" is diabled.
2017-08-06 13:06:49.000000 +0200 info  main: The entry "https://pastebin.com/raw/kDqbYwWr" is missing a destination path, so it'll be ignored.
2017-08-06 13:06:49.000000 +0200 note  main:   ... file list contained 6, I kept 1. (see above)
2017-08-06 13:06:49.000000 +0200 debug main: adblock=[ricklist] http://rickrolldb.com/ricklist.txt -> index/adblock/ricklist.txt

So there's all our config validation stuff, in super high verbosity. It's all buffered up so it all appears in the log at the same time.

AE::log defaults to only showing fatal, alert, crit, and error level messages, so most none of this will turn up in your logs by default.

I normally set PERL_ANYEVENT_VERBOSE=info in my login profile so I get the more juicy stuff while running scripts.

Moving on:

2017-08-06 13:06:49.000000 +0200 info  AnyEvent: Autoloaded model 'AnyEvent::Impl::EV', using it.
2017-08-06 13:06:49.178216 +0200 info  AnyEvent::IO: Autoloaded IO model 'Perl', using it.
2017-08-06 13:06:49.330338 +0200 debug AnyEvent::Socket: (re)loading /etc/hosts.
2017-08-06 13:06:49.842046 +0200 debug main: ---
2017-08-06 13:06:49.842046 +0200 + - ''
2017-08-06 13:06:49.842046 +0200 + - HTTPVersion: 1.1
2017-08-06 13:06:49.842046 +0200 +   Reason: OK
2017-08-06 13:06:49.842046 +0200 +   Status: 200
2017-08-06 13:06:49.842046 +0200 +   URL: http://rickrolldb.com/ricklist.txt
2017-08-06 13:06:49.842046 +0200 +   accept-ranges: bytes
2017-08-06 13:06:49.842046 +0200 +   connection: keep-alive
2017-08-06 13:06:49.842046 +0200 +   content-length: 10726
2017-08-06 13:06:49.842046 +0200 +   content-type: text/plain
2017-08-06 13:06:49.842046 +0200 +   date: 'Sun, 06 Aug 2017 11:06:49 GMT'
2017-08-06 13:06:49.842046 +0200 +   last-modified: 'Sun, 06 Aug 2017 11:00:02 GMT'
2017-08-06 13:06:49.842046 +0200 +   server: nginx/1.0.2
2017-08-06 13:06:49.842046 +0200 +   vary: Accept-Encoding

We can see here that we got '' as $data, because we passed an (empty) on_body callback.

Since we're starting with ricklist.txt, let's update our on_body to be more helpful:

on_body => sub {my ($partial_body, $headers) = @_;
    AE::log warn => "Yep, this is the ricklist.txt! " . ($ricks++)
      if $partial_body =~ m{!   Rickroll blacklist, http://rickrolldb.com};

    return 1
},

We'll see the same thing as before, only

2017-08-06 13:35:13.800477 +0200 warn  main: Yep, this is the ricklist.txt! 0

Here the callback was run several times with different size blocks of content. Seeing the zero at the end indicates that the text we were looking for was in the first block.

Fiddling a bit with the callback, and logging what we get we see $partial_body can even be split in the middle of lines, the length changes, but you're called pretty frequently. We can use this callback to stream large lists to disk without having to slurp the whole thing into memory. Neat.

We only want it if-modified-since...

We want to be good internet citizens, and part of doing that is allowing http servers to say "There's nothing to see here, Move along.". When speaking http this comes in the form of an HTTP/304 not modified response.

To do this we need to do a little bit of extra book keeping on our end, but in return we get back zero length responses when there's no change to the file. This means we can spend our bandwidth on other stuff, which is part of the point of blocking ads in the first place.

So, the interesting headers from the response above is:

2017-08-06 14:06:49.842046 +0200 +   last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'

So, let's give this a try, We'll just set If-Modified-Since to the value we saw above

;http_request GET => $rule->{url},
    headers => {
        'If-Modified-Since' => 'Sun, 06 Aug 2017 12:00:02 GMT'
    },
    on_body => sub {my ($partial_body, $headers) = @_; ;1},
    sub { my($data, $headers) = @_;
        AE::log debug => Dump( [$data, $headers] );
    }

and we see:

2017-08-06 14:15:04.301523 +0200 +   Reason: Not Modified
2017-08-06 14:15:04.301523 +0200 +   Status: 304
...
2017-08-06 14:15:04.301523 +0200 +   date: 'Sun, 06 Aug 2017 12:15:04 GMT'
2017-08-06 14:15:04.301523 +0200 +   last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'
^C

but if you use the value from earlier:

'If-Modified-Since' => 'Sun, 06 Aug 2017 11:00:02 GMT'

You'll be sent the whole body again.

2017-08-06 14:20:11.509902 +0200 +   Reason: OK
2017-08-06 14:20:11.509902 +0200 +   Status: 200
...
2017-08-06 14:20:11.509902 +0200 +   content-length: 10726
2017-08-06 14:20:11.509902 +0200 +   date: 'Sun, 06 Aug 2017 12:20:11 GMT'
2017-08-06 14:20:11.509902 +0200 +   last-modified: 'Sun, 06 Aug 2017 12:00:02 GMT'
^C

From the last-modified date and times we see in the headers we can guess that they generate this file at the top of every hour and dump it out to disk for us to download.

Add in ETag support for good measure

Most of the block lists I found were hosted on github. github does this neat thing where they include the name of the git blob corresponding to that version of the file, if we add a list from raw.githubusercontent.com:

2017-08-06 14:47:41.078166 +0200 +   Reason: OK
2017-08-06 14:47:41.078166 +0200 +   Status: 200
...
2017-08-06 14:47:41.078166 +0200 +   content-length: 109252
...
2017-08-06 14:47:41.078166 +0200 +   date: 'Sun, 06 Aug 2017 12:47:41 GMT'
2017-08-06 14:47:41.078166 +0200 +   etag: '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
2017-08-06 14:47:41.078166 +0200 +   expires: 'Sun, 06 Aug 2017 12:52:41 GMT'

If we include that in our If-None-Match we can provoke the same kind of HTTP/304 not modifieds.

headers => {
    'If-None-Match' => '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
},

and we see:

2017-08-06 15:00:16.755018 +0200 +   Reason: Not Modified
2017-08-06 15:00:16.755018 +0200 +   Status: 304
...
2017-08-06 15:00:16.755018 +0200 +   date: 'Sun, 06 Aug 2017 13:00:16 GMT'
2017-08-06 15:00:16.755018 +0200 +   etag: '"325bfb353d2bf60a77771b487cf0f9cb9fd87aac"'
2017-08-06 15:00:16.755018 +0200 +   expires: 'Sun, 06 Aug 2017 13:05:16 GMT'

We have 2 good ways to prevent extra fetching of content, but that's too easy.

There's no defined behaviour for a request that includes both.

No problem though we simply prefer the ETag if the server doesn't set it we won't have it, so we won't pass it back.

The rest of the time we can stick with If-Modified-Since and things will be fine.

Keeping the headers around

After fetching the page the first time we can simply stash the values in our $rule hash. This looks pretty much the same as before, we just store/send the headers and pass them along with the next request.

If we keep piling stuff on like this it gets to be a bit of a mess, so let's pull the http request out into its own sub:

sub update {
    my $rule = shift;

    # cancel the timer while we fetch the file:
    unschedule($rule);

    use AnyEvent::HTTP;
    http_request GET => $rule->{url},
        http_request GET => $rule->{url},
            headers => {
                (-e  $rule->{path}) ? (
                exists $rule->{last_etag}?('If-None-Match'     => $rule->{last_etag}): 
                exists $rule->{last_mod} ?('If-Modified-Since' => $rule->{last_mod} ): ()
                ) : (),
            },
        # stream to disk
        on_body => sub {my ($partial_body, $headers) = @_; ... ;1},

        # If we die in on_body the download will stop and this won't happen:
        sub { my($data, $headers) = @_;
          AE::log debug => Dump( [undef, $headers] );

          # $headers->{expires}?

          $rule->{last_etag} = $headers->{etag}
            if exists $headers->{etag};
          $rule->{last_mod}  = $headers->{'last-modified'}
            if exists $headers->{'last-modified'};

          # check again later:
          schedule( $rule );
       }
}

So all we need to do is call update periodically so we keep polling for new content though 90 seconds might be a bit too fast for a lot of these lists (as we saw, ricklist is hourly):

my $timers;
sub unschedule {
    my $rule = shift;
    undef $timers->{ $rule->{name} } 
}
sub schedule {
    my $rule = shift;

    $timers->{ $rule->{name} } = AnyEvent->timer (
        after => rand 90,
        cb    => sub{ update($rule) }
    );
}

my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) { 
    schedule( $rule );
}
$loop->recv

As long as we run our script near the top of the hour when ricklist is re-generated, we can see:

2017-08-06 15:52:23.779815 +0200 +   Reason: OK
2017-08-06 15:52:23.779815 +0200 +   Status: 200
2017-08-06 15:52:23.779815 +0200 +   last-modified: 'Sun, 06 Aug 2017 13:00:01 GMT'

2017-08-06 15:52:47.563668 +0200 +   Reason: Not Modified
2017-08-06 15:52:47.563668 +0200 +   Status: 304
2017-08-06 15:52:47.563668 +0200 +   date: 'Sun, 06 Aug 2017 13:52:47 GMT'
2017-08-06 15:52:47.563668 +0200 +   last-modified: 'Sun, 06 Aug 2017 13:00:01 GMT'

2017-08-06 16:00:16.781009 +0200 +   Reason: OK
2017-08-06 16:00:16.781009 +0200 +   Status: 200
2017-08-06 16:00:16.781009 +0200 +   date: 'Sun, 06 Aug 2017 14:00:16 GMT'
2017-08-06 16:00:16.781009 +0200 +   last-modified: 'Sun, 06 Aug 2017 14:00:02 GMT'

2017-08-06 16:00:33.477262 +0200 +   Reason: Not Modified
2017-08-06 16:00:33.477262 +0200 +   Status: 304
2017-08-06 16:00:33.477262 +0200 +   date: 'Sun, 06 Aug 2017 14:00:33 GMT'
2017-08-06 16:00:33.477262 +0200 +   last-modified: 'Sun, 06 Aug 2017 14:00:02 GMT'

Now it's just a matter of implementing on_body, which looks a bit like this.

Let's add an on_body to stream to the file

Once we've decided that the server is really going to send us some content, we want to create a file and write the data there:

This is mostly logging, but it seems silly to just pass empty callbacks to all these calls:

    on_body => sub {my ($partial_body, $headers) = @_;

        return if $headers->{Status} !~ /^2/; 

        use AnyEvent::IO qw[:aio :flags];
        return aio_open($rule->{statefile}, O_CREAT | O_EXCL | O_RDWR, 0600, sub {
          ($fh) = @_
            or return AE::log error => "$! - denial of service attack?";
            AE::log debug => q[Opened %s for write], $rule->{statefile};

            aio_write $fh, $partial_body, sub { my $length = shift;
                AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
            }
        }) unless $fh;

        aio_write $fh, $partial_body, sub { my $length = shift;
            AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
        };

    ;1},

Because of we call aio_open with O_CREAT | O_EXCL our open will fail if the file already exists. It's possible that another version of this script is writing there... or that we crashed in the past.

Once our temp file is written to the path we decided on in $rule->{statefile}, we'll want to move it into place and replace the existing version.

and finally we replace the old file

Here we use an old unix trick, you create a hard link to the existing file, and then rename your new file over it. This way the old files contents is still referred to by the hard link, and you can recover it easily if something goes wrong with a write.

When we add this to our existing callback, we get this. It's mostly logging again, but you only ever really think about logging when you don't have enough:

    sub { my($data, $headers) = @_;
      AE::log debug => "[%s] %s %s with %s byte body from %s",
                $rule->{name}, @{ $headers }{qw[ Status Reason content-length URL]};

      # $headers->{expires}?

      $rule->{last_etag} = $headers->{etag}
        if exists $headers->{etag};
      $rule->{last_mod}  = $headers->{'last-modified'}
        if exists $headers->{'last-modified'};

        return aio_close $fh, sub {
            AE::log debug => q[Closed %s], $rule->{statefile};
            my $bak = $rule->{path} . '.bak';

            # just like in the docs, create a .bak, and then rename the temp over the top:
            aio_unlink $bak, sub {
                # don't care if the unlink fails.
                aio_link $rule->{path}, $bak, sub {
                    my($success) = @_;
                    # the link can fail, but not if the file exists:
                    return AE::log error => "Can't create %s: $!", $bak
                       if -e $rule->{path} and not $success;

                   aio_rename $rule->{statefile}, $rule->{path}, sub {
                      @_ or return AE::log error => "Can't replace '%s': $!", $rule->{path};
                      AE::log info => "Replaced %s with %s", $rule->{path}, $rule->{statefile};
                   };
                };
            # check again later:
                schedule( $rule );
            }
        } if $fh;
        schedule( $rule );
   }

Wrap up

This chunk of code should be enough to get a list of mirrored Adblock lists, and perhaps other stuff too.

Exercises for the reader

There's always room to improve, here's a couple of ideas...

Allow for per-file polling frequency

Polling every file every 90 seconds is definitely too much for Adblock lists, it could be good to configure ranges of delays and maybe even some back offs to handle failures.

Scheduling to force downloads

You might want to enforce a minimum age for the downloaded files, so we can be sure we have the latest version available.

Verification of downloads

There are lots of ways to verify downlaods. We can try:

  • content length
  • sha / md5 checksums from another url
  • Adblock ! Checksum commens.

The whole thing

Here's the whole script looks like this:

#! /usr/local/bin/perl
#ABSTRACT: keep a directory of files up to date

use warnings; use strict;
use AnyEvent;

use FindBin;
use File::Basename qw[ basename ];
use lib "$FindBin::Bin/../lib";
my @CONFIG_PATH=("$FindBin::Bin");
my $CONFIG_NAME = basename($0);

my ($CONFIG_FILE) = grep -e, map "$_/$CONFIG_NAME.yaml", @CONFIG_PATH
    or AE::log fatal => "No $CONFIG_NAME.yaml found in ".join ", ", @CONFIG_PATH;

AE::log info => "Will read sync rules from: $CONFIG_FILE";
# maintain a list of files to mirror


my @sync_rules;
{
    use YAML qw[ LoadFile  DumpFile Dump ];
    my $normalise = sub {
        my $name= $_->{name} || $_->{path} || $_->{url} || Dump($_);
        # without a url we don't have anything to mirror.
        AE::log info => q[The entry "%s" is missing a url, so it'll be ignored], $name
            and return unless exists $_->{url};

        AE::log info => q[The entry "%s" is missing a destination path, so it'll be ignored.], $name
            and return unless exists $_->{path};

        AE::log debug => q[The entry "%s" is diabled.], $name
            and return if exists $_->{disabled};


        $_->{name} ||= $_->{url};

        # we store the ETag and Modified here:
        $_->{statefile} ||= $_->{path} . ".$CONFIG_NAME";

        $_
    };

    my @unvalidated;
    eval {@unvalidated=LoadFile $CONFIG_FILE;1} 
        or AE::log fatal => "Can't read $CONFIG_FILE: $@";

    my @normal = map $normalise->(), @unvalidated;
    AE::log note => "  ... file list contained %s, I kept %s. (see above)",
                            0+@unvalidated, 0+@normal;

    AE::log fatal => "There were no rules left after validation"
        unless @sync_rules = @normal;

    AE::log debug => sub {sprintf "%s=[%s] %s -> %s", @{ $_ }{qw[ type name url path ]} }
        for @sync_rules

}

sub update {
    my $rule = shift;

    # cancel the timer while we fetch the file:
    unschedule($rule);

    use AnyEvent::HTTP;

    my $fh;
    http_request GET => $rule->{url},
        headers => {
            (-e  $rule->{path}) ? (
            exists $rule->{last_etag}?('If-None-Match'     => $rule->{last_etag}): 
            exists $rule->{last_mod} ?('If-Modified-Since' => $rule->{last_mod} ): ()
            ) : (),
        },

        # stream to disk
        on_body => sub {my ($partial_body, $headers) = @_;

            return if $headers->{Status} !~ /^2/; 

            use AnyEvent::IO qw[:aio :flags];
            return aio_open($rule->{statefile}, O_CREAT | O_EXCL | O_RDWR, 0600, sub {
              ($fh) = @_
                or return AE::log error => "Can't create %s: $!", $rule->{statefile};
                AE::log debug => q[Opened %s for write], $rule->{statefile};

                aio_write $fh, $partial_body, sub { my $length = shift;
                    AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
                }
            }) unless $fh;

            aio_write $fh, $partial_body, sub { my $length = shift;
                AE::log debug => q[Wrote %s / %s to %s], $length, length $partial_body, $rule->{statefile}
            };

        ;1},

        # If we die in on_body the download will stop and this won't happen:
        sub { my($data, $headers) = @_;
          AE::log debug => "[%s] %s %s with %s byte body from %s",
                    $rule->{name}, @{ $headers }{qw[ Status Reason content-length URL]};

          # $headers->{expires}?

          $rule->{last_etag} = $headers->{etag}
            if exists $headers->{etag};
          $rule->{last_mod}  = $headers->{'last-modified'}
            if exists $headers->{'last-modified'};

            return aio_close $fh, sub {
                AE::log debug => q[Closed %s], $rule->{statefile};
                my $bak = $rule->{path} . '.bak';

                # just like in the docs, create a .bak, and then rename the temp over the top:
                aio_unlink $bak, sub {
                    # don't care if the unlink fails.
                    aio_link $rule->{path}, $bak, sub {
                        my($success) = @_;
                        # the link can fail, but not if the file exists:
                        return AE::log error => "Can't create %s: $!", $bak
                           if -e $rule->{path} and not $success;

                       aio_rename $rule->{statefile}, $rule->{path}, sub {
                          @_ or return AE::log error => "Can't replace '%s': $!", $rule->{path};
                          AE::log info => "Replaced %s with %s", $rule->{path}, $rule->{statefile};
                       };
                    };
                # check again later:
                    schedule( $rule );
                }
            } if $fh;
            schedule( $rule );
       }
}

my $timers;
sub unschedule {
    my $rule = shift;
    undef $timers->{ $rule->{name} } 
}
sub schedule {
    my $rule = shift;

    $timers->{ $rule->{name} } = AnyEvent->timer (
        after => 1,#rand 10,
        cb    => sub{ update($rule) }
    );
}

my $loop= AnyEvent->condvar;
for my $rule (@sync_rules) { 
    schedule( $rule );
}
$loop->recv

Self restarting AnyEvent program


shells-post-icon

Wether you're running 1800 production web servers or a single VPS, there will come a time where you need to do some automated log cleanup, update some configs or monitor the state of some process or network connection.

So you write your little script, and dump it on the machine, via puppet or salt. You add an entry to your cron tab and never worry about it again.

Yeah. That's never the end - Some new requirements come up, the script needs to check new things or the config files move because of an os upgrade, the damn thing keeps emailing you warnings. We come to a point where we need to patch our script and deploy the changes. And make sure the old version stops, and the new versions is running. Over and over.

This is the point where it's handy to have your bundle of maintenance scripts running from an auto-updating git working copy on your boxes.

The aim here is to build a script that runs and does it's work while keeping an eye on your git repos' state.

Setup clone the repo install your depends

I'm going to assume you want to add this to an existing project:

me@compy386:~ git clone your-github-project /usr/local/yourthing    
me@compy386:~ cpanm -L /usr/local/yourthing/misc AnyEvent AnyEvent::HTTP

but if you want to just play with this without defacing one of your public projects:

me@compy386:~ 
    mkdir -p restarter/misc
    cd restarter
    git init 
    <paste code in a file called computer_program>
    git add .
    git commit -am "Let's get this party started"
    cpanm -L misc AnyEvent AnyEvent::HTTP

(maybe git ignore misc/*)

Supervise it, or just run it

Pop something like this in if your supervisor.conf, conf.d/computer_program.conf or whatever:

[program:computer_program]
command=/usr/local/yourthing/computer_program --loglevel=%(ENV_LOGLEVEL)s 
# You'll also want to configure log rotation if you're doing this long term.

And fire it up

me@compy386:~ supervisorctl start computer_program
me@compy386:~ supervisorctl status

And

me@compy386:~ ps ax | grep computer[_]program
computer_program 4b825dc642cb6eb9a060e54bf8d69288fbee4904

If you're not convinced, you can just run it directly

me@compy386:~ /usr/local/yourthing/computer_program
... It'll just run.

See it at work

... since the point is to restart the script each time the git repo its running from changes, The best way to to test this is to just commit locally and see the script notice the change.

Open up another terminal and commit a change to the script, or a anything in the directory.

me@compy386:/usr/local/yourthing echo "# never mind" >> computer_program
me@compy386:/usr/local/yourthing git commit -am "Helpful docs"

In the other tab we get one of these:

2017-05-02 23:15:11.163089 +0200 trace main: restarter: first run, starting with /usr/local/yourthing at 4ffdd6cc2b36b6fd69553202b7db1c7288927521
2017-05-02 23:17:41.198546 +0200 trace main: restarter[4ffdd6cc2b36b6fd69553202b7db1c7288927521 cmp 5f0e2d68cc2fef915215490db7ad5f6a52c75a45]: not taking more jobs

Depending on what you set $CHECK_EVERY to, the script will bail out, and if supervised it'll be restarted.

The script itself

#! /usr/bin/perl

#ABSTRACT: wireframe AnyEvent program that exits when code changes.

use warnings; use strict;
my $MY_NAME = $0;

use Cwd qw(cwd abs_path);
use File::Basename qw(dirname);
use File::Spec::Functions qw(catdir);

our $NEST;
use lib catdir( $NEST= dirname( abs_path($0) ), 'misc/lib/perl5');
# ...for use with cpanm -L misc AnyEvent

use AnyEvent;
use AnyEvent::Util qw/ run_cmd /;

my $computer_program = AnyEvent->condvar;

# simple self-restart setup 
my ( $CHECK_EVERY, $MURDER_AFTER) = ( 2.5*60, 150 );
my (
    $WRAP_IT_UP,   # flag, if true don't pick up any new jobs. (Timer for exit).
    $RUNNING_JOBS, # guards for currently running jobs
    $running_sha,  # string, sha1 of the last commit on this directory
    $restarter,    # guard, runs git log.
);

my $timer = AnyEvent->timer(
    after => 1,
    interval => $CHECK_EVERY,
    cb => sub {
        # this will kill the previous run by destroying the guard.
        $restarter = run_cmd [qw/ git log -1 --format=%H /, $NEST],
            '>' => sub {

            return unless @_; # If it doesn't run, just keep on keep'on.

            my $output = shift;
            chomp $output;

            if (not defined $running_sha) {
                $0 = join ' ', $MY_NAME, $running_sha=$output;
                AE::log trace=> "restarter: first run, starting with %s at %s", $NEST, $output;
                return;
            }

            my $we_good = (defined $running_sha and defined $output and $running_sha eq $output);
            AE::log trace=> "restarter[%s cmp %s]: %s",$running_sha, $output,
                                            $we_good ? 'seems fine' : 'not taking more jobs'
                                            ;
            # life is fine, keep going.
            return if $we_good;

            # nothing going on, so just bail out.
            if (not defined $RUNNING_JOBS or not @$RUNNING_JOBS) {
                AE::log info=> "restarter[%s cmp %s]: New software while idle. Bailing out to upgrade.",$running_sha, $output,
                return $computer_program->send
            }

            my $murder_in=$MURDER_AFTER;
            AE::log trace=> "restarter[%s cmp %s]: New software while Busy. Murder in %ss",$running_sha, $output, $murder_in;

            # busy. murder the jobs in a mintes time.
            $WRAP_IT_UP = AnyEvent->timer(after => $murder_in, cb => sub {
                AE::log info=> "restarter: %ss timer has passed, murdering %s jobs", $murder_in, 0+@{ $RUNNING_JOBS || [] };
                undef $_ for @{ $RUNNING_JOBS || [] };
                return $computer_program->send;
            });
        }
    }
);

# start the loop:
$computer_program->recv;
__END__

Exercises for the reader

When first working on a script like this one can easily get in trouble, the need to commit a change in order to to test the restart behaviour makes for an exciting life. Fortunately the script will also notice if you use git commit --amend to change the sha of the current commit.

This won't be your usual work flow, but it makes testing much more fun.

Obviously you can only amend a commit if you haven't pushed it yet.

Have the script do some stuff

Use AnyEvent::HTTP to poll some web service, and check the response for some value.

use AnyEvent::Log info to log when the value is present so you can see your code working.

If you add the right callbacks to your request you should be able to see the event being cancelled when you commit a change to the script.

Have your script check for changes on your remote

Since we're already polling git for changes, we could go one step further and poll for changes to the remote too.

This is relatively straight forward if you have a passwordless http remote like the ones github provide.

Here you'd want to use git remote update and git rev-parse origin/HEAD to check if your working copy and/or script are at the lastest SHA for your repo.

Spawn an ssh-agent for private repos

If our git repo is secret, like on gitlab or paid github accounts, you'll need to use a "deploy key" to fetch changes from the repo.

You can use run_cmd to start up another process running ssh-agent, once you know the socket path it choses, you can pass that info on in the environment to your GIT_SSH for any commands that interact with your remotes.

Wrap up

There are some other obvious options for getting your changes out:

  • Packaging your script and sticking it the package a mirror, restarting with hooks
  • Have your config management drop the new copy and do the init/rc.d/systemd dance
  • put your script in a base container, and rebuild your farm with every change

Each have their tradeoffs, and I'm a simple fellow. I like to push a change, and watch graphs to see my changes kick in. After all, if your change doesn't have an impact on the graphs, was it really a change?

Being able to push changes to a git repo and have them go live automatically takes a lot of the sting out of having to deploy daemons to large groups of production boxes, the more pessimistic reader will ask "But if I push a broken version, won't the script stop doing the git-fetch?"

Yep. That'll happen. You'd likely want one script that just does the git pull, and nothing else, while the rest of your scripts just watch for the change and restart when they change.

🍆


Thanks for coming.

Templating Without a Template.


You can generate html from perl with the things CGI.pm exports, but chances are that you might hire a designer and they'll hate you for it.

There are oodles of template engines on cpan, my personal favourite is HTML::Mason... Mason is the more the context of this post that the purpose though.

At some point in your life, you'll find yourself in mason land with a list of things, and the need to stick that in an array for your template:

sub get_records {
    my @records = ORM->search_for_related_things();
    $_->{link_tag} = sprintf '<a href="view_record.html?id=%s">%s</a>",
       $_->id, $_->title 
        for @records;
}

and then you can readily use that html in your template:

<nav><ul>
% for my $record ( $controller->get_records ) { 
    <li><% $record->{link_tag} %>
% }
</ul></nav>

Except you've just messed up escaping on both sets of interpolations above. The database shouldn't have entity encoded strings in it, nor should it have HTML tags of any sort, so these should be escaped.

There's also an injection for any id that contains a single quote character, users can break the url and inject javascript or even whole tags as as they like.

What if we don't put HTML in the array?

HTML::Element is part of the HTML::Tree distribution, and is used, surprisingly, for modelling elements in an HTML Document. The handy part is that it knows how to escape magic values in attributes and the like:

me@compy386:~ perl -MHTML::Element  -E '
    my $a=HTML::Element->new(a=> href=> "view_record.html?id=%s");
    $a->push_content( "This & that");
    say $a->as_HTML
    '

<a href="view_record.html?id=14">This &amp; that</a>

We just need to make url construction in there safe too...

me@compy386:~ perl -MURI -MHTML::Element -E '
    (my $b= URI->new("view_record.html"))->query_form(id=>14);
    my $a=HTML::Element->new(a=> href=> $b );
    $a->push_content( "This & that");
    say $a->as_HTML
    '

<a href="view_record.html?id=14">This &amp; that</a>

Seems ok, more objects representing our markup and less string concat'ing means we're less likely to get escaping wrong, let's try it on:

me@compy386:~ perl -MURI -MHTML::Element -E '
    (my $b= URI->new("view_record.html"))->query_form(id=>q/5" onload="alert(1)" "/);
    my $a=HTML::Element->new(a=> href=> $b );
    $a->push_content( "This & that");
    say $a->as_HTML
    '

We can see that URI did the helpful thing and escaped everything so the href stayed in the html tag

<a href="view_record.html?id=5%22+onload%3D%22alert(1)%22+%22">This &amp; that</a>

That's kinda annoying though

In order to fairly decide how annoying creating a URI object and passing it to HTML::Element object is, it seems only fair to do it the right in the other version...

You can't really do it right because you don't know the context that the {link_tag} will be used in, so we can just assume that the call site will correctly escape it, throwing out half of the bath water, and most of the baby:

sub get_records {
    my @records = ORM->search_for_related_things();
    $_->{link_tag} = sprintf '<a href="view_record.html?id=%s">%s</a>",
       encode_entities(url_escape($_->id)), encode_html_entities($_->title)
        for @records;
}

That's also the simplest case

Even though a link element is fairly straight forward, we can still see that it turns into a whole bundle of code if you do it by hand. If you're building anything more complicated than a link to some other place and a heading, you'll quickly find that you're trying to escape params from the request, data from your database, external APIs and from all kinds of trust levels. You'll be doing it in all sorts of different contexts in your document. Do you remember the escaping rules for javascript strings in a JSON response? How are they different from the rules in an inline <script> tag? How do css expressions work again? Are they different in an html attribute? Life is tough.

Having an object model is handy

Having an object model that represents your data allows you to store much more information than simply passing strings about, and that will in turn give you a better idea of how to correctly use your data and how to avoid security issues caused by mixing contexts and allowing user input to cross trust boundaries.

And best of all, you don't have to do all the escaping by hand.

Run-length encoding


I did a hacker-rank thing about run-length encoding, after hitting submit I realised there was more golfing to be done:

me@compy-386:~ echo "aaaabcdeeee" | perl -plE '
s/^(.)(\1*)/$o.= $1.($2&&length" $2");""/e while $_;
$_=$o
'
a4bcde4

It's nothing too magical:

  • -print each line after -Evaling the expression for each line of the file, with automatic handling of -line endings
  • for each iteration, we match the first character in $_, and the more of them (\1*)
    • with /e s/// will evaluate the replacement as an expression instead of treating it as a plain old string replacement
    • the expression appends the first match $1 (the first letter) and the length of the second match in $2 (the rest of the run) to $o
    • an empty $2 means the length isn't added because the challenge dictated that a single character is left alone ('a' instead of 'a1')
    • the "" is htere so the matched text is replaced with nothing, moving us closer to $_ being empty
  • the while loop continues until $_ is empty
  • once $_ is empty and all the text is processed, $o is assinged to $_ so it's printed.

Todo:

  • remove $o, by using print, or with fancy use of /g.
  • remove the while, I'm sure it can be done.
  • remove "" from the replace.

Standard perl-golf disclaimer

Please don't do this kind of thing in a production code base

Some notes on git-config


gitconfig-post-icon

I'm sure everyone has a .gitconfig with some handy aliases like:

  alias.ff=git pull --ff-only
  alias.rb=git pull --rebase

If you open up your .gitconfig it'll look something like:

[alias]
    ; too lazy to type these all in full
    root = !pwd
    ff   = pull --ff-only
    rb   = pull --rebase
    stat = status

It looks like an ini file, really.

The cool thing about .ini is that everyone has their own freak-show extensions to the simple ini format, which is really not much more complex than what's above, but has been extended in different directions with each implementation.

git-config obviously has its own rules about what's allowable, and how things are stored.

You're not allowed underscores:

me@compy386:~ $ git config -f ./example --add foo.bar_baz 1
error: invalid key: foo.bar_baz

me@compy386:~ $ git config -f ./example --add foo_bar.bar 1
error: invalid key: foo_bar.bar

So, unless your language lets you have - in method names, or you like snakeCase you're going to have to mangle the names after reading your config.

Your settings need to be in a section

me@compy386:~ $ git config -f example --add bar 1
error: key does not contain a section: bar

Doing this makes the config file much easier to deal with, and leaves you without the quagmire of nonsense dealing with "keys with no section go into the _ section"

Sections can have sub-sections

If you want to have configs for multiple named things of the same type:

me@compy386:~ $ git config -f ./example --add foo.thething.bar-baz 1

me@compy386:~ $ cat ./example 
[foo "thething"]
    bar-baz = 1

me@compy386:~ $ git config -list -f ./example 
foo.thething.bar-baz=1

Yep, you can have 2 levels of keys, and you end up with [first "second"] in your config. Neat!

This is used for branches and remotes among other things:

.git/config
[branch "master"]
    remote = origin
    merge = refs/heads/master

sections can have the same name as sub-sections

me@compy386:~ $ git config -f example --add foo.bar.baz 1

me@compy386:~ $ cat example 
[foo]
    bar = 1
[foo "bar"]
    baz = 1

me@compy386:~ $ git config -l -f example 
foo.bar=1
foo.bar.baz=1

If you're parsing this directly into a data structure you can end up with some fairly upsetting situations, like foo.bar becoming a hashmap when you don't expect it.

git-config - you might as well use it.

If you're building a tool that depends on git for a large portion of its job, you might as well use git-config too. It's a format that your users are likely already familiar with, and fits neatly into the ecosystem.