vendredi 20 novembre 2015

Scraping the web with perl6

Scraping the web with perl6


So you want to use perl6 gather data from webpages? Scraping the web is not that easy. There are lots of factors that can make your life a nightmare. Like the overuse usage of Javascript/AJAX can make getting information for some website impossible (think of something that add content when scrolling). Even when you think you will only deal with plain old HTML, there are still some considerations to keep in mind. So according to your need, perl 6 is maybe not the answer for you.

JSON


There are a lots of modules to parse JSON and if the website you are trying to gather data from provide a JSON api, well use it. You can still look at this article to learn how to use an HTTP user agent, but the focus of this article is gathering data from HTML.

XHTML vs HTML5


I prefer to warn you about that. In this blog post/article I will use something that parse HTML5 and give us a XML tree. A concerning difference between XHTML and HTML5 parsing are the how errors are handled. 
Example :
<p><a href="hello.html"/>Hello <i>world</i></a></p>

HTML5 will close the <a> tag immediately seeing the /, XHTML will wait for the </a> tag. creating a different DOM in the end.
If you want to parse something like html for epub you will need a XHTML parser. If you want a "It look like that in my browser" use HTML5 parsing.

Perl 6 small issue


I prefer to sort out this quickly, but there is a small issue that can be annoying when writing scraper with Perl 6. (At the time I write this) Perl 6 implemantation are still quite slow (comparing to other language, that it). It's not that bad, but having to wait 5-10 sec each time you rerun your script to see if you corrected your mistake became quite annoying on a long run, especially in the context of extracting data from html.

Modules and tools


Obviously perl6 doesn't come out of the box with an User Agent and an HTML parser, so we will be using those:

HTTP::UserAgent - our http(s) client.
XML - Provide a way to handle xml tree and search stuff on it
Gumbo - A binding to a C library that parse HTML5

If you already searched on modules.perl6.org you probably stumble accross HTML::Parser::XML and Web::Scraper . H:P:X is writen more as an XHTML parser and it's very slow (It take 30 sec to parse the web pages I will use in this article). Web::Scraper use H:P:X internaly

Obviously you will need to install the Gumbo library to use the Gumbo perl6 binding.

An additionnal tool is a web browser that allow you to display developpement information about website, it will be usefull to look to find the tag/class we will looking for. Like google chrome.

I love MLP fanfiction


Yes, I love ponies and people writing fanfiction with it. That what lead me to write some perl 6 and contributing a bit in some modules. How that is related? Well the website https://www.fimfiction.net  host said fan fiction and provide a way to read and add stories to differents bookshelves. Story are tagged by character, contents, rating, etc... Reading quite a lot of stories I wanted to do stats about stuff like: What is the percentage of comedy in my main bookshelf, or how many stories involve this character.

That why as an example we will look at scraping story information from the page that display the bookshelf.

I will use https://www.fimfiction.net/bookshelf/751448/ that is the bookshelf I use on an IRC Bot to do various stuff.

First step, getting the number of page


Before writing code, we need to have a look at the what the page source look like. I am assuming you know how to use your browser to display the source code and the tools to inspect elements. You should write down somewhere the tag/class that will be interesting. You can write then in your future script, like as comments to have the right spelling in front of you.

You should notice that this bookshelf has 2 pages of stories. It's an information we will be interested in if we want to extract all stories. You should also notice that the link to page 2 is : https://www.fimfiction.net/bookshelf/751448/favourites?order=date_added&page=2 
The order=data_added is interesting because we probably want to use this base url instead of just fimfiction/bookshelfid in case they change the default display order. The favourites part of the url is the name of the bookshelf, luckily fimfiction support also https://www.fimfiction.net/bookshelf/751448/?order=date_added&page=2 so we don't really need the name and figure how it's mangled (In case of space in the name or such)

To get back at how to extract the number of page, we simply inspect the element that allow to navigate between page. It's in a div named page_list wich contains a list (ul/li tags) of links.
The text of the second last element give you the number of page in this case. The last element beeing an arrow.


Time to write some code



use v6;

use Gumbo;
use XML;
use HTTP::UserAgent;

#We define some base url, notice that I don't use https
my $bbaseurl = "http://www.fimfiction.net/bookshelf/";
my $fimbaseurl = "http://www.fimfiction.net/";

my $bookshelfid = @*ARGS[0];

my $url = $bbaseurl~$bookshelfid;
my $ua = HTTP::UserAgent.new;
#Fimfiction hide mature content (violent story/sex stories) as default.
$ua.cookies.set-cookie('Set-Cookie:view_mature=true; ');

#HTTP::UserAgent give us a response object
my $rep = $ua.get($url);

if ! $rep.is-success {
    die "Can't contact $url" ~ $rep.status-line;
}

#First we are only interested in the number of page

# We could have only called parse-html($rep.content) and search on the xml tree created
# But parse-html provided by Gumbo offer some basic filtering, that speed up the parsing
# :SINGLE make it stop at the first element that match div class="page_list"
# :nowhitespace tell him to not add all the whitespaces that are outside elements (like identation tab)
my $xmldoc = parse-html($rep.content, :TAG<div>, :class<page_list>, :SINGLE, :nowhitespace);

# Note: $xmldoc contains the html tag as root, not the <div>
# We don't care for the <ul> or extra content of this div, so let get all the <li> tags

my @pages_li = $xmldoc.lookfor(:TAG<li>);

my $number_of_page = 1;

#if we have more than one <li>
if @pages_li.elems > 1 {
    # get the text of the second last element 
    $number_of_page = @pages_li[@pages_li.elems-2][0][0].text;
}

say "Bookshelf n°$bookshelfid has $number_of_page page(s)";

Providing the nice output (don't mind -I perl6-gumbo/lib here)


root@testperl6:~/piko# perl6 -I perl6-gumbo/lib/ article1.p6 751448
Bookshelf n°751448 has 2 page(s)
root@testperl6:~/piko# perl6 -I perl6-gumbo/lib/ article1.p6 149291
Bookshelf n°149291 has 60 page(s)
root@testperl6:~/piko# 

You are probably confused by the @pages_li[@pages_li.elems-2][0][0].text
The XML tree for <li><a href="foo">blabla</a></li> look like that:
XML::Element named 'li'
--XML::Element named 'a' with %attribs<href> = "foo"
----XML::Text containing the text 'blabla'

That why we use @pases_li[] to get the right <li> element, then [0] on it to get his first child (<a>)  then [0] to get the first child that is an XML::Text with our text.

Be careful because using [] (or .nodes[]) on an XML::Element give you its children and there are of XML::Node type, so they can be XML::Text containing useless whitespace or an element you are looking for. use elements() method if you want only sub elements.


Getting some info about the stories


Fimfiction provide a small json to get story info (api/story.php) but we will not use it and it does not give you everything. Here we will focus on gathering this for each story:

-The title
-The author name
-The tags (Comedy/Aventure...)
-The character tags.

Let's get back at our web browser. We don't need the header, neither the footer of the page. You can narrow down the wanted content in the inner div or just gather all the story_content_box one.

Title and author name are quite easy:


    # :SINGLE make lookfor returns a XML::Element instead of an Array of it
    %story<title> = $story_div.lookfor(:TAG<a>, :class<story_name>, :SINGLE)[0].text;
    # Author name is the text of a link with fancy stuff around the a tag, in an author div
    %story<author> = $story_div.lookfor(:TAG<div>, :class<author>, :SINGLE).lookfor(:TAG<a>, :SINGLE)[0].text;

The tags are more tricky. They are links nested in a big description div and their class vary. Luckily the elements method (lookfor is a shortcut for elements(:RECURSE)) can take a regex


    my $description_div = $story_div.lookfor(:TAG<div>, :class<description>, :SINGLE);
    my @tags = $description_div.lookfor(:TAG<a>, :class(/^story_category/));
    for @tags -> $atag {
      %story<tags>.push($atag[0].text);
    }

Character tags are easy to get


    my $extradiv = $story_div.lookfor(:TAG<div>, :class<extra_story_data>, :SINGLE);
    my @charactera = $extradiv.lookfor(:TAG<a>, :class<character_icon>);
    for @charactera -> $aelem {
      # Accessing one attribute
      %story<character_tags>.push($aelem<title>);
    }



Final file 


Aucun commentaire:

Enregistrer un commentaire