438 lines
17 KiB
HTML
438 lines
17 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta name="generator" content=
|
||
"HTML Tidy for HTML5 for Linux version 5.2.0">
|
||
<title>Web Scraping</title>
|
||
<meta charset="utf-8">
|
||
<meta name="description" content="A collection of examples of using Common Lisp">
|
||
<meta name="viewport" content=
|
||
"width=device-width, initial-scale=1">
|
||
<link rel="icon" href=
|
||
"assets/cl-logo-blue.png"/>
|
||
<link rel="stylesheet" href=
|
||
"assets/style.css">
|
||
<script type="text/javascript" src=
|
||
"assets/highlight-lisp.js">
|
||
</script>
|
||
<script type="text/javascript" src=
|
||
"assets/jquery-3.2.1.min.js">
|
||
</script>
|
||
<script type="text/javascript" src=
|
||
"assets/jquery.toc/jquery.toc.min.js">
|
||
</script>
|
||
<script type="text/javascript" src=
|
||
"assets/toggle-toc.js">
|
||
</script>
|
||
|
||
<link rel="stylesheet" href=
|
||
"assets/github.css">
|
||
|
||
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
|
||
</head>
|
||
<body>
|
||
<h1 id="title-xs"><a href="index.html">The Common Lisp Cookbook</a> – Web Scraping</h1>
|
||
<div id="logo-container">
|
||
<a href="index.html">
|
||
<img id="logo" src="assets/cl-logo-blue.png"/>
|
||
</a>
|
||
|
||
<div id="searchform-container">
|
||
<form onsubmit="duckSearch()" action="javascript:void(0)">
|
||
<input id="searchField" type="text" value="" placeholder="Search...">
|
||
</form>
|
||
</div>
|
||
|
||
<div id="toc-container" class="toc-close">
|
||
<div id="toc-title">Table of Contents</div>
|
||
<ul id="toc" class="list-unstyled"></ul>
|
||
</div>
|
||
</div>
|
||
|
||
<div id="content-container">
|
||
<h1 id="title-non-xs"><a href="index.html">The Common Lisp Cookbook</a> – Web Scraping</h1>
|
||
|
||
<!-- Announcement we can keep for 1 month or more. I remove it and re-add it from time to time. -->
|
||
<p class="announce">
|
||
📹 <a href="https://www.udemy.com/course/common-lisp-programming/?couponCode=6926D599AA-LISP4ALL">NEW! Learn Lisp in videos and support our contributors with this 40% discount.</a>
|
||
</p>
|
||
<p class="announce-neutral">
|
||
📕 <a href="index.html#download-in-epub">Get the EPUB and PDF</a>
|
||
</p>
|
||
|
||
|
||
<div id="content"
|
||
<p>The set of tools to do web scraping in Common Lisp is pretty complete
|
||
and pleasant. In this short tutorial we’ll see how to make http
|
||
requests, parse html, extract content and do asynchronous requests.</p>
|
||
|
||
<p>Our simple task will be to extract the list of links on the CL
|
||
Cookbook’s index page and check if they are reachable.</p>
|
||
|
||
<p>We’ll use the following libraries:</p>
|
||
|
||
<ul>
|
||
<li><a href="https://github.com/fukamachi/dexador">Dexador</a> - an HTTP client
|
||
(that aims at replacing the venerable Drakma),</li>
|
||
<li><a href="https://shinmera.github.io/plump/">Plump</a> - a markup parser, that works on malformed HTML,</li>
|
||
<li><a href="https://shinmera.github.io/lquery/">Lquery</a> - a DOM manipulation
|
||
library, to extract content from our Plump result,</li>
|
||
<li><a href="https://lparallel.org/pmap-family/">lparallel</a> - a library for parallel programming (read more in the <a href="process.html">process section</a>).</li>
|
||
</ul>
|
||
|
||
<p>Before starting let’s install those libraries with Quicklisp:</p>
|
||
|
||
<pre><code class="language-lisp">(ql:quickload '("dexador" "plump" "lquery" "lparallel"))
|
||
</code></pre>
|
||
|
||
<h2 id="http-requests">HTTP Requests</h2>
|
||
|
||
<p>Easy things first. Install Dexador. Then we use the <code>get</code> function:</p>
|
||
|
||
<pre><code class="language-lisp">(defvar *url* "https://lispcookbook.github.io/cl-cookbook/")
|
||
(defvar *request* (dex:get *url*))
|
||
</code></pre>
|
||
|
||
<p>This returns a list of values: the whole page content, the return code
|
||
(200), the response headers, the uri and the stream.</p>
|
||
|
||
<pre><code>"<!DOCTYPE html>
|
||
<html lang=\"en\">
|
||
<head>
|
||
<title>Home &ndash; the Common Lisp Cookbook</title>
|
||
[…]
|
||
"
|
||
200
|
||
#<HASH-TABLE :TEST EQUAL :COUNT 19 {1008BF3043}>
|
||
#<QURI.URI.HTTP:URI-HTTPS https://lispcookbook.github.io/cl-cookbook/>
|
||
#<CL+SSL::SSL-STREAM for #<FD-STREAM for "socket 192.168.0.23:34897, peer: 151.101.120.133:443" {100781C133}>>
|
||
|
||
</code></pre>
|
||
|
||
<p>Remember, in Slime we can inspect the objects with a right-click on
|
||
them.</p>
|
||
|
||
<h2 id="parsing-and-extracting-content-with-css-selectors">Parsing and extracting content with CSS selectors</h2>
|
||
|
||
<p>We’ll use <code>lquery</code> to parse the html and extract the
|
||
content.</p>
|
||
|
||
<ul>
|
||
<li><a href="https://shinmera.github.io/lquery/">https://shinmera.github.io/lquery/</a></li>
|
||
</ul>
|
||
|
||
<p>We first need to parse the html into an internal data structure. Use
|
||
<code>(lquery:$ (initialize <html>))</code>:</p>
|
||
|
||
<pre><code class="language-lisp">(defvar *parsed-content* (lquery:$ (initialize *request*)))
|
||
;; => #<PLUMP-DOM:ROOT {1009EE5FE3}>
|
||
</code></pre>
|
||
|
||
<p>lquery uses <a href="https://shinmera.github.io/plump/">Plump</a> internally.</p>
|
||
|
||
<p>Now we’ll extract the links with CSS selectors.</p>
|
||
|
||
<p><strong>Note</strong>: to find out what should be the CSS selector of the element
|
||
I’m interested in, I right click on an element in the browser and I
|
||
choose “Inspect element”. This opens up the inspector of my browser’s
|
||
web dev tool and I can study the page structure.</p>
|
||
|
||
<p>So the links I want to extract are in a page with an <code>id</code> of value
|
||
“content”, and they are in regular list elements (<code>li</code>).</p>
|
||
|
||
<p>Let’s try something:</p>
|
||
|
||
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li")
|
||
;; => #(#<PLUMP-DOM:ELEMENT li {100B3263A3}> #<PLUMP-DOM:ELEMENT li {100B3263E3}>
|
||
;; #<PLUMP-DOM:ELEMENT li {100B326423}> #<PLUMP-DOM:ELEMENT li {100B326463}>
|
||
;; #<PLUMP-DOM:ELEMENT li {100B3264A3}> #<PLUMP-DOM:ELEMENT li {100B3264E3}>
|
||
;; #<PLUMP-DOM:ELEMENT li {100B326523}> #<PLUMP-DOM:ELEMENT li {100B326563}>
|
||
;; #<PLUMP-DOM:ELEMENT li {100B3265A3}> #<PLUMP-DOM:ELEMENT li {100B3265E3}>
|
||
;; #<PLUMP-DOM:ELEMENT li {100B326623}> #<PLUMP-DOM:ELEMENT li {100B326663}>
|
||
;; […]
|
||
</code></pre>
|
||
|
||
<p>Wow it works ! We get here a vector of plump elements.</p>
|
||
|
||
<p>I’d like to easily check what those elements are. To see the entire
|
||
html, we can end our lquery line with <code>(serialize)</code>:</p>
|
||
|
||
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li" (serialize))
|
||
#("<li><a href=\"license.html\">License</a></li>"
|
||
"<li><a href=\"getting-started.html\">Getting started</a></li>"
|
||
"<li><a href=\"editor-support.html\">Editor support</a></li>"
|
||
[…]
|
||
</code></pre>
|
||
|
||
<p>And to see their <em>textual</em> content (the user-visible text inside the
|
||
html), we can use <code>(text)</code> instead:</p>
|
||
|
||
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content" (text))
|
||
#("License" "Editor support" "Strings" "Dates and Times" "Hash Tables"
|
||
"Pattern Matching / Regular Expressions" "Functions" "Loop" "Input/Output"
|
||
"Files and Directories" "Packages" "Macros and Backquote"
|
||
"CLOS (the Common Lisp Object System)" "Sockets" "Interfacing with your OS"
|
||
"Foreign Function Interfaces" "Threads" "Defining Systems"
|
||
[…]
|
||
"Pascal Costanza’s Highly Opinionated Guide to Lisp"
|
||
"Loving Lisp - the Savy Programmer’s Secret Weapon by Mark Watson"
|
||
"FranzInc, a company selling Common Lisp and Graph Database solutions.")
|
||
</code></pre>
|
||
|
||
<p>All right, so we see we are manipulating what we want. Now to get their
|
||
<code>href</code>, a quick look at lquery’s doc and we’ll use <code>(attr
|
||
"some-name")</code>:</p>
|
||
|
||
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li a" (attr :href))
|
||
;; => #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
|
||
;; "hashes.html" "pattern_matching.html" "functions.html" "loop.html" "io.html"
|
||
;; "files.html" "packages.html" "macros.html"
|
||
;; "/cl-cookbook/clos-tutorial/index.html" "os.html" "ffi.html"
|
||
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
|
||
;; […]
|
||
;; "http://www.nicklevine.org/declarative/lectures/"
|
||
;; "http://www.p-cos.net/lisp/guide.html" "https://leanpub.com/lovinglisp/"
|
||
;; "https://franz.com/")
|
||
</code></pre>
|
||
|
||
<p><em>Note</em>: using <code>(serialize)</code> after <code>attr</code> leads to an error.</p>
|
||
|
||
<p>Nice, we now have the list (well, a vector) of links of the
|
||
page. We’ll now write an async program to check and validate they are
|
||
reachable.</p>
|
||
|
||
<p>External resources:</p>
|
||
|
||
<ul>
|
||
<li><a href="https://developer.mozilla.org/en-US/docs/Glossary/CSS_Selector">CSS selectors</a></li>
|
||
</ul>
|
||
|
||
<h2 id="async-requests">Async requests</h2>
|
||
|
||
<p>In this example we’ll take the list of url from above and we’ll check
|
||
if they are reachable. We want to do this asynchronously, but to see
|
||
the benefits we’ll first do it synchronously !</p>
|
||
|
||
<p>We need a bit of filtering first to exclude the email addresses (maybe
|
||
that was doable in the CSS selector ?).</p>
|
||
|
||
<p>We put the vector of urls in a variable:</p>
|
||
|
||
<pre><code class="language-lisp">(defvar *urls* (lquery:$ *parsed-content* "#content li a" (attr :href)))
|
||
</code></pre>
|
||
|
||
<p>We remove the elements that start with “mailto:”: (a quick look at the
|
||
<a href="strings.html">strings</a> page will help)</p>
|
||
|
||
<pre><code class="language-lisp">(remove-if (lambda (it) (string= it "mailto:" :start1 0 :end1 (length "mailto:"))) *urls*)
|
||
;; => #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
|
||
;; […]
|
||
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
|
||
;; "license.html" "http://lisp-lang.org/"
|
||
;; "https://github.com/CodyReichert/awesome-cl"
|
||
;; "http://www.lispworks.com/documentation/HyperSpec/Front/index.htm"
|
||
;; […]
|
||
;; "https://franz.com/")
|
||
</code></pre>
|
||
|
||
<p>Actually before writing the <code>remove-if</code> (which works on any sequence,
|
||
including vectors) I tested with a <code>(map 'vector …)</code> to see that the
|
||
results where indeed <code>nil</code> or <code>t</code>.</p>
|
||
|
||
<p>As a side note, there is a handy <code>starts-with</code> function in
|
||
<a href="https://github.com/diogoalexandrefranco/cl-strings/">cl-strings</a>,
|
||
available in Quicklisp. So we could do:</p>
|
||
|
||
<pre><code class="language-lisp">(map 'vector (lambda (it) (cl-strings:starts-with it "mailto:")) *urls*)
|
||
</code></pre>
|
||
|
||
<p>it also has an option to ignore or respect the case.</p>
|
||
|
||
<p>While we’re at it, we’ll only consider links starting with “http”, in
|
||
order not to write too much stuff irrelevant to web scraping:</p>
|
||
|
||
<pre><code class="language-lisp">(remove-if-not (lambda (it) (string= it "http" :start1 0 :end1 (length "http"))) *) ;; note the remove-if-NOT
|
||
</code></pre>
|
||
|
||
<p>All right, we put this result in another variable:</p>
|
||
|
||
<pre><code class="language-lisp">(defvar *filtered-urls* *)
|
||
</code></pre>
|
||
|
||
<p>and now to the real work. For every url, we want to request it and
|
||
check that its return code is 200. We have to ignore certain
|
||
errors. Indeed, a request can timeout, be redirected (we don’t want
|
||
that) or return an error code.</p>
|
||
|
||
<p>To be in real conditions we’ll add a link that times out in our list:</p>
|
||
|
||
<pre><code class="language-lisp">(setf (aref *filtered-urls* 0) "http://lisp.org") ;; too bad indeed
|
||
</code></pre>
|
||
|
||
<p>We’ll take the simple approach to ignore errors and return <code>nil</code> in
|
||
that case. If all goes well, we return the return code, that should be
|
||
200.</p>
|
||
|
||
<p>As we saw at the beginning, <code>dex:get</code> returns many values, including
|
||
the return code. We’ll catch only this one with <code>nth-value</code> (instead
|
||
of all of them with <code>multiple-value-bind</code>) and we’ll use
|
||
<code>ignore-errors</code>, that returns nil in case of an error. We could also
|
||
use <code>handler-case</code> and catch specific error types (see examples in
|
||
dexador’s documentation) or (better yet ?) use <code>handler-bind</code> to catch
|
||
any <code>condition</code>.</p>
|
||
|
||
<p>(<em>ignore-errors has the caveat that when there’s an error, we can not
|
||
return the element it comes from. We’ll get to our ends though.</em>)</p>
|
||
|
||
<pre><code class="language-lisp">(map 'vector (lambda (it)
|
||
(ignore-errors
|
||
(nth-value 1 (dex:get it))))
|
||
*filtered-urls*)
|
||
</code></pre>
|
||
|
||
<p>we get:</p>
|
||
|
||
<pre><code>#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
|
||
200 200 200 200)
|
||
</code></pre>
|
||
|
||
<p>it works, but <em>it took a very long time</em>. How much time precisely ?
|
||
with <code>(time …)</code>:</p>
|
||
|
||
<pre><code>Evaluation took:
|
||
21.554 seconds of real time
|
||
0.188000 seconds of total run time (0.172000 user, 0.016000 system)
|
||
0.87% CPU
|
||
55,912,081,589 processor cycles
|
||
9,279,664 bytes consed
|
||
</code></pre>
|
||
|
||
<p>21 seconds ! Obviously this synchronous method isn’t efficient. We
|
||
wait 10 seconds for links that time out. It’s time to write and
|
||
measure and async version.</p>
|
||
|
||
<p>After installing <code>lparallel</code> and looking at
|
||
<a href="https://lparallel.org/">its documentation</a>, we see that the parallel
|
||
map <a href="https://lparallel.org/pmap-family/">pmap</a> seems to be what we
|
||
want. And it’s only a one word edit. Let’s try:</p>
|
||
|
||
<pre><code class="language-lisp">(time (lparallel:pmap 'vector
|
||
(lambda (it)
|
||
(ignore-errors (let ((status (nth-value 1 (dex:get it)))) status)))
|
||
*filtered-urls*)
|
||
;; Evaluation took:
|
||
;; 11.584 seconds of real time
|
||
;; 0.156000 seconds of total run time (0.136000 user, 0.020000 system)
|
||
;; 1.35% CPU
|
||
;; 30,050,475,879 processor cycles
|
||
;; 7,241,616 bytes consed
|
||
;;
|
||
;;#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
|
||
;; 200 200 200 200)
|
||
</code></pre>
|
||
|
||
<p>Bingo. It still takes more than 10 seconds because we wait 10 seconds
|
||
for one request that times out. But otherwise it proceeds all the http
|
||
requests in parallel and so it is much faster.</p>
|
||
|
||
<p>Shall we get the urls that aren’t reachable, remove them from our list
|
||
and measure the execution time in the sync and async cases ?</p>
|
||
|
||
<p>What we do is: instead of returning only the return code, we check it
|
||
is valid and we return the url:</p>
|
||
|
||
<pre><code class="language-lisp">... (if (and status (= 200 status)) it) ...
|
||
(defvar *valid-urls* *)
|
||
</code></pre>
|
||
|
||
<p>we get a vector of urls with a couple of <code>nil</code>s: indeed, I thought I
|
||
would have only one unreachable url but I discovered another
|
||
one. Hopefully I have pushed a fix before you try this tutorial.</p>
|
||
|
||
<p>But what are they ? We saw the status codes but not the urls :S We
|
||
have a vector with all the urls and another with the valid ones. We’ll
|
||
simply treat them as sets and compute their difference. This will show
|
||
us the bad ones. We must transform our vectors to lists for that.</p>
|
||
|
||
<pre><code class="language-lisp">(set-difference (coerce *filtered-urls* 'list)
|
||
(coerce *valid-urls* 'list))
|
||
;; => ("http://lisp-lang.org/" "http://www.psg.com/~dlamkins/sl/cover.html")
|
||
</code></pre>
|
||
|
||
<p>Gotcha !</p>
|
||
|
||
<p>BTW it takes 8.280 seconds of real time to me to check the list of
|
||
valid urls synchronously, and 2.857 seconds async.</p>
|
||
|
||
<p>Have fun doing web scraping in CL !</p>
|
||
|
||
<p>More helpful libraries:</p>
|
||
|
||
<ul>
|
||
<li>we could use <a href="https://github.com/tsikov/vcr">VCR</a>, a store and
|
||
replay utility to set up repeatable tests or to speed up a bit our
|
||
experiments in the REPL.</li>
|
||
<li><a href="https://github.com/orthecreedence/cl-async">cl-async</a>,
|
||
<a href="https://github.com/orthecreedence/carrier">carrier</a> and others
|
||
network, parallelism and concurrency libraries to see on the
|
||
<a href="https://github.com/CodyReichert/awesome-cl">awesome-cl</a> list,
|
||
<a href="http://www.cliki.net/">Cliki</a> or
|
||
<a href="https://quickdocs.org/-/search?q=web">Quickdocs</a>.</li>
|
||
</ul>
|
||
|
||
|
||
<p class="page-source">
|
||
Page source: <a href="https://github.com/LispCookbook/cl-cookbook/blob/master/web-scraping.md">web-scraping.md</a>
|
||
</p>
|
||
</div>
|
||
|
||
<script type="text/javascript">
|
||
|
||
// Don't write the TOC on the index.
|
||
if (window.location.pathname != "/cl-cookbook/") {
|
||
$("#toc").toc({
|
||
content: "#content", // will ignore the first h1 with the site+page title.
|
||
headings: "h1,h2,h3,h4"});
|
||
}
|
||
|
||
$("#two-cols + ul").css({
|
||
"column-count": "2",
|
||
});
|
||
$("#contributors + ul").css({
|
||
"column-count": "4",
|
||
});
|
||
</script>
|
||
|
||
|
||
|
||
<div>
|
||
<footer class="footer">
|
||
<hr/>
|
||
© 2002–2021 the Common Lisp Cookbook Project
|
||
</footer>
|
||
|
||
</div>
|
||
<div id="toc-btn">T<br>O<br>C</div>
|
||
</div>
|
||
|
||
<script text="javascript">
|
||
HighlightLisp.highlight_auto({className: null});
|
||
</script>
|
||
|
||
<script type="text/javascript">
|
||
function duckSearch() {
|
||
var searchField = document.getElementById("searchField");
|
||
if (searchField && searchField.value) {
|
||
var query = escape("site:lispcookbook.github.io/cl-cookbook/ " + searchField.value);
|
||
window.location.href = "https://duckduckgo.com/?kj=b2&kf=-1&ko=1&q=" + query;
|
||
// https://duckduckgo.com/params
|
||
// kj=b2: blue header in results page
|
||
// kf=-1: no favicons
|
||
}
|
||
}
|
||
</script>
|
||
|
||
<script async defer data-domain="lispcookbook.github.io/cl-cookbook" src="https://plausible.io/js/plausible.js"></script>
|
||
|
||
</body>
|
||
</html>
|