emacs.d/clones/lispcookbook.github.io/cl-cookbook/web-scraping.html

436 lines
17 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.2.0">
<title>Web Scraping</title>
<meta charset="utf-8">
<meta name="description" content="A collection of examples of using Common Lisp">
<meta name="viewport" content=
"width=device-width, initial-scale=1">
<link rel="stylesheet" href=
"assets/style.css">
<script type="text/javascript" src=
"assets/highlight-lisp.js">
</script>
<script type="text/javascript" src=
"assets/jquery-3.2.1.min.js">
</script>
<script type="text/javascript" src=
"assets/jquery.toc/jquery.toc.min.js">
</script>
<script type="text/javascript" src=
"assets/toggle-toc.js">
</script>
<link rel="stylesheet" href=
"assets/github.css">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
</head>
<body>
<h1 id="title-xs"><a href="index.html">The Common Lisp Cookbook</a> &ndash; Web Scraping</h1>
<div id="logo-container">
<a href="index.html">
<img id="logo" src="assets/cl-logo-blue.png"/>
</a>
<div id="searchform-container">
<form onsubmit="duckSearch()" action="javascript:void(0)">
<input id="searchField" type="text" value="" placeholder="Search...">
</form>
</div>
<div id="toc-container" class="toc-close">
<div id="toc-title">Table of Contents</div>
<ul id="toc" class="list-unstyled"></ul>
</div>
</div>
<div id="content-container">
<h1 id="title-non-xs"><a href="index.html">The Common Lisp Cookbook</a> &ndash; Web Scraping</h1>
<!-- Announcement we can keep for 1 month or more. I remove it and re-add it from time to time. -->
<p class="announce">
📹 <a href="https://www.udemy.com/course/common-lisp-programming/?couponCode=6926D599AA-LISP4ALL">NEW! Learn Lisp in videos and support our contributors with this 40% discount.</a>
</p>
<p class="announce-neutral">
📕 <a href="index.html#download-in-epub">Get the EPUB and PDF</a>
</p>
<div id="content"
<p>The set of tools to do web scraping in Common Lisp is pretty complete
and pleasant. In this short tutorial well see how to make http
requests, parse html, extract content and do asynchronous requests.</p>
<p>Our simple task will be to extract the list of links on the CL
Cookbooks index page and check if they are reachable.</p>
<p>Well use the following libraries:</p>
<ul>
<li><a href="https://github.com/fukamachi/dexador">Dexador</a> - an HTTP client
(that aims at replacing the venerable Drakma),</li>
<li><a href="https://shinmera.github.io/plump/">Plump</a> - a markup parser, that works on malformed HTML,</li>
<li><a href="https://shinmera.github.io/lquery/">Lquery</a> - a DOM manipulation
library, to extract content from our Plump result,</li>
<li><a href="https://lparallel.org/pmap-family/">lparallel</a> - a library for parallel programming (read more in the <a href="process.html">process section</a>).</li>
</ul>
<p>Before starting lets install those libraries with Quicklisp:</p>
<pre><code class="language-lisp">(ql:quickload '("dexador" "plump" "lquery" "lparallel"))
</code></pre>
<h2 id="http-requests">HTTP Requests</h2>
<p>Easy things first. Install Dexador. Then we use the <code>get</code> function:</p>
<pre><code class="language-lisp">(defvar *url* "https://lispcookbook.github.io/cl-cookbook/")
(defvar *request* (dex:get *url*))
</code></pre>
<p>This returns a list of values: the whole page content, the return code
(200), the response headers, the uri and the stream.</p>
<pre><code>"&lt;!DOCTYPE html&gt;
&lt;html lang=\"en\"&gt;
&lt;head&gt;
&lt;title&gt;Home &amp;ndash; the Common Lisp Cookbook&lt;/title&gt;
[…]
"
200
#&lt;HASH-TABLE :TEST EQUAL :COUNT 19 {1008BF3043}&gt;
#&lt;QURI.URI.HTTP:URI-HTTPS https://lispcookbook.github.io/cl-cookbook/&gt;
#&lt;CL+SSL::SSL-STREAM for #&lt;FD-STREAM for "socket 192.168.0.23:34897, peer: 151.101.120.133:443" {100781C133}&gt;&gt;
</code></pre>
<p>Remember, in Slime we can inspect the objects with a right-click on
them.</p>
<h2 id="parsing-and-extracting-content-with-css-selectors">Parsing and extracting content with CSS selectors</h2>
<p>Well use <code>lquery</code> to parse the html and extract the
content.</p>
<ul>
<li><a href="https://shinmera.github.io/lquery/">https://shinmera.github.io/lquery/</a></li>
</ul>
<p>We first need to parse the html into an internal data structure. Use
<code>(lquery:$ (initialize &lt;html&gt;))</code>:</p>
<pre><code class="language-lisp">(defvar *parsed-content* (lquery:$ (initialize *request*)))
;; =&gt; #&lt;PLUMP-DOM:ROOT {1009EE5FE3}&gt;
</code></pre>
<p>lquery uses <a href="https://shinmera.github.io/plump/">Plump</a> internally.</p>
<p>Now well extract the links with CSS selectors.</p>
<p><strong>Note</strong>: to find out what should be the CSS selector of the element
Im interested in, I right click on an element in the browser and I
choose “Inspect element”. This opens up the inspector of my browsers
web dev tool and I can study the page structure.</p>
<p>So the links I want to extract are in a page with an <code>id</code> of value
“content”, and they are in regular list elements (<code>li</code>).</p>
<p>Lets try something:</p>
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li")
;; =&gt; #(#&lt;PLUMP-DOM:ELEMENT li {100B3263A3}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B3263E3}&gt;
;; #&lt;PLUMP-DOM:ELEMENT li {100B326423}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B326463}&gt;
;; #&lt;PLUMP-DOM:ELEMENT li {100B3264A3}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B3264E3}&gt;
;; #&lt;PLUMP-DOM:ELEMENT li {100B326523}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B326563}&gt;
;; #&lt;PLUMP-DOM:ELEMENT li {100B3265A3}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B3265E3}&gt;
;; #&lt;PLUMP-DOM:ELEMENT li {100B326623}&gt; #&lt;PLUMP-DOM:ELEMENT li {100B326663}&gt;
;; […]
</code></pre>
<p>Wow it works ! We get here a vector of plump elements.</p>
<p>Id like to easily check what those elements are. To see the entire
html, we can end our lquery line with <code>(serialize)</code>:</p>
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li" (serialize))
#("&lt;li&gt;&lt;a href=\"license.html\"&gt;License&lt;/a&gt;&lt;/li&gt;"
"&lt;li&gt;&lt;a href=\"getting-started.html\"&gt;Getting started&lt;/a&gt;&lt;/li&gt;"
"&lt;li&gt;&lt;a href=\"editor-support.html\"&gt;Editor support&lt;/a&gt;&lt;/li&gt;"
[…]
</code></pre>
<p>And to see their <em>textual</em> content (the user-visible text inside the
html), we can use <code>(text)</code> instead:</p>
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content" (text))
#("License" "Editor support" "Strings" "Dates and Times" "Hash Tables"
"Pattern Matching / Regular Expressions" "Functions" "Loop" "Input/Output"
"Files and Directories" "Packages" "Macros and Backquote"
"CLOS (the Common Lisp Object System)" "Sockets" "Interfacing with your OS"
"Foreign Function Interfaces" "Threads" "Defining Systems"
[…]
"Pascal Costanzas Highly Opinionated Guide to Lisp"
"Loving Lisp - the Savy Programmers Secret Weapon by Mark Watson"
"FranzInc, a company selling Common Lisp and Graph Database solutions.")
</code></pre>
<p>All right, so we see we are manipulating what we want. Now to get their
<code>href</code>, a quick look at lquerys doc and well use <code>(attr
"some-name")</code>:</p>
<pre><code class="language-lisp">(lquery:$ *parsed-content* "#content li a" (attr :href))
;; =&gt; #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
;; "hashes.html" "pattern_matching.html" "functions.html" "loop.html" "io.html"
;; "files.html" "packages.html" "macros.html"
;; "/cl-cookbook/clos-tutorial/index.html" "os.html" "ffi.html"
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
;; […]
;; "http://www.nicklevine.org/declarative/lectures/"
;; "http://www.p-cos.net/lisp/guide.html" "https://leanpub.com/lovinglisp/"
;; "https://franz.com/")
</code></pre>
<p><em>Note</em>: using <code>(serialize)</code> after <code>attr</code> leads to an error.</p>
<p>Nice, we now have the list (well, a vector) of links of the
page. Well now write an async program to check and validate they are
reachable.</p>
<p>External resources:</p>
<ul>
<li><a href="https://developer.mozilla.org/en-US/docs/Glossary/CSS_Selector">CSS selectors</a></li>
</ul>
<h2 id="async-requests">Async requests</h2>
<p>In this example well take the list of url from above and well check
if they are reachable. We want to do this asynchronously, but to see
the benefits well first do it synchronously !</p>
<p>We need a bit of filtering first to exclude the email addresses (maybe
that was doable in the CSS selector ?).</p>
<p>We put the vector of urls in a variable:</p>
<pre><code class="language-lisp">(defvar *urls* (lquery:$ *parsed-content* "#content li a" (attr :href)))
</code></pre>
<p>We remove the elements that start with “mailto:”: (a quick look at the
<a href="strings.html">strings</a> page will help)</p>
<pre><code class="language-lisp">(remove-if (lambda (it) (string= it "mailto:" :start1 0 :end1 (length "mailto:"))) *urls*)
;; =&gt; #("license.html" "editor-support.html" "strings.html" "dates_and_times.html"
;; […]
;; "process.html" "systems.html" "win32.html" "testing.html" "misc.html"
;; "license.html" "http://lisp-lang.org/"
;; "https://github.com/CodyReichert/awesome-cl"
;; "http://www.lispworks.com/documentation/HyperSpec/Front/index.htm"
;; […]
;; "https://franz.com/")
</code></pre>
<p>Actually before writing the <code>remove-if</code> (which works on any sequence,
including vectors) I tested with a <code>(map 'vector …)</code> to see that the
results where indeed <code>nil</code> or <code>t</code>.</p>
<p>As a side note, there is a handy <code>starts-with</code> function in
<a href="https://github.com/diogoalexandrefranco/cl-strings/">cl-strings</a>,
available in Quicklisp. So we could do:</p>
<pre><code class="language-lisp">(map 'vector (lambda (it) (cl-strings:starts-with it "mailto:")) *urls*)
</code></pre>
<p>it also has an option to ignore or respect the case.</p>
<p>While were at it, well only consider links starting with “http”, in
order not to write too much stuff irrelevant to web scraping:</p>
<pre><code class="language-lisp">(remove-if-not (lambda (it) (string= it "http" :start1 0 :end1 (length "http"))) *) ;; note the remove-if-NOT
</code></pre>
<p>All right, we put this result in another variable:</p>
<pre><code class="language-lisp">(defvar *filtered-urls* *)
</code></pre>
<p>and now to the real work. For every url, we want to request it and
check that its return code is 200. We have to ignore certain
errors. Indeed, a request can timeout, be redirected (we dont want
that) or return an error code.</p>
<p>To be in real conditions well add a link that times out in our list:</p>
<pre><code class="language-lisp">(setf (aref *filtered-urls* 0) "http://lisp.org") ;; too bad indeed
</code></pre>
<p>Well take the simple approach to ignore errors and return <code>nil</code> in
that case. If all goes well, we return the return code, that should be
200.</p>
<p>As we saw at the beginning, <code>dex:get</code> returns many values, including
the return code. Well catch only this one with <code>nth-value</code> (instead
of all of them with <code>multiple-value-bind</code>) and well use
<code>ignore-errors</code>, that returns nil in case of an error. We could also
use <code>handler-case</code> and catch specific error types (see examples in
dexadors documentation) or (better yet ?) use <code>handler-bind</code> to catch
any <code>condition</code>.</p>
<p>(<em>ignore-errors has the caveat that when theres an error, we can not
return the element it comes from. Well get to our ends though.</em>)</p>
<pre><code class="language-lisp">(map 'vector (lambda (it)
(ignore-errors
(nth-value 1 (dex:get it))))
*filtered-urls*)
</code></pre>
<p>we get:</p>
<pre><code>#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
200 200 200 200)
</code></pre>
<p>it works, but <em>it took a very long time</em>. How much time precisely ?
with <code>(time …)</code>:</p>
<pre><code>Evaluation took:
21.554 seconds of real time
0.188000 seconds of total run time (0.172000 user, 0.016000 system)
0.87% CPU
55,912,081,589 processor cycles
9,279,664 bytes consed
</code></pre>
<p>21 seconds ! Obviously this synchronous method isnt efficient. We
wait 10 seconds for links that time out. Its time to write and
measure and async version.</p>
<p>After installing <code>lparallel</code> and looking at
<a href="https://lparallel.org/">its documentation</a>, we see that the parallel
map <a href="https://lparallel.org/pmap-family/">pmap</a> seems to be what we
want. And its only a one word edit. Lets try:</p>
<pre><code class="language-lisp">(time (lparallel:pmap 'vector
(lambda (it)
(ignore-errors (let ((status (nth-value 1 (dex:get it)))) status)))
*filtered-urls*)
;; Evaluation took:
;; 11.584 seconds of real time
;; 0.156000 seconds of total run time (0.136000 user, 0.020000 system)
;; 1.35% CPU
;; 30,050,475,879 processor cycles
;; 7,241,616 bytes consed
;;
;;#(NIL 200 200 200 200 200 200 200 200 200 200 NIL 200 200 200 200 200 200 200
;; 200 200 200 200)
</code></pre>
<p>Bingo. It still takes more than 10 seconds because we wait 10 seconds
for one request that times out. But otherwise it proceeds all the http
requests in parallel and so it is much faster.</p>
<p>Shall we get the urls that arent reachable, remove them from our list
and measure the execution time in the sync and async cases ?</p>
<p>What we do is: instead of returning only the return code, we check it
is valid and we return the url:</p>
<pre><code class="language-lisp">... (if (and status (= 200 status)) it) ...
(defvar *valid-urls* *)
</code></pre>
<p>we get a vector of urls with a couple of <code>nil</code>s: indeed, I thought I
would have only one unreachable url but I discovered another
one. Hopefully I have pushed a fix before you try this tutorial.</p>
<p>But what are they ? We saw the status codes but not the urls :S We
have a vector with all the urls and another with the valid ones. Well
simply treat them as sets and compute their difference. This will show
us the bad ones. We must transform our vectors to lists for that.</p>
<pre><code class="language-lisp">(set-difference (coerce *filtered-urls* 'list)
(coerce *valid-urls* 'list))
;; =&gt; ("http://lisp-lang.org/" "http://www.psg.com/~dlamkins/sl/cover.html")
</code></pre>
<p>Gotcha !</p>
<p>BTW it takes 8.280 seconds of real time to me to check the list of
valid urls synchronously, and 2.857 seconds async.</p>
<p>Have fun doing web scraping in CL !</p>
<p>More helpful libraries:</p>
<ul>
<li>we could use <a href="https://github.com/tsikov/vcr">VCR</a>, a store and
replay utility to set up repeatable tests or to speed up a bit our
experiments in the REPL.</li>
<li><a href="https://github.com/orthecreedence/cl-async">cl-async</a>,
<a href="https://github.com/orthecreedence/carrier">carrier</a> and others
network, parallelism and concurrency libraries to see on the
<a href="https://github.com/CodyReichert/awesome-cl">awesome-cl</a> list,
<a href="http://www.cliki.net/">Cliki</a> or
<a href="https://quickdocs.org/-/search?q=web">Quickdocs</a>.</li>
</ul>
<p class="page-source">
Page source: <a href="https://github.com/LispCookbook/cl-cookbook/blob/master/web-scraping.md">web-scraping.md</a>
</p>
</div>
<script type="text/javascript">
// Don't write the TOC on the index.
if (window.location.pathname != "/cl-cookbook/") {
$("#toc").toc({
content: "#content", // will ignore the first h1 with the site+page title.
headings: "h1,h2,h3,h4"});
}
$("#two-cols + ul").css({
"column-count": "2",
});
$("#contributors + ul").css({
"column-count": "4",
});
</script>
<div>
<footer class="footer">
<hr/>
&copy; 2002&ndash;2021 the Common Lisp Cookbook Project
</footer>
</div>
<div id="toc-btn">T<br>O<br>C</div>
</div>
<script text="javascript">
HighlightLisp.highlight_auto({className: null});
</script>
<script type="text/javascript">
function duckSearch() {
var searchField = document.getElementById("searchField");
if (searchField && searchField.value) {
var query = escape("site:lispcookbook.github.io/cl-cookbook/ " + searchField.value);
window.location.href = "https://duckduckgo.com/?kj=b2&kf=-1&ko=1&q=" + query;
// https://duckduckgo.com/params
// kj=b2: blue header in results page
// kf=-1: no favicons
}
}
</script>
<script async defer data-domain="lispcookbook.github.io/cl-cookbook" src="https://plausible.io/js/plausible.js"></script>
</body>
</html>