292 lines
15 KiB
HTML
292 lines
15 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang='en'><head><meta charset='utf-8' /><meta name='pinterest' content='nopin' /><link href='../../../../static/css/style.css' rel='stylesheet' type='text/css' /><link href='../../../../static/css/print.css' rel='stylesheet' type='text/css' media='print' /><title>Happy Little Words / Steve Losh</title></head><body><header><a id='logo' href='https://stevelosh.com/'>Steve Losh</a><nav><a href='../../../index.html'>Blog</a> - <a href='https://stevelosh.com/projects/'>Projects</a> - <a href='https://stevelosh.com/photography/'>Photography</a> - <a href='https://stevelosh.com/links/'>Links</a> - <a href='https://stevelosh.com/rss.xml'>Feed</a></nav></header><hr class='main-separator' /><main id='page-blog-entry'><article><h1><a href='index.html'>Happy Little Words</a></h1><p class='date'>Posted on November 20th, 2015.</p><p>In late October the video game streaming site Twitch.tv <a href="http://blog.twitch.tv/2015/10/introducing-twitch-creative/">launched "Twitch
|
||
|
Creative"</a>, essentially giving people permission to stream
|
||
|
non-video game related creative content on the site. To celebrate the launch
|
||
|
they streamed all 403 episodes of <a href="https://en.wikipedia.org/wiki/The_Joy_of_Painting">The Joy of Painting with Bob Ross</a> in
|
||
|
a giant marathon.</p>
|
||
|
|
||
|
<p>The Bob Ross channel has its own chat room, and it quickly became packed with
|
||
|
folks watching Bob paint. The chat spawned its own memes and conventions within
|
||
|
days, mostly taking gamer slang (e.g. "gg" for "good game") and applying it to
|
||
|
the show (people spam "gg" in the chat whenever Bob finishes a painting).</p>
|
||
|
|
||
|
<p>Sadly that marathon has ended, but they've kept the dream alive by having <a href="http://blog.twitch.tv/2015/11/monday-night-is-bob-ross-night/">"Bob
|
||
|
Ross Night" on Mondays</a>. Every Monday they're going to stream a season
|
||
|
of the show twice (once at a Europe-friendly time and again for American folks).
|
||
|
Last Monday I scraped the Twitch chat during the marathon(s) of Season 2 and
|
||
|
decided to have some fun poking around at the data.</p>
|
||
|
|
||
|
<ol class="table-of-contents"><li><a href="index.html#s1-scraping">Scraping</a></li><li><a href="index.html#s2-volume">Volume</a></li><li><a href="index.html#s3-n-grams">N-grams</a></li><li><a href="index.html#s4-graphing">Graphing</a></li><li><a href="index.html#s5-up-next">Up Next</a></li></ol>
|
||
|
|
||
|
<h2 id="s1-scraping"><a href="index.html#s1-scraping">Scraping</a></h2>
|
||
|
|
||
|
<p>Scraping the chat was pretty easy. Twitch has an IRC gateway for chats, so
|
||
|
I just ran an IRC client (<a href="https://weechat.org/">weechat</a>) on a VPS and had it log the channel like
|
||
|
any other. Once the marathon finished I just <code>scp</code>'ed down the 8mb log and
|
||
|
started working with it.</p>
|
||
|
|
||
|
<p>First I trimmed both ends to only leave messages from about an hour and a half
|
||
|
before and after the marathons started and ended. So the data I'm going to work
|
||
|
with runs from 2015-11-16 14:30 to 2015-11-17 07:30 (all times are in UTC),
|
||
|
or 17 hours.</p>
|
||
|
|
||
|
<p>Then I cleaned it up to
|
||
|
remove some of the cruft (status messages from the client and such) and
|
||
|
lowercase everything:</p>
|
||
|
|
||
|
<pre><code>cat data/raw | grep -E '^[^\t]+\t <' | gsed -e 's/./\L\0/g' > data/log
|
||
|
</code></pre>
|
||
|
|
||
|
<p>Then I made an ugly little Python script to massage the data into something
|
||
|
a bit easier to work with later:</p>
|
||
|
|
||
|
<pre><code><span class="code"><span class="symbol">import</span> datetime, sys, time
|
||
|
|
||
|
<span class="special">def</span><span class="keyword"> datetime_to_epoch</span><span class="paren1">(<span class="code">dt</span>)</span>:
|
||
|
<span class="symbol">return</span> int<span class="paren1">(<span class="code">time.mktime<span class="paren2">(<span class="code">dt.timetuple<span class="paren3">(<span class="code"></span>)</span></span>)</span></span>)</span>
|
||
|
|
||
|
<span class="symbol">for</span> line <span class="symbol">in</span> sys.stdin:
|
||
|
timestamp, nick, msg = <span class="paren1">(<span class="code">s.strip<span class="paren2">(<span class="code"></span>)</span> <span class="symbol">for</span> s <span class="symbol">in</span> line.split<span class="paren2">(<span class="code"><span class="string">'</span><span class="string">\t</span><span class="string">'</span>, 2</span>)</span></span>)</span>
|
||
|
|
||
|
timestamp = datetime_to_epoch<span class="paren1">(<span class="code">
|
||
|
datetime.datetime.strptime<span class="paren2">(<span class="code">timestamp, <span class="string">'%Y-%m-%d %H:%M:%S'</span></span>)</span></span>)</span>
|
||
|
|
||
|
<span class="comment"># strip off <>'s
|
||
|
</span> nick = nick<span class="paren1">[<span class="code">1:-1</span>]</span>
|
||
|
|
||
|
<span class="symbol">print</span><span class="paren1">(<span class="code">timestamp, nick, msg</span>)</span></span></code></pre>
|
||
|
|
||
|
<p>This results in a file with one message per line, in the format:</p>
|
||
|
|
||
|
<pre><code>timestamp nick message goes here...
|
||
|
</code></pre>
|
||
|
|
||
|
<p>On a side note: I tried out <a href="https://mosh.mit.edu/">Mosh</a> for persisting a connection to the server
|
||
|
(instead of using tmux or screen to persist a session) and it worked pretty
|
||
|
well. I might start using it more often.</p>
|
||
|
|
||
|
<h2 id="s2-volume"><a href="index.html#s2-volume">Volume</a></h2>
|
||
|
|
||
|
<p>Now that we've got a nice clean corpus, let's start playing with it!</p>
|
||
|
|
||
|
<p>The obvious first question: how many messages did people send in total?</p>
|
||
|
|
||
|
<pre><code> ><((°> cat data/messages | wc -l
|
||
|
165368
|
||
|
</code></pre>
|
||
|
|
||
|
<p>That's almost 10,000 messages per hour! And since there were periods of almost
|
||
|
no activity before, between, and after the two marathons it means the rate
|
||
|
<em>during</em> them was well over that!</p>
|
||
|
|
||
|
<p>Who talked the most?</p>
|
||
|
|
||
|
<pre><code>><((°> cat data/messages | cuts -f 2 | sort | uniq -c | sort -nr | head -5
|
||
|
269 fuscia13
|
||
|
259 almightypainter
|
||
|
239 sabrinamywaifu
|
||
|
235 roudydogg1
|
||
|
201 ionone
|
||
|
</code></pre>
|
||
|
|
||
|
<p>Some talkative folks (though honestly I expected a bit higher numbers here).
|
||
|
<a href="https://bitbucket.org/sjl/dotfiles/src/default/fish/functions/cuts.fish">cuts</a> is "<strong>cut</strong> on <strong>s</strong>paces" — a little function I use so I don't have
|
||
|
to type <code>-d ' '</code> all the time.</p>
|
||
|
|
||
|
<h2 id="s3-n-grams"><a href="index.html#s3-n-grams">N-grams</a></h2>
|
||
|
|
||
|
<p>The chat has spawned a bunch of its own memes and jargon. I made another ugly
|
||
|
Python script to split up messages into <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> so we can analyze them more
|
||
|
easily:</p>
|
||
|
|
||
|
<pre><code><span class="code"><span class="symbol">import</span> sys
|
||
|
<span class="symbol">import</span> nltk
|
||
|
|
||
|
<span class="special">def</span><span class="keyword"> window</span><span class="paren1">(<span class="code">coll, size</span>)</span>:
|
||
|
<span class="string">'''Generate a "sliding window" of tuples of size l over coll.
|
||
|
|
||
|
coll must be sliceable and have a fixed len.
|
||
|
|
||
|
'''</span>
|
||
|
coll_len = len<span class="paren1">(<span class="code">coll</span>)</span>
|
||
|
<span class="symbol">for</span> i <span class="symbol">in</span> range<span class="paren1">(<span class="code">coll_len</span>)</span>:
|
||
|
<span class="symbol">if</span> i + size > coll_len:
|
||
|
<span class="symbol">break</span>
|
||
|
else:
|
||
|
<span class="symbol">yield</span> tuple<span class="paren1">(<span class="code">coll<span class="paren2">[<span class="code">i:i+size</span>]</span></span>)</span>
|
||
|
|
||
|
<span class="symbol">for</span> line <span class="symbol">in</span> sys.stdin:
|
||
|
timestamp, nick, msg = line.split<span class="paren1">(<span class="code"><span class="string">' '</span>, 2</span>)</span>
|
||
|
|
||
|
n = int<span class="paren1">(<span class="code">sys.argv<span class="paren2">[<span class="code">1</span>]</span></span>)</span>
|
||
|
|
||
|
<span class="symbol">for</span> ngram <span class="symbol">in</span> set<span class="paren1">(<span class="code">window<span class="paren2">(<span class="code">nltk.word_tokenize<span class="paren3">(<span class="code">msg</span>)</span>, n</span>)</span></span>)</span>:
|
||
|
<span class="symbol">print</span><span class="paren1">(<span class="code">timestamp, nick, <span class="string">'__'</span>.join<span class="paren2">(<span class="code">ngram</span>)</span></span>)</span></span></code></pre>
|
||
|
|
||
|
<p>This lets us easily split a message into unigrams:</p>
|
||
|
|
||
|
<pre><code>><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 1
|
||
|
1447680000 sjl it
|
||
|
1447680000 sjl the
|
||
|
1447680000 sjl beat
|
||
|
1447680000 sjl of
|
||
|
1447680000 sjl out
|
||
|
1447680000 sjl devil
|
||
|
</code></pre>
|
||
|
|
||
|
<p>The order of n-grams within a message isn't preserved because the splitting
|
||
|
script uses a <code>set</code> to remove duplicate n-grams. I wanted to remove dupes
|
||
|
because it turns out people frequently copy and paste the same word many times
|
||
|
in a single message and I didn't want that to throw off the numbers.</p>
|
||
|
|
||
|
<p>Bigrams are just as easy — just change the parameter to <code>split.py</code>:</p>
|
||
|
|
||
|
<pre><code>><((°> echo "1447680000 sjl beat the devil out of it" | python src/split.py 2
|
||
|
1447680000 sjl of__it
|
||
|
1447680000 sjl the__devil
|
||
|
1447680000 sjl beat__the
|
||
|
1447680000 sjl devil__out
|
||
|
1447680000 sjl out__of
|
||
|
</code></pre>
|
||
|
|
||
|
<p>N-grams are joined with double underscores to make them easier to plot later.</p>
|
||
|
|
||
|
<p>So what are the most frequent unigrams?</p>
|
||
|
|
||
|
<pre><code>><((°> cat data/words | cuts -f3 | sort | uniq -c | sort -nr | head -15
|
||
|
19523 bob
|
||
|
14367 ruined
|
||
|
11961 kappaross
|
||
|
11331 !
|
||
|
10989 gg
|
||
|
7666 is
|
||
|
6592 the
|
||
|
6305 ?
|
||
|
6090 i
|
||
|
5376 biblethump
|
||
|
5240 devil
|
||
|
5122 saved
|
||
|
5075 rip
|
||
|
4813 it
|
||
|
4727 a
|
||
|
</code></pre>
|
||
|
|
||
|
<p>Some of these are expected, like "Bob" and stopwords like "is" and "the".</p>
|
||
|
|
||
|
<p>The chat loves to spam "RUINED" whenever Bob makes a drastic change to the
|
||
|
painting that looks awful at first, and then spam "SAVED" once he applies a bit
|
||
|
more paint and it looks beautiful. This happens frequently with mountains.</p>
|
||
|
|
||
|
<p>"KappaRoss" and "BibleThump" are <a href="https://fivethirtyeight.com/features/why-a-former-twitch-employee-has-one-of-the-most-reproduced-faces-ever/">Twitch emotes</a> that produce small
|
||
|
images in the chat.</p>
|
||
|
|
||
|
<p>When Bob cleans his brush he beats it against the leg of the easel to remove the
|
||
|
paint thinner, and he often smiles and says "just beat the devil out of it". It
|
||
|
didn't take long before chat started spamming "RIP DEVIL" every time he cleans
|
||
|
the brush.</p>
|
||
|
|
||
|
<p>How about the most frequent bigrams and trigrams?</p>
|
||
|
|
||
|
<pre><code>><((°> cat data/bigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
|
||
|
3731 rip__devil
|
||
|
3153 !__!
|
||
|
2660 bob__ross
|
||
|
2490 hi__bob
|
||
|
1844 <__3
|
||
|
1838 kappaross__kappaross
|
||
|
1533 bob__is
|
||
|
1409 bob__!
|
||
|
1389 god__bless
|
||
|
1324 happy__little
|
||
|
1181 van__dyke
|
||
|
1093 gg__wp
|
||
|
1024 is__back
|
||
|
908 i__believe
|
||
|
895 ?__?
|
||
|
|
||
|
><((°> cat data/trigrams | cuts -f3 | sort | uniq -c | sort -nr | head -15
|
||
|
2130 !__!__!
|
||
|
1368 kappaross__kappaross__kappaross
|
||
|
678 van__dyke__brown
|
||
|
617 bob__is__back
|
||
|
548 ?__?__?
|
||
|
503 biblethump__biblethump__biblethump
|
||
|
401 bob__ross__is
|
||
|
377 hi__bob__!
|
||
|
376 beat__the__devil
|
||
|
361 bob__!__!
|
||
|
331 bob__<__3
|
||
|
324 <__3__<
|
||
|
324 3__<__3
|
||
|
303 i__love__you
|
||
|
302 son__of__a
|
||
|
</code></pre>
|
||
|
|
||
|
<p>Looks like lots of love for Bob and no sympathy for the devil. It also seems
|
||
|
like <a href="https://www.bobross.com/ProductDetails.asp?ProductCode=VanDykeBrown">Van Dyke Brown</a> is Twitch chat's favorite color by a landslide.</p>
|
||
|
|
||
|
<p>Note that the exact n-grams depend on the tokenization method. I used NLTK's
|
||
|
<code>word_tokenize</code> because it was easy and worked pretty well.
|
||
|
<code>wordpunct_tokenize</code> also works, but it splits up basic punctuation a bit too
|
||
|
much for my liking (e.g. it turns <code>bob's</code> into three tokens <code>bob</code>, <code>'</code>, and <code>s</code>,
|
||
|
where <code>word_tokenize</code> produces just <code>bob</code> and <code>'s</code>).</p>
|
||
|
|
||
|
<h2 id="s4-graphing"><a href="index.html#s4-graphing">Graphing</a></h2>
|
||
|
|
||
|
<p>Pure numbers are interesting, but <a href="https://en.wikipedia.org/wiki/Anscombe%27s_quartet">can be misleading</a>. Let's make some
|
||
|
graphs to get a sense of what the data feels like. I'm using <a href="http://www.gnuplot.info/">gnuplot</a> to
|
||
|
make the graphs.</p>
|
||
|
|
||
|
<p>What does the overall volume look like? We'll use minute-wide buckets in the
|
||
|
x axis to make the graph a bit easier to read.</p>
|
||
|
|
||
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/hlw-total-large.png"><img src="../../../../static/images/blog/2015/11/hlw-total.png" alt="Photo"></a></p>
|
||
|
|
||
|
<p>Can you tell where the two marathons start and end?</p>
|
||
|
|
||
|
<p>Let's try to identify where episodes start and finish. Chat usually spams "hi
|
||
|
bob" when an episode starts and "gg" when it finishes, so let's plot those.
|
||
|
We'll use 30-second x buckets here because a minute isn't a fine enough
|
||
|
resolution for the events we're looking for. To make it easier to read we'll
|
||
|
just look at the first half of the first marathon.</p>
|
||
|
|
||
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/hlw-higg-large.png"><img src="../../../../static/images/blog/2015/11/hlw-higg.png" alt="Photo"></a></p>
|
||
|
|
||
|
<p>This works pretty well! The graph starts with a big spike of "hi bob", then as
|
||
|
each episode finishes we see a (huge) spike of "gg", followed immediately by
|
||
|
a round of "hi bob" as the next episode starts.</p>
|
||
|
|
||
|
<p>Can we find all the times Bob cleaned his brush?</p>
|
||
|
|
||
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/hlw-ripdevil-large.png"><img src="../../../../static/images/blog/2015/11/hlw-ripdevil.png" alt="Photo"></a></p>
|
||
|
|
||
|
<p>Looks like the devil isn't having a very good time. It's encouraging that the
|
||
|
two seasons have roughly the same structure (three main clusters of peaks).</p>
|
||
|
|
||
|
<p>Note that there are a couple of smaller peaks between the two showings. Twitch
|
||
|
showed another streamer painting between the two marathons, so it's likely that
|
||
|
she cleaned her brush a couple of times and the chat responded. Fewer people
|
||
|
were watching the stream during the break, hence the smaller peaks.</p>
|
||
|
|
||
|
<p>When did Bob get the most love? We'll use 5-minute x bins here because we just
|
||
|
want a general idea.</p>
|
||
|
|
||
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/hlw-love-large.png"><img src="../../../../static/images/blog/2015/11/hlw-love.png" alt="Photo"></a></p>
|
||
|
|
||
|
<p>Lots of love all around, but especially as he signed off at the end.</p>
|
||
|
|
||
|
<p>One of my favorite moments was when Bob said something about "changing your mind
|
||
|
in mid <strong>stream</strong>" and the chat started spamming conspiracy theories about how
|
||
|
he somehow knew about the stream 30 years in the past:</p>
|
||
|
|
||
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/hlw-heknew-large.png"><img src="../../../../static/images/blog/2015/11/hlw-heknew.png" alt="Photo"></a></p>
|
||
|
|
||
|
<h2 id="s5-up-next"><a href="index.html#s5-up-next">Up Next</a></h2>
|
||
|
|
||
|
<p>Poking around at this chat corpus was a lot of fun (and <em>definitely</em> counts as
|
||
|
studying for my NLP final, <em>definitely</em>). I'll probably record the chat during
|
||
|
next week's marathon and do some more poking, specifically around finding unique
|
||
|
events (e.g. his son Steve coming on the show) by comparing rate percentiles.</p>
|
||
|
|
||
|
<p>If you've got other ideas for things I should graph, <a href="http://twitter.com/stevelosh">let me know</a>.</p>
|
||
|
</article></main><hr class='main-separator' /><footer><nav><a href='https://github.com/sjl/'>GitHub</a> ・ <a href='https://twitter.com/stevelosh/'>Twitter</a> ・ <a href='https://instagram.com/thirtytwobirds/'>Instagram</a> ・ <a href='https://hg.stevelosh.com/.plan/'>.plan</a></nav></footer></body></html>
|