255 lines
No EOL
16 KiB
HTML
255 lines
No EOL
16 KiB
HTML
<!DOCTYPE html>
|
|
<html lang='en'><head><meta charset='utf-8' /><meta name='pinterest' content='nopin' /><link href='../../../../static/css/style.css' rel='stylesheet' type='text/css' /><link href='../../../../static/css/print.css' rel='stylesheet' type='text/css' media='print' /><title>Just Beat the Data Out of It / Steve Losh</title></head><body><header><a id='logo' href='https://stevelosh.com/'>Steve Losh</a><nav><a href='../../../index.html'>Blog</a> - <a href='https://stevelosh.com/projects/'>Projects</a> - <a href='https://stevelosh.com/photography/'>Photography</a> - <a href='https://stevelosh.com/links/'>Links</a> - <a href='https://stevelosh.com/rss.xml'>Feed</a></nav></header><hr class='main-separator' /><main id='page-blog-entry'><article><script type='text/javascript' async
|
|
src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML'></script><h1><a href='index.html'>Just Beat the Data Out of It</a></h1><p class='date'>Posted on November 30th, 2015.</p><p><a href="https://stevelosh.com/blog/2015/11/beat-the-data/blog/2015/11/happy-little-words/">Last week</a> we played around with a transcript of the Bob Ross Twitch
|
|
chat during the Season 2 marathon. I scraped the chat again last Monday to get
|
|
the transcript for the Season 3 marathon, so let's pick up where we left off.</p>
|
|
|
|
<ol class="table-of-contents"><li><a href="index.html#s1-volume-comparison">Volume Comparison</a></li><li><a href="index.html#s2-spiky-n-grams">Spiky N-grams</a></li><li><a href="index.html#s3-percentiles">Percentiles</a></li><li><a href="index.html#s4-spikiness-scores">Spikiness Scores</a></li><li><a href="index.html#s5-results">Results</a></li><li><a href="index.html#s6-join-the-fun">Join the Fun</a></li></ol>
|
|
|
|
<h2 id="s1-volume-comparison"><a href="index.html#s1-volume-comparison">Volume Comparison</a></h2>
|
|
|
|
<p>Was this week busier or quieter than last week?</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-volume-comparison-large.png"><img src="../../../../static/images/blog/2015/11/btd-volume-comparison.png" alt="Season 2 and 3 chat volume comparison"></a></p>
|
|
|
|
<p>Note the separate x axes to line up the start and end times of the logs. Also
|
|
two-minute buckets were used to make things a bit cleaner to look at on this
|
|
crowded graph (see the y axis label).</p>
|
|
|
|
<p>Seems like this was a bit quieter than last week. It's encouraging that the
|
|
basic structure looks the same — this hints that there are some patterns
|
|
waiting to be discovered.</p>
|
|
|
|
<h2 id="s2-spiky-n-grams"><a href="index.html#s2-spiky-n-grams">Spiky N-grams</a></h2>
|
|
|
|
<p>Last week we looked at graphs of various ngrams and saw that some of them show
|
|
pretty clear patterns. The end of each episode brings a flood of <code>gg</code>, and when
|
|
Bob's son Steve comes on the show we get a big spike in <code>steve</code>:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s2-ggsteve-large.png"><img src="../../../../static/images/blog/2015/11/btd-s2-ggsteve.png" alt="Plot of "gg" and "steve" unigrams in Season 2"></a></p>
|
|
|
|
<p>It's reasonable to expect the same behavior this week. What did we get?</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-ggsteve-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-ggsteve.png" alt="Plot of "gg" and "steve" unigrams in Season 3"></a></p>
|
|
|
|
<p>Looks pretty similar! In fact the <code>steve</code> plot is even more obvious this week.
|
|
And in both cases the second streaming of the season repeats the pattern seen
|
|
in the first.</p>
|
|
|
|
<p>Each week between the two seasons the channel "hosts" another painter. This
|
|
just means that it "pipes through" another streamer's channel so people don't
|
|
get bored.</p>
|
|
|
|
<p>This week whoever is in charge of picking the guest stream did a shitty job.
|
|
After the first showing ended viewers were assaulted with the most loud,
|
|
obnoxious manchild on the planet.</p>
|
|
|
|
<p>The chat was not pleased:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-douche-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-douche.png" alt="The Douche-o-Meter™"></a></p>
|
|
|
|
<p>Thankfully whoever manages Bob's channel mercy-killed the hosting after 10
|
|
minutes or so, and we enjoyed the blissful silence.</p>
|
|
|
|
<p>So we've seen that the rate of certain n-grams have clear patterns. If we're
|
|
interested in a particular n-gram that's great — we can graph it and take
|
|
a look. But what if we want to <em>find</em> interesting n-grams to look at, without
|
|
having to watch the whole marathon (or comb through the logs)?</p>
|
|
|
|
<h2 id="s3-percentiles"><a href="index.html#s3-percentiles">Percentiles</a></h2>
|
|
|
|
<p><a href="https://en.wikipedia.org/wiki/Percentile">Percentiles</a> are a really useful measurement in a lot of fields, so let's
|
|
take a look at them here. We'll start with a relatively common n-gram like
|
|
"the":</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-the-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-the.png" alt="Percentile graph of "the" in Season 3"></a></p>
|
|
|
|
<p>Here we've got a pretty smooth gradation from the lower percentiles up to the
|
|
higher ones. Note that these are rates of <code>the</code> per minute, so the value <code>11</code>
|
|
at <code>50</code> means that half of all 2-minute bins recorded had eleven or fewer
|
|
instances of <code>the</code>. This seems low for English text, but a lot of the messages
|
|
in the Bob Ross chat are one or two-word slang — full sentences are rare.</p>
|
|
|
|
<p>If we go back to the normal n-gram plot of <code>the</code> we can see that it's not a very
|
|
"spiky" word:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-the-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-the.png" alt="Plot of "the" unigram in Season 3"></a></p>
|
|
|
|
<p>Let's look at another common word, <code>bob</code>:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-bob-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-bob.png" alt="Percentile graph of "bob""></a></p>
|
|
|
|
<p>Pretty smooth, though it's a little bit steeper at the end (probably because of
|
|
the deluge of <code>hi bob</code> when an episode starts). N-gram plot for comparison:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-bob-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-bob.png" alt="Plot of "bob" unigram in Season 3"></a></p>
|
|
|
|
<p>What about an n-gram we <em>know</em> represents a mostly-unique event, like <code>steve</code>?
|
|
We would expect the graph of percentiles to look steeper, because the lower and
|
|
middle percentiles would be very low and the highest few would skyrocket.</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-steve-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-steve.png" alt="Percentile graph of "steve" in Season 3"></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-steve-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-steve.png" alt="Plot of "steve" unigram in Season 3"></a></p>
|
|
|
|
<p>We've tentatively identified another pattern in the data, but how can it help us
|
|
find new interesting terms?</p>
|
|
|
|
<h2 id="s4-spikiness-scores"><a href="index.html#s4-spikiness-scores">Spikiness Scores</a></h2>
|
|
|
|
<p>If we look at the percentiles for a few known-spiky terms we can see a pattern:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-steve-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-steve.png" alt="Percentile graph of "steve" in Season 3"></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-drugs-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-drugs.png" alt="Percentile graph of "drugs" in Season 3"></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-percentile-cringe-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-percentile-cringe.png" alt="Percentile graph of "cringe" in Season 3"></a></p>
|
|
|
|
<p>The top percentile or two have some volume, but it quickly drops away to
|
|
nothingness within five or ten percent. So let's try to define a really basic
|
|
"spikiness score" that we can work out for all n-grams:</p>
|
|
|
|
<div>$$ {\text{Spikiness}}(w) = \frac{P_{100}(w)}{P_{95}(w) + 0.1} $$</div>
|
|
|
|
<p>We'll start by saying that the spikiness score of a word is the value of the
|
|
100th percentile for that word, divided by the 95th percentile (plus a small
|
|
smoothing factor to avoid division by zero). Let's try some words:</p>
|
|
|
|
<pre><code>the 1.78
|
|
bob 2.39
|
|
steve 4.67
|
|
drugs 30.00
|
|
cringe 60.00</code></pre>
|
|
|
|
<p>This doesn't look too terrible. The words we consider spiky are all scored
|
|
higher than the non-spiky ones, but it's not quite there yet. <code>steve</code> is rated
|
|
pretty low even though we consider it to be spiky.</p>
|
|
|
|
<p>When we made our initial formula we arbitrarily picked the 100th and 95th
|
|
percentiles out of thin air. What if we choose the 99th and 90th instead?</p>
|
|
|
|
<div>$$ {\text{Spikiness}}(w) = \frac{P_{99}(w)}{P_{90}(w) + 0.1} $$</div>
|
|
|
|
<pre><code>bob 3.56
|
|
the 1.42
|
|
steve 77.27
|
|
cringe 20.00
|
|
drugs 10.00</code></pre>
|
|
|
|
<p>This has changed the scores quite a bit, and now they're more like what we want.
|
|
But again, we just picked the two percentiles out of thin air. It would be nice
|
|
if we could get a feel for how the choice of percentiles affects our spikiness
|
|
scores. Once again, let's turn to gnuplot. We'll generalize our function:</p>
|
|
|
|
<div>$$ {\text{Spikiness}}(w, L, U) = \frac{P_{U}(w)}{P_{L}(w) + 0.1} $$</div>
|
|
|
|
<p>And graph it for all the combinations of percentiles for a couple of words we
|
|
know:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-the-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-the.png" alt="Spikiness percentile sensitivity plot for "the""></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-bob-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-bob.png" alt="Spikiness percentile sensitivity plot for "bob""></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-steve-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-steve.png" alt="Spikiness percentile sensitivity plot for "steve""></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-rip__devil-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-rip__devil.png" alt="Spikiness percentile sensitivity plot for "rip devil""></a></p>
|
|
|
|
<p>These graphs are approaching the point of being impossible to read, but we can
|
|
definitely see a pattern. In the first two graphs (common words) the only way
|
|
to get a high spikiness score is to choose our formula's lower percentile to be
|
|
<em>really</em> low (15th percentile or lower).</p>
|
|
|
|
<p>In the second two graphs (spiky words) we can see that the score is high when
|
|
the upper percentile is 99th or 100th, and the lower percentile is beneath the
|
|
90th (or thereabouts).</p>
|
|
|
|
<p>Now that we have a hypothesis let's try a couple more plots to see if it still
|
|
holds:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-gg-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-gg.png" alt="Spikiness percentile sensitivity plot for "gg""></a></p>
|
|
|
|
<p><code>gg</code> does come in spikes, but it happens so often that we need to select
|
|
a smaller lower percentile if we want it to be considered spiky. Whether we
|
|
want to depends on what we're looking for — if we want <em>rare</em> events then we
|
|
probably want to exclude it.</p>
|
|
|
|
<p><code>ruined</code> get spammed so much that it's certainly not rare, and isn't even
|
|
particularly spiky in any way:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-ssp-ruined-large.png"><img src="../../../../static/images/blog/2015/11/btd-ssp-ruined.png" alt="Spikiness percentile sensitivity plot for "ruined""></a></p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-ruined-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-ruined.png" alt="Plot of "ruined" unigram in Season 3"></a></p>
|
|
|
|
<p>So it looks like we're at least on a reasonable track here. Let's settle the
|
|
100th and 90th for now and see where they lead.</p>
|
|
|
|
<p>There's one other addition to our spikiness formula we should make before moving
|
|
on: if the 100th percentile of a term is small (e.g. less than 5) then while it
|
|
might technically be spiky, we probably don't care about it. So we'll just drop
|
|
those on the floor and not really worry about them.</p>
|
|
|
|
<div>$$
|
|
{\text{Spikiness}}(w) = \begin{cases} 0& {\text{if}}\ P_{100}(w) < 5 \\ \frac{P_{100}(w)}{P_{90}(w) + 0.1}& {\text{otherwise}} \end{cases}
|
|
$$</div>
|
|
|
|
<h2 id="s5-results"><a href="index.html#s5-results">Results</a></h2>
|
|
|
|
<p>Now that we've got a way to measure a term's spikiness, we can calculate it for
|
|
all n-grams and sort to find some interesting ones. Let's try it with bigrams:</p>
|
|
|
|
<pre><code>mouth__noises 680.00
|
|
(__mouth 520.00
|
|
soft__music 480.00
|
|
elevator__music 480.00
|
|
noises__) 470.00
|
|
believe__biblethump 460.00
|
|
cool__elevator 450.00
|
|
soft__rock 390.00
|
|
smooth__soft 390.00
|
|
smooth__jazz 380.00
|
|
relaxing__guitar 360.00
|
|
guitar__music 360.00
|
|
son__of 330.00
|
|
music__) 330.00
|
|
(__soft 330.00
|
|
a__gun 320.00
|
|
(__relaxing 320.00
|
|
big__shaft 300.00
|
|
super__steve 290.00
|
|
jazz__music 280.00
|
|
crazy__day 280.00
|
|
zoop__zoop 270.00
|
|
the__heck 270.00
|
|
(__smooth 260.00
|
|
flat__trees 240.00
|
|
steve__! 220.00
|
|
hi__steve 220.00
|
|
...</code></pre>
|
|
|
|
<p>We can get similar results for unigrams, trigrams, etc. Let's graph a couple of
|
|
these highly-spiky terms. Twitch chat definitely loves innuendo:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-innuendo-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-innuendo.png" alt="Plot of vaguely sexual n-grams in Season 3"></a></p>
|
|
|
|
<p>Something new this week was the addition of captions, which sometimes included
|
|
things like <code>(soft music)</code> and <code>(mouth noises)</code>. The chat liked to poke fun at
|
|
those:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-mouthnoises-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-mouthnoises.png" alt="Plot of "soft music" and "mouth noises" bigrams in Season 3"></a></p>
|
|
|
|
<p>We can also see some particular elements of paintings:</p>
|
|
|
|
<p><a href="https://stevelosh.com/static/images/blog/2015/11/btd-s3-subjects-large.png"><img src="../../../../static/images/blog/2015/11/btd-s3-subjects.png" alt="Plot of subject n-grams in Season 3"></a></p>
|
|
|
|
<p>The lists aren't perfect. They contain a lot of redundant stuff (e.g. <code>(soft
|
|
music)</code> produces 3 separate bigrams that are all equally spiky), and there's
|
|
a bunch of stuff we don't care about as much. But if you're looking to find
|
|
some interesting terms they can at least give you a starting point.</p>
|
|
|
|
<h2 id="s6-join-the-fun"><a href="index.html#s6-join-the-fun">Join the Fun</a></h2>
|
|
|
|
<p>I'm posting this right as the Season 4 marathon is going live on <a href="http://twitch.tv/BobRoss">the Bob Ross
|
|
Twitch channel</a> If you've got some time feel free to pull up your
|
|
comfy computer chair and join a few thousand other people for a relaxing evening
|
|
with Bob!</p>
|
|
</article></main><hr class='main-separator' /><footer><nav><a href='https://github.com/sjl/'>GitHub</a> ・ <a href='https://twitter.com/stevelosh/'>Twitter</a> ・ <a href='https://instagram.com/thirtytwobirds/'>Instagram</a> ・ <a href='https://hg.stevelosh.com/.plan/'>.plan</a></nav></footer></body></html> |