Yesterday I had a lot of fun playing with Project Jupyter. For those that aren't aware of this project, it's an effort to provide a workspace for performing repeatable experimentation with data. In short it mixes markdown editing capabilities with a REPL environment for a large number of languages. From the website:
"a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more"
After having my interest tweaked by the browser version I bit the bullet and spent ages downloading and installing the full version to a virtual machine. It's a Python based web-app so requires quite a bit of setup and unfortunately I found the documentation to be a bit sparse.
And so it was that while trying to work out how to install the FSharp module I came across Azure Notebooks. This is a free, Azure hosted version of Jupyter that has almost all the features of a local installation but with none of the faff. After quickly spinning up a new notebook here I didn't even look back at the local installation.
A Jupyter [Data] Mining Core Project
As per the title and lead of this post, I decided to use Jupyter to have a little fun.
Back in September, while grinding my way through season 11 of Red Dwarf, I began to wonder why it wasn't as funny as it used to be. Had the writing deteriorated? Were the actors past it? Or were these elements still as great as they used to be and it was me who had changed?
I started thinking about ways this could be investigated such as:
- Using IMDB rating as a measure of humour in each episode of Red Dwarf
- Performing semantic analysis of episode's transcript to see if the sentiment had changed
- Using word-count to determine whether there was a correlation between character participation and overall humour
Well, it was a funny notion and provided a pleasant distraction from the pretty awful episode of Red Dwarf I was watching at the time. I added it to my "ideas" list in Trello, finished the episode and went to bed.
Yesterday, when I came across Project Jupyter, I knew it'd be a great medium for performing this investigation so shifted the analysis from "Ideas" to "In progress" and got cracking.
Data Science using F#
Now, while in relation to this investigation I use the term "data science" to basically mean "munging a few numbers and drawing a few graphs", I do think F# makes a fantastic language for the discipline in general. It has some incredible mechanisms for acquiring and cleaning data as well as for parsing natural language. Couple this with it's concise, functional, elegant language and the ability to leverage components from the entire breadth of .NET ecosystem and you have quite a significant offering.
The Azure implementation of Project Jupyter is first class and, for now at least, totally free. Getting started is as simple as logging in with Microsoft credentials and then clicking 'Add notebook'. Being an MS implementation, I used Edge to edit the notebook and found the experience extremely robust, especially given it's a "Preview" program.
In fact I experience just two issues while authoring my notebook:
- Data Store - It's not currently possible to upload or store data within the Azure notebook library (despite having functionality to do this in the web-interface). Instead you need to host your data on one of a small number of whitelisted sites. Fortunately Github is one of these sites so this doesn't prove to be much of an issue.
- Packages - While Azure Notebooks provides access to a large number of packages "out-of-the-box" (i.e. FSharp.Data, XPlot.Plotly, etc) it can be tricky to add/use other packages. For example, I wanted to use the XPlot.GoogleCharts package (as it provided trendline capabilities) and ended up having to write a custom display printer for it to work (due to an open issue on Github).
Apart from these issues, authoring and scripting F# in an Azure Notebook was almost as fast as using "F# Interactive". It even provides Intellisense capabilities but, in practice, these are usually too slow to be of actual use.
From Azure Notebooks you're able to download your notebook as a native ".ipynb" file (in fact this is encouraged as MS reserves the right to remove unused notebooks after 60 days). You can then share this file to other people who have Jupyter installed or, preferably, commit it to a repository in Github which has excellent support for Jupyter Notebooks.
You can find my notebook "A sentiment(al) analysis of why Red Dwarf is no longer funny (to me)" here. As you will see when you click the link, Github not only shows you the static parts of the notebook but actually tries to spin up a kernel and execute the code parts too. This is a "limited rendering only" so Github also provides a link to open the notebook in 'nbviewer' web-app. This link is shown below:
I had timeboxed my investigation into Project Jupyter and therefore didn't get round to performing an actual sentiment anaylsis of the content of each episode. However I did manage to do the following:
- Programmatically download episode information from several sources in JSON format and use JsonValue to dynamically query these sources
- Scrape demographically categorized rating information from IMDB and use HtmlDocument to parse the data into strong types
- Resolve issue with rendering XPlot.GoogleCharts charts within the notebook and use these charts to provide an interactive visualisation of the decline in rating of Red Dwarf across time and demographic categories.
This provided a fair stab at correlating episode rating with the overall decline in Red Dwarf's humourousness but is a long way short of any form of "data science". It was both enlightening and a lot of fun doing this small project and I will certainly consider Azure Notebooks as a valuable tool in my toolbox.
Should I find the time, I would certainly like to return to this project and use FParsec and Azure Text Analytics to perform an actual sentiment analysis. Hopefully it'll overturn, or at least help justify, my somewhat disturbing conclusion!