Ryan's Data Blog

Follow me on GitHub

Exploring Baby Names and what's next

Added Baby Names EDA notebook. Exploring baby names since 1880 in the United State, downloaded this dataset from Kaggle. Forewarned this notebook is a little messy, I will be going back through deleting commented code and adding more explanations to what is going on throughout. I found some interesting bits of information from this though; baby names are much more unique now! If you were female born before 1950 (and after 1880) you are probably called Mary, and male probably named Michael. In 1950 the top name for boys and girls made up about 5% of the population (in this data) to only about 1% in 2014! Parents are getting more creative as well; with bunches of names, never seen before 1990, popping into the top-10 in the 90s. Obviously curious about my name, which predictably has a bump a few years before I was born, but has been on steady decline since :-/. Looked at my name compared to some others, biblical names have been popular since 1880 (beginning of this dataset), and seem to stand near or at the top of the most popular names. Messed around with some Plotly plots in this—timeseries with slider, scatter plots, line graphs—really love how easy it is to incorporate these interactives into Jupyter, rather than needing to extract data to different application, they deserve more love! Also really nice that you can toggle the data and dive into data in the plots, makes visualizations much more informative. Still used some nice static matplotlib plots, need for speed. Fun and interesting data exploring this weekend.

Currently I am taking an AI course from Columbia University, learning search algorithms. I may upload projects to github (after course is finished), but they are not very interesting thus far, implementing Breadth/Depth first search, A*, and IDA* on the 8-puzzle problem. Next data project I am thinking of doing something like: around the world in data ways; where each few days (each week maybe) find some interesting data from a country from around the world (economic/health probably most likely). Would be interesting to check out major import/exports of countries around the world and a bit of the background around the good/service, stay tuned!

**Robot Localization**

Get and predict robot location, given observation values from a sensor.

Fun project! A ton of trial and error with the initial algorithms! Here is what is being done in the program: Given observation from noisy sensor, try to locate and estimate the robot's hidden location. The sensor provides a distribution of probable values. The robot is equally likely to be in any of the locations on the grid, initially. The robot has a transition model with a distribution, which is given in robot.py file. Treating the distribution of hidden states as a hidden markov model (HMM); I implement forward-backward algorithm, by getting distribution of current observation (phi) with the previous message--the state from previous movement; prior distribution on initial step--then get sum of the distribution of the transition model for each of the possible values from weights (phi*previous message). This becomes the message for the next stage, iterate for all observations, which gets us the forward messages. The process for backward messages is similar, using the transpose--or columns--of the transition distribution matrix. Once we have the forward and backward messages, we again loop through the observations calculating the marginals for each observation, multiplying the messages--forward times backward--by the distribution for each observation in the set of given observations (backward_message[i] * forward_message[i] * phi (observations distribution) ). Next I implemented and calculate the MAP estimation for HMMs using the Viterbi algorithm. To implement Viterbi I used Min-Sum for HMMs, where each message is the minimum of the sum of the $-log_2$ of the previous message, phi (observation), and psi (transition), storing the argmin for each traceback messages needed for viterbi. Once we get the minimum over all possible states, we get the argmin for the last last message, and using our traceback messages we get the MAP estimate for Viterbi.

The struggle was trying to implement forward-backward using dictionaries rather than numpy arrays. I eventually got a forward-backward implementation working, once I got to Viterbi I knew it would be easier to work with matrices. I reworked forward-backward using matrices, which made implementation rather simple. Using log-scale is a must when dealing with the probabilities, even in this limited space, underflow was an issue I quickly realized. I worked two pet examples of both viterbi and forward-backward which help with intuition and simplify bigger project. This is not implemented in a notebook, similar to last project, I will try to put together something that renders nicely to follow along with some screenshots or videos of the robot graphics, which shows estimations as the robot moves around the grid. Next project is a spam/ham email classifier. Classic ML project, but will be implementing from scratch, similar to recommender and localization. Robot Localization repo.

A working movie recommender!

Predict the rating of a new (unseen) movie, X, given set of ratings for movies seen.

How it's done: Putting information gained and entropy to use; given ratings for a set a movies recommends, compute the posterior distribution, find likelihood of a distribution of the given observation (movie rating). Once we have the likelihood distribution, we use inference (bayes) to get a distributions of posteriors for each possible rating--and its likelihood--for each movie in the dataset. In addition we provide a MAP (Maximum a posteriori) estimate for each movie, which is really what would be shown to an end user; the distribution is interesting though. Once we have the distribution of possible ratings for each movie, implement an entropy function; which gives us the entropy (randomness) for each of the posterior values. Movie recommender repo.

Did not work this one in a notebook, so posted is not as easily read as the notebooks posted earlier. However I will work to convert into a notebook with outputs that explain the steps taken with screen shots from the run program. Next up is robot localization; cool stuff coming here! Localization is the foundation to self driving cars: getting observations from sensors to predict location of car relative to objects seen. Stay tuned.

Probability!

This week was filled with probability! With the movie recommendation project in the works re-affirming Bayesian Inference was a must, although some of this notebook (Bayes Inference...) was a bit of review, I'm not at a point where probability, especially conditioning on multiple events is super intuitive, but python definitely helps to gain more insight. Expectations are weighted averages, not a difficult topic, the linearity and total expectation properties lead to more difficult problems becoming simple; variance should always be calculated using expectations. Have been here before with random variables, the real meat of probability! Solving (note in notebooks) problems similar to hat problem, coupon collector, really only using discrete r.v.s and pmfs in this notebook.

Getting into Measuring Randomness I explore what will really contribute to the movie recommender. Using entropy to measure the information gained in each set of ratings, and divergence to compare the inherent values to the predicted values. In the notebook Measuring Randomness I go through calculating entropy, divergence, mutual information, and play a bit geometric distributions. With all three of these measurements being rooted in machine learning (i.e.decision trees, information gain), it is great to gain some real practice with using entropy and divergence at a lower level than simply looking at scikit-learn docs! Information theory is a fascinating subject, and measuring randomness is really the foundation to the entire field.

I may upload so older data explorations, though some are pretty rough and all over the place. Movie recommender mini project is basically done will upload to github once finished. I've been looking into the Quantum Mechanics course on Stanford Online, it has been great so far, highly recommended. Also Mining Massive Datasets is about to open up, I may venture into that. Next (new) post will include more probability, using undirected graphical models as data structures for probability distributions. Next project is robot localization!

First Post

Welcome! I will be using this blog to post data reports, explorations, and tutorials. Links along the right side and should render html, code available to be downloaded in github repo. I will post summaries for new explorations with data sources, hoping to get at least a couple in each week, stay tuned. Any questions or comments email me: slootryan at gmail dot com


The data sets already explored include a couple Spark notebooks, where using databricks I explored word counts in Shakespeare's work and log data using Apache Spark, it's crazy how fast using Spark is compared to trying to do this same project on my machine! Both were basically not possible. The R markdowns include a couple of psets from Udacity's Data Analysis course (L4, L5, L6), the other two were tangents while taking the course. Exploring the classic Diamonds dataset, focusing on a bunch of visualization. The Global food supply was an analysis of the global food prduction per person--the (most complete) newest data I could get was from 2007--saw some interesting trends in the DRC, US, and globally. I delved into some more advanced visualizations in the food supply analysis with Plotly.

I have a few more, that I may upload, that are not yet complete. New posts coming should be more interesting, and will include mathmatical sets (i.e. probabalistic) and more machine learning.