October 25th 2012

Data Science at Engine Yard

We are in a privileged position when it comes to cloud-data. Engine Yard hosts a variety of customers from large enterprises to one-man-show startups. This means that we get access to a wide variety of data that represent countless cloud-usage behavioral patterns.

As the business grows, we have started investing into turning the vast volume of data we have into relevant information that can be used to leverage and improve our customers’ experience on our platform.

From application performance forecasting to marketing insights, the data we have serves one purpose: Improving the customer experience and their businesses.

What is data science and what does a data scientist do?

Not unlike Web 2.0 and Ajax, the term data science is slowly becoming a victim of the industries’ hype. Trying to understand what a data scientist does has become increasingly confusing. What are the main tasks? What do they do on a day-to-day basis?

You might often hear the term data-anthropologist instead of data scientist. That’s because the tasks of an anthropologist can be similar to a data scientist:

Information, and patterns are present everywhere but they have to be uncovered,
Once you have found a source of information, you need to plan your approach to retrieving it,
When ready to retrieve information, you retrieve it piece by piece, clean each piece, operate on said pieces and put them back together,
After having put the pieces back together, you study the set and place judgments through observation,
You then review and test the collection methods,
You then make inferences based on the key indicators,
It is then the very critical time of presenting the findings to other people, including explaining the patterns, trends, cultural traditions, etc.

As a data scientist, my work is fairly similar. My daily tasks are roughly:

Look at the data we have collected,
Identify which data we don’t have, and where we can get it from,
Work on storing that data and more importantly being able to retrieve it in a fashionable and scalable manner,
Automate the data collection,
Once data is collected, explore it (this part is very important!),
Talk to people in the organization to identify relevant questions,
When you know what you want, prepare your data,
Clean the data,
Clean it more,
Perform a split-apply-combine strategy. If you are familiar with Map-Reduce, the split-apply-combine strategy is very similar where a split and apply would be the map and the combine would be the reduce part of a Map-Reduce process.
Try merging various sources of information, and retrieve patterns,
Based on the data, formulate hypotheses, reject or accept them,
Present your findings to a wider audience, in our case it can be our support team, engineering team, executives and more importantly our customers.

From that list you can probably notice that a few skills and technologies are required. A well rounded data scientist should have strong bases in data analysis, statistics, application development, and be obsessively curious. I would even add basic economics and general business knowledge will help you convey your findings and potential decisions to a broader range of decision-makers.

Technologies We Use

The technologies we use internally are as diversified as the skills we consider required to be a data scientist. We’ve therefore compiled a list of what we use and their duties.

R: R is the Lingua Franca of todays’ statisticians. It is a platform for statistical computing. It’s open source, it runs on Unix, OSX, and Windows. More importantly, it has very powerful packages to help with statistical analysis and graphical visualisation of the analyses.
Python: Python is easy to learn, it’s clean, has a strong academic and scientific community, and therefore has notorious scientific processing packages.
NumPy is the fundamental package for scientific computing with Python. It has very useful capabilities for linear algebra, N-dimensional array/matrix object, Fourier Transform, and much more.
SciPy: SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It provides efficient and user-friendly numerical routines and is designed to work with NumPy data-structures, namely array objects.
NLTK: The Natural Language ToolKit for Python. NLTK is a leading platform for building Python programs to work with human language data. If you do anything in computational linguistics, you probably want to take a look at NLTK.
CouchDB: CouchDB is a database that uses JSON documents, JavaScript for MapReduce queries and makes the results and processing available through a truly RESTful HTTP API. CouchDB has a VERY appealing replication architecture which we leverage. In our case, we use CouchDB to collect and process/index data locally on development machines, then replicate the data to a central cluster. With some internal magic, our CouchDB data is then fed to our Hadoop-and-friends setup. We use CouchDB and R together a lot for data sampling and local processing.
Hadoop: Hadoop and family have been becoming the leader in all “big-data” related discussions in the industry. It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. There are various projects which you might find interesting that we use from the Hadoop family: ZooKeeper, Pig, Hive, HBase and Mahout.
Mahout: Mahout allows us to leverage the power of MapReduce with Hadoop to perform machine learning and data mining tasks. It has powerful clustering and classification core algorithms.
D3: D3.js is a JavaScript library for manipulating documents based on data. We currently are experimenting with it and don’t do anything serious. We are looking at ways to link R with D3JS. It has been said that the future iterations of the ggplot2 R package might use D3JS for interactivity. This is exciting :–)
Various R Packages: We use a large number of R__ __packages when performing data-analysis and visualisation so here is a sub-list of a few packages we use:
- RStudio: Whilst RStudio is not a package per se, it is a great IDE to develop R and generate plots on the fly allowing one to visualise the output rapidly.
- plyr: A CRAN R package to split, apply and combine data.
- reshape: Reshape is hands-down the most useful package to me personally. It allows me to completely remove the shape of data and give it a new shape. Read about it, learn it, it’ll save you tons of time.
- ggplot2: ggplot2 is a very powerful plotting system for R__.__
- RCurl__ and RJSONIO: Use together those packages allow you to connect to any HTTP API and parse their JSON responses into native R objects allowing you to reshape, and then split-apply-combine them.
- HMisc: The Hmisc library contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing datasets, imputing missing values, etc.
- devtools__: This is simply a package that makes developing R a lot simpler. It allows you to install non-peer-reviewed code from Github and experiment immediately.
- lattice: Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data. We use this for graphical analyses with non-trivial multivariate requirements.
- lubridate: As a data scientist, you will incontestably have to deal with time-series. Time-series contain time which contains dates. As an application developer, I’m certain there’s no need to explain the need for a good library to handle, recognise, fix and parse date-fields. This is what lubridate does. It makes it easy to work with dates and times.
- forecast: When working with time-series (already?!) and linear models, this package will allow you to run basic univariate forecasting as such as exponential smoothing via state space models and automatic ARIMA modelling.
- quantmod: This is a package we use for quantitative financial modelling. This is interesting to analyze markets and their trends.
- PerformanceAnalytics: Econometric tools for performance and risk analysis. With some fiddling, one can build graphical visualisation of risk and performance analysis for servers and cloud utilisation.
- The following packages are for fun and getting your feet wet with R:
  - RXKCD: Interface with XKCD from R. See this blog post for more information about the RXKCD package.
  - twitteR: Interface with the Twitter HTTP API from R.
  - fun: The FUN R package. Play games from R directly :–). See this blog post for examples.
- The following packages are for the ones who are more mathematically inclined and generally curious
  - msm: This is a package that contains functions for fitting general continuous-time Markov and Hidden Markov multi-state models to longitudinal data.
  - deSolve: This is another very useful package if you are working with ordinary differential equations (ODE), partial differential equations (PDE), or differential algebraic equations (DAE). It is essentially a solver for ODE, PDE, DAE and DDEs.
  - ruGarch: This is a package to process univariate GARCH models as such as ARFIMA, in-mean, external regressors and various GARCH flavours, with methods for fit, forecast, simulation, inference and plotting.
  - KFAS: This package provides functions for Kalman filtering, smoothing, forecasting and simulation of Gaussian, Poisson and Binomial state space models with exact diffuse initialization when distributions of some or all elements of initial state vector are unknown. If you need to do multivariate analyses, this is likely to come in handy.

There is of course a lot more technology behind it all, but this is the general lines of our data science initiative.

Reality of Data Science

Even though being a data scientist has been proclaimed one of the sexiest jobs of the 21st century, there are a few unsaid facts about being a data scientist:

There’s nothing glamorous about it. You are not a rockstar, but a janitor. You will clean a lot of data. A LOT.
Most people don’t and won’t care about data. That’s right..
There is a lot of mathematics involved. Get over it.
Some people will get offended with your findings. People are volatile. Especially when what the data says is different to their preconceived beliefs.
If you are awesome at technology and maths but can’t communicate, being a data scientist is probably not for you.
All the business lingo you hear about in the economic news and all the crazy hyped-buzzwords people use might very well become part of your vocabulary as you may have to use them when conveying your findings. You might not though.
Some days will be long. Very long. And you won’t find a thing. Get more coffee.
You will read that classical statistics is outdated for todays’ needs. It may be true, however classical statistics provides strong bases, and gets you well versed into the cryptic world and vocabulary of the statistician.
Not everything is about Big Data. In fact, learning how to sample is a key skill you’ll need. Sampling (properly) has a LOT of benefits and can be very representative of your actual dataset.

###The most important tip of all

Whenever you read about data science or data analysis, it’s about the ability to store petabytes of data, retrieve that data in nanoseconds, turn it into a rainbow with a unicorn dancing on it, etc. There is one aspect that is consistently missing in any article or blog post: When you have data, any amount of data, how do you identify which questions are relevant to you?

I once heard an economist friend of mine mention the following about the field of economy:

Economics is a field with all the answers and none of the questions.

The previous quote is, in my opinion, surprisingly similar to the field of data science. If there is one important tip, it’s:

Spend time finding the right questions. Answers are easy to find once you have a question.

Closing Note

We hope this will be useful to you and we’d love to hear from you about topics of interest and questions about our data science initiative at Engine Yard.

Share your thoughts with @engineyard on Twitter