`print('Hello World')`

`Hello World`

compute

Author

Thomas Wiecki

Published

November 18, 2014

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What’s wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).

It has happened quite a few times that people came up to me and said “I heard Python is amazing for data science so I wanted to start learning it but spent two days installing Python and all the other modules!”. It’s quite reasonable to think that you have to install Python if you want to use it but indeed, installing the full PyData stack manually when you don’t know which tools you actually need is quite an undertaking. So I strongly recommend against doing that.

Fortunately for you, the fine folks at Continuum have created the Anaconda Python distribution that installs most of the PyData stack for you, and the modules it doesn’t provide out of the box can easily be installed via a GUI. The distribution is also available for all major platforms so save yourself the two days and just use that!

After Python is installed, most people start by launching it. Again, very reasonable but unfortunately dead wrong. I don’t know a single SciPythonista that uses the Python command shell directly (YMMV). Instead, IPython, and specifically the IPython Notebook are incredibly powerful Python shells that are used ubiquitously in PyData. I strongly recommend you directly start using the IPython Notebook (IPyNB) and don’t bother with anything else, you won’t regret it. In brief, the IPyNB is a Python shell that you access via your web browser. It allows you to mix code, text, and graphics (even interactive ones). This blog post was written in an IPyNB and it’s rare to go find a talk at a Python conference that does not use the IPython Notebook. It comes preinstalled by Anaconda so you can just start using it. Here’s an example of what it looks like:

This thing is a rocket – every time I hear one of the core devs talk at a conference I am flabbergasted by all the new things they cooked up. To get an idea for some of the advanced capabilities, check out this short tutorial on IPython widgets. These allow you to attach sliders to control a plot interactively:

Normally, people recommend you start by learning `NumPy`

(pronounced num-pie, not num-pee!) which is the library that provides multi-dimensional arrays. Certainly this was the way to go a few years ago but I hardly use `NumPy`

at all today. The reason is that `NumPy`

became more of a core library that’s used by other libraries which provide a much nicer interface. Thus, the main library to use for working with data is `Pandas`

. It can input and output data from all kinds of formats (including databases), do joins and other SQL-like functions for shaping the data, handle missing values with ease, support time series, has basic plotting capabilities and basic statistical functionality and much more. There is certainly a learning curve to all its features but I strongly suggest you go through most of the documentation as a first step. Trust me, the time you invest will be set off a thousand fold by being more efficient in your data munging. Here are a few quick tricks to whet your appetite:

A | B | C | D | E | F | |
---|---|---|---|---|---|---|

0 | 1 | 2013-01-02 | 1 | 1 | test | foo |

1 | 1 | 2013-01-02 | 1 | 2 | train | foo |

2 | 1 | 2013-01-02 | 1 | 1 | test | foo |

3 | 1 | 2013-01-02 | 1 | 2 | train | foo |

Columns can be accessed by name:

Compute the sum of `D`

for each category in `E`

:

Doing this is in `NumPy`

(or *gasp* `Matlab`

!) would be much more clunky.

There’s a ton more. If you’re not convinced, check out 10 minutes to pandas where I borrowed this from.

The main plotting library of Python is `Matplotlib`

. However, I don’t recommend using it directly for the same reason I don’t recommend spending time learning `NumPy`

initially. While `Matplotlib`

is very powerful, it is its own jungle and requires lots of tweaking to make your plots look shiny. So instead, I recommend to start using `Seaborn`

. `Seaborn`

essentially treats `Matplotlib`

as a core library (just like `Pandas`

does with `NumPy`

). I will briefly illustrate the main benefits of `seaborn`

. Specifically, it:

- creates aesthetically pleasing plots by default (for one thing, it does not default to the jet colormap),
- creates statistically meaningful plots, and
- understands the
`pandas`

`DataFrame`

so the two work well together.

While `pandas`

comes prepackaged with `anaconda`

, `seaborn`

is not directly included but can easily be installed with `conda install seaborn`

.

```
import seaborn as sns
# Load one of the data sets that come with seaborn
tips = sns.load_dataset("tips")
sns.jointplot("total_bill", "tip", tips, kind='reg');
```

As you can see, with just one line we create a pretty complex statistical plot including the best fitting linear regression line along with confidence intervals, marginals and the correlation coefficients. Recreating this plot in `matplotlib`

would take quite a bit of (ugly) code, including calls to scipy to run the linear regression and manually applying the linear regression formula to draw the line (and I don’t even know how to do the marginal plots and confidence intervals from the top of my head). This and the next example are taken from the tutorial on quantitative linear models.

`DataFrame`

Data has structure. Often, there are different groups or categories we are interested in (`pandas`

’ `groupby`

functionality is amazing in this case). For example, the `tips`

data set looks like this:

total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|

0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |

1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |

2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |

3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |

4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |

We might ask if smokers tip differently than non-smokers. Without `seaborn`

, this would require a `pandas`

`groupby`

together with the complex code for plotting a linear regression. With `seaborn`

, we can provide the column name we wish to split by as a keyword argument to `col`

:

Pretty neat, eh?

As you dive deeper you might want to control certain details of these plots at a more fine grained level. Because `seaborn`

is just calling into `matplotlib`

you probably will want to start learning this library at that point. For most things I’m pretty happy with what `seaborn`

provides, however.

The idea of this blog post was to provide a very select number of packages which maximize your efficiency when starting with data science in Python.

- Curated decibans of scientific programming resources in Python – very comprehensive list of Python modules.
- Python for Data Analysis – the book by Wes McKinney (Pandas’ creator).
- The Open Source Data Science Masters – many useful links to various resources.

Thanks to Katie Green and Andreas Dietzel for helpful feedback on an earlier draft.