Python Setup for Data Science

4 minute read

I recently had to get python set up on a new laptop (MacBook Pro running MacOS High Sierra) and found myself scrambling to look up install instructions again, wishing I had documented this someplace. For future reference, here are the steps:

1. Python

MacOS High Sierra comes pre-installed with Python 2.7. Since I plan to use it, I found no reason to upgrade to Python 3.

To avoid any permission denied errors when installing python packages, change the permission of your python folder as below

sudo chown -R $USER /Library/Python/2.7

Quite a few of the stackoverflow threads mention doing a sudo pip install instead of a pip install to avoid the permission denied errors. This is not recommended due to security risks. I also find it somewhat annoying. I found changing the python directory permissions to be a cleaner way around this.

2. Pip:

Pip helps with installing and updating python packages

  • Download the get-pip.py file from here
  • Run the following on your terminal:
    python get-pip.py

There are multiple applications to install and manage packages for Python, such as conda, easy_install, or pip. I prefer pip. As a general rule of thumb, it is best to stick to one of them rather than install some packages using pip and others using easy_install (say). Python packages often have dependencies on other packages, and using different package managers can sometimes cause problems.

3. matplotlib, numpy, scipy

These comes pre-installed with macOS. matplotlib is needed with plots and visualization. numpy and scipy are required for scientific computing.

4. Jupyter

Jupyter is required for running your python code and analysis in interactive notebooks

Install as below:

python -m pip install --user jupyter

Open your ~/.bashrc file and add your python location to PATH:

export PATH="$HOME/Library/Python/2.7/bin:$PATH”

Run the following on your terminal

source .bashrc

Note: The —user option in the first command line is due to an OS error I encountered when running pip install without the user option. OSError: [Errno: 1] Operation not permitted: '/System/Library/Frameworks/Python.framework/Versions/2.7/share’. By default pip install attempts to install the package for all users of the computer which causes the above error. It is recommended to install a package locally for the user. We will keep this convention of using —user following pip install for other subsequent installs.

5. virtualenv

virtualenv is used to create virtual environments. It is a lot cleaner to work with virtual envs. That way we don’t mess up our standard python distribution as we pile on packages. Once you create a virtual environment, it has all of the packages installed earlier. All installs within a virtual environment are local to it, and if something goes wrong, you can delete the virtualenv you created and create a new virtual environment.

Install virtualenv

pip install —user virtualenv

Then create a directory to keep track of your virtual environments. For e.g. I create a directory called pyenvs under my home directory.

mkdir ~/pyenvs

Now create a virtual environment within the directory you created. I have called it pyds. You can name it whatever you want.

virtualenv ~/pyenvs/pyds

Activate the virtualenv

source ~/pyenvs/pyds/bin/activate

If you want to deactivate your virtualenv, type deactivate on your terminal. Note that all the packages that you installed in your virtualenv won’t be available in your base installation. For now we will remain in the virtualenv we just activated.

We will make all subsequent installations in the steps that follow within this virtualenv. We no longer need to add —user when installing within a virtualenv since it is local by virtue of being within your virtual envirnoment.

6. pandas: for data processing and manipulation

pip install pandas

7. scikit-learn: for building models

pip install -U scikit-learn

Note: The -U options upgrades an installed package if you already have it

8. statsmodels: for more descriptive linear and logistic regression models

pip install -U statsmodels

9. seaborn: for nicer visualizations

pip install -U seaborn

10. tensorflow: For deep learning models

pip install -U tensorflow

11. keras: a high-level API that sits atop tensorflow

pip install -U keras

12. Install ipython and jupyter within the virtualenv

It is possible the jupyter installation still points to a location other than your virtualenv. You can check this by typing

which python

This should point to ~/pyenvs/pyds/bin/python

which jupyter

If this doesn’t point to a location within ~/pyenvs/pyds/ then you need to install it within your virtualenv. It is recommended that you install ipython as well.

pip install —ignore-installed ipython
pip install -U jupyter

13. Check your installation

Type jupyter notebook in your command line. To test if all the package installations are working properly within notebook, type the following inside your jupyter notebook and type Cmd+Enter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import optimize
import tensorflow as tf
import seaborn as sns
from keras.models import Sequential

If no errors are thrown, the setup look good. You are now ready to analyze data and build models in python.

Updated:

Leave a Comment