When I first began work doing what is now known as data science 9 years ago, the term did not exist, though the field did exist under a host of different names - analytics science, statistics, modeling (of the mathematical kind, not to be confused with its glamorous namesake) among others. Much data was still available, and while it was not called big data then, it still presented a lot of the challenges that have been made somewhat easier with recent tools. In late 2012, around the time Harvard Business Review (HBR) came out with their article - Data Scientist: The Sexiest Job of the 21st Century - search volume for the term begins to grow. This is likely when data science went mainstream. It has made it a lot easier to tell people you are a data scientist than provide a tl;dr of the myriad things it has come to represent. We are all too familar with the term, but what does it really mean to be a data scientist?
To say that a data scientist works with data and derives insights from it is much like saying that a CEO runs a company - yes, but what does that involve? A detailed definition sadly runs into the popular trope of the blind men and the elephant. The nature of what constitutes data science differs across organizations, and my own view is likely limited by the specifics of what I have seen covered under this broad sweep of a term. With that caveat, here is an attempt at an answer
I call this out first not because it is the first thing data scientists do - that distinction belongs to the less glamorous but critical steps of extracting, cleaning, and munging data. Once all that is done, visualizations is where you begin to get a feel for the data. From simple lines and bars available within standard statistical languages such as ggplot2 (in R) and matplotlib (in Python), to the more interactive visualizations made possible by tools like d3, highcharts and plotly, compelling visualizations can tell you a lot about what is going on. At its most mundane, visualizations are a good way to catch anything funky going on with the data, and to visualize distributions (and validate if they are gaussian). For most analysis, a report based on visualizations may be all that is needed to answer the question you have. In fact, if you are able to pull the data into an excel sheet and plot it there, that works too. The best tool, after all, is the one that gets the job done.
There are those that, when working on model development (more on this later), will go directly into it without spending enough time visualizing the data. This is based on the view that visualizations are limiting in the 2-3 dimensions one might employ. A regression, on the other hand, offers a much cleaner interpretation and accounts for the simultaneous impact of all variables on the variable of interest. While visualization does not preclude modeling, the pitfall of relying purely on models and regressions (which are summarized interpretation of the data) is best explained by Anscombe’s quartet
A key challenge in visualizations is figuring out how best to build a visual narrative for the question you are looking to answer. This may be a simple analytic question, or a part of a broader modeling activity where you are looking to see how two variables interact with each other. While it might seem cool to have all your visualization wizadry on display, try to keep to the message. Some of the rules of good writing such as omit needless words, have parallels in visualization (omit needless plots). As with good writing, good visualizations take time and patience.
A good reference are the Tufte books on visualization. For inspiration, check out the ones in NYT. Third party tools such as Tableau and Mode Analytics have free public versions that you can play with using public datasets.
While this does not involve walking on a ramp, modeling (or modelling, in the UK) is the most glamorous part of data science. Predictive, or scoring models translate insights from visualizations into actionable insights. For example let us suppose that one’s likelihood of defaulting on a loan reduces as your credit history increases (among other drivers). If you have to now assess the likelihood of risk for an applicant that walks into a bank, the model can provide you with a risk score based on the inputs (e.g. credit history among others) that characterize the prospective borrower.
The most common form of modeling is what is called supervised learning. It involves two parts - training & validation, and testing. One begins with an objective function that relates the outcome (that you are trying to predict) to a set of drivers (that may influence the outcome). The purpose of training and validation is to identify the set of drivers that most influence the outcome, and quantify the level of their influence (i.e. which ones are more or less influential i.e. predictive). The test dataset is a small portion of the data that is held out so that once you have finalized the model (i.e. decided on the best approach and drivers), you can test it on a clean data that the model has not seen before. A common approach for supervised learning is a regression, or some variation of it (e.g. logistic regression). For binary outcomes, a slew of approaches exist - random forests, gradient boosting, support vector machines (SVMs), neural networks (and its more popular cousin, deep learning), among others. An important part of modeling is using multiple different approaches and comparing them based on quantitative (e.g. ROC, precision, recall, F1-score, memory usage and execution time) and qualitative (e.g. interpretability, ease of implementation) metrics
Another typical challenge addressed by models for when there is no labeled outcome and you are looking to derive some associations based on the data is unsupervised learning (e.g. customer segmentation based on similar characteristics).
Most standard statistical software (R and Python) have packages that you can use for model training and evaluation. A good starting point on this topic is Introduction to Statistical Learning and its somewhat advanced precursor The Elements of Statistical Learning.
At the close of any analysis or modeling exercise, the data scientist has to document and present results in a clear way that is tailored to the audience. R Markdown (in R) and Jupyter notebook (in Python) allow code outputs and plots to be embedded and are a convenient way to present your results. This is an important and often neglected part of data science, for one is often all to eager to move on to the next task. Business consultants are known to obsess over their presentation slides, constantly iterating and improving on them. For a data scientist, the report is to be treated similarly, particularly if you are using it to guide a decision or a certain course of action based on your analysis. I have found starting early with a document during the analysis is a great way of both setting scope, and anticipating questions and answering them through analysis. It is the dialog you have with yourself as you go about your work, and a good reference to have if you need to refer to the project months later when memory of its details has faded away
If you are a data scientist associated with a consumer website or mobile app, you are likely to be responsible for the design and analysis of experiments. Design involves deciding on metrics, randomization unit, power, significance, expected change, and calculating sample size. Once you have gathered the required experimental data, you need to test for statistical significance to see if the variant is significantly different from the baseline. The Udacity course on A/B testing is a good place to start for those new to the space.
Machine Learning (ML)
While the section on modeling covers several machine learning concepts, I will use this section to highlight a couple of sub-domains that have more of an ML (a la Computer Science) flavor. Text processing, or more generally, extracting features from unstructured data is one. Information retrieval techniques such as those used in search and recommendations are another.
Data science sits in the middle of a pipeline that has data infrastructure at one end, and production infrastructure at the other. Short of working with csvs, data is often stored in a Hadoop-based ecosystem that is managed by a data engineering/infrastructure team. The production systems where the final models are likely to be deployed are owned by engineering. Therefore to get data, and to deploy the models, a data scientist has to work closely with these groups. An understanding of both the data infrastructure and production systems are valuable skills to have in order to be effective. Data scientists working in a client-based or consulting set-up will likely work closely with business consultants, who can provide a good understanding of the business context and the problem being solved.
While not comprehensive, this covers most of what I have come to know as data science. If I have missed anything, please feel free to call it out in the comments!