Brewing a Perspective on Data Science

If you’ve ever taken an introductory course in statistics, you’ve almost certainly come across the concept of Student’s T-test. For those that haven’t, Student’s T-test is a statistical test used in hypothesis testing to compare mean values between groups of observations. For example, one might want compare the average height of a sample of men and women in a particular country to see if there is a significant difference. Student’s T-test would be an appropriate choice of statistical tool to analyse the data in this case.

The origins of Student’s T-test are well documented [1,2,3], but worth briefly repeating here. The details of the method were published in a 1908 paper by a man named William Sealy Gosset, under the pseudonym Student. Gosset’s employer, coincidentally the brewer of my favourite beer: Guinness, forbade employees publishing their work under their real names. Gosset was concerned with applying statistics to the optimisation of the brewing process at Guinness. The traditional statistical methods of the time required large sample sizes. Gosset developed his method to deal with the small sample sizes he encountered during the brewing process.

Fast forward more than 100 years. Data is the new oil [4], Artificial Intelligence is the new electricity [5], and Data Science is the sexiest job of the 21st century [6]. As if you haven’t heard already. In addition, many new concepts are finding their way into the vocabulary of technology organisations:

  • Chief Data Officers (company executives responsible for governance within an organisation);
  • Data products (the creation of value from data to solve problems); and
  • Data plays (business strategy based upon acquisition and utilisation of data).

 

At the centre of all this, Data Scientists are generally lauded as pioneers of this brave new data-driven world. Despite the adulation, the definition of their role within many organisations, and the field in general, remains vaguely defined. Why is that? Guiness saw the value in hiring the best Oxford and Cambridge biochemistry and statistics graduates to work on their industrial processes many years ago. This was surely modern day data science.

Rather than offer yet another attempt at defining the data scientist role here I will point the interested reader to Harris et al. [8] who provide a thorough study of the topic. Instead, I will offer two reasons why I believe the data science role will continue to remain difficult to define for some time to come:

1. Breadth of Topics

The range of topics that are generally considered to fall under the remit of a data scientist at an average technology company can be broad. From traditional business intelligence to experimental design to machine learning algorithms. Until organisations reach a scale where they can invest more heavily in data science teams, data scientists are required to maintain a broad skill set, which can make the position difficult to define and hire for.

2. Bridging the gap to production

Despite waiting three years and awarding 1 million dollars to the winner, Netflix did not put the winning solution of their recommendation competition into their production emvironment due to the estimated engineering cost of doing so.[9]. The increasing importance of getting insights into production systems for companies who invest into data science initiatives puts an increasing demand for data scientists to possess high quality software engineering skills.

Building recommendation engines, chatbots and self-driving cars is a far cry from the days of Student; however, the goal of data scientists working within organisations remains the same: apply data driven methods to the optimisation of products.

 

  1. https://en.wikipedia.org/wiki/William_Sealy_Gosset
  2. https://www.irishtimes.com/news/science/how-a-student-of-guinness-became-the-faraday-of-statistics-1.1779032
  3. http://www-history.mcs.st-andrews.ac.uk/Biographies/Gosset.html
  4. https://medium.com/project-2030/data-is-the-new-oil-a-ludicrous-proposition-1d91bba4f294
  5. https://medium.com/syncedreview/artificial-intelligence-is-the-new-electricity-andrew-ng-cc132ea6264
  6. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  7. https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html
  8. http://www.stat.wvu.edu/~jharner/courses/dsci503/docs/Analyzing_the_Analyzers.pdf
  9. https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429