Where’s My Wine From? A Simple Geodata Hack

I enjoy the occasional glass of wine. For some time now I’ve been studiously scanning (almost) every bottle that I’ve drank over the last few years at Vivino. For the uninitiated, Vivino is an online wine community that consists of a crowd sourced database of wine reviews and ratings, an online marketplace for wine, and a mobile app that allows the scanning of wine labels, which can then be matched to the database. 

A nice feature of having a Vivino account is the ability to export the data related to the historical scans that you’ve made in spreadsheet format. After utilising this feature, I was curious to find out where the wines I had drank over the years came from. As any other self-respecting techie would do, I decided to write a small software program to visualise them on a map (links provided at the bottom of this post). 

Mapping the geographic distribution of a data set is a common use case for Data Scientists and Software Engineers. Often data sets come with some form of address information but not geographic coordinates, such as in the case of my wine data export. This adds an additional processing step to accomplish the task. 

At a high level, the workflow can be defined as follows:

  1. Assign geographic coordinates to each entry in the data set. (data enrichment)
  2. Aggregate the data by Winery (data aggregation)
  3. Transform the data into a suitable format for displaying on a map. (data transformation)
  4. Display the data on a map. (data visualisation)
  5. Make the map available (deployment)

Below I outline the tools I used to implement the above workflow. I stuck to free and open source tools in the Python and Javascript programming languages, but be aware that many alternatives exist. 

Data Enrichment

 The task of mapping from a geographic name, e.g. “Berlin”, to coordinates is known as Geocoding. Nominatim is an open source tool to search OpenStreetMap data. It powers the search box of OpenStreetMaps and also provides an interface that can be accessed over the web. Note to stay in compliance with Nominatim’s usage policy when making use of their APIs.

Data Aggregation

 Pandas is a data analysis library that is quickly becoming a mainstay amongst Data Scientists and Engineers[1]. The library provides a data frame data structure that developers can easily apply common data wrangling operations to, such as sort, filter, groupby and descriptive statistics.

 In the case of my wine use case, the aggregation required is rather simple: one line of python pandas code to group the data by Winery and count the number of bottles of wine.

Data Transformation

 GeoJSON is an open standard for representing geographic information. Most mapping tools provide support for this data format out of the box. Javascript Object Notation (JSON) itself is a lightweight, text-based data interchange format supported by most modern programming languages. Python provides support for manipulation of geographic object in the form of Shapely. I used Shapely in combination with Shapely-geojson to transform the output of the data aggregation step into a single GeoJSON file.

Data Visualisation

 OpenLayers is a free, open source Javascript library for displaying map data in a web browser. It has many features and easy to customise and extend. Once the data has been transformed into GeoJSON format, it is straightforward to display the data using OpenLayers with a few lines of Javascript.

Deployment 

 Flask is an open source, lightweight web application framework. It is particularly suitable for prototyping as it allows rapid development of web applications with minimum of overhead. An additional benefit is that it written in Python, the language of choice for many data professionals.

 Docker is an open source containerisation framework for creating isolated environments for running software applications. Applications deployed as docker containers bundle their own tools, configuration and libraries, and will run the same regardless of the underlying infrastructure. Containers have the advantage over virtual machines (VMs) in that they are much smaller, because VMs include a full copy of the operating system. In contrast containers share the operating system resources. 

Many examples of “dockerizing” a Flask application can be found online.  

If you’re interested in how this all fits together from a technical perspective (or apply it yourself), feel free to peruse the code on Github. If you’re just interested in where my wine comes from, feel free to take a look at the map. It turns out my wine purchasing behaviour has a strong European bias, particularly focused on Bordeaux!

There’s of course a lot more that could be done here in terms of visualisation (my intention of this post was simply to demonstrate how this can fit together in terms of a data workflow). More importantly, I hope it’s apparent that this workflow can be applied more broadly than just this use case.

  1. https://www.theregister.co.uk/2017/09/14/python_explosion_blamed_on_pandas/