5 min read

Geocoding French addresses with BAN

Geocoding French postal addresses with free tools was a rather tricky task before the National Addresses Database or Base Adresse Nationale. Collaboratively built by several actors from private and public sectors, this open database comes with a freely usable API. This article shows how to perform geocoding on French addresses with the BAN API and how to visualize it with R.

Getting some data to geocode

Rated hotels data is a good example to start. The dataset can be found on Data.gouv.fr and contains all the hotel facilities rated by Atout France with a clear star rating system. Although the rated hosting capacity subject has already been treated on this blog, the geocoding question has not been addressed despite the lack of coordinates in the data set. However, each hotel postal address is available and good enough for geocoding through the BAN API.

We need to download the CSV file first, drop useless columns and select the relevant lines for our example. We’ll only focus here on the luxury hotels, rated 4 and 5 stars by Atout France. There is no higher ranking in the classification.

prefix <- "../../sources/"
hotels_csv_path <- paste0(prefix, "ratedHotelsRaw.csv")
if(!file.exists(hotels_csv_path)){
download.file(
  "http://static.data.gouv.fr/0c/27f27c8782f876878b0ce5cc6914231d750c1532d2837750bf8bdc7c541aca.csv",
  hotels_csv_path)
}

Read the CSV being careful with the encoding, Excluding variables, Selecting luxury hotels The first rows of the processed dataframe luxuryHotels looks like below:

ratedHotelsRaw <- read.csv2(hotels_csv_path,
                            fileEncoding = "LATIN1")
ratedHotels <- ratedHotelsRaw[c(4,7,8,9,10)]
luxuryHotels <- ratedHotels[
  ratedHotels$CLASSEMENT == "4 étoiles"
  | ratedHotels$CLASSEMENT == "5 étoiles",]

knitr::kable(luxuryHotels[1:5,], row.names = FALSE)
CLASSEMENT NOM.COMMERCIAL ADRESSE CODE.POSTAL COMMUNE
4 étoiles HOLIDAY INN BORDEAUX SUD PESSAC 10 avenue BECQUEREL 33600 PESSAC
4 étoiles HÔTEL PALM BEACH 5 place de l’étang 6400 CANNES
4 étoiles HÔTEL MATHIS ÉLYSÉES 3 RUE DE PONTHIEU 75008 PARIS
4 étoiles CHÂTEAU DE CURZAY le château 86600 CURZAY-SUR-VONNE
4 étoiles LES CORDERIES 214 rue des moulins 80230 SAINT-VALERY-SUR-SOMME

As you can see, this dataset lacks geographical information. We need at least the longitude and latitude columns to display these hotels as points on a map.

Geocoding our CSV with the BAN API

We are now going to use the BAN API to geocode the adresses of luxury hotels in France. We will perform a mass geocoding by sending a CSV file to the API. We begin by saving our slightly processed dataframe to a csv file:

hotel_path <- paste0(prefix,"geocoding-french-addresses-with-ban/luxuryHotels.csv")

write.csv(luxuryHotels,
          file = hotel_path,
          row.names=FALSE)

We can now send our CSV to the BAN API. The file must be sent by the data parameter, while adress columns must be defined by the columns parameter. Otherwise, all columns of the CSV are concatenated to be geocoded. As we do not want to pollute our geocoding with names and ratings, we need to specify the 3 columns making the address of each hotel:

* ADRESSE: Containing the street and street number.
* CODE.POSTAL:  The postal code of the hotel.
* COMMUNE:  The town name.

Data is sent by a HTTP POST query: The APi returns the CSV file wich can be converted to a dataframe object through the content function:

library(httr) 
library(RCurl)

geocoded_hotels_path <- "../../sources/luxuryHotelsGeo.csv"

if (!file.exists(geocoded_hotels_path)){
  queryResults <- POST("http://api-adresse.data.gouv.fr/search/csv/",
  body=list(data=upload_file(hotel_path, type = "text/csv; charset=UTF-8"),
  columns="ADRESSE",
  columns="CODE.POSTAL",
  columns="COMMUNE")
  )
  luxuryHotels_geo <- content(queryResults)
  write.csv2(luxuryHotels_geo,geocoded_hotels_path)
} else {
  luxuryHotels_geo <- read.csv2(geocoded_hotels_path)
}

Several columns are added to the original dataset. In addition to the latitude and longitude you can find the corrected address, an confidence score or the department label. And that’s all for geocoding French addresses with the BAN API!

Input CSV files are limited to 8 Mo for the moment. You can use simpler URLs to query the API for single addresses, and similar functions for reverse geocoding starting from latitudes and longitudes. Take a look at the quick documentation of the BAN API for more information.

Visualizing luxury hotels in Normandy

Getting map data

Once the addresses are geocoded, we can now visualize luxury hotels over a map. A shapefile of the regions we want to map is therefore required. Natural Earth provides various shapefiles of regions all over the world, and Admin 1 – States, Provinces is the right dataset for our example. It contains polygons of countries and regions and can be processed with the shapefile function of the raster R package.

library(raster)

regions_zip_url <- "https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_1_states_provinces.zip"
filename <- basename(regions_zip_url)
file_path <- paste0("../../sources/",filename)
exdir <- paste0("../../sources/NaturalEarth/")
borders_file_path <- "../../sources/NaturalEarth/ne_10m_admin_1_states_provinces.shp"

if(!file.exists(borders_file_path)){
  download.file(regions_zip_url,file_path)
  unzip(file_path,exdir=exdir)
}

borders <- shapefile(borders_file_path)

We’ll focus on Normandy, a rather touristy region as well frequented by Parisians as by Brits during the whole year. We need to subset our borders dataframe to only keep the Haute and Basse Normandie, which form together the Normandie

normandie <- borders[borders$region %in% c("Haute-Normandie", "Basse-Normandie"), ]

As we have our geocoded hotels on one side and a map on the other, we can now display the two on a single plot.

Displaying a simple map

The luxuryHotelsGeocoded dataframe we created in the geocoding part of this tutorial contains all the luxury hotels in France. As we only focus our data visualization on Normandy, we need to exclude other hotels from the dataframe. We could decide to subset conventionally, using the postal code column, but this implies that we know the postal code structure and the department structure of Normandy. Although this would be the easier way, we’ll use a foolproof method for this example by excluding the hotels whose coordinates are out of the Normandy map we created previously:

library(maptools)
 
# First, we drop the hotels whose geocoding did not work.
luxuryHotels_geo <- luxuryHotels_geo[!is.na(luxuryHotels_geo$longitude) & !is.na(luxuryHotels_geo$latitude),]
 
# We create spatial objects for our hotels from the coordinates of each one.
coordinates(luxuryHotels_geo)=~longitude+latitude
 
# We set the hotels pection the same of our map.
projection(luxuryHotels_geo)=projection(borders)
 
# We overlay the 2 datasets.
overlay <- over(luxuryHotels_geo,normandie)
 
# We look at the hotels out of the overlay to subset
# and only keep Normady hotels.
luxuryHotels_geo$over <- overlay$region
luxuryHotels_geo.Normandie <- luxuryHotels_geo[!is.na(luxuryHotels_geo$over),]

We can finally plot our 2 datasets easily with the plot function from R, and color hotels by ranking:

plot(normandie)
 plot(luxuryHotels_geo.Normandie,add=T,pch="+",col=factor(luxuryHotels_geo.Normandie$CLASSEMENT))
 legend(x=-1.15,y=50.12,pch="+",col=unique(factor(luxuryHotels_geo.Normandie$CLASSEMENT)),
  legend=unique(factor(luxuryHotels_geo.Normandie$CLASSEMENT)),cex=0.7)
title(main="Luxury hotels in Normandy", font.main=4)

Geocoding: Luxury hotels in Normandy

This plot may not be one of the finest, but a quick look reveals without surprise a lot of luxury hotels in Calvados, the most touristy department of Normandy. With places like Caen, Deauville or the D-Day beaches, the area is attracting each year many visitors from all over the world with high standards regarding hosting.

Thanks to Guillaume for code corrections.