Geocoding French postal addresses with free tools was a rather tricky task before the National Addresses Directory or Base Adresse Nationale. Collaboratively made by several actors from private and public sectors, this open database comes with a freely usable API. This article shows how to perform geocoding on French addresses with the BAN API and how to visualize it with R.

Getting some data to geocode

Rated hotels data is a good example to start. The dataset can be found on Data.gouv.fr and contains all the hotel facilities rated by Atout France with a clear star rating system. Although the rated hosting capacity subject has already been treated on this website, the geocoding question has not been addressed despite the lack of coordinates in the data set. However, each hotel postal address is available and good enough for geocoding through the BAN API.

We need to download the CSV file first, drop useless columns and select the relevant lines for our example. We’ll only focus here on the luxury hotels, rated 4 and 5 stars by Atout France. There is no higher ranking in the classification.

#Download the CSV from data.gouv.fr
if(!file.exists("ratedHotelsRaw.csv")){
download.file("http://static.data.gouv.fr/0c/27f27c8782f876878b0ce5cc6914231d750c1532d2837750bf8bdc7c541aca.csv"
,"ratedHotelsRaw.csv")
}

#Read the CSV being careful with the encoding
ratedHotelsRaw <- read.csv2("./ratedHotelsRaw.csv",fileEncoding = "LATIN1")

#Excluding variables
ratedHotels <- ratedHotelsRaw[c(4,7,8,9,10)]

#Selecting luxury hotels
luxuryHotels <- ratedHotels[ratedHotels$CLASSEMENT == "4 étoiles"
                      | ratedHotels$CLASSEMENT == "5 étoiles",]

The first rows of the processed dataframe luxuryHotels looks like below:

CLASSEMENT

NOM.COMMERCIAL

ADRESSE

CODE.POSTAL

COMMUNE

4 étoiles

HOLIDAY INN BORDEAUX SUD PESSAC

10 avenue BECQUEREL

33600

PESSAC

4 étoiles

HÔTEL PALM BEACH

5 place de l’étang

6400

CANNES

4 étoiles

HÔTEL MATHIS ÉLYSÉES

3 RUE DE PONTHIEU

75008

PARIS

As you can see, this dataset lacks geographical information. We need at least the longitude and latitude columns to display these hotels as points on a map.

Geocoding our CSV with the BAN API

We are now going to use the BAN API to geocode the adresses of luxury hotels in France. We will perform a mass geocoding by sending a CSV file to the API. We begin by saving our slightly processed dataframe to a csv file:

write.csv(luxuryHotels, file = "./luxuryHotels.csv",row.names=FALSE)

We can now send our CSV to the BAN API. The file must be sent by the data parameter, while adress columns must be defined by the columns parameter. Otherwise, all columns of the CSV are concatenated to be geocoded. As we do not want to pollute our geocoding with names and ratings, we need to specify the 3 columns making the address of each hotel:

  • ADRESSE: Containing the street and street number.
  • CODE.POSTAL:  The postal code of the hotel.
  • COMMUNE:  The town name.

Data is sent by a HTTP POST query:

library(httr)
library(RCurl)

queryResults <- POST("http://api-adresse.data.gouv.fr/search/csv/",
body=list(data=upload_file("luxuryHotels.csv", type = "text/csv; charset=UTF-8"),
columns="ADRESSE",
columns="CODE.POSTAL",
columns="COMMUNE")
)

The APi returns the CSV file wich can be converted to a dataframe object through the content function:

luxuryHotels <- content(queryResults)

Several columns are added to the original dataset. In addition to the latitude and longitude you can find the corrected address, an confidence score or the department label. And that’s all for geocoding French addresses with the BAN API!

Input CSV files are limited to 8 Mo for the moment. You can use simpler URLs to query the API for single addresses, and similar functions for reverse geocoding starting from latitudes and longitudes. Take a look at the quick documentation of the BAN API for more information.

Visualizing luxury hotels in Normandy

Getting map data

Once the addresses are geocoded, we can now visualize luxury hotels over a map. A shapefile of the regions we want to map is therefore required. Natural Earth provides various shapefiles of regions all over the world, and Admin 1 – States, Provinces is the right dataset for our example. It contains polygons of countries and regions and can be processed with the `shapefile` function of the raster R package.

library(raster)

if(!file.exists("NaturalEarth/ne_10m_admin_1_states_provinces.shp")){
download.file("http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_1_states_provinces.zip"
,"ne_10m_admin_1_states_provinces.zip")
unzip("ne_10m_admin_1_states_provinces.zip",exdir="NaturalEarth")
}

borders <- shapefile("NaturalEarth/ne_10m_admin_1_states_provinces.shp")

We’ll focus on Normandy, a rather touristy region as well frequented by Parisians as by Brits during the whole year. We need to subset our `borders` dataframe to only keep the Haute and Basse Normandie, which form together the Normandie

normandie <- borders[borders$region %in% c("Haute-Normandie", "Basse-Normandie"), ]

As we have our geocoded hotels on one side and a map on the other, we can now display the two on a single plot.

Displaying a simple map

The luxuryHotelsGeocoded dataframe we created in the geocoding part of this tutorial contains all the luxury hotels in France. As we only focus our data visualization on Normandy, we need to exclude other hotels from the dataframe.
We could decide to subset conventionally, using the postal code column, but this implies that we know the postal code structure and the department structure of Normandy. Although this would be the easier way, we’ll use a foolproof method for this example by excluding the hotels whose coordinates are out of the Normandy map we created previously:

library(maptools)
library(raster)

# First, we drop the hotels whose geocoding did not work.
luxuryHotels <- luxuryHotels[!is.na(luxuryHotels$longitude) & !is.na(luxuryHotels$latitude),]

# We create spatial objects for our hotels from the coordinates of each one.
coordinates(luxuryHotels)=~longitude+latitude

# We set the hotels pection the same of our map.
projection(luxuryHotels)=projection(borders)

# We overlay the 2 datasets.
overlay <- over(luxuryHotels,normandie)

# We look at the hotels out of the overlay to subset
# and only keep Normady hotels.
luxuryHotels$over <- overlay$OBJECTID_1
luxuryHotels.Normandie <- luxuryHotels[!is.na(luxuryHotels$over),]

We can finally plot our 2 datasets easily with the plot function from R, and color hotels by ranking:

jpeg("Luxury hotels in Normandy.jpg",2000,1100,res=300)
plot(normandie)
plot(luxuryHotels.Normandie,add=T,pch="+",col=factor(luxuryHotels.Normandie$CLASSEMENT))
legend(x=-1.15,y=50.12,pch="+",col=unique(factor(luxuryHotels.Normandie$CLASSEMENT)),
 legend=unique(factor(luxuryHotels.Normandie$CLASSEMENT)),cex=0.7)
title(main="Luxury hotels in Normandy", font.main=4)
dev.off()

Geocoding: Luxury hotels in Normandy

This plot may not be one of the finest, but a quick look reveals without surprise a lot of luxury hotels in Calvados, the most touristy department of Normandy. With places like Caen, Deauville or the D-Day beaches, the area is attracting each year many visitors from all over the world with high standards regarding hosting.

 

SHARE

4 COMMENTS

  1. Hi,

    Thank you very much for this article. Is your code to access the API still working? I tried to reproduce it but it doesn’t work.

    Thanks a lot!

    • Hi,

      Many thanks FrenchKPI !

      For Vivien, a solution :

      queryResults <- POST("http://api-adresse.data.gouv.fr/search/csv/&quot;,
      body = list(
      data = upload_file("luxuryHotels.csv", type = "text/csv; charset=UTF-8"),
      columns="ADRESSE",
      columns="CODE.POSTAL",
      columns="COMMUNE"),
      )

      You have to specify the kind of upload file, here csv, and so on for encoding. Then the query works, and the result is exactly the same as python http query.

      Adding verbose=T in POST can give some intuitions

LEAVE A REPLY

Please enter your comment!
Please enter your name here