This is a review of the Data Science Specialization provided by the John Hopkins University on Coursera. I followed the first 9 courses during the last year and was pretty satisfied. Massive Open Online Courses (MOOCs) are fairly popular now. However, it does not mean today’s knowledge comes without work. MOOCs are different than following tutorials written by tech enthusiasts on their blogs as it requires more focus and organization skills.
A MOOC, the Coursera way
This was the first time I seriously followed a MOOC. The full Data Science Specialization is divided into 10 sessions. These courses last one month each and progressively get you into the Data Science world, from using Git to develop a fully functional data application. Difficulty constantly increases for someone following all the sessions in the recommended order. The 10th session consists of a capstone project. Here, students get to work on a real life, company sponsored project.
Sessions are regularly held, as this specialization is still very popular. One can get into a course freely or by paying a fee, about 35€ at the time I took it. The fee gets you a Signature Track ID, a feature to identify you each time you upload assignments on the platform.
Coursera identifies you by taking a photo of your face with your webcam, and by recording your typing fingerprint. Is this typing fingerprint method really working? I changed my keyboard during a course, going from a French AZERTY layout to a Canadian QWERTY one. This affected my typing rhythm during about 2 weeks. However, I never got into trouble.
Once you finished a Course successfully using Signature Track, you get a personal certificate link you can add to your CV or Linkedin profile. For illustration, my certificates are available on the Coursera website.
Coursera does not require students to take the courses consecutively. I stopped the courses a few months during the summer, and continued after. However, signing up into a course requires you to complete it within the month. In the Data Science Specialization, courses were roughly divided the same way each month:
- The first week quietly introduce the subject. Students have to complete an easy quiz before Sunday night.
- The second one is a little bit more complex, and a preparation to the assignment you’ll complete the following week. The second week also ends with a quiz.
- The third week is the most intense one, because of the assignment you need to complete using the skills you learned the previous weeks. Projects are published over Github or a similar platform to allow grading.
- Peer grading takes place during the fourth week. You are required to grade at least four others students with the objective grading rules of the project. Don’t worry, you got access to these rules before publishing your project. You often need to answer to a short recapitulatory quiz on this fourth week.
Once the course is finished, you get your grades the following days.
This specialization is in my opinion suitable for whoever is working with data without R knowledge. From the database administrator to the regular Excel user, it makes indeed a lot of people. That may explain the success of this specialization at Coursera.
The 10 Data Science Specialization Courses
The Data Scientist’s Toolbox
The first month introduces data science and common data science tools. A first step consists of understanding what data science means. Then students get to install R and RStudio, the popular R IDE. Using Git is also taught, as students have to publish programming assignments over Github. Course covers Windows, OSX and Linux setups. This month may be lighter than the following courses, however understanding Git is mandatory to continue.
Once RStudio installed comes the time to learn how to develop with R. This course covers R basics. While the description recommend some programming experience, the course can be followed by anyone. Month closes with a project including some recursive programming.
Getting and Cleaning Data
Getting, cleaning and profiling data may not be the funniest part of the Data Scientist job. However, one must master these skills in order to efficiently deliver reliable data analysis. Indeed, real life databases never look like the clean Kaggle data sets.
Exploratory Data Analysis
How to explore freshly cleaned data? Visualize it and compute some summary statistics. In this course, you’ll learn and try a reproducible and detailed method to understand a whole range of data.
Generating analysis reports is nice. Updating the same reports with fresh data is even better. Far from the costly Business Intelligence solutions, R provides lightweight libraries to easily generate analysis reports. For corporate reporting or research papers, reproducing analysis remains critical. You’ll learn here how to address this issue in the course.
Following exploration comes the understanding of the properties of the data. After teaching you theory and practice, this course let you draw conclusions about the underlying distributions in data.
In this course you’ll learn how to identify predictors in the data. Least squares and inference using regression models is covered, as well as ANOVA and ANCOVA tests. From theory to practice, you’ll be able to use the most important statistical analysis tool in a data scientist’s toolkit.
Practical Machine Learning
This course covers all the process needed to perform machine learning. From training models to analyzing error rates, you’ll understand what is over fitting, classification trees, Naive Bayes and random forests. And you’ll be ready to take part in a Kaggle competition.
Developing Data Product
The last regular course teach you how to develop data oriented web applications using R and the Shiny library. Shiny allows the development of interactive data visualization without web skills. The final assignment consists of a simple application hosted at Shinyapps.io. I decided to visualize car labeling data.
Data Science Capstone
I did not participated in this last step of the specialization. It seems lately Coursera established a partnership with Swiftkey. Project purpose appears to predict the next typed words onto a keyboard. An interesting presentation of one of these projects is available here.
After the MOOC?
Getting a basic tool set to work with data can be enough for many. But completing the Data Science Specialization at Coursera does not makes you a Data Scientist. However, these 9 courses can be a perfect start for someone interested to become one. After completing the specialization, popular resources include two Springer books freely available as PDFs:
- An Introduction to Statistical Learning, continuing over the subjects approached in the specialization.
- The Elements of Statistical Learning, a must-read after reading the introduction book beforehand.
On the other hand, Data Science is not only statistics. Improving programming skills remains mandatory for students. Paths here are numerous, as from databases to programming languages, technologies are increasingly various and powerful. Finally, do not forget the very enriching competitions at Kaggle. In conclusion, quality resources are easily available over the web, and one just need to follow the path to become a real Data Scientist. But first, good luck with the Data Science Specialization at Coursera!