It becomes now more common for analytic teams to get access to a tremendous amount of heterogeneous data. When information rarely comes clean and ordered, it make a real challenge to process continuous flows of raw data without clear metadata. As much as any other programming task, data analysis is subject to unexpected complexity. How to limit risks through metadata, before even including data in your data quality workflow and before profiling?
By any means, the analytic department can’t reject data before profiling it. But even with lots of automation, profiling is costly. In order to limit unexpected complexity, series of controls can be performed with the provider beforehand. However providers rarely show the same maturity regarding the data they deliver. If some create and manage their own structured databases with data quality policies, others only reuse existing data. Even worse, some without access to a reliable database only use spreadsheets shared by mail. How to gather the right metadata regardless of your provider? How to start your data quality protocol the agile way? Here is an idea of the questions to ask during the first interviews with data providers before profiling data.
1- Assess what the data represents
This may sound trivial, but assessing the global meaning of the data is a good place to kickstart a data quality process. Ensure the provider has a clear idea of the data, and what it represents. Data almost if not always represents places, peoples or events. If the data does not easily fall in one these three dimensions, be sure to understand why.
2- Acknowledge analytic perimeter
An important parameter to take into account when integrating external data. As for software source code, licenses often regulate data usage. A common situation in Open Data. For example, the permissive Open License regulates many of the data published on the French Open Data portal Data.gouv.fr. However, more restrictive licenses for Open Data exists. As of September 2012, Open Street Map publishes its data under the terms of the Open Data Commons Open Database License. ODbL forces users to maintain the license over modifications of the database.
If license lacks, law prevails. Most of the data sets on Data.gov, the U.S. Open Data portal, are U.S. Government Works. This mean these data sets do not fall under any copyrights and can be reused freely.
When data come from internal sources, pay attention to its content. In the case of personal information contained in the data, additional laws can apply. Information privacy law vary according to the country. As misuses of data can bring you to court, always acknowledge your perimeter.
3- Map existing data usages
Surely, someone used the data you received beforehand. If all private uses may be impossible to map in the case of Open Data, at least map the usages within the organization. Who used this data? How? For what kind of results? Separate production and reporting usages. Both provide insights about the scope and importance of the data within the organization. Moreover, chances are basic reporting already exists. This may help to manage analytic in the right direction.
4- Formally delimit the data perimeter
Data is not omniscient. Agree with your provider about the data perimeter before profiling it yourself. Then only, confront it with the reality and work together to correct flaws during the data quality processes.
In the case of geographical data, assess its frontiers or the lack of it. In the case of personal database, ensure to identify the right population. For the events data, defining the expected time frame is mandatory for an efficient data quality policy.
5- Name the creators of the data
If the data has not been created by the provider, be sure to be able to name its real creator. Sometimes, the provider source is not the data set source. Ride up the data flow until you find the real creator, or creators, of the data. A vague office number is not enough. Once the true data sources and means to contact them are identified, jump to the next step.
6- Understand how the data have been created
Once the data creator or creator have identified, comes the time to understand how the data have been created. Countless methods create data. Manual inputs, automatic sensors, application logs…
Luckily, each method has a kind of data quality signature. Manual input in a single application usually does not provide big data, but can include many errors. On the other hand, automatic sensors and application logs can create humongous amounts of accurate data. Handling these voluminous information will ask to the analytic dept to coordinate with IT to avoid technical flaws in the processes.
Data update frequency is here a mandatory metadata to define the analytic workflow. Though according to your goals, analytic may not need to closely follow production schedules.
7- Acknowledge the data format before profiling it
As more technical aspect, data format is none the less the least important. From flat files to structured relational databases, data can take various forms. A general rule is to seek the rawest, unaltered form of data stopping just before paper data. In fact, analytic department should never be asked to scan data. However this means often getting a read access to the database. Still, database administrators can decline full read access on their systems. Moreover, analytic workflow may be vulnerable to database schema updates.
A common solution for database administrators is to provide flat file extracts to data users. In this case, always promote use of open formats like CSV or JSON over proprietary formats like XLS.
8 – Understand the data model
Understand the data model. Does the data come in tables? If so, tables rarely come alone. Write down the relational model. For document style nested data, list all the possible attributes.
Pay attention to the case of multiple creators. Did they used the same metadata repositories? If not, definitions may not be the same along data of different origins. You may not be able to join the different tables or documents automatically.
9- Ensure the data dictionary contains all the needed metadata
Data does not come alone. In addition to the metadata created in the previous steps, the data dictionary may be the most important. More or less structured, this document describe all the variables, columns, and their meaning. It also describe their possible values and all the fields types. Your data quality workflow may require to integrate in the metadata repository.
Do not forget dictionaries are rarely complete as data evolves quickly. New uses of the data are the opportunity to complete it.
10- Ask the data provider
No checklist could replace open discussion with the data provider. Discuss strengths and weaknesses of the data sets, and identify problems you may have forgotten as an analyst.
New issues always rise during data processing and analysis. Get back to your provider as soon as needed. Design your processes inspired by agile methods. Do not get stuck in the good ol’ V-model, more suitable for construction than for the software world. Also, the provider may be thankful to know about potentially unidentified flaws in its data.
Once these questions asked and metadata gathered, you can start profiling data without expecting big surprises or hidden costs. Do you think about others questions to ask before profiling data? Please share with us!