AI in QA: Beyond Metadata
Vincent Brandon, Data Coordinator
July 14, 2021
In a previous article, I reviewed Pandas Profiling, a metadata generation, and visualization library in python for getting insights into datasets. Running statistics on a static dataset empowers human data engineers and analysts to manage data quality actively. For small reports, profiling the data and manually checking each field for expected behavior works incredibly well. But what happens when we have hundreds, or even thousands, of columns with wildly different data types, distributions, and relationships? What happens when we have millions of rows to review to ensure the relationships between columns are consistent?
Enter automated data processing. Machine learning is being pushed further up the data pipeline to help with ingesting and cleanup. Many solutions are trying to overcome major hurdles in data migration and reliability.
Newer tooling around loading data can be binned into three areas:
- Type inference (is this a number or a character?)
- Context inference (is this a field we see in some domain? Like an SSN or address.)
- Dashboards to handle all the metadata information
All these processes produce further metadata. On top of the statistics, information about valid field types (is it a date, is it a floating-point number, a string, how long?) is stored to aid in translating data between systems. Contextual inference is a pathway to domain-specific validation and cleaning. If you know you have got an address, natural language processing and geolocation services can be employed to standardize all values across datasets.
Keeping track of the many engines working away is difficult. This difficulty is why dashboards have become so prominent. Metadata can become as cumbersome as the raw data itself if not presented properly. Behind the scenes, database connectors are trying to translate data types and text formatting from one system to the next. Statistics like modeling distributions, counts, and cardinality, are generated. See StreamSets and DataRobot as examples.
Cleaning & Enrichment
Even with all the help of scanning every field, checking to make sure every type is correct or flagged, and giving users context as to what the variables could be or should be, data engineering teams are still bogged down cleaning the dataset. However, advances in estimating functional dependencies (Zhang, Guo, & Rekatsinas, 2020) are finally yielding optimized solutions that can run in reasonable amounts of time, such as how one value might influence another, across columns, across time, across datasets.
Holoclean can impute missing data. By building thousands of models as a dataset is processed and aided by human engineers with domain experience, data cleaning software can detect misspellings, bad assignments, or blips in ingestion that would otherwise go unnoticed. What this means for researchers, business users, and analysts is more reliable data.
For engineering teams and coordinators, this is an exciting time. The UDRC maintains a longitudinal database of hundreds of tables, thousands of fields, billions of data points. We will continue to improve our upstream processes to deliver the best possible data for our customers.
Zhang, Y., Guo, Z., & Rekatsinas, T. (2020). A Statistical Perspective on Discovering Functional Dependencies in Noisy Data. SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.