As some of you know, I’ve made a move from consulting back into a full-time employee for Crop Pro Insurance. There was so much opportunity in this role. First of all, this role gives me my first full-time data science credit. I also get to build a team to support data science projects. On top of that, I get to push my data automation code to the next level. All of these work together to make me ask one question: “Can we build a tool that would assist in Data Analysis tasks?”
What does a data analyst do?
I’m not saying data analysts aren’t doing anything. What I am asking is can we list what they are doing so we can identify tasks that are ripe for automation.
They Collect Ontology
During the first couple weeks of the new job, a lot of my time is spent learning about the business. I’m learning new acronyms, new vocabulary, and new concepts. The fancy word for this analysis is Ontology Collection. Data analysts will collect this information on paper or in OneNote. Eventually, they’ll create documentation where they put all of this information in a human-readable form that is useful for both developers and business owners to communicate about the business using a single language.
I’ve found that the process of gathering this kind of information isn’t structured. It’s generally a set of conversations between the analysts and the Subject Matter Experts (SMEs). For the most part, the analyst records notes while the SMEs talk about all the details of their work. It’s up to the analyst to figure out what pieces of information are important, and which they discard. Right now, I would think this would be a difficult task to automate. The end goal of this automation would be a chatbot that would record all the information raw and parse through this information looking for new words and phrases. Using each new word and phrase as a prompt to feedback to the SME for further definition.
But how could we model ontology in a way that the bot could work?
How could we model this information so that it could be consumed by other processes down the line?
They Populate a Data Catalog
Yes, I’m referring to Azure Data Catalog. And yes, I’m aware ADC has room for improvement, but it is still the best option I’ve found for facilitating this kind of work. It’s the first product to realize no one person or team will ever be able to catalog all the data assets in your enterprise. It will take many SMEs to get there. Data Analysts are great at this kind of work because they can take information from all prior meetings that hint at a data source and explore that information further. Let’s say someone mentions a system in passing. A good data analyst will record what they can when they learn about this new source, and explore for more information later.
This followup seems ripe for automation. Any time a new source of information is identified, we could queue it up for interrogation. I’m referring to the process of exploring the information schema of a given source. It’s not always baked into the source system, sometimes it requires tools to get the job done. But this is a process that can be kick-started with automation, and then pass the work over to a human for further analysis.
Collecting this information in both a human readable and machine readable way will be critical. A solid metadata model can support both of these goals.
They Collect Business Rules
When analysts are in meetings they’ll also learn about business rules. These rules have a tremendous variety. These rules can define which data is valid versus what’s considered bad data. These rules can define Service Level Agreements that control when developers can deploy solutions to different environments. All of this information is useful, but are we collecting is in a way that we can consume it via machine later? Most often I find this information in documentation and have to implement it in code or jobs. I’ve never found it collected in a way I could simply reference.
The real challenge here is going to be to define a structure that’s flexible enough to allow for rules around subject areas completely unknown before they’re learned. They’re also going to have to be enforceable too. This sounds incredibly difficult right now, but are there techniques out there that could work?
There’s far more to the job of a data analyst. What tasks am I missing here? What are your thoughts on automation and extending the analysts abilities with AI? Could we equip our best analysts to do even more? Could we keep up with growing demand through a hybrid solution of human and machine? Share your thoughts below and via Twitter. I’m interested in what you think!