Data Challenges in the AI-ML Journey

Posted on : March 31st 2022

Author : Sudhakaran Jampala

The adoption of artificial intelligence and machine learning (AI/ML) has accelerated due to the availability of cost-effective and near-limitless capacity in data storage and computing power due to cloud services.¹ According to Forrester, 53% of the global data and analytics decision-makers report that they are in some AI/ML journey stage regarding implementing or post-implementation phase.² Gartner estimates that by 2025, 50% of cloud data centers will deploy advanced robots with AI/ML capabilities, resulting in 30% greater operating efficiency.³ Yet there is fear and uncertainty across companies regarding AI/ML projects as an estimated 70% and 85% of data science projects fail.⁴

The need for a data strategy to counter data challenges

A common data strategy for many organizations to counter data complexities, particularly unstructured data, is to deploy a single project to get all its data organized, and this usually involves placing the data into a large data lake.⁵ It rarely works as it’s not part of a Well-Architected ML lifecycle.

Exhibit 1: Well-Architected ML Lifecycle

Source: Amazon Web Services⁶

Therefore, it is imperative to analyze and investigate datasets and summarize the main characteristics to run data analytics, typically involving data visualization methods. The process is called Exploratory Data Analysis (EDA), making it easier to discover patterns in the data, identify anomalies, test hypotheses, or check underlying assumptions.

Exhibit 2: Statistical functions and techniques possible with EDA

Function Brief Description
Clustering and dimension reduction Help to develop graphical displays of high-dimensional data containing many variables.
Univariate visualization Univariate visualization of each field in the raw datasets and offer summary statistics.
Bivariate visualizations Bivariate visualizations and summary statistics help assess the relationship between each variable in the dataset and the target variable
Multivariate visualizations Multivariate visualizations to map and understand interactions between various data fields
K-means Clustering K-means Clustering is a clustering method in unsupervised learning and is commonly used in market segmentation, pattern recognition, and image compression
Predictive models Predictive models such as linear regression use statistics and data to predict outcomes

Source: IBM⁷

The data for analysis is rarely available in a readily structured or usable form, and the data might have errors, omissions, and may lack the meta context. To structure the data into a usable format, data scientists use data wrangling for data cleansing, data validation, and structuring the raw data.

Exhibit 3: Key Data Wrangling Activities

Activity Brief Description
Discovering Help to develop graphical displays of high-dimensional data containing many
Structuring Restructuring the unstructured data by reshaping or merging it for easier analysis
Cleaning Cleaning the data by making corrections, removing inaccurate data, and ultimately boosting the data quality
Enriching Enriching additional data to augment the existing data
Validating Verifying the data’s consistency, quality, and security
Publishing Pushing the treated data down the data pipeline for analytical use

Source: Techcanvass⁸

EDA leads to feature engineering and feature selection. Feature engineering takes raw data from the selected datasets and transforms them into “features” that better represent the underlying problem to be solved.⁹ “Features” are arrays of fixed-sized numbers that AI/ML algorithms understand. Feature engineering includes data cleansing, and it can represent the largest part of an AI/ML project in terms of time spent.

The optimum finish

After feature engineering and selection, the next step is training. The process of training and optimizing an ML model is mainly iterative. Training is the most intensive step of the entire life cycle, and maintaining track of the results of each experiment when iterating becomes complex rapidly. Data scientists can face operational frustrations at this stage due to a lack of capacity to record the precise configurations. Tracking tools can simplify the process of remembering the data, the features selected, and model parameters with the performance metrics. Thus, experiments can be compared side-by-side, delineating the differences in performance.

Significant versions of a model need to be captured for possible later use, and this challenge is called reproducibility. The objective is to save enough information about the environment in the developed model so that the model can be reproduced with similar results from scratch. Without reproducibility, the model handover process into production (or DevOps) will be riddled with inefficiencies.

  1. https://d1.awsstatic.com/psc-digital/2021/gc-400/mining-insights-fsi-/AWS_Mining_Intelligent_Insights_with_Machine_Learning_Financial_Services_eBook.pdf
  2. https://www.mobiquity.com/insights/embarking-on-the-ai-ml-journey
  3. https://www.gartner.com/en/newsroom/press-releases/2021-11-01-gartner-predicts-half-of-cloud-data-centers-will-deploy-robots-with-ai-capabilties-by-2025
  4. https://www.servercomputeworks.com/datasheets/AI-Journey-whitepaper.pdf
  5. https://www.ibm.com/downloads/cas/EBJQ6K7M
  6. https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/wellarchitected-machine-learning-lens.pdf
  7. https://www.ibm.com/in-en/cloud/learn/exploratory-data-analysis
  8. https://businessanalyst.techcanvass.com/what-is-data-wrangling-and-exploratory-analysis/
  9. https://itlligenze.com/uploads/5/137039/files/oreilly-ml-ops.pdf

We want to hear from you

Leave a Message

Our solutioning team is eager to know about your
challenge and how we can help.

Comments are closed.
Skip to content