Data quality: The not-so secret sauce for AI and machine learning

September 06, 2019 by Yugal Sharma - ADVERTORIAL
If AI or machine learning algorithms aren’t living up to your expectations, could data be the culprit?

AI adoption is growing quickly

The promise of artificial intelligence has always felt more like a future state, but the reality is that many companies are already adopting AI initiatives. This is especially true in the scientific R&D realms. Over the last few years, there has been a huge increase of machine learning and AI initiatives in everything from QSAR models to genomics. According to a 2018 survey, AI adoption grew drastically from 38% in 2017 to 61% in 2018. This occurred across a variety of industries, including healthcare, manufacturing and financial services. However, most early adopters noted one of the biggest challenges to successful implementation involved data, specifically, accessing, protecting, integrating and preparing data for AI initiatives.

Danger: Data challenges ahead

While companies are heavily investing in the talent needed to design and implement AI algorithms, the success of initiatives depends largely upon the training data for which they are built and tested. Many companies struggle to manage the vast amounts of unstructured data needed to support projects and translate them into usable, categorized training sets required to feed algorithms. Some businesses are drowning in data, however, others are searching for specialized scientific data not readily available in the public domain. Often the data sets that are available take a long time to acquire and transform for the intended purpose. From taxonomies and classifications to connecting disparate data sets, AI initiatives require massive amounts of data preparation to unlock the promise of machine learning.

Pay now or pay later

Up to 80% of a data scientist's time is spent on data wrangling and preparation. There are a variety of public repositories for scientific data, but all have inherent challenges, including transcription errors, mislabeled units and overly complex patent language. Another key challenge is translating foreign-language content. Patents, for example, are published in more than 60 languages, globally. The ability to quickly translate, extract, connect and normalize the relevant data is invaluable to the success of AI projects. If the affinities are off by 3 or 6 orders of magnitude, algorithms may never yield an accurate prediction. When data scientists use comprehensive data that is normalized, quality checked and trusted to have correct semantic linking, they can focus time and energy on optimizing algorithms instead of preparing data.

Unfortunately, teams searching for data will often utilize public sources or spend as little as possible to label and prepare data. When teams are dealing with unstructured scientific data, patents from 60 different languages or complex reaction schemes, they find it's not easy to classify and connect this type of data in a meaningful way. The opportunity costs of data preparation, as well as accuracy and comprehensiveness of data, should all be factored into the equation when assessing opportunities for machine learning improvements. If AI-derived predictions aren't meeting expectations, the chances are good that the data itself could be derailing results.

What is high-quality data?

As the saying goes, quality matters, whether it's seafood, healthcare or training data. Don't eat bargain sushi and don't feed poor quality data to your algorithms either. For high-quality data to be leveraged in its fullest capacity, it should be clean and normalized with correct semantic meaning and connections. This level of quality is not easily achieved. The scientific experts at CAS have a deep understanding of patent language and emerging trends in publications, as well as the foreign language expertise to identify the signal from the noise. Their expertise in taxonomies, semantic linking and data categorization are critical capabilities essential to building and maintaining a high-quality data set.

The payoff of investing in high-quality data

Our team at CAS has a number of active projects that apply our content collection to various AI and machine learning applications. In fact, we recently filed a patent application based on the work of one of our talented data scientists, Jaron Maxson. He was interested in leveraging machine learning and CAS's content collection to help solve challenges in the materials space. Specifically, he wanted to see if an algorithm could accurately predict functional uses for newly developed polymers. Researchers are creating novel polymers with unique properties, but struggle to find the best applications for these compounds. If successful, Jaron's algorithm could potentially increase the ROI on polymer research by maximizing the commercial applications of new development.

Due to the laws of combinatorics, polymers are inherently one of the most challenging groups for any classification system. The other big challenge with polymers is settling on a measurable definition of polymer function. There is no recognized methodology for assigning functions to polymers. This is where CAS's long-standing classification system was able to provide a new type of definition for a rather disorganized trait. Representing polymer functions by using predetermined fields of chemistry allowed a novel application of our classically indexed data.

There are millions of existing and theoretical polymers with hundreds of potential properties, but Jaron was able to take a small set of high-quality property data that had been intellectually indexed from the literature by CAS scientists and build a prediction model for applications. The results are promising. The algorithm demonstrated a statistically significant prediction accuracy of 66% when utilizing at least three populated properties for these polymers.

Though an early proof of concept, it illustrates three important points:

  • The quality, unique classifications and historical reach of CAS data is valuable in giving scientists a new way of defining previously disorganized values.
  • Using a diverse and comprehensive training set for models will yield better predictions with less data preparation.
  • CAS's comprehensive collection of data can be easily customized to support the needs of specific algorithms. From property data, polymers, reactions across journals, patents, to dissertations; the possibilities are endless.

If your AI or machine learning efforts aren't meeting expectations and your teams are struggling with data challenges, we'd love to talk to see how we can leverage our expertise in data and machine learning to enable faster breakthroughs, greater efficiencies and better decisions. Contact us today!

About Yugal Sharma

Yugal Sharma, PhD is Senior Director, CAS Solutions at SEMANTiCS 2019 Premium Sponsor CAS

About SEMANTiCS Conference

The annual SEMANTiCS conference is the meeting place for professionals who make semantic computing work, and understand its benefits and know its limitations. Every year, SEMANTiCS attracts information managers, IT-architects, software engineers, and researchers, from organisations ranging from NPOs, universities, public administrations to the largest companies in the world.