Manual research effort was reduced significantly, resulted in an estimated cost savings of ~$1.3M USD per year

Problem Statement & Challenge

The R&D team of a leading confectionary & food manufacturing company was looking to leverage insights from scientific research on topics significant to their industry, like food safety, health & nutrition, pet food etc. from multiple online sources. However, their existing setup had SMEs perusing and picking articles manually to categorize and store them in a repository from where the needed information was reported to business stakeholders. This setup proved to be time-consuming and tedious. Therefore, the client was looking to build a solution that would reduce manual effort and increase efficiency and productivity.


TheMathCompany worked with the client company to build an end to end NLP solution to streamline the research process of scanning and identifying the needed articles. An automated solution was to be deployed to extract, scan, compile, and categorize numerous documents swiftly. The resulting tool would also reduce the need for manual review, automatically identify and reject irrelevant articles based on their abstracts and increase productivity and working efficiency of SMEs.

Client testimonial

“This tool has revolutionized how our team operates and has enabled us to be faster, more accurate, and efficient. The automation and complex machine learning work that has gone into this project has been a game changer. It has allowed us to widen our views on what analytics could do for our company.”

- Director


In order to automate the process and save precious time spent sifting through copious amounts of data, the end to end NLP solution was deployed by TheMathCompany. The Automated NLP Engine that was created, consisted of a data pipeline which would extract data from various online sources, sieve through numerous documents pertinent to the client’s research efforts, and the model developed would leverage an ML algorithm to categorize articles based on pertinence. The data pipeline that was setup, extracted data from appurtenant external sources on a daily basis. All the models were then integrated to classify documents and store the results, which could be viewed on the User Interface by multiple SMEs.

Through the User Interface, SMEs could share feedback on relevance accuracy, and the model would learn through reviews, and improve accuracy overtime.

Data Pipeline: The data pipeline extracted pertinent research data from private and public external data sources and stored the outputs. The data was also transformed into formats that could be picked up by the NLP models.

The entire infrastructure was built on Azure. Public and Private data sources were used to source the needed research material by using API and RPA techniques, respectively. The data was then transformed on Databricks, to make it ready to be picked up by the model. The pipeline was scheduled for a daily run.

Model Development: The Ensemble NLP model was built to predict and categorize article pertinence based on their abstracts. A metadata network was created to provide the key metrics, and the model took historical data into consideration to ensure that the relevance tags and categories identified, were in line with the client’s practices.

In the Modelling Process, a Relevance Model and a Category Classification Model was created.

- Relevance Model: Articles were classified as relevant/irrelevant, by generating embeddings using ELMO and word2vec embeddings which were then combined and served as independent variables for the neural network model. Also, the neural network model could use these embeddings to predict and classify correctly.

- Category Classification:This ML model was built so that once the needed articles were identified, they could then be tagged with the corresponding categories/topics. A tool that proved handy when SMEs were creating business reports for their teams. This step used to be manual earlier - the secondary model was built along with the primary NLP engine, to reduce manual effort spent in tedious categorization.

Measuring Sales Lift Against Different Variables

Both the models underwent re-training every 2 weeks to learn from the SMEs’ continuous feedback, resulting in improved accuracy of results over time.

Web based User Interface: The final User Interface helped SMEs to view the model results and provide feedback. The feedback was then used to re-train the model and improve its accuracy. The interface’s automation statistics page displayed the accuracy levels for relevancy tagging and model categorization over a given period of time. The UI acted as a transparent representation of the model’s performance with regard to overall data accuracy, as well as, provided specific stats on the accuracy of identifying article relevance.

After reviewing articles, SMEs could download the data in a format compatible to their internal EndNote data libraries. The client company had an in-house software which acted as a repository of all the data collected. Therefore, the final output was made downloadable in a text file format which could be uploaded onto the client’s existing tool. This enabled the SMEs to access the selected research output on a platform that they were already familiar with and were comfortable using.

Deploying the product:

Once the solution was built, it was completely deployed to the client environment by following their best practices. The source, the codes, the data, the models and the UI, were setup in the client ecosystem so that once we deployed the solution, the NLP engine ran independently on the client company platform on a daily basis. Here is a detailed insight into the procedure followed for the same:

share case study

facebook linkdin twitter

Want critical data insights to unlock business value?

Get Started
Linkedin Instagram Facebook Twitter
Cookies policy Privacy policy