Automated NLP engine
The R&D team of a leading confectionary & food manufacturing company was looking to leverage insights from scientific research on topics significant to their industry, like food safety, health & nutrition, pet food etc. from multiple online sources. However, their existing setup had SMEs perusing and picking articles manually to categorize and store them in a repository from where the needed information was reported to business stakeholders. This setup proved to be time-consuming and tedious. Therefore, the client was looking to build a solution that would reduce manual effort and increase efficiency and productivity.
TheMathCompany worked with the client company to build an end to end NLP solution to streamline the research process of scanning and identifying the needed articles. An automated solution was to be deployed to extract, scan, compile, and categorize numerous documents swiftly. The resulting tool would also reduce the need for manual review, automatically identify and reject irrelevant articles based on their abstracts and increase productivity and working efficiency of SMEs.
In order to automate the process and save precious time spent sifting through copious amounts of data, the end to end NLP solution was deployed by TheMathCompany. The Automated NLP Engine that was created, consisted of a data pipeline which would extract data from various online sources, sieve through numerous documents pertinent to the client’s research efforts, and the model developed would leverage an ML algorithm to categorize articles based on pertinence. The data pipeline that was setup, extracted data from appurtenant external sources on a daily basis. All the models were then integrated to classify documents and store the results, which could be viewed on the User Interface by multiple SMEs.
Through the User Interface, SMEs could share feedback on relevance accuracy, and the model would learn through reviews, and improve accuracy overtime.
Data Pipeline: The data pipeline extracted pertinent research data from private and public external data sources and stored the outputs. The data was also transformed into formats that could be picked up by the NLP models.
The entire infrastructure was built on Azure. Public and Private data sources were used to source the needed research material by using API and RPA techniques, respectively. The data was then transformed on Databricks, to make it ready to be picked up by the model. The pipeline was scheduled for a daily run.
Model Development: The Ensemble NLP model was built to predict and categorize article pertinence based on their abstracts. A metadata network was created to provide the key metrics, and the model took historical data into consideration to ensure that the relevance tags and categories identified, were in line with the client’s practices.
In the Modelling Process, a Relevance Model and a Category Classification Model was created.
- Relevance Model: Articles were classified as relevant/irrelevant, by generating embeddings using ELMO and word2vec embeddings which were then combined and served as independent variables for the neural network model. Also, the neural network model could use these embeddings to predict and classify correctly.
- Category Classification:This ML model was built so that once the needed articles were identified, they could then be tagged with the corresponding categories/topics. A tool that proved handy when SMEs were creating business reports for their teams. This step used to be manual earlier - the secondary model was built along with the primary NLP engine, to reduce manual effort spent in tedious categorization.