At Indix (acquired by Avalara), the goal was to build the "Google of Products". It was an ambitious goal that involved crawling the web to gather product information from 5,000+ brand and retailer web sites, classifying the products to a taxonomy of 5,000+ nodes, and extracting relevant attributes of the products to match different products across retailers. This structured data was then exposed via a search API that would help customer use cases that needed product information. The product catalog currently has 3+ billion products. The team also built an e-commerce knowledge graph with 100 million nodes and about a billion edges to solve problems like Query Intent Recognition and Query Understanding for Product Search.
Naturally, a robust NLP pipeline was needed to solve these problems by making sense of the unstructured text data at this scale.
The first part of the talk will cover the evolution of the architecture, building blocks and algorithms of the NLP Pipeline.
The building blocks Rajesh will cover include Language Models, Word Embeddings and Knowledge Graph.
The algorithms he'll cover will be classification, entity extraction, document similarity and query understanding (for e-commerce domain).
Post acquisition by Avalara, the team was tasked to make sense of the unstructured text data in the Tax Compliance domain with limited data.
The second part of the talk will focus on how Rajesh's team is fine-tuning the e-commerce NLP Pipeline and using Transfer Learning techniques from the e-commerce domain to solve problems with language understanding in the tax compliance domain.