How Much Data Do You Need To Train A Chatbot and Where To Find It?

Published in

Chatbots Life

4 min readJan 31, 2020

Most providers/vendors say you need plenty of data to train a chatbot to handle your customer support or other queries effectively, But, how much is plenty, exactly? We take a look around and see how various bots are trained and what they use.

Recent bot news saw Google reveal its latest Meena chatbot (PDF) was trained on some 341GB of data. Meena is “a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations.” So, a few steps beyond your usual scripted bot or even those that claim AI smarts.

The 38-page scientific paper highlights the advanced nature of Meena, but any business looking to train its bot faces that same opening question, how much training is enough?

Building A Better Bot Through Training

For a very narrow-focused or simple bot, one that takes reservations or tells customers about opening times or what’s in stock, there’s no need to train it. A script and API link to a website can provide all the information perfectly well, and thousands of businesses find these simple bots save enough working time to make them valuable assets.

Most Popular Chatbot Tutorials

1. Building a Simple Finance Bot with TensorFlow 1.12
2. Building a News Bot using Twitter API in 10 mins!
3. Building Alexa skills in Python, for absolute beginners.
4. Chatbot Conference 2020

But when you have a bot that needs to answer a range of questions that people could ask in a very wide range of phrases, training becomes essential to teach the bot to understand what is being asked of it through natural language programming (NLP).

These bots can be trained through data you already have in the business, perhaps digitised call centre transcripts, email or Messenger requests and so on to provide intent variation, classification and recognition. To see how data capture can be done, there’s this insightful piece from a Japanese University, where they collected hundreds of questions and answers from logs to train their bots.

KLM used some 60,000 questions from its customers in training the BlueBot chatbot for the airline. Businesses like Babylon health can gain useful training data from unstructured data, but the quality of that data needs to be firmly vetted, as they noted in a 2019 blog post.

Others have to go further out of their way to find unique information to deliver top notch customer service. The developers of the Rose chatbot at the Las Vegas Cosmopolitan Hotel took the time to “0ver the course of 12 weeks, we met with every department within The Cosmopolitan to learn the secrets and surprises the typical guest wouldn’t find on their own. With a ton of information, the team leveraged the user experience team to identify key conversation categories that would help guests experience the property through their interests.”

Mainstream Sources of Training Data

Or, you can buy-in data suitable for your vertical or market, using services like Lionbridge who provide business-focused data across a broad range of categories. There are also many popular datasets available to any business include:

Microsoft Research’s Social Media Conversation Corpus (free) — A collection of 12,696 Tweet Ids representing 4,232 three-step conversational snippets extracted from Twitter logs, dated from 2015.
The Ubuntu Ranking Dataset Creator (free) which contains a large dataset for research in unstructured multi-turn dialogue systems
Chinese Treebank (free) A collection of Chinese web, newswire, magazine and broadcast news to help bots learn the language.
Or, your business can use API-based services like Microsoft’s Azure LUIS (Language Understanding) which use entity dictionaries and other tools to augment the need for training.
A negative feedback dataset, ideal for handling those grumpy customers.

Whatever your chatbot, finding the right type and quality of data is key to giving it the right grounding to deliver a high-quality customer experience. With the right data, you can train chatbots like SnatchBot through simple learning tools or use their pre-trained models for specific use cases.

Clearly, the more data you have the better, and if it can be provided as entities and intent, or similar identifiers, the better, but even raw data can be useful in training bots when it comes to helping customers.

Hopefully, this gives you some insight into the volume of data required for building a chatbot or training a neural net. The best bots also learn from new questions that are asked of them, either through supervised training or AI-based training, and as AI takes over, self-learning bots could rapidly become the norm.