We already have enough challenges in Open Data, from data presentation formats (that usually don’t follow a common pattern), to the lack of good methods to facilitate visualization. As a result, there’s a huge complexity for regular users that want to access and reuse the data. This complexity becomes even greater when we talk about the Multidimensional Data Model. Based on this model, Open Data may be structured as cubes, involving dimensions and metrics that expand as data grows on the Web, so the information needed may be spread over several data cubes. One thing is apparent: users can not easily find what they want, and whey they do, natural language queries are not usually considered.
We have been studying ways to explore open data and data cubes by natural language, so we come to the chatbots. During a conversation with the user, chatbots are able to recognize the information needed and establish matches between it and database dimensions. Well, how nice would it be if the chatbot could suggest metrics and dimensions according to matches found, helping the user to formulate a query? To top if off, it could deliver an answer the user might take a long time to find on his own.
Chatbot Conversational Flow for Multidimensional Queries
Following this motivation, we implemented a chatbot conversation flow for answering dimensional queries (see next Figure). The conversation starts with a query sentence (i.e., an input), from which we recognize the user intention. We assume the input sentence may contain terms related to database attributes, so we preprocess it for finding all meaningful keywords, which will be used in a state called Reads Database Schema. This state searches the keywords in the database schema, aiming to find related metadata. Right after that, dimensions and metrics of interest are returned to the bot, so it can make appropriate suggestions to the user.
The bot suggests metrics and dimensions during the conversation, and all user choices are processed (temporarily stored as query parameters). The user can also filter data, i.e, search for specific values within a chosen dimension, and if no filter was specified, the chatbot will consider all possible dimension values in its search. After deciding on a filter, the user also decides about including more dimensions in the search. When all dimensions have already been chosen, the state Query Execution is called in the sequence, which is responsible for effectively accessing the database. After query execution with the parameters, the chatbot receives the results, finally presenting them to the user.
For exemplifying, suppose that a user inputs the query “show me schools per cities and states”. After the database schema reading performed from the query keywords, the chatbot presents a list of related metrics such as “schools count”, “schools average”, and so on. The user chooses “schools count”. The bot then suggests a list of related dimensions, from which the user chooses “city name”, since he wants to see schools count per city. He prefers not to filter by city, but he includes another dimension “state name” in the query, this time filtering by the American state “Texas”. The bot then stores all dimensional information as query parameters, executing it and returning an answer about schools count in each city of Texas.
We tested this conversation flow with a multidimensional database called BIOD (Blended Integrated Open Data). BIOD contains Brazilian open data and it is made available through a RESTful API, so it can be queried by informing dimensional parameters in the URL. We implemented the conversation flow using Xatkit development framework, which allowed us to define training sentences corresponding to the database schema vocabulary. All dimensional information chosen by the user during the conversation with the bot are used to fill the query parameters in the URL.
For assessing if we were on the right track, we performed an evaluation through an empirical user study. Twenty-one participants with different backgrounds interacted with the chatbot, and posed a set of predefined questions until receiving an answer from BIOD. After that, each participant was requested to fill in an evaluation form covering four distinct categories of questions: Visibility (how adequate was the information displayed on the chat), Support (how assertive was the bot when providing guidance), Usefulness (user satisfaction with the received answers), and Simplicity (how intuitive and easy the interaction was). The form questions followed a linear scale, meaning that each participant should choose an option from 1 to 5 for stating how much he/she agreed with the current statement (the closer to the value 5, the better).
The results were encouraging: Visibility and Usefulness aspects were the most scored categories, meaning that information shown in the chat was understandable, and users were able to compose a dimensional query even without knowing the database structure/metadata (which was our main goal). Also, most users completely agreed with sentences such as “I would use the chatbot again”, and “the bot gave me answers faster than if I had to search on my own”, demonstrating how important are real time answers. Support and Simplicity aspects obtained good scores too, although we had some limitations. Despite most users were able to compose queries based on the bot orientation, the bot still needs more flexibility, specially when receiving unexpected entries.
With human participation, we received very useful feedbacks for future improvement of the chatbot, e.g., “including more detailed explanations”, “including return options”, “dealing with typing errors”, and “including buttons to select the options in the chat” (many of the suggestions are on their way!). Most importantly, we were able to understand a little more about how the user feels when interacting with a virtual assistant, which can help us to improve user experience and providing context-aware answers. As the chatbot exempts the user from knowing database metadata and query languages such as SQL, we believe it has potential to improve open data transparency for everyone that is looking for information.
Our results with the chatbot approach are described in the paper “Talk to your data: a chatbot system for multidimensional datasets“, which will be published on IEEE COMPSAC 2022 (check a preliminary version here). Besides me, the paper is co-authored by Marcos Didonet Del Fabro, Celio Trois, Luis Carlos Bona, Jordi Cabot, and Leon Gonçalves. The chatbot that queries BIOD data is available online with updates and improvements made periodically. Currently, we’re working on an entity recognition task for the chatbot, as well as synonyms inclusion through Wordnet, for better understanding user intentions. Also, our objective is not only to enable a database query, but recommending databases based on similarity metrics. We still have a long way to go, but one step at a time and we can make the user closer to talking to data 🙂
You can also take a look at a presentation where we elaborate on this work:
Maria Helena is a PhD student at UFPR (Federal University of Paraná, Brazil), researching topics related to Open Data, Data Discovery, and Question-Answering systems.