Developing a system with multi-lingual capabilities in order to receive answer to user specific queries
Abhishek Gupta
Abstract
FAQ based systems which provide responses to user specific questions are fairly common and popular. However, the current systems are mostly limited to handling the question-answer pairs in English. Additionally, typical usage of query text introduce errors into the flow thus increasing the complexity of a successful transaction. This proposal is limited to a pair language system whereby the question/answers could be in English/Bengali. The system should be able to determine the question from a set of available questions and deliver the appropriate response to the user. A highly accurate system would enable creation of FAQ like content for various services using language modules and, this framework can then be enabled to provide a seamless reading experience to the users.
Additional Information
Project Timeline
| Time | Task |
| May 21 - June 10 | Implement the given system in English with ontology database |
| June 11 - June 18 | Building understanding about Bengali |
| June 19 - July 08 |
Translate the given system to work in case of Bengali |
| July 9 | Midterm submissions |
| July 09 - July 22 | Expand the model to enable input for semi-structured data |
| July 23 - Aug 05 | Socialize the bot by adding interactions to the bot and maintaining contexts of the chat with the user |
| Aug 06 - Aug 23 | Finalize the code, implement an interface |
| Aug 24 | Submission |
Deliverables
A complete interface where an administrator would be able to upload a file (in the prescribed format), and after a pre-processing step would be able to ask questions to the virtual bot being talked about here. In the backend the nlp model will run to interact with the user much like Alice bot does. All code will be in python with minimal dependencies.
Detailed Approach
This is one of the possible approaches for the project, which I would be working on as time passes by. It involves a rich study of the available literature and testing out of things, which I am not only working on now but also would be involved when I would be working on the project.
Step1 : Implement the given system in English with ontology database
-
Find the most important keywords with respect to the question.
-
Use POS tagger on a set of questions
-
Observe which tags are usually the keywords for a given question
-
Estimate a confidence level for the given keywords
-
On the basis of keyword, map the question to a query that can be worked on a given ontology database. Example, a database about questions pertaining to Ankur in brief.
-
Once the answer is retrieved, on the basis of the pos of tags of the answer, make a sentence.
Step2 : Building understanding about Bengali
-
Understand the basic linguistics about Bengali
-
Improve domain knowledge so that it can be used for better results in the end system
Step3 : Translate the given system to work in case of Bengali
-
Code a POS tagger for Bengali based on Conditional Random Fields approach as given on - http://www.computing.dcu.ie/~rhaque/SNLP_rejwanul.pdf
-
Form a small set of ontology database for Bangla
-
Use the approach as described in step 1 here to get a basic Q/A system for Bangla
The method mentioned is similar to the "bag of words" approach with a weight to the certain Words which are likely to be the in the answer (based on the analysis of distribution of POS tags)
Step4 : Expand the model to enable input for semi-structured data (on Bangla)
-
Latent semantic analysis of the document with the assumption of every sentence (updated) to be the document model
-
Find where the question lies with respect to the vector above by finding the correlation (or in simple terms the cosine of the vectors)
-
Say if the question is related most to line A, use A to answer the question
Step5 : Socialize the bot by adding interactions to the bot and maintaining contexts of the chat with the user
-
While answering a question, estimate the confidence level of the answer
-
If the confidence level of the answer is low, make the system ask intelligent questions related to the original question as posed by the user
-
On the basis of the answer to above, try improving the confidence level
This approach should use the POS tags based method. For example, if their is a NOUN ADJECTIVE pair, in the question then the system can ask the user "Do you really think NOUN is ADJECTIVE?" and on parsing the boolean the system can associate the adjective to the noun in the model (whether LSA or bag of words).
Notes :
Step 1 (and consequently 2) and Step 4 can be changed on doing a wider literature review on expert systems like SIRI among other things
Steps 2 and 4 might look a little non-complementary to the reader, a hybrid model of these is what is in my as of now. Incase the ideal answer is subjective in nature LSA is expected to give better results, while on the other hand if the answer is objective in nature the approach two should work better.
A more detailed report (and more dynamic as well) can be found here. Use of named entity recognition can further improve the accuracy with the POS tags, but I am not really sure what would be the slot for them in the timeline. After reading research paper on NER using CRF and SVM, I strongly feel that we should CRF for the same.
UPDATE : I strongly feel that we should work in a language independent way ("bag of words" and "latent semantic analysis") can both work in a similar fashion so that the system can be adopted by others if required. And with the system, we can leave slots so that people can add language dependent plugins like say thesauras, etc.
IBM's statistical question answering system
- Treatment of the problem as an information retrieval (given the fact that they have large database to search over like a typical information retrieval problem).
- Their system is not really answering the question, but in a way fetching the nearest possible statements.
- To find the correlation they use the techniques like - Matching Words, Thesauras Match, Mis-Match Words, Dispersion, and Cluster Words.
Automated faq answering: Continued experience with shallow language understanding
- Does not analyze user queries; instead, analysis is applied to FAQs the in database long before any user queries are submitted.
- Work of FAQ retrieval is reduced to keyword matching without inferring; the system still creates an illusion of intelligence.
- Some more work to order to process phrases
- The approach can be very much adopted for the step 1 of the project
Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database
Would help while working on step 5. The paper describes where during the question-answering process, the system retrieves relevant data instances and question templates, and offers one or several interpretations of the original question. The user selects an interpretation to be answered.
Even though the project is research oriented, lots of work has already been done in this regard at various places in the World demonstrating the feasibility of the project, and ensuring that the project should abide by the timelines proposed and would not fail in giving the deliverables mentioned. As promised, I have done lots of research study in the last 7 days and now I am quite sure that I can start implementing things as soon as the project is started. I would be more than happy to answer any questions, or concerns regarding the project or the approach mentioned.
