|May 21 - June 10
||Implement the given system in English with ontology database
|June 11 - June 18
||Building understanding about Bengali
|June 19 - July 08
Translate the given system to work in case of Bengali
|July 09 - July 22
||Expand the model to enable input for semi-structured data
|July 23 - Aug 05
||Socialize the bot by adding interactions to the bot and maintaining contexts of the chat with the user
|Aug 06 - Aug 23
||Finalize the code, implement an interface
A complete interface where an administrator would be able to upload a file (in the prescribed format), and after a pre-processing step would be able to ask questions to the virtual bot being talked about here. In the backend the nlp model will run to interact with the user much like Alice bot does. All code will be in python with minimal dependencies.
This is one of the possible approaches for the project, which I would be working on as time passes by. It involves a rich study of the available literature and testing out of things, which I am not only working on now but also would be involved when I would be working on the project.
Step1 : Implement the given system in English with ontology database
Find the most important keywords with respect to the question.
Use POS tagger on a set of questions
Observe which tags are usually the keywords for a given question
Estimate a confidence level for the given keywords
On the basis of keyword, map the question to a query that can be worked on a given ontology database. Example, a database about questions pertaining to Ankur in brief.
Once the answer is retrieved, on the basis of the pos of tags of the answer, make a sentence.
Step2 : Building understanding about Bengali
Step3 : Translate the given system to work in case of Bengali
Code a POS tagger for Bengali based on Conditional Random Fields approach as given on - http://www.computing.dcu.ie/~rhaque/SNLP_rejwanul.pdf
Form a small set of ontology database for Bangla
Use the approach as described in step 1 here to get a basic Q/A system for Bangla
The method mentioned is similar to the "bag of words" approach with a weight to the certain Words which are likely to be the in the answer (based on the analysis of distribution of POS tags)
Step4 : Expand the model to enable input for semi-structured data (on Bangla)
Latent semantic analysis of the document with the assumption of every sentence (updated) to be the document model
Find where the question lies with respect to the vector above by finding the correlation (or in simple terms the cosine of the vectors)
Say if the question is related most to line A, use A to answer the question
Step5 : Socialize the bot by adding interactions to the bot and maintaining contexts of the chat with the user
While answering a question, estimate the confidence level of the answer
If the confidence level of the answer is low, make the system ask intelligent questions related to the original question as posed by the user
On the basis of the answer to above, try improving the confidence level
This approach should use the POS tags based method. For example, if their is a NOUN ADJECTIVE pair, in the question then the system can ask the user "Do you really think NOUN is ADJECTIVE?" and on parsing the boolean the system can associate the adjective to the noun in the model (whether LSA or bag of words).
Step 1 (and consequently 2) and Step 4 can be changed on doing a wider literature review on expert systems like SIRI among other things
Steps 2 and 4 might look a little non-complementary to the reader, a hybrid model of these is what is in my as of now. Incase the ideal answer is subjective in nature LSA is expected to give better results, while on the other hand if the answer is objective in nature the approach two should work better.
A more detailed report (and more dynamic as well) can be found here. Use of named entity recognition can further improve the accuracy with the POS tags, but I am not really sure what would be the slot for them in the timeline. After reading research paper on NER using CRF and SVM, I strongly feel that we should CRF for the same.
UPDATE : I strongly feel that we should work in a language independent way ("bag of words" and "latent semantic analysis") can both work in a similar fashion so that the system can be adopted by others if required. And with the system, we can leave slots so that people can add language dependent plugins like say thesauras, etc.
IBM's statistical question answering system
- Treatment of the problem as an information retrieval (given the fact that they have large database to search over like a typical information retrieval problem).
- Their system is not really answering the question, but in a way fetching the nearest possible statements.
- To find the correlation they use the techniques like - Matching Words, Thesauras Match, Mis-Match Words, Dispersion, and Cluster Words.
Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System
This paper is about FAQ FINDER, a natural language question-answering system that uses files of frequently asked questions as its knowledge base. Unlike information-retrieval approaches that rely on a purely lexical metric of similarity between query and document, it uses a semantic knowledge base (WORDNET) to improve its ability to match question and answer.
Since, no such semantic knowledge base yet exists (in a usable form) for Bangla, as I suggested we should work in an domain independent fashion initially, and implement the things in a way so that tools like above can be hot plugged without any issue.
Automated faq answering: Continued experience with shallow language understanding
- Does not analyze user queries; instead, analysis is applied to FAQs the in database long before any user queries are submitted.
- Work of FAQ retrieval is reduced to keyword matching without inferring; the system still creates an illusion of intelligence.
- Some more work to order to process phrases
- The approach can be very much adopted for the step 1 of the project
Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database
Would help while working on step 5. The paper describes where during the question-answering process, the system retrieves relevant data instances and question templates, and offers one or several interpretations of the original question. The user selects an interpretation to be answered.
Even though the project is research oriented, lots of work has already been done in this regard at various places in the World demonstrating the feasibility of the project, and ensuring that the project should abide by the timelines proposed and would not fail in giving the deliverables mentioned. As promised, I have done lots of research study in the last 7 days and now I am quite sure that I can start implementing things as soon as the project is started. I would be more than happy to answer any questions, or concerns regarding the project or the approach mentioned.