The task we undertook was to build the AI system that could answer a series of questions based on legal transcripts, questions such as “Which bank was involved?”, “What is the family status of the plaintiff?”, “What are the characteristics of the clearance decided?”. We were given 5 questions and, 200 legal transcripts digitized through OCR (Optical Character Recognition) and also the text that answers the questions in these documents.

After analyzing the data, we found out that the answers are consistently being text- extracted from the original document BUT not as a continuous, homogeneous part of the text, rather. A combination of partial text from various paragraphs phrases of the document. That means that we couldn’t create a system that would find the starting and ending point of each answer. Moreover, the very same phrase could well be an answer to another one or many more answers. For example, the phrase “The plaintiff is married so half of his belongings will pass into the ownership of his wife…” is an answer to both the question about family status and the corresponding clearance.

This is why we have decided to approach this task as a multi-label text classification problem. Our model should predict for each phrase, the probability that it could be a part of an answer to any of the questions. All phrases that are predicted as answers to a specific question are concatenated and form the final answer passage for this question. There was also the probability that a question didn’t have an answer in a document, so our model would predict ‘no answer’.

In more detail, our initial model was a standard- sequence model consisting of word embeddings, bi-directional GRU RNN, a self-attention mechanism, and a fully connected layer with 5 outputs, one for each question.

Having this as our baseline model we started experimenting with various possible improvements. You can find below what worked well and what didn’t work that great for us:

  1. We created domain-specific word embeddings based on the 30.000 similar unlabeled documents that were given to us. These word embeddings were much better than generic Greek embedding.
  2. We tried 3 alternatives for word embeddings: don ‘to use pre-trained word embeddings, use pretrained but fine-tune them during training, use pretrained, do not fine-tune them, but freeze their weights. The third alternative produced the best results, then the second and lastly the first. As professor Manning says in his online NLP course, do not fine-tune the word embeddings if you don’t have enough training data.
  3. Multi-head attention helped; we used a number of heads equal to the number of output classes * 4.
  4. Adding another fully connected layer was also a good idea. Of course, increasing the parameters required regularization, so we used both dropouts with 0.5 probability and weight decay in all possible places.
  5. We couldn’t find a way to use the actual evaluation metric as a loss function. The evaluation metric was ROUGE-4 which counts the overlap of 4-grams between the actual answer and our prediction, but we used PyTorch BCEWithLogitsLoss which combines a sigmoid activation and Binary Cross-Entropy loss.
  6. We couldn’t resist the temptation to add special features based on specific words that we know that characterize a phrase, as an answer to the question we were given. For example, the word “child” and “kid” most probably would exist in a phrase about family status. We tried to have diversity in these hand-crafted features, as we knew that they were going to add more, unknown questions in the final evaluation phase, so we wanted our model to be effective even in new questions.
  7. We tried to also incorporate a Greek BERT model created by AUEB’s Natural Language Processing Group but it increased the training time without improving significantly the results.

 The final model after these aforementioned improvements is:

Wrapping it up, one has to point out that it’s very important to have a solid baseline model and make many experiments to find out what works best. Furthermore, try to find a metric, a single number, that evaluates your model. It’s not always that easy but it’s crucial to be able to compare your models, especially after creating hundreds of models.

So, if there is a takeaway for me, is to stop looking for that perfect, single is that there isn’t a “silver bullet” that can kill the beast and solve any NLP problem, you‘d better prepare yourself to try a great many of them to get the job done!