· 6 years ago · Dec 03, 2019, 09:02 PM
1\chapter{Our Contribution}
2In previous chapters, we have described the basics of question answering task. Unfortunately, previously described models and datasets are very good but only for English. We would like to be able to create such models also for Czech language. In this chapter, we will describe, how previously described dataset and models can be reused for Czech QA task.
3
4\section{Basic common stuff}
5In this section, we will describe basic tools and dataset we have used in both of the following methods, which we have tried.
6
7\subsection{Dataset}
8We have downloaded the dataset SQuAD 1.0 from here \cite{squadsource}. We have chosen SQuAD version 1.0, because it is easier to be predicted than version 2.0 as there always exists the answer for the question. More about SQuAD dataset can be found in \ref{chapter02}.
9
10The structure of the dataset is following. There are two .json files. First one is train-v1.1.json and it contains all data for training. It means there is always context with several questions and each question has one answer. Second one is dev-v1.1.json, which is used for evaluation dataset. The size of training dtaset is 87,599 questions and development set is 10,570 questions. The structure of the file is the same with the exception that is was annotated manually so there can be several answers for one question. The best matching answer was always chosen to be compared with the predicted one. It also allows a little deviations, which can be useful as the answer is not always 100 \% same like the predicted one and still can be correct.
11
12The structure of both data files looks like this. There is a tag \textit{data} containing list of all articles. There is always a \textit{title} tag and then a list of single paragraphs containing the context called \textit{paragraphs} tag. Each paragraph has it own list with answers and questions in \textit{qas} tag. That consists of three tags \textit{question}, where each question has its own id in \textit{id} tag for easy identification and \textit{answers}. Each \textit{answers} collection consist od text of the answer in \textit{text} tag and also starting index of the answer in the text in the \textit{answer\_start} tag.
13
14Basically, the structure looks like this:
15
16\lstset{language=C}
17\begin{lstlisting}
18{ Data[
19 title
20 paragraphs [{
21 context
22 qas [{
23 answers[
24 text
25 answer_start
26 ]
27 question
28 id
29 }]
30 }]
31]
32version }
33\end{lstlisting}
34
35
36\subsection{Translation of the data}
37We have used data translation of the SQuAD dataset to English. For that, we have used LINDAT Translator, which is the best translator between Czech and English, that is freely distributed and developed at Faculty of Mathematics and Physics at Charles University by the Institute of Formal and Applied Linguistics. More about this translator can be found here \cite{lindat}.
38
39The translation of the dataset brings a noise into in. It means that in English dataset the answer in the text was the same as the text in the answer tag. Unfortunately, now the answer in the text and answer in answers tag may differ.
40
41When we have translated the paragraphs, answers and questions, the start index had have to be recomputed as we need it during training. The problem is that we cannot use exact match because of several reasons. First one is that the answer does not to fit the text exactly. For that we cannot use exact match and we need to go character by character and find longest common substring by following algorithm. See alg.\ref{Alg:lcs}. We will start with the whole text and compute match between it and the translated answer. We will systematically delete first character until we have empty string and measure longest common substring.
42
43\begin{algorithm}[H]
44 \caption{Calculation of the cylinder from the plane}
45 \label{Alg:lcs}
46 \hspace*{\algorithmicindent} \textbf{Input:} \\
47 $text$ = translated text \\
48 $answer$ = translated answer \\
49 $idx$ = original index of the middle of the answer \\
50
51 \hspace*{\algorithmicindent} \textbf{Output:} \\
52 $bestMatch$ = start index of the bes answer\\
53
54 \begin{algorithmic}
55 \FOR{i = 0 to len($text$)}
56 \STATE $lcs$ = longestCommonSubstring(text[i:len($text$)], $answer$)
57 \ENDFOR
58 \STATE $maxLcs$ = find maximum of $lcs$
59 \IF{$maxLcs$ is only one item}
60 \STATE return $maxLlcs$
61 \ELSE
62 \FOR {i = 0 to len($maxLcs$)}
63 \STATE $maxPos$=$maxLcs.match$*(1-abs($idx$-$maxLcs[i].idx$)/len($text$))
64 \ENDFOR
65 \ENDIF
66 \STATE return max($maxPos$)
67 \end{algorithmic}
68
69\end{algorithm}
70
71The other problem is that there can be more occurrences of the answer and only one is correct. For that we will consider that the text sentences are in the same order in English and even in Czech and we will use the original answer position to find the new answer. So for each possible answer we compute its score according to its longest common substring multiplied by the position. To facilitate the work, middle position of the answer is taken. Nearer is the actual position to current one and more similar it is, the higher the score is. Finally the answer with the high test score is chosen and its starting index is used. Obviously, if starting index points to the middle of the word, it is moved that it shows to the beginning of it.
72
73
74To facilitate our work with translated data, we have modified the final .json file after translation a bit. Two new tags into the \textit{answers} tag were added. First one is \textit{answer\_end}, which is computed during recomputation of the starting index. It is pointing to the end of the last word of the answer in the text and it was added because of the easier visualization. The other one is \textit{answer\_match} and it is value of the score of the match. See below.
75
76\lstset{language=C}
77\begin{lstlisting}
78{ Data[
79 title
80 paragraphs [{
81 context
82 qas [{
83 answers[
84 text
85 answer_start
86 ]
87 question
88 id
89 }]
90 }]
91]
92version }
93\end{lstlisting}
94
95We will now describe the most common mistakes during translation. One of them is word order, that is confusing the system while recomputing start index of the answer in the text. One of them can be seen in \ref{Fig:img-order}. The other common mistake is caused by synonyms, that the translator has chosen two different Czech words in question and answer for the same English one. See \ref{Fig:img-synonyms}. The last common mistake I will mention is cause by different language properies of Czech and English, that Czech words are declined. See \ref{Fig:img-declination}. Another problem is translating of numbers. They can be once written as words and translated and once written as numbers and then the algorithm is confused. See \ref{Fig:img-numbers}. The same problem with names. See \ref{Fig:img-names}. You can observe some of the deviations in translations below.
96
97
98\begin{figure}[H]
99 \centering
100 \includegraphics[width=140mm]{../img/example-order.png}
101 \caption{Example of the selected answer by the algorithm with changed word order between text and answer.}
102 \label{Fig:img-order}
103\end{figure}
104
105\begin{figure}[H]
106 \centering
107 \includegraphics[width=140mm]{../img/example-declination.png}
108 \caption{Example of the selected answer by the algorithm with different declination in text and answer.}
109 \label{Fig:img-declination}
110\end{figure}
111
112\begin{figure}[H]
113 \centering
114 \includegraphics[width=140mm]{../img/example-synonyms.png}
115 \caption{Example of the selected answer by the algorithm with synonyms in text and answer.}
116 \label{Fig:img-synonyms}
117\end{figure}
118
119\begin{figure}[H]
120 \centering
121 \includegraphics[width=140mm]{../img/example-declination.png}
122 \caption{Example of the selected answer by the algorithm with different declination in text and answer.}
123 \label{Fig:img-declination}
124\end{figure}
125
126\begin{figure}[H]
127 \centering
128 \includegraphics[width=140mm]{../img/example-numbers.png}
129 \caption{Example of the selected answer by the algorithm with non-translated numbers in text and answer.}
130 \label{Fig:img-numbers}
131\end{figure}
132
133\begin{figure}[H]
134 \centering
135 \includegraphics[width=140mm]{../img/example-names.png}
136 \caption{Example of the selected answer by the algorithm with partially translated names in text and answer.}
137 \label{Fig:img-names}
138\end{figure}
139
140
141After having all the data translated and successfully recomputed start and end index, we had to analyze data and observe, how good or bad the translation was. For that we have created a graph \ref{Fig:img-devmatch} and \ref{Fig:img-trainmatch}. showing how successful the translation was. We can also observe, that we have really similar numbers for both sets, what is really good because the development set and train set do not differ too much and the mistakes can be similar as well.
142
143\begin{figure}[H]
144 \centering
145 \includegraphics[width=140mm]{../img/train-match.png}
146 \caption{Plot of how much the answers match the answers in the text for training set}
147 \label{Fig:img-trainmatch}
148\end{figure}
149
150
151\begin{figure}[H]
152 \centering
153 \includegraphics[width=140mm]{../img/dev-match.png}
154 \caption{Plot of how much the answers match the answers in the text for development.}
155 \label{Fig:img-devmatch}
156\end{figure}
157
158According to the observation of the graph \ref{Fig:img-trainmatch} ref, we can see that almost all the translations were with match more than 50\% in training and also development set. Moreover, we can see that there are only few translation that have match less than 80 \% and we still have preserved almost 90\% of the data and it is still enough to make good predictions. For that we can throw them away that they do not case too much mess during training, because these answers will be probably wrongly found in the text. We can observe exact values in the table \ref{tab:match-after-trans}, we can keep only questions with rounded accord more than 80 \% accuracy and we still have preserved almost 90\% of the data and it is still enough to make good predictions.
159
160\begin{table}[h]
161 \centering
162 \begin{tabular}{| l | l | l |}
163 \hline
164 \textbf{Match} & \textbf{Train set size} & \textbf{Test set size} \\
165 \hline
166 \textbf{100\%} & 57.8\% & 58.0\% \\
167 \hline
168 \textbf{$\geq$ 90\%} & 78.9\% & 78.2\% \\
169 \hline
170 \textbf{$\geq$ 80\%} & 89.0\% & 88.6\% \\
171 \hline
172 \textbf{$\geq$ 70\%} & 94.7\% & 94.2\% \\
173 \hline
174 \textbf{$\geq$ 60\%} & 89.2\% & 98.0\% \\
175 \hline
176 \textbf{$\geq$ 50\%} & 99.8\% & 99.8\% \\
177 \hline
178 \end{tabular}
179 \caption{Percent of data with match higher the certain amount}
180 \label{tab:match-after-trans}
181\end{table}
182
183For better observation of influence of the translation to the model results, we will create several datasets. They will contain only answers with match greater of equal to certain value. We found out that the best partitions will be after 5\%, so we sill create datasets where we have match 100\%, more than 95\%, more than 90\%, more than 85\% and more than 80\% for both training and development set. We will use them for the translation what will be described in following part. For the completeness of the information how much percent of original data is in newly created files see following table \ref{tab:match-after-trans-files}. We can see the the numbers are a bit different now and it is because of rounding deviations. We have still more than 85\% of data save so it is still enough.
184
185\begin{table}[h]
186 \centering
187 \begin{tabular}{| l | l | l |}
188 \hline
189 \textbf{Match} & \textbf{Train set size} & \textbf{Test set size} \\
190 \hline
191 \textbf{100\%} & 50.6\% & 51.0\% \\
192 \hline
193 \textbf{$\geq$ 95\%} & 59.1\% & 59.2\% \\
194 \hline
195 \textbf{$\geq$ 90\%} & 71.2\% & 71.2\% \\
196 \hline
197 \textbf{$\geq$ 85\%} & 80.0\% & 78.9\% \\
198 \hline
199 \textbf{$\geq$ 80\%} & 85.1\% & 84.5\% \\
200 \hline
201 \end{tabular}
202 \caption{Percent of data with match higher the certain amount}
203 \label{tab:match-after-trans-files}
204\end{table}
205
206
207\subsection{Machine for training}
208For training and testing, we have used Artificial Intelligence Cluster (AIC).
209\cite[AIC (Artificial Intelligence Cluster) is a computational grid with sufficient computational capacity for research in the field of deep learning using both CPU and GPU. It was built on top of SGE scheduling system. MFF students of Bc. and Mgr. degrees can use it to run their experiments and learn the proper ways of grid computing in the process.]{aic}.
210%4 processors Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz.
211
212\section{BiDAF Model}
213BiDAF network is a model that was trained on SQuAD dataset. It is a multi-level hierarchical process which with help of attention mechanism represents the context od words on several levels of granularity and it limits the information loss in context ans relation representation. More about this model can be found at \ref{chapter03}.
214
215As mentioned above, the goal of this work is to reuse this model to be able to solve this task also in Czech language. We have used model from here \cite{biattflowsource}. There are two main attitudes both linked with machine translation of data between English and Czech. First one is to take all SQuAD dataset and translate it to Czech and train model in Czech. The second is to train model in English and translate Czech input to English, then let the model to produce the answer and then translate it back to Czech.
216
217\subsection{Translation of English data to Czech}
218We have taken all SQuAD data set and translated from English to Czech. Then, we have trained model in Czech and tested its accuracy.
219
220For the training we had to change English embeddings used for the English dataset to Czech ones. We have used Czech embeddings created by RNDr. Milan Straka, PhD. on 4 milliards of Czech words using word2vec. There are 1.5 millions of embeddings.
221
222After translation, we have obtained 5 data files fir train and 5 data files for test with different matches between the answer in the text and translated answer as described above. To find the best combination of train and development dataset we have tried to train model on all the possible combinations. For that, we have trained 25 models with train ad development set with answers with match from 80\% to 100\%. The results of the trainings can be seen in table \ref{tab:cz-results-acc} and \ref{tab:cz-resultsf1}.
223
224\begin{table}[]
225 \begin{tabular}{|l|l|l|l|l|l|}
226 \hline
227 \textbf{Dev/Train} & \textbf{100\%}& \textbf{95\%}& \textbf{90\%}& \textbf{85\%}& \textbf{85\%} \\
228 \hline
229 \textbf{100\%}& 0.5703 & 0.5734 & 0.5884 & 0.5923 & 0.596 \\ \hline
230 \textbf{95\%}& 0.5314 & 0.557 & 0.5717 & 0.5769 & 0.5771 \\ \hline
231 \textbf{90\%}& 0.4825 & 0.5131 & 0.5478 & 0.5612 & 0.5314 \\ \hline
232 \textbf{95\%}& 0.4572 & 0.4747 & 0.493 & 0.5286 & 0.5456 \\ \hline
233 \textbf{80\%}& 0.4386 & 0.481 & 0.5076 & 0.5234 & 0.5343 \\ \hline
234 \end{tabular}
235 \caption{Accuracy after translation SQuAD to Czech and then trained and tested models with corresponding data files.}
236\label{tab:cz-results-acc}
237\end{table}
238
239\begin{table}[]
240 \begin{tabular}{|l|l|l|l|l|l|}
241 \hline
242 \textbf{Dev/Train} & \textbf{100\%}& \textbf{95\%}& \textbf{90\%}& \textbf{85\%}& \textbf{85\%} \\
243 \hline
244 \textbf{100\%}& 0.6501 & 0.6575 & 0.6749 & 0.6795 & 0.6789 \\ \hline
245 \textbf{95\%}& 0.6235 & 0.6486 & 0.6627 & 0.6718 & 0.67371 \\ \hline
246 \textbf{90\%}& 0.5856 & 0.6185 & 0.6519 & 0.664 & 0.6599 \\ \hline
247 \textbf{95\%}& 0.5657 & 0.6083 & 0.6202 & 0.6464 & 0.6535 \\ \hline
248 \textbf{80\%}& 0.5501 & 0.5988 & 0.6209 & 0.6388 & 0.6479 \\ \hline
249 \end{tabular}
250 \caption{F1 after translation SQuAD to Czech and then trained and tested models with corresponding data files.}
251 \label{tab:cz-results-f1}
252\end{table}
253
254If we compare all these results in th graph with accuracy \ref{Fig:graph-acc} and f1 score \ref{Fig:graph-f1}, we can see that the best results we had with train data size witch answers with match grater or equal to 80\% and development set with only 100\% match between answer and answer in the text. The results are logical because match above 80\% does not bring such a noise into the dataset and we also have much more data than with training set with match 100\% so the model can then answer more questions. Te reason why development set is the best with answers with match 100\% is because the other development sets are noisy and it is harder for the model to predict them.
255
256\begin{figure}[H]
257 \centering
258 \includegraphics[width=140mm]{../img/transl_cz_accuracy.png}
259 \caption{Plot of how much the answers match the answers in the text for development.}
260 \label{Fig:graph-acc}
261\end{figure}
262
263\begin{figure}[H]
264 \centering
265 \includegraphics[width=140mm]{../img/transl_cz_f1.png}
266 \caption{Plot of how much the answers match the answers in the text for development.}
267 \label{Fig:graph-f1}
268\end{figure}
269
270\subsection{Translation of Czech input into English}
271We have trained the bi-att-flow model in English. It took 124 hours 29 minutes and 31 seconds on CPU. Then we have taken Czech testing data set, we have translated it into English and then we have run in on a model. Immediately after that, we have measured accuracy and we got following results. See \ref{tab:cz-en-results}.
272\begin{table}[h]
273 \centering
274 \begin{tabular}{| l | l |}
275 \hline
276 \textbf{Accuracy} & \textbf{F1} \\
277 \hline
278 0.5439 & 0.6758 \\
279 \hline
280 \end{tabular}
281 \caption{Results after translation of Czech dataset to English as evaluation and accuracy computed on this dataset without translating back to Czech.}
282 \label{tab:cz-en-results}
283\end{table}
284
285To make the evaluation completed, we have taken English answers and translated them back to Czech language. Then we have measured similarity between original and newly obtained Czech answers. We got these results. See\ref{tab:cz-en-cz-results}.
286
287\begin{table}[h]
288 \centering
289 \begin{tabular}{| l | l |}
290 \hline
291 \textbf{Accuracy} & \textbf{F1} \\
292 \hline
293 0.6208 & 0.7036 \\
294 \hline
295 \end{tabular}
296 \caption{Results after translation of Czech dataset to English as evaluation and them translating the answers back to Czech and computed accuracy on Czech datasets.}
297 \label{tab:cz-en-cz-results}
298\end{table}
299
300As we do not have any Czech training of testing dataset for evaluation, we have translated whole development dataset and we have chosen only that questions and answers, where the match between translated answer and answer in the text was more than 95 \%. Match was computed in the same way like in the previous paragraph and the algorithm is described in \ref{Alg:lcs}.
301
302\subsection{Summary of results}
303
304If we compare the best result of the model trained in Czech and the best result of model in English with translation of the input data from Czech with the original English model for English data to select ideal model for question answering task for Czech language, we can see that the best results we obtained with English model trained on English data set and with translating Czech input into English and then translating English output back into Czech. See \ref{tab:all-results}
305
306We think that English model is better because we have more data and there is no noise caused by translation. Moreover, LINDAT translator used for translation was constructed that if we translate sentence from English to Czech and then back to English, it is trying to return original sentence and for that the loss is minimal.
307
308\begin{table}[h]
309 \centering
310 \begin{tabular}{| l | l | l |}
311 \hline
312 \textbf{Model} & \textbf{Accuracy} & \textbf{F1} \\
313 \hline
314 English & 0.6422 & 0.7529 \\
315 \hline
316 CZ & 0.596 & 0.6789 \\
317 \hline
318 CZ-EN & 0.5439 & 0.6758 \\
319 \hline
320 CZ-EN-CZ & 0.6208 & 0.7036 \\
321 \hline
322 \end{tabular}
323 \caption{Results after evaluation on English dataset of EN model.}
324 \label{tab:all-results}
325\end{table}
326
327
328To sum it up, with the best model for Czech we have accuracy 62\% and f1 score 70.36 \% and for English model we have accuracy 64.00\% and f1 score 75.29\%. We can see that accuracy is only 2\% lower and f1 score is only 5\% lower, what is not such a difference.
329
330Basically bi-att-flow model is not so good with accuracy on English dataset only 64\% and F1 score only 75\% and for that, we have tried to reuse another model that gives accuracy TODO and f1 score TODO for English.
331
332
333
334\section{Transformer model}
335output_enbert_ensquad_2epochs
3362 epochy
337exact: 80.69063386944181,
338f1: 88.15706455174694,
339Duration: 04:50:42
340TEST: na cj {"f1": 83.10446467475838, "exact_match": 71.5618585557007}
341
342output_enbert_ensquad_3epochs
3433 epochy
344exact: 80.02838221381268,
345f1: 87.80016686516335
346Duration: 04:59:10
347
348output_en_multi_cased_ensquad
349'exact': 81.5042573320719, 'f1': 88.74618032733746
350
351
352TEST: na cj "f1": 83.93484385529726, "exact_match": 72.53218884120172}
353
354
355output_en_multi_uncased_ensquad
356'exact': 81.86376537369915,
357'f1': 89.12343349880592
358Duration: 05:28:54
359TEST: na cj {"f1": 84.23396891538762, "exact_match": 72.70013062138459}
360
361
362
363
364
365output_cz_multi_cased_czsquad
366exact: 64.54562418361634,
367f1: 73.65744877298926,
368Duration: 03:27:08
369TEST: {f1": 73.65578201073284, "exact_match": 64.54562418361634}
370
371output_cz_multi_uncased_czsquad
372exact: 66.57958574360889,
373f1: 75.5997408007131,
374Duration: 02:57:45
375TEST {"f1": 75.60099962358045, "exact_match": 66.57958574360889}
376
377output_cz_multi_uncased_czsquad_train100
378'exact': 68.44560552341855, 'f1': 75.99376087620891,
379
380output_cz_multi_cased_czsquad_train100
381exact': 66.15040119425267, 'f1': 74.28282989321876,