ftyers · Veranchos · Oct 31, 2018 · Oct 31, 2018 · Oct 31, 2018 · Nov 3, 2018
diff --git a/2018-komp-ling/practicals/Practical 1/50random.txt b/2018-komp-ling/practicals/Practical 1/50random.txt
diff --git a/2018-komp-ling/practicals/Practical 1/Segmentator 1.py b/2018-komp-ling/practicals/Practical 1/Segmentator 1.py
@@ -0,0 +1,29 @@
+import fileinput
+import nltk
+from nltk.tokenize import sent_tokenize
+
+
+def read_a_file(file_path):
+    file = open(file_path, "r", encoding='UTF-8')
+
+    data = file.read()
+
+    return data
+
+def segmentator(text):
+    sent_tokenized_list = sent_tokenize(text)
+    f = open('nltk_punkt_result.txt', 'w', encoding='utf-8')
+    for i in sent_tokenized_list:
+        f.write(i + '\n')
+    print(sent_tokenized_list[:10])
+    return f
+
+
+def main():
+    file = read_a_file('50random.txt')
+    segment_text = segmentator(file)
+    return segment_text
+
+
+if __name__ == '__main__':
+    main()
diff --git a/2018-komp-ling/practicals/Practical 1/Tokenization/Tokenisation.py b/2018-komp-ling/practicals/Practical 1/Tokenization/Tokenisation.py
@@ -0,0 +1,37 @@
+from maxmatch import  maxmatch
+
+# create a dictionary for maxmatch algo
+
+
+def create_dict(filename):
+    f = open(filename, 'r', encoding='UTF-8')
+    text = f.read()
+    dictionary = text.split('\n')
+    return dictionary
+
+
+def sentences_to_tokenize(filename):
+    f = open(filename, 'r', encoding='UTF-8')
+    text = f.read()
+    sentences = text.split('\n')
+    return sentences
+
+
+def main():
+    dict = create_dict('dict.txt')
+    sentences = [sentence.strip() for sentence in sentences_to_tokenize('japanese_texts.txt')]
+
+    # all tokens are separated with commas
+    tokenized_sentences = [', '.join(filter(lambda token: token != ',', maxmatch(sentence, dict))) for sentence in sentences]
+
+    results = open('tokenization_result.txt', 'w', encoding='UTF-8')
+    results.write('\n'.join(tokenized_sentences))
+    results.close()
+
+    return tokenized_sentences
+
+
+if __name__ == '__main__':
+    main()
+
+
diff --git a/2018-komp-ling/practicals/Practical 1/Tokenization/maxmatch.py b/2018-komp-ling/practicals/Practical 1/Tokenization/maxmatch.py
@@ -0,0 +1,13 @@
+def maxmatch(sentence, dictionary):
+    if len(sentence) == 0:
+        return []
+
+    for i in range(len(sentence), -1, -1):
+        word = sentence[:i]
+        remainder = sentence[i:]
+
+        if word in dictionary or i == 1:
+            return [word] + maxmatch(remainder, dictionary)
+
+
+
diff --git a/2018-komp-ling/practicals/Practical 1/Tokenization/tokenization_result.txt b/2018-komp-ling/practicals/Practical 1/Tokenization/tokenization_result.txt
diff --git a/2018-komp-ling/practicals/Practical 1/nltk_punkt_result.txt b/2018-komp-ling/practicals/Practical 1/nltk_punkt_result.txt
diff --git a/2018-komp-ling/practicals/Practical 1/pragmatic_segmenter_result.txt b/2018-komp-ling/practicals/Practical 1/pragmatic_segmenter_result.txt
diff --git a/2018-komp-ling/practicals/Practical 1/segmentation-response.md b/2018-komp-ling/practicals/Practical 1/segmentation-response.md
@@ -0,0 +1,36 @@
+# Segmentation
+1.  I downloaded my texts from a Wikipedia dump of Russian and used WikiExtractor to extract them. Then I chose 50 paragraphs of random texts with this bash command:
+
+    `head -n 50000 wiki.txt | sort -R | head -n 50 > 50random.txt`
+
+    So, this was my [data](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/50random.txt).
+
+2. Applied Ruby's pragmatic_segmenter to it:
+
+    `ruby -I . segmenter.rb < 50random.txt > pragmatic_segmenter_result.txt`
+
+   [That's](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/pragmatic_segmenter_result.txt) what I've got.
+
+3. Then I wrote [my implementation using NLTK's Punkt.](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Segmentator%201.py)
+
+    [Here](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/nltk_punkt_result.txt) is the results.
+
+## Comparison of results
+Both methods gave out the same number of sentences — 134. However, the sentences are different. 
+
+Thus, pragmatic_segmenter divided basic Russian abbreviation "т. к." as two sentences:
+> Мясо барана разделывают по суставам, при этом следят за тем, чтобы в бараний желудок (гюзян) не попали острые части, т.  
+> к.  
+> они могут его повредить.
+
+NLTK's segmenter managed that task but also had troubles with abbreviation. For example, it separated language abbreviations like "тиб."  and "санскр.".
+>Херуками, например, являются тантрические божества Чакрасамвара (тиб.  
+>Демчог) и Вишуддха Херука (тиб.  
+>Яндаг Херука).
+
+While pragmatic_segmenter got it right.
+>Херуками, например, являются тантрические божества Чакрасамвара (тиб. Демчог) и Вишуддха Херука (тиб. Яндаг Херука).
+
+## Evaluation
+
+As both approaches gave quantitatively the same results on my data, there is no need to count their accuracy (it will be the same). What's interesting is that they make different types of mistakes. 
diff --git a/2018-komp-ling/practicals/Practical 1/tokenization-response.md b/2018-komp-ling/practicals/Practical 1/tokenization-response.md
@@ -0,0 +1,20 @@
+# Tokenization
+
+To apply a maxMatch algo, you need a dictionary. So, I made one.
+Firstly, I downloaded japanese_trainings conllu file. Then I parsed them using bash commands. 
+1. delete all comments
+> sed '/^#/d' ja_gsd-ud-train.conllu > japanese_dict.txt
+2. delete all blank lines
+> sed '/^\s*$/d' japanese_dict.txt > japanese_dict_no_empty_lines.txt
+
+Also you need a data to test on. Here is mine.
+I downloaded japanese_test conllu file and extracted a text from it with this bash commands:
+1. extract texts
+> sed -n '/^# text =/p' ja_gsd-ud-test.conllu > japanese_test_texts.txt
+2. delete '# text = ' tags
+> sed 's/^# text =//' japanese_test_texts.txt > japanese_texts.txt
+
+
+Then I wrote [Python code](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/Tokenisation.py) to tokenize text using [maxMatch](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/maxmatch.py) algorithm (suggested in Jurafsky & Martin's book).  
+[Here](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/tokenization_result.txt) is the result. All the tokens are separated with commas.
+
diff --git a/2018-komp-ling/practicals/Practical 2/Transliteration_table.txt b/2018-komp-ling/practicals/Practical 2/Transliteration_table.txt
@@ -0,0 +1,33 @@
+А	A
+Б	B
+В	V
+Г	G
+Д	D
+Е	E
+Ё	Yo
+Ж	Zh
+З	Z
+И	I
+Й	Y
+К	K
+Л	L
+М	M
+Н	N
+О	O
+П	P
+Р	R
+С	S
+Т	T
+У	U
+Ф	F
+Х	H
+Ц	TS
+Ч	Ch
+Ш	Sh
+Щ	Sch
+Ъ	-
+Ы	Y
+Ь	-
+Э	E
+Ю	Yu
+Я	Ya
diff --git a/2018-komp-ling/practicals/Practical 2/rank.py b/2018-komp-ling/practicals/Practical 2/rank.py
@@ -0,0 +1,21 @@
+import sys
+
+
+freq = []
+
+fd = open('freq.txt', 'r')
+for line in fd.readlines():
+    line = line.strip('\n')
+    (f, w) = line.split('\t')
+    freq.append((int(f), w))
+
+
+rank = 1
+min = freq[0][0]
+ranks = []
+for i in range(0, len(freq)):
+    if freq[i][0] < min:
+        ank = rank + 1
+        min = freq[i][0]
+    ranks.append((rank, freq[i][0], freq[i][1]))
+    print('%d\t%d\t%s' % (ranks[i][0], ranks[i][1], ranks[i][2]))
diff --git a/2018-komp-ling/practicals/Practical 2/transliterate.py b/2018-komp-ling/practicals/Practical 2/transliterate.py
@@ -0,0 +1,58 @@
+
+def read_text(filename):
+    f = open(filename,'r', encoding='UTF-8')
+    text = f.read()
+    text = text.replace('\ufeff', '')
+
+    return text
+
+
+def write_new_text(filepath, text):
+    new_text = open(filepath, 'w+', encoding='UTF-8')
+    new_text.write(text)
+    new_text.close()
+    return new_text
+
+
+def create_map(alphabets):
+    lines = alphabets.split('\n')
+    list_of_letters = []
+    list_of_matches = []
+    for line in lines:
+        low_case = line.lower()     # because all letters in our table are in upper case
+        list_of_letters.append(low_case)
+        list_of_letters.append(line)
+    for line in list_of_letters:
+        letters = line.split('\t')
+        list_of_matches.append(letters)
+    matches = dict(list_of_matches)
+
+    return matches
+
+
+def transliterate(text, matches):
+    transliterated_text = ''
+    for letter in text:
+        if letter in matches:
+            if matches[letter] == '-':  # for Ъ and Ь
+                transliterated_text += ''
+            else:
+                transliterated_text += matches[letter]
+        else:
+            transliterated_text += letter
+
+    return transliterated_text
+
+
+def main():
+    alphabets = read_text('Transliteration_table.txt')
+    matches = create_map(alphabets)
+    text_to_transliterate = read_text('Text_to_transliterate.txt')
+    transliterated_text = transliterate(text_to_transliterate, matches)
+    print(transliterated_text)
+
+    return write_new_text('Transliterated_text.txt', transliterated_text)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/2018-komp-ling/practicals/Practical 2/transliteration-response.md b/2018-komp-ling/practicals/Practical 2/transliteration-response.md
@@ -0,0 +1,29 @@
+# Practical 2
+Here's to some questios from practical 2:
+>You'll note that the code does not print out the frequency list in order. Which Unix command might you use to sort the output in frequency order ?
+
+```sort -nr```
+
+>What do you think we would get if we set the argument reverse to False ?
+
+We'd get an ascending sorted list.
+
+Firstly, I made [rank.py](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%202/rank.py) which takes a command line argument and reads in a frequency list from a file and outputs a ranked frequency.
+Then I [implemented transliteration algo](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%202/transliterate.py), discribed in the task. 
+
+>What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e.
+
+Russian _e_ could be either je after _ь_, _ъ_, vowels and at the beginning of the word or _e_ in all the other cases. So, we can add some rules to disambiguate it. We can apply it either before or after transliteration. For example, after transliteration we can look at every _e_ in transliterated text and replace it with _je_ if it stays after [AOIEUaoieu']. 
+
+>Can you think of a way that you could provide mappings from many characters to one character ?
+>For example sh → ш or дж → c ?
+
+Maybe we can firstly go through all the text to find _sh_ and replace it with _ш_ or _дж_ with _j_ annd so on and then do it one more time to replace remaining letters.
+>How might you make different mapping rules for characters at the beginning or end of the string ?
+
+In the case of Russian-to-Englich transliteration we could have troubles at the beginning of the word only with _e_, I guess. We can write something like:
+```
+  if str.startswith(letter):
+  ...
+```
+If we are afraid to lose upper cases at the beginng of the sentence, we could add them in our matсhes table or just implement simple rule I described above.
diff --git a/2018-komp-ling/practicals/Practical 3/Unigram_model-response.md b/2018-komp-ling/practicals/Practical 3/Unigram_model-response.md
@@ -0,0 +1,85 @@
+# Practical 3
+## Functions
+Output (first 10 lines) from running the [script](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/palindrome.py) for finding palindromes 
+on the freq.txt file from the previous practical:
+```
+3439    как  
+1669    или  
+1254    еще  
+1085    ее  
+283     тут  
+279     тот  
+190     оно  
+90      XX  
+80      11  
+71      XIX  
+```
+
+## Implementing n dimensional matrices with dict
+
+I implemented the code from the task and got this output:
+```
+	a	absorbed	all	and	another
+бы	0	0	0	0	0	
+вас	0	0	0	0	0	
+видит	0	0	0	0	0	
+всего	0	0	0	0	0	
+вы	0	0	0	0	0	
+```
+>Why do we need end='' passed to the print() statement ? What would happen if we didn't have it?
+
+We need ```end=''``` to avoid printing every new element at the new line, because ```\n``` is the default ending.
+Without this ending we have this table:
+```
+	a	absorbed	all	and	another
+бы	
+0	
+0	
+0	
+0	
+0	
+
+вас	
+0	
+0	
+0	
+0	
+0	
+
+видит	
+0	
+0	
+0	
+0	
+0	
+
+всего	
+0	
+0	
+0	
+0	
+0	
+
+вы	
+0	
+0	
+0	
+0	
+0	
+
+```
+After saving [this](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/args.py) code and running it in the command line using
+```$ python3 args.py a b c ```
+
+I had this output:
+```['args.py', 'a', 'b', 'c']```
+-- it's the list of arguments passed to the command line.
+
+## Unigram language model
+Here is my code [train.py](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/train.py) It runs from command line and takes two arguments: 1)path to the input file and 2)path to the output file. Mine are [test.txt](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/test.txt) [res.txt](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/res.txt) resp. Test file was downloaded from [Russian data from the SynTagRus corpus](https://github.com/UniversalDependencies/UD_Russian-SynTagRus/blob/master/ru_syntagrus-ud-test.conllu)
+``` 
+$ python3 train.py test.txt res.txt
+```
+>What might be a simple improvement to the language model for languages with orthographic case ?
+
+I think, this simple improvement might be a wordform lemmatization: we could count not only wordforms but the lemmas, which could say us more about the context. 
diff --git a/2018-komp-ling/practicals/Practical 3/args.py b/2018-komp-ling/practicals/Practical 3/args.py
@@ -0,0 +1,3 @@
+import sys
+
+print(sys.argv)
-Original file line number
+Diff line change
@@ -0,0 +1,33 @@
+    А	A
+    Б	B
+    В	V
+    Г	G
+    Д	D
+    Е	E
+    Ё	Yo
+    Ж	Zh
+    З	Z
+    И	I
+    Й	Y
+    К	K
+    Л	L
+    М	M
+    Н	N
+    О	O
+    П	P
+    Р	R
+    С	S
+    Т	T
+    У	U
+    Ф	F
+    Х	H
+    Ц	TS
+    Ч	Ch
+    Ш	Sh
+    Щ	Sch
+    Ъ	-
+    Ы	Y
+    Ь	-
+    Э	E
+    Ю	Yu
+    Я	Ya