Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
751cbbc
Create quiz01-response.md
Veranchos Oct 31, 2018
0165b0a
Update quiz01-response.md
Veranchos Oct 31, 2018
60281cf
Update quiz01-response.md
Veranchos Oct 31, 2018
0d30b24
add segmentation results
Veranchos Nov 3, 2018
2c161ac
Create segmentation-response.md
Veranchos Nov 3, 2018
a49a9ca
typo
Veranchos Nov 3, 2018
4bf4c9b
add tokenization results
Veranchos Nov 4, 2018
b5cb3d0
Merge branch 'master' of https://github.com/Veranchos/ftyers.github.io
Veranchos Nov 4, 2018
40cb613
Create tokenization-response.md
Veranchos Nov 4, 2018
ae8cffa
added practical 2 (better be late than never :)
Mar 27, 2019
f674581
Update transliterate.py
Veranchos Mar 27, 2019
ec70af9
Create transliteration-response.md
Veranchos Mar 27, 2019
9ab9f1c
added practical 3
Mar 27, 2019
07e591e
Merge branch 'master' of https://github.com/Veranchos/ftyers.github.io
Mar 27, 2019
18ad6c5
added part of practical 3
Mar 27, 2019
68af195
Create Unigram_model-response.md
Veranchos Mar 27, 2019
5376a10
practical 3 update
Veranchos Mar 28, 2019
0e18962
changed number of symbols in a floating numbers
Veranchos Mar 28, 2019
84881fe
Update Unigram_model-response.md
Veranchos Mar 28, 2019
5e38afd
Update Unigram_model-response.md
Veranchos Mar 28, 2019
89c48d7
Update Unigram_model-response.md
Veranchos Mar 28, 2019
5c81d35
Update Unigram_model-response.md
Veranchos Mar 28, 2019
64e7f53
typo
Veranchos Mar 28, 2019
ea4ad61
added quiz 2
Mar 28, 2019
d560b4b
Merge branch 'master' of https://github.com/Veranchos/ftyers.github.io
Mar 28, 2019
11665ad
added practical 4
Veranchos Mar 28, 2019
fc9c7f8
updated quiz-2
Veranchos Mar 30, 2019
decd3b0
Add files via upload
Veranchos Mar 30, 2019
c56365c
Update quiz-2.md
Veranchos Mar 30, 2019
2fb0680
Create pluralize.py
Veranchos Mar 30, 2019
4e7ace6
Update quiz-2.md
Veranchos Mar 30, 2019
5030de2
Create quize-3.md
Veranchos Mar 30, 2019
da6d766
unigran-tagger report update
Veranchos Mar 30, 2019
38b6884
added practical 5
Veranchos Mar 30, 2019
bf84d76
Merge branch 'master' of https://github.com/Veranchos/ftyers.github.io
Veranchos Mar 30, 2019
e53b41d
add report on practical 5
Veranchos Mar 30, 2019
c512840
udpipe practical update
Veranchos Mar 31, 2019
c951160
Update practical5-report.md
Veranchos Mar 31, 2019
5e0d188
typo
Veranchos Mar 31, 2019
c57d49c
Update Unigram-pos-tagger-report.md
Veranchos Apr 1, 2019
470c543
typo
Veranchos Apr 2, 2019
d852d7e
Update practical5-report.md
Veranchos Apr 2, 2019
f9f4e43
changed file name
Veranchos Apr 2, 2019
2eb4ff3
Rename practical5-report.md to practical5-response.md
Veranchos Apr 2, 2019
4bcedf6
modified practical 4
Apr 2, 2019
04f7670
slightly changed practical 4 files
Apr 2, 2019
378de30
Update Unigram-pos-tagger-response.md
Veranchos Apr 2, 2019
98867f0
Update Unigram-pos-tagger-response.md
Veranchos Apr 2, 2019
234c1e1
added code for prac4
Apr 2, 2019
dd94581
Merge branch 'master' of https://github.com/Veranchos/ftyers.github.io
Apr 2, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/50random.txt

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/Segmentator 1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import fileinput
import nltk
from nltk.tokenize import sent_tokenize


def read_a_file(file_path):
file = open(file_path, "r", encoding='UTF-8')

data = file.read()

return data

def segmentator(text):
sent_tokenized_list = sent_tokenize(text)
f = open('nltk_punkt_result.txt', 'w', encoding='utf-8')
for i in sent_tokenized_list:
f.write(i + '\n')
print(sent_tokenized_list[:10])
return f


def main():
file = read_a_file('50random.txt')
segment_text = segmentator(file)
return segment_text


if __name__ == '__main__':
main()
37 changes: 37 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/Tokenization/Tokenisation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from maxmatch import maxmatch

# create a dictionary for maxmatch algo


def create_dict(filename):
f = open(filename, 'r', encoding='UTF-8')
text = f.read()
dictionary = text.split('\n')
return dictionary


def sentences_to_tokenize(filename):
f = open(filename, 'r', encoding='UTF-8')
text = f.read()
sentences = text.split('\n')
return sentences


def main():
dict = create_dict('dict.txt')
sentences = [sentence.strip() for sentence in sentences_to_tokenize('japanese_texts.txt')]

# all tokens are separated with commas
tokenized_sentences = [', '.join(filter(lambda token: token != ',', maxmatch(sentence, dict))) for sentence in sentences]

results = open('tokenization_result.txt', 'w', encoding='UTF-8')
results.write('\n'.join(tokenized_sentences))
results.close()

return tokenized_sentences


if __name__ == '__main__':
main()


13 changes: 13 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/Tokenization/maxmatch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
def maxmatch(sentence, dictionary):
if len(sentence) == 0:
return []

for i in range(len(sentence), -1, -1):
word = sentence[:i]
remainder = sentence[i:]

if word in dictionary or i == 1:
return [word] + maxmatch(remainder, dictionary)



Large diffs are not rendered by default.

134 changes: 134 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/nltk_punkt_result.txt

Large diffs are not rendered by default.

134 changes: 134 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/pragmatic_segmenter_result.txt

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/segmentation-response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Segmentation
1. I downloaded my texts from a Wikipedia dump of Russian and used WikiExtractor to extract them. Then I chose 50 paragraphs of random texts with this bash command:

`head -n 50000 wiki.txt | sort -R | head -n 50 > 50random.txt`

So, this was my [data](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/50random.txt).

2. Applied Ruby's pragmatic_segmenter to it:

`ruby -I . segmenter.rb < 50random.txt > pragmatic_segmenter_result.txt`

[That's](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/pragmatic_segmenter_result.txt) what I've got.

3. Then I wrote [my implementation using NLTK's Punkt.](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Segmentator%201.py)

[Here](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/nltk_punkt_result.txt) is the results.

## Comparison of results
Both methods gave out the same number of sentences — 134. However, the sentences are different.

Thus, pragmatic_segmenter divided basic Russian abbreviation "т. к." as two sentences:
> Мясо барана разделывают по суставам, при этом следят за тем, чтобы в бараний желудок (гюзян) не попали острые части, т.
> к.
> они могут его повредить.

NLTK's segmenter managed that task but also had troubles with abbreviation. For example, it separated language abbreviations like "тиб." and "санскр.".
>Херуками, например, являются тантрические божества Чакрасамвара (тиб.
>Демчог) и Вишуддха Херука (тиб.
>Яндаг Херука).

While pragmatic_segmenter got it right.
>Херуками, например, являются тантрические божества Чакрасамвара (тиб. Демчог) и Вишуддха Херука (тиб. Яндаг Херука).

## Evaluation

As both approaches gave quantitatively the same results on my data, there is no need to count their accuracy (it will be the same). What's interesting is that they make different types of mistakes.
20 changes: 20 additions & 0 deletions 2018-komp-ling/practicals/Practical 1/tokenization-response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Tokenization

To apply a maxMatch algo, you need a dictionary. So, I made one.
Firstly, I downloaded japanese_trainings conllu file. Then I parsed them using bash commands.
1. delete all comments
> sed '/^#/d' ja_gsd-ud-train.conllu > japanese_dict.txt
2. delete all blank lines
> sed '/^\s*$/d' japanese_dict.txt > japanese_dict_no_empty_lines.txt

Also you need a data to test on. Here is mine.
I downloaded japanese_test conllu file and extracted a text from it with this bash commands:
1. extract texts
> sed -n '/^# text =/p' ja_gsd-ud-test.conllu > japanese_test_texts.txt
2. delete '# text = ' tags
> sed 's/^# text =//' japanese_test_texts.txt > japanese_texts.txt


Then I wrote [Python code](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/Tokenisation.py) to tokenize text using [maxMatch](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/maxmatch.py) algorithm (suggested in Jurafsky & Martin's book).
[Here](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%201/Tokenization/tokenization_result.txt) is the result. All the tokens are separated with commas.

33 changes: 33 additions & 0 deletions 2018-komp-ling/practicals/Practical 2/Transliteration_table.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
А A
Б B
В V
Г G
Д D
Е E
Ё Yo
Ж Zh
З Z
И I
Й Y
К K
Л L
М M
Н N
О O
П P
Р R
С S
Т T
У U
Ф F
Х H
Ц TS
Ч Ch
Ш Sh
Щ Sch
Ъ -
Ы Y
Ь -
Э E
Ю Yu
Я Ya
21 changes: 21 additions & 0 deletions 2018-komp-ling/practicals/Practical 2/rank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import sys


freq = []

fd = open('freq.txt', 'r')
for line in fd.readlines():
line = line.strip('\n')
(f, w) = line.split('\t')
freq.append((int(f), w))


rank = 1
min = freq[0][0]
ranks = []
for i in range(0, len(freq)):
if freq[i][0] < min:
ank = rank + 1
min = freq[i][0]
ranks.append((rank, freq[i][0], freq[i][1]))
print('%d\t%d\t%s' % (ranks[i][0], ranks[i][1], ranks[i][2]))
58 changes: 58 additions & 0 deletions 2018-komp-ling/practicals/Practical 2/transliterate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@

def read_text(filename):
f = open(filename,'r', encoding='UTF-8')
text = f.read()
text = text.replace('\ufeff', '')

return text


def write_new_text(filepath, text):
new_text = open(filepath, 'w+', encoding='UTF-8')
new_text.write(text)
new_text.close()
return new_text


def create_map(alphabets):
lines = alphabets.split('\n')
list_of_letters = []
list_of_matches = []
for line in lines:
low_case = line.lower() # because all letters in our table are in upper case
list_of_letters.append(low_case)
list_of_letters.append(line)
for line in list_of_letters:
letters = line.split('\t')
list_of_matches.append(letters)
matches = dict(list_of_matches)

return matches


def transliterate(text, matches):
transliterated_text = ''
for letter in text:
if letter in matches:
if matches[letter] == '-': # for Ъ and Ь
transliterated_text += ''
else:
transliterated_text += matches[letter]
else:
transliterated_text += letter

return transliterated_text


def main():
alphabets = read_text('Transliteration_table.txt')
matches = create_map(alphabets)
text_to_transliterate = read_text('Text_to_transliterate.txt')
transliterated_text = transliterate(text_to_transliterate, matches)
print(transliterated_text)

return write_new_text('Transliterated_text.txt', transliterated_text)


if __name__ == '__main__':
main()
29 changes: 29 additions & 0 deletions 2018-komp-ling/practicals/Practical 2/transliteration-response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Practical 2
Here's to some questios from practical 2:
>You'll note that the code does not print out the frequency list in order. Which Unix command might you use to sort the output in frequency order ?

```sort -nr```

>What do you think we would get if we set the argument reverse to False ?

We'd get an ascending sorted list.

Firstly, I made [rank.py](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%202/rank.py) which takes a command line argument and reads in a frequency list from a file and outputs a ranked frequency.
Then I [implemented transliteration algo](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%202/transliterate.py), discribed in the task.

>What to do with ambiguous letters ? For example, Cyrillic `е' could be either je or e.

Russian _e_ could be either je after _ь_, _ъ_, vowels and at the beginning of the word or _e_ in all the other cases. So, we can add some rules to disambiguate it. We can apply it either before or after transliteration. For example, after transliteration we can look at every _e_ in transliterated text and replace it with _je_ if it stays after [AOIEUaoieu'].

>Can you think of a way that you could provide mappings from many characters to one character ?
>For example sh → ш or дж → c ?

Maybe we can firstly go through all the text to find _sh_ and replace it with _ш_ or _дж_ with _j_ annd so on and then do it one more time to replace remaining letters.
>How might you make different mapping rules for characters at the beginning or end of the string ?

In the case of Russian-to-Englich transliteration we could have troubles at the beginning of the word only with _e_, I guess. We can write something like:
```
if str.startswith(letter):
...
```
If we are afraid to lose upper cases at the beginng of the sentence, we could add them in our matсhes table or just implement simple rule I described above.
85 changes: 85 additions & 0 deletions 2018-komp-ling/practicals/Practical 3/Unigram_model-response.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Practical 3
## Functions
Output (first 10 lines) from running the [script](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/palindrome.py) for finding palindromes
on the freq.txt file from the previous practical:
```
3439 как
1669 или
1254 еще
1085 ее
283 тут
279 тот
190 оно
90 XX
80 11
71 XIX
```

## Implementing n dimensional matrices with dict

I implemented the code from the task and got this output:
```
a absorbed all and another
бы 0 0 0 0 0
вас 0 0 0 0 0
видит 0 0 0 0 0
всего 0 0 0 0 0
вы 0 0 0 0 0
```
>Why do we need end='' passed to the print() statement ? What would happen if we didn't have it?

We need ```end=''``` to avoid printing every new element at the new line, because ```\n``` is the default ending.
Without this ending we have this table:
```
a absorbed all and another
бы
0
0
0
0
0

вас
0
0
0
0
0

видит
0
0
0
0
0

всего
0
0
0
0
0

вы
0
0
0
0
0

```
After saving [this](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/args.py) code and running it in the command line using
```$ python3 args.py a b c ```

I had this output:
```['args.py', 'a', 'b', 'c']```
-- it's the list of arguments passed to the command line.

## Unigram language model
Here is my code [train.py](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/train.py) It runs from command line and takes two arguments: 1)path to the input file and 2)path to the output file. Mine are [test.txt](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/test.txt) [res.txt](https://github.com/Veranchos/ftyers.github.io/blob/master/2018-komp-ling/practicals/Practical%203/res.txt) resp. Test file was downloaded from [Russian data from the SynTagRus corpus](https://github.com/UniversalDependencies/UD_Russian-SynTagRus/blob/master/ru_syntagrus-ud-test.conllu)
```
$ python3 train.py test.txt res.txt
```
>What might be a simple improvement to the language model for languages with orthographic case ?

I think, this simple improvement might be a wordform lemmatization: we could count not only wordforms but the lemmas, which could say us more about the context.
3 changes: 3 additions & 0 deletions 2018-komp-ling/practicals/Practical 3/args.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
import sys

print(sys.argv)
Loading