· 5 years ago · Feb 11, 2020, 09:27 PM
1Unstructured Knowledge
2
36
4
5Moim zdaniem nie przedstawiono kompletnej idei systemu kontroli kodu, nie powiedziano nic np. o branchach, mało praktycznych przykładów. Część o prototypowaniu była bardzo w porządku.
6
7Knowledge represented by:
8
9 objects or concepts
10 relations between them
11
12Statistical MT - Assumptions
13
14Neural Translation (NMT)
15Quantitative Knowledge
16
17Picture: bing.com/images
18
19Krzysztof Jassem
20
21Example:
22
23Find me a restaurant nearby
24
25Algorithms for automatic
26
27grammatical error correction
28
29If both: translation model and language model are well trained,
30
31then the desired translation gets the highest probability.
32
33Ocena: 5/10
34
35Wykład bardzo dobry, ale dla osób nie mającego pojęcia o takich rzeczach np. studentów pierwszego roku. Jeżeli ktoś tego nie wiedział na 4 roku to nie wiem co tu robi...
36
37"Linguistic Regularities in Continuous Space Word Representations"
38
39Mikolov, Yih, Zweig (2012)
40
41Efficient algorithms
42
43for Hybrid
44
45Neural Machine Translation
46
47Rule-based translation requires large amount of human work.
48
492018 / 2019
50Sentiment Analysis
51
52Exercise: Compare P("be"), P("go") and P("an")
53
54A System for Supporting Decisions
55
56on Investing at Polish Stock Exchange
57
58based on stock exchange news
59
60http://www.elbot.com
61Step 3. Rules that induce company names
62Step 2. Rules that induce posts
63
64Aims of summarization
65
66 absorb information from vast amounts of data
67 understand emotions in documents
68
69https://translatica.pl
70
71Probabilistic methods
72
73for spell-checking
74Question Answering (QA) Systems
75
76At some level of translation quality further improvements are very hard do implement.
77Statistical Methods
781-of-N Encoding
79
80Queen is a wife of a king. King is a husband of a queen.
81Qualitative Knowledge
82
83Input Sentence: Poznań mi się podoba.
84n-gram Model
85
86https://zpjn.wmi.amu.edu.pl/
87
88uploads/Posiadala-mgr-i.pdf
89
902014
91
92Text classification is an automated process that aims to assign a document to one of categories (classes).
93
9410/10. Jasno przedstawione tematy, dobrze wytłumaczone. Uwzględnione najważniejsze cechy prototypu. Wytłumaczone cele ciągłej integracji.
95
96Neural Networks is the latest approach to Machine Translation (2014).
97
98A) are differentiable
99
100B) are non-negative
101
102C) are non-decreasing
103
104D) are continuous
105
106E) have values between 0 and 1
107
108Dawid Jurkiewicz
109
110P("an") > P("be") > P("go")
111
112Misiu zaraz Państwu łapkę poda.
113
114Goal: To obtain information relevant to an information need from a collection of information resources.
115
116P(be | it, be) > P(go | it, be) > P (an | it, be)
117Dialogue Systems
118
119Poznań has over 600 000 inhabitants.
120
1212017
122
123Ocena 7
124
125Dobrze omówiona faza prototypowania, która jest istotnym elementem startupów i pozwala zaoszczędzić wiele czasu
126
127Kinga Kramer
128
129P(be | let, it) = COUNT (let, in, be) divided by COUNT(let, it)
130
131Goal: To extract important posts in corporations
132
1332018
134
135 AIML - operated by A.L.I.C.E
136
137 Keywords
138
139 Word statistics (web pages)
140
141 Question Answering
142
143Shallow
144
145Automatic Summarization of Polish News
146
147Articles by Sentence Selection
148Knowledge Structure
149
1502004 - 2010 - 2018
151
152Results:
153
154CEO of Google LLC
155
156CTO of Microsoft LLC
157Deep
158
159An artificial neuron network (ANN) is a computational model based on the structure and functions of biological neural networks.
160
161Information that flows through the network affects its structure -
162
163a neural network changes based on input and output.
164
165Correct answers:
166
167C)
168
169D)
170
1712018
172Bootstrapping Method - Example
173Unigram (1-gram) Model
174
175Queen is a wife of a king. King is a husband of a queen.
176
177Results:
178
179Director of Google LLC
180
181President of Microsoft LLC
182
183#Name inducing rule (seed rule exteded)
184
185Post = < CEO, CTO, Director, President >
186
187CorpSuffix = < LLC >
188
189Left: < Post > <'of'>
190
191Match: < Name >
192
193Right: < CorpSuffix >
194Unsupervised Methods
195Step 1. Seed Rule
196Automatic Summarization
197Rule-based Translation (RBMT)
198
199https://zpjn.wmi.amu.edu.pl
200
201/uploads/Klimek-mgr-i.pdf
202
203Word embedding is a numerical representation of a text.
204
205Input: Poznań jest miastem pięknym
206Text Classification
207
208https://zpjn.wmi.amu.edu.pl/uploads
209
210/Kania-mgr-i.pdf
211
212Concepts: dog, animal
213
214Relation: is_a
215
216Dawid Klimek
217
218Document Summary - short note on a document that conveys its meaning and information included.
219Rule-based Example
220
221Choose correct statement:
222
223Translation model
224
225A) is an example of a translation system
226
227B) is a function that assigns probabilities to potential translations of words (or phrases)
228
229C) is a translation system modeled for mobile devices
230
231D) is used in a rule-base translation
232
233Word Embeddings
234Language Model
235Predictive Systems
236
2372018
238
239He took it out.
240
241<rule id="" name="">
242
243<pattern>
244
245<token inflected='yes' regexp='yes'>cofać|cofnąć</token>
246
247<token>do</token>
248
249<token>tyłu</token>
250
251</pattern>
252
253...
254
255</rule>
256
257AIML script - Example
258
259Language model is a mapping that assigns probability to a given sequence of words.
260Information Retrieval
261
262All activation functions:
263
264Conversational Search
265
266Łukasz Pawluczuk
267
268Lexicon (6 words):
269
270a, husband, is, king, of, queen, wife
271Frequency Encoding
272
273[2, 0, 1, 1, 1, 1, 1] [2, 1, 1, 1, 0, 1, 1] [4, 1, 2, 2, 2, 2, 1]
274
275Results:
276
277Director of Samsung LLC
278
279President of Apple LLC
280
281Marcin Kania
282Artificial Neural Network
283
284TEST - Question 3
285
286"Voice-activated platform that listens in on company meetings for trigger phrases like “what are” and “I wonder.” When it hears them, Second Mind’s search function whirs into action, seeking an answer for the rest of your sentence."
287Types of Knowledge
288
289Extracting Information abot church services start times
290
2911. Match words of the question to words in a document.
292
2932. Return the best matching document
294
295No semantic analysis executed
296Semantic Search
297
298Picture: IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data
299
300#Seed Rule
301
302Post = < CEO, CTO >
303
304CorpSuffix = < LLC >
305
306Left: < Post > <'of'>
307
308Match: < Name >
309
310Right: < CorpSuffix >
311Machine Translation Methods
312
3132015
314
315Lexicon (7 words):
316
317[a, husband, is, king, of, queen, wife]
318
319Analysis of Sentiment
320
321in Students' Comments on Lectures
322
3232017
324Machine Translation
325
326 Images
327 Videos
328 Documents
329 Texts in the Internet
330
331Voice Assistants
332
333Classification of Investment Funds by Means of Machine Learning Methods.
334
335Objects: Poznań, Poland
336
337Relation: located_in
338
339Tomasz Dwojak
340
341Roman Grundkiewicz
342NLP Applications
343
344Answer: B
345Structured Knowledge
346
347An open-source C++ library for NMT has been developed at AMU.
348Statistical Translation (SMT)
349Boostrapping important corporation posts
350
351P(Poznan is a beautiful city) =
352
353P(a | Poznan, is) *
354
355P(beautiful | is, a) *
356
357P(city | a beautiful)
358Other Applications
359Test Question 1
360Methods of Text Analysis
361
362The approach is currently being disowned by other Machine Translation methods.
363
364Choose correct sentence:
365
366n-gram language model
367
368A) assigns probability to a word depending on n previous words
369
370B) assigns probability to a word depending on (n-1) previous words
371
372C) is used to calculate probabilities of sentences that are exactly n-word long
373
374D) is used to calculate probabilities of sentences whose weight is 'n'.
375
376Chatbots
377Rule-based Methods
378
379TEST - Question 4
380Language Model in MT
381Spelling & Grammar Correction
382Translation Model - Example
383Test Question 5
384
385Knowledge given in measurable values:
386
387 geographical location
388 population
389 size, length
390 ...
391
392Automated process of categorizing opinions expressed in a piece of text, usually:
393
394positive, negative, or neutral.
395Augmented Writing
396Students' Comments on K. Jassem's Lectures
397Translatica - Examples
398
399Definition based on: www.techopedia.com
400
401https://zpjn.wmi.amu.edu.pl/uploads/
402
403Pawluczuk-mgr-i.pdf
404Word2Vec
405Reinforcement Learning
406
407https://zpjn.wmi.amu.edu.pl
408
409/uploads/grundkiewicz2016phd.pdf
410
411Output Sentence: I like Poznań.
412
413 Deep syntactic analysis of a question
414 Semantic analysis of a question
415 Conversion to a database query.
416 Knowledge database required (in a structured form)
417
418Guess: What English words may have appeared at the other end of the "phone"?
419
420<category>
421
422<pattern> LET'S TALK ABOUT FLOWERS. </pattern>
423
424<template> Yes <set name="topic">flowers</set> </template>
425
426</category>
427
428<topic name="flowers">
429
430<category>
431
432<pattern> * </pattern>
433
434<template> Flowers have a nice smell. </template>
435
436</category>
437
438<category>
439
440<pattern> I LIKE IT SO MUCH! </pattern>
441
442<template> I like flowers too. </template>
443
444</category>
445
446</topic>
447Translatica
448
449NMT already outperforms SMT.
450NMT - Summary
451Supervised Methods
452How do chatbots work?
453
454RDF (Resource Description Framework) Triples
455
456https://textio.com/
457
458Tomasz Posiadała
459RBMT - Summary
460
461Semantic search is a data searching technique in which a search query aims to:
462
463 match keywords
464 determine the intent
465 determine the context
466
467# Rule post_inducing_1:
468
469Match: < Post> < 'of Google LLC'>
470
471#Rule post_inducing_2:
472
473Match: <Post> < 'of Microsoft LLC'>
474
475TEST - Question 5
476
477Aim:
478
479To predict future based on information in documents.
480
481Task:
482
483Predict if a company price will grow or fall on a given date
484
485Training Data:
486
487X: Stock reports on the company
488
489Y: 1 (UP) or 0 (DOWN) next day(s)
490
491On pociagnął to, ujawniać
492
493że jest się homoseksualistą.
494
495Example
496
497Summarize users' opinions on BIXBY vs SIRI
498
499M. Junczys-Dowmunt
500
501R. Grundkiewicz
502
503T. Dwojak
504
505https://marian-nmt.github.io/
506
507Picture: http://www.cs.utah.edu/~hal/courses/2009S_AI/Walkthrough/Speech/
508
509Change of Order
510
511Picture: http://nlp.postech.ac.kr/research/previous_research/smt/
512
513Lexical Rules
514
515Information overload - information exceeds our capacity to understand it.
516
517Parallel text corpora
518
519Picture: textberg.jpg
520
521Shift of Subject and Object
522
523Input: Words represented as 1-of-N vectors: (0, 0, 0, ..., 1, 0, 0)
524
525length of the vector = size of lexicon
526
527Output: Word Embeddings - Representations of words as
528
529M-dimensional vectors
530
531The probability of a word depends on (n-1) previous words.
532
533Trigram (3-gram) - probability of a word depends on 2 previous words.
534
535P(go | let, it)
536
537P(be | let, it)
538
539P(an | let, it)
540
541Compare:
542
543Weronika Sieińska
544
545Use of Reinforcement Learning in Dialogue Modeling
546
5472019
548
549picture from: wikipedia
550
551[1, 0, 1, 1, 1, 1, 1] [1, 1, 1, 1, 0, 1, 1] [1, 1, 1, 1, 1, 1, 1]
552
5531-st sentence 2nd sentence Both sentences
554
555Translation candidates:
556
557 Poznan is city beautiful
558 Poznan is beautiful city
559 Poznan is town beautiful
560 Poznan is town pretty
561 Poznan is a beautiful city
562
563...
564
565Poznań
566
567 Poznan: 0,7
568 Posen: 0,2
569 Poznań: 0,1
570
571Jest
572
573 Is: 0,9
574 Was: 0,1
575
576Miastem
577
578 City: 0,7
579 Town: 0,3
580
581Pięknym
582
583 Beautiful: 0,7
584 Pretty: 0,2
585 Excellent: 0,1
586
587NULL (empty word)
588
589 Null: 0,6
590 A: 0,2
591 The: 0,2
592
593Translation Model
594
595Sequence of words:
596
597Probability of the sequence:
598
599Method:
600
601Recurrent Neural Networks
602
603Unigram probability:
604
605The Teddy bear of epidemics
606
607will pass a trap to the State.
608
609
610(ab)c = a(bc)
611
612[A-Za-z\u00C0-\u017F]
613
614my$ mommy
615
616^my my prescious
617
618^my my, oh my
619
620trimming spaces and tabs
621Selected RegEx Properties
622Definition
623
624b.d bad
625
626bed
627
628bud
629
630bird
631
632(x)
633
634x*
635
636(xy)*
637
638y*(x|z)
639
640groups() returns all matching subgroups in a tuple.
641
642findall() finds all the matches and returns them as a list of strings.
643Lookbehind
644
645(?<=a)b (positive lookbehind)
646
647matches the b (and only the b) in cab,
648
649but does not match b in cob or debt.
650
651If not, you can also add flags to the start of the regex.
652
653In response to a symbol automaton transits to another state.
654
655Krzysztof Jassem
656
657colou?r color
658
659colour
660
661Feb(ruary)? 23(rd)? Feb 23
662
663February 23
664
665February 23rd
666
667.?@amu\.edu\.pl mail to: jassem@amu.edu.pl
668
669L(x*) = {x, xx, xxx, xxxx, ...}
670
671^4[0-9]{12}(?:[0-9]{3})?$
672
673sit. A cat sits on a mat.
674
675Any character: German, French or Polish:
676group()
677
678Let r denote a regular expression.
679
680Then, L(r) denotes a language generated by r.
681
682{n} - n repetitions
683
684[ÀÂÇÉÈÊËÎÏÔÛÙÜŸàâçéèêëîïôûùüÿ] French
685
686L(x) = {x}; L(y) = {y}; L(z) = {z}
687
688([Dd]og|[Cc]at)+
689Practice (Advanced)
690
691https://www.tutorialspoint.com/automata_theory/constructing_fa_from_re.htm
692RegEx Definition
693
694\d [0-9]
695
696\w [a-zA-Z0-9_]
697
698\t Tab character
699
700\r End of line character
701
702\n New line character
703
704\s [ \t\r\n]
705
706\D [^\d]
707
708\W [^\w]
709
710\S [^\s]
711Laziness
712
713dogs
714
715line = "Cats are smarter than dogs";
716
717found = re.match( r'dogs', line)
718
719if found:
720
721print found.group()
722
723else:
724
725print "No match!!"
726
727[a-z]*@amu\.edu\.pl
728
729mail to: jassem@amu.edu.pl
730
731\d+ PZ033GA
732
733^\d+$ 347
734
735Anchors do not match characters.
736
737They match positions in the string.
738
739<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
740Replacing strings
741
742At the beginning, automaton is in the initial state.
743
744VISA card number
745
746Most diacritic character may be represented by Unicode range:
747
748Phone Num : 2004959559
749
750line = "Cats are smarter than dogs";
751
752found = re.search( r'(\w)* are (\w)*', line)
753
754if found:
755
756print found.groups()
757
758else:
759
760print "Nothing found!!"
761
762ab* abbb
763
764a
765
766Regular expressions are used as patterns for searching and replacing strings.
767
768For a given regex r a program searches for (the first) string that belongs to L(r).
769
770+ matches 1 or more repetitions
771Representing RegEx as FA
772Diacritic characters
773
774A) are differentiable
775
776B) are non-negative
777
778C) are non-decreasing
779
780D) are continuous
781
782E) have values between 0 and 1
783Word boundaries
784
785@(.+\.)+ mail to: jassem@amu.edu.pl
786
787[AEO]la Ala ma kota
788
789pi[wk]o piwo
790
791piko
792
793piwko
794
795wst[aą][zż]ka wstążka
796
797wstazka
798
799. This is a cat.
800Theoretical Background
801
802\bis\b This island is beautiful.
803
804\Bis\B This island is not an aisle.
805Negated Classes
806
807Regexes a and b are equivalent (a = b) if they define the same language.
808
809Theorem:
810
811For each regular expression r there exist NFA (DFA) that accepts L(r).
812
813seat\. He seats him on a seat.
814
815('Cats', 'smarter')
816
817d[ay] Today is Sunday.
818
819a(b | c) = ab | ac
820
821https://regex101.com/
822
823Quantifiers are used to match repetitions.
824
825Backreferences match strings that have already been matched as groups:
826
827\1 - matches first group
828
829\2 - matches second group, etc.
830Literal Characters & Concatenation
831groups()
832A neural network that was presented at the video used for recognition (not for generation):
833A) generates a hand-written form of a digit
834B) classifies a hand-written digit into one of 10 classes
835C) recognizes the author of the writing
836
837word = re.search(r'\w+', "Hello World")
838
839word.group()
840
841'Hello'
842
843https://www.tutorialspoint.com/automata_theory/non_deterministic_finite_automaton.htm
844
845This includes:
846Shorthand Characters Classes
847
848\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b
849
850cat About cats and mice
851
852e-mail
853Automaton
854
855-
856
857Correct answers:
858
859C)
860
861D)
862
863. Ala ma kota.
864
865kot. Ala ma kota.
866
867r.k rak
868
869rok
870
871ryk
872
873.o.a lola
874
875cola
876
877rola
878
879kolano!
880
881kot\. Do Ali przyszedł kot.
882
883Kleene star matches 0 or more repetitions.
884
885(a | b) | c = a | (b | c)
886
887E - alphabet (set of characters)
888
889[A-ZÀÂÇÉÈÊËÎÏÔÛÙÜŸÖÜÄÖÜẞĄĆĘŁŃÓŚŹŻa-zàâçéèêëîïôûùüÿäöüßąćęłńóśźż]
890Dot
891
892(a*)* = a*
893
894[?.!] Hey!
895'+' Quantifier
896NLP Examples
897
898Extremal Clause
899
900Nothing else is a regular expression.
901
902group() returns the whole string that matches the regex.
903'Eliza' bot
904Programming Regexes
905Flags
906
907\b matches word boundary;
908
909\B negates \b:
910
911[^lp]asem
912
913Jassem idzie lasem
914
915z pasem, czasem.
916Backreferences
917
918<H1>My First Heading</H1>
919
920L(xy*)* = {xy, xyxy, ...} = L(xy)*
921match()
922RegEx and
923Finite-State Automata
924
925Grouping may be used to represent a repetition of a pattern longer than one character.
926
927Inductive Clause
928
929If a, b are regular epressions, then:
930
931(a) (grouping),
932
933a* (Kleene star - repetition),
934
935ab (concatenation),
936
937a|b (alternative),
938
939are regular expressions.
940Practice (Basic)
941Deterministic Finite-State Automaton (DFA)
942
943pattern = re.compile('\w+')
944
945word = pattern.search('Hello World')
946
947word.group()
948
949'Hello'
950
951All activation functions:
952
953(ab)* ababab
954
955s = 'My name is John Jones.'
956
957mo = re.search(r'John(?! Doe)', s)
958
959print(mo.group())
960
961John
962Positive Classes
963
964[ ] matches any one (!) character inside brackets.
965
966L(xxx(y | z) = {xxxy, xxxz} = L(xxxy | xxxz)
967
968a Jack is a boy.
969
970L((xy)z) = {xyz} = L(x(yz))
971
972^[ \t]+|[ \t]+$
973
974group(n) returns the n-th subgroup.
975
976Theorem:
977
978For any NFA there exists DFA that accepts the same language.
979
980Unicode range without mathematical symbols:
981
982[A-Z]un A gun is no Fun.
983
984floating point numbers
985
986dog Oh, my god, it's a dog!
987Perl Contribution
988Lookahead
989
990Automaton is always in a state.
991
992https://www.eclecticenergies.com/ego/eliza
993
994User1: Nobody loves me.
995
996ELIZA1: WHY DO YOU THINK NOBODY LOVES YOU
997
998^ matches beginning of the string
999
1000$ matches end of the string.
1001
1002Flags modify the standard searching procedure.
1003Basic operations
1004
1005(ab)+ abababa
1006Examples
1007
1008Iraq[^i]
1009
1010Iraqi freedom.
1011
1012Iraq is a country.
1013
1014Iraq.
1015
1016check at:
1017
1018User2: They're always making fun of me.
1019
1020ELIZA2: CAN YOU GIVE ME A SPECIFIC EXAMPLE
1021
1022? matches 0 or 1 repetition
1023Number of repetitions
1024Regular Expressions in NLP
1025
1026Answer: B
1027
1028(ab)? ab
1029
1030"" (empty string)
1031
1032Patterns may be grouped by means of braces: "( )"
1033
1034phone = "2004-959-559 # This is Phone Number"
1035
1036# Delete Python-style comments
1037
1038num = re.sub(r'#.*$', "", phone)
1039
1040print "Phone Num : ", num
1041
1042Phone Num : 2004-959-559
1043
1044E = {x,y,z}
1045
1046# Remove anything other than digits
1047
1048num = re.sub(r'\D', "", phone)
1049
1050print "Phone Num : ", num
1051
1052[ÄÖÜẞäöüß] German
1053
1054date
1055
1056{n, } - at least n repetitions
1057Negated
1058Character Classes
1059
1060(Python)
1061Grouping
1062
1063L(xy | yz) = {xy, yz} = L(yz, xy)
1064
1065^ (caret character) at the start of [ ] negates characters inside [ ].
1066
1067E = {x, y, z}
1068Kleene star
1069Test Question 1
1070Anchors
1071
1072re.search() looks for the string that matches the regex.
1073
1074Returns a match object on success, none on failure.
1075
1076re.compile() changes a regex into a Python object.
1077
1078group() and groups() return the matching string(s).
1079
1080ab? ab
1081
1082a
1083Quantifiers
1084
1085a | b = b | a
1086
1087-
1088
1089[ĄĆĘŁŃÓŚŹŻąćęłńóśźż] Polish
1090Test Question 5
1091
1092Once all input has been received:
1093
1094- if automaton is in an accepting state, input has been accepted
1095
1096- otherwise, input has been rejected
1097Non-deterministic Finite-State Automaton (NFA)
1098
1099× (U+00D7) ÷ (U+00F7)
1100
1101Dot (.) matches any character except for newline (equivalent to [^\n]).
1102
1103[A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u017F]
1104
1105https://www.tutorialspoint.com/automata_theory/deterministic_finite_automaton.htm
1106findall()
1107Character Classes
1108
1109Normally, flags are specified outside the regular expression.
1110
1111match() checks for a match only at the beginning of the string.
1112
1113line = "Cats are smarter than dogs";
1114
1115found = re.search( r'(\w*) are (\w*)', line)
1116
1117if found:
1118
1119print found.group()
1120
1121print found.group(1)
1122
1123print found.group(2)
1124
1125else:
1126
1127print "Nothing found!!"
1128
1129L(x*y) = {y, xy, xxy, xxxy, ...}
1130Alternative
1131
1132Automaton receives input: sequence of symbols (one by one) that belong to a finite alphabet.
1133
1134[cmh]at There is a cat on the mat.
1135
1136[:digit:] any digit
1137
1138[:alpha:] any letter
1139
1140[:alnum:] any digit or letter
1141
1142a About cats and mice
1143
1144L((x | y) | z) = {x, y, z} = L(x | (y | z))
1145
1146ab+ abbb
1147
1148ab
1149
1150cat|dog This is a catalogue.
1151
1152Kowalsky|i This is Kowalsky.
1153
1154Kowalsk(y|i) This is Kowalsky.
1155
1156wi(fe|ves) These are my wives.
1157Language Defined by RegEx
1158
1159Unicode range
1160Greediness
1161
1162L(x*y | z*) = {y, xy, xxy, z, zz, zzz, ...}
1163
1164No match!!
1165
1166x
1167
1168z
1169
1170moja$ moja, ty, mordo, moja
1171
1172^moja mordo, ty moja
1173
1174^moja moja krew
1175
1176([Dd]og|[Cc]at){1,3}
1177
1178{n,m} - between n and m repetitions
1179
1180(?<!a)b (negative lookbehind)
1181
1182doesn't match the b in cab,
1183
1184but matches the b in cob or debt.
1185
1186User3: My boyfriend says I'm depressed.
1187
1188ELIZA3: I AM SORRY TO HEAR YOU ARE DEPRESSED
1189
1190^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$
1191
1192Automaton has some special states - accepting (finite) states.
1193'?' Quantifier
1194
1195^[-+]?[0-9]*\.?[0-9]+$
1196
1197line = "Cats are smarter than dogs";
1198
1199found = re.search( r'dogs', line)
1200
1201if found:
1202
1203print found.group()
1204
1205else:
1206
1207print "No match!!"
1208
1209kot|pies Uważaj na pieszych!
1210
1211Kowalski|a Przyszła Kowalska.
1212
1213Kowalsk(i|a) Przyszła Kowalska.
1214
1215fizy(k|cy) Przyszli fizycy.
1216
1217Cats are smarter
1218
1219Cats
1220
1221smarter
1222Reference to a group
1223
1224amu\.edu\.pl jassem@amu.edu.pl
1225
1226\[\[.*?\]\]
1227
1228\[\[[^\]]*\]\]
1229
1230[[Poland]] scored two goals against [[Germany]].
1231
1232<h1>My First Heading</h1>
1233
1234<.+>
1235
1236\[\[.*?\]\]
1237
1238[[Poland]] scored two goals against [[Germany]].
1239
1240(.*)@amu\.edu\.pl \1
1241
1242jassem@amu.edu.pl jassem
1243
1244multiline
1245
1246(?m)^\w+
1247
1248first
1249
1250second
1251
1252http://www.multiwingspan.co.uk/a23.php?page=fsm
1253
1254\[\[.*\]\]
1255
1256[[Poland]] scored two goals against [[Germany]].
1257
1258?= looks one group ahead
1259
1260(for positive confirmation)
1261
1262str = 'Here comes Barack Obama'
1263
1264found = re.search(r'\w+(?= Obama)',str)
1265
1266found.group()
1267
1268Barack
1269
1270([\'\"])[^\1]*\1
1271
1272matches single or double quoted strings
1273
1274<h1>My First Heading</h1>
1275
1276<.+?>
1277
1278global
1279
1280(?g)this
1281
1282this and again this
1283
1284a ala ma kota.
1285
1286kot kto ma kota?
1287
1288a Ala ma kota.
1289
1290kot piękne koty
1291
1292x?y
1293
1294xw
1295
1296(\w+) \1
1297
1298matches word repetition
1299
1300?! looks one group ahead
1301
1302(for negative confirmation)
1303
1304s = 'My name is John Depp. His name is John Kennedy.'
1305
1306mo = re.search( r'John(?! Depp)', s)
1307
1308print(mo)
1309
1310Kennedy
1311
1312ungreedy
1313
1314(?U)a+
1315
1316aaaaa
1317
1318<h1>My First Heading</h1>
1319
1320<.+?>
1321
1322<[^>]+>
1323
1324Basic Clause
1325
1326(empty string) is a regular expression;
1327
1328any character a is a regular expression.
1329
1330case insensitive
1331
1332(?i)A
1333
1334a
1335
1336Based on: https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
1337
1338s/(.*) me\./WHY DO YOU THINK \1 YOU/
1339
1340s/.* always .*/CAN YOU GIVE ME A SPECIFIC EXAMPLE/
1341
1342s/.* I’m (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
1343
1344Copy and Move
1345Test Question1
1346Introduction to Linux
1347
1348cut -f 1-2,4-5 data.txt
1349
1350one two four five
1351
1352alpha beta delta epsilon
1353
1354# change delimiter for ';'
1355
1356cut -f 2-4 -d ';' numbers
1357
1358two;three;four
1359
13602;3;4
1361Word Count
1362
1363Examples from: https://www.computerhope.com/unix/udiff.htm
1364Listing a Directory
1365
1366To append the sorted content of the current directory to the file named 'directories':
1367
1368A) ls -l | sort >> directories
1369
1370B) sort < ls -l >> directories
1371
1372C) ls -l | sort > directories
1373
1374D) ls -l < sort > directories
1375
1376To change something which matches a regular expression you can use sed 's/.../.../'(stream editor)
1377
1378# change first word 'junk' into 'garbage'
1379
1380sed 's/junk/garbage/' my-text-file.txt
1381
1382# change each word 'junk' into 'garbage'
1383
1384sed 's/junk/garbage/g' my-text-file.txt
1385
1386Correct answer: A
1387What is a script?
1388
1389To relocate a file 'myfile' from the subdirectory 'subdir' into the current directory use:
1390
1391A) relocate subdir/myfile myfile
1392
1393B) relocate ./subdir/myfile .
1394
1395C) mv ./subdir/myfile .
1396
1397D) mv /subdir/myfile /
1398
1399#!/bin/bash
1400
1401tr -sc [:alpha:] '\n' < document > words
1402
1403sort < words | uniq -c > frequency_list
1404
1405# sort and remove duplicates
1406
1407sort -u names
1408
1409Filip Gralinski
1410
1411Krzysztof Jassem
1412
1413Rafal Jaworski
1414
1415# sort in reversed order and remove duplicates
1416
1417sort -ru names
1418
1419Rafal Jaworski
1420
1421Krzysztof Jassem
1422
1423Filip Gralinski
1424
1425Replacing regular expressions: sed
1426diff (append lines)
1427
1428# tr - s squeezes multiple characters into one
1429
1430echo "Too many spaces " | tr -s ' '
1431
1432Too many spaces
1433Getting Information on Text Files
1434Processing Text Files
1435
1436Correct answer: B
1437
1438sort -ru removes a duplicated line - the numer of lines is now equal to '6'
1439
1440Correct answer: D
1441
1442A) 1c1
1443
1444< Krzysztof Jassem
1445
1446---
1447
1448> Krzysztof Jasem
1449
14503a4
1451
1452> Pawel Skorzewski
1453Viewing / Editing a File
1454
1455 Directories and files are case-sensitive
1456
1457 Examples of directory paths:
1458
1459. current directory
1460
1461.. parent directory
1462
1463~ home directory
1464
1465cut -f 3 data.txt
1466
1467three
1468
1469gamma
1470
1471# Do not display file names in results
1472
1473$ grep -h apple grocery.list grocery.list2
1474
1475apples
1476
1477dry apples
1478
1479cut -c -4 numbers
1480
1481will display:
1482
1483A) one;
1484
14851;2;
1486
1487B) five
1488
1489;4;5
1490
1491C) one;two;three;four
1492
14931;2;3;4
1494
1495D) four
1496
14974
1498
1499# Sort contents of a file (line by line)
1500
1501sort < personal_info
1502
1503# Displays this on screen:
1504
1505My e-mail is jassem@amu.edu.pl
1506
1507My name is Krzysztof Jassem
1508
1509In order to backup all the files in the current directory in another directory (named 'copy'):
1510
1511A) mkdir copy
1512
1513cp all ./copy
1514
1515B) mkdir copy
1516
1517ls * ./copy
1518
1519C) ls copy
1520
1521cp * ./copy
1522
1523D) mkdir copy
1524
1525cp * ./copy
1526delimiter other than tab
1527
1528diff file1.txt file2.txt
1529cut characters
1530
1531# cut characters
1532
1533# from 3-12 including tabs
1534
1535cut -c 3-12 data.txt
1536
1537e two thre
1538
1539pha beta g
1540Searching / Replacing
1541
1542In Windows you may try PowerShell ISE or install beta version of Ubuntu Bash in Windows 10.
1543How to get help
1544
15454d3
1546
1547< I promise.
1548Sort by key
1549
1550# tr -d deletes character from a string
1551
1552echo 'ed, of course!' |tr -d "aeiou"
1553
1554d, f crs!
1555
1556cut cuts out selected sections of each line, removing the rest.
1557
1558By default they show 10 lines of the file but you can define a different number by using -n switch:
1559
1560head -n2 grocery.list
1561
1562apples
1563
1564bananas
1565
1566tail -n2 grocery.list
1567
1568plums
1569
1570carrots
1571Test Question 2
1572
1573#Search for 'apple' in both test files
1574
1575$ grep apple grocery.list grocery.list2
1576
1577grocery.list:apples
1578
1579grocery.list2:dry apples
1580
1581You can count characters, words and lines in the text file by using wc command:
1582
1583You can count only lines (words, characters) by using -l, -w or -c options:
1584
1585wc grocery.list
1586
15874 4 29 grocery.list
1588
1589wc -l grocery.list
1590
15914 grocery.list
1592
1593wc -w grocery.list
1594
15954 grocery.list
1596
1597wc -c grocery.list
1598
159929 grocery.list
1600
1601This script makes the frequency list of words in a document.
1602
1603#!/bin/bash
1604
1605# replace all non-alpha characters with new line; # option -c complements the choice (all but...)
1606
1607tr -sc [:alpha:] '\n' < document > words
1608
1609# sort, delete duplicate and count
1610
1611# unix filters out duplicate lines
1612
1613# unix -c prefixes lines with a number
1614
1615# representing how many times they occurred.
1616
1617sort < words | uniq -c > frequency_list
1618
1619# Sort lines typed from keyboard
1620
1621sort
1622
1623jaworski
1624
1625graliński
1626
1627jassem
1628
1629Ctrl-d
1630
1631# This will be displayed as output:
1632
1633graliński
1634
1635jassem
1636
1637jaworski
1638
1639cat > names
1640
1641Krzysztof Jassem
1642
1643Filip Gralinski
1644
1645Rafal Jaworski
1646
1647Krzysztof Jassem
1648
1649ctrl-d
1650
1651The search can be negated by using -v switch:
1652
1653# show lines without the word 'junk'
1654
1655egrep -v 'junk' my-text-file.txt
1656
1657Pipes redirect the output of a command to the input of another command.
1658Operations on directories
1659Diff
1660
1661To edit a file you can use e.g. nano, pico or vim:
1662
1663nano grocery.list
1664
1665pico grocery.list
1666
1667https://ryanstutorials.net/linuxtutorial/piping.php
1668
1669cut -f 2-4 data.txt
1670
1671two three four
1672
1673beta gamma delta
1674
1675Script is a text file that contains a series of commands.
1676diff (delete lines)
1677
1678#change lines from 2 to 4
1679
16802,4c2,4
1681
1682< I need to run the laundry.
1683
1684< I need to wash the dog.
1685
1686< I need to get the car detailed.
1687
1688---
1689
1690> I need to do the laundry.
1691
1692> I need to wash the car.
1693
1694> I need to get the dog detailed.
1695Searching
1696
1697cut -f 3- data.txt
1698
1699three four five
1700
1701gamma delta epsilon
1702Pipes
1703Sorting
1704Text Processing Tools in Linux
1705
1706Examples frm https://www.ibm.com/developerworks/aix/library/au-unixtext/
1707
1708To see what is inside current directory use
1709
1710ls command:
1711
1712# Lists files in current directory
1713
1714ls
1715
1716# Can be used with path as an argument
1717
1718ls my-directory
1719
1720# ls has many options, e.g.: recursive listing...
1721
1722ls -R my-directory
1723
1724# ...or listing in reversed alphabetical order
1725
1726ls -r my-directory
1727
1728To show a file content in the terminal use cat:
1729
1730cat name-of-the-file.txt
1731
1732You may also create a file using cat
1733
1734$ cat > grocery.list
1735
1736apples
1737
1738bananas
1739
1740plums
1741
1742<ctrl-d>
1743
1744You may also append text to a file using cat
1745
1746$ cat >> grocery.list
1747
1748carrots
1749
1750<ctrl-d>
1751Test Question 3
1752
1753#content of 'data.txt':
1754
1755one two three four five
1756
1757alpha beta gamma delta epsilon
1758cut
1759
1760cat > names1
1761
1762Krzysztof Jassem
1763
1764Rafal Jaworski
1765
1766Filip Gralinski
1767
1768#Create a new test file
1769
1770$cat grocery.list2
1771
1772Apple Sauce
1773
1774wild rice
1775
1776black beans
1777
1778kidney beans
1779
1780dry apples
1781
17822a3
1783
1784> Oh yeah, I also need to buy grated cheese.
1785
1786Correct answer: D
1787
1788Correct answer: A
1789
1790Change first line
1791
1792Append line 4 of names2 after line 3 of names 1
1793
1794# sort by second key (column)
1795
1796sort -k2 names
1797
1798Filip Gralinski
1799
1800Krzysztof Jassem
1801
1802Krzysztof Jassem
1803
1804Rafal Jaworski
1805
1806Example of a Script
1807Redirecting Input
1808Directories and files
1809
1810#content of 'data.txt':
1811
1812one two three four five
1813
1814alpha beta gamma delta epsilon
1815How to open a terminal?
1816Test Question 8
1817cut fields
1818
1819Correct answer: B
1820Basic Linux Commands
1821
1822#create a new test file
1823
1824cat > business_card
1825
1826Krzysztof Jassem
1827
1828jassem@amu.edu.pl
1829Viewing Part of a File
1830
1831You can sort files line by line using sort
1832
1833To display 10 first characters of 'myfile1' and 10 first characters of 'myfile2':
1834
1835A) head -n10 myfile1 | myfile2
1836
1837B) head -c10 myfile1 | myfile2
1838
1839C) head -n10 myfile1 myfile2
1840
1841D) head -c10 myfile1 myfile2
1842
1843#Search for 'apple' ignoring case
1844
1845$ grep -i apple grocery.list grocery.list2
1846
1847grocery.list:apples
1848
1849grocery.list2:Apple Sauce
1850
1851grocery.list2:dry apples
1852
1853#replace small letters with capital ones
1854
1855echo "What is the standard text editor?" |tr [:lower:] [:upper:]
1856
1857WHAT IS THE STANDARD TEXT EDITOR?
1858
1859#content of 'numbers':
1860
1861one;two;three;four;five
1862
18631;2;3;4;5
1864Test Question 6
1865Test Question 4
1866Example - Frequency list
1867
1868 Commands are case sensitive
1869
1870 TAB completion
1871
1872When you press TAB key, the system will try to determine what you want to write based on what
1873
1874you already typed)
1875
1876 Commands history
1877
1878Commands are stored in the history.
1879
1880Press UP or DOWN key to navigate the history.
1881
1882 You can scroll the screen by using
1883
1884SHIFT + PageUp or SHIFT + PageDown
1885
1886diff file1 file2
1887Replacing
1888
1889# Remove three files in the current directory
1890
1891rm file1 file2 file3
1892
1893# Remove all files from the current directory
1894
1895rm *
1896
1897# List the current directory and the parent directory
1898
1899ls . ..
1900
1901Diff displays differences between two files (line by line).
1902
1903It tells you the instructions on how to change the first file to make it match the second file.
1904Test Question 9
1905
1906Correct answer: A
1907Processing Multiple Files
1908grep
1909
1910Correct answer: A
1911
1912Script should have the .sh extension. e.g.
1913
1914cat > delEL.sh
1915
1916#first line of a script should be:
1917
1918#!/bin/bash
1919
1920#second line command: filter out empty lines
1921
1922cat business_card | egrep "." >
1923
1924business_card.nonempty
1925
1926#third line command: run 'wc' on the new file
1927
1928wc business_card.nonempty
1929
1930Uniq sorting
1931
1932 On Ubuntu you can press: CTRL + ALT + T
1933
1934 Or you can switch from the GUI to the terminal by pressing ALT + F1 (or F2, F3...)
1935 You can go back to the GUI by pressing ALT + F7
1936
1937Test Question 5
1938
1939#content of 'numbers':
1940
1941one;two;three;four;five
1942
19431;2;3;4;5
1944Redirecting Input / Output
1945
1946You can see the begining/ending of the file by using head or tail.
1947
1948A) Copies /test/example/myfile.txt to newfile.txt
1949
1950B) Replaces every occurrence of 'test' in myfile.txt with 'example'; the result is saved in newfile.txt
1951
1952C) Replaces every occurrence of 'test' in myfile.txt with 'example'; the result is saved in myfile.txt
1953
1954D) Moves /test/example/myfile.txt to newfile.txt
1955Creating / Removing Directories
1956
1957Correct answer: C
1958Scripts
1959
1960To replace strings you can use tr.
1961Redirecting Output
1962
1963Suppose that the file 'info' has one non-empty duplicate line and
1964
1965wc info returns
1966
19677 14 113
1968
1969A possible result of:
1970
1971sort -ru info | wc
1972
1973A) 8 14 113
1974
1975B) 6 12 98
1976
1977C) 7 13 113
1978
1979D) 7 14 98
1980
1981# Sort and save in another file:
1982
1983sort < personal_info > personal_info.sorted
1984
1985To create a directory use mkdir:
1986
1987# Create directory named 'my-directory'
1988
1989mkdir my-directory
1990
1991# Can be used on longer paths with -p switch
1992
1993mkdir -p my-directory/work/junk
1994
1995To remove directory use rm command
1996
1997# Remove 'my-directory'
1998
1999rm my-directory
2000
2001# Can be used for files as well:
2002
2003rm file-to-remove
2004
2005# Append echo to file
2006
2007echo "My e-mail is jassem@amu.edu.pl" >>
2008
2009personal_info
2010
2011diff file1 file2
2012
2013The same operation may be achieved by one command:
2014
2015A) tr -sc [:alpha:] '\n' < document | sort | uniq -c > frequency_list
2016
2017B) tr -sc [:alpha:] '\n' < document > words | sort | uniq -c > frequency_list
2018
2019sed 's/test/example/g' myfile.txt > newfile.txt
2020diff (change lines)
2021tr(anslate)
2022
2023To search in a file you can use grep.
2024
2025grep returns lines that contain searched strings.
2026Change Directory
2027
2028# Echo to screen
2029
2030echo "My name is Krzysztof Jassem"
2031
2032# Echo to file
2033
2034echo "My name is Krzysztof Jassem" >
2035
2036personal_info
2037
2038# search and count found lines
2039
2040egrep -c '^begin|end$' myfile.txt
2041Basic Linux Facts
2042
2043#After creation of the script make it an executable file:
2044
2045chmod +x delEL.sh
2046
2047# To run a script, type its name preceded with ./:
2048
2049./delEL.sh
2050
20512 3 35 business_card.nonempty
2052
2053FINISH
2054
2055Use egrep if you want to search using regular expressions:
2056
2057# show lines with verb 'go' in one of its forms
2058
2059egrep 'go|went|gone|goes' my-text-file.txt
2060
2061# show lines begining with a upper case letter
2062
2063egrep '^[A-Z]' my-text-file.txt
2064
2065# show lines ending with a dot
2066
2067egrep '\.$' my-text-file.txt
2068
2069 Linux manual: man and the name of the command, e.g.
2070
2071man ls
2072
2073# !!! Exit by pressing q
2074
2075 Linux info pages type: info and the name of the command:
2076
2077info ls
2078
2079# !!! Exit by pressing q
2080
2081 Try --help or -help parameters, e.g.
2082
2083ls --help
2084egrep
2085Test Question 7
2086
2087To display the number of lines that start with the letter "K" in the file 'names':
2088
2089A) egrep -n '^K' names
2090
2091B) egrep -c '^K' names
2092
2093C) egrep -n '$K' names
2094
2095D) egrep -c '$K' names
2096
2097#content of 'data.txt':
2098
2099one two three four five
2100
2101alpha beta gamma delta epsilon
2102
2103# Move to parent directory
2104
2105cd ..
2106
2107# Move two directories back
2108
2109cd ../..
2110
2111# Move to 'my-directory'
2112
2113cd my-directory
2114
2115# Move to the home directory
2116
2117cd ~
2118Task Question 10
2119
2120diff names1 names2 will return:
2121Reversed sorting
2122
2123cat > names2
2124
2125Krzysztof Jasem
2126
2127Rafal Jaworski
2128
2129Filip Gralinski
2130
2131Pawel Skorzewski
2132
2133You can display the first or last characters using -c switch
2134
2135head -c12 grocery.list
2136
2137# '\n' is counted as 1 char!
2138
2139apples
2140
2141banan
2142
2143tail -c12 grocery.list
2144
2145# EOF is counted as one char!
2146
2147ums
2148
2149carrots
2150
2151B) 1a1
2152
2153< Krzysztof Jassem
2154
2155---
2156
2157> Krzysztof Jasem
2158
21593a3
2160
2161> Pawel Skorzewski
2162
2163Correct answer: B
2164
2165# file1:
2166
2167I need to go to the store.
2168
2169I need to buy some apples.
2170
2171When I get home, I'll wash the dog.
2172
2173I promise.
2174
2175# file2:
2176
2177I need to go to the store.
2178
2179I need to buy some apples.
2180
2181When I get home, I'll wash the dog.
2182
2183# copy file1 to file1.copy
2184
2185cp file1 file1.copy
2186
2187# to copy directories use -r switch
2188
2189cp -r my-directory my-directory-copy
2190
2191Use cp to copy files or directories
2192
2193# move file to the parent directory
2194
2195mv file1 ..
2196
2197# change the name of the file
2198
2199mv file1 file1-with-new-name
2200
2201Use mv to move or rename (files or directories):
2202
2203# file1
2204
2205I need to buy apples.
2206
2207I need to run the laundry.
2208
2209I need to wash the dog.
2210
2211I need to get the car detailed.
2212
2213# file2
2214
2215I need to buy apples.
2216
2217I need to do the laundry.
2218
2219I need to wash the car.
2220
2221I need to get the dog detailed.
2222
2223# file1.txt:
2224
2225I need to go to the store.
2226
2227I need to buy some apples.
2228
2229When I get home, I'll wash the dog.
2230
2231# file2.txt:
2232
2233I need to go to the store.
2234
2235I need to buy some apples.
2236
2237Oh yeah, I also need to buy grated cheese.
2238
2239When I get home, I'll wash the dog.
2240
2241(1) The U.
2242
2243(2) K.
2244
2245(3) Prime Minister, Mrs.
2246
2247(4) Theresa May, was seen out with her family today.
2248
2249Segmentation Rules eXchange (SRX)
2250
2251we
2252
2253First sentence.|| Second sentence
2254
2255doesnt
2256
2257doesn't
2258
2259doesn' t
2260
2261doesn t
2262
2263does nt
2264English Rules
2265
2266W roku '17
2267
2268^(19|20)\d\d[- \.](0[1-9]|1[012])[- \.](0[1-9]|[12][0-9]|3[01])$
2269Tokenization by PSI-Toolkit
2270Test Question 3
2271
2272jblack@domain.com
2273
2274www.prezi.com
2275
2276142.32.48.231
2277
2278An optimal tokenizer would split the text:
2279
2280Rzeka św. Wawrzyńca
2281
2282into:
2283
22841. 1 token
2285
22862. 2 tokens
2287
22883. 3 tokens
2289
22904. 4 tokens
2291
2292<languagerules> - segmentation rules for a set of languages
2293
2294<languagerule> rules for one language
2295
2296<rule> one rule of segmentation
2297
2298<maprules> specifies what rules should be used for what language
2299
23009.
2301
2302Zadzwoń do inż. Kowalskiego
2303
2304opowiada
2305
2306Task Segmentation 1.
2307
2308Implement your own SRX tool. The tool should allow user to create his own SRX rules and test them against selected text documents.
2309
2310(1) The U.K.
2311
2312(2) Prime Minister, Mrs.
2313
2314(3) Theresa May, was seen out with her family today.
23151. Get training data
2316
2317baran
2318
2319https://github.com/filipg/psi-toolkit/tree/master/tools/segmenters/srx/data
2320Solutions
2321
2322Question: What to execute first: segmentation or tokenization?
2323
2324Let's try it with PSI-Toolkit!
2325
2326baran kawał opowiadał ani
2327Test Question 4
2328
2329<languagerule languagerulename="English">
2330
2331<!-- Some English abbreviations -->
2332
2333<rule break="no">
2334
2335<beforebreak>\sMrs\.</beforebreak>
2336
2337<afterbreak>\s</afterbreak>
2338
2339</rule>
2340
2341<rule break="no">
2342
2343<beforebreak>\sU\.K\.</beforebreak>
2344
2345<afterbreak>\s</afterbreak>
2346
2347</rule>
2348
2349</languagerule>
2350
2351^[0-9]{2}-[0-9]{2}$
2352
23531Z9999W9984539998
2354
2355<languagerule languagerulename="Default">
2356
2357<!-- Common rules for most languages -->
2358
2359<rule break="no">
2360
2361<beforebreak>^\s*[0-9]+\.</beforebreak>
2362
2363<afterbreak>\s</afterbreak>
2364
2365</rule>
2366
2367<rule break="yes">
2368
2369<afterbreak>\n</afterbreak>
2370
2371</rule>
2372
2373<rule break="yes">
2374
2375<beforebreak>[\.\?!]+</beforebreak>
2376
2377<afterbreak>\s</afterbreak>
2378
2379</rule>
2380
2381</languagerule>
2382
2383Suggested answer:
2384
2385co-education: 1 token
2386
2387Hewlett-Packard: 2 tokens (very unclear)
2388
2389drag-and-drop: 1 token
2390
2391drag-him-away: 3 tokens
2392
2393we
2394
2395http://text-processing.com/demo/tokenize/
2396
2397IN:
2398
2399Polish: Ala ma kota. Kot ma mysz. Mysz ma serek. Ser zostaje sam.
2400
2401English: Alice has the cat. The cat has the mouse, the mouse has cheese. Cheese remains alone.
2402
2403Result of automatic matching:
2404
2405Ala ma kota <-> Alice has the cat.
2406
2407Kot ma mysz <-> The cat has the mouse, the mouse has cheese.
2408
2409Mysz ma serek. <-> Cheese remains alone.
2410
2411Ser zostaje sam <-> ' '
2412
2413User should be able to easily adjust the matching to:
2414
2415Ala ma kota <-> Alice has the cat.
2416
2417Kot ma mysz. Mysz ma serek. <-> The cat has the mouse, the mouse has cheese.
2418
2419Ser zostaje sam <-> Cheese remains alone.
2420
2421distance
2422Test Question 1
2423It may not...
2424Rule-based approach
2425
2426some
2427
2428hocus-pocus
24292. Generate unpunctuated text
2430
2431Question: How to handle such errors?
2432
2433sentence
2434Test Question 2
2435
2436B-52
2437
2438Tokenization is the task of chopping a character sequence up into pieces, called tokens, optionally removing certain characters, such as punctuation marks.
2439
2440baranka
2441
2442l y
2443Apostrophe in English
2444
2445Please, call my lawyer -- Richard Smith -- on Tuesday.
2446
2447<maprules>
2448
2449<!-- List exceptions first -->
2450
2451<languagemap
2452
2453languagepattern="[Ee][Nn].*"
2454
2455languagerulename="English"/>
2456
2457<languagemap
2458
2459languagepattern="[Pp][Ll].*"
2460
2461languagerulename="Polish"/>
2462
2463<!-- Common breaking rules -->
2464
2465<languagemap
2466
2467languagepattern=".*"
2468
2469languagerulename="Default"/>
2470
2471</maprules>
2472
2473The above rule combined with the default rule would split:
2474
2475Polska weszła do Unii w 2004 r. A nie w r. 2005.
2476
2477into:
2478
24791. Polska weszła do Unii w 2004 r.
2480
2481A nie w r. 2005.
2482
24832. Polska weszła do Unii w 2004 r.
2484
2485A nie w r.
2486
24872005.
2488
24893. Polska weszła do Unii w 2004 r. A nie w r. 2005.
2490
24914. Polska weszła do Unii w 2004 r. A nie w r.
2492
24932005.
2494
2495Let us check an online tokenizer:
2496
2497http://text-processing.com/demo/tokenize/
2498
2499Question:
2500
2501How to distinguish dots in the Internet addresses from "normal dots"?
2502
2503Longer dash (ALT 0151, 'Ctrl Alt -')
2504
250510zł
2506
25072015r.
2508
2509ahead
2510
2511ash
2512
2513Question: How to match white characters?
2514
2515Segmentation problem
2516Dot + space
2517ends sentence?
2518
2519OUT (not remove):
2520
2521Eksplozja
2522
2523wielorybów
2524
2525nastąpiła
2526
2527w
2528
2529[[
2530
2531Tajwanie
2532
2533]]
2534
2535,
2536
2537a
2538
2539nie
2540
2541w
2542
2543[[
2544
2545Polsce
2546
2547]]
2548
2549.
2550Internal Binding
2551Machine Learning Approach
2552Default rules
2553
2554"Shorter dash" (ALT 0150; 'Ctrl -')
2555What is tokenization?
2556
2557Answer:
2558
2559<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>
2560
25611. Tokenization is language-dependent
2562
2563\bhttp\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?\b
2564
2565Example for Lab Task Segmentation 2
2566Test Question 6
2567(1) The U.K. Prime Minister,
2568Mrs. Theresa May, was seen out with his family today.
2569Tokenization of European languages
2570
2571English Language
2572
2573http://morphadorner.northwestern.edu/morphadorner/sentencesplitter/example/
2574
2575The U.K. Prime Minister, Mrs. Theresa May,
2576
2577was seen out with her family today.
2578
2579Take them as tokens in the text under tokenization.
2580
2581Question: What will be matched by Regex
2582
2583<set\b[^>]*>(.*?)</set>
2584Polish Rules
2585Date formats
2586
2587How to tokenize a text without separators, e.g. picture writing?
2588
2589(1) The U.K. Prime Minister, Mrs.
2590
2591(2) Theresa May, was seen out with her family today.
2592Dot + space except for U.K.
2593ends sentence
2594Reverse Match Max
2595
2596boys
2597
2598boys'
2599
2600boys '
2601
2602canon
2603Initial tokenization
2604Other useful RegEx expressions
2605
2606DeepCorrection 1: Sentence Segmentation of unpunctuated text.
2607
2608by Praneeth Bedapudi
2609
2610words existing only together
2611Chinese
2612Dash
2613Word Boundary
2614
2615Let us check: I like cats, dogs, etc.
2616
2617http://text-processing.com/demo/tokenize/
2618
2619<!-- skróty przed nazwiskami ludzi - nie należy dzielić zdania ani przed małą, ani przed dużą literą -->
2620
2621<rule break="no">
2622
2623<beforebreak>\b([aA]mb|[aA]dm|[A]sp…)\.</beforebreak>
2624
2625<afterbreak>\s</afterbreak>
2626
2627</rule>
2628
2629<!--skróty dni tygodnia – nie dzielić -->
2630
2631<rule break="no">
2632
2633<beforebreak>\b( [pP]on|[wW]t|[śŚ]r|...\.</beforebreak>
2634
2635<afterbreak>\s</afterbreak>
2636
2637</rule>
2638
2639<!—inne skróty, po których nie należy robić podziału zdania -->
2640
2641<rule break="no">
2642
2643<beforebreak>\b([aA]dr|[aA]l|[aA]rk|[cC]ieśń|…)\.</beforebreak>
2644
2645<afterbreak>\s</afterbreak>
2646
2647</rule>
2648
2649When an abbreviation is followed by a dot:
2650
26511) The dot always ends the sentence
2652
26532) The dot never ends the sentence
2654
26553) The dot ends the sentence only if it is followed by another dot
2656
26574) The dot may or may not end the sentence.
2658
2659Used to make a break in a sentence.
2660Hyphen
2661Problem
2662
2663Orange'owy
2664
2665Bridge'owy
2666
2667The above SRX combined with the default rule would split:
2668
2669The U.K. Prime Minister, Mrs. Theresa May, was seen out with her family today.
2670
2671into:
2672
26731. The U.
2674
2675K.
2676
2677Prime Minister, Mrs.
2678
2679Theresa May, was seen out with his family today. (4 sentences)
2680
26812. The U.K.
2682
2683Prime Minister, Mrs.
2684
2685Theresa May, was seen out with his family today. (3)
2686
26873. The U.K.
2688
2689Prime Minister, Mrs. Theresa May, was seen out with his family today. (2)
2690
26914. The U.K. Prime Minister, Mrs. Theresa May, was seen out with his family today. (1)
2692
2693Yes. Close to 1 mln sentences
2694
2695someverynicesentence
2696Map Rules
2697
2698wecanonlyseeashortdistanceahead
2699
2700\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
2701Use of Tokenization
2702
2703Question: How to extend this pattern to all possible tags (Python-like regex)?
2704
27055m
2706SRX rules in PSI-Toolkit
2707
27083. The most common method used for tokenization is Regular Expressions
2709Hyphen vs Dash
2710En Dash
2711
27129. Zadzwoń do inż.
2713
2714Kowalskiego
2715
2716Answer:
2717
2718e-mail address
2719
2720IP address
2721
2722URL
2723Lab Tasks
2724
2725Answer: Tags and everything between them
2726Example (Polish)
2727Word Splitting
2728
2729A tokenizer should rather treat the dash as a separate token.
2730
2731Mr. O'Connor doesn't think that the boys' stories about the cowboy's hat are amusing.
2732
2733i
2734Lab Tasks
2735
2736cowboys
2737
2738cowboy's
2739
2740cowboy 's
2741
2742cowboy ' s
2743
2744boy s
2745
2746Abbreviations
2747
27489. Zadzwoń do inż. Kowalskiego
2749
2750canon
2751
2752y
2753
2754English abrreviations: dept., min. Mr., Mrs. no, No.
2755Emoticons
2756
2757n
2758
2759Answer 1: If an expression with a dot does not match any regex, then it is a potential error.
2760MaxMatch Algorithm
2761
2762It is believed that tokenization is an initial step in any NLP task.
2763Test Question 8
2764Tokenization
2765
2766My sister loves chocolate -- my brother loves vanilla.
2767
2768OUT (remove)
2769
2770Eksplozja
2771
2772wielorybów
2773
2774nastąpiła
2775
2776w
2777
2778Tajwanie
2779
2780a
2781
2782nie
2783
2784w
2785
2786Polsce
2787Handling Errors
2788
2789Polish abbreviations: inż., np., itd., itp., dr, mgr
2790
2791McDonald's
2792
2793Rockn'roll
2794
2795Answer: Certainly, they are.
2796
2797O'Connor doesn't boys' cowboy's
2798
2799C++
2800
2801C#
2802
2803Which of the following sentences contains a dash (not a hyphen)?
2804
28051. To jakieś czary - mary.
2806
28072. Polska flaga jest biało - czerwona.
2808
28093. Kupiłem e - bilet.
2810
28114. Kupiłem 2 bilety, a Julek - 3.
2812
2813Theoretically: (1)I (2)like (3)cats (4), (5)dogs 6(,) (7)etc. 8(.)
2814
2815Proceed from the end of the string rather than from the beginning.
2816
2817(1) @def litera [a-zA-Z]
2818
2819(2) @def litCyfSpMysl [a-zA-Z0-9_-]
2820
2821(3) @def litSpMysl [a-zA-Z_-]
2822
2823(4) @rule blank/B := [\ \n\r\t\f\p{Z}]+
2824
2825(5) @rule punct/I := [!\?]+|[\.,;:\p{P}]
2826
2827(6) @rule addr/X := (litera+:\/\/)? litSpMysl+ (\.(?:litCyfSpMysl)+)+ (\?[^\ ])?
2828
2829Abbreviations are shortened forms of words or lengthy phrases.
2830
2831(0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[012])\.(([0-9]{2})|([0-9]{4}))
2832Tokenization - Conclusions
2833
2834jassem@amu.edu.pl
2835
2836A dash is longer than a hyphen and is used to indicate a range or a pause.
2837
28389.
2839
2840Zadzwoń do inż.
2841
2842Kowalskiego
2843
2844see
2845
28461. To match non-word tokens
2847
28482. To get meaning of words
2849
28503. To split text into sentences
2851
28524. To match abbreviations
2853Dots in the mail addresses
2854
2855ort
2856Statistical Algorithms
2857What is Segmentation?
2858
2859Answer: \2
2860
2861Hyphen links words.
2862
2863Dash separates words.
2864
2865Dash and hyphen are different characters. But...
2866It may work...
2867
28681. Prepare data for training:
2869
2870 1 mln sentences - training set
2871 100 000 sentences - validation set
2872
28732. Use of Machine Learning algorithms
2874Dot
2875
2876Answer:
2877
2878By means of regular expressions.
2879
2880opowiadała
2881Solution
2882
2883segment --lang pl ! write-simple --tags segment
2884
2885Zwiedziłem wiele krajów, m.in. Niemcy, Francję, Kanadę. Uwielbiam podróżować!
2886
2887Goal: Find a corpus of texts perfectly divided into sentences
2888
2889l
2890
2891\b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+\b
2892Hard (Non-breaking) Space
2893
2894distance
2895
2896Example error in text:
2897
2898End of a sentence.Beginning of the sentence
2899
2900How to tokenize emoticons?
2901
2902Question: How to tokenize "I like cats, dogs etc." ?
2903
2904łani
2905
2906Segmentation is the initial splitting of the text into concise fragments, usually sentences.
2907
2908Task Segmentation 2. - group task
2909
2910Implement a half-automatic bilingual sentence aligner.
2911
2912Details, see here below:
2913Segmentation
2914
2915Question: How are tokens separated from each other?
2916
2917czary-mary
2918
2919Answer: Usually tokenization. Sentence splitting assembles the tokenized text into sentences.
2920http://www.gala-global.org/oscarStandards/srx/srx20.html
2921
2922barankawałopowiadałani
2923
2924Answer: Yes, it does.
2925Train the model
2926
2927OConnor
2928
2929O'Connor
2930
2931O' Connor
2932
2933O 'Connor
2934
2935Based on:
2936
2937How to tokenize the following character sequences:
2938
2939co-education
2940
2941Hewlett-Packard
2942
2943drag-and-drop
2944
2945drag-him-away
2946
2947 number range 15-25
2948 time period October11-October 15, 2012
2949 game scores Poland won 2-0
2950
2951A sentence splitter based on the above rule only would result in:
2952
2953Question: Are the tokenizers languages-specific?
2954
2955ahead
2956
2957e-Usługi,
2958
2959e-learning
2960
2961anty-reklama
2962Unusual tokens
2963Sentence Splitting
2964Dot + space except for Mrs. or U.K.
2965ends sentence
2966Test Question 10
2967Does dot always end sentence?
2968Test Question 5
2969Tokenization problems
2970
2971IN:
2972
2973Eksplozja wielorybów nastąpiła w [[Tajwanie]], a nie w [[Polsce]].
2974MaxMatch English
2975
2976Na pewno wydałem w Las Vegas ponad 10 000$.
2977
2978After prefixes:
2979
2980ashort
2981
2982M*A*S*H
2983On-line Splitters
2984
2985Question: What will be removed by regex:
2986
2987s/<(.*?)>//g
2988Space
2989
2990Universal, XML-based language that describes segmentation rules:
2991
2992 language(s) rules are applied to
2993 break positions
2994 unbreak positions
2995
2996baran kawał opowiada łani
2997
2998<languagerule languagerulename="Polish">
2999
3000<!-- Polish dates -->
3001
3002<rule break="no">
3003
3004<beforebreak>\s[r]\.</beforebreak>
3005
3006<afterbreak>\s[0-9]{4}]</afterbreak>
3007
3008</rule>
3009
3010</languagerule>
3011
3012closed-door meetings
3013
3014a book-loving student
3015
3016ten-minute break
3017Example (English)
3018In tokenization, regexes are commonly used:
3019
3020Wikipedia dump + CoreNLP segmentation?
3021SRX elements
3022
3023Answer: Everything between XML tags named set
3024
3025Input text: To jest książka inż. Kowalskiego
3026
3027is processed by an English and Polish tokenizer that does return
3028
3029puntuation marks.
3030
3031Output (only for the ending of the sentence):
3032
30331. English: inż . Kowalskiego (3 tokens); Polish: inż . Kowalskiego (3)
3034
30352. English: inż. Kowalskiego (2); Polish: inż. Kowalskiego (2)
3036
30373. English : inż . Kowalskiego (3); Polish: inż. Kowalskiego (2)
3038
30394. English: inż. Kowalskiego (2); Polish: inż . Kowalskiego (3)
3040
3041What do below Regexes match?
3042Test Question 7
3043
3044very
3045
3046nice
3047
3048see
3049
3050kawał
3051
3052Check all possible divisions of the input text into chunks of length 1 or 2.
3053
3054Answer 2: Machine learning (later today)
3055
3056How to treat two-word geographical names?
3057
3058 Stargard Szczeciński, Środa Wlkp, Nowa Wieś
3059 Las Vegas, Los Angeles
3060 San Francisco-Palo Alto (train table)
3061
3062Dash, Hyphen
3063
3064Polish Language (PSI-Toolkit)
3065
3066psi-toolkit.wmi.amu.edu.pl
3067
3068Już jestem -- rzekł Stach.
3069
3070Mam pięć przyczep, a samochodów -- tylko trzy.
3071
3072Alfabet Morse'a
3073
3074Dziewczyna Rudy'ego
3075
3076EXAMPLE SRX FILE
3077
3078http://www.gala-global.org/oscarStandards/srx/srx20.html#AppSample
3079
3080They are often written in the same way
3081
3082How to tokenize large numbers (e.g. phone numbers):
3083
3084 8 66 11 55
3085 6 661 555 234
3086
3087Tokenization 1. Create your own tokenizer for Polish texts:
3088
3089 For each token the tokenizer should return its type.
3090 Tokenization rules should be editable independent of application.
3091
3092Other languages - why not?
3093
3094Tokenization 2. Implement an algorithm to extract an exhaustive list of abbreviations form a text corpus.
3095
3096In tokenization an unusual token should:
3097
30981) be omitted
3099
31002) returned in one piece
3101
31023) divided into smallest pieces
3103
31044) replaced by an emoticon
3105
3106biało-czarny
3107
3108polsko-angielskie,
3109
3110dom-przyczepa
3111Apostrophe in Polish
3112
3113A hyphen (-) is a punctuation mark that’s used to join words or parts of words.
3114
3115The tokenizer should rather join the hyphen into a token.
3116
31172. Tokenization is not trivial - there are rules and exceptions
3118
3119Question: Does the dot belong to the token?
3120Em dash
3121
3122Answer: [ \t\n\r]* ...or \s*
3123
3124anti-virus
3125
3126e-learning
3127
3128eco-service
3129
3130barankawałopowiadałani
3131
3132to link a few words into one
3133
3134Answer:
3135
3136 white spaces
3137 punctuation marks
3138 markup tags
3139
3140Let us check PSI-Toolkit
3141
3142http://psi-toolkit.amu.edu.pl
3143
3144https://tatoeba.org/eng ?
3145
3146The U.K. Prime Minister,
3147
3148Mrs. Theresa May, was seen out with her family today.
3149
3150Hyphenation Problems
3151Apostrophe
3152According to the method shown above how would you tokenize the text:
3153You are smart, aren't you?
31541) are nt
31552) arent
31563) aren ' t
31574) are ' nt
3158MaxMatch Polish
3159
3160Search for most frequent pairs (sequences) of symbols in the training corpus.
3161
3162http://multiservice.nlp.ipipan.waw.pl/pl/
3163
3164Note: Do not forget to use "hard space" (e.g. Ctrl-Shift space or tilda) when it should not break lines.
3165Test Question 9
3166
3167Take the division which has the largest probability (according to the training corpus)
3168
3169wał
3170
3171IN:
3172
31739. Zadzwoń do inż. Kowalskiego
3174
3175Question: Suppose we need to extract only a month from the last regex. How to do it?
3176
3177How to tokenize words like:
3178
3179 na pewno, co prawda
3180 faux pas
3181 white space (whitespace), data base (database)
3182 lowercase, lower-case, lower case
3183
3184First sentence.||Second sentence
3185
3186First sentence.|| second sentence
3187
3188First sentence|| Second sentence
3189
3190First sentence;|| Second sentence
3191
3192W roku '17 dziewczyna Rudy'ego poszła do McDonald'sa, aby otrzymać bridge'owy podarunek.
3193
3194The main morpheme of a word, conveying its meaning.
3195
3196{computer, computers}
3197
3198computer-> computerized
3199Step 1
3200Porter Algorithm
3201Basic Concepts
3202Homographs
3203
3204mothers-in-law
3205
3206isolation -> isolat
3207
3208Stemming - for a given word returns its stem.
3209
3210I’m sure I’m right.
3211
3212Take a right turn at the intersection.
3213Stemming
3214
32150
3216
3217friendly
3218
3219Answer
3220
3221foodie
3222
3223Krzysztof Jassem
3224
3225IN: lisów
3226Stemming
3227
3228She has a rose garden.
3229
3230Sales rose by 20% over the Christmas period.
3231
3232Imperfect Solution:
3233
3234If more than one inflection are possible, choose the most frequent one (based on corpus)
3235Morfeusz Lemmatizer
3236True vs False Derivatives
3237
3238Task: Lemmatizers 2.
3239
3240Compare (at least two) Polish lemmatizers:
3241
32421) Collect (download) a corpus of Polish texts
3243
32442) Lemmatize the corpus with different tools
3245
32463) "Diff" the result files
3247
3248The result of the task is the report on the experiment.
3249
3250fan-bloody-tastic
3251
3252webisode
3253Lemmatization
3254
3255She has a rose garden.
3256
3257Sales rose by 20% over the Christmas period.
3258
32592. Lemmatization returns all possible base forms (lemmas) of the word.
3260
3261Word formation may take 2 steps:
3262
3263 derivational formation, e.g. food -> foodie
3264 inflectional formation, e.g. foodie -> foodies
3265
3266Task 6 (Team work)
3267
3268The points are awarded by another group
3269
32701. Speed
3271
32722. No lexicon needed
3273
32743. Stemming returns the same stem for different words having the same origin (good for document classification)
3275
3276vowel = {a, e, i, o, u} + {y after consonant}
3277
3278V = series of vowels, e.g. {a}, {ea}, {ai}
3279
3280consonant = any other letter
3281
3282C = series of consonants, e.g. {c} {ch}, {tr}
3283
3284Any English word may be represented as:
3285
3286(C)(VC)...(VC)(V)
3287
3288(...) indicates optionality (0 or 1 occurrence), e.g.
3289
3290Example
3291
3292bread : CVC
3293
3294m (measure) = number of VC groups in a word
3295
3296m(bread) = ?
3297
3298What are the advantages of stemming compared to lemmatization?
3299
3300For each word POS-tagging returns one and only possibility:
3301
3302{(lemma), part-of-speech, (morphological values)}
3303
3304that matches the context.
3305
3306Give 3 examples of suffix-merged derivatives - use endings different from the ones in the above table.
3307
3308komputer -> komputery
3309
3310budować -> budowałem
3311
3312dobry -> lepszy
3313
3314Rule Example
3315
3316sses -> ss caresses -> caress
3317
3318ies -> i ponies -> poni
3319
3320ss -> ss caress -> caress
3321
3322s -> cats -> cat
3323
33244. Stemming returns stems of words - it is fast but imperfect.
3325
33263. POS-tagging returns only one part of speech (and one base form) of the word that best fits to the context.
3327Task 1
3328
3329Problem:
3330
3331Suppose lexicon includes inflected forms of words:
3332
3333ruch (Genitive: ruchu)
3334
3335brzuch (Genitive: brzucha)
3336
3337How to inflect a new word: paluch?
3338
3339Morphemes are the smallest language units that convey some meaning.
3340
3341Goal: To generate words that are not present in the lexicon although they are used in real language.
3342Answer
3343
3344foodie -> foodies
3345
3346Give 3 new examples of true derivatives and
3347
33483 new examples of false derivatives.
3349
3350The below command asks to list all base forms
3351Text Classification
3352
3353I’m sure I’m right.
3354
3355Take a right turn at the intersection.
3356POS-tagging
3357
3358The conference was very well organized.
3359
3360The dog fell down a well.
3361
3362isolated -> isolat
3363
3364{work, worked}
3365
3366OUT:
3367
33681) lemma: fox
3369
33702) part of speech: noun
3371
33723) number: plural
3373TASK 3
3374
3375Most popular Polish lemmatizers return the list of the following items:
3376
3377gigabyte
3378Lemmatization vs POS-Tagging
3379
3380megastore
3381
3382{nice, nicer, nicest}
3383
3384lemmatize --lang en | simple-writer --tag lexeme
3385
3386Homographs are words with the same spelling but having more than one meaning.
3387
3388Task: Lemmatizers 1.
3389
3390Using an open-source lexicon develop your own lemmatizer.
3391
3392Example lexicons:
3393
3394 SJP: http://www.sjp.pl/
3395 aspell: ftp://ftp.gnu.org/gnu/aspell/dict/0index.html
3396
3397Stem
3398Examples of word formation
3399Morphemes
3400
3401Affix - morpheme attached to a stem, modifying its meaning.
3402
3403Give 3 examples (other than given above) of each of (any language):
3404
3405A) words with a suffix
3406
3407B) words with a prefix
3408
3409C) words with an infix
3410
34113 points
3412
34131) inflection
3414
34152) derivation
3416
34173) derivation
3418
34194) none
3420
3421m(fee) = ?
3422Use of Morhological Analysis
3423
3424able -> ability
3425Affix
3426Suffix
3427
3428http://sgjp.pl/morfeusz/demo/
3429Is Word Long Enough?
3430False Derivatives
3431Prefix
3432
3433Morphological analysis - automatic process that extracts morhemes from a word.
3434Infix
3435Answer
3436Inflection
3437
3438Input: 1) text (edit window or a file)
3439
34402) command (a pipe of instructions)
3441
3442crowdfund
3443
3444Goal: To generate all inflected forms from a base form
3445
3446by analogy to other words
3447
3448Word formation - automatic creation of a new word based on other words and morphemes
3449
3450*S - stem ends with the letter 's'
3451
3452(*T - stems ends with 't' etc.)
3453
3454*v* - stem contains a vowel
3455
3456*d* stem ends with any double letter
3457
3458https://langrid.org/playground/morphological-analyzer.html
3459Answer
3460
3461subtask
3462Derivation by Merging Prefixes
3463True Derivatives
3464
3465fejsować
3466
3467megafajny
3468
3469makroproducent
3470
3471grobbing
3472
3473{miły, miłego, miły, milszy, milsza, najmilszy,...}
3474
34753. Some words are confirmed in corpus because of orthographic errors.
3476
3477kolacja -> kolować
3478
3479Problem? YES
3480
3481(m>1) ance -> allowance -> allow
3482
3483(m>1) ence -> inference -> infer
3484
3485...
3486STEPS 4, 5, 6
3487Derivation by Merging Suffixes
3488Steps of Porter Algorithm
3489
3490Derived forms are built by:
3491
3492 adding suffixes
3493 changing suffixes
3494 adding prefixes
3495
3496isolate -> isolat
3497
3498Inflection changes values of:
3499
3500 number
3501 gender
3502 tense
3503 case
3504
3505unbelievable
3506Conclusions
3507Inflectional Formation
3508Lexeme
3509
3510A fly was buzzing against the window.
3511
3512Let’s fly a kite.
3513
3514PSI-Toolkit - a set of tools that analyses natural language
3515
3516(*d*&!(*L || *S || *Z))
3517
3518double letter -> single letter
3519
3520tann -> tan
3521
3522fall -> fall
3523
3524Porter 1. Implement Porter algorithm for Polish in Python. The application should return:
3525
3526- stem
3527
3528- applied steps of algorithm
3529
3530- information on POS (e.g. noun) if possible
3531
3532Porter 2. TEAM TASK. Implement a platform for easy creation of new Porter algorithms. Functions:
3533
3534- edition of stemming rules
3535
3536- possibility of use for different languages
3537
3538- input: text; output: list of subsequent stems and POS (if possible)
3539PSI-Toolkit
3540True Derivatives vs Errors
3541Search Queries
3542
3543IN: foxes
3544
3545IN: koty szafy czci powieść lub
3546Task 7
3547
3548worked
3549
3550A) za wirus owałem
3551
3552B) zawirusować;
3553
35541st person, sing, past
3555
3556C) wirus
3557
3558Another example: katar, tatar; awatar
3559
3560foxes
3561
3562Let's try:
3563
3564Nie mamy w domu soli
3565
35661. Morphological analysis aims at extracting morhemes from words.
3567Step 2
3568
3569Derivation - forming new base words from existing ones
3570Task 1.
3571
3572Classify below examples to one of:
3573
3574inflection; derivation; none:
3575
3576A) zły -> najgorszy
3577
3578B) actor -> actress
3579
3580C) believe -> unbelievable
3581
3582D) alkohol -> piwo
3583
35844 points
3585Optional conditions
3586
3587True derivatives modify the meaning of the words they are derived from but do not change the meaning significantly.
3588Purpose of Stemming
3589
35902. Some generated words exist in corpus, although they are not in fact derivatives (but inflected forms of other words).
3591
3592czajnik -> czajniczka
3593
3594konik -> koniczka
3595
3596Problem? YES
3597STEP 7
3598
3599do -> undo
3600
3601(m>0) d -> agreed -> agree
3602
3603feed -> feed
3604
3605(*v*) ed -> plastered -> plaster
3606
3607bled -> bled
3608
3609(*v*) ing -> motoring -> motor
3610
3611sing -> sing
3612
3613Other Polish lemmatizers
3614
3615computer -> computers
3616
3617build -> built
3618
3619good -> better
3620
3621Catostrophe -> catastrophize
3622Task 4
3623
3624foxes
3625Morphological Analysis
3626Theoretical background
3627
3628Inflection (pol. odmiana) - forming a new (inflected) word from a base word
3629Lab Tasks
3630
3631In: Word
3632
3633Out: Stem of a word
3634Morphological Analysis
3635
3636Morphological Synthesis 1. TEAM TASK.
3637
3638Implement a system that generates words that do not occur in standard dictionaries, although they are confirmed in a corpus, for a few natural languages.
3639
36405. Word formation generates words that may not exist in a lexicon.
3641
3642excite
3643
3644excitable, exciting, excited, unexcited
3645http://9ol.es/porter_js_demo.html
3646http://text-processing.com/demo/stem/
3647English POS-Tagger Online
3648Word Formation
3649
3650{good, better, best}
3651
3652to photobomb
3653
3654Lemmatization - for a given word returns:
3655
36561) its base form (lemma)
3657
36582) part of speech (optionally)
3659
36603) values of morphological features (optionally)
3661
3662undo
3663
3664Example condition:
3665
3666(m>1) && (*S || *T))
3667
3668kropek
3669English lemmatizers
3670
3671List of lemmatizers (and other Polish NLP tools):
3672
3673http://clip.ipipan.waw.pl/LRT
3674
3675{budować, buduję, budujesz, budowałem, budowali,...}
3676
3677Examples:
3678
3679believed -> believe-d
3680
3681unbelievable -> un-believ-able
3682
3683Inflection (usually but not always) consists in adding suffixes to a stem.
3684
3685build -> building
3686True Derivatives
3687
3688Idea
3689
3690Algorithm takes 7 steps in turn.
3691
3692In each step:
3693
3694if (word is long enough) && (optional conditions)
3695
3696then cut off or replace its ending (different endings for different steps)
3697
3698(m>0) ational -> ate relational -> relate
3699
3700(m>0) tional -> tion conditional -> condition
3701
3702...
3703
3704http://psi-toolkit.wmi.amu.edu.pl
3705
3706Automatic process of finding the stem of the word - usually by cutting off letters from the word.
3707Lemmatization
3708
3709achieve
3710
3711achievement, achiever, achievable, unachievable
3712
3713Verify your examples by morfeusz.
3714
3715For a word zawirusowałem what will be the output of:
3716
3717A) lemmatizer (2 points)
3718
3719B) stemmer (1 point)
3720
3721 http://psi-toolkit.wmi.amu.edu.pl:
3722 lemmatize --lang pl | simple-writer --tag lexeme
3723 lemmatize | write-morphology
3724
3725Give 3 examples of words, for which a Polish lemmatizer returns more than one lemma. Do not use any device for verification.
3726
37273 points
3728
3729store -> megastore
3730
3731lemmatize | write-morphology
3732
3733komputer -> komputerowy
3734
3735budować -> budowla
3736
3737dobry -> dobro
3738
3739książka -> książeczka
3740
3741robić -> zrobić
3742
3743good -> goodness
3744
3745Optional conditions are checked for a stem that would result after cutting the ending.
3746
3747cieniutki
3748
3749http://morphological.org/
3750
3751False derivatives totally change the meaning of words they are derived from. They are "derivatives by coincidence".
3752
3753base_form part-of-speech:value1:value2...
3754
37551
3756Morphological Analysis
3757
3758{komputer, komputera, komputerowi, komputerem, komputerze, komputery,...}
3759Goal of Automatic Word Formation
3760
3761(m>1) e -> probate -> probat
3762
3763rate-> rate
3764
3765Algorithm:
3766
37671) Take an existing word W
3768
37692) Cut an n-letter ending from W (n = 3,4,5,...)
3770
37713) Append a new ending to W
3772
37734) Confirm if the new word occurrs in a corpus
3774Derivation
3775
3776Commands for lemmatization:
3777
3778catastrophize
3779Lab Tasks
3780Lab Task
3781
3782Set of words with the same base form.
3783
3784(m>0) ative -> formative -> form
3785
3786(m>0) alize -> al formalize -> formal
3787
3788...
3789
3790OUT:
3791
37921) lemma: lis
3793
37942) part of speech: noun
3795
37963) number: plural
3797
3798case: genitive
3799Porter Stemmer Online
3800
3801A fly was buzzing against the window.
3802
3803Let’s fly a kite.
3804Potential Errors
3805
3806For each word lemmatization returns all possibilities:
3807
3808lemma, (optionally: part-of-speech, morphological values)
3809STEP 3
3810
3811wodoszczelny
3812
38131. Non-existing words:
3814
3815balować -> balacja
3816
3817zdolny -> zdolizacja
3818
3819Non-existing words are deleted by corpus confirmation.
3820
3821Problem? NO
3822
3823smarter
3824
3825kota: base form = kota
3826
3827subst = substantivum (rzeczownik)
3828
3829pl = plural number (liczba mnoga)
3830
3831nom.acc.voc = nominative (mianownik)
3832
3833or accusative (biernik)
3834
3835or vocative (wołacz)
3836
3837f = feminine (rodzaj żeński)
3838
3839Third line:
3840
3841first line:
3842
3843subst = substantivum (rzeczownik)
3844
3845sg = singular number(liczba pojedyncza)
3846
3847gen = genitive case (przypadek dopełniacza)
3848
3849f = feminine gender(rodzaj żeński)
3850
3851second line:
3852
3853subst = substantivum (rzeczownik)
3854
3855pl = plural number (liczba mnoga)
3856
3857nom.acc.voc = nominative or accusative or vocative (mianownik lub biernik lub wołacz)
3858
3859f = feminine gender (rodzaj żeński)
3860
3861fin = finite verb (czasownik odmieniony)
3862
3863sg = singular number
3864
3865ter = third person (trzecia osoba)
3866
3867imperf = imperfective (niedokonany)
3868
3869inf = infinitive (bezokolicznik)
3870
3871perf = perfective (dokonany)
3872
3873impt = imperative (tryb rozkazujący)
3874
3875sg = singular (liczba pojedyncza)
3876
3877sec = second (druga osoba)
3878
3879imperf = imperfective (niedokonany)
3880
3881conj = conjuction (spójnik)
3882Task 8
3883
38841. How many nouns in the corpus are inflected in the same way as "rufa"?
3885
38862. How many nouns in the corpuus are inflected in the same way as "koja"?
3887Answer
3888
38891. 4503
3890
38912. 5836
3892
3893zbudowałem
3894
3895kilobajt
3896
3897podzadanie
3898
3899We work hard
3900
3901we+pron
3902
3903work+noun | work+verb
3904
3905hard+adj | hard+adv
3906
3907we+pron
3908
3909work+verb
3910
3911hard + adv
3912
3913lisy
3914
3915pracowałem
3916
3917mądrzejszy
3918
3919Nie mamy w domu soli
3920
3921nie:q+qub
3922
3923mieć+fin
3924
3925w+prep
3926
3927dom+subst
3928
3929sól+subst
3930
3931lisy
3932
3933nieziemski
3934
3935sąsiedzki
3936
3937Lemmatization
3938
3939POS-tagging
3940
3941budował -> budow
3942
3943budowla -> budow
3944
3945budowlany -> budow
3946
3947GPL licence
3948
3949http://www.apertium.org/
3950NLP-Toolkits Tasks
3951Bonsai - statistical translator
3952Tokenizer
3953
3954Coreference occurs when two or more expressions in a text refer to the same person or thing (wikipedia).
3955Coreference Analysis
3956Extended POS-tags
3957Task 2
3958Using PSI-Tools in Linux Bash
3959
3960Processing of English
3961
3962language
3963
3964Extended POS-tags (denoted as TAG) merge POS-tags and morphological features.
3965Switches
3966
3967User can add Linux processes in the pipe, e.g. "sort" or "grep"
3968POS-Tagging
3969
3970Process of recognizing and classifying named entities - (multiword) elements in text, such as:
3971
3972 names of persons (mgr Roman Grundkiewicz)
3973 organizations (Uniwersytet im. Adama Mickiewicza)
3974 locations (rzeka św. Wawrzyńca)
3975 expressions of times (11 luty 1965)
3976 quantities (200 m)
3977 monetary values (45$)
3978 percentages (84%)
3979 etc. (eduwiki.wmi.amu.edu.pl)
3980
3981Installation
3982
3983Statistical language model assigns a probability that a word follows a sequence of words. Based on this a statistical model calculates probabilities of sentences.
3984Note: NLTK Lemmatizer works on single tokens:
3985Annotators
3986SPAM-classifier
39871) assigns a floating-point value to SPAM e-mails
39882) decides if a new-coming e-mail is SPAM or NOT
39893) classifies SPAM e-mails to recipients
39904) classifiers SPAM e-mails to topics
3991
3992Computational treatment of opinion, sentiment, and subjectivity in text.
3993TASK 9
3994
3995lang-guesser guesses the language of the text
3996
3997Bilexicon returns a number of translations:
3998
39991) one for the whole text
4000
40012) one for a sentence
4002
40033) one for a word
4004
40054) one or more for a word
4006
4007VBG - verb, gerund
4008
4009Shape Alpha Stop
4010
4011On-line demo version:
4012
4013http://nlp.stanford.edu:8080/parser/
4014
4015Processing is divided into phases:
4016
4017 language recognition
4018 tokenization
4019 sentence splitting
4020 Named Entity Recognition
4021
4022VBZ - verb, 3-rd person, singular number, Present tense
4023What's NLP toolkit?
4024Cloning Docker Repo
4025
4026The above example requires the existence of a classifier (a program that assigns a text to a class). Such a classifier may be trained with the aid of NLTK tools.
4027TASK 7
4028
4029Requires a few complex operations from a user before the first annotation is received (sometimes they end in a failure).
4030
4031 Stanford - command line instructions
4032
4033 UIMA, nltk – Python (java?) bindings
4034
4035 GATE – pipeline of tools
4036
4037 Apertium – Machine Translation
4038
4039PSI-Toolkit Licence
4040Tree
4041
4042Full list of aliases - see PSI-Toolkit Documentation
4043Scraping Text Documents
4044Tokenization
4045Translators
4046
4047$ sudo docker run -i skorzewski/psi-toolkit tokenize --lang pl
4048
4049< example | sort | uniq -c > example.freq_list
4050
4051User can add Linux processes in the pipe, e.g. "sort" or "grep"
4052Spacy basic functions
4053
4054Applications require Java compilation. Programming skills (at least basic level) are necessary.
4055Task 4
4056Sentiment Analysis
4057Research Team
4058POS-tagging
4059
4060Polish Translation: analiza wydźwięku
4061TASK 6
4062Task 3
4063Ideas from other toolkits
4064
4065Linux packages
4066
4067run from
4068
4069command line
4070Selected Operations
4071
4072I saw a man with a dog. He was very calm.
4073Other Types of Writers
4074
4075in Python
4076The tool presented in the previous
4077slide is a type of:
40781) Tokenizer
40792) Named Entity Recognizer
40803) POS-tagger
40814) Coreference analyzer
4082
4083Industrial-Strength Natural Language Processing
4084
4085Model is a way of representing something so that it can be grasped by a human or a computer.
4086
4087In order to extract a Polish fragment
4088
4089from a multilingual text and replace some fragments of the extract using regexes:
4090
40911) use PSI-Toolkit in Bash
4092
40932) use PSI-Toolkit as a Python binding
4094
40953) Both 1) and 2) may work
4096
40974) neither of 1) or 2) works
4098
4099Tokenizer depends on the language
4100
4101NLP = Natural Language Processing
4102
4103NLP-Toolkit – a set of tools and libraries for Natural Language Processing
4104
4105source code (GitHub)
4106
4107Return syntax structures (graphics or text)
4108Task 3
4109Sentence Splitting
4110Language guesser
4111Spell-checking
4112
4113coded in Java
4114
4115Tools analyze text in order to extract information
4116Information System Laboratory
4117Running PSI-Toolkit in Docker
4118Aliases
4119Lattice
4120
4121Words having the same POS-tag, may differ with morphological features:
4122
4123"Czy Rzeka św. Warzyńca znajduje
4124
4125się w Europie czy w Ameryce Północnej?"
4126
4127A NER tool should return n named entities in the above sentence.
4128
41291) n = 0
4130
41312) n = 2
4132
41333) n = 3
4134
41354) n = 5
4136
4137Separating text into smaller pieces
4138Examples of NLP Toolkits
4139
4140sudo apt-get install apt-transport-https ca-certificates curl
4141
4142software-properties-common
4143
4144curl -fsSL https://download.docker.com/linux/ubuntu gpg |
4145
4146sudo apt-key add -
4147
4148sudo add-apt-repository "deb [arch=amd64]
4149
4150https://download.docker.com/linux/ubuntu
4151
4152$(lsb_release -cs) stable"
4153
4154sudo apt-get update
4155
4156sudo apt-get install docker-ce
4157
4158 Segmenter works on srx breaking rules
4159 User may re-define breaking rules
4160
4161Using PSI-Tools in Linux Bash
4162
4163PSI-lattice is a structure that
4164
4165spans over:
4166
41671) one letter
4168
41692) one token
4170
41713) one sentence
4172
41734) whole input text
4174Functionalities of PSI-Toolkit
4175Lemmatizers
4176
4177Natural Language Toolkit
4178Which tool requires lexicon?
41791) Lancaster Stemmer
41802) Porter Stemmer
41813) WordNet Lemmatizer
41824) None
4183
4184Spell-checker in PSI-Toolkit:
4185
41861) underlines incorrect words
4187
41882) underlines incorrect sentences
4189
41903) prompts one correction for each incorrect word
4191
41924) prompts a list of corrections for each incorrect word
4193
4194 Morfologik (outside tool!)
4195 Lammerlemma (lemmatizer)
4196 User may define own lemmatizers
4197
4198Spacy
4199Processing
4200Web Documents
4201
4202 gv-writer →
4203 draw
4204 graph
4205 write-chart
4206 write-graph
4207 tp-tokenizer →
4208 token-generator
4209 tokenise
4210 tokeniser
4211 tokenize
4212 tokenizer
4213
4214 txt-reader – raw text
4215 apertium-reader – various formats
4216 HTML
4217 RTF
4218 Open-Office Writer
4219 Microsoft Office (docx)
4220 pdf-reader
4221 nkjp-reader – Narodowego Korpusu Języka Polskiego (http://nkjp.pl)
4222 utt-reader – UTT (UAM Text Tools) (http://utt.amu.edu.pl)
4223
4224Tokenization
4225TASK 5
4226Task 5
4227NLP Toolkits
4228Segmenter
4229(Sentence Splitter)
4230Scraping HTML Documents
4231
4232UIMA - Unstructured Information Management System
4233Language Model
4234
4235GPL:
4236
4237 right to run the program
4238 right to analyze and modify the program
4239 right to distribute unmodified copy
4240 right to improve and publicize the improved version
4241
4242Bilexicon
4243Named Entity Recognition
4244
4245For a sentence "Jan nie lubi soli",
4246
4247a POS-tagger returns n POS-tags
4248
4249for the word "soli"
4250
42511) n = 1
4252
42532) n = 2
4254
42553) n = 3
4256
42574) n = 4
4258Types of Readers
4259
4260PROPN - Proper Noun
4261
4262Example:
4263
4264Language may be modelled with a formal grammar.
4265
4266reader | annotator_1 | ...|annotator_n | writer
4267Tokenization & Sentence Splitting
4268
4269 PSI-Pipe - use PSI-Toolkit in Linux locally
4270 PSI-Server - create your own PSI-Toolkit web-service
4271
4272„Narzędzia do automatycznego przetwarzania języka polskiego udostępnione publicznie”
427321 04.2011 – 20.04.2013
4274
4275 bracketing-writer
4276
4277[NP[AP[very large] house] <np><ap>very large</ap> house</np>
4278
4279 gv-writer (draw) – visual (examples to follow)
4280
4281Data Structure
4282Lemmatization
4283
4284$ cat example
4285
4286Do Polski przyjechali prof. Kowalski i dr Kowalski.
4287
4288Przywieźli do Polski zwierzęta, np. psa i kota, za które zapłacili 500$.
4289
4290^D
4291
4292Statistical language model is based on a text corpus.
4293
4294NNP - noun, singular number
4295
4296Which of the following is not returned by psi-writer?
4297
42981) start position of each arch
4299
43002) length of each arch
4301
43023) tag attached to each arch
4303
43044) priority of each arch
4305
4306Shape: The word shape – capitalization, punctuation, digits.
4307
4308Alpha: Is the token an alpha character?
4309
4310Stop: Is the token part of a stop list, i.e. the most common words of the language?
4311
4312Statistical (Machine Learning) NLP methods require a language model.
4313Linux Packages
4314About PSI-Toolkit
4315Stemmers, Lemmatizers
4316Task 1
4317Morphological Features
4318Readers
4319Sentence Splitting
4320PSI - Toolkit Lattice
4321Named Entity Recognition (NER)
4322TASK 11
4323Structures Used in NLP
4324Psi-Toolkit pipeline
4325
4326http://psi-toolkit.wmi.amu.edu.pl/help/documentation.html
4327
4328Lemmatization - returns all possible lemmas for a word - each lemma is marked with a POS.
4329
4330POS-tagging - determines the actual POS in a context.
4331
4332 Free open-source platform for Machine Translation.
4333 Machine Translation web-service
4334 User may use an existing MT engine or...
4335 User may create their own MT engine
4336 Just re-edit a few dictionary files
4337
4338Other attributes
4339Parsers
4340Python bindings
4341
4342web service
4343TASK 8
4344
4345Process of assigning part-of-speech to words in a text.
4346
4347> echo ’I’ve been to many countries. I enjoy travelling.’ |
4348
4349psi_pipe segment --lang en ! write-simple --tag segment > segmented.txt
4350
4351> cat segmented.txt
4352
4353I’ve been to many countries.
4354
4355I enjoy travelling.
4356
4357Opinion Mining
4358
4359Most PSI-annotators may be customized by switches
4360
4361By using aspell (aspell.net), PSI-Toolkit supports over 80 languages.
4362Apertium
4363Transferer - rule-based translator
4364
4365LGPL allows the work to be linked with free software or proprietary software
4366
4367--lang pl denotes:
4368
43691) a reader
4370
43712) a writer
4372
43733) a tokenizer
4374
43754) an option of an annotator
4376Writers
4377Task 4
4378
4379>echo ’ Piękna gra trwała długo’ |
4380
4381psi-pipe read-text ! tokenize ! lemmatize ! simple-writer --tag lexeme
4382Stemmers
4383Readers read text from input device, separate symbols and initiate the PSI-lattice.
4384
4385Targeted for both:
4386
4387 language engineers (who can develop their applications in GATE)
4388 linguist, who can use complete tools
4389
4390Aperium on-line
4391
4392Each arch of the lattice is annotated with a tag.
4393POS-Tagging
4394
4395There is a coreference between "he" and:
4396
43971) I
4398
43992) man
4400
44013) dog
4402
44034) none of above
4404
4405Users may substitute a dictionary with one of their own.
4406PSI-Toolkit
4407Running PSi-Toolkit
4408
4409Finding the lemma (base form) for a given word (optionally also part of speech and morphological features).
4410
4411>echo ’ Piękna gra trwała długo’ |
4412
4413psi-pipe read-text ! tokenize ! lemmatize ! simple-writer --tag lexeme
4414
4415Language model is a way to represent a language for a computer.
4416Note: Results of different tokenizers may differ:
4417PSI-Toolkit Users
4418
4419NumPy needed!
4420
4421Returns all possible translations: word by word
4422
4423> echo ’I’ve been to many countries. I enjoy travelling.’ |
4424
4425psi_pipe segment --lang en ! write-simple --tag segment > segmented.txt
4426
4427> cat segmented.txt
4428
4429I’ve been to many countries.
4430
4431I enjoy travelling.
4432NLTK
4433
4434Lang-guesser guesses a language for:
4435
44361) one token
4437
44382) entire text
4439
44403) fragment of text (longer than a token)
4441
44424) one character
4443TASK 10
4444Option --tag lexeme tells writer to
44451) display stems
44462) display lemmas
44473) display lemmas and POS tags
44484) display POS tags
4449Lemmatizers
4450Executed:
4451>>> from nltk.tokenize import sent_tokenize
4452Next, two commands were given:
4453A) >>> sent_tokenize("First sentence. Second sentence")
4454B) >>> nltk.sent_tokenize("First sentence. Second sentence")
4455Which command will run properly (no errors reported)?
44561) Only A)
44572) Only B)
44583) Both
44594) None
4460Various Results of Tokenizers
4461
4462http://www.slideshare.net/japerk/nltk-in-20-minutes
4463Note: word_tokenize() returns the list of tokens:
4464Note: sent_tokenize() returns a list of sentences:
4465Note: ne-chunk returns a nested tree (nltk.tree.Tree object):
4466Note: sent_tokenize() "knows" when not to split after dot:
4467Note: you should use pos_tag() for a list of tokens:
4468
44692 ,
4470
44712 .
4472
44731 $
4474
44751 500
4476
44771 do
4478
44791 Do
4480
44811 dr
4482
44832 i
4484
44851 kota
4486
44872 Kowalski
4488
44891 które
4490
44911 np.
4492
44932 Polski
4494
44951 prof.
4496
44971 przyjechali
4498
44991 Przywieźli
4500
45011 psa
4502
45031 za
4504
45051 zapłacili
4506
45071 zwierzęta
4508
45091. http://pypi.python.org/pypi/nltk
4510
45112. http://sourceforge.net/projects/numpy/files/NumPy/
4512
45133. Test installation:
4514
4515Start>Python34
4516
4517>>> import nltk
4518
45194. Download data
4520
4521>>> nltk.download()
4522
4523Windows
4524Note: Stemmers operate on single tokens:
4525Note: word_tokenize() is a function that may be imported from 'nltk.tokenize' package:
4526Note: pos-tag() is a function from the 'nltk.pos' package
4527Note: when used on strings, pos_tag() assigns tags to characters:
4528Note: use find() and rfind() to get indexes of the first / last appearance of the word:
4529Note: You need BeautifulSoup.
4530But you have to install first (using pip)!
4531
45321. sudo pip install -U nltk
4533
45342. sudo pip install -U numpy
4535
45363. Test installation:
4537
4538$ python
4539
4540>>> import nltk
4541
45424. Download data
4543
4544>>> nltk.download()
4545
4546Linux
4547Note: read().decode() does not work on HTML:
4548Note: urllib is a Python library that helps working with URLs.
4549Note: urllib.request.urlopen() opens, reads URLs and forms a new object:
4550Note: use read().decode to extract text from URL:
4551
4552
4553Problem: unbalanced classes
4554Recall
4555Features
4556
45573. Bayes classifier is simple to implement, yet it gives good results - particularly in Sentiment Analysis.
4558Results of the Experiments
4559Regression
4560
4561Ocena 7
4562
4563Dobrze omówiona faza prototypowania, która jest istotnym elementem startupów i pozwala zaoszczędzić wiele czasu
4564
456510/10. Jasno przedstawione tematy, dobrze wytłumaczone. Uwzględnione najważniejsze cechy prototypu. Wytłumaczone cele ciągłej integracji.
4566Bayes Classifier - Algorithm
4567
4568All activation functions:
4569
45702. The records of the games are stored
4571
4572 correct moves lead to victories
4573 errors lead to losses
4574
4575Identifying Groups of Celestial Bodies
4576
4577Anti-SPAM problem
4578
4579Goal: To classify a new mail as SPAM or NOT SPAM
4580
4581machine learns to play (one game of) checkers
4582
4583Anti-SPAM problem
4584
4585Goal: To classify a new mail as SPAM or NOT SPAM
4586
4587TEMPERATURE PREDICTION
4588
4589Goal: To predict global average temperature in 2020
4590
4591Answer: B
4592
4593percentage of correct classifications
4594
4595Krzysztof Jassem
4596Students' comments on lectures
4597
4598P(erformance) =
4599Astronomical data analysis
4600Supervised Learning
4601
4602Example 1
4603
4604Example 2
4605
4606picture from (Jurafsky, Martin, to appear)
4607
4608Example 1
4609Formula for Bayes' Rule
4610
4611Humans help machines by labeling the data used for learning.
4612
4613Percentage of true labels among those predicted by system as positive
4614F-measure
4615
4616Accuracy of stupid classifier:
4617
4618Percentage of correct classifications
4619Sentiment Analysis
4620
4621false negative
4622
46232. Humans label each mail as either SPAM on NOTSPAM
4624AMU Case Study
4625
4626F-measure takes into account both precision and recall.
4627
4628Wykład oceniam na 9 punktów.
4629
4630Przedstawiono interesujące zaganienia, które mogą okazać się przydatne w przyszłej pracy zawodowej. Prezentacje bardzo czytelne, ładnie wykonane.
4631
4632records (history) of played games
4633Bayes Classifier
4634Regression vs Classification
4635To predict a continuous valued output
4636Precision
4637Definition
4638
4639P(erformance) =
4640Bayes' Rule Applied to Texts
4641Example: SPAM detection: SPAM = positive
4642
4643How to learn likelihood probability?
4644
46451. A large set of e-mails is collected
4646
4647machine learns to classify e-mails as spam
4648
4649or (not spam)
4650
4651Example 2
4652
4653E(xperience) =
4654Grouping Similar Objects
4655Test Question 1
4656Likelihood Probability
4657
4658percentage of wins over a randomly-playing algorithm
4659
4660true negative
4661
4662Stupid classifier: label all mails as NOT COFFEE
4663
4664A) are differentiable
4665
4666B) are non-negative
4667
4668C) are non-decreasing
4669
4670D) are continuous
4671
4672E) have values between 0 and 1
4673
46743. Machine learns on the data prepared (i.e. collected and labeled) - by humans.
4675
46761. Humans play (thousands of) games against humans or machines
4677
4678Percentage of positive cases labeled correctly by the classifier.
4679
4680E(xperience) =
4681
4682-
4683Introduction to Machine Learning
4684
4685Example: Suppose we want to find a mail about coffee. 99% is NOT about coffee.
4686Spam Detection
4687
4688Feature is a characteristic of the learning data represented as a number.
4689Test Question 5
4690Contingency Matrix
4691
469299%
4693Examples
4694
4695GAME of CHECKERS.
4696
4697Goal: To teach machine play checkers
4698Assigning News to Topics
4699Prior Probability
4700
4701Automated process of categorizing opinions expressed in a piece of text, usually:
4702
4703positive, negative, or neutral.
4704
4705Features: words in the e-mail
4706Identifying Computer Clusters
4707
47082. Bayes classifier naively assumes the independence of features.
4709
4710Machine learns on the data that have not been labeled by humans.
4711
47121. Bayes classifier is a supervised classification method.
4713
47144. To evaluate classifications, the following metrics are used:
4715
4716 accuracy,
4717 precision,
4718 recall,
4719 F-measure.
4720
4721Example 2
4722
47233. Machine learns on the records of already played games
4724Examples of Classifications
4725Classification
4726Various Methods
4727
4728Tom M. Mitchell (1998) "A computer program is said to learn:
4729
4730 from experience E
4731 with respect to some class of tasks T
4732 and performance measure P,
4733
4734if its performance at tasks in T, as measured by P, improves with experience E".
4735
4736false positive
4737
4738Anti-SPAM problem
4739
4740Goal: To classify a new mail as SPAM or NOT SPAM
4741Supervised vs Unsupervised Learning
4742
4743Text is an unordered set (bag) of words
4744
4745-
4746
4747Ocena: 5/10
4748
4749Wykład bardzo dobry, ale dla osób nie mającego pojęcia o takich rzeczach np. studentów pierwszego roku. Jeżeli ktoś tego nie wiedział na 4 roku to nie wiem co tu robi...
4750
4751Trying to predict a discrete output -
4752
4753one of a finite set of classes
4754
4755Numerical representation of a feature:
4756
4757- numerical identifier of a word in a finite lexicon
4758Bag of Words Assumption
4759
47606
4761
4762Moim zdaniem nie przedstawiono kompletnej idei systemu kontroli kodu, nie powiedziano nic np. o branchach, mało praktycznych przykładów. Część o prototypowaniu była bardzo w porządku.
4763
4764T(ask) =
4765
4766history of classifications with outputs labeled by humans
4767Unsupervised Learning
4768Conclusions
4769
4770T(ask) =
4771
4772An element of the learning data is represented as a numerical vector, called feature vector, that consists of feature values.
4773Evaluation
4774
4775Example 1
4776
4777GAME of CHECKERS.
4778
4779Goal: To teach machine play checkers
4780
4781Feature: Average global temperature in a given year
4782Market Segmentation
4783Social Network Analysis
4784Accuracy
4785
47863 experiments:
4787
4788 10 classes: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
4789 5 classes: 0-6, 7, 8, 9, 10
4790 3 classes: 0-6 (negative), 7-8 (neutral), 9-10 (positive)
4791
4792Types of Machine Learning
4793
4794true positive
4795
4796=
4797
4798NOT SPAM
4799
4800SPAM
4801
4802SPAM
4803
4804NOT SPAM
4805
4806Example: To classify a review to one of three classes:
4807
4808 negative
4809 neutral
4810 positive
4811
4812SPAM
4813
4814Classifier A labels mail as SPAM if P(SPAM) > 90%
4815
4816Classifier B labels mail as SPAM if P(SPAM) > 70%
4817
4818Feature vector: vector of average temperatures in previous years
4819
4820(14.32, 14.35, ...)
4821
4822NOT SPAM
4823
4824Suppose that we test STUPID CLASSIFIER on 100 mails.
4825
4826recall = 0 / 1 = 0
4827
4828A has higher precision, lower recall.
4829
4830B has lower recall, higher precision.
4831
4832(1038, 923, 348, 2032, 253) for a 5-word e-mail.
4833
4834Feature vector: vector of all word identifiers in the e-mail, e.g.
4835
4836Arthur Samuel (1959) "Field of study that gives computers the ability to learn without being explicitly programmed."
4837
4838Suppose that we test STUPID CLASSIFIER on 100 mails.
4839
4840Precision = 0 / 0
4841
4842Golden
4843
4844System Output
4845
4846Type of Outcome
4847
4848Example: To predict average temperature in 2020 learning on data from previous years.
4849
4850Formula for Gradient Descent
4851
4852Having m training samples (x,y)
4853
4854find a function h: x-> y based on the samples.
4855
4856x:
4857
4858Question: Which of the above characteristics (age, sex, vote) are features (x), and which one is label (y)?
4859Age = 36
4860
4861= (4,5, 4000)
4862
4863Interpretation: h(x) = P(y = 1 | x)
4864
4865Krzysztof Jassem
4866One-Hot Encoding
4867
4868Feature 6: Log word count of document
4869
4870Value: ln(64) = 4.15
4871
4872Weight: 0.7 (positive impact)
4873
4874Feature 3: Existence of word "no"
4875
4876Value: 1
4877
4878Weight: -1.2 (negative impact)
4879
4880[2, 0, 1, 1, 1, 1, 1] [2, 1, 1, 1, 1, 0] [4, 1, 2, 2, 2, 2, 1]
4881
4882P (choice = Duda) = -1.701 + 0.064 * 25 = - 0,10
4883Logistic Regression
4884Sentiment Analysis - Calculations
4885Age = 25
4886
4887Goal: Automatic rating, e.g. 4.25
4888
4889y:
4890Age = 36
4891
4892Vector Space Model = [w(1), ...w(m)],
4893
4894w(i) = weight of i-th word
4895
48964. Use the GD algorithm to optimize the loss function
4897
4898Feature 4: No of personal pronouns
4899
4900Value: 3
4901
4902Weight: 0.5 (low positive impact)
4903Intuition
4904
4905Lexicon (7 words):
4906
4907a, husband, is, king, of, queen, wife
4908GD - Picture
4909Age = 36
4910Frequency Encoding
4911Cost function
4912Non-NLP Example
4913
4914What other factors may have impact on the first salary?
4915
4916Task: To classify a document as either positive or negative
4917
4918Task:
4919
4920Given a new hotel review, automatically score it with a real number that corresponds to the sentiment of the review:
4921
4922the more positive review, the higher score.
4923Multi-Variable Logistic Regression
4924Example
4925How to Apply Logistic Regression in NLP?
4926
4927TEST QUESTION 1
4928Age = 35
4929Woman = 1
4930Linear regression
4931
4932QUESTION:
4933
4934How to represent linear regression if there are more than one feature?
4935
4936Solution:
4937
49381. Take a set of hotel reviews (texts + scores, e.g. stars)
4939
49402. Build the Vector Model Space for each review (limit vocabulary)
4941
49423. Label each review with the number of stars scored to this review
4943
49444. Build a linear regression model on the data
4945Data Set
4946
4947Test Question 3
4948
4949Select ALL true statements for the logistic regression model in Duda vs Komorowski voting
4950
4951A) People aged 25 and under are almost sure to vote for Komorowski
4952
4953B) People aged 45 and above are almost sure to vote for Duda
4954
4955C) There is a distinct difference of preference between age 35 and 36
4956
4957D) There is a distinct difference of preference between age 45 and 46
4958
49593. Define the function that minimizes the prediction error (cross entropy loss function)
4960Sentiment Analysis - Experiment
4961Inverted Document Frequency
4962
4963ANSWER No. The algorithm ends up in the local minimum.
4964
4965One-Hot Encoding - a |V|-long vector with values 1 or 0, indicating if a word is present in the document
4966
4967Feature 5: Existence "!"
4968
4969Value: 0
4970
4971Weight: 2 (high positive impact)
4972
4973P (choice = Duda) = -1.701 + 0.064 * 35 = 0.54
4974
4975Suppose we are at a local minimum. What will the next iteration do?
4976
4977A) Leave the 'teta' value unchanged
4978
4979B) Decrease 'teta' value
4980
4981C) Increase 'teta' value
4982
4983D) Move towards global minimum
4984Sentiment Analysis
4985
4986-
4987
4988For a given set of linear coefficients:
4989When is Logistic Regression Used?
4990
4991Monthly salary
4992
4993Task: to classify a given document d into a class c
4994P(choice = Duda) = - 1,701 + 0,064 * age
4995
4996Answer: The number of rows in X (n) is not equal to the number of columns in teta (n+1).
4997From Linear to Logistic
4998
49991000 people:
5000
5001 age: <18...100>
5002 sex: <1; 0> (woman, man)
5003 vote: <1; 0> (Duda, Tusk)
5004
5005Conclusions
5006
5007Linear regression:
5008
5009P(choice = Duda) = -1.701 + 0.064 * 36 = 0.60
5010Logistic Regression in NLP
5011Gradient Descent
5012Logistic Model
5013Age = 35
5014
50151. Represent the document d as a feature vector x
5016
5017m =
5018
5019Predicting the unknown value of a variable Y based on known values of some variables Xs, assuming there is a linear relationship between Xs and Y.
5020
5021Linear regression:
5022
50233. Logistic regression is used for classification tasks -- it calculates the probability of a class.
5024
5025V - finite (limited) vocabulary of words used in a set of documents.
5026Learning Regression
5027
5028Select the incorrect sentence:
5029
50301) The more documents contain the word, the higher is its weight
5031
50322) The more frequent a word is in the document, the higher is its weight
5033
50343) Sentiment analysis assigns, to a document, a score, which mirrors the author's opinion on a matter
5035
50364) For a given document the linear regression method returns its score (a real number)
5037
5038P (choice = Duda) = -1.701 + 0.064 * 35 = 0.54
5039
5040- ith sample
5041
5042Back to the "salary" example
5043
5044Feature 2: No of negative words
5045
5046Value: 2
5047
5048Weight: -5 (high negative impact)
5049
5050P (choice = Duda) = -1.701 + 0.064 * 25 = - 0,10
5051
5052Feature 1: No of positive words
5053
5054Value: 3
5055
5056Weight: 2.5 (high positive impact)
5057Test Question 5
5058
5059Lexicon (7 words):
5060
5061[a, husband, is, king, of, queen, wife]
5062
5063Multinomial classification: more than 2 classes
5064Learning Data (Training Samples)
5065
5066P(choice = Duda) = -1.701 + 0.064 * 36 = 0.60
5067Linear Regression in NLP
5068
5069TEST: Find an error in the above formula
5070
5071Linear regression:
5072
5073x - input variable
5074Age = 35
5075Woman = 0
5076
5077QUESTION 1: What happens if alpha is too small?
5078
5079QUESTION 2: What happens if alpha is too large?
5080
5081Linear regression:
5082Age = 25
5083Regression Methods in NLP
5084
5085These are binary classification problems: 0 or 1
5086
5087Average mark
5088Document Representations
5089Probabilities Depending on Age
5090Formula for Logistic Regression
5091y = -1,70 + 0,64* age
5092Formula for Linear Regression
5093
50942. Define the function that computes c from x
5095
5096 sigmoid function (for binary classification)
5097 softmax function for multinomial classification)
5098
5099Logistic regression applies regression (continuous-valued output) to classification problems.
5100
5101Logistic regression
5102
51035
5104
51051) tf(t,d): frequency of the term t in the document d
5106
5107P (choice = Duda) = -1.701 + 0.064 * 45 = 1.18
5108
5109training set
5110
5111y - output variable
5112Linear Model?
5113
5114Linear regression - one feature:
5115
5116QUESTION Does the algorithm ends up in the same location independent of the starting point?
5117
5118P (choice = Duda) = -1.701 + 0.064 * 45 = 1.18
5119Local vs Global Minimum
5120
5121w(i) = w(t, d) = tf(t, d) * idf(t, d)
5122
5123example from Yurafsky, Martin, 2018
5124
5125ANSWER 1: The algorithm is slow.
5126
5127ANSWER 2:The algorithm may overshoot the minimum.
5128
5129-
5130Rating a review
5131Age = 35
5132
51332.Linear regression assumes the linear relationship between features and predicted values.
5134
51351. Regression is a ML method that predicts a continuous-valued output.
5136
51374. In NLP, features used in logistic regression involve:
5138
5139 lexicon;
5140 upper and lower-casing;
5141 punctuation marks;
5142 a lot of other features (e.g. parts of speech).
5143
5144Age = 45
5145
5146TEST QUESTION 8
5147
5148Queen is a wife of a king. King is a husband of a queen.
5149Woman = 1
5150Gradient Descent -
5151Formulas and Intuition
5152
5153Algorithm:
5154
5155Repeat adjusting 'teta' by the above value until convergence
5156
5157Iterative algorithm for minimizing Cost Function:
5158
51591) Start with any
5160
51612) Keep changing 'teta 'until minimum value of Cost Function is reached
5162
5163Logistic regression calculates the probability that a given object belongs to a class.
5164Logistic Regression
51651- Variable (age) Example
5166
5167m - # training samples
5168
5169Test Question 2
5170
5171Select ALL correct answers.
5172
5173The minus value of the prediction function for age 25:
5174
5175A) is admissible in a classification problem
5176
5177B) is not admissible in a classification problem and shows the disadvantage of linear approach
5178
5179C) is caused by lack of training data for age = 25
5180
5181D) shows that people aged under 25 should not vote
5182Age = 45
5183
5184Answer: features: age, sex
5185
5186label: vote
5187
5188Queen is a wife of a king. King is a husband of a queen.
5189Woman = 0
5190Logistic Function
5191
5192N = # all documents
5193
5194D = # documents that contain term t
5195
51962) inverse document frequency:
5197
5198the fewer documents contain the word, the better:
5199
5200Two variables:
5201
5202COST FUNCTION
5203
5204Linear relationship between university mark average and the first salary
5205
5206Single variable:
5207
5208cost function returns the difference between real values y and values h(x) calculated by the linear function h().
5209
5210Conclusion
5211
5212For the cost function, the Gradient Descent algorithm ends up in the global minimum.
5213
5214Logistic regression:
5215
5216Logistic regression:
5217
5218Learning regression consists in finding which minimizes the cost function.
5219
5220[1, 0, 1, 1, 1, 1, 1] [1, 1, 1, 1, 1, 1, 0] [1, 1, 1, 1, 1, 1, 1]
5221
52221-st sentence 2nd sentence Both sentences
5223
5224(repeated)
5225
5226Logistic regression
5227
5228ANSWER
5229
5230(repeated)
5231
52321 variable:
5233
52342 variables:
5235Sigmoid function
5236
5237Repeat:
5238
5239alpha controls the size of the step
5240
5241For all 'j' simultaneously:
5242
5243By definition, the cost function J always looks like right-hand picture. Such a function is called convex.
5244
5245MATRIX FORM:
5246
5247
5248Perceptron - neural unit with linear activation functions and binary output (0 or 1).
5249
5250 NNs are built of neural units (inspired by human neurons)
5251
5252Task: Define an NN with one hidden layer for XOR function and ReLU activation function.
5253
5254Hint: Use OR and AND functions as the hidden layer.
5255
5256The logical functions may be calculated by a perceptron.
5257XOR Problem
5258Units (Neurons)
5259
5260All activation functions:
5261
5262Algorithms
5263
5264"Linguistic Regularities in Continuous Space Word Representations"
5265
5266Mikolov, Yih, Zweig (2012)
5267
5268Krzysztof Jassem
5269
5270XOR function
5271
5272AND function
5273
5274 In some cases the objects may be classified to one of the groups directly (OR, AND).
5275 Then, one-step methods like perceptron (linear regression) or logistic regression are sufficient.
5276
5277Task: Define a perceptron that calculates the XOR function:
5278
5279 returns 1 for: (1, 0), (0, 1)
5280 returns 0 for (0, 0), (0, 1)
5281
5282AND function
5283Context Probabilities
5284
5285Answer: 3 x 4 x 2
5286Multi-layer Neural Networks
5287Output
5288
5289Word Embeddings: Representations of words as numerical vectors
5290
5291The output of the perceptron is defined by the function:
5292
5293 Each unit mulitiplies input values by weights, add a bias and applies a non-linear funtion.
5294
5295Applications
5296
5297Task: Build a perceptron that calculates logical functions.
5298
5299OR function
5300Neural Networks
5301Input
5302
5303Answer: B
5304Word Embeddings
5305
5306 Neural Network may be used in NLP to predict a word based on previous context.
5307
5308Test Question 2
5309
5310dobry
5311
5312kobiet
5313
5314jak
5315What is the size of matrix if NN consists solely of:
5316
5317 2 input units
5318 3 2nd-layer units?
5319
53201) 2 x 3
53212) 3 x 3
53223) 4 x 3
53234) 4 x 4
5324
5325Draw a simple DNN with:
5326
5327- 2 input units,
5328
5329- 2 hidden layers
5330
5331- 1 output unit
5332
5333Set all values of theta parameters.
5334
5335Assume g = Id (linear regression model) in all transitions.
5336
5337Calculate: h(-2, 1).
5338
5339Mc Culloh - Pitts neuron, 1943
5340
5341Backward propagation is a learning algorithm for finding NN parameters.
5342
5343The algorithm starts from the output units in order to minimize cost functions of all activations.
5344
5345The calculations use the chain rule of differentiation for partial derivatives.
5346
5347Andrew Ng, Coursera Machine Learning Course
5348
5349Conclusion: No perceptron can be defined for XOR function.
5350
5351Softmax function normalizes any vector to a vector of probability values:
5352Test Question 1
5353More hidden layers
5354
5355Predicting the next word based on previous ones.
5356
5357A) are differentiable
5358
5359B) are non-negative
5360
5361C) are non-decreasing
5362
5363D) are continuous
5364
5365E) have values between 0 and 1
5366Logical Perceptrons
5367
5368with sigmoid activations
5369
5370Give the correct interpretation of the coefficient:
5371
5372-
5373
5374Dzień...
5375Neural Networks
5376– Introduction
5377Test Question 3
5378Neural Network
5379
5380Applying NN to a classification problem:
5381
5382 output values should be between:
5383 output values should sum up to:
5384
5385Individual Task
5386
5387Task: Define a perceptron that calculates the AND function:
5388
5389 returns 1 for: (1, 1)
5390 returns 0 for (0, 0), (0, 1), (1, 0)
5391
5392One hidden layer
5393
5394Task: Define a perceptron that calculates the OR function:
5395
5396 returns 1 for: (1, 1), (1, 0), (0, 1)
5397 returns 0 for (0, 0)
5398
53990 and 1
5400
54011
5402
5403Answer: 3 x 3 (bias unit)
5404
5405 However, in some classification tasks intermediate representations need to be built (XOR example).
5406 Such representations are found by hidden layers in neural networks.
5407
5408Good...
5409
5410Daniel Jurafsky & James H. Martin. Speech and Language Processing. Draft of September 23rd, 2018
5411Test Question 4
5412Architecture
5413Test Question 5
5414
5415Answer: parameter that corresponds to:
5416
5417 2nd layer (1st hidden layer)
5418 2nd neuron (a2) in the layer
5419 3rd input (x3) to the neuron
5420
5421What is the minimum size of matrix if NN consists solely of:
5422
5423 2 input units
5424 3 2nd-layer units
5425 3 3rd-layer units?
5426
54271) 2 x 3 x 2
54282) 3 x 3 x 2
54292) 3 x 4 x 2
54303) 3 x 3 x 3
5431Backward Propagation
5432
5433Neural Networks are a generalization of logistic regression:
5434
5435 applied repeatedly
5436 using various logistic functions
5437
5438A neural network that was presented at the video used for recognition (not for generation):
5439A) generates a hand-written form of a digit
5440B) classifies a hand-written digit into one of 10 classes
5441C) recognizes the author of the writing
5442Forward Propagation
5443
5444Neuron is taking:
5445
5446 weighted sum of its inputs
5447 additional term called bias
5448
5449luck
5450
5451for
5452
5453news
5454Other Activation Functions
5455
5456-
5457Logical Functions
5458Perceptron - Output
5459
5460Correct answers:
5461
5462C)
5463
5464D)
5465
5466If ANN are used for classification problems, then Softmax is used as activation function in the output layer.
5467
5468 In NLP Neural Networks words are represented by word embeddings (learned by other NNs).
5469
5470Conclusions
5471
5472Deep learning - learning with NNs built of a few (hidden layers)
5473Softmax Function
5474
5475Nice to ...
5476
5477Przesyłam gorące...
5478Summary
5479NLP Applications
5480Neural Networks
5481
5482Calculating outputs of consecutive layers.
5483
5484see
5485
5486be
5487
5488hear
5489
5490 Neural networks are trained by algorithms such as Gradient Descent.
5491
5492pozdrowienia
5493
5494tematy
5495
5496i
5497
5498Probability of a word given all previous words
5499
55004-gram:
5501
5502Probability of a word given 3 previous words
5503
5504sigmoid as activation function
5505
5506Approximation:
5507
5508Probability of a word given (N-1) previous words
5509
5510Reason: Results of XOR cannot be separated by a line:
5511
5512activation function
5513
5514[
5515
5516ReLU
5517
5518sigmoid applied to weighted input
5519
5520tanh
5521
5522Method:
5523
5524Recurrent Neural Networks
5525
5526Input: Words represented as one-hot vectors: (0, 0, 0, ..., 1, 0, 0)
5527
5528length of the vector = size of lexicon
5529
5530Output: Word Embeddings - Representations of words as
5531
5532M-dimensional vectors
5533
5534Training Set / Dev Set / Test Set
5535
5536Experimentally confirmed value of lambda: 0.4
5537Add-k Smoothing
5538Restaurant Example - Counts
5539Short Meassages (SMS)
5540
5541Dzień...
5542
5543wikipedia
5544Information Retrieval
5545
5546Language model is a function that assigns probabilities to sequences of words in the language.
5547N-grams
5548Guess the author of this drama
5549
5550How are...
5551
5552Backoff (trigrams)
5553
5554Dziękuję za...
5555
5556Each n-gram is chosen at random.
5557
5558However, the more frequent is the n-gram, the more likely it is to be chosen.
5559
5560Krzysztof Jassem
5561
55622. Chose all 5-word sequences that appear at least 40 times
5563Hand-writing recognition
5564
5565The lower perplexity, the better language model.
5566Bigram Add-1 Formula
5567
5568Joint probability of a n-long word sequence: probability that such a sequence appears in the language.
5569
5570słychać, wolisz, słychać u Ciebie
5571Stupid Backoff
5572
5573Documents are ranked based on the probability of the query Q in the document's language model.
5574
5575Development set is the part of data, on which developer checks if his model works.
5576Good vs Bad LM
5577Chain rule
5578
5579- use hash tables for representing words
5580
5581- store probabilities as 4-8 bits rather than 8 byte floats
5582
5583PP = cannot divide by zero
5584
5585Number of word types:
5586Zeros!
5587
5588Test Set (W):
5589
5590How are you feeling these days?
5591
5592you, things?
5593
5594It’s fun to wreck a nice beach.
5595Bigrams (N=2)
5596Language Modeling
5597Restaurant Example - Probabilities
5598
5599actress
5600
5601across
5602
5603acres
5604
5605caress
5606
5607access
5608
5609From: Jurafsky, Martin, to appear
5610
5611What is the sum of Laplace probabilities over all words from V?
5612
5613Informally: For a given sequence of words, how likely is that sequence to appear in the language.
5614
5615Suggested partition:
5616
561780 / 10 / 10
5618
5619P(How are you doing) =
5620
5621# (How are you doing) /
5622
5623# (all word sequences of length 4)
5624Applications
5625
5626Compare P(to | want)!
5627
5628soon, in, then
5629Perplexity
5630
563113 million
5632
5633Idea: When the number of N-grams is zero, use (N-1)-grams, (N-2)-grams ...
5634
5635P(How are you doing) =
5636
5637P(How) * P(are | How) * P(you | How are) * P(doing | How are you)
5638
5639For N-grams models, the higher N, the lower perplexity.
5640
5641Yurafsky, Martin, Speech nad Language Processing, vol. 3, unpublished
5642
5643Training set is the part of data, on which the model is calculated.
5644
5645A) are differentiable
5646
5647B) are non-negative
5648
5649C) are non-decreasing
5650
5651D) are continuous
5652
5653E) have values between 0 and 1
5654Interpolation (trigrams)
5655
5656PANDARUS:
5657
5658Alas, I think he shall be come approached and the day
5659
5660When little srain would be attain'd into being never fed,
5661
5662And who is but a chain and subjects of his death,
5663
5664I should not sleep.
5665
5666SECOND SENATOR
5667
5668They are away this miseries, produced upon my soul,
5669
5670Breaking and strongly should be buried, when I perish
5671
5672The earth and thoughts of many states.
5673
5674DUKE VINCENTIO:
5675
5676Well, your wit is in the care of side and that.
5677
5678SECOND LORD
5679
5680They would be ruled after this chamber, and
5681
5682my fair nues begun out of the fact, to be conveyed,
5683
5684Whose noble souls I'll have the heart of the wars.
5685
5686Stupid backoff is a type of interpolation model.
5687
5688-
5689Language Model
5690Test Question 5
5691
5692dobry, kobiet, jak
5693
5694Smoothing assigns non-zero probabilities to non-occurring
5695
5696N-grams.
5697
5698for, in advance
5699
5700Test set is the part of data, on which the final solution is evaluated.
5701
5702actress
5703
5704across
5705
5706acres
5707
5708caress
5709
5710access
5711Machine Translation
5712
5713informację, pamięć
5714Add-1 Smoothing - Unigrams
5715
57161. Collect the corpus. Number of words:
5717
5718http://karpathy.github.io/2015/05/21/rnn-effectiveness/
5719Laplace Smoothing
5720
5721Co ...
5722
5723Yesterday I was walking acress the street.
5724
57251
5726Google N-gram Viewer
5727Definition
5728
5729Number of 5-grams:
5730
5731Add-k makes sense for k < 1.
5732N-grams
5733
5734Jak się masz?
5735
5736How are you?
5737
5738-
5739
5740Correct answers:
5741
5742C)
5743
5744D)
5745Backoff and Interpolation
5746
5747How to acress the Deep Web?
5748Add-1 Smoothing - Bigrams
5749
5750Who is the President of Microsoft?
5751
5752How do you have yourself?
5753
5754https://books.google.com/ngrams
5755Speech Recognition
5756log probabilities
5757
5758Training Set:
5759
5760How are you doing today?
5761
5762How are you doing these days?
5763
5764I am feeling OK.
5765
5766The higher probability of the test set, the better is the language model.
5767Uknown Words
5768
5769It works surprisingly well for large corpora.
5770
5771See you....
5772Text Generation on N-grams
5773Practice
5774Zero probabilities
5775Smoothing
5776Real corpora
5777
5778Add-k requires experiments on the devset in order to tune k.
5779
5780Thanks...
5781
5782It’s fun to recognize speech.
5783
5784No smoothing
5785
57864. Build the language model on the data
5787
5788P (I want to eat Chinese lunch) = ?
5789
5790P(W) = 0
5791Spelling Correction
5792Google Corpus (2006)
5793
5794Standard unigram probability
5795
5796(without Add-1):
5797
5798Chain Rule
5799
5800or even:
58011. Closed vocabulary
5802
58031. Replace all words ouside vocabulary by UKN
5804
58052. Create the model, treating UKNs as any other words
5806
5807P(How are you doing) ~
5808
5809P(How) * P(are| How)
5810
5811* P(you | are) * P(doing | you)
5812
5813Żyd karabin niesie.
5814
5815Żydka rabin niesie.
5816
5817No smoothing
5818
5819Standard bigram probability
5820
5821Decomposition of the joint probability:
5822
5823Perplexity for bigram model
5824
5825Word history probability:
5826
5827P(doing| How are you) =
5828
5829count(How are you doing)
5830
5831count (How are you)
5832
5833Add-1 smoothing
5834
5835Continuous Test Set
5836
5837P(doing | How are you) ~
5838
5839P(doing | you)
5840
5841if zero:
5842
5843Add-1
5844
5845Add-1 bigram probability
5846
5847Then take:
5848
5849access
5850
5851actress
5852
5853across
5854
5855acres
5856
5857caress
5858
5859Unigrams:
5860
58613. Count all N-grams with N beween 1 and 5
5862
5863(and release the data to public)
5864Trigrams (N=3)
5865
5866P(doing | How are you) ~
5867
5868P(doing | are you)
5869
5870P(How are you doing) ~
5871
5872P(How) * P(are | How) *
5873
5874P(you| How are ) * P(doing | are you)
5875
5876Joint probabibility:
5877
5878P(How are you doing) =
5879
5880count(How are you doing)
5881
5882count (all 4-word sequences)
5883
5884Laplace Add-1 unigram probability:
5885
5886V is the number of all words (size of Vocabulary)
5887
5888Find:
5889
5890across
5891
5892actress
5893
5894acres
5895
5896caress
5897
5898access
5899
5900Recurrence stops at unigrams.
5901Maximum Likelihood Estimation:
59022. Open vocabulary with unfrequent words
5903
59041. Replace all unfrequent words by UKN
5905
59062. Create the model as before
5907
5908Unfrequent words - words that occurr less than n times (likely errors).
5909
5910Formula for Perplexity
5911
5912
5913Assume that all weights are equal to 1.
5914
5915Source: Yurafsky, Martin
5916
5917This is a [braod: broad] statement on the coat [baehavour: behavior].
5918
5919count(oa) = How many times oa should have appeared in total.
5920
5921Edit Distance - measure of graphical dissimilarity of two words
5922
5923C(x) - set of all candidates for x
59242-gram model
5925
59261 Levenshtein distance is an old concept still used in text processing.
5927
5928Krzysztof Jassem
5929
59300 (E X ; E X)
5931Noisy channel
5932
5933E X P O R T -> E X P I R T -> E X P I R E -> E X P I R E S
5934
5935P(whiskey on the rocks) =
5936
5937P(rocks | the) * P(the | on) * (on | whiskey) =
5938
59391/4 * 2/4 * 3/4 = 3/32
5940Confusion Matrix
5941
59422. Damerau-Levenshtein distance is applied.
5943
5944whisker on the rocky
5945
5946whisked on the rocks
5947Real-word Errors
5948Non-word Correction
5949
5950whisker on them rocks
5951Prior probability
5952Language Tool
5953
5954p(whisker | whisker) = 0.6
5955
5956p(whiskey | whisker) = 0.3
5957
5958p(whisked | whisker) = 0.1
5959
5960p(vodka | whisker) = 0
5961
5962whiskey on the rock
5963
5964An error is something you have done which is considered to be incorrect or wrong.
5965
5966Candidate:
5967
5968whisker an the rocks
5969
5970P(baot | bot) = ins[b,a] / count(b) = 1 / 7
5971Distance Between Substrings
5972Edit probability
5973
5974P(whisker) = P(whisker on the rocks) * P(whisker | whisker) =
5975
5976= 1/32 * 0.6 =
5977
5978= 6/320
5979fourty
5980
5981P(whiskey) = P(whiskey on the rocks) * P(whiskey | whisker) =
5982
5983= 3/32 * 0.3 =
5984
5985= 9/320
5986
5987Dynamic programming: large problem can be solved by combining solutions to sub-problems.
5988
5989Assume that word X has length m
5990
5991Assume that word Y has length n
5992
59931. The observed word x is out of vocabulary D
5994achieve
5995
5996D(4, 4) =
5997
5998A) are differentiable
5999
6000B) are non-negative
6001
6002C) are non-decreasing
6003
6004D) are continuous
6005
6006E) have values between 0 and 1
6007
6008X : observed sentence (with at most one error)
6009
6010W: candidate correction sentence
6011
6012C(X): set of candidate correction sentences
6013
6014D(i, j) = distance between X[1...i] and Y[1...j]
6015Task & Solution
6016Introduction
6017
6018based on: Daniel Jurafsky & James H. Martin. Speech and Language Processing
6019
6020Whisker on the rocks, please.
6021
6022Minimum edit distance is the minimum number of editing operations needed to transform one word to another.
6023
6024CORRECTED!
6025
6026Take alpha = 0.6 (actually, alpha should vary between 0.95 and 1)
6027break
6028Definitions
6029
60303
6031
6032https://languagetool.org/
6033
6034x - observed word
6035Language model
6036
6037boat
6038
6039Alignment is a correspondence between characters.
6040
6041whisker on then rocks
6042
6043whisker on Małysz's face
6044
6045trans[o,a] = How many times oa was mistakenly typed as ao in the corpus of errors
6046
6047Levenshtein distance: weight 1 for all: deletion, insertion, substitution
6048
6049-
6050
6051Correct answers:
6052
6053C)
6054
6055D)
6056
6057Solution: Dynamic programming
6058
6059whiskey on the rocks
6060through
6061
6062Language Tool is a proof-reading web service created collaboratively by users.
6063acheive
6064
6065P(boat) = 3/30 = 1/10
6066
6067P(bait) = 1/30
6068
6069P(bat) = 2/30 = 1/15
6070
6071P(bot) = 1/30
6072
6073Set of candidates:
6074
6075Distance beween X and Y is equal to D(m, n)
6076
6077Can you lave him my message?
6078Summary
6079
6080Editing operation is a simple action on characters, such as:
6081
6082 insertion, e.g. bid - > bird
6083 deletion, e.g. bird - > bid
6084 substitution, e.g. bid -> bad
6085 (transposition, e.g. bird -> brid)
6086
6087Real Word Correction
6088
6089[Al-Naomi: Al-Naimi] lives in Nagao.
6090
60911 (E X P ; E X P I)
6092
6093P(baot | bat) = ins[a,o] / count(a) = 2 / 17
6094
6095For all candidates,
6096
6097distribute probability proportionally to their edit probabilities.
6098Simple Model
6099
6100p(on | on) = 0.6
6101
6102p(an | on) = 0.25
6103
6104p(one | on) = 0.15
6105advise
6106
61073. In non-real word correction, we estimate probabilities of:
6108
6109 character corrections
6110 unigrams (single words)
6111
6112Alignment
6113
6114whisker one the rocks
6115
6116Observed error word x: baot
6117Edit Distance
6118brake
6119
6120All activation functions:
6121
6122More Sophisticated Model
6123
6124E X P O R T -> E X P I R T
6125
6126This talk is conducted bye Jassem.
6127
6128P(baot | boat) = trans[o,a] / count(oa) = 2 / 3
6129Example
6130
6131P(baot | bait) = sub(i,o) / count(o) = 1 / 8
6132
61332. Spell-checking is handled by statistical methods.
6134
6135Damerau-Levenshtein distance version of Levenshtein distance with transposition of adjacent chracters (xy -> yx) as a single editing operation.
6136
61377
6138adress
6139Algorithm
6140
6141w - candidate word
6142Error corpus - Example
6143compliment
6144Noisy Channel
6145Wagner & Fischer Algorithm
6146
6147whiskey in the glass
6148
6149E X P I R E S has length
6150
6151vodka on the rocks
6152
6153according to: Collins dictionary
6154
6155D(2, 2) =
6156Spell Checking
6157
6158P(baot | boat) = trans[o,a] / count(oa)
6159
61606
6161
6162D(6, 5) =
6163Total probability
6164
6165P(whiskey) > P(whisker)
6166
6167P(whisker on the rocks) =
6168
6169P(rocks|the) * P(the | on) * (on | whisker) =
6170
61711/4 * 2/4 * 1/4 = 1/32
6172Assumptions
6173Model
6174Definition
6175
6176A number type may be [flaot:float], [reaol:real], integer, [naotural:natural].
6177definitely
6178
6179whisker and the beard
6180Candidate sentences
6181Non-word errors
6182Test Question 1
6183
6184Boat is chosen as the most likely candidate for the correction.
6185advice
6186
6187D(6, 7) = 3 (E X P O R T; E X P I R E S)
6188
6189whiskey on ice
6190
61911 ( E X P O; E X P I)
6192
6193Levenshtein distance between EXPORT and EXPIRES is:
6194
6195For any two strings the number of alternative alignments is very large.
6196
6197whisker on the rock
6198
6199-
6200Problem
6201Assumptions
6202
6203whiskey on the rocks
6204Formulas
6205
6206Task: Find the alignment with the minimum Levenshtein distance.
6207
62081. There is no more than error word in the sentence.
6209
6210whisker on the rocks
6211
6212Spelling error is an error in the conventionally accepted form of spelling a word.
6213definately
6214Channel Model
6215Examples
6216
6217whisker vs moustashe
6218
62194. In the real-word correction we estimate probabilities of:
6220
6221 word corrections
6222 whole sentences
6223
6224p(whisker | whisker) = 0.6
6225
6226p(whiskey | whisker) = 0.2
6227
6228p(whisked | whisker) = 0.2
6229
6230p(vodka | whisker) = 0
6231
6232advice, advise
6233
6234Diane gave him great advice. ('Advice' means to give recommendation.)
6235
6236The doctor will advise you about which exercises w
6237
6238Real word errors - errors that result in an existing word.
6239forty
6240
6241p(rocks | rocks) = 0.6
6242
6243p(rock | rocks) = 0.35
6244
6245p(rocky | rocks) = 0.05
6246
6247A spelling is the correct order of the letters in a word.
6248
6249I am leaving in about fifteen minuets.
6250complement
6251
6252E X P O R T has length
6253Best Candidate
6254
6255p(the | the) = 0.6
6256
6257p(them | the) = 0.2
6258
6259p(then | the) = 0.2
6260
6261A letter with mistakes to check by Language Tool:
6262
6263https://www.englisch-hilfen.de/en/exercises/structures/error_text_letter.htm
6264
62652. The edit distance between the error and the intended word is equal to 1.
6266threw
6267
6268Distance between words: sum of editing operations multiplied by their weights
6269Pseudo Code
6270
62713. The set of candidates is limited to words, which:
6272
6273 belong to D
6274 have a edit distance 1 from x
6275
6276A confusion matrix lists how many times one thing was confused with the other.
6277Example
6278
6279D(3, 4) =
6280address
6281
6282baot boat transposition [ao | oa]
6283
6284baot bait substitution [o | i]
6285
6286baot bat insertion [ao | a]
6287
6288baot bot insertion [ba | b]
6289
6290Task: Calculate the matrix for <EXPORT, EXPIRES)
6291
62922 (E X P O R T ; E X P I R)
6293
6294p(on | on) = 0.6
6295
6296p(an | on) = 0.2
6297
6298p(one | on) = 0.2
6299
6300Prior probability is counted as a unigram language model from a large corpus.
6301
6302E X P I R T -> E X P I R E S
6303
6304Diane gave him great advice.
6305
6306The doctor will advise you about
6307
6308your exercises.
6309
63104 confusion matrices are designed for all types of edit operations:
6311
6312del[x,y]: count (xy typed as x)
6313
6314ins[x,y]: count (x typed as xy)
6315
6316sub[x,y]: count (x typed as y)
6317
6318trans[x,y]: count (xy typed as yx)
6319
6320sub[i,u]: How many times i was mistakenly typed as u.
6321
6322Find the most likely word w that was misspelled as x
6323
6324E X P O R T *
6325
6326E X P I R E S
6327
6328s s i
6329
6330E A X P O R T *
6331
6332A E X P I R E S
6333
6334s s s s i
6335
6336Denominator P(x) is the same for all w
6337
6338Sources of confusion matrices:
6339
6340www.dcs.bbk.ac.uk/~ROGER/corpora.html
6341
6342norvig.com/ngrams/
6343
6344E * X P O R T *
6345
6346* E X P I R E S
6347
6348d i s s i
6349
6350I am not a bot. I have a boat. My boat is not similar to any other boat. I also have a special fishing bat. This bat launches live baits.
6351
6352picture: https://www.saltwatersportsman.com/bait-chummer-baseball-bat-fishing-tips
6353
6354Limit the vocabulary by a subset C
6355
6356of good candidates
6357
6358E X P O R T *
6359
6360E X P I R E S
6361
6362s s i
6363
6364E X P O * R T *
6365
6366E X P * I R E S
6367
6368d i s i
6369
6370That pink blouse complements your
6371
6372gray pants.
6373
6374My instructor gave me a compliment
6375
6376on my essay.
6377
6378I'll need a break before going back to work.
6379
6380This car needs a new set of brakes.
6381
6382Bayes' Rule for conditional probability
6383
6384He threw out the garbage.
6385
6386I drove through a small town.
6387
6388
6389<orth>ur.</orth>
6390
6391<lex>
6392
6393<base>urodzony</base>
6394
6395<ctag>adj:sg:gen:m1:pos</ctag>
6396
6397</lex>
6398
6399Shallow 2. Apply SPADE to transform Polsih cardinals numerals into their "spoken" form, e.g.
6400
6401IN: Mam 5 znaczków OUT: Mam pięć znaczków
6402
6403IN: Mam 5 kolegów OUT: Mam pięciu kolegów.
6404
6405Maximum score: 200
6406
6407IN:
6408
6409What are we asking about?
6410
6411Oto portret Jana ur. w 1965 roku.
6412Suppose the following input to SPADE:
6413Widziałem grzeczny dziewczynka
6414The Basic Nominal Group rule (Select ALL correct answers):
6415A) will group together grzeczny dziewczynka
6416B) will assign syntax head to dziewczynka
6417C) will assign syntax head to grzeczny
6418D) will NOT group together grzeczny dziewczynka
6419What's puddle?
6420What's NERT?
6421Lab Tasks
6422Question 5
6423
6424Gdzie jest Beata Szydło?
6425
6426Gdzie pracuje Krzysztof Jassem?
6427
6428Kiedy Barack Obama przyjechał do Polski?
6429
6430Kiedy Donald Trump ożenił się z Melanie?
6431Numerals
6432Shallow parsing in
6433Question Answering
6434
6435W 223. odcinku.
6436
6437Dodaj 2/3 l mleka.
6438
6439Art. 242 pkt 1 kodeksu karnego
6440
6441<orth>ur.</orth>
6442
6443<lex>
6444
6445<base>urodzony</base>
6446
6447<ctag>adj:number*:case*:gender*:pos</ctag>
6448
6449</lex>
6450
6451https://cogcomp.cs.illinois.edu/page/demos/
6452
6453 type of a question (e.g. temporal, spatial)
6454 subject (subject of action)
6455 action (main activity)
6456 phrases for the action (various phrases that may represent action)
6457 constraints (action objects, location, time, etc.)
6458
6459Rule syntax
6460Shallow Parsing
6461Question 6
6462Action
6463
6464 Aim: Incomplete analysis of a sentence
6465 Results: Simple representations of parts of sentences
6466 Methods:
6467 Regular expressions
6468 Statistical computations
6469 Efficiency: High
6470
6471Shallow Parsing - Basic Ideas
6472
6473 Rule (name of the rule)
6474
6475 Left (left context of the group)
6476
6477 Match (group of tokens to match)
6478
6479 Right (right context of the group)
6480
6481 Eval (the result of matching)
6482
6483PESEL 87101615001
6484
6485NIP 1001230210
6486
6487+48 606606606
6488
6489Can we use SPADE in TTS?
6490
6491Miałem 5 żon.
6492
6493art
6494
6495Rule "ok. --> około"
6496
6497Match: [orth~"ok/i"] ns [orth~"\."];
6498
6499Eval: word(qub, "około");
6500
6501https://responsivevoice.org/text-to-speech-languages/polski-syntezator-mowy/
6502
6503no space between tokens
6504Example Rule (2)
6505Parsing may help a text synthesizer to correctly read (select ALL correct answers):
6506A) Acronyms (e.g. IBM)
6507B) Abbreviations (e.g. prof. Miodka)
6508C) Numerals (e.g. Zabrakło 14 głosów)
6509D) Telephone numbers (e.g. Mój telefon to +48 60606606)
6510
6511Miałam pięciu mężów.
6512
6513Marta Wieczorek, "Lingwistyczne aspekty syntezy mowy z tekstu",
6514
6515Msc thesis 2010, supervisor: Krzysztof Jassem
6516
6517puddle - shallow parser in PSI-Toolkit
6518Question 3
6519
6520This is a parcel for professor Gracz.
6521
6522Rule "ur. --> urodzony"
6523
6524Match: [orth~"ur/i"] ns [orth~"\."];
6525
6526Eval: word(adj:number*:case*:gender*:pos, "urodzony");
6527
65281. Implement a shallow parser that recognizes a few types of NP groups in Polish, based on information returned by the PSI-Toolkit lemmatizer.
6529
6530[PER Neil A. Armstrong] , the 38-year-old civilian commander, radioed to earth and the mission control room here: "[LOC Houston] , [ORG Tranquility Base] here; the Eagle has landed."
6531Why text normalization is so important for the synthesis of the Polish text?
6532
6533Hint: Phrases may be found in a lexicon of synonyms.
6534
6535Question 1
6536
6537Lab Tasks
6538Can we make a computer speak Polish better than ResponsiveVoice (Ivona)?
6539
6540Sometimes the (syntactic) head of a group is contrasted to semantic head of a group - the word which is the most important for the meaning.
6541Question 6
6542
6543Adam Sosnowski, "Konwersja tekstu ortograficznego na tekst fonetyczny przy użyciu parsingu płytkiego", MSc thesis 2011, supervisor: Krzysztof Jassem
6544
6545Type: Spatial
6546
6547Examples:
6548
6549Gdzie jest Beata Szydło?
6550
6551Gdzie pracuje Krzysztof Jassem?
6552
6553Hint for Search Algorithm:
6554
6555Select sentences with recognized Named Entity: Location
6556
6557Miałem pięć żon.
6558
6559Neil A. Armstrong, the 38-year-old civilian commander, radioed to earth and the mission control room here: "Houston, Tranquility Base here; the Eagle has landed."
6560
6561 Author: Leszek Manicki (WMI MS Thesis)
6562
6563 Takes as input:
6564 Raw text
6565 PSI-Lattice after morhological analysis
6566
6567 Puts tokens into groups
6568
6569 Disambiguates morphological hypotheses
6570
6571 Adds new edges to the PSI-Lattice
6572
6573Rule example
6574Query
6575NERT - How to use shallow parsing in translation?
6576Example Rule (3)
6577
6578Na wojnie zginęło około 6 milionów Polaków.
6579Examples of Questions
6580
6581art
6582
6583Rule "PP1: Conjunction + prepositional phrase"
6584
6585Match: [pos~"conj"] [type=PP];
6586
6587Eval: delete(pos!~"conj", 1); group(CPP, 2);
6588
658915
6590SPADE - main assumptions
6591Question 5
6592
6593In:
6594
6595Sentence separated into tokens.
6596
6597Each token ic characterized morphologically.
6598
6599Out:
6600
6601The structure of the sentence, often in the form of a tree.
6602Example Rule (1)
6603
6604art
6605How to implement
6606SPADE normalization?
6607
6608Rule "num+sub"
6609
6610Match: [pos~"num"] [pos~"subst"];
6611
6612Eval: unify(case gender, 1, 2);
6613
6614OUT:
6615
6616Type: Temporal
6617
6618Examples:
6619
6620Kiedy ożenił się Donald Trump?
6621
6622Kiedy Barack Obama przyjechał do Polski?
6623
6624Hint for Search Algorithm:
6625
6626Select sentences with recognized Named Entity: Time
6627Subject
6628
6629NUMBER
6630
6631COLOR
6632Translation of Person Names
6633Unification
6634Phrases
6635
6636Dostarczono 15 małych zielonych plastikowych niezbędników
6637
6638Polish text:
6639What's SPADE?
6640
6641Shallow 3. Apply SPADE to recognize person entities in Poilsh texts, e.g.
6642
6643IN: Kiedy mogę się spotkać z prof. UAM dr. hab. Krzysztofem Jassemem?
6644
6645OUT: Kiedy mogę się spotkać z <prof. UAM dr. hab. Krzysztofem Jassemem>?
6646
6647Maximum score: 200
6648Rule syntax
6649
6650Hint for Search Algorithm:
6651
6652Look for sentences, in which Subject is followed directly by an element of Phrases
6653
6654Rule "sub+adj"
6655
6656Match: [pos~"subst"] [pos~"adj"];
6657
6658Eval: unify(case number gender, 1, 2);
6659
6660Action is an activity or a state of the subject.
6661
6662Action is usually represented by the main verb.
6663
6664Hint for Search Algorithm:
6665
6666Look for all inflected forms of constraints
6667
6668(e.g. Melania, Melanią, ...Polski, Polsce, ...)
6669
6670cmd
6671
6672Miałam pięciu mężów.
6673Example Rule (1)
6674
6675The head of a phrase is the word that determines the syntactic type of that phrase.
6676
6677Gdzie jest Beata Szydło?
6678
6679Gdzie pracuje Krzysztof Jassem?
6680
6681Kiedy Donald Trump ożenił się z Melanie?
6682
6683Kiedy Barack Obama przebywał w Polsce?
6684
6685MATERIAL
6686
6687Rule "Basic Nominal Group”
6688
6689Match: [pos~”adj”][pos~”subst”];
6690
6691Eval: unify(case, number, gender, 1, 2);
6692
6693group(NG, 2, 1);
6694
6695Na wojnie zginęło ok. 6 milionów Polaków
6696
6697Phrases that represent activities similar to Action.
6698
6699create new word
6700NERT - Definition
6701
6702Rule (date; 1st~quarter II)
6703
6704Match: <1|I> <base~kwartał> <[0-9]{4}> <r\.>?
6705
6706Action: append(1st quarter of \3)
6707
6708Rule(Person pan~name 5)
6709
6710Match: <base~[pP]an> <{PersonInfix}>* <{ProperPL}; sem~surname>
6711
6712Action: append(\2:t \3:nomu)
6713
6714Select ALL correct sentences:
6715
6716Shallow 1. Design a system that recognizes Polish names of companies and translates them into another language (e.g. English), e.g.
6717
6718In: Jan Kowalski został nowym dyrektorem Małopolskiej Huty Szkła S.A.
6719
6720Out: Jan Kowalski został nowym dyrektorem Malopolska Huta Szkła SA
6721
6722Maximum points: 200
6723
6724Suggested query representation:
6725
6726syntactic head
6727NERT - Examples
6728
6729Shallow parsing is an analysis of a sentence which identifies its constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
6730
6731To jest przesyłka dla pana profesora Gracza.
6732
6733 Author: Adam Przepiórkowski (IPI PAN)
6734
6735 Takes as input either of:
6736 Raw text
6737 XML morphological annotation of a text
6738
6739 Puts tokens into groups
6740
6741 Disambiguates morphological hypotheses
6742
6743 Indicates heads of groups
6744
6745Question 4
6746puddle RULES
6747Conclusions
6748Head of a Constituent
6749
6750<orth>ur.</orth>
6751
6752<lex>
6753
6754<base>urodzony</base>
6755
6756<ctag>adj:number*:case*:gender*:pos</ctag>
6757
6758</lex>
6759
6760to make a living
6761
6762<orth>Jana</orth>
6763
6764<lex>
6765
6766<base>Jan</base>
6767
6768<ctag>subst:sg:gen:m1</ctag>
6769
6770</lex>
6771
6772Na wojnie zginęło około 6 milionów Polaków.
6773
6774Dostarczono
6775
6776Hint for Search Algorithm:
6777
6778Look for sentences, in which Subject is directly followed by Action.
6779Can ResponsiveVoice (Ivona) speak Polish?
6780
6781semantic head
6782Translation of temporal expressions
6783
6784OPER
6785NER & NERT
6786
6787Answer:
6788
6789Similar to SPADE
6790Application of Shallow Parsing - Text Normalization for TTS
6791Example Rule
6792(numeral agreement)
6793IN
6794A. Unification decreases the number of syntactic hypotheses
6795B. Unification increases the number of syntactic hypotheses
6796C. Grouping adds a new element to a HTML file
6797D. Grouping deletes an element from a HTML file
6798
6799Kiedy Barack Obama przyjechał do Polski?
6800
6801Kupił 2 m wstążki.
6802
6803Spotkał prof. Jana Miodka.
6804
6805W 56 r. p.n.e.
6806
6807Przyszła p. Wieczorek.
6808
6809Named Entity and Translation (NERT):
6810
6811Task of identifying Named Entities in texts and defining their translations in other languages
6812A rule for unification of nominal groups may help synthesizer correctly read the sentences (select ALL correct):
6813A) Spotkałem NG[prof. Jana] Miodka
6814B) Dodaj 2/3 NG[litra mleka]
6815C) NG[NIP] 1001230210
6816D) W NG[artykule 243.] kodeksu karnego
6817Internet addresses,
6818special characters
6819
6820Oto portret Jana ur. w 1965 roku.
6821
6822SIZE
6823
6824<tok id="tA5">
6825
6826<orth>przykładowym</orth>
6827
6828<lex>
6829
6830<base>przykładowy</base>
6831
6832<ctag>adj:sg:inst:m1:pos</ctag>
6833
6834</lex>
6835
6836<lex>
6837
6838<base>przykładowy</base>
6839
6840<ctag>adj:sg:inst:n1:pos</ctag>
6841
6842</lex>
6843
6844<lex>
6845
6846<base>przykładowy</base>
6847
6848<ctag>adj:sg:inst:n2:pos</ctag>
6849
6850</lex>
6851
6852<lex>
6853
6854<base>przykładowy</base>
6855
6856<ctag>adj:sg:loc:m1:pos</ctag>
6857
6858</lex>
6859
6860<lex>
6861
6862<base>przykładowy</base>
6863
6864<ctag>adj:pl:dat:f:pos</ctag>
6865
6866</lex>
6867
6868</tok>
6869
6870Zaprezentuję działanie algorytmu SPADE tym przykładowym zdaniem.
6871What Does Text Normalization Consist In?
6872Deep Parsing vs Shallow Parsing
6873
6874Miałem 5 żon.
6875Deep Parsing -
6876Examples
6877
6878Krzysztof Jassem
6879
6880What type of question is it?
6881
6882c
6883Rule example
6884
6885Miałam 5 mężów.
6886
6887Adam Sosnowski, "Konwersja tekstu ortograficznego na tekst fonetyczny przy użyciu parsingu płytkiego", MSc thesis 2011, supervisor: Krzysztof Jassem
6888
6889Gdzie jest Beata Szydło?
6890
6891Phrases: przebywa, udała się, wyjechała
6892
6893NG[Mężczyzna ten niewysoki] NG[wielką bryką] jeździ.
6894
6895# 'Gdzie jest' template
6896
6897## Question to match
6898
6899Match: <orth~"podaj">? <orth~"gdzie"> <orth~"jest"> <ne~person>
6900
6901## Query representation
6902
6903Type: PLACE
6904
6905Subject: \4
6906
6907Action: <base~"być">
6908
6909Constraints:
6910
6911Phrases: <base~"przebywać"> <orth~"(w | na)>" ;
6912
6913<base~"polecieć"> <orth~"(do | na)"> ;
6914
6915<base~"pojechać"> <orth~"(do | na)"> ;
6916
6917Miałam 5 mężów.
6918
6919In:
6920
6921Sentence separated into tokens.
6922
6923Each token is characterized morphologically.
6924
6925Out:
6926
6927Components of the sentence: noun groups, verb groups etc.
6928
6929Oto portret Jana ur. w 1965 roku.
6930Example Rule
6931(numeral agreement)
6932
6933<orth>Jana</orth>
6934
6935<lex>
6936
6937<base>Jan</base>
6938
6939<ctag>subst:sg:gen:m1</ctag>
6940
6941</lex>
6942
6943Kiedy Donald Trump ożenił się z Melanią?
6944
6945Rule "sub+adj"
6946
6947Match: [pos~"subst"] [pos~"adj"];
6948
6949Eval: unify(case number gender, 1, 2);
6950SPADE algorithm
6951Can we use SPADE in TTS?
6952
6953OUT
6954
6955<orth>ur.</orth>
6956
6957<lex>
6958
6959<base>urodzony</base>
6960
6961<ctag>adj:sg:gen:m1:pos</ctag>
6962
6963</lex>
6964SPADE RULES
6965A rule for unification of nominal groups may help synthesizer correctly read the sentences (select ALL correct):
6966A) Spotkałem NG[prof. Jana] Miodka
6967B) Dodaj 2/3 NG[litra mleka]
6968C) NG[NIP] 1001230210
6969D) W NG[artykule 243.] kodeksu karnego
6970NER - Example
6971Question Type
6972Definition of NER
6973
6974Autostrada zostanie otwarta w I kwartale 2015 r.
6975Shallow Parsing
6976Constraints
6977
6978zielonych
6979
6980dla p. dr. J. Kowalskiego leg. się dow. osob. BAC1234567,
6981
6982zam. na os. B. Chrobrego 10 m 7
6983
6984just need a little more info....
6985
6986create new word
6987
6988 Rule (name of the rule)
6989
6990 Match (group of tokens to match)
6991
6992 Eval (the result of matching)
6993
6994Question 3
6995
6996Dogs like cats
6997
6998Oto portret Jana ur. w 1965 roku.
6999Example Rule (3)
7000
7001Deep parsing (Syntax Analysis) is an analysis of a sentence which produces a syntactic representation of a sentence.
7002
7003IN
7004
7005a small red dog
7006
7007Autostrada zostanie otwarta w I kw. 2015 r.
7008
7009To jest przesyłka dla pana profesora Gracza.
7010
7011Samsung Electronics Polska Sp. z o.o.
7012
7013Kapitał zakładowy Huty Miedzi S.A. wynosi 780 mln zł.
7014
7015Na ul. Umultowskiej 87 znajduje się budynek Uniwersytetu im. Adama Mickiewicza.
7016Document numbers
7017
7018<group synh=”tA6” semh=”tA5” type=”NG”>
7019
7020<tok id="tA5">
7021
7022<orth>przykładowym</orth>
7023
7024<lex>
7025
7026<base>przykładowy</base>
7027
7028<ctag>adj:sg:inst:n2:pos</ctag>
7029
7030</lex>
7031
7032</tok>
7033
7034<tok id="tA6">
7035
7036<orth>zdaniem</orth>
7037
7038<lex>
7039
7040<base>zdanie</base>
7041
7042<ctag>subst:sg:inst:n2</ctag>
7043
7044</tok>
7045
7046</group>
7047
7048no space between tokens
7049
7050The motorway will be opened in the 1st quarter of 2015.
7051
7052Let's try!
7053Example of Template (1)
7054
7055Hint for Search Algorithm:
7056
7057Look for the texts with Subject in the nominative form.
7058How to implement
7059SPADE normalization?
7060
7061niezbędników
7062
7063Rule "ok. --> około"
7064
7065Match: [orth~"ok/i"] ns [orth~"\."];
7066
7067Eval: word(qub, "około");
7068
7069Mężczyzna ten niewysoki wielką bryką jeździ.
7070
7071małych
7072
7073art
7074Example of Template (2)
7075
7076<tok id="tA6">
7077
7078<orth>zdaniem</orth>
7079
7080<lex>
7081
7082<base>zdanie</base>
7083
7084<ctag>subst:sg:inst:n2</ctag>
7085
7086</lex>
7087
7088<lex>
7089
7090<base>zdać</base>
7091
7092<ctag>ger:sg:inst:n2:perf:aff</ctag>
7093
7094</lex>
7095
7096</tok>
7097
7098<tok id="tA5">
7099
7100<orth>Przykładowym</orth>
7101
7102<lex>
7103
7104<base>przykładowy</base>
7105
7106<ctag>adj:sg:inst:n2:pos</ctag>
7107
7108</lex>
7109
7110</tok>
7111
7112<tok id="tA6">
7113
7114<orth>zdaniem</orth>
7115
7116<lex>
7117
7118<base>zdanie</base>
7119
7120<ctag>subst:sg:inst:n2</ctag>
7121
7122</tok>
7123
7124Rule "num+sub"
7125
7126Match: [pos~"num"] [pos~"subst"];
7127
7128Eval: unify(case gender, 1, 2);
7129Abbreviations
7130Shallow Parsing - Example
7131puddle
7132
7133Shallow parsing approach:
7134
7135 recognizes groups of words in a sentence
7136
7137 uses methods similar to RegExps
7138
7139 works with high efficiency
7140
7141 suffices for numerous NLP tasks
7142
7143In the rule for translation of Polish persons into English, in the Action section:
7144Action: append(\2:t \3:nomu)
7145A) nomu stands for "nie mów nikomu"
7146B) nomu stands for "no, mu możesz powiedzieć"
7147C) nomu stands for "naive operational mutual underground"
7148D) nomu stands for "nominative, uppercase"
7149What is Parsing?
7150Grouping
7151
7152Rule "PP2: Prep. phrase + conj. prep. phrase"
7153
7154Match: [type=PP] [type=CPP]+;
7155
7156Eval: group(PP, 1);
7157
7158Example of unification:
7159
7160Named entities: "atomic elements in text" that belong to some predefined categories such as:
7161
7162 names of persons
7163 organizations
7164 locations
7165 expressions of times
7166 quantities
7167 monetary values
7168 percentages, etc.
7169
7170Named entity recognition (NER) is the task of identifying such named entities in raw texts.
7171Shallow Parsing - Definition
7172
7173# 'Kiedy ślub' template
7174
7175## Question to match
7176
7177Match: <orth~"podaj">? <orth~"kiedy"> <ne~person>
7178
7179<base~"pobrać"> <orth~"się"> <orth = "z">
7180
7181<ne~person>
7182
7183## Query representation
7184
7185Type: TIME
7186
7187Subject: \3
7188
7189Action: <base~"pobrać"> <orth~"się">
7190
7191Constraints: \7
7192
7193Phrases: <base~"ożenić"> <orth~"się"> ;
7194
7195<base~"wyjść'> <orth~"za mąż"> ;
7196What's NER?
7197Select ALL correct sentences:
7198puddle - main assumptions
7199
7200at school
7201
7202Query is a representation of a question that helps finding an answer in a heap of documents.
7203
7204Kiedy Barack Obama przyjechał do Polski?
7205
7206Phrases: przyleciał, pojawił się, wyladował, został przyjęty
7207Test Question 2
7208Example Rule (2)
7209
7210Miałem pięć żon.
7211
7212http://cogcomp.cs.illinois.edu/page/demo_view/NER
7213
72142. Implement a simple QA system that searches for an answer in the Internet. The system should be based on 2 - 3 templates and shallow parsing techniques.
7215
7216Na wojnie zginęło ok. 6 milionów Polaków
7217
7218Any other words that may help select the answer.
7219
7220A) Shallow parsing is more efficient that deep parsing
7221
7222B) Shallow parsing returns more information on the structure of a sentence that deep parsing
7223
7224C) Shallow parsing requires more sophisticated tools than deep parsing
7225
7226D) Shallow parsing returns only partial information on the syntax structure of the sentence
7227
7228SPADE, for each recognized group, points out its syntactic and semantic head.
7229
7230Rule "ur. --> urodzony"
7231
7232Match: [orth~"ur/i"] ns [orth~"\."];
7233
7234Eval: word(adj:number*:case*:gender*:pos, "urodzony");
7235
7236a small red dog
7237What is NERT for?
7238
7239plastikowych
7240SPADE
7241
7242www.onet.pl.
7243
7244jassem@amu.edu.pl.
7245
7246# $ %ˆ { ( : & *:
7247
7248Select ALL correct sentences:
7249A) Search algorithm should look for all inflected forms of words in Constraints
7250B) Search algorithm should look for nominative forms of words in Subject
7251C) Action represents a set of verbs similar to the main activity in the sentence
7252D) Phrases represent a list of words similar to Subject
7253cats
7254like
7255Verb
7256Phrase
7257Verb Phrase
7258Dogs
7259Noun
7260Sentence
7261
7262cmd : OPER NUMBER art ;
7263
7264Grammar rules
7265
7266art : attr art
7267
7268| KIND ;
7269
7270attr : SIZE
7271
7272| COLOR
7273
7274| MATERIAL ;
7275
7276
7277Test Question 1
7278Pointing Devices
7279Syntax Analysis
7280
7281Formal grammar consists of:
7282
7283 terminal symbols - words in a lexicon
7284 non-terminal (auxilliary) symbols
7285 start symbol - one of non-terminal symbols
7286 set of productions (grammar rules)
7287
7288Grammar is a method of describing admissible sequences of words that belong to a lexicon.
7289Example of a Session
7290Lexical Analysis
7291Yacc
7292Human-Machine Communication
7293Running the parser
7294Old Days
7295
72964. Up-to-date NLP applications often apply
7297
7298Lex/Yacc-like ideas.
7299
7300Robot: How can I help you?
7301
7302User: Receive 5 tiny blue metal boxes.
7303Example of LU Application
7304Program Arguments
7305
7306> my_program -g -o database.txt
7307
7308Krzysztof Jassem
7309Running the application
7310Formal Grammars
7311Grammar Elements
7312What is Grammar?
7313Test Question
7314Database Representation
7315
7316Robot: Done! I have 5 large white plastic boxes on stock. How can I help you?
7317
7318User: Thank you.
7319
73202. Lex/Yacc may be succesfully applied for very simple natural language expressions.
7321Representation of an article
7322Lexer and Parser (in Python)
7323
7324Syntax analysis checks if the sequence of token types is admissible by the grammar.
7325
7326In case of success syntax analysis may return an action(s).
7327Test Question
7328Python Lex and Yacc
7329
7330PLY = Python Lex and Yacc
7331Topic of the Lecture
7332
7333Touch Communication
7334
73353. Full understanding of human language requires more sophisticated methods, but...
7336Adjustments for Parser
7337
7338Which command is not recognized by this grammar?
7339
73401) two tiny boxes
7341
73422) five tiny box
7343
73443) five boxes
7345
73464) five tiny boxes
7347Modifications in Lexer
7348
73491. Lex/Yacc has been invented for the processing of artificial languages.
7350On Success: Modify Database
7351(increase or reduce number of goods on stock)
7352Task to complete
7353
7354To code, in Python, a simple program that understands natural language.
7355PLY Module
7356Evolution
7357in Man-Machine Interaction
7358
7359Understand means: "do the appropriate action".
7360
7361ASR (Automatic Speech Recognition)
7362
7363NLU (Natural Language Understanding)
7364
7365NLG (Natural Language Generation)
7366
7367TTS (Text To Speech)
7368Case Study - Smart Home
7369Adjustments for Lexer
7370A. sentence : NOUN VERB ;
7371B. sentence : NOUN VERB
7372{ $$ = $2; }
7373C. sentence : NOUN VERB
7374{ printf ("Parsed!"); }
7375| NOUN VERB NOUN;
7376D. sentence : NOUN VERB { "Success!" }
7377
7378Robot: Done! I have 8 large white plastic boxes on stock. How can I help you?
7379
7380User: Release 3 large white plastic rings.
7381
7382lex.py and yacc.py are modules found in ply package
7383Interpreting Meaning
7384
7385This code gives:
7386
7387Program that executes syntax analysis is called parser.
7388Lex
7389
7390Context-free grammar:
7391
7392All grammar rules have one non-terminal on the left side.
7393Verbal Communication
7394
73951. Text is split into tokens
7396
73972. Each token is assigned its type (from the lexicon)
7398
7399t is an instance of LexToken.
7400
7401t has attributes:
7402
7403 t.value: text matched (string)
7404 t.type (token type)
7405 t.lineno (line number)
7406 t.lexpos (position of the token)
7407
7408t should be returned; otherwise token is ignored
7409Database
7410Stand-alone Lexer
7411On Failure: No action
7412What is the toughest challenge in vocal man-machine communication:
7413A) Text-to-speech synthesis
7414B) Speech-to-text recognition
7415C) Natural language understanding
7416D) Natural language generation
7417
7418Database stores the number of each article on stock, e.g.
7419
7420tiny white metal boxes : 5
7421
7422large plastic blue rings: 10
7423
7424lex is a module for lexical analysis.
7425Lexical & Syntax Analysis
7426
7427Task: to make the machine understand the command in natural language.
7428
74291. User was given choice by an application
7430Conclusions
7431Formal Grammar - Example
7432Language Understanding (LU)
7433Modifications in Parser
7434Which of the below statements is not a correct grammar rule in Yacc?
7435
7436Program that executes lexical analysis is called lexer.
7437
7438Robot: Done! I have 5 tiny blue metal boxes on stock. How can I help you?
7439
7440User: Receive 8 large white plastic rings.
7441
7442two: NUMBER tiny: SIZE boxes: KIND
7443
7444Action: Save info in database
7445
7446Here, grammar rules are functions:
7447
7448Tiny White Metal Boxes
7449
74500000
7451
7452First, all token types must be defined:
7453
7454Then, for all token types, rules are defined in a form of RegEx:
7455
7456The above code gives:
7457
7458The rules may perform actions:
7459
7460p is a list of objects for:
7461
7462p[0] - left-hand symbol
7463
7464p[1] - first right-hand symbol, etc.
7465
7466Lex - lexical analysis - lex.py
7467
7468Yacc - syntax analysis - yacc.py
7469
7470originally written in C
7471
7472This code gives:
7473
7474dwa małe pudełka
7475
7476dwa: NUMBER
7477
7478małe: SIZE
7479
7480pudełka: KIND
7481
7482two tiny boxes
7483
7484two: NUMBER
7485
7486tiny: SIZE
7487
7488boxes: KIND
7489
7490Large Blue Plastic Rings
7491
7492...
7493
74941111
7495
7496two: NUMBER boxes: KIND tiny: SIZE
7497
7498Failure
7499
7500dwa małe pudełka
7501
7502pięć małych pudełek
7503
7504Correct sequences
7505
7506two tiny boxes
7507
7508five tiny boxes
7509
7510This code gives:
7511
7512Non-terminal symbols
7513
7514{ COMMAND, ARTICLE, NUMBER, SIZE, KIND }
7515
7516Incorrect Sequences
7517
7518dwa małych pudełka
7519
7520pięć małe pudełek
7521
7522two tiny box
7523
7524tiny five boxes
7525
7526Tiny White Metal Rings
7527
75280001
7529Create a program that understands human sentences, e.g.
7530Receive 5 tiny blue plastic boxes.
7531Release 2 large metal white boxes.
7532
7533(Quantity of articles is expressed with digits.)
7534
7535After 2 deliveries:
7536
75373 tiny white metal boxes
7538
75395 large blue plastic rings
7540
7541Productions
7542
7543COMMAND → NUMBER ARTICLE
7544
7545ARTICLE → SIZE KIND
7546
7547NUMBER → dwa
7548
7549NUMBER → pięć
7550
7551SIZE → małe
7552
7553SIZE → małych
7554
7555KIND → pudełka
7556
7557KIND → pudełek
7558
7559NUMBER →two
7560
7561NUMBER → five
7562
7563SIZE → tiny
7564
7565KIND → box
7566
7567KIND → boxes
7568
7569{ dwa, pięć, małe, małych,
7570
7571pudełka, pudełek }
7572
7573Terminal symbols
7574
7575{ two, five, tiny, box, boxes }
7576
7577{ dwa, pięć, małe, małych,
7578
7579pudełka, pudełek }
7580
7581Lexicon
7582
7583{ two, five, tiny, box, boxes }
7584
75850
7586
75871
7588
7589Tiny
7590
7591Large
7592
7593White
7594
7595Blue
7596
7597Metal
7598
7599Plastic
7600
7601Boxes
7602
7603Rings
7604
7605dwa: NUMBER pudełka: KIND małe: SIZE
7606
7607Failure
7608
7609dwa: NUMBER małe: SIZE pudełka: KIND
7610
7611Action: Save info in database
7612
7613The above code produces:
7614
7615Initial Settings:
7616
7617The rule may take an action. Then it is defined as a function:
7618
7619Start symbol
7620
7621COMMAND
7622
7623
7624Tokenization and sentence splitting are usually the first steps in text processing, which segment text into smaller pieces.
7625
7626In case of natural language texts these two pre-processing phases may prove far from trivial.
7627
7628Internet is an infinite source of linguistic knowledge.
7629
7630The lecture shows how to scrape (extract) linguistic data from Internet.
7631
7632https://prezi.com/jcwtt0iptfcz/?token=57809fc48232eddf18d8b8a1197758ee440680293142cac0abca8d147c7061bd&utm_campaign=share&utm_medium=copy
7633Lecture 9. Artifiial Neural Networks
7634Lecture 14. Deep Parsing
7635(Syntax Analysis)
7636Lecture 3. Processing Texts in Linux
7637
7638https://prezi.com/s76b4ailuz7y/?token=36f0d0824020f1c1aa25ce64b1cc7a8ae7cb46af3c2e30b468873adee74b5f5e&utm_campaign=share&utm_medium=copy
7639
7640Logistic regression may be successfully applied in sentiment analysis.
7641Standard Tasks in NLP
7642
7643The lecture provides basic information on methods used Machine Learning (ML).
7644Course on Natural Language Processing 2019 / 2020
7645
7646Most recent NLP solutions apply Artificial Neural Networks (ANN).
7647
7648Natural Language Processing (NLP) is a domain of Artificial Intelligence that aims at computers understand the language used by humans.
7649
7650NLP is applied in an widening variety of computer systems, which will be shown during the lecture.
7651
7652https://prezi.com/nfqfn-tkksd1/?token=7c7e470c76114f3d926219c69975875d266ec319c16a718ad1fe4dcbb9210801&utm_campaign=share&utm_medium=copy
7653
7654Language model is a function that assigns probabilities to elements of the language (sentences) - the more likely a sentence is to appear, the higher probability.
7655
7656Language modeling is a statistical approach to examining the natural language.
7657Basics
7658
7659Krzysztof Jassem
7660
7661Regular Expressions (RegEx) is a powerful tool for text searching / replacing / validation. RegEx are used in almost all tasks of NLP.
7662
7663Python is regarded as a very convenient environment for using Regular Expressions.
7664
7665https://prezi.com/g8j5kje6njtf/?token=097b919a3ddc21ad2965174f7fd690f62aa3d00b4bddba8e0f490ee4357b71cf&utm_campaign=share&utm_medium=copy
7666Lecture 1. Applications of NLP
7667
7668The lecture demonstrates the methods of linear and logistic regression.
7669
7670The lecture provides introduction to ANN and shows their applications in NLP.
7671Supplementary Lecture. Web Scraping
7672
7673Short version of the lecture:
7674Lecture 13. Python Lex & Yacc
7675
7676There exist a number of toolkits that help processing natural language.
7677
7678The lecture covers two of them: PSI-Toolkit developed at AMU and NLTK (Python module).
7679
7680https://prezi.com/bmxjygvv1px3/?token=5c9a71e8e8ae31d54a723f43a8286fe005009817edd4658fb5a467b57409b6a4&utm_campaign=share&utm_medium=copy
7681
7682https://prezi.com/rbhq8cm4vtzi/?token=d42a73a0e6c6086ceccab9e80bcbf236389955c08fabd5999bb129ebc4a7883a&utm_campaign=share&utm_medium=copy
7683
7684The basic ML algorithm used in Natural Language Processing is Bayes classifier.
7685
7686Deep Parsing aims at disclosing the structure and meaning of the whole sentence.
7687
7688The lecture will present most common parsing algorithms and methods of language description.
7689
76902. Standard tasks in NLP
7691Lecture 7. Bayes Classifier
7692
7693Statistical methods consist in examining the language by means of statistics - mainly the frequency of occurrence of some phenomena in the language.
7694
7695Lecture 4. Text Segmentation
7696
7697Lecture 5. Morphological Analysis
7698
7699Lecture 6. NLP Toolkits
7700
7701Once a natural language texts is tokenized, the tokens are searched for in the lexicon. This is the first step in order to understand the meaning of the text.
7702
7703The lecture discusses the problem of building robust lexicons that cover still-evolving natural languages.
7704
7705https://prezi.com/_exvlgugk_-e/?token=6c1b850cc5855d0cce389fdf70328949f89fa539554682c0563f854a0238389d&utm_campaign=share&utm_medium=copy
7706
7707Machine Learning approach to NLP is based on pure data (texts) without human-prepared linguistic rules.
7708
7709As opposed to rule-bases methods, Machine Learning approach requires significantly less human work.
7710
7711On the other hand, this approach is harder to control by humans and thus less reliable.
7712
7713Lecture 1. Applications of NLP
7714
7715Lecture 2. Regular Expressions in Python
7716
7717Lecture 3. Processing Text in Bash Shell
7718
7719The lectures will teach basic concepts in Natural Language Processing.
7720Machine Learning Approach
7721
7722https://prezi.com/qrvzpicblhjy/?token=8d8967aa799b927298d8932534b66c76333f63bc0a5139ccea65fbd4539c9fe2&utm_campaign=share&utm_medium=copy
7723
7724https://prezi.com/lgrqjn3mqv2l/?token=04daf495327dbf9d553e5b36271d7d14073c7f5e1725a7f85c01244798821b6e&utm_campaign=share&utm_medium=copy
7725
77265. Machine Learning approach to NLP
7727
7728https://prezi.com/svbsskbxwrtx/?token=e6979fe7c86c63210e9051b7b52b4d458aceefe4c3476214bf3a5585d130041f&utm_campaign=share&utm_medium=copy
7729
7730Shallow parsing is a process that identifies main components of a sentence.
7731
7732This type of processing may be used in speech synthesis, which is demonstrated during the lecture.
7733
7734The lecture shows how to write a simple NLP application in Python using Yacc and Lex.
7735
7736Lex is a tool for lexical analysis.
7737
7738Yacc is a tool for parsing text.
7739
7740Full version of the lecture:
7741
7742https://prezi.com/xzvr4atryeid/?token=785de03666ee2925bbc1d7789e22b3f2e178ebfd9a4e87e9d51e337b28aa6d81&utm_campaign=share&utm_medium=copy
7743Lecture 6. NLP Toolkits
7744
7745Rule-based approach to NLP consists in human preparation of rules that describe linguistic phenomena.
7746
7747As opposed to Machine Learning methods, rule-based approach requires significantly more human work.
7748
7749On the other hand, this approach is easier to control by humans and thus more reliable.
7750Lecture 11. Spell-checking
7751
77524. Statistical methods in NLP
7753Overview of the Course
7754
7755The lectures will overview preliminary procedures that are used in almost all NLP systems.
7756Lecture 2. Regular Expressions in Python
7757Lecture 5. Morphological Analysis
7758Lecture 10. Language Modeling
7759
77603. Rule-based methods for NLP
7761Statistical Methods
7762
7763https://prezi.com/0c0orqubgk3g/?token=252f4ff988e5bc544982b1a018d15c17329b1d225421c8341802d0199eea639f&utm_campaign=share&utm_medium=copy
7764
77651. Basic concepts of NLP
7766
7767Majority of basic NLP tasks with text can be executed with simple Linux commands and scripts.
7768
7769The lecture teaches how to use commands (starting with elementary Linux) and to write scripts.
7770
7771https://prezi.com/47ivuwazf5yn/?token=53e29b57f949bffa39cfa47b585acd879bd0fe4e7c3720b3e819be8e8ac5f503&utm_campaign=share&utm_medium=copy
7772Lecture 4. Text Segmentation
7773Lecture 12. Shallow Parsing
7774
7775Spell-checking is one of the applications of language modeling.
7776
7777Thanks to the statistical approach, the most likely corrections are prompted as first.
7778Rule-based Approach
7779
7780https://prezi.com/emfutfrsjwz0/?token=713d96b40514cb5481244f94d9d10f5d7c8530b48e90292107716d761c3e9022&utm_campaign=share&utm_medium=copy
7781
7782https://prezi.com/vwmvvbudwrpv/?token=2d2d3e00b4545be2bcf2ec469f6a165a64560e6d4a41b09a6a49d719fd81dab6&utm_campaign=share&utm_medium=copy
7783Lecture 8. Regression