Что такое findslide.org?

FindSlide.org - это сайт презентаций, докладов, шаблонов в формате PowerPoint.


Для правообладателей

Обратная связь

Email: Нажмите что бы посмотреть 

Яндекс.Метрика

Презентация на тему Speech and language processing (3rd ed. raft), dan jurafsky and jamesh. martin. Глава 2.3, стр. 11.

NormalizationNeed to “normalize” terms Information Retrieval: indexed text & query terms must have same form.We want to match U.S.A. and USAWe implicitly define equivalence classes of termse.g., deleting periods in a termAlternative: asymmetric expansion:Enter: window Search: window,
Speech and Language Processing (3rd ed. raft),  Dan Jurafsky and James H. Martin. Глава NormalizationNeed to “normalize” terms Information Retrieval: indexed text & query terms must Case foldingApplications like IR: reduce all letters to lower caseSince users tend LemmatizationReduce inflections or variant forms to base formam, are, is  becar, MorphologyMorphemes:The small meaningful units that make up wordsStems: The core meaning-bearing unitsAffixes: StemmingReduce terms to their stems in information retrievalStemming is crude chopping of Porter’s algorithm The most common English stemmer  Step 1asses  ss Viewing morphology in a corpus Why only strip –ing if there is Viewing morphology in a corpus Why only strip –ing if there is Dealing with complex morphology is sometimes necessarySome languages requires complex morpheme segmentationTurkishUygarlastiramadiklarimizdanmissinizcasina`(behaving) Basic Text ProcessingWord Normalization and Stemming Литература, статьи:Диалог. Лемматизация слов русского языка в применении к распознаванию слитной речи.
Слайды презентации

Слайд 2 Normalization
Need to “normalize” terms
Information Retrieval: indexed text

NormalizationNeed to “normalize” terms Information Retrieval: indexed text & query terms

& query terms must have same form.
We want to

match U.S.A. and USA
We implicitly define equivalence classes of terms
e.g., deleting periods in a term
Alternative: asymmetric expansion:
Enter: window Search: window, windows
Enter: windows Search: Windows, windows, window
Enter: Windows Search: Windows
Еnter: Снеговик Search: Снеговик, снеговики
Potentially more powerful, but less efficient

Где ещё может понадобиться нормализация?


Слайд 3 Case folding
Applications like IR: reduce all letters to

Case foldingApplications like IR: reduce all letters to lower caseSince users

lower case
Since users tend to use lower case
Possible exception:

upper case in mid-sentence?
e.g., General Motors
Fed vs. fed
SAIL vs. sail
МегаФон vs. мегафон
For sentiment analysis, MT, Information extraction
Case is helpful (US versus us is important)

Какие преимущества даёт приведение текста к одному регистру?


Слайд 4 Lemmatization
Reduce inflections or variant forms to base form
am,

LemmatizationReduce inflections or variant forms to base formam, are, is 

are, is  be
car, cars, car's, cars'  car
Lemmatization:

have to find correct dictionary headword form
Machine translation
Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer ‘want’
the boy's cars are different colors  the boy car be different color
Мы если суп, а вдоль аллеи стояли раскидистые ели -> я есть суп, а вдоль аллея стоять раскидистый ель

В какой форме существительное и глагол обычно являются леммой?


Слайд 5 Morphology
Morphemes:
The small meaningful units that make up words
Stems:

MorphologyMorphemes:The small meaningful units that make up wordsStems: The core meaning-bearing

The core meaning-bearing units
Affixes: Bits and pieces that adhere

to stems
Often with grammatical functions

Приведите примеры аффиксов


Слайд 6 Stemming
Reduce terms to their stems in information retrieval
Stemming

StemmingReduce terms to their stems in information retrievalStemming is crude chopping

is crude chopping of affixes
language dependent
e.g., automate(s), automatic, automation

all reduced to automat.
Например, чистый, чистка сведутся к «чист».

for example compressed
and compression are both
accepted as equivalent to
compress.

for exampl compress and
compress ar both accept
as equival to compress

В чём отличие лемматизации от стемминга? Что точнее?


Слайд 7 Porter’s algorithm The most common English stemmer
Step

Porter’s algorithm The most common English stemmer Step 1asses  ss

1a
sses  ss caresses  caress
ies  i ponies

 poni
ss  ss caress  caress
s  ø cats  cat
Step 1b
(*v*)ing  ø walking  walk
sing  sing
(*v*)ed  ø plastered  plaster


Step 2 (for long stems)
ational ate relational relate
izer ize digitizer  digitize
ator ate operator  operate

Step 3 (for longer stems)
al  ø revival  reviv
able  ø adjustable  adjust
ate  ø activate  activ


Какое главное наглядное преимущество этого алгоритма?


Слайд 8 Viewing morphology in a corpus Why only strip –ing

Viewing morphology in a corpus Why only strip –ing if there

if there is a vowel?
(*v*)ing  ø walking

 walk
sing  sing

Как в большинстве случаев узнать, надо ли отбрасывать ing?


Слайд 9 Viewing morphology in a corpus Why only strip –ing

Viewing morphology in a corpus Why only strip –ing if there

if there is a vowel?
(*v*)ing  ø walking

 walk
sing  sing

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr









tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr

548 being
541 nothing
152 something
145 coming
130 morning
122 having
120 living
117 loving
116 Being
102 going

1312 King
548 being
541 nothing
388 king
375 bring
358 thing
307 ring
152 something
145 coming
130 morning

Объясните работу данных команд?


Слайд 10 Dealing with complex morphology is sometimes necessary
Some languages

Dealing with complex morphology is sometimes necessarySome languages requires complex morpheme

requires complex morpheme segmentation
Turkish
Uygarlastiramadiklarimizdanmissinizcasina
`(behaving) as if you are among

those whom we could not civilize’
Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

В каком ещё языке могут возникнуть большие проблемы с разбором слов ?


Слайд 11 Basic Text Processing

Word Normalization and Stemming

Basic Text ProcessingWord Normalization and Stemming

  • Имя файла: speech-and-language-processing 3rd-ed-raft-dan-jurafsky and jamesh-martin-glava-23-str-11.pptx
  • Количество просмотров: 121
  • Количество скачиваний: 0