词形化与词干的区别是什么？

Question 1

When do I use each ?

Also…is the NLTK lemmatization dependent upon Parts of Speech? Wouldn’t it be more accurate if it was?

Question 2

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.

Question 3

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

The word “walk” is the base form for word “walking”, and hence this is matched in both stemming and lemmatisation.

The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context, e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

Source: https://en.wikipedia.org/wiki/Lemmatisation

Question 4

There are two aspects to show their differences:

A stemmer will return the stem of a word, which needn’t be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.
In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

Question 5

The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general “term conflation” procedures, which may also address lexico-semantic, syntactic, or orthographic variations.

The real difference between stemming and lemmatization is threefold:

Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;
Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;
In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.

Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.

Question 6

As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.

As for when you would use one or the other, it’s a matter of how much your application depends on getting the meaning of a word in context correct. If you’re doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you’re doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.

As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it “dove” and “v” yields “dive” while “dove” and “n” yields “dove”.

Question 7

An example-driven explanation on the differenes between lemmatization and stemming:

Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile”.

Stemming handles matching “car” to “cars” .

Lemmatization implies a broader scope of fuzzy word matching that is still handled by the same subsystems. It implies certain techniques for low level processing within the engine, and may also reflect an engineering preference for terminology.

[…] Taking FAST as an example, their lemmatization engine handles not only basic word variations like singular vs. plural, but also thesaurus operators like having “hot” match “warm”.

This is not to say that other engines don’t handle synonyms, of course they do, but the low level implementation may be in a different subsystem than those that handle base stemming.

http://www.ideaeng.com/stemming-lemmatization-0601

Question 8

ianacl
but i think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes

Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form

Question 9

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

If you lemmatize the word ‘Caring‘, it would return ‘Care‘. If you stem, it would return ‘Car‘ and this is erroneous.
If you lemmatize the word ‘Stripes‘ in verb context, it would return ‘Strip‘. If you lemmatize it in noun context, it would return ‘Stripe‘. If you just stem it, it would just return ‘Strip‘.
You would get same results whether you lemmatize or stem words such as walking, running, swimming… to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn’t humongous, go with Lemmatization.

Question 10

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning.

Examples,

"beautiful" -> "beauti"
"corpora" -> "corpora"

Stemming can be done very quickly.

Lemmatization on the other hand, is the process of converting the given word into it’s base form according to the dictionary meaning of the word.

Examples,

"beautiful" -> "beauty"
"corpora" -> "corpus"

Lemmatization takes more time than stemming.

词形化与词干的区别是什么？

问题：词形化与词干的区别是什么？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

传入的Django请求中的JSON数据在哪里？

如何将列表合并为元组列表？

Python 20行代码检测人脸是否佩戴口罩

Python将多个变量分配给相同的值？列出行为

使用日志记录打印pprint的输出

如何访问环境变量值？

词形化与词干的区别是什么？

问题：词形化与词干的区别是什么？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

相关文章

排行榜展示

文章展示