标签归档:NLP

Cpca 这个Python神器能帮你自动识别文字中的省市区并绘图

在做NLP(自然语言处理)相关任务时,经常会遇到需要识别并提取省、城市、行政区的需求。虽然我们自己通过关键词表一个个查找也能实现提取目的,但是需要先搜集省市区关键词表,相对而言比较繁琐。

今天给大家介绍一个模块,你只需要把字符串传递给这个模块,他就能给你返回这个字符串内的省、市、区关键词,并能给你在图片上标注起来,它就是 Cpca 模块。

1.准备

开始之前,你要确保Python和pip已经成功安装在电脑上,如果没有,请访问这篇文章:超详细Python安装指南 进行安装。

(可选1) 如果你用Python的目的是数据分析,可以直接安装Anaconda:Python数据分析与挖掘好帮手—Anaconda,它内置了Python和pip.

(可选2) 此外,推荐大家用VSCode编辑器来编写小型Python项目:Python 编程的最好搭档—VSCode 详细指南

Windows环境下打开Cmd(开始—运行—CMD),苹果系统环境下请打开Terminal(command+空格输入Terminal),输入命令安装依赖:

pip install cpca

注意,目前 cpca 模块仅支持Python3及以上版本。

在 windows 上可能会出现类似如下问题

Building wheel for pyahocorasick (setup.py) ... error

先去下载 Microsoft Visual C++ Build Tools 安装VC++构建工具,再重新 pip install cpca,即可解决问题

2.Cpca 基本使用

通过两行代码就能实现最基本的省市区提取:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

location_str = [
    "广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
    "特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
    "三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str)
print(df)

效果如下:

     省     市     区                     地址  adcode
0  广东省   深圳市   福田区     巴丁街深南中路1025号新城大厦1层  440304
1  上海市  None  None                      。  310000
2  四川省   德阳市   广汉市  城西三星堆镇的鸭子河畔,属青铜时代文化遗址  510681

注意第三条的广汉市,cpca 不仅识别到了语句中的县级市广汉市,还能自动匹配到其代管市的德阳市,不得不说非常强大。

如果你想获知程序是从字符串的那个位置提取出省市区名的,可以添加一个 pos_sensitive=True 参数:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

location_str = [
    "广东省深圳市福田区巴丁街深南中路1025号新城大厦1层",
    "特斯拉上海超级工厂是特斯拉汽车首座美国本土以外的超级工厂,位于中华人民共和国上海市。",
    "三星堆遗址位于中国四川省广汉市城西三星堆镇的鸭子河畔,属青铜时代文化遗址"
]
df = cpca.transform(location_str, pos_sensitive=True)
print(df)

效果如下:

(base) G:\push\20220623>python 1.py
     省     市     区                     地址  adcode  省_pos  市_pos  区_pos
0  广东省   深圳市   福田区     巴丁街深南中路1025号新城大厦1层  440304      0      3      6
1  上海市  None  None                      。  310000     38     -1     -1
2  四川省   德阳市   广汉市  城西三星堆镇的鸭子河畔,属青铜时代文化遗址  510681      9     -1     12

它标记出了识别到省、市、区的关键位置(index),当然如果是德阳市这种特殊的识别会被标记为-1.

3.高级使用

它还可以从大段文本中批量识别多个地区:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca

long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
    "在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
    "想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
    "香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
print(df)

效果如下:

(base) G:\push\20220623>python 1.py
          省     市     区 地址  adcode  省_pos  市_pos  区_pos
0       广东省   广州市  None     440100     -1     44     -1
1   香港特别行政区  None  None     810000     47     -1     -1
2       广东省   深圳市  None     440300     -1     58     -1
3       北京市  None  None     110000     71     -1     -1
4       广东省   广州市  None     440100     -1     86     -1
5       广东省   深圳市  None     440300     -1     89     -1
6   香港特别行政区  None  None     810000     92     -1     -1
7       北京市  None  None     110000    100     -1     -1
8       广东省   广州市  None     440100     -1    110     -1
9   香港特别行政区  None  None     810000    115     -1     -1
10      广东省   深圳市  None     440300     -1    120     -1
11      北京市  None  None     110000    128     -1     -1
12      广东省   广州市  None     440100     -1    143     -1

不仅如此,模块中还自带一些简单绘图工具,可以在地图上将上面输出的数据以热力图的形式画出来:

# 公众号: Python 实用宝典
# 2022/06/23

import cpca
from cpca import drawer

long_text = "对一个城市的评价总会包含个人的感情。如果你喜欢一个城市,很有可能是喜欢彼时彼地的自己。"\
    "在广州、香港读过书,工作过,在深圳买过房、短暂生活过,去北京出了几次差。"\
    "想重点比较一下广州、深圳和香港,顺带说一下北京。总的来说,觉得广州舒适、"\
    "香港精致、深圳年轻气氛好、北京大气又粗糙。答主目前选择了广州。"
df = cpca.transform_text_with_addrs(long_text, pos_sensitive=True)
drawer.draw_locations(df[cpca._ADCODE], "df.html")

运行的时候可能会报这个错:

(base) G:\push\20220623>python 1.py
Traceback (most recent call last):
  File "1.py", line 12, in <module>
    drawer.draw_locations(df[cpca._ADCODE], "df.html")
  File "G:\Anaconda3\lib\site-packages\cpca\drawer.py", line 41, in draw_locations
    import folium
ModuleNotFoundError: No module named 'folium'

使用pip安装即可:

pip install folium

然后重新运行代码,会在当前目录下生成 df.html, 双击打开,效果如下:

怎么用,是不是感觉非常方便?以后地点的识别用这个模块就完全够了。

还有更多的细节你可以访问这个项目的Github主页阅读,该项目的README完全中文编写,非常容易阅读:

https://github.com/DQinYuan/chinese_province_city_area_mapper

如果你无法访问GitHub,也可以在Python实用宝典公众号后台回复:cpca 下载完整项目。

我们的文章到此就结束啦,如果你喜欢今天的 Python 教程,请持续关注Python实用宝典。

有任何问题,可以在公众号后台回复:加群,回答相应验证信息,进入互助群询问。

原创不易,希望你能在下面点个赞和在看支持我继续创作,谢谢!

给作者打赏,选择打赏金额
¥1¥5¥10¥20¥50¥100¥200 自定义

​Python实用宝典 ( pythondict.com )
不只是一个宝典
欢迎关注公众号:Python实用宝典

词形化与词干的区别是什么?

问题:词形化与词干的区别是什么?

什么时候使用每个?

另外… NLTK词素化是否取决于词类?如果不是,它会更准确吗?

When do I use each ?

Also…is the NLTK lemmatization dependent upon Parts of Speech? Wouldn’t it be more accurate if it was?


回答 0

简短而密集:http : //nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

词干和词根化的目的都是将单词的屈折形式和有时与派生相关的形式减少为通用的基本形式。

但是,这两个词的风格不同。词干通常是指粗略的启发式过程,该过程会切掉单词的结尾,以期在大多数时间正确实现此目标,并且通常包括删除派生词缀。词法词化通常是指使用单词的词汇和词法分析来正确处理事情,通常旨在仅去除词尾变化,并返回单词的基数或字典形式,这被称为引理。

从NLTK文档:

引词化和词干化是规范化的特殊情况。他们为一组相关的单词形式确定规范的代表。

Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

From the NLTK docs:

Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.


回答 1

合法化阻止密切相关。不同之处在于,词干分析器在不了解上下文的情况下对单个单词进行操作,因此无法根据词性区分具有不同含义的单词。但是,茎杆通常更易于实现和运行得更快,并且降低的精度对于某些应用可能并不重要。

例如:

  1. “更好”一词的引理是“好”。由于需要字典查找,因此干掉了该链接。

  2. 单词“ walk”是单词“ walking”的基本形式,因此,它在词干和词根化方面均匹配。

  3. 根据上下文,“开会”一词可以是名词的基本形式,也可以是动词(“见面”)的形式,例如“在我们上次见面”或“我们明天再见面”。与词根提取不同,词条分解原则上可以根据上下文选择适当的词条。

资料来源https : //zh.wikipedia.org/wiki/合法化

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

For instance:

  1. The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.

  2. The word “walk” is the base form for word “walking”, and hence this is matched in both stemming and lemmatisation.

  3. The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context, e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatisation can in principle select the appropriate lemma depending on the context.

Source: https://en.wikipedia.org/wiki/Lemmatisation


回答 2

有两个方面可以显示它们的差异:

  1. 词干将返回一个字,这不必是完全相同的字的形态根的杆。即使词干本身不是有效的词根,通常只要相关词映射到相同的词干就足够了,而在词法化过程中,它将返回词的字典形式,该词必须是有效的词。

  2. lemmatisation,单词的语音部分,应首先确定和归一化的规则将是语音的不同部分不同,而词干上的单个字的运行不需要的上下文的知识,并且因此其具有不同的字之间不能区分含义取决于词性。

参考http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

There are two aspects to show their differences:

  1. A stemmer will return the stem of a word, which needn’t be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.

  2. In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization


回答 3

词干和词根化的目的都是为了减少形态变异。这与更通用的“术语合并”过程相反,后者也可以处理词汇语义,句法或正字法变化。

词干和词根化的真正区别有三点:

  1. 词干将单词形式简化为(伪)词干,而词义化将单词形式简化为在语言上有效的词缀。这种差异在形态更为复杂的语言中显而易见,但对于许多IR应用而言可能无关紧要。

  2. 引理化仅处理拐点变化,而词干化也可以处理导数变化。

  3. 在实现方面,词元化通常更为复杂(尤其是对于形态复杂的语言),并且通常需要某种词典。另一方面,可以通过相当简单的基于规则的方法来实现令人满意的词干。

词法标记器也可以支持词法化,以消除同音异义词的歧义。

The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general “term conflation” procedures, which may also address lexico-semantic, syntactic, or orthographic variations.

The real difference between stemming and lemmatization is threefold:

  1. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;

  2. Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;

  3. In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.

Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.


回答 4

正如MYYN所指出的那样,词干提取是将词尾的,有时是衍生词的词缀去除为所有原始词都可能与之相关的基本形式的过程。词法化与获得单个单词有关,该单词使您可以将一堆变形的表格组合在一起。这比阻止更难,因为它需要考虑上下文(因此要考虑单词的含义),而阻止则忽略上下文。

至于何时使用一个或另一个,则取决于您的应用程序在多大程度上取决于正确理解上下文中单词的含义。如果您要进行机器翻译,则可能需要进行词形化处理,以避免错误翻译单词。如果您要对10亿个文档进行信息检索,而您有99%的查询(从1-3个字不等),那么您就可以满足要求。

对于NLTK,WordNetLemmatizer确实会使用词性,尽管您必须提供(否则默认为名词)。传递给它“鸽子”和“ v”会产生“潜水”,而传递给“鸽子”和“ n”会产生“鸽子”。

As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.

As for when you would use one or the other, it’s a matter of how much your application depends on getting the meaning of a word in context correct. If you’re doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you’re doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.

As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it “dove” and “v” yields “dive” while “dove” and “n” yields “dove”.


回答 5

一个示例驱动的解释,说明了词源化和词干之间的区别:

词法化处理将“汽车”与“汽车”匹配以及将“汽车”与“汽车”匹配。

阻止处理将“ car”匹配到“ cars”

词法化意味着模糊词匹配的范围更广,仍然由相同的子系统处理。它暗示了用于引擎内低级处理的某些技术,也可能反映了工程上对术语的偏爱。

[…]以FAST为例,他们的词素化引擎不仅处理诸如单数或复数之类的基本单词变体,而且还处理诸如“热”匹配“暖”之类的词库运算符。

这并不是说其他​​引擎当然不会处理同义词,但是底层实现可能与处理基本词干的引擎不在同一个子系统中。

http://www.ideaeng.com/stemming-lemmatization-0601

An example-driven explanation on the differenes between lemmatization and stemming:

Lemmatization handles matching “car” to “cars” along with matching “car” to “automobile”.

Stemming handles matching “car” to “cars” .

Lemmatization implies a broader scope of fuzzy word matching that is still handled by the same subsystems. It implies certain techniques for low level processing within the engine, and may also reflect an engineering preference for terminology.

[…] Taking FAST as an example, their lemmatization engine handles not only basic word variations like singular vs. plural, but also thesaurus operators like having “hot” match “warm”.

This is not to say that other engines don’t handle synonyms, of course they do, but the low level implementation may be in a different subsystem than those that handle base stemming.

http://www.ideaeng.com/stemming-lemmatization-0601


回答 6

ianacl
但我认为词干是一个粗略的黑客人们用它来获得所有不同形式的同一个单词到它不必是本身就是一个合法的字基本形式
有点像波特施特默尔可以使用简单的正则表达式来消除常见字后缀

词法化将单词还原为实际的基本形式,在不规则动词的情况下,该词看起来可能与输入词
完全不同,例如Morpha之类的东西,它使用FST将名词和动词带入其基本形式

ianacl
but i think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes

Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form


回答 7

词干只是去除或阻止单词的最后几个字符,通常会导致错误的含义和拼写。词法化会考虑上下文,并将单词转换为其有意义的基本形式,即词法。有时,同一个词可以有多个不同的引词。我们应该在特定的上下文中为单词识别词性(POS)标签。以下示例说明了所有差异和用例:

  1. 如果您对“ 关怀 ” 一词进行词缀化,它将返回“ 关怀 ”。如果阻止,它将返回“ Car ”,这是错误的。
  2. 如果在动词上下文中对单词“ Stripes ”进行词缀化,它将返回“ Strip ”。如果在名词上下文中对其进行词缀化,它将返回“ Stripe ”。如果您只是阻止它,它将仅返回’ Strip ‘。
  3. 无论您是词干化还是词干化(例如走路,跑步,游泳走路,跑步,游泳等)您都会得到相同的结果。
  4. 归类化在计算上很昂贵,因为它涉及查找表,而不涉及查找表。如果您的数据集很大并且性能是一个问题,请使用Stemming。请记住,您也可以将自己的规则添加到“词干”中。如果准确性至高无上,并且数据集不那么庞大,请使用Lemmatization。

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:

  1. If you lemmatize the word ‘Caring‘, it would return ‘Care‘. If you stem, it would return ‘Car‘ and this is erroneous.
  2. If you lemmatize the word ‘Stripes‘ in verb context, it would return ‘Strip‘. If you lemmatize it in noun context, it would return ‘Stripe‘. If you just stem it, it would just return ‘Strip‘.
  3. You would get same results whether you lemmatize or stem words such as walking, running, swimming… to walk, run, swim etc.
  4. Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn’t humongous, go with Lemmatization.

回答 8

词干处理是删除给定单词的最后几个字符以获得较短形式的过程,即使该形式没有任何意义。

例子,

"beautiful" -> "beauti"
"corpora" -> "corpora"

可以非常快速地完成茎梗。

反之,词法化是根据单词的字典含义将给定单词转换为其基本形式的过程。

例子,

"beautiful" -> "beauty"
"corpora" -> "corpus"

合法化要花费比阻止更多的时间。

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning.

Examples,

"beautiful" -> "beauti"
"corpora" -> "corpora"

Stemming can be done very quickly.

Lemmatization on the other hand, is the process of converting the given word into it’s base form according to the dictionary meaning of the word.

Examples,

"beautiful" -> "beauty"
"corpora" -> "corpus"

Lemmatization takes more time than stemming.


如何使用NLTK标记器消除标点符号?

问题:如何使用NLTK标记器消除标点符号?

我刚刚开始使用NLTK,但我不太了解如何从文本中获取单词列表。如果使用nltk.word_tokenize(),则会得到单词和标点的列表。我只需要这些词。我如何摆脱标点符号?同样word_tokenize不适用于多个句子:点号会添加到最后一个单词中。

I’m just starting to use NLTK and I don’t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn’t work with multiple sentences: dots are added to the last word.


回答 0

看看nltk 在此处提供的其他标记化选项。例如,您可以定义一个令牌生成器,该令牌生成器将字母数字字符序列选作令牌,并丢弃其他所有内容:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

回答 1

您实际上并不需要NLTK来删除标点符号。您可以使用简单的python将其删除。对于字符串:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或对于unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在令牌生成器中使用此字符串。

PS字符串模块还有一些其他可以删除的元素集(例如数字)。

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


回答 2

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

回答 3

正如注释中所注意到的那样,因为word_tokenize()仅适用于单个句子,所以它以send_tokenize()开头。您可以使用filter()过滤出标点符号。如果您有一个unicode字符串,请确保它是一个unicode对象(而不是使用“ utf-8”之类的编码编码的“ str”)。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a ‘str’ encoded with some encoding like ‘utf-8’).

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

回答 4

我只使用了以下代码,删除了所有标点符号:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

回答 5

我认为您需要某种正则表达式匹配(以下代码在Python 3中):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以正常使用,因为它可以删除标点符号,同时保留“ n’t”之类的令牌,而这些令牌不能从regex令牌生成器(如)获得wordpunct_tokenize

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like “n’t”, which can’t be obtained from regex tokenizers such as wordpunct_tokenize.


回答 6

真诚的问,这是什么字?如果您假设一个单词仅由字母字符组成,那您就错了,因为如果在标记化之前删除标点符号,can't诸如的单词将被破坏成碎片(例如cant),这很可能会对程序产生负面影响。

因此,解决方案是先标记化然后删除标点标记

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

……然后,如果你愿意,你可以替换某些标记,如'mam

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

…and then if you wish, you can replace certain tokens such as 'm with am.


回答 7

我使用以下代码删除标点符号:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

而且,如果您想检查令牌是否为有效的英语单词,则可能需要PyEnchant

教程:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

回答 8

删除标点符号(它将删除和标点符号处理的一部分,使用下面的代码)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

样本输入/输出:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']


回答 9

只需添加@rmalouf的解决方案,就不会包含任何数字,因为\ w +等效于[a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

回答 10

您可以在没有nltk(python 3.x)的情况下一行完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

You can do it in one line without nltk (python 3.x).

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

Java或Python用于自然语言处理

问题:Java或Python用于自然语言处理

我想知道哪种编程语言更适合自然语言处理。Java还是Python?我发现了很多与此有关的问题和答案。但是我仍然迷失在选择使用哪一个上。

我想知道用于Java的NLP库,因为有很多库(LingPipe,GATE,OpenNLP,StandfordNLP)。对于Python,大多数程序员都建议使用NLTK。

但是,如果我要对非结构化数据(只是自由格式的纯英文文本)进行一些文本处理或信息提取,以获得一些有用的信息,那么最佳选择是什么?Java还是Python?合适的图书馆?

更新

我要做的是从非结构化数据中提取有用的产品信息(例如,用户使用不太标准的英语来制作有关手机或笔记本电脑的不同形式的广告)

I would like to know which programming language is better for natural language processing. Java or Python? I have found lots of questions and answers regarding about it. But I am still lost in choosing which one to use.

And I want to know which NLP library to use for Java since there are lots of libraries (LingPipe, GATE, OpenNLP, StandfordNLP). For Python, most programmers recommend NLTK.

But if I am to do some text processing or information extraction from unstructured data (just free formed plain English text) to get some useful information, what is the best option? Java or Python? Suitable library?

Updated

What I want to do is to extract useful product information from unstructured data (E.g. users make different forms of advertisement about mobiles or laptops with not very standard English language)


回答 0

Java vs Python for NLP非常偏爱或必需。根据公司/项目的不同,您将需要使用其中一个,而除非您负责一个项目,否则通常没有太多选择。

除了NLTK(www.nltk.org),实际上还有其他用于文本处理的库python

(有关更多信息,请参见https://pypi.python.org/pypi?%3Aaction=search&term=natural+language+processing&submit=search

对于Java,还有其他许多吨,但这是另一个清单:

这是基本字符串处理的不错比较,请参阅http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html

GATE与UIMA与OpenNLP的有用比较,请参阅https://www.assembla.com/spaces/extraction-of-cost-data/wiki/Gate-vs-UIMA-vs-OpenNLP?version=4

如果您不确定使用NLP的语言是什么,我个人会说“可以为您提供所需分析/输出的任何语言”,请参阅要学习自然语言处理的语言或工具?

这是NLP工具的最新版本(2017):https : //github.com/alvations/awesome-community-curated-nlp

NLP工具的较旧列表(2013):http ://web.archive.org/web/20130703190201/http: //yauhenklimovich.wordpress.com/2013/05/20/tools-nlp


除了语言处理工具之外,您非常需要将machine learning工具合并到NLP管道中。

有一个整体的范围PythonJava,并再次就看个人喜好和库是否人性化不够:

python中的机器学习库:

(有关更多信息,请参见https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search


随着最近(2015年)NLP中的深度学习海啸,您可能可以考虑:https : //en.wikipedia.org/wiki/Comparison_of_deep_learning_software

我将避免出于非偏爱/中立的目的列出深度学习工具。


其他也需要NLP / ML工具的Stackoverflow问题:

Java vs Python for NLP is very much a preference or necessity. Depending on the company/projects you’ll need to use one or the other and often there isn’t much of a choice unless you’re heading a project.

Other than NLTK (www.nltk.org), there are actually other libraries for text processing in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=natural+language+processing&submit=search)

For Java, there’re tonnes of others but here’s another list:

This is a nice comparison for basic string processing, see http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html

A useful comparison of GATE vs UIMA vs OpenNLP, see https://www.assembla.com/spaces/extraction-of-cost-data/wiki/Gate-vs-UIMA-vs-OpenNLP?version=4

If you’re uncertain, which is the language to go for NLP, personally i say, “any language that will give you the desired analysis/output”, see Which language or tools to learn for natural language processing?

Here’s a pretty recent (2017) of NLP tools: https://github.com/alvations/awesome-community-curated-nlp

An older list of NLP tools (2013): http://web.archive.org/web/20130703190201/http://yauhenklimovich.wordpress.com/2013/05/20/tools-nlp


Other than language processing tools, you would very much need machine learning tools to incorporate into NLP pipelines.

There’s a whole range in Python and Java, and once again it’s up to preference and whether the libraries are user-friendly enough:

Machine Learning libraries in python:

(for more, see https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search)


With the recent (2015) deep learning tsunami in NLP, possibly you could consider: https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software

I’ll avoid listing deep learning tools out of non-favoritism / neutrality.


Other Stackoverflow questions that also asked for NLP/ML tools:


回答 1

这个问题很开放。就是说,下面而不是选择一个,而是根据您要使用的语言进行比较(因为两种语言都有不错的库)。

Python

在Python方面,首先要看的是Python Natural Language Toolkit。正如他们在描述中所指出的那样,NLTK是构建Python程序以使用人类语言数据的领先平台。它为50多种语料库和词汇资源(如WordNet)提供了易于使用的界面,并提供了一套用于分类,标记化,词干,标记,解析和语义推理的文本处理库。

您还可以查找一些出色的代码,这些代码源自基于Python的Google自然语言工具包项目。您可以在GitHub上找到该代码的链接。

爪哇

首先看的是斯坦福大学的自然语言处理小组。那里分发的所有软件都是用Java编写的。所有最新发行版都需要Oracle Java 6+或OpenJDK 7+。分发程序包包括用于命令行调用的组件,jar文件,Java API和源代码。

您在许多机器学习环境中看到的另一个很棒的选择(通用选择)是Weka。Weka是用于数据挖掘任务的机器学习算法的集合。这些算法既可以直接应用于数据集,也可以从您自己的Java代码中调用。Weka包含用于数据预处理,分类,回归,聚类,关联规则和可视化的工具。它也非常适合开发新的机器学习方案。

The question is very open ended. That said, rather than choose one, below is a comparison depending on the language that you would like to use (since there are good libraries available in both languages).

Python

In terms of Python, the first place you should look at is the Python Natural Language Toolkit. As they note in their description, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

There is also some excellent code that you can look up that originated out of Google’s Natural Language Toolkit project that is Python based. You can find a link to that code here on GitHub.

Java

The first place to look would be Stanford’s Natural Language Processing Group. All of software that is distributed there is written in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

Another great option that you see in a lot of machine learning environments here (general option), is Weka. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.


如何使用scikit learning计算多类案例的精度,召回率,准确性和f1-得分?

问题:如何使用scikit learning计算多类案例的精度,召回率,准确性和f1-得分?

我正在研究情绪分析问题,数据看起来像这样:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

所以,我的数据是不平衡的,因为1190 instances标有5。对于使用scikit的SVC进行的分类Im 。问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的精度,查全率,准确性和f1得分。因此,我尝试了以下方法:

第一:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

第二:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

第三:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

但是,我收到这样的警告:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

如何正确处理我的不平衡数据,以便以正确的方式计算分类器的指标?

I’m working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit’s SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:

First:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

Second:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

Third:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

However, Im getting warnings like this:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

How can I deal correctly with my unbalanced data in order to compute in the right way classifier’s metrics?


回答 0

我认为对于将哪些砝码用于什么有很多困惑。我不确定我是否确切知道让您感到困扰,所以我将涉及不同的话题,请耐心等待;)。

Class重量

来自class_weight参数的权重用于训练分类器。它们不会用于您正在使用的任何度量的计算中:使用不同的类别权重,数字会有所不同,仅仅是因为分类器不同。

基本上,在每个scikit-learn分类器中,都使用类权重来告诉您的模型,类的重要性。这意味着在训练过程中,分类器将付出更多的努力来对权重较高的类进行正确分类。
他们如何做到这一点是特定于算法的。如果您想了解有关SVC如何工作的详细信息,而该文档对您来说没有意义,请随时提及。

指标

有了分类器后,您想知道其效果如何。在这里,你可以使用你所提到的指标:accuracyrecall_scoref1_score

通常,当Class分布不平衡时,准确性被认为是较差的选择,因为它会给仅预测最频繁Class的模型打高分。

我不会详细说明所有这些指标,但是请注意,除之外accuracy,它们自然地应用于类级别:如您在print分类报告中所见,它们是为每个类定义的。他们依赖诸如true positives或的概念,这些概念false negative要求定义哪个类别是肯定的

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

警告

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

之所以收到此警告,是因为您使用的是f1得分,召回率和精确度,而未定义应如何计算它们!问题可以改写为:从以上分类报告中,您如何为f1分数输出一个全局数字?你可以:

  1. 取每个Class的f1分数的平均值:这就是avg / total上面的结果。也称为平均。
  2. 使用真实阳性/阴性阴性等的总计数来计算f1-分数(您将每个类别的真实阳性/阴性阴性的总数相加)。又名平均。
  3. 计算f1分数的加权平均值。使用'weighted'在scikit学习会由支持类的权衡F1评分:越要素类有,更重要的F1的得分这个类在计算中。

这是scikit-learn中的3个选项,警告是说您必须选择一个。因此,您必须average为score方法指定一个参数。

选择哪种方法取决于您如何衡量分类器的性能:例如,宏平均不考虑类的不平衡,并且类1的f1分数与类的f1分数一样重要5.但是,如果您使用加权平均,则对于第5类,您将变得更加重要。

这些指标中的整个参数规范目前在scikit-learn中尚不十分清楚,根据文档,它将在0.18版中变得更好。他们正在删除一些不明显的标准行为,并发出警告,以便开发人员注意到它。

计算分数

我要提到的最后一件事(如果您知道它,可以随时跳过它)是,分数只有在根据分类器从未见过的数据进行计算时才有意义。这一点非常重要,因为您获得的用于拟合分类器的数据得分都是完全不相关的。

这是使用的一种方法StratifiedShuffleSplit,它可以随机分配数据(经过改组后),以保留标签的分布。

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

希望这可以帮助。

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).

Class weights

The weights from the class_weight parameter are used to train the classifier. They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.

Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.

The metrics

Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned: accuracy, recall_score, f1_score

Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

The warning

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output one global number for the f1-score? You could:

  1. Take the average of the f1-score for each class: that’s the avg / total result above. It’s also called macro averaging.
  2. Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
  3. Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.

These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.

Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you’ll get more importance for the class 5.

The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.

Computing scores

Last thing I want to mention (feel free to skip it if you’re aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.

Here’s a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

Hope this helps.


回答 1

这里有很多非常详细的答案,但我认为您没有回答正确的问题。据我了解的问题,有两个问题:

  1. 我如何为多类问题评分?
  2. 我该如何处理不平衡的数据?

1。

可以将scikit-learn中的大多数计分函数用于多类问题和单类问题。例如:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

这样,您最终得到每个类的有形和可解释的数字。

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

然后…

2。

…您可以判断出不平衡的数据是否甚至是一个问题。如果代表较少的Class(第1类和第2类)的得分低于训练样本较多的Class(第4类和第5类)的得分,那么您就知道不平衡的数据实际上是个问题,您可以采取相应措施,例如在该线程的其他一些答案中进行了介绍。但是,如果要预测的数据中存在相同的类别分布,那么不平衡的训练数据可以很好地代表数据,因此,不平衡是一件好事。

Lot of very detailed answers here but I don’t think you are answering the right questions. As I understand the question, there are two concerns:

  1. How to I score a multiclass problem?
  2. How do I deal with unbalanced data?

1.

You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

This way you end up with tangible and interpretable numbers for each of the classes.

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

Then…

2.

… you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread. However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.


回答 2

提出的问题

回答“对于不平衡数据的多类别分类应使用什么度量”这一问题:Macro-F1-measure。也可以使用Macro Precision和Macro Recall,但是它们不像二进制分类那样容易解释,它们已经被合并到F量度中,并且多余的量度使方法比较,参数调整等复杂化。

微观平均对类不平衡很敏感:例如,如果您的方法对大多数常见标签都有效,而完全使其他标签混乱,则微观平均指标将显示出良好的结果。

加权平均不适用于不平衡数据,因为它通过标签计数加权。此外,它很难解释且不受欢迎:例如,在以下我非常建议仔细研究的非常详细的调查中,没有提及这种平均值:

Sokolova,Marina和Guy Lapalme。“对分类任务的绩效指标进行系统分析。” 信息处理与管理45.4(2009):427-437。

特定于应用程序的问题

但是,回到您的任务,我将研究2个主题:

  1. 通常用于您的特定任务的指标-它使(a)与他人比较您的方法,并了解您做错了什么;(b)不要自己探索这一方法并重用他人的发现;
  2. 方法的不同错误的成本-例如,您的应用程序的用例可能仅依赖于4星级和5星级审核-在这种情况下,好的指标应仅将这2个标签计算在内。

常用指标。 从文献资料中我可以推断出,有两个主要的评估指标:

  1. 精度,例如

Yu,April和Daryl Chang。“使用Yelp业务进行多类情感预测。”

链接)-请注意,作者使用的评级分布几乎相同,请参见图5。

庞波和李丽娟 “看见星星:利用阶级关系来进行与等级量表有关的情感分类。” 计算语言学协会第四十三届年会论文集。计算语言学协会,2005年。

链接

  1. MSE(或更不常见的是,平均绝对误差- -MAE)-例如,

Lee,Moontae和R.Grafe。“带有餐厅评论的多类情感分析。” CS N 224(2010)中的最终项目。

链接)-他们同时探讨准确性和MSE,并认为后者会更好

帕帕斯,尼古拉斯,Rue Marconi和Andrei Popescu-Belis。“解释星星:基于方面的情感分析的加权多实例学习。” 2014年自然语言处理经验方法会议论文集。EPFL-CONF-200899号。2014。

链接)-他们利用scikit-learn进行评估和基准评估,并声明其代码可用;但是,我找不到它,所以如果您需要它,请写信给作者,这本书是相当新的,似乎是用Python编写的。

不同错误的代价 如果您更关心避免出现大失误,例如将1星评价转换为5星评价或类似方法,请查看MSE;如果差异很重要,但不是那么重要,请尝试MAE,因为它不会使差异平方;否则保持准确性。

关于方法,而不是指标

尝试使用回归方法,例如SVR,因为它们通常胜过SVC或OVA SVM之类的多类分类器。

Posed question

Responding to the question ‘what metric should be used for multi-class classification with imbalanced data’: Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.

Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.

Weighting averaging isn’t well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:

Sokolova, Marina, and Guy Lapalme. “A systematic analysis of performance measures for classification tasks.” Information Processing & Management 45.4 (2009): 427-437.

Application-specific question

However, returning to your task, I’d research 2 topics:

  1. metrics commonly used for your specific task – it lets (a) to compare your method with others and understand if you do something wrong, and (b) to not explore this by yourself and reuse someone else’s findings;
  2. cost of different errors of your methods – for example, use-case of your application may rely on 4- and 5-star reviewes only – in this case, good metric should count only these 2 labels.

Commonly used metrics. As I can infer after looking through literature, there are 2 main evaluation metrics:

  1. Accuracy, which is used, e.g. in

Yu, April, and Daryl Chang. “Multiclass Sentiment Prediction using Yelp Business.”

(link) – note that the authors work with almost the same distribution of ratings, see Figure 5.

Pang, Bo, and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.” Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

(link)

  1. MSE (or, less often, Mean Absolute Error – MAE) – see, for example,

Lee, Moontae, and R. Grafe. “Multiclass sentiment analysis with restaurant reviews.” Final Projects from CS N 224 (2010).

(link) – they explore both accuracy and MSE, considering the latter to be better

Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. “Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis.” Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.

(link) – they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can’t find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.

Cost of different errors. If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn’t square diff; otherwise stay with Accuracy.

About approaches, not metrics

Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.


回答 3

首先,仅使用计数分析来判断您的数据是否不平衡会更加困难。例如:每1000个阳性观察中就有1个只是噪音,错误还是科学突破?你永远不会知道。
因此,最好使用所有可用的知识并明智地选择其状态。

好吧,如果真的不平衡怎么办?
再次-查看您的数据。有时您会发现一两个观察值乘以一百倍。有时创建这种虚假的一类观察很有用。
如果所有数据都是干净的,下一步是在预测模型中使用类权重。

那么多类指标呢?
根据我的经验,通常不会使用您的任何指标。有两个主要原因。
首先:与概率一起使用总是比采用可靠的预测更好(因为如果它们给同一个类,那么您还可以如何分别将0.9和0.6预测模型分开?)
其次:比较预测模型和构建新模型要容易得多仅取决于一项好的指标。
根据我的经验,我可以推荐loglossMSE(或均方误差)。

如何解决sklearn警告?
只是简单地(如yangjie所注意到的)average使用以下值之一覆盖参数:('micro'全局计算指标),'macro'(计算每个标签的指标)或'weighted'(与宏相同,但具有自动权重)。

f1_score(y_test, prediction, average='weighted')

在使用默认average值调用指标函数后发出所有警告,'binary'这不适用于多类预测。
祝你好运,并享受机器学习的乐趣!

编辑:
我发现了另一个回答者建议,建议改用我无法同意的回归方法(例如SVR)。据我所知,甚至没有多类回归。是的,多标签回归有很大的不同,是的,在某些情况下,有可能在回归和分类之间进行切换(如果类以某种方式排序),但这种情况很少见。

我建议(在scikit-learn范围内)尝试另一个非常强大的分类工具:梯度增强随机森林(我最喜欢),KNeighbors等。

之后,您可以计算预测之间的算术平均值或几何平均值,并且大多数时候您将获得更好的结果。

final_prediction = (KNNprediction * RFprediction) ** 0.5

First of all it’s a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it’s always better to use all your available knowledge and choice its status with all wise.

Okay, what if it’s really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it’s useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.

So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it’s always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it’s much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).

How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).

f1_score(y_test, prediction, average='weighted')

All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!

Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it’s possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.

What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.

After that you can calculate arithmetic or geometric mean between predictions and most of the time you’ll get even better result.

final_prediction = (KNNprediction * RFprediction) ** 0.5

Stanford-tensorflow-tutorials-此存储库包含斯坦福课程的代码示例


斯坦福-TensorFlow-教程

此存储库包含课程CS 20:深度学习研究的TensorFlow的代码示例。
它将随着课程的进展而更新。
详细的教学大纲和课堂讲稿可以在这里找到。here
对于本课程,我使用python3.6和TensorFlow 1.4.1

前一年课程的代码和备注请参见文件夹2017和网站https://web.stanford.edu/class/cs20si/2017

有关安装说明和依赖项列表,请参阅此存储库的安装文件夹

Nltk-NLTK源

自然语言工具包(NLTK)


NLTK–自然语言工具包–是一套开源的Python模块、数据集和教程,支持自然语言处理方面的研究和开发。NLTK需要Python版本3.5、3.6、3.7、3.8或3.9

有关文档,请访问nltk.org

贡献

你想为NLTK的发展做贡献吗?太棒了!请阅读CONTRIBUTING.md有关更多详细信息,请参阅

另请参阅how to contribute to NLTK

捐赠

你觉得这个工具包有帮助吗?请使用NLTK主页上的链接通过PayPal向项目捐款来支持NLTK开发

引用

如果您发表使用NLTK的作品,请引用NLTK书,如下所示:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

版权所有

版权所有(C)2001-2021年NLTK项目

有关许可证信息,请参阅LICENSE.txt

AUTHORS.md包含对NLTK做出贡献的每个人的列表

重新分布

  • NLTK源代码是在Apache2.0许可下分发的
  • NLTK文档在知识共享署名-非商业性-无衍生作品3.0美国许可下分发
  • NLTK语料库是根据自述文件中给出的条款为每个语料库提供的;所有语料库都可以再分发,并可用于非商业用途
  • 根据这些许可证的规定,NLTK可以自由再分发

Jina-面向任何类别数据的云原生神经搜索框架

云-本地神经搜索[?]适用于以下方面的框架任何数据类型

Jina 允许您在短短几分钟内构建以深度学习为动力的搜索即服务

🌌所有数据类型-大规模索引和查询任何类型的非结构化数据:视频、图像、长/短文本、音乐、源代码、PDF等

🌩️FAST和本机云-从第一天开始的分布式架构,可扩展且设计为本地云:享受集装箱化、流式处理、并行、分片、异步调度、HTTP/GRPC/WebSocket协议

⏱️节省时间这个神经搜索系统的设计模式,从零到生产准备就绪的系统只需几分钟

🍱拥有您的堆栈-保持解决方案的端到端堆栈所有权,避免使用零散的、多供应商的通用旧式工具带来的集成陷阱

运行快速演示

安装

  • 通过PyPI:pip install -U "jina[standard]"
  • 通过Docker:docker run jinaai/jina:latest
更多安装选项
x86/64、arm64、v6、v7 Linux/MacOS和Python 3.7/3.8/3.9 Docker用户
最低要求
(不支持HTTP、WebSocket、Docker)
pip install jina docker run jinaai/jina:latest
Daemon pip install "jina[daemon]" docker run --network=host jinaai/jina:latest-daemon
使用附加服务 pip install "jina[devel]" docker run jinaai/jina:latest-devel

版本标识符are explained here吉娜可以继续奔跑Windows Subsystem for Linux我们欢迎社会各界帮助我们native Windows support

开始使用

文档、执行者和流是JINA中的三个基本概念

1个️⃣复制-粘贴下面的最小示例并运行它:

💡预赛:character embeddingpoolingEuclidean distance

import numpy as np
from jina import Document, DocumentArray, Executor, Flow, requests

class CharEmbed(Executor):  # a simple character embedding with mean-pooling
    offset = 32  # letter `a`
    dim = 127 - offset + 1  # last pos reserved for `UNK`
    char_embd = np.eye(dim) * 1  # one-hot embedding for all chars

    @requests
    def foo(self, docs: DocumentArray, **kwargs):
        for d in docs:
            r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
            d.embedding = self.char_embd[r_emb, :].mean(axis=0)  # average pooling

class Indexer(Executor):
    _docs = DocumentArray()  # for storing all documents in memory

    @requests(on='/index')
    def foo(self, docs: DocumentArray, **kwargs):
        self._docs.extend(docs)  # extend stored `docs`

    @requests(on='/search')
    def bar(self, docs: DocumentArray, **kwargs):
        q = np.stack(docs.get_attributes('embedding'))  # get all embeddings from query docs
        d = np.stack(self._docs.get_attributes('embedding'))  # get all embeddings from stored docs
        euclidean_dist = np.linalg.norm(q[:, None, :] - d[None, :, :], axis=-1)  # pairwise euclidean distance
        for dist, query in zip(euclidean_dist, docs):  # add & sort match
            query.matches = [Document(self._docs[int(idx)], copy=True, scores={'euclid': d}) for idx, d in enumerate(dist)]
            query.matches.sort(key=lambda m: m.scores['euclid'].value)  # sort matches by their values

f = Flow(port_expose=12345, protocol='http', cors=True).add(uses=CharEmbed, parallel=2).add(uses=Indexer)  # build a Flow, with 2 parallel CharEmbed, tho unnecessary
with f:
    f.post('/index', (Document(text=t.strip()) for t in open(__file__) if t.strip()))  # index all lines of _this_ file
    f.block()  # block for listening request

2个️⃣打开http://localhost:12345/docs(扩展的Swagger UI)在浏览器中,单击/搜索制表符和输入:

{"data": [{"text": "@requests(on=something)"}]}

也就是说,我们希望从上面的代码片段中找到与以下内容最相似的行@request(on=something)现在单击执行巴顿!

3个️⃣不是图形用户界面的人?那就让我们用Python来做吧!保持上述服务器运行,并启动一个简单的客户端:

from jina import Client, Document
from jina.types.request import Response


def print_matches(resp: Response):  # the callback function invoked when task is done
    for idx, d in enumerate(resp.docs[0].matches[:3]):  # print top-3 matches
        print(f'[{idx}]{d.scores["euclid"].value:2f}: "{d.text}"')


c = Client(protocol='http', port_expose=12345)  # connect to localhost:12345
c.post('/search', Document(text='request(on=something)'), on_done=print_matches)

,它打印以下结果:

         Client@1608[S]:connected to the gateway at localhost:12345!
[0]0.168526: "@requests(on='/index')"
[1]0.181676: "@requests(on='/search')"
[2]0.192049: "query.matches = [Document(self._docs[int(idx)], copy=True, score=d) for idx, d in enumerate(dist)]"

😔不管用吗?我们的错!Please report it here.

阅读教程

支持

加入我们吧

吉娜的后盾是Jina AIWe are actively hiring全栈开发人员、解决方案工程师在开源领域构建下一个神经搜索生态系统

贡献

我们欢迎来自开源社区、个人和合作伙伴的各种贡献。我们的成功归功于你的积极参与



















TextBlob-简单、Python式的文本处理–情感分析、词性标记、名词短语提取、翻译等等

主页:https://textblob.readthedocs.io/

TextBlob是一个Python(2和3)库,用于处理文本数据。它提供了一个简单的API,用于深入研究常见的自然语言处理(NLP)任务,如词性标记、名词短语提取、情感分析、分类、翻译等

 

from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

TextBlob站在NLTKpattern,并且两者都玩得很好

功能

  • 名词短语提取
  • 词性标注
  • 情绪分析
  • 分类(朴素贝叶斯、决策树)
  • 标记化(将文本拆分成单词和句子)
  • 词频和词频
  • 解析
  • N元语法
  • 词形变化(复数和单数)与词汇化
  • 拼写更正
  • 通过扩展添加新模型或语言
  • Wordnet集成

现在就去拿吧

$ pip install -U textblob
$ python -m textblob.download_corpora

示例

查看更多示例,请参阅Quickstart guide

文档

有关完整文档,请访问https://textblob.readthedocs.io/

要求

  • Python>=2.7或>=3.5

项目链接

许可证

麻省理工学院有执照。请参阅捆绑的LICENSE有关更多详细信息,请提交文件

Virgilio-您的数据科学E-Learning新导师

Virgilio是什么?

通过互联网学习和阅读意味着在一个混沌信息的无限丛林,在快速变化的创新领域更是如此

你有没有感到不知所措?当试图接近数据科学没有一条真正的“路”可走?

你是否厌倦了点击“Run”,“Run”,“Run”。在一本木星笔记本上,带着别人工作的舒适区给人的那种虚假的自信?

您是否曾经因为同一算法或方法的几个相互矛盾的名称而感到困惑,这些名称来自不同的网站和零散的教程?

Virgilio为每个人免费解决这些关键问题

Enter in the new web version of Virgilio!

关于

Virgilio由以下人员开发和维护these awesome people您可以给我们发电子邮件virgilio.datascience (at) gmail.com或加入Discord chat

贡献力量

太棒了!检查contribution guidelines参与我们的项目吧!

许可证

内容由-NC-SA 4.0在知识共享下发布license代码在MIT licenseVirgilio形象来自于here