首页 > 文体写作

tag是什么

更新时间:2023-03-12 23:52:42 阅读：评论：0

合肥有什么好玩的地方-简短的小故事

2023年3月12日发(作者：愚公移山英文版)

POSTagging

POStagging:part-of-speechtagging,orwordclassorlexicalcategories.说法很多其实就是词性标注。

那么⽤nltk的⼯具集的off-the-shelf⼯具可以简单的对⽂本进⾏POStagging

>>>text=_tokenize("Andnowforsomethingcompletelydifferent")

>>>_tag(text)

[('And','CC'),('now','RB'),('for','IN'),('something','NN'),('completely','RB'),('different','JJ')]

APIDocument⾥⾯是这么介绍这个接⼝的

UNLTK'scurrentlyrecommendedpartofspeechtaggertotagthegivenlistoftokens.

我查了下code，pos_tagloadtheStandardtreebankPOStagger

dinatingconjunction

inalnumber

rminer

tentialthere

ignword

ositionorsubordinatingconjunction

ctive

ective,comparative

ective,superlative

itemmarker

,singularormass

n,plural

pernoun,singular

opernoun,plural

determiner

ssiveending

sonalpronoun

$Posssivepronoun

erb,comparative

erb,superlative

icle

bol

rjection

,baform

b,pastten

b,gerundorprentparticiple

b,pastparticiple

b,non-3rdpersonsingularprent

b,3rdpersonsingularprent

-determiner

-pronoun

$Posssivewh-pronoun

-adverb

现在根据上⾯主要词性缩写的解释，可以⽐较容易理解上⾯接⼝给出的词性标注了。

在nltk的corpus，语料库，⾥⾯有些是加过词性标注的，这些可以⽤于训练集，标注过的corpors都有tagged_words()method

>>>_words()

[('The','AT'),('Fulton','NP-TL'),('County','NN-TL'),...]

>>>_words(simplify_tags=True)

[('The','DET'),('Fulton','N'),('County','N'),...]

AutomaticTagging

下⾯就来讲讲各种⾃动标注的⽅法，因为tag要根据词的context，所以tag是以nten为单位的，⽽不是word为单位，因为如果以词为

单位，⼀个句⼦的结尾词会影响到下个句⼦开头词的tag，这样是不合理的，以句⼦为单位可以避免这样的错误，让context的影响不会越过

nten。

我们就⽤browncorpus作为例⼦，

>>>importbrown

>>>brown_tagged_nts=_nts(categories='news')

>>>brown_nts=(categories='news')

可以分布取出标注过的句⼦集合，未标注的句⼦集合，分别⽤做标注算法的验证集和测试集。

TheDefaultTagger

Thesimplestpossibletaggerassignsthesametagtoeachtoken.

>>>raw='Idonotlikegreeneggsandham,IdonotlikethemSamIam!'

>>>tokens=_tokenize(raw)

>>>default_tagger=tTagger('NN')

>>>default_(tokens)

[('I','NN'),('do','NN'),('not','NN'),('like','NN'),('green','NN'),

('eggs','NN'),('and','NN'),('ham','NN'),(',','NN'),('I','NN'),

198|Chapter5:CategorizingandTaggingWords

('do','NN'),('not','NN'),('like','NN'),('them','NN'),('Sam','NN'),

('I','NN'),('am','NN'),('!','NN')]

这个Tagger，真的很简单就是把所有的都标注成你告诉他的这种，看似毫⽆意义的tagger，不过作为backoff，还是有⽤的

TheRegularExpressionTagger

Theregularexpressiontaggerassignstagstotokensonthebasisofmatchingpatterns.

>>>patterns=[

...(r'.*ing$','VBG'),#gerunds

...(r'.*ed$','VBD'),#simplepast

...(r'.*es$','VBZ'),#3rdsingularprent

...(r'.*ould$','MD'),#modals

...(r'.*/'s$','NN$'),#posssivenouns

...(r'.*s$','NNS'),#pluralnouns

...(r'^-?[0-9]+(.[0-9]+)?$','CD'),#cardinalnumbers

...(r'.*','NN')#nouns(default)

...]

>>>regexp_tagger=Tagger(patterns)

>>>regexp_(brown_nts[3])

[('``','NN'),('Only','NN'),('a','NN'),('relative','NN'),('handful','NN'),

('of','NN'),('such','NN'),('reports','NNS'),('was','NNS'),('received','VBD'),

("''",'NN'),(',','NN'),('the','NN'),('jury','NN'),('said','NN'),(',','NN'),

('``','NN'),('considering','VBG'),('the','NN'),('widespread','NN'),...]

这个Tagger，进步了⼀点，就是你可以定义⼀些正则⽂法的规则，满⾜规则就tag成相应的词性，否则还是default

TheLookupTagger

’sfindthehundredmostfrequentwordsandstoretheirmost

likelytag.

这个⽅法开始有点实⽤价值了，就是通过统计训练corpus⾥⾯最常⽤的词，最有可能出现的词性是什么，来进⾏词性标注。

>>>fd=st((categories='news'))

>>>cfd=ionalFreqDist(_words(categories='news'))

>>>most_freq_words=()[:100]

>>>likely_tags=dict((word,cfd[word].max())forwordinmost_freq_words)

>>>baline_tagger=mTagger(model=likely_tags)

这段code就是从corpus中取出top100的词，然后找到这100个词出现次数最多的词性，然后形成likely_tags的字典

然后将这个字典作为model传个unigramTagger

unigramTagger就是⼀元的tagger，即不考虑前后context的⼀种简单的tagger

这个⽅法有个最⼤的问题，你只指定了top100词的词性，那么其他的词怎么办

好，前⾯的defaulttagger有⽤了

baline_tagger=mTagger(model=likely_tags,backoff=tTagger('NN'))

这样就可以部分解决这个问题，不知道的就⽤defaulttagger来标注

这个⽅法的准确性完全取决于这个model的⼤⼩，这⼉取了top100的词，可能准确性不⾼，但是随着你取的词的增多，这个准确率会不断

提⾼。

N-GramTagging

Unigramtaggersarebadonasimplestatisticalalgorithm:foreachtoken,assignthetagthatismostlikelyforthat

particulartoken.

上⾯给出的lookuptagger就是⽤的Unigramtagger，现在给出Unigramtagger更⼀般的⽤法

>>>importbrown

>>>brown_tagged_nts=_nts(categories='news')

>>>brown_nts=(categories='news')

>>>unigram_tagger=mTagger(brown_tagged_nts)＃Training

>>>unigram_(brown_nts[2007])

[('Various','JJ'),('of','IN'),('the','AT'),('apartments','NNS'),

('are','BER'),('of','IN'),('the','AT'),('terrace','NN'),('type','NN'),

(',',','),('being','BEG'),('on','IN'),('the','AT'),('ground','NN'),

('floor','NN'),('so','QL'),('that','CS'),('entrance','NN'),('is','BEZ'),

('direct','JJ'),('.','.')]

你可以来已标注的语料库对Unigramtagger进⾏训练

Ann-gramtaggerisageneralizationofaunigramtaggerwhocontextisthecurrentwordtogetherwiththepart-of-

speechtagsofthen-1precedingtokens.

n元就是要考虑context，即考虑前n-1个word的tag，来给当前的word进⾏tagging

就n元tagger的特例⼆元tagger作为例⼦

>>>bigram_tagger=Tagger(train_nts)

>>>bigram_(brown_nts[2007])

这样有个问题，如果tag的句⼦中的某个词的context在训练集⾥⾯没有，哪怕这个词在训练集中有，也⽆法对他进⾏标注，还是要通过

backoff来解决这样的问题

>>>t0=tTagger('NN')

>>>t1=mTagger(train_nts,backoff=t0)

>>>t2=Tagger(train_nts,backoff=t1)

Transformation-BadTagging

n-gramtagger存在的问题是，model会占⽤⽐较⼤的空间，还有就是在考虑context时，只会考虑前⾯词的tag，⽽不会考虑词本⾝。

⽽要介绍的这种tagger可以⽐较好的解决这些问题，⽤存储rule来代替model，这样可以节省⼤量的空间，同时在rule中不限制仅考虑

tag，也可以考虑word本⾝。

Brilltaggingisakindoftransformation-badlearning,eralideaisverysimple:guessthe

tagofeachword,thengobackandfixthemistakes.

那么Brilltagging的原理从底下这个例⼦就可以了解

(1)replaceNNwithVBwhenthepreviouswordisTO;

(2)replaceTOwithINwhenthenexttagisNNS.

Phratoincreagrantstostatesforvocationalrehabilitation

UnigramTONNNNSTONNSINJJNN

Rule1VB

Rule2IN

OutputTOVBNNSINNNSINJJNN

第⼀步⽤unigramtagger对所有词做⼀遍tagging，这⾥⾯可能有很多不准确的

下⾯就⽤rule来纠正第⼀步中guess错的那些词的tag，最终得到⽐较准确的tagging

那么这些rules是怎么⽣成的了，答案是在training阶段⾃动⽣成的

Duringitstrainingpha,thetaggerguessvaluesforT1,T2,andC,leis

scoredaccordingtoitsnetbenefit:thenumberofincorrecttagsthatitcorrects,lessthenumber

ofcorrecttagsitincorrectlymodifies.

意思就是在training阶段，先创建thousandsofcandidaterules，这些rule创建可以通过简单的统计来完成，所以可能有⼀些rule是不准

确的。那么⽤每条rule去fixmistakes，然后和正确tag对⽐，改对的数⽬减去改错的数⽬⽤来作为score评价该rule的好坏，⾃然得分⾼的

留下，得分低的rule就删去，底下是些rules的例⼦

NN->VBifthetagoftheprecedingwordis'TO'

NN->VBDifthetagofthefollowingwordis'DT'

NN->VBDifthetagoftheprecedingwordis'NNS'

NN->NNPifthetagofwordsi-2...i-1is'-NONE-'

NN->NNPifthetagofthefollowingwordis'NNP'

NN->NNPifthetextofwordsi-2...i-1is'like'

NN->VBNifthetextofthefollowingwordis'*-1'

本文发布于:2023-03-12 23:52:40，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/zhishi/a/167863636126939.html

本文word下载地址：tag是什么.doc

本文 PDF 下载地址：tag是什么.pdf

上一篇：学生会面试

下一篇：返回列表

标签：tag是什么

留言与评论（共有 0 条评论）