I'm using both SpaCy and Stanza to identify named entities in very short string (brand names and business names):
# BUILDING THE MODELS
#-----stanza
sen = stanza.Pipeline ("en")
smlp = stanza.MultilingualPipeline()
#----spacy
spl = spacy.load(en_core_web_lg-3.7.1)
spt = spacy.load(en_core_web_trf-3.7.3)
# TESTING THE MODELS
name = 'The Port of Peri Peri'
print(name)
# spacy
print('\n SPACY------------------------------------------------------------------')
print('spacy spl')
doc = spl(name)
for token in doc:
print(token.text, token.is_oov, token.shape_, token.tag_, token.pos_, token.dep_, token.ent_type_, token.ent_iob_)
print('-----------------')
print('spacy trf')
doc = spt(name)
for token in doc:
print(token.text, token.is_oov, token.shape_, token.tag_, token.pos_, token.dep_, token.ent_type_, token.ent_iob_)
#stanza
print('\n STANZA----------------------------------------------------------------')
print('stanza sen')
doc = sen(name)
for sent in doc.sentences:
for token in sent.tokens:
for word in token.words:
print(word.text, word.xpos, word.upos, word.deprel, token.ner)
print('-----------------')
print('stanza smlp')
doc = smlp(name)
for sent in doc.sentences:
for token in sent.tokens:
for word in token.words:
print(word.text, word.xpos, word.upos, word.deprel, token.ner)
print('-----------------')
Output:
The Port of Peri Peri
SPACY------------------------------------------------------------------
spacy spl
The False Xxx DT DET det ORG B
Port False Xxxx NNP PROPN ROOT ORG I
of False xx IN ADP prep ORG I
Peri False Xxxx NNP PROPN compound ORG I
Peri False Xxxx NNP PROPN pobj ORG I
-----------------
spacy trf
The True Xxx DT DET det FAC B
Port True Xxxx NNP PROPN ROOT FAC I
of True xx IN ADP prep FAC I
Peri True Xxxx NNP PROPN compound FAC I
Peri True Xxxx NNP PROPN pobj FAC I
STANZA----------------------------------------------------------------
stanza sen
The DT DET det B-PERSON
Port NNP PROPN root I-PERSON
of IN ADP case I-PERSON
Peri NNP PROPN nmod I-PERSON
Peri NNP PROPN nmod E-PERSON
-----------------
stanza smlp
The DT DET det B-PERSON
Port NNP PROPN root I-PERSON
of IN ADP case I-PERSON
Peri NNP PROPN nmod I-PERSON
Peri NNP PROPN nmod E-PERSON
-----------------
So, if you look at the ends of the output lines, while SpaCy identifies the place as ORGanization or FACility, Stanza identifies it as PERSON, which is rediculous: Person names do not usually start with "The" or have "of" in them. My question is, is there any improvements I can make to the Stanza models or is this as good as they get?