✨ Using Rubrix with spaCy
NER¶
In this tutorial, you’ll learn to log spaCy Name Entity Recognition (NER) predictions. This is useful for evaluating pre-trained models, spotting frequent errors, and improve your pipelines over time.
Introduction¶
In this tutorial we will:
Load the Gutenberg Time dataset from the Hugging Face Hub.
Use a transformer-based spaCy model for detecting entities in this dataset and log the detected entities into a Rubrix dataset. This dataset can be used for exploring the quality of predictions and for creating a new training set, by correcting, adding and validating entities.
Use a smaller spaCy model for detecting entities and log the detected entities into the same Rubrix dataset for comparing its predictions with the previous model.
As a bonus, we will use Rubrix and spaCy on a more challenging dataset: IMDB.
Setup Rubrix¶
If you are new to Rubrix, visit and ⭐ star Rubrix for more materials like and detailed docs: Github repo
If you have not installed and launched Rubrix, check the Setup and Installation guide.
Once installed, you only need to import Rubrix:
[1]:
import rubrix as rb
Install tutorial dependencies¶
In this tutorial, we’ll use the datasets
and spaCy
libraries and the en_core_web_trf
pretrained English model, a Roberta-based spaCy model . If you do not have them installed, run:
[ ]:
%pip install datasets -qqq
%pip install -U spacy -qqq
%pip install protobuf
Our dataset¶
For this tutorial, we’re going to use the Gutenberg Time dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities.
[ ]:
from datasets import load_dataset
dataset = load_dataset("gutenberg_time", split="train")
Let’s take a look at our dataset!
[ ]:
train, test = dataset.train_test_split(test_size=0.002, seed=42).values() ; test
Logging spaCy NER entities into Rubrix¶
Using a Transformer-based pipeline¶
Let’s install and load our roberta-based pretrained pipeline and apply it to one of our dataset records:
[ ]:
!python -m spacy download en_core_web_trf
[ ]:
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp(dataset[0]["tok_context"])
doc
Now let’s apply the nlp pipeline to our dataset records, collecting the tokens and NER entities.
[ ]:
records = []
for record in test:
# We only need the text of each instance
text = record["tok_context"]
# spaCy Doc creation
doc = nlp(text)
# Entity annotations
entities = [
(ent.label_, ent.start_char, ent.end_char)
for ent in doc.ents
]
# Pre-tokenized input text
tokens = [token.text for token in doc]
# Rubrix TokenClassificationRecord list
records.append(
rb.TokenClassificationRecord(
text=text,
tokens=tokens,
prediction=entities,
prediction_agent="en_core_web_trf",
)
)
[ ]:
records[0]
[ ]:
rb.log(records=records, name="gutenberg_spacy_ner")
If you go to the gutenberg_spacy_ner
dataset in Rubrix you can explore the predictions of this model:
You can filter records containing specific entity types.
You can see the most frequent “mentions” or surface forms for each entity. Mentions are the string values of specific entity types, such as for example “1 month” can be the mention of a duration entity. This is useful for error analysis, to quickly see potential issues and problematic entity types.
You can use the free-text search to find records containing specific words.
You could validate, include or reject specific entity annotations to build a new training set.
Using a smaller but more efficient pipeline¶
Now let’s compare with a smaller, but more efficient pre-trained model. Let’s first download it
[ ]:
!python -m spacy download en_core_web_sm
[ ]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(dataset[0]["tok_context"])
[ ]:
records = [] # Creating and empty record list to save all the records
for record in test:
text = record["tok_context"] # We only need the text of each instance
doc = nlp(text) # spaCy Doc creation
# Entity annotations
entities = [
(ent.label_, ent.start_char, ent.end_char)
for ent in doc.ents
]
# Pre-tokenized input text
tokens = [token.text for token in doc]
# Rubrix TokenClassificationRecord list
records.append(
rb.TokenClassificationRecord(
text=text,
tokens=tokens,
prediction=entities,
prediction_agent="en_core_web_sm",
)
)
[ ]:
rb.log(records=records, name="gutenberg_spacy_ner")
Exploring and comparing en_core_web_sm
and en_core_web_trf
models¶
If you go to your gutenberg_spacy_ner
you can explore and compare the results of both models.
You can use the predicted by
filter, which comes from the prediction_agent
parameter of your TextClassificationRecord
to only see predictions of a specific model:
Extra: Explore the IMDB dataset¶
So far both spaCy pretrained models seem to work pretty well. Let’s try with a more challenging dataset, which is more dissimilar to the original training data these models have been trained on.
[ ]:
imdb = load_dataset("imdb", split="test[0:5000]")
[ ]:
records = []
for record in imdb:
# We only need the text of each instance
text = record["text"]
# spaCy Doc creation
doc = nlp(text)
# Entity annotations
entities = [
(ent.label_, ent.start_char, ent.end_char)
for ent in doc.ents
]
# Pre-tokenized input text
tokens = [token.text for token in doc]
# Rubrix TokenClassificationRecord list
records.append(
rb.TokenClassificationRecord(
text=text,
tokens=tokens,
prediction=entities,
prediction_agent="en_core_web_sm",
)
)
[ ]:
rb.log(records=records, name="imdb_spacy_ner")
Exploring this dataset highlights the need of fine-tuning for specific domains.
For example, if we check the most frequent mentions for Person, we find two highly frequent missclassified entities: gore (the film genre) and Oscar (the prize). You can check yourself each an every example by using the filters and search-box.
Summary¶
In this tutorial, we have learnt to log and explore differnt spaCy
NER models with Rubrix. Using what we´ve learnt here you can:
Build custom dashboards using Kibana to monitor and visualize spaCy models.
Build training sets using pre-trained spaCy models.