We evaluate CamemBERT in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI); improving the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French.
We release CamemBERT a Tasty French Language Model. CamemBERT is trained on 138GB of French text. It establishes a new state of the art in POS tagging, Dependency Parsing and NER, and achieves strong results in NLI. Bon appétit !
import torch camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0') camembert.eval() # disable dropout (or leave in train mode to finetune)
# Download camembert model wget https://dl.fbaipublicfiles.com/fairseq/models/camembert.v0.tar.gz tar -xzvf camembert.v0.tar.gz # Load the model in fairseq from fairseq.models.roberta import CamembertModel camembert = CamembertModel.from_pretrained('/path/to/camembert.v0') camembert.eval() # disable dropout (or leave in train mode to finetune)
masked_line = 'Le camembert est <mask> :)' camembert.fill_mask(masked_line, topk=5) # [('Le camembert est délicieux :)', 0.4909118115901947, ' délicieux'), # ('Le camembert est excellent :)', 0.10556942224502563, ' excellent'), # ('Le camembert est succulent :)', 0.03453322499990463, ' succulent')]
# Extract the last layer's features line = "J'aime le camembert!" tokens = camembert.encode(line) last_layer_features = camembert.extract_features(tokens) assert last_layer_features.size() == torch.Size([1, 10, 768]) # Extract all layer's features (layer 0 is the embedding layer) all_layers = camembert.extract_features(tokens, return_all_hiddens=True) assert len(all_layers) == 13 assert torch.all(all_layers[-1] == last_layer_features)