CrossEncoderTokenizer

class lightning_ir.cross_encoder.tokenizer.CrossEncoderTokenizer(*args, query_length: int = 32, doc_length: int = 512, **kwargs)[source]

Bases: LightningIRTokenizer

__init__(*args, query_length: int = 32, doc_length: int = 512, **kwargs)[source]

Methods

__init__(*args[, query_length, doc_length])

expand_queries(queries, num_docs)

from_pretrained(model_name_or_path, *args, ...)

Loads a pretrained tokenizer.

preprocess(queries, docs, num_docs)

tokenize([queries, docs, num_docs])

truncate(text, max_length)

classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) LightningIRTokenizer

Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See LightningIRTokenizerClassFactory for more details.

>>> Loading using model class and backbone checkpoint
>>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased"))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
>>> Loading using base class and backbone checkpoint
>>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig()))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
Parameters:

model_name_or_path (str) – Name or path of the pretrained tokenizer

Raises:

ValueError – If called on the abstract class LightningIRTokenizer and no config is passed

Returns:

A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin

Return type:

LightningIRTokenizer