BiEncoderTokenizer

class lightning_ir.bi_encoder.tokenizer.BiEncoderTokenizer(*args, query_token: str = '[QUE]', doc_token: str = '[DOC]', query_expansion: bool = False, query_length: int = 32, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, doc_length: int = 512, attend_to_doc_expanded_tokens: bool = False, add_marker_tokens: bool = True, **kwargs)[source]

Bases: LightningIRTokenizer

__init__(*args, query_token: str = '[QUE]', doc_token: str = '[DOC]', query_expansion: bool = False, query_length: int = 32, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, doc_length: int = 512, attend_to_doc_expanded_tokens: bool = False, add_marker_tokens: bool = True, **kwargs)[source]

Methods

__init__(*args[, query_token, doc_token, ...])

from_pretrained(model_name_or_path, *args, ...)

Loads a pretrained tokenizer.

tokenize([queries, docs])

tokenize_doc(docs, *args, **kwargs)

tokenize_query(queries, *args, **kwargs)

Attributes

doc_token

doc_token_id

query_token

query_token_id

config_class

alias of BiEncoderConfig

classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) LightningIRTokenizer

Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See LightningIRTokenizerClassFactory for more details.

>>> Loading using model class and backbone checkpoint
>>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased"))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
>>> Loading using base class and backbone checkpoint
>>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig()))
...
<class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
Parameters:

model_name_or_path (str) – Name or path of the pretrained tokenizer

Raises:

ValueError – If called on the abstract class LightningIRTokenizer and no config is passed

Returns:

A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin

Return type:

LightningIRTokenizer