BiEncoderTokenizer
- class lightning_ir.bi_encoder.tokenizer.BiEncoderTokenizer(*args, query_token: str = '[QUE]', doc_token: str = '[DOC]', query_expansion: bool = False, query_length: int = 32, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, doc_length: int = 512, attend_to_doc_expanded_tokens: bool = False, add_marker_tokens: bool = True, **kwargs)[source]
Bases:
LightningIRTokenizer
- __init__(*args, query_token: str = '[QUE]', doc_token: str = '[DOC]', query_expansion: bool = False, query_length: int = 32, attend_to_query_expanded_tokens: bool = False, doc_expansion: bool = False, doc_length: int = 512, attend_to_doc_expanded_tokens: bool = False, add_marker_tokens: bool = True, **kwargs)[source]
Methods
__init__
(*args[, query_token, doc_token, ...])from_pretrained
(model_name_or_path, *args, ...)Loads a pretrained tokenizer.
tokenize
([queries, docs])tokenize_doc
(docs, *args, **kwargs)tokenize_query
(queries, *args, **kwargs)Attributes
doc_token
doc_token_id
query_token
query_token_id
- config_class
alias of
BiEncoderConfig
- classmethod from_pretrained(model_name_or_path: str, *args, **kwargs) LightningIRTokenizer
Loads a pretrained tokenizer. Wraps the transformers.PreTrainedTokenizer.from_pretrained method to return a derived LightningIRTokenizer class. See
LightningIRTokenizerClassFactory
for more details.>>> Loading using model class and backbone checkpoint >>> type(BiEncoderTokenizer.from_pretrained("bert-base-uncased")) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'> >>> Loading using base class and backbone checkpoint >>> type(LightningIRTokenizer.from_pretrained("bert-base-uncased", config=BiEncoderConfig())) ... <class 'lightning_ir.base.class_factory.BiEncoderBertTokenizerFast'>
- Parameters:
model_name_or_path (str) – Name or path of the pretrained tokenizer
- Raises:
ValueError – If called on the abstract class
LightningIRTokenizer
and no config is passed- Returns:
A derived LightningIRTokenizer consisting of a backbone tokenizer and a LightningIRTokenizer mixin
- Return type: