tokenizer
tokenizer
¶
get_chatml_encoder() -> Tokenizer
¶
Back-compat helper for the legacy ChatML tiktoken setup.
encode_chat_template(template: ChatTemplate, encoder: Optional[Tokenizer] = None, system_prompt: Optional[str] = None, prompt: bool = False, *, tokenize: bool = True) -> list[int] | str
¶
Encode a chat template as tokens or formatted text.
- For tiktoken, formatting is always ChatML.
- For HuggingFace tokenizers, formatting uses tokenizer.apply_chat_template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
template
|
ChatTemplate
|
List of chat turns |
required |
encoder
|
Optional[Tokenizer]
|
Tokenizer to use. Required when tokenize=True. |
None
|
system_prompt
|
Optional[str]
|
Optional system prompt to prepend |
None
|
prompt
|
bool
|
If True, append a generation prompt for autoregressive generation |
False
|
tokenize
|
bool
|
If True, return token ids. If False, return formatted text. |
True
|
encode_chat_template_with_mask(template: ChatTemplate, encoder: Tokenizer, system_prompt: Optional[str] = None) -> tuple[list[int], list[bool]]
¶
Encode a chat template and return a per-token assistant mask.
Returns:
| Type | Description |
|---|---|
(ids, assistant_mask)
|
ids is the token list, assistant_mask[i] is True |
list[bool]
|
if token i belongs to an assistant turn (standard SFT masking). |
Uses incremental encoding: encodes progressively longer prefixes of the conversation to find exact token boundaries for each turn.
decode_chat_template(tokens: list[int] | str, encoder: Optional[Tokenizer] = None) -> ChatTemplate
¶
Decode tokens back into a ChatTemplate.
Parses chatml format: <|im_start|>role message<|im_end|>
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[int] | str
|
Token list or string (if encoder is None, treated as string) |
required |
encoder
|
Optional[Tokenizer]
|
Tokenizer (if None, tokens is treated as the raw string) |
None
|