Skip to content

tokenizer

tokenizer

get_chatml_encoder() -> Tokenizer

Back-compat helper for the legacy ChatML tiktoken setup.

encode_chat_template(template: ChatTemplate, encoder: Optional[Tokenizer] = None, system_prompt: Optional[str] = None, prompt: bool = False, *, tokenize: bool = True) -> list[int] | str

encode_chat_template(
    template: ChatTemplate,
    encoder: Optional[Tokenizer] = None,
    system_prompt: Optional[str] = None,
    prompt: bool = False,
    *,
    tokenize: Literal[False],
) -> str
encode_chat_template(
    template: ChatTemplate,
    encoder: Tokenizer = ...,
    system_prompt: Optional[str] = None,
    prompt: bool = False,
    *,
    tokenize: Literal[True] = True,
) -> list[int]

Encode a chat template as tokens or formatted text.

  • For tiktoken, formatting is always ChatML.
  • For HuggingFace tokenizers, formatting uses tokenizer.apply_chat_template.

Parameters:

Name Type Description Default
template ChatTemplate

List of chat turns

required
encoder Optional[Tokenizer]

Tokenizer to use. Required when tokenize=True.

None
system_prompt Optional[str]

Optional system prompt to prepend

None
prompt bool

If True, append a generation prompt for autoregressive generation

False
tokenize bool

If True, return token ids. If False, return formatted text.

True

encode_chat_template_with_mask(template: ChatTemplate, encoder: Tokenizer, system_prompt: Optional[str] = None) -> tuple[list[int], list[bool]]

Encode a chat template and return a per-token assistant mask.

Returns:

Type Description
(ids, assistant_mask)

ids is the token list, assistant_mask[i] is True

list[bool]

if token i belongs to an assistant turn (standard SFT masking).

Uses incremental encoding: encodes progressively longer prefixes of the conversation to find exact token boundaries for each turn.

decode_chat_template(tokens: list[int] | str, encoder: Optional[Tokenizer] = None) -> ChatTemplate

Decode tokens back into a ChatTemplate.

Parses chatml format: <|im_start|>role message<|im_end|>

Parameters:

Name Type Description Default
tokens list[int] | str

Token list or string (if encoder is None, treated as string)

required
encoder Optional[Tokenizer]

Tokenizer (if None, tokens is treated as the raw string)

None