Interface Encoding
-
Method Summary
Modifier and TypeMethodDescriptionintcountTokens(String text) Encodes the given text into a list of token ids and returns the amount of tokens.Decodes the given list of token ids into a text.byte[]decodeBytes(List<Integer> tokens) Decodes the given list of token ids into a byte array.Encodes the given text into a list of token ids.Encodes the given text into a list of token ids.encodeOrdinary(String text) Encodes the given text into a list of token ids, ignoring special tokens.encodeOrdinary(String text, int maxTokens) Encodes the given text into a list of token ids, ignoring special tokens.getName()Returns the name of this encoding.
-
Method Details
-
encode
Encodes the given text into a list of token ids.Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is currently no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.If you want to encode special tokens as ordinary text, use
encodeOrdinary(String).Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encode("hello world"); // returns [15339, 1917] encoding.encode("hello <|endoftext|> world"); // raises an UnsupportedOperationException- Parameters:
text- the text to encode- Returns:
- the list of token ids
- Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
encode
Encodes the given text into a list of token ids.Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is currently no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an
UnsupportedOperationException.If you want to encode special tokens as ordinary text, use
encodeOrdinary(String, int).This method will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encode("hello world", 100); // returns [15339, 1917] encoding.encode("hello <|endoftext|> world", 100); // raises an UnsupportedOperationException- Parameters:
text- the text to encodemaxTokens- the maximum number of tokens to encode- Returns:
- the
EncodingResultcontaining a list of token ids and whether the tokens were truncated due to the maxTokens parameter - Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
encodeOrdinary
Encodes the given text into a list of token ids, ignoring special tokens.This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encodeOrdinary("hello world"); // returns [15339, 1917] encoding.encodeOrdinary("hello <|endoftext|> world"); // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]- Parameters:
text- the text to encode- Returns:
- the list of token ids
-
encodeOrdinary
Encodes the given text into a list of token ids, ignoring special tokens.This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.
It will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.
Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.encodeOrdinary("hello world", 100); // returns [15339, 1917] encoding.encodeOrdinary("hello <|endoftext|> world", 100); // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]- Parameters:
text- the text to encodemaxTokens- the maximum number of tokens to encode- Returns:
- the
EncodingResultcontaining a list of token ids and whether the tokens were truncated due to the maxTokens parameter
-
countTokens
Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant thanencode(String).Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.countTokens("hello world"); // returns 2 encoding.countTokens("hello <|endoftext|> world"); // raises an UnsupportedOperationException- Parameters:
text- the text to count tokens for- Returns:
- the amount of tokens
- Throws:
UnsupportedOperationException- if the text contains special tokens which are not supported for now
-
decode
Decodes the given list of token ids into a text.Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decode(List.of(15339, 1917)); // returns "hello world" encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
- Parameters:
tokens- the list of token ids- Returns:
- the decoded text
- Throws:
IllegalArgumentException- if the list contains invalid token ids
-
decodeBytes
Decodes the given list of token ids into a byte array.Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE); encoding.decodeBytes(List.of(15339, 1917)); // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE)); // raises an IllegalArgumentException
- Parameters:
tokens- the list of token ids- Returns:
- the decoded byte array
- Throws:
IllegalArgumentException- if the list contains invalid token ids
-
getName
String getName()Returns the name of this encoding. This is the name which is used to identify the encoding and must be unique for registration in theEncodingRegistry.- Returns:
- the name of this encoding
-