Interface Encoding


public interface Encoding
  • Method Details

    • encode

      List<Integer> encode(String text)
      Encodes the given text into a list of token ids.

      Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is currently no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an UnsupportedOperationException.

      If you want to encode special tokens as ordinary text, use encodeOrdinary(String).

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encode("hello world");
       // returns [15339, 1917]
      
       encoding.encode("hello <|endoftext|> world");
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to encode
      Returns:
      the list of token ids
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • encode

      EncodingResult encode(String text, int maxTokens)
      Encodes the given text into a list of token ids.

      Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. There is currently no support for parsing special tokens in a text, so if the text contains special tokens, this method will throw an UnsupportedOperationException.

      If you want to encode special tokens as ordinary text, use encodeOrdinary(String, int).

      This method will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encode("hello world", 100);
       // returns [15339, 1917]
      
       encoding.encode("hello <|endoftext|> world", 100);
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to encode
      maxTokens - the maximum number of tokens to encode
      Returns:
      the EncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameter
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • encodeOrdinary

      List<Integer> encodeOrdinary(String text)
      Encodes the given text into a list of token ids, ignoring special tokens.

      This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encodeOrdinary("hello world");
       // returns [15339, 1917]
      
       encoding.encodeOrdinary("hello <|endoftext|> world");
       // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
       
      Parameters:
      text - the text to encode
      Returns:
      the list of token ids
    • encodeOrdinary

      EncodingResult encodeOrdinary(String text, int maxTokens)
      Encodes the given text into a list of token ids, ignoring special tokens.

      This method does not throw an exception if the text contains special tokens, but instead encodes them as if they were ordinary text.

      It will truncate the list of token ids if the number of tokens exceeds the given maxTokens parameter. Note that it will try to keep characters together, that are encoded into multiple tokens. For example, if the text contains a character which is encoded into 3 tokens, and due to the maxTokens parameter the last token of the character is truncated, the first two tokens of the character will also be truncated. Therefore, the actual number of tokens may be less than the given maxTokens parameter.

       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.encodeOrdinary("hello world", 100);
       // returns [15339, 1917]
      
       encoding.encodeOrdinary("hello <|endoftext|> world", 100);
       // returns [15339, 83739, 8862, 728, 428, 91, 29, 1917]
       
      Parameters:
      text - the text to encode
      maxTokens - the maximum number of tokens to encode
      Returns:
      the EncodingResult containing a list of token ids and whether the tokens were truncated due to the maxTokens parameter
    • countTokens

      int countTokens(String text)
      Encodes the given text into a list of token ids and returns the amount of tokens. It is more performant than encode(String).
       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.countTokens("hello world");
       // returns 2
      
       encoding.countTokens("hello <|endoftext|> world");
       // raises an UnsupportedOperationException
       
      Parameters:
      text - the text to count tokens for
      Returns:
      the amount of tokens
      Throws:
      UnsupportedOperationException - if the text contains special tokens which are not supported for now
    • decode

      String decode(List<Integer> tokens)
      Decodes the given list of token ids into a text.
       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.decode(List.of(15339, 1917));
       // returns "hello world"
      
       encoding.decode(List.of(15339, 1917, Integer.MAX_VALUE));
       // raises an IllegalArgumentException
       
      Parameters:
      tokens - the list of token ids
      Returns:
      the decoded text
      Throws:
      IllegalArgumentException - if the list contains invalid token ids
    • decodeBytes

      byte[] decodeBytes(List<Integer> tokens)
      Decodes the given list of token ids into a byte array.
       Encoding encoding = EncodingRegistry.getEncoding(EncodingType.CL100K_BASE);
       encoding.decodeBytes(List.of(15339, 1917));
       // returns [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
      
       encoding.decodeBytes(List.of(15339, 1917, Integer.MAX_VALUE));
       // raises an IllegalArgumentException
       
      Parameters:
      tokens - the list of token ids
      Returns:
      the decoded byte array
      Throws:
      IllegalArgumentException - if the list contains invalid token ids
    • getName

      String getName()
      Returns the name of this encoding. This is the name which is used to identify the encoding and must be unique for registration in the EncodingRegistry.
      Returns:
      the name of this encoding