Module utf8

Basic UTF8 character counting support for Luakit

This module provides a partial implementation of the Lua 5.3 UTF-8 library.

Functions

utf8.len (s, begin, end)

Return the number of characters (not bytes) of a UTF-8-encoded string.

If the optional parameters begin and/or end are given, then characters within s will only be counted if they begin between positions begin and end (both inclusive).

An error is raised if s (or the characters that start in the slice from begin to end) contains invalid UTF8 characters, of if begin or end point to byte indices not in s.

Parameters

  • s
    Type: string
    The string whose length is to be returned.
  • begin
    Type: integer
    Optional
    Default: 1
    Only consider s from (1-based byte) index begin onwards. If negative, count from end of s (with -1 being the last byte).
  • end
    Type: integer
    Optional
    Default: -1
    Only consider s up to and including (1-based byte) index end. If negative, count from end of s (with -1 being the last byte).

Return Values

  • integer
    The length (in UTF8 characters) of s.

utf8.offset (string, woffset, base)

Convert an offset (in UTF8 characters) to a byte offset.

If optional parameter base is given and positive, count characters starting from (byte) index base.

An error is raised if base is smaller than 1 or larger than the (byte) length of string, or if base points to a byte inside string that is not the starting byte of a UTF8 encoding.

Examples

  • utf8.offset("abc",2,2) would return 3
  • utf8.offset("abc",-3) would return 1

Parameters

  • string
    Type: string
    The string in which offsets should be converted.
  • woffset
    Type: integer
    The offset (1-based, in UTF8 characters) which should be converted.
  • base
    Type: integer
    Optional
    A (1-based byte) index in string. Defaults to 1 if woffset is positive, and to the (byte) length of string if woffset is negative. See the description above.

Return Values

  • integer
    The (1-based) byte offset of the woffset-th UTF8 character in string.

Attribution

Copyright