Overview

Utilities and code for exploring Unicode string preparation.

This package contains code for exploring various different strategies for preparing Unicode strings for use as identifiers. There have been various different approaches to this problem over the years, and this code lets you explore them, compare them, and create new ones.

At present, this code focuses only on what characters are allowed, rejected, stripped, etc... by these preparation methods. Some of these preparation methods have further processing on the whole string, including contextual checks which are not yet embodied in this code base.

You can start by running the included command line utility:

> python preputil.py --help
Usage:
preputil.py (-m | --map) [method]               -- map out a method
preputil.py (-d | --diff) [method1] [method2]   -- compare two methods
preputil.py (-l | --list)                       -- list available methods
preputil.py (-h | --help)                       -- this help text

> python preputil.py --list
Available methods, and aliases to use on command line:
    IDNA2008:    idna idna2008 idnabis
    StringPrep:  string stringprep
    UAX31:       id uax31
    VName:       vname

> python preputil.py --map idna2008 | less
> python preputil.py --diff stringprep uax31 | less

Note that the '--map' and '--diff' modes are computing functions over the entire Unicode code point range, and so take some time to run. On my 2.2Ghz MacBook they take ~17 seconds each!

The processing code is all in the package newprep. See the docstrings in the modules for more extensive notes, caveats, and information on how to use and extend this code.

Main Modules: codepoint - Utilities for handling code points and sets of code points. methods - A list of available preparation methods. UCD - Information and utilities from the Unicode Character Database. prep - Base class for representing a preparation method.

Method Modules:
idna2008 - IDNA2008's code point classification stringprep - An RFC 3454 ("StringPrep") style profile uax31 - Unicode's Identifier Syntax vname - A proposed method for use in VWRAP

This code was written by Mark Lentczner at Linden Lab. It was inspired by the newprep BoF session at IETF 77 (Anaheim, 2010) and work needed for both the VWRAP working gorup, and internal Linden Lab development. Discussion of this code and the surrounding issues can be on the newprep mailing list: https://www.ietf.org/mailman/listinfo/newprep or directly with the author.

The code is released open source under an "MIT style" license. See LICENSE.

- Mark Lentczner
  markl@lindenlab.com
  May, 2010
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.