Project

General

Profile

Universa string reduction algorithm

Goal

To protect registered names, nicks and ither publicly visible identification strings (in USN service for example) against falisification and occasional mistakes.

The idea

Only "known" characters are accepted, and for each and every known characters the susbtition rules are in effedct to replace look-alike characters with their "base" characters. String containing only non-repeating base characters is a string archetype, or a reduced string.

When adding any national character set, only "meaningdul" characters are added, the numbers should be converted to the european system first, and any alike characters should be added to the substitution tables.

The algorithm

Fix the numbers

Replace any national (e.g. roman) numbers into european, e.g. "Campus IV" -> "Campus 4". With roman number only numbers that are clearly separated from the text (with reminal characters, like punctuation and whiltespaces) should be converted.

Replace all puctuation

Only letter characters, digits and _ can remain. The rest is removed.

replace alikie characters and substrings.

  • All I- like characters and fragments (1 ! i | l I î ï í ī į ì î ï í ī į ì, and upcase versions, ][ ) are replaced with 1

  • All U-like characters (u U v V w W û ü ù ú ū and upcase versions) are replaced with V

  • All O-like characters (o O º @ Q * ô ö ò ó œ ø ō õ/ and upcase versions) are replaced with 0

  • All 5-like characters (S, s, Z, z) are replaced with 5

  • all 8-like characters are replaced with 8 (B, and all like characters)

(to be continued)

  • All accented characters or umlauts and like are replaced with non-accented upper case ones, e.g. ç -> C, ë -> E, if these were not already processed.

  • trailing and heading spaces are removed

  • change case of all character to upper case. Important! the case change must be done after all other substitutions.

Ensure there are only allowed characters

Each character from any supported set must be explicitly added to the character-translation table, and all potentially consuming characters must be added to the substitution table above. Any non-listed characters is causes failure.

Remove all repetitions

Any sequence of 2+ same characters are replaced with one corresponding character. E.g Hello will become HE10, and Hello, world!!? -> HE10 W0R1D

Done

The result is the recuced string, or an achetype string and should be unique in relevant contexts.