Universa string reduction algorithm¶
Goal¶
To protect registered names, nicks and ither publicly visible identification strings (in USN service for example) against falisification and occasional mistakes.
The idea¶
Only "known" characters are accepted, and for each and every known characters the susbtition rules are in effedct to replace look-alike characters with their "base" characters. String containing only non-repeating base characters is a string archetype, or a reduced string.
When adding any national character set, only "meaningdul" characters are added, the numbers should be converted to the european system first, and any alike characters should be added to the substitution tables.
The algorithm¶
Fix the numbers¶
Replace any national (e.g. roman) numbers into european, e.g. "Campus IV" -> "Campus 4". With roman number only numbers that are clearly separated from the text (with reminal characters, like punctuation and whiltespaces) should be converted.
Replace all puctuation¶
Only letter characters, digits and _ can remain. The rest is removed.
replace alikie characters and substrings.¶
All I- like characters and fragments (1 ! i | l I î ï í ī į ì î ï í ī į ì, and upcase versions, ][ ) are replaced with 1
All U-like characters (u U v V w W û ü ù ú ū and upcase versions) are replaced with V
All O-like characters (o O º @ Q * ô ö ò ó œ ø ō õ/ and upcase versions) are replaced with 0
All 5-like characters (S, s, Z, z) are replaced with 5
all 8-like characters are replaced with 8 (B, and all like characters)
(to be continued)
All accented characters or umlauts and like are replaced with non-accented upper case ones, e.g. ç -> C, ë -> E, if these were not already processed.
trailing and heading spaces are removed
change case of all character to upper case. Important! the case change must be done after all other substitutions.
Ensure there are only allowed characters¶
Each character from any supported set must be explicitly added to the character-translation table, and all potentially consuming characters must be added to the substitution table above. Any non-listed characters is causes failure.
Remove all repetitions¶
Any sequence of 2+ same characters are replaced with one corresponding character. E.g Hello will become HE10, and Hello, world!!? -> HE10 W0R1D
Done¶
The result is the recuced string, or an achetype string and should be unique in relevant contexts.