f



Unicode script

Suppose I have a Unicode character, and I want to determine the script or
scripts it belongs to.

For example:

U+0033 DIGIT THREE "3" belongs to the script "COMMON";
U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".


Is this information available from Python?


More about Unicode scripts:

http://www.unicode.org/reports/tr24/
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

0
Steve
12/15/2016 4:53:58 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

7 Replies
831 Views

Similar Articles

[PageSpeed] 17

On Thu, Dec 15, 2016 at 4:53 PM, Steve D'Aprano
<steve+python@pearwood.info> wrote:
> Suppose I have a Unicode character, and I want to determine the script or
> scripts it belongs to.
>
> For example:
>
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "=CE=BE" belongs to the script "GREEK".
>
> Is this information available from Python?

Tools/makunicodedata.py doesn't include data from "Scripts.txt". If
adding an external dependency is ok, then you can use PyICU. For
example:

    >>> icu.Script.getScript('\u0033').getName()
    'Common'
    >>> icu.Script.getScript('\u0061').getName()
    'Latin'
    >>> icu.Script.getScript('\u03be').getName()
    'Greek'

There isn't documentation specific to Python, so you'll have to figure
things out experimentally with reference to the C API.

http://icu-project.org/apiref/icu4c
http://icu-project.org/apiref/icu4c/uscript_8h.html
0
eryk
12/15/2016 6:01:58 PM
I think this might be what you want:
https://docs.python.org/3/howto/unicode.html#unicode-properties

On Thu, Dec 15, 2016 at 11:53 AM, Steve D'Aprano
<steve+python@pearwood.info> wrote:
> Suppose I have a Unicode character, and I want to determine the script or
> scripts it belongs to.
>
> For example:
>
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "=CE=BE" belongs to the script "GREEK".
>
>
> Is this information available from Python?
>
>
> More about Unicode scripts:
>
> http://www.unicode.org/reports/tr24/
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>
>
> --
> Steve
> =E2=80=9CCheer up,=E2=80=9D they said, =E2=80=9Cthings could be worse.=E2=
=80=9D So I cheered up, and sure
> enough, things got worse.
>
> --
> https://mail.python.org/mailman/listinfo/python-list



--=20
Joel Goldstick
http://joelgoldstick.com/blog
http://cc-baseballstats.info/stats/birthdays
0
Joel
12/15/2016 6:03:39 PM
On 2016-12-15 16:53, Steve D'Aprano wrote:
> Suppose I have a Unicode character, and I want to determine the script or
> scripts it belongs to.
>
> For example:
>
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
>
>
> Is this information available from Python?
>
>
> More about Unicode scripts:
>
> http://www.unicode.org/reports/tr24/
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>
>
Interestingly, there's issue 6331 "Add unicode script info to the 
unicode database". Looks like it didn't make it into Python 3.6.

0
MRAB
12/15/2016 6:06:49 PM
On 12/15/2016 11:53 AM, Steve D'Aprano wrote:
> Suppose I have a Unicode character, and I want to determine the script =
or
> scripts it belongs to.
>
> For example:
>
> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
> U+03BE GREEK SMALL LETTER XI "=CE=BE" belongs to the script "GREEK".
>
> Is this information available from Python?



Yes, though not as nicely as you probably want.  (Have you searched for=20
existing 3rd party modules?)  As near as I can tell, there is no direct=20
'script' property in the unicodedatabase.

Option 1: unicodedata module, from char name

 >>> import unicodedata as ucd
 >>> ucd.name('\u03be')
'GREEK SMALL LETTER XI'
 >>> ucd.name('\u0061')
'LATIN SMALL LETTER A'

In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.

 >>> ucd.name('\U00010A60')
'OLD SOUTH ARABIAN LETTER HE'

In a few cases, the script name is embedded in the name.
 >>> ucd.name('\U0001F200')
'SQUARE HIRAGANA HOKA'

Occasionally the script name is omitted.
 >>> ucd.name('\u3300')
'SQUARE APAATO'  # Katakana

To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character

LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO

> More about Unicode scripts:
>
> http://www.unicode.org/reports/tr24/
> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

Option 2: Fetch the above Scripts.txt.

Suboption 1: Turn Scripts.txt into a list of lines.  The lines could be=20
condensed to codepoint or codepoint range, script.  Write a function=20
that takes a character or codepoint and linearly scans the list for a=20
matching line.  This makes each lookup O(number-of-lines).

Suboption 2. Turn Scripts.txt into a list of scripts, with codepoint=20
being the index.  This takes more preparation, but makes each lookup=20
O(1).  Once the preparation is done, the list could be turned into a=20
tuple and saved as a .py file, with the tuple being a compiled constant=20
in a .pyc file.

To avoid bloat, make sure that multiple entries for a script use the=20
same string object instead of multiple equal strings.  (CPython string=20
interning might do this automatically, but cross-implementation code=20
should not depend on this.)  The difference is

scripts =3D [..., 'Han', 'Han', 'Han', ...] # multiple strings
versus
HAN =3D 'Han'
scripts =3D [..., HAN, HAN, HAN, ...]  # multiple references to one strin=
g

On a 64 bit OS, the latter would use 8 x defined codepoints (about=20
200,000) bytes.  Assuming such does not already exits, it might be worth =

making such a module available on PyPI.

> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt

Essentially, ditto, except that I would use a dict rather than a=20
sequence as there are only about 400 codepoints involved.

--=20
Terry Jan Reedy


0
Terry
12/15/2016 7:45:15 PM
On 12/15/2016 1:06 PM, MRAB wrote:
> On 2016-12-15 16:53, Steve D'Aprano wrote:
>> Suppose I have a Unicode character, and I want to determine the script=
 or
>> scripts it belongs to.
>>
>> For example:
>>
>> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
>> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
>> U+03BE GREEK SMALL LETTER XI "=CE=BE" belongs to the script "GREEK".
>>
>>
>> Is this information available from Python?
>>
>>
>> More about Unicode scripts:
>>
>> http://www.unicode.org/reports/tr24/
>> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
>> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>>
>>
> Interestingly, there's issue 6331 "Add unicode script info to the
> unicode database". Looks like it didn't make it into Python 3.6.

https://bugs.python.org/issue6331
Opened in 2009 with patch and 2 revisions for 2.x.  At least the Python=20
code needs to be updated.

Approved in principle by Martin, then unicodedata curator, but no longer =

active.  Neither, very much, are the other 2 listed in the Expert's index=
=2E

 From what I could see, both the Python API (there is no doc patch yet)=20
and internal implementation need more work.  If I were to get involved,=20
I would look at the APIs of PyICU (see Eryk Sun's post) and the=20
unicodescript module on PyPI (mention by Pander Musubi, on the issue).

--=20
Terry Jan Reedy


0
Terry
12/15/2016 9:57:24 PM
On 2016-12-15 21:57, Terry Reedy wrote:
> On 12/15/2016 1:06 PM, MRAB wrote:
>> On 2016-12-15 16:53, Steve D'Aprano wrote:
>>> Suppose I have a Unicode character, and I want to determine the script or
>>> scripts it belongs to.
>>>
>>> For example:
>>>
>>> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
>>> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
>>> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
>>>
>>>
>>> Is this information available from Python?
>>>
>>>
>>> More about Unicode scripts:
>>>
>>> http://www.unicode.org/reports/tr24/
>>> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
>>> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>>>
>>>
>> Interestingly, there's issue 6331 "Add unicode script info to the
>> unicode database". Looks like it didn't make it into Python 3.6.
>
> https://bugs.python.org/issue6331
> Opened in 2009 with patch and 2 revisions for 2.x.  At least the Python
> code needs to be updated.
>
> Approved in principle by Martin, then unicodedata curator, but no longer
> active.  Neither, very much, are the other 2 listed in the Expert's index.
>
>  From what I could see, both the Python API (there is no doc patch yet)
> and internal implementation need more work.  If I were to get involved,
> I would look at the APIs of PyICU (see Eryk Sun's post) and the
> unicodescript module on PyPI (mention by Pander Musubi, on the issue).
>
For what it's worth, the post has prompted me to get back to a module I 
started which will report such Unicode properties, essentially the ones 
that the regex module supports. It just needs a few more tweaks and 
packaging up...

0
MRAB
12/16/2016 2:44:20 AM
On 2016-12-16 02:44, MRAB wrote:
> On 2016-12-15 21:57, Terry Reedy wrote:
>> On 12/15/2016 1:06 PM, MRAB wrote:
>>> On 2016-12-15 16:53, Steve D'Aprano wrote:
>>>> Suppose I have a Unicode character, and I want to determine the script or
>>>> scripts it belongs to.
>>>>
>>>> For example:
>>>>
>>>> U+0033 DIGIT THREE "3" belongs to the script "COMMON";
>>>> U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
>>>> U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".
>>>>
>>>>
>>>> Is this information available from Python?
>>>>
>>>>
>>>> More about Unicode scripts:
>>>>
>>>> http://www.unicode.org/reports/tr24/
>>>> http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
>>>> http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
>>>>
>>>>
>>> Interestingly, there's issue 6331 "Add unicode script info to the
>>> unicode database". Looks like it didn't make it into Python 3.6.
>>
>> https://bugs.python.org/issue6331
>> Opened in 2009 with patch and 2 revisions for 2.x.  At least the Python
>> code needs to be updated.
>>
>> Approved in principle by Martin, then unicodedata curator, but no longer
>> active.  Neither, very much, are the other 2 listed in the Expert's index.
>>
>>  From what I could see, both the Python API (there is no doc patch yet)
>> and internal implementation need more work.  If I were to get involved,
>> I would look at the APIs of PyICU (see Eryk Sun's post) and the
>> unicodescript module on PyPI (mention by Pander Musubi, on the issue).
>>
> For what it's worth, the post has prompted me to get back to a module I
> started which will report such Unicode properties, essentially the ones
> that the regex module supports. It just needs a few more tweaks and
> packaging up...
>
Finally completed and uploaded!

It's called 'uniprop' and it's at:

https://pypi.python.org/pypi/uniprop/1.0

For Python 3.4-3.6.

0
MRAB
12/18/2016 2:34:49 AM
Reply: