unicode display of common characters

  • Follow


In my emacs I display the text:

  A\u2019gan Town Coalmine, held a five-day strike starting on 13 November. 
  The teachers were demanding the payment of some four months\u2019 wage 
  arrears...

This is actually the way it appears (a back slash, then u followed by
the number 2019). It is a typographic single quote (improperly used in
this case). 

I believe such a character can be displayed by my fonts (Firefox
understands it properly). My localization is en_US.UTF-8 UTF-8 and my
emacs is 24.4.1.  

-- 
 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/17/2005 3:30:57 PM

Hi Haines,


Haines Brown writes:
> my emacs is 24.4.1.

21.4 is the last version of FSF Emacs, I think, but that won't include
the change mentioned below.  Unless you are talking about XEmacs where
I don't know the current version.

> In my emacs I display the text:
>
>   A\u2019gan Town Coalmine, held a five-day strike starting on 13 November. 
>   The teachers were demanding the payment of some four months\u2019 wage 
>   arrears...
>
> This is actually the way it appears (a back slash, then u followed
> by the number 2019). It is a typographic single quote (improperly
> used in this case).

Those are called UCNs, it's a common convention in programming
languages.

I can store a file with those UCNs replaced by the real characters
(using UTF-8 encoding) and read it again.  Emacs can handle this
fine.

I do get the same text as you, if I copy-and-paste from gedit (just to
pick a random Unicode-aware application that I have installed).

The reason is probably that gedit returns this format when Emacs asks
for the STRING type clipboard.  Gedit would give the UTF-8 of the
string if Emacs would ask for UTF8_STRING instead.  The CVS version of
Emacs does that, so things should work there.  It's also probably
possible to make 21.4 behave correctly by just patching the function
where that change was made (lisp/term/x-win.el, x-selection-value),
but I haven't looked into that much further.


benny
0
Reply Benjamin 4/18/2005 4:12:47 PM


Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> Those are called UCNs, it's a common convention in programming
> languages.
> 
> It's also probably possible to make 21.4 behave correctly by just
> patching the function where that change was made
> (lisp/term/x-win.el, x-selection-value),  but I haven't looked into
> that much further.  

Benny, thanks for the reply. I thought first of editing the x-win.el
(I run debian, and so it is
/usr/share/emacs//21.4/lisp/term/x-win.elc). Problem is, there's no
x-selection-value variable in my file (I assume it's being byte
compiled makes no difference). 

If I understand you correctly, the current CVS version of emacs
changes the clipboard so that the escaped representation of unicode
characters will appear correctly.

I'm unclear what the clipboard has to do with my difficulty
displaying the unicode character for a typographic single quote. The
text I used as the example in my original message originated as an
e-mail message under emacs (r-mail) and was saved by emacs. It was not
pasted to or from any other application. 
  
-- 
 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/18/2005 8:01:04 PM

Hi Haines,


Haines Brown writes:
> I'm unclear what the clipboard has to do with my difficulty
> displaying the unicode character for a typographic single quote.

Maybe nothing, and I didn't say that it did.  I said, *I* could
reproduce a similar problem when I used the clipboard.  You didn't say
how your problem was occurring, so how would I know if this applies to
your situation?

> The text I used as the example in my original message originated as
> an e-mail message under emacs (r-mail) and was saved by emacs. It
> was not pasted to or from any other application.

As I said, saving a file containing the text you gave (with the real
characters) with Emacs and reading it in again works fine for me.  I
don't use rmail, so if that is a factor I wouldn't know how and why.
UCNs are not used with the mail encoding protocols that I know.

You want to give a recipe to reproduce the problem step by step and
than ask somebody to look into it who also uses rmail.


benny
0
Reply Benjamin 4/19/2005 10:26:57 PM

Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> Hi Haines,
> > The text I used as the example in my original message originated as
> > an e-mail message under emacs (r-mail) and was saved by emacs. It
> > was not pasted to or from any other application.
> 
> As I said, saving a file containing the text you gave (with the real
> characters) with Emacs and reading it in again works fine for me.  I
> don't use rmail, so if that is a factor I wouldn't know how and why.
> UCNs are not used with the mail encoding protocols that I know.
> 
> You want to give a recipe to reproduce the problem step by step and
> than ask somebody to look into it who also uses rmail.

Thanks, Benny, I will. I had a file that was saved from a message
being read in rmail. The saved file, when opened (with a simple file
reader) had the string: "... four months\u2019 wage arrears ..." The \
is not an escape, and it, and the four following digits are separate
iso-8859-1 characters. 

I try to imitate your little test. When I open this file in emacs, the
"\u2019" appears literally, not as the typographic single quote that
the Unicode value 2019 represents. 

Here are some Chinese characters that I've copied from a web page (but
which it does not display properly as Chinese characters):
\u66f4\u591a\u65b0\u95fb. I pasted them from a web page. So rmail not
involved in this case. The pasted text I see literally (backslash, u
and then a hex number)

-- 
 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/21/2005 2:01:17 PM

Hi H
Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> As I said, saving a file containing the text you gave (with the real
> characters) with Emacs and reading it in again works fine for me.  I
> don't use rmail, so if that is a factor I wouldn't know how and why.
> UCNs are not used with the mail encoding protocols that I know.
> 
> You want to give a recipe to reproduce the problem step by step and
> than ask somebody to look into it who also uses rmail.

Another recipe, and this time without rmail involvement:

In a web page, there is displayed an em-dash. It appears correctly
when viewed with Firefox. When I view the source of this html file it
shows that the web page used the illicit Windows character entity to
express an em-dash:

  ... under Marxist ideology. &#151; Translator]

When I copy and paste the string displayed in my browser into a file
viewed by emacs, the em-dash is represented as unicode:

  ... under Marxist ideology. \u2014 Translator]

This is causing me grief because I must remove unicode from HTML files
created under emacs in order for browsers to display the HTML file
properly, instead of displaying the \u2014 literally.

I suspect that emacs is acting properly here. Apparently, when
presented with a Windows em-dash, emacs must represent it as unicode, 
although it would be nice to have it represented instead in the html
character entity (&#8212) when emacs is in the HTML mode. Is there any
way to get emacs HTML mode to do that? 

If not, and this problem is limited to the Windows proprietary
character set, I suspect the only work-around would be to run a macro
on the file to convert Windows character entities to legitimate
entities, such as converting \u2014 to &#8212;. If I have to do this,
any general advice would be appreciated.

-- 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/22/2005 3:05:16 PM

Hi Haines,


Haines Brown writes:
> When I copy and paste the string displayed in my browser into a file
> viewed by emacs, the em-dash is represented as unicode:
>
>   ... under Marxist ideology. \u2014 Translator]

I thought clipboard was not involved?  I thought I already established
that the way Emacs uses the X11 clipboard has exactly this behaviour
in released version of Emacs (and that the CVS version worked better
here)?  What am I missing?

> This is causing me grief because I must remove unicode from HTML
> files created under emacs

Creating HTML in Emacs is a different use-case as I understand it,
with different operations.  What exactly are you doing that causes you
problems?

Maybe what you are after is this (?):

- You want to copy some HTML from a website that you have open in your
  browser.

- The HTML source contains &#151; (which is illegal as you noticed).

- You want to use a browser and copy-and-paste to translate this to
  legal Unicode. 

- Because Emacs doesn't use Unicode on the clipboard, you get UCNs
  instead.

If that is your situation, I'd say your copy-and-paste technique
doesn't work, you need something else.

You can just copy the source code into Emacs (i.e. &#151;) or save the
file directly from your browser.  Than have a simple Lisp procedure or
external filter program that converts the broken entities.  E.g. tidy
<http://tidy.sourceforge.net> can fix this and other HTML breakage.

Of course whether this suggestion would help depends on the nature of
your actual use-case.


benny
0
Reply Benjamin 4/24/2005 11:16:31 AM

Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> Hi Haines,
> 
> Haines Brown writes:
> > When I copy and paste the string displayed in my browser into a file
> > viewed by emacs, the em-dash is represented as unicode:
> >
> >   ... under Marxist ideology. \u2014 Translator]
> 
> I thought clipboard was not involved?  

I apologize for my difficulties grasping the point. 

> Maybe what you are after is this (?):
> 
> - You want to copy some HTML from a website that you have open in your
>   browser.
> 
> - The HTML source contains &#151; (which is illegal as you noticed).
> 
> - You want to use a browser and copy-and-paste to translate this to
>   legal Unicode. 
> 
> - Because Emacs doesn't use Unicode on the clipboard, you get UCNs
>   instead.

This is where I went astray. For various reasons, I can't copy
material from a html source, but only as it is displayed in a browser,
using the browser's copy utility. But sometimes I'm copying from a
plain text file containing hex quoted printable, etc. So I don't use
the emacs clipboard, and that is why I dismissed the emacs clipboard
issue as irrelevant. I'm pasting into emacs material copied elsewhere.   

> If that is your situation, I'd say your copy-and-paste technique
> doesn't work, you need something else.

Are you implying there are more intelligent clipboards I could use
under linux to copy what is displayed on browser?
 
> You can just copy the source code into Emacs (i.e. &#151;) or save
> the file directly from your browser.  Than have a simple Lisp
> procedure or external filter program that converts the broken
> entities.  E.g. tidy <http://tidy.sourceforge.net> can fix this and
> other HTML breakage.

I don't care for tidy except perhaps as a syntax error checker. In
this I'm swimming against the tide, I know, but I use a bunch of
markup procedures in emacs that conflict with a tidied source. But
tidy could be used if I configure it to do just the conversion of hex
quoted printable (=97), unicode (\u2014), octal (\226) or Windows
character enties (&#151;) to legitimate HTML character entites. I've
not studied tidy enough to know whether it could be configured to do
just that. I would rather do it with emacs if possible. 

-- 
 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/24/2005 1:29:18 PM

Hi Haines,


Haines Brown writes:
> So I don't use the emacs clipboard, and that is why I dismissed the
> emacs clipboard issue as irrelevant. I'm pasting into emacs material
> copied elsewhere.

If you don't use the Emacs clipboard handler, you should not be
getting UCNs.

Unless you use cut-and-paste in other programs and those programs have
similar issues and than you get the UCNs from those programs.  But
that would be off-topic for this group here.

Benjamin Riefenstahl wrote:
>> If that is your situation, I'd say your copy-and-paste technique
>> doesn't work, you need something else.
>
> Are you implying there are more intelligent clipboards I could use
> under linux to copy what is displayed on browser?

No, I was implying that some other workflow was needed on your side
for handling the situation, like what I proposed next.

>> You can just copy the source code into Emacs (i.e. &#151;) or save
>> the file directly from your browser.  Than have a simple Lisp
>> procedure or external filter program that converts the broken
>> entities. [...]

> [...]�I would rather do it with emacs if possible.

As I said, you can write a Lisp procedure to do it, but you will need
some data to convert the Windows encoding to something that Emacs
understands, like Unicode.  Usually that would be a coding system
called windows-1252, but that coding system is only available in CVS
Emacs, it's not in any released version yet.

If you have an Emacs built from CVS, you can use code like this (I had
something similar lying around for other purposes):

  (defun benny-replace-ucns ()
    (interactive)
    (save-excursion
      (goto-char (point-min))
      (while (re-search-forward
              "\\\\u\\([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]\\)" nil t)
	(replace-match (benny-codepoint-to-char
			(string-to-number (match-string 1) 16))
		       t t))

  (defun benny-replace-numeric-entities ()
    (interactive)
    (save-excursion
      (goto-char (point-min))
      (while (re-search-forward "&#x\\([0-9a-fA-F]+\\);" nil t)
	(replace-match (benny-codepoint-to-char
			(string-to-number (match-string 1) 16))
		       t t))
      (goto-char (point-min))
      (while (re-search-forward "&#\\([0-9]+\\);" nil t)
	(replace-match (benny-codepoint-to-char
			(string-to-number (match-string 1)))
		       t t))))

  (defun benny-codepoint-to-char (codepoint)
    (if (< codepoint #x100)
	(decode-coding-string
	 (format "%c" codepoint) 'windows-1252)
      (decode-coding-string
       (format "%c%c" (% codepoint #x100) (/ codepoint #x100)) 'utf-16-le)))

If you are limited to the released Emacs, you can change
benny-codepoint-to-char to use a lookup table that you can create from
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>.

Bind 'benny-replace-ucns or 'benny-replace-numeric-entities to a
global key and use that to cleanup your texts.


benny
0
Reply Benjamin 4/25/2005 1:44:47 PM

Benjamin Riefenstahl <b.riefenstahl@turtle-trading.net> writes:

> If you don't use the Emacs clipboard handler, you should not be
> getting UCNs.

Well, this happens without the emacs clipboard, which is strange. Here
is an example.

I just created a web page containing the em-dash character entity
&#8212;. Whether I display that page in galeon or firefox, a copy
(mouse select; C-c) from it and a paste (C-y) into emacs produces a
UCN (\u2014). If I do the copy-paste into a different application
(such as scribus, fileRunner), it pastes as an em-dash as it
should. So apparently my system clipboard is functioning properly, but
emacs won't accept its content without converting it to UCN notation.

I also don't understand the use of UCN notation. Isn't it used for C
coding?

> >> You can just copy the source code into Emacs (i.e. &#151;) or
> >> save the file directly from your browser.  Than have a simple
> >> Lisp procedure or external filter program that converts the
> >> broken entities. [...]
> 
> > [...]�I would rather do it with emacs if possible.
> 
> As I said, you can write a Lisp procedure to do it, but you will
> need some data to convert the Windows encoding to something that
> Emacs understands, like Unicode.

Thank you for the LISP example for transforming how characters are
represented. My sense is that I can expand on it to include also hex
quoted printable (=97) and octal (\226) converstions to legitimate
HTML character entites. When I can find the time, I'll look at LISP
programming to see if I can figure out how to do that.

> If you have an Emacs built from CVS, you can use code like this (I
> had something similar lying around for other purposes):
> 
>   (defun benny-replace-ucns ()
>     (interactive)
>     (save-excursion
>       (goto-char (point-min))
>       (while (re-search-forward
>               "\\\\u\\([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]\\)" nil t)
> 	(replace-match (benny-codepoint-to-char
> 			(string-to-number (match-string 1) 16))
> 		       t t))

> If you are limited to the released Emacs, you can change
> benny-codepoint-to-char to use a lookup table that you can create from
> <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>.
> 
> Bind 'benny-replace-ucns or 'benny-replace-numeric-entities to a
> global key and use that to cleanup your texts.

If I understand the general idea here, because you are using CVS
emacs, you are able to replace UCN character notation by using a loop
that references "benny-codepoint-to-char" (a variable?  A map?) to
replace the UCN with the letter equivalent.

But, for example, what would be the result of \u2014, since emacs does
not display an em-dash? 

I guess I can eventually figure out how to get
"benny-codepoint-to-char" (whatever that is) to use a look up table. I
downloaded the map you pointed to, and it appears to support UCN to
unicode representations of Latin-1 characters. With a bit of luck
perhaps I could build similar tables for hex quoted printable (=97)
and octal (\226) notations. 

Thanks, again.

-- 
 
       Haines Brown
       KB1GRM       
0
Reply Haines 4/25/2005 4:24:06 PM

9 Replies
644 Views

(page loaded in 0.222 seconds)

Similiar Articles:













7/22/2012 6:23:09 PM


Reply: