f



character-set encoding

HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as
Windows-1252, apparently because some websites claiming to be ISO-8859-1
are actually Windows-1252.  (Why in the world?  More web sites advertise
themselves as ISO-8859-1 than as Windows-1252 and even if that were not
the case, standards shouldn't standardized wrong behaviour.) 

What does the standard say about a website advertising itself as
ISO-8859-15?  Should it be parsed as ISO-8859-15?

0
helbig
5/16/2016 7:10:07 PM
comp.authoring.html 7078 articles. 0 followers. Post Follow

19 Replies
612 Views

Similar Articles

[PageSpeed] 59

On Mon, 16 May 2016 19:10:07 +0000 (UTC),
helbig@asclothestro.multivax.dePhillip Helbig wrote:
> HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as
> Windows-1252, apparently because some websites claiming to be ISO-8859-1
> are actually Windows-1252.

In HTML5 W3C Recommendation 28 October 2014, I don't find anything that
mention it. The only thing I could find is that Windows-1252 is the
suggested encoding only if the document is not in the listed locale
languages (setion 8.2.2.2 Determining the character encoding).

> (Why in the world?  More web sites advertise
> themselves as ISO-8859-1 than as Windows-1252 and even if that were not
> the case, standards shouldn't standardized wrong behaviour.) 

I agree. It made them lower than standard.

> What does the standard say about a website advertising itself as
> ISO-8859-15?  Should it be parsed as ISO-8859-15?

ISO-8859-15 itself is more like a small universal character set that covers
several languages. It's not meant for a specific language or specific
language group which most character sets are for - including ISO-8859 own
character sets.

So, if ISO-8859-15 is not available locally, Windows-1252 would be
preferred. Considering that HTML5 mentions only the suggested character sets
based on a list of languages (and not on a list of character sets), and
since ISO-8859-15 covers multiple languages [1]. Moreover, Windows-1252 has
the least character table difference when compared with ISO-8859-15 [2].

[1] https://en.wikipedia.org/wiki/ISO-8859-15#Coverage
[2] https://en.wikipedia.org/wiki/ISO-8859-15#Differences_from_ISO-8859-1
0
JJ
5/17/2016 2:20:41 AM
On 2016-05-16 19:10:07 +0000, Phillip Helbig (undress to reply said:

> HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as
> Windows-1252, apparently because some websites claiming to be ISO-8859-1
> are actually Windows-1252.  (Why in the world?  More web sites advertise
> themselves as ISO-8859-1 than as Windows-1252 and even if that were not
> the case, standards shouldn't standardized wrong behaviour.)

The issue is not whether they declare themselves to be ISO-8859-1, but 
whether they do so falsely. But I thought I remembered the standard 
explaining why they chose to take this odious step as the lesser of two 
evils.

-- 
John W Kennedy
"Information is light. Information, in itself, about anything, is light."
  -- Tom Stoppard. "Night and Day"

0
John
5/17/2016 2:57:02 AM
17.5.2016, 5:20, JJ wrote:

> On Mon, 16 May 2016 19:10:07 +0000 (UTC),
> helbig@asclothestro.multivax.dePhillip Helbig wrote:
>> HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as
>> Windows-1252, apparently because some websites claiming to be ISO-8859-1
>> are actually Windows-1252.
>
> In HTML5 W3C Recommendation 28 October 2014, I don't find anything that
> mention it.

This part of �HTML5� (a very vague concept) was moved out of the W3C 
specification for HTML5, into the �Encoding� document, which is cited in 
that spec and has the normative status of Candidate Recommendation. But 
normative statuses have lost importance in the �HTML5� world.

The �Encoding� document lists several encoding names that shall be 
interpreted as windows-1252, including iso-8859-1:

https://www.w3.org/TR/encoding/#legacy-single-byte-encodings

>> (Why in the world?  More web sites advertise
>> themselves as ISO-8859-1 than as Windows-1252 and even if that were not
>> the case, standards shouldn't standardized wrong behaviour.)
>
> I agree. It made them lower than standard.

The reason is simple enough: There no use for C1 Controls (code points 
from 80 to 9F in hexadecimal) in HTML documents, and it is virtually 
certain that if an HTML document declared to be ISO-8859-1 encoded 
contains such code points, then the document is in fact windows-1252 
encoded and the code points should be interpreted as graphic characters 
according to windows-1252. Well, in rare cases it might be a matter of 
data error, but even then, windows-1252 interpretation makes more sense 
than pretending that we are interpreting the data as iso-8859-1.

>> What does the standard say about a website advertising itself as
>> ISO-8859-15?  Should it be parsed as ISO-8859-15?
>
> ISO-8859-15 itself is more like a small universal character set that covers
> several languages. It's not meant for a specific language or specific
> language group which most character sets are for - including ISO-8859 own
> character sets.

I think you have misunderstood what ISO-8859-15 was meant for. It was 
introduced in order to include the euro sign, ���; the extra letters 
that were added due to the needs of Finnish and French were much less 
important. ISO-8859-15 soon became obsolete, or was born obsolete, since 
anyone who needs ��� in text can use windows-1252 or utf-8.

The �Encoding� document lists iso-8859-15 as a separate encoding, with a 
few names in addition to the reference name. It is included as one of 
the set of encodings for which support is required:
https://www.w3.org/TR/encoding/#names-and-labels

So the answer is that a document advertised as ISO-8859-15 shall be 
parsed as ISO-8859-15.

>Considering that HTML5 mentions only the suggested character sets
> based on a list of languages

What you are referring to is a description of how the character encoding 
is determined when it is not declared properly. That is, �guessing� or 
�sniffing� the encoding�based on the language of the user�s environment! 
The �sniffing� algorithm never results in using iso-8859-15 simply 
because it is not commonly used anywhere.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/
0
Jukka
5/17/2016 12:41:05 PM
In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela"
<jkorpela@cs.tut.fi> writes: 

> The reason is simple enough: There no use for C1 Controls (code points 
> from 80 to 9F in hexadecimal) in HTML documents, 

True.

> and it is virtually 
> certain that if an HTML document declared to be ISO-8859-1 encoded 
> contains such code points, then the document is in fact windows-1252 
> encoded and the code points should be interpreted as graphic characters 
> according to windows-1252. 

Virtually certain?  I don't think so.  Suppose I write a page in German 
and use � instead of &auml; and so on.  It would make sense to call it 
ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows 
character set even be relevant.

The point is not the C1 controls; the point are the 8-bit printable 
characters.  These are NOT all at the same positions as in windows-1252.

> >> ISO-8859-15?  Should it be parsed as ISO-8859-15?
> >
> > ISO-8859-15 itself is more like a small universal character set that covers
> > several languages. It's not meant for a specific language or specific
> > language group which most character sets are for - including ISO-8859 own
> > character sets.
> 
> I think you have misunderstood what ISO-8859-15 was meant for. It was 
> introduced in order to include the euro sign, `�'; the extra letters 
> that were added due to the needs of Finnish and French were much less 
> important. 

So if I want to use the Euro sign as an 8-bit character, why not use 
ISO-8859-15?

> ISO-8859-15 soon became obsolete, or was born obsolete, since 
> anyone who needs `�' in text can use windows-1252 or utf-8.

But suppose I don't use Windows.  I would have to jump through hoops to 
create the proper 8-bit representations, which would look different on 
my screen (not in HTML, but when editing the source).  If the code 
points in utf-8 are not identical to ISO-8859-15, same problem.  If so, 
why not call it ISO-8859-15?

> The `Encoding' document lists iso-8859-15 as a separate encoding, with a 
> few names in addition to the reference name. It is included as one of 
> the set of encodings for which support is required:
> https://www.w3.org/TR/encoding/#names-and-labels

Reason enough to use it!

> So the answer is that a document advertised as ISO-8859-15 shall be 
> parsed as ISO-8859-15.

Great!

0
helbig
5/18/2016 7:40:48 PM
Am 18.05.2016 um 21:40 schrieb Phillip Helbig (undress to reply):

> In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela"
> <jkorpela@cs.tut.fi> writes:

>> and it is virtually
>> certain that if an HTML document declared to be ISO-8859-1 encoded
>> contains such code points, then the document is in fact windows-1252
>> encoded and the code points should be interpreted as graphic characters
>> according to windows-1252.

> Virtually certain?  I don't think so.  Suppose I write a page in German
> and use � instead of &auml; and so on.  It would make sense to call it
> ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows
> character set even be relevant.

I did exactly this for many years and it worked and would still work now 
had I not decided to switch to UTF-8.

But Jukka said "if ... contains such code points" (to wit between U+0080 
and U+009F). Now, it is virtually certain that this will not be the case 
in your example. So you would not notice if a browser would apply 
Windows-1252 instead of ISO-8859-1.

> But suppose I don't use Windows.  I would have to jump through hoops to
> create the proper 8-bit representations, which would look different on
> my screen (not in HTML, but when editing the source).  If the code
> points in utf-8 are not identical to ISO-8859-15, same problem.  If so,
> why not call it ISO-8859-15?

If (1) for you producing UTF-8 is more of a problem than ISO-8859-15 and 
(2) you never need characters outside ISO-8859-15, e.g. quotation marks 
or dashes, then this is a good choice. And it will work if you declare 
what you have done. For me, neither (1) nor (2) are fulfilled.

-- 
Helmut Richter

0
Helmut
5/18/2016 7:58:31 PM
In article <nhihgt$bp5$1@news.in.tum.de>, Helmut Richter <hhr-m@web.de>
writes: 

> Am 18.05.2016 um 21:40 schrieb Phillip Helbig (undress to reply):
> 
> > In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela"
> > <jkorpela@cs.tut.fi> writes:
> 
> >> and it is virtually
> >> certain that if an HTML document declared to be ISO-8859-1 encoded
> >> contains such code points, then the document is in fact windows-1252
> >> encoded and the code points should be interpreted as graphic characters
> >> according to windows-1252.
> 
> > Virtually certain?  I don't think so.  Suppose I write a page in German
> > and use � instead of &auml; and so on.  It would make sense to call it
> > ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows
> > character set even be relevant.
> 
> I did exactly this for many years and it worked and would still work now 
> had I not decided to switch to UTF-8.
> 
> But Jukka said "if ... contains such code points" (to wit between U+0080 
> and U+009F). Now, it is virtually certain that this will not be the case 
> in your example. So you would not notice if a browser would apply 
> Windows-1252 instead of ISO-8859-1.

That is true.

> > But suppose I don't use Windows.  I would have to jump through hoops to
> > create the proper 8-bit representations, which would look different on
> > my screen (not in HTML, but when editing the source).  If the code
> > points in utf-8 are not identical to ISO-8859-15, same problem.  If so,
> > why not call it ISO-8859-15?
> 
> If (1) for you producing UTF-8 is more of a problem than ISO-8859-15 

Yes.

> and 
> (2) you never need characters outside ISO-8859-15, e.g. quotation marks 
> or dashes, 

Yes.

> then this is a good choice. And it will work if you declare 
> what you have done. 

OK.

So, similarly, in contrast to the previous post I replied to, the EURO
sign should show up here: �, at least in a mime-aware newsreader, since
the newsgroup headers should contain the relevant declarations. 

To summarize: Windows-1252 puts some extra characters where the 8-bit
non-printable characters are in ISO-8859-1 (and ISO-8859-15).  I could
of course use these bit representations, but they would show up as
"intended" on a non-Windows machine (e.g., in a text editor), i.e. not
what the user should see on a web page.  ISO-8859-15 is essentially the
same as ISO-8859-1 except that it replaces a few rarely used characters
with the EURO sign and some additional letters to allow more languages
to be completely covered. Thus, that makes sense to me.  If I actually
needed the symbols in ISO-8859-1 but not in ISO-8859-15, I could replace
my default encoding of ISO-8859-15 with ISO-8859-1 for that page or use
the "&" mnemonics or the "#" values (the latter the only option if I
need both symbols in ISO-8859-1 but not in ISO-8859-15 AND in ISO-8859-5
but not in ISO-8859-1). 

0
helbig
5/19/2016 8:20:48 AM
Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote:
>"Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: 

>>The reason is simple enough: There no use for C1 Controls (code points 
>>from 80 to 9F in hexadecimal) in HTML documents, 

>True.

What, no one wants to offer backwards compatibility for teletypewriters
any longer?

>> and it is virtually 
>> certain that if an HTML document declared to be ISO-8859-1 encoded 
>> contains such code points, then the document is in fact windows-1252 
>> encoded and the code points should be interpreted as graphic characters 
>> according to windows-1252. 

>Virtually certain?  I don't think so.  Suppose I write a page in German 
>and use � instead of &auml; and so on.  It would make sense to call it 
>ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows 
>character set even be relevant.

>The point is not the C1 controls; the point are the 8-bit printable 
>characters.  These are NOT all at the same positions as in windows-1252.

Exactly right. Microsoft did that deliberately to annoy those who
used non-Microsoft word processors like WordPerfect.

If the Web page declares the wrong character set, get the author to fix it.
0
Adam
5/27/2016 5:53:01 AM
27.5.2016, 8:53, Adam H. Kerman wrote:

> Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote:
>> "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes:
[...]
>>> and it is virtually
>>> certain that if an HTML document declared to be ISO-8859-1 encoded
>>> contains such code points, then the document is in fact windows-1252
>>> encoded and the code points should be interpreted as graphic characters
>>> according to windows-1252.
>
>> Virtually certain?  I don't think so.

Any counterexample?

>> Suppose I write a page in German
>> and use � instead of &auml; and so on.  It would make sense to call it
>> ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows
>> character set even be relevant.

How is that relevant to C1 controls? This was about them, and only in 
the code range reserved for them in ISO-8859-1 do ISO-8859-1 (ISO Latin 
1) and windows-1252 (Windows Latin 1) differ. In particular, the letter 
�� has the same position in both.

You may call it ISO-8859-1, but for web browsers, it is an alias for 
windows-1252.

>> The point is not the C1 controls; the point are the 8-bit printable
>> characters.  These are NOT all at the same positions as in windows-1252.
>
> Exactly right.

Exactly what are you talking about? The printable character that 
windows-1252 allocates to locations 80...1F (hex) are not present in 
ISO-8859-1 at all.

> Microsoft did that deliberately to annoy those who
> used non-Microsoft word processors like WordPerfect.

They used some of 80...9F for graphic characters that are actually needed.

> If the Web page declares the wrong character set, get the author to fix it.

Should a web page used expect that to work?

Treating ISO-8859-1 was windows-1252 often helps to make browsers render 
the characters the author meant, instead of ignoring them or signalling 
errors to the user (who mostly can do absolutely nothing to �get the 
author to fix it�). It causes absolutely no harm (or any effect) when 
the data is ISO-8859-1 and does not try to use C1 Controls.

The �guess� is wrong only in the rare cases where the character data has 
been corrupted somehow or the encoding is actually neither ISO-8859-1 
nor windows-1252. Even then, it hardly makes things worse.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/
0
Jukka
5/27/2016 12:57:34 PM
Jukka K. Korpela wrote:

> 27.5.2016, 8:53, Adam H. Kerman wrote:
>> Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de>
>> wrote:
>>> "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes:
> 
> You may call it ISO-8859-1, but for web browsers, it is an alias for
> windows-1252.
> 
>>> The point is not the C1 controls; the point are the 8-bit printable
>>> characters.  These are NOT all at the same positions as in windows-1252.
>> Exactly right.
> 
> Exactly what are you talking about? The printable character that
> windows-1252 allocates to locations 80...1F (hex) are not present in
> ISO-8859-1 at all.

You mean 80..._9F_ (hex), and that the characters at those code points in 
Windows-1252 are not in ISO-8859-1 at the same code points or at all was 
precisely the point argued.  Because that puts into doubt the decision to 
treat ISO-8859-1 as an alias for Windows-1252.

See also: <https://de.wikipedia.org/wiki/ISO_8859-1>


PointedEars
-- 
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
  -- Richard Cornford, cljs, <f806at$ail$1$8300dec7@news.demon.co.uk>
0
Thomas
5/27/2016 6:35:00 PM
Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:
>27.5.2016, 8:53, Adam H. Kerman wrote:
>>Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote:
>>>"Jukka K. Korpela" <jkorpela@cs.tut.fi> writes:

>[...]
>>>> and it is virtually
>>>> certain that if an HTML document declared to be ISO-8859-1 encoded
>>>> contains such code points, then the document is in fact windows-1252
>>>> encoded and the code points should be interpreted as graphic characters
>>>> according to windows-1252.

>>> Virtually certain?  I don't think so.

>Any counterexample?

>>> Suppose I write a page in German
>>> and use � instead of &auml; and so on.  It would make sense to call it
>>> ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows
>>> character set even be relevant.

>How is that relevant to C1 controls? This was about them, and only in 
>the code range reserved for them in ISO-8859-1 do ISO-8859-1 (ISO Latin 
>1) and windows-1252 (Windows Latin 1) differ. In particular, the letter 
>�� has the same position in both.

>You may call it ISO-8859-1, but for web browsers, it is an alias for 
>windows-1252.

I use Firefox. It's set to auto-detect the character set, which means
it reads the MIME header. If I can't read all the text on a Web page,
I have to manually choose the character set, and try both UTF-8 and
Windows-1252. Sometimes I have to try other character sets, or the page
is in more than one character set, sigh. I am not familiar with a setting
in Firefox that says "assume Windows-1252 even if marked ISO-8859-1".

If your statement was about Internet Explorer or other browsers written
by Microsoft, it wouldn't surprise anyone that it was true, but your
statement isn't universally applicable to all browsers.

>>> The point is not the C1 controls; the point are the 8-bit printable
>>> characters.  These are NOT all at the same positions as in windows-1252.

>> Exactly right.

>Exactly what are you talking about?

Obviously, I'm agreeing with the O.P.'s position that there are code
points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print
different characters in Windows-1252.

The O.P.'s point is well taken.

>The printable character that windows-1252 allocates to locations 80...1F
>(hex) are not present in ISO-8859-1 at all.

Your position that you can just parse for the use of control characters
is not necessarily a safe assumption that the character set in use is
Windows-1252, as that's not Microsoft's only proprietary character set
that uses code points for control characters to render printing characters.

>>Microsoft did that deliberately to annoy those who used non-Microsoft
>>word processors like WordPerfect.

>They used some of 80...9F for graphic characters that are actually needed.

That doesn't address the point the O.P. made that Microsoft changed
code points, and didn't merely add code points.

>>If the Web page declares the wrong character set, get the author to fix it.

>Should a web page used expect that to work?

Ghod forbid people should fix errors.

>Treating ISO-8859-1 was windows-1252 often helps to make browsers render 
>the characters the author meant, instead of ignoring them or signalling 
>errors to the user (who mostly can do absolutely nothing to �get the 
>author to fix it�). It causes absolutely no harm (or any effect) when 
>the data is ISO-8859-1 and does not try to use C1 Controls.

>The �guess� is wrong only in the rare cases where the character data has 
>been corrupted somehow or the encoding is actually neither ISO-8859-1 
>nor windows-1252. Even then, it hardly makes things worse.
0
Adam
5/27/2016 6:36:42 PM
27.5.2016, 21:36, Adam H. Kerman wrote:

> I use Firefox.

Then try this:
http://www.cs.tut.fi/~jkorpela/chars/test.html
(It declares iso-8859-1 but is actually windows-1253 encoded.)

> I am not familiar with a setting
> in Firefox that says "assume Windows-1252 even if marked ISO-8859-1".

There is no such setting, any more than there is setting for “take 1 + 1 
as meaning 2”. ☺

> Obviously, I'm agreeing with the O.P.'s position that there are code
> points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print
> different characters in Windows-1252.

There are no such code points. Please prove I’m wrong by listing those 
code points and citing the relevant specs.

> Your position that you can just parse for the use of control characters
> is not necessarily a safe assumption that the character set in use is
> Windows-1252, as that's not Microsoft's only proprietary character set
> that uses code points for control characters to render printing characters.

The issue was with pages declared to be ISO-8859-1 encoded. If you 
declare, say, an ISO-8859-2 encoded page as ISO-8859-1, you will surely 
have problems more serious than those discussed here.

> That doesn't address the point the O.P. made that Microsoft changed
> code points, and didn't merely add code points.

What code points are you referring to?

There were no code points they could possibly add. The 8-bit coding 
space has 256 points, and that’s it. What they did was allocating 
graphic characters to some code points that are reserved for control 
codes (“control characters”) in iso-8859-1 (and never used that way in 
HTML, for example). If you think otherwise, please specify which code 
points and characters you are talking about.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/
0
Jukka
5/27/2016 8:08:33 PM
On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K.
Korpela as having written::

>> ... the letter 
>> “䒠 has the same position in both.

Funny: in Jukka's original, what I see is

> ... the letter 
> “ä” has the same position in both.

Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) .

Cheers, -- tlvp
-- 
Avant de repondre, jeter la poubelle, SVP.
0
tlvp
5/28/2016 7:29:09 AM
In article <ni8nbt$nm1$4@news.albasani.net>, "Adam H. Kerman"
<ahk@chinet.com> writes: 

> >> and it is virtually 
> >> certain that if an HTML document declared to be ISO-8859-1 encoded 
> >> contains such code points, then the document is in fact windows-1252 
> >> encoded and the code points should be interpreted as graphic characters 
> >> according to windows-1252. 
> 
> >Virtually certain?  I don't think so.  Suppose I write a page in German 
> >and use � instead of &auml; and so on.  It would make sense to call it 
> >ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows 
> >character set even be relevant.
> 
> >The point is not the C1 controls; the point are the 8-bit printable 
> >characters.  These are NOT all at the same positions as in windows-1252.
> 
> Exactly right. Microsoft did that deliberately to annoy those who
> used non-Microsoft word processors like WordPerfect.

This is not the only example of MicroSoft implementing such a policy.

> If the Web page declares the wrong character set, get the author to fix it.

In practice, little chance.

0
helbig
5/28/2016 10:07:36 AM
28.5.2016, 10:29, tlvp wrote:

> On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K.
> Korpela as having written::
>
>>> ... the letter
>>> “䒠 has the same position in both.
>
> Funny: in Jukka's original, what I see is
>
>> ... the letter
>> “ä” has the same position in both.
>
> Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) .

Don’t blame the encodings. Adam’s message did not specify character 
encoding in a Content-Type header, so the interpretation of bytes 
outside the Ascii range 0..7F (hex) is up to the recipient, the software 
that displays the message.

ObHTML: The same happens when an HTML document is sent by a server 
without any specification of character encoding. However, in that case, 
the guessing algorithm is defined in detail in HTML5.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/
0
Jukka
5/28/2016 2:43:11 PM
Adam H. Kerman <ahk@chinet.com> wrote:

>I use Firefox. It's set to auto-detect the character set, which means
>it reads the MIME header.

I'm going to have to withdraw several points here. First of all, the
auto-detect feature is related to the use of non-Latin alphabets and
isn't relevant to what's under discussion.

>Obviously, I'm agreeing with the O.P.'s position that there are code
>points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print
>different characters in Windows-1252.

Memory was fuzzy on this point as well. I had recalled that four code
points were swapped, but comparing the two character sets, they're not.

I was thinking back a couple of decades how painful it used to be
dealing with documents produced in one word processor that couldn't
be imported into another word processor, because Microsoft Word would
take specific character patterns and substitute non-ASCII characters
by default from their proprietary character set so they couldn't be
edited on a competitor's application. The "smart quote" substitution
was notorious, especially since Windows-1252 right (close) single quote
was used for apostrophe as well, making it every bit the ambiguous
character that ASCII apostrophe/single quote/acute accent, so it
wasn't much of an improvement.
0
Adam
5/28/2016 3:22:53 PM
Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:

>Then try this:
>http://www.cs.tut.fi/~jkorpela/chars/test.html
>(It declares iso-8859-1 but is actually windows-1253 encoded.)

I see that you declared iso-8859-1. Firefox reports that it's Windows-1252
encoded. With code page 1253, I should see modern Greek letters, but not
with 1252.

Looks like this chart:
https://msdn.microsoft.com/en-us/goglobal/cc305146.aspx

I withdrew my other points. I'm not withdrawing the point that browsers
assuming it's one code page when it's actually a different code page is
helpful in any way.

Your demonstration supports my point, actually, so you should
have persuaded yourself to change your position.
0
Adam
5/28/2016 3:28:25 PM
tlvp <mPiOsUcB.EtLlLvEp@att.net> wrote:
>On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K.
>Korpela as having written::

>>> ... the letter 
>>> “䒠 has the same position in both.

>Funny: in Jukka's original, what I see is

>> ... the letter 
>> “ä” has the same position in both.
>
>Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) .

>Cheers, -- tlvp

I could use MIME headers with Usenet articles, but generally, I don't.
I try to use ASCII characters wherever possible in Usenet articles and
substitute the ambiguous ASCII character for the Windows-1252 or UTF-8
character, but there's no way to do that when quoting Jukka, so I don't
bother. Anyway, I think more than one character set is being used in
this thread. I am aware that the quotes look ugly.

I'm declaring windows-1252 just for you, as Jukka's precursor article
was declared in that character set, but it appears that you've substituted
UTF-8 code points and this still won't make the quoted text renderable.
0
Adam
5/28/2016 3:52:21 PM
Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote:
>In article <ni8nbt$nm1$4@news.albasani.net>, "Adam H. Kerman"
><ahk@chinet.com> writes: 
>
>> >> and it is virtually 
>> >> certain that if an HTML document declared to be ISO-8859-1 encoded 
>> >> contains such code points, then the document is in fact windows-1252 
>> >> encoded and the code points should be interpreted as graphic characters 
>> >> according to windows-1252. 
>> 
>> >Virtually certain?  I don't think so.  Suppose I write a page in German 
>> >and use � instead of &auml; and so on.  It would make sense to call it 
>> >ISO-8859-1, right?  I don't even USE Windows.  Why should a Windows 
>> >character set even be relevant.
>> 
>> >The point is not the C1 controls; the point are the 8-bit printable 
>> >characters.  These are NOT all at the same positions as in windows-1252.
>> 
>> Exactly right. Microsoft did that deliberately to annoy those who
>> used non-Microsoft word processors like WordPerfect.
>
>This is not the only example of MicroSoft implementing such a policy.

I corrected my statement elsewhere, but MS did the code point change
in the application before saving the document. I was recalling that
four code points of printable characters had been swapped in windows-1252,
but comparing the two character encodings, my memory was wrong.

>> If the Web page declares the wrong character set, get the author to fix it.

>In practice, little chance.

I know.

It occurs to me that I sometimes might encounter a Web page that
correctly declares its character set, but was served in another character set.
0
Adam
5/28/2016 3:57:07 PM
On Sat, 28 May 2016 15:52:21 +0000 (UTC), Adam H. Kerman wrote,
and I quote it in full only to illustrate all that has changed yet again
with the intentional Win-1252 encoding declaration:

> tlvp <mPiOsUcB.EtLlLvEp@att.net> wrote:
>>On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K.
>>Korpela as having written::
> 
>>>> ... the letter 
>>>> “䒠 has the same position in both.
> 
>>Funny: in Jukka's original, what I see is
> 
>>> ... the letter 
>>> “äâ€ has the same position in both.
>>
>>Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) .
> 
>>Cheers, -- tlvp
> 
> I could use MIME headers with Usenet articles, but generally, I don't.
> I try to use ASCII characters wherever possible in Usenet articles and
> substitute the ambiguous ASCII character for the Windows-1252 or UTF-8
> character, but there's no way to do that when quoting Jukka, so I don't
> bother. Anyway, I think more than one character set is being used in
> this thread. I am aware that the quotes look ugly.
> 
> I'm declaring windows-1252 just for you, as Jukka's precursor article
> was declared in that character set, but it appears that you've substituted
> UTF-8 code points and this still won't make the quoted text renderable.

FWiW, it went, clearly, from bad to worse :-) . Apologies for wasting B/W. 

Cheers, -- tlvp
-- 
Avant de repondre, jeter la poubelle, SVP.
0
tlvp
5/28/2016 8:12:57 PM
Reply: