HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as Windows-1252, apparently because some websites claiming to be ISO-8859-1 are actually Windows-1252. (Why in the world? More web sites advertise themselves as ISO-8859-1 than as Windows-1252 and even if that were not the case, standards shouldn't standardized wrong behaviour.) What does the standard say about a website advertising itself as ISO-8859-15? Should it be parsed as ISO-8859-15?
![]() |
0 |
![]() |
On Mon, 16 May 2016 19:10:07 +0000 (UTC), helbig@asclothestro.multivax.dePhillip Helbig wrote: > HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as > Windows-1252, apparently because some websites claiming to be ISO-8859-1 > are actually Windows-1252. In HTML5 W3C Recommendation 28 October 2014, I don't find anything that mention it. The only thing I could find is that Windows-1252 is the suggested encoding only if the document is not in the listed locale languages (setion 8.2.2.2 Determining the character encoding). > (Why in the world? More web sites advertise > themselves as ISO-8859-1 than as Windows-1252 and even if that were not > the case, standards shouldn't standardized wrong behaviour.) I agree. It made them lower than standard. > What does the standard say about a website advertising itself as > ISO-8859-15? Should it be parsed as ISO-8859-15? ISO-8859-15 itself is more like a small universal character set that covers several languages. It's not meant for a specific language or specific language group which most character sets are for - including ISO-8859 own character sets. So, if ISO-8859-15 is not available locally, Windows-1252 would be preferred. Considering that HTML5 mentions only the suggested character sets based on a list of languages (and not on a list of character sets), and since ISO-8859-15 covers multiple languages [1]. Moreover, Windows-1252 has the least character table difference when compared with ISO-8859-15 [2]. [1] https://en.wikipedia.org/wiki/ISO-8859-15#Coverage [2] https://en.wikipedia.org/wiki/ISO-8859-15#Differences_from_ISO-8859-1
![]() |
0 |
![]() |
On 2016-05-16 19:10:07 +0000, Phillip Helbig (undress to reply said: > HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as > Windows-1252, apparently because some websites claiming to be ISO-8859-1 > are actually Windows-1252. (Why in the world? More web sites advertise > themselves as ISO-8859-1 than as Windows-1252 and even if that were not > the case, standards shouldn't standardized wrong behaviour.) The issue is not whether they declare themselves to be ISO-8859-1, but whether they do so falsely. But I thought I remembered the standard explaining why they chose to take this odious step as the lesser of two evils. -- John W Kennedy "Information is light. Information, in itself, about anything, is light." -- Tom Stoppard. "Night and Day"
![]() |
0 |
![]() |
17.5.2016, 5:20, JJ wrote: > On Mon, 16 May 2016 19:10:07 +0000 (UTC), > helbig@asclothestro.multivax.dePhillip Helbig wrote: >> HTML 5 requires documents which claim to be ISO-8859-1 to be parsed as >> Windows-1252, apparently because some websites claiming to be ISO-8859-1 >> are actually Windows-1252. > > In HTML5 W3C Recommendation 28 October 2014, I don't find anything that > mention it. This part of �HTML5� (a very vague concept) was moved out of the W3C specification for HTML5, into the �Encoding� document, which is cited in that spec and has the normative status of Candidate Recommendation. But normative statuses have lost importance in the �HTML5� world. The �Encoding� document lists several encoding names that shall be interpreted as windows-1252, including iso-8859-1: https://www.w3.org/TR/encoding/#legacy-single-byte-encodings >> (Why in the world? More web sites advertise >> themselves as ISO-8859-1 than as Windows-1252 and even if that were not >> the case, standards shouldn't standardized wrong behaviour.) > > I agree. It made them lower than standard. The reason is simple enough: There no use for C1 Controls (code points from 80 to 9F in hexadecimal) in HTML documents, and it is virtually certain that if an HTML document declared to be ISO-8859-1 encoded contains such code points, then the document is in fact windows-1252 encoded and the code points should be interpreted as graphic characters according to windows-1252. Well, in rare cases it might be a matter of data error, but even then, windows-1252 interpretation makes more sense than pretending that we are interpreting the data as iso-8859-1. >> What does the standard say about a website advertising itself as >> ISO-8859-15? Should it be parsed as ISO-8859-15? > > ISO-8859-15 itself is more like a small universal character set that covers > several languages. It's not meant for a specific language or specific > language group which most character sets are for - including ISO-8859 own > character sets. I think you have misunderstood what ISO-8859-15 was meant for. It was introduced in order to include the euro sign, ���; the extra letters that were added due to the needs of Finnish and French were much less important. ISO-8859-15 soon became obsolete, or was born obsolete, since anyone who needs ��� in text can use windows-1252 or utf-8. The �Encoding� document lists iso-8859-15 as a separate encoding, with a few names in addition to the reference name. It is included as one of the set of encodings for which support is required: https://www.w3.org/TR/encoding/#names-and-labels So the answer is that a document advertised as ISO-8859-15 shall be parsed as ISO-8859-15. >Considering that HTML5 mentions only the suggested character sets > based on a list of languages What you are referring to is a description of how the character encoding is determined when it is not declared properly. That is, �guessing� or �sniffing� the encoding�based on the language of the user�s environment! The �sniffing� algorithm never results in using iso-8859-15 simply because it is not commonly used anywhere. -- Yucca, http://www.cs.tut.fi/~jkorpela/
![]() |
0 |
![]() |
In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: > The reason is simple enough: There no use for C1 Controls (code points > from 80 to 9F in hexadecimal) in HTML documents, True. > and it is virtually > certain that if an HTML document declared to be ISO-8859-1 encoded > contains such code points, then the document is in fact windows-1252 > encoded and the code points should be interpreted as graphic characters > according to windows-1252. Virtually certain? I don't think so. Suppose I write a page in German and use � instead of ä and so on. It would make sense to call it ISO-8859-1, right? I don't even USE Windows. Why should a Windows character set even be relevant. The point is not the C1 controls; the point are the 8-bit printable characters. These are NOT all at the same positions as in windows-1252. > >> ISO-8859-15? Should it be parsed as ISO-8859-15? > > > > ISO-8859-15 itself is more like a small universal character set that covers > > several languages. It's not meant for a specific language or specific > > language group which most character sets are for - including ISO-8859 own > > character sets. > > I think you have misunderstood what ISO-8859-15 was meant for. It was > introduced in order to include the euro sign, `�'; the extra letters > that were added due to the needs of Finnish and French were much less > important. So if I want to use the Euro sign as an 8-bit character, why not use ISO-8859-15? > ISO-8859-15 soon became obsolete, or was born obsolete, since > anyone who needs `�' in text can use windows-1252 or utf-8. But suppose I don't use Windows. I would have to jump through hoops to create the proper 8-bit representations, which would look different on my screen (not in HTML, but when editing the source). If the code points in utf-8 are not identical to ISO-8859-15, same problem. If so, why not call it ISO-8859-15? > The `Encoding' document lists iso-8859-15 as a separate encoding, with a > few names in addition to the reference name. It is included as one of > the set of encodings for which support is required: > https://www.w3.org/TR/encoding/#names-and-labels Reason enough to use it! > So the answer is that a document advertised as ISO-8859-15 shall be > parsed as ISO-8859-15. Great!
![]() |
0 |
![]() |
Am 18.05.2016 um 21:40 schrieb Phillip Helbig (undress to reply): > In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela" > <jkorpela@cs.tut.fi> writes: >> and it is virtually >> certain that if an HTML document declared to be ISO-8859-1 encoded >> contains such code points, then the document is in fact windows-1252 >> encoded and the code points should be interpreted as graphic characters >> according to windows-1252. > Virtually certain? I don't think so. Suppose I write a page in German > and use � instead of ä and so on. It would make sense to call it > ISO-8859-1, right? I don't even USE Windows. Why should a Windows > character set even be relevant. I did exactly this for many years and it worked and would still work now had I not decided to switch to UTF-8. But Jukka said "if ... contains such code points" (to wit between U+0080 and U+009F). Now, it is virtually certain that this will not be the case in your example. So you would not notice if a browser would apply Windows-1252 instead of ISO-8859-1. > But suppose I don't use Windows. I would have to jump through hoops to > create the proper 8-bit representations, which would look different on > my screen (not in HTML, but when editing the source). If the code > points in utf-8 are not identical to ISO-8859-15, same problem. If so, > why not call it ISO-8859-15? If (1) for you producing UTF-8 is more of a problem than ISO-8859-15 and (2) you never need characters outside ISO-8859-15, e.g. quotation marks or dashes, then this is a good choice. And it will work if you declare what you have done. For me, neither (1) nor (2) are fulfilled. -- Helmut Richter
![]() |
0 |
![]() |
In article <nhihgt$bp5$1@news.in.tum.de>, Helmut Richter <hhr-m@web.de> writes: > Am 18.05.2016 um 21:40 schrieb Phillip Helbig (undress to reply): > > > In article <nhf3a8$8sp$1@dont-email.me>, "Jukka K. Korpela" > > <jkorpela@cs.tut.fi> writes: > > >> and it is virtually > >> certain that if an HTML document declared to be ISO-8859-1 encoded > >> contains such code points, then the document is in fact windows-1252 > >> encoded and the code points should be interpreted as graphic characters > >> according to windows-1252. > > > Virtually certain? I don't think so. Suppose I write a page in German > > and use � instead of ä and so on. It would make sense to call it > > ISO-8859-1, right? I don't even USE Windows. Why should a Windows > > character set even be relevant. > > I did exactly this for many years and it worked and would still work now > had I not decided to switch to UTF-8. > > But Jukka said "if ... contains such code points" (to wit between U+0080 > and U+009F). Now, it is virtually certain that this will not be the case > in your example. So you would not notice if a browser would apply > Windows-1252 instead of ISO-8859-1. That is true. > > But suppose I don't use Windows. I would have to jump through hoops to > > create the proper 8-bit representations, which would look different on > > my screen (not in HTML, but when editing the source). If the code > > points in utf-8 are not identical to ISO-8859-15, same problem. If so, > > why not call it ISO-8859-15? > > If (1) for you producing UTF-8 is more of a problem than ISO-8859-15 Yes. > and > (2) you never need characters outside ISO-8859-15, e.g. quotation marks > or dashes, Yes. > then this is a good choice. And it will work if you declare > what you have done. OK. So, similarly, in contrast to the previous post I replied to, the EURO sign should show up here: �, at least in a mime-aware newsreader, since the newsgroup headers should contain the relevant declarations. To summarize: Windows-1252 puts some extra characters where the 8-bit non-printable characters are in ISO-8859-1 (and ISO-8859-15). I could of course use these bit representations, but they would show up as "intended" on a non-Windows machine (e.g., in a text editor), i.e. not what the user should see on a web page. ISO-8859-15 is essentially the same as ISO-8859-1 except that it replaces a few rarely used characters with the EURO sign and some additional letters to allow more languages to be completely covered. Thus, that makes sense to me. If I actually needed the symbols in ISO-8859-1 but not in ISO-8859-15, I could replace my default encoding of ISO-8859-15 with ISO-8859-1 for that page or use the "&" mnemonics or the "#" values (the latter the only option if I need both symbols in ISO-8859-1 but not in ISO-8859-15 AND in ISO-8859-5 but not in ISO-8859-1).
![]() |
0 |
![]() |
Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote: >"Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: >>The reason is simple enough: There no use for C1 Controls (code points >>from 80 to 9F in hexadecimal) in HTML documents, >True. What, no one wants to offer backwards compatibility for teletypewriters any longer? >> and it is virtually >> certain that if an HTML document declared to be ISO-8859-1 encoded >> contains such code points, then the document is in fact windows-1252 >> encoded and the code points should be interpreted as graphic characters >> according to windows-1252. >Virtually certain? I don't think so. Suppose I write a page in German >and use � instead of ä and so on. It would make sense to call it >ISO-8859-1, right? I don't even USE Windows. Why should a Windows >character set even be relevant. >The point is not the C1 controls; the point are the 8-bit printable >characters. These are NOT all at the same positions as in windows-1252. Exactly right. Microsoft did that deliberately to annoy those who used non-Microsoft word processors like WordPerfect. If the Web page declares the wrong character set, get the author to fix it.
![]() |
0 |
![]() |
27.5.2016, 8:53, Adam H. Kerman wrote: > Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote: >> "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: [...] >>> and it is virtually >>> certain that if an HTML document declared to be ISO-8859-1 encoded >>> contains such code points, then the document is in fact windows-1252 >>> encoded and the code points should be interpreted as graphic characters >>> according to windows-1252. > >> Virtually certain? I don't think so. Any counterexample? >> Suppose I write a page in German >> and use � instead of ä and so on. It would make sense to call it >> ISO-8859-1, right? I don't even USE Windows. Why should a Windows >> character set even be relevant. How is that relevant to C1 controls? This was about them, and only in the code range reserved for them in ISO-8859-1 do ISO-8859-1 (ISO Latin 1) and windows-1252 (Windows Latin 1) differ. In particular, the letter �� has the same position in both. You may call it ISO-8859-1, but for web browsers, it is an alias for windows-1252. >> The point is not the C1 controls; the point are the 8-bit printable >> characters. These are NOT all at the same positions as in windows-1252. > > Exactly right. Exactly what are you talking about? The printable character that windows-1252 allocates to locations 80...1F (hex) are not present in ISO-8859-1 at all. > Microsoft did that deliberately to annoy those who > used non-Microsoft word processors like WordPerfect. They used some of 80...9F for graphic characters that are actually needed. > If the Web page declares the wrong character set, get the author to fix it. Should a web page used expect that to work? Treating ISO-8859-1 was windows-1252 often helps to make browsers render the characters the author meant, instead of ignoring them or signalling errors to the user (who mostly can do absolutely nothing to �get the author to fix it�). It causes absolutely no harm (or any effect) when the data is ISO-8859-1 and does not try to use C1 Controls. The �guess� is wrong only in the rare cases where the character data has been corrupted somehow or the encoding is actually neither ISO-8859-1 nor windows-1252. Even then, it hardly makes things worse. -- Yucca, http://www.cs.tut.fi/~jkorpela/
![]() |
0 |
![]() |
Jukka K. Korpela wrote: > 27.5.2016, 8:53, Adam H. Kerman wrote: >> Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> >> wrote: >>> "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: > > You may call it ISO-8859-1, but for web browsers, it is an alias for > windows-1252. > >>> The point is not the C1 controls; the point are the 8-bit printable >>> characters. These are NOT all at the same positions as in windows-1252. >> Exactly right. > > Exactly what are you talking about? The printable character that > windows-1252 allocates to locations 80...1F (hex) are not present in > ISO-8859-1 at all. You mean 80..._9F_ (hex), and that the characters at those code points in Windows-1252 are not in ISO-8859-1 at the same code points or at all was precisely the point argued. Because that puts into doubt the decision to treat ISO-8859-1 as an alias for Windows-1252. See also: <https://de.wikipedia.org/wiki/ISO_8859-1> PointedEars -- Prototype.js was written by people who don't know javascript for people who don't know javascript. People who don't know javascript are not the best source of advice on designing systems that use javascript. -- Richard Cornford, cljs, <f806at$ail$1$8300dec7@news.demon.co.uk>
![]() |
0 |
![]() |
Jukka K. Korpela <jkorpela@cs.tut.fi> wrote: >27.5.2016, 8:53, Adam H. Kerman wrote: >>Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote: >>>"Jukka K. Korpela" <jkorpela@cs.tut.fi> writes: >[...] >>>> and it is virtually >>>> certain that if an HTML document declared to be ISO-8859-1 encoded >>>> contains such code points, then the document is in fact windows-1252 >>>> encoded and the code points should be interpreted as graphic characters >>>> according to windows-1252. >>> Virtually certain? I don't think so. >Any counterexample? >>> Suppose I write a page in German >>> and use � instead of ä and so on. It would make sense to call it >>> ISO-8859-1, right? I don't even USE Windows. Why should a Windows >>> character set even be relevant. >How is that relevant to C1 controls? This was about them, and only in >the code range reserved for them in ISO-8859-1 do ISO-8859-1 (ISO Latin >1) and windows-1252 (Windows Latin 1) differ. In particular, the letter >�� has the same position in both. >You may call it ISO-8859-1, but for web browsers, it is an alias for >windows-1252. I use Firefox. It's set to auto-detect the character set, which means it reads the MIME header. If I can't read all the text on a Web page, I have to manually choose the character set, and try both UTF-8 and Windows-1252. Sometimes I have to try other character sets, or the page is in more than one character set, sigh. I am not familiar with a setting in Firefox that says "assume Windows-1252 even if marked ISO-8859-1". If your statement was about Internet Explorer or other browsers written by Microsoft, it wouldn't surprise anyone that it was true, but your statement isn't universally applicable to all browsers. >>> The point is not the C1 controls; the point are the 8-bit printable >>> characters. These are NOT all at the same positions as in windows-1252. >> Exactly right. >Exactly what are you talking about? Obviously, I'm agreeing with the O.P.'s position that there are code points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print different characters in Windows-1252. The O.P.'s point is well taken. >The printable character that windows-1252 allocates to locations 80...1F >(hex) are not present in ISO-8859-1 at all. Your position that you can just parse for the use of control characters is not necessarily a safe assumption that the character set in use is Windows-1252, as that's not Microsoft's only proprietary character set that uses code points for control characters to render printing characters. >>Microsoft did that deliberately to annoy those who used non-Microsoft >>word processors like WordPerfect. >They used some of 80...9F for graphic characters that are actually needed. That doesn't address the point the O.P. made that Microsoft changed code points, and didn't merely add code points. >>If the Web page declares the wrong character set, get the author to fix it. >Should a web page used expect that to work? Ghod forbid people should fix errors. >Treating ISO-8859-1 was windows-1252 often helps to make browsers render >the characters the author meant, instead of ignoring them or signalling >errors to the user (who mostly can do absolutely nothing to �get the >author to fix it�). It causes absolutely no harm (or any effect) when >the data is ISO-8859-1 and does not try to use C1 Controls. >The �guess� is wrong only in the rare cases where the character data has >been corrupted somehow or the encoding is actually neither ISO-8859-1 >nor windows-1252. Even then, it hardly makes things worse.
![]() |
0 |
![]() |
27.5.2016, 21:36, Adam H. Kerman wrote: > I use Firefox. Then try this: http://www.cs.tut.fi/~jkorpela/chars/test.html (It declares iso-8859-1 but is actually windows-1253 encoded.) > I am not familiar with a setting > in Firefox that says "assume Windows-1252 even if marked ISO-8859-1". There is no such setting, any more than there is setting for “take 1 + 1 as meaning 2”. ☺ > Obviously, I'm agreeing with the O.P.'s position that there are code > points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print > different characters in Windows-1252. There are no such code points. Please prove I’m wrong by listing those code points and citing the relevant specs. > Your position that you can just parse for the use of control characters > is not necessarily a safe assumption that the character set in use is > Windows-1252, as that's not Microsoft's only proprietary character set > that uses code points for control characters to render printing characters. The issue was with pages declared to be ISO-8859-1 encoded. If you declare, say, an ISO-8859-2 encoded page as ISO-8859-1, you will surely have problems more serious than those discussed here. > That doesn't address the point the O.P. made that Microsoft changed > code points, and didn't merely add code points. What code points are you referring to? There were no code points they could possibly add. The 8-bit coding space has 256 points, and that’s it. What they did was allocating graphic characters to some code points that are reserved for control codes (“control characters”) in iso-8859-1 (and never used that way in HTML, for example). If you think otherwise, please specify which code points and characters you are talking about. -- Yucca, http://www.cs.tut.fi/~jkorpela/
![]() |
0 |
![]() |
On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K. Korpela as having written:: >> ... the letter >> 䒠 has the same position in both. Funny: in Jukka's original, what I see is > ... the letter > “ä” has the same position in both. Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) . Cheers, -- tlvp -- Avant de repondre, jeter la poubelle, SVP.
![]() |
0 |
![]() |
In article <ni8nbt$nm1$4@news.albasani.net>, "Adam H. Kerman" <ahk@chinet.com> writes: > >> and it is virtually > >> certain that if an HTML document declared to be ISO-8859-1 encoded > >> contains such code points, then the document is in fact windows-1252 > >> encoded and the code points should be interpreted as graphic characters > >> according to windows-1252. > > >Virtually certain? I don't think so. Suppose I write a page in German > >and use � instead of ä and so on. It would make sense to call it > >ISO-8859-1, right? I don't even USE Windows. Why should a Windows > >character set even be relevant. > > >The point is not the C1 controls; the point are the 8-bit printable > >characters. These are NOT all at the same positions as in windows-1252. > > Exactly right. Microsoft did that deliberately to annoy those who > used non-Microsoft word processors like WordPerfect. This is not the only example of MicroSoft implementing such a policy. > If the Web page declares the wrong character set, get the author to fix it. In practice, little chance.
![]() |
0 |
![]() |
28.5.2016, 10:29, tlvp wrote: > On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K. > Korpela as having written:: > >>> ... the letter >>> 䒠 has the same position in both. > > Funny: in Jukka's original, what I see is > >> ... the letter >> “ä” has the same position in both. > > Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) . Don’t blame the encodings. Adam’s message did not specify character encoding in a Content-Type header, so the interpretation of bytes outside the Ascii range 0..7F (hex) is up to the recipient, the software that displays the message. ObHTML: The same happens when an HTML document is sent by a server without any specification of character encoding. However, in that case, the guessing algorithm is defined in detail in HTML5. -- Yucca, http://www.cs.tut.fi/~jkorpela/
![]() |
0 |
![]() |
Adam H. Kerman <ahk@chinet.com> wrote: >I use Firefox. It's set to auto-detect the character set, which means >it reads the MIME header. I'm going to have to withdraw several points here. First of all, the auto-detect feature is related to the use of non-Latin alphabets and isn't relevant to what's under discussion. >Obviously, I'm agreeing with the O.P.'s position that there are code >points used for PRINTING CHARACTERS in ISO-8859-1 that are used to print >different characters in Windows-1252. Memory was fuzzy on this point as well. I had recalled that four code points were swapped, but comparing the two character sets, they're not. I was thinking back a couple of decades how painful it used to be dealing with documents produced in one word processor that couldn't be imported into another word processor, because Microsoft Word would take specific character patterns and substitute non-ASCII characters by default from their proprietary character set so they couldn't be edited on a competitor's application. The "smart quote" substitution was notorious, especially since Windows-1252 right (close) single quote was used for apostrophe as well, making it every bit the ambiguous character that ASCII apostrophe/single quote/acute accent, so it wasn't much of an improvement.
![]() |
0 |
![]() |
Jukka K. Korpela <jkorpela@cs.tut.fi> wrote: >Then try this: >http://www.cs.tut.fi/~jkorpela/chars/test.html >(It declares iso-8859-1 but is actually windows-1253 encoded.) I see that you declared iso-8859-1. Firefox reports that it's Windows-1252 encoded. With code page 1253, I should see modern Greek letters, but not with 1252. Looks like this chart: https://msdn.microsoft.com/en-us/goglobal/cc305146.aspx I withdrew my other points. I'm not withdrawing the point that browsers assuming it's one code page when it's actually a different code page is helpful in any way. Your demonstration supports my point, actually, so you should have persuaded yourself to change your position.
![]() |
0 |
![]() |
tlvp <mPiOsUcB.EtLlLvEp@att.net> wrote: >On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K. >Korpela as having written:: >>> ... the letter >>> 䒠 has the same position in both. >Funny: in Jukka's original, what I see is >> ... the letter >> “ä” has the same position in both. > >Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) . >Cheers, -- tlvp I could use MIME headers with Usenet articles, but generally, I don't. I try to use ASCII characters wherever possible in Usenet articles and substitute the ambiguous ASCII character for the Windows-1252 or UTF-8 character, but there's no way to do that when quoting Jukka, so I don't bother. Anyway, I think more than one character set is being used in this thread. I am aware that the quotes look ugly. I'm declaring windows-1252 just for you, as Jukka's precursor article was declared in that character set, but it appears that you've substituted UTF-8 code points and this still won't make the quoted text renderable.
![]() |
0 |
![]() |
Phillip Helbig (undress to reply) <helbig@asclothestro.multivax.de> wrote: >In article <ni8nbt$nm1$4@news.albasani.net>, "Adam H. Kerman" ><ahk@chinet.com> writes: > >> >> and it is virtually >> >> certain that if an HTML document declared to be ISO-8859-1 encoded >> >> contains such code points, then the document is in fact windows-1252 >> >> encoded and the code points should be interpreted as graphic characters >> >> according to windows-1252. >> >> >Virtually certain? I don't think so. Suppose I write a page in German >> >and use � instead of ä and so on. It would make sense to call it >> >ISO-8859-1, right? I don't even USE Windows. Why should a Windows >> >character set even be relevant. >> >> >The point is not the C1 controls; the point are the 8-bit printable >> >characters. These are NOT all at the same positions as in windows-1252. >> >> Exactly right. Microsoft did that deliberately to annoy those who >> used non-Microsoft word processors like WordPerfect. > >This is not the only example of MicroSoft implementing such a policy. I corrected my statement elsewhere, but MS did the code point change in the application before saving the document. I was recalling that four code points of printable characters had been swapped in windows-1252, but comparing the two character encodings, my memory was wrong. >> If the Web page declares the wrong character set, get the author to fix it. >In practice, little chance. I know. It occurs to me that I sometimes might encounter a Web page that correctly declares its character set, but was served in another character set.
![]() |
0 |
![]() |
On Sat, 28 May 2016 15:52:21 +0000 (UTC), Adam H. Kerman wrote, and I quote it in full only to illustrate all that has changed yet again with the intentional Win-1252 encoding declaration: > tlvp <mPiOsUcB.EtLlLvEp@att.net> wrote: >>On Fri, 27 May 2016 18:36:42 +0000 (UTC), Adam H. Kerman quoted Jukka K. >>Korpela as having written:: > >>>> ... the letter >>>> Â“ä’ has the same position in both. > >>Funny: in Jukka's original, what I see is > >>> ... the letter >>> “ä†has the same position in both. >> >>Somehow, ISO-9959-1, Win-1252, or UTF-8 didn't quite all agree here :-) . > >>Cheers, -- tlvp > > I could use MIME headers with Usenet articles, but generally, I don't. > I try to use ASCII characters wherever possible in Usenet articles and > substitute the ambiguous ASCII character for the Windows-1252 or UTF-8 > character, but there's no way to do that when quoting Jukka, so I don't > bother. Anyway, I think more than one character set is being used in > this thread. I am aware that the quotes look ugly. > > I'm declaring windows-1252 just for you, as Jukka's precursor article > was declared in that character set, but it appears that you've substituted > UTF-8 code points and this still won't make the quoted text renderable. FWiW, it went, clearly, from bad to worse :-) . Apologies for wasting B/W. Cheers, -- tlvp -- Avant de repondre, jeter la poubelle, SVP.
![]() |
0 |
![]() |