f



UTF-8 Encoding Error

I am getting the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15: invalid start byte

as I try to read some files through TaggedCorpusReader. TaggedCorpusReader is a module
of NLTK.
My files are saved in ANSI format in MS-Windows default. 
I am using Python2.7 on MS-Windows 7. 

I have tried the following options till now, 
string.encode('utf-8').strip()
unicode(string)
unicode(str, errors='replace')
unicode(str, errors='ignore')
string.decode('cp1252')

But nothing is of much help.

If any one may kindly suggest.

I am trying if you may see.
0
subhabangalore
12/23/2016 6:38:15 AM
comp.lang.python 77058 articles. 3 followers. Post Follow

3 Replies
144 Views

Similar Articles

[PageSpeed] 57

On 22Dec2016 22:38, Subhabrata Banerjee <subhabangalore@gmail.com> wrote:
>I am getting the error:
>UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15: invalid start byte
>
>as I try to read some files through TaggedCorpusReader. TaggedCorpusReader is a module
>of NLTK.
>My files are saved in ANSI format in MS-Windows default.
>I am using Python2.7 on MS-Windows 7.
>
>I have tried the following options till now,
>string.encode('utf-8').strip()
>unicode(string)
>unicode(str, errors='replace')
>unicode(str, errors='ignore')
>string.decode('cp1252')
>
>But nothing is of much help.

It would help to see a very small program that produces your error message.

Generally you need to open text files in the same encoding used for thei text.  
Which sounds obvious, but I'm presuming you've not done that.

Normally, when you open a file you can specify the text encoding. I am not a 
Windows guy, so I do not know what "ANSI format in MS-Windows default" means at 
the encoding level.

Supposing you had a bit of code like this:

  with open("filename.txt", "r") as fp:
      for line in fp:
          # line is a Python 2 str, but is a sequence of bytes internally
          unicode_line = line.decode('utf8')
          # unicode_line is a Python 2 _unicode_ object, which is text, a 
          # sequence of Unicode codepoints

you could get an error like yours if the file _did not_ contain UTF-8 encoded 
text.

If you used:
    unicode(str, errors='replace')
    unicode(str, errors='ignore')
I would not have expected the error you recite, but we would need to see an 
example program to be sure.

I would guess that the text in your file is not UTF-8 encoded, and that you 
need to specify the correct encoding to the .decode call.

Cheers,
Cameron Simpson <cs@zip.com.au>
0
Cameron
12/23/2016 6:58:15 AM
On Thu, 22 Dec 2016 22:38:15 -0800, subhabangalore wrote:

> My files are saved in ANSI format in MS-Windows default.

Having a testfile in Code Page 1252 that contains:

this is a test >“”

The following ought to default into Unicode a string:

>>> import codecs
>>> s = codecs.open('testfile', encoding='cp1252').read()
>>> s
u'this is a test >\u201c\u201d\n'

If you want to output to the DOS box, you commonly have to switch it over 
to code page 852 instead of Unicode.

I'm running Linux (Terminal is Unicode) so I can't offer more than these 
suggestions.

HTH
-- 
Gilmeh
0
Gilmeh
12/24/2016 3:42:07 PM
On 2016年12月22日 22時38分, subhabangalore@gmail.com wrote:
>I am getting the error:
>UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15: invalid start byte

The following is a reflex of mine, whenever I encounter Python 2 Unicode errors:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

A relevant Stack Exchange thread awaits you here:

http://stackoverflow.com/a/21190382/2230956
0
Grady
12/25/2016 5:50:16 AM
Reply: