f



windows utf8 & lxml

Hi 

I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.

Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.

I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?

The key part of my script is 

        for content in roots:
            utf8_parser = etree.XMLParser(encoding='utf-8')
            fix_ascii = utf8_parser.decode('windows-1252')
            mytree = etree.fromstring(
                content.read().encode('utf-8'), parser=fix_ascii)

Without the added .decode my code looks like

        for content in roots:
            utf8_parser = etree.XMLParser(encoding='utf-8')
            mytree = etree.fromstring(
                content.read().encode('utf-8'), parser=utf8_parser)

However doing it in such a fashion returns this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Which I found this SO for http://stackoverflow.com/a/29217546/461887 but cannot seem to implement with lxml.

Ideas?

Sayth
0
Sayth
12/20/2016 11:53:42 AM
comp.lang.python 77058 articles. 3 followers. Post Follow

5 Replies
47 Views

Similar Articles

[PageSpeed] 9

On Tue, 20 Dec 2016 03:53:42 -0800 (PST), Sayth Renshaw
<flebber.crue@gmail.com> declaimed the following:

>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

	Given that "position 0" I'd be tempted to think you may have a file
with a UTF-16LE byte order mark.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

0
Dennis
12/20/2016 3:35:02 PM
Possibly i will have to use a different method from lxml like this. 
http://stackoverflow.com/a/29057244/461887

Sayth
0
Sayth
12/20/2016 6:58:02 PM
On Tuesday, 20 December 2016 22:54:03 UTC+11, Sayth Renshaw  wrote:
> Hi 
> 
> I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.
> 
> Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.
> 
> I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?
> 
> The key part of my script is 
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             fix_ascii = utf8_parser.decode('windows-1252')
>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=fix_ascii)
> 
> Without the added .decode my code looks like
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=utf8_parser)
> 
> However doing it in such a fashion returns this error:
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
> Which I found this SO for http://stackoverflow.com/a/29217546/461887 but cannot seem to implement with lxml.
> 
> Ideas?
> 
> Sayth

Why is windows so hard. Sort of running out of ideas, tried methods in the docs SO etc.

Currently

        for xml_data in roots:
            parser_xml = etree.XMLParser()
            mytree = etree.parse(xml_data, parser_xml)

Returns
C:\Users\Sayth\Anaconda3\envs\race\python.exe C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml
Traceback (most recent call last):
  File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in <module>
    data_attr(rootObs)
  File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in data_attr
    mytree = etree.parse(xml_data, parser_xml)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81110)
  File "src/lxml/parser.pxi", line 1832, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109)
  File "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument (src\lxml\lxml.etree.c:118392)
  File "src/lxml/parser.pxi", line 1747, in lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180)
  File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\lxml.etree.c:111907)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105102)
  File "src/lxml/parser.pxi", line 702, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769)
  File "src/lxml/lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:12074)
  File "src/lxml/parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer (src\lxml\lxml.etree.c:102431)
io.UnsupportedOperation: read

Process finished with exit code 1

Thoughts?

Sayth
0
Sayth
12/21/2016 9:03:48 AM
Sayth Renshaw wrote:

> On Tuesday, 20 December 2016 22:54:03 UTC+11, Sayth Renshaw  wrote:
>> Hi
>> 
>> I have been trying to get a script to work on windows that works on mint.
>> The key blocker has been utf8 errors, most of which I have solved.
>> 
>> Now however the last error I am trying to overcome, the solution appears
>> to be to use the .decode('windows-1252') to correct an ascii error.
>> 
>> I am using lxml to read my content and decode is not supported are there
>> any known ways to read with lxml and fix unicode faults?
>> 
>> The key part of my script is
>> 
>>         for content in roots:
>>             utf8_parser = etree.XMLParser(encoding='utf-8')
>>             fix_ascii = utf8_parser.decode('windows-1252')
>>             mytree = etree.fromstring(
>>                 content.read().encode('utf-8'), parser=fix_ascii)
>> 
>> Without the added .decode my code looks like
>> 
>>         for content in roots:
>>             utf8_parser = etree.XMLParser(encoding='utf-8')
>>             mytree = etree.fromstring(
>>                 content.read().encode('utf-8'), parser=utf8_parser)
>> 
>> However doing it in such a fashion returns this error:
>> 
>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
>> invalid start byte Which I found this SO for
>> http://stackoverflow.com/a/29217546/461887 but cannot seem to implement
>> with lxml.
>> 
>> Ideas?
>> 
>> Sayth
> 
> Why is windows so hard. 

I don't think this has anything to do with the OS. Your lxml_data is 
probably not what you think it is. Compare:

$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import lxml.etree
>>> lxml.etree.parse(sys.stdout)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse 
(src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument 
(src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument 
(src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike 
(src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in 
lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in 
lxml.etree._ParserContext._handleParseResultDoc 
(src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 679, in lxml.etree._handleParseResult 
(src/lxml/lxml.etree.c:92426)
  File "lxml.etree.pyx", line 327, in 
lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:10196)
  File "parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer 
(src/lxml/lxml.etree.c:89083)
io.UnsupportedOperation: not readable

That looks similar to what you get.

> Sort of running out of ideas, tried methods in the
> docs SO etc.
> 
> Currently
> 
>         for xml_data in roots:
>             parser_xml = etree.XMLParser()
>             mytree = etree.parse(xml_data, parser_xml)
> 
> Returns
> C:\Users\Sayth\Anaconda3\envs\race\python.exe
> C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml Traceback
> (most recent call last):
>   File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in
>   <module>
>     data_attr(rootObs)
>   File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in
>   data_attr
>     mytree = etree.parse(xml_data, parser_xml)
>   File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse
>   (src\lxml\lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1832, in
>   lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109) File
>   "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument
>   (src\lxml\lxml.etree.c:118392) File "src/lxml/parser.pxi", line 1747, in
>   lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180) File
>   "src/lxml/parser.pxi", line 1162, in
>   lxml.etree._BaseParser._parseDocFromFilelike
>   (src\lxml\lxml.etree.c:111907) File "src/lxml/parser.pxi", line 595, in
>   lxml.etree._ParserContext._handleParseResultDoc
>   (src\lxml\lxml.etree.c:105102) File "src/lxml/parser.pxi", line 702, in
>   lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769) File
>   "src/lxml/lxml.etree.pyx", line 324, in
>   lxml.etree._ExceptionContext._raise_if_stored
>   (src\lxml\lxml.etree.c:12074) File "src/lxml/parser.pxi", line 373, in
>   lxml.etree._FileReaderContext.copyToBuffer
>   (src\lxml\lxml.etree.c:102431)
> io.UnsupportedOperation: read
> 
> Process finished with exit code 1
> 
> Thoughts?
> 
> Sayth


0
Peter
12/21/2016 9:36:39 AM
I'm not a beginning python coder, but I'm not an advanced one either. I
can't see why I have this problem, though at this point I've probably
been looking at it too hard and for too long (several days), so maybe
I'm just too close to it.
Can one of you guys see the problem (besides my childish coding)? I'll
give you the code first, and then the problem.

def moving():
    import csv
    ls = []
    with open('E:\\Coding projects\\Pycharm\\Moving\\New Listings.csv',
'r') as infile:
        raw = csv.reader(infile)
        indata = list(raw)
        rows = indata.__len__()
    for i in range(rows):
        ls.append(indata[i])
    # sort: Description only, to make hyperelinks & find duplicates
    mergeSort(ls)
    # find & mark dups, make hyperlink if not dup
    for i in range(1, len(ls) - 1):
        if ls[i][0] == ls[i + 1][0]:
            ls[i][1] = "dup"
        else:
            # make hyperlink
            desc = ls[i][0]
            url = ls[i][1]
            ls[i][0] = '=HYPERLINK(\"' + url + '\",\"' + desc + '\")'
    # save to csv
    ls.insert(0, ["Description","url"])
    with open('E:\\Coding projects\\Pycharm\\Moving\\Moving 2017
out.csv', 'w') as outfile:
        writer = csv.writer(outfile, lineterminator='\n')
        writer.writerows(ls)

import operator
def mergeSort(L, compare = operator.lt):
    if len(L) < 2:
        return L[:]
    else:
        middle = int(len(L)/2)
        left = mergeSort(L[:middle], compare)
        right = mergeSort(L[middle:], compare)
        return merge(left, right, compare)

def merge(left, right, compare):
    result = []
    i,j = 0, 0
    while i < len(left) and j < len(right):
        if compare(left[i], right[j]):
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    while (i < len(left)):
        result.append(left[i])
        i += 1
    while (j < len(right)):
        result.append(right[j])
        j += 1
    return result

moving()

The problem is that while mergeSort puts the list ls in perfect order,
which I can see by looking at result on merge's final return to
mergeSort, and at the left and the right once back in mergeSort. Both
the left half and the right half are in order. But the list L is still
in its original order, and after mergeSort completes, ls is still in its
original order. Maybe there's some bonehead error causing this, but I
just can't see it.

I can provide a sample csv file for input, if you want to execute this,
but to keep things simple, you can see the problem in just a table with
webpage titles in one column and their urls in the second column.

Any insights would be greatly appreciated.

0
Deborah
12/22/2016 12:55:18 AM
Reply: