Skip to content
This repository has been archived by the owner on Sep 14, 2018. It is now read-only.

xml.etree.ElementTree can't read some XMLs #1300

Open
NValerij opened this issue Jun 17, 2016 · 3 comments
Open

xml.etree.ElementTree can't read some XMLs #1300

NValerij opened this issue Jun 17, 2016 · 3 comments

Comments

@NValerij
Copy link

Hello.
There are 3 issues with this file.
Code to reproduce:

import xml.etree.ElementTree as ET
ET.parse('test.xml')
  • BOM is not recognized: xmllib.Error: Syntax error at line 1: illegal data at start of file.
    OK, I can workaround it with ET.parse(codecs.open(r'D:\NLC\LexicalSpanAnnotator\TestData\test.xml', 'r', encoding = 'utf-8'))
  • Symbol with code 8233 brakes parsing: xmllib.Error: Syntax error at line 3: illegal character in content.
    I also can do workaround it (load text and replace this symbol with 
 mnemonic, but it is not a good idea in general).
  • There are no empty line in the end and very strange message about it: xmllib.Error: Syntax error at line 4: data not in content

I've checked this file with ElementTree parser from Python 3.4 (sorry, no 2.7 installed) and with msxml-parser. Both have done this task OK.

@slide
Copy link
Contributor

slide commented Jul 29, 2016

The reason this happens is because we don't have pyexpat implemented.

@slide slide removed the untriaged label Jul 29, 2016
@kunom
Copy link
Contributor

kunom commented Aug 12, 2016

To be a bit more detailed: The ElementTree implementation was patched to use xmllib instead of pyexpat as underlying XML parser. xmllib has been deprecated with Python2.0, but it is a pure Python implementation, which makes integration into IronPython much easier.

See also the checkin comment of commit cb73948.

@kunom
Copy link
Contributor

kunom commented Aug 16, 2016

See also #393.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants