XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Now, given a string, what is the best way to remove all those illegal chars from it?
Right now, my best bet is a regular expression, but it's a bit of a mouthful:
illegal_xml_re = re.compile(u'[\x00-\x08\x0b-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]')
clean = illegal_xml_re.sub('', dirty)
(Python 2.5 doesn't even know about Unicode chars above 0xFFFF, so no need to filter those)
My question is: is this the best/proper way to do this?
Is there a more efficient or standard way?
UPDATE: Based on kaizer.se's comment, a more correct regular expression would have to be constructed on the fly, like this:
illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x0C), (0x0E, 0x1F), (0x7F, 0x84),
(0x86, 0x9F), (0xD800, 0xDFFF), (0xFDD0, 0xFDDF),
(0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF), (0x3FFFE, 0x3FFFF),
(0x4FFFE, 0x4FFFF), (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),
(0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF), (0x9FFFE, 0x9FFFF),
(0xAFFFE, 0xAFFFF), (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),
(0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF), (0xFFFFE, 0xFFFFF),
(0x10FFFE, 0x10FFFF) ]
illegal_ranges = ["%s-%s" % (unichr(low), unichr(high))
for (low, high) in illegal_unichrs
if low < sys.maxunicode]
illegal_xml_re = re.compile(u'[%s]' % u''.join(illegal_ranges))
I really wish someone could point me to a c implementation of this, perhaps in one of the many python xml libraries?
UPDATE: I've accepted Olemis Lang's answer, which has some tweaks to the version presented above. Please pay attention to the surrogate pair issue before you decide which version to use.