How To Decode Unicode String That Is Read From A File In Python?
I have a file containing UTF-16 strings. When I try to read the unicode, ' ' (double quotes) are added and the string looks like 'b'\\xff\\xfeA\\x00''. The inbuilt .decode function
Solution 1:
Try this:
str.encode().decode()
Solution 2:
It looks like the file has been created by writing bytes literals to it, something like this:
some_bytes = b'Hello world'withopen('myfile.txt', 'w') as f:
f.write(str(some_bytes))
This gets around the fact that attempting write bytes to a file opened in text mode raises an error, but at the cost that the file now contains "b'hello world'"
(note the 'b' inside the quotes).
The solution is to decode the bytes
to str
before writing:
some_bytes = b'Hello world'
my_str = some_bytes.decode('utf-16') # or whatever the encoding of the bytes might bewithopen('myfile.txt', 'w') as f:
f.write(my_str)
or open the file in binary mode and write the bytes directly
some_bytes = b'Hello world'withopen('myfile.txt', 'wb') as f:
f.write(some_bytes)
Note you will need to provide the correct encoding if opening the file in text mode
withopen('myfile.txt', encoding='utf-16') as f: # Be sure to use the correct encoding
Consider running Python with the -b
or -bb
flag set to raise a warning or exception respectively to detect attempts to stringify bytes.
Post a Comment for "How To Decode Unicode String That Is Read From A File In Python?"