-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Our proposal improves the robustness of pandas' text importers, in particular the read_csv() function. Currently, an explicit encoding can be set or it defaults to None, which seems to be resolved to 'utf-8', but maybe this is platform-specific. Unluckily, csv files often come with different encodings. For example, Excel does not use UTF-8 by default and often users do not really care about encodings while saving such that we have to handle different file encondings. Unluckily, pandas raises UnicodeDecodeErrors if something else than 'utf-8' is required, even though text editors automatically detect the right encoding.
Feature Description
Several resources suggest to automatically detect the right enconding using chardet.detect(). Using this, the following code successfully recognized the right encoding in our experiments ('utf-8' or 'ISO-8859-1'):
import chardet
import io
filename = 'path/to/some/file.csv' # source file
encoding = None # encoding can be predefined or not
with open(filename, 'rb') as file:
data = file.read()
if encoding is None: # if not explicitly given, this line detects the right encoding
encoding = chardet.detect(data)['encoding']
pd.read_csv(io.BytesIO(data), encoding=encoding)
This could be used as an additional encoding='auto' case - or even in the 'None' case instead of the current default - inside pandas directly. We don't know whether this auto detecting might fail in some cases, however it does a much better job than the current default decoding. Therefore, we would like to propose this feature.
Alternative Solutions
Alternatively, explicitly defining the right encoding is required to avoid UnicodeDecodeErrors.
Additional Context
No response