fix: prioritize utf-8 over system locale to prevent crashes on Windows #549
+49
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Problem
Currently,
gitingestcrashes (or skips files with an error message) on Windows when processing valid UTF-8 files that contain specific characters (like smart quotes”or certain math symbols) located past the first 1024 bytes.Why this happens:
locale.getpreferredencoding()(oftencp1252on Windows) overutf-8.cp1252is "greedy": if the first 1024 bytes of a file are standard ASCII,cp1252claims the file is valid.cp1252(e.g., byte0x9Dfrom a UTF-8 smart quote), it raises aUnicodeDecodeError.The Solution
I modified
src/gitingest/utils/file_utils.pyto prioritizeutf-8in the encoding detection list.Since UTF-8 is stricter than legacy encodings, checking it first ensures that:
cp1252.Benefits
🚀).How to Reproduce
I created a standalone reproduction script. You can run this on Linux/Mac to simulate the Windows environment and witness the crash:
Verification
tests/test_windows_encoding.pythat mocks the Windows environment.