fix: prioritize utf-8 over system locale to prevent crashes on Windows #549

CalumRakk · 2025-12-01T07:48:12Z

The Problem

Currently, gitingest crashes (or skips files with an error message) on Windows when processing valid UTF-8 files that contain specific characters (like smart quotes ” or certain math symbols) located past the first 1024 bytes.

Why this happens:

The system currently prioritizes locale.getpreferredencoding() (often cp1252 on Windows) over utf-8.
cp1252 is "greedy": if the first 1024 bytes of a file are standard ASCII, cp1252 claims the file is valid.
However, when reading the full file, if Python encounters a byte that is undefined in cp1252 (e.g., byte 0x9D from a UTF-8 smart quote), it raises a UnicodeDecodeError.

The Solution

I modified src/gitingest/utils/file_utils.py to prioritize utf-8 in the encoding detection list.

Since UTF-8 is stricter than legacy encodings, checking it first ensures that:

Valid UTF-8 files are correctly identified immediately.
We avoid the "false positive" detection of cp1252.
If a file is actually legacy encoding, UTF-8 check will fail quickly, and it will gracefully fallback to the system locale.

Benefits

CRITICAL: Prevents crashes/read errors on Windows with files containing smart quotes, em dashes, etc.
Feature: Enables full Emoji support on Windows (prevents "mojibake" like ðŸš€).
Consistency: Ensures behavior is identical across Linux, macOS, and Windows.

How to Reproduce

I created a standalone reproduction script. You can run this on Linux/Mac to simulate the Windows environment and witness the crash:

import sys
from pathlib import Path
from unittest.mock import patch

# Add src to path to import internal modules
sys.path.insert(0, str(Path(__file__).parent / "src"))

from gitingest.schemas.filesystem import FileSystemNode, FileSystemNodeType

def reproduction():
    print("--- Simulating Windows (CP1252) Environment ---")

    # Mock Windows environment with CP1252 encoding
    with patch("locale.getpreferredencoding", return_value="cp1252"), \
         patch("platform.system", return_value="Windows"):

        # Create content:
        # 1. Fill initial buffer (1024 bytes) to bypass early detection.
        # 2. Add Emojis and a Smart Quote (”).
        # Note: The smart quote (”) contains byte 0x9D, which is undefined in CP1252 and causes the crash.
        content = (
            "A" * 1050
            + "\n\n"
            + "# Modern Features & Emojis 🚀\n"
            + "This project uses modern features:\n"
            + "- ✨ Magic\n"
            + "- 🔥 Speed\n"
            + "And smart quotes: ” (THIS TRIGGERS THE CRASH)"
        )

        filename = "reproduce_crash.md"
        temp_file = Path(filename)
        temp_file.write_text(content, encoding="utf-8")

        try:
            print(f"Attempting to read '{filename}'...")

            node = FileSystemNode(
                name=filename,
                type=FileSystemNodeType.FILE,
                path_str=str(temp_file),
                path=temp_file,
            )

            result = node.content

            if "Error reading file" in result:
                print("\n[❌ CRASH] Windows failed to read Emojis/Special characters.")
                print(f"Error details:\n{result}")
            else:
                print("\n[✅ SUCCESS] File read correctly!")
                print("Last lines read:")
                print(result[-150:])

        finally:
            if temp_file.exists():
                temp_file.unlink()

if __name__ == "__main__":
    reproduction()

Verification

Added a regression test tests/test_windows_encoding.py that mocks the Windows environment.
Verified that existing tests pass.
Verified code style (pre-commit hooks passed).

fix: prioritize utf-8 over system locale to prevent crashes on Windows

78cebcb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prioritize utf-8 over system locale to prevent crashes on Windows #549

fix: prioritize utf-8 over system locale to prevent crashes on Windows #549

CalumRakk commented Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: prioritize utf-8 over system locale to prevent crashes on Windows #549

Are you sure you want to change the base?

fix: prioritize utf-8 over system locale to prevent crashes on Windows #549

Conversation

CalumRakk commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

The Solution

Benefits

How to Reproduce

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CalumRakk commented Dec 1, 2025 •

edited

Loading