Skip to content

Conversation

@CalumRakk
Copy link

@CalumRakk CalumRakk commented Dec 1, 2025

The Problem

Currently, gitingest crashes (or skips files with an error message) on Windows when processing valid UTF-8 files that contain specific characters (like smart quotes or certain math symbols) located past the first 1024 bytes.

Why this happens:

  1. The system currently prioritizes locale.getpreferredencoding() (often cp1252 on Windows) over utf-8.
  2. cp1252 is "greedy": if the first 1024 bytes of a file are standard ASCII, cp1252 claims the file is valid.
  3. However, when reading the full file, if Python encounters a byte that is undefined in cp1252 (e.g., byte 0x9D from a UTF-8 smart quote), it raises a UnicodeDecodeError.

The Solution

I modified src/gitingest/utils/file_utils.py to prioritize utf-8 in the encoding detection list.

Since UTF-8 is stricter than legacy encodings, checking it first ensures that:

  1. Valid UTF-8 files are correctly identified immediately.
  2. We avoid the "false positive" detection of cp1252.
  3. If a file is actually legacy encoding, UTF-8 check will fail quickly, and it will gracefully fallback to the system locale.

Benefits

  • CRITICAL: Prevents crashes/read errors on Windows with files containing smart quotes, em dashes, etc.
  • Feature: Enables full Emoji support on Windows (prevents "mojibake" like 🚀).
  • Consistency: Ensures behavior is identical across Linux, macOS, and Windows.

How to Reproduce

I created a standalone reproduction script. You can run this on Linux/Mac to simulate the Windows environment and witness the crash:

import sys
from pathlib import Path
from unittest.mock import patch

# Add src to path to import internal modules
sys.path.insert(0, str(Path(__file__).parent / "src"))

from gitingest.schemas.filesystem import FileSystemNode, FileSystemNodeType

def reproduction():
    print("--- Simulating Windows (CP1252) Environment ---")

    # Mock Windows environment with CP1252 encoding
    with patch("locale.getpreferredencoding", return_value="cp1252"), \
         patch("platform.system", return_value="Windows"):

        # Create content:
        # 1. Fill initial buffer (1024 bytes) to bypass early detection.
        # 2. Add Emojis and a Smart Quote (”).
        # Note: The smart quote (”) contains byte 0x9D, which is undefined in CP1252 and causes the crash.
        content = (
            "A" * 1050
            + "\n\n"
            + "# Modern Features & Emojis 🚀\n"
            + "This project uses modern features:\n"
            + "- ✨ Magic\n"
            + "- 🔥 Speed\n"
            + "And smart quotes: ” (THIS TRIGGERS THE CRASH)"
        )

        filename = "reproduce_crash.md"
        temp_file = Path(filename)
        temp_file.write_text(content, encoding="utf-8")

        try:
            print(f"Attempting to read '{filename}'...")

            node = FileSystemNode(
                name=filename,
                type=FileSystemNodeType.FILE,
                path_str=str(temp_file),
                path=temp_file,
            )

            result = node.content

            if "Error reading file" in result:
                print("\n[❌ CRASH] Windows failed to read Emojis/Special characters.")
                print(f"Error details:\n{result}")
            else:
                print("\n[✅ SUCCESS] File read correctly!")
                print("Last lines read:")
                print(result[-150:])

        finally:
            if temp_file.exists():
                temp_file.unlink()

if __name__ == "__main__":
    reproduction()

Verification

  • Added a regression test tests/test_windows_encoding.py that mocks the Windows environment.
  • Verified that existing tests pass.
  • Verified code style (pre-commit hooks passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant