2026-03-17-xldiff-design.md - xl-cli-tools - CLI tools for viewing and editing Excel files

2026-03-17-xldiff-design.md (10600B)
      1 # xldiff — Excel Sheet Comparison Tool
      2 
      3 Port of [go-xldiff](../../../go-xldiff) to Rust, leveraging polars for type-aware comparison.
      4 
      5 ## CLI Interface
      6 
      7 ```
      8 xldiff <FILE1>[:<SHEET>] <FILE2>[:<SHEET>] [flags]
      9 ```
     10 
     11 Two positional file arguments, each optionally suffixed with `:SheetName` or `:0` (0-based index). The same file may appear twice with different sheets (e.g., `report.xlsx:Q1 report.xlsx:Q2`).
     12 
     13 Sheet selection uses `rfind(':')` to split on the last colon, so Windows drive-letter paths (e.g., `C:\file.xlsx`) are handled correctly. The suffix is tried as a sheet name first, then as a 0-based numeric index.
     14 
     15 ### Flags
     16 
     17 | Flag | Type | Default | Description |
     18 |------|------|---------|-------------|
     19 | `--key <COLS>` | string | *(none)* | Comma-separated key columns (header name or column letter). Omit for positional mode. |
     20 | `--cols <COLS>` | string | *(all)* | Columns to compare (header name or column letter). Key columns always included. |
     21 | `--skip <N>[,<M>]` | string | `0` | Rows to skip before header. Single value applies to both files; comma pair sets each independently. |
     22 | `--no-header` | bool | false | Treat first row as data; generate synthetic `column_0`, `column_1`, … headers (matching reader.rs convention). |
     23 | `--tolerance <F>` | f64 | *(none)* | Numeric tolerance for float comparisons (e.g., `0.01`). |
     24 | `--format <FMT>` | string | `text` | Output format: `text`, `markdown`, `json`, `csv`. |
     25 | `--no-color` | bool | false | Force-disable ANSI colors (auto-detected by default). |
     26 
     27 **Column resolution priority** (consistent with `filter.rs::resolve_column()`):
     28 1. Exact header name match
     29 2. Case-insensitive header name match
     30 3. Column letter (A, B, AA) if purely alphabetic
     31 
     32 This means `--key A` matches a header named "A" before column position 0. To select the first column by position, use `--cols` with letter "A" only when no header is named "A".
     33 
     34 ### Exit Codes
     35 
     36 xldiff follows the `diff(1)` convention, which differs from xlcat/xlset/xlfilter:
     37 
     38 - **0** — no differences
     39 - **1** — differences found
     40 - **2** — error (invalid arguments or runtime failure)
     41 
     42 The other tools in this repo use exit 1 for runtime errors and exit 2 for argument errors. xldiff merges both error types into exit 2 because exit 1 is reserved for "differences found" — the standard convention for diff tools, enabling `if ! xldiff ...; then ...` in scripts.
     43 
     44 ## Data Structures
     45 
     46 ```rust
     47 pub struct DiffResult {
     48     pub headers: Vec<String>,       // Columns that were compared (filtered set if --cols used)
     49     pub key_columns: Vec<String>,
     50     pub added: Vec<DiffRow>,
     51     pub removed: Vec<DiffRow>,
     52     pub modified: Vec<ModifiedRow>,
     53     pub source_a: SheetSource,
     54     pub source_b: SheetSource,
     55 }
     56 
     57 pub struct SheetSource {
     58     pub file_name: String,
     59     pub sheet_name: String,
     60 }
     61 
     62 pub struct DiffRow {
     63     pub values: Vec<String>,        // Cell values as display strings
     64 }
     65 
     66 pub struct ModifiedRow {
     67     pub key: Vec<String>,
     68     pub changes: Vec<CellChange>,
     69 }
     70 
     71 pub struct CellChange {
     72     pub column: String,
     73     pub old_value: String,
     74     pub new_value: String,
     75 }
     76 ```
     77 
     78 `DiffResult` is format-agnostic. Each output formatter consumes it independently. `SheetSource` carries file/sheet metadata for headers.
     79 
     80 Values are converted to display strings during diff construction. Type-aware comparison (tolerance, numeric equality) happens during the diff phase; the result stores only the string representations.
     81 
     82 `headers` reflects the effective column set after `--cols` filtering, so output formatters use it directly without further filtering.
     83 
     84 ## Diff Algorithm
     85 
     86 ### Mismatched Headers
     87 
     88 Before diffing, the tool checks whether both DataFrames share the same column names (after `--cols` filtering). Behavior:
     89 
     90 - **Key-based mode:** key columns must exist in both files or the tool exits with an error. Non-key columns are compared by name — columns present in only one file are reported as added/removed columns in a warning on stderr, and comparison proceeds on the intersection. `DiffResult.headers` contains the union.
     91 - **Positional mode:** columns are matched by position. If column counts differ, the tool warns on stderr and pads the shorter side with empty values.
     92 
     93 ### Positional Mode (no `--key`)
     94 
     95 1. Convert each row to a string: join all compared columns with a null-byte separator.
     96 2. Build frequency maps (`HashMap<RowHash, usize>`) for both tables.
     97 3. Walk table A: if B has remaining copies, consume one; otherwise mark as removed.
     98 4. Walk table B: if A has remaining copies, consume one; otherwise mark as added.
     99 5. No "modified" rows — any cell change produces a different hash.
    100 
    101 ### Key-Based Mode (`--key` provided)
    102 
    103 The implementation uses HashMap-based row matching (like the Go version) rather than polars joins. This avoids complications with join column type mismatches, row reordering, and duplicate-key cartesian products. Polars is used for reading and column selection, not for the join itself.
    104 
    105 1. Resolve key column names/letters to column indices.
    106 2. Build `HashMap<Vec<String>, RowData>` for both tables, keyed by the concatenated key column values.
    107 3. For each key in A not in B → removed.
    108 4. For each key in B not in A → added.
    109 5. For each key in both → compare non-key columns cell by cell:
    110    - **Numeric (Float64):** if `--tolerance` is set, equal when `|a - b| <= tolerance`; otherwise exact. NaN == NaN is treated as equal; NaN vs non-NaN is a difference.
    111    - **Integer vs Float cross-type:** if one file infers a column as Int64 and the other as Float64, both values are compared as f64. Tolerance applies if set.
    112    - **Integer (Int64):** exact comparison (tolerance does not apply to pure integer columns).
    113    - **String:** exact string comparison.
    114    - **Datetime:** compare as timestamps.
    115    - **Boolean:** exact match.
    116    - **Null vs value:** treated as different. Null vs null: equal.
    117 6. Rows with any cell difference become a `ModifiedRow` with a list of `CellChange`s.
    118 7. Duplicate keys produce a warning on stderr. Last occurrence wins (same as Go).
    119 
    120 ### Column Filtering (`--cols`)
    121 
    122 Applied before diffing. Both DataFrames are narrowed to the selected columns. Key columns are always retained, even when omitted from `--cols`.
    123 
    124 ### Tolerance
    125 
    126 - Applies to Float64 columns and Int64-vs-Float64 cross-type comparisons in key-based mode.
    127 - Positional mode hashes exact string values; tolerance does not apply.
    128 - If `--tolerance` is set but no numeric columns exist, it is a silent no-op.
    129 
    130 ## Output Formats
    131 
    132 ### No Differences
    133 
    134 When the diff result is empty (no added, removed, or modified rows):
    135 
    136 - **Text:** prints "No differences found." to stdout.
    137 - **Markdown:** prints "No differences found."
    138 - **JSON:** `{"added":[],"removed":[],"modified":[]}`
    139 - **CSV:** header row only (no data rows).
    140 
    141 All formats exit 0.
    142 
    143 ### Text (default, colored)
    144 
    145 ```
    146 --- Sheet1 (old.xlsx)
    147 +++ Sheet1 (new.xlsx)
    148 
    149 Added: 1 | Removed: 1 | Modified: 1
    150 
    151 - ID: "3"  Name: "Charlie"  Score: "70"
    152 + ID: "4"  Name: "Dana"  Score: "85"
    153 
    154 ~ ID: "1"
    155     Score: "90" → "95"
    156 ```
    157 
    158 ANSI colors: red (`-`), green (`+`), yellow (`~`). Auto-detects TTY via `std::io::IsTerminal`; `--no-color` overrides.
    159 
    160 ### Markdown (`--format markdown`)
    161 
    162 ```markdown
    163 ## Added (1)
    164 | ID | Name | Score |
    165 |----|------|-------|
    166 | 4  | Dana | 85    |
    167 
    168 ## Removed (1)
    169 | ID | Name | Score |
    170 |----|------|-------|
    171 | 3  | Charlie | 70 |
    172 
    173 ## Modified (1)
    174 | Key (ID) | Column | Old | New |
    175 |----------|--------|-----|-----|
    176 | 1        | Score  | 90  | 95  |
    177 ```
    178 
    179 Clean, uncolored tables. The added/removed tables can reuse `formatter.rs::render_table_header()` and `render_table_rows()` for consistent alignment with xlcat/xlfilter output. The modified table is diff-specific.
    180 
    181 ### JSON (`--format json`)
    182 
    183 ```json
    184 {
    185   "added": [{"ID": "4", "Name": "Dana", "Score": "85"}],
    186   "removed": [{"ID": "3", "Name": "Charlie", "Score": "70"}],
    187   "modified": [
    188     {
    189       "key": {"ID": "1"},
    190       "changes": [{"column": "Score", "old": "90", "new": "95"}]
    191     }
    192   ]
    193 }
    194 ```
    195 
    196 Empty arrays are `[]`, never null. Pretty-printed with 2-space indentation.
    197 
    198 ### CSV (`--format csv`)
    199 
    200 ```csv
    201 _status,ID,Name,Score,_old_ID,_old_Name,_old_Score
    202 added,4,Dana,85,,,
    203 removed,3,Charlie,70,,,
    204 modified,1,,95,,,90
    205 ```
    206 
    207 `_status` column indicates row type. `_old_*` columns hold previous values. For modified rows: new values go in the main columns, old values in `_old_*` columns, only for changed cells. Unchanged cells are empty in both positions. Key column values appear in the main columns to identify the row.
    208 
    209 ## Module Layout
    210 
    211 ### New Files
    212 
    213 - **`src/diff.rs`** — Core diff logic: `DiffResult`, positional/key-based algorithms, tolerance, duplicate key detection.
    214 - **`src/bin/xldiff.rs`** — CLI binary: clap argument parsing, sheet resolution, output formatting (text, markdown, json, csv).
    215 
    216 ### Reused Modules
    217 
    218 - **`reader.rs`** — `read_sheet()` / `read_sheet_with_skip()` to load DataFrames.
    219 - **`metadata.rs`** — `read_file_info()` for sheet resolution.
    220 - **`cell.rs`** — Column letter resolution (shared with filter.rs via `col_letter_to_index()`).
    221 - **`formatter.rs`** — `render_table_header()` and `render_table_rows()` for markdown added/removed tables.
    222 
    223 ### Output Formatting
    224 
    225 The four format renderers live in the binary (`xldiff.rs`), not in `formatter.rs`. Diff output structure differs enough from xlcat's table rendering to warrant separation. The markdown formatter reuses `formatter.rs` table-rendering helpers for the added/removed sections.
    226 
    227 ### Cargo.toml Changes
    228 
    229 - Add `[[bin]] name = "xldiff"` entry.
    230 - Add `serde = { version = "1", features = ["derive"] }` and `serde_json = "1"` for JSON output.
    231 - TTY detection via `std::io::IsTerminal` (stable since Rust 1.70, no extra crate needed).
    232 
    233 ### Pipeline
    234 
    235 ```
    236 parse args → resolve sheets → read DataFrames
    237   → narrow to --cols → diff (positional or key-based)
    238   → format output → print → exit code
    239 ```
    240 
    241 ## Testing
    242 
    243 - **Unit tests** in `diff.rs`: positional mode, key-based mode, tolerance (including cross-type Int64/Float64, NaN handling), duplicate keys, column filtering, edge cases (empty sheets, single-row, mismatched columns/headers).
    244 - **Integration tests** in `tests/test_xldiff.rs`: CLI end-to-end with `assert_cmd`.
    245 - **Fixtures**: add helpers to `tests/common/mod.rs`:
    246   - `create_diff_pair()` — two files with added/removed/identical rows for positional mode.
    247   - `create_diff_pair_with_keys()` — two files with key columns and modified cells.
    248   - `create_diff_pair_with_floats()` — two files with float values differing by small amounts for tolerance testing.
	xl-cli-tools CLI tools for viewing and editing Excel files
	Log \| Files \| Refs \| README \| LICENSE