2026-03-17-xldiff-design.md (10600B)
1 # xldiff — Excel Sheet Comparison Tool 2 3 Port of [go-xldiff](../../../go-xldiff) to Rust, leveraging polars for type-aware comparison. 4 5 ## CLI Interface 6 7 ``` 8 xldiff <FILE1>[:<SHEET>] <FILE2>[:<SHEET>] [flags] 9 ``` 10 11 Two positional file arguments, each optionally suffixed with `:SheetName` or `:0` (0-based index). The same file may appear twice with different sheets (e.g., `report.xlsx:Q1 report.xlsx:Q2`). 12 13 Sheet selection uses `rfind(':')` to split on the last colon, so Windows drive-letter paths (e.g., `C:\file.xlsx`) are handled correctly. The suffix is tried as a sheet name first, then as a 0-based numeric index. 14 15 ### Flags 16 17 | Flag | Type | Default | Description | 18 |------|------|---------|-------------| 19 | `--key <COLS>` | string | *(none)* | Comma-separated key columns (header name or column letter). Omit for positional mode. | 20 | `--cols <COLS>` | string | *(all)* | Columns to compare (header name or column letter). Key columns always included. | 21 | `--skip <N>[,<M>]` | string | `0` | Rows to skip before header. Single value applies to both files; comma pair sets each independently. | 22 | `--no-header` | bool | false | Treat first row as data; generate synthetic `column_0`, `column_1`, … headers (matching reader.rs convention). | 23 | `--tolerance <F>` | f64 | *(none)* | Numeric tolerance for float comparisons (e.g., `0.01`). | 24 | `--format <FMT>` | string | `text` | Output format: `text`, `markdown`, `json`, `csv`. | 25 | `--no-color` | bool | false | Force-disable ANSI colors (auto-detected by default). | 26 27 **Column resolution priority** (consistent with `filter.rs::resolve_column()`): 28 1. Exact header name match 29 2. Case-insensitive header name match 30 3. Column letter (A, B, AA) if purely alphabetic 31 32 This means `--key A` matches a header named "A" before column position 0. To select the first column by position, use `--cols` with letter "A" only when no header is named "A". 33 34 ### Exit Codes 35 36 xldiff follows the `diff(1)` convention, which differs from xlcat/xlset/xlfilter: 37 38 - **0** — no differences 39 - **1** — differences found 40 - **2** — error (invalid arguments or runtime failure) 41 42 The other tools in this repo use exit 1 for runtime errors and exit 2 for argument errors. xldiff merges both error types into exit 2 because exit 1 is reserved for "differences found" — the standard convention for diff tools, enabling `if ! xldiff ...; then ...` in scripts. 43 44 ## Data Structures 45 46 ```rust 47 pub struct DiffResult { 48 pub headers: Vec<String>, // Columns that were compared (filtered set if --cols used) 49 pub key_columns: Vec<String>, 50 pub added: Vec<DiffRow>, 51 pub removed: Vec<DiffRow>, 52 pub modified: Vec<ModifiedRow>, 53 pub source_a: SheetSource, 54 pub source_b: SheetSource, 55 } 56 57 pub struct SheetSource { 58 pub file_name: String, 59 pub sheet_name: String, 60 } 61 62 pub struct DiffRow { 63 pub values: Vec<String>, // Cell values as display strings 64 } 65 66 pub struct ModifiedRow { 67 pub key: Vec<String>, 68 pub changes: Vec<CellChange>, 69 } 70 71 pub struct CellChange { 72 pub column: String, 73 pub old_value: String, 74 pub new_value: String, 75 } 76 ``` 77 78 `DiffResult` is format-agnostic. Each output formatter consumes it independently. `SheetSource` carries file/sheet metadata for headers. 79 80 Values are converted to display strings during diff construction. Type-aware comparison (tolerance, numeric equality) happens during the diff phase; the result stores only the string representations. 81 82 `headers` reflects the effective column set after `--cols` filtering, so output formatters use it directly without further filtering. 83 84 ## Diff Algorithm 85 86 ### Mismatched Headers 87 88 Before diffing, the tool checks whether both DataFrames share the same column names (after `--cols` filtering). Behavior: 89 90 - **Key-based mode:** key columns must exist in both files or the tool exits with an error. Non-key columns are compared by name — columns present in only one file are reported as added/removed columns in a warning on stderr, and comparison proceeds on the intersection. `DiffResult.headers` contains the union. 91 - **Positional mode:** columns are matched by position. If column counts differ, the tool warns on stderr and pads the shorter side with empty values. 92 93 ### Positional Mode (no `--key`) 94 95 1. Convert each row to a string: join all compared columns with a null-byte separator. 96 2. Build frequency maps (`HashMap<RowHash, usize>`) for both tables. 97 3. Walk table A: if B has remaining copies, consume one; otherwise mark as removed. 98 4. Walk table B: if A has remaining copies, consume one; otherwise mark as added. 99 5. No "modified" rows — any cell change produces a different hash. 100 101 ### Key-Based Mode (`--key` provided) 102 103 The implementation uses HashMap-based row matching (like the Go version) rather than polars joins. This avoids complications with join column type mismatches, row reordering, and duplicate-key cartesian products. Polars is used for reading and column selection, not for the join itself. 104 105 1. Resolve key column names/letters to column indices. 106 2. Build `HashMap<Vec<String>, RowData>` for both tables, keyed by the concatenated key column values. 107 3. For each key in A not in B → removed. 108 4. For each key in B not in A → added. 109 5. For each key in both → compare non-key columns cell by cell: 110 - **Numeric (Float64):** if `--tolerance` is set, equal when `|a - b| <= tolerance`; otherwise exact. NaN == NaN is treated as equal; NaN vs non-NaN is a difference. 111 - **Integer vs Float cross-type:** if one file infers a column as Int64 and the other as Float64, both values are compared as f64. Tolerance applies if set. 112 - **Integer (Int64):** exact comparison (tolerance does not apply to pure integer columns). 113 - **String:** exact string comparison. 114 - **Datetime:** compare as timestamps. 115 - **Boolean:** exact match. 116 - **Null vs value:** treated as different. Null vs null: equal. 117 6. Rows with any cell difference become a `ModifiedRow` with a list of `CellChange`s. 118 7. Duplicate keys produce a warning on stderr. Last occurrence wins (same as Go). 119 120 ### Column Filtering (`--cols`) 121 122 Applied before diffing. Both DataFrames are narrowed to the selected columns. Key columns are always retained, even when omitted from `--cols`. 123 124 ### Tolerance 125 126 - Applies to Float64 columns and Int64-vs-Float64 cross-type comparisons in key-based mode. 127 - Positional mode hashes exact string values; tolerance does not apply. 128 - If `--tolerance` is set but no numeric columns exist, it is a silent no-op. 129 130 ## Output Formats 131 132 ### No Differences 133 134 When the diff result is empty (no added, removed, or modified rows): 135 136 - **Text:** prints "No differences found." to stdout. 137 - **Markdown:** prints "No differences found." 138 - **JSON:** `{"added":[],"removed":[],"modified":[]}` 139 - **CSV:** header row only (no data rows). 140 141 All formats exit 0. 142 143 ### Text (default, colored) 144 145 ``` 146 --- Sheet1 (old.xlsx) 147 +++ Sheet1 (new.xlsx) 148 149 Added: 1 | Removed: 1 | Modified: 1 150 151 - ID: "3" Name: "Charlie" Score: "70" 152 + ID: "4" Name: "Dana" Score: "85" 153 154 ~ ID: "1" 155 Score: "90" → "95" 156 ``` 157 158 ANSI colors: red (`-`), green (`+`), yellow (`~`). Auto-detects TTY via `std::io::IsTerminal`; `--no-color` overrides. 159 160 ### Markdown (`--format markdown`) 161 162 ```markdown 163 ## Added (1) 164 | ID | Name | Score | 165 |----|------|-------| 166 | 4 | Dana | 85 | 167 168 ## Removed (1) 169 | ID | Name | Score | 170 |----|------|-------| 171 | 3 | Charlie | 70 | 172 173 ## Modified (1) 174 | Key (ID) | Column | Old | New | 175 |----------|--------|-----|-----| 176 | 1 | Score | 90 | 95 | 177 ``` 178 179 Clean, uncolored tables. The added/removed tables can reuse `formatter.rs::render_table_header()` and `render_table_rows()` for consistent alignment with xlcat/xlfilter output. The modified table is diff-specific. 180 181 ### JSON (`--format json`) 182 183 ```json 184 { 185 "added": [{"ID": "4", "Name": "Dana", "Score": "85"}], 186 "removed": [{"ID": "3", "Name": "Charlie", "Score": "70"}], 187 "modified": [ 188 { 189 "key": {"ID": "1"}, 190 "changes": [{"column": "Score", "old": "90", "new": "95"}] 191 } 192 ] 193 } 194 ``` 195 196 Empty arrays are `[]`, never null. Pretty-printed with 2-space indentation. 197 198 ### CSV (`--format csv`) 199 200 ```csv 201 _status,ID,Name,Score,_old_ID,_old_Name,_old_Score 202 added,4,Dana,85,,, 203 removed,3,Charlie,70,,, 204 modified,1,,95,,,90 205 ``` 206 207 `_status` column indicates row type. `_old_*` columns hold previous values. For modified rows: new values go in the main columns, old values in `_old_*` columns, only for changed cells. Unchanged cells are empty in both positions. Key column values appear in the main columns to identify the row. 208 209 ## Module Layout 210 211 ### New Files 212 213 - **`src/diff.rs`** — Core diff logic: `DiffResult`, positional/key-based algorithms, tolerance, duplicate key detection. 214 - **`src/bin/xldiff.rs`** — CLI binary: clap argument parsing, sheet resolution, output formatting (text, markdown, json, csv). 215 216 ### Reused Modules 217 218 - **`reader.rs`** — `read_sheet()` / `read_sheet_with_skip()` to load DataFrames. 219 - **`metadata.rs`** — `read_file_info()` for sheet resolution. 220 - **`cell.rs`** — Column letter resolution (shared with filter.rs via `col_letter_to_index()`). 221 - **`formatter.rs`** — `render_table_header()` and `render_table_rows()` for markdown added/removed tables. 222 223 ### Output Formatting 224 225 The four format renderers live in the binary (`xldiff.rs`), not in `formatter.rs`. Diff output structure differs enough from xlcat's table rendering to warrant separation. The markdown formatter reuses `formatter.rs` table-rendering helpers for the added/removed sections. 226 227 ### Cargo.toml Changes 228 229 - Add `[[bin]] name = "xldiff"` entry. 230 - Add `serde = { version = "1", features = ["derive"] }` and `serde_json = "1"` for JSON output. 231 - TTY detection via `std::io::IsTerminal` (stable since Rust 1.70, no extra crate needed). 232 233 ### Pipeline 234 235 ``` 236 parse args → resolve sheets → read DataFrames 237 → narrow to --cols → diff (positional or key-based) 238 → format output → print → exit code 239 ``` 240 241 ## Testing 242 243 - **Unit tests** in `diff.rs`: positional mode, key-based mode, tolerance (including cross-type Int64/Float64, NaN handling), duplicate keys, column filtering, edge cases (empty sheets, single-row, mismatched columns/headers). 244 - **Integration tests** in `tests/test_xldiff.rs`: CLI end-to-end with `assert_cmd`. 245 - **Fixtures**: add helpers to `tests/common/mod.rs`: 246 - `create_diff_pair()` — two files with added/removed/identical rows for positional mode. 247 - `create_diff_pair_with_keys()` — two files with key columns and modified cells. 248 - `create_diff_pair_with_floats()` — two files with float values differing by small amounts for tolerance testing.