✦ OPEN SOURCE RUST TOOL
DATASET
PROFILING FAST.
Analyze any CSV before AI/ML training. Detects missing values, outliers, type mismatches, and no-variance columns. Built in Rust for speed.
$ cargo install rustsight✦ PERFORMANCE
The claim isn't marketing. Here's the receipt.
Chicago Crimes dataset · 8,500,901 rows · 22 columns · 2GB CSV — source: Kaggle (chicago-crime-dataset)
| Tool | Time | vs Pandas |
|---|---|---|
| Polars | 1.42s | 22.2× faster |
| DuckDB | 4.33s | 7.3× faster |
| RustSightthis project | 5.21s | 6.1x faster |
| Pandas | 31.53s | baseline |
| csvkit | DNF | — |
RustSight vs Pandas
6.1x
faster
RustSight vs DuckDB
0.83×
DuckDB wins
Polars vs Pandas
22.2×
faster
DuckDB vs Pandas
7.3×
faster
✦ KEY FINDINGS
✦ COLUMN ANALYSIS (from RustSight report)
| Column Name | Type | Missing | Min | Max | Mean |
|---|---|---|---|---|---|
| ID | #numeric | clean | 634.00 | 14,118,918.00 | 7,575,905.59 |
| Case Number | #categorical | clean | — | — | — |
| Date | #categorical | clean | — | — | — |
| Block | #categorical | clean | — | — | — |
| IUCR | #categorical | clean | — | — | — |
| Primary Type | #categorical | clean | — | — | — |
| Description | #categorical | clean | — | — | — |
| Location Desc. | #categorical | 15,626 | — | — | — |
| Arrest | #numeric | clean | 0.00 | 1.00 | 0.25 |
| Domestic | #numeric | clean | 0.00 | 1.00 | 0.17 |
| Beat | #numeric | clean | 111.00 | 2,535.00 | 1,183.17 |
| District | #numeric | 47 | 1.00 | 31.00 | 11.30 |
| Ward | #numeric | 614,818 (7.2%) | 1.00 | 50.00 | 22.79 |
| Community Area | #numeric | 613,685 (7.2%) | 0.00 | 77.00 | 37.37 |
| FBI Code | #categorical | clean | — | — | — |
| X Coordinate | #numeric | 94,671 (1.1%) | 0.00 | 1,205,119.00 | 1,164,666.40 |
| Y Coordinate | #numeric | 94,671 (1.1%) | 0.00 | 1,951,622.00 | 1,885,922.46 |
| Year | #numeric | clean | 2001.00 | 2026.00 | 2011.14 |
| Updated On | #categorical | clean | — | — | — |
| Latitude | #numeric | 94,671 (1.1%) | 36.62 | 42.02 | 41.84 |
| Longitude | #numeric | 94,671 (1.1%) | -91.69 | -87.52 | -87.67 |
| Location | #categorical | 94,671 (1.1%) | — | — | — |
Benchmark run on Windows · 20 threads · release build (cargo build --release) · single run · RustSight v1.0.0
Reproduce this benchmark →✦ FEATURES
CSV Analysis
rustsight stats data.csv- →Numeric vs categorical detection
- →Min / max / mean per column
- →Missing value count per column
- →Streaming — no RAM limit
- →Saves _report.txt automatically
ML Readiness Check
rustsight validate data.csv- →High missing value ratio warnings
- →No-variance column detection
- →Outlier flagging
- →Mixed-type column detection
- →Clear warnings before training
File Inspection
rustsight inspect file.csv- →Total byte size
- →UTF-8 validity check
- →Line and word count
- →Non-ASCII byte detection
- →Works on any file type
✦ REAL OUTPUT
This is what RustSight actually produces. No setup, no config.
$ rustsight stats stockdata.csv
(contents of stockdata_report.txt)
File: stockdata.csv
Total rows: 2617
Columns: 7
🕛 Analysis completed in 1ms
✦ FORMAT SUPPORT
What works today. What's coming.
| Format | Status |
|---|---|
| CSV | ✓ Supported |
| Parquet | ⧖ Planned |
| JSON | ⧖ Planned |
| Arrow | ⧖ Planned |
| TXT/Binary | ✓ Supported |
✦ GET STARTED
$ cargo install rustsightInstall
cargo install rustsightRequires Rust toolchain (rustup.rs)
Profile a dataset
rustsight stats your_data.csvGenerates a column-level report with types, stats, and missing values
Check ML readiness
rustsight validate your_data.csvFlags outliers, high-missing columns, and no-variance features