OPEN SOURCE RUST TOOL

DATASET
PROFILING FAST.

Analyze any CSV before AI/ML training. Detects missing values, outliers, type mismatches, and no-variance columns. Built in Rust for speed.

$ cargo install rustsight
View on GitHub →
rustsight — zsh

PERFORMANCE

The claim isn't marketing. Here's the receipt.

Chicago Crimes dataset · 8,500,901 rows · 22 columns · 2GB CSV — source: Kaggle (chicago-crime-dataset)

ToolTimevs Pandas
Polars1.42s22.2× faster
DuckDB4.33s7.3× faster
RustSightthis project5.21s6.1x faster
Pandas31.53sbaseline
csvkitDNF

RustSight vs Pandas

6.1x

faster

RustSight vs DuckDB

0.83×

DuckDB wins

Polars vs Pandas

22.2×

faster

DuckDB vs Pandas

7.3×

faster

⚠ Benchmark results will vary by machine. These timings were recorded on a Windows PC with 20 threads (release build, --release flag). Your results may be faster or slower depending on CPU core count, available RAM, disk read speed (SSD vs HDD), and OS scheduling. The relative ordering between tools is consistent across hardware; the absolute seconds are not. RustSight's next optimization target is parallel line splitting via memmap2 + rayon, which is expected to bring it within striking distance of DuckDB.
KEY FINDINGS
RustSight beats Pandas by 6.1x on 8.5 million rows — a hand-built Rust CLI comfortably outperforms the industry-standard Python data library with zero configuration.
RustSight is within 1.28s of DuckDB — a production C++ query engine built by a team of database researchers. Closing this gap is the next engineering milestone.
Polars wins overall at 1.42s — it memory-maps the file and parses in parallel from disk. Its engine is also written in Rust, making it the practical ceiling for this hardware.
csvkit is not viable at scale — pure Python row-by-row processing on 8.5M rows was killed after 40+ minutes. Estimated completion: 60–120 minutes.
COLUMN ANALYSIS (from RustSight report)
Column NameTypeMissingMinMaxMean
ID#numericclean634.0014,118,918.007,575,905.59
Case Number#categoricalclean
Date#categoricalclean
Block#categoricalclean
IUCR#categoricalclean
Primary Type#categoricalclean
Description#categoricalclean
Location Desc.#categorical15,626
Arrest#numericclean0.001.000.25
Domestic#numericclean0.001.000.17
Beat#numericclean111.002,535.001,183.17
District#numeric471.0031.0011.30
Ward#numeric614,818 (7.2%)1.0050.0022.79
Community Area#numeric613,685 (7.2%)0.0077.0037.37
FBI Code#categoricalclean
X Coordinate#numeric94,671 (1.1%)0.001,205,119.001,164,666.40
Y Coordinate#numeric94,671 (1.1%)0.001,951,622.001,885,922.46
Year#numericclean2001.002026.002011.14
Updated On#categoricalclean
Latitude#numeric94,671 (1.1%)36.6242.0241.84
Longitude#numeric94,671 (1.1%)-91.69-87.52-87.67
Location#categorical94,671 (1.1%)

Benchmark run on Windows · 20 threads · release build (cargo build --release) · single run · RustSight v1.0.0

Reproduce this benchmark →

FEATURES

CSV Analysis

rustsight stats data.csv
  • Numeric vs categorical detection
  • Min / max / mean per column
  • Missing value count per column
  • Streaming — no RAM limit
  • Saves _report.txt automatically

ML Readiness Check

rustsight validate data.csv
  • High missing value ratio warnings
  • No-variance column detection
  • Outlier flagging
  • Mixed-type column detection
  • Clear warnings before training

File Inspection

rustsight inspect file.csv
  • Total byte size
  • UTF-8 validity check
  • Line and word count
  • Non-ASCII byte detection
  • Works on any file type

REAL OUTPUT

This is what RustSight actually produces. No setup, no config.

stockdata_report.txt

$ rustsight stats stockdata.csv

(contents of stockdata_report.txt)

File: stockdata.csv

Column|Missing|Type|Min|Max|Mean
Date|0|categorical|N/A|N/A|N/A
Adj Close|0|numeric|55.42|306.28|130.76
Close|0|numeric|57.39|317.14|135.15
High|0|numeric|58.65|319.32|136.92
Low|0|numeric|57.20|308.91|133.36
Open|0|numeric|57.30|313.50|135.25
Volume|0|numeric|3,775,300.00|271,879,400.00|18,992,309.82

Total rows: 2617

Columns: 7

🕛 Analysis completed in 1ms

FORMAT SUPPORT

What works today. What's coming.

FormatStatus
CSV Supported
Parquet Planned
JSON Planned
Arrow Planned
TXT/Binary Supported

Vote for the next format →

GET STARTED

$ cargo install rustsight
1

Install

cargo install rustsight

Requires Rust toolchain (rustup.rs)

2

Profile a dataset

rustsight stats your_data.csv

Generates a column-level report with types, stats, and missing values

3

Check ML readiness

rustsight validate your_data.csv

Flags outliers, high-missing columns, and no-variance features