How We Validate CSV Files

A technical overview of our CSV validation engine and the checks we perform on every file.

Encoding Detection

We analyze the byte patterns in your file to detect the character encoding. Common encodings like UTF-8, UTF-8 with BOM, Latin-1 (ISO-8859-1), and Windows-1252 are all detected. If a BOM (Byte Order Mark) is present, it's identified and reported. Non-UTF-8 files are converted before analysis.

Delimiter Auto-Detection

Our engine samples the first several rows of your file and counts occurrences of common delimiters: comma (,), semicolon (;), tab, and pipe (|). The delimiter with the most consistent occurrence across rows is selected. This handles regional differences where semicolons are common in European CSV exports.

Structure Validation

We check that every row has the same number of columns as the header row. Rows with missing or extra columns are flagged. We also check for empty rows, trailing delimiters, and proper line endings. Each structural issue is categorized by severity.

Header Validation

If a header row is detected, we check for empty headers, duplicate header names, and headers with leading/trailing whitespace. Clean, unique headers are essential for data import systems and databases.

Data Type Analysis

For each column, we sample the data to determine the predominant data type: integer, float, boolean, date, email, URL, or string. This helps identify columns where data types are mixed or unexpected, which often indicates data quality issues.

Duplicate & Empty Row Detection

We hash each row to efficiently detect exact duplicates. Empty rows (rows with no data or only delimiters) are also identified and counted. Both issues commonly cause problems during data import.

Quality Score Calculation

The quality score starts at 100 and is reduced based on the severity and count of issues found. Major issues like inconsistent columns or encoding problems cause larger deductions, while minor issues like trailing whitespace cause smaller ones.

Score Ranges

  • 90-100: Excellent - File is clean and ready for use
  • 80-89: Good - Minor issues that may not cause problems
  • 60-79: Fair - Some issues that should be reviewed
  • 40-59: Poor - Significant issues that need fixing
  • 0-39: Critical - Major problems that will likely cause import failures

CSV工具