Banner Banner

Technical Perspective: TASHEEH: Repairing Row-Structure in Raw CSV Files

Matthias Boehm

2025

Open science and data exchange in general rely on stan dardized and interoperable file formats. Comma-separated value (CSV) files are probably the most versatile, simplest, and widely-used file format for tabular data. For example, the FAIR data principles of research data management pro mote findable, accessible, interoperable, and reusable data and metadata. In this context, CSV files ensure accessibility and interoperability because of its simple structure and text based format, making them amenable for long-term stor age. An analysis by the Google Dataset Search team found that schema.org contained almost 30M datasets of which 37% are tables in CSV or XLS format