Convert complex statistical formats into editable data with one command
C

Convert complex statistical formats into editable data with one command

Convert complex statistical formats into editable data with one command

5,389 stars
N/A forks
N/A contributors

README

Project documentation from GitHub

Edit-Banana: Stop Copying Tables by Hand

We've all been there. You find a perfect table of data in a PDF, a research paper, or a webpage—maybe it's census data, financial results, or experimental findings. You need that data in a spreadsheet or a script, but it's trapped as a static image or in a messy, non-editable format. Your next hour is suddenly filled with the mind-numbing task of manual data entry. What if you could just… get the data?

That's the frustration Edit-Banana is built to solve. It's a command-line tool that takes those complex, formatted statistical tables (think PDFs, images, or messy text) and converts them into clean, editable data with a single command. It's like Ctrl+C, Ctrl+V for data that was never meant to be copied.

What It Does

In simple terms, Edit-Banana is an intelligent table extractor. You feed it a file containing a table—often from academic papers, reports, or official documents where data is presented for human reading, not machine processing. It then identifies the table structure, parses the rows and columns, and outputs the data into a usable format like CSV or Excel.

It goes beyond basic OCR by understanding the logic of statistical tables: merged headers, nested columns, footnotes, and units. It tries to reconstruct the intended dataset from the formatted presentation layer.

Why It's Cool

The magic of Edit-Banana isn't just that it extracts text; it's that it aims to extract meaningful structure. Here’s what makes it stand out:

  • One-Command Simplicity: The core promise is real. A single command like edit-banana input.pdf -o data.csv can save an afternoon of tedious work.
  • Handles the Messy Stuff: It's designed for the real world of data presentation. It doesn't just bail when it sees a spanned header or a superscript footnote symbol; it tries to integrate that information intelligently.
  • Developer-Centric: It's a CLI tool, which means it slots perfectly into data processing pipelines. You can automate the extraction of hundreds of tables, hook it into a data scraping script, or use it as the first step in your ETL process.
  • Fights PDF Hell: For anyone in research, data analysis, or journalism, getting data out of PDFs is a notorious pain point. Edit-Banana is a direct assault on that problem.

How to Try It

Ready to free some trapped data? Getting started is straightforward.

  1. Clone the repo:

    git clone https://github.com/BIT-DataLab/Edit-Banana.git
    cd Edit-Banana
    

Did you like this issue?

Join our weekly newsletter

Love discovering amazing projects?

Help us continue bringing you the best open-source discoveries every week.

Back to Projects
Last updated: Mar 10, 2026