Datasets:

docling-project
/

FinTabNet_OTSL

Tasks:

Object Detection

Modalities:

Formats:

Size:

ArXiv:

Tags:

table-structure-recognition

table-understanding

Libraries:

License:

Dataset card Data Studio Files Files and versions

FinTabNet_OTSL / README.md

asnassar's picture

Update README.md

1fe37f0 over 2 years ago

|

history blame contribute delete

2.54 kB

	---
	license: other
	pretty_name: FinTabNet-OTSL
	size_categories:
	- 10K<n<100K
	tags:
	- table-structure-recognition
	- table-understanding
	- PDF
	task_categories:
	- object-detection
	- table-to-text
	---
	# Dataset Card for FinTabNet_OTSL

	## Dataset Description

	- Homepage: https://ds4sd.github.io
	- Paper: https://arxiv.org/pdf/2305.03393

	### Dataset Summary

	This dataset is a conversion of the original [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/) into the OTSL format presented in our paper "Optimized Table Tokenization for Table Structure Recognition". The dataset includes the original annotations amongst new additions.

	### Dataset Structure

	* cells: origunal dataset cell groundtruth (content).
	* otsl: new reduced table structure token format
	* html: original dataset groundtruth HTML (structure).
	* html_restored: generated HTML from OTSL.
	* cols: grid column length.
	* rows: grid row length.
	* image: PIL image

	### OTSL Vocabulary:

	OTSL: new reduced table structure token format
	More information on the OTSL table structure format and its concepts can be read from our paper.
	Format of this dataset extends work presented in a paper, and introduces slight modifications:

	* "fcel" - cell that has content in it
	* "ecel" - cell that is empty
	* "lcel" - left-looking cell (to handle horizontally merged cells)
	* "ucel" - up-looking cell (to handle vertically merged cells)
	* "xcel" - 2d span cells, in this dataset - covers entire area of a merged cell
	* "nl" - new line token

	### Data Splits

	The dataset provides three splits
	- `train`
	- `val`
	- `test`

	## Additional Information

	### Dataset Curators

	The dataset is converted by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
	You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).

	Curators:
	- Maksym Lysak, [@maxmnemonic](https://github.com/maxmnemonic)
	- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
	- Christoph Auer, [@cau-git](https://github.com/cau-git)
	- Nikos Livathinos, [@nikos-livathinos](https://github.com/nikos-livathinos)
	- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)


	### Citation Information


	```bib
	@misc{lysak2023optimized,
	title={Optimized Table Tokenization for Table Structure Recognition},
	author={Maksym Lysak and Ahmed Nassar and Nikolaos Livathinos and Christoph Auer and Peter Staar},
	year={2023},
	eprint={2305.03393},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}```