A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit <a rel="external noopener" href="https://arxiv.org/pdf/2101.09465v2.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
Filters
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
[article]
<span title="2021-11-08">2021</span>
<i >
arXiv
</i>
<span class="release-stage" >pre-print</span>
Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages. ...
In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page. ...
Acknowledgments We sincerely thank the anonymous reviewers for their valuable comments. ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2101.09465v2">arXiv:2101.09465v2</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7wlb3dgfgnapziyjkwdy5u3ubi">fatcat:7wlb3dgfgnapziyjkwdy5u3ubi</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211113101328/https://arxiv.org/pdf/2101.09465v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/54/dd/54dd1600ad9e0b2d5a2899c3b7a7a403087823ae.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2101.09465v2" title="arxiv.org access">
<button class="ui compact blue labeled icon button serp-button">
<i class="file alternate outline icon"></i>
arxiv.org
</button>
</a>
WebSRC: A Dataset for Web-Based Structural Reading Comprehension
<span title="">2021</span>
<i title="Association for Computational Linguistics">
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
</i>
<span class="release-stage">unpublished</span>
Moreover, we proposed Web-SRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages. ...
In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page. ...
Acknowledgments We sincerely thank the anonymous reviewers for their valuable comments. ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2021.emnlp-main.343">doi:10.18653/v1/2021.emnlp-main.343</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7er2qgkdvnemddsr2qrr2zk4ie">fatcat:7er2qgkdvnemddsr2qrr2zk4ie</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211105033955/https://aclanthology.org/2021.emnlp-main.343.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/9c/ce/9ccef5d274eb9ba13691e772d78b33ad84e2d5bc.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2021.emnlp-main.343">
<button class="ui left aligned compact blue labeled icon button serp-button">
<i class="external alternate icon"></i>
Publisher / doi.org
</button>
</a>
TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages
[article]
<span title="2022-05-13">2022</span>
<i >
arXiv
</i>
<span class="release-stage" >pre-print</span>
Recently, the structural reading comprehension (SRC) task on web pages has attracted increasing research interests. ...
Experimental results demonstrate that our model outperforms strong baselines and achieves state-of-the-art performances on the web-based SRC benchmark WebSRC at the time of writing. ...
Acknowledgements We sincerely thank the anonymous reviewers for their valuable comments. ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.06435v1">arXiv:2205.06435v1</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/j63hnebuind2tl45gnnb2z5cbq">fatcat:j63hnebuind2tl45gnnb2z5cbq</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220520183052/https://arxiv.org/pdf/2205.06435v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/11/e6/11e680848c3e65a7fe64ffc10afa4003cd9c4e14.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.06435v1" title="arxiv.org access">
<button class="ui compact blue labeled icon button serp-button">
<i class="file alternate outline icon"></i>
arxiv.org
</button>
</a>
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding
[article]
<span title="2022-03-11">2022</span>
<i >
arXiv
</i>
<span class="release-stage" >pre-print</span>
While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based ...
In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone, such as HTML/XML-based documents, where text and markup information is jointly pre-trained. ...
We evaluate the MarkupLM models on the Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) and the Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011) . ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2110.08518v2">arXiv:2110.08518v2</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/pr2kzjrt3ffp5kb2jzhkmziuey">fatcat:pr2kzjrt3ffp5kb2jzhkmziuey</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220324212224/https://arxiv.org/pdf/2110.08518v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/6a/73/6a731761b80eb37aa1a4a867bac331617ca13a81.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2110.08518v2" title="arxiv.org access">
<button class="ui compact blue labeled icon button serp-button">
<i class="file alternate outline icon"></i>
arxiv.org
</button>
</a>
Document AI: Benchmarks, Models and Applications
[article]
<span title="2021-11-16">2021</span>
<i >
arXiv
</i>
<span class="release-stage" >pre-print</span>
Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques for automatically reading, understanding, and analyzing business documents. ...
This paper briefly reviews some of the representative models, tasks, and benchmark datasets. ...
MAINSTREAM DOCUMENT AI TASKS AND BENCHMARKS Document AI involves automatic reading, comprehension, and analysis of documents. ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.08609v1">arXiv:2111.08609v1</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7mg67htkgbgyjg63hlegd32m24">fatcat:7mg67htkgbgyjg63hlegd32m24</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211123060603/https://arxiv.org/pdf/2111.08609v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/34/c6/34c61db889ce5fd1002b9c1cd2331bfa1072cef7.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.08609v1" title="arxiv.org access">
<button class="ui compact blue labeled icon button serp-button">
<i class="file alternate outline icon"></i>
arxiv.org
</button>
</a>
MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding
<span title="">2022</span>
<i title="Association for Computational Linguistics">
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
</i>
<span class="release-stage">unpublished</span>
While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based ...
Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images ...
We evaluate the MarkupLM models on the Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) and the Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011) . ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2022.acl-long.420">doi:10.18653/v1/2022.acl-long.420</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/fnffy7dpfrba5lyokhwppb7moa">fatcat:fnffy7dpfrba5lyokhwppb7moa</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220516062805/https://aclanthology.org/2022.acl-long.420.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/96/96/9696523b3c41d4df70b4f93de573ab1075c68112.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2022.acl-long.420">
<button class="ui left aligned compact blue labeled icon button serp-button">
<i class="external alternate icon"></i>
Publisher / doi.org
</button>
</a>
WebFormer: The Web-page Transformer for Structure Information Extraction
[article]
<span title="2022-02-01">2022</span>
In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. ...
Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand ...
EXPERIMENTS 4.1 Datasets SWDE [18, 61] : The Structured Web Data Extraction (SWDE) dataset is designed for structural reading comprehension and information extraction on the web. ...
<span class="external-identifiers">
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.48550/arxiv.2202.00217">doi:10.48550/arxiv.2202.00217</a>
<a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/vzstgyznwrhm7e2pdqpqzu3tcy">fatcat:vzstgyznwrhm7e2pdqpqzu3tcy</a>
</span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220202025858/https://arxiv.org/pdf/2202.00217.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext">
<button class="ui simple right pointing dropdown compact black labeled icon button serp-button">
<i class="icon ia-icon"></i>
Web Archive
[PDF]
<div class="menu fulltext-thumbnail">
<img src="https://blobs.fatcat.wiki/thumbnail/pdf/77/36/77365b30336ac46d620d958dc4c108a159c02834.180px.jpg" alt="fulltext thumbnail" loading="lazy">
</div>
</button>
</a>
<a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.48550/arxiv.2202.00217">
<button class="ui left aligned compact blue labeled icon button serp-button">
<i class="external alternate icon"></i>
Publisher / doi.org
</button>
</a>