Filters








7 Hits in 1.6 sec

WebSRC: A Dataset for Web-Based Structural Reading Comprehension [article]

Xingyu Chen, Zihan Zhao, Lu Chen, Danyang Zhang, Jiabao Ji, Ao Luo, Yuxuan Xiong, Kai Yu
<span title="2021-11-08">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages.  ...  In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page.  ...  Acknowledgments We sincerely thank the anonymous reviewers for their valuable comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2101.09465v2">arXiv:2101.09465v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7wlb3dgfgnapziyjkwdy5u3ubi">fatcat:7wlb3dgfgnapziyjkwdy5u3ubi</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211113101328/https://arxiv.org/pdf/2101.09465v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/54/dd/54dd1600ad9e0b2d5a2899c3b7a7a403087823ae.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2101.09465v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

WebSRC: A Dataset for Web-Based Structural Reading Comprehension

Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, Kai Yu
<span title="">2021</span> <i title="Association for Computational Linguistics"> Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing </i> &nbsp; <span class="release-stage">unpublished</span>
Moreover, we proposed Web-SRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages.  ...  In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page.  ...  Acknowledgments We sincerely thank the anonymous reviewers for their valuable comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2021.emnlp-main.343">doi:10.18653/v1/2021.emnlp-main.343</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7er2qgkdvnemddsr2qrr2zk4ie">fatcat:7er2qgkdvnemddsr2qrr2zk4ie</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211105033955/https://aclanthology.org/2021.emnlp-main.343.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/9c/ce/9ccef5d274eb9ba13691e772d78b33ad84e2d5bc.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2021.emnlp-main.343"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages [article]

Zihan Zhao, Lu Chen, Ruisheng Cao, Hongshen Xu, Xingyu Chen, Kai Yu
<span title="2022-05-13">2022</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Recently, the structural reading comprehension (SRC) task on web pages has attracted increasing research interests.  ...  Experimental results demonstrate that our model outperforms strong baselines and achieves state-of-the-art performances on the web-based SRC benchmark WebSRC at the time of writing.  ...  Acknowledgements We sincerely thank the anonymous reviewers for their valuable comments.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.06435v1">arXiv:2205.06435v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/j63hnebuind2tl45gnnb2z5cbq">fatcat:j63hnebuind2tl45gnnb2z5cbq</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220520183052/https://arxiv.org/pdf/2205.06435v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/11/e6/11e680848c3e65a7fe64ffc10afa4003cd9c4e14.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2205.06435v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding [article]

Junlong Li, Yiheng Xu, Lei Cui, Furu Wei
<span title="2022-03-11">2022</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based  ...  In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone, such as HTML/XML-based documents, where text and markup information is jointly pre-trained.  ...  We evaluate the MarkupLM models on the Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) and the Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011) .  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2110.08518v2">arXiv:2110.08518v2</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/pr2kzjrt3ffp5kb2jzhkmziuey">fatcat:pr2kzjrt3ffp5kb2jzhkmziuey</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220324212224/https://arxiv.org/pdf/2110.08518v2.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/6a/73/6a731761b80eb37aa1a4a867bac331617ca13a81.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2110.08518v2" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

Document AI: Benchmarks, Models and Applications [article]

Lei Cui, Yiheng Xu, Tengchao Lv, Furu Wei
<span title="2021-11-16">2021</span> <i > arXiv </i> &nbsp; <span class="release-stage" >pre-print</span>
Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques for automatically reading, understanding, and analyzing business documents.  ...  This paper briefly reviews some of the representative models, tasks, and benchmark datasets.  ...  MAINSTREAM DOCUMENT AI TASKS AND BENCHMARKS Document AI involves automatic reading, comprehension, and analysis of documents.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.08609v1">arXiv:2111.08609v1</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7mg67htkgbgyjg63hlegd32m24">fatcat:7mg67htkgbgyjg63hlegd32m24</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20211123060603/https://arxiv.org/pdf/2111.08609v1.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/34/c6/34c61db889ce5fd1002b9c1cd2331bfa1072cef7.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener" href="https://arxiv.org/abs/2111.08609v1" title="arxiv.org access"> <button class="ui compact blue labeled icon button serp-button"> <i class="file alternate outline icon"></i> arxiv.org </button> </a>

MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding

Junlong Li, Yiheng Xu, Lei Cui, Furu Wei
<span title="">2022</span> <i title="Association for Computational Linguistics"> Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) </i> &nbsp; <span class="release-stage">unpublished</span>
While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based  ...  Multimodal pre-training with text, layout, and image has made significant progress for Visually Rich Document Understanding (VRDU), especially the fixed-layout documents such as scanned document images  ...  We evaluate the MarkupLM models on the Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) and the Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011) .  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2022.acl-long.420">doi:10.18653/v1/2022.acl-long.420</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/fnffy7dpfrba5lyokhwppb7moa">fatcat:fnffy7dpfrba5lyokhwppb7moa</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220516062805/https://aclanthology.org/2022.acl-long.420.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/96/96/9696523b3c41d4df70b4f93de573ab1075c68112.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.18653/v1/2022.acl-long.420"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>

WebFormer: The Web-page Transformer for Structure Information Extraction [article]

Qifan Wang, Yi Fang, Anirudh Ravula, Fuli Feng, Xiaojun Quan, Dongfang Liu
<span title="2022-02-01">2022</span>
In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.  ...  Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand  ...  EXPERIMENTS 4.1 Datasets SWDE [18, 61] : The Structured Web Data Extraction (SWDE) dataset is designed for structural reading comprehension and information extraction on the web.  ... 
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.48550/arxiv.2202.00217">doi:10.48550/arxiv.2202.00217</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/vzstgyznwrhm7e2pdqpqzu3tcy">fatcat:vzstgyznwrhm7e2pdqpqzu3tcy</a> </span>
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20220202025858/https://arxiv.org/pdf/2202.00217.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/77/36/77365b30336ac46d620d958dc4c108a159c02834.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.48550/arxiv.2202.00217"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> Publisher / doi.org </button> </a>