A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit <a rel="external noopener" href="http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p1221.pdf">the original URL</a>. The file type is <code>application/pdf</code>.
<i title="ACM Press">
<a target="_blank" rel="noopener" href="https://fatcat.wiki/container/s4hirppq3jalbopssw22crbwwa" style="color: black;">Proceedings of the 26th International Conference on World Wide Web - WWW '17</a>
Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/3038912.3052701">doi:10.1145/3038912.3052701</a> <a target="_blank" rel="external noopener" href="https://dblp.org/rec/conf/www/ZhangSLA17.html">dblp:conf/www/ZhangSLA17</a> <a target="_blank" rel="external noopener" href="https://fatcat.wiki/release/7pxytsaa2ncjtipqlfgcuabgdm">fatcat:7pxytsaa2ncjtipqlfgcuabgdm</a> </span>
more »... s from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.
<a target="_blank" rel="noopener" href="https://web.archive.org/web/20170712063804/http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p1221.pdf" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="https://blobs.fatcat.wiki/thumbnail/pdf/68/e4/68e47fea6d61acbf0b058a963e42228a4e3f07af.180px.jpg" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href="https://doi.org/10.1145/3038912.3052701"> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> acm.org </button> </a>