Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules

Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, Ermyas Abebe
<span title="">2017</span> <i title="ACM Press"> <a target="_blank" rel="noopener" href="" style="color: black;">Proceedings of the 26th International Conference on World Wide Web - WWW &#39;17</a> </i> &nbsp;
Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection
more &raquo; ... s from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.
<span class="external-identifiers"> <a target="_blank" rel="external noopener noreferrer" href="">doi:10.1145/3038912.3052701</a> <a target="_blank" rel="external noopener" href="">dblp:conf/www/ZhangSLA17</a> <a target="_blank" rel="external noopener" href="">fatcat:7pxytsaa2ncjtipqlfgcuabgdm</a> </span>
<a target="_blank" rel="noopener" href="" title="fulltext PDF download" data-goatcounter-click="serp-fulltext" data-goatcounter-title="serp-fulltext"> <button class="ui simple right pointing dropdown compact black labeled icon button serp-button"> <i class="icon ia-icon"></i> Web Archive [PDF] <div class="menu fulltext-thumbnail"> <img src="" alt="fulltext thumbnail" loading="lazy"> </div> </button> </a> <a target="_blank" rel="external noopener noreferrer" href=""> <button class="ui left aligned compact blue labeled icon button serp-button"> <i class="external alternate icon"></i> </button> </a>