Filters








5,406 Hits in 6.2 sec

Predicting Human Similarity Judgments Using Large Language Models [article]

Raja Marjieh, Ilia Sucholutsky, Theodore R. Sumers, Nori Jacoby, Thomas L. Griffiths
2022 arXiv   pre-print
Here we leverage recent advances in language models and online recruitment, proposing an efficient domain-general procedure for predicting human similarity judgments based on text descriptions.  ...  Similarity judgments provide a well-established method for accessing mental representations, with applications in psychology, neuroscience and machine learning.  ...  We evaluated two representations for predicting human similarity judgments based on these labels, namely, a one-hot representation and a word embedding representation.  ... 
arXiv:2202.04728v1 fatcat:pahfltoyvzhlngjothqyijizoa

Semantic Answer Similarity for Evaluating Question Answering Models [article]

Julian Risch and Timo Möller and Julian Gutsch and Malte Pietsch
2021 arXiv   pre-print
We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one  ...  In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics.  ...  Results Table 2 lists the correlation between different automated evaluation metrics and human judgment using Pearson's r and Kendall's tau-b rank correlation coefficients on labeled subsets of SQuAD  ... 
arXiv:2108.06130v3 fatcat:omyydvcy6nbodp6famdhcrcqwq

Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto-Roles

Francis Ferraro, Adam Poliak, Ryan Cotterell, Benjamin Van Durme
2017 Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)  
We study how different frame annotations complement one another when learning continuous lexical semantics.  ...  We learn the representations from a tensorized skip-gram model that consistently encodes syntactic-semantic content better, with multiple 10% gains over baselines.  ...  Teichert et al. (2017) demonstrated slight improvements in jointly and conditionally predicting PropBank (Bonial et al., 2013) 's semantic role labels and Reisinger et al. (2015) 's proto-role labels  ... 
doi:10.18653/v1/s17-1011 dblp:conf/starsem/FerraroPCD17 fatcat:mef5uswrzbbsnic2g7ntfs76ui

Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto-Roles [article]

Francis Ferraro, Adam Poliak, Ryan Cotterell, Benjamin Van Durme
2017 arXiv   pre-print
We study how different frame annotations complement one another when learning continuous lexical semantics.  ...  We learn the representations from a tensorized skip-gram model that consistently encodes syntactic-semantic content better, with multiple 10% gains over baselines.  ...  Teichert et al. (2017) demonstrated slight improvements in jointly and conditionally predicting PropBank (Bonial et al., 2013) 's semantic role labels and Reisinger et al. (2015) 's proto-role labels  ... 
arXiv:1706.09562v1 fatcat:whacdzwkqjcf5atjpml7yrtjfy

Better Semantic Frame Based MT Evaluation via Inversion Transduction Grammars

Dekai Wu, Chi-kiu Lo, Meriem Beloucif, Markus Saers
2014 Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation  
upon MEANT's already-high correlation with human adequacy judgments.  ...  We introduce an inversion transduction grammar based restructuring of the MEANT automatic semantic frame based MT evaluation metric, which, by leveraging ITG language biases, is able to further improve  ...  For both reference and machine translations, the ASSERT (Pradhan et al., 2004) semantic role labeler was used to automatically predict semantic parses.  ... 
doi:10.3115/v1/w14-4003 dblp:conf/ssst/WuLBS14 fatcat:4m2gtq2vz5d4xg5uhpphzqyw5e

The Many Dimensions of Truthfulness: Crowdsourcing Misinformation Assessments on a Multidimensional Scale [article]

Michael Soprano and Kevin Roitero and David La Barbera and Davide Ceolin and Damiano Spina and Stefano Mizzaro and Gianluca Demartini
2021 arXiv   pre-print
However, fake news are a subtle matter: statements can be just biased ("cherrypicked"), imprecise, wrong, etc. and the unidimensional truth scale used in existing work cannot account for such differences  ...  capture independent pieces of information; (3) the crowdsourcing task can be easily learned by the workers; and (4) the resulting assessments provide a useful basis for a more complete estimation of statement  ...  We thank the reviewers for their comments; they provided insightful remarks that helped us to improve the overall quality of the paper.  ... 
arXiv:2108.01222v1 fatcat:26prpndntffkfmjrfoabd3wz5m

Evaluating Coherence in Dialogue Systems using Entailment [article]

Nouha Dziri, Ehsan Kamalloo, Kory W. Mathewson, Osmar Zaiane
2020 arXiv   pre-print
Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable.  ...  Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the  ...  While our results illustrate that the proposed approach correlates reasonably with human judgment and provide an unbiased estimate for the response quality, we believe that there is still room for improvement  ... 
arXiv:1904.03371v2 fatcat:dxo27ah7ureffj6d3knvuswo7i

Evaluating Coherence in Dialogue Systems using Entailment

Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, Osmar Zaiane
2019 Proceedings of the 2019 Conference of the North  
Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable.  ...  Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the  ...  While our results illustrate that the proposed approach correlates reasonably with human judgment and provide an unbiased estimate for the response quality, we believe that there is still room for improvement  ... 
doi:10.18653/v1/n19-1381 dblp:conf/naacl/DziriKMZ19 fatcat:jqy62q6sdfgexpb2ee7yh57ada

Lexical Access Preference and Constraint Strategies for Improving Multiword Expression Association within Semantic MT Evaluation

Dekai Wu, Chi-kiu Lo, Markus Saers
2014 Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)  
associations, leading to state-of-the-art improvements in correlation with human adequacy judgments.  ...  Because of this, one of the most important factors in correctly predicting semantic translation adequacy is the accuracy of recognizing alternative lexical realizations of the same multiword expressions  ...  HR0011-12-C-0014 and HR0011-12-C-0016, and GALE contract nos.  ... 
doi:10.3115/v1/w14-4719 dblp:conf/cogalex/WuLS14 fatcat:cenznrrc4jak3pferczmpwmo6m

comp-syn: Perceptually Grounded Word Embeddings with Color [article]

Bhargav Srinivasa Desikan, Tasker Hull, Ethan O. Nadler, Douglas Guilbeault, Aabir Abubaker Kar, Mark Chu, Donald Ruggiero Lo Sardo
2020 arXiv   pre-print
In particular, we show that (1) comp-syn predicts human judgments of word concreteness with greater accuracy and in a more interpretable fashion than word2vec using low-dimensional word-color embeddings  ...  Our package release includes word-color embeddings for over 40,000 English words, each associated with crowd-sourced word concreteness judgments.  ...  D.G. and T.H. acknowledge intellectual support from the Institute for Advanced Learning (IAL) in Ontario, Canada.  ... 
arXiv:2010.04292v2 fatcat:vrk2yyex6zgdha4t4hggdzrzk4

Can Audio Captions Be Evaluated with Image Caption Metrics? [article]

Zelin Zhou, Zhiling Zhang, Xuenan Xu, Zeyu Xie, Mengyue Wu, Kenny Q. Zhu
2022 arXiv   pre-print
This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval.  ...  Current metrics are found in poor correlation with human annotations on these datasets.  ...  We thus leverage Sentence-BERT for better estimation of semantic similarity and propose Error Detector to penalize sentences with fluency issues.  ... 
arXiv:2110.04684v2 fatcat:3rzsk5wmnje5boqddb6l2vmube

BiMEANT: Integrating Cross-Lingual and Monolingual Semantic Frame Similarities in the MEANT Semantic MT Evaluation Metric [chapter]

Chi-kiu Lo, Dekai Wu
2014 Lecture Notes in Computer Science  
We present experimental results showing that integrating cross-lingual semantic frame similarity into the semantic frame based automatic MT evaluation metric MEANT improves its correlation with human judgment  ...  To address this issue we propose a new bilingual metric, BiMEANT, that correlates with human judgment more closely than MEANT by incorporating new cross-lingual semantic frame similarity assessments into  ...  Chinese semantic parser.  ... 
doi:10.1007/978-3-319-11397-5_6 fatcat:hg6tltaswrh47jqhbysuwwmxhi

Temporal web dynamics and its application to information retrieval

Kira Radinsky, Fernando Diaz, Susan Dumais, Milad Shokouhi, Anlei Dong, Yi Chang
2013 Proceedings of the sixth ACM international conference on Web search and data mining - WSDM '13  
.  Operators for manipulating streams of interest  Filter  Link  Visualize E. Adar, M. Dontcheva, J. Fogarty and D. Weld. Zoetrope: Interacting with the ephemeral web.  ...  Error at time t The prediction for time t J. Durbin and S.  ...  place semantics for a tag be derived from the tag's location and time usage distribution?  ... 
doi:10.1145/2433396.2433500 dblp:conf/wsdm/RadinskyDDSDC13 fatcat:v3r5yqpnwjcezm35ux4v5eiyse

Inherent Disagreements in Human Textual Inferences

Ellie Pavlick, Tom Kwiatkowski
2019 Transactions of the Association for Computational Linguistics  
We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments.  ...  We show that, very often, disagreements are not dismissible as annotation "noise", but rather persist as we collect more ratings and as we vary the amount of context provided to raters.  ...  Acknowledgments Thank you to the Action Editor Chris Potts and the anonymous reviewers for their input on earlier drafts of this paper.  ... 
doi:10.1162/tacl_a_00293 fatcat:e5xmen7w3jhzzmkyn4lpvlvxaa

Legal Judgment Prediction with Multi-Stage CaseRepresentation Learning in the Real Court Setting [article]

Luyao Ma, Yating Zhang, Tianyi Wang, Xiaozhong Liu, Wei Ye, Changlong Sun, Shikun Zhang
2021 arXiv   pre-print
Legal judgment prediction(LJP) is an essential task for legal AI.  ...  An extensive set of experiments with a large civil trial data set shows that the proposed model can more accurately characterize the interactions among claims, fact and debate for legal judgment prediction  ...  [2] leverages BERT to focus only on learning good representation of the pure input fact text for judgment prediction.  ... 
arXiv:2107.05192v1 fatcat:isqunj4khfgc5klhlni3kyowdu
« Previous Showing results 1 — 15 out of 5,406 results