Abstract
The quality of training/testing datasets is critical when a model is trained and evaluated by the annotated datasets. In Information Retrieval (IR), documents are annotated by human experts if they are relevant or not to a given query. Relevance judgment of human assessors is inherently subjective and dynamic. However, a small group of experts’ relevance judgment results are usually taken as ground truth to “objectively” evaluate the performance of an IR system. Recent trends intend to employ a group of judges, such as outsourcing, to alleviate the potentially biased judgment results stemmed from using only a single expert’s judgment. Nevertheless, different judges may have different opinions and may not agree with each other, and the inconsistency in human relevance judgment may affect the IR system evaluation results. Further, previous research focused mainly on the quality of documents, rather on the quality of queries submitted to an IR system. In this research, we introduce Relevance Judgment Convergence Degree (RJCD) to measure the quality of queries in the evaluation datasets. Experimental results reveal a strong correlation coefficient between the proposed RJCD score and the performance differences between two IR systems.