主页

索引

模块索引

搜索页面

Evaluating Verifiability in Generative Search Engines

  • https://arxiv.org/abs/2304.09848

  • 标题:评估生成搜索引擎中的可验证性

  • 作者:Nelson F. Liu, Tianyi Zhang, Percy Liang

  • 斯坦福大学计算机科学系

  • 生成式搜索引擎直接生成对用户查询的响应,以及内联引用。一个值得信赖的生成搜索引擎的一个先决条件是可验证性,即系统应该全面引用(高引用召回率;所有陈述都得到引用的完全支持)和准确(高引用精度;每个引用都支持其相关陈述)。我们进行人工评估,以审计四个流行的生成搜索引擎——Bing Chat、NeevaAI、perplexity 和 YouChat——来自各种来源的各种查询(例如,历史 Google 用户查询、Reddit 上动态收集的开放式问题等)。我们发现,来自现有生成搜索引擎的响应是流畅的,看起来信息量很大,但经常包含不支持的陈述和不准确的引用:平均而言,只有51.5%的生成句子得到引用的完全支持,只有74.5%的引用支持其相关的句子。我们认为,对于可能作为信息搜索用户主要工具的系统来说,这些结果低得令人担忧,特别是考虑到它们的可信度外表。我们希望我们的研究结果能进一步推动可信赖的生成式搜索引擎的发展,并帮助研究人员和用户更好地理解现有商业系统的不足。

  • Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines – Bing Chat, NeevaAI, this http URL, and YouChat – across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.

https://img.zhaoweiguo.com/uPic/2024/08/hLE8Hz.png

Figure 1: Generative search engines answer user queries by generating a tailored response, along with in-line citations. However, not all generated statements are fully supported by citations (citation recall), and not every citation supports its associated statement (citation precision).

https://img.zhaoweiguo.com/uPic/2024/08/K8spCq.png

Figure 2: 计算引文召回率和精确度的示例。Examples of calculating citation recall and precision.

  • Citation recall measures the proportion of generated statements that are supported by citations.

  • Citation precision measures the proportion of citations that support their associated statements.

  • Partially-supporting citations only improve citation precision when their associated statement is supported by the union of its citations and no other associated citation fully supports the statement by itself (middle example).

  • “引文召回率”衡量的是得到引文支持的生成陈述的比例。

  • “引文精度”衡量支持其关联陈述的引文比例。

  • “部分支持引用”在下面情况下能提高引文的准确性,当与其他部分支持引用共同提供的支持时。

主页

索引

模块索引

搜索页面