About

Draft edited September 20, 2023 10:11 PM (PDT)

This hosts the provisional comparative evaluation of new web search systems with a focus on dimensions around integrating with users' curiosity-engagement, question-generation, response validation, search repair, search sharing, complaint and feedback, and other concerns at the core of my research.

The new attention to search, sparked by widespread interest in OpenAI's ChatGPT, has pushed many to develop new ways to search.

This evaluation is a look towards comparing different search systems, with a particular eye towards finding and shining some light on systems that are opening up search in new ways.

I have a wide range of criteria by which to provide some marks and remarks on these systems. I'll need to narrow them down and gradually work through them. I'm not thinking about them as goals, requirements, or desired-specifications, and some even may contradict. For some of the criteria I will provide citations as reference or support. Some criteria are drawn from examples in previous search systems (including shuttered, speculative, and experimental systems). My goal here is not to simply do an accounting of searching today, but to get some sense of where we might want search to go.

Initial Search Systems

In this initial set of reviews, I’m focusing on these search engines, listed alphabetically:

Andi Search (andisearch.com)
Metaphor (metaphor.systems)
Perplexity AI (perplexity.ai)
Phind (phind.com)
You.com (you.com)

These are my initial examples of new approaches to searching in generative web search systems. I may provide come contextualizing comments about other systems, like the explicit search-focused tools from Google and Microsoft, and chat-based systems like ChatGPT, Anthropic’s Claude, etc., that support search and search-like interactions.

To they extend that they support public-facing search, I will also be examining newer search libraries and services (including RAG frameworks), like the offerings from LangChain, from LlamaIndex, and Weaviate’s Verba, with comparison to (the also adapting) existing tools like those from Algolia and Elasticsearch.

Broad Criteria

My criteria are broad. I’m focused on concerns my research and training best prepares me to engage with. These are broadly questions related to the explicit and implicit articulation of the search system, the interactions around queries and results, the ability to share the burden of search, and the formalized methods of complaint. I’ll do some explicit evaluations of atomic performance related to “hallucination” or “groundedness”, but my focus is more on how people perceive and perform-with tool outputs than the outputs themselves. How are the searchers ushered into their searches? What do they see as searchable? How can they engage with search results (or responses)? Are they expected to vet the responses for hallucinations? How is automation bias addressed? What post-search activities are supported by the search system itself?

I’ll ask about features or uses that might perhaps be refused or reimagined, while situating this period of search amidst a longer history of search. There are importance concerns about misleading results, sources of training and reference data, oversight, and the future of work. I’m very much developing these reviews to acknowledge that these systems will keep changing. Where there are very important concerns that I am less well-versed in, like accessibility, I will leverage other resources.

What searches can we avoid doing?

What newly / more easily think to do?

What make newly possible?

What can be slower or faster?

Seamful? Viscid? Vetted?

Doubtful & deliberate?

Ephemeral or persistent?

Memorable? Public? Shared?

Surfing or blazing?

Embedded? Loosely coupled?

Fun?

Daniel Griffin, Sep 3, 2023

Scoping

This is not intended to be an introductory guide to these systems, but focused on making sense of what new search tools are providing and what they might become. These reviews may be useful to heavy users, developers, and others looking to understand changes in system support for various searching practices.

I will largely be looking at systems for web search, including those more focused to particular subject areas. Though important, these reviews will not (yet at least) engage with new search systems for:

academic search:
- Examples:
  - Consensus (consensus.app)
  - Elicit (elicit.org)
- See:
  - Michael Gusenbauer’s call for independent audits (author copy)
  - Aaron Tay’s “categorization of interesting new academic discovery tools”
enterprise search
- Examples:
  - Glean (glean.com)
  - Vectara (vectara.com)
personal knowledge management
- Examples:
  - Klu (klu.so)
  - Rewind (rewind.ai)

I will not be trying to replicate metrics like that in ragas (tagline: “Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines”; from Exploding Gradient). Those metrics are intended to examine: Faithfulness, Context Relevancy, Context Recall, Answer Relevancy, Aspect Critiques. I’m more interested in actions outside the RAG-pipeline itself. What can be done to improve coordination between searcher expectation and system performance? What can be done to remedy system failures at searching-time?

I will also not be focused on in-editor code generation tools (like GitHub’s Copilot) and writing tools (like Lex.page) that replace or subsume some searching tasks.

I will not very focused on various metrics related to speed, unless it is very noticeable in frequent use.

I am concerned about questions of bias, but here only insofar as these systems are markedly different from the prior problems found in search.

I am not focused on explainability or transparency of these systems, though some question will definitely engage with those questions. I will be more focused on examining questions around seamfulness, tractability, and traceability. I will be thinking about how practical algorithmic knowledge [@cotter2022practical] is built up and valued.

I’m less focused on responding to or rehashing and regurgitating arguments about “model collapse”, than perhaps looking at how these search tools and their users imagine supporting or working towards unsealing knowledge, whether through articulations that help users doubt & dig deeper, providing multiple drafts, or RAG adaptations.

The most important work would be work looking at how these search systems and tools are imagined and used (or not) by other people. I am not looking at that right now, but I will look at aspects of the systems identified publicly by different users or others.

Roadmap

Under Construction

This is presently very much under construction, much more a speculative prototype than a full-fledged system. A proper system will require many more people working together. I am not proposing I have the “solution” or could build out a framework like this alone. This is just part of my joining the conversation about what search might be.

TODO

Add to system pages (background, reference data, reflections); scraping; interviews and guest commentary; glossary; reading list; suggested class exercises and discussion questions; add search functionality.

Additional Criteria under Consideration (incomplete)

Queries

Do they support persistent queries?
Does it autocorrect queries?

Disclosure of rating and moderation practices.

Different practices may cause more or less harm to the individuals tasks in those roles. These practices may also be a leading indicator of the trustworthiness of the system itself. See @meisner2022labor.
Explicit statement about responsibility around something like ‘societal relevance’ [@sundin2021relevance]?

Oversight

Is there an oversight or advisory board of any type (outside of legal mandates?
- Comment: See AI ethics boards, Ethical Use Advisory Council, the Facebook Oversight board, etc., [@magassa2017diverse; @young2019inclusive]

Feedback and Complaint

Are there formal feedback mechanisms for the search results or conversation pages?
What is the transparency around feedback?
Are there examples of changes from community feedback?

Community

Are there sponsored community spaces (forums, workspaces)?
- Note: Often Discord channels, sometimes Slack

Refining, Revising, and Reviewing Results

Is there support for disambiguating terms used in questions?
Are there clarification prompts?
Are there suggested or related searches?
Is there ‘faceted search’?
Are follow-on searches supported in conversational interfaces?
Are there branching searches?
Can you re-generate the response?
Can you adjust ‘temperature’?
Can you ‘block results’?
Can you ‘filter results’?
Can you ‘sort results’?

Safety

Are there “safesearch” options?
Are there statements about children using the system?

Is there a way to share search sessions (multiple queries across a time period)?
Is there a co-search function (share a link to a shared search session)?

Standards, Openness, and Auditing

Have they published academic research?
Do they provide access to academic researchers or journalists?
- See, ex. Lurie’s comparison of platform research API requirements [@lurie2023comparing]
Is there an engineering blog?
Do they actively support user audits?
- TODO: pull from /docs/2023/07/05/search-audits.html
- See also: Feedback and Complaint above.
Have there been external evaluations?

History

Is there a clear and accessible link to search history?
Is there a search function in the search history?

Disclaimers, Warnings, and Doubting

Is there a disclaimer for searches with limited results? Perhaps like Google’s warning around so-called “data voids” [@golebiewski2018data; golebiewski2019data]
Are there efforts to encourage appropriate “doubting” [lindemann2023sealed_paper; @lindemann2023sealed_poster] (or unsealing of knowledges)?

Project

This project is not externally funded. Daniel Griffin, Ph.D., is pursuing this work in the course of his research and job search.