Extension:WikibaseMediaInfo/MediaSearch

MediaSearch aims improve the experience of searching for media files.

As part of MediaSearch, we also built a new media-focused UI (Special:MediaSearch) that can even be used as a standalone extension.

This page, however, is about the new media-specific search profile in Extension:WikibaseMediaInfo that seek to leverage the structured data associated with files to improve the search results. (Almost) all relevant code lives in the src/Search directory in Extension:WikibaseMediaInfo.

Setup

WikibaseMediaInfoHooks::onCirrusSearchRegisterFullTextQueryClassifiers

This hook allows registering a "classifier." More on that later.

WikibaseMediaInfoHooks::onCirrusSearchProfileService

This wires up the other components: a query builder, a rescore profile, and a rescore function chain (I'll go over what those things are in a minute.)

The important thing to remember here is that this is the place where all of those components come together, and inform CirrusSearch when it needs to invoke our algorithms.

Right now, those conditions are:

only NS_FILE results are requested
the search term classifiers' results include a mediasearch class (more below)

When those conditions are met, the MediaSearch search profile is invoked, and with that profile (unless overridden explicitly in the request) comes a default (mediasearch-specific) rescore profile.

Components

Classifier

A classifier is essentially the first step. When a search query is received, it'll be parsed into an abstract syntax tree. All registered classifiers will get to step through that tree to understand what kind of query we're dealing with.

Those classifiers generally look for certain traits in the search term. E.g. whether the search term contains simple words or phrases, or more complex structures.

The MediaSearch classifier (MediaSearchASTClassifier) simply steps through the AST to inspect what kind of query features are used, and whether they're supported by the MediaSearch query builder. If any of the search term features is not yet supported (e.g. contains wildcards, …), it'll return some class to indicate that MediaSearch is not supported. Else, it'll indicate that MediaSearch is supported for the given search query.

Query builder profile

This is essentially the part that accepts the search term and returns an elastic query. This is where we also find matching Wikidata entities for a given search term, and include those ids as part of the elastic query. And a few more tweaks, which I'll cover in more detail below.

See MediaSearchProfiles.php for the MediaSearch query builder profiles configuration. This is also where the weights for all fields are configured.

Rescore profile

After the query builder has delivered an elastic query to find & score relevant terms for a given search term, a rescore profile can add to that elastic query to alter the result scores, based on characteristics of the results rather than the search term. E.g.: increase the score of more recent articles, popular articles, or articles with a certain template, or combinations of these. Stuff like that.

While it is possible to register a default rescore profile along with a query builder profile (as we do for MediaSearch), it is also possible to swap out the rescore profile (e.g. via the API’s srqiprofile param.)

Rescore profiles are basically a bunch of functions being executed for all results, allowing their scores to change. Those results are then combined with the original score in specific ways: some are added, some multiply, some can be multiple functions of which only the maximum score is added, etc.

See MediaSearchRescoreProfiles.php for the MediaSearch rescore profiles configuration.

Rescore function chains

Rescore function chains are a subset of the rescore profiles configuration. They're the individual functions (or a combination of functions) that calculate a specific score for a result.

Rescore profiles, OTOH, are the configuration for which function chains to use, and how they should be combined with the original resultset scores (or additional function chains.)

E.g. there can be a function chain that calculates a specific score based on incoming links to a page, which can then be used in multiple rescore profiles: one where it has a small-ish impact on the original score (wsum_inclinks - a weighted balance between original relevance score & popularity rescore), and another where it has a massive impact (popular_inclinks - basically reducing the results to a popularity sort.)

See MediaSearchRescoreFunctionChains.php for the MediaSearch rescore function chains configuration.

For MediaInfo, we registered a custom rescore profile (classic_noboostlinks_max_boost_template), which is almost a clone of an existing rescore profile in Cirrus (classic_noboostlinks) that we used to use for awhile. The difference is in their template handling. Pages can get a certain boost if they contain certain templates (e.g. Template:Quality_image or Template:Valued_image) and all of those boosts are then multiplied.

On Commons, many files that have a certain quality assessment also have another (in fact, for some, a previous quality assessment is a requirement for another), and this multiplication led to massive boosts (up to 10x), which made it easy for poor results (as in: good image, but maybe not all that relevant for the search term) to score in the first few results. (e.g. a search for "king" was dominated by quality-assessed pictures of people or places that had "king" in their name - not monarchs.) In order to put a limit on the template impact, we had to create a separate function chain where the template function was pulled out and ensured it'd only use the maximum value, and then a new profile that re-assembled the new function chains.

Query builder lifecycle

As mentioned before, this is the part that receives a search query string and turns it into an elastic query object.

Until recently, MediaSearchQueryBuilder.php held pretty much all of the MediaSearch-specific logic. This has been refactored in order to add support for more advanced queries, like cat OR dog, where all of our existing logic now might have to be applied on multiple terms. This file is now simply the main entrypoint, and it'll invoke MediaSearchASTQueryBuilder.php, where the real logic lives nowadays.

That one is quite similar to the classifier: it will receive the search term AST and step through it, invoking certain specific logic for every supported node that it encounters (which is currently limited to boolean operations (AND/OR/NOT), simple words and phrases.) This file will for the most part call out to other classes to deal with a specific part of the search query, and then assemble all those parts into one big elastic query.

Before it starts stepping through the AST and start to build an elastic query, it'll first invoke another thing: MediaSearchASTEntitiesExtractor.php. This is yet another class that traverses the nodes of the AST & it does it prior to the query builder node visitor. It'll simply capture all the relevant search terms. After this, the query builder starts doing its thing and delegates most of the work for dealing with parts of the query to other classes.

The most complex node handler that we have is WordsQueryNodeHandler.php, the one for simple words (which can be invoked more than once, for searches like cat OR dog, in which case this gets invoked once for cat and once for dog.) This is now the class that turns 1 specific search term into elastic query clauses. PhraseQueryNodeHandler.php (for exact phrase matches) is similar, but simpler.

Here's the gist of what WordsQueryNodeHandler.php (or other node handlers) do(es):

Simple words query handler

For a bunch text fields (e.g. descriptions, title, text, category, …), an elastic match query is created for the (relevant part of the) search term. Each field is given a certain weight. Matches of the search term in all of these fields will contribute to the final score per file. This score is a complex formula based on the frequency of the words in that specific field, and across the entire search index (more rare across search index = higher score, more occurrences inside field = higher score). Those scores are then multiplied with the weights per field. This is basically what a default Cirrus profile already did as well, except that those don't include (multilingual) captions.

The weights for all fields have also been tweaked based on a logistic regression that was based on 10k+ manually assessed real-world search term results: there is a strong correlation between how relevant a file is for a search term, and how well it scores. Fields that have shown that they have many false positives get a lower weight, others get boosted more.

For that given search term, we'll call the Wikidata API to figure out which (if any) entities are a good match. This is why, as mentioned earlier, we had a class fetch all individual terms in a search query - we can now efficiently request them all at once. This happens in MediaSearchEntitiesFetcher.php, and there are 2 intermediate caching layers (MediaSearchCachingEntitiesFetcher.php & MediaSearchMemoryEntitiesFetcher.php.) This will simply do a Wikidata search for a word, and then return the relevant entities.

Now that we have that entity information, we'll add a bit more to the existing match clauses that we already had. If a file has any of the statements in their index, they also count! This code is currently in WikibaseEntitiesHandler.php, though I wouldn't be too surprised if that ends up changing at some point.

Sadly, the scores for full text (one or multiple words) and statements (a very computery term like "P180=<entity id>") are extremely unbalanced. Aside from the word frequency-based algorithm, we also found that more words (in the search term) almost always lead to higher scores. We could optimize for short 1-word terms and adjust the weight of statements scores so that it roughly matches it, but statements would then never score high enough on multiple-word searches and statements would have no impact on those search results. Or vice versa. For this reason, we added a bit of score normalization that reduces the fulltext scores based on the amount of words in the search query: long or short search terms will now always be in similar territory. (This was based on averages based on real-world search terms & their best result scores, though there's a massive standard deviation between those scores to the point that it's certainly still not sufficiently balanced for some terms. We've thus far never gotten around to researching and optimizing this further, but should definitely revisit this at some point.)