Extension:WikiLambda/Discovery needs

This is a sketch document of the discovery needs and use cases for the experience for users of the WikiLambda code on Wikifunctions. It is not final, and some of the potential use cases may never be directly met by first party systems and services.

Use cases

All use cases should be taken implicitly as optionally searching for a match/fuzzy-match of the user's search input for label in their selected or imputed input language or its fallbacks, or just looking for a content listing without label input.

In all(?) use cases, results would be shown prioritised by fit to the search input criterion if provided, and then by popularity.

Finding things to use / reference

As a function user, show me a list of Functions.
As a function user, show me a list of Functions, where they are labelled in at least one of my languages.
As a function user, show me a list of Implementations in Python 3.8 of popular Functions, and whether or not they pass their tests.
As a function user, show me a list of Implementations in any version of Python 3 of popular Functions, and whether or not they pass their tests.

As a function user, show me a list of Functions which take exactly a string.
As a function user, show me a list of Functions which take exactly a string or a boolean.
As a function user, show me a list of Functions which take exactly two strings.
As a function user, show me a list of Functions which take exactly a string and a boolean (in that order).
As a function user, show me a list of Functions which take exactly a string and a boolean (in any order).
As a function user, show me a list of Functions which take a string and a boolean (in any order), alongside other inputs.
As a function user, show me a list of Functions which take a list.
As a function user, show me a list of Functions which take a list of strings.
As a function user, show me a list of Functions which take exactly a pair of string and boolean (in either order).
As a function user, show me a list of Functions which take exactly a pair of string and something else.

As a function user, show me a list of Functions which output a string.
…

As a function user, show me a list of Functions which take a string and output an integer.

Finding things to translate

As a contributor who can edit in French, German, and English, show me a list of Objects which have a label in at least one of those languages but not at least one of the others, so I can fix it, prioritised by popularity.

As a contributor who can edit Chinese fluently and read English, show me a list of Objects which have recent changes to their description in English so I can ensure they are aligned.
[As above, but restricted to Functions, or Implementations, or Functions including their input/output labels, or …]

Finding things to create / extend

As a code contributor, show me a list of Functions which have no Implementations.
As a code contributor, show me a list of Functions which have at least one Implementation.
As a code contributor, show me a list of Functions which have at least three Implementations.
As a code contributor who likes writing Python, show me a list of Functions which have no Implementations in Python3.
As a code contributor who likes writing Python and is learning Node by comparison, show me a list of Functions which have an Implementation in both Python and Node.
As a code contributor who likes writing Python and Node and making the code similar between them, show me a list of Functions which have an Implementation in Python or Node but not in the other.
As a code re-user on an environment that only lets me use Python or Node, show me a list of Functions which have at least one Implementation in Python or Node.

As a code contributor, show me a list of Functions which have no Testers.
As a code contributor, show me a list of Functions which have at least one Tester.
As a code contributor, show me a list of Functions which have at least ten Testers.

Finding things to fix

As a code fixer, show me a list of Functions which have some Testers and at least one Implementation in Node which does not pass all its Testers.
As a code fixer, show me a list of Functions which have Testers and at least one Implementation which passes all its tests but has an Implementation in Python which does not.

As a polyglot translator-cum-coder mathematician working on matrix operations who has overly extreme demands of what search tools we can provide, show me a list of all Functions which are labelled in English but not German, take a Nx6 Matrix of Integers and a 6-Vector of Integers (in either order) labelled in Russian or Ukrainian but not Bulgarian, output a Vector of at least three integers, have at least ten testers, have a passing implementation in Node, and do not have an implementation in Python.

Data model inputs (denormalised content) and dimensions of search

Permanent Objects are MediaWiki pages

Field	Nature	Example	Notes
All Objects
ID	String	`Z1234`	This is just the page title, and probably won't be the basis of searches very often but power-users might type it in directly and expect it to Just Work™.
Type	Reference to another Object	`Z8` `Z10(Z22,Z22,Z30)`	These references may be complicated, compound references (as given in the second example)
Label	Strings in multiple languages	`en:multiply; de:multiplizieren; …`	This is likely the principal dimension by which type-ahead searches are made, and is similar to Wikidata's type-ahead need. Users will probably expect some level of natural language fallback in their results (pt-br -> pt, etc.).
Description	Multiple wikitext blocks	`en:'''Multiply''' is…; de: …`	These are quasi-independent blocks of wikitext; there could be dozens for a given page.
Functions (Type: Z8)
Input types	Compound reference to another Object	`Z22,Z22` `Z22,Z30,Z30`	This is to allow for "two integers" input filtering. Will also need type inheritance, and complicated management around ordering rules, if we provide for that.
Input type [multiple]	Reference to another Object	`Z22`	(Same as other references.)
Input label [multiple]	Strings in multiple languages	`en:multiplicand; de:Multiplikand; …`	(Same as other multi-lingual strings.)
Output type	Reference to another Object	`Z22`	(Same as other references.)
Output label	Strings in multiple languages	`en:result; …`	(Same as other multi-lingual strings.)
Implementations	List of references to other Objects	`Z12345;Z12346;Z12347`	Mostly this will be uninteresting per se, and instead the next field will be of interest (and the boolean "zero or not?" query)
Implementation count	Int	`3` `0`	Mostly this will be in boolean queries of if this is zero or not. (Would that need be better handled as its own boolean field?)
Implementation language [multiple]	String	`javascript-es6` `python38`	Some level of grouping might be wanted (python38/python39 -> python3)? Maybe hack it with stemming if we are careful with the names?
Implementation tester result [multiple]	Boolean (tristate?) of pass/fail	`true` `false` `null`	I don't think we'll want to model beyond the pass/fail/not-yet-known status. "Passed at least seventy percent of its tests" isn't really a discovery metric.
Testers	List of references to other Objects	`Z12345;Z12346;Z12347`	(Same as for Implementations.)
Tester count	Int	`3` `0`	(Same as for Implementations.)
Implementations (Type: Z14)
…

Options

Create a bespoke reporting system inside MW's code/DB set-up

What:
- We'd provide reporting via a page like Special:ListObjects or similar, and in-editor tools using the APIs.
- Special code in WikiLambda would inject content into secondary data tables any time they updated. (This is what we've already partially started for the search look-ahead API.)
- We'd very visibly cross-link from Special:Search as it'd be useless for Object searching (but vital for normal wikitext content page searching).
Pros:
- Lots of ability to add bespoke richness without slowing down the much larger users in production.
- Avoid poor fit matches for our use cases.
- Not held back by another team's priorities on feature/language support from upstream.
Cons:
- Duplication of effort between us and Search Platform and Elastic – for example, we'd need to invent-our-own for a lot of specialist language-related content (e.g. stemming).
- Significant extra application server and database load compared to our expectations (but less demands on ElasticSearch infrastructure).
- User would have a split experience between searching for Object content and policy pages / etc.; this may be irritating especially if the community has "how to" documentation in both Wikifunctions: namespace and Object descriptions.

Inject content into Wikimedia cluster's ElasticSearch system and query that

What:
- Special code in WikiLambda would define custom content fields for CirrusSearch and provide content updates.
- Users would use Special:Search for everything, with a bunch of special keywords like object_has_language:fr (possibly hid by a bespoke extra search interface).
Pros:
- Standard model for content querying, familiar to community members.
- Much less scary infrastructure / load balancing work for the team.
- Might be able to take advantage of current facet tool (AdvancedSearch), extending that rather than writing our own interface from scratch.
Cons:
- Content updates may be slow compared to on-wiki activity.
- Not good for categorical reporting queries ("list every Function with no Testers", "exactly how many Objects have no labels in Assyrian?" etc.).
- Less flexible to our needs.
- Language support may lag from what we ideally want as a team.

Build our own reporting system based on a triple store/etc. for internal querying (and perhaps external querying?)

What:
- We'd build out our own triple store for our special data (similar to the Wikidata Query Service and the Wikimedia Commons Query Service).
- We would proxy the results from the back-end into one or more reporting pages on Special:ListObjects or similar.
Pros:
- Amazingly flexible, built around our needs.
- Support for longer-running complex queries; we could stash the results of these and re-trigger the run every few hours or whatever would be needed.
Cons:
- Triple stores are best for combining with other datasets; ours wouldn't match up with anything external, unlike Wikidata.
- Asking Search Platform / SRE to support Yet Another BlazeGraph Instance may not make us popular.
- The effort to stand up and provide on-going support to a triple store is very high.
- Not good for categorical reporting queries ("list every Function with no Testers", "exactly how many Objects have no labels in Assyrian?" etc.), though (?) better than Elastic.
- Updates would be slower than ElasticSearch.
- External querying adds a lot of extra concerns (security, updates, load, …)

Questions

In general we'd want to provide lists "by popularity".
- What do we mean by this?
- How would we inform the searching system of our needs?
- Are there circumstances where we wouldn't want this ordering? What other orderings would we want, and why?
What have I forgotten?
…