Extension:WikiLambda/Discovery needs

From Linux Web Expert

This is a sketch document of the discovery needs and use cases for the experience for users of the WikiLambda code on Wikifunctions. It is not final, and some of the potential use cases may never be directly met by first party systems and services.

Use cases

All use cases should be taken implicitly as optionally searching for a match/fuzzy-match of the user's search input for label in their selected or imputed input language or its fallbacks, or just looking for a content listing without label input.

In all(?) use cases, results would be shown prioritised by fit to the search input criterion if provided, and then by popularity.

Finding things to use / reference

  • As a function user, show me a list of Functions.
  • As a function user, show me a list of Functions, where they are labelled in at least one of my languages.
  • As a function user, show me a list of Implementations in Python 3.8 of popular Functions, and whether or not they pass their tests.
  • As a function user, show me a list of Implementations in any version of Python 3 of popular Functions, and whether or not they pass their tests.
  • As a function user, show me a list of Functions which take exactly a string.
  • As a function user, show me a list of Functions which take exactly a string or a boolean.
  • As a function user, show me a list of Functions which take exactly two strings.
  • As a function user, show me a list of Functions which take exactly a string and a boolean (in that order).
  • As a function user, show me a list of Functions which take exactly a string and a boolean (in any order).
  • As a function user, show me a list of Functions which take a string and a boolean (in any order), alongside other inputs.
  • As a function user, show me a list of Functions which take a list.
  • As a function user, show me a list of Functions which take a list of strings.
  • As a function user, show me a list of Functions which take exactly a pair of string and boolean (in either order).
  • As a function user, show me a list of Functions which take exactly a pair of string and something else.
  • As a function user, show me a list of Functions which output a string.
  • As a function user, show me a list of Functions which take a string and output an integer.

Finding things to translate

  • As a contributor who can edit in French, German, and English, show me a list of Objects which have a label in at least one of those languages but not at least one of the others, so I can fix it, prioritised by popularity.
  • As a contributor who can edit Chinese fluently and read English, show me a list of Objects which have recent changes to their description in English so I can ensure they are aligned.
  • [As above, but restricted to Functions, or Implementations, or Functions including their input/output labels, or …]

Finding things to create / extend

  • As a code contributor, show me a list of Functions which have no Implementations.
  • As a code contributor, show me a list of Functions which have at least one Implementation.
  • As a code contributor, show me a list of Functions which have at least three Implementations.
  • As a code contributor who likes writing Python, show me a list of Functions which have no Implementations in Python3.
  • As a code contributor who likes writing Python and is learning Node by comparison, show me a list of Functions which have an Implementation in both Python and Node.
  • As a code contributor who likes writing Python and Node and making the code similar between them, show me a list of Functions which have an Implementation in Python or Node but not in the other.
  • As a code re-user on an environment that only lets me use Python or Node, show me a list of Functions which have at least one Implementation in Python or Node.
  • As a code contributor, show me a list of Functions which have no Testers.
  • As a code contributor, show me a list of Functions which have at least one Tester.
  • As a code contributor, show me a list of Functions which have at least ten Testers.

Finding things to fix

  • As a code fixer, show me a list of Functions which have some Testers and at least one Implementation in Node which does not pass all its Testers.
  • As a code fixer, show me a list of Functions which have Testers and at least one Implementation which passes all its tests but has an Implementation in Python which does not.
  • As a polyglot translator-cum-coder mathematician working on matrix operations who has overly extreme demands of what search tools we can provide, show me a list of all Functions which are labelled in English but not German, take a Nx6 Matrix of Integers and a 6-Vector of Integers (in either order) labelled in Russian or Ukrainian but not Bulgarian, output a Vector of at least three integers, have at least ten testers, have a passing implementation in Node, and do not have an implementation in Python.

Data model inputs (denormalised content) and dimensions of search

Permanent Objects are MediaWiki pages

Field Nature Example Notes
All Objects
ID String Z1234 This is just the page title, and probably won't be the basis of searches very often but power-users might type it in directly and expect it to Just Work™.
Type Reference to another Object Z8


Z10(Z22,Z22,Z30)

These references may be complicated, compound references (as given in the second example)
Label Strings in multiple languages en:multiply; de:multiplizieren; … This is likely the principal dimension by which type-ahead searches are made, and is similar to Wikidata's type-ahead need. Users will probably expect some level of natural language fallback in their results (pt-br -> pt, etc.).
Description Multiple wikitext blocks en:'''Multiply''' is…; de: … These are quasi-independent blocks of wikitext; there could be dozens for a given page.
Functions (Type: Z8)
Input types Compound reference to another Object Z22,Z22


Z22,Z30,Z30

This is to allow for "two integers" input filtering. Will also need type inheritance, and complicated management around ordering rules, if we provide for that.
Input type [multiple] Reference to another Object Z22 (Same as other references.)
Input label [multiple] Strings in multiple languages en:multiplicand; de:Multiplikand; … (Same as other multi-lingual strings.)
Output type Reference to another Object Z22 (Same as other references.)
Output label Strings in multiple languages en:result; … (Same as other multi-lingual strings.)
Implementations List of references to other Objects Z12345;Z12346;Z12347 Mostly this will be uninteresting per se, and instead the next field will be of interest (and the boolean "zero or not?" query)
Implementation count Int 3

0

Mostly this will be in boolean queries of if this is zero or not. (Would that need be better handled as its own boolean field?)
Implementation language [multiple] String javascript-es6

python38

Some level of grouping might be wanted (python38/python39 -> python3)? Maybe hack it with stemming if we are careful with the names?
Implementation tester result [multiple] Boolean (tristate?) of pass/fail true

false

null

I don't think we'll want to model beyond the pass/fail/not-yet-known status. "Passed at least seventy percent of its tests" isn't really a discovery metric.
Testers List of references to other Objects Z12345;Z12346;Z12347 (Same as for Implementations.)
Tester count Int 3

0

(Same as for Implementations.)
Implementations (Type: Z14)


Options

Create a bespoke reporting system inside MW's code/DB set-up

  • What:
    • We'd provide reporting via a page like Special:ListObjects or similar, and in-editor tools using the APIs.
    • Special code in WikiLambda would inject content into secondary data tables any time they updated. (This is what we've already partially started for the search look-ahead API.)
    • We'd very visibly cross-link from Special:Search as it'd be useless for Object searching (but vital for normal wikitext content page searching).
  • Pros:
    • Lots of ability to add bespoke richness without slowing down the much larger users in production.
    • Avoid poor fit matches for our use cases.
    • Not held back by another team's priorities on feature/language support from upstream.
  • Cons:
    • Duplication of effort between us and Search Platform and Elastic – for example, we'd need to invent-our-own for a lot of specialist language-related content (e.g. stemming).
    • Significant extra application server and database load compared to our expectations (but less demands on ElasticSearch infrastructure).
    • User would have a split experience between searching for Object content and policy pages / etc.; this may be irritating especially if the community has "how to" documentation in both Wikifunctions: namespace and Object descriptions.

Inject content into Wikimedia cluster's ElasticSearch system and query that

  • What:
    • Special code in WikiLambda would define custom content fields for CirrusSearch and provide content updates.
    • Users would use Special:Search for everything, with a bunch of special keywords like object_has_language:fr (possibly hid by a bespoke extra search interface).
  • Pros:
    • Standard model for content querying, familiar to community members.
    • Much less scary infrastructure / load balancing work for the team.
    • Might be able to take advantage of current facet tool (AdvancedSearch), extending that rather than writing our own interface from scratch.
  • Cons:
    • Content updates may be slow compared to on-wiki activity.
    • Not good for categorical reporting queries ("list every Function with no Testers", "exactly how many Objects have no labels in Assyrian?" etc.).
    • Less flexible to our needs.
    • Language support may lag from what we ideally want as a team.

Build our own reporting system based on a triple store/etc. for internal querying (and perhaps external querying?)

  • What:
    • We'd build out our own triple store for our special data (similar to the Wikidata Query Service and the Wikimedia Commons Query Service).
    • We would proxy the results from the back-end into one or more reporting pages on Special:ListObjects or similar.
  • Pros:
    • Amazingly flexible, built around our needs.
    • Support for longer-running complex queries; we could stash the results of these and re-trigger the run every few hours or whatever would be needed.
  • Cons:
    • Triple stores are best for combining with other datasets; ours wouldn't match up with anything external, unlike Wikidata.
    • Asking Search Platform / SRE to support Yet Another BlazeGraph Instance may not make us popular.
    • The effort to stand up and provide on-going support to a triple store is very high.
    • Not good for categorical reporting queries ("list every Function with no Testers", "exactly how many Objects have no labels in Assyrian?" etc.), though (?) better than Elastic.
    • Updates would be slower than ElasticSearch.
    • External querying adds a lot of extra concerns (security, updates, load, …)

Questions

  • In general we'd want to provide lists "by popularity".
    • What do we mean by this?
    • How would we inform the searching system of our needs?
    • Are there circumstances where we wouldn't want this ordering? What other orderings would we want, and why?
  • What have I forgotten?