Manual:Pywikibot/Cookbook/Page generators

From Linux Web Expert

Overview

A page generator is an object that is iterable (see PEP 255) and that yields page objects on which other scripts can then work.

Most of these functions just wrap a Site or Page method that returns a generator. For testing purposes listpages.py can be used, to print page titles to standard output.

Documentation

Page generators form one of the most powerful tools of Pywikibot. A page generator iterates the desired pages.

Why use page generators?

  • You may separate finding pages to work on and the actual processing, so the code becomes cleaner and more readable.
  • They implement a reuseable code for typical tasks.
  • Pywikibot team writes page generators, and can follow the changes of MediaWiki API, and you have to write your code on a higher level.

A possible reason to write your own page generator is mentioned in [[../Working with your watchlist#Follow your bot|Follow your bot]] section.

Most page generators are available via command line arguments for end users. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html for details. If you write your own script, you may use these arguments, but if they are permanent for the task, you may want to directly invoke the appropriate generator instead of handling command line arguments.

Life is too short to list them all here, but the most used generators are listed under the above link. you may also discover them in the pywikibot/pagegenerators directory of your Pywikibot installation. They may be divided into three main groups:

  1. High-level generators for direct use, mostly (but not exclusively) based on MediaWiki API. Usually they have long and hard-to-remember names, but the names may always be cheated from docs or code. They are connected to command line arguments.
  2. Filters. They wrap around another generator (take the original generator as argument), and filter the results, for example for namespace. This means they won't run too fast...
  3. Low-level API-based generators may be obtained as methods of Page, Category, User, FilePage, WikibasePage or Site objects. Most of them is packed into a hig-level generator function, which is the preferred use (we may say, the public interface of Pywikibot), however nothing forbids the direct use. Sometimes they yield structures rather than page objects, but may be turned to a real page generator, as we will see in an example.

Pagegenerators package (group 1 and 2)

Looking into pywikibot/pagegenerators directory you discover scripts whose names begin with an underscore. This means they are not for direct import, however they can be useful for discovering the features. The incorporated generators may be used as

import pywikibot.pagegenerators
for page in pywikibot.pagegenerators.AllpagesPageGenerator(total=10):
    print(page)

which is almost equivalent to:

from pywikibot.pagegenerators import AllpagesPageGenerator
for page in AllpagesPageGenerator(total=10):
    print(page)

To interpret this directory which appears for us as pywikibot.pagegenerators package in code:

  • __init__.py primarily holds the documentation, but there are also some wrapper generators in it.
  • _generators.py holds the primary generators.
  • _filters.py holds the wrapping filter generators.
  • _factory.py is responsible for interpreting the command line arguments and choosing the appropriate generator function.

API generators (group 3)

MediaWiki offers a lot of low-level page generators, which are implemented in GeneratorsMixin class. APISite is child of GeneratorsMixin, so we may use these methods for our site instance. While the above mentioned pagelike objects have their own methods that may easily be found in the documentation of the class, they usually use an underlying method which is implemented in APISite, and somtimes offers more features.

Usage

Generators may be used in for loops as shown above, but also may be transformed to lists:

print(list(AllpagesPageGenerator(total=10)))

But be careful: while loops continuously process pages, the list comprehension may take a while because it has to read all the items from the generator. This statement is very fast for total=10, takes noticeable time for total=1000, and is definitely slow for total=100000. Of course, it will consume a lot of memory for big numbers, so usually it is better to use generators in a loop.

A few interesting generators

A non-exhaustive list of useful generators. All these may be imported from pywikibot.pagegenerators.

Autonomous generators (_generators.py)

Most of them correspond to a special page on wiki.

  • AllpagesPageGenerator(): Yields all the pages in a long-long queue along the road in alphabetic order. You may specify the start, namespace, a limit to avoid endless queues, and if redirects should be included, excluded or exclusively yielded. See an example below.
  • PrefixingPageGenerator(): Pages whose title begins with a given string. See an example below.
  • LogeventsPageGenerator(): Events from logs
  • CategorizedPageGenerator(): Pages from a given category.
  • LinkedPageGenerator(): Pages that are linked from another page. See an example in chapter [[../Creating and reading lists|Creating and reading lists]].
  • TextIOPageGenerator(): Reads from file or URL. See an example in chapter [[../Creating and reading lists|Creating and reading lists]].
  • PagesFromTitlesGenerator(): Generates pages from their titles.
  • UserContributionsGenerator(): Generates pages that a given user worked on.
  • XMLDumpPageGenerator(): Reads from a downloaded dump on your device. In the dump pages are usually sorted by pageid (creation time). See in [[../Working with dumps|Working with dumps]] chapter.

... and much more...

An odd one out

  • XmlDumpReplacePageGenerator() looks for pages in dump that are subject to a text replacement. It is defined within replace.py, and may be imported from there.

Filtering generators (_filters.py)

  • NamespaceFilterPageGenerator(): Only lets out pages from given namespace(s).
  • PageTitleFilterPageGenerator(): Let's you define an ignore list which pages won't be yielded.
  • RedirectFilterPageGenerator(): Yields either only redirect pages or only not redirects.
  • SubpageFilterGenerator(): Generator which filters out subpages based on depth.
  • RegexFilter: This is not a function, rather a class. Makes possible to filter titles with a regular expression.
  • CategoryFilterPageGenerator(): Lets out only pages which are in all of the given categories.

Other wrappers (__init__.py)

  • PageClassGenerator(): You have Page objects from another generator. This wrapper examines them, and whichever represents a user, a category or a file description page, turns it into the appropriate subclass so that you may use more methods;others remain untouched.
  • PageWithTalkPageGenerator(): Takes a page generator, and yields the content pages and the corresponding talk pages each after the other (or just the talk pages).
  • RepeatingGenerator(): This one is exciting: makes possible to follow the events on your wiki in live, taking pages from recent changes or some log.

Examples

List Pywikibot user scripts with AllpagesPageGenerator

You want to collect users' home made Pywikibot scripts from all over Wikipedia. Supposing that they are in user namespace and have a title with .py ending, a possible soulution is to create an own pagegenerator using AllpagesPageGenerator. Rather slow than quick. :-) This will not search in other projects.

from pywikibot.pagegenerators import AllpagesPageGenerator
def gen():
    langs = site.languages()
    for lang in langs:
        for page in AllpagesPageGenerator(
                namespace=2,
                site=pywikibot.Site(lang, 'wikipedia')):
            if page.title().endswith('.py'):
                yield page

And test with:

for page in gen():
    print(page)

If you want a reduced version for testing, you may use

    langs = site.languages()[1:4]

This will limit the number of Wikipedias to 3, and excludes the biggest, enwiki. You may also use a total=n argument in AllpagesPageGenerator.

Sample result:

[[de:Benutzer:Andre Riemann/wmlinksubber.py]]
[[de:Benutzer:Chricho/draw-cantorset.py]]
[[de:Benutzer:Christoph Burgmer/topiclist.py]]
[[de:Benutzer:Cmuelle8/gpx2svg.py]]
etc.

They are not sure to be Pywikibot scripts as there are other Python programs published. You may retrieve the pages and check them for import pywikibot line.

Titles beginning with a given word – PrefixingPageGenerator

During writing this section we had a question on the village pump about the spelling of the name of József Degenfeld. A search showed that we have several existing articles about Degenfeld family. To quickly compose a list of them the technics from [[../Creating and reading lists|Creating and reading lists]] chapter was copied:

from pywikibot.pagegenerators import PrefixingPageGenerator
print('\n'.join(['* ' + page.title(as_link=True) for page in PrefixingPageGenerator('Degenfeld')]))

* [[Degenfeld-Schonburg-kastély]]
* [[Degenfeld-kastély]]
* [[Degenfeld-kastély (Baktalórántháza)]]
* [[Degenfeld-kastély (Téglás)]]
* [[Degenfeld-kastély (egyértelműsítő lap)]]
* [[Degenfeld család]]

For such rapid tasks [[../Introduction#Scripting_vs_interactive_use|shell]] is very suitable.

Pages created by a user with a site iterator

You want to list the pages created by a given user, for example yourself. How do we know that an edit was creation of a new page? The response is the parentid value, which is the oldid of the previous edit. If it is zero, there is no previous version, that means it was either creation or recreation after a deletion. Where do we get a parentid from? Either from a contribution (see [[../Working with users|Working with users]] chapter) or a revision (see [[../Working with page histories|Working with page histories]]).

Of course, we begin with high-level page generators, just because this is the title of the chapter. We have one that is promising: UserContributionsGenerator() Its description is: Yield unique pages edited by user:username. Now, this is good for beginning, but we have only pages, so we have to get the first revision, and see if the username equals to the desired user, which will not be the case in the vast majority, but will be very slow.

So we look into the source and notice that this function calls User.contributions(), which is a method of a User object, and has the description Yield tuples describing this user edits. This is promising again, but looking into it we see that the tuple does not contain parentid, but we find an underlying method again, which is Site.usercontribs(). This looks good, and has a link to API:Usercontribs, which is the fourth step of our investigation. Finally, this tells what we want to hear: yes, it has parentid.

Technically, Site.usercontribs() is not a page generator, but we will turn it. It takes the username as string, iterates contributions, and we may create the pages from their titles. The simpliest version just to show the essence:

for contrib in site.usercontribs(username):
    if not contrib['parentid']:
        page = pywikibot.Page(site, contrib['title'])
        # Do something

Introduction was long, solution short. :-)

But it was not only short, but also fast, because we did not create unnecessary objects, and, what is more important, did not get unnecessary data. We got dictionaries, read the parentid and the title from them, and created only the desired Page objects – but did not retrieve the pages, which is the slow part of the work. Handle the result with care, as some false positives may occur, e.g. upon page history merges.

Based on this simple solution we create an advanced application that

  • gets only pages from selected namespaces (this is not a post-generation filtering as Pywikibot's NamespaceFilterPageGenerator(), MediaWikiAPI will do the filtering on the fly)
  • separates pages by namespaces
  • separates disambiguation pages from articles by title and creates a fictive namespace for them
  • filters out redirect pages from articles and templates (in other namespaces and among disamb pages the ratio of these is very low, so we don't bother; this is a decision)
  • saves the list to separate subpages of the user by namespace
  • writes the creation date next to the titles.

Of these tasks only recognizing the redirects makes it necessary to retrieve the pages which is slow and loads the server and bandswidth. While the algorithm would be simpler if we did the filtering within the loop, more efficient is to do this filtering afterwards for selected pages

import pywikibot

site = pywikibot.Site()
username = 'Bináris'
summary = 'A Bináris által létrehozott lapok listázása'
# Namespaces that I am interested in
namespaces = [
    0,  # main
    4,  # Wikipedia
    8,  # MediaWiki
    10,  # template
    14,  # category
]
# Subpage titles (where to save)
titles = {
    0: 'Szócikkek',
    4: 'Wikipédia',
    8: 'MediaWiki',
    10: 'Sablonok',
    14: 'Kategóriák',
    5000: 'Egyértelműsítő lapok',  # Fictive ns for disambpages
}
# To store the results
created = dict(zip(namespaces + [5000], [[] for i in range(len(titles))]))

for contrib in site.usercontribs(username, namespaces=namespaces):
    if contrib['parentid']:
        continue
    ns = contrib['ns']
    if ns == 0 and contrib['title'].endswith('(egyértelműsítő lap)'):  # disamb pages
        ns = 5000
    title = (':' if ns == 14 else '') + contrib['title']
    created[ns].append((title, contrib['timestamp']))

# Remove redirects from articles and templates
for item in created[0][:]:
    if pywikibot.Page(site, item[0]).isRedirectPage():
        created[0].remove(item) 
for item in created[10][:]:
    if pywikibot.Page(site, item[0]).isRedirectPage():
        created[10].remove(item)

for ns in created.keys():
    if not created[ns]:
        continue
    print(ns)
    page = pywikibot.Page(site, 'user:Bináris/Létrehozott lapok/' + titles[ns])
    print(page)
    page.text = 'Bottal létrehozott lista. Utoljára frissítve: ~~~~~\n\n'
    page.text += '\n'.join(
            [f'# [[{item[0]}]] {item[1][:10].replace("-", ". ")}.' 
                for item in created[ns]]
        ) + '\n'
    print(page.text)
    page.save(summary)
Line 4
Username as const (not a User object). Of course, you may get it from command line or a web interface.
Line 23
A dictionary for the results. Keys are namespace numbers, values are empty lists.
Line 26
This is the core of the script. We get the contributions with 0 parentid, check them for being a disambiguation page, prefix the category names with a colon, and store the titles together with the timestamps as tuples. We don't retrieve any page content by this point.
Line 35
Removal of redirects. Now we have to retrieve selected pages. Note the slicing in the loop head; this is necessary when you loop over a list and meanwhile remove items from it. [:] creates a copy to loop over it, preventing a conflict.
Line 43
We save the lists to subpages, unless they are empty.

A sample result is at hu:Szerkesztő:Bináris/Létrehozott lapok.

Summary

High-level page generators are really various and flexible and are often useful when we do some common task, especially if we want the pwb wrapper to handle our command-line arguments. But for some specific tasks we have to go deeper. On the next level there are the generator methods of pagelike objects, such as Page, User, Category etc., while on the lowest level page generators and other iterators of the Site object, which are directly based on MediaWiki API. Going deeper is possible through the documentation and the code itself.

On the other hand, iterating pages through an API iterator, given the namespece as argument may be faster than using a high-level generator from pagegenrators package, then filter it with a wrapping NamespaceFilterPageGenerator(). At least we may suppose (no benchmark has been made).

In some rare cases this is still not enough, if some features offered by API are not implemented in Pywikibot. You may either implement them and contribute to the common code base, or make a copy of them and enhance with the missing parameter according to API documentation.