Manual:Pywikibot/Cookbook/Getting a single page

From Linux Web Expert

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Creating a Page object from title

In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements:

import pywikibot
site = pywikibot.Site()

You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as

page = pywikibot.Page(site, 'Budapest')

Note that Python is case sensitive, and in its world Site and Page mean classes,[1] Site() and Page() class instances, while lowercase site and page should be variables.

For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using print(), saving and running your code.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> type(page)
<class 'pywikibot.page._page.Page'>

Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page. pywikibot.page._page.Page shows the path to it: you may find the Page class in Pywikibot/pywikibot/page/_page.py. Now let's see the user page of your bot. Either you prefix it with the namespace ('User' and other English names work everywhere, while the localized names only in your very wiki) or you give the namespace number as the third argument. So

>>> title = site.username()
>>> page = pywikibot.Page(site, 'User:' + title)
>>> page
Page('Szerkesztő:BinBot')

and

>>> title = site.username()
>>> page = pywikibot.Page(site, title, 2)
>>> page
Page('Szerkesztő:BinBot')

will give the same result. 'Szerkesztő' is the localized version of 'User' in Hungarian; Pywikibot won't respect that I used the English name for the namespace in my command, the result is always localized.

Getting the title of the page

On the other hand, if you already have a Page object, and you need its title as a string, title() method will do the job:

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.title()
'Budapest'

Possible errors

While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.

  1. The page does not exist.
  2. The page is a redirect.
  3. You may have been mislead regarding the content in some namespaces. If your page is in Category namespace, the content is the descriptor page. If it is in User namespace, the content is the user page. The trickiest is the File namespace: the content is the file descriptor page, not the file itself; however if the file comes from Commons, the page may not exist in your wiki at all, while you still see the picture.
  4. The expected piece of text is not in the page content because it is transcluded from a template. You see the text on the page, but cannot replace it directly by bot.
  5. Sometimes a badly formed code may work well. For example [[Category:Foo  bar]] with two spaces will behave as [[Category:Foo bar]]. While the page is in the category and you will get it from a page generator (see below), you won't find the desired string in it.
  6. And, unfortunately, Wikipedia servers sometimes face errors. If you get a 500 error, go and read a book until server comes back.
  7. InvalidTitleError is raised in very rare cases. A possible reason is that you wanted to get a page title that contains illegal characters.

Getting the content of the page

Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not.

There are two main approaches of getting the content. It is important to understand the difference.

Page.text

You may notice that text does not have parentheses. Looking into the code we discover that it is not a method, rather a property. This means text is ready to use without calling it, may be assigned a value, and is present upon saving the page.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.text

will write the whole text on your screen. Of course, this is for experiment.

You may write

text = page.text

if you need a copy of the text, but usually this is unneccessary. Page.text is not a method, so referring to it several times does not slow down your bot. Just manipulate page.text or assign it a new value, then save.

If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.

Page.text will never raise an error. If the page is a redirect, you will get the redirect link instead of the content of the target page. If the page does not exist, you will get an empty string which is just what happens if the page does exist, but is empty (it is usual at talk pages). Try this:

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
...     print('Got it!')
... else:
...     print(f'Page {page.title()} does not exist or has no content.')
...
Page Arghhhxqrwl!!! does not exist or has no content.

Page.text is comfortable if you don't have to deal with the existence of the page, otherwise it is your responsibility to make the difference. An easy way is Page.exists().

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
...     print(len(page.text))
... else:
...     print(page.exists())
...
False

While page creation does not contact the live wiki, refering to text for the first time and Page.exists() usually does. For several pages it will take a while. If it is too slow for you, go to the [[../Working with dumps|Working with dumps]] section. page.has_content() shows if it is neccessary; if it returns True, the bot will not retrieve the page again. Therefore it returns True for non-existing pages as it is senseless to reload them. Although this is a public method, you are unlikely to have to use it directly.

Page.get()

The traditional way is page.get() which forces you to handle the errors. In this case we store the value in a variable.

>>> page = pywikibot.Page(site, 'Budapest')
>>> text = page.get()
>>> len(text)
165375

A non-existing page causes a NoPageError:

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> text = page.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
    self._getInternals()
  File "c:\Pywikibot\pywikibot\page\_page.py", line 436, in _getInternals
    self.site.loadrevisions(self, content=True)
  File "c:\Pywikibot\pywikibot\site\_generators.py", line 772, in loadrevisions
    raise NoPageError(page)
pywikibot.exceptions.NoPageError: Page [[hu:Arghhhxqrwl!!!]] doesn't exist.

A redirect page causes an IsRedirectPageError:

>>> page = pywikibot.Page(site, 'Time to Shine')
>>> text = page.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
    self._getInternals()
  File "c:\Pywikibot\pywikibot\page\_page.py", line 444, in _getInternals
    raise self._getexception
pywikibot.exceptions.IsRedirectPageError: Page [[hu:Time to Shine]] is a redirect page.

If you don't want to handle redirects, just make the difference between existing and non-existing pages, get_redirect will make its behaviour more similar to that of text:

>>> page = pywikibot.Page(site, 'Time to Shine')
>>> page.get(get_redirect=True)
'#ÁTIRÁNYÍTÁS [[Time to Shine (egyértelműsítő lap)]]'

Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it.

for title in ['Budapest', 'Arghhhxqrwl!!!', 'Time to Shine']:
    page = pywikibot.Page(site, title)
    try:
        text = page.get()
        print(f'Length of {page.title()} is {len(text)} bytes.')
    except pywikibot.exceptions.NoPageError:
        print(f'{page.title()} does not exist.')
    except pywikibot.exceptions.IsRedirectPageError:
        print(f'{page.title()} redirects to {page.getRedirectTarget()}.')
        print(type(page.getRedirectTarget()))

Which results in:

Length of Budapest is 165375 bytes.
Arghhhxqrwl!!! does not exist.
Time to Shine redirects to [[hu:Time to Shine (egyértelműsítő lap)]].
<class 'pywikibot.page._page.Page'>

While Page.text is simple, it gives only the text of the redirect page. With getRedirectTarget() we got another Page instance without parsing the text. Of course, the target page may also not exist or be another redirect. Scripts/redirect.py gives a deeper insight.

For a practical application see [[../Working with namespaces#Content pages and talk pages|Content pages and talk pages]].

Reloading

If your bot runs slowly and you are in doubt that the page text is still actual, use get(force=True). The experiment shows that it does not update page.text, which is good on one side, as you don't lose your data, but on the other side needs attention to be concious,

>>> import pywikibot as p
>>> site = p.Site()
>>> page = p.Page(site, 'Kisbolygók listája (1–1000)')
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text = 'Luke,I am your father!'
>>> page.text
'Luke, I am your father!'
>>> page.get(force=True)
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text
'Luke, I am your father!'
>>>
>>> page.text = page.get()
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'

Page.exists() currently does not reflect to forced reload, see phab:T330980.

Notes

  1. This is not quite true; as we saw earlier, Site is a factory that creates objects. The difference is hidden on purpose because it acts like a class, and Site() will really be an instance.