WikiMidas is a Python spidering API wrapper program specifically designed for data crawling from MediaWiki APIs. It could be used to retrieve information from Wikipedia and MoeGirl which uses MediaWiki as their APIs. MoeGirl is more focused on the articles in the finctional characters’ in anime. MoeGirlMidas, developed based on WikiMidas, are designed specifically to retrieve characters’ data from MoeGirl.
I named the two programs after “midas”. Because I like the “feel” that you touch something and something becomes useful.
I have little experience in writing a program to retrieve data from internet by myself. So, In addition to some practice program experience in MOOC courses (Using Python to Access Web Data), this is the first time I formally write a data crawler by myself, and I did suffer a lot of problems in order to make it work.
To achieve the goal of data crawler, there are several aspects that one need to be at least familiar with:
website API usage
some third-party data crawler helper libraries
However, I have little prior knowledge about html language, except the basic concept that the html language wraps information in brackets, and have little experience in communicating with the website APIs. I also learned that I would need to use regular expression, a important practical concept in which I only have little knowlege without too much practice, to extract useful informations. The whole process is a little bit painful. But I think I have at least gained some practical experience in data crawling.
This libray allows user to use regular expression to do matching operations to extract useful information from texts. The usage of this library might be easy. But the key is to understand regular expression and use the correct regular expression to match the thing you are looking for.
This is a libray which is used to communicate with website APIs in the Python program.
Choices during Development
How to retrieve webpage content? API or Webpage Html?
The MediaWiki provides API to retrieve webpage content using its internal parser tools, such as MediaWiki TextExtracts. So this raises a question. Should we use the API internal tools or parse the content by ourselves?
I think my answer might be the latter one, parsing the content by ourselves. For websites using MediaWiki API, the optional parser tools might not be installed. Even if it is installed, from my preliminary test of the tool, I found it not flexible enough to achieve my relatively more sophisticated purposes, such as getting image urls, getting data from tables, etc.
Parsing by ourselves, although might be labor entensive, could be flexible to extract almost any data of interest using a combination of tools.
MoeGirlMidas was wrapped into a class object called “MoeGirlAPI”. To use MoeGirlMidas in your Python program:
The Python script could also be run directly in the shell (the shell might have conflicts with Chinese characters):
To search article
To retrieve article
The sample codes above could also be run in a Jupyter Notebook. See the run results here.