My aim is to get plain text (without links, tags, parameters and other trash, only articles text) from wikipedia xml dumps (https://dumps.wikimedia.org/backup-index.html). I found WikiExtractor python script on GitHub (https://github.com/attardi/wikiextractor). After downloaing and installing it (i use PyCharm IDE, Windows 10) i tried to get it started with
wikiextractor -cb 250K -o extracted D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2
but then (after preprocessing) i got following error
raise ValueError('cannot find context for %r' % method) from None ValueError: cannot find context for 'fork'
I tried to change the parameter in the following function from "fork" to "spawn" (advice from the internet)
Process = get_context("fork").Process
but this only leads to
TypeError: cannot pickle '_io.BufferedWriter' object
I have no idea how to fix it or what it might be related to
Here is full stack trace:
INFO: Preprocessing 'D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
...
INFO: Preprocessed 2300000 pages
INFO: Loaded 36839 templates in 209.9s
INFO: Starting page extraction from D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2.
Traceback (most recent call last):
File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None,
File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals)
File "C:\Users\Shurup\PycharmProjects\pythonProject\venv\Scripts\wikiextractor.exe_main_.py", line 7, in
File "c:\users\shurup\pycharmprojects\pythonproject\venv\lib\site-packages\wikiextractor\WikiExtractor.py", line 640, in main process_dump(input_file, args.templates, output_path, file_size,
File "c:\users\shurup\pycharmprojects\pythonproject\venv\lib\site-packages\wikiextractor\WikiExtractor.py", line 359, in process_dump Process = get_context("fork").Process
File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 239, in get_context return super().get_context(method)
File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 193, in get_context raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'fork'
Here is stack trace with "spawn" instead of "fork" parameter
You can run it in Docker. It works like a charm.
dockerfile:
Build:
docker build --pull --rm -f "Dockerfile" -t wikiextractor:latest
Run:
docker run --rm -it --mount type=bind,source="$(PWD)\output",target=/app/output wikiextractor:latest
Make sure you have an
output
folder in your current working directory.