Wikipedia extractor problem ValueError: cannot find context for 'fork'

3.6k Views Asked by At

My aim is to get plain text (without links, tags, parameters and other trash, only articles text) from wikipedia xml dumps (https://dumps.wikimedia.org/backup-index.html). I found WikiExtractor python script on GitHub (https://github.com/attardi/wikiextractor). After downloaing and installing it (i use PyCharm IDE, Windows 10) i tried to get it started with

wikiextractor -cb 250K -o extracted D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2

but then (after preprocessing) i got following error

raise ValueError('cannot find context for %r' % method) from None ValueError: cannot find context for 'fork'

I tried to change the parameter in the following function from "fork" to "spawn" (advice from the internet)

Process = get_context("fork").Process

but this only leads to

TypeError: cannot pickle '_io.BufferedWriter' object

I have no idea how to fix it or what it might be related to

Here is full stack trace:

INFO: Preprocessing 'D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2' to collect template definitions: this may take some time.

INFO: Preprocessed 100000 pages

...

INFO: Preprocessed 2300000 pages

INFO: Loaded 36839 templates in 209.9s

INFO: Starting page extraction from D:\Wiki_dumps\ruwiktionary-20211120-pages-articles-multistream.xml.bz2.

Traceback (most recent call last):

File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None,

File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals)

File "C:\Users\Shurup\PycharmProjects\pythonProject\venv\Scripts\wikiextractor.exe_main_.py", line 7, in

File "c:\users\shurup\pycharmprojects\pythonproject\venv\lib\site-packages\wikiextractor\WikiExtractor.py", line 640, in main process_dump(input_file, args.templates, output_path, file_size,

File "c:\users\shurup\pycharmprojects\pythonproject\venv\lib\site-packages\wikiextractor\WikiExtractor.py", line 359, in process_dump Process = get_context("fork").Process

File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 239, in get_context return super().get_context(method)

File "C:\Users\Shurup\AppData\Local\Programs\Python\Python310\lib\multiprocessing\context.py", line 193, in get_context raise ValueError('cannot find context for %r' % method) from None

ValueError: cannot find context for 'fork'

Here is stack trace with "spawn" instead of "fork" parameter

"spawn" parameter stack trace

2

There are 2 best solutions below

1
On

You can run it in Docker. It works like a charm.

dockerfile:

FROM python:slim

WORKDIR /app
RUN pip install wikiextractor

COPY Wikipedia-20211212095544.xml /app/

CMD python -m wikiextractor.WikiExtractor --output /app/output /app/Wikipedia-20211212095544.xml

Build: docker build --pull --rm -f "Dockerfile" -t wikiextractor:latest
Run: docker run --rm -it --mount type=bind,source="$(PWD)\output",target=/app/output wikiextractor:latest

Make sure you have an output folder in your current working directory.

1
On

I directly pip install wikiextractor locally, and then pip install wikiextractor==0.1, and it can be extracted normally.