Native command throws error only when I redirect to a variable

83 Views Asked by At

I am using the following to get raw html code for a web page and then pass it to html2text to then get only the text content, omitting the html tags:

wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text --ignore-links

It works just fine, consistently, a sample of the expected output:

...
vscroll_wide
scrollbar_back_color
scrollbar_back_dark_color
scrollbar_button_color
scrollbar_button_dark_color
scrollbar_button_icon_color
scrollbar_button_icon_dark_color
scrollbar_button_hot_color
scrollbar_button_hot_dark_color
scrollbar_button_hot_icon_color
scrollbar_button_hot_icon_dark_color
scrollbar_button_down_color
scrollbar_button_down_dark_color
scrollbar_button_down_icon_color
scrollbar_button_down_icon_dark_color
...

but as soon as I attempt to save the data to a variable or even pipe to select-object for further processing I get an error:

$text = wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text
#wget -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |html2text --ignore-links |Select-Object -f 5

The error I am getting is:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Scripts\html2text.exe\__main__.py", line 7, in <module>
  File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Lib\site-packages\html2text\cli.py", line 330, in main
    sys.stdout.write(h.handle(html))
  File "C:\Users\ralf\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u21b3' in position 202140: character maps to <undefined>

I searched around for this error but all of the solutions I am coming across are to do with navigating around this error inside a python script, I am in need of a solution of getting around this issue when piping/stdout

Why is PowerShell failing when redirecting the output?

PowerShell 7.4 on win 11

1

There are 1 best solutions below

3
mklement0 On BEST ANSWER

Why is PowerShell failing when redirecting the output?

The reason is that python (which underlies html2text) - like many other Windows CLIs (console applications) - modifies its output behavior based on whether the output target is a console (terminal) or is redirected:

  • In the former case, such CLIs use the Unicode version of the WinAPI WriteConsole function, meaning that any character from the global Unicode alphabet is accepted.

    • This means that character-encoding problems do not surface in this case, and the output usually prints properly to the console (terminal) - that said, exotic Unicode characters may not print properly, necessitating switching to a different font.
  • In the latter case, CLIs must encode their output, and are expected to respect the legacy Windows OEM code page associated with the current console window, as reflected in the output from chchp.com and - by default - in [Console]::OutputEncoding inside a PowerShell session:

    • E.g., the OEM code page is 437 on US-English systems, and if the text to output contains characters that cannot be represented in that code page - which (for non-CJK locales) is a single-byte encoding limited to 256 characters in total.

      • Notably, Python exhibits nonstandard behavior by default, by encoding redirected output based on the ANSI code page (e.g, 1252 on US-English systems) rather than the OEM code page (both of which are determined by the system's active legacy system locale, aka language for non-Unicode programs). However, like the OEM code page (in non-CJK locales), ANSI code pages too are limited to 256 characters, and trying to encode a character outside that set results in the error you saw.
    • To avoid this limitation, modern CLIs increasingly encode their output using UTF-8 instead, either by default (e.g., Node.js), or on an opt-in basis (e.g., Python).


In the context of PowerShell, an external program's (stdout) output is considered redirected (not targeting the console/terminal) in one of the following cases:

  • capturing external-program output in a variable ($text = wget ..., as in your case), or using it as part an of expression (e.g., "foo" + (wget ...))

  • relaying external-program output via the pipeline (e.g., wget ... | ...)

  • in Windows PowerShell and PowerShell (Core) 7 up to v7.3.x: also with >, the redirection operator; in v7.4+, using > directly on an external-program call now passes the raw bytes through to the target file.

That is, in all those cases decoding the external program-output comes into play, into .NET strings, based on the encoding stored in [Console]::OutputEncoding.
In the case at hand, this stage wasn't even reached, because Python itself wasn't able to encode its output.


The solution in your case is therefore two-pronged, as suggested by zett42:

  • Make sure that html2text outputs UTF-8-encoded text.

    • html2text is a Python-based script/executable, so (temporarily) set $env:PYTHONUTF8=1 before invoking it.
  • Make sure that PowerShell interprets the output as UTF-8:

    • To that end, (temporarily) set [Console]::OutputEncoding to [System.Text.UTF8Encoding]::new()

To put it all together:

$text = 
  & {
    $prevEnv = $env:PYTHONUTF8
    $env:PYTHONUTF8 = 1
    $prevEnc = [Console]::OutputEncoding
    [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

    try {
      wget.exe -O - https://www.voidtools.com/forum/viewtopic.php?p=36618#p36618 |
        html2text
    } finally {
      $env:PYTHONUTF8 = $prevEnv
      [Console]::OutputEncoding = $prevEnc
    }
  }

Note:

  • When you pipe data from PowerShell TO an external program (not the case here), PowerShell uses the $OutputEncoding preference variable to encode it, in which case you may have to (temporarily) change $OutputEncoding too; it defaults to ASCII(!) in Windows PowerShell, and to (BOM-less) UTF-8 in PowerShell (Core) 7 - which is problematic in both cases, as it doesn't match the default value of [Console]::OutputEncoding.
    For instance, to both send data as UTF-8 and to decode it as such, you can (temporarily) set:

    $OutputEncoding = [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
    
  • It is possible to configure a given system to use UTF-8 system-wide by default, which would make things just work without extra effort in this case (though in Windows PowerShell you may situationally still have to set $OutputEncoding); however, this configuration, which sets the system locale in a way that sets both the OEM and the ANSI code page to 65001 (UTF-8), has far-reaching consequences that may break existing scripts - see this answer.

    • GitHub issue #7233 is a much lower-impact suggestion to make PowerShell (Core) 7 console windows default to UTF-8, without the need to change the system locale (active code pages).