Convert SPIP text to markdown (or HTML)

338 Views Asked by At

I've got to update an old website based on SPIP (A french CMS with specific, Markdown-like syntax).

I'd like to convert its database content to markdown, but I didn't find any useful resource to convert SPIP syntax to HTML (And then to markdown via league/html-to-markdown, for instance), but I'm not able to find the correct method (from SPIP's code) to use to do so.

Any help would be great.

2

There are 2 best solutions below

0
On BEST ANSWER

I finally found a script which matches my needs : https://github.com/nhoizey/spip2markdown

It is intended to be used inside SPIP, but the main functions are easily adaptable.

1
On

Like you, I don't know such a tool, so I created mine when I had to face the issue of exporting SPIP data. But this tool:

  • is intended to output XML instead of HTML
  • is implemented as a plugin of SPIP, so it must be installed first, then driven from the SPIP private area
  • and to be honest, since it happened several years ago, I have not so much things in mind about it

So I can't realistically propose you to use it.
In the other hand, if you want to write your own tool, you might take advantage of the following excerpt, which was the heart of my tool:

$spip2xml_specifs = [
  'data_fields' => [
  # obj => [
  #   dest_field =>  src_field | [src_field,...]
  # ]
  # in src_field, initial "*" means: do not apply filters
    'rub' => [
      'titre' => '*titre',
      'body'  => ['descriptif','texte'],
    ],
    'art' => [
      'titre' => '*titre',
      'body'  => ['*surtitre','*soustitre','descriptif','chapo','texte','ps'],
    ],
  ),
  'str_replace' => [
    "\r\n"                                => "\n", # normalize Win with *nix
  ],
  'preg_replace' => [
    '¤\n\n\n*¤'                           => "\n\n", # limit multiple \n up to 2
    #
    '¤{{{(.+)}}}¤msU'                     => '<h3>$1</h3>',
    '¤{{(.+)}}¤msU'                       => '<b>$1</b>',
    '¤{(.+)}¤msU'                         => '<i>$1</i>',
    # _  => <br />
    '¤^_ ¤ms'                             => '<br />',
    # ---- => <hr />
    '¤^(-{4,})(\n|$)¤ms'                  => '<hr />',
    /*
    # \n\n => <paragraph>
    '¤(\n\n)?(.+)((?=\n\n)|$)¤Us'         => '<p>$2</p>',
    '¤\n\n¤'                              => '', # drop left (why?) \n\n
    */
    # [...|...->...] => <a href... /a>
    '¤\[->(.*)\]¤msU'                     => '<a href="$1">$1</a>',
    '¤\[(.*)->(.*)\]¤msU'                 => '<a href="$2">$1</a>',
    '¤<a (.*)>(.*)\|(.*)</a>¤msU'         => '<a title="$3" $1>$2</a>',
    # <cadre>, <code> => <blockquote>
    '¤<(?:cadre|quote)>(.*)</\1>¤imsU'    => '<blockquote>$1</blockquote>',
    # -* => <ul... /ul>
    '¤^-\*([^*].*)¤m'                     => '<li>$1</li>',
    '¤(<li>.*</li>)¤s'                    => '<ul>$1</ul>',
    # tableaux, notes, ancres...? modèles non traités -> signaler ?
    #
    # finally remove superfluous <p>
    '¤<p><(h[1-6r]|ul|table)(.*)>(.*)(</\1>)?</p>¤imsU'
                                          => '<$1$2>$3$4',
  ],
];

The data_fields array registers the fields that have to be processed for the two main data containers (rubrics and articles).
Then the str_replace and preg_replace array members register all transformations that must be executed in turn, on each field.

At least I can assert that these specifications are the right ones and work fine.

Feel free to ask for more information if needed.