Javascript transform a WEBTT file into a json

1.2k Views Asked by At

File .vtt:

WEBVTT

00:00:00.039 --> 00:00:25.968
VINCENZO Cassano!

00:00:26.044 --> 00:00:26.961
Damn it.

00:01:23.434 --> 00:01:24.894
Mr. Vincenzo Cassano.

00:01:24.978 --> 00:01:27.814
You're under arrest
for the murder of Mr. Oh Jeong-bae.

00:01:43.913 --> 00:01:44.956
Hands up,

00:01:45.540 --> 00:01:46.708
or I'll fire.

00:01:51.504 --> 00:01:52.964
I didn't do it.

Transformed into json:

[
   {
      "from":"00:00:00.039",
      "to":"00:00:25.968",
      "timeString":"00:00:00.039 --> 00:00:25.968",
      "text":"VINCENZO Cassano!"
   },
....,
   {
      "from":"00:01:24.978",
      "to":"00:01:27.814",
      "timeString":"00:01:24.978 --> 00:01:27.814",
      "text":"You're under arrest\nfor the murder of Mr. Oh Jeong-bae."<- Multi line,i assume it's a \n??
   }
]

Result: enter image description here

I have a .vtt file, for subtitles, I have to make sure to create a json array as seen above, also considering the multi line.

I wrote this, which should remove the leading WEBVTT and the double spaces, but I can't remove the leading space as seen in the image below, index 0(maybe this problem I managed to fix it by adding .replace('\n', ''))

const v = enc.decode(text).replace('WEBVTT', '').replace(/[\r\n]{2,}/g, '\n').replace('\n', '');
const lines = v.split('\n');
lines.map((el, i) => console.log(`${i} - ${el}`));

const test = new RegExp('\\b(\\d{2}:\\d{2}:\\d{2})\\.(\\d{3})\\b').test(el);
<-- expression to check if the string is of the type 00:00:00.039

enter image description here

Can you give me a hand?

1

There are 1 best solutions below

0
On

Might I suggest SRT/VTT Parser? It's output would get you part way there and then you just have to do a little reformatting within your code.

You could also have a look at this answer, which uses another paser, Node VTT.