How to implement a seekable media input stream using Sodium for decryption

408 Views Asked by At

We have a product consisting of documents and media files that are encrypted for DRM protection. On the production side, we have a Python script that encrypts the files, and on the client side, an Android app that decrypts them. This means we need to have an encryption/decryption scheme that can work compatibly on both Python and Android platforms. I've settled on libsodium/NaCl because it is available on both platforms, is free and open source, and it's designed with high-level APIs that are supposed to provide "Expert selection of default primitives" (http://nacl.cr.yp.to/features.html), thus helping the developer get things configured right without having to be an expert in the details of cryptography parameters.

Based on that, I've been able to test successfully that data encrypted by Sodium on Python can be decrypted by Sodium on Android. There's a fair bit of learning time invested in Sodium, so I'd prefer not to have to change that, if at all possible.

However, when it comes to playing large DRM-protected videos on the Android side, I believe we need a solution that works for streaming, not just for decrypting a whole file into memory. Currently we are just reading the whole file into memory and decrypting it:

final byte[] plainBytes = secretBox.decrypt(nonce, cipherText);

Obviously that's not going to work well with large video files. If we were using javax.crypto.Cipher instead of Sodium, we could use it to implement a CipherInputStream (and use that to implement a exoplayer2.upstream.DataSource or something). But I'm having difficulty seeing how to use libsodium to implement a decryption stream.

The libsodium library I'm using does provide bindings to "stream" functions. But this meaning of "stream" seems to be a somewhat different thing from "streaming" in the sense of Java InputStream. Moreover, all those functions seem to be very specific to the low-level detailed parameters that up to this point, libsodium has not required me to be aware of. For example, chacha20, salsa20, xsalsa20, xchacha20poly1305, etc. Up to this point, I have no idea which of these algorithms is being used on either side; SecretBox just works.

So I guess the question I would like answered most is, how can libsodium be used in Android to provide seekable, streaming decryption? Do you know of any good example code?

Subquestions of that:

  • Admittedly, now that I look closer in the docs, I see that pynacl SecretBox uses XSalsa20 stream cipher. I wonder if I can count on that always being the case, since I'm supposed to be insulated from those details?
  • I think for media playing, you need more than just streaming, in the sense of being able to consume a small piece at a time, in sequence. For typical usage, you also need it to be seekable: the user wants to be able to skip back 5 seconds without having to wait for the player to reset to the beginning of the stream and process/decrypt the whole thing again up to 5 seconds ago.
  • Is it feasible that I could use javax.crypto.Cipher on the Android side, but configure it to be compatible with the encryption algorithm (XSalsa20) and its parameters from the PyNaCl SecretBox production process?

Update:

To clarify,

  • The issue of decryption key delivery is already solved to our satisfaction, so that is not what I'm asking help on here.
  • Our app is completely offline, so the streaming issues I mentioned have to do with loading and decrypting files from local storage, rather than waiting for downloads.
1

There are 1 best solutions below

2
On

For video you might it easier to use existing mechanisms, as they will have already solved most of your issues.

For most video applications you will want to stream the video and play/seek as you go, rather than having to download the entire video, as you point out.

At this tine there are three major DRM's commonly used to encrypt and share keys between the server and the client: Widevine, PlayReady and FairPlay. All three will support the functionality you want for streamed videos. The disadvantage is that you will usually have to pay to use these DRM services.

You can also use HLS or DASH to streams the video, Adjustable Bit Rate or ABR streaming protocols (https://stackoverflow.com/a/42365034/334402).

These allow you also use less secure, but possibly adequate for your needs, key sharing mechanisms that essentially allow the key be shared in the clear while the content itself is still encrypted. These are both free and well supported:

  • HLS AES Encryption
  • DASH Cleasrkey Encryption

Have a look at these answers for examples of generating both streams: https://stackoverflow.com/a/45103073/334402, https://stackoverflow.com/a/46897097/334402

You can play back the streams using open source players like DASH.JS for browser and ExoPlayer for Android Native.

If you wanted more security but still wanted to avoid using a commercial DRM, you could also modify the above to configure the key on your player client directly rather than transiting it from server to client.

You then do have the risk that someone could hack or reverse engineer your client app to extract the key, but I think you will have this with your original approach anyway. The real value of DRM's systems is not the content encryption, which is essentially just AES, but the mechanisms they use to securely transport and store the keys. Ultimately, it is a question of cost and benefit - it sounds like your solution may work quite adequately with a custom configured key implementation.

As an aside, on the seeking question - most video formats are broken into groups of pictures or frames which can be decoded separately from the rest of the video before and afterwards, with the help of some header info. So you can decode at, or at least near, any given point without having to decode the entire video up to that point.

The thumbnails you see when you scroll or hover along the timeline on a player are generally actually a separate stream of still image snapshots or thumbnails at regular intervals in the video. This allows the player show the appropriate thumbnail as if it is showing the frame at that point in the video. If the user clicks to that point then the player requests that section of the video, if it does not already have it, decodes the relevant chunk and starts playing it.