How to implement a safety-critical AI compute cluster at the edge?

143 Views Asked by At

I want to experiment to develop a redundant autonomous car compute architecture which can handle all AI and other computing stuff. To do that, I bought some edge computing devices (Nvidia Jetson TX2s) which contains integrated GPU. Then I connected them with a gigabit ethernet switch so now I can communicate them.

I need your advices for the system architecture. How can I implement this failsafe, safety-critcal and redundant system? There is some cluster examples to provide high avaibility. But I want to do that : "Each compute node runs same processes then they output results to master node. Master node analyses and votes the results and picks the best one. If a compute node fails (bug, system down, lack of electiricty etc), the system should be aware of failure and transfer the failed node's compute load to healty nodes. Also each node should run some node specific tasks without affected by cooperated processes."

What is your thoughts? Any keyword, suggestion, method recommodation helps me.

1

There are 1 best solutions below

0
On

The primary system/software safety standard for automobiles is ISO 26262. If you're going to be serious about making an automotive product, you'll want to acquire a copy and follow the process.

The primary classification for levels of autonomy in cars is SAE J3016_201806. You'll save a lot of headache up front by knowing which level you're shooting for beforehand. You may want to shoot for Level 1 ("hands on") like an adaptive cruise control or lane departure prevention system before trying to do more sophisticated things.

Here are some general themes that I've gleaned from doing safety stuff:

  • There is no generally-accepted way to determine a probability of software failure. There's even a school of thought that software does not fail. Instead, most safety standards assign safety-significant functionality implemented in software to different "levels" that require higher levels of scrutiny based on certain criteria including severity, closeness to a hazard (are there interlocks?), etc.
  • Most safety standards define software as everything running on the hardware, so you will need to ensure that the operating system you use also can meet the standards. This usually means a real-time operating system.
  • Keep your safety-significant functionality as simple as possible. If you can do something with elementary electrical circuits and logic gates (such as an emergency stop), do it because the math and analysis is much more mature for hardware.
  • Acquire and follow a safety-relevant coding standard. The predominant one for automotive applications is MISRA C.
  • Look into using fault tree analysis to identify the relationships of failures required for a mishap to occur. This also helps identify single points of failure.
  • Try to alleviate hazards in the design if possible. Procedural mitigations and personal protective equipment should be a last resort.
  • At a minimum, you'll want a hard electrical emergency stop for the safety driver and a remote-controlled emergency stop operated by a spotter.