LOL. I can give you an example from the recent past. Without going into great detail we are in the final months of bringing a new customer system on-line. One phase of that rollout was completed recently.
For this particular technology, there are six individual sub-systems that need to identify a situation and react accordingly, working as a whole to react to a production environment situation.
Each piece was tested individually and after finding and fixing errors, they all worked flawlessly in our test environment - a copy (aka "digital twin") of the production environment.
However, when put into production, it turned out that one piece of information passed between two of the systems failed. That was a key piece of information that tied the individual parts together with "the big picture." We had tested this, but there are limits to what we can test in a test environment vs. a production environment.
Why are there limits? Because it would be infeasibly expensive to simulate an environment with thousands to tens of thousands of individual data points being moved in simulation in realistic trends with and to each other.
In our case the failure was obvious, and (patting ourselves on the back) this is because we design our systems with extensive feedback and logging features so that when something goes wrong it doesn't take weeks or months to pinpoint the error or having to wait for the next instance and hope that it can be identified (there is a classic engineer joke about brake failure on this one!).
In our case the cost to fix the root cause error and retest was not an insignificant percentage to the entire implementation cost, and that is with a non-AI (aka "big data") black-box solution.
For autonomous driving the data set is several orders of magnitude more complex than ours, but the principle is the same. Breaking a complex problem into parts is a key engineering principle. It does not allow you to skip testing the entire system as a whole. It will make that full system testing less expensive, no doubt, but it must still happen or you'll get failures like Tesla Autopilot cars wanting to drive into oncoming traffic.
https://www.youtube.com/shorts/5v6GMhkGqes
Our system is complex, but driving in the real world is at least 3 orders of magnitude (probably 4 or 5) more difficult. For FSD to work correctly, every single one of those hundreds of thousands of individual situations must be tested individually, and then regressively together.