The problem with NVME raid 0 is that's only possible to achieve with software raid. Meaning, no battery backing, no controller cache. Also, Raid 1 on the bottom is important to preserve data, combined with battery. In regards to crash-es, in case of power outage, scripts read the USP status via SNMP and issues SIGINT to steemd, so that the data is safely put on disk. I also did a couple of edits here and there in the code to ensure corruption free files even in case of steemd crash. (although not properly tested yet, but will submit merge request on github if it shows well with my server).
While theoretically you are right in terms of speed, I still prefer industry grade approach. It's much easier to scale. I also have a hot spare, dual power supplies, ECC memory. These things are important. (Learned the hard way working with mission critical systems).
It could be "a psychology" problem of mine :) but I don't trust the system unless I can replace the faulty drive, or power supply without guest OS even being aware that anything happened at all.
As for the seed nodes, I did managed to perform p2p load balancing based on block number using CCR1036-12G-4S-EM and some scripting. (not being used yet as it needs to be properly tested).
The way I see witness work in 1 year from now, only the 'by the book' industry grade systems with well planned topology would survive and be able to scale. While it's possible to improvise, it comes with the price, many of us (including myself) learn the hard way.
Another problem is that people who are experienced with mission critical systems outside of crypto industry are hard to get into the projects such this one. While the people from crypto are usually tends to experiment with DIY solutions. While it's fun, the problem with the lack of first category is important knowledge transfer that shall take place to ensure that not every single lesson is paid again.
I'll try to implement as much as possible from the industry best practices into this (and other projects i'm involved into) in order to better understand the technology challenges.
While we are experimenting a DIY solutions, IBM (and others) are watching what we are doing with their proprietary blockchain solutions, and once we prove the good working model, they would be able to deploy steem-like platform in a matter of seconds, that will be more stable and scalable in every possible way. That's what i'm afraid of, that current technologies are just a testing polygon for corporate players.
(These are my predictions, and articles such as this one are my contribution for the future if it happens my view to be correct). - or it does not have to be at all.
The problem is the hardware requirements for a full node on Steem are so ridiculous that you have to consider DIY, especially since there is no data loss in a complete failure scenario.
Enterprise quality hardware for a full node would be $30-40K. A Dell server that meets requirements today is around $20k and that's without any redundancy or hardware RAID. The fact the hardware would be outgrown in 6 months makes it even worse.
There is a good chance all these servers we are setting up for full nodes with 512GB of ram won't be sufficient in 6 months. AppBase promises to help in this regards, as does account history changing to 30 days instead of since block 0 on full nodes, but then again Steem Mobile Wallet was coming out in November.
Spending $20k-$30k+ on hardware that may last 6 months isn't a very smart move, especially when there is no financial recovery by running a full node, only a witness node and those positions with the funding to do such a thing are pretty much locked in stone.
Just have in mind, that for the price of two NVE, you can get a refurbished enterprise grade server with controller and even 240GB ECC ram on ebay. (The cost of the new one would be 20-30k).
But you are right on many points. Unless there are other motives, most likely investing in enterprise grade infrastructure is not a wise choice and it's likely not to generate ROI.
For example, ECC Ram is crucial thing so you don't need to rebuild from scratch even in case of crash. Not sure how many people are aware of that. (not talking about RAM as a storage, just for the operation of steemd). When ECC Ram is used, the chances of file corruption are unlikely to happen even when kill -9 is used.
On another side, I am a bit fascinated about these things not being documented, and I got an impression that there is a strong amount of 'selfishness' when it comes to knowledge exchange, which is totally opposite from Open Source Ideology, that blockchain evolved from.