4K Prime update results and next steps

On Wednesday 10/7/2020 we put out an update which moved operations from a 2048 bit prime to a 4096 bit prime. This upgrade had always been in the roadmap, but was pushed up to deal with a large number of nodes running under specced on cheap hardware, many of which within the same data centers.

This created three issues which we were hoping the 4k prime update would address.

  • Running a large portion of the network through a single host can allow that hosting company to censor the network
  • The nodes running on these cheap VPSs do not meet the specs, and as the network matures, they will not perform
  • This situation is unfair to those who are committed to the network and have provided the requested hardware.

While average precomputation times with 4K primes on GPUs are 8 times faster than CPU nodes, we discovered that a team with one slow node and two fast nodes is slowed down by much more than 1/3rd [1]. We have verified this property in the NodeLab, as you can see below.

The resulting numbers on the live network were even worse than above because those nodes running in data centers are all very closely located. So their reduced latency with each other, combined with this effect, has made the final performance numbers of CPU nodes vs GPU nodes not useful for identifying underperforming nodes given the current betanet properties. As you can see below, our sample happens to show known GPU-only nodes as being slower than known CPU-only nodes due to this latency discrepancy.

So What do we do now?

Until full decentralization, the xx network team will be giving this topic more care. We will release an update to the dashboard which lists the service provider for every node[2] and will require the disclosure of planned service providers in the application process for future nodes, and not accept nodes using an over represented service provider.

The hardest part of this process is handling currently problematic nodes. We have separated all nodes on the network into the following 4 groups:

  • High Concern: Nodes running in CPU mode on an over-represented service provider (27 Nodes)
  • Moderate Concern: Node running on an over-represented service provider in GPU mode (10 Nodes)
  • Minimal Concern: Node running in non over-represented service provider in CPU mode (19)
  • No Concern: Nodes running in GPU mode on an non over-represented service provider. (116)

We have disabled all nodes of high concern as of today, and have reached out to them. We will require them to make changes to continue operating as part of the network. Before any nodes are kicked from the network or lose compensation, the issues will be presented to the community for input.

For nodes of Moderate Concern, we expect no action will be needed from the majority of them. We would like to avoid punishing them for happening to choose the same provider as others, so unless the problem persists we are inclined to allow them to stay.

For nodes of Minimal Concern, we don’t expect them to need to make any changes. CPU nodes are more likely to have performance issues down the road because they are not our build targets, so there is simply a little more risk. These nodes will be contacted in the coming days because their issues could be configuration issues.

And nodes of No Concern are in good standing. We are pleased that a significant majority of node operators have taken our request to heart and are so invested in supporting the network.

The groups can be found here, and any node who needs to make changes has already been notified over email.

Long Term

In the long term, we hope this process is handled by decentralized governance. This issue plagues most blockchains and many simply don’t pay much attention to it. For example, Bitcoin is overly centralized in China currently. In the long term, the solution we see is for decentralized governance to take a more active role in adding nodes to the network, allowing the users of the network to vote and make intelligent decisions about nodes.

Compensation Fairness

The elephant in the room is the disparity in costs over the past 3 months between operators of No Concern and those of High Concern. We feel that handling this issue inappropriately would place significant strain on the community and will do our best to make sure nodes feel whole. We understand that an answer to this question is needed, and continually saying “soon” rings hollow, but unfortunately we do not have a concrete plan yet.

Process

Due to the potentially complex nature of these issues, we will be handling individual cases with node operators via Email. If you have any questions about your specific case please email us at [email protected]. We of course encourage community discussion and feedback, and hope you will use public forums to discuss, process, and advise us on your opinions.

New Bugs

A new bug where nodes will sometimes get stuck has been introduced with this update. The occurrence of this bug will not impact compensation. The team is working on finding a solution and expects to release a hotfix shortly.

Footnotes

[1] The underlying idea with 4K primes was that as we increase the requirements of the network, those nodes who do not meet spec would be immediately obvious and could be handled. Unfortunately, if a slow node is feeding data to faster nodes, the faster nodes are only able to operate at the rate which the slower gives it input. While we can still discover slower nodes over time, the network latencies in the current betanet make it impossible to draw clear conclusions right now.
[2]We have used a “whois” IP lookup service to determine the hosting provider for an IP

9 Likes

Very nice reading

Will extra cuda cores improve the precomputation time even more? I mean Nvidia is releasing more and more powerful gpus :slight_smile:

It’s very simple: Upgrade or leave!

The requirements have been well known from the start. Unfortunately a generous compensation also attracts people who want to cheat the rules and maximise their profit. The compensation is so generous that the investment in hardware is compensated within the first month of operation!

It’s like have a Formula 1 race and some people entered with normal cars who slow down the race. The fast cars shouldn’t adjust to the slow cars, the slow cars leave. They shouldn’t have been there in the first place.

Possible solution
The node software could do some pre-calculation to check performance (either CPU or GPU) and if it can’t do that within the set time, the software should just quit with an error message.

Current situation
There could be a grace period for nodes to upgrade. But to be honest, the requirements are very well communicated, also during the application process and in the node setup manual. I think disabling these high concern nodes and reaching out with one last request to upgrade is a great idea! :+1:

Over-represented service providers
A lot of data centres are concentrated around main Internet connections. Therefor nodes running in data centres have a higher chance of been close together. Also, a lot of bigger Internet Service Providers may have the same WHOIS information although their customers can still be decentralised. The over-represented service providers are unforeseen by node operators and could be taken in account during the future node selection process.

Upgrade
Congratulations on the upgrade and solving the problems so quickly! It’s great to see the speed and agility of the development team. Big thumbs up! :+1:

4 Likes

Hi , i like the idea for the possible solution. But as a hacker… i gurantee you there is a way around and fake that data :slight_smile: Its impossible to do that with a untrusted party.

1 Like

True, you can even build your node from source and remove this check first, but it might scare off some people. :slightly_smiling_face:

Or the network could require some handshake calculation that can’t be faked client side. Preferably one that doesn’t require a lot of computational power to validate server side.

random nodes, test other nodes randomly every random X blocks… that might be something… but in this case you cant know if the problem is the source node that initiates the test…

On the dashboard can you calculate or send from node directly the real latency of this node and add latency of each node ?
It’s for view the percentage of slow node. Already found some node who slow down a lot each round but take some time at hand by checking round after round a precise node.

I have try myself my node with CPU only the first day some minutes for remove cuda/gpu issue from other issues.

The latency is not a property of a single node, but of two connected nodes.

Calculation time if you prefer. I not speak about network latency but global latency for process task.

US operators shed a silent tear.

1 Like

After participating in a round, the nodes could independently advertise each other’s latency, and if one is consistently underperforming they could be taken out of the election process.

1 Like

Why should someone with 1st class network and hardware located in Agentina be kicked out?
Looking from the perspective of that node, everyone is constantly underperforming and should be taken out…

Originally the idea was to have several geo-bins and I thought location or latency awareness would be considered by the s/w. But it seems random.

If nodes can prefer peers with lower latency, that would be better. “In-between” nodes (a node in Iceland, for example) would likely have peers from several continents.

I’m talking about pre-computation latency, not network latency.

What is considered underperforming for pre-comp latency?

Dedicated compute nodes that are relatively underperforming but have recommended specs shouldn’t be penalized, IMO.