Xx network Economic Tweaks - Realtime Failure Deductions

We have released a blog post about an economic change:
https://xx.network/blog/realtime-failure-deductions

We would like to hear from the community about this topic

12 Likes

Sounds good to me. A 190 point penalty per realtime failure looked a bit too high at first, as it’ll mean not only 0 cmix rewards for a node with 5% realtime failure rate, but basically also 0 total (cmix+block) rewards as soon as the realtime failure rate increases a little bit more above 5%. But given that very few nodes have very high realtime failure rates, and that they have so negative effect on cmix messaging, it seems a very effective and reasonably fair incentive for operators to minimize their realtime failures. In fact operators with failure rates exceeding 5% would be better off by immediately stopping cmix and simply running their block validator while they try a fix.

Another consequence of such a high penalty is that it will tend to enlarge the reward difference between nodes with below-average versus above-average realtime failure, which may also be good to incentivize the reduction of realtime failures network wise.

3 Likes

To keep it short, your proposed rate of the cutoff being 5% is completely acceptable to me, it does appear only true outliers with configuration problems (or under specified) suffer at this level and above and so should be penalised appropriately.

1 Like

Does the team know what attributes to the poor realtime failures the most? Is that a combination of CPU + GPU + SSD speed? Or mostly networking? If it’s networking, does that vary on which different ISPs they are teamed with?

The team will also be looking into making the team multiplier contingent on minimum performance, with more information coming soon.

Are you planning to release a binary/command to test performance where the results are signed by e.g. validator keys?

1 Like

This is a welcome move considering that the impact of the poor performing node is upon the whole team doing rounds and overall network. We have been testing nodes and figuring out ways to improve the performance in alpha, beta and now in Canary. This should be supported by the community for overall benefit.

1 Like

As I was reading the proposal/analysis it occurred to me that I would like to ask for a CLI utility that can stop cMix service when a moving realtime failure rate hits a set value. (Of course, we could write our own, but it’s difficult for us to test it and make it work reliably).

Then I saw this related comment so I want to add that it’s hard to imagine a non-malicious validator who would want to keep running cMix service when he knows ultimately he’ll earn nothing.

The problem is they don’t know. Maybe unattended upgrade of some key package failed, maybe something happened to the network and the node is still up (so no alerts for the operator) but failing real time rounds. Whatever it is, the operator would almost certainly prefer to stop cMix service and troubleshoot, both for the benefit of network and his own. So having such a utility provided by the Project would be helpful.

It is probably possible to maliciously delay own messages (just squeeze the pipe before REALTIME begins) and intentionally cause real time round failures to all nodes participating in the same round. In fact it is possible to use 2 nodes to target 1 node, and squeeze the pipe only when that one node happens to participate in the a round. If all three have the same threshold value, the attacked one will stop cMix first.

If many operators have such a utility set to a fairly low value, they could take themselves offline during such an attack as far as mixing is concerned. So personally I’d set that value to be several times higher than what I normally see on a node (in terms of real time failures when everything is working fine).

I don’t know if such two-smack-one (or ten-smack-two, etc.) attacks would be feasible even without a utility that stops cMix - that may be something to consider. You don’t need to make the attacked node hit a utility threshold - if the operator stops the node or loses nominators, that may be good enough.

It may be possible to detect out such attacks with statistics and deal with them through governance.

From the proposal (blog post):

how long it takes for nodes to recover from failed rounds, nor how different nodes hardware and internet configurations may impact round times.

That is a problem. Now cMix restarts in such cases, which is already a punishment on its own, as at least one or two rounds are skipped due to downtime. Sometimes when cMix recovers, it gets hit with another precomp failure “of which it has no knowledge” (as it says in cmix.log) for a round scheduled while it was down. While those are merely precomp failures they still represent a micro-fine of 10 or 20 (missed) points added to the realtime fine. If stricter fines for real time failures are implemented, it’d be nice if cMix could handle them more gracefully.

I looked at some of those nodes with higher real time failures and surprisingly one has real time failure rate 5x higher than precomp failure rate (usually the real time ratio is much lower). As precomp failure is meant to pre-fail a round to avoid real time failures, how is that even possible? Computationally, isn’t the workload the same (precomp vs real time)?

As a result, the team will be publishing a list of currently poorly performing node operators which xxDK users can optionally opt into not sending messages on rounds that contain those nodes.

This sounds good. A checkbox setting ([ ] avoid using 10% lowest-performing nodes) would help. There probably shouldn’t be a way to say “use these 10 gateways” because it’s not a good idea privacy-wise. Also the list should probably be just one and maybe some client-side randomization (avoid the bottom 11% or 13% or 9.5% rather than precisely 10%) would be helpful, so that a variant of that two-smack-one game cannot be played against the user, to selectively drop messages to more and more nodes and make him or her ultimately pick the 5 or 10 nodes that the attacker can gain control over.

I wonder if node operators may potentially be able to save on bandwidth costs by preferring other cMix nodes vs. messaging clients.

Penalising in terms of points is fine but if the intentions are malicious what would stop them from not running the rounds?

Is there a possibility to deter them from running the node continuosly? Slashing?

If it is for someone who is known from the community and gets realtime issues, is there a way to segregate them before slashing applies to them?

I think the fine is higher than I expected, and it will also increase migration to the cloud i.e. centralization and security risks. I think the Project knows how many validators (likely) physically control their nodes.

I think we should strive to have 70% residential, which is probably unachievable. I don’t remember how that break down looked in the CSV stats that used to be shared, but I wouldn’t be surprised if more than 50% of validators physically controlled their nodes at this point.

If this gets even lower (30-40%), it will be great for reliability, but even worse for security. So I would suggest to start with lower penalties and adjust upward if necessary.

We don’t want to force residence-based operators to move to the cloud, and then once we realize we’re 90% cloud, discuss if we should lower this penalty.

Can XX Messenger receive “not delivered” notifications or otherwise determine if a message delivered to cMix service ended up in a failed round, without impacting own security?

1 Like

That would be great, except for the fact that it doesn’t seem to be currently known what causes RTF to be higher than average. For example, in my case, I have excellent hardware, symmetric gigabit fibre, and my RTF is around 0.7%. I have no idea what I can do to improve this.

1 Like

Can XX Messenger receive “not delivered” notifications or otherwise determine if a message delivered to cMix service ended up in a failed round, without impacting own security?

Yes it already does this. But excessive failures still will impact usability. Users will not be happy if they have to constantly resubmit messages.

1 Like

Okay, that’s great. I guess it’s a choice - is cutting real time failure rate from 1 in 150 to 1 in 250 worth an additional 20% of nodes moving to the cloud?

Of course, the cloud-hosted figure is a random guesstimate, and even losing 10 validator nodes out of 200+ is not going to cause a big impact, but all I’m saying is we can always hike the penalty while minding that balance.

Moving a node isn’t easy, so suddenly increasing the penalty fivefold at once is going to make some validators overreact and move to the cloud. I propose to increase it from 30 to 60 or 90, and review the effect four weeks after that. I am not against the proposed fine (eventually), but lets give people time to make improvements without panicking and moving to the cloud.

1 Like

This is a valid point. The primary goal is to target very bad offenders, not people slightly above average. This would be a good argument for targeting 15%~20%. That would be a deduction of 40~60 points per

1 Like

It seems fair

1 Like

Not sure I agree with this. Moving a node to the cloud is not easy, there are not that many GPU VPSs available, and they are extremely expensive, that’s why most people host physical nodes at home or DCs. But most of all, underspec or poorly connected nodes tend to produce high precomp failure, not high realtime failure. The latter are most of the time the result of gateway issues or node-gateway comm isues. And most gateways are already in the cloud. If anything, someone with a high realtime failure rate would think about either moving the gateway to a closer VPS provider/location (if available) or to a physical machine in a closer residential location.

That was my initial impression. It all depends how far we want to go with incentivizing the reduction of realtime failures network-wise. I agree that 190 points would mean non-negligible reward differences for not so big realtime performance differences, and that trying to improve that small realtime performance difference may not be easy and may result in operators shifting their gateways around and flooding discord with help requests. It would then be necessary to provide clear guidance on how to reduce realtime failure rates that are not very high to start with.

In all, it might be wiser to start with a smaller penalty and see how it works.

1 Like

5% is a reasonable threshold given the current ~0.5% average in real-time failure. I am not against moving more carefully and target a 10%-15% initially. In any case, the “negative points” approach seems a good viable solution without side effects, eg. compared with chilling a node with high RT fail. Even in the example of a coordinated attack to a node, it seems to me it will not be economically interesting to perform, not scalable and only temporary. Regional multipliers should still work well to offset regional network performance.

It would be good to know what possible alternative solutions the team has considered and discarded.

About the minimum performance requirement for team multiplier, it makes much sense. However, I would disagree if the penalty is irreversible. For example, a temporary high RT failure should only trigger loss of the team multiplier until fixed.

2 Likes

I support any initiative to chill the worst performing nodes but I don’t quite understand the combination of the proposed measures.

Users/developers will take advantage of one or all lists to exclude lower performing nodes or include high performers. If the pool of nodes is large enough, I cannot find a reason why you would not want to make use of it.
So the secondary solution effectively quarantines (permanently?) the nodes with the lowest performance. It’s a measure with an instant effect.

A result of the 190 point penalty is that it increases the differences in earnings based on a RTF percentage that is difficult to positively influence. Do not ask me why my RTF is 0.15 whilst another is at 0.65. Whether that is a drawback or not is probably a matter of perspective, but it drifts away from the idea that all active nodes should be rewarded relatively equal.

From the blog post:

There is also a secondary stop-gap solution the team is working on. The ultimate flaw with economic solutions is that they can take time, which is not great when messages are actively being dropped by the xx messenger and other users of the xxDk. As a result, the team will be publishing a list of currently poorly performing node operators which xxDK users can optionally opt into not sending messages on rounds that contain those nodes. It will also be possible for xxDK users to opt into other, separately curated lists.

I think an adaptive non-economic solution is the best way to filter out low performance and to do so quickly. I have the following suggestions:

  1. Have a set of quarantined nodes that have to be opted-into (opt-in is not a typo) by users, otherwise users will only experience non-quarantined (well-performing) nodes.
  2. Let the criteria for being quarantined be adjustable/votable based on rank (e.g. bottom 10), relative rank (bottom 1%), real-time timeouts (percentage), precomp timeouts (percentage), real-time timeouts (number of consecutive including last), precomp timeouts (number of consecutive, including last), or combinations of these, and also have a maximum number of nodes that can be quarantined.
  3. Have a test user/app of some sort that has opted-in to the quarantined nodes, so that their performance can performance can be measured continuously without impacting other users.
  4. Let the criteria for bein un-quarantined also be adjustable/votable.

The idea behind this is that for any default user, all it sees is the well-performing nodes. So from the user point of view nothing needs to be done to get good performance when using the xx network. An idea is also for the use of quarantined nodes to be free (no postage), which causes an economic incentive for users to use the quarantined nodes so that their performance can be measured and compared to criteria for being un-quarantined.

The idea behind this from the node operator viewpoint is that there is less penalty, but also less reward for any unintentional drops in performance.

From the overall network point of view, the criteria must be balanced while still being responsive. If the criteria are too stringent, then during e.g. some kind of global/local networking issue, too many nodes may become quarantined. Therefore the idea of including a maximum percentage (e.g. 1%) or number of nodes that can be quarantined. E.g. if 5 consecutive real-time timouts is one of the quarantine criteria, this ensures that if there are global/local networking issues that cause 10% of nodes to have 5 consecutive real-time timouts, only 1% will be quarantined. All of these criteria should be settable, adjustable, voteable.

Why cutting real time failure rate implicate nodes moving to the cloud ?
There is a requirement and if this requirement is respected all going fine.

If a node is underspec or poorly connected this validator should just be slashed. We have requirement and should be respected because we “sign” a contract for running a node. If you can’t run due to hardware/provider and can’t find a good DC you nominate instead.

The idea is really good just need to adapt step by step the penalty may be it’s a little too high for first try. This can really help to have a better network and exclude automatically people who run on slower network or without GPU.