Some of the JDK's unmodifiable collections, such as those from `Set.of()` and `Map.of()`, also randomize iteration order. At the time they were added, Go and Python also randomized iteration order. However, more recently, Python moved away from randomization. Early in the Python 3.x releases, dict iteration order was randomized. In 3.6, iteration order was insertion order, but only as an implementation detail. In 3.7, this was elevated to a guarantee.
There are occasional complaints about Java's unmodifiable collections' randomized order, as it can occasionally cause intermittent test failures or even failures in production. However, the advantage of the randomized order is that we've been able to reorganize the internal implementations of these data structures -- twice -- without worrying about compatibility problems caused by changing the ordering. Historically this has been a problem with other structures such as `HashMap`. Even though HashMap's order isn't guaranteed, it's predictable and stable, and changing its ordering has definitely flushed out bugs.
IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.
And since hash-first languages provide a broken API, there's no way to provide it for arbitrary types. Compare-first languages (like C++) generally provide it, but paying the price of tree-based structures (assuming a B-tree can't negate it) isn't actually necessary.
Rather than either of those choices, I argue that languages really should be "key-first" - rather than having classes implement hashing or comparison themselves, they should return a tuple of the data that will be fed into said hash or comparison. Besides being a nicer API, this also prevents many common bugs.
Is a fun presentation by Matthew Kulukundis (designer of Google's Hash Table) with Hyrum Wright offering objections from the crowd (Hyrum's law) about the design and rollout of it at Google
What's the advantage of specifying it as random over specifying it as sequential in some way? Aren't both just specifying a behavior, where one behavior is potentially more useful than the other? I guess I understand the principled point that a set of hash keys is not an array. But it seems like more complexity than is necessary, and even possible to fall victim to Hyrum's law besides… you can imagine someone using random hash ordering to implicitly randomize something without needing to call an RNG explicitly.
It seems like the most principled approach might be an re-specification of the data structure: why is "iteration" even possible over a hash table? Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?
One advantage to randomization, and why various languages did it in the first place, is that it prevents miscreants from DoSing your service if they know its implementation.
Say you're operating a service and you share its source on GitHub such that anyone can see it. Your language doesn't randomize hash values. Hashes are O(1), right? Well, not if a clever attacker can send you a set of values where they know each one will hash into the same bucket. The you end up with like 9 empty buckets and 1 with 100,000 items in it. Oops!
Old Python pre-randomization had a dict.items() method that would yield all the keys in order, conceptually kind of like:
for bucket in self._buckets:
for item in bucket:
yield item
The order of those buckets would be repeatable between runs, so bucket foo would always come first, then bucket bar. Then Python added hash randomization so that the resulting hash key was something like hash(salt+key) instead of just hash(key). Now there's no way to tell in advance which bucket an item would get filed into, and the buckets would end up more or less balanced in size.
Newer Pythons (since 3.6? 3.7?) do something altogether different, and I can't explain exactly how their ordered dicts work, except to say I sat through a presentation on them and thought it was freaking genius even if I could re-implement it myself without sitting down with their docs.
It's interesting that the idea mail mentions that nothing changes about the implementation (including order) but the memory layout. Which would imply insertion order was already preserved in older versions (not the case IIRC) or the idea underwent a few more changes that did in fact impact order.
EDIT: I couldn't quite find an answer but https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/ mentions the behavior happens since then because the implementation only tracks indices in the hash table itself and relies on maintaining entries separately in a second array that gets expanded in insertion order.
This would also seem straightforward but it raises a few questions such as how deletion is implemented (efficiently).
In addition to the DoS aspect mentioned in a sibling comment, the primary reason you would do this is to avoid constraining the implementation. If you want to change the design to improve performance, for example, being forced to match the implicit ordering of the old implementation may be very difficult.
It certainly may be useful to define an specific ordering. Maps ordered by insertion time and maps ordered by key order are fairly common. But maps with these constraints probably won't be as fast as those with arbitrary ordering, so there is a trade-off decision to be made. A map constrained to the arbitrary ordering of the initial implementation is the worst of both worlds: difficult to make faster despite not having a particularly useful ordering.
As a concrete example, I am currently in the middle of rewriting Go's map implementation to use a more efficient design (https://go.dev/issue/54766). If Go hadn't randomized map iteration order this work would likely be impossible. It is unlikely we could completely change the design while keeping iteration order identical to the old implementation (assuming we want the new version to be faster at least).
I think removing methods from the Hashmap was not in scope for their work - they were working on the JDK, and presumably didn't have the bandwidth to patch every single Java library used at Google to use set-based methods instead of iteration-based.
Also, I think in practice you'll find you want to extract items from a set in some order, so you need some kind of transformation from set->list. e.g: you want to print them out.
Edit: Forgot to address your main question. If the specification requires a specific ordering, the implementation is forever bound to do that, even if other implementations would be more efficient/secure/other-desirable-properties. By introducing randomness, you reduce the risk of people accidentally relying on unspecified behavior, and are more able to change your implementation later.
One solution could be adding a `sort` argument (which takes a function that compares two `<T>` items and returns `true` or `false` depending on their order) to all functions on unordered collections of `<T>` items in which an order must be chosen, or requiring that the items in such collections implement a `TotalOrder` interface or something similar. This isn't very ergonomic in languages that don't have an equivalent of Traits or typeclasses though. In languages which permit side effects, this would include any functions that iterate over the items in an unordered collection.
Alternatively, maintain backward compatibility as much as possible. Function overloading might be too clever. Extra verbosity in exchange for clarity might be a worthwhile tradeoff. WET is acceptable in these scenarios.
(from Feb 2021)
Some of the JDK's unmodifiable collections, such as those from `Set.of()` and `Map.of()`, also randomize iteration order. At the time they were added, Go and Python also randomized iteration order. However, more recently, Python moved away from randomization. Early in the Python 3.x releases, dict iteration order was randomized. In 3.6, iteration order was insertion order, but only as an implementation detail. In 3.7, this was elevated to a guarantee.
https://docs.python.org/3/library/stdtypes.html#dict
(search for "dictionary order")
There are occasional complaints about Java's unmodifiable collections' randomized order, as it can occasionally cause intermittent test failures or even failures in production. However, the advantage of the randomized order is that we've been able to reorganize the internal implementations of these data structures -- twice -- without worrying about compatibility problems caused by changing the ordering. Historically this has been a problem with other structures such as `HashMap`. Even though HashMap's order isn't guaranteed, it's predictable and stable, and changing its ordering has definitely flushed out bugs.
IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.
And since hash-first languages provide a broken API, there's no way to provide it for arbitrary types. Compare-first languages (like C++) generally provide it, but paying the price of tree-based structures (assuming a B-tree can't negate it) isn't actually necessary.
Rather than either of those choices, I argue that languages really should be "key-first" - rather than having classes implement hashing or comparison themselves, they should return a tuple of the data that will be fed into said hash or comparison. Besides being a nicer API, this also prevents many common bugs.
https://www.youtube.com/watch?v=ncHmEUmJZf4
Is a fun presentation by Matthew Kulukundis (designer of Google's Hash Table) with Hyrum Wright offering objections from the crowd (Hyrum's law) about the design and rollout of it at Google
What's the advantage of specifying it as random over specifying it as sequential in some way? Aren't both just specifying a behavior, where one behavior is potentially more useful than the other? I guess I understand the principled point that a set of hash keys is not an array. But it seems like more complexity than is necessary, and even possible to fall victim to Hyrum's law besides… you can imagine someone using random hash ordering to implicitly randomize something without needing to call an RNG explicitly.
It seems like the most principled approach might be an re-specification of the data structure: why is "iteration" even possible over a hash table? Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?
One advantage to randomization, and why various languages did it in the first place, is that it prevents miscreants from DoSing your service if they know its implementation.
Say you're operating a service and you share its source on GitHub such that anyone can see it. Your language doesn't randomize hash values. Hashes are O(1), right? Well, not if a clever attacker can send you a set of values where they know each one will hash into the same bucket. The you end up with like 9 empty buckets and 1 with 100,000 items in it. Oops!
Old Python pre-randomization had a dict.items() method that would yield all the keys in order, conceptually kind of like:
The order of those buckets would be repeatable between runs, so bucket foo would always come first, then bucket bar. Then Python added hash randomization so that the resulting hash key was something like hash(salt+key) instead of just hash(key). Now there's no way to tell in advance which bucket an item would get filed into, and the buckets would end up more or less balanced in size.Newer Pythons (since 3.6? 3.7?) do something altogether different, and I can't explain exactly how their ordered dicts work, except to say I sat through a presentation on them and thought it was freaking genius even if I could re-implement it myself without sitting down with their docs.
> and I can't explain exactly how their ordered dicts work
Traditionally you simply use a doubly linked list approach on the entries (each entry maintains two additional references to the previous and next entry) for that like LinkedHashMap: https://docs.oracle.com/javase//8/docs/api/java/util/LinkedH...
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/...
Which is also what Python seems to be doing: https://stackoverflow.com/a/34496644
It's fairly intuitive.
Do their new default (now also ordered?) dics do this differently?
Note that OrderedDict is an implementation in Python. CPython's dict has a different implementation. There's more about it at https://docs.python.org/3.6/whatsnew/3.6.html#new-dict-imple... and https://mail.python.org/pipermail/python-dev/2012-December/1... .
This implementation was used from 3.6, right?
It's interesting that the idea mail mentions that nothing changes about the implementation (including order) but the memory layout. Which would imply insertion order was already preserved in older versions (not the case IIRC) or the idea underwent a few more changes that did in fact impact order.
EDIT: I couldn't quite find an answer but https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/ mentions the behavior happens since then because the implementation only tracks indices in the hash table itself and relies on maintaining entries separately in a second array that gets expanded in insertion order.
This would also seem straightforward but it raises a few questions such as how deletion is implemented (efficiently).
In addition to the DoS aspect mentioned in a sibling comment, the primary reason you would do this is to avoid constraining the implementation. If you want to change the design to improve performance, for example, being forced to match the implicit ordering of the old implementation may be very difficult.
It certainly may be useful to define an specific ordering. Maps ordered by insertion time and maps ordered by key order are fairly common. But maps with these constraints probably won't be as fast as those with arbitrary ordering, so there is a trade-off decision to be made. A map constrained to the arbitrary ordering of the initial implementation is the worst of both worlds: difficult to make faster despite not having a particularly useful ordering.
As a concrete example, I am currently in the middle of rewriting Go's map implementation to use a more efficient design (https://go.dev/issue/54766). If Go hadn't randomized map iteration order this work would likely be impossible. It is unlikely we could completely change the design while keeping iteration order identical to the old implementation (assuming we want the new version to be faster at least).
I think removing methods from the Hashmap was not in scope for their work - they were working on the JDK, and presumably didn't have the bandwidth to patch every single Java library used at Google to use set-based methods instead of iteration-based.
Also, I think in practice you'll find you want to extract items from a set in some order, so you need some kind of transformation from set->list. e.g: you want to print them out.
Edit: Forgot to address your main question. If the specification requires a specific ordering, the implementation is forever bound to do that, even if other implementations would be more efficient/secure/other-desirable-properties. By introducing randomness, you reduce the risk of people accidentally relying on unspecified behavior, and are more able to change your implementation later.
One solution could be adding a `sort` argument (which takes a function that compares two `<T>` items and returns `true` or `false` depending on their order) to all functions on unordered collections of `<T>` items in which an order must be chosen, or requiring that the items in such collections implement a `TotalOrder` interface or something similar. This isn't very ergonomic in languages that don't have an equivalent of Traits or typeclasses though. In languages which permit side effects, this would include any functions that iterate over the items in an unordered collection.
Alternatively, maintain backward compatibility as much as possible. Function overloading might be too clever. Extra verbosity in exchange for clarity might be a worthwhile tradeoff. WET is acceptable in these scenarios.
Relevant XKCD:
https://xkcd.com/1172/
Interesting note: the JDK deliberately randomizes iteration order of certain immutable sets and maps (https://github.com/openjdk/jdk/blob/jdk-23%2B37/src/java.bas...).
Hash iteration order is a great example of Hyrum’s Law
An even better example (requires less words) https://xkcd.com/1172/