We are able to defeat the small integer intern in this way (a calculation allows us to avoid the caching layer):
>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False
How can you defeat the small string intern, i.e. to see the following result:
>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False
sys.intern mentions that "Interned strings are not immortal", but there's no context about how a string could kicked out of the intern, or how you can create a str instance avoiding the caching layer.
Since interning is CPython implementation detail, answers relying on undocumented implementation details are ok/expected.
Unicode consisting of only one character (with value smaller than 128 or more precisely from
latin1) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior forbytes) are created at the start and are stored in an array as long as the interpreter is alive:So every time a length 1 unicode is created, the character value gets looked up if it is in the
latin1-array. E.g. inunicode_decode_utf8:One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.
A possibility is to populate the unicode-data by ourselves using C-API. I use
Cythonfor the proof of concept, but alsoctypescould be used to the same effect:Noteworthy details:
PyUnicode_Newwould not look up inlatin1, because the characters aren't set yet.127asmaxchartoPyUnicode_New. As result, we can interpret data viaPyUnicode_1BYTE_DATAwhich makes it easy to manipulate it without much ado manually.And now:
as wanted.
Here is a similar idea, but implemented with
ctypes:Noteworthy details:
PyUnicode_1BYTE_DATAwith ctypes, because it is a macro. An alternative would be to calculate the offset todata-member and directly access this memory (but it depends on the platform and doesn't feel very portable)PyUnicode_CopyCharactersis used (there are probably also other possibilities to achieve the same), which is more abstract and portable than directly calculating/accessing the memory._PyUnicode_FastCopyCharactersis used, becausePyUnicode_CopyCharacterswould check, that the target-unicode has multiple references and throw._PyUnicode_FastCopyCharactersdoesn't perform those checks and does as asked.And now:
For strings longer than 1 character, it is a lot easier to avoid interning, e.g.: