String comparison speed

Quoting from unit 5 lecture 10:
"We can do string comparisons very quickly, because strings are immutable. That means that we don't need to look at all the characters in the string to compare two strings. Double equal for strings can be done in such a way that it doesn't need to look at the whole string."

Then I'm asking: what is this way? I don't understand how Python can do comparison between 2 immutable objects in such a way that it doesn't depend on the object size. And why the immutability is related in this?

To me, if 2 objects are distinct from each other, then you have to compare each part of the object (that means characters for strings) to tell if these objects are equal.

asked 27 Mar '12, 17:31

Benjamin%20M's gravatar image

Benjamin M
3326
accept rate: 50%


4 Answers:

If the objects are immutable you can have the same object used over and over again. You assign that object an ID and then on comparison see if the two objects have the same ID.

Strings are immutable so if you use 'HI' in one spot in your code and then use 'HI' in a different spot. Both of those variables point to the same 'HI' string. That 'HI' string has an ID associated with it so a comparison becomes numeric instead of textual (each character doesn't have to be checked).

Total guess on my part but makes sense. I'll get destroyed by a true python zealot if I'm wrong :)

link

answered 27 Mar '12, 17:34

David%20Smith's gravatar image

David Smith
1.2k61236

edited 27 Mar '12, 17:37

So that means that when I do:
a = "ab"
b = "ab"
The "ab" string is only present once in memory, with one unique ID?

(27 Mar '12, 17:37) Benjamin M Benjamin%20M's gravatar image

Of course, your system must be careful to never generate a separate instance of the same string contents with a different ID (or pointer). Languages that don't have this property require actual comparison of the string contents at run time.

(27 Mar '12, 17:38) Kenneth I. L... Kenneth%20I.%20Laws-1's gravatar image

That's what I'm postulating yes. Let this question marinate a bit on the board. I'm either going to be laughed out of the room or folks will start to concur with what I've said. I haven't looked at the source implementation of Python or read the spec to confirm this (Lazy I know) so it would be good to wait to see what others have to say.

(27 Mar '12, 17:38) David Smith David%20Smith's gravatar image

That makes sense, thank you a lot.

(27 Mar '12, 17:43) Benjamin M Benjamin%20M's gravatar image

No problem. Glad I could help.

(27 Mar '12, 17:43) David Smith David%20Smith's gravatar image

Just as a follow-up I bet managing that String Pool presents some really interesting problems. I have more Java resources than I do Python so I might have to look into that management from the Java point of view.

(27 Mar '12, 17:46) David Smith David%20Smith's gravatar image

olol, it was written on wikipedia too:

The practice of always using references in place of copies of equal objects is known as interning. If interning is used, two objects are considered equal if and only if their references, typically represented as integers, are equal. Some languages do this automatically: for example, Python automatically interns short strings. If the algorithm which implements interning is guaranteed to do so in every case that it is possible, then comparing objects for equality is reduced to comparing their pointers, a substantial gain in speed in most applications. (Even if the algorithm is not guaranteed to be comprehensive, there still exists the possibility of a fast path case improvement when the objects are equal and use the same reference.) Interning is generally only useful for immutable objects.
--Wikipedia, 'Background' section

I should have read it before asking...

link

answered 27 Mar '12, 17:49

Benjamin%20M's gravatar image

Benjamin M
3326

I should have read the python spec before answering so we're even :)

(27 Mar '12, 17:52) David Smith David%20Smith's gravatar image
1

So it looks like creating a short string involves the Python environment checking if the string already exists, if so, it reuses the same string (e.g., its object ID).

(27 Mar '12, 18:00) Charles Lin Charles%20Lin's gravatar image

While the current answers are correct, I strongly doubt this is always the case: if you wanted to apply this optimisation to all strings and only ever use IDs, you'd have to check whether each string exists if you ever generate it. I suppose it's possible by keeping a global hashmap of strings.

link

answered 27 Mar '12, 18:03

Anton%20Golov's gravatar image

Anton Golov ♦
13.3k2174174

@poejav - This is one of the folks I was hoping would get in on this question.

@jesyspa - I agree that scaling this for all string literals is psychotic.

(27 Mar '12, 18:06) David Smith David%20Smith's gravatar image

Reading this should help. In particular, although string literals are automatically interned, everything else that comes from outside (if you, say, read a string from a file or from the network), won't be automatically interned unless you really ask for it.

link
This answer is marked "community wiki".

answered 19 Apr '12, 09:56

Konstantin%20Tretjakov's gravatar image

Konstantin T...
233

Your answer
Question text:

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×15,312
×70
×10
×2

Asked: 27 Mar '12, 17:31

Seen: 341 times

Last updated: 19 Apr '12, 09:56