On Oct 30, 2004, at 1:05 PM, Theodore H.Smith wrote:
Does anyone know much about the field of string similarity?
I have a distaste for the academia, because:
1) If you look at the pioneers of modern computing, few came from the
academia.
2) Most of what they write is over-complexified, yet impractical.
"O(N)" notations and really bad source code writing skills.
The "O(N)" notation is a simple, expressive way to describe the
asymptotic growth rate of a function. I'm happy to explain it, if you
like.
No need to explain it, not to me as its simple enough to grasp. To me
its like bad C. C is a simple language, correct? But C can be used to
write awful things that no human should understand (obfuscated C). To
me, O(N) is just like that. Simple, but not nice looking. A step
towards obfuscation.
O(N) notation does sense and is useful, its just unnecessarily
mathematical... Why not just express it like "linear cost", or
"quadratic cost". I suppose when you have functions whose cost really
IS an equation, like n*log(n) then it makes sense, but I'd really only
use that as a last resort, if there wasn't a better way of expressing
the costs, without resorting to obfuscation.
Also, it doesn't mention the length of time that O takes. O(N) for
example, may be slower than O(N^2) for low values of N, if O in the
first case is high, but O in the second case is low.
Say what?!?
While it is true that - if O() were a pure mathematical function -
O(N) > O(N^2) | 0 < N < 1, but N can only take on integral values; how
can you have 1/2 of a data point?
There is another factor that many programmers "conveniently" forget;
and that's that Big-O notation simplifies certain aspects of the
problem reducing the "evaluation" down to the worst-case scenario
(usually w/respect to large data sets "N"), so you can have cases where
an insertion or bubble-sort does better than a binary tree search, or
quicksort, for small values of N.
This is why these are just tools, not laws. They can help you
evaluate your code (algorithms), but they can't say they're the best
for all situations!
As an example. Say you have a somewhat small data set (N < 10000).
These need to be read into an array for use as a lookup table. For
speed, the array needs to be kept in sorted order, but insertions and
deletions are comparatively rare events.
Some programmers will go get a book on algorithms and see that
insertion sorts have a worst case performance of O(N^2), and a binary
tree (or quicksort*) has a performance of O(log2(N)) and conclude that
you should always use a binary tree or quicksort for such cases.
But what they don't read is the fine print; "...the performance of a
binary tree degrades to O(N) when the input data is already in sorted
order (or is nearly sorted), leading to a highly unbalanced tree
structure."
An insertion sort, on the other hand, performs much better when the
input data is already sorted order (it only has to append the data -
during the initial insertion - at the end of the array, without the
need to move all the array elements down one slot)! And a simple tweak
to the algorithm - using that knowledge - will help you gain O(log2(N))
for searches and O(1) performance for the initial insertions (when
you're reading the sorted items from your data file)! You'll still have
O(N) performance for "normal" insertions, but - as stated in the
problem - insertions and deletions are relatively rare.
And why do I mention the bit about the list being in sorted order?
Consider the above mentioned binary tree. It works nicely for unsorted
data (the 1st time you construct the tree from the unsorted file). But
what do you think happens the first time someone goes and inserts (or
deletes) an item from the list? That's right; the code marks the data
set 'dirty', which causes (or should cause, anyway) the data set to be
saved back to the file... in sorted order. Oops!
• IIRC, quicksort has a performance of O(N x log2(N)), a bit worse then
O(N), but not as bad as O(N^2)!
Of course when N gets high the speed of the operation becomes
irrelevant, but for certain ranges the speed of the operation does
make a difference. For example, what if in the first case, O is 1000x
slower? And what if the values of N, only can range from 10-30? In
this case, the second function is better.
Things like that, make me wonder why use such notations in the first
place. The real world can't get summarised into mathematics so neatly,
much of the time. I'd only use such notation as a last resort, not in
preference to more natural notation.
--
Theodore H. Smith - Software Developer.
http://www.elfdata.com
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://www.realsoftware.com/listarchives/lists.html>
William H Squires Jr
wsquires at satx dot rr dot com dot nospam <- remove the .nospam
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>
Search the archives of this list here:
<http://www.realsoftware.com/listarchives/lists.html>
|