Text Search: a comparison of Boyer Moore Vs Hash. This experiment takes on 2 text search techniques to compare the output.
The first approach is Boyer Moore algorithm using the looking-glass approach. This is ok and has no side effects as a result of the technique used in the algorithm.
Input: A text file of 11,000 lines. The objective is to find the occurrences of a pattern.
Boyer Moore algorithms fares at 49.2 milli seconds.
Now the technique using super imposed coding to speed up the search by Malcolm C Harrison runs into problems and interesting observations when the text involved is English. The technique is to hash every two consecutive letter of every line of the
input to a superimposed code. Now, the same is done to the pattern and we need
to find only the line with the similar bits in the super imposed code.
This is pretty ok for small inputs but, on the huge input this gave a lot of additional outputs without any presence of the pattern in the text. This is mainly because the hashes to a range of 128 on English text is resulting in collisions. Too much that the out put is unusable. The number of collision is of the order of 1000. (10%).
As mentioned by Knuth, this collision can be brought down by removing all the
spaces in the input text and pattern since it contributing to the common noise. This does not help much but does bring the collision down a bit. But, the collision can be reduced to 240-280 collisions by applying the hash incrementally to the text pattern. i.e if 'software' is the term apply the hash to so, sof, soft, etc so that, the
incremental sub sequence is also caught in the superimposed code. This is good
but 200-250 collisions is still not good and not to mention the additional runtime incurred to perform the incremental hash. The time is 482ms.
Byer Moore although would take more time on much larger files wins for now.
The first approach is Boyer Moore algorithm using the looking-glass approach. This is ok and has no side effects as a result of the technique used in the algorithm.
Input: A text file of 11,000 lines. The objective is to find the occurrences of a pattern.
Boyer Moore algorithms fares at 49.2 milli seconds.
Now the technique using super imposed coding to speed up the search by Malcolm C Harrison runs into problems and interesting observations when the text involved is English. The technique is to hash every two consecutive letter of every line of the
input to a superimposed code. Now, the same is done to the pattern and we need
to find only the line with the similar bits in the super imposed code.
This is pretty ok for small inputs but, on the huge input this gave a lot of additional outputs without any presence of the pattern in the text. This is mainly because the hashes to a range of 128 on English text is resulting in collisions. Too much that the out put is unusable. The number of collision is of the order of 1000. (10%).
As mentioned by Knuth, this collision can be brought down by removing all the
spaces in the input text and pattern since it contributing to the common noise. This does not help much but does bring the collision down a bit. But, the collision can be reduced to 240-280 collisions by applying the hash incrementally to the text pattern. i.e if 'software' is the term apply the hash to so, sof, soft, etc so that, the
incremental sub sequence is also caught in the superimposed code. This is good
but 200-250 collisions is still not good and not to mention the additional runtime incurred to perform the incremental hash. The time is 482ms.
Byer Moore although would take more time on much larger files wins for now.
No comments:
Post a Comment