In making a ternary search tree to identify as fast as possible the type of word passed in to the algorithm, for instance sp_help -> 1 (procedures), select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on deciding what type of comparison happens at each node. Has anyone got any suggestions? Do you think for the first level I could compare just the first characters of the word passed in and the word at the 'current' node, or would it be more likely to be a whole word comparison for each node? If it would be a whole word comparison, what form would this take for best performance? Bearing in mind that each word will only ever return the same type, but more than one word will have a certain type (i.e. the type will not be unique, but the word will.)
Hi, i am assuming you have libraries to do string comparison. say you are in node "root" and it has 3 children child1, child2, child3. if the word that you are searching is less than child1(Which you can find using strcmp or any other string utility functions) then go down in child1. if it is between child1 and child2(greater than child1 and lesser than child2) then go down in child2 else go down in child3. if this answered your question well and good! if not get back with more details of what you are trying to do. regards, Lingesh [quoted text, click to view] "Bonj" wrote: > In making a ternary search tree to identify as fast as possible the type of > word passed in to the algorithm, for instance sp_help -> 1 (procedures), > select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on > deciding what type of comparison happens at each node. > Has anyone got any suggestions? > Do you think for the first level I could compare just the first characters > of the word passed in and the word at the 'current' node, or would it be more > likely to be a whole word comparison for each node? > If it would be a whole word comparison, what form would this take for best > performance? > Bearing in mind that each word will only ever return the same type, but more > than one word will have a certain type (i.e. the type will not be unique, but > the word will.) >
This posting is provided "AS IS" with no warranties, and confers no rights
I'm trying to build a word-classifying algorithm for a syntax highlighter for T-SQL. [quoted text, click to view] > i am assuming you have libraries to do string comparison.
No, that's what I don't have. The strings will be std::basic_string<_TCHAR> but that's the only library I am using, but I want the algorithm to be as fast as possible. [quoted text, click to view] > say you are in > node "root" and it has 3 children child1, child2, child3. if the word that > you are searching is less than child1(Which you can find using strcmp or any > other string utility functions) then go down in child1. if it is between > child1 and child2(greater than child1 and lesser than child2) then go down in > child2 else go down in child3. if this answered your question well and good! > if not get back with more details of what you are trying to do.
What I don't get, is what form would "root" take, what form would "child1" take, and especially what form would "is less than" take, in this *particular* scenario? I'm hitting brick walls because the only examples I can find are text-book generic. For instance, if I was searching for the type of word for sp_help, the parameter I pass in is a string with the value "sp_help" and I want to get back the value WT_SYSTEMPROC (where WT_SYSTEMPROC is an application-defined constant). Is the "root" node the first letter, "s"? Is the root node instead the very first (alphabetically) word out of the ones I want to recognise? Can you describe the process of traversing down the tree for this example of this specific algorithm? [quoted text, click to view] > regards, > Lingesh > > "Bonj" wrote: > > > In making a ternary search tree to identify as fast as possible the type of > > word passed in to the algorithm, for instance sp_help -> 1 (procedures), > > select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on > > deciding what type of comparison happens at each node. > > Has anyone got any suggestions? > > Do you think for the first level I could compare just the first characters > > of the word passed in and the word at the 'current' node, or would it be more > > likely to be a whole word comparison for each node? > > If it would be a whole word comparison, what form would this take for best > > performance? > > Bearing in mind that each word will only ever return the same type, but more > > than one word will have a certain type (i.e. the type will not be unique, but > > the word will.) > > > This posting is provided "AS IS" with no warranties, and confers no rights
sorry another one, what *form* would "go down" take? i.e. go from "sp_he" to "sp_hel", one step nearer to the child node of "sp_help"? (am I on the right lines?) What about "sp_helpindex", that would be a child node of "sp_help", but "sp_help" is a child node in itself. This is one thing that confuses me. [quoted text, click to view] "Lingeshwaran Palaniappan(Microsoft)" wrote: > Hi, > i am assuming you have libraries to do string comparison. say you are in > node "root" and it has 3 children child1, child2, child3. if the word that > you are searching is less than child1(Which you can find using strcmp or any > other string utility functions) then go down in child1. if it is between > child1 and child2(greater than child1 and lesser than child2) then go down in > child2 else go down in child3. if this answered your question well and good! > if not get back with more details of what you are trying to do. > regards, > Lingesh > > "Bonj" wrote: > > > In making a ternary search tree to identify as fast as possible the type of > > word passed in to the algorithm, for instance sp_help -> 1 (procedures), > > select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on > > deciding what type of comparison happens at each node. > > Has anyone got any suggestions? > > Do you think for the first level I could compare just the first characters > > of the word passed in and the word at the 'current' node, or would it be more > > likely to be a whole word comparison for each node? > > If it would be a whole word comparison, what form would this take for best > > performance? > > Bearing in mind that each word will only ever return the same type, but more > > than one word will have a certain type (i.e. the type will not be unique, but > > the word will.) > > > This posting is provided "AS IS" with no warranties, and confers no rights
[quoted text, click to view] Bonj wrote: > In making a ternary search tree to identify as fast as possible the type of > word passed in to the algorithm, for instance sp_help -> 1 (procedures), > select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on > deciding what type of comparison happens at each node. > Has anyone got any suggestions?
First, I would try std::map to see if you really need to go through all the work of writing and debugging this new data structure. Also, boost::spirit has a symbol table class that is supposedly based on a TST, which you might look into using. (Actually, if you're doing parsing, you should probably look into boost::spirit. It's incredibly nifty.) [quoted text, click to view] > Do you think for the first level I could compare just the first characters > of the word passed in and the word at the 'current' node, or would it be more > likely to be a whole word comparison for each node?
Only the first character. Otherwise you could just have a binary search tree. [quoted text, click to view] > If it would be a whole word comparison, what form would this take for best > performance? > Bearing in mind that each word will only ever return the same type, but more > than one word will have a certain type (i.e. the type will not be unique, but > the word will.)
A TST will basically be a binary search tree on the first character of the string and when a match is found, the node will contain another TST that starts with the second character, and so on until you run out of string or something doesn't match. So you have a structure something like: struct tst_node { _TCHAR match; struct tst_node *left, *right; union { struct tst_node *ptr; /* when match != 0 */ unsigned data; /* only used when match == 0 */ } down; }; struct tst_node *root; And for your three examples, "sp_help", "select", and "sysobjects", the top of the TST might look like: root -> node0 /* first level BST */ node0 = { match = 's' left = right = NULL ptr -> node1 } /* second level BST (advance string iterator/pointer/index) */ node1 = { match = 'p' left = node2 /* less than 'p' */ right = node3 /* greater than 'p' */ ptr -> subtree with only "_help" } node2 = { match = 'e' left = right = NULL ptr -> subtree with only "lect" } node3 = { match = 'y' left = right = NULL ptr -> subtree with only "sobjects" } Now the subtree with only "sobjects" is going to take at least 16*9 = 144 bytes, which is quite wasteful. If you're sneaky, you can store a sequence of characters in each node rather than a single character. You still treat it as having only one character for the purpose of the binary search, but if that matches, you compare the rest of the characters in that node and give up if any do not match. But this is just an optimization, and it makes things a lot more complex. (of course you are looking at implementing a TST, so maybe this is important to you) I hope this is clear enough to be of some use. :) -josh
Yes, that's great, thanks. I'll study your answer, but it seems to englighten the situation. [quoted text, click to view] > Now the subtree with only "sobjects" is going to take at least 16*9 = > 144 bytes, which is quite wasteful. If you're sneaky, you can store a > sequence of characters in each node rather than a single character. You > still treat it as having only one character for the purpose of the > binary search, but if that matches, you compare the rest of the > characters in that node and give up if any do not match. But this is > just an optimization, and it makes things a lot more complex. (of > course you are looking at implementing a TST, so maybe this is important > to you)
That's very good to know, cheers.
Just another question: One last thing I can't quite get my head round and I'm hoping you'll be able to help: What would happen algorithmically in the situation, where a node has got child nodes, but is also a root node itself? For instance, "sp_help" is a child node, but also has child nodes relating to, say, "sp_helpindex" ??? Any thoughts, like what would the code basically do? [quoted text, click to view] "josh" wrote: > Bonj wrote: > > In making a ternary search tree to identify as fast as possible the type of > > word passed in to the algorithm, for instance sp_help -> 1 (procedures), > > select -> 2 (keywords), sysobjects -> 3 (system tables), I'm stuck on > > deciding what type of comparison happens at each node. > > Has anyone got any suggestions? > > First, I would try std::map to see if you really need to go through all > the work of writing and debugging this new data structure. Also, > boost::spirit has a symbol table class that is supposedly based on a > TST, which you might look into using. (Actually, if you're doing > parsing, you should probably look into boost::spirit. It's incredibly > nifty.) > > > Do you think for the first level I could compare just the first characters > > of the word passed in and the word at the 'current' node, or would it be more > > likely to be a whole word comparison for each node? > > Only the first character. Otherwise you could just have a binary search > tree. > > > If it would be a whole word comparison, what form would this take for best > > performance? > > Bearing in mind that each word will only ever return the same type, but more > > than one word will have a certain type (i.e. the type will not be unique, but > > the word will.) > > A TST will basically be a binary search tree on the first character of > the string and when a match is found, the node will contain another TST > that starts with the second character, and so on until you run out of > string or something doesn't match. > > So you have a structure something like: > > struct tst_node > { > _TCHAR match; > struct tst_node *left, *right; > union > { > struct tst_node *ptr; /* when match != 0 */ > unsigned data; /* only used when match == 0 */ > } down; > }; > > struct tst_node *root; > > And for your three examples, "sp_help", "select", and "sysobjects", the > top of the TST might look like: > > root -> node0 > > /* first level BST */ > > node0 = > { > match = 's' > left = right = NULL > ptr -> node1 > } > > /* second level BST (advance string iterator/pointer/index) */ > > node1 = > { > match = 'p' > left = node2 /* less than 'p' */ > right = node3 /* greater than 'p' */ > ptr -> subtree with only "_help" > } > > node2 = > { > match = 'e' > left = right = NULL > ptr -> subtree with only "lect" > } > > node3 = > { > match = 'y' > left = right = NULL > ptr -> subtree with only "sobjects" > } > > Now the subtree with only "sobjects" is going to take at least 16*9 = > 144 bytes, which is quite wasteful. If you're sneaky, you can store a > sequence of characters in each node rather than a single character. You > still treat it as having only one character for the purpose of the > binary search, but if that matches, you compare the rest of the > characters in that node and give up if any do not match. But this is > just an optimization, and it makes things a lot more complex. (of > course you are looking at implementing a TST, so maybe this is important > to you) > > I hope this is clear enough to be of some use. :) > > -josh >
[quoted text, click to view] Bonj wrote: > Just another question: > One last thing I can't quite get my head round and I'm hoping you'll be able > to help: > What would happen algorithmically in the situation, where a node has got > child nodes, but is also a root node itself? For instance, "sp_help" is a > child node, but also has child nodes relating to, say, "sp_helpindex" > ??? > > Any thoughts, like what would the code basically do?
The node at "sp_help" will have an entry with its matching character equal to 0 and an entry with its maching character equal to 'i'. The first holdes the data for "sp_help" and the second goes on to match "sp_helpindex". If your strings are nul terminated, you are guaranteed that a node that corresponds to a final symbol will not have children. -josh
Don't see what you're looking for? Try a search.
|