Introduction

General Hashing Routines

File: hash.e

Version: v1.00

Date: 01-sep-2000

Author: Derek Parnell

These routines help manage hash tables. See "Hash Routines" for details.


Table of Contents

  • ActionCodes

  • ErrorCode

  • Hash Routines

  • HashKeys


  • ActionCodes

    These codes are used to indicate some action is required or was performed.

  • kHASH_ERROR

    An error was detected. See Hash_Errors for details.

  • kHASH_UNIQUE

    Insert mode: Keys must be unique, duplicate keys ignored.

  • kHASH_INSERT

    Callback action: Insert the returned data.

  • kHASH_REPLACE

    Callback action: Replace the existing with the returned data.

  • kHASH_IGNORE

    Callback action: Ignore the insert request.

  • kHASH_DEF

    Used by calling routines to request the default hashing algorithms

  • kHASH_DEF_DIM

    Set Default number of dimensions for record estimates.

  • kHASH_DEF_SEEDKEYLEN

    Set Max Key length for precalculated seeds.


    ErrorCode

    These codes are returned in the Hash_Errors sequence.

  • kHASH_ERR_BADCALLBACK

    The data returned by an Insert callback function was not a 2-element sequence

  • kHASH_ERR_TABLEID_BAD

    The supplied TableId is not any of the current Ids

  • kHASH_ERR_TABLEID_DEL

    The supplied TableId refers to a deleted table.

  • kHASH_ERR_NONATOMIC_DIM

    One of the Table dimensions is not an atom/integer

  • kHASH_ERR_NONPOS_DIM

    One of the Table dimensions is than 1.

  • kHASH_ERR_BADMODE

    A unknown Insert mode was supplied.

  • kHASH_ERR_CALLBACK

    The action code in the Insert callback data is not recognised.

  • kHASH_ERR_RNF

    The data to be replaced wasn't found.

  • kHASH_ERR_BAD_HASH

    Wrong number of hash values supplied.

  • kHASH_ERR_BAD_DEFAULT

    An unknown Default ID was supplied.

  • kHASH_ERR_BAD_VALUE

    An invalid value for the specific default ID was supplied.

  • kHASH_ERR_NOSTATE

    Unable to restore system state because there is no state to restore from.


    Hash Routines

    These routines help manage hash tables. A Hash table is like a RAM-based database, but with limitations. It is usually only for small numbers of records, and for small record sizes, but it is very fast. Individual records can only be retrieved if you know their key; there is no FindFirst, FindNext type of functions, because there is no assumed relationship between the records in a table. However, there is a function to retrieve all the stored records, optionally with a user- written filtering function.

    Typically, hash tables are used for syntax analysis, but can be used in any application that has to retrieve data based on unique keys. The record data to be stored can be anything, as these routine do not use this data in any way at all. In fact, the data probably should not include the record's key as this is stored by the routines anyway, so it would be redundant and wasteful.

    These routines store any error messages in a global sequence called Hash_Errors. After calling any of the hash routines, you can check this sequence and if its length is not zero it means that at least one error was detected. Each entry in this sequence is a three-element sequence - the first is the error code, the third is a sequence of relevant data relating to the particular error, and the second is an English text message relating to the error.

  • func Hash_Calc( integer TableId, sequence Key ) This calculates the "hash" key for your supplied data key.
  • proc Hash_DeleteTable( integer TableId) This removes a specific table, and all the data inserted into it.
  • var Hash_Errors Stores error codes and messages.
  • func Hash_Fetch( integer TableId, sequence Key, sequence Hashkey ) This tries to find the data that was stored in the Table
  • func Hash_GetTable( integer TableId, integer Filter, sequence Extras) This is used to fetch the all the data inserted into a specific table.
  • func Hash_Insert( integer TableId, sequence Key, sequence Data, integer Mode, sequence Hashkey ) This is used to insert new data into the table.
  • func Hash_NewTable( object Dimensions ) This defines a new hash table and returns its ID.
  • func Hash_Replace( integer TableId, sequence Key, sequence NewData, sequence Hashkey ) This is used to replace/update the data previously stored.
  • func Hash_RestoreState( ) This restores the internal state of the hash functions.
  • func Hash_SaveState( ) This saves the internal state of the hash functions.
  • proc Hash_SetDefault( integer pDefaultId, object pValue) This allows the caller to set various default values used by these routines.
  • func Hash_Stats( ) This is used to fetch the current statistics of all tables.
  • func Hash_TableStats( integer TableId) This is used to fetch the current statistics of a specific table.

    HashKeys

    The reason they are called hash tables is that each record's key is converted into a "Hash Key" that is, in turn, used to identify a specific queue that the record is stored in. The conversion algorithms examine each key calculate a number for the key. The number is divided by the number of queues available to see which queue the record can be stored in. Because it is possible that different keys can all map into the same queue number, some queues store multple records. The more effecient hash tables are those that have the fewest number of records per queue.

    This library helps you to distribute records over the queues by allowing you to define a hash table to use multple algorihms. Each table created with this library has a set of "dimensions", and each dimension uses a different algorithm. You not only specify how many dimensions to use, but also their relative weightings. For example, a one dimension table defined with {x}, a 2-dim table as {x,y}, a 3-dim table as {x,y,z}, and so on. There is no limit to the number of dimensions except memory. In practice, a maximum or 4-6 is realistic. These dimensions also determine the number of queues to use. The total number of queues is the product of the dimension values, ie. a 3-dim table of {x,y,z} has x * y * z queues. You can bias the algorithm(s) to use by having larger dimension values. For example, a table of {3,19,3} has 3 * 19 * 3 (171) queues and the second dimension's algorithm is more heavily relied upon to distribute the keys than the other two. Generally, the more dimensions and the more queues, the better the key distribution. But you must trade-off memory for speed.

    To make it easier, the table defintion funtion also allows you to supply an estimate of the number of records you think will be stored, it then calculates a dimension set for the table that should be suitable for the data, however it tends to be a bit memory hungry. Some functions allow you to supply your own hash keys. You might do this if the default algorithms don't suit your data. When you do this, you must supply a hash key set with exactly the same number of elements as there are in the table's dimension. Plus, each element must be in the range 1 .. N, where N is the value in the respective dimension. Thus for a table defined with dimensions of {3,5,7}, you must supply a hash-key-set with three elements and the first must be from 1 - 3, the 2nd from 1-5, and the 3rd from 1-7.


    [func]
    Hash_Calc( integer TableId, sequence Key )

    This calculates the "hash" key for your supplied data key.

    Returns: Returned is a sequence that contains the this data key's hash key.

    Category: Hash Routines

    The hash key is based on the dimensions defined for this table. You would typically use this to store the hash key for common data keys to save the lookup routine recalculating it.

    This function does not store any data!

     Examples:- 
    

    integer lHT sequence lHashKey lHT = Hash_NewTable( {7,5,91} ) -- This creates a hash table lHashKey = Hash_Calc ( lHT , "Derek") -- Returns the hash key for "Derek"

    See Also: Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [proc]
    Hash_DeleteTable( integer TableId)

    This removes a specific table, and all the data inserted into it.

    Returns: Nothing is returned. This is a Procedure.

    Category: Hash Routines

    You must supply the Table to use. Once a table is deleted, its Table ID cannot be used again.

     Examples:- 
     integer lHT -- Hash Table ID 
     Hash_DeleteTable ( lHT ) 
    

    See Also: Hash_Calc, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [var]
    Hash_Errors

    Stores error codes and messages.

    Category: Hash Routines

    This sequence is updated whenever an error is detected. Each new error is appended, enabling this to handle multiple errors in a single call. All previous errors are deleted when you call a Hash function. Each entry in this consists of a 3-element sequence.

  • [1] An ErrorCode

  • [2] A sequence of data for the error

  • [3] A message template in English

     sequence lRC 
     lRC = Hash_...( ... ) 
     if length(lRC) != 0 then 
     for i = 1 to length(lRC) do 
     printf(2, "Hash Err #%d, '%s'\n", 
     {lRC[i][1], 
     sprintf(lRC[i][3], lRC[i][2]) 
     } 
     ) 
     end for 
     else 
     ... -- Function worked. 
     end if 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_Fetch( integer TableId, sequence Key, sequence Hashkey )

    This tries to find the data that was stored in the Table

    Returns: Returned is a sequence that contains the data that was stored against the data Key.

    Category: Hash Routines

    info It uses the the data Key. If Hashkey is kHASH_DEF the routine first calculates a hash key for this data key, otherwise it uses whatever hashing data you provide. See HashKeys for more details If the returned sequence is empty, then the key has not been inserted into the hash table, otherwise the returned data is in the first element. I've done it this way so that I can support empty sequences as valid data. That is, an empty data sequence would be returned as {{}} whereas No-Data-Found is returned as {} .

     Examples:- 
    

    integer lHT -- Hash Table ID sequence lMyData lMyData = Hash_Fetch ( lHT , "Derek", kHASH_DEF) if length(lMyData) = 0 then --- no data found. else --- data is in lMyData[1] !!! N.B. Not in lMyData, but in lMyData's 1st element. end if

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_GetTable( integer TableId, integer Filter, sequence Extras)

    This is used to fetch the all the data inserted into a specific table.

    Returns: Returned is a sequence containing the selected data.

    Category: Hash Routines

    You must supply the Table to use, and an optional filtering routine, and optional data for the filter function. To use a filtering routine, the second parameter must be a valid routine_id. This routine is passed three parameters, the record's key, the record's data and the Extras. The routine must return a sequence. If the sequence is empty, the record will be skiped. Otherwise the returned sequence must contain a single element, and this element is returned to the caller of Hash_GetTable. Each element in the returned sequence is a two-element sequence containing the record's key and data. The order of the returned records is random; it is not sorted.

     Examples:- 
     function NumFilter( sequence pKey, sequence pData, sequence pExtra) 
     if IsNumeric(pKey) 
     then 
     return {pData} 
     else 
     return{} 
     end if 
     end function 
    

    function AlphaFilter( sequence pKey, sequence pData, sequence pExtra) Only select alpha-keys that match the 'type' in pExtra. if not IsNumeric(pKey) then if equal(pData[3], pExtra[1]) then return {pData} else return {} end if else return{} end if end function

    integer lHT -- Hash Table ID sequence lNums, lWords lNums = Hash_GetTable ( lHT, routine_id("NumFilter"), {}) lWords = Hash_GetTable ( lHT, routine_id("AlphaFilter"), {"CONST"})

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_Insert( integer TableId, sequence Key, sequence Data, integer Mode, sequence Hashkey )

    This is used to insert new data into the table.

    Returns: Returned is a success code, either kHASH_ERROR, kHASH_INSERT, kHASH_IGNORE or kHASH_REPLACE.

    Category: Hash Routines

    You must supply the Table to use, the data's Key, the data to store, the Insertion Mode, and optionally the key's hash value. You can use kHASH_DEF to have this routine calculate the hash for you. See HashKeys for more detail. The Insertion Mode is either kHASH_UNIQUE or a routine_id. A value of kHASH_UNIQUE inserts the data and its key if the key has not already been inserted, otherwise it just ignores it. A routine_id allows the use of more sophosticated control of the data. With a callback routine, you can elect to modify the new data prior to insertion, or update existing data, or ignore new data. The callback routine receives three parameters - the key, any existing data, and the new data. It must return a sequence of two elements - the first must be either kHASH_REPLACE, kHASH_INSERT, or kHASH_IGNORE. The second must be a sequence - for kHASH_INSERT and kHASH_REPLACE this the data put into the table, for kHASH_IGNORE it can be anything because it is just ignored.

     Examples:- 
     #1 - Simple Insertion... 
     integer lHT -- Hash Table ID 
     integer lRC 
     lRC = Hash_Insert ( lHT , "Derek", {1,2,"xyz"}, kHASH_UNIQUE, kHASH_DEF) 
     if lRC = kHASH_ERROR 
     then --- an error detected. See Hash_Errors for details.
     elsif lRC = kHASH_INSERT --- data has been succesfully inserted. 
     else  --- data already exists 
     end if 
     #2 - Insertion with callback... 
     function MyCallback(sequence pKey, sequence pExisting, sequence pNew) 
     This counts the occurances of non-numeric records. 
     if length(pExisting) = 0 
     then  -- New Record 
     if IsNumeric(pKey) 
     then 
     return {{kHASH_INSERT, pNew} ) Don't add a counter. 
     else 
     return {{kHASH_INSERT, append(pNew, 1)} ) Attach a counter to the data. 
     end if 
     else 
     if length(pExisting[1]) = 4 
     then 
     pExisting[1][4] += 1 -- Increment the counter 
     return {{kHASH_REPLACE, pExisting[1]} ) 
     else 
     return {{kHASH_IGNORE, {}} ) 
     end if 
     end if 
     end function 
    

    integer lHT -- Hash Table ID integer lRC lRC = Hash_Insert ( lHT , "Derek", {1,2,"xyz"}, routine_id("MyCallback"), kHASH_DEF) if lRC = kHASH_ERROR then --- an error detected. See Hash_Errors for details. elsif lRC = kHASH_INSERT --- data has been succesfully inserted. else --- data already exists end if

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_NewTable( object Dimensions )

    This defines a new hash table and returns its ID.

    Returns: Returned is an integer representing a table id.

    Category: Hash Routines

    'Dimensions' is either a sequence containing a list of atoms or a single integer. If a sequence then the number of atoms specifies the number of sub-tables to use. Each sub-table uses a different algorithm to generate a hashing index. The value of each atom specifies the number of queues in its sub-table. The queues hold the inserted entries. Which queue is actually used for a given entry depends on the number of sub-tables and the number of queues in each sub-table. By adjusting these dimensions, you can influence the efficiency of finding entries in a table. The aim is to get the entries evenly spread amonsgt to queues and to have a minimum of entries per queue. If an integer, 'Dimensions' is an estimate of the number of entries expected to be stored. This function calculates a default queue length for four sub-tables. The default is biased towards efficient lookup so it uses more memory than may be required. This id must be used in subsequent hashing functions. Zero is returned when the function finds something wrong with the 'Dimensions' parameter.

     Examples:- 
    

    integer lHT lHT = Hash_NewTable( {7,5,91} ) -- This creates a hash table with 3 sub-tables Sub table #1 has 7 queues, Sub table #2 has 5 queues, Sub table #3 has 91 queues lHT = Hash_NewTable ( 7000 ) -- This creates a hash table that is expected to store about 7000 entries.

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_Replace( integer TableId, sequence Key, sequence NewData, sequence Hashkey )

    This is used to replace/update the data previously stored.

    Returns: Returned is a success code, either kHASH_ERROR or kHASH_REPLACE.

    Category: Hash Routines

    After your data has been inserted into your table via a call to Hash_Insert, you can use this function to fetch the store data. You must supply the Table the data was stored in, the data's Key, the new data, and optionally the key's hash value. You can use kHASH_DEF to have this routine calculate the hash for you. See HashKeys for more detail.

     Examples:- 
    

    integer lHT -- Hash Table ID integer lRC lRC = Hash_Replace ( lHT , "Derek", {1,2,"xyz"}, kHASH_DEF) if lRC = kHASH_ERROR then --- an error detected. See Hash_Errors for details. else --- data has been succesfully replaced. end if

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_RestoreState( )

    This restores the internal state of the hash functions.

    Returns: Nothing is returned as this is a procedure.

    Category: Hash Routines

    The state, saved by Hash_SaveState, is restore to allow re-enterant code to work. This must be used by callback routines if they call any other Hash function. Hash_Errors is updated with code HASH_ERR_NOSTATE if there is no state to restore.

     Examples:- 
     function InsertCallback( ... ) 
     ... 
     Hash_SaveState() 
     lRC = Hash_Fetch ( lHT, "GrandTotal", kHASH_DEF) 
     lRC[1][1] += 1 
     lRC = Hash_Replace( lHT, "GrandTotal", lRC[1], kHASH_DEF) 
     Hash_RestoreState() 
     ... 
     return ... 
     end function 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_SaveState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [func]
    Hash_SaveState( )

    This saves the internal state of the hash functions.

    Returns: Nothing is returned as this is a procedure. Does not update Hash_Errors

    Category: Hash Routines

    This allows re-enterant code to work. This must be used by callback routines if they call any other Hash function.

     Examples:- 
     function InsertCallback( ... ) 
     ... 
     Hash_SaveState() 
     lRC = Hash_Fetch ( lHT, "GrandTotal", kHASH_DEF) 
     lRC[1][1] += 1 
     lRC = Hash_Replace( lHT, "GrandTotal", lRC[1], kHASH_DEF) 
     Hash_RestoreState() 
     ... 
     return ... 
     end function 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SetDefault, Hash_Stats, Hash_TableStats


    [proc]
    Hash_SetDefault( integer pDefaultId, object pValue)

    This allows the caller to set various default values used by these routines.

    Returns: Returns kHASH_ERROR if an error was detected, and kHASH_IGNORE if the default was set.

    Category: Hash Routines

    You must supply two parameters, the first contains the code for the default being set, the second is the new value for that default.

  • kHASH_DEF_DIM "Default number of dimensions when record estimates are supplied to Hash_NewTable function." Must be an integer greater than zero.

  • kHASH_DEF_SEEDKEYLEN "Maximum length of key for pre-calculated seeds." Must be an integer greater than zero.

     Examples:- 
     Hash_SetDefault ( kHASH_DEF_DIM, 3 )  -- Set the default dimensions to 3. 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_Stats, Hash_TableStats


    [func]
    Hash_Stats( )

    This is used to fetch the current statistics of all tables.

    Returns: Returned is a sequence containing a sequence for each table in use.

    Category: Hash Routines

    This is the same as Hash_TableStats , but returns the data for every table in use. See Hash_TableStats for the details.

     Examples:- 
     sequence lRC 
     lRC = Hash_Stats () 
     if length(lRC) = 0 
     then --- an error detected. See Hash_Errors for details. 
     else 
     for i = 1 to length(lRC) do 
     printf(1, "Records inserted for %s: %d, Avg Q len=%f\n", 
     {sprint(lRC[1][11]), lRC[i][1], lRC[i][6]}) 
     end for 
     end if 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_TableStats


    [func]
    Hash_TableStats( integer TableId)

    This is used to fetch the current statistics of a specific table.

    Returns: Returned is a sequence containing various statistics.

    Category: Hash Routines

    Can be used to determine the best set of table dimensions to use with the given set of data being inserted. You must supply the Table to use.

    Layout of Statistics Sequence

  • [1] ==> Total records inserted

  • [2] ==> # of Queues with records

  • [3] ==> # of Empty Queues

  • [4] ==> Maximum record count in a queue

  • [5] ==> Minimum record count in a (non-empty) queue

  • [6] ==> Average number of records per queue

  • [7] ==> Standard Deviation of records in queues

  • [8] ==> # of queues with higher than average record count

  • [9] ==> # of queues with lower than average record count

  • [10]==> Queue Spread. A sequence containing queue counts per queue size. That is, the # of queues with 1 record, 2 records, 3 records ...

  • [11]==> Table Definition. A sequence containing the table's dimesions.

     Examples:- 
     integer lHT -- Hash Table ID 
     sequence lRC 
     lRC = Hash_TableStats ( lHT) 
     if length(lRC) = 0 
     then --- an error detected. See Hash_Errors for details. 
     else 
     printf(1, "Records inserted: %d, Avg Q len=%f\n", 
     {lRC[1], lRC[6]}) 
     end if 
    

    See Also: Hash_Calc, Hash_DeleteTable, Hash_Errors, Hash_Fetch, Hash_GetTable, Hash_Insert, Hash_NewTable, Hash_Replace, Hash_RestoreState, Hash_SaveState, Hash_SetDefault, Hash_Stats