Associative Arrays#

Array#

An array maps an index (usually an integer) to an element, of any type.
Arrays in Java have indexes starting at 0.
Access by indexing

1
a[i] /* 0 <= i < size */

An array is a function from natural numbers to elements all of the same type.

So can we use an array every time we want to connect numbers to values?

Actual example: exam scoring#

Class of about 300 students
Student numbers 5 digits
Students usually marked once, or sometimes twice (borderline).
Program to check that no student marked more than twice

1
static final int range = 100000; // 5-digit numbers
2
int [ ] timesMarked= new int [range];
3
for (int i= 0; i != range ; i++) timesMarked[i] = 0;
4
…
5
timesMarked[studentNum]= timesMarked[studentNum] + 1;

Very sparsely used array!
0.3% of array is used! Very space-inefficient.

We need to map a key to a value#

Example:
link (‘map’) key (student number) to value (student information).
This is called a mapping.
At most one value for any key.
Also called a function, from key to value.

Example: email-address lookup#

“Sharon” → “sharoncurtis@brookes.ac.uk"
"David” → “dlightfoot@brookes.ac.uk"
"Ian” → “ibayley@brookes.ac.uk”
Pronounce ”→” as “maps to”

Associative array#

ADT(abstract data type)
- An associative array is an abstract data type
- it is composed of a collection of unique keys and a collection of values, where each key is associated with one value (or set of values).

graph LR ADT --> Associative_array["Associative array"] ADT --> Associative_container["Associative container"] ADT --> map["map"] ADT --> mapping["mapping"] ADT --> dictionary["dictionary"] ADT --> Finite_map["Finite map"] ADT --> Index_file["Index file"]

The operation of finding the value associated with a key is called a lookup or indexing, and this is the most important operation supported by an associative array.
The relationship between a key and its value is sometimes called a mapping or binding. For example, if the value associated with the key “bob” is 7, we say that our array maps “bob” to 7.
Associative arrays are very closely related to the mathematical concept of a function with a finite domain.

Possible example: sequence of entries#

1
public class DelegateSeq implements IDelegateDB {
2
    private Delegate[] table;
3
    private final static int tableSize = 10; // or whatever
4
    private int numEntries;
5
    public DelegateSeq() {
6
        System.out.println("Simple sequence");
7
        table = new Delegate[tableSize];
8
        clearDB();
9
    }
10
    public void clearDB() {
11
        numEntries = 0;
12
    }
13
}

Sequence of entries : put#

1
public Delegate put(Delegate delegate) {
2
    assert numEntries != tableSize; // very simple
3
    assert delegate != null;
4
    String name = delegate.getName();
5
    assert name != null && !name.equals("");
6
    Delegate previous;
7
    int pos = findPos(name);
8
    // we have to check not already in table. Time efficiency of findPos? (next slide)
9
    if (pos == numEntries) { // new
10
        numEntries++;
11
        previous = null;
12
    } else {
13
        previous = table[pos];
14
    }
15
    table[pos] = delegate;
16
    return previous;
17
}

Sequence of entries: findPos#

1
private int findPos(String name) {
2
    assert name != null && !name.equals("");
3
    int i = 0;
4
    while (i != numEntries && !table[i].getName().equals(name))
5
        i++;
6
    return i;
7
}

time complexity of findPos?
Simple linear search: O(n)
So put (and also get) have linear time complexity.

Simple sequence gives O(n) behaviour#

Can be improved by using an ordered sequence.
Then we can use a binary search.

Ordered sequence: findPos#

1
private int findPos(String name) {
2
    // returns position where name is, or would go
3
    assert name != null && !name.equals("");
4
    int left = 0, right = numEntries;
5
    while (left != right) {
6
        int mid = (left + right) / 2;
7
        if (name.compareTo(table[mid].getName()) > 0) left = mid + 1;
8
        else right = mid;
9
    }
10
    return left; // or right, since left == right
11
}

This is a binary search, so O(log n) – much better than unordered

Ordered sequence: put#

1
public Delegate put(Delegate delegate) {
2
    // assertions as before
3
    Delegate previous;
4
    int pos = findPos(name);
5
    if (pos == numEntries || !name.equals(table[pos].getName())) { // new
6
        int i = numEntries;
7
        while (i != pos) { // ‘budging up’ to keep the table ordered
8
            table[i] = table[i-1]; i--;
9
        }
10
        numEntries++;
11
        previous = null;
12
    } else {
13
        previous = table[pos];
14
    }
15
    table[pos] = delegate;
16
    return previous;
17
}

But the ‘budging up’ is O(n), so not much improvement.

Ordered not much improvement#

Self indexing can be O(1) – constant time. But works only in very special cases.
- But leads to very sparse use of space in most cases.
- But leads to very sparse use of space in most cases.
Using a sequence works OK, but has O(n) time complexity.
Ordered sequence is better: put is O(n), but get is O(log n) because of the binary search.
But we want O(1) in all cases.

Answer: Hash tables#

Hans Peter Luhn IBM, 1954
Very clever idea!

We will see how they are made available in the Java Base-Class Library (API) and also how they work.

https://spectrum.ieee.org/tech-history/silicon-revolution/hans-peter-luhn-and-the-birth-of-the-hashing-algorithm

Hans Peter Luhn

Java API HashMap : Method Summary#

void clear() Removes all mappings from this map.
boolean containsKey(K key) Returns true if this map contains a mapping for the specified key.
boolean containsValue(V value) Returns true if this map maps one or more keys to the specified value.
V get(K key) Returns the value to which the specified key is mapped in this identity hash map, or null if the map contains no mapping for this key.
boolean isEmpty() Returns true if this map contains no key-value mappings.
Set<K> keySet() Returns a set view of the keys contained in this map.
V put(K key, V value) Associates the specified value with the specified key in this map.
(Returns the old discarded value – same with remove below)
V remove(K key) Removes the mapping for this key from this map if present.
int size() Returns the number of key-value mappings in this map.
Collection<V> values() Returns a collection view of the values contained in this map.

Hash tables#

Vocabulary

retrieve 检索

modulo 取模

quadratic probing 平方探查

load factor 负载因子

Hash tables (associative arrays)#

Content topics:
- Hashing
- Hash functions
- Collision resolution
  - Probing (open addressing)
  - Chaining

Typical scenario#

Fast storage and searching of large tables, lists, databases etc
Need to minimize processing operations (comparisons and index calculations) per search
Numerous application areas for fast table-lookup
For example:
databases, dictionaries, compiler symbol table, search engines, password lookup …

What’s the problem#

How to store and then be able to locate items from a large data set very quickly
The problem is whatever search methods we use, the following fact will not change:
The more data we have, the slower the access

What’s the problem - realistic demand#

The more data we have, the slower the access
For example:
- Suppose a compiler allows 6-character symbol names that start with an upper-case letter (A-Z) followed by 5 alpha-numeric (0-9, A-Z)
- How many possible symbol names could there possibly be?
- Answer: $26^1 \times 36^5 = 1,572,120,576$ or approximately $1.572 \times 10^9$
- It would not be sensible to allocate this amount ( $1.572 \times 10^9$ ) of memory and use fewer than 10,000 elements (an amount of elements that a program usually handles)

A 16-digit bank card

For example:
- How many possible accounts
  - $10^{16}$
- World population
  - $8068351220 \approx 8.07 \times 10^9$
- Proportion of accounts we may use
  - $8.07 \times 10^{9-16}$
  - $0.0000807\%$ Too wasteful
    No necessary

What’s the problem- efficiency#

Linear search
- With N items on average, we need to trawl through N/2 of them to find the one we want. So a linear search is O(N)
- With twice the number of data items, the search time doubles

xychart-beta title "T = a + bN (Series 1)" x-axis "number of items, N" [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17] y-axis "search time" 0 --> 120 line "T = a + bN" [7,14,21,28,35,42,49,56,63,70,77,84,91,98,105,112,119]

Binary Search
- This is the technique we use to find an entry in a telephone directory in which the entries are sorted. The complexity of the binary search is O(logN)

graph LR R[ ] --> A[ ] R[ ] --> B[ ] A[ ] --> A1[ ] A[ ] --> A2[ ] A1[ ] --> X1[ ] A1[ ] --> X2[ ] A2[ ] --> X3[ ] A2[ ] --> X4[ ] B[ ] --> B1[ ] B[ ] --> B2[ ] B1[ ] --> X5[ ] B1[ ] --> X6[ ] B2[ ] --> X7[ ] B2[ ] --> X8[ ]

xychart-beta title "T = a log₂N (Series1)" x-axis "number of items, N" [1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97] y-axis "search time" 0 --> 7 line "T = a log₂N" [0.00,2.32,3.17,3.70,4.09,4.39,4.64,4.86,5.04,5.21,5.36,5.49,5.61,5.73,5.83,5.93,6.02,6.11,6.19,6.27,6.34,6.41,6.48,6.54,6.60]

Search-Algorithm Performance#

Linear search is O(N) – is the worst-possible case, linear time
Binary search is O(log N) – is better, logarithmic time, but requires data to be sorted
What we desire is O(1) or O(N⁰)
- i.e. search time is constant, independent of data size N
- so we can reach any data in a fixed number of operations, i.e. constant time

’Associative’ array#

Suitable data structures for associative array:
- Binary search tree [O(logn)]
- Hash table [O(1)]

‘Hashing’#

‘Messing things up’, for example:
- Corned-beef hash // 玉米牛肉糜
- Hash brown potatoes // 薯饼
Hence the meat-mincing machine in next slide…
Hashing for CS:
A method of transforming a search key into an address, for the purpose of fast and efficient storage and retrieval of data items.

Example: finding someone quickly#

How does hashing work#

Instead of storing each new data item in the next free memory location, we need to find a way to find its storage location based on a key value from the data content.
That way we will be able to locate each data item, using its key value.

Hashing:
The process of build a relationship between constant numbers (such as 1, 2, 3…) and particular content.

So the location to store/find a data item is obtained using a hash function applied to the data item (key) value
The hash function maps a key to an arbitrary (but not ‘random’) integer ‘bucket number’

What is a hash function#

Hash function(method):
- The way/method to convey the particular contents into constant numbers.

A hashing example#

Lets’ store some names of a few people in a table of size 12, so index 0 to 11 in Java
The key value for each person is simply the name string, e.g. “ChrisC”
The hash function is obtained from the sum of the ASCII codes of the characters in each string.

ASCII codes for letters
A	B	C	D	E	F	G	H	I	J	K	L	M
65	66	67	68	69	70	71	72	73	74	75	76	77
N	O	P	Q	R	S	T	U	V	W	X	Y	Z
78	79	80	81	82	83	84	85	86	87	88	89	90
a	b	c	d	e	f	g	h	i	j	k	l	m
97	98	99	100	101	102	103	104	105	106	107	108	109
n	o	p	q	r	s	t	u	v	w	x	y	z
110	111	112	113	114	115	116	117	118	119	120	121	122

C	h	r	i	s	C
67	104	114	105	115	67

To obtain an index in the range 0 to 11 calculate: character code sum, modulo 12
So, hash = (sum of ASCII codes for each character in string) modulo 12 (% operator)
Range of hash index is 0 to 11

Hash value for “ChrisC”
Sum of character codes: 67 + 104 + 114 + 105 + 115 + 67 = 572
572 mod 12 is 8
So hash value is 8

Similarly:

“ChrisC” → 8
“SharonC“→ 2
“DavidL“→ 0
“IanB“→ 10

Simple Hashing#

Too-simple hash function#

1
int hashFirstChar(String s) { // just value of first character
2
    return s.charAt(0) % tableSize;
3
    // mod tableSize gives a value in range 0 .. tableSize-1
4
}

(Can do the % tableSize later instead)

This hash function has problem: names starting with same letter all have same hash value!

Modify simple hash function-v1#

1
final int tableSize = 12; // for example
2
private static int hashSum(String key) {
3
    int hashValue = 0;
4
    for(int j = 0; j != key.length(); j++)
5
        hashValue = hashValue + (int)key.charAt(j);
6
    return hashValue; // or hashValue % tableSize here or later
7
}

This hash function still has problem: Different names with same letters hash to same value! hashSum("marcel") = hashSum("carmel") = 628

Modify simple hash function-v2#

1
private static int hashWeighted(String key) {
2
    int hashValue = 0;
3
    for(int j = 0; j != key.length(); j++)
4
        hashValue = j * hashValue + (int)key.charAt(j);
5
        // "j *" here is weighting
6
    return hashValue;
7
}

hashWeighted(“marcel”) = 34153
hashWeighted (“carmel”) = 33153
This hash function still has problem

Danger! Overflow ahead!#

hashWeighted(“Brzeczyszczykiewicz”)
= 1,876,978,831,955,724,338
But biggest integer in 32 bits, Integer.MAX_VALUE
= 2,147,483,647 Numeric overflow! Can crash program.

Possible solution avoiding overflow#

1
private static int hashWeighted(String key) {
2
    int hashValue = 0;
3
    for(int j = 0; j != key.length(); j++)
4
        hashValue = (j * hashValue + (int)key.charAt(j)) % tableSize;
5
    return hashValue;
6
}

Some examples of hashing functions#

Division:
h(k) is the remainder (modulus) of key k after division by the size of the table, s.
h(k) = k % s; if k = 7155, s = 97 then h(k) = 74
Truncation:
Take a few of the first or last characters of the key as the hash code. Can work well, provided the characters are evenly distributed.

Mid-square:
the key is squared and the middle digits of the result are used as the hashed value.
k = 7155; k² = 51194025; h(k) = 94 (from 511 94 025)
Folding:
the key is partitioned into several parts and the sum of the parts is used to produce the hash code
k = 7155; 71 + 55 → 126 or 17 + 55 → 72
……

Practice: calculate your name hash value#

Store some names of a group students, so table size should be 26, the key value for each one is simply the name string, e.g. “Tony”
The hash function is Division:
- k is the sum of the ASCII codes of the letters in the name string;
- h(k) is the remainder (modulus) of key k after division by the size of the table, s.
  h(k) = k % s; s = 26

ASCII codes for letters
A	B	C	D	E	F	G	H	I	J	K	L	M
65	66	67	68	69	70	71	72	73	74	75	76	77
N	O	P	Q	R	S	T	U	V	W	X	Y	Z
78	79	80	81	82	83	84	85	86	87	88	89	90
a	b	c	d	e	f	g	h	i	j	k	l	m
97	98	99	100	101	102	103	104	105	106	107	108	109
n	o	p	q	r	s	t	u	v	w	x	y	z
110	111	112	113	114	115	116	117	118	119	120	121	122

Practice#

Design your own hash method
- to calculate the hash value for your name string
- should be different from the example method mentioned above
- try to use easy way

The most handsome boy/most beautiful girl demonstrate

Summary Quiz

https://ks.wjx.top/vm/ttamyHO.aspx

Simple Hashing – with Hash Clash#

Search(Retrieval)#

Use the same hashing function and collision-resolution method as for inserting.
At best can get O(1) – constant time.

Can’t do better than that!

Perfect hash function#

A perfect hash function gives no collisions:
For all x and y:

graph LR A["x ≠ y"] --> B["hash(x) ≠ hash(y)"]

Possible if you know the data.
Collision is when two or more keys have the same hash value.
Rarely possible, but for example in a compiler,
determine whether identifier is a reserved word.
We know all the reserved words, so can construct a perfect hash function.

Desirable characteristics of a hashing function#

Quick to calculate
Gives good spread – different keys map to different bucket/slot numbers (positions in the table)
A perfect hashing function maps every key to a different bucket/slot number
but in practice, sometimes get a ‘clash’ 💣 – more than one key is mapped to the same bucket number.

More about hash functions#

The hash function can potentially generate any hash table index, 0..TableSize-1 so we often use the modulo (%) operator to restrict the hash function range
A hash function directs us to a single storage location for a particular data key value
If we could guarantee that each possible key value maps on to a unique entry (1:1 mapping) then our hashing would be perfect.
Because the hash table in practice is limited in size, and the difficulty of ensuring one-to-one mapping, we need to allow for collisions.

Try to reduce collisions!
Impossible to eliminate!

One Java built-in `hashCode()` method#

$\text{hashCode}(s) = s[0] \times 31^{(n-1)} + s[1] \times 31^{(n-2)} + \dots + s[n-1] \times 31^0$
$e.g.$
$\text{hashcode}(^{\prime\prime}\text{A}^{\prime\prime}) = 65$
$\text{hashcode}(^{\prime\prime}\text{AB}^{\prime\prime}) = 65 \times 31^1 + 66 = 2081$
$\text{hashcode}(^{\prime\prime}\text{ABC}^{\prime\prime}) = 65 \times 31^2 + 66 \times 31^1 + 67 = 64578$
Beware numeric overflow! Use % early.

1
public class Hash10 {
2
    public static void main(String[] args) {
3
        System.out.println("hashCode(\"A\")="+"A".hashCode());
4
        System.out.println("hashCode(\"AB\")="+"AB".hashCode());
5
        System.out.println("hashCode(\"ABC\")="+"ABC".hashCode());
6
        System.out.println("hashCode(\"ABCD\")="+"ABCD".hashCode());
7
    }
8
}

result:

1
hashCode("A")=65
2
hashCode("AB")=2081
3
hashCode("ABC")=64578
4
hashCode("ABCD")=2001986

Probability of hash clashes#

What is the probability that 2 (or more) people in a group of n share a birthday month?
Assume that birthdays are equally and randomly distributed over 12 months. Q(n) = probability that n people have distinct birthday months
Q(2) = 11/12 = 0.9166
Q(3) = 11/12 x 10/12 = 0.7639
Q(4) = 11/12 x 10/12 x 9/12 = 0.5729
etc

P(n) = 1-Q(n) = probability that 2 or more people of the n share a birthday month, so
P(2) = 0.0834
P(3) = 0.236
P(4) = 0.427
P(5) = 0.618
P(6) = 0.777
etc.

5 people > 50%

xychart-beta title "birthday months problem" x-axis "n" [0,2,4,6,8,10,12] y-axis "P(x)" 0 --> 1.0 line "Series1" [0,0.1,0.3,0.6,0.85,0.95,1.0]

Probabilities of clashes in Hash Table#

A surprising example:
- Create a hash table with Table_Size=1,000,000 entries (index 0..Table_Size-1)
- Hash function creates an arbitrary index number (based on the key value) that is uniformly distributed in the range 0..Table_Size-1
- Keep adding hash entries to the table
- The probability that a collision will occur before 2,500 entries are made is 95%, i.e. when the table is only 0.25% full
Conclusion: clashes can occur more frequently than we might intuitively expect.

Collision Resolution – Linear probing#

Hash collisions can be resolved by probing to find an alternative location.
- If a collision occurs [hash(key1)=hash(key2)] look for next table entry that is free
- First calculate:
  - Index1 = hash(Key) (which cell is not available)
  - hash(key) maps to range 0 ..Table_Size-1
- Next calculate
  - Index2 =(Index1 + probe) % Table_Size
- repeating with probe = 1,2,3,.. until a vacant entry is found (probe function)

primary clustering has occurred for keys that hash to index 93 and 94

Hash collisions can be resolved by probing to find an alternative location.
- Advantages:
  - resolves hash clash
  - simple algorithm
- Disadvantages:
  - local clustering around the primary key (termed primary clustering, such as key1~key6)
  - deletion is tricky (it leaves holes)

Linear probing - Number of probes#

the data is stored within the hash table (also called ‘closed hashing’).
We want to increase efficiency for probing = easy to find a vacant cell to store the hash value
= give more space than the storage required

which one is easier to find a vacant cell to store hash code for key1~key6 (collision existed)

Collision resolution: Open addressing#

Open addressing
The average number of probes for a search depends critically on the hash-table load factor, L
L = number_of_items_in_table / Table_Size often expressed as a percentage
Therefore the maximum capacity is simply the hash table size, and maximum load factor is 1.0.

Linear probing - Number of probes#

The average number of probes for a search depends critically on the hash-table load factor
Average number of probes calculation is complex to derive

xychart-beta title "average number of probes" x-axis "load factor" [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] y-axis "log(number of probes)" 0 --> 2.5 line "successful search" [0, 0.05, 0.12, 0.20, 0.30, 0.42, 0.56, 0.73, 0.94, 1.20, 1.50] line "unsuccessful search" [0, 0.15, 0.32, 0.55, 0.85, 1.22, 1.68, 2.10, 2.45]

To solve clustering#

Primary clustering
- When keys hash to the same location and are then stored close by – this leads to primary clustering
- Avoid by allowing the probe function to provide a greater spread of keys
Secondary clustering
- When keys hash to the same index and then follow the same probe sequence (same offset), this leads to secondary clustering
- Avoid with probe offset being non-linear (quadratic) or a function of the key (double hashing)

To solve clustering– Quadratic Probing#

As linear probing but instead of looking for next table entry, we move in jumps of a size involving the square of the probe integer
- First calculate:
  Index1 = hash(Key) (which is not available)
- Next calculate
  Index2 = (Index1 + i²) mod Table_Size
  with i=1,2,3, …
- So jump on by 1, 4, 9, 16, 25 …

Advantages:
- As linear and pseudo-random probing but primary clustering is reduced.
- Also simpler than pseudo-random probing
Disadvantage:
- (of all probing methods) is a degraded performance when the table starts to become full.
- Can lead to infinite loop since empty cells might be repeatedly skipped over. Need to detect this and avoid it. (OK if load factor < 50%.)

To solve clustering– Double Hashing#

can avoid both primary and secondary clustering.
uses two hash functions: initial hash, and probe
Index1 = hash(key)
{ideally maps to 0..Table_Size-1}
If Index1 entry is occupied use a probe function to locate a vacant position
Index2 = (Index1 + i * probe(key)) % Table_Size
(note probe(key) computed only once)
Probe index i = 1, 2, 3,… until a vacant position is found – but you should limit the number of probes to avoid wrap around to the same index and thus possible infinite loop.

Can require fewer probes, giving better performance than linear probing
Ensure probe(key) > 0 , ideally probe(key) maps to 1..Table_Size-1 Linear probing is double hashing when
probe(key) = ? 1

Double hashing#

Double hashing:
hash is the first function
probe is the second function

Open-Addressing Hashing Efficiency#

Average steps of probes

method	successful searches			unsuccessful searches
load factor	50%	80%	90%	50%	80%	90%
linear probing	1.5	3.0	5.5	2.5	13.0	50.0
quadratic probing	1.4	2.2	2.9	2.2	5.8	11.4
double hashing	1.4	2.0	2.6	2.0	5.0	10.0

collision resolution: chaining#

Separate chaining
- start a linked list to store the data for each key for that maps to the hash-table location
- each hash table entry can then accommodate a list of key values that share the same hash value
Since each hash table entry can be a list there is no limit to the hash-table capacity, so the load factor may exceed 1.0.

Linked lists within the hash table handle collisions
Alternative to ‘open addressing’, Linked list of all elements whose keys have same hash index.

collision resolution: chaining example#

1
import java.util.LinkedList;
2
import java.util.ListIterator;
3
public class Hash11 {
4
    public static void main(String[] args) {
5
        final int HTS = 12; // hash table size
6
        LinkedList[] staff=new LinkedList[HTS];
7
        for(int i=0; i<HTS; i++)
8
            staff[i]=new LinkedList();
9
        addStaff(staff,"Sharon");
10
        addStaff(staff,"Chris");
11
        addStaff(staff,"Ian");
12
        addStaff(staff,"David");
13
        addStaff(staff,"Peter");
14
        addStaff(staff,"Muhammad");
15
        addStaff(staff,"Arantza");
16
        addStaff(staff,"Ken");
17
        addStaff(staff,"Richard");
18
        addStaff(staff,"Hong");
19
        addStaff(staff,"William");
20
        addStaff(staff,"Mark");
21
        addStaff(staff,"Bob");
22
        addStaff(staff,"Clare");
23
        addStaff(staff,"Faye");
24
        ListIterator iterator=staff[0].listIterator();
25
        for(int i=0; i<HTS; i++) {
26
            iterator=staff[i].listIterator();
27
            System.out.print("staff["+i+"]: ");
28
            while (iterator.hasNext())
29
                System.out.print(iterator.next()+" ");
30
            System.out.println();
31
        }
32
    }
33
    private static void addStaff(LinkedList[] staff, String key) {
34
        final int HTS = 12; // hash table size
35
        staff[hash(key)].addLast(key);
36
    }
37
    private static int hash(String key) {
38
        final int HTS = 12; // hash table size
39
        return Math.abs(key.hashCode() % HTS);
40
    }
41
}

run:

1
staff[0]: Hong
2
staff[1]: Richard
3
staff[2]:
4
staff[3]: Sharon
5
staff[4]: Muhammad Ken
6
staff[5]: Arantza William Mark Bob
7
staff[6]:
8
staff[7]: Chris Clare
9
staff[8]: David Peter
10
staff[9]:
11
staff[10]: Ian
12
staff[11]: Faye
13
BUILD SUCCESSFUL (total time: 0 seconds)

Hashing with Chaining#

Average number of probes
- For the number of probes (or comparisons), first determine the average chain length for cells having a chain
- In the illustration the average chain length is
  $((1 \times 1) + (1 \times 2) + (1 \times 3)) \div 3 = 2$
- For an unsuccessful search the complete chain is inspected. The number of comparisons is this average chain length, so is 2
- For a successful search on average the search travels midway along the chain. For chain length k, the mid length is $(1+2+...+k)\div k = (k+1) \div 2$ , so with an average chain length of 2 the mid length is $(2+1)\div 2 = 1.5$

Linear Probing vs Chaining#

Linear Probing	Chaining
- Can require less memory (best for small data records) - Best when load factor is low - No pointers or memory allocation, so safe operation - Insertion can be quicker, without memory allocation - Memory references more localized	- Simple to implement - Less sensitive to clustering, but need to monitor chain lengths because searching in a chain is linear, O(n) - Better for high load-factor situations as hash table linked lists do not fill up - Better for large data records

Hashing Performance#

Best Case (ideal) O(1)
Worst case (all keys hash to same slot) O(n)
Performance very dependent on hash table Load Factor
Load Factor = number of items in table / table size
Likelihood of collision strongly related to load factor
Load Factor (L) = 0 represents empty table
- Open Addressing: 0 ≤ L ≤ 1, (hash table can only fill to capacity)
- Chaining: 0 ≤ L, (so L can exceed 1)
Typically performance becomes poor when table > 80% full
Rebuild with a bigger table to improve performance if table becomes full

Disadvantages of hash tables#

The size of the table is fixed and must be estimated in advance. It cannot be changed at runtime.
Table size should be about 10% larger than the maximum expected number of entries.
Hashing algorithms are designed for efficient insertion and retrieval. Deletions are cumbersome and should be avoided. (Don’t just replace entry by null – “example: Where’s Wally”)
Use ‘lazy deletion’; don’t really remove – just mark entry as inactive.
Contents not stored in order, so sorting needed.
The space needed for pointers for chaining could be used for a larger table.

Summary#

Access time is not dependent on the size of the table, at best O(1)
Good for storing data not directly representable as array indexes,
or for storing sets of data from very large data sets.
Key values mapped to index by hashing function.
Different keys mapping to same index means collision.
Collision resolution:
- ‘open addressing’ using various ‘probing’ algorithms;
- chaining (most efficient, but uses more memory).
Hashing efficiency depends critically on hash-table load factor and degree of clustering

References#

Robert Sedgewick, Algorithms in Java, Addison-Wesley, Ch 14.
Mark Allen Weiss, Data Structures and Algorithm Analysis, Addison-Wesley, Ch 5
Cormen, Leiserson and Rivest, Introduction to Algorithms, Ch 12
Niklaus Wirth, Algorithms+Data Structures=Programs, Prentice-Hall
Wikipedia: https://en.wikipedia.org/wiki/Hash_table

Extra: history#

The idea of hashing arose independently in different places.
In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining.
Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.

音乐

音乐

Data Structures and Algorithms Week3: Associative Arrays and Hash tables

Associative Arrays#

Array#

Actual example: exam scoring#

We need to map a key to a value#

Example: email-address lookup#

Associative array#

Possible example: sequence of entries#

Sequence of entries : put#

Sequence of entries: findPos#

Simple sequence gives O(n) behaviour#

Ordered sequence: findPos#

Ordered sequence: put#

Ordered not much improvement#

Answer: Hash tables#

Java API HashMap : Method Summary#

Hash tables#

Hash tables (associative arrays)#

Typical scenario#

What’s the problem#

What’s the problem - realistic demand#

What’s the problem- efficiency#

Search-Algorithm Performance#

’Associative’ array#

‘Hashing’#

Example: finding someone quickly#

How does hashing work#

What is a hash function#

A hashing example#

Simple Hashing#

Too-simple hash function#

Modify simple hash function-v1#

Modify simple hash function-v2#

Danger! Overflow ahead!#

Possible solution avoiding overflow#

Some examples of hashing functions#

Practice: calculate your name hash value#

Practice#

Simple Hashing – with Hash Clash#

Search(Retrieval)#

Perfect hash function#

Desirable characteristics of a hashing function#

More about hash functions#

One Java built-in hashCode() method#

Probability of hash clashes#

Probabilities of clashes in Hash Table#

Collision Resolution – Linear probing#

Linear probing - Number of probes#

Collision resolution: Open addressing#

Linear probing - Number of probes#

To solve clustering#

To solve clustering– Quadratic Probing#

To solve clustering– Double Hashing#

Double hashing#

Open-Addressing Hashing Efficiency#

collision resolution: chaining#

collision resolution: chaining example#

Hashing with Chaining#

Linear Probing vs Chaining#

Hashing Performance#

Disadvantages of hash tables#

Summary#

References#

Extra: history#

支持与分享

评论区

音乐

目录

One Java built-in `hashCode()` method#