The meaning of parameter server

When I solve the large-scale machine learning problems, many machines should share the states such as parameters, cluster information, user profiles, and so on. (Lots of training data creates powerful 10^9 to 10^12 parameters.[2])

Parameter servers store shared states like parameters, and all worker nodes share the states with frequently accessing.

Challenges for sharing parameter. [3]

  1. Accessing the parameters requires an enormous amount of network bandwidth.
  2. Many machine learning algorithms are sequential. Synchronization hurts the performance.
  3. Fault Tolerance is critical.



[2]K. Canini, Sibyl: A system for large-scale supervised machine learning, Technical Talk, 2012

[3] L. Mu, Scaling Distributed Machine Learning with the Parameter Server, OSDI’14

Definition of security related words

referenced by [1]

Side channels: Accidental leakage of sensitive data

Covert channels: Exploited by a Trojan to deliberately leak information

Timing channels: Exploited through timing variation

Storage channels: Exploited some functional aspect of the system-something that is directly visible in software, such as a register value or the return value of a system call

Attack types
  1. Exploiting cache


Prime the cache by filing one or more sets with its own lines -> probes by timing accesses to its previously-loaded lines, to see if any were evicted


– Inverse of prime and probe

– Flush a shared line of interest by using dedicated instructions or by eviction through contention -> reload the evicted line by touching it, measuring the time taken


– The attacker first pre-load victim run, and establish a baseline execution time -> Evict a line of interest, and runs the victim again. -> a variation in execution time indicates the line of interest was accessed

Cover Channel Techniques

– The sender encodes a bit by accessing some number of cache sets. The receiver decodes the message by measuring the time to access its own colliding lines

Unmasking the Victim’s Layout

– The task’s virtual address space generally sets at compile time and highly predictable. Attacker should have some knowledge about the virtual or physical layout of the victim in order to exploit some attacks(?)

[1] A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware

Prefetch Accuracy & Coverage

There are two metrics to evaluate the performance of prefetcher.

First one is accuracy. The notion of accuracy is the ratio of the number of useful prefetches to the total number of prefetches.

Second one is coverage. The meaning of coverage is how many misses can be covered by prefetch. Unlike the specific formula of the accuracy, there are several forms of calculating coverage.

  1. # of prefetch misses / # of total cache misses [1]
  2. 100 x (# of prefetch hits / (# of prefetch hits + # of total cache misses)) [2]
  3. # of useful prefetch/# of total cache misses [3]





Cuckoo hashing


Cuckoo hashing은 기존의 hash table에서 같은 bucket에 대한 value에 대해서 chaining (Separate chaining)을 이루는 것과 다르게 open address방식으로 value들은 모두 entry array에 두고, 그에 대한 address는 처음의 hash function과 다를 수 있게 하여서 collision을 줄여 주었다.

Separate chaining[1]

Separate Chaining

Key에 대한 hash function으로 결정된 bucket에서 collision이 일어나면 같은 bucket에 해당하는 key value를 linked list의 형식으로 연결 시켜준다. 이렇게 하면 collision이 해결된다. 하지만 hash function으로 인해서 정해지는 bucket의 갯수가 몇 개로 제한이 되어있으면 bias된 bucket의 linked list가 길어질 수 있고, 사용되지 않는 bucket의 갯수가 늘어날 수 있다.

Cuckoo hashing

하나의 key에 대해서 두 개의 hash function(H1, H2)을 사용하여 H1에 대한 address가 empty라면 그 곳에 value를 넣어두고, H1에 value가 있고, H2가 empty라면 그 곳에 value를 넣어주는 방식을 취하고 있다. 만약 두 개의 hash function에 대해서 모두 value가 있다고 하면 처음의 value를 displace하고, evicted된 value는 그에 대한 다른 hash function을 이용하여 empty bucket을 찾는 방법을 취한다. 이런 방법으로 최대한 bucket에 key-value를 넣어두고, 가능하지 않는 경우에만 re-hashing을 하던, table의 size를 늘리는 형태를 취하게 된다.

Cuckoo hashing[2]

MSR(MicroSoft Research)에서는 이러한 Cuckoo hashing을 좀더 practical하게 사용할 수 있도록 연구하였다.[3] Conservative cuckoo hashing의 경우에는 bucket의 entry(slot)의 갯수를 1개로 정하였는데, 이를 8개로 늘리게 되면(element가 8 byte라고 가정) 하나의 bucket에 대해서 64 byte 즉 1개의 cache-line size가 되기 때문에 memory access 한번에 하나의 bucket의 정보를 가져올 수 있게 되고 이는 space-efficient하고, spatial locality를 고려한 hashing이 된다. 이는 bucket의 collision을 최대한 줄일 수 있으므로 효율적이게 된다.


  1. The time complexity of Lookup is O(1) in the worst case, Insertion and Deletion is amortized O(1).


  1. If the size of table grows, the insertion time is higher, so it greatly slows down the performance.




Recommended document to understand cuckoo hashing easily


Gem5 Documentation


You can get the source code of gem5 through Mercural( which is a kind of source management tool.

Using the command “hg clone” 

To build gem5, it needs some dependency programs & librarys

Please check the version of programs

  • g++: version 4.8 or newer
  • python: version 2.6-2.7
  • scons: version 0.98.1 or newer
  • swig: 2.0.4 or newer

Full system

We can compile the gem5 simulator with the tool which name is “scons”. (

In the gem5-stable/build_opts folder, there are several configuration files which specify ISA, cache coherence protocol for ruby, enabling full system, cpu types to build.

There are several options of gem5 binary(gem5.debug, gem5.opt,, gem5.perf) I will use gem5.opt.

For building full system mode with X86 ISA, you should make a file with enabling full system.

Build gem5 with scons tool

To run the gem5 with full system mode, you should have X86 linux kernel binary, disk, and swap image.

Also, you should set the PATH to the location of disk and kernel.

Then, you can use linux kernel version(provided image), and disk which parsec binaries and inputs are installed. 

System Call Emulation

In a same manner of full system mode, you can build gem5 with configuration file. (without enabling full system)


Full system

You can run the gem5 with gem5.opt binary. If you type the command like below, the binary boot the specified kernel and mount the specified disk.

You can access the kernel through telnet with the port #3456 like below screenshot.

System Call Emulation

Similar to the way of full system mode…