refactor: Improved clarity

4ndrelim · Sep 29, 2024 · 48870c9 · 48870c9
1 parent cd3c465
commit 48870c9
Show file tree

Hide file tree

Showing 12 changed files with 247 additions and 94 deletions.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ It is aligned with [CS2040s](https://nusmods.com/courses/CS2040S/data-structures
 
 The work here is continually being developed by CS2040s Teaching Assistants(TAs) and ex-2040s students, 
 under the guidance of Prof Seth.
-It is still in its infant stage, mostly covering lecture content and discussion notes.
+It mostly covers lecture content and discussion notes.
 Future plans include deeper discussion into the tougher parts of tutorials and even practice problems / puzzles related
 to DSA.
 
@@ -30,13 +30,13 @@ Gradle is used for development.
 - [Linked List](src/main/java/dataStructures/linkedList)
 - [LRU Cache](src/main/java/dataStructures/lruCache)
 - Minimum Spanning Tree 
-    * Kruskal
-    * Prim's
+    * [Kruskal](src/main/java/algorithms/minimumSpanningTree/kruskal)
+    * [Prim's](src/main/java/algorithms/minimumSpanningTree/prim)
     * Boruvska
 - [Queue](src/main/java/dataStructures/queue)
   - [Deque](src/main/java/dataStructures/queue/Deque)
   - [Monotonic Queue](src/main/java/dataStructures/queue/monotonicQueue)
-- Segment Tree
+- [Segment Tree](src/main/java/dataStructures/segmentTree)
 - [Stack](src/main/java/dataStructures/stack)
 - [Segment Tree](src/main/java/dataStructures/segmentTree)
 - [Trie](src/main/java/dataStructures/trie)
@@ -47,10 +47,10 @@ Gradle is used for development.
     * [Template](src/main/java/algorithms/binarySearch/binarySearchTemplated)
 - [Counting Sort](src/main/java/algorithms/sorting/countingSort)
 - [Cyclic Sort](src/main/java/algorithms/sorting/cyclicSort)
-    * [Special case](src/main/java/algorithms/sorting/cyclicSort/simple) of O(n) time complexity
-    * [Generalized case](src/main/java/algorithms/sorting/cyclicSort/generalised) of O(n^2) time complexity
+    * [Special case](src/main/java/algorithms/sorting/cyclicSort/simple)
+    * [Generalized case](src/main/java/algorithms/sorting/cyclicSort/generalised)
 - [Insertion Sort](src/main/java/algorithms/sorting/insertionSort)
-- [Knuth-Morris-Pratt](src/main/java/algorithms/patternFinding) aka KMP algorithm
+- [Knuth-Morris-Pratt](src/main/java/algorithms/patternFinding) (KMP algorithm)
 - [Merge Sort](src/main/java/algorithms/sorting/mergeSort)
     * [Recursive](src/main/java/algorithms/sorting/mergeSort/recursive)
     * [Bottom-up iterative](src/main/java/algorithms/sorting/mergeSort/iterative)
@@ -76,8 +76,8 @@ Gradle is used for development.
     * [Selection](src/main/java/algorithms/sorting/selectionSort)
     * [Merge](src/main/java/algorithms/sorting/mergeSort)
     * [Quick](src/main/java/algorithms/sorting/quickSort)
-      * [Hoare's](src/main/java/algorithms/sorting/quickSort/hoares)
-      * [Lomuto's](src/main/java/algorithms/sorting/quickSort/lomuto) (Not discussed in CS2040s)
+      * [Hoare's](src/main/java/algorithms/sorting/quickSort/hoares) (this version is the one shown in lecture!)
+      * [Lomuto's](src/main/java/algorithms/sorting/quickSort/lomuto)
       * [Paranoid](src/main/java/algorithms/sorting/quickSort/paranoid)
       * [3-way Partitioning](src/main/java/algorithms/sorting/quickSort/threeWayPartitioning)
     * [Counting Sort](src/main/java/algorithms/sorting/countingSort) (found in tutorial)
@@ -88,7 +88,7 @@ Gradle is used for development.
     * [Trie](src/main/java/dataStructures/trie)
     * [B-Tree](src/main/java/dataStructures/bTree)
     * [Segment Tree](src/main/java/dataStructures/segmentTree) (Not covered in CS2040s but useful!)
-    * Red-Black Tree (Not covered in CS2040s but useful!)
+    * Red-Black Tree (**WIP**)
     * [Orthogonal Range Searching](src/main/java/algorithms/orthogonalRangeSearching)
     * Interval Trees (**WIP**)
 5. [Binary Heap](src/main/java/dataStructures/heap) (Max heap)

diff --git a/src/main/java/algorithms/patternFinding/KMP.java b/src/main/java/algorithms/patternFinding/KMP.java
@@ -107,12 +107,6 @@ public static List<Integer> findOccurrences(String sequence, String pattern) {
                     pTrav += 1;
                     sTrav += 1;
                 }
-                // ALTERNATIVELY
-                // if pTrav == 0 i.e. nothing matched, move on
-                //    sTrav += 1
-                //    continue
-                //
-                // pTrav = prefixTable[pTrav]
             }
         }
         return indicesFound;

diff --git a/src/main/java/algorithms/sorting/cyclicSort/README.md b/src/main/java/algorithms/sorting/cyclicSort/README.md
@@ -3,7 +3,8 @@
 ## Background
 
 Cyclic sort is a comparison-based, in-place algorithm that performs sorting (generally) in O(n^2) time.
-Though under some conditions (discussed later), the best case could be done in O(n) time.
+Under some special conditions (discussed later), the algorithm is non-comparison based and 
+the best case could be done in O(n) time. This is the version that tends to be used in practice.
 
 ### Implementation Invariant
 
@@ -24,16 +25,16 @@ This allows cyclic sort to have a time complexity of O(n) for certain inputs.
 
 We discuss more implementation-specific details and complexity analysis in the respective folders. In short,
 
-1. The [**simple**](./simple) case discusses the non-comparison based implementation of cyclic sort under
+1. The [**simple**](./simple) case discusses the **non-comparison based** implementation of cyclic sort under
    certain conditions. This allows the best case to be better than O(n^2).
 2. The [**generalised**](./generalised) case discusses cyclic sort for general inputs. This is comparison-based and is
-   usually implemented in O(n^2).
+   typically implemented in O(n^2).
 
 Note that, in practice, the generalised case is hardly used. There are more efficient algorithms to use for sorting,
 e.g. merge and quick sort. If the concern is the number of swaps, generalized cyclic sort does indeed require fewer 
 swaps, but likely won't lower than selection sort's.
 
-In other words, cyclic sort is specially designed for situations where the elements to be sorted are 
-known to fall within a specific, continuous range, such as integers from 1 to n, without any gaps or duplicates.
+In other words, **cyclic sort is specially designed for situations where the elements to be sorted are 
+known to fall within a specific, continuous range, such as integers from 1 to n, without any gaps or duplicates.**
 
 
diff --git a/src/main/java/algorithms/sorting/cyclicSort/simple/README.md b/src/main/java/algorithms/sorting/cyclicSort/simple/README.md
@@ -10,7 +10,31 @@ This is typically applicable when sorting a sequence of integers that are in a c
 or can be easily mapped to such a range. We illustrate the idea with n integers from 0 to n-1.
 
 In this implementation, the algorithm is **not comparison-based**! (unlike the general case).
-It makes use of the known inherent ordering of the numbers, bypassing the nlogn lower bound for most sorting algorithms.
+It makes use of the known inherent ordering of the numbers, 
+bypassing the `nlogn` lower bound for most sorting algorithms.
+
+<details>
+<summary> <b>Duplicates</b> </summary>
+Not designed to hande duplicates. When duplicates are present, the algorithm can run into issues, 
+such as overwriting elements or getting stuck in infinite loops, 
+because it assumes that each element has a unique position in the array.
+
+If you need to handle duplicates, modifications are required, 
+such as checking for duplicate values before placing elements, 
+which can impact the simplicity and efficiency (possibly degrade to `O(n^2)`) of the algorithm.
+</details>
+
+<details>
+<summary> <b>Inherent Ordering..?</b> </summary>
+This property allows the sorting algorithm to avoid comparing elements with each other 
+and instead directly place each element in its correct position. 
+
+For example, if sorting integers from 0 to n-1, the number 0 naturally belongs at index 0, 1 at index 1, and so on. 
+This inherent structure allows Cyclic Sort to achieve `O(n)` time complexity, 
+bypassing the typical `O(nlogn)` time bound of comparison-based sorting algorithms 
+([proof](https://tildesites.bowdoin.edu/~ltoma/teaching/cs231/fall07/Lectures/sortLB.pdf)) 
+by using the known order of elements rather than making comparisons to determine their positions.
+</details>
 
 ## Complexity Analysis
 
@@ -48,3 +72,4 @@ otherwise there would be a contradiction.
    and sorting needs to be done in O(1) auxiliary space.
 2. The implementation here uses integers from 0 to n-1. This can be easily modified for n contiguous integers starting
    at some arbitrary number (simply offset by this start number).
+3. This version of cyclic sort does not handle duplicates (at least, sorting might not be guaranteed to be in O(n))
diff --git a/src/main/java/dataStructures/avlTree/README.md b/src/main/java/dataStructures/avlTree/README.md
@@ -13,6 +13,19 @@ Here we discuss a type of self-balancing BST, known as the AVL tree, that avoids
 across the operations by ensuring careful updating of the tree's structure whenever there is a change 
 (e.g. insert or delete).
 
+<details>
+<summary> <b>Terminology</b> </summary>
+<li>
+Level: Refers to the number of edges from the root to that particular node. Root is at level 0.
+</li>
+<li>
+Depth: The depth of a node is the same as its level; i.e. how far a node is from the root of the tree.
+</li>
+<li>
+Height: The number of edges on the longest path from that node to a leaf. A leaf node has height 0.
+</li>
+</details>
+
 ### Definition of Balanced Trees
 Balanced trees are a special subset of trees with **height in the order of log(n)**, where n is the number of nodes. 
 This choice is not an arbitrary one. It can be mathematically shown that a binary tree of n nodes has height of at least
@@ -39,8 +52,11 @@ former.
 
 <details>
 <summary> <b>Ponder..</b> </summary>
-Consider any two nodes (need not have the same immediate parent node) in the tree. Is the difference in height 
-between the two nodes <= 1 too?
+Can a tree exists where there exists 2 leaf nodes whose depths differ by more than 1? What about 2? 10?
+<details>
+<summary> <b>Answer</b> </summary>
+Yes! In fact, you can always construct a large enough AVL tree where their difference in depth is > some arbitrary x!
+</details>
 </details>
 
 It can be mathematically shown that a **height-balanced tree with n nodes, has at most height <= 2log(n)** (
@@ -75,7 +91,7 @@ Hence, we need some re-balancing operations. To do so, tree rotation operations
 
 Prof Seth explains it best! Go re-visit his slides (Lecture 10) for the operations :P <br>
 Here is a [link](https://www.youtube.com/watch?v=dS02_IuZPes&list=PLgpwqdiEMkHA0pU_uspC6N88RwMpt9rC8&index=9) 
-for prof's lecture on trees. <br>
+to prof's lecture on trees. <br>
 _We may add a summary in the near future._
 
 ## Application

diff --git a/src/main/java/dataStructures/bTree/README.md b/src/main/java/dataStructures/bTree/README.md
@@ -104,26 +104,97 @@ Image Source: https://www.geeksforgeeks.org/insert-operation-in-b-tree/
 The delete operation has a similar idea as the insert operation, but involves a lot more edge cases. If you are
 interested to learn about it, you can read more [here](https://www.geeksforgeeks.org/delete-operation-in-b-tree/).
 
-## Application
-There are many uses of B-Trees but the most common is their utility in database management systems in handling large 
-datasets by optimizing disk accesses.
+## Application: Index Structure
+B+ trees tend to be used in practice over vanilla B-trees. 
+The B+ tree is a specific variant of the B-tree that is optimized for efficient data retrieval from disk 
+and range queries.
 
-Large amounts of data have to be stored on the disk. But disk I/O operations are slow and not knowing where to look 
-for the data can drastically worsen search time. B-Tree is used as an index structure to efficiently locate the 
-desired data. Note, the B-Tree itself can be partially stored in RAM (higher levels) and partially on disk 
-(lower, less freq accessed levels).
+We will discuss two common applications of B+ trees: **database indexing** and **file system indexing**.
 
-Consider a database of all the CS modules offered in NUS. Suppose there is a column "Code" (module code) in the 
-"CS Modules" table. If the database has a B-Tree index on the "Code" column, the keys in the B-Tree would be the 
-module code of all CS modules offered.
+---
 
-Each key in the B-Tree is associated with a pointer, that points to the location on the disk where the corresponding 
-data can be found. For e.g., a key for "CS2040s" would have a pointer to the disk location(s) where the row(s) 
-(i.e. data) with "CS2040s" is stored. This efficient querying allows the database quickly navigate through the keys 
-and find the disk location of the desired data without having to scan the whole "CS Modules" table.
+### Indexing Structure
 
-The choice of t will impact the height of the tree, and hence how fast the query is. Trade-off would be space, as a 
-higher t means more keys in each node, and they would have to be (if not already) loaded to RAM.
+B+ trees are often used to efficiently manage large amounts of data stored on disk. 
+They do not store the actual data itself but instead store **pointers** (or references) 
+to where the data is located on the disk.
+
+#### Pointer / Reference
+A pointer in the context of a B+ tree refers to some piece of information that can be used to 
+retrieve actual data from the disk. Some common examples include:
+- **Disk address/block number**
+- **Filename with offset**
+- **Database page and record ID**
+- **Primary key ID**
+
+<details>
+<summary> <b>File System Indexing</b> </summary>
+
+### B+ Trees for File System Indexing
+
+File system indexing refers to the process by which an operating system organizes and manages files on 
+storage media (such as hard drives, SSDs) to enable efficient file retrieval, searching, and management. 
+It involves creating and maintaining indexes (similar to those in a database) that help quickly locate files, 
+directories, and their metadata (like file names, attributes, permissions, and timestamps).
+
+#### Workflow:
+- The **root node** of a B+ tree is typically stored in **RAM** to speed up access.
+- **Nodes** in the tree contain keys and child pointers to other nodes.
+- **Intermediate nodes** do not store actual data but guide the search process toward the leaf nodes.
+- **Leaf nodes** either contain the actual data or pointers to the data stored on disk. 
+This is where the data retrieval process ends.
+
+#### Optimized Disk I/O:
+B+ trees are optimized for disk I/O, especially for **range queries**. 
+The tree nodes are designed to fit into disk pages, meaning a single disk read operation can bring in multiple keys 
+and pointers. This reduces the overall number of disk accesses required and efficiently utilizes disk pages.
+
+#### Range Queries:
+B+ trees are particularly effective for **range queries**. Since the leaf nodes in a B+ tree are linked together 
+(typically via a **doubly linked list**), this makes sequential access for range queries efficient. 
+For example, in a file system, this allows fetching multiple adjacent keys (like file names in a directory) 
+without requiring additional disk I/O.
+
+</details>
+
+<details>
+<summary> <b>SQL Engines</b> </summary>
+
+### B+ Trees in SQL Engines
+
+In **MySQL**, B+ trees are extensively used in the **InnoDB** storage engine 
+(the default storage engine for MySQL databases).
+
+#### Primary Key Index (Clustered Index):
+In **InnoDB**, the primary key is always stored in a **clustered index**. 
+This means the leaf nodes of the B+ tree store the actual rows of the table. 
+In a clustered index, the rows are physically stored in the order of the primary key, 
+making retrieval by primary key highly efficient.
+
+#### Secondary Indexes:
+For secondary indexes in MySQL (specifically in InnoDB), 
+once the B+ tree for the secondary index is navigated to the leaf node, the following process occurs:
+
+1. **Secondary Index B+ Tree**: The leaf nodes store the indexed column value (e.g., `last_name`) 
+along with a reference to the primary key (e.g., `emp_id`).
+2. **Reference to Primary Key**: This reference (the primary key value) is used to look up the actual data 
+in the **clustered index** (which is also a B+ tree). The clustered index stores the entire row data in its leaf nodes.
+
+#### Detailed Process:
+- **Step 1**: MySQL navigates the secondary index tree based on the query condition (e.g. a range query on `last_name`)
+    - The internal nodes guide the search, and the leaf node contains the `last_name` 
+  value and the corresponding primary key (`emp_id`).
+
+- **Step 2**: Once MySQL reaches the leaf node of the secondary index B+ tree, it retrieves the primary key (`emp_id`).
+
+- **Step 3**: MySQL uses this primary key to directly access the **clustered index** (the B+ tree for the primary key).
+    - It navigates the primary key B+ tree to locate the row in its leaf nodes, where the full row data 
+  (e.g., `emp_id`, `last_name`, `first_name`, `salary`) is stored.
+
+> **Note**: If multiple results match a query on the secondary index, 
+the leaf nodes of the secondary index B+ tree will store multiple primary keys corresponding to the matching rows.
+
+</details>
 
 ## References
-This description heavily references CS2040S Recitation Sheet 4. 
+CS2040S Recitation Sheet 4.