Bloom Checker is a fast and efficient tool for verifying whether an email or dataset item is present in a database. Using the Bloom Filter algorithm, it provides quick results with low memory usage, perfect for handling large datasets.
Imagine an email verification service that needs to check if millions of email addresses exist in a database. A common implementation might look like this:
def check_email(email):
# First, check cache
if cache.get(email):
return True
# If not in cache, check database
if database.exists(email):
cache.set(email, True)
return True
return False
This approach faces two significant challenges:
-
Cache Miss: When a valid email isn't in the cache but exists in the database:
Client → Cache (Miss) → Database (Found) → Update Cache
This creates one extra unnecessary lookup, but it's manageable.
-
Cache Penetration: When checking non-existent emails:
Client → Cache (Miss) → Database (Not Found) → No Cache Update
This becomes problematic when:
- Attackers deliberately query non-existent emails
- Each query unnecessarily hits both cache and database
- System resources are wasted on known-invalid queries
Bloom Checker solves this by adding a Bloom Filter as a preliminary check:
Client → Bloom Filter → Cache → Database
When checking an email:
- If Bloom Filter says "No" → Email definitely doesn't exist (stop here)
- If Bloom Filter says "Yes" → Email might exist (proceed to cache/database)
Real-world example:
# Without Bloom Filter:
check_email("[email protected]") # Cache miss + DB query wasted
check_email("[email protected]") # Cache miss + DB query wasted
check_email("[email protected]") # Cache miss + DB query wasted
# With Bloom Checker:
check_email("[email protected]") # Bloom Filter: No (stops here)
check_email("[email protected]") # Bloom Filter: No (stops here)
check_email("[email protected]") # Bloom Filter: No (stops here)
Benefits:
- Protects against DoS attacks using non-existent emails
- Reduces unnecessary database load
- Extremely memory efficient (10 million emails ≈ 15MB of memory)
- Quick response times (O(k) where k is number of hash functions)
- Fast Email Verification: Quickly checks whether an email is probably in the database or definitely not.
- Bloom Filter Algorithm: Implements the space-efficient probabilistic data structure to minimize memory usage.
- Low False Positive Rate: Configurable false positive rates to suit different application needs.
- Customizable Parameters: Adjust the size of the Bloom Filter and the number of hash functions based on the dataset size.
- Graphical User Interface (GUI): Intuitive and easy-to-use interface built with Tkinter.
- File Input: Supports CSV files for email lists and results display.
-
Clone the repository:
git clone https://github.com/Yamil-Serrano/Bloom-Checker.git
-
Navigate to the project directory:
cd Bloom-Checker
-
Install required dependencies:
pip install -r requirements.txt
-
Run the application:
python main.py
-
Use the interface to:
- Select the initial database CSV file.
- Select the verification CSV file.
- View the verification results in the interface, with color-coded outputs:
- Green: The email is probably in the database.
- Red: The email is definitely not in the database.
-
Adjust the false positive rate directly in the
main.py
script if needed.
Email Address |
---|
[email protected] |
[email protected] |
[email protected] |
Email Address |
---|
[email protected] |
[email protected] |
- Lotus flower icons created by Freepik - Flaticon
- File icons created by Good Ware - Flaticon
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
For questions, suggestions, or contributions, please reach out via:
- GitHub: Neowizen