Adding 2 Million Users to an Unbalanced Binary Genealogy Tree

Project Timeline: 15 Days

Challenge: Efficiently adding and managing 2 million users within an unbalanced Binary Multi-Level Marketing (MLM) genealogy tree

Background and Objectives

A prominent company in the MLM sector needed to incorporate a vast volume of data into an unbalanced binary genealogy tree. The goal was to add 2 million users to a MySQL DB, maintaining the structure’s integrity while ensuring smooth data migration, tree path rendering, and minimizing system resource consumption.

The Binary Tree was to represent an MLM system, where users are added in a hierarchical structure, with one “father” having a maximum of two “children” placed either to the left or right under him, along with the second tree; Sponsor Tree. A Sponsor Tree in MLM is a hierarchical structure that shows the relationships between a sponsor and their recruits, forming a network of downline members.

Challenges

Data Integrity & Structure Validation:

The data provided by the Client needed to be validated for:

Unique usernames
Valid email addresses which is unique to each Username
Proper user hierarchy (father-child relationship).
Ensuring that the given father username and sponsor username for each user is valid (exists higher in the tree).

Initial Step:

It involved collaborating closely with the client to clean the data and address any missing values before constructing the tree structure and organizing the data into the required format for database migration.

Handling a Large Volume of Data:

With over 2 million users, the data size and the complexity of tree path calculations led to performance issues. The unbalanced nature of the tree further complicated the relationships, leading to additional memory and computational overhead.

Transferring to the database quickly and efficiently.

Solutions Implemented

Step 1:

Data Validation and Testing

Objective: To ensure the dataset was valid, accurate, and properly structured for migration.

Approach: Using the Pandas library, multiple python scripts analyzed and validated the provided CSV file. This involved:

Checking for unique usernames and email addresses.
Verifying the father-child relationships.
Sorting and identifying any data mismatches or missing entries.

Step 2:

Formatting and Conversion

Once the data was validated, the next step was converting it into the appropriate format for migration to database, which involved mapping the dataset to multiple schemas such as:

User Data Schema
Father-Child Relationship Tree Schema
Referrer-Referee Relationship Tree Schema (Sponsor Tree)

CSV files containing data formatted according to each table’s schema were created using the Pandas library. This conversion ensured the data was properly prepared for migration into the target tables.

Step 3:

Tree Path Management

Challenge – Handling Unbalanced Binary Tree

The unbalanced nature of the Binary Tree structure led to disproportionate depths for new users from the root node (Admin User), caused deeper nesting and complicated the tree path calculations.

In a perfect binary tree, the depth for a million users is less than 20. In practice, however, the depth typically ranges from 200 to 2000, depending on factors such as users who recruit large numbers of new referrals. In the case of a highly unbalanced tree, the depth can exceed 6000, as seen in several branches, due to the deep imbalance.
This exponential growth in tree depth meant that rendering the relationships between users and their ancestors generated over 1.49 billion rows of data, significantly increasing computation time and memory consumption.
Pandas struggled with calculating Binary Tree structure leading to slow processing and memory overload.

Solution – Introduction of Redis

Developers explored Redis, an open-source, in-memory data store, to optimize performance. Redis offered high-speed data retrieval and storage capabilities, reducing the load on traditional disk-based systems. Its in-memory architecture enabled faster processing and minimized bottlenecks associated with reading and writing large datasets.

How Redis Was Used:

Each user’s data, including their father’s information and related connections, is stored in Redis.
Thanks to Redis’ fast read-write capability, it was possible to rapidly retrieve the father and their upline for each user, and store this new user and their upline in memory for their children users. The data was simultaneously written to a CSV file.

Challenge – Managing Large Data Set

Objective: To overcome performance limitations and create the CSV file for migration.
Approach: The number of rows of data increased exponentially with an increasing number of users. By using Redis for efficient data caching and Python for generating CSV files, the team was able to manage memory consumption and reduce processing time. Even so, the user data needed to be split into four chunks in order to process it properly.
Data Breakdown: The final dataset consisted of 1.49 billion rows, occupying approximately 28 GB of storage (CSV files) in the end.
This process took 2 days to complete, an impressive feat considering the scale of the data.

Step 4:

Introduction of Dask for Big Data Processing

To handle the enormous dataset and efficiently process the data while making final adjustments and editing errors (such as duplication), our team incorporated Dask, a parallel computing library designed for big data processing. Dask allowed the team to process the large amounts of treepath data efficiently by converting treepath CSV files to parquet format for processing.

Step 5:

Data Migration to MySQL Using Pandas

Objective: Efficient migration of 2 million users into the client database.

Approach: The migration of user data for 2 million users was carried out using the Pandas library to MySQL. The data, stored in CSV files formatted to match the user data table schemas, was successfully migrated in just 10 minutes.

Step 6:

Transition to MySQL FILE UPLOAD and Removal of Pandas

Challenge – Inefficiency of Pandas

Pandas proved insufficient for the large-scale migration of the treepath data, which consisted of almost 1.5 billion rows.

Solution – Introduction of MySQL FILE UPLOAD

The team transitioned to direct file uploads from CSV to MySQL, removing Python/Pandas as the intermediary, and used the MySQL FILE UPLOAD command for the migration. This shift enabled faster and more reliable data migration, achieving 5 billion rows in just 30 minutes, with migration speeds up to 75 times faster compared to the previous approach.

Drawbacks: While MySQL file upload provided better speed, it lacked built-in security features, such as duplicate detection and null column checks. It also required that indexes and keys be disabled in the name of speed. Thanks to rigorous data cleaning and validation using Pandas in the first step, this issue was resolved.

Final Step:

Adding Indexes & Foreign Keys

With the data successfully processed, the foreign keys and indexes were added to the table for faster queries and datalook-ups. This enabled easy data access and ensured that the entire structure was fully operational.

Results & Achievements

Within just 15 days, our team successfully added 2 million users to the unbalanced MLM tree, overcoming significant technical challenges. The following outcomes were achieved:

Increased Performance: The migration process was accelerated by 75 times, ensuring efficient handling of large-scale data.
Scalable Solution: The integration of Redis and Dask ensured that the system could scale effectively to handle 1.49 billion records and continue expanding.
Improved Efficiency: The final approach reduced memory consumption and minimized downtime, allowing for quicker data access and tree rendering.
Successful Deployment: The migration was completed on time, with no major performance bottlenecks or data integrity issues.

Conclusion

This case study highlights the successful implementation of an advanced and scalable system for managing 2 million users in an unbalanced binary MLM genealogy tree. By using Redis, Dask, and MySQL, our development team overcame the challenges of large-scale data migration, unbalanced tree management, and system performance. The project not only met its deadline but also set a new standard for efficiently managing massive datasets in MLM structures. This accomplishment marks a significant milestone in Infinite MLM’s journey, ensuring that our system can support future growth without compromising performance.