Adding 2 Million Users to an Unbalanced Binary Genealogy Tree
Project Timeline:
15 Days
Challenge:
Efficiently adding and managing 2 million users within an unbalanced Binary Multi-Level Marketing (MLM) genealogy tree
Background and Objectives
A prominent company in the MLM sector needed to incorporate a vast volume of data into an unbalanced
binary genealogy tree. The goal was to add 2 million users to a MySQL DB, maintaining the structure’s
integrity while ensuring smooth data migration, tree path rendering, and minimizing system resource
consumption.
The Binary Tree was to represent an MLM system, where users are added in a hierarchical structure, with one
“father” having a maximum of two “children” placed either to the left or right under him, along with the
second tree; Sponsor Tree. A Sponsor Tree in MLM is a hierarchical structure that shows the relationships
between a sponsor and their recruits, forming a network of downline members.
Challenges
Data Integrity & Structure Validation:
The data provided by the Client needed to be validated for:
Unique usernames
Valid email addresses which is unique to each Username
Proper user hierarchy (father-child relationship).
Ensuring that the given father username and sponsor username for each user is valid
(exists
higher in the tree).
Initial Step:
It involved collaborating closely with the client to clean the data and address any missing values
before constructing the tree structure and organizing the data into the required format for database
migration.
Handling a Large Volume of Data:
With over 2 million users, the data size and the complexity of tree path calculations led to
performance issues. The unbalanced nature of the tree further complicated the relationships, leading
to additional memory and computational overhead.
Transferring to the database quickly and efficiently.
Solutions Implemented
Step 1:
Data Validation and Testing
Objective:
To ensure the dataset was valid, accurate, and properly structured for
migration.
Approach:
Using the Pandas library, multiple python scripts analyzed and validated the provided CSV file.
This involved:
Checking for unique usernames and email addresses.
Verifying the father-child relationships.
Sorting and identifying any data mismatches or missing entries.
Step 2:
Formatting and Conversion
Once the data was validated, the next step was converting it into the appropriate format for
migration to database, which involved mapping the dataset to multiple schemas such as:
User Data Schema
Father-Child Relationship Tree Schema
Referrer-Referee Relationship Tree Schema (Sponsor Tree)
CSV files containing data formatted according to each table’s schema were created using the
Pandas library. This conversion ensured the data was properly prepared for migration into the
target tables.
Step 3:
Tree Path Management
Challenge – Handling Unbalanced Binary Tree
The unbalanced nature of the Binary Tree structure led to disproportionate depths for new users
from the root node (Admin User), caused deeper nesting and complicated the tree path
calculations.
In a perfect binary tree, the depth for a million users is less than 20. In
practice,
however, the depth typically ranges from 200 to 2000, depending on factors such
as
users
who recruit large numbers of new referrals. In the case of a highly unbalanced
tree,
the
depth can exceed 6000, as seen in several branches, due to the deep imbalance.
This exponential growth in tree depth meant that rendering the relationships between
users and their ancestors generated over 1.49 billion rows of data, significantly
increasing computation time and memory consumption.
Pandas struggled with calculating Binary Tree structure leading to slow processing
and
memory overload.
Solution – Introduction of Redis
Developers explored Redis, an open-source, in-memory data store, to optimize performance. Redis
offered high-speed data retrieval and storage capabilities, reducing the load on traditional
disk-based systems. Its in-memory architecture enabled faster processing and minimized
bottlenecks associated with reading and writing large datasets.
How Redis Was Used:
Each user’s data, including their father’s information and related connections, is
stored
in Redis.
Thanks to Redis’ fast read-write capability, it was possible to rapidly retrieve the
father and their upline for each user, and store this new user and their upline in
memory for their children users. The data was simultaneously written to a CSV file.
Challenge – Managing Large Data Set
Objective: To overcome performance limitations and create the CSV file for migration.
Approach: The number of rows of data increased exponentially with an increasing number of
users. By using Redis for efficient data caching and Python for generating CSV files,
the team was able to manage memory consumption and reduce processing time. Even so, the
user data needed to be split into four chunks in order to process it properly.
Data Breakdown: The final dataset consisted of 1.49 billion rows, occupying
approximately 28 GB of storage (CSV files) in the end.
This process took 2 days to complete, an impressive feat considering the scale of the
data.
Step 4:
Introduction of Dask for Big Data Processing
To handle the enormous dataset and efficiently process the data while making final adjustments
and editing errors (such as duplication), our team incorporated Dask, a parallel computing
library designed for big data processing. Dask allowed the team to process the large amounts of
treepath data efficiently by converting treepath CSV files to parquet format for processing.
Step 5:
Data Migration to MySQL Using Pandas
Objective: Efficient migration of 2 million users into the client database.
Approach: The migration of user data for 2 million users was carried out using the
Pandas library to MySQL. The data, stored in CSV files formatted to match the user
data table schemas, was successfully migrated in just 10 minutes.
Step 6:
Transition to MySQL FILE UPLOAD and Removal of Pandas
Challenge – Inefficiency of Pandas
Pandas proved insufficient for the large-scale migration of the treepath data, which consisted of
almost 1.5 billion rows.
Solution – Introduction of MySQL FILE UPLOAD
The team transitioned to direct file uploads from CSV to MySQL, removing Python/Pandas as the
intermediary, and used the MySQL FILE UPLOAD command for the migration. This shift enabled
faster and more reliable data migration, achieving 5 billion rows in just 30 minutes, with
migration speeds up to 75 times faster compared to the previous approach.
Drawbacks: While MySQL file upload provided better speed, it lacked built-in security
features, such as duplicate detection and null column checks. It also required that
indexes and keys be disabled in the name of speed. Thanks to rigorous data cleaning
and validation using Pandas in the first step, this issue was resolved.
Final Step:
Adding Indexes & Foreign Keys
With the data successfully processed, the foreign keys and indexes were added to the table for
faster queries and datalook-ups. This enabled easy data access and ensured that the entire
structure was fully operational.
Results & Achievements
Within just 15 days, our team successfully added 2 million users to the unbalanced MLM tree,
overcoming significant technical challenges. The following outcomes were achieved:
Increased Performance: The migration process was accelerated by 75 times, ensuring
efficient handling of large-scale data.
Scalable Solution: The integration of Redis and Dask ensured that the system could
scale effectively to handle 1.49 billion records and continue expanding.
Improved Efficiency: The final approach reduced memory consumption and minimized
downtime, allowing for quicker data access and tree rendering.
Successful Deployment: The migration was completed on time, with no major performance
bottlenecks or data integrity issues.
Conclusion
This case study highlights the successful implementation of an advanced and scalable system for
managing 2 million users in an unbalanced binary MLM genealogy tree. By using Redis, Dask, and
MySQL, our development team overcame the challenges of large-scale data migration, unbalanced
tree management, and system performance. The project not only met its deadline but also set a
new standard for efficiently managing massive datasets in MLM structures. This accomplishment
marks a significant milestone in Infinite MLM’s journey, ensuring that our system can support
future growth without compromising performance.
×
Infinite MLM Software
Version 12.0 has been released !!!
With Highly Augmented Features Try demo to experience features of MLM Software 12.0
A quick glimpse at released version 12.0 features
This website uses browser cookies to enhance your user experience and boost loading speeds but we never store any of your private data. By continuing you accept our privacy policy.