Step 1:
Data Validation and Testing
Objective:
To ensure the dataset was valid, accurate, and properly structured for
migration.
Approach:
Using the Pandas library, multiple python scripts analyzed and validated the provided CSV file.
This involved:
-
Checking for unique usernames and email addresses.
-
Verifying the father-child relationships.
-
Sorting and identifying any data mismatches or missing entries.
Step 2:
Formatting and Conversion
Once the data was validated, the next step was converting it into the appropriate format for
migration to database, which involved mapping the dataset to multiple schemas such as:
-
User Data Schema
-
Father-Child Relationship Tree Schema
-
Referrer-Referee Relationship Tree Schema (Sponsor Tree)
CSV files containing data formatted according to each table’s schema were created using the
Pandas library. This conversion ensured the data was properly prepared for migration into the
target tables.
Step 3:
Tree Path Management
Challenge – Handling Unbalanced Binary Tree
The unbalanced nature of the Binary Tree structure led to disproportionate depths for new users
from the root node (Admin User), caused deeper nesting and complicated the tree path
calculations.
-
-
This exponential growth in tree depth meant that rendering the relationships between
users and their ancestors generated over 1.49 billion rows of data, significantly
increasing computation time and memory consumption.
-
Pandas struggled with calculating Binary Tree structure leading to slow processing
and
memory overload.
Solution – Introduction of Redis
Developers explored Redis, an open-source, in-memory data store, to optimize performance. Redis
offered high-speed data retrieval and storage capabilities, reducing the load on traditional
disk-based systems. Its in-memory architecture enabled faster processing and minimized
bottlenecks associated with reading and writing large datasets.
How Redis Was Used:
-
Each user’s data, including their father’s information and related connections, is
stored
in Redis.
-
Thanks to Redis’ fast read-write capability, it was possible to rapidly retrieve the
father and their upline for each user, and store this new user and their upline in
memory for their children users. The data was simultaneously written to a CSV file.
Challenge – Managing Large Data Set
-
Objective: To overcome performance limitations and create the CSV file for migration.
-
Approach: The number of rows of data increased exponentially with an increasing number of
users. By using Redis for efficient data caching and Python for generating CSV files,
the team was able to manage memory consumption and reduce processing time. Even so, the
user data needed to be split into four chunks in order to process it properly.
-
Data Breakdown: The final dataset consisted of 1.49 billion rows, occupying
approximately 28 GB of storage (CSV files) in the end.
-
This process took 2 days to complete, an impressive feat considering the scale of the
data.
Step 4:
Introduction of Dask for Big Data Processing
To handle the enormous dataset and efficiently process the data while making final adjustments
and editing errors (such as duplication), our team incorporated Dask, a parallel computing
library designed for big data processing. Dask allowed the team to process the large amounts of
treepath data efficiently by converting treepath CSV files to parquet format for processing.
Step 5:
Data Migration to MySQL Using Pandas
Objective: Efficient migration of 2 million users into the client database.
Approach: The migration of user data for 2 million users was carried out using the
Pandas library to MySQL. The data, stored in CSV files formatted to match the user
data table schemas, was successfully migrated in just 10 minutes.
Step 6:
Transition to MySQL FILE UPLOAD and Removal of Pandas
Challenge – Inefficiency of Pandas
Pandas proved insufficient for the large-scale migration of the treepath data, which consisted of
almost 1.5 billion rows.
Solution – Introduction of MySQL FILE UPLOAD
The team transitioned to direct file uploads from CSV to MySQL, removing Python/Pandas as the
intermediary, and used the MySQL FILE UPLOAD command for the migration. This shift enabled
faster and more reliable data migration, achieving 5 billion rows in just 30 minutes, with
migration speeds up to 75 times faster compared to the previous approach.
-
Drawbacks: While MySQL file upload provided better speed, it lacked built-in security
features, such as duplicate detection and null column checks. It also required that
indexes and keys be disabled in the name of speed. Thanks to rigorous data cleaning
and validation using Pandas in the first step, this issue was resolved.
Final Step:
Adding Indexes & Foreign Keys
With the data successfully processed, the foreign keys and indexes were added to the table for
faster queries and datalook-ups. This enabled easy data access and ensured that the entire
structure was fully operational.
Results & Achievements
Within just 15 days, our team successfully added 2 million users to the unbalanced MLM tree,
overcoming significant technical challenges. The following outcomes were achieved:
-
Increased Performance: The migration process was accelerated by 75 times, ensuring
efficient handling of large-scale data.
-
Scalable Solution: The integration of Redis and Dask ensured that the system could
scale effectively to handle 1.49 billion records and continue expanding.
-
Improved Efficiency: The final approach reduced memory consumption and minimized
downtime, allowing for quicker data access and tree rendering.
-
Successful Deployment: The migration was completed on time, with no major performance
bottlenecks or data integrity issues.
Conclusion
This case study highlights the successful implementation of an advanced and scalable system for
managing 2 million users in an unbalanced binary MLM genealogy tree. By using Redis, Dask, and
MySQL, our development team overcame the challenges of large-scale data migration, unbalanced
tree management, and system performance. The project not only met its deadline but also set a
new standard for efficiently managing massive datasets in MLM structures. This accomplishment
marks a significant milestone in Infinite MLM’s journey, ensuring that our system can support
future growth without compromising performance.