July - August, 2018
Introduction
Coding Galaxy is an educational app for children to learn about basic Computer Science logics. Similar to other educational tools, it is crucial to know and understand what the users are going through in order to fit better to their needs. The micro-behaviors of the users of Coding Galaxy have been collected for a few months. These raw data can provide valuable information for game creators to better design levels and contents in the future.
Goal:
To process and classify the micro-behaviors data of different users for better understanding of the app’s users base.
Data Processing
The raw data of each user’s micro-behaviors is stored in the corresponding fashion:
{"uuid":"989c3ef0-9499-11e8-8233-3705ce51e9d2","app_bundle_id":"com.tangoredu.coding","user_id":"6378","content":{"timestamp":"1533024788897","level_id":"21","stage_id":"33","question_id":"72","action":"Get","cmd":"J","rej":"0","ac_loc":"N","current_M":"TL,F,C,TR,J,C,J,C,TL,J"}}
As shown from the example above, these data is stored in a JSON format. uuid represents the current data’s unique ID. app_bundle_id refers to the app’s bundle ID, which represents Coding Galaxy in this case. user_id is used to identify the current user with content describing the corresponding action taken by the user. Inside content, a timestamp is stored along with the the ID of the current level, stage, and question. The action of the user is also categorized and described by other further details such as cmd (command) and current_M (current movements).
In order to separate the users into meaningful groups, it is necessary to process the data in order to reflect distinguished characteristics. The flow of the playthrough of each question and the processed data categories are represented by the diagram that follows:
![](https://static.wixstatic.com/media/24faeb_bb9e9baa2a0b4c9e9d5ede190ec9c60f~mv2.png/v1/fill/w_980,h_704,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/24faeb_bb9e9baa2a0b4c9e9d5ede190ec9c60f~mv2.png)
Each question has the goal of reaching the end point, which is achieved through a sequence of moving directions. When the user first encounters the question, he or she has to get the moving directions he or she wants and puts them in the desired position in the created sequence one at a time. After the full sequence is generated, the user is able to run it. Running it has four outcomes: 1) The user wins the game; 2) The user finds a minor mistake and continues with the get and put process; 3) The user finds a major mistake and has to press the reset button in order to clear out the whole sequence and start over; 4) The user fails to complete the tasks and decides to give up and leave the game.
From the flow of the gameplay, seven sets of processed data are chosen to categorize the users with the reasonings as followed:
Number of Runs: This represents the total number of runs the user has used regardless of whether or not he or she completes the question successfully. This value separates users into those who requires a lot of testings before completion and those who do not.
Total Duration: This represents the total amount of time that a user requires to complete the question. This value tests the efficiency and speed of the user in solving problems.
Start to First Get Time: This represents the average amount of time that a user takes to get the first piece of moving direction. This value helps show whether or not a user plan before he or she starts working on the problem.
Start to First Run Time: This represents the average amount of time that a user takes to run the created sequence for the first time. Similar to total duration, this value reflects the efficiency and speed of the user when he or she first encounters a new problem.
Number of Resets: This represents the total number of resets the user has used throughout solving the problem. As only big mistakes are required to use the reset button, this value demonstrates whether or not a user makes an unrecoverable mistake often throughout the question.
Success Rate: This represents the overall successful completion over the total number of trials (Number of Win plus Number of Leave). This value shows whether or not a user give up easily when he or she is solving a problem, or if he or she tries to complete the problem successfully in one sitting.
Get to Put time: This represents the average time it takes for a user to put down the moving direction in the desired position of the sequence after he or she has get the moving direction. This value shows whether or not a user understands the sequential logic that he or she is trying to create, as well as the time a user takes to think when he or she picks up a moving direction.
These seven sets of data are overall a fitting representation of the users for Coding Galaxy. The next steps will be to use a mechanism to categorize them in a logical manner.
Model
Even though we have processed the data and generalized it into seven sets, we do not know the metrics that we should use in order to best separate the users into groups. In the world of machine learning, unsupervised learning is often used to describe the structure of unlabeled data. As we are trying to group the users according to their similar micro-behaviors, it is reasonable to view this as a cluster analysis problem, in which the results will group users with similar attributes with one another. The chosen method for the cluster analysis is k-means clustering. As suggested by its name, k-means clustering aims to partition objects, in this case users, into their corresponding clusters with the nearest mean. By doing so, the data space is divided into different Voronoi cells within the Voronoi diagram, enabling future data to be easily classified in the same manner. One example of the Voronoi diagram is shown below:
![](https://static.wixstatic.com/media/24faeb_0fa44503d9104200a3a7216d44329ec9~mv2.png/v1/fill/w_512,h_512,al_c,q_85,enc_auto/24faeb_0fa44503d9104200a3a7216d44329ec9~mv2.png)
K-means Clustering
K-means clustering follows the below steps to form multiple clusters:
![](https://static.wixstatic.com/media/24faeb_033a107691e64009a0fc9a74fde0ed28~mv2.png/v1/fill/w_980,h_1067,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/24faeb_033a107691e64009a0fc9a74fde0ed28~mv2.png)
This image demonstrates the convergence of k-means clustering:
![](https://static.wixstatic.com/media/24faeb_3b506795c6f745698047057a18dc6030~mv2.gif/v1/fill/w_220,h_214,al_c,pstr/24faeb_3b506795c6f745698047057a18dc6030~mv2.gif)
Nevertheless, from the results of the data processing, the final inputs include seven sets of data per question for each user, which is much more than two-dimensional. Therefore, it is not possible to picture it within the Voronoi diagram and be classified using the k-means clustering. In order to solve this, principal component analysis (PCA) is used to convert this set of data from a higher dimension to a lower dimension (two-dimensional). PCA is a statistical procedure that transforms possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This means that it removes qualities from the original set of data that are potentially repetitive or redundant and combine other useful features in order to create linearly independent representations of the original data. As a result, the resulting data is a more precise and representative indicators of the original data set. This set of data can then be analyzed by the k-means algorithm and form clusters.
Results
The results of k-means clustering are as followed:
![](https://static.wixstatic.com/media/24faeb_01df33e240964b919a870780a35acec5~mv2.png/v1/fill/w_980,h_1157,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/24faeb_01df33e240964b919a870780a35acec5~mv2.png)
Each black dot represents a data, and the white crosses represent the centroids of their corresponding cluster.
For cluster analyses, the Elbow method is often used to determine the appropriate number of clusters that exist within the dataset. The elbow method tries to find the point of maximum diminishing marginal gain of increasing the number of clusters. Thus, it will create an angle in the graph when plotting gains against the number of clusters, creating an “elbow” like figure in the plot. The graph shown below is a plot of the k-means clustering score (y-axis) against the number of clusters (x-axis). We can see that the “elbow” is located at three clusters, meaning that the maximum gains of increasing the number of clusters in achieved with three groups.
![](https://static.wixstatic.com/media/24faeb_97cea65a6e0f4f118a9852105ac9de42~mv2.png/v1/fill/w_756,h_504,al_c,q_90,enc_auto/24faeb_97cea65a6e0f4f118a9852105ac9de42~mv2.png)
Even though the ideal number of clusters is three, it is still beneficial to further divide the users into more clusters. The following google sheets link shows the average of each statistic within the divided groups of users for dividing into three, four, and five clusters with k-means algorithm. The data is analyzed and color-coded to demonstrate each group’s characteristics. Green is the most ideal, followed by yellow, orange, and red in this order. Blue represents special cases, which means the amount of time spent planning for “Start to First Get Time” and the amount of time spent thinking during “Get to Put Time.” The data is then further analyzed and interpreted into understandable groups in the furthest right column.
![](https://static.wixstatic.com/media/24faeb_918a3597733f480aa6ad8a1a0834f8d8~mv2.png/v1/fill/w_980,h_368,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/24faeb_918a3597733f480aa6ad8a1a0834f8d8~mv2.png)
Conclusion
From the above results, we can see that users have been successfully grouped according to their corresponding micro-behaviors for a better understanding of Coding Galaxy’s users base. As the micro-behaviors data is automatically collected through the app’s usage, classification of further users can also be automatic with instant feedback system by using this model. App developers can now recognize each user’s needs better and create designs that fit to their needs accordingly. Overall, the outcomes of this project is satisfying and should have a positive impact in the future developments of the app.
Further Steps
With the current data from the months of March to August in 2018, only 134 users and two questions (‘23’ and ‘31’) are used in training the k-means model. By normalizing to each level’s difficulty accordingly should allow more questions’ data to be incorporated into the model. Furthermore, longer periods of data collection is required to make the model more accurate and representative. Other sets of processed data can be added on top of the current seven inputs to further describe the user base.
Comments