The XML elements include "location," "convo," and "class." I proceeded to consolidate the data into a Pandas dataframe, utilizing a for loop to create individual columns for each XML tag. During the loading process, I encountered errors with two files: "log_317.xml" and "log_1768.xml." Upon investigation, I found these two files to be nearly identical except for their "location" values. More details can be found in the ".txt" file within the Misc folder.
Interestingly, these files seemed to contain hints for decoding, which led me to mistakenly invest a significant amount of time searching for non-English characters within the "convo" column. This approach was misguided. After nearly giving up, a moment of insight occurred when I recognized that certain entries in the "convo" column resembled international news.
I realized that the two problematic files contained content about a foreign language that I couldn't process in Python. This revelation motivated a creative idea: filtering conversations mentioning countries or related entities. I achieved this by utilizing SpaCy, resulting in over 700 filtered rows. While not flawless, this filtering method provided a workable solution for the challenge.
Beforehand, I had attempted to dissect the data by splitting and identifying unique elements in the location column. The numerical values were consistently integers, ranging from two digits to over 200. My intuition suggested they might be coordinates, but my initial attempts at plotting them while filtering by class yielded inconclusive results. Even when I used only the coordinates using the over 700 rows of filtered data, the plotted data still appeared random.
Frustration led me to narrow down the data to conversations related to Bangladesh (the language of the two problematic files). This reduced the dataset to just 5 rows. Plotting these values surprisingly resulted in a straight line, revealing my oversight:
I had neglected to convert the "location" values from strings to integers.
After rectifying this oversight and replotting with the filtered data, the solution became evident. For a visual representation of the plot and the final solution, please refer to the solution Jupyter notebook.
---
Whole challenge took me just one or two hours to figure out, and half the time I was lying on the bed scrolling social media because ideas always comes to me while relaxed. Not difficult at all but at least it is fun. 😎 Looking forward to trying other kind of challenges.
~Althea🤍