Skip to content

Latest commit

 

History

History
9 lines (9 loc) · 5.12 KB

File metadata and controls

9 lines (9 loc) · 5.12 KB

DSTA Challenge of Wits 2023 Round 2

Huge pity I did not attempt round 1 because I was busy with finishing up my Data Science Bootcamp and also due to my Paris trip. I was mindlessly scrolling social media when I saw the ad again and thought I should try it out.

Task:

Millions of stolen cash was stashed away by terror organisation APOCALYPSE in their safe. A thumb drive Logs.zip with log files was recovered when they were captured. Handed to you, your team’s mission is to find the secret key from the “Logs.zip” files that will unlock the safe to recover the stolen cash. Armed with programming expertise and knowledge of text processing, you know how best to represent insights from data to answer mission requirements. You are told APOCALYPSE used a 2D shape as their secret key. Can you find the secret key to unlock the safe? 

Hint 1: Download Logs.zip from the challenge website. 
Hint 2: Write a script to solve the challenge. 

Solution:

The "Logs.zip" archive contains more than 2000 XML files, named from "Log_0.xml" to "Log2254.xml".

First, I written a code to load one of the xml file to see what is in it. Here's an excerpt from "log_0.xml":

<data>
<location>285,88</location>
<convo>the holes in this film remain agape -- holes punched through by an inconsistent , meandering , and sometimes dry plot .</convo>
<class>casual</class>
</data>

The XML elements include "location," "convo," and "class." I proceeded to consolidate the data into a Pandas dataframe, utilizing a for loop to create individual columns for each XML tag. During the loading process, I encountered errors with two files: "log_317.xml" and "log_1768.xml." Upon investigation, I found these two files to be nearly identical except for their "location" values. More details can be found in the ".txt" file within the Misc folder.

Interestingly, these files seemed to contain hints for decoding, which led me to mistakenly invest a significant amount of time searching for non-English characters within the "convo" column. This approach was misguided. After nearly giving up, a moment of insight occurred when I recognized that certain entries in the "convo" column resembled international news.

I realized that the two problematic files contained content about a foreign language that I couldn't process in Python. This revelation motivated a creative idea: filtering conversations mentioning countries or related entities. I achieved this by utilizing SpaCy, resulting in over 700 filtered rows. While not flawless, this filtering method provided a workable solution for the challenge.

Beforehand, I had attempted to dissect the data by splitting and identifying unique elements in the location column. The numerical values were consistently integers, ranging from two digits to over 200. My intuition suggested they might be coordinates, but my initial attempts at plotting them while filtering by class yielded inconclusive results. Even when I used only the coordinates using the over 700 rows of filtered data, the plotted data still appeared random.

Frustration led me to narrow down the data to conversations related to Bangladesh (the language of the two problematic files). This reduced the dataset to just 5 rows. Plotting these values surprisingly resulted in a straight line, revealing my oversight: 

I had neglected to convert the "location" values from strings to integers.

After rectifying this oversight and replotting with the filtered data, the solution became evident. For a visual representation of the plot and the final solution, please refer to the solution Jupyter notebook.

---

Whole challenge took me just one or two hours to figure out, and half the time I was lying on the bed scrolling social media because ideas always comes to me while relaxed. Not difficult at all but at least it is fun. 😎 Looking forward to trying other kind of challenges.

~Althea🤍