String matching for governorate information in unstructured text
Approach
As I wrote about in my blogpost, “Lessons at HRDAG: Making More Syrian Records Useable,” HRDAG compiled a dictionary by mapping unique raw Arabic text to which Syrian governorate it refers to in English, and this was done for every unique value in the governorates field for every source used in the original analysis. However, in this new approach, I am using fields that have not been preprocessed; for most of the values in the fields, there would not be a corresponding mapping to a governorate in the dictionary, so I would not be able to rely on the dictionary for transliterations. I wanted to take a string match approach to search for instances of the governorates we were looking for, but I do not know Arabic: I did not know if proper nouns such as locations would be written differently in different grammatical contexts, nor would I have been able to spot check and see if I was on the right track (in identifying locations of death).
I decided on testing out a different approach: Translate the original Arabic text into English using Google Translate, then string match to find the 14 Syrian governorates using their English translations. I saw this as finding low hanging fruit: Find the governorates I can find now, and build on this approach later as we get more Arabic-speaker resources, since we wanted to be mindful of oracles’ time.
As a proof of concept, I tested out this process on a source I was familiar with from prior exploration. This source had three fields I thought were likely to contain governorate of death information: area_of_death, neighborhood_of_death, and notes. Area and neighborhood of death seemed to be designated for locational information about a killing at the sub-governorate level, and the notes field is an unstructured text field with miscellaneous information.
At HRDAG, production code is written as a script (e.g., a .py or .R script) that can be run from the terminal using a Makefile shell script. However, as I was testing out this approach, I drafted my code in a Jupyter Notebook to test out code interactively, reorganize code into functions, and organize my thoughts.
Technical Details/Reflection
I used Google Cloud Translate to translate each entry in each of the three fields. Since each translation is an API call that costs some nonzero amount of Google Cloud credits, I reduced the need for redundant translations by reducing the list of items to translate to only non-null and unique entries (also trimming leading and following whitespace) and writing out the DataFrame of original text and translations as a pickle file in order to avoid having to re-translate entries as much as possible.
I was running the translation code interactively as a test for certain fields, then running the code from a function and adjusting the function if issues arose. For instance, I ran into API rate limits that resulted in me needing to run the translations in batches with sleep time. Another thing to beware of is accidentally overwriting your output files when troubleshooting and testing different things, because of copy-pasting and editing variable names.
Have you ever had the rare but amazing feeling of finding the exact function you need? That was how I felt when I found Pandas’ Series.str.findall function, which finds all occurrences of a pattern or regular expression and puts them into a list. Using the str.findall() function, I added a column to my Arabic-English translation dictionaries for my three fields (area_of_death, neighborhood_of_death, notes).
One thing I realized in this process that was more difficult than I thought is combining multiple Pandas DataFrame columns into one. I thought this would be straightforward: Pandas’ str.findall() outputs a cell containing a list. I thought it would be as simple as extending the list or combining multiple lists into a set, but that was not possible. Additionally, you could not use a string function on a cell and have it apply to its value. Finally, I did not want to loop through every single cell in a DataFrame by row then column—a nested loop of O(n2); I wanted to keep it as a vectorized function for efficiency and readability. Enter: the .apply() and lambda function. By putting a lambda function inside a .apply(), I could command and manipulate values inside the cell of a particular DataFrame field.
Given that Pandas.str.findall() outputs a list, I converted the cell into a string and did some processing (e.g., removing “[“ and quotes) to facilitate using string concatenation and other manipulations to consolidate cells. There were multiple steps at which I converted the contents of a cell to string type and back. With Python’s function chaining, it is possible to chain multiple operations sequentially in a function, and by putting each on its own separate line, you can distinguish between functions and keep things visually clean.
There was another issue when consolidating multiple cells into one. Some records had more than one governorate in their text fields: Either because more than one location appeared in a single entry, or because a record had conflicting governorate values in more than one of the three fields used. Through manual checking of the values in the fields, I found that the notes field is not as closely tied to location of death information as we’d like: The appearance of a governorate in the notes field does not always indicate the killing took place there (e.g., the entry could be describing where the victim was traveling from at the time of death). Thus, we decided not to use the notes field in imputing governorates, and only use area and neighborhood of death, which are more explicitly tied to the location that a death took place.
Once I had the English translations, I could string match to find governorates in English. I aggregated a list of English-translated governorates to search for from the dictionary that maps original Arabic to English governorates in a later step of the workflow. With English, I could be assured that I could get an exact string match given English’s lack of declensions (a word changing its form depending on its syntactical purpose in a sentence). However, I would not be able to pick up a match if a particular governorate was anglicized (transliterated into English) using a different spelling—something I put as a parking lot item to look into later.
Testing
How do we test that this process worked? When testing code, we need to know two things: What is the question or problem we are trying to solve, and is our answer correct or solution effective? There is a third thing we need to implicitly know, which is the metric we need to use to determine if our answer/solution is good, or how good it is. At this point, I returned to the original goal of the project: To increase the number of records with an associated governorate. Thus, we’d want to take a look at how many records that previously didn’t have an associated governorate, get an imputed governorate. (We can also look at what percentage of records without governorates now have an imputed governorate).
But how do we know if these imputed governorates can be trusted? Let’s think about what this would look like if this were an assertion. We’d want to know that our imputed governorates match the ground truth—in this case, the (English) governorate labels already assigned to records. I looked at records with existing governorates, and took as a metric the percentage of records where the imputed governorate matched the actual.
There were 560 records that had an originally assigned governorate and a single, consolidated imputed governorate (excluding records with more than one governorate, null values, or locations other than the 14 Syrian governorates) that I used for this test. Of those, 74% of imputed governorates matched the original translations (416 records). This indicates that our method is promising, but needs further refinement and involvement from language partners before fully productionizing into the analytical workflow.