_{Picture taken from Unsplash. Notice how the typewriter has a QWERTZ layout? Funny, isn't it?}

Hey, folks! 👋🏻

Remember how I told you about finishing university? Well, as you may know, the student either needs to write a bachelor's thesis or take a huge exam with all the material you've learned over however many years you've studied. Since my curriculum doesn't have the option to take an exam (which I wouldn't have wanted to take anyway), I am writing a bachelor's thesis. I'm not going to tell you the exact topic as you would probably be bored by it, but the general idea is named entity recognition.

Why?

If you've read my post, you've probably understood that while I love my bachelor's curriculum, I want to study software engineering and web development as well. I took the plunge and instead of writing the thesis about my main topic (which is the Estonian language overall), I chose computational linguistics. Named entity recognition was one of the thesis topics that was offered to me and I took the plunge.

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Source: Wikipedia

That may seem like a lot of gibberish to the ordinary reader, who might not even be familiar with computers, but I'll give you the gist of it:

When you read an article on the newspaper or something, you often stumble upon names, for example Donald Trump, Vladimir Putin, but also organisation names like Microsoft, Apple or Google. While you may understand that they are names, the computer doesn't - it needs a bit of programming and some rule sets. Unfortunately, even the best programs don't understand names perfectly all the time, but neither do people.

The highest F₁-score a computer programmed to recognise names has gotten is 0.9339. To put it simply, the computer program has a 93.39% chance of finding the name and guessing it right (whether it's a person's name, an organisation's name etc). The human annotators 97.60% and 96.95%. Source: Wikipedia

There are a lot of challenges 💻

A computer is simply a trail of ones and zeroes. There are a lot of big challenges with named entity recognition. I'll name two of the biggest ones:

Metonymy. Simply put, a word meaning two different things. In this case, a name meaning two different things. A very American example of this would be "White House". We could speak of the White House while talking about the place of residence and the workplace of the POTUS, but "The White House" is also quite often used in the context of the staff of the White House. For exaple: "The White House is making an announcement today." The computer has a tough time understanding that and to make up a rule for this exact occasion is quite tough.
A lot of recognition models are based on information around the entity (other words around the name in the sentence - context), but when the name nor context reveals if the name is, for example, a person or organisation - shit goes down. Consider "Philip Morris announced today that..." - people understand that Philip Morris is the company, not the person, but the computer doesn't.

So that's what I'll be doing for the upcoming year 📃

I will be implementing a Python script to recognise name entities in 19th century parish court protocols. This is probably going to be quite a tough task, because the court protocols are written in old literary language, which means that the Estonian they speak and write there is quite different from the language we speak today. As you may know, languages develop over time and Estonian is one, which has developed a lot.

I will not be writing the program from scratch though - there is a library made in Python named EstNLTK, which has named entity recognition built into it. However, the library is written for later Estonian, 20th-21th century language. It shouldn't have any problems reviewing that language, but it will have problems with the 19th century. Here is where I come in.

I think this will be quite a tough task, but all the mathematics etc around it seems utterly interesting. I also seem to have stumbled upon a very nice instructor for my thesis. I will try to keep you updated on how I'm doing, because I think this will be a very fun experiment, where I get to put my Python and linguistic knowledge to use.

Should you have any questions about this topic, I'm very much willing to answer, because if I don't know the answer, I will be trying to find it out on my own. 💭

Sort:

Trending

[-]

xves (61) 4 years ago

Damn, Kristjan, I'm impressed... and intrigued! You will definitely have to post about the thesis process and of things you've learnt about during your research and practice! I'd love to get to know more about this topic but I don't know where to start and what to ask even. 😅

$0.58

2 votes

jibspark (60) 4 years ago

Ah, of course! I'll let you know! ;)

$0.56

1 vote

cadawg (69) 4 years ago (edited)

Wow. That's quite some task! When you first said it I thought you were talking about looking at photos and distinguishing objects (but after seeing you talk about Organisations, I realised my assumption was utterly wrong). I am only just starting University so I haven't got to that point yet. It's sure to be interesting and I wish you the best of luck!

Thanks for posting in the programming community.

$0.03