Idea: news diversity

As with most of my mornings, they start with a coffee and reading the news on time.mk – news portals' aggregator.

At the time of writing this post, there are about 115 sources on the sources list on time.mk. As readers, we are mostly interested in unique content. Out of those 115 sources, a lot of them will generate noise. It may look something like this:

Even if you don’t understand Cyrillic, you can tell it’s definitely the same text.

So, how do we find the most unique sources and only stick to them, instead of wasting our time on copy-paste news portals that do not produce original content?

We start with the most obvious definition:

Definition 1: There exists portals p_1, p_2, ..., p_n.

We will just use the word portal for news portal. Every portal has a list of contents:

Definition 2: A portal p has list of contents, together with a timestamp: (c_1, t_1), (c_2, t_2), ..., (c_n, t_n).

We have a way to compute the similarity between two contents:

Definition 3: There exists a function 0 \leq S(c_1, c_2) \leq 1 that outputs how similar c_1 and c_2 are. The lower the number, the less similar they are. Alternatively, the higher the number, the more similar they are.

So for example, given the portals on that screenshot above p_1, p_2, p_3 together with their respective texts (c_1, t_1), (c_2, t_2), (c_3, t_3) we can already tell that S(c_1, c_2) and S(c_2, c_3) and S(c_3, c_1) will be a big number.

But what does this tell us about p_1, p_2, p_3? Well, it tells that they produce almost the same content at a given period t. For this reason, note also that \mid t_1 - t_2 \mid and \mid t_2 - t_3 \mid have to be small differences. This leads us to the next definition:

Definition 4: A portal p_1 will be similar to a portal p_2 at a specific timestamp t if \exists (c_1, t_1) \in p_1, (c_2, t_2) \in p_2 such that:

  • S(c_1, c_2) > \lambda – i.e. the difference is above some threshold \lambda that defines similarity bounds
  • \mid t_1 - t \mid < \epsilon – first article is at timestamp t
  • \mid t_2 - t \mid < \epsilon – second article is at timestamp t

A portal being similar just at a single point in time does not tell much. For example, maybe there was some important news where one portal was the source and others just used it as a reference to spread the news. We will accept cases like these to be okay. However, if this happens more often, then something tells us about the overall similarity between two portals.

Definition 5: For a given timestamp range T_1 < T_2, a portal p_1 will be similar to a portal p_2 if there are t_1, t_2, \ldots, t_{\lambda'} distinct timestamps such that:

  • T_1 < t_1, t_2, \ldots, t_{\lambda'} < T_2 – i.e. at least \lambda' timestamps are within the range, for some threshold \lambda'
  • Definition 4 applies for all timestamps

Now we have a way to represent the non-original content producing portals in a given year, say, 2018. We will have to monitor this weekly (or even daily) to see how a portal changes its trend. Given all of this information, we can construct graphs to cluster similar news sites.

The only remaining thing is for someone to build this system and save people a lot of time reading garbage. 🙂

Leave a comment