Idea: news diversity

As with most of my mornings, they start with a coffee and reading the news on time.mk – news portals' aggregator.

At the time of writing this post, there are about 115 sources on the sources list on time.mk. As readers, we are mostly interested in unique content. Out of those 115 sources, a lot of them will generate noise. It may look something like this:

Even if you don’t understand Cyrillic, you can tell it’s definitely the same text.

So, how do we find the most unique sources and only stick to them, instead of wasting our time on copy-paste news portals that do not produce original content?

We start with the most obvious definition:

Definition 1: There exists portals $p_1, p_2, ..., p_n$ .

We will just use the word portal for news portal. Every portal has a list of contents:

Definition 2: A portal $p$ has list of contents, together with a timestamp: $(c_1, t_1), (c_2, t_2), ..., (c_n, t_n)$ .

We have a way to compute the similarity between two contents:

Definition 3: There exists a function $0 \leq S(c_1, c_2) \leq 1$ that outputs how similar $c_1$ and $c_2$ are. The lower the number, the less similar they are. Alternatively, the higher the number, the more similar they are.

So for example, given the portals on that screenshot above $p_1, p_2, p_3$ together with their respective texts $(c_1, t_1), (c_2, t_2), (c_3, t_3)$ we can already tell that $S(c_1, c_2)$ and $S(c_2, c_3)$ and $S(c_3, c_1)$ will be a big number.

But what does this tell us about $p_1, p_2, p_3$ ? Well, it tells that they produce almost the same content at a given period $t$ . For this reason, note also that $\mid t_1 - t_2 \mid$ and $\mid t_2 - t_3 \mid$ have to be small differences. This leads us to the next definition:

Definition 4: A portal $p_1$ will be similar to a portal $p_2$ at a specific timestamp $t$ if $\exists (c_1, t_1) \in p_1, (c_2, t_2) \in p_2$ such that:

$S(c_1, c_2) > \lambda$ – i.e. the difference is above some threshold $\lambda$ that defines similarity bounds
$\mid t_1 - t \mid < \epsilon$ – first article is at timestamp $t$
$\mid t_2 - t \mid < \epsilon$ – second article is at timestamp $t$

A portal being similar just at a single point in time does not tell much. For example, maybe there was some important news where one portal was the source and others just used it as a reference to spread the news. We will accept cases like these to be okay. However, if this happens more often, then something tells us about the overall similarity between two portals.

Definition 5: For a given timestamp range $T_1 < T_2$ , a portal $p_1$ will be similar to a portal $p_2$ if there are $t_1, t_2, \ldots, t_{\lambda'}$ distinct timestamps such that:

$T_1 < t_1, t_2, \ldots, t_{\lambda'} < T_2$ – i.e. at least $\lambda'$ timestamps are within the range, for some threshold $\lambda'$
Definition 4 applies for all timestamps

Now we have a way to represent the non-original content producing portals in a given year, say, 2018. We will have to monitor this weekly (or even daily) to see how a portal changes its trend. Given all of this information, we can construct graphs to cluster similar news sites.

The only remaining thing is for someone to build this system and save people a lot of time reading garbage. 🙂