Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate share of publishers' spending that is traceable #21

Open
markbrough opened this issue Jul 22, 2022 · 13 comments
Open

Calculate share of publishers' spending that is traceable #21

markbrough opened this issue Jul 22, 2022 · 13 comments
Assignees

Comments

@markbrough
Copy link
Member

markbrough commented Jul 22, 2022

Currently, it is difficult to see the share of publishers' spending that is traceable. For example, we know that lots of NGO funding from FCDO and the Netherlands is traceable, but it is unclear how much funding through other implementing partners (e.g. multilaterals) is traceable. It would be useful to start getting a rough idea of this.

To start with, we should capture a list of provider-org/@provider-activity-id for any activities that have incoming traceability, as Incoming Funds and Incoming Commitments (transaction-type/@code = 1 or 11). See #19

We can then run through all publishers' data again, and see which publishers' activities have corresponding activities published elsewhere that report receiving funding from them.

@markbrough markbrough changed the title Calculat t Calculate share of publishers' spending that is traceable Jul 22, 2022
@Bjwebb Bjwebb self-assigned this Sep 8, 2022
@Bjwebb
Copy link

Bjwebb commented Sep 20, 2022

Notes from our call last week:

We have a traceability calculation on the dashboard already, but that is how much of a recipient publisher's incoming funds have an activity id. In this issue we're interested in how many of a provider publisher's activities have a recipient publisher linking to them.

We might eventually want to filter this by humanitarian flag. This will likely be similar to how we currently handle hierarchy.

@Bjwebb
Copy link

Bjwebb commented Sep 22, 2022

I've made a commit generated some stats on the dev site:

Here's the ratio between these as percentages: https://gist.github.com/Bjwebb/7abf31ad55f09b4c470d5ad4b78eff73

Some of the numbers will be too low, as I excluded references to publisher's own activities from the traceable sum, but left them in the total sum. I'm going to change the code to exclude those activities from the total sum also.

@Bjwebb
Copy link

Bjwebb commented Sep 23, 2022

Here's a spreadsheet with those same percentages, and percentages of activities as well as spend:
https://docs.google.com/spreadsheets/d/1iwHB46-3Eq8_OCQ0uJzpYxTNyILwV1vkt0XzeI6oMNc/edit#gid=0
This is sorted by total spend descending, which gives a useful overview of the big publishers I think.

I've excluded activities referenced by a publisher's own activities from the denominator. Unfortunately that excludes some activities that are also referenced by other publishers, so it's possible to get >100%. I'll try to exclude only those activities that are only referenced by that publisher.

My work so far is at main...dev, although I hope to rebase that before opening a PR.

@markbrough
Copy link
Member Author

This is looking really great, @Bjwebb ! Thank you for all this work. It's very exciting that we are already starting to get some real numbers here.

So I think we are currently counting as "traceable" the full value of a publisher X's activities which are referenced as providing incoming funds (either as Incoming Commitments or Incoming Funds) to any other publisher's activities?

Or are we only counting the value that is stated as Incoming Commitments / Incoming Funds on the other publishers' activities?

@stevieflow
Copy link

I'd like to suggest a revision in terms of how we count whether an activity is traceable

So far, I think we are in a binary scenario

  • an activity has no incoming or outgoing links = it is not traceable
  • an activity has at least one incoming or outgoing link - it is traceable

Someone looking at that this methodology could (devil's advocate) just then say "Ok, let's make sure all our activities have an outgoing link". They could alter their data and see their count rise to 100%.

I think we have to mitigate for such things. The key for IATI is that activities are connected - and collaboration between publishers (by publishing links to each other's activities) is how that is achieved.

It can get a bit complicated by the fact that the current assumed and preferred model of traceability is that of links pointing upwards -- organisations include links to their immediate donor / partner in their data, so we can "trace" through.

This is opposed to the start of any chain including downward links - the argument being, that the org at the start of any chain may not know where the other activities are yet, as they dont exist

If we imagine this a set network of activities for analysis:

IATI-network

In this "network":

  • The red activity has no links - it floats in isolation
  • The yellow activity has outgoing links only - it points at others
  • The orange activity has incoming links only - it is referenced by other activties
  • The green activity is lit both ways with incoming and outgoing links

This might need revision in terms of implicit value being expressed by colours / labels, as there can be perfectly legitimate reasons for all cases - but I think it would be really useful to disaggregate how we calculate the types of activities we are finding

@markbrough
Copy link
Member Author

Thanks @stevieflow - I think this makes sense, and I also think there's a decent argument in favour of making traceability "downwards" as well as "upwards". Though I'm not sure if there would be counter-arguments around redundancy etc?

I'm also a bit nervous about implementing a methodology that's quite a departure away from what is currently generally expected / implemented by most publishers... Is this something you think should be implemented now, perhaps as an additional calculation to the one that @Bjwebb has been working on?

@Bjwebb
Copy link

Bjwebb commented Sep 23, 2022

@markbrough

So I think we are currently counting as "traceable" the full value of a publisher X's activities which are referenced as providing incoming funds (either as Incoming Commitments or Incoming Funds) to any other publisher's activities?

Yes, that's right, the full value (of all commitments + disbursements) for the referenced activity.

@stevieflow

but I think it would be really useful to disaggregate how we calculate the types of activities we are finding

My work so far is only looking at incoming links (ie. I only look at provider-org/@provider-activity-id).

@Bjwebb
Copy link

Bjwebb commented Sep 23, 2022

BTW, a couple of other notes about my code so far:

  • It doesn't look at hierarchies, so if we only expect references to activities at the bottom of the hierarchy, we might be being unfair to include all activities in the total.
  • It includes all activities, regardless of age. We might consider it unfair to penalise people for having lots of historic activities, for which it's more difficult to get traceability links in. We could just look at "current" activities, but I think there's already several ways of defining that!

@stevieflow
Copy link

@markbrough re: traceability methodology - I think the above expresses all possible ways activities can be "linked" - I was keen to try and not make a distinction between the yellow and orange scenarios, as they are all valid and feasible. I'd be more concerned if - at this stage - we baked in some preference for upstream traceability, when the true nature of the data standard permits any route

@Bjwebb

My work so far is only looking at incoming links (ie. I only look at provider-org/@provider-activity-id).

Thanks. It'd also be useful to count receiver-org - but as a separate count to provider.

we only expect references to activities at the bottom of the hierarchy

I think that's true. Some publishers might have internal links between hierarchies - but agree we should just look to the lowest set for now

We could just look at "current" activities

Yes, good point. If chance, then segmenting between current (using the PWYF definition?) and non-current could be interesting, but will start to make the data complex

@markbrough
Copy link
Member Author

Feedback on work so far:

  • There are some quite interesting and in some cases surprising findings about the share of publishers’ spend that is traceable. We see higher values for NL, FCDO and BE than for other publishers, which is in line with what we would expect given their efforts on traceability of NGO implementing partners. However, traceable shares are generally quite low across the board. This could be down to two things: 1) multilaterals not having incoming traceability; 2) looking at all activities (we wouldn’t expect to have traceability for older activities).
    • (1) - This is an interesting finding and we should try to look at how this is broken down by type of organisation - MB to do some analysis and consider if we should schedul some additional work in later November;
    • (2) - We should later try to only look at current activities (this would be useful for a number of other things in the humportal) – try to schedule this work in later November.
  • There are some cases where traceability is over 100%, e.g. rutgers. It would be good to make sure that these values are correct - add some more unit tests and try to get this solid today.
  • We should also look at different sorts of traceability later – try to schedule this work in later November.
  • V1 - spend 10 mins to try to make the traceability calculations work with v1 IATI data. It would be good to be backwards compatible if possible, but it is not an immediate priority.
  • Traceability value calculations are currently looking at all sorts of transaction types, not only incoming funds and incoming commitments. This should generally amount to the same measure either way, except where publishers are using unexpected transaction types for e.g. incoming funds. We should keep the methodology this way for now, as we are taking as broad / generous a view as possible about the amount of funding that is traceable. We can adjust later if necessary.
  • We exclude any references to a publisher’s own activities for now. That seems to be sensible.

Bjwebb added a commit that referenced this issue Sep 27, 2022
#21

This was meant to exclude activities that were only referenced by
themselves from the denominator, but the calculation was at the
publisher level, where we don't know if other publishers will reference
them yet.
@Bjwebb
Copy link

Bjwebb commented Sep 28, 2022

I've added 1.0x support cf1c33e

I've found why I was getting >100%. My code that tried to exclude own ref activities from the denominator was broken. I've removed it was broken and not straightforward to fix: a86ab4e. It looks like it doesn't make a big difference to the numbers coming out for large publishers.

The google sheet is up to date with these changes https://docs.google.com/spreadsheets/d/1iwHB46-3Eq8_OCQ0uJzpYxTNyILwV1vkt0XzeI6oMNc/edit#gid=0

I've also started thinking about what some "end to end" tests might look like, in terms of feeding in multiple publishers of dummy data, and checking the eventual output is what we expect. There's a branch for this, but there's not very much there yet dev...end-to-end-traceability-tests

@markbrough
Copy link
Member Author

Thank you, this all looks great, @Bjwebb !

@markbrough
Copy link
Member Author

markbrough commented Sep 28, 2022

So one issue I have noticed is that the publisher-specific files appear to be empty for all publishers here:
current/aggregated-publisher/fcdo/traceable_sum_commitments_and_disbursements_by_publisher_id.json

That also appears to be the case for the existing traceability file:
current/aggregated-publisher/fcdo/traceable_activities_by_publisher_id.json
UPDATE: sorry, I mistook the new count of activities with something quite different in the publishing statistics.

Does it seem as though this would be complicated to adjust, so that the amount for each publisher (from the big list of amounts for each publisher) is stated in each publisher's file?

Bjwebb added a commit that referenced this issue Nov 14, 2022
Bjwebb added a commit that referenced this issue Nov 14, 2022
Bjwebb added a commit that referenced this issue Nov 15, 2022
Bjwebb added a commit that referenced this issue Nov 15, 2022
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
codeforIATI#21

This was meant to exclude activities that were only referenced by
themselves from the denominator, but the calculation was at the
publisher level, where we don't know if other publishers will reference
them yet.
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
Bjwebb added a commit to IATI/IATI-Stats that referenced this issue Jun 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants