-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathTODO
294 lines (292 loc) · 16.8 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
== Run ==
* @statistic.maxdiag [+ announcement]
* @statistic.triangles
* @statistic.conflict [+ announcement]
* run analysis for src/algebraic-conflict/
* @network family dimacs10-
* @ft.degree
* @statistic.fconflict [+ announcement]
* @all (because last time, Matlab was broken)
* @statistic.nonbipal [+ announcement]
* @network.soc-sign-bitcoin(alpha otc)
* @statistic.controllability [+ announcement]
* @time_histogram
* @statistic.mediandist [+ announcement]
* @distr.sym
** those for large networks are often empty
* @statistic.separation [+ announcement]
* @distr.sym-n
* @statistic.tconflict [+ announcement]
* @distr.lap
* @hopdistr
* @degree
* @rating_evolution2 # + accouncement
* @statistic.(squares tour4)
* @statistic.inoutassort
* @inter [+ announcement]
* @inter2 [+ announcement]
* @decomposition.seidel [+announcement]
* @statistic.seidelnorm [+announcement]
== Easy new features ==
* new statistic: maximal egde multiplicity
* for rating networks: mean rating, i.e., mean edge weight
* statistic: [prefatt] pref. att. exponent
* Second largest eigenvalue of N: spectral gap, as per Martin Gueuning
* new node statistic: k-core number (largest k for which a node has
degree >0 in the k-core). (as measure of centrality). Plot the
corresponding distribution, the k-core distribution.
(matlab_bgl function core_numbers.m)
* Ratios: maxdiag/opnorm, maxdiag/snorm: they are measures of the "cyclicness" of a directed graph
* The spectral radius divided by the spectral norm (a measure of the [a]cyclic-ness?)
* norm. alg. conflict; norm. alg. connectivity; norm. alg. non-bipartivity (related to [nonbipn])
* left and right size of [coco]
** have to actually generate it ...
** Need to write a new C program size.c that is *not* only called for
the simple~* variant, but for the actual graph. Or better, write a
lcc.c program, that does it directly
* temporal histogram: also plot in modulo YEAR, WEEK and DAY.
* new plot type: all standard statistics in function of time (we have
to exclude statistics that are themselves based on temporal features)
* mediandist: actually compute it from hyperanf
* hopdistr.b: Y axis labels: write there the logit-transformed values.
* laplacian of signed graph WITHOUT absolute value. Models real
repulsion, better than the absolute value, which models antipodal
attraction. This is used in [357] to draw signed graphs.
** decomposition: L= D-A without absolute value
** Is there always an eigenvalue zero with constant eigenvector?
** Eigenvalues may be negative even if every node has positive
sum-of-weights.
** This is NOT the same as LAPQ (there is no implementation yet)
* statistic: median path length: compute is properly (as the rounded-up 50-percentile effective diameter)
* implement in C or as function of other statistics: fill, average degree, inoutassort, etc.
* normalized version of the algebraic conflict: derive as a problem of
assigning a number to each node such that the number is equal/opposite
when two nodes are connected, and the algebraic conflict equals the sum
of squared differences. Normalize by the number of edges, and such that
the mean square of numbers is one.
\xi' = \xi * n / 2m = \xi / d
(the exact definition is in projects/konect/presentations/komepol.opd)
* scatter over all nodes of one network: degree vs number of adjacent
triangles. There should be a power-law slope of a round 1.5, according
to [Fast Counting of Triangles in Large Real Networks:
Algorithms and Laws] p.8, fig 7.
* size distribution of connected components (directed and undirected)
* statistic: number of cnnected components
* implement all repulsive matrices
* statistic: revive [aredis]
* new count: number of 4-cliques
* new count: number of T = [1 2; 1 3; 2 3; 2 4; 3 4]. I.e., K_4 minus
one edge.
* new count: number of 4-paths
* new count: number of T = [1 2; 1 3; 2 3; 3 4]
* wegde coefficient = s / [m (n-2)]. I.e., the probability that an edge
and another node are completed by a second edge.
* directed networks count statistic: number of patterns of the form [1
2; 2 1], i.e. reciprocated edge paris.
* directed networks count statistic: number of incident edge pairs:
(1) pointed to the same node (2) pointing from the same node (3)
sequential.
* directed networks count statistic: Number of directed triangles, number of cyclic triangles
* https://marketplace.gephi.org/plugin/openord-layout/
* https://share.osf.io/registration/
* statistic: [hubiness], variance of the degree distribution
* directed Laplacian: L = D - A where A is asymmetric. Has complex spectrum.
* scatter: draw the QQ plot
* make sure the distinction between event-edges and connection-edges is
modeled accurately
* complex eigenvalues of P=D^-1A for directed networks. Largest is 1.
* average/median path length in function of network size
* Degeneracy: maximum k-core number
* Graphon plot: show the adjacency matrix with nodes sorted by U[L]_i,
and reduced to 100x100 blocks, in blsck and white
* Fourier transform of timestamps for temporal networks
** for each T, map timestamps to interval [0,1} via (mod T), and measure
the skewness of the resulting distribution. I.e., largest ratio
between one half of the interval to another (using any split). This
is a form of the Kolgomorov-Smirnov distance between the actual
distribution and the uniform distribution
** X axis: time scale, logarithmically from 1s to several years. Y
axis: skewness.
* plot: degree vs local clustering coefficient for all nodes of one network
* plot all degree distributions (for one class) on a single plot.
* c/ifub.c: at each step, use a timer-like progress indicator. Same
for triangles.c.
* @network.wikilens-ratings: reinsert 'pa' into PLOTS and fix this bug: no rule to build m/pa_compute_one.m,
needed by dat/pa.wikilens-ratings.mat
* reinsert 'inter' and 'inter2' into PLOTS, and fix this bug:
** @network.wikilens-ratings: command for @inter.wikilens-ratings failed with exit status 127
* @ft.degree: make it work for all networks. It fails for multiedge
networks at the moment, due to a missing implementation.
** When done, re-insert @ft.degree.$network into @all
* check.m: check that all tags used in each meta.* file are correct,
based on the current content of konect-handbook.tex
* check.m: Check that all datasets have a bibtex citation that is present in konect.bib
* meta.* of all datasets: remove all n3-* fields (write a unit tests
to check that only allowed fields are present in meta.*)
* run @check, and in particular dat/checkmeta
* Instead of have m/konect_{label,data}_statistic.m, store that
information as metadata in each m/konect_statistic_$statistic.m file
separately
* check.m: check that the category of the network is one of the allowed
categories (by comparing with a canonical source)
* NCP (network community plot)
* Use https://github.com/franktakes/teexgraph for the BoundingDiameter algorithm to compute the diameter
* dat/NETWORKS* : Use '.' instead of '_' for separation
* actual moments of the degree distribution
** Root average squared degree, etc.
* bidd, degree, cluscod: Put the variable in question on Y axis, to
have the "higher curve means higher value" effect.
* scatter all edge by the degree of their two connected nodes
* diam: first, remove all degree-1 node, except a single one neighbour for each non-degree-1 node.
* "directed diameter" in directed networks, i.e., longest shortest directed path
* @ft.degree: make it also work for networks with multiple edges without edge weights.
* a combination of [clusco] and [meandist] is a measure of "small-world"ness. See http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002051
* write the "Timer" class for triangle counting
* statistic: median degree [mediandegree]: solve the C++/C thing
== Hard ==
* better graph visualization for the website
* scatter: add a plot variant that shows the correlation coefficient
* cluscod: show the mean and median, and overall mean clustering
coefficient on the plot
* symmlq/pcg/bigc(?) for lap decomposition
* animation of eigenvector embedding over time
* the spid of a network: given a distance distribution (aka hop
distribution), compute sigma^2/mu. [Four Degrees of Separation,
WebSci'12]
* new statistic: SPID (shortest-path index of dispersion) [708].
* comparison: use a scale in which the luminosity increases
continuously and the hue rotates around the colors.
* new centrality measure: C(i) = \sum_{j~i} d(i)^{-1}, where d(i) is the degree.
For directed networks, use the indegree and the out-neighbors.
* degree plots: have more and smaller ticks, even when no number is printed.
* closeness and betweenness centralities as node distributions
* New directed Laplacians: L_out = D_out - A ; L_in = D_in - A.
** Eigenvalues should be nonnegative according to r. braun
** Send results to rosemary braun <[email protected]>
* degree: also plot as semilogY (to visualize exponential distributions)
* for all decompositions: use callback instead of prepare_matrix for
computation (uses less memory)
* animated plots over time
* new distribution: normalized degree, i.e. sum(A D^-1, 2) or sum(D^-1 A, 1).
* new statistic: geometrical dimension of a network. Sum_k (λ_1/λ_k)
with Laplacian eigenvalues.
* check out scatter plots involving AREDIS. What was the argument again
why a shrinking AREDIS is better/worse than a shrinking DIAMETER.
* statistic time plots: horizontal grid lines
* inspect RADIUS of networks and add it as a statistic to the website
* decomposition.b.stoch2 : axes: Y_max should be 1.
* Compute the "diago" statistic for all diagonality tests. (see projects/latent-negative/m/diagonality.m)
* edge betweenness as a link prediction measure?
* new decomposition: adjacency matrix normalized by dividing by (degree
/ log(degree)), such that row sums equal the log node degrees
* nonnegative matrix decompositions
* decompositions with missing values [Y. Koren et al.; Matrix
factorization techniques for recommender systems]
* probabilistic models: LDA, PLSA, etc.
* distr: use luinc() instead of lu()
* distr: use dynamic bins
* normalize additively with modularity kernel: subtract (d_i d_j) / 2m
* denormalize multiplicatively: take prediction using normalizing
adjacency matrix, and multiply by the degree product afterwards
* path: for bipartite graphs, find equivalents of the local prediction
measures (i.e. number of common neighbors, Jaccard, etc.)
* find a variant of preferential attachment that works on weighted
graphs (and can be evaluated on them)
* generate lattice random networks (but smarter than just a square lattice, and should be parameter-free)
* http://coldattic.info/shvedsky/s/blogs/a-foo-walks-into-a-bar/posts/40
* degree.m: compute the degree distribution without building the full adjacency matrix
* mauc and map: make it much faster by using sparse() instead of find to find all node pairs of the form (i, *).
* implement http://arxiv.org/abs/1209.1270v1
* use http://www.mathworks.com/matlabcentral/fileexchange/16248 as data structure
* bipartite Laplacian, i.e. the Laplacian matrix in the bipartite double
cover of a directed graph (only ASYM). This is equivalent to the
split-complex Laplacian.
* Mean Reciprocal Rank as new error measure: sum of inverse ranks where a "1" is present.
* implement the 4-clustering coefficient for bipartite graphs
(references are in [bipartivity])
* implement the statistic of [680]
* gini vs jain: although the scatter plots indicate that the measures
correlate negatively (which corresponds to their definition), their time
evolution has them both increasing. In other words, "jain" should be
decreasing over time, but is increasing.
* edge multiplicity distributions: also the cumulated distribution
plots (show them on the website)
* statistic: variance of the degree distribution
* distribution of common neighbors of all connected nodes (Hans Akkermans)
* hop plot power law from [535]. Exponent of the first half (or third)
or the hop distribution.
* relation between the degree distribution (power law) and eigenvector
distribution (lognormal). Go from a power law to a lognormal by
convolutions (consider the number of paths of length 2). Go from
lognormal to power law by sums of lognormals (consider the eigenvalue
decomposition).
* compute both types of clustering coefficient: weighted by nodes, and
weighted by half-edges. (only one is used at the moment, but the second
one can be derived trivially from the clustering coefficient
distribution)
* relative clustering coefficient (relative to that of a random graph
with same size and volume) (i.e., divide by the fill)
* power law: measure distance to power law instead of statistical significance.
** measure power law exponent in function of degree (using a kernel?)
* compute the moments of a network. I.e., the number of cycles of
length N for N=1,2,3,4,.... The code for this is very similar to the
code for computing the clustering coefficient.
* statistic: network embeddedness
* new statistic: signed clustering coefficient
* eigl: try out eigs(..., 'sa', ...).
* implement the ProfFlow method from [670] for link prediction.
* decomposition_{full,time}: save the runtime of the decomposition
* decomposition: AA' + A'A = UΛU'
* decomposition: αAA' + (1-α)A'A = UΛU'
* statistic: diameter divided by radius
* more bipartite clustering coefficient measures
** Latapy et al in "Basic Notions for the Analysis of Large Two-mode Networks" [795]
** Lind, P.G., González, M.C., Herrmann, H.J., 2005. Cycles and clustering in bipartite networks. Physical Review E 72, 056127.
** (Opsahl 2012) from http://toreopsahl.com/tnet/two-mode-networks/clustering/; probably Opsahl, T., 2013. Triadic closure in two-mode networks: Redefining the global and local clustering coefficients. Social Networks 35, doi:10.1016/j.socnet.2011.07.001.
*** probability that a 4-path is completed by 2 edges to form a 6-cycle
** Robins, G., Alexander, M., 2004. Small worlds among interlocking Directors: Network structure and distance in bipartite graphs. Computational and Mathematical Organization Theory 10 (1), 69–94.
*** the probability that a 3-path is completed by an edge to form a 4-cycle
** Zhang, P., Wang, J., Li, X., Li, M., Di, Z., Fan, Y., 2008. Clustering coefficient and community structure of bipartite networks. Physica A 387, 6869–6875.
** [612]
* extend ANTICONFLICT to signed graphs, giving a statistic similar to
CONFLICT, but that is normalized by network size, and thus can be used
to compare networks. Check whether, in signed/rating networks that
change over time, the new measure grows or shrinks, validating or
invalidating balance theory. Cf. [797].
* compute the moments (number of cycles of length R) empirically from medium-sized networks.
* connected component distribution plot: for each size N, the
probability that a randomly chosen connected component of a network has
size ≥ N. On a log-log plot. Should exhibit a power-law for small N,
and a big outlier for the connected component [PEGASUS: A Peta-Scale
Graph Mining System - Implementation and Observations].
* weight power law exponent as a new statistic
* statistic only for directed acyclid graphs: length of the longest
path
* statistic: Soffer-Vasquez clustering coefficient and transitivity
$\tilde C$ and $\tilde T$ [b845].
* oddcycles: This is implemented wrongly for signed networks. It
should be based on the unsigned adjacency matrix, not on the signed
one. For instance wikisigned-k2 has a negative value of the statistic now, which
it should not have.
* maxdegree: at the moment does not into account multiple edges (except for POSITIVE networks in which they are aggregated), but should
* local sparsity plot: for each integer k, the average fill of the
k-ego-network, where the k-ego-network of a node is defined as the
network consisting of that node and its neighbors at distance at most
k. The first entry of the plot is the clustering coefficient; the last
entry is the fill of the total network.
* directed distance distribution
* [gp32] triangle closing coefficient: percentage of edges created by
triangles closing (as opposed to preferential attachment)
** Already implemented in briclo
* julia-like map of generated synthetic graphs on the two-dimensional map
* for transparent plots: http://www.mathworks.com/matlabcentral/fileexchange/23629-export-fig
* The harmonic mean distance between nodes:
ohttp://arxiv.org/abs/cond-mat/000835
* sg1: make the *_adj arrays have length n+1, where the last entry
always equals m, to simplify iteration over all edges.
* sg1: for SYM, have an additional variant where each edge is only
saved once (will make programs such as hyperanf and matrix
decompositions use half of memory).
* Can Kemeny's constant be calculated for large networks? Does it even make sense?
* Optimize diameter computation by: removing all degree-1 nodes except a single
one for each degree-2+ node that has at least one degree-1 neighbor.