In the latest update: a new evaluation package is implemented. The idea is given like this:
First, for a single paper , written by Prof. Hai Zhuge and cite it as: ([Zhuge, Hai. "Dimensionality on summarization." arXiv preprint arXiv:1507.00209 (2015)]). I manually extracted a MindMap from a pure TEXT of the paper by going through the whole paper. You can also Visualize this MindMap HERE (Figure 4 For thumbnail)
To use the keyword MindMap, I wrote it in a pure text format allowing an easy writting by human in a very simple sytax format. Please see for our keyword tree source text.. Basically, you need to write like this:
"nodename":"childrennode name", or a recursive definition going like this:
Then, I wrote a parser to extract a tree network from the source text, which can be formulated in a JSON format for computer program usage. The new sources codes can be found in our Soure Code Project HERE. in /GraphEngine subdirectory
Thrid, we need to give scores of MindMap Nodes. To this ends, I use a pagerank method to rank nodes and give scores of tree nodes. The score will reflect the internal nodes weights in terms of its connectivity and routing position in the tree. For example, the root node will have the highest score. Some internal nodes having more sub-tree structures will have higher scores.
Finally, I compare each sentences of summarization over this MindMap Tree by using a Cosine distance with weighted tree score. That is, for each sentence, a COSINE score is computed with each nodes on the tree, and multiplied by the node weight. Then, one sentence will obtain a set of scores. Then, I use summation, average, and maximum to select one score for each sentence. And then, for each document, with several sentences scores, I again sue summation, average and maximum to compute the final single score for one document.
Since there are ten methods, each will produce ten to fifteen sentences, I can obtain one score for one document on one model, which is finally listted in Figure 1:
In this test, each model has two paper types: one for "no_abs" and the other for "yes_abs". "no_abs" means in summarization text, sentences from the original abstraction and conclusion are excluded. Thus, any extraction model will not be able to select those sentences out. In "yes_abs", those sentences in abstraction and conclusion are kept and can be selected by summarization models, if the model deem those sentences as important ones. Basically, You will find it can match the ROUGE score of the summarization texts on the same paper in Figure 2.
For usage, please download Singel Paper Data here:
For single paper result, please download Result here::
In this work, we test the summarization result of different length. That is, we extracted: top-1, top-3, top-5, top-7, top-9, top-11, and top-13 sentences from the model results, as the final model result of summarization in ROUGE-1 score evaluation, to see how the number of extracted sentence will affect the summarization ROUGE-1 score.
We test Subparasec model and Simgraph model with different length of summarization.
Figure 5 shows the result of different length of summarization. In the Figure, sp1 is the top-1 result, sp3 is the top-3 result etcc. It reflects that:
When the extracted sentences is shorter, the precision score is higher, and the recall rate is lower.
When the extracted sentences is longer, the precision score is lower, and the recall rate is increasing.
This is simply due to the fact that fewer sentences will cover fewer keywords and topics of the paper benchmark, making a lower recall rate.
NOTE that top-1 has a very PRECISION score of 0.8 because that the first sentence by Subparasec model is "Dimensionality on Summarization", the title of the paper. Thus, it has a very high RPECISION score.
And the smoothly decreasing of precision score and increasing of recall rate indicate that the model is increasingly better as more sentences are involved.
While in Figure 6, the top-1 to top-13 sentences are used from the Simgraph model. And the result is a little bit different. Top-1 has a very low precision result becase but it has a higher recall rate, indating that the first sentence is not SO precision compared with the Subparasec model, and aslo compared with its second sentences.
The overall performance of Simgraph model is lower than the Subparasec model.
The ranked sentes data is availabel HERE
The result collection is availabel HERE
The source code for doing this top-k evaluation is updated on our Git HERE. The source coudes are updated includes:
sxpDoPyrougeScore.py, by calling top-k model evaluations and keytree matching packages
sxpRougeConfig.py, two new configuration dicts are added, idname dict are added with new model IDs.
./context/controller.py, run_one_rankmodel() function is extended to run two top-k evaluations. We use a string regex to detect the model to be runed and extracted the top-k number as the parameters to run the top-k ranking.
Note that, in isSimLen() we do not run the ranking of Simgraph model because it is too time consuming. Instead, we directly use allsent data stored in previous runing of Simgraph model to ease the test. In this case, we add two features in confdict to tell which system and file to be borrowed.
Following contains the data and source codes for Ranking Sentences for Single Document Summarization by Extracting Structure links from document:
The document collection is availabel HERE
The result collection is availabel HERE
The source code for parsing document is available HERE.
The source code for ranking sentences of document is available HERE.