Directory-Summarizer: A Tool for Summarizing Conference Proceedings and Other Document Collections

A couple of weekends ago I put together a python script that utilizes sumy  ( and and pdfminer ( and to summarize all pdf, docx (Word), and .txt files in a user-specified directory, including sub-directories as well. In addition, it lists (but doesn’t summarize) the Powerpoint (.ppt and .pptx files as well. I had recently returned from the Intelligent Transportation Society of America’s Annual Meeting, and had a USB drive with the conference proceedings. The problem is, the files are all just organized into folders by session code (e.g., TS-3), and each session could have a quite diverse range of papers. I wanted a way to quickly scan the proceedings to identify items that might be worth my while to read, and also might serve a similar purpose for others.

The user may also specify how many sentences to include in the summary of each document, as well as which of the summarization algorithms included in sumy that they would like used.

Summarizers generally attempt to determine the most important sentences within a document in terms of describing its content, and present them. They do not really understand a report, and can’t write a new abstract like a human could. So the sentences in the summary to not flow together, but typically do capture the content of the document. In addition to the summary, I pull out the first line in each report, as this is often the title or the first part of the title of the report.

Here’s an example 6-sentence summary the tool produced for one of the papers in the proceedings, related to semi-automated platooning of trucks to reduce fuel consumption. I think it captures the scope of the paper:


This paper provides selected final results from Phase One, which is explored a range of technical and non-technical challenges, including assessing feasible real-world business models within the trucking industry.

Testing in past FHWA EAR research and by project partner Peloton has shown that, due to aerodynamic drafting effects, DATP has the potential to significantly reduce fuel use.

The premise of this research is that taking this technology to full commercialization requires a simpler technical approach (compared to fully automated platooning) which bridges from current trucking operations to DATP.

Data was taken in order to compare the relative distance measurements provided by Dynamic Based Real Time Kinematic (DRTK) and a Delphi automotive RADAR.

This particular road segment was chosen for the initial analysis due to its relatively low traffic volumes (resulting in a data set of manageable size) and limited ingress/egress points (allowing the consideration of trucks that remained on the corridor for an extended distance).

ATA Trucking Trends 2013) indicate that over-the-road operations, with an emphasis on truckload (TL) and line-haul less-than-truckload (LTL) sectors would experience the highest likelihood of encountering the desired DATP attributes.

File Path: E:TS01\2_14620_abstract_2183_0.pdf

The Directory-Summarizer can be used to generate summaries for any collections of documents stored in a master directory, and the code is available on github.

P.S.: I understand that there is a python port of tika that, when the bugs are out, could be dropped in so the summarizer could handle even more file types, or the code could be modified to utilize tika service instance to do the same. If anyone does that, let me know how it goes.


Feel free to share...Tweet about this on TwitterShare on LinkedInShare on RedditShare on FacebookShare on Google+Email this to someone

IFTTT adds “Maker” channel

Make Magazine has an article on the newly added “Maker” channel on IFTTT.  For those who haven’t heard of it IFTTT (If This, Then That) is a company that offers a web-based interface that lets users build simple control scripts for supporting the Internet of Things and other web-based services. The scripts are all very simple, consisting of defining a triggering event (the “if” clause) and an action that should then be taken (the “then that”). It builds from APIs that many services offer. You can define some parameters in some of the scripts. So you can, for example, set up a “recipe” (as IFTTT calls them) that says “if someone tags me on Facebook, then turn my Phillips Hue light red.” Or If the weather report for tomorrow includes rain, then send a tweet.”

Users who create recipes can share them with other IFTTT users. However, until now, the systems that can trigger recipes, or that can be sent action responses, were limited to pre-existing commercial services that IFTTT had set up interfaces with (e.g., Dropbox, Gmail, Amazon Echo, WeMO, etc.). While this list is growing, if you were a Maker and wanted to build your  own triggering device, or have a custom device respond, you were out of luck. That’s changed now with the new Maker channel. You can use basically anything that can send or respond to Restful URL calls to set up your own channels. Personally, I’ve found IFTTT to be a fascinating concept, but I was underwhelmed by what I’ve seen so far in terms of scripts that would be useful to me, or even fun to play with. The new Maker channel may change that.

Feel free to share...Tweet about this on TwitterShare on LinkedInShare on RedditShare on FacebookShare on Google+Email this to someone

Quick tip: Excel’s “compare” function

Whatever our projects, we’re very likely to have data in an Excel spreadsheet at some point. For debugging by looking at log data, it can sometimes be very useful to compare two log files to find differences. I just learned that Excel 2013 Professional has a built in capability to do that. You need the “Inquire” tab, and then select “compare” from that. It’s an add-on, so query help for “inquire” to see how to add it to the ribbon.

Feel free to share...Tweet about this on TwitterShare on LinkedInShare on RedditShare on FacebookShare on Google+Email this to someone