Develop
Welcome to the developer's section! Here you can find information about developing applications and tools that use our Document Identification Process (DIP) and/or our REST web service. Enjoy!
Table of Contents
REST Web Service(top)
Each record in our database can be thought of as having three parts: a document, a list of authors and a list of publishers.
The document has a title, a unique ID and, optionally, a DOI and some URLs.
Each of the document's authors has a first name, last name, unique ID and, optionally, some initials.
Each document can be published in multiple places (eg. as a thesis then later at a conference) each of these separate publications have a name, a type and a unique ID along with any available information about year, volume, issue etc.
You can send and receive data in either XML or JSON format using standard HTTP methods.
Here's an example of the XML generated for a document.
<result results="1">
<document id="130030">
<title>Watermarking is Not Cryptography</title>
<doi>10.1007/11922841_1</doi>
<publications>
<publication>
<publisher id="418">
<name>Lecture Notes in Computer Science</name>
<type>CONFERENCE</type>
</publisher>
<volume>4283</volume>
<year>2006</year>
<pages>1-15</pages>
</publication>
</publications>
<urls>
<url>http://www.adastral.ucl.ac.uk/~icox/papers/2006/IWDW2006.pdf</url>
</urls>
<authors>
<author id="116009">
<firstname>Gwenaël</firstname>
<lastname>Doërr</lastname>
</author>
<author id="72871">
<firstname>Ingemar</firstname>
<lastname>Cox</lastname>
<initials>J</initials>
</author>
<author id="222871">
<firstname>Teddy</firstname>
<lastname>Furon</lastname>
</author>
</authors>
<md5>bb23febbb0e0e7e824b5d934817b9166</md5>
<movingWindow>
<hash>00028ade8d59c489981be74b352f98a5</hash>
<hash>00037b0adf79079bbde986ddf5b249dd</hash>
<hash>0003d0604e1eddf3cbc0a0bd9ee7ed29</hash>
<hash>0006337e6423e6dccba1fb02726a73a9</hash>
<hash>0009d81f79aa7c3ef4d7d98baec7b676</hash>
<hash>000bed0598270e8fd8fb621d2ae57a8a</hash>
<hash>000c579df13f6e4bc49075bce87fb493</hash>
<hash>000de590184f26423edd693214add366</hash>
<hash>000fc3ed88a4b72e919d25f8eb5b5231</hash>
<hash>0013177afe2cfd7e652d8183ef594f1e</hash>
<hash>0014656aa0101b69391a5eaba6b688f7</hash>
<hash>0014ebab9b5388b699d789aeb55772d0</hash>
<hash>0018d122593baaa62ba3fa70bdd1759a</hash>
<hash>001942612b7f9ece46dac4dba1b94f34</hash>
<hash>001a24d375fbb804b6f9458f9832952e</hash>
<hash>001bc825d08048605f395f8329cb3360</hash>
<hash>0021ffab046261876ef7370a7a45d847</hash>
<hash>00263be9ef63cf79459b2cada2c89b92</hash>
<hash>0026ac7abf681382e21ce06dd1ecbadf</hash>
<hash>002cb827bc613852aced0b350a7c88e3</hash>
</movingWindow>
</document>
</result>
The Basics(top)
At CiteSeeing we provide a REST web service and a public API for developers to use in order to query our database. The API provides fairly standard functionality as detailed below.
The CiteSeeing base URI is:
http://www.citeseeing.com/api/
From this base URI you can then add paths and parameters to obtain your desired outcome.
http://www.citeseeing.com/api/documents/130030
The XML description of the document in our database with unique ID 130030, this is the same as the example XML above.
From the example above there are a few things to note:
<result results="1">
This is the root of the returned xml, you will always get this root no matter what the query. the "results" attribute tells you how many results have matched your query. This is not the same as the number of results returned, as there is a cap of 50 returned results per query, but rather it is the total number of records in the database that matched your query.
<document id="130030">
Documents, authors and publishers have unique IDs. This means that a single author's or publisher's details can be linked to multiple documents and visa versa and that changing one changes them all.
<md5> and <movingwindow>
These are the components of the fingerprint of the document. Each document has 20 hashed moving window fingerprints and one md5 hash fingerprint, you can search for a document using either detail.
Retrieving(top)
You've seen an example of retrieving information above, you use standard HTTP GET requests. You start from the base URI and then add the type of record you're looking for and then the ID. Depending on the type of record you have some extra options. Here's the full retrieval API: (for an example, click the path)
| Path (click to try) | Parameters | Description |
| Documents | ||
documents/{id} |
- |
Returns information on the document with this unique ID |
documents/{id}/authors |
- |
Returns information on the authors of the document with this unique ID |
documents/md5 |
md5 |
Returns information on all documents with this MD5 key |
documents/movingwindow |
hash x20 |
Returns information on a document that matches the moving window keys |
| Authors | ||
authors/{id} |
- |
Returns information on the author with this unique ID |
authors/{id}/documents |
- |
Returns information on the documents that the author with this unique ID has written |
authors/filter |
firstname
|
Returns information on all authors that match these values. At least one parameter must be supplied. |
authors/search |
q |
Returns information on all authors whose full name contains the supplied string |
| Publishers | ||
publishers/{id} |
- |
Returns information on the publisher with this unique ID |
publishers/{id}/documents |
- |
Returns information on the documents published by the publisher with this unique ID |
publishers/types |
- |
Returns the possible types of publisher |
publishers/search |
q |
Returns information on all publishers whose name contains the supplied string |
Creating(top)
Creation of authors and publishers is easy, creating records of documents however is slightly more involved.
In order to add a document to the database you must first know its fingerprints. Fingerprinting of a PDF can be done via our DIP.
Once you have this information creating a record is as simple as POSTing an XML document or JSON object to the correct place.
Creating an Author Record
Below are blank XML and JSON templates for author creation. You must use MIME type application/xml or application/json for this to work.
You simply fill in the details and POST to:
http://www.citeseeing.com/api/authors
It is worth noting that an author must have a first and last name to be saved to the database, initials are optional.
Creating a Publisher Record
Below are blank XML and JSON templates for publisher creation, it's just as easy as author creation. Again, you must use MIME type application/xml or application/json for this to work.
You simply fill in the details and POST to:
http://www.citeseeing.com/api/publisher
It is worth noting here that neither fields are optional and type must have a value that corresponds with a type returned from http://www.citeseeing.com/api/publishers/types
Creating a Document Record
This is the slightly trickier bit. Before creating a document record you must have that document's fingerprints, the fingerprints consist of one md5 hash and 20 hashed moving window fingerprints. How to get these fingerprints and what they are is explained in the DIP section.
The same as before: you must transfer the data using MIME type application/xml or application/json.
<document>
<title></title>
<doi></doi>
<authors>
<author id="">
</authors>
<publications>
<publication>
<publisher id="">
<volume></volume>
<issue></issue>
<year></year>
<month></month>
</publication>
</publications>
<md5></md5>
<movingWindow>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
<hash></hash>
</movingWindow>
</document>
{
"title": "",
"doi": "",
"authors": {"author": [{"@id": ""}]},
"publications":
{
"publication":
[{
"publisher":{"@id":""},
"volume": "",
"issue": "",
"year": "",
"month": ""
}]
},
"md5": "",
"movingWindow":
{
"hash": ["","","","","","","","","","","","","","","","","","","",""]
}
}
Things to note about document record creation are:
Only title, md5 and movingWindow are mandatory and movingWindow must have 20 hashes.
You can add more authors and publications by simply adding more nodes (XML) or array objects (JSON). You can have no authors or publications by leaving out the publications and/or authors sections completely.
Updating(top)
Updating is very similar to creating; the data structures are very similar it is only the POST address that changes.
By far the easiest way of updating a record is to GET that record in whatever format you desire and then edit the details as you wish before POSTing it back. This saves hassle with creating the correct format.
Below is an example of how to update an author record. Use a similar approach for the other records.
Step 1 - GET current details
Simply save the returned data of a GET:
http://www.citeseeing.com/api/authors/72871
Gives:
<result results="1">
<author id="72871">
<firstname>Ingemar</firstname>
<lastname>Cox</lastname>
<initials>J</initials>
</author>
</result>
Step 2 - Edit those details
Simply strip the root node (results) and change the other details as you need.
<author id="72871">
<firstname>Ingemar</firstname>
<lastname>Cox</lastname>
</author>
Step 3 - POST new details to same URI
POST the above xml to the below URI using MIME type application/xml.
http://www.citeseeing.com/api/authors/72871
If everything was successful you will receive a HTTP 200 (OK) response code.
Errors(top)
Occasionally, things go wrong. When they do (be it on our heads or yours) we will return one of the following error codes, depending on the situation.
| Code | Error | Description |
| 400 | Bad Request | The data you have provided is not correct. |
| 404 | Not Found | The ID you have provided does not link to a record. |
| 500 | Internal Server Error | Either the data you have provided could not be parsed or there has been an internal server error. |
Document Identification Process (DIP)(top)
Our DIP provides you with the tools required to extract information on a document from our database using nothing more than a PDF document. The DIP is a two stage process; fingerprinting followed by querying. Fingerprinting creates a list of hashed strings that can uniquely identify the document, querying sends those strings to the database and extracts the correct information.
Fingerprinting (top)
In order to create a fingerprint of a document you require our Fingerprint Library, this is a small Java library (packaged as a .jar) that takes in a PDF document and produces a fingerprint based on the encryption method you choose.
There is an online JavaDoc available as well as one to download.
Currently, there are two encryption methods to choose from; complete MD5 hash and moving window hash.
The complete MD5 hash method will take the text of a document and run the MD5 algorithm on it returning a 32 character hex string.
The moving window hash method will take the text of a document and run a moving window fingerprinting algorithm on it returning 20 32 character hex strings.
Below is an example of how to take a PDF File object and extract the desired fingerprints. As you can see it's really quite easy.
package com.citeseeing.flexample;
import java.io.File;
import com.citeseeing.fingerprint.Encrypter;
import com.citeseeing.fingerprint.Encryption;
import com.citeseeing.fingerprint.EncryptionFactory;
import com.citeseeing.fingerprint.exception.EncrypterException;
import com.citeseeing.fingerprint.extractor.Extractor;
import com.citeseeing.fingerprint.extractor.ExtractorFactory;
import com.citeseeing.fingerprint.extractor.exception.ExtractorException;
public class Fingerprint
{
public static String[] getFingerprint(File pdf)
{
try
{
// Encrypter encrypter = EncryptionFactory.loadEncrypter(Encryption.MD5);
Encrypter encrypter = EncryptionFactory.loadEncrypter(Encryption.MovingWindow);
Extractor extractor = ExtractorFactory.loadExtractor();
return encrypter.encrypt(extractor.extract(pdf));
// Use fingerprint ...
}
catch (ExtractorException ex)
{
// Handle exception
}
catch (EncrypterException ex)
{
// Handle exception
}
return new String[0];
}
}
The complete MD5 method considers all of the text in the document when it computes the fingerprint, this makes it very sensitive to changes. We wanted a way to match documents that were, for all intents and purposes, the same, ignoring any minor changes (things like a date stamp or a watermark) so we created the moving window method as a more robust way of fingerprinting.
Our moving window algorithm follows very closely the works of Nevin Heintze his paper; Scalable Document Fingerprinting. A very interesting read, if you're in to this kind of thing.
The moving window hash method will return a list of 20 32 character hex strings. In the case that the document is small enough that fewer than 20 hashes will suffice, a default is substituted.
Querying(top)
Once you've got your fingerprint you now need to get the information. This is easy to do, especially if you've read the section above on the REST web service.
Here's how you do it:
package com.citeseeing.flexample;
import com.sun.jersey.api.client.Client;
import com.sun.jersey.api.client.ClientResponse;
import com.sun.jersey.api.client.WebResource;
import com.sun.jersey.core.util.MultivaluedMapImpl;
import java.io.IOException;
import java.net.MalformedURLException;
import javax.ws.rs.core.MultivaluedMap;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Query
{
public static Document getDetails(String[] fingerprint)
{
String url = "http://www.citeseeing.com/api/documents/";
if (fingerprint.length == 1)
{
url += "md5";
}
else
{
url += "movingwindow";
}
Client client = Client.create();
WebResource webResource = client.resource(url);
final MultivaluedMap queryParams = new MultivaluedMapImpl();
for(String hash : fingerprint)
{
queryParams.add("hash",hash);
}
webResource = webResource.queryParams(queryParams);
ClientResponse response = webResource.get(ClientResponse.class);
try
{
if (response.getStatus() == 200)
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
return db.parse(response.getEntityInputStream());
}
}
catch (SAXException ex)
{
// Handle Exception
}
catch (ParserConfigurationException ex)
{
// Handle Exception
}
catch (MalformedURLException ex)
{
// Handle Exception
}
catch (IOException ex)
{
// Handle Exception
}
return null;
}
}
You'll notice that we use the Jersey Client and WebResource objects. These make things simpler and easy on the eye but you can use a regular URLConnection instead.

