Sequences from USPTO Grants
The directory below contains FASTA format files of:
- nucleotide sequences (in the nt.* files); and
- amino acid sequences (in the aa.* files)
extracted from US patent grants.
GenBank provides the patnt and pataa databases of sequences from patents from
various jurisdictions at
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA.
We have:
- filtered patnt and pataa to produce a set of sequences from the US grants;
- extracted the sequences from the US grant bulk sequence listings;
- extracted the sequences from US patent documents from 2002 onward (when data was first available in XML); and
- produced the union of the sequences from steps 1 to 3 in aa.fsa.gz and nt.fsa.gz files.
The *-inClaims.fsa.gz files contain the subset of the sequences that are referenced in the claims.
Acknowledgement
Significant funding for this software and collection of these data was provided by the Ministry of Foreign Affairs of Norway through the International Rice Research Institute for CAMBIA's Patent Lens (the OS4 Initiative: Open Source, Open Science, Open Society, Orzya sativa).
Citation
For any derived data products that were produced using the original data set, the user should properly cite the data in any publication or in the metadata, in the following form:
Bacon N, Ashton D, Jefferson RA, Connett MB (2006) Biological sequences named and claimed in US patents and patent applications, CAMBIA Patent Lens OS4 Initiative, http://www.patentlens.net/daisy/patentlens/2205.html.
It is not ethical to publish data without proper attribution or co-authorship and acknowledgement of the ideas and funding. Compilation of this dataset required intellectual, financial and time investment in the conception, preparation and collection of data. Co-authorship in the publication of descriptive or interpretive results derived directly from the data is the privilege and responsibility of these investigators.
Notification
To assist with the furtherance of the public good mission served by the funder and sponsoring organization, we request that users notify the originators of the data set when any derivative work or publication based on derived from the data set is distributed.
Note: All data from the first line starting with ">US" onward is sequences that we've extracted from patent data that do not appear in GenBank. You can extract these sequences with:
gunzip < aa.fsa.gz | sed -n '/^>US/,$p' > aa-notInGenBank.fsa
The software used to generate this data is available under the GPL. Comments and suggestions for improvements are welcome at webmaster@cambia.org
Caveats
- This data is provided without any guarantee of correctness. Some of the problems in its production are described in the following points. Please help us improve it by notifying us of any shortcomings and suggestions for improvements. Please search the log files provided in the logs directory for errors and warnings pertaining to the processing of any patent of particular interest.
- Sequences were extracted from bulk sequence listings. The data quality of US sequence listings is generally high, however the USPTO does not subject sequence listings to rigourous validation. Some sequence listings have errors ranging from minor formatting errors to gross data corruption.
- We parse the claims text for lists of SEQ ID NOS to determine which
sequences are referenced in the claims. A number of issues arise in this:
- the formatting of lists of SEQ ID NOS is not standardized and many special cases need to be handled e.g. "SEQ ID NO:2 - 5'-GTG CCG GGG TCT TCG G-3'" is referring only to SEQ ID NO: 2 and not to SEQ ID NOS: 2, 3, 4 and 5;
- some patent text XML data is produced by Optical Character Recognition (OCR) software from scanned images of paper documents. Althought the data quality is good some characters are misread e.g. upper case "O" and digit "0" are likely to be confused;
- there may be typing errors;
- sometimes nucleotide ranges are interspersed in a list of SEQ ID NOS and its difficult to distinguish between them;
- "923,989" may mean a single large SEQ ID NO or two smaller ones.



There are no comments.