Error Description
You received this error because of one of two reasons:
- You uploaded an unrecognized file format
- You tried to submit an analysis with a sample that contains a file type that is inappropriate for the workflow.
Solutions
Solution 1
If you got the file type error when trying to start an analysis, double check the contents of your files to verify what type of data it contains. The analysis error message should have indicated to you what type of files it accepts. See "Further information" below for formats of common filetypes used by Basepair.
Solution 2
If you are sure your files contain the right type of data, rename the files to have an appropriate file suffix and re-upload the files. See "Further information" below for accepted suffixes for common filetypes used by Basepair.
Solution 3 - contact us
If you still cannot resolve your issue, please don't hesitate to contact us by either:
- Creating a ticket here: http://support.basepairtech.com/support/tickets/new
- Message us using the chat icon in the lower right corner of your screen on the Basepair dashboard.
Further information
Basepair deals with various types of files. Usually the type of file is indicated by the suffix (e.g. ".csv" for a comma-separated file). Moreover, due to the size of the files we deal with, they are usually compressed in some way. Here are some common suffixes that indicate a file is compressed:
- .gz
- .zip
- .tar.gz
- .bz2
Unfortunately, we frequently see cases where the file suffix does not match what the actual file type is, or the file extension indicates it is compressed but it actually is not. Below we go over the common file types Basepair deals with:
FASTQ
This file type contains sequence reads and usually looks like this:
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
There are four lines for each read:
- Line 1: contains the name of the read
- Line 2: contains the actual read content in A's, T's, C's, and G's
- Line 3: Which strand the read is from
- Line 4: The per-base quality scores
You can read more about FASTQ file format here: https://en.wikipedia.org/wiki/FASTQ_format
Common file suffixes for FASTQ files are:
- .fastq
- .fq
- And compressed variations like: .fq.gz, .fq.zip, .fastq.gz, etc.
The following are also sometimes used:
- .txt (and its compressed variations)
- .sra (which is a special compressed format from the sequence read archive)
You can open FASTQ files in a text editor to check their contents, but be warned FASTQ files are often very large.
FASTA
This file type contains DNA, RNA, and protein sequences. Reference genomes and transcriptomes are some examples of data stored in FASTA files. An example is below:
>Sequence 1 ATCGATCGTATTTCTCTTTAAACGGTATGCTA >Sequence 2 TTTTGCGCGTTCTTAGGCTTGCTATCTC
There are two lines for each sequence:
- Line 1: The sequence name prefixed with a ">"
- Line 2: The actual sequence (whether it be DNA, RNA, or protein)
You can read more about the FASTA format here: https://en.wikipedia.org/wiki/FASTA
Here are some common file suffixes for FASTA files:
- .fasta
- .fa
- And compressed variations like: .fasta.gz, .fa.zip, etc.
You can open FASTA files in a text editor to check their contents.
BAM/SAM
Standing for Sequence Alignment/Map (SAM) and its compressed form (BAM), stores reads aligned to a reference sequence. An example two lines are shown below:
GWNJ-0965:327:GW1811231642:8:1103:14407:23987 163 KmetStat_WT_A 1 1 112M = 1 -112 GAAACGCGTA AAFFFJJJJJ AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:112 YS:i:0 YT:Z:CP GWNJ-0965:327:GW1811231642:8:1103:10287:72367 99 KmetStat_WT_A 1 34 135M = 1 -135 GAAACGCGTA AAFFFJJJJJ AS:i:0 XS:i:-35 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:135 YS:i:0 YT:Z:CP
Each line contains information for one sequence read such as is mapping quality, mate pair information, etc. You can read more about the BAM/SAM format here: https://samtools.github.io/hts-specs/SAMv1.pdf
Here are some common file suffixes for BAM/SAM files:
- .bam for BAM file
- .sam for SAM file
You can open SAM files in a text editor, but be warned these files are often several to tens of gigabytes large. To check the contents of a BAM file you need specialized software (like samtools). You can also try opening it in Integrative Genomics Viewer.
VCF
This file type contains variants called for one or more samples (e.g. SNPs, indels, somatic mutations). This is a more complicated file format and (unfortunately) different tools produce slightly different variations on it. An example is below:
##fileformat=VCFv4.3 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
You can read more about the VCF format here: https://en.wikipedia.org/wiki/Variant_Call_Format
Here are some common file suffixes for FASTA files:
- .vcf
- And compressed variations like: .vcf.gz, .vcf.zip, etc.
You can open VCF files in a text editor or even excel to view its contents.