Error ID 1 - Wrong File Type

Modified on Fri, 3 May, 2019 at 1:36 PM

Error Description


You received this error because of one of two reasons:

  1. You uploaded an unrecognized file format
  2. You tried to submit an analysis with a sample that contains a file type that is inappropriate for the workflow.



Solutions


Solution 1

If you got the file type error when trying to start an analysis, double check the contents of your files to verify what type of data it contains. The analysis error message should have indicated to you what type of files it accepts. See "Further information" below for formats of common filetypes used by Basepair.


Solution 2

If you are sure your files contain the right type of data, rename the files to have an appropriate file suffix and re-upload the files. See "Further information" below for accepted suffixes for common filetypes used by Basepair.


Solution 3 - contact us

If you still cannot resolve your issue, please don't hesitate to contact us by either:


Further information


Basepair deals with various types of files. Usually the type of file is indicated by the suffix (e.g. ".csv" for a comma-separated file). Moreover, due to the size of the files we deal with, they are usually compressed in some way. Here are some common suffixes that indicate a file is compressed:

  1. .gz
  2. .zip
  3. .tar.gz
  4. .bz2


Unfortunately, we frequently see cases where the file suffix does not match what the actual file type is, or the file extension indicates it is compressed but it actually is not. Below we go over the common file types Basepair deals with:



FASTQ


This file type contains sequence reads and usually looks like this:


@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65


There are four lines for each read:

  1. Line 1: contains the name of the read
  2. Line 2: contains the actual read content in A's, T's, C's, and G's
  3. Line 3: Which strand the read is from
  4. Line 4: The per-base quality scores

You can read more about FASTQ file format here: https://en.wikipedia.org/wiki/FASTQ_format



Common file suffixes for FASTQ files are:

  1. .fastq
  2. .fq
  3. And compressed variations like: .fq.gz, .fq.zip, .fastq.gz, etc.


The following are also sometimes used:

  1. .txt (and its compressed variations) 
  2. .sra (which is a special compressed format from the sequence read archive)


You can open FASTQ files in a text editor to check their contents, but be warned FASTQ files are often very large.


FASTA


This file type contains DNA, RNA, and protein sequences. Reference genomes and transcriptomes are some examples of data stored in FASTA files. An example is below:


>Sequence 1
ATCGATCGTATTTCTCTTTAAACGGTATGCTA
>Sequence 2
TTTTGCGCGTTCTTAGGCTTGCTATCTC


There are two lines for each sequence:

  1. Line 1: The sequence name prefixed with a ">"
  2. Line 2: The actual sequence (whether it be DNA, RNA, or protein)

You can read more about the FASTA format here: https://en.wikipedia.org/wiki/FASTA



Here are some common file suffixes for FASTA files:

  1. .fasta
  2. .fa
  3. And compressed variations like: .fasta.gz, .fa.zip, etc.


You can open FASTA files in a text editor to check their contents.


BAM/SAM


Standing for Sequence Alignment/Map (SAM) and its compressed form (BAM), stores reads aligned to a reference sequence. An example two lines are shown below:


GWNJ-0965:327:GW1811231642:8:1103:14407:23987  163  KmetStat_WT_A  1  1  112M  =  1  -112  GAAACGCGTA  AAFFFJJJJJ  AS:i:0  XS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:112  YS:i:0  YT:Z:CP
GWNJ-0965:327:GW1811231642:8:1103:10287:72367  99  KmetStat_WT_A  1  34  135M  =  1  -135  GAAACGCGTA  AAFFFJJJJJ  AS:i:0  XS:i:-35  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:135  YS:i:0  YT:Z:CP


Each line contains information for one sequence read such as is mapping quality, mate pair information, etc. You can read more about the BAM/SAM format here: https://samtools.github.io/hts-specs/SAMv1.pdf


Here are some common file suffixes for BAM/SAM files:

  1. .bam for BAM file
  2. .sam for SAM file


You can open SAM files in a text editor, but be warned these files are often several to tens of gigabytes large. To check the contents of a BAM file you need specialized software (like samtools). You can also try opening it in Integrative Genomics Viewer.


VCF


This file type contains variants called for one or more samples (e.g. SNPs, indels, somatic mutations). This is a more complicated file format and (unfortunately) different tools produce slightly different variations on it. An example is below:


##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS      ID         REF   ALT    QUAL  FILTER   INFO                             FORMAT       NA00001         NA00002          NA00003
20     14370    rs6054257  G     A      29    PASS    NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ  0|0:48:1:51,51  1|0:48:8:51,51   1/1:43:5:.,.
20     17330    .          T     A      3     q10     NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ  0|0:49:3:58,50  0|1:3:5:65,3     0/0:41:3
20     1110696  rs6040355  A     G,T    67    PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ  1|2:21:6:23,27  2|1:2:0:18,2     2/2:35:4
20     1230237  .          T     .      47    PASS    NS=3;DP=13;AA=T                   GT:GQ:DP:HQ  0|0:54:7:56,60  0|0:48:4:51,51   0/0:61:2
20     1234567  microsat1  GTC   G,GTCT 50    PASS    NS=3;DP=9;AA=G                    GT:GQ:DP     0/1:35:4        0/2:17:2         1/1:40:3


You can read more about the VCF format here: https://en.wikipedia.org/wiki/Variant_Call_Format


Here are some common file suffixes for FASTA files:

  1. .vcf
  2. And compressed variations like: .vcf.gz, .vcf.zip, etc.


You can open VCF files in a text editor or even excel to view its contents.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article