Overrepresentation analysis (ORA) is used to identify the biological relationships in a list of genes by testing gene sets for enrichment in the query. However, the inconsistent definition and highly overlapping nature of gene set databases can make interpreting ORA results difficult. Here, we introduce GeneTEA, a model that takes in free-text gene descriptions and incorporates several natural language processing methods to learn a sparse gene-by-term embedding, which can be treated as a de novo gene set database. We benchmark performance against other popular ORA tools and find that only GeneTEA properly controls false discovery while consistently surfacing relevant biology.