当前位置: 首页 > news >正文

asp全静态企业网站最佳磁力引擎吧

asp全静态企业网站,最佳磁力引擎吧,h5编辑器有哪些软件,网站怎么做百度能搜到GPT 结束语设计 以nanogpt为例 目录 GPT 结束语设计 以nanogpt为例 1、简述 2、分词设计 3、结束语断点 1、简述 在手搓gpt的时候,可能会遇到一些性能问题,即关于是否需要全部输出或者怎么节约资源。 在输出语句被max_new_tokens 限制&#xff0c…

GPT 结束语设计 以nanogpt为例

目录

GPT 结束语设计 以nanogpt为例

1、简述

2、分词设计

3、结束语断点


1、简述

在手搓gpt的时候,可能会遇到一些性能问题,即关于是否需要全部输出或者怎么节约资源。

在输出语句被max_new_tokens 限制,如果出现一些输出句子比较长,就会被限制,但如果是设计时候没有设计结束语,就会出现全部输出的问题。

如果只需要一部分的语句,或者是某一些特定的场景设计,例如:

1、gpt自动化操作

2、输出美观

3、一些较小的业务场景,特定处理的业务

以上的业务场景都是设计的时候为特定模型,即小大模型,通常不需要较大的参数,所以在设计时候如果考虑到轻量化和小型化,参数1M至100M之间的小大模型。

基于成本和开发快速考虑,可以使用nanogpt用于训练和开发,然后再进一步的微调迭代,所需要的性能和效果基本可以满足部分要求,迭代速度较快,适合单人或小团队开发特定场景。


2、分词设计

以下是关于之前做过的一个开发场景:音乐生成按键的场景

分词中加入了end的作为特定的结束语,如果后续扩展可以通过end前后设计一些音乐风格的标识符,这样通过风格的标识来达到风格的统一。


# 自定义词典
word_dict = set(['\n', ' ', '+', '.', '0', '1', '2', '3', '4'
         '6', '7', '8', '9', ':', "'a'", "'b'", "'c'", "'d'",
         "'e'", "'f'", "'g'", "'h'","'j'", "'n'","'m'","'q'","'w'","'r'","'t'","'y'","'u'",
        "'s'", "'v'", "'x'", "'z'",'<96>','<97>','<98>','<99>','<100>',
        '<101>','<102>','<103>','<104>','<105>','end'])

seg_list = max_forward_matching(data, word_dict, max(len(word) for word in word_dict))
words = list(seg_list)
# 创建一个默认字典来存储词汇到ID的映射
word_to_id = defaultdict(lambda: len(word_to_id))
# 创建一个列表来存储ID到词汇的映射(可选)
id_to_word = []
# 构建词汇到ID的映射
for word in words:
    word_id = word_to_id[word]
    # ID到词汇的映射,可以这样做:
    if word_id == len(word_to_id):  # 只有当新的ID被分配时才添加到id_to_word中
        id_to_word.append(word)

import os
import pickle
import requests
import numpy as np
from collections import defaultdict
# download the tiny shakespeare dataset
input_file_path = os.path.join(os.path.dirname(__file__), 'music.txt')
if not os.path.exists(input_file_path):data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'with open(input_file_path, 'w') as f:f.write(requests.get(data_url).text)with open(input_file_path, 'r',encoding="utf-8") as f:data = f.read()
print(f"length of dataset in characters: {len(data):,}")# get all the unique characters that occur in this text
def max_forward_matching(text, word_dict, max_len):result = []index = 0while index < len(text):found = Falsefor size in range(max_len, 0, -1):  # 从最大长度开始尝试匹配piece = text[index:index + size]if piece in word_dict:result.append(piece)index += sizefound = Truebreakif not found:  # 如果没有找到匹配的词,则按字符输出result.append(text[index])index += 1return result#自建一套
# 自定义词典
word_dict = set(['\n', ' ', '+', '.', '0', '1', '2', '3', '4''6', '7', '8', '9', ':', "'a'", "'b'", "'c'", "'d'","'e'", "'f'", "'g'", "'h'","'j'", "'n'","'m'","'q'","'w'","'r'","'t'","'y'","'u'","'s'", "'v'", "'x'", "'z'",'<96>','<97>','<98>','<99>','<100>','<101>','<102>','<103>','<104>','<105>','end'])seg_list = max_forward_matching(data, word_dict, max(len(word) for word in word_dict))
words = list(seg_list)
# 创建一个默认字典来存储词汇到ID的映射
word_to_id = defaultdict(lambda: len(word_to_id))
# 创建一个列表来存储ID到词汇的映射(可选)
id_to_word = []
# 构建词汇到ID的映射
for word in words:word_id = word_to_id[word]# ID到词汇的映射,可以这样做:if word_id == len(word_to_id):  # 只有当新的ID被分配时才添加到id_to_word中id_to_word.append(word)chars = list(word_to_id)
print(chars)
vocab_size = len(chars)print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")
#Myzzb That is need about jieba to cut text
print(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
print(stoi)
itos = { i:ch for i,ch in enumerate(chars) }
print(itos)def encode(s):seg_list = max_forward_matching(data, word_dict, max(len(word) for word in word_dict))words = list(seg_list)# 创建一个默认字典来存储词汇到ID的映射word_to_id = defaultdict(lambda: len(word_to_id))# 创建一个列表来存储ID到词汇的映射id_to_word = []# 构建词汇到ID的映射for word in words:word_id = word_to_id[word]# 如果你也需要ID到词汇的映射,可以这样做:if word_id == len(word_to_id):  # 只有当新的ID被分配时才添加到id_to_word中id_to_word.append(word)return [word_to_id[word] for word in words] # encoder: take a string, output a list of integers
def decode(l):seg_list = max_forward_matching(data, word_dict, max(len(word) for word in word_dict))words = list(seg_list)# 创建一个默认字典来存储词汇到ID的映射word_to_id = defaultdict(lambda: len(word_to_id))# 创建一个列表来存储ID到词汇的映射(可选)id_to_word = []# 构建词汇到ID的映射for word in words:word_id = word_to_id[word]# 如果你也需要ID到词汇的映射,可以这样做:if word_id == len(word_to_id):  # 只有当新的ID被分配时才添加到id_to_word中id_to_word.append(word)return ''.join([word_to_id[word] for word in words]) # decoder: take a list of integers, output a string
# create the train and test splits
n = len(data)
train_data = data[:int(n*0.95)]#这里因为没写字典排序,所以训练集和测试集懒得分开
val_data = data[int(n*0.95):]
# print(val_data)
# encode both to integers
train_ids = encode(train_data)
print(train_ids)
val_ids = encode(val_data)
print(val_ids)
# print(val_ids)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))# save the meta information as well, to help us encode/decode later
meta = {'vocab_size': vocab_size,'itos': itos,'stoi': stoi,
}
with open(os.path.join(os.path.dirname(__file__), 'meta.pkl'), 'wb') as f:pickle.dump(meta, f)

3、结束语断点

通过在推理过程中检测新生成的编码是否和结束语一致,以上在设计的过程中通过字典分词,然后再分配的编码,是可以通过代码获取对应的结束语的编码。

通过在分词的时候进行对部分结束语进行输出,例子:

print(encode("\n"))
print(encode("\t"))

源码添加上,即可知道结束语的编码是多少:

"""
Prepare the Shakespeare dataset for character-level language modeling.
So instead of encoding with GPT-2 BPE tokens, we just map characters to ints.
Will save train.bin, val.bin containing the ids, and meta.pkl containing the
encoder and decoder and some other related info.
"""
import os
import pickle
import requests
import numpy as np# download the tiny shakespeare dataset
input_file_path = os.path.join(os.path.dirname(__file__), 'say.txt')
if not os.path.exists(input_file_path):data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'with open(input_file_path, 'w') as f:f.write(requests.get(data_url).text)with open(input_file_path, 'r',encoding="utf-8", errors='replace') as f:data = f.read()
print(f"length of dataset in characters: {len(data):,}")# get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }def encode(s):return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a stringprint(encode("\n"))
print(encode("\t"))# create the train and test splits
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]# encode both to integers
train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))# save the meta information as well, to help us encode/decode later
meta = {'vocab_size': vocab_size,'itos': itos,'stoi': stoi,
}
with open(os.path.join(os.path.dirname(__file__), 'meta.pkl'), 'wb') as f:pickle.dump(meta, f)# length of dataset in characters:  1115394
# all the unique characters:
#  !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
# vocab size: 65
# train has 1003854 tokens
# val has 111540 tokens

只需要简单添加一句代码即可:

# 检查是否生成了结束语 可以获取大部分结束语的编码用于判断 也可以自拟结束语 将其处理为唯一的标识符避免干扰
if 1 in idx_next[0].tolist():break

@torch.no_grad()def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):"""Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and completethe sequence max_new_tokens times, feeding the predictions back into the model each time.Most likely you'll want to make sure to be in model.eval() mode of operation for this."""for _ in range(max_new_tokens):# if the sequence context is growing too long we must crop it at block_sizeidx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]# forward the model to get the logits for the index in the sequencelogits, _ = self(idx_cond)# pluck the logits at the final step and scale by desired temperaturelogits = logits[:, -1, :] / temperature# optionally crop the logits to only the top k optionsif top_k is not None:v, _ = torch.topk(logits, min(top_k, logits.size(-1)))logits[logits < v[:, [-1]]] = -float('Inf')# apply softmax to convert logits to (normalized) probabilitiesprobs = F.softmax(logits, dim=-1)# sample from the distributionidx_next = torch.multinomial(probs, num_samples=1)# 检查是否生成了结束语 可以获取大部分结束语的编码用于判断 也可以自拟结束语 将其处理为唯一的标识符避免干扰if 1 in idx_next[0].tolist():break# append sampled index to the running sequence and continueidx = torch.cat((idx, idx_next), dim=1)return idx
http://www.ds6.com.cn/news/90483.html

相关文章:

  • 手机搜索网站建设系列推广软文范例
  • 建站abc是不是骗局厦门关键词排名优化
  • 建网站-湛江市网络搜索关键词排名
  • 网站建设理论百度精简版入口
  • 建筑工程管理系统平台seo发外链工具
  • 南京网站设计制作公司排名五年级下册数学优化设计答案
  • 长春做网站公司哪家好外贸网站建站
  • 自己 做 网站凡科网站登录入口
  • 深圳外贸网站开发长沙seo工作室
  • 地图网站怎么做湖北网站推广
  • 如何用visual studio做网站关键词优化平台有哪些
  • 沈阳个人网站制作西安百度竞价推广
  • 做网站项目流程图模板线上推广的渠道有哪些
  • 网站打开速度慢 如何优化如何做网站推广的策略
  • 佛山网站建设服务公司seo工作流程图
  • 未来做哪个网站致富长沙seo外包
  • 长春网站建设开发广告平台网
  • jsp 数据库做网站长春seo推广
  • 国外网站 模板百度网络营销中心客服电话
  • 嘉兴网站建设费用广州营销课程培训班
  • 建设网站公司选哪家好天津百度快速排名优化
  • 高端网站建设企业宁波网络营销策划公司
  • 企业网站怎么做中英文切换网络营销图片素材
  • js做网站跳转上海seo公司排名榜
  • wap网站定位武汉关键词排名工具
  • 佛山品牌网站建设电商平台怎么加入
  • 自己做行程的网站武汉百度seo排名
  • 建设网站用模版焊工培训
  • 重庆网站建设公司有哪些seo技术外包公司
  • php做网站项目的流程浙江网站建设推广