c4b38a5c72
fix a few cores from previous code updates
Matt Wells
2014-03-11 09:36:33 -07:00
5c2e78e5fa
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-10 20:26:30 -07:00
483f3c5bae
fix core
Matt Wells
2014-03-10 18:17:28 -07:00
f9fdc96563
no use in newline separating the list of urls if they're going to be read back in and need to be space separated
Daniel Steinberg
2014-03-10 15:22:43 -07:00
e293d465a3
snprintf instead of sprintf
Daniel Steinberg
2014-03-10 14:03:28 -07:00
41e3988fbc
not a conf file
Daniel Steinberg
2014-03-10 13:57:13 -07:00
4a7bf5d4d0
Story #2040: store raw URL submissions for customer bulk jobs
Daniel Steinberg
2014-03-10 13:50:30 -07:00
bfcb7082f4
fix bug from nuking doledb on a new collection.
Matt Wells
2014-03-10 13:48:00 -07:00
bd4484db3c
Merge branch 'testing' into diffbot-testing
Matt Wells
2014-03-10 12:08:23 -07:00
9debee20dc
Merge branch 'diffbot' into testing
Matt Wells
2014-03-09 20:44:09 -07:00
662b6d4b32
doc updates
Matt Wells
2014-03-09 20:43:49 -07:00
90ff2c2a25
update example site lists
Matt Wells
2014-03-09 20:35:45 -07:00
82db7240a3
simple print update
Matt Wells
2014-03-09 19:43:32 -07:00
f7b7274ff1
replace "exact:" directive with "seed:" really the same thing.
Matt Wells
2014-03-09 19:35:20 -07:00
f8e561e6f4
more new site list api fixes
Matt Wells
2014-03-09 18:15:57 -07:00
11e8c16878
new site list updates
Matt Wells
2014-03-09 17:53:24 -07:00
ed626b162a
more site list based spider fixes to be more like gsa
Matt Wells
2014-03-08 20:52:31 -07:00
aab165ed20
fix bad return value from function
Matt Wells
2014-03-08 19:32:56 -08:00
4cb66c31bf
get this new api spidering
Matt Wells
2014-03-08 12:02:20 -07:00
624c1d4e68
nuke doledb fixes
Matt Wells
2014-03-08 10:51:15 -07:00
29694f4efe
startup fixes
Matt Wells
2014-03-08 10:25:56 -07:00
8aa0662a27
Merge branch 'diffbot' into testing
Matt Wells
2014-03-08 09:38:44 -07:00
14817df7a9
new site patterns api stuff
Matt Wells
2014-03-08 09:23:32 -07:00
7cdd411ef1
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:26:47 -08:00
72fab5b61e
Do not end a crawl while urls are still being spidered because they might add more links to spiderdb when they finally complete.
Matt Wells
2014-03-07 09:30:12 -08:00
dcd42e455e
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:02:29 -08:00
c143ee1fba
fix core when creating a new collection because we incremented m_numRecs but did not grow the ptr buffer. also added support for localgb.conf so we can use that instead of gb.conf to avoid git push/pull conflicts.
Matt Wells
2014-03-07 09:05:14 -08:00
f777e6cccd
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 08:23:21 -08:00
d6177019ec
minor fix
Matt Wells
2014-03-07 08:07:09 -08:00
434dd182d4
fix mem leak. always harvest links for custom crawls.
Matt Wells
2014-03-06 21:24:39 -08:00
e351d2a6f1
get searching on token working
Matt Wells
2014-03-06 17:01:41 -08:00
27e8e810d2
use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum.
Matt Wells
2014-03-06 15:48:11 -08:00
d74f748e93
search all collections under a token if "&token" is given but not "&c=..."
Matt Wells
2014-03-06 11:00:43 -08:00
97e46dbf4e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:45 -08:00
ca2d307229
revert gb.conf
Matt Wells
2014-03-06 10:47:03 -08:00
efa92b16fd
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:35 -08:00
25cf0efdbf
first compiled stab at multi collection searching.
Matt Wells
2014-03-06 10:45:13 -08:00
451a092378
fix core from changing parms while evaluating a url.
Matt Wells
2014-03-06 07:47:43 -08:00
0962e243a4
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-05 07:43:25 -08:00
58a1feeea5
specify &header=1 explicitly to get json serp header lest we break our clients parsers
Matt Wells
2014-03-05 07:41:59 -08:00
13e33bc261
fix jezebel crawl from hanging.
Matt Wells
2014-03-04 19:45:26 -08:00
1b62f1582b
print memtable when almost full so we can see where the leak is. more spiders for ethan. do not try to get diffbot reply if page is already json. likely it is an injected diffbot json reply.
Matt Wells
2014-03-04 18:19:50 -08:00
603cd67758
fix csv downloads some more
Matt Wells
2014-03-04 12:07:46 -08:00
2ab9aaeeaa
streaming csv fixes
Matt Wells
2014-03-04 11:04:26 -08:00
866b09d25e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-04 10:46:28 -08:00
b1381cc610
make csv streamable, faster and take almost no memory.
Matt Wells
2014-03-04 10:45:57 -08:00
280dcb85cf
fix for passing testUpdatedContent smoketest
Matt Wells
2014-03-04 09:09:51 -08:00
ab9f2b33c1
definition updates
Matt Wells
2014-03-04 08:37:39 -07:00
1acb16b1ee
tweak empty doledb priority logic. anchor it more to m_doleIpTable for more reliability. seems like it was causing some slow dows during spidering. seems more continuous now.
Matt Wells
2014-03-03 13:48:59 -08:00
48b5330d9c
only skip checking to spider a url of its doleip table is empty
Matt Wells
2014-03-03 13:22:27 -08:00
282dad6cef
deal with no coll recs when getting link text using msg25. do not share g_lineTable between collections.
Matt Wells
2014-03-03 08:04:24 -08:00
ff8a0b4ef1
do not let all collections share the same line table in linkdb.cpp
Matt Wells
2014-03-03 07:50:11 -08:00
a82abe8260
added ^ operator to url crawl patterns. good for tmz crawl.
Matt Wells
2014-03-02 14:57:59 -08:00
7fd6bbd7f5
added ^ support to url crawl expressions
Matt Wells
2014-03-02 14:41:25 -08:00
e4d425c18f
fix coll being deleted when getting link text.
Matt Wells
2014-03-02 14:24:49 -08:00
bb5016e88b
add the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned
Daniel Steinberg
2014-03-01 11:16:17 -08:00
aeb2833d20
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 11:46:44 -08:00
11efab9862
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 08:23:59 -08:00
c596d38e60
fix core from getting title of json object
Matt Wells
2014-02-28 08:18:09 -08:00
5f3aa24805
took out restrictDomain logic. now we always only follow links on the same domain as the seed UNLESS a url crawl pattern or a url crawl regex was specified.
Matt Wells
2014-02-27 19:53:17 -08:00
42f254125e
fix core in new link text logic. empty msg25 replies are ok if g_errno is set.
Matt Wells
2014-02-27 13:56:32 -08:00
365fc16606
fix core in "wait in line" logic when getting link info in Linkdb.cpp.
Matt Wells
2014-02-27 09:22:35 -08:00
af9eb8fb73
need to allow clients to not restrict to seed domains.
Matt Wells
2014-02-26 22:27:22 -08:00
927f4626ee
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 22:26:13 -08:00
eaca38cbfd
fix new result streaming logic some more
Matt Wells
2014-02-26 21:42:43 -08:00
0933884191
fix super fast and mem efficient search results streaming code.
Matt Wells
2014-02-26 21:18:08 -08:00
f11e25024a
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-26 20:34:06 -08:00
1030e6ada8
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 20:30:20 -08:00
b429f12346
add logic to save memory when streaming over 200 results back. should fix oom when streaming back hundreds of thousands of results.
Matt Wells
2014-02-26 20:33:35 -08:00
8208178c79
remove "Initial crawl request" dups from the urls.csv. do not count fake firstip spider requests attempts in xmldoc.cpp as crawlbot page download attempts since we just re-add that request with the correct firstip and bail. it basically doubles this count form what users would expect.
Matt Wells
2014-02-26 15:48:52 -08:00
b450bfc2a6
do not show html column in csv. libreoffice and excel flub it if a cell is over 32k or so.
Matt Wells
2014-02-26 15:03:05 -08:00
a0697e1bb5
do not allow custom crawls to spider the web any more.
Matt Wells
2014-02-26 10:26:09 -08:00
6445b0572b
fix gb.conf
Matt Wells
2014-02-26 01:08:20 -08:00
a6b7e088f5
take out tfndb, unused. fix core from diffbot url too long.
Matt Wells
2014-02-26 01:07:13 -08:00
6716d8f21b
remove entry from linetable for linkinfo lookup
Matt Wells
2014-02-26 00:27:29 -08:00
8bb5d106db
fixes for query reindex/delete.
Matt Wells
2014-02-25 18:12:45 -08:00
33c8123288
more fixes for new link info code.
Matt Wells
2014-02-25 13:53:41 -08:00
9c486c77ed
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 12:32:40 -08:00
cf6695f625
speed up getNumTotalRecs() by caching it basically for 2 seconds since pingserver.cpp calls it all the time.
Matt Wells
2014-02-25 12:14:51 -08:00
b3ff7df904
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 11:05:46 -08:00
94a55bf9a6
fixes for new link info code so it doesn't bottleneck. got EFENCE_SIZE working so we can use efence on large allocs only so we don't go oom using it. might help finding some of the out of bounds writing going on.
Matt Wells
2014-02-25 10:55:05 -08:00
ceb623bb8f
do not dedup bulks. only respider urls if error is tmp. mess with msg1 in spider.cpp so niceness is MAX_NICENESS and not 0 because it was not able to trigger a doledb dump.
Matt Wells
2014-02-23 20:04:46 -08:00
72f1312652
new linkdb code compiling.
Matt Wells
2014-02-20 17:27:28 -08:00
9820f14066
checkpoint
Matt Wells
2014-02-20 14:54:21 -08:00
88dfa20cbe
docid based spider rec related fixes.
Matt Wells
2014-02-20 08:46:00 -08:00
e87b71caef
fix query reindex core
Matt Wells
2014-02-19 21:07:01 -08:00
b37b19ea4a
print comma before json item so we do not end in trailing comma ever
Matt Wells
2014-02-19 10:04:49 -08:00
dda7648333
try to fix problem of crawls stopping when they shouldn't. seems like it might be doing the trick.
Matt Wells
2014-02-19 00:51:46 -08:00
b48adc0542
try to fix crawls stopping too early
Matt Wells
2014-02-18 10:28:48 -08:00
ae2aed7066
try to fix a few cores from deleting collections. try to spider urls again if user changes certain crawling parms. like regex, patterns, etc.
Matt Wells
2014-02-18 09:44:15 -08:00
117f0ca4e8
more bio updates
Matt Wells
2014-02-17 15:32:45 -07:00
bc2b9d6179
bio.html updates
Matt Wells
2014-02-17 15:29:08 -07:00
f942183104
ignore maxtocrawl for bulk jobs too
Matt Wells
2014-02-16 22:24:17 -08:00
a4deb7ff08
exempt bulk jobs from maxtoprocess
Matt Wells
2014-02-16 22:14:43 -08:00
9c9d5fff98
print out content type in caps with maroon bg in serps. use empty site patterns to mean no restriction, not "*" anymore for simplicity.
Matt Wells
2014-02-16 22:47:02 -07:00
0b5cd6d3f9
more parm fixes
Matt Wells
2014-02-16 22:18:39 -07:00
48315f6dc3
parm fixes
Matt Wells
2014-02-16 22:13:27 -07:00
725b6189a7
show user's ip in master ips description so they can add it to the list easily.
Matt Wells
2014-02-16 21:56:31 -07:00
9d0dca71db
fix rapid coll delete bug some more.
Matt Wells
2014-02-16 20:13:06 -08:00